Apache Spark in 7 Steps
If you’re looking to get up to speed with learning the fundamentals of Apache Spark in a short period of time, you can count on this course to help you learn the basics of this engine. Spark is becoming a popular big data processing engine with its unique ability to run in-memory with excellent speed. It is also easy to use and offers simple syntax.
The course is designed to give you a fundamental understanding of and hands-on experience in writing basic code as well as running applications on a Spark cluster. Over 7 steps, you will work on interesting examples and assignments that will demonstrate and help you understand basic operations, querying, machine learning, and streaming.
By the end of this course, you’ll be able to put your learning to practice and build your own projects with ease and confidence.
About the Author
Karen Yang has been a passionate self-learner in computer science for over 6 years. She has programming, big data processing, and engineering experience. Her recent interests include cloud computing. She previously taught for 5 years in a college evening adult program.
This video will give you an overview about the course.
The aim of this video is to set up an AWS account with a basic (free) plan.
• Register a new account at https://aws.amazon.com/
• Enter login, contact, payment information, and choose a basic plan
• Confirm AWS account subscription and login to the console
The aim of this video is to launch a Spark Cluster on EC2 using a Python script.
• Download the Spark EC2 script and dependencies
• Create a key-pair and generate AWS access credentials
• Launch Spark cluster with EC2 script
The aim of this video is to set up your development environment.
• Install Python 3.5 and dependencies on the master node
• Install Python 3.5 and dependencies on the worker node
• Verify the installations
The aim of this video is to run a test application on an EC2 cluster.
• Copy a test file to be used in running the application
• Show command to submit an application that is deployed locally
• Show command to submit an application on the worker nodes, using the cluster
Working with RDDs
The aim of this video is to learn how to create RDDs while working in the PySpark shell.
• Using the PySpark shell for code development
• Creating RDDs from text file(s)
• Creating RDDs programmatically
The aim of this video is to learn how RDD actions trigger execution of RDD transformations
• Review some commonly used RDD actions and what they do
• Provide some examples of RDD actions
• Highlight that actions are a way to send data from executors to the driver after performing a computation
The aim of this video is to show how transformations are operations that transform your RDD data from one form to another.
• Review some commonly used RDD transformations and what they do
• Provide some examples of RDD transformations
• Highlight that RDD transformations are lazy evaluations in Spark
The aim of this video is to review RDD operations such as joins, set, and numeric operations.
• Review inner, left outer, right outer, and full outer joins
• Review set operations such as intersection, subtraction, union, and distinct
• Review numeric operations such as minimum, maximum, mean, sum, standard deviation, variance, and statistics
The aim of this video is to learn about shared variables such as broadcast and accumulator.
• Review the purpose of shared variables
• Broadcast variables are used for reading data across the worker nodes of a cluster
• Accumulator variables are used for writing data across worker nodes of a cluster
The aim of this video is to install Jupyter Notebook and useful notebook extension.
• Instructions to download Python 3.7 Anaconda distribution
The aim of this video is to learn how to start the Jupyter Notebook for Spark and to perform basic RDD and DataFrame operations.
• Demonstrate how to initialize Spark in the notebook
• Learn about the entry point to Spark through the SparkSession class
• Perform basic RDD and DataFrame operations in the Jupyter Notebook
The aim of this video is to explore DataFrame row operations such as changing values, filtering rows, and using a row function to create a DataFrame.
• Convert a DataFrame to an RDD and perform operations
• Alter row values and filter row values
• Use a row function to create a DataFrame
The aim of this video is to explore DataFrame column operations such as selecting columns, creating new columns, and sort values in a column.
• Perform DataFrame operations such as show, head, describe, and take
• Select columns, create new columns, filter, and alter values in a column
• Sort values in a column in ascending and descending order
The aim of this video is to learn about DataFrame manipulation.
• Calculate summary statistics such as min, max, and mean
• Use aggregation functions such as groupby() and agg()
• Join two DataFrames with the use of inner join and left outer join
The aim of this video is to show how to use Spark SQL by registering a DataFrame as a temporary view or a global temporary view.
• Show how to register a temporary view before using Spark SQL
• Show how to register a global temporary view before using Spark SQL
• Perform a basic Spark SQL query, selecting columns from views
The aim of this video is to learn about schemas—inferring schema and programmatically specifying schema.
• Demonstrate how to infer schema using reflection
• Demonstrate how to programmatically specify the schema
• Work with schemas, create views, and run SQL queries
The aim of this video is to explore Spark SQL, using some commonly used operations.
• Select columns, filter rows, mutate values in a table, and calculate the mean
• Use split-apply-combine aggregation in a Spark SQL query
• Perform, write, and read operations for csv, JSON, and parquet file
The aim of this video is to examine I/O options when reading and writing csv, JSON, and parquet files.
• Load and read a csv file where the infer schema is set to true
• Load and read a JSON file and save to parquet
• Demonstrate that Parquet supports direct SQL querying and schema merging
The aim of this video is to demonstrate the use of HIVE in relation to Spark SQL.
• Make use of the HIVE context
• Show HIVE table
• Run HIVE queries
Machine Learning Fundamentals
The aim of this video is to perform basic statistics for machine learning in Spark.
• Show how to calculate Pearson’s correlation
• Show how to calculate Spearman’s correlation
• Perform hypothesis testing, using the Chi Square Test
The aim of this video is to learn how a pipeline chains multiple transformers and estimators together to specify an ML workflow.
• Explore the pipeline component called Transformers
• Explore the pipeline component called Estimators
• Explore how parameters belong to specific instances of Estimators and Transformers
The aim of this video is to explore feature extractors as a part of Spark machine learning fundamentals.
• Demonstrate the use of text as features with TF-IDF
• Transform words into vectors, using Word2Vec
• Convert a collection of text documents to vectors of token counts with CountVectorizer
The aim of this video is to examine feature transformers as a part of Spark machine learning fundamentals.
• Show how Principal Component Analysis (PCA) projects vectors into low-dimensional space
• Show how OneHotEncoder Estimator transforms continuous features into categorical features
• Show how MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a range such as 0 to 1
The aim of this video is to demonstrate the use of feature selectors as part of Spark Machine learning fundamentals.
• Provide purpose of feature selection
• Demonstrate the use of Vector Slicer, which extracts features from a vector column
• Demonstrate the use of the Chi Squared Selector as it operates on labeled data with categorical features
Machine Learning Models
The aim of this video is to show how to use classification models in Spark, namely binomial Logistic Regression and Naïve Bayes classification.
• Classification is the process of predicting the class of given data points
• Show how to use Logistic Regression in Spark to predict a binary class outcome
• Show how to use Naïve Bayes classification in Spark to predict a binary class outcome
The aim of this video is to show how to use regression models in Spark, namely Linear Regression and Gradient-Boosted Tree Regression.
• Regression is a measure of relationship between an outcome variable and its explanatory variables
• Demonstrate how to do Linear Regression in Spark
• Demonstrate how to do Gradient-Boosted Tree Regression in Spark
The aim of this video is to explore Clustering, using two commonly used models, namely K-Means and LDA (Latent Dirichlet Allocation).
• Clustering involves separating data points into a predefined number of clusters
• Demonstrate the use of K-Means Clustering
• Demonstrate the use of Latent Dirichlet Allocation (LDA)
The aim of this video is to examine Collaborative Filtering in Spark as a model for recommendation based on users’ past behavior.
• Generate the top 10 user recommendations for each user and for each movie
• Generate top 10 movie recommendations for a specified set of users
• Generate top 10 user recommendations for a specified set of movies
The aim of this video is to demonstrate model selection and tuning in Spark.
• Explain the use of model selection and tuning in Spark
• Show how Cross-Validation works for model selection
• Show how Train-Validation split works for hyper-parameter tuning
The aim of this video is to show Spark’s RDD-based streaming, namely DStreams.
• Illustrate how to start the streaming context and receive input
• Show how to perform RDD transformation operations on DStreams
• Demonstrate DStreaming, using a word count example with use of a data server
The aim of this video is to learn about DStream window operations.
• Learn how to maintain state, using the function updateStateByKey()
• Demonstrate the use of the window method, which includes window length (size) and slide interval
• Demonstrate the use of a DStream window operation called reduceByKeyAndWindow()
The aim of this video is to explore Structured Streaming, which is conceived as new rows of data arrive on the stream and is added to the unbounded table.
• Structured Streaming is built on top of Spark SQL and operates much like DataFrames
• Present pseudo code to demonstrate the key aspects of Structured Streaming
• Provide a word count code example to demonstrate Structured Streaming
The aim of this video is to examine window operations such as aggregation and watermarking.
• Demonstrate the use of aggregations, using groupBy() and window()
• Explain the use of watermarking with withWatermark()
• Provide an example of Structured Streaming with window operations
The aim of this video is to demonstrate how to join batch and streaming data.
• Show that joining batch and streaming data results in streaming data
• Point out particulars about Structured Streaming joins
• Provide an example of joining batch and streaming data, using a vacation dataset