Apache Spark in 7 Steps

Delve into 7 short lessons and daily exercises carefully chosen to get you started with Apache Spark
Instructor:
Packt Publishing
3 students enrolled
English [Auto-generated]
Find out how to deploy a Spark cluster in the AWS (Amazon Web Services) cloud, using a Python EC2 (Amazon Elastic Compute Cloud)script
Study the basic concepts of Spark, including transformations and actions
Uncover what RDDs (Resilient Distributed Datasets) are and how to perform operations on them
Run queries using Spark SQL
Write Spark SQL queries and work with Spark DataFrames
Understand how to use MLlib for machine learning applications
Discover streaming operations

If you’re looking to get up to speed with learning the fundamentals of Apache Spark in a short period of time, you can count on this course to help you learn the basics of this engine. Spark is becoming a popular big data processing engine with its unique ability to run in-memory with excellent speed. It is also easy to use and offers simple syntax.

The course is designed to give you a fundamental understanding of and hands-on experience in writing basic code as well as running applications on a Spark cluster. Over 7 steps, you will work on interesting examples and assignments that will demonstrate and help you understand basic operations, querying, machine learning, and streaming.

By the end of this course, you’ll be able to put your learning to practice and build your own projects with ease and confidence.

About the Author

Karen Yang has been a passionate self-learner in computer science for over 6 years. She has programming, big data processing, and engineering experience. Her recent interests include cloud computing. She previously taught for 5 years in a college evening adult program.

Python Basics

1
The Course Overview

This video will give you an overview about the course.

2
Setting Up an AWS Account

The aim of this video is to set up an AWS account with a basic (free) plan.

   •  Register a new account at https://aws.amazon.com/

   •  Enter login, contact, payment information, and choose a basic plan

   •  Confirm AWS account subscription and login to the console

3
Launching a Spark Cluster on EC2

The aim of this video is to launch a Spark Cluster on EC2 using a Python script.

   •  Download the Spark EC2 script and dependencies

   •  Create a key-pair and generate AWS access credentials

   •  Launch Spark cluster with EC2 script

4
Setting Up Your Environment

The aim of this video is to set up your development environment.

   •  Install Python 3.5 and dependencies on the master node

   •  Install Python 3.5 and dependencies on the worker node

   •  Verify the installations

5
Running a Test Application

The aim of this video is to run a test application on an EC2 cluster.

   •  Copy a test file to be used in running the application

   •  Show command to submit an application that is deployed locally

   •  Show command to submit an application on the worker nodes, using the cluster

Working with RDDs

1
Creating RDDs

The aim of this video is to learn how to create RDDs while working in the PySpark shell.

   •  Using the PySpark shell for code development

   •  Creating RDDs from text file(s)

   •  Creating RDDs programmatically

2
Actions

The aim of this video is to learn how RDD actions trigger execution of RDD transformations

   •  Review some commonly used RDD actions and what they do

   •  Provide some examples of RDD actions

   •  Highlight that actions are a way to send data from executors to the driver after performing a computation

3
Transformations

The aim of this video is to show how transformations are operations that transform your RDD data from one form to another.

   •  Review some commonly used RDD transformations and what they do

   •  Provide some examples of RDD transformations

   •  Highlight that RDD transformations are lazy evaluations in Spark

4
Joins, Set, and Numeric Operations

The aim of this video is to review RDD operations such as joins, set, and numeric operations.

   •  Review inner, left outer, right outer, and full outer joins

   •  Review set operations such as intersection, subtraction, union, and distinct

   •  Review numeric operations such as minimum, maximum, mean, sum, standard deviation, variance, and statistics

5
Shared Variables

The aim of this video is to learn about shared variables such as broadcast and accumulator.

   •  Review the purpose of shared variables

   •  Broadcast variables are used for reading data across the worker nodes of a cluster

   •  Accumulator variables are used for writing data across worker nodes of a cluster

DataFrames

1
Installing Jupyter Notebook

The aim of this video is to install Jupyter Notebook and useful notebook extension.

   •  Instructions to download Python 3.7 Anaconda distribution

2
RDDs and DataFrames

The aim of this video is to learn how to start the Jupyter Notebook for Spark and to perform basic RDD and DataFrame operations.

   •  Demonstrate how to initialize Spark in the notebook

   •  Learn about the entry point to Spark through the SparkSession class

   •  Perform basic RDD and DataFrame operations in the Jupyter Notebook

3
DataFrame Row Operations

The aim of this video is to explore DataFrame row operations such as changing values, filtering rows, and using a row function to create a DataFrame.

   •  Convert a DataFrame to an RDD and perform operations

   •  Alter row values and filter row values

   •  Use a row function to create a DataFrame

4
DataFrame Column Operations

The aim of this video is to explore DataFrame column operations such as selecting columns, creating new columns, and sort values in a column.

   •  Perform DataFrame operations such as show, head, describe, and take

   •  Select columns, create new columns, filter, and alter values in a column

   •  Sort values in a column in ascending and descending order

5
DataFrame Manipulation

The aim of this video is to learn about DataFrame manipulation.

   •  Calculate summary statistics such as min, max, and mean

   •  Use aggregation functions such as groupby() and agg()

   •  Join two DataFrames with the use of inner join and left outer join

Spark SQL

1
Views

The aim of this video is to show how to use Spark SQL by registering a DataFrame as a temporary view or a global temporary view.

   •  Show how to register a temporary view before using Spark SQL

   •  Show how to register a global temporary view before using Spark SQL

   •  Perform a basic Spark SQL query, selecting columns from views

2
Schemas

The aim of this video is to learn about schemas—inferring schema and programmatically specifying schema.

   •  Demonstrate how to infer schema using reflection

   •  Demonstrate how to programmatically specify the schema

   •  Work with schemas, create views, and run SQL queries

3
SQL Operations

The aim of this video is to explore Spark SQL, using some commonly used operations.

   •  Select columns, filter rows, mutate values in a table, and calculate the mean

   •  Use split-apply-combine aggregation in a Spark SQL query

   •  Perform, write, and read operations for csv, JSON, and parquet file

4
I/O Options

The aim of this video is to examine I/O options when reading and writing csv, JSON, and parquet files.

   •  Load and read a csv file where the infer schema is set to true

   •  Load and read a JSON file and save to parquet

   •  Demonstrate that Parquet supports direct SQL querying and schema merging

5
HIVE

The aim of this video is to demonstrate the use of HIVE in relation to Spark SQL.

   •  Make use of the HIVE context

   •  Show HIVE table

   •  Run HIVE queries

Machine Learning Fundamentals

1
Basic Statistics

The aim of this video is to perform basic statistics for machine learning in Spark.

   •  Show how to calculate Pearson’s correlation

   •  Show how to calculate Spearman’s correlation

   •  Perform hypothesis testing, using the Chi Square Test

2
Pipelines

The aim of this video is to learn how a pipeline chains multiple transformers and estimators together to specify an ML workflow.

   •  Explore the pipeline component called Transformers

   •  Explore the pipeline component called Estimators

   •  Explore how parameters belong to specific instances of Estimators and Transformers

3
Feature Extractors

The aim of this video is to explore feature extractors as a part of Spark machine learning fundamentals.

   •  Demonstrate the use of text as features with TF-IDF

   •  Transform words into vectors, using Word2Vec

   •  Convert a collection of text documents to vectors of token counts with CountVectorizer

4
Feature Transformers

The aim of this video is to examine feature transformers as a part of Spark machine learning fundamentals.

   •  Show how Principal Component Analysis (PCA) projects vectors into low-dimensional space

   •  Show how OneHotEncoder Estimator transforms continuous features into categorical features

   •  Show how MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a range such as 0 to 1

5
Feature Selectors

The aim of this video is to demonstrate the use of feature selectors as part of Spark Machine learning fundamentals.

   •  Provide purpose of feature selection

   •  Demonstrate the use of Vector Slicer, which extracts features from a vector column

   •  Demonstrate the use of the Chi Squared Selector as it operates on labeled data with categorical features

Machine Learning Models

1
Classification

The aim of this video is to show how to use classification models in Spark, namely binomial Logistic Regression and Naïve Bayes classification.

   •  Classification is the process of predicting the class of given data points

   •  Show how to use Logistic Regression in Spark to predict a binary class outcome

   •  Show how to use Naïve Bayes classification in Spark to predict a binary class outcome

2
Regression

The aim of this video is to show how to use regression models in Spark, namely Linear Regression and Gradient-Boosted Tree Regression.

   •  Regression is a measure of relationship between an outcome variable and its explanatory variables

   •  Demonstrate how to do Linear Regression in Spark

   •  Demonstrate how to do Gradient-Boosted Tree Regression in Spark

3
Clustering

The aim of this video is to explore Clustering, using two commonly used models, namely K-Means and LDA (Latent Dirichlet Allocation).

   •  Clustering involves separating data points into a predefined number of clusters

   •  Demonstrate the use of K-Means Clustering

   •  Demonstrate the use of Latent Dirichlet Allocation (LDA)

4
Collaborative Filtering

The aim of this video is to examine Collaborative Filtering in Spark as a model for recommendation based on users’ past behavior.

   •  Generate the top 10 user recommendations for each user and for each movie

   •  Generate top 10 movie recommendations for a specified set of users

   •  Generate top 10 user recommendations for a specified set of movies

5
Model Selection and Tuning

The aim of this video is to demonstrate model selection and tuning in Spark.

   •  Explain the use of model selection and tuning in Spark

   •  Show how Cross-Validation works for model selection

   •  Show how Train-Validation split works for hyper-parameter tuning

Streaming

1
DStreams

The aim of this video is to show Spark’s RDD-based streaming, namely DStreams.

   •  Illustrate how to start the streaming context and receive input

   •  Show how to perform RDD transformation operations on DStreams

   •  Demonstrate DStreaming, using a word count example with use of a data server

2
DStream Window Operations

The aim of this video is to learn about DStream window operations.

   •  Learn how to maintain state, using the function updateStateByKey()

   •  Demonstrate the use of the window method, which includes window length (size) and slide interval

   •  Demonstrate the use of a DStream window operation called reduceByKeyAndWindow()

3
Structured Streaming

The aim of this video is to explore Structured Streaming, which is conceived as new rows of data arrive on the stream and is added to the unbounded table.

   •  Structured Streaming is built on top of Spark SQL and operates much like DataFrames

   •  Present pseudo code to demonstrate the key aspects of Structured Streaming

   •  Provide a word count code example to demonstrate Structured Streaming

4
Window Operations

The aim of this video is to examine window operations such as aggregation and watermarking.

   •  Demonstrate the use of aggregations, using groupBy() and window()

   •  Explain the use of watermarking with withWatermark()

   •  Provide an example of Structured Streaming with window operations

5
Joining Batch and Streaming Data

The aim of this video is to demonstrate how to join batch and streaming data.

   •  Show that joining batch and streaming data results in streaming data

   •  Point out particulars about Structured Streaming joins

   •  Provide an example of joining batch and streaming data, using a vacation dataset

You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!

Be the first to add a review.

Please, login to leave a review
79fd55f6a5158449d6cee3d5f0749145
30-Day Money-Back Guarantee

Includes

4 hours on-demand video
Full lifetime access
Access on mobile and TV
Certificate of Completion