2.75 out of 5
4 reviews on Udemy

PySpark for Beginners

Build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0
Packt Publishing
19 students enrolled
English [Auto-generated]
Learn about Apache Spark and the Spark 2.0 architecture
Build and interact with Spark DataFrames using Spark SQL
Read, transform, and understand data and use it to train machine learning models
Build machine learning models with MLlib and ML

omasz has also authored the Practical Data Analysis Cookbook published by Packt Publishing in 2016.Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This course will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark. You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command. By the end of this course, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.

About the Author

Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in the Seattle area. He has over 13 years of experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance, and consulting he gained while working on three continents: Europe, Australia, and North America. While in Australia, Tomasz has been working on his PhD in Operations Research with a focus on choice modeling and revenue management applications in the airline industry.

At Microsoft, Tomasz works with big data on a daily basis, solving machine learning problems such as anomaly detection, churn prediction, and pattern recognition using Spark. 

Tomasz has also authored the Practical Data Analysis Cookbook published by Packt Publishing in 2016.

Understanding Spark

The Course Overview

This video will give you an overview about the course.

Spark Jobs and APIs

In this video, you will be provided with a short overview of the Apache Spark Jobs and APIs. This covers the necessary foundation for the subsequent section on Spark 2.0 architecture.

  • Learn about the execution process
  • Explore the resilient distributed dataset
  • Explore DataFrames and Datasets
Spark 2.0 Architecture

The introduction of Apache Spark 2.0 is the recent major release of the Apache Spark project based on the key learnings from the last two years of development of the platform

  • Unify Datasets and DataFrames
  • Introduce SparkSession
  • Understand Tungsten phase 2

Resilient Distributed Datasets

Creating RDDs

RDDs operate in parallel. This is the strongest advantage of working in Spark: Each transformation is executed in parallel for enormous increase in speed. In this video, we will create RDDs first. We will also have a look at global versus local scope.

  • Create RDDs
  • Study global versus local scope


Transformations shape your dataset. These include mapping, filtering, joining, and transcoding the values in your dataset. In this video, we will showcase some of the transformations available on RDDs.

  • Understand various types of transformation

Actions execute the scheduled task on the dataset; once you have finished transforming your data you can execute your transformations. This might contain no transformations or execute the whole chain of transformations. In this video, we will see different types of actions in PySpark.

  • Understand various types of actions


Basic Operations with DataFrames

Whenever a PySpark program is executed using RDDs, there is a potentially large overhead to execute the job.

  • See catalyst Optimizer refresh
  • Speed up PySpark with DataFrames
  • Create DataFrames
High End Operations – Interpolating and Querying

There are two different methods for converting existing RDDs to DataFrames: inferring the schema using reflection, or programmatically specifying the schema. The former allows you to write more concise code, while the latter allows you to construct DataFrames when the columns and their data types are only revealed at run time. In this video, we will look at the Querying with the DataFrame API and SQL.

  • Explore Interoperating with RDDs
  • Learn querying with the DataFrame API
  • See querying with SQL

Prepare Data for Modeling

Checking for Duplicates, Missing Observations, and Outliers

Until you have fully tested the data and proven it worthy of your time, you should neither trust it nor use it. In this video, we will show you how to deal with duplicates, missing observations, and outliers.

  • Understand duplicates in PySpark
  • Deal with a number of missing values
  • Study outliers
Getting Familiar with Your Data

Although we would strongly discourage such behavior, you can build a model without knowing your data. It will most likely take you longer, and the quality of the resulting model might be less than optimal, but it is doable.

  • Study the descriptive statistics
  • Understand correlation in PySpark

There are multiple visualization packages, but in this video we will be using matplotlib and Bokeh exclusively to give you the best tools for your needs.

  • Study histogram for visualization

Introducing MLlib

Loading and Transforming the Data

Even though MLlib is designed with RDDs and DStreams in focus, for ease of transforming the data we will read the data and convert it to a DataFrame.

  • Import the pyspark.sql.types
  • Specify our recode dictionary
  • Specify our recoding methods
Getting to Know Your Data

In order to build a statistical model in an informed way, an intimate knowledge of the dataset is necessary. Without knowing the data it is possible to build a successful model, but it is then a much more arduous task, or it would require more technical resources to test all the possible combinations of features. Therefore, after spending the required 80% of the time cleaning the data, we spend the next 15% getting to know it!

  • Look at the descriptive statistics
  • Learn about correlations
  • See statistical testing
Creating the Final Dataset

Now, we will create our final dataset that we will use to build our models. We will convert our DataFrame into an RDD of LabeledPoints.A LabeledPoint is a MLlib structure that is used to train the machine learning models. It consists of two attributes: label and features.

  • Create an RDD of LabeledPoints
  • Split into training and testing
Predicting Infant Survival

In this video, we will build two models: a linear classifier—the logistic regression, and a non-linear one—a random forest.

  • Split into training and testing
  • Select only the most predictable features

Introducing the ML Package

Predicting the Chances of Infant Survival with ML

In this video, we will, once again, attempt to predict the chances of the survival of an infant. In this video, first we will load the data, then create transformers, estimators and a pipeline. We will then fit the model, evaluate the performance and save the model.

  • Load the data
  • Create transformers, estimator and a pipeline
  • Fit, evaluate and save the model
Parameter Hyper-Tuning

A concept of parameter hyper-tuning is to find the best parameters of the model. For example, the maximum number of iterations needed to properly estimate the logistic regression model or maximum depth of a decision tree. In this video, we will explore two concepts that allow us to find the best parameters for our models that is grid search and train-validation splitting.

  • Understand grid search with example
  • Study train validation splitting with example
Other Features of PySpark ML in Action

In this video, we will provide examples of how to use some of the Transformers and Estimators.

  • Understand the concept of feature extraction
  • Study classification and clustering
  • Understand regression in PySpark
You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
2.8 out of 5
4 Ratings

Detailed Rating

Stars 5
Stars 4
Stars 3
Stars 2
Stars 1
30-Day Money-Back Guarantee


2 hours on-demand video
Full lifetime access
Access on mobile and TV
Certificate of Completion