4.1 out of 5
128 reviews on Udemy

Apache Spark with Python – Learn by Doing

50 Python source code examples and multiple deployment scenarios
Todd McGrath
903 students enrolled
English [Auto-generated]
Have confidence using Spark from Python
Understand Spark core concepts and processing options
Run Spark and Python on their own computer
Setup Spark on new Amazon EC2 cluster
Deploy Python Programs to to a Spark Cluster
Know what tools to use for Spark Adminstration
Certificate of completion
30 money back guarantee

Would you like to advance your career and learning Apache Spark will help?

There’s no doubt Apache Spark is an in-demand skillset with higher pay. This course will help you get there.

This course prepares you for job interviews and technical conversations. At the end of this course, you can update your resume or CV with a variety of Apache Spark experiences.

Or maybe you need to learn Apache Spark quickly for a current or upcoming project?

How can this course help?

You will become confident and productive with Apache Spark after taking this course. You need to be confident and productive in Apache Spark to be more valuable.

Now, I’m not going to pretend here. You are going to need to put in work. This course puts you in a position to focus on the work you will need to complete.

This course uses Python, which is a fun, dynamic programming language perfect for both beginners and industry veterans.

At the end of this course, you will have rock solid foundation to accelerate your career and growth in the exciting world of Apache Spark.

Why choose this course?

Let’s be honest. You can find Apache Spark learning material online for free. Using these free resources is okay for people with extra time.

This course saves your time and effort. It is organized in a step-by-step approach that builds upon each previous lessons. PowerPoint presentations are minimal.

The intended audience of this course is people who need to learn Spark in a focused, organized fashion. If you want a more academic approach to learning Spark with over 4-5 hours of lectures covering the same material as found here, there are other courses on Udemy which may be better for you.

This Apache Spark with Python course emphasizes running source code examples.

All source code examples are available for download, so you can execute, experiment and customize for your environment after or during the course.

This Apache Spark with Python course covers over 50 hands-on examples. We run them locally first and then deploy them on cloud computing services such as Amazon EC2.

The following will be covered and more:

  • What makes Spark a power tool of Big Data and Data Science?
  • Learn the fundamentals of Spark including Resilient Distributed Datasets, Spark Actions and Transformations
  • Run Spark in a Cluster in your local environment and Amazon EC2
  • Deploy Python applications to a Spark Cluster
  • Explore Spark SQL with CSV, JSON and mySQL (JDBC) data sources
  • Convenient links to download all source code
  • Reinforce your understanding through multiple quizzes and lecture recap

Course Overview

Course Overview and Methodology

Introducing the course objectives, benefits, instructor, and overall methodology for you to learn and become confident with Apache Spark with Python.

Apache Spark and Python Foundational Building Blocks

The Big Picture - Running Some Code, Analyzing Data with Apache Spark

I don't expect you to follow all the details of this video. I want to give you a big picture through source code examples of using Apache Spark and Python to analyze data.

We're going to start by running some code examples of Python against the Spark API through a Spark Driver program called PySpark.

I don't expect you to follow along with all these Python examples yet. I'll fill in the blanks later in the course.

We'll talk about what a "Spark Driver" program means later.

From PySpark, we're going to analyze some Uber data. Uber is a company which has disrupted the worldwide taxi industry. We're going to use the Uber data from New York City from the NYC's Taxi & Limousine Commission.

With this data, we can use Python and Spark to analyze. We can determine the total number of Uber trips, the most popular Uber bases in NYC, etc.

In this example, we'll give a glimpse into Spark core concepts such as Resilient Distributed Datasets, Transformations, Actions and Spark drivers. In addition, we'll see code examples of how to use Python with Spark.

Again, I don't expect you to follow all the details here, it's intended as a high level over to begin.

Apache Spark Fundamentals - The Essentials You Need to Know

Now that we've seen Spark with Python examples, let's continue by considering the key Spark concepts you need to know. These concepts will be used throughout the rest of this Spark with Python data science course.

We need to describe Resilient Distributed Datasets, Transformations, Actions, Spark Drivers and applications deployed to clusters in more detail.

This builds the foundations for later sections of the Spark with Python Data Science Power Tools course.

[Milestone] Key Concepts Quiz

Let's confirm our understanding of the foundations of Apache Spark. Just three questions to confirm the goals of this section of the Apache Spark course.

Prepare Your Environment

Setting Up Your Environment

Now that we've seen an example of data analysis with Python using Spark, let's configure your environment. With your own environment, you'll be able to run code from this Spark with Python course as well as experiment on your own.

If you already have Spark downloaded and installed you can skip the next lecture. For Python setup, we're going to use a particular flavor or Python. So even if you don't end up using the version of Python used this course, the recommendation is to still view the Python videos.

Let me know in the course comments if you have any questions. It should be a straightforward process. It just takes a while to download.

For Windows Operating System Users Only

If you are preparing your Apache Spark and Python environment on a Windows machine, please watch this video. It highlights two areas you need to know before proceeding to the following lectures.

Download and Install Spark

This video shows how and where to download and install Apache Spark used in this course. You are free to watch this course without installing Spark, but if you want to experiment with your own environment, you should download and install Spark on your own machine.

Download and Install Python

Walk through installing the Python version used in this Spark with Python course. We're going to use the Anaconda Python version which provides us convenient access to many 3rd party Python libraries used in data science such as charts and graphs, math, etc.

[Milestone] Setup Checkpoint

At this point, let's confirm your Spark environment is running and we're able to interact with the Spark Python API.

To accomplish this, start up the pyspark Spark driver program. This is just a short video to show how to confirm your Spark with Python environment.

Check ipython notebook Setup (optional)

ipython notebook is not a requirement for this course. But, it may help if you decide to copy-and-paste from the provided source code examples.

This video goes through ipython setup. Also, see the private course discussions on how people with a variety of setups have succeeded in configuring Apache Spark with ipython notebook.

Sample Data Access

We're going to use a few different files of sample data files for this Apache Spark with Python course. This video shows how and where to download.

Links for both files are also provided at the end of this section

[Milestone] Setup References and Download Links

Hyperlinks to download Spark, Python and command reference

Apache Spark Transformations and Actions

Spark Transformations and Actions Overview

In essence, there are two kinds of Spark functions: Transformations and Actions. Transformations transform an existing RDD into a new, different one. Actions are functions used against RDDs to produce a value.

In this section of the Apache Spark with Python Data Science course, we'll go over a wide variety of Spark Transformation and Action functions.

This should build your confidence and understanding of how you can apply these functions to your uses cases. It will also create more foundation for us to build upon in your journey of learning apache spark with python.

Spark Transformations Part 1

We're going to break Apache Spark transformations into groups. In this video, we'll cover some common spark transformations which produce RDDs. These include map, flatMap, filter, etc.

We're going to use a CSV dataset of baby names in New York. As we progress through transformations and actions in this Apache Spark with Python course, we'll determine more and more results for this sample data set.

So, let's begin with some commonly used Spark transformations.

Spark Transformations Part 2

In part 2 of Spark Transformations, we'll discover spark transformations used when we need to combine, compare and contrast elements in two RDDs. This is something we often have to do when working with datasets. Spark helps compare RDDs through transformation functions union, intersection, distinct, etc.

Spark Transformations Part 3

In part 3 of our focus on Spark Transformation functions were going work with the "key" functions including groupByKey, reduceByKey, aggregateByKey, sortByKey

All these transformations work with key,value pair RDDs, so we will cover the creation of PairRDDs as well.

We'll continue to use the baby_names.csv file used in Part 1 and Part 2 of Spark Transformations

[Milestone] Transformations Quiz

Let's confirm our understanding of Spark Transformations at this point.

Spark Actions

Spark Actions produce values back to the Spark Driver program. Also, recall that Action functions called against RDD cause a previously lazy RDD to be evaluated. So, in the real world when working with large datasets, we need to be careful when triggering RDDs to be evaluated through Spark actions.

This video shows commonly used Spark Actions.

[Milestone] Spark Actions Quiz

Let's confirm our understanding of Spark Actions.

[Milestone] Download Resources and Source Code Access

Provides links to download the source code (ipython notebook) used in this section of the course.

Apache Spark Clusters

Spark on a Cluster Introduction

Clusters allow Spark to processes huge volumes of data by distributing the workload across multiple nodes. This is also referred to as "running in parallel" or "horizontal scaling"

A cluster manager is required to Spark on a cluster. Spark supports 3 types of cluster managers including Apache YARN, Apache Mesos and an internal cluster manager distributed with Spark called Standalone.

Let's cover the key concepts of this Spark Clustering section of the course.

Run Standalone Cluster

Let's run a Spark Standalone cluster within your environment. We'll start a Spark Master and one Spark worker. We'll quickly go over the Spark UI web console. We'll return to the Spark UI console in a later lecture after we deploy a couple of Python programs to it.

Deploy Python Programs to the Cluster

Now that we have a Spark cluster running, how do we use it? In this lecture of the Spark with Python course, we'll deploy a couple of Python programs. We'll start with a simple example and then progress to more complicated examples which include utilizing spark-packages and Spark SQL.

Deploy commands include:

bin/spark-submit --master spark://todd-mcgraths-macbook-pro.local:7077 examples/src/main/python/pi.py

bin/spark-submit --master spark://todd-mcgraths-macbook-pro.local:7077 examples/src/main/python/wordcount.py baby_names.csv

[Milestone] Write and Deploy Python Program to the Spark Cluster

Let's review a Python program which utilizes examples we've already seen in this Spark with Python course. It's a program which analyzes New York City Uber data using Spark SQL. The video will show the program in the Sublime Text editor, but you can use any editor you wish.

When deploying our driver program, we need to do things differently than we have while working with pyspark. For example, we need to obtain a SparkContext and SQLContext. We need to specific Python imports.

bin/spark-submit --master spark://todd-mcgraths-macbook-pro.local:7077 --packages com.databricks:spark-csv_2.10:1.3.0 uberstats.py Uber-Jan-Feb-FOIL.csv

Spark Cluster Administrative Diagnostics - The Spark UI

The Spark UI was briefly introduced in a previous lecture. Let's return to it now we have an available worker in the cluster and we have deployed some Python programs.

The Spark UI is the tool for Spark Cluster diagnostics, so we'll review the key attributes of the tool.

Create an Amazon EC2 Based Cluster Part 1

Apache Spark can be run on a cluster of two or more instances of Amazon EC2. In this video, let's go over how to create a Spark cluster on EC2. We'll cover the setup from both Spark as well as how to configure Amazon EC2 authentication and authorization using the Amazon Web Console (AWS).

[Milestone] Create an Amazon EC2 Based Cluster Part 2

We'll continue the setup Spark cluster on EC2 with special attention to how we can use ipython notebook against our Spark cluster running in EC2.

Before the EC2 cluster is ready to use from ipython notebook, we need to open port 7077.

[Milestone] Spark Cluster Quiz

Let's confirm our understanding of Spark Clustering

Spark SQL

Spark SQL Introduction

Spark SQL is perfect for those coming from a SQL background. It allows us to use SQL against a variety of datasets including CSV, JSON and JDBC databases. The Spark code for working with these datasets looks the same!

In this section of the Spark with Python course, we're going discuss a certain kind of RDD used with Spark SQL. Then, we're going to cover Spark SQL through input data source examples such as CSV, JSON and a mySQL database.

Spark SQL with New York City Uber Trips CSV Source

Spark SQL uses a type of Resilient Distributed Dataset called DataFrames which are composed of Row objects accompanied with a schema. The schema describes the data types of each column. A DataFrame may be considered similar to a table in a traditional relational database.


We’re going to use the Uber dataset, ipython notebook and the spark-csv package available from Spark Packages to make our lives easier. The spark-csv package is described as a “library for parsing and querying CSV data with Apache Spark, for Spark SQL and DataFrames” This library is compatible with Spark 1.3 and above.

Spark SQL with Historical World Cup Player Statistics JSON Source

Let's load a JSON input source to Spark SQL’s SQLContext. This Spark SQL JSON with Python portion of the course has two parts. The first part shows examples of JSON input sources with a specific structure. The second part warns you of something you might not expect when using Spark SQL with JSON data source.


We are going to use two JSON inputs. We’ll start with a simple, trivial example and then move to an analysis of historical World Cup player data.

The World Cup Player source data may be downloaded from the github repo for this course or https://raw.githubusercontent.com/jokecamp/FootballData/master/World%20Cups/all-world-cup-players.json

Spark SQL with mySQL (JDBC) source

Now that we have Spark SQL experience with CSV and JSON, connecting and using a mySQL database will be easy. So, let’s cover how to use Spark SQL with Python and a mySQL database input data source.


We’re going to load some NYC Uber data into a database. Then, we’re going to fire up pyspark with a command line argument to specifiy the JDBC driver needed to connect to the JDBC data source. We’ll make sure we can authenticate and then start running some queries.

[Milestone] Spark SQL Resources and Download Source Code

All the source code used in the Spark SQL section of the Spark with Python course is available from the course github repo.

Conclusion and Free Bonus Lecture

Apache Spark with Python Course Conclusion and Looking Ahead

Thanks for taking the course! I hope you enjoyed the course and you are feeling comfortable and confident using Python with Apache Spark. If you have any questions or suggestions on how to improve this course, just let me know in the course discussions forum.

Bonus Lecture: Access to Free Books and Course Discounts

Be sure to visit http://www.supergloo.com for discount coupons to my other Udemy courses and links to my Spark related books. Also, you'll have a chance to sign up for my mailing list where I send announcements of new courses, books, Spark tutorials, etc.

Come check it out!

You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
4.1 out of 5
128 Ratings

Detailed Rating

Stars 5
Stars 4
Stars 3
Stars 2
Stars 1
30-Day Money-Back Guarantee


2 hours on-demand video
3 articles
Full lifetime access
Access on mobile and TV
Certificate of Completion