Apache Spark with Python – Learn by Doing
Would you like to advance your career and learning Apache Spark will help?
There’s no doubt Apache Spark is an in-demand skillset with higher pay. This course will help you get there.
This course prepares you for job interviews and technical conversations. At the end of this course, you can update your resume or CV with a variety of Apache Spark experiences.
Or maybe you need to learn Apache Spark quickly for a current or upcoming project?
How can this course help?
You will become confident and productive with Apache Spark after taking this course. You need to be confident and productive in Apache Spark to be more valuable.
Now, I’m not going to pretend here. You are going to need to put in work. This course puts you in a position to focus on the work you will need to complete.
This course uses Python, which is a fun, dynamic programming language perfect for both beginners and industry veterans.
At the end of this course, you will have rock solid foundation to accelerate your career and growth in the exciting world of Apache Spark.
Why choose this course?
Let’s be honest. You can find Apache Spark learning material online for free. Using these free resources is okay for people with extra time.
This course saves your time and effort. It is organized in a step-by-step approach that builds upon each previous lessons. PowerPoint presentations are minimal.
The intended audience of this course is people who need to learn Spark in a focused, organized fashion. If you want a more academic approach to learning Spark with over 4-5 hours of lectures covering the same material as found here, there are other courses on Udemy which may be better for you.
This Apache Spark with Python course emphasizes running source code examples.
All source code examples are available for download, so you can execute, experiment and customize for your environment after or during the course.
This Apache Spark with Python course covers over 50 hands-on examples. We run them locally first and then deploy them on cloud computing services such as Amazon EC2.
The following will be covered and more:
- What makes Spark a power tool of Big Data and Data Science?
- Learn the fundamentals of Spark including Resilient Distributed Datasets, Spark Actions and Transformations
- Run Spark in a Cluster in your local environment and Amazon EC2
- Deploy Python applications to a Spark Cluster
- Explore Spark SQL with CSV, JSON and mySQL (JDBC) data sources
- Convenient links to download all source code
- Reinforce your understanding through multiple quizzes and lecture recap
Introducing the course objectives, benefits, instructor, and overall methodology for you to learn and become confident with Apache Spark with Python.
Apache Spark and Python Foundational Building Blocks
I don't expect you to follow all the details of this video. I want to give you a big picture through source code examples of using Apache Spark and Python to analyze data.
We're going to start by running some code examples of Python against the Spark API through a Spark Driver program called PySpark.
I don't expect you to follow along with all these Python examples yet. I'll fill in the blanks later in the course.
We'll talk about what a "Spark Driver" program means later.
From PySpark, we're going to analyze some Uber data. Uber is a company which has disrupted the worldwide taxi industry. We're going to use the Uber data from New York City from the NYC's Taxi & Limousine Commission.
With this data, we can use Python and Spark to analyze. We can determine the total number of Uber trips, the most popular Uber bases in NYC, etc.
In this example, we'll give a glimpse into Spark core concepts such as Resilient Distributed Datasets, Transformations, Actions and Spark drivers. In addition, we'll see code examples of how to use Python with Spark.
Again, I don't expect you to follow all the details here, it's intended as a high level over to begin.
Now that we've seen Spark with Python examples, let's continue by considering the key Spark concepts you need to know. These concepts will be used throughout the rest of this Spark with Python data science course.
We need to describe Resilient Distributed Datasets, Transformations, Actions, Spark Drivers and applications deployed to clusters in more detail.
This builds the foundations for later sections of the Spark with Python Data Science Power Tools course.
Let's confirm our understanding of the foundations of Apache Spark. Just three questions to confirm the goals of this section of the Apache Spark course.
Prepare Your Environment
Now that we've seen an example of data analysis with Python using Spark, let's configure your environment. With your own environment, you'll be able to run code from this Spark with Python course as well as experiment on your own.
If you already have Spark downloaded and installed you can skip the next lecture. For Python setup, we're going to use a particular flavor or Python. So even if you don't end up using the version of Python used this course, the recommendation is to still view the Python videos.
Let me know in the course comments if you have any questions. It should be a straightforward process. It just takes a while to download.
If you are preparing your Apache Spark and Python environment on a Windows machine, please watch this video. It highlights two areas you need to know before proceeding to the following lectures.
This video shows how and where to download and install Apache Spark used in this course. You are free to watch this course without installing Spark, but if you want to experiment with your own environment, you should download and install Spark on your own machine.
Walk through installing the Python version used in this Spark with Python course. We're going to use the Anaconda Python version which provides us convenient access to many 3rd party Python libraries used in data science such as charts and graphs, math, etc.
At this point, let's confirm your Spark environment is running and we're able to interact with the Spark Python API.
To accomplish this, start up the pyspark Spark driver program. This is just a short video to show how to confirm your Spark with Python environment.
ipython notebook is not a requirement for this course. But, it may help if you decide to copy-and-paste from the provided source code examples.
This video goes through ipython setup. Also, see the private course discussions on how people with a variety of setups have succeeded in configuring Apache Spark with ipython notebook.
We're going to use a few different files of sample data files for this Apache Spark with Python course. This video shows how and where to download.
Links for both files are also provided at the end of this section
Hyperlinks to download Spark, Python and command reference
Apache Spark Transformations and Actions
In essence, there are two kinds of Spark functions: Transformations and Actions. Transformations transform an existing RDD into a new, different one. Actions are functions used against RDDs to produce a value.
In this section of the Apache Spark with Python Data Science course, we'll go over a wide variety of Spark Transformation and Action functions.
This should build your confidence and understanding of how you can apply these functions to your uses cases. It will also create more foundation for us to build upon in your journey of learning apache spark with python.
We're going to break Apache Spark transformations into groups. In this video, we'll cover some common spark transformations which produce RDDs. These include map, flatMap, filter, etc.
We're going to use a CSV dataset of baby names in New York. As we progress through transformations and actions in this Apache Spark with Python course, we'll determine more and more results for this sample data set.
So, let's begin with some commonly used Spark transformations.
In part 2 of Spark Transformations, we'll discover spark transformations used when we need to combine, compare and contrast elements in two RDDs. This is something we often have to do when working with datasets. Spark helps compare RDDs through transformation functions union, intersection, distinct, etc.
In part 3 of our focus on Spark Transformation functions were going work with the "key" functions including groupByKey, reduceByKey, aggregateByKey, sortByKey
All these transformations work with key,value pair RDDs, so we will cover the creation of PairRDDs as well.
We'll continue to use the baby_names.csv file used in Part 1 and Part 2 of Spark Transformations
Let's confirm our understanding of Spark Transformations at this point.
Spark Actions produce values back to the Spark Driver program. Also, recall that Action functions called against RDD cause a previously lazy RDD to be evaluated. So, in the real world when working with large datasets, we need to be careful when triggering RDDs to be evaluated through Spark actions.
This video shows commonly used Spark Actions.
Let's confirm our understanding of Spark Actions.
Provides links to download the source code (ipython notebook) used in this section of the course.
Apache Spark Clusters
Clusters allow Spark to processes huge volumes of data by distributing the workload across multiple nodes. This is also referred to as "running in parallel" or "horizontal scaling"
A cluster manager is required to Spark on a cluster. Spark supports 3 types of cluster managers including Apache YARN, Apache Mesos and an internal cluster manager distributed with Spark called Standalone.
Let's cover the key concepts of this Spark Clustering section of the course.
Let's run a Spark Standalone cluster within your environment. We'll start a Spark Master and one Spark worker. We'll quickly go over the Spark UI web console. We'll return to the Spark UI console in a later lecture after we deploy a couple of Python programs to it.
Now that we have a Spark cluster running, how do we use it? In this lecture of the Spark with Python course, we'll deploy a couple of Python programs. We'll start with a simple example and then progress to more complicated examples which include utilizing spark-packages and Spark SQL.
Deploy commands include:
bin/spark-submit --master spark://todd-mcgraths-macbook-pro.local:7077 examples/src/main/python/pi.py
bin/spark-submit --master spark://todd-mcgraths-macbook-pro.local:7077 examples/src/main/python/wordcount.py baby_names.csv
Let's review a Python program which utilizes examples we've already seen in this Spark with Python course. It's a program which analyzes New York City Uber data using Spark SQL. The video will show the program in the Sublime Text editor, but you can use any editor you wish.
When deploying our driver program, we need to do things differently than we have while working with pyspark. For example, we need to obtain a SparkContext and SQLContext. We need to specific Python imports.
bin/spark-submit --master spark://todd-mcgraths-macbook-pro.local:7077 --packages com.databricks:spark-csv_2.10:1.3.0 uberstats.py Uber-Jan-Feb-FOIL.csv
The Spark UI was briefly introduced in a previous lecture. Let's return to it now we have an available worker in the cluster and we have deployed some Python programs.
The Spark UI is the tool for Spark Cluster diagnostics, so we'll review the key attributes of the tool.
Apache Spark can be run on a cluster of two or more instances of Amazon EC2. In this video, let's go over how to create a Spark cluster on EC2. We'll cover the setup from both Spark as well as how to configure Amazon EC2 authentication and authorization using the Amazon Web Console (AWS).
We'll continue the setup Spark cluster on EC2 with special attention to how we can use ipython notebook against our Spark cluster running in EC2.
Before the EC2 cluster is ready to use from ipython notebook, we need to open port 7077.
Let's confirm our understanding of Spark Clustering
Spark SQL is perfect for those coming from a SQL background. It allows us to use SQL against a variety of datasets including CSV, JSON and JDBC databases. The Spark code for working with these datasets looks the same!
In this section of the Spark with Python course, we're going discuss a certain kind of RDD used with Spark SQL. Then, we're going to cover Spark SQL through input data source examples such as CSV, JSON and a mySQL database.
Spark SQL uses a type of Resilient Distributed Dataset called DataFrames which are composed of Row objects accompanied with a schema. The schema describes the data types of each column. A DataFrame may be considered similar to a table in a traditional relational database.
We’re going to use the Uber dataset, ipython notebook and the spark-csv package available from Spark Packages to make our lives easier. The spark-csv package is described as a “library for parsing and querying CSV data with Apache Spark, for Spark SQL and DataFrames” This library is compatible with Spark 1.3 and above.
Let's load a JSON input source to Spark SQL’s SQLContext. This Spark SQL JSON with Python portion of the course has two parts. The first part shows examples of JSON input sources with a specific structure. The second part warns you of something you might not expect when using Spark SQL with JSON data source.
We are going to use two JSON inputs. We’ll start with a simple, trivial example and then move to an analysis of historical World Cup player data.
The World Cup Player source data may be downloaded from the github repo for this course or https://raw.githubusercontent.com/jokecamp/FootballData/master/World%20Cups/all-world-cup-players.json
Now that we have Spark SQL experience with CSV and JSON, connecting and using a mySQL database will be easy. So, let’s cover how to use Spark SQL with Python and a mySQL database input data source.
We’re going to load some NYC Uber data into a database. Then, we’re going to fire up pyspark with a command line argument to specifiy the JDBC driver needed to connect to the JDBC data source. We’ll make sure we can authenticate and then start running some queries.
All the source code used in the Spark SQL section of the Spark with Python course is available from the course github repo.
Conclusion and Free Bonus Lecture
Thanks for taking the course! I hope you enjoyed the course and you are feeling comfortable and confident using Python with Apache Spark. If you have any questions or suggestions on how to improve this course, just let me know in the course discussions forum.
Be sure to visit http://www.supergloo.com for discount coupons to my other Udemy courses and links to my Spark related books. Also, you'll have a chance to sign up for my mailing list where I send announcements of new courses, books, Spark tutorials, etc.
Come check it out!