PySpark for Beginners
omasz has also authored the Practical Data Analysis Cookbook published by Packt Publishing in 2016.Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This course will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark. You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command. By the end of this course, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.
About the Author
Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in the Seattle area. He has over 13 years of experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance, and consulting he gained while working on three continents: Europe, Australia, and North America. While in Australia, Tomasz has been working on his PhD in Operations Research with a focus on choice modeling and revenue management applications in the airline industry.
At Microsoft, Tomasz works with big data on a daily basis, solving machine learning problems such as anomaly detection, churn prediction, and pattern recognition using Spark.
Tomasz has also authored the Practical Data Analysis Cookbook published by Packt Publishing in 2016.
This video will give you an overview about the course.
In this video, you will be provided with a short overview of the Apache Spark Jobs and APIs. This covers the necessary foundation for the subsequent section on Spark 2.0 architecture.
- Learn about the execution process
- Explore the resilient distributed dataset
- Explore DataFrames and Datasets
The introduction of Apache Spark 2.0 is the recent major release of the Apache Spark project based on the key learnings from the last two years of development of the platform
- Unify Datasets and DataFrames
- Introduce SparkSession
- Understand Tungsten phase 2
Resilient Distributed Datasets
RDDs operate in parallel. This is the strongest advantage of working in Spark: Each transformation is executed in parallel for enormous increase in speed. In this video, we will create RDDs first. We will also have a look at global versus local scope.
- Create RDDs
- Study global versus local scope
Transformations shape your dataset. These include mapping, filtering, joining, and transcoding the values in your dataset. In this video, we will showcase some of the transformations available on RDDs.
- Understand various types of transformation
Actions execute the scheduled task on the dataset; once you have finished transforming your data you can execute your transformations. This might contain no transformations or execute the whole chain of transformations. In this video, we will see different types of actions in PySpark.
- Understand various types of actions
Whenever a PySpark program is executed using RDDs, there is a potentially large overhead to execute the job.
- See catalyst Optimizer refresh
- Speed up PySpark with DataFrames
- Create DataFrames
There are two different methods for converting existing RDDs to DataFrames: inferring the schema using reflection, or programmatically specifying the schema. The former allows you to write more concise code, while the latter allows you to construct DataFrames when the columns and their data types are only revealed at run time. In this video, we will look at the Querying with the DataFrame API and SQL.
- Explore Interoperating with RDDs
- Learn querying with the DataFrame API
- See querying with SQL
Prepare Data for Modeling
Until you have fully tested the data and proven it worthy of your time, you should neither trust it nor use it. In this video, we will show you how to deal with duplicates, missing observations, and outliers.
- Understand duplicates in PySpark
- Deal with a number of missing values
- Study outliers
Although we would strongly discourage such behavior, you can build a model without knowing your data. It will most likely take you longer, and the quality of the resulting model might be less than optimal, but it is doable.
- Study the descriptive statistics
- Understand correlation in PySpark
There are multiple visualization packages, but in this video we will be using matplotlib and Bokeh exclusively to give you the best tools for your needs.
- Study histogram for visualization
Even though MLlib is designed with RDDs and DStreams in focus, for ease of transforming the data we will read the data and convert it to a DataFrame.
- Import the pyspark.sql.types
- Specify our recode dictionary
- Specify our recoding methods
In order to build a statistical model in an informed way, an intimate knowledge of the dataset is necessary. Without knowing the data it is possible to build a successful model, but it is then a much more arduous task, or it would require more technical resources to test all the possible combinations of features. Therefore, after spending the required 80% of the time cleaning the data, we spend the next 15% getting to know it!
- Look at the descriptive statistics
- Learn about correlations
- See statistical testing
Now, we will create our final dataset that we will use to build our models. We will convert our DataFrame into an RDD of LabeledPoints.A LabeledPoint is a MLlib structure that is used to train the machine learning models. It consists of two attributes: label and features.
- Create an RDD of LabeledPoints
- Split into training and testing
In this video, we will build two models: a linear classifier—the logistic regression, and a non-linear one—a random forest.
- Split into training and testing
- Select only the most predictable features
Introducing the ML Package
In this video, we will, once again, attempt to predict the chances of the survival of an infant. In this video, first we will load the data, then create transformers, estimators and a pipeline. We will then fit the model, evaluate the performance and save the model.
- Load the data
- Create transformers, estimator and a pipeline
- Fit, evaluate and save the model
A concept of parameter hyper-tuning is to find the best parameters of the model. For example, the maximum number of iterations needed to properly estimate the logistic regression model or maximum depth of a decision tree. In this video, we will explore two concepts that allow us to find the best parameters for our models that is grid search and train-validation splitting.
- Understand grid search with example
- Study train validation splitting with example
In this video, we will provide examples of how to use some of the Transformers and Estimators.
- Understand the concept of feature extraction
- Study classification and clustering
- Understand regression in PySpark