Processing big data in real time is challenging due to scalability, information consistency, and fault-tolerance. Big Data Processing with Apache Spark teaches you how to use Spark to make your overall analytical workflow faster and more efficient. You’ll explore all core concepts and tools within the Spark ecosystem, such as Spark Streaming, the Spark Streaming API, machine learning extension, and structured streaming.
You’ll begin by learning data processing fundamentals using Resilient Distributed Datasets (RDDs), SQL, Datasets, and Dataframes APIs. After grasping these fundamentals, you’ll move on to using Spark Streaming APIs to consume data in real time from TCP sockets, and integrate Amazon Web Services (AWS) for stream consumption.
By the end of this course, you’ll not only have understood how to use machine learning extensions and structured streams but you’ll also be able to apply Spark in your own upcoming big data projects.
About the Author
Manuel Ignacio Franco Galeano is a computer scientist from Colombia. He works for Fender Musical Instruments as a lead engineer in Dublin, Ireland. He holds a master’s degree in computer science from University College, Dublin UCD. His areas of interest and research are music information retrieval, data analytics, distributed systems, and blockchain technologies.
Nimish Narang has graduated from UBC with a degree in biology and computer science in 2016. He has developed Mobile apps for Android and iOS since 2015. He is focused on data analysis and machine learning from the past two years and has previously published Keras and Professional Scala with Packt.
Introduction to Spark Distributed Processing
Let us begin our learning journey on Data Processing with Apache Spark. In this course, we will learn to efficiently tackle large datasets and perform big data analysis with Spark and Python. The GitHub link for this course is - https://github.com/TrainingByPackt/Big-Data-Processing-with-Apache-Spark-eLearning
Before we start learning about Apache Spark, let us ensure that we have access to the ecosystem and all of the necessary tools. To begin with, let us download and install Spark and set up our computer environment by installing the following:
Spark 2.4
OS: Windows 7 SP1 64-bit, Windows 8.1 64-bit, or Windows 10 64-bit
Python 3.0 or above
Amazon Web Services (AWS) account
Let us begin the course by learning about data processing fundamentals using RDDs, datasets, and APIs. In this section, we will learn how to use Spark for data processing and get an introduction into Spark SQL and various Spark dataframes.
Now that we are familiar with the lesson overview has a whole, let us begin to learn the use of Spark for data processing and get an introduction of Spark SQL and various Spark DataFrames. In this section, we will learn about the programming languages supported by Spark, its components and the deployment modes.
RDDs support two types of operations: transformations and actions. These operations along with some actions will be used to execute the dataset. Let us further learn into the concepts and use them practically in the demo.
In continuation from the previous section, let us dive deep into the concepts of map reduced functions and their execution.
The interactive Python interface is a great tool for simple computations. Its functionality is limited even as the computing operations grow in complexity. In this section, let us learn to write Python programs that can interact with a Spark cluster outside of the interactive console. Further to this section, we will also learn about functional programming.
Nested function are functions inside other functions. The most important advantage of this paradigm is that the outer scope cannot see what is happening in the inner function. Nonetheless, the inner scope can access variables in the outer scope. Now, let us look at an example of a function using the syntax. Further to this section, we will also learn about Standalone Python Programs.
Before we understand how each of these function along with Spark, let us first know what they mean. A dataset is a distributed collection that provides additional metadata about the structure of the data that is stored. A DataFrame is a dataset that organizes information into named columns. DataFrames can be built from different sources, such as JSON, XML, and databases. In this section, let us cover each of them in detail. For further information on movielens datasets.
Let us quickly recap our learning from this lesson.
Test your learning with this assessment.
Introduction to Spark Streaming
In the previous lesson, we learned to write Python programs to execute parallel operations inside a Spark cluster, and we created and transformed RDDs. In this section, let us learn to process streams of data in real time and write Python standalone programs. So, let’s get started!
Consuming live streams of data is a challenging endeavor, one of the reasons being the volume of the incoming data. The variability in the flow of information may lead to situations where very fast producers may overwhelm consumers. Let’s learn to find the right balance between reads and writes through some concepts.
DStreams is an abstraction that represents continuous streams of data. This abstraction provides functionality for consistency and fault recovery. Let us learn about DStreams in detail, in this section.
There are various operations that are supported by the Spark Streaming API. RDDs support two types of operations: transformations and actions. These operations along with some actions will be used to execute the dataset. Let us further learn into the concepts and use them practically in the demo.
Spark Streams provides an interface to apply computations to sliding windows of data. Let us look at a scenario and understand this better.
Structured Streaming is built on top of the Spark SQL engine. It provides a mechanism to consume live streams in real time and store this data by using the DataFrame and Dataset APIs. In this section, we will learn about this in more detail.
Let us quickly recap our learning from this lesson.
Test your learning with this assessment.
Spark Streaming Integration with AWS
In the previous lesson, we learned to write Python programs and created and transformed RDDs. In this lesson, let us learn to process streams of data in real time. This section will also teach you to write Python standalone programs to connect live streams of data by using similar concepts wing those learned in Lesson 1. Let’s begin!
Spark Streaming supports integration with external data sources such as AWS Kinesis, Apache Kafka, and Flume. We can also write custom consumers to connect to any other source. In the following video, we will focus on the consumption of live data from Kinesis and store it after aggregation in AWS S3.
Let us learn to integrate AWS Kinesis and Python. We will write Python code to create, list, and delete Kinesis streams and set up a Kinesis stream for further analysis with Spark Streaming.
Amazon S3 is a storage system where you can store data within resources called buckets, place objects in a bucket and perform operations such as write, read, and delete. The maximum size allowed for every object in a bucket is up to 5 TB. Let us learn further about all this and much more, in this section.
Spark provides functionality to consume data from Kinesis streams. A Kinesis stream shard is processed by one input DStream at a time. Let us learn about live streams of data and check-pointing in this section.
Let us quickly recap our learning from this lesson.
Test your learning with this assessment.
Spark Streaming, ML, and Windowing Operations
By now, we have learnt the most relevant concepts of Apache Spark. In this section, we will learn to integrate Spark Streaming with the machine learning functionality by implementing a system to recommend movies in real time. So, let’s get started!
MLlib is Spark's machine learning library, and it provides functionality that allows for the usage of common algorithms and data processing mechanisms at scale in an easy way. Let’s learn about this in detail, in this video.
Spark Streams provides an interface to apply computations to sliding windows of data. This section contains a scenario where streams of movie ratings are grouped in windows of 3 streams for every 2 time units. Let’s learn about this in detail.
Let us quickly recap our learning from this lesson.
Test your learning with this assessment.