4 out of 5
4
1 review on Udemy

Big Data Processing with Apache Spark

Efficiently tackle large data sets and big data analysis challenges using Spark and Python
Instructor:
Packt Publishing
5 students enrolled
English [Auto-generated]
Write your own Python programs that can interact with Spark
Implement data stream consumption using Apache Spark
Recognize common operations in Spark to process known data streams
Integrate Spark streaming with Amazon Web Services
Create a collaborative filtering model with Python and the movielens dataset
Apply processed data streams to Spark machine learning APIs

Processing big data in real  time is challenging due to scalability, information consistency, and  fault-tolerance. Big Data Processing with Apache Spark teaches you how  to use Spark to make your overall analytical workflow faster and more  efficient. You’ll explore all core concepts and tools within the Spark  ecosystem, such as Spark Streaming, the Spark Streaming API, machine  learning extension, and structured streaming.

You’ll begin by learning data processing fundamentals using Resilient  Distributed Datasets (RDDs), SQL, Datasets, and Dataframes APIs. After  grasping these fundamentals, you’ll move on to using Spark Streaming  APIs to consume data in real time from TCP sockets, and integrate Amazon  Web Services (AWS) for stream consumption.

By the end of this course, you’ll not only have understood how to use  machine learning extensions and structured streams but you’ll also be  able to apply Spark in your own upcoming big data projects. 

About the Author

Manuel Ignacio Franco  Galeano is a computer scientist from Colombia. He works for Fender  Musical Instruments as a lead engineer in Dublin, Ireland. He holds a  master’s degree in computer science from University College, Dublin UCD.  His areas of interest and research are music information retrieval,  data analytics, distributed systems, and blockchain technologies.

Nimish Narang has graduated from UBC with a degree in biology and  computer science in 2016. He has developed Mobile apps for Android and  iOS since 2015. He is focused on data analysis and machine learning from  the past two years and has previously published Keras and Professional  Scala with Packt. 

Introduction to Spark Distributed Processing

1
Course Overview

Let us begin our learning journey on Data Processing with Apache Spark. In this course, we will learn to efficiently tackle large datasets and perform big data analysis with Spark and Python. The GitHub link for this course is - https://github.com/TrainingByPackt/Big-Data-Processing-with-Apache-Spark-eLearning

2
Installation and Setup

Before we start learning about Apache Spark, let us ensure that we have access to the ecosystem and all of the necessary tools. To begin with, let us download and install Spark and set up our computer environment by installing the following:

  • Spark 2.4

  • OS: Windows 7 SP1 64-bit, Windows 8.1 64-bit, or Windows 10 64-bit

  • Python 3.0 or above

  • Amazon Web Services (AWS) account

3
Lesson Overview

Let us begin the course by learning about data processing fundamentals using RDDs, datasets, and APIs. In this section, we will learn how to use Spark for data processing and get an introduction into Spark SQL and various Spark dataframes.   

4
Introduction to Spark and Resilient Distributed Datasets

Now that we are familiar with the lesson overview has a whole, let  us begin to learn the use of Spark for data processing and get an  introduction of Spark SQL and various Spark DataFrames. In this section,  we will learn about the programming languages supported by Spark, its  components and the deployment modes. 

5
Operations Supported by the RDD API

RDDs support two types of operations: transformations and actions. These operations along with some actions will be used to execute the dataset. Let us further learn into the concepts and use them practically in the demo.   

6
Map Reduce Operations

In continuation from the previous section, let us dive deep into the concepts of map reduced functions and their execution.   

7
Self-Contained Python Spark Programs

The interactive Python interface is a great tool for simple computations. Its functionality is limited even as the computing operations grow in complexity. In this section, let us learn to write Python programs that can interact with a Spark cluster outside of the interactive console. Further to this section, we will also learn about functional programming.   

8
Nested Functions and Standalone Python Programs

Nested function are functions inside other functions. The most important advantage of this paradigm is that the outer scope cannot see what is happening in the inner function. Nonetheless, the inner scope can access variables in the outer scope. Now, let us look at an example of a function using the syntax. Further to this section, we will also learn about Standalone Python Programs.   

9
Introduction to SQL, Datasets, and DataFrames

Before we understand how each of these function along with Spark, let us first know what they mean. A dataset is a distributed collection that provides additional metadata about the structure of the data that is stored. A DataFrame is a dataset that organizes information into named columns. DataFrames can be built from different sources, such as JSON, XML, and databases. In this section, let us cover each of them in detail. For further information on movielens datasets.   

10
Lesson Summary

Let us quickly recap our learning from this lesson.     

11
Test Your Knowledge

Test your learning with this assessment.

Introduction to Spark Streaming

1
Lesson Overview

In the previous lesson, we learned to write Python programs to execute parallel operations inside a Spark cluster, and we created and transformed RDDs. In this section, let us learn to process streams of data in real time and write Python standalone programs. So, let’s get started!   

2
Introduction to Streaming Architectures

Consuming live streams of data is a challenging endeavor, one of the reasons being the volume of the incoming data. The variability in the flow of information may lead to situations where very fast producers may overwhelm consumers. Let’s learn to find the right balance between reads and writes through some concepts.   

3
Introduction to Discretized Streams (Dstreams)

DStreams is an abstraction that represents continuous streams of data. This abstraction provides functionality for consistency and fault recovery. Let us learn about DStreams in detail, in this section.     

4
Operations Supported by the Spark Streaming API

There are various operations that are supported by the Spark Streaming API. RDDs support two types of operations: transformations and actions. These operations along with some actions will be used to execute the dataset. Let us further learn into the concepts and use them practically in the demo.   

5
Windowing Operations

Spark Streams provides an interface to apply computations to sliding windows of data. Let us look at a scenario and understand this better.   

6
Structured Streaming

Structured Streaming is built on top of the Spark SQL engine. It provides a mechanism to consume live streams in real time and store this data by using the DataFrame and Dataset APIs. In this section, we will learn about this in more detail.   

7
Lesson Summary

Let us quickly recap our learning from this lesson.     

8
Test Your Knowledge

Test your learning with this assessment.   

Spark Streaming Integration with AWS

1
Lesson Overview

In the previous lesson, we learned to write Python programs and created and transformed RDDs. In this lesson, let us learn to process streams of data in real time. This section will also teach you to write Python standalone programs to connect live streams of data by using similar concepts wing those learned in Lesson 1. Let’s begin!   

2
Spark Integration with AWS Services

Spark Streaming supports integration with external data sources such as AWS Kinesis, Apache Kafka, and Flume. We can also write custom consumers to connect to any other source. In the following video, we will focus on the consumption of live data from Kinesis and store it after aggregation in AWS S3.   

3
Integrating AWS Kinesis and Python

Let us learn to integrate AWS Kinesis and Python. We will write Python code to create, list, and delete Kinesis streams and set up a Kinesis stream for further analysis with Spark Streaming.   

4
AWS S3 Basic Functionality

Amazon S3 is a storage system where you can store data within resources called buckets, place objects in a bucket and perform operations such as write, read, and delete. The maximum size allowed for every object in a bucket is up to 5 TB. Let us learn further about all this and much more, in this section.   

5
Kinesis Streams and Spark Streams

Spark provides functionality to consume data from Kinesis streams. A Kinesis stream shard is processed by one input DStream at a time. Let us learn about live streams of data and check-pointing in this section.   

6
Lesson Summary

Let us quickly recap our learning from this lesson.   

7
Test Your Knowledge

Test your learning with this assessment.   

Spark Streaming, ML, and Windowing Operations

1
Lesson Overview

By now, we have learnt the most relevant concepts of Apache Spark. In this section, we will learn to integrate Spark Streaming with the machine learning functionality by implementing a system to recommend movies in real time. So, let’s get started!   

2
Spark Integration with Machine Learning

MLlib is Spark's machine learning library, and it provides functionality that allows for the usage of common algorithms and data processing mechanisms at scale in an easy way. Let’s learn about this in detail, in this video.   

3
Spark Streaming Windowing Operations

Spark Streams provides an interface to apply computations to sliding windows of data. This section contains a scenario where streams of movie ratings are grouped in windows of 3 streams for every 2 time units. Let’s learn about this in detail.   

4
Lesson Summary

Let us quickly recap our learning from this lesson.     

5
Test Your Knowledge

Test your learning with this assessment.   

You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
4
4 out of 5
1 Ratings

Detailed Rating

Stars 5
0
Stars 4
1
Stars 3
0
Stars 2
0
Stars 1
0
0035437d73db71187abde9bc460b835c
30-Day Money-Back Guarantee

Includes

3 hours on-demand video
Full lifetime access
Access on mobile and TV
Certificate of Completion