Today’s world witnesses a massive amount of data being generated everyday, everywhere. As a result, a number of organizations are focusing on Big Data processing to process large amounts of data in real-time with maximum efficiency. This has led to Apache Spark gaining popularity in the Big Data market rapidly. If you want to get the most out of the trending Big Data framework for all your data processing needs, then go for this Learning Path.
This comprehensive 2-in-1 course focuses on performing data streaming and data analytics with Apache Spark. You will learn to load data from a variety of structured sources such as JSON, Hive, and Parquet using Spark SQL and schema RDDs. You will also build streaming applications and learn best practices for managing high-velocity streaming and external data sources. Next, you will explore Spark machine learning libraries and GraphX where you will perform graphical processing and analysis. Finally, you will learn dataframe implementation to perform distributed operations on data set using SparkR.
This training program includes 2 complete courses, carefully chosen to give you the most comprehensive training possible.
The first course, Spark Analytics for Real-Time Data Processing, starts off with explaining Spark SQL. You will learn how to use the Spark SQL API and built-in functions with Apache Spark. You will also go through some interactive analysis and look at some integrations between Spark and Java/Scala/Python. Next, you will explore Spark Streaming, streamingcontext, and DStreams. You will learn how Spark streaming works on top of the Spark core, thus inheriting its features. Finally, you will stream data and also learn best practices for managing high-velocity streaming and external data sources.
In the second course, Advanced Analytics and Real-Time Data Processing in Apache Spark, you will leverage the features of various components of the Spark framework to efficiently process, analyze, and visualize your data. You will then learn how to implement the high velocity streaming operation for data processing in order to perform efficient analytics on your real-time data. You will also analyze data using machine learning techniques and graphs. Next, you will learn to solve problems using machine learning techniques and find out about all the tools available in the MLlib toolkit. Finally, you will see some useful machine learning algorithms with the help of Spark MLlib and will integrate Spark with R.
By the end of this learning path, you will be able to use Apache Spark for data processing to process large amounts of data on real-time basis.
Meet Your Expert(s):
We have the best work of the following esteemed author(s) to ensure that your learning journey is smooth:
Nishant Garg has over 17 years of software architecture and development experience in various technologies, such as Java Enterprise Edition, SOA, Spring, Hadoop, Hive, Flume, Sqoop, Oozie, Spark, Shark, YARN, Impala, Kafka, Storm, Solr/Lucene, NoSQL databases (such as HBase, Cassandra, and MongoDB), and MPP databases (such as GreenPlum). He received his MS in software systems from the Birla Institute of Technology and Science, Pilani, India, and is currently working as a technical architect for the Big Data RandD Group with Impetus Infotech Pvt. Ltd. Previously, Nishant has enjoyed working with some of the most recognizable names in IT services and financial industries, employing full software life cycle methodologies such as Agile and SCRUM. Nishant has also undertaken many speaking engagements on big data technologies and is also the author of Apache Kafka and HBase Essentials, Packt Publishing.
Tomasz Lelek is a Software Engineer and Co-Founder of InitLearn. He mostly does programming in Java and Scala. He dedicates his time and effort to get better at everything. He is currently diving into Big Data technologies. Tomasz is very passionate about everything associated with software development. He has been a speaker at a few conferences in Poland-Confitura and JDD, and at the Krakow Scala User Group. He has also conducted a live coding session at Geecon Conference. He was also a speaker at an international event in Dhaka. He is very enthusiastic and loves to share his knowledge.
Spark Analytics for Real-Time Data Processing
This video gives an overview of the entire course.
This video explains the complete introduction of Spark SQL discusses the type of applications where Spark SQL is useful to use and in the end it explains the performance of the Spark SQL.
At first it talks about the Spark SQL introduction
Next it explains the type of applications where Spark SQL is useful
The final step explains the performance of the Spark SQL
This video explains about the Spark SQL core abstractions used by programming interfaces. These core abstractions are SQLContext, HiveContext, SparkSession, Dataset and DataFrames.
At first it talks about the SQLContextfor Spark 1.6 and 2.0
Next it explains the HiveContext for Spark 1.6 and 2.0
The final step explains concept of Dataset and DataFrame
This video explains the creation of DataFrames from resilient distributed datasets also run some code examples.
At first it talks about creating DataFrames from resilient distributed datasets
Next it executes the code sample for creating DataFrame
This video explains the creation of DataFrames from different type of files and also run some code examples
At first it talks about creating DataFrame from CSV files with demonstration
Next it talks about creating DataFrames from JSON files with demonstration
Last it talk about creating DataFrames from Parquet and ORC files
This video explains the ways of creating DataFrames from different data source. It also talks about the storing DataFrames also run some code examples.
At first it talks about creating DataFramefrom HIVE data source
Next it explains the creating DataFramesfrom JDBC data source
The final step explains storing the data within JSON/ORC files and using HIVE and JDBC
This video explains the data frame API for common operations such as columns, dtypes, Explain, printSchema, registerTempTable, and so on with demonstration.
At first it talks about the common operations – columns and dtypes
Next it explains the common operations – explain and printSchema
The final steptalk about the common operations – registerTempTable
This video explains the data frame API for query operationsfor aggregation, sampling, filter, groupBy, join, intersect, orderBy, sort, and so on with demonstration.
At first it talk about the query operations for aggregation, sampling, filter
Next it explains the query operations for groupBy, join, intersect
The final step talks about the query operations for orderBy, sort and distinct
This video explains the data frame API for actionssuch as limit, select, withColumn, selectExpr, count, describe, collect, and so on with demonstration.
At first it talks about the limit, select, withColumnactions
Next it explains the selectExpr, count, describe actions
The final step talks about the collect, show, take actions
This video explains the data frame API for built-in functions for collections, date, time, math, and string that Spark SQL provides, optimized for fast execution.
At first it talks about the built-in functions for collections
Next it explains the built-in functions for date and time
The final steptalk about the built-in functions for math and string
This video explains the complete introduction of Spark Streaming, DStreams and support for different data sources.
At first it talk why Spark Streaming is needed
Next it explains the how DStreams is different from RDD
The final step explains the different data sources supported by Spark Streaming
This video explains the complete code for word count program and also steps for executing it as a very first example.
At first it walks through the word count program code
Next it explains the step to run word count program
This video explains the complete architecture of Spark Streaming, concept of DStreams with example and Streaming execution in spark.
At first it walks through the Spark Streaming architecture in detail
Next it explains the concept of DStreams with example
The final step explains the Spark Streaming execution details
This video explains the different type of transformations available in Spark Streaming such as stateless transformations and stateful transformations.
At first it talks about the stateless transformations such as map(), filter(), groupByKey(), and so on
Next it explains the windowed operations under stateful transformation
The final step explains the updateStateByKey() stateful transformation
This video explains the different type of input sources available for Spark Streaming such as Sockets, Files, Kafka, Flume, and so on. It also explains the different available output operations.
At first it talks about the core input sources such as Sockets, Files, Akka drivers
Next it explains other input sources such as Flume and Kafka
The final step explains the output operations such as Save(), saveAsHadoopFiles(), and so on
This video briefly explains the performance considerations for Spark Streaming such as batch size, parallelism, garbage collection, memory usage.
At first it talks about the tuning batch size for Spark Streaming
Next it explains usage of parallelism for Spark Streaming performance
The final step explains garbage collection and memory usage for Streaming applications
Aim of this video is to explain about the best practice for handling high velocity streams such as using parallelism, scheduling, setting right configuration for memory usage and few other tips.
First it explains the parallelism based best practices
Next it explains scheduling based best practices
In the last it explains the memory related configuration and few other tips
Aim of this video is to explain about the Best practice for external data sources such as Flume, Kafka, Sockets and Message Queue protocol.
First it explains about Flume in context of Streaming
Next it explains about Kafka in context of Streaming
In the last it explains usage of Sockets and Message Queue protocol
Aim of this video is to explain design patterns which can be used with to maintain the Global State and foreachRDD output action in Spark Streaming.
First it explains patterns for maintain the Global State within Streaming application
Next it explains patterns for handling connections within foreachRDD action
Advanced Analytics and Real-Time Data Processing in Apache Spark
This video gives an overview of the entire course.
The aim of this video is to delve into Spark Streaming architecture.
Understand micro batches
Compare latency versus throughput
Learn about failure recovery and check pointing
The aim of this video is to look into Streaming context of Spark Streaming application.
Create Spark Streaming application
Create base for Streaming processing
The goal of this video is to look into the processing Streaming data and understand how the Streaming processing differs from processing batch data.
Find out what the unbounded data is
Find out how stream processing is different from the batch processing
Process each event really fast
The goal of this video is to learn about use cases in Spark Streaming applications and know when to use it.
Find out why to use Streaming and it’s pros
Learn stream use cases
The aim of this video is to look into Spark Streaming word count problem and solve it using Spark Streaming API.
Create Spark Streaming word count
Test Streaming job
Learn how to write processing
The goal of this video is to understand what the master URL in Spark Streaming context is.
Understand Spark architecture
Use master URL for submitting jobs
Find out what a YARN is
Streaming architecture needs to have a data source. So often, Apache Kafka does an event queue which is a great data source for events. The goal of this video is to integrate Spark Streaming with Apache Kafka.
Understand what the Apache Kafka is
Use Apache Kafka as a DataSource for the Spark Streaming job
Learn about writing DStream provider
The aim of this video is to implement streaming stateful processing that saves some data to Cassandra database and retrieve it so it can be used as a state durable data store.
Stream stateful processing
Use Cassandra as state store
Use Spark Streaming mapWithState to implement stateful processing
The aim of this video is to learn about transform and the window operation in Spark Streaming.
Learn about transformation on the DStream
Learn window events using Dstream API
In Spark, we often want to join multiple streams data and then apply some processing on it. In this video, we will try to join two sources.
Join two streams
Test the joining
In this video, we will learn output operations and learn how to save results to Kafka Sink.
Understand what a Sink is
Define Sink for DStream
Save results from Spark Streaming job to Kafka
In this video, we will get to know what the event time is.
What the processing time is
What the ingestion time is
How to handle each of them
Data sources work in at least once guarantee. In this video, we will connect external systems.
Implement deduplication logic
In this video, we will get to know how to handle not in order events.
How to verify order of events
Implement sorting in stream of events
In this video, we will implement Streaming processing that filters our bots.
Use deduplication that we implemented to make Streaming processing robust
Implement using order verification that we implemented to make Streaming processing robust
In this video, we will create a project using Spark MLlib.
What we will want to achieve
Analyze input Data
Prepare input data to be make it ready for input to ML models
In this video, we will see how to show text as a vector.
Transform text into vector of numbers
In this video, we will see an algorithm for transforming text into vector of numbers.
Understand Bag-Of-Words
Word2Vect
Learn Skip-Gram
In this video, we will learn what a supervised and unsupervised ML is.
What the logistic regression is
Explain logistic regression simple example
Implement logistic regression model in Apache Spark
This video explains what the cross validation is.
How to split training and test data in a proper way
Implement cross-validation in Apache Spark
This video explains what a clustering is.
Learn Gaussian mixture model explanation
Cluster data using post timestamp
How to use GMM in a proper way
In this video, we will preparing data for clustering.
Using GMM to Cluster posts by time of a post
Implementing Logic in Apache Spark
In this video, we will see what the Singular Value Decomposition (SVD) is.
When we can use it
How to implement it in the Spark using SPARKML Lib
In this video, we will look at the Movie data source that will be used to train model.
Build collaborative filtering in Apache Spark
Use the Alternating Least Squares (ALS) algorithm
Recommend movies for given user
In this video, we will see what a graph is.
What an edge is
What a vertex is
This video explains Spark GraphX.
See the pros
Differentiate between graph-parallel versus data-parallel
In this video, we will create Spark project.
Import GraphX library to Spark
Explain sbt
In this video, we will see what a property graph is.
Use GraphX API to create a graph
Define edges
Define vertices
In this video, we will understand about Graph API.
Look and investigate operations
In this video, we will use GraphAPI to experiment with Graph
Explain operations on edges
Explain operations on vertices
In this video, we will use triplets API.
Aggregate triplets to extract facts from a graph
In this video, we will create subgraph of graph.
Define properties of a subgraph
Extract subgraph
In this video, we will calculate average of neighbourhood.
Use neighbourhood aggregations from GraphX API
In this video, we will count degrees of vertices.
Count in-degree of vertex
Count out-degree of vertex
In this video, we will optimize graph by using caching.
Test code without caching
Test code with caching
In this video, we will define graph structure in a file.
Load it to Spark using GraphLoader
In this video, we will take RDD from all edges.
Take RDD from all vertices
Perform operations using RDD API
In this video, we will see what connected components are.
Implement in Spark GraphX
In this video, we will see what an R language is.
What a SparkR is
How to use SparkR
What are the pros of SparkR
In this video, we will install SparkR Studio.
Use SparkR Studio
In this video, we will create Spark DataFrame in the SparkR.
Use SparkR console
In this video, we will use SparkR grouping.
Use SparkR aggregation
This video uses dapply.
Use dapplyCollect
In this video, we will use gapply.
Use gapplyCollect
In this video, we will use distributed functions.
Use spark.lapply method
In this video, we will use DataFrame SQL API.
Use SQL from SparkR
In this video, we will get to know about PageRank.
Look at the input data
Calculate PageRank in SparkX
Explain PageRank using Spark GraphX
In this video, we will create abandoned cart logic.
Implement Streaming logic