Tuning Apache Spark: Powerful Big Data Processing Recipes
Video Learning Path Overview
A Learning Path is a specially tailored course that brings together two or more different topics that lead you to achieve an end goal. Much thought goes into the selection of the assets for a Learning Path, and this is done through a complete understanding of the requirements to achieve a goal.
Today, organizations have a difficult time working with large datasets. In addition, big data processing and analyzing need to be done in real time to gain valuable insights quickly. This is where data streaming and Spark come in.
In this well thought out Learning Path, you will not only learn how to work with Spark to solve the problem of analyzing massive amounts of data for your organization, but you’ll also learn how to tune it for performance. Beginning with a step by step approach, you’ll get comfortable in using Spark and will learn how to implement some practical and proven techniques to improve particular aspects of programming and administration in Apache Spark. You’ll be able to perform tasks and get the best out of your databases much faster.
Moving further and accelerating the pace a bit, You’ll learn some of the lesser known techniques to squeeze the best out of Spark and then you’ll learn to overcome several problems you might come across when working with Spark, without having to break a sweat. The simple and practical solutions provided will get you back in action in no time at all!
By the end of the course, you will be well versed in using Spark in your day to day projects.
From blueprint architecture to complete code solution, this course treats every important aspect involved in architecting and developing a data streaming pipeline
Test Spark jobs using the unit, integration, and end-to-end techniques to make your data pipeline robust and bulletproof.
Solve several painful issues like slow-running jobs that affect the performance of your application.
Anghel Leonard is currently a Java chief architect. He is a member of the Java EE Guardians with 20+ years’ experience. He has spent most of his career architecting distributed systems. He is also the author of several books, a speaker, and a big fan of working with data.
Tomasz Lelek is a Software Engineer, programming mostly in Java and Scala. He has been working with the Spark and ML APIs for the past 5 years with production experience in processing petabytes of data. He is passionate about nearly everything associated with software development and believes that we should always try to consider different solutions and approaches before solving a problem. Recently he was a speaker at conferences in Poland, Confitura and JDD (Java Developers Day), and at Krakow Scala User Group. He has also conducted a live coding session at Geecon Conference. He is a co-founder of initlearn, an e-learning platform that was built with the Java language. He has also written articles about everything related to the Java world.
Data Stream Development with Apache Spark, Kafka, and Spring Boot
This video provides an overview of the entire course.
Introduce data streaming fundamentals and shape the data streaming blueprint architecture
Cover the big picture of data streaming
Talk about classifying, securing and scaling streaming systems
Shape via a diagram the data streaming blueprint architecture
Introduce the Meetup RSVPs stream and choose the technologies for implementing the data streaming blueprint architecture. See alternative technologies as well and how to decide between them
Access the Meetup RSVP stream online
Choose the proper technology for each tier of data streaming blueprint architecture
Explore the alternative technologies per tier and criteria for choosing between them properly
After a brief overview of the Collection Tier, we have a general discussion about protocols, interaction patterns and issues involved in writing a Collection Tier.
Start with a brief overview about connecting to the source of data, push and pull mechanisms and lightweight business logic
Continue with protocols and interaction patterns
Finish with the problem of scaling the Collection Tier and WebSocket caused by the direct and persistent connection
Develop the Collection Tier part for ingesting Meetup RSVPs via Spring WebSocketClient API
Brief overview of WebSocket concept
Introduce Spring WebSocketClient API and its role in Collection Tier
Implementation the code
Explain why this tier, that apparently complicates and slows down the data streaming pipeline, is needed.
Tackle backpressure issue
Understand the data durability issue
Learn about data delivery semantics issue
Apache Kafka is a powerful, but complex technology. This video represents a comprehensive introduction of the main Kafka concepts.
Understand cover overview, terminology, high-level architecture, topics and partitions
Explore producers and consumers, consumer groups, delivery semantics and durability
Install and configure Zookeeper and a Kafka broker
Send the collected data to Message Queuing Tier (Kafka) via Spring Cloud Stream, Kafka Binder API.
Introduce Spring Cloud Stream goal and architecture
Discuss about message binders, especially the Kafka Binder API via suggestive diagrams
Follow the Code for sending the collected data to the Message Queuing Tier
Cover the main aspects of a Data Access Tier such as writing/reading the analyzed data to/from a long-term storage, in-memory databases/data-grids and memory. Discuss about caching strategies. Cover static and dynamic filtering depending on protocol.
See the overview of the Data Access Tier by answering to the question "what we can do with the analyzed data?"
Write and read the analyzed data to/from a long-term storage, in-memory databases/data-grids and memory
Cover caching strategies along with static and dynamic filtering depending on protocol
Introduce MongoDB main headlines, justifying this election and prepare a MongoDB instance ready to go.
Learn MongoDB - What is it, why to use it and when to use it
Explore terminology, relational vs. document based, capped collection and scaling
Install and configure a localhost instance of MongoDB server and MongoDB Compass
Clarify what is "reactive programming" and "reactive streams". Introduce Spring Reactive. Coding the MongoDB and Spring Reactive interaction.
Explain "reactive programming" and "reactive streams"
Introduce Spring Reactive Mono, Flux, WebFlux API and Spring Reactive Repositories via snippets of code
Know how to tie up MongoDB and Spring Reactive at code level via the ReactiveMongoTemplate API
Focus on implementing the UI part. The end-user or client is a HTML -JS based webpage capable to connect via Server Sent Events protocol to a reactive endpoint exposed via the Spring Reactive Flux API. Cover a bunch of communication patterns used in this situation.
Explain the theoretical headlines meant to clarify what we will do
Implement the UI part at code level
Discuss about publish-subscribe, RMI/RPC, Simple Messaging and Data Sync communication patterns
General overview of Analysis Tier. Cover main headlines and goals of this tier in a data streaming pipeline.
Explore the Continuous Query Model. specific to stream-processors
Explain why the Analysis Tier should run in a distributed fashion and touching high-level architectures of Apache Spark, Storm, Samza and Flink
Discover main features of a streaming process
Discover how the specific streaming algorithms looks like and have a flavor of the problems that these algorithms tries to solve. Theoretical cover four notorious streaming algorithms.
Talk about data stream query types and stream mining constrains
Explaining stream and event time. Introducing the window of data concept
Explore the concepts of Reservoir Sampling, HyperLogLog, Count-Min Sketch and Bloom Filter streaming algorithms
The goal of this video is like a check in list of Apache Spark headlines and to givea high-level overview of what Apache Spark is and how it works.
Understand what is Apache Spark and why to elect it
Know terminology, high-level architecture, Spark stack and Spark job architecture
Introduce RDDs, DataFrames, Datasets, checkpointing and monitoring
Plug-in Apache Spark in our data streaming pipeline. More precisely, place the Analysis Tier (Spark) between Message Queuing Tier (Kafka) and Data Access Tier (MongoDB).
Cover aspects of running Spark on Windows
Write a Spark based kickoff application
Prepare this application to ingest data from Kafka and send it, after analysis, to MongoDB
Discover the RDD data structure specific to Apache Spark and be aware of its main characteristics. Implement the code lines needed to ingest Meetup RSVPs from Kafka in RDDs and write these RDDs in a MongoDB collection.
Introduce RDDs as a new data structure
Cover RDDs transformations actions and memory management
Write the code lines needed to pull RSVPs from Kafka to RDDs and sending them to a MongoDB collection
Grasp a comprehensive guide of Spark Streaming. Theoretical and practical aspects are interleaved in order to cover Discretized Stream and Windowing as the two main headlines.
Cover theoretical part of DStreams, Receiver Thread, Windowing and Checkpointing
Write an application to pull RSVPs from Kafka to DStreams and send these DStreams to a MongoDB collection
Write an application to count RSVPs in a window length of 30 seconds with sliding interval of 5 seconds
Tackle Spark SQL headlines, cover the powerful DataFrame and Dataset data structures via a comparison with RDDs and several examples, and write an application based on Spark SQL.
Have a brief overview of Spark SQL and a comprehensive comparison of RDDs vs. DataFrames vs. Datasets
Introduce DataFrames and Datasets API via examples
Write an application for filtering RSVPs by Australia venue via Spark SQL
The focus here is on discovering Spark Structured Streaming and developing an application sample.
Cover Structured Streaming processing model. Explain concepts: unbounded input table, user query, result table, output mode and triggers.
Discover windowed grouped aggregations, watermarking, sources and sinks and checkpointing.
Write an application for counting RSVPs by guests number in a window of 4 minutes with a sliding of 2 minutes and a watermark of 1 minute
Provide the main set of knowledge about the topic in a soft-technical language and easy to assimilate.
Introduce Machine Learning concept via an example
Loop over the 7 steps meant to shape the big picture of how Machine Learning should tackle real problems
Have a final overview of Machine Learning and some Spark hints
Spark MLlib (or Spark ML) is the Spark library for Machine Learning. The aim of this video is to discover all the main headlines of a Spark ML Pipeline. Implement an ML Pipeline for the House Price Forecast System discussed in the previous video.
Introduce Spark MLlib (Spark ML) main concept, Spark ML Pipeline, and see how data is flowing through an ML Pipeline
Cover Spark MLlib (Spark ML) operations: transformers, estimators, evaluators, etc.
Dissect Spark Pipeline and PipelineModel APIs and use them to Implement an ML Pipeline For The House Price Forecast System
Combine the power of Spark ML and Structured Streaming in an example that trains a Logistic Regression model offline and later scoring online. Explore an example of online training and scoring via the RDD API. Discuss about the unreleased Streaming ML concept.
Introduce the Logistic Regression algorithm used in the further applications
Develop an application that trains the model offline and scores online on the Meetup RSVPs stream
Develop an application that trains and scores online on the Meetup RSVPs stream via the RDDs API
Bring into discussion Spark GraphX, the Spark library dedicated to graphs and graphs-parallel computation.
Cover Spark GraphX headlines
Cover Spark GraphX API headlines
See a simple example
Provide the argumentation for choosing logging against checkpointing as the fault tolerance mechanism in streaming, to dissect the RBML, SBML and HML architectures and to implement HML in our streaming pipeline.
Explain why logging is better than checkpointing in a streaming pipeline
Have a bunch of meaningful diagrams to dissect the flow of data through RBML, SBML and HML
Provide the coding session for adding HML in our streaming pipeline via Spring Reactive and MongoDB
The goal here is to provide another implementation for the SBML part via the Debezium Connector for MongoDB.
Get a Kafka Connect brief overview
Explore Debezium Connector for MongoDB brief overview
Understand theoretical aspects of implementing SBML logger with Debezium Connector For MongoDB
Secure the communication between the Collection and the Message Queuing tiers and between the Analysis and the Message Queuing tiers.
Explore secure communication between Collection and Message Queuing tiers via SSL
Secure communication between Analysis and Message Queuing tiers via SSL.
Point SSL for Kafka inter-broker communication
Apache Spark: Tips, Tricks, & Techniques
This video provides an overview of the entire course.
In this video, we will use Spark transformations to defer computations to a later time.
Understand spark DAG creation
Execute DAG by issuing action
Defer decision about starting job until the last possible moment
In this video, we will learn how to avoid transformations.
Understand groupBy API
Use cache() function
Avoid skewed partitions
In this video, we will using reduce and reduceByKey to calculate results.
Understand reduce behavior
Use reduce() function
Use reduceByKey() function
In this video, we will understand what an action can be in Spark.
Get a walkthrough of all the actions
In this video, we reuse the same RDD for different actions.
Minimize execution time by reuse of RDD
In this video, we will delve into Spark RDDs parent/child chain.
Learn how to extend RDD
Chain the new RDD with a parent
Test our custom RDD
In this video, we will use RDD in an immutable way.
Understand DAG immutability
Create two leaves from one root RDD
Examine results from both leaves
In this video, we will use DataFrame operations to transform the RDD.
Understand DataFrame immutability
Create two leaves from one root DataFrame
Add a new column by issuing transformation
In this video, we will learn about immutability in the high-concurrent environment.
Understand cons of mutable collections
Create two threads that simultaneously modify mutable collection
Understand the reasoning about a concurrent program
In this video, we will use Dataset API in the immutable way.
Understand dataset immutability
Create two leaves from one root dataset
Add new columns by issuing transformation
In this video, we will learn to detect a shuffle in processing.
Load randomly partitioned data
Issue re-partition using meaningful partition key
Understand that shuffle occurs by explaining query
In this video, we will test operations that cause shuffle in Apache Spark.
Use join for two DataFrames
Use two DataFrames that are partitioned differently
Test join that causes shuffle
In this video, we will change the design of jobs with wide dependencies.
Re-partition dataFrames using common partition key
Understand join with pre-partitioned data
Understand why we avoided shuffle
In this video, we will use keyBy() operation to reduce shuffle.
Load randomly partitioner data
Learn to pre-partition data in a meaningful way
Leverage keyBy() function
In this video, we will use a custom partitioner to reduce shuffle.
Implement custom partitioner
Use partitioner with partitionBy method on Spark
Validate that our data was partitioned properly
In this video, we will learn to save data in plain text.
Load plain text data
Test your data
In this video, we will learn to save data in plain text.
Load plain text data
Test your data
In this video, we will learn about tabular formats.
Load CSV data
Test your data
In this video, we will learn to use Avro with Spark.
Understand how to save data in an Avro
Load Avro data
Test your data
In this video, we will learn to use columnar formats like Parquet.
Learn how to save data in Parquet
Load Parquet data
Test your data
In this video, we will learn the available transformations on key/value pairs.
Examine your data
In this video, we will learn to use aggregateByKey instead of groupBy().
Understand why we should not use groupByKey
Learn what aggregateByKey gives us
Implement logic using aggregateByKey
In this video, we will learn to use Actions on Key/Value pairsUsing Accumulators.
Examine actions on K/V pairs
Examine output for K/V RDD
In this video, we will look at the available partitioners on key/value data.
Test your data
In this video, we will learn to implement custom partitioner.
Learn to implement range partitioning
Test our partitioner
In this video, we will be creating components with logic. Then we will perform unit testing of the components and lastly we will use case class from model.
Create component with logic
Perform unit testing of the component
Use case class from model
In this video, we will acquire skills to perform integration testing using SparkSession.
Leverage SparkSession to integration testing
Use Unit tested component
This video will take you through mocking data sources using partial functions.
Create spark component that read from Hive
Test mocked component
In this video, we will be using ScalaCheck for property-based testing.
Apply property based testing
Create property based test
In the last video of the section, we will learn to apply the test in different versions of spark.
Resisting component to work with Spark pre-2.x
Mock testing pre-2.x
RDD mock testing
In this video, we will learn to create a graph from datasource.
Create loader component
Revisit graph file format
Load Spark graph from file
In this video, we will learn to use Vertex API.
Construct graph Using Vertex
Leverage Vertex transformations
In this video, we will learn to use Edge API.
Construct graph using Edge
Leverage Edge transformations
In this video, we will learn to calculate degree of Vertex.
In this video, we will learn to calculate Page Rank.
Load data about users
Load data about followers
Use PageRank to calculate rank of users
Troubleshooting Apache Spark
This video will give you an overview about the course.
In this video, we will be solving eager computations with lazy evaluation.
What is a Transformation?
Why are my transformations not executed?
Trigger transformations using actions
In this video, we will be solving slow-running jobs by using in-memory Persistence.
Problem with data re-computation
Use the cache() function
Use the persistance() function
In this video, we will be alleviating unexpected API behavior by picking the proper RDD API.
How to speed up transform/filter queries
The ordering of operators matters
Performance test of our improvement
We will learn to reduce wide dependencies using narrow dependencies.
What is a narrow dependency?
What is a wide dependency?
How to avoid wide dependencies?
In this video, we will learn to solve slow jobs using partitions.
Examine the number of partitions of RDD
Use the coalesce() method
Use the repartition() method
In this video, we will learn the technique of extending the DataFrame API with UDF functions.
Use the DataFrame API
Create a UDF Function
Register UDF for a usage in the DF API
In this video, we will be understanding jobs by examining physical and logical plans
Examine logical and physical plans of DF
Examine execution plan of RDDs
In this video, we will be replacing slow interpreted lambdas using Spark Optimizer.
Delve into the Optimizer class
In this video, we will learn to avoid wrong join strategies by using a join type based on data volume.
Understand inner join
Understand left/right join
Understand outer join
In this video, we will discover techniques to solve the slow joins problem by choosing the proper execution plan.
Use custom partitioner during join
How to join a smaller dataset with a bigger one?
In this video, we will perform distributed joins using DataFrame.
Use DataFrame to perform join
Perform inner join
Perform outer/left/right join
In this video, we will perform distributed joins using DataSet.
How to perform type-safe joins?
Use DataSet to join
In this video, we will make jobs memory efficient by reusing existing objects.
How to minimize object creation
Use mutable state passed to Spark API
In this video, we will iterate over specific partitions by using mapPartition().
Understand what can be inside of a partition
Perform operations partition wise using mapPartitions
In this video, we will learn to debug Spark Start by introducing accumulators.
Add metrics using accumulators
In this video, we will explore ways to avoid recomputation with RDDs multiple times by using caching.
Use Spark API to favour resisability of RDDs
In this video, we will create replaceable and reusable sink And source.
Detect Missing Values (NaN)
Leverage the IsNull() helper method
In this video, we will be reducing the time of batch jobs using Spark micro-batch approach.
Make NaN meaningful to processing
Replacing NaN with scalar value
In this video, we will make jobs fault tolerant by introducing a checkpoint mechanism.
Define the Ad Validator module
Understand what a backward fill is
Understand what a forward fill is
In this video, we will learn to create one code base for stream and batch using structured streaming.
Handle an outlier by replacing it with meaningful name
Implement logic using replace