4.33 out of 5
3 reviews on Udemy

Tuning Apache Spark: Powerful Big Data Processing Recipes

Uncover the lesser known secrets of powerful big data processing with Spark and Kafka
Packt Publishing
32 students enrolled
English [Auto-generated]
How to attain a solid foundation in the most powerful and versatile technologies involved in data streaming: Apache Spark and Apache Kafka
Form a robust and clean architecture for a data streaming pipeline
Ways to implement the correct tools to bring your data streaming architecture to life
How to create robust processing pipelines by testing Apache Spark jobs
How to create highly concurrent Spark programs by leveraging immutability
How to solve repeated problems by leveraging the GraphX API
How to solve long-running computation problems by leveraging lazy evaluation in Spark
Tips to avoid memory leaks by understanding the internal memory management of Apache Spark
Troubleshoot real-time pipelines written in Spark Streaming

Video Learning Path Overview

A Learning Path is a specially tailored course that brings together two or more different topics that lead you to achieve an end goal. Much thought goes into the selection of the assets for a Learning Path, and this is done through a complete understanding of the requirements to achieve a goal.

Today, organizations have a difficult time working with large datasets. In addition, big data processing and analyzing need to be done in real time to gain valuable insights quickly. This is where data streaming and Spark come in.

In this well thought out Learning Path, you will not only learn how to work with Spark to solve the problem of analyzing massive amounts of data for your organization, but you’ll also learn how to tune it for performance. Beginning with a step by step approach, you’ll get comfortable in using Spark and will learn how to implement some practical and proven techniques to improve particular aspects of programming and administration in Apache Spark. You’ll be able to perform tasks and get the best out of your databases much faster.

Moving further and accelerating the pace a bit, You’ll learn some of the lesser known techniques to squeeze the best out of Spark and then you’ll learn to overcome several problems you might come across when working with Spark, without having to break a sweat. The simple and practical solutions provided will get you back in action in no time at all!

By the end of the course, you will be well versed in using Spark in your day to day projects.

Key Features

  • From blueprint architecture to complete code solution, this course treats every important aspect involved in architecting and developing a data streaming pipeline

  • Test Spark jobs using the unit, integration, and end-to-end techniques to make your data pipeline robust and bulletproof.

  • Solve several painful issues like slow-running jobs that affect the performance of your application.

Author Bios

  • Anghel Leonard is currently a Java chief architect. He is a member of the Java EE Guardians with 20+ years’ experience. He has spent most of his career architecting distributed systems. He is also the author of several books, a speaker, and a big fan of working with data.

  • Tomasz Lelek is a Software Engineer, programming mostly in Java and Scala. He has been working with the Spark and ML APIs for the past 5 years with production experience in processing petabytes of data. He is passionate about nearly everything associated with software development and believes that we should always try to consider different solutions and approaches before solving a problem. Recently he was a speaker at conferences in Poland, Confitura and JDD (Java Developers Day), and at Krakow Scala User Group. He has also conducted a live coding session at Geecon Conference. He is a co-founder of initlearn, an e-learning platform that was built with the Java language. He has also written articles about everything related to the Java world.

Data Stream Development with Apache Spark, Kafka, and Spring Boot

The Course Overview

This video provides an overview of the entire course.

Discovering the Data Streaming Pipeline Blueprint Architecture

Introduce data streaming fundamentals and shape the data streaming blueprint architecture

  • Cover the big picture of data streaming

  • Talk about classifying, securing and scaling streaming systems

  • Shape via a diagram the data streaming blueprint architecture

Analyzing Meetup RSVPs in Real-Time

Introduce the Meetup RSVPs stream and choose the technologies for implementing the data streaming blueprint architecture. See alternative technologies as well and how to decide between them

  • Access the Meetup RSVP stream online

  • Choose the proper technology for each tier of data streaming blueprint architecture

  • Explore the alternative technologies per tier and criteria for choosing between them properly

Running the Collection Tier (Part I – Collecting Data)

After a brief overview of the Collection Tier, we have a general discussion about protocols, interaction patterns and issues involved in writing a Collection Tier.

  • Start with a brief overview about connecting to the source of data, push and pull mechanisms and lightweight business logic

  • Continue with protocols and interaction patterns

  • Finish with the problem of scaling the Collection Tier and WebSocket caused by the direct and persistent connection

Collecting Data Via the Stream Pattern and Spring WebSocketClient API

Develop the Collection Tier part for ingesting Meetup RSVPs via Spring WebSocketClient API

  • Brief overview of WebSocket concept

  • Introduce Spring WebSocketClient API and its role in Collection Tier

  • Implementation the code

Explaining the Message Queuing Tier Role

Explain why this tier, that apparently complicates and slows down the data streaming pipeline, is needed.

  • Tackle backpressure issue

  • Understand the data durability issue

  • Learn about data delivery semantics issue

Introducing Our Message Queuing Tier –Apache Kafka

Apache Kafka is a powerful, but complex technology. This video represents a comprehensive introduction of the main Kafka concepts.

  • Understand cover overview, terminology, high-level architecture, topics and partitions

  • Explore producers and consumers, consumer groups, delivery semantics and durability

  • Install and configure Zookeeper and a Kafka broker

Running The Collection Tier (Part II – Sending Data)

Send the collected data to Message Queuing Tier (Kafka) via Spring Cloud Stream, Kafka Binder API.

  • Introduce Spring Cloud Stream goal and architecture

  • Discuss about message binders, especially the Kafka Binder API via suggestive diagrams

  • Follow the Code for sending the collected data to the Message Queuing Tier

Dissecting the Data Access Tier

Cover the main aspects of a Data Access Tier such as writing/reading the analyzed data to/from a long-term storage, in-memory databases/data-grids and memory. Discuss about caching strategies. Cover static and dynamic filtering depending on protocol.

  • See the overview of the Data Access Tier by answering to the question "what we can do with the analyzed data?"

  • Write and read the analyzed data to/from a long-term storage, in-memory databases/data-grids and memory

  • Cover caching strategies along with static and dynamic filtering depending on protocol

Introducing Our Data Access Tier – MongoDB

Introduce MongoDB main headlines, justifying this election and prepare a MongoDB instance ready to go.

  • Learn MongoDB - What is it, why to use it and when to use it

  • Explore terminology, relational vs. document based, capped collection and scaling

  • Install and configure a localhost instance of MongoDB server and MongoDB Compass

Exploring Spring Reactive

Clarify what is "reactive programming" and "reactive streams". Introduce Spring Reactive. Coding the MongoDB and Spring Reactive interaction.

  • Explain "reactive programming" and "reactive streams"

  • Introduce Spring Reactive Mono, Flux, WebFlux API and Spring Reactive Repositories via snippets of code

  • Know how to tie up MongoDB and Spring Reactive at code level via the ReactiveMongoTemplate API

Exposing the Data Access Tier in Browser

Focus on implementing the UI part. The end-user or client is a HTML -JS based webpage capable to connect via Server Sent Events protocol to a reactive endpoint exposed via the Spring Reactive Flux API. Cover a bunch of communication patterns used in this situation.

  • Explain the theoretical headlines meant to clarify what we will do

  • Implement the UI part at code level

  • Discuss about publish-subscribe, RMI/RPC, Simple Messaging and Data Sync communication patterns

Diving into the Analysis Tier

General overview of Analysis Tier. Cover main headlines and goals of this tier in a data streaming pipeline.

  • Explore the Continuous Query Model. specific to stream-processors

  • Explain why the Analysis Tier should run in a distributed fashion and touching high-level architectures of Apache Spark, Storm, Samza and Flink

  • Discover main features of a streaming process

Streaming Algorithms For Data Analysis

Discover how the specific streaming algorithms looks like and have a flavor of the problems that these algorithms tries to solve. Theoretical cover four notorious streaming algorithms.

  • Talk about data stream query types and stream mining constrains

  • Explaining stream and event time. Introducing the window of data concept

  • Explore the concepts of Reservoir Sampling, HyperLogLog, Count-Min Sketch and Bloom Filter streaming algorithms

Introducing Our Analysis Tier – Apache Spark

The goal of this video is like a check in list of Apache Spark headlines and to givea high-level overview of what Apache Spark is and how it works.

  • Understand what is Apache Spark and why to elect it

  • Know terminology, high-level architecture, Spark stack and Spark job architecture

  • Introduce RDDs, DataFrames, Datasets, checkpointing and monitoring

Plug-in Spark Analysis Tier to Our Pipeline

Plug-in Apache Spark in our data streaming pipeline. More precisely, place the Analysis Tier (Spark) between Message Queuing Tier (Kafka) and Data Access Tier (MongoDB).

  • Cover aspects of running Spark on Windows

  • Write a Spark based kickoff application

  • Prepare this application to ingest data from Kafka and send it, after analysis, to MongoDB

Brief Overview of Spark RDDs

Discover the RDD data structure specific to Apache Spark and be aware of its main characteristics. Implement the code lines needed to ingest Meetup RSVPs from Kafka in RDDs and write these RDDs in a MongoDB collection.

  • Introduce RDDs as a new data structure

  • Cover RDDs transformations actions and memory management

  • Write the code lines needed to pull RSVPs from Kafka to RDDs and sending them to a MongoDB collection

Spark Streaming

Grasp a comprehensive guide of Spark Streaming. Theoretical and practical aspects are interleaved in order to cover Discretized Stream and Windowing as the two main headlines.

  • Cover theoretical part of DStreams, Receiver Thread, Windowing and Checkpointing

  • Write an application to pull RSVPs from Kafka to DStreams and send these DStreams to a MongoDB collection

  • Write an application to count RSVPs in a window length of 30 seconds with sliding interval of 5 seconds

DataFrames, Datasets and Spark SQL

Tackle Spark SQL headlines, cover the powerful DataFrame and Dataset data structures via a comparison with RDDs and several examples, and write an application based on Spark SQL.

  • Have a brief overview of Spark SQL and a comprehensive comparison of RDDs vs. DataFrames vs. Datasets

  • Introduce DataFrames and Datasets API via examples

  • Write an application for filtering RSVPs by Australia venue via Spark SQL

Spark Structured Streaming

The focus here is on discovering Spark Structured Streaming and developing an application sample.

  • Cover Structured Streaming processing model. Explain concepts: unbounded input table, user query, result table, output mode and triggers.

  • Discover windowed grouped aggregations, watermarking, sources and sinks and checkpointing.

  • Write an application for counting RSVPs by guests number in a window of 4 minutes with a sliding of 2 minutes and a watermark of 1 minute

Machine Learning in 7 Steps

Provide the main set of knowledge about the topic in a soft-technical language and easy to assimilate.

  • Introduce Machine Learning concept via an example

  • Loop over the 7 steps meant to shape the big picture of how Machine Learning should tackle real problems

  • Have a final overview of Machine Learning and some Spark hints

MLlib (Spark ML)

Spark MLlib (or Spark ML) is the Spark library for Machine Learning. The aim of this video is to discover all the main headlines of a Spark ML Pipeline. Implement an ML Pipeline for the House Price Forecast System discussed in the previous video.

  • Introduce Spark MLlib (Spark ML) main concept, Spark ML Pipeline, and see how data is flowing through an ML Pipeline

  • Cover Spark MLlib (Spark ML) operations: transformers, estimators, evaluators, etc.

  • Dissect Spark Pipeline and PipelineModel APIs and use them to Implement an ML Pipeline For The House Price Forecast System

Spark ML and Structured Streaming

Combine the power of Spark ML and Structured Streaming in an example that trains a Logistic Regression model offline and later scoring online. Explore an example of online training and scoring via the RDD API. Discuss about the unreleased Streaming ML concept.

  • Introduce the Logistic Regression algorithm used in the further applications

  • Develop an application that trains the model offline and scores online on the Meetup RSVPs stream

  • Develop an application that trains and scores online on the Meetup RSVPs stream via the RDDs API

Spark GraphX

Bring into discussion Spark GraphX, the Spark library dedicated to graphs and graphs-parallel computation.

  • Cover Spark GraphX headlines

  • Cover Spark GraphX API headlines

  • See a simple example

Fault Tolerance (HML)

Provide the argumentation for choosing logging against checkpointing as the fault tolerance mechanism in streaming, to dissect the RBML, SBML and HML architectures and to implement HML in our streaming pipeline.

  • Explain why logging is better than checkpointing in a streaming pipeline

  • Have a bunch of meaningful diagrams to dissect the flow of data through RBML, SBML and HML

  • Provide the coding session for adding HML in our streaming pipeline via Spring Reactive and MongoDB

Kafka Connect

The goal here is to provide another implementation for the SBML part via the Debezium Connector for MongoDB.

  • Get a Kafka Connect brief overview

  • Explore Debezium Connector for MongoDB brief overview

  • Understand theoretical aspects of implementing SBML logger with Debezium Connector For MongoDB

Securing Communication between Tiers

Secure the communication between the Collection and the Message Queuing tiers and between the Analysis and the Message Queuing tiers.

  • Explore secure communication between Collection and Message Queuing tiers via SSL

  • Secure communication between Analysis and Message Queuing tiers via SSL.

  • Point SSL for Kafka inter-broker communication

Test Your Knowledge

Apache Spark: Tips, Tricks, & Techniques

The Course Overview

This video provides an overview of the entire course.

Using Spark Transformations to Defer Computations to a Later Time

In this video, we will use Spark transformations to defer computations to a later time.

  • Understand spark DAG creation

  • Execute DAG by issuing action

  • Defer decision about starting job until the last possible moment

Avoiding Transformations

In this video, we will learn how to avoid transformations.

  • Understand groupBy API

  • Use cache() function

  • Avoid skewed partitions

Using reduce and reduceByKey to Calculate Results

In this video, we will using reduce and reduceByKey to calculate results.

  • Understand reduce behavior

  • Use reduce() function

  • Use reduceByKey() function

Performing Actions That Trigger Computations

In this video, we will understand what an action can be in Spark.

  • Get a walkthrough of all the actions

  • Perform tests

Reusing the Same RDD for Different Actions

In this video, we reuse the same RDD for different actions.

  • Minimize execution time by reuse of RDD

  • Introduce caching

  • Perform tests

Delve into Spark RDDs Parent/Child Chain

In this video, we will delve into Spark RDDs parent/child chain.

  • Learn how to extend RDD

  • Chain the new RDD with a parent

  • Test our custom RDD

Using RDD in an Immutable Way

In this video, we will use RDD in an immutable way.

  • Understand DAG immutability

  • Create two leaves from one root RDD

  • Examine results from both leaves

Using DataFrame Operations to Transform It

In this video, we will use DataFrame operations to transform the RDD.

  • Understand DataFrame immutability

  • Create two leaves from one root DataFrame

  • Add a new column by issuing transformation

Immutability in the Highly Concurrent Environment

In this video, we will learn about immutability in the high-concurrent environment.

  • Understand cons of mutable collections

  • Create two threads that simultaneously modify mutable collection

  • Understand the reasoning about a concurrent program

Using Dataset API in an Immutable Way

In this video, we will use Dataset API in the immutable way.

  • Understand dataset immutability

  • Create two leaves from one root dataset

  • Add new columns by issuing transformation

Detecting a Shuffle in a Processing

In this video, we will learn to detect a shuffle in processing.

  • Load randomly partitioned data

  • Issue re-partition using meaningful partition key

  • Understand that shuffle occurs by explaining query

Testing Operations That Cause Shuffle in Apache Spark

In this video, we will test operations that cause shuffle in Apache Spark.

  • Use join for two DataFrames

  • Use two DataFrames that are partitioned differently

  • Test join that causes shuffle

Changing Design of Jobs with Wide Dependencies

In this video, we will change the design of jobs with wide dependencies.

  • Re-partition dataFrames using common partition key

  • Understand join with pre-partitioned data

  • Understand why we avoided shuffle

Using keyBy() Operations to Reduce Shuffle

In this video, we will use keyBy() operation to reduce shuffle.

  • Load randomly partitioner data

  • Learn to pre-partition data in a meaningful way

  • Leverage keyBy() function

Using Custom Partitioner to Reduce Shuffle

In this video, we will use a custom partitioner to reduce shuffle.

  • Implement custom partitioner

  • Use partitioner with partitionBy method on Spark

  • Validate that our data was partitioned properly

Saving Data in Plain Text

In this video, we will learn to save data in plain text.

  • Load plain text data

  • Test your data

Leveraging JSON as a Data Format

In this video, we will learn to save data in plain text.

  • Load plain text data

  • Test your data

Tabular Formats – CSV

In this video, we will learn about tabular formats.

  • Load CSV data

  • Test your data

Using Avro with Spark

In this video, we will learn to use Avro with Spark.

  • Understand how to save data in an Avro

  • Load Avro data

  • Test your data

Columnar Formats – Parquet

In this video, we will learn to use columnar formats like Parquet.

  • Learn how to save data in Parquet

  • Load Parquet data

  • Test your data

Available Transformations on Key/Value Pairs

In this video, we will learn the available transformations on key/value pairs.

  • Use countByKey()

  • Examine your data

Using aggregateByKey Instead of groupBy()

In this video, we will learn to use aggregateByKey instead of groupBy().

  • Understand why we should not use groupByKey

  • Learn what aggregateByKey gives us

  • Implement logic using aggregateByKey

Actions on Key/Value Pairs

In this video, we will learn to use Actions on Key/Value pairsUsing Accumulators.

  • Examine actions on K/V pairs

  • Use collect

  • Examine output for K/V RDD

Available Partitioners on Key/Value Data

In this video, we will look at the available partitioners on key/value data.

  • Examine HashPartitioner

  • Examine RangePartitioner

  • Test your data

Implementing Custom Partitioner

In this video, we will learn to implement custom partitioner.

  • Learn to implement range partitioning

  • Test our partitioner

Separating Logic from Spark Engine – Unit Testing

In this video, we will be creating components with logic. Then we will perform unit testing of the components and lastly we will use case class from model.

  • Create component with logic

  • Perform unit testing of the component

  • Use case class from model

Integration Testing Using SparkSession

In this video, we will acquire skills to perform integration testing using SparkSession.

  • Leverage SparkSession to integration testing

  • Use Unit tested component

Mocking Data Sources Using Partial Functions

This video will take you through mocking data sources using partial functions.

  • Create spark component that read from Hive

  • Mocking component

  • Test mocked component

Using ScalaCheck for Property-Based Testing

In this video, we will be using ScalaCheck for property-based testing.

  • Apply property based testing

  • Create property based test

Testing in Different Versions of Spark

In the last video of the section, we will learn to apply the test in different versions of spark.

  • Resisting component to work with Spark pre-2.x

  • Mock testing pre-2.x

  • RDD mock testing

Creating Graph from Datasource

In this video, we will learn to create a graph from datasource.

  • Create loader component

  • Revisit graph file format

  • Load Spark graph from file

Using Vertex API

In this video, we will learn to use Vertex API.

  • Construct graph Using Vertex

  • Use Vertex

  • Leverage Vertex transformations

Using Edge API

In this video, we will learn to use Edge API.

  • Construct graph using Edge

  • Use Edge

  • Leverage Edge transformations

Calculate Degree of Vertex

In this video, we will learn to calculate degree of Vertex.

  • Calculate degree

  • In-Degree

  • Out-Degree

Calculate Page Rank

In this video, we will learn to calculate Page Rank.

  • Load data about users

  • Load data about followers

  • Use PageRank to calculate rank of users

Test Your Knowledge

Troubleshooting Apache Spark

The Course Overview

This video will give you an overview about the course.

Eager Computations: Lazy Evaluation

In this video, we will be solving eager computations with lazy evaluation.

  • What is a Transformation?

  • Why are my transformations not executed?

  • Trigger transformations using actions

Caching Values: In-Memory Persistence

In this video, we will be solving slow-running jobs by using in-memory Persistence.

  • Problem with data re-computation

  • Use the cache() function

  • Use the persistance() function

Unexpected API Behavior: Picking the Proper RDD API

In this video, we will be alleviating unexpected API behavior by picking the proper RDD API.

  • How to speed up transform/filter queries

  • The ordering of operators matters

  • Performance test of our improvement

Wide Dependencies: Using Narrow Dependencies

We will learn to reduce wide dependencies using narrow dependencies.

  • What is a narrow dependency?

  • What is a wide dependency?

  • How to avoid wide dependencies?

Making Computations Parallel: Using Partitions

In this video, we will learn to solve slow jobs using partitions.

  • Examine the number of partitions of RDD

  • Use the coalesce() method

  • Use the repartition() method

Defining Robust Custom Functions: Understanding User-Defined Functions

In this video, we will learn the technique of extending the DataFrame API with UDF functions.

  • Use the DataFrame API

  • Create a UDF Function

  • Register UDF for a usage in the DF API

Logical Plans Hiding the Truth: Examining the Physical Plans

In this video, we will be understanding jobs by examining physical and logical plans

  • Examine logical and physical plans of DF

  • Examine execution plan of RDDs

Slow Interpreted Lambdas: Code Generation Spark Optimization

In this video, we will be replacing slow interpreted lambdas using Spark Optimizer.

  • Delve into the Optimizer class

  • Bytecode generation

Avoid Wrong Join Strategies: Using a Join Type Based on Data Volume

In this video, we will learn to avoid wrong join strategies by using a join type based on data volume.

  • Understand inner join

  • Understand left/right join

  • Understand outer join

Slow Joins: Choosing an Execution Plan for Join

In this video, we will discover techniques to solve the slow joins problem by choosing the proper execution plan.

  • Use custom partitioner during join

  • How to join a smaller dataset with a bigger one?

Distributed Joins Problem: DataFrame API

In this video, we will perform distributed joins using DataFrame.

  • Use DataFrame to perform join

  • Perform inner join

  • Perform outer/left/right join

TypeSafe Joins Problem: The Newest DataSet API

In this video, we will perform distributed joins using DataSet.

  • How to perform type-safe joins?

  • Use DataSet to join

Minimizing Object Creation: Reusing Existing Objects

In this video, we will make jobs memory efficient by reusing existing objects.

  • How to minimize object creation

  • Use aggregateByKey

  • Use mutable state passed to Spark API

Iterating Transformations – The mapPartitions() Method

In this video, we will iterate over specific partitions by using mapPartition().

  • Understand what can be inside of a partition

  • Perform operations partition wise using mapPartitions

Slow Spark Application Start: Reducing Setup Overhead

In this video, we will learn to debug Spark Start by introducing accumulators.

  • Use accumulators

  • Add metrics using accumulators

Performing Unnecessary Recomputation: Reusing RDDs

In this video, we will explore ways to avoid recomputation with RDDs multiple times by using caching.

  • Use Spark API to favour resisability of RDDs

  • Use StorageLevel

  • Use Checkpointing

Repeating the Same Code in Stream Pipeline: Using Sources and Sinks

In this video, we will create replaceable and reusable sink And source.

  • Detect Missing Values (NaN)

  • Leverage the IsNull() helper method

  • Install pandas

Long Latency of Jobs: Understanding Batch Internals

In this video, we will be reducing the time of batch jobs using Spark micro-batch approach.

  • Make NaN meaningful to processing

  • Replacing NaN with scalar value

Fault Tolerance: Using Data Checkpointing

In this video, we will make jobs fault tolerant by introducing a checkpoint mechanism.

  • Define the Ad Validator module

  • Understand what a backward fill is

  • Understand what a forward fill is

Maintaining Batch and Streaming: Using Structured Streaming Pros

In this video, we will learn to create one code base for stream and batch using structured streaming.

  • Handle an outlier by replacing it with meaningful name

  • Implement logic using replace

Test Your Knowledge
You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
4.3 out of 5
3 Ratings

Detailed Rating

Stars 5
Stars 4
Stars 3
Stars 2
Stars 1
30-Day Money-Back Guarantee


12 hours on-demand video
Full lifetime access
Access on mobile and TV
Certificate of Completion