Apache Spark: Tips, Tricks, & Techniques

Discover proven techniques to create testable, immutable, and easily parallelizable Spark jobs
Packt Publishing
13 students enrolled
English [Auto-generated]
Compose Spark jobs from actions and transformations
Create highly concurrent Spark programs by leveraging immutability
Ways to avoid the most expensive operation in the Spark API—Shuffle
How to save data for further processing by picking the proper data format saved by Spark
Parallelize keyed data; learn of how to use Spark's Key/Value API
Re-design your jobs to use reduceByKey instead of groupBy
Create robust processing pipelines by testing Apache Spark jobs
Solve repeated problems by leveraging the GraphX API

Apache Spark has been around for quite some time, but do you really know how to get the most out of Spark? This course aims at giving you new possibilities; you will explore many aspects of Spark, some you may have never heard of and some you never knew existed.

In this course you’ll learn to implement some practical and proven techniques to improve particular aspects of programming and administration in Apache Spark. You will explore 7 sections that will address different aspects of Spark via 5 specific techniques with clear instructions on how to carry out different Apache Spark tasks with hands-on experience. The techniques are demonstrated using practical examples and best practices.

By the end of this course, you will have learned some exciting tips, best practices, and techniques with Apache Spark. You will be able to perform tasks and get the best data out of your databases much faster and with ease.

About the Author

Tomasz Lelek is a Software Engineer, programming mostly in Java and Scala. He is a fan of microservice architectures and functional programming. He dedicates considerable time and effort to getting better every day. He is passionate about nearly everything associated with software development, and believes that we should always try to consider different solutions and approaches before solving a problem. Recently he was a speaker at conferences in Poland -, Confitura and JDD (Java Developers Day), and also at Krakow Scala User Group. He has also conducted a live coding session at Geecon Conference.

He is a co-founder initlearn an e-learning platform that was built with the Java language.

Transformations and Actions

The Course Overview

This video provides an overview of the entire course.

Using Spark Transformations to Defer Computations to a Later Time

In this video, we will use Spark transformations to defer computations to a later time.

   •  Understand spark DAG creation

   •  Execute DAG by issuing action

   •  Defer decision about starting job until the last possible moment

Avoiding Transformations

In this video, we will learn how to avoid transformations.

   •  Understand groupBy API

   •  Use cache() function

   •  Avoid skewed partitions

Using reduce and reduceByKey to Calculate Results

In this video, we will using reduce and reduceByKey to calculate results.

   •  Understand reduce behavior

   •  Use reduce() function

   •  Use reduceByKey() function

Performing Actions That Trigger Computations

In this video, we will understand what an action can be in Spark.

   •  Get a walkthrough of all the actions

   •  Perform tests

Reusing the Same RDD for Different Actions

In this video, we reuse the same RDD for different actions.

   •  Minimize execution time by reuse of RDD

   •  Introduce caching

   •  Perform tests

Immutable Design

Delve into Spark RDDs Parent/Child Chain

In this video, we will delve into Spark RDDs parent/child chain.

   •  Learn how to extend RDD

   •  Chain the new RDD with a parent

   •  Test our custom RDD

Using RDD in an Immutable Way

In this video, we will use RDD in an immutable way.

   •  Understand DAG immutability

   •  Create two leaves from one root RDD

   •  Examine results from both leaves

Using DataFrame Operations to Transform It

In this video, we will use DataFrame operations to transform the RDD.

   •  Understand DataFrame immutability

   •  Create two leaves from one root DataFrame

   •  Add a new column by issuing transformation

Immutability in the Highly Concurrent Environment

In this video, we will learn about immutability in the high-concurrent environment.

   •  Understand cons of mutable collections

   •  Create two threads that simultaneously modify mutable collection

   •  Understand the reasoning about a concurrent program

Using Dataset API in an Immutable Way

In this video, we will use Dataset API in the immutable way.

   •  Understand dataset immutability

   •  Create two leaves from one root dataset

   •  Add new columns by issuing transformation

Avoid Shuffle and Reduce Operational Expenses

Detecting a Shuffle in a Processing

In this video, we will learn to detect a shuffle in processing.

   •  Load randomly partitioned data

   •  Issue re-partition using meaningful partition key

   •  Understand that shuffle occurs by explaining query

Testing Operations That Cause Shuffle in Apache Spark

In this video, we will test operations that cause shuffle in Apache Spark.

   •  Use join for two DataFrames

   •  Use two DataFrames that are partitioned differently

   •  Test join that causes shuffle

Changing Design of Jobs with Wide Dependencies

In this video, we will change the design of jobs with wide dependencies.

   •  Re-partition dataFrames using common partition key

   •  Understand join with pre-partitioned data

   •  Understand why we avoided shuffle

Using keyBy() Operations to Reduce Shuffle

In this video, we will use keyBy() operation to reduce shuffle.

   •  Load randomly partitioner data

   •  Learn to pre-partition data in a meaningful way

   •  Leverage keyBy() function

Using Custom Partitioner to Reduce Shuffle

In this video, we will use a custom partitioner to reduce shuffle.

   •  Implement custom partitioner

   •  Use partitioner with partitionBy method on Spark

   •  Validate that our data was partitioned properly

Saving Data in the Correct Format

Saving Data in Plain Text

In this video, we will learn to save data in plain text.

   •  Load plain text data

   •  Test your data

Leveraging JSON as a Data Format

In this video, we will learn to save data in plain text.

   •  Load plain text data

   •  Test your data

Tabular Formats – CSV

In this video, we will learn about tabular formats.

   •  Load CSV data

   •  Test your data

Using Avro with Spark

In this video, we will learn to use Avro with Spark.

   •  Understand how to save data in an Avro

   •  Load Avro data

   •  Test your data

Columnar Formats – Parquet

In this video, we will learn to use columnar formats like Parquet.

   •  Learn how to save data in Parquet

   •  Load Parquet data

   •  Test your data

Working with Spark Key/Value API

Available Transformations on Key/Value Pairs

In this video, we will learn the available transformations on key/value pairs.

   •  Use countByKey()

   •  Examine your data

Using aggregateByKey Instead of groupBy()

In this video, we will learn to use aggregateByKey instead of groupBy().

   •  Understand why we should not use groupByKey

   •  Learn what aggregateByKey gives us

   •  Implement logic using aggregateByKey

Actions on Key/Value Pairs

In this video, we will learn to use Actions on Key/Value pairsUsing Accumulators.

   •  Examine actions on K/V pairs

   •  Use collect

   •  Examine output for K/V RDD

Available Partitioners on Key/Value Data

In this video, we will look at the available partitioners on key/value data.

   •  Examine HashPartitioner

   •  Examine RangePartitioner

   •  Test your data

Implementing Custom Partitioner

In this video, we will learn to implement custom partitioner.

   •  Learn to implement range partitioning

   •  Test our partitioner

Testing Apache Spark Jobs

Separating Logic from Spark Engine – Unit Testing

In this video, we will be creating components with logic. Then we will perform unit testing of the components and lastly we will use case class from model.

   •  Create component with logic

   •  Perform unit testing of the component

   •  Use case class from model

Integration Testing Using SparkSession

In this video, we will acquire skills to perform integration testing using SparkSession.

   •  Leverage SparkSession to integration testing

   •  Use Unit tested component

Mocking Data Sources Using Partial Functions

This video will take you through mocking data sources using partial functions.

   •  Create spark component that read from Hive

   •  Mocking component

   •  Test mocked component

Using ScalaCheck for Property-Based Testing

In this video, we will be using ScalaCheck for property-based testing.

   •  Apply property based testing

   •  Create property based test

Testing in Different Versions of Spark

In the last video of the section, we will learn to apply the test in different versions of spark.

   •  Resisting component to work with Spark pre-2.x

   •  Mock testing pre-2.x

   •  RDD mock testing

Leveraging Spark GraphX API

Creating Graph from Datasource

In this video, we will learn to create a graph from datasource.

   •  Create loader component

   •  Revisit graph file format

   •  Load Spark graph from file

Using Vertex API

In this video, we will learn to use Vertex API.

   •  Construct graph Using Vertex

   •  Use Vertex

   •  Leverage Vertex transformations

Using Edge API

In this video, we will learn to use Edge API.

   •  Construct graph using Edge

   •  Use Edge

   •  Leverage Edge transformations

Calculate Degree of Vertex

In this video, we will learn to calculate degree of Vertex.

   •  Calculate degree

   •  In-Degree

   •  Out-Degree

Calculate Page Rank

In this video, we will learn to calculate Page Rank.

   •  Load data about users

   •  Load data about followers

   •  Use PageRank to calculate rank of users

You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!

Be the first to add a review.

Please, login to leave a review
30-Day Money-Back Guarantee


2 hours on-demand video
Full lifetime access
Access on mobile and TV
Certificate of Completion