3.5 out of 5
3.5
1 review on Udemy

Troubleshooting Apache Spark

Quick, simple solutions to common development issues and Debugging techniques with Apache Spark.
Instructor:
Packt Publishing
15 students enrolled
English [Auto-generated]
Solve long-running computation problems by leveraging lazy evaluation in Spark
Avoid memory leaks by understanding the internal memory management of Apache Spark
Rework problems due to not-scaling out pipelines by using partitions
Debug and create user-defined functions that enrich the Spark API
Choose a proper join strategy depending on the characteristics of your input data
Troubleshoot APIs for joins - DataFrames or DataSets
Write code that minimizes object creation using the proper API
Troubleshoot real-time pipelines written in Spark Streaming

Apache Spark has been around quite some time, but do you really know how to solve the development issues and problems you face with it? This course will give you new possibilities and you’ll cover many aspects of Apache Spark; some you may know and some you probably never knew existed. If you take a lot of time learning and performing tasks on Spark, you are unable to leverage Apache Spark’s full capabilities and features, and face a roadblock in your development journey. You’ll face issues and will be unable to optimize your development process due to common problems and bugs; you’ll be looking for techniques which can save you from falling into any pitfalls and common errors during development. With this course you’ll learn to implement some practical and proven techniques to improve particular aspects of Apache Spark with proper research

You need to understand the common problems and issues Spark developers face, collate them, and build simple solutions for these problems. One way to understand common issues is to look out for Stack Overflow queries. This course is a high-quality troubleshooting course, highlighting issues faced by developers in different stages of their application development and providing them with simple and practical solutions to these issues. It supplies solutions to some problems and challenges faced by developers; however, this course also focuses on discovering new possibilities with Apache Spark. By the end of this course, you will have solved your Spark problems without any hassle.

About the Author

Tomasz Lelek is a Software Engineer, programming mostly in Java and Scala. He is a fan of microservice architectures and functional programming. He dedicates considerable time and effort to getting better every day. He is passionate about nearly everything associated with software development, and believes that we should always try to consider different solutions and approaches before solving a problem. Recently he was a speaker at conferences in Poland -, Confitura and JDD (Java Developers Day), and also at Krakow Scala User Group. He has also conducted a live coding session at Geecon Conference.

Common Problems and Troubleshooting the Spark Distributed Engine

1
The Course Overview

This video will give you an overview about the course.

2
Eager Computations: Lazy Evaluation

In this video, we will be solving eager computations with lazy evaluation.

   •  What is a Transformation?

   •  Why are my transformations not executed?

   •  Trigger transformations using actions

3
Caching Values: In-Memory Persistence

In this video, we will be solving slow-running jobs by using in-memory Persistence.

   •  Problem with data re-computation

   •  Use the cache() function

   •  Use the persistance() function

4
Unexpected API Behavior: Picking the Proper RDD API

In this video, we will be alleviating unexpected API behavior by picking the proper RDD API.

   •  How to speed up transform/filter queries

   •  The ordering of operators matters

   •  Performance test of our improvement

5
Wide Dependencies: Using Narrow Dependencies

We will learn to reduce wide dependencies using narrow dependencies.

   •  What is a narrow dependency?

   •  What is a wide dependency?

   •  How to avoid wide dependencies?

Distributed DataFrames Optimization Pitfalls

1
Making Computations Parallel: Using Partitions

In this video, we will learn to solve slow jobs using partitions.

   •  Examine the number of partitions of RDD

   •  Use the coalesce() method

   •  Use the repartition() method

2
Defining Robust Custom Functions: Understanding User-Defined Functions

In this video, we will learn the technique of extending the DataFrame API with UDF functions.

   •  Use the DataFrame API

   •  Create a UDF Function

   •  Register UDF for a usage in the DF API

3
Logical Plans Hiding the Truth: Examining the Physical Plans

In this video, we will be understanding jobs by examining physical and logical plans

   •  Examine logical and physical plans of DF

   •  Examine execution plan of RDDs

4
Slow Interpreted Lambdas: Code Generation Spark Optimization

In this video, we will be replacing slow interpreted lambdas using Spark Optimizer.

   •  Delve into the Optimizer class

   •  Bytecode generation

Distributed Joins in Cluster

1
Avoid Wrong Join Strategies: Using a Join Type Based on Data Volume

In this video, we will learn to avoid wrong join strategies by using a join type based on data volume.

   •  Understand inner join

   •  Understand left/right join

   •  Understand outer join

2
Slow Joins: Choosing an Execution Plan for Join

In this video, we will discover techniques to solve the slow joins problem by choosing the proper execution plan.

   •  Use custom partitioner during join

   •  How to join a smaller dataset with a bigger one?

3
Distributed Joins Problem: DataFrame API

In this video, we will perform distributed joins using DataFrame.

   •  Use DataFrame to perform join

   •  Perform inner join

   •  Perform outer/left/right join

4
TypeSafe Joins Problem: The Newest DataSet API

In this video, we will perform distributed joins using DataSet.

   •  How to perform type-safe joins?

   •  Use DataSet to join

Solving Problems with Non-Efficient Transformations

1
Minimizing Object Creation: Reusing Existing Objects

In this video, we will make jobs memory efficient by reusing existing objects.

   •  How to minimize object creation

   •  Use aggregateByKey

   •  Use mutable state passed to Spark API

2
Iterating Transformations – The mapPartitions() Method

In this video, we will iterate over specific partitions by using mapPartition().

   •  Understand what can be inside of a partition

   •  Perform operations partition wise using mapPartitions

3
Slow Spark Application Start: Reducing Setup Overhead

In this video, we will learn to debug Spark Start by introducing accumulators.

   •  Use accumulators

   •  Add metrics using accumulators

4
Performing Unnecessary Recomputation: Reusing RDDs

In this video, we will explore ways to avoid recomputation with RDDs multiple times by using caching.

   •  Use Spark API to favour resisability of RDDs

   •  Use StorageLevel

   •  Use Checkpointing

Troubleshooting Real-Time Processing Jobs in Spark Streaming

1
Repeating the Same Code in Stream Pipeline: Using Sources and Sinks

In this video, we will create replaceable and reusable sink And source.

   •  Detect Missing Values (NaN)

   •  Leverage the IsNull() helper method

   •  Install pandas

2
Long Latency of Jobs: Understanding Batch Internals

In this video, we will be reducing the time of batch jobs using Spark micro-batch approach.

   •  Make NaN meaningful to processing

   •  Replacing NaN with scalar value

3
Fault Tolerance: Using Data Checkpointing

In this video, we will make jobs fault tolerant by introducing a checkpoint mechanism.

   •  Define the Ad Validator module

   •  Understand what a backward fill is

   •  Understand what a forward fill is

4
Maintaining Batch and Streaming: Using Structured Streaming Pros

In this video, we will learn to create one code base for stream and batch using structured streaming.

   •  Handle an outlier by replacing it with meaningful name

   •  Implement logic using replace

You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
3.5
3.5 out of 5
1 Ratings

Detailed Rating

Stars 5
0
Stars 4
0
Stars 3
1
Stars 2
0
Stars 1
0
42358ef7db0460026cfa6bf026445631
30-Day Money-Back Guarantee

Includes

2 hours on-demand video
Full lifetime access
Access on mobile and TV
Certificate of Completion