Troubleshooting Apache Spark
Apache Spark has been around quite some time, but do you really know how to solve the development issues and problems you face with it? This course will give you new possibilities and you’ll cover many aspects of Apache Spark; some you may know and some you probably never knew existed. If you take a lot of time learning and performing tasks on Spark, you are unable to leverage Apache Spark’s full capabilities and features, and face a roadblock in your development journey. You’ll face issues and will be unable to optimize your development process due to common problems and bugs; you’ll be looking for techniques which can save you from falling into any pitfalls and common errors during development. With this course you’ll learn to implement some practical and proven techniques to improve particular aspects of Apache Spark with proper research
You need to understand the common problems and issues Spark developers face, collate them, and build simple solutions for these problems. One way to understand common issues is to look out for Stack Overflow queries. This course is a high-quality troubleshooting course, highlighting issues faced by developers in different stages of their application development and providing them with simple and practical solutions to these issues. It supplies solutions to some problems and challenges faced by developers; however, this course also focuses on discovering new possibilities with Apache Spark. By the end of this course, you will have solved your Spark problems without any hassle.
About the Author
Tomasz Lelek is a Software Engineer, programming mostly in Java and Scala. He is a fan of microservice architectures and functional programming. He dedicates considerable time and effort to getting better every day. He is passionate about nearly everything associated with software development, and believes that we should always try to consider different solutions and approaches before solving a problem. Recently he was a speaker at conferences in Poland -, Confitura and JDD (Java Developers Day), and also at Krakow Scala User Group. He has also conducted a live coding session at Geecon Conference.
Common Problems and Troubleshooting the Spark Distributed Engine
This video will give you an overview about the course.
In this video, we will be solving eager computations with lazy evaluation.
• What is a Transformation?
• Why are my transformations not executed?
• Trigger transformations using actions
In this video, we will be solving slow-running jobs by using in-memory Persistence.
• Problem with data re-computation
• Use the cache() function
• Use the persistance() function
In this video, we will be alleviating unexpected API behavior by picking the proper RDD API.
• How to speed up transform/filter queries
• The ordering of operators matters
• Performance test of our improvement
We will learn to reduce wide dependencies using narrow dependencies.
• What is a narrow dependency?
• What is a wide dependency?
• How to avoid wide dependencies?
Distributed DataFrames Optimization Pitfalls
In this video, we will learn to solve slow jobs using partitions.
• Examine the number of partitions of RDD
• Use the coalesce() method
• Use the repartition() method
In this video, we will learn the technique of extending the DataFrame API with UDF functions.
• Use the DataFrame API
• Create a UDF Function
• Register UDF for a usage in the DF API
In this video, we will be understanding jobs by examining physical and logical plans
• Examine logical and physical plans of DF
• Examine execution plan of RDDs
In this video, we will be replacing slow interpreted lambdas using Spark Optimizer.
• Delve into the Optimizer class
• Bytecode generation
Distributed Joins in Cluster
In this video, we will learn to avoid wrong join strategies by using a join type based on data volume.
• Understand inner join
• Understand left/right join
• Understand outer join
In this video, we will discover techniques to solve the slow joins problem by choosing the proper execution plan.
• Use custom partitioner during join
• How to join a smaller dataset with a bigger one?
In this video, we will perform distributed joins using DataFrame.
• Use DataFrame to perform join
• Perform inner join
• Perform outer/left/right join
In this video, we will perform distributed joins using DataSet.
• How to perform type-safe joins?
• Use DataSet to join
Solving Problems with Non-Efficient Transformations
In this video, we will make jobs memory efficient by reusing existing objects.
• How to minimize object creation
• Use aggregateByKey
• Use mutable state passed to Spark API
In this video, we will iterate over specific partitions by using mapPartition().
• Understand what can be inside of a partition
• Perform operations partition wise using mapPartitions
In this video, we will learn to debug Spark Start by introducing accumulators.
• Use accumulators
• Add metrics using accumulators
In this video, we will explore ways to avoid recomputation with RDDs multiple times by using caching.
• Use Spark API to favour resisability of RDDs
• Use StorageLevel
• Use Checkpointing
Troubleshooting Real-Time Processing Jobs in Spark Streaming
In this video, we will create replaceable and reusable sink And source.
• Detect Missing Values (NaN)
• Leverage the IsNull() helper method
• Install pandas
In this video, we will be reducing the time of batch jobs using Spark micro-batch approach.
• Make NaN meaningful to processing
• Replacing NaN with scalar value
In this video, we will make jobs fault tolerant by introducing a checkpoint mechanism.
• Define the Ad Validator module
• Understand what a backward fill is
• Understand what a forward fill is
In this video, we will learn to create one code base for stream and batch using structured streaming.
• Handle an outlier by replacing it with meaningful name
• Implement logic using replace