4.27 out of 5
32 reviews on Udemy

Apache Spark SQL – Bigdata In-Memory Analytics Master Course

Master in-memory distributed computing with Apache Spark SQL. Leverage the power of Dataframe and Dataset Real life demo
230 students enrolled
Spark SQL Syntax, Component Architecture in Apache Spark
Dataset, Dataframes, RDD
Advanced features on interaction of Spark SQL with other components
Using data from various data sources like MS Excel, RDBMS, AWS S3, No SQL Mongo DB,
Using different format of files like Parquet, Avro, JSON
Table partitioning and Bucketing

This course is designed for professionals from zero experience to already skilled professionals to enhance their Spark SQL Skills. Hands on session covers on end to end setup of Spark Cluster in AWS and in local systems. 


What students are saying:

  • 5 stars, “This is classic. Spark related concepts are clearly explained with real life examples.  ” – Temitayo Joseph 

In data pipeline whether the data is in structured or in unstructured form, the final extracted data would be in structured form only. At the final stage we need to work with the structured data. SQL is popular query language to do analysis on structured data.

Apache spark facilitates distributed in-memory computing. Spark has inbuilt module called Spark-SQL for structured data processing. Users can mix SQL queries with Spark programs and seamlessly integrates with other constructs of Spark.

Spark SQL facilitates loading and writing data from various sources like RDBMS, NoSQL databases, Cloud storage like S3 and easily it can handle different format of data like Parquet, Avro, JSON and many more.

Spark Provides two types of APIs

Low Level API – RDD

High Level API – Dataframes and Datasets

Spark SQL amalgamates very well with various components of Spark like Spark Streaming, Spark Core and GraphX as it has good API integration between High level and low level APIs.

Initial part of the course is on Introduction on Lambda Architecture and Big data ecosystem. Remaining section would concentrate on reading and writing data between Spark and various data sources.

Dataframe and Datasets are the basic building blocks for Spark SQL. We will learn on how to work on Transformations and Actions with RDDs, Dataframes and Datasets.

Optimization on table with Partitioning and Bucketing.

To facilitate the understanding on data processing following usecase have been included to understand the complete data flow.

1) NHL Dataset Analysis

2) Bay Area Bike Share Dataset Analysis


++ Apache Zeppelin notebook (Installation, configuration, Dynamic Input)

++Spark Demo with Apache Zeppelin


  • Need for Apache Spark

  • Various sub components of Apache Spark

  • Overview of distributed memory and role of structured data processing in Spark

Need for Distributed Computing and Storage
  • Need for distributed storage

  • Need for distributed processing

  • Bottleneck for traditional processing and overhead of network

  • Sending the program near to data

  • Doing the processing near to data

  • Roles of data in memory in recursive computing

  • Distributed in memory computing concepts

Lambda Architecture and Spark Stack
  • Introduction to Lambda Architecture

  • Three layers of Lambda architecture

  • Details on Speed Layer, Batch Layer and Service Layer

  • Role of model generation, Machine learning analytics, Big Data and Data Ingestion

  • Different sub components of Spark Stack and mapping to lambda architecture layers

  • Role of Spark SQL, MLlib and Spark Streaming

Master - Worker Architecture - Spark Standalone Cluster
  • Introduction to Master Worker architecture

  • Introduction to different resource manager like Mesos, YARN, Spark Standalone

  • Spark stand alone architecture

  • Role of Driver, Spark Context, Executor and tasks in Spark Master and Spark Workers

Apache Spark - In memory computing

Spark - Different Execution Mode
  • Architecture of Spark - Cluster mode of execution

  • Architecture of Spark - Client mode of execution

  • Difference between Cluster and Client mode of execution

  • Pros and Cons of Custer and Client mode of execution

  • Spark Driver execution location

Spark on YARN
  • Introduction to Yet Another Resource Negotiator (YARN)

  • Different components and its introduction.

  • Learn about Node Manager, Application Master, Resource Manager, YARN gateway and its role

  • Spark execution in YARN

Prepare AWS EC2 Instance
  • Introduction to AWS EC2

  • Setup EC2 for Spark Installation

  • Setup security group

  • Configure private key / public key

  • Connect to EC2 with private key using putty

Spark Local Installation and Spark Shell Verification
  • Connect to EC2 Instance

  • Install required Java package

  • Download and unpack Spark package

  • Explore various files and folder structure in Spark

  • Start Spark Shell locally

  • Verify Spark Shell with spark context

  • Explore Spark Web UI with default configuration

  • Explore pyspark Shell

Spark Different Modes of Execution

Spark Session - Different Shell - Scala Shell and PySpark
  • Introduction to Spark Session

  • Different context and its uses.

  • Intrododuction to Sprak Context, SQL Context, Streaming Context, Hive Context

  • Details on Spark Session in Spark Shell

  • Introduction to Spark-SQL Shell

Spark Cluster and Client Mode
  • Start and test spark shell in cluster mode

  • Start and test spark shell in client mode

  • Overview of spark application execution in Cluster and Client mode

Spark Standalone Cluster - Spark Shell
  • Start spark stand alone cluster with multiple machines

  • Start spark shell against stand alone cluster

  • Spark shell web UI for cluster

  • Execute sample job and verify DAG cycle and jobs in web UI

Accessing log files
  • Configure web ui to access log files

  • Access logs of cluster from web UI

Low Level API

  • Understanding on Dataframes

  • Internal working of Dataframes on shared memory

  • Concept on handling each row as generic type of Row Class

  • Creating sample dataframe

  • visualize data frame with printschema and show options

  • Understand Dataset

  • Difference between Dataset and Dataframe

  • Introduction to Case Class

  • Using Case Class with Dataset

  • Create and visualize Dataset

Spark Components and Architecture

Spark Sub Components
  • Various sub components of Apache Spark

  • Introduction to Spark-SQL sub components like Catalysts Optimizer, Dataframes, Datasets, etc.,

  • Introduction to other components like Streaming, MlLib, etc.,

  • Roles played by various components in data ingestion

Spark Partitions Introduction
  • Purpose of partitioning the data

  • How partitioning works in Spark

  • Impact of partitioning the data

  • Visualize partitioned data and processing of partitioned data

Transformations and Actions
  • Introduction to RDD transformations and actions

  • RDD and Directed Acyclic Graph (DAG) during transformation

  • Optimization using DAG cycle during transformation

Catalyst Optimizer
  • Introduction to Catalyst Optimizer

  • Purpose and logical architecture of Catalyst Optimizer

  • Logical and Physical plan selection and Catalyst optimizer role

  • Overview about logical optimization

  • Overview on physical optimization

Data Ingestion - Data Sources

MySQL Read and Write
  • Starting spark shell using Mysql connection driver

  • Reading data from Mysql database

  • Writing data to Mysql database

MySQL Read and Write with Partition
  • Reading Mysql table into multiple partitions

  • Verify partition with web UI

  • Analyze the impact of partition in RDD

Read and Write data from MongoDB
  • Understand MongoDB

  • Setup mongoDB with mlab

  • Import data into mongoDB

  • Start spark shell with mongoDB connector

  • Read data from mongoDB

MS xlsx file as Datasource
  • Configure spark excel package

  • Start spark shell by including excel package

  • Read data from xls file

  • View and verify data

AWS Simple Storage Service S3 as Datasource
  • Introduction on AWS S3

  • Configure Secret access and access key Id

  • Creating S3 bucket

  • Read data from S3

  • View and verify data

JSON File as Datasource
  • Introduction to JSON file

  • Read JSON file

  • View JSON file schema

  • Covert JSON to dataframes

Avro File as Datasource
  • Introduction to avro format

  • Download and use spark avro package with spark shell

  • Read a JSON file and store as avro

  • Overview of different configuration like compression, deflate level, etc.,

  • Read and view avro files

Parquet File as Datasource
  • Introduction on Parquet files

  • Read json file and store as parquet file

  • Read and view parquet file

Working with Spark SQL Shell

Introduction to Spark SQL Shell
  • Introduction to SparkSQL shell

  • Open SparkSQL shell

  • Difference between schema on read and Schema on write

  • Create table and load data in managed table

  • Fetch records from managed table

Customize Data Warehouse Dir
  • Spark warehouse directory purpose

  • Default spark warehouse directory

  • Customize warehouse directory

Create External Table with CSV
  • Overview on external table

  • Load CSV files to external table

  • Pros and Cons of external table

Use Case : NHL Game data Analysis
  • Analyze NHL game data

  • Read NHL csv files

  • Create Dataframe from NHL data RDD

  • Analyze behavior of partitions, RDD, Performance with different queries

Creating Table in Parquet Format
  • Create table in Parquet format

  • Access table data using Spark-SQL shell

  • Verify parquet files in data warehouse directory

Table Partition
  • Overview of partition in Spark

  • Create partitioned table

  • Verify and execute query in partitioned data

Table Bucketing
  • Overview of Bucketing with partitions

  • Purpose and use of bucketing data

  • Create bucketed table

  • Load and analyse data from bucketed table

  • Overview of views

  • Create views from existing table

  • Select data from views

Use Case: Bay Area Bike Share Data - FordGoBike
  • Analyse ford go bike share data

  • Create dataframe from CSV file

  • Create required tables for Stations, Status, Trips and Weather

  • Load data to all the required tables

  • Analyse the data with various queries

Visualization with Apache Zeppelin

Zeppelin Introduction
  • Introduction to Apache Zeppelin

  • Architecture and various components of Apache Zeppelin

  • Overview on Zeppelin UI, Notebook, etc.,

Zeppelin Installation
  • Install Zeppelin from binary

  • Configure system requirements like JDK and JAVA_HOME

  • Zeppelin folder overview

  • Configure Zeppelin port

  • Start Zeppelin daemon

  • Zeppelin UI overview

Zeppelin UI Overview
  • Zeppelin UI Overview

  • Create and Import Notebook

  • Play with interpreter and settings

  • Accessing saved notebook

  • Configure , enable and disable interpreter

Zeppelin Interpreter HDFS
  • Hadoop Distributed File System (HDFS) overview

  • Access HDFS files list

  • Configure HDFS Interpreter

  • List HDFS files

Zeppelin Notebook Overview
  • Zeppelin notebook functionality overview

  • Connect to MySQL database

  • View different database available

  • List various tables

  • Try version functions in Notebook

  • Compare different versions of notebook

  • Overview on note permissions, Configuration, Interpreter settings and keyboard shortcuts

Zeppelin Interpreter Hive
  • Introduction to Apache Hive and its components

  • Overview of configuration details to connect to Hive

  • Configuring JDBC Interpreter and required maven artifacts

  • Create and configure interpreter

  • Access hive database

  • Load data to hive tables

  • Query hive tables

  • Execute various hive queries and visualize

  • Arrange multiple query visualization

Zeppelin Interpreter Spark
  • Introduction to Apache Spark

  • Configure Spark Interpreter

  • Configuring Spark parameters in Interpreter

  • Execute various Spark SQL queries

  • Visualize Spark SQL query results

  • Arrange and visualize various query results

Zeppelin Dynamic Input elements
  • Introduction to dynamic input forms

  • Creating and using various input elements

  • Discuss various scopes of input elements

  • Demo on various scopes

Data files and Other Resources

Data Files

Bonus Lecture

Special coupon to join my other courses
You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
4.3 out of 5
32 Ratings

Detailed Rating

Stars 5
Stars 4
Stars 3
Stars 2
Stars 1
30-Day Money-Back Guarantee


4 hours on-demand video
2 articles
Full lifetime access
Access on mobile and TV
Certificate of Completion