4.27 out of 5
4.27
32 reviews on Udemy

Apache Spark SQL – Bigdata In-Memory Analytics Master Course

Master in-memory distributed computing with Apache Spark SQL. Leverage the power of Dataframe and Dataset Real life demo
Instructor:
MUTHUKUMAR S
230 students enrolled
English
Spark SQL Syntax, Component Architecture in Apache Spark
Dataset, Dataframes, RDD
Advanced features on interaction of Spark SQL with other components
Using data from various data sources like MS Excel, RDBMS, AWS S3, No SQL Mongo DB,
Using different format of files like Parquet, Avro, JSON
Table partitioning and Bucketing

This course is designed for professionals from zero experience to already skilled professionals to enhance their Spark SQL Skills. Hands on session covers on end to end setup of Spark Cluster in AWS and in local systems. 

COURSE UPDATED PERIODICALLY SINCE LAUNCH: Last Updated : December

What students are saying:

  • 5 stars, “This is classic. Spark related concepts are clearly explained with real life examples.  ” – Temitayo Joseph 

In data pipeline whether the data is in structured or in unstructured form, the final extracted data would be in structured form only. At the final stage we need to work with the structured data. SQL is popular query language to do analysis on structured data.

Apache spark facilitates distributed in-memory computing. Spark has inbuilt module called Spark-SQL for structured data processing. Users can mix SQL queries with Spark programs and seamlessly integrates with other constructs of Spark.

Spark SQL facilitates loading and writing data from various sources like RDBMS, NoSQL databases, Cloud storage like S3 and easily it can handle different format of data like Parquet, Avro, JSON and many more.

Spark Provides two types of APIs

Low Level API – RDD

High Level API – Dataframes and Datasets

Spark SQL amalgamates very well with various components of Spark like Spark Streaming, Spark Core and GraphX as it has good API integration between High level and low level APIs.

Initial part of the course is on Introduction on Lambda Architecture and Big data ecosystem. Remaining section would concentrate on reading and writing data between Spark and various data sources.

Dataframe and Datasets are the basic building blocks for Spark SQL. We will learn on how to work on Transformations and Actions with RDDs, Dataframes and Datasets.

Optimization on table with Partitioning and Bucketing.

To facilitate the understanding on data processing following usecase have been included to understand the complete data flow.

1) NHL Dataset Analysis

2) Bay Area Bike Share Dataset Analysis

Updates:

++ Apache Zeppelin notebook (Installation, configuration, Dynamic Input)

++Spark Demo with Apache Zeppelin

Introduction

1
Introduction
  • Need for Apache Spark

  • Various sub components of Apache Spark

  • Overview of distributed memory and role of structured data processing in Spark

2
Need for Distributed Computing and Storage
  • Need for distributed storage

  • Need for distributed processing

  • Bottleneck for traditional processing and overhead of network

  • Sending the program near to data

  • Doing the processing near to data

  • Roles of data in memory in recursive computing

  • Distributed in memory computing concepts

3
Lambda Architecture and Spark Stack
  • Introduction to Lambda Architecture

  • Three layers of Lambda architecture

  • Details on Speed Layer, Batch Layer and Service Layer

  • Role of model generation, Machine learning analytics, Big Data and Data Ingestion

  • Different sub components of Spark Stack and mapping to lambda architecture layers

  • Role of Spark SQL, MLlib and Spark Streaming

4
Master - Worker Architecture - Spark Standalone Cluster
  • Introduction to Master Worker architecture

  • Introduction to different resource manager like Mesos, YARN, Spark Standalone

  • Spark stand alone architecture

  • Role of Driver, Spark Context, Executor and tasks in Spark Master and Spark Workers

Apache Spark - In memory computing

1
Spark - Different Execution Mode
  • Architecture of Spark - Cluster mode of execution

  • Architecture of Spark - Client mode of execution

  • Difference between Cluster and Client mode of execution

  • Pros and Cons of Custer and Client mode of execution

  • Spark Driver execution location

2
Spark on YARN
  • Introduction to Yet Another Resource Negotiator (YARN)

  • Different components and its introduction.

  • Learn about Node Manager, Application Master, Resource Manager, YARN gateway and its role

  • Spark execution in YARN

3
Prepare AWS EC2 Instance
  • Introduction to AWS EC2

  • Setup EC2 for Spark Installation

  • Setup security group

  • Configure private key / public key

  • Connect to EC2 with private key using putty

4
Spark Local Installation and Spark Shell Verification
  • Connect to EC2 Instance

  • Install required Java package

  • Download and unpack Spark package

  • Explore various files and folder structure in Spark

  • Start Spark Shell locally

  • Verify Spark Shell with spark context

  • Explore Spark Web UI with default configuration

  • Explore pyspark Shell

Spark Different Modes of Execution

1
Spark Session - Different Shell - Scala Shell and PySpark
  • Introduction to Spark Session

  • Different context and its uses.

  • Intrododuction to Sprak Context, SQL Context, Streaming Context, Hive Context

  • Details on Spark Session in Spark Shell

  • Introduction to Spark-SQL Shell

2
Spark Cluster and Client Mode
  • Start and test spark shell in cluster mode

  • Start and test spark shell in client mode

  • Overview of spark application execution in Cluster and Client mode

3
Spark Standalone Cluster - Spark Shell
  • Start spark stand alone cluster with multiple machines

  • Start spark shell against stand alone cluster

  • Spark shell web UI for cluster

  • Execute sample job and verify DAG cycle and jobs in web UI

4
Accessing log files
  • Configure web ui to access log files

  • Access logs of cluster from web UI

Low Level API

1
Dataframe
  • Understanding on Dataframes

  • Internal working of Dataframes on shared memory

  • Concept on handling each row as generic type of Row Class

  • Creating sample dataframe

  • visualize data frame with printschema and show options

2
Dataset
  • Understand Dataset

  • Difference between Dataset and Dataframe

  • Introduction to Case Class

  • Using Case Class with Dataset

  • Create and visualize Dataset

Spark Components and Architecture

1
Spark Sub Components
  • Various sub components of Apache Spark

  • Introduction to Spark-SQL sub components like Catalysts Optimizer, Dataframes, Datasets, etc.,

  • Introduction to other components like Streaming, MlLib, etc.,

  • Roles played by various components in data ingestion

2
Spark Partitions Introduction
  • Purpose of partitioning the data

  • How partitioning works in Spark

  • Impact of partitioning the data

  • Visualize partitioned data and processing of partitioned data

3
Transformations and Actions
  • Introduction to RDD transformations and actions

  • RDD and Directed Acyclic Graph (DAG) during transformation

  • Optimization using DAG cycle during transformation

4
Catalyst Optimizer
  • Introduction to Catalyst Optimizer

  • Purpose and logical architecture of Catalyst Optimizer

  • Logical and Physical plan selection and Catalyst optimizer role

  • Overview about logical optimization

  • Overview on physical optimization

Data Ingestion - Data Sources

1
MySQL Read and Write
  • Starting spark shell using Mysql connection driver

  • Reading data from Mysql database

  • Writing data to Mysql database

2
MySQL Read and Write with Partition
  • Reading Mysql table into multiple partitions

  • Verify partition with web UI

  • Analyze the impact of partition in RDD

3
Read and Write data from MongoDB
  • Understand MongoDB

  • Setup mongoDB with mlab

  • Import data into mongoDB

  • Start spark shell with mongoDB connector

  • Read data from mongoDB

4
MS xlsx file as Datasource
  • Configure spark excel package

  • Start spark shell by including excel package

  • Read data from xls file

  • View and verify data

5
AWS Simple Storage Service S3 as Datasource
  • Introduction on AWS S3

  • Configure Secret access and access key Id

  • Creating S3 bucket

  • Read data from S3

  • View and verify data

6
JSON File as Datasource
  • Introduction to JSON file

  • Read JSON file

  • View JSON file schema

  • Covert JSON to dataframes

7
Avro File as Datasource
  • Introduction to avro format

  • Download and use spark avro package with spark shell

  • Read a JSON file and store as avro

  • Overview of different configuration like compression, deflate level, etc.,

  • Read and view avro files

8
Parquet File as Datasource
  • Introduction on Parquet files

  • Read json file and store as parquet file

  • Read and view parquet file

Working with Spark SQL Shell

1
Introduction to Spark SQL Shell
  • Introduction to SparkSQL shell

  • Open SparkSQL shell

  • Difference between schema on read and Schema on write

  • Create table and load data in managed table

  • Fetch records from managed table

2
Customize Data Warehouse Dir
  • Spark warehouse directory purpose

  • Default spark warehouse directory

  • Customize warehouse directory

3
Create External Table with CSV
  • Overview on external table

  • Load CSV files to external table

  • Pros and Cons of external table

4
Use Case : NHL Game data Analysis
  • Analyze NHL game data

  • Read NHL csv files

  • Create Dataframe from NHL data RDD

  • Analyze behavior of partitions, RDD, Performance with different queries

5
Creating Table in Parquet Format
  • Create table in Parquet format

  • Access table data using Spark-SQL shell

  • Verify parquet files in data warehouse directory

6
Table Partition
  • Overview of partition in Spark

  • Create partitioned table

  • Verify and execute query in partitioned data

7
Table Bucketing
  • Overview of Bucketing with partitions

  • Purpose and use of bucketing data

  • Create bucketed table

  • Load and analyse data from bucketed table

8
Views
  • Overview of views

  • Create views from existing table

  • Select data from views

9
Use Case: Bay Area Bike Share Data - FordGoBike
  • Analyse ford go bike share data

  • Create dataframe from CSV file

  • Create required tables for Stations, Status, Trips and Weather

  • Load data to all the required tables

  • Analyse the data with various queries

Visualization with Apache Zeppelin

1
Zeppelin Introduction
  • Introduction to Apache Zeppelin

  • Architecture and various components of Apache Zeppelin

  • Overview on Zeppelin UI, Notebook, etc.,

2
Zeppelin Installation
  • Install Zeppelin from binary

  • Configure system requirements like JDK and JAVA_HOME

  • Zeppelin folder overview

  • Configure Zeppelin port

  • Start Zeppelin daemon

  • Zeppelin UI overview

3
Zeppelin UI Overview
  • Zeppelin UI Overview

  • Create and Import Notebook

  • Play with interpreter and settings

  • Accessing saved notebook

  • Configure , enable and disable interpreter

4
Zeppelin Interpreter HDFS
  • Hadoop Distributed File System (HDFS) overview

  • Access HDFS files list

  • Configure HDFS Interpreter

  • List HDFS files

5
Zeppelin Notebook Overview
  • Zeppelin notebook functionality overview

  • Connect to MySQL database

  • View different database available

  • List various tables

  • Try version functions in Notebook

  • Compare different versions of notebook

  • Overview on note permissions, Configuration, Interpreter settings and keyboard shortcuts

6
Zeppelin Interpreter Hive
  • Introduction to Apache Hive and its components

  • Overview of configuration details to connect to Hive

  • Configuring JDBC Interpreter and required maven artifacts

  • Create and configure interpreter

  • Access hive database

  • Load data to hive tables

  • Query hive tables

  • Execute various hive queries and visualize

  • Arrange multiple query visualization

7
Zeppelin Interpreter Spark
  • Introduction to Apache Spark

  • Configure Spark Interpreter

  • Configuring Spark parameters in Interpreter

  • Execute various Spark SQL queries

  • Visualize Spark SQL query results

  • Arrange and visualize various query results

8
Zeppelin Dynamic Input elements
  • Introduction to dynamic input forms

  • Creating and using various input elements

  • Discuss various scopes of input elements

  • Demo on various scopes

Data files and Other Resources

1
Data Files

Bonus Lecture

1
Special coupon to join my other courses
You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
4.3
4.3 out of 5
32 Ratings

Detailed Rating

Stars 5
18
Stars 4
9
Stars 3
3
Stars 2
2
Stars 1
0
16e4f8bb5ce9633e40b1eea16b2e56b0
30-Day Money-Back Guarantee

Includes

4 hours on-demand video
2 articles
Full lifetime access
Access on mobile and TV
Certificate of Completion