3.9 out of 5
3.9
286 reviews on Udemy

HDPCD:Spark using Python (pyspark)

Prepare for Hortonworks HDP Certified Developer - Spark using Python as programming language
Basics of Python, Spark and required skills to give HDPCD:Spark certification using Python/pyspark with confidence

Course cover the overall syllabus of HDPCD:Spark Certification.

  • Python Fundamentals – Basic Python programming required using REPL

  • Getting Started with Spark – Different setup options, setup process

  • Core Spark – Transformations and Actions to process the data

  • Data Frames and Spark SQL – Leverage SQL skills on top of Data Frames created from Hive tables or RDD

  • One month complementary lab access

  • Exercises – A set of self evaluated exercises to test skills for certification purpose

After the course one will gain enough confidence to give the certification and crack it.

All the demos are given on our state of the art Big Data cluster. You can avail one week complementary lab access by filling this form which is provided as part of the welcome message.

Introduction

1
Introduction
2
Using itversity platforms - Big Data Developer labs and forum

Python Fundamentals

1
Introduction and Setup Python Environment
2
Basic Programming Constructs
3
Functions in Python
4
Python Collections
5
Map Reduce operations on Python Collections
6
Setting up Data Sets for Basic I/O Operations
7
Basic I/O operations and processing data using Collections

Spark Getting Started

1
Setup Options
2
Setup using tarball
3
Setup using Hortonworks Sandbox
4
Using labs.itversity.com
5
Using Windows - Putty and WinScp
6
Using Windows - Cygwin
7
HDFS - Quick Preview
8
YARN - Quick Preview
9
Setup Data Sets
10
Curriculum

Core Spark - Transformations and Actions

1
Introduction
2
Problem Statement and Environment
3
Initializing the job using pyspark

#Check out our lab for practice: https://labs.itversity.com

#Initializing the Job

pyspark --master yarn
  --deploy-mode client
  --conf spark.ui.port=12335
  --num-executors 1
  --executor-memory 2048M

hadoop fs -du -s -h /public/retail_db/orders
hadoop fs -du -s -h /public/retail_db/order_items

#Initializing the job programmatically

sc.stop() #To stop existing spark context

from pyspark import SparkContext, SparkConf
 
conf = SparkConf().setMaster("yarn-client").setAppName("Daily Product Revenue")
sc = SparkContext(conf=conf)

#Raise any issues on https://discuss.itversity.com - make sure to categorize properly

4
Resilient Distributed Datasets - Create
5
Resilient Distributed Datasets - Persist and Cache
6
Previewing the data using actions - first, take(n), count, collect
7
Filtering the Data - Get completed/closed orders
8
Accumulators - Get completed/closed orders with count
9
Converting to key value pairs - using map


10
Joining data sets - join and outer join with examples
11
Get Daily revenue per product id - using reduceByKey
12
Get Daily revenue and count per product id - using aggregateByKey
13
Execution Life Cycle
14
Broadcast Variables
15
Sorting the data
16
Saving the data to the file system
17
Final Solution - Get Daily Revenue per Product

Spark SQL using pyspark

1
Introduction to Spark SQL and Objectives
2
Different interfaces to run SQL - Hive, Spark SQL
3
Create database and tables of text file format - orders and order_items
4
Create database and tables of ORC file format - orders and order_items
5
Running SQL/Hive Commands using pyspark
6
Functions - Getting Started
7
Functions - String Manipulation
8
Functions - Date Manipulation
9
Functions - Aggregate Functions in brief
10
Functions - case and nvl
11
Row level transformations
12
Joining data between multiple tables
13
Group by and aggregations
14
Sorting the data
15
Set operations - union and union all
16
Analytics functions - aggregations
17
Analytics functions - ranking
18
Windowing functions
19
Creating Data Frames and register as temp tables
20
Write Spark Application - Processing Data using Spark SQL
21
Write Spark Application - Saving Data Frame to Hive tables
22
Data Frame Operations

Exercises or Problem Statements with Solutions

1
Introduction about exercises
2
General Guidelines about Exercises or Problem Statements
3
General Guidelines - Initializing the Job
You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
3.9
3.9 out of 5
286 Ratings

Detailed Rating

Stars 5
117
Stars 4
94
Stars 3
47
Stars 2
15
Stars 1
13
3fb11ad133b41cec215f1e9bc152eb2a
30-Day Money-Back Guarantee

Includes

12 hours on-demand video
Full lifetime access
Access on mobile and TV
Certificate of Completion