HDPCD:Spark using Python (pyspark)
Course cover the overall syllabus of HDPCD:Spark Certification.
Python Fundamentals – Basic Python programming required using REPL
Getting Started with Spark – Different setup options, setup process
Core Spark – Transformations and Actions to process the data
Data Frames and Spark SQL – Leverage SQL skills on top of Data Frames created from Hive tables or RDD
One month complementary lab access
Exercises – A set of self evaluated exercises to test skills for certification purpose
After the course one will gain enough confidence to give the certification and crack it.
All the demos are given on our state of the art Big Data cluster. You can avail one week complementary lab access by filling this form which is provided as part of the welcome message.
Spark Getting Started
Core Spark - Transformations and Actions
#Check out our lab for practice: https://labs.itversity.com
#Initializing the Job
pyspark --master yarn
hadoop fs -du -s -h /public/retail_db/orders
hadoop fs -du -s -h /public/retail_db/order_items
#Initializing the job programmatically
sc.stop() #To stop existing spark context
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("yarn-client").setAppName("Daily Product Revenue")
sc = SparkContext(conf=conf)
#Raise any issues on https://discuss.itversity.com - make sure to categorize properly