2.83 out of 5
2.83
3 reviews on Udemy

Big Data analytics with PySpark (Apache Spark and Python)

A course for leveraging the power of Python and putting it to use in the(Apache spark architecture) Spark ecosystem.
Instructor:
Indira Programmer
799 students enrolled
English [Auto-generated]
Will understand to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.

PySpark helps data scientists interface with Resilient Distributed Datasets in apache spark and python.Py4J is a popularly library integrated within PySpark that lets python interface dynamically with JVM objects (RDD’s).  In this course you’ll learn how to use Spark from Python! Spark is a tool for doing parallel computation with large datasets and it integrates well with Python. PySpark is the Python package that makes the magic happen. You’ll use this package to work with some live example. You’ll learn to wrangle this data and build a whole machine learning pipeline to predict results. Get ready to put some Spark in your Python code and dive into the world of high performance machine learning! 

Apache Spark is a fast cluster computing framework which is used for processing, querying and analyzing big data. Being based on in-memory computation, it has an advantage over several other big data frameworks.

Originally written in the Scala programming language, the open source community has developed an amazing tool to support Python for Apache Spark. PySpark helps data scientists interface with RDDs in Apache Spark and Python through its library Py4j. There are many features that make PySpark a better framework than others:

  • Speed: It is 100x faster than traditional large-scale data processing frameworks.

  • Powerful Caching: Simple programming layer provides powerful caching and disk persistence capabilities.

  • Deployment: Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manager.

  • Real Time: Real-time computation and low latency because of in-memory computation.

  • Polyglot: Supports programming in Scala, Java, Python, and R.

PySpark in the Industry

Let’s move ahead with our PySpark tutorial and see where Spark is used in the industry.

Every industry revolves around big data and where there’s big data, there’s analysis involved. So let’s have a look at the various industries where Apache Spark is used.

Media is one of the biggest industries growing towards online streaming. Netflix uses Apache Spark for real-time stream processing to provide personalized online recommendations to its customers. It processes 450 billion events per day which flow to server-side applications.

Finance is another sector where Apache Spark’s real-time processing plays an important role. Banks are using Spark to access and analyze social media profiles to gain insights which can help them make the right business decisions for credit risk assessment, targeted ads, and customer segmentation. Customer churn is also reduced using Spark. Fraud detection is one of the most widely used areas of machine learning where Spark is involved.

Healthcare providers are using Apache Spark to analyze patient records along with past clinical data to identify which patients are likely to face health issues after being discharged from the clinic. Apache Spark is used in genomic sequencing to reduce the time needed to process genome data.

Retail and e-commerce is an industry where one can’t imagine it running without the use of analysis and targeted advertising. One of the largest e-commerce platform today, Alibabaruns some of the largest Spark jobs in the world in order to analyze petabytes of data. Alibaba performs feature extraction in image data. eBay uses Apache Spark to provide targeted offers, enhance customer experience and optimize overall performance.

Travel industries also use Apache Spark. TripAdvisor, a leading travel website that helps users plan a perfect trip, is using Apache Spark to speed up its personalized customerrecommendations. TripAdvisor uses Apache Spark to provide advice to millions of travelers by comparing hundreds of websites to find the best hotel prices for its customers.

An important aspect of this PySpark tutorial is to understand why we need to use Python. Why not Java, Scala or R?

Easy to Learn: For programmers, Python is comparatively easier to learn because of its syntax and standard libraries. Moreover, it’s a dynamically typed language, which means RDDs can hold objects of multiple types.

A vast set of libraries: Scala does not have sufficient data science tools and libraries like Python for machine learning and natural language processing. Moreover, Scala lacks good visualization and local data transformations.

Huge Community Support: Python has a global community with millions of developers that interact online and offline in thousands of virtual and physical locations.

One of the most important topics here is the use of RDDs. Let’s understand what RDDs are. 

Spark RDDs

When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations, we need to reuse or share data among multiple jobs. Earlier frameworks like Hadoop had problems while dealing with multiple operations/jobs like:

  • Storing data in intermediate storage such as HDFS.

  • Multiple I/O jobs make the computations slow.

  • Replications and serializations which in turn makes the process even slower.

RDDs try to solve all the problems by enabling fault-tolerant distributed in-memory computations. RDD is short for Resilient Distributed Datasets. RDD is a distributed memory abstraction which lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. They are the read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. There are several operations performed on RDDs:

  • Transformations: Transformations create a new dataset from an existing one. Lazy Evaluation.

  • Actions: Spark forces the calculations for execution only when actions are invoked on the RDDs.

Introduction

1
Introduction
2
Pyspark

Installation plus

1
Installation brief
2
installation 2
3
PATH & Installing Spark
4
Bash Profile
5
Configuring a local instance
6
multi node instance
7
multinode instance

configuring a session

1
configuring a session in Jupyter
2
PySpark Cloudera Images
3
Reading Data
4
Partitions and Performance
5
transformations

RDD and more

1
Overview of RDD Actions
2
Pitfalls of using RDDs
3
DataFrames Creation
4
Accessing Underlying RDDs
5
Performance Optimizations
6
Reflection
7
Sql to interact with DataFrames
8
Column transformation
9
Actions
10
Preparing Data
11
Handling Missing Observations
12
Handling Outliers
13
Computing Correlations
14
Visualizing interactions
15
Machine Learning with MLlib Module
16
Testing the data
17
Transforming the data
18
Creating an RDD for training
19
Forecasting example
20
Statistics
21
Machine Learning with ML Module with transformers
22
Estimators
23
Pipelines

Predicting and estimating

1
Predicting
2
estimating ex
3
Hyperparameters
4
features
5
variables
6
Streaming
7
aggregations
8
Continuous aggregation
9
GF Installing
10
graph
11
queries
12
Example
13
visualizing
You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
2.8
2.8 out of 5
3 Ratings

Detailed Rating

Stars 5
0
Stars 4
1
Stars 3
1
Stars 2
0
Stars 1
1
c6a86651f6b61c438fdaec3f9a86d735
30-Day Money-Back Guarantee

Includes

6 hours on-demand video
Full lifetime access
Access on mobile and TV
Certificate of Completion