3.8 out of 5
3.8
14 reviews on Udemy

Python PySpark & Big Data Analysis Using Python Made Simple

PySpark and Big Data Analysis Using Python for Absolute Beginners
Instructor:
Satish Venkatesh
675 students enrolled
Learn different pyspark functions
Learn how big data analysis is done using pyspark

Welcome to the course ‘Python Pyspark and Big Data Analysis Using Python Made Simple’

This course is from a software engineer who has managed to crack interviews in around 16 software companies.

Sometimes, life gives us no time to prepare, There are emergency times where in we have to buck up our guts and start  bringing the situations under our control rather then being in the control of the situation.  At the end of the day, All leave this earth empty handed. But given a situation, we should live up or fight up in such a way that the whole action sequence should make us proud and be giving us goosebumps when we think about it right after 10 years.

Apache Spark is an open-source processing engine built around speed, ease of use, and analytics. 

Spark is Developed to utilize distributed, in-memory data structures to improve data processing speeds for most workloads, Spark performs up to 100 times faster than Hadoop MapReduce for iterative algorithms. Spark supports Java, Scala, and Python APIs for ease of development.

The PySpark API Utility Module enables the use of Python to interact with the Spark programming model. For programmers who are 
already familiar with Python, the PySpark API provides easy access to the extremely high-performance data processing enabled by Spark’s Scala architecture —without really the need to learn any Scala. 

Though Scala is much more efficient, the PySpark API allows data scientists with experience of Python to write programming logic in the language most
familiar to them. They can use it to perform rapid distributed transformations on large sets of data, and get the results back in Python-friendly notation. 

PySpark transformations (such as map, flatMap, filter) return resilient distributed datasets (RDDs). The short functions are passed to RDD methods using Python’s lambda syntax, while longer functions are defined with the def keyword. 

PySpark automatically ships the requested functions to worker nodes. The worker nodes then run the Python processes and push the results back to SparkContext, which stores the data in the RDD. 

PySpark offers access via an interactive shell, providing a simple way to learn the API. 

This course has a lot of programs , single line statements which extensively explains the use of pyspark apis.
Through programs and through small data sets we have explained how actually a file with big data sets is analyzed the required results are returned.

The course duration is around 6 hours. We have followed the question and answer approach to explain the pyspark api concepts.
We would request you to kindly check the list of pyspark questions in the course landing page and then if you are interested, you can enroll in the course.

Note: This course is designed for Absolute Beginners 

Questions:

 >> Create and print an RDD from a python collection of numbers. The given collection of numbers should be distributed in 5 partitions
>> Demonstrate the use of glom() function
>> Using the range() function, print ‘1, 3, 5’
>> what is the output of the below statements ?
     sc=SparkContext()
     sc.setLogLevel(“ERROR”)

     sc.range(5).collect()
     sc.range(2, 4).collect()
     sc.range(1, 7, 2).collect()

>> For a given python collection of numbers in the RDD with a given set of partitions. Perform the following:
   -> write a function which calculates the square of each numbers
   -> apply this function on the specified partitions in the rdd

>> what is the output of the below statements: 

   [[0, 1], [2, 3], [4, 5]]

   write a statement such that you get the below outputs:

   [0, 1, 16, 25]
   [0, 1]
   [4, 9]
   [16, 25]

>> with the help of SparkContext(), read and display the contents of a text file

>> explain the use of union() function

>> Is it possible to combine and print the contents of a text file and contents of a rdd ?

>>  write a pgm to list a particular directory’s text files and their contents

>> Given two functions seqOp and combOp, what is the output of the below statements:
     seqOp = (lambda x, y: (x[0] + y, x[1] + 1))
     combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
     print(sc.parallelize([1, 2, 3, 4], 2).aggregate((0, 0), seqOp, combOp))

>> Given a data set: [1, 2] : Write a statement such that we get the output as below:

      [(1, 1), (1, 2), (2, 1), (2, 2)]

>> Given the data: [1,2,3,4,5].
     What is the difference between the output of the below 2 statements:
        print(sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(4).glom().collect())
        print(sc.parallelize([1, 2, 3, 4, 5], 5).coalesce(4).glom().collect())

>> Given two rdds x and y:
        x = sc.parallelize([(“a”, 1), (“b”, 4)])
        y = sc.parallelize([(“a”, 2)])

     Write a pyspark pgm statement which produces the below statement:
        [(‘a’, ([1], [2])), (‘b’, ([4], []))]

>> Given the below statement:

      m = sc.parallelize([(1, 2), (3, 4)]).collectAsMap()

      Find out a way to print the below values:

      ‘2’
      ‘4’

>> explain the output of the below statment:

     print(sc.parallelize([2, 3, 4]).count())
     output: 3

>> Given the statement :

     rdd = sc.parallelize([(“a”, 1), (“b”, 1), (“a”, 1)])

     Find a way to count the occurences of the the keys and print the output as below:

     [(‘a’, 2), (‘b’, 1)]

>> explain the output of the below statement:

     print(sorted(sc.parallelize([1, 2, 1, 2, 2], 2).countByValue().items()))

     output: [(1, 2), (2, 3)]

>> Given the rdd which contains the elements -> [1, 1, 2, 3],
     try to print only the first occurence of the number 

     output: [1, 2, 3]

>> Given the below statement:
     rdd = sc.parallelize([1, 2, 3, 4, 5])
     write a statement to print only -> [2, 4]

>> Given data: [2, 3, 4]. Try to print only the first element in the data (i.e 2)

>>  Given the below statement:
      rdd = sc.parallelize([2, 3, 4])
      Write a statement to get the below output from the above rdd:
      [(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]

>> Given the below statement:
       x = sc.parallelize([(“a”, [“x”, “y”, “z”]), (“b”, [“p”, “r”])])
    Write a statement/statements to get the below output from the above rdd:
       [(‘a’, ‘x’), (‘a’, ‘y’), (‘a’, ‘z’), (‘b’, ‘p’), (‘b’, ‘r’)]

>> Given the below statement:
     rdd = sc.parallelize([(“a”, 1), (“b”, 1), (“a”, 1)])
    What is the output of the below statements:
     print(sorted(rdd.foldByKey(0, add).collect()))
     print(sorted(rdd.foldByKey(1, add).collect()))
     print(sorted(rdd.foldByKey(2, add).collect()))

>> Given below statements:
     x = sc.parallelize([(“a”, 1), (“b”, 4)])
     y = sc.parallelize([(“a”, 2), (“c”, 8)]) 
     Write a statement to get the output as 
     [(‘a’, (1, 2)), (‘b’, (4, None)), (‘c’, (None, 8))]

>> is it possible to get the number of partitions in the rdd

>> Given below statements:
rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
write a snippet to get the following output: 
[(0, [2, 8]), (1, [1, 1, 3, 5])] 

>> Given below statements:
w = sc.parallelize([(“a”, 5), (“b”, 6)])
x = sc.parallelize([(“a”, 1), (“b”, 4)])
y = sc.parallelize([(“a”, 2)])
z = sc.parallelize([(“b”, 42)])
write a snippet to get the following output: 
output: [(‘a’, ([5], [1], [2], [])), (‘b’, ([6], [4], [], [42]))] 

>> Given below statements:
rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
write a snippet to get the following output: 
output:
[1, 2, 3]

>> Given below statements:
x = sc.parallelize([(“a”, 1), (“b”, 4)])
y = sc.parallelize([(“a”, 2), (“a”, 3)])
write a snippet to get the following output: 
output:
[(‘a’, (1, 2)), (‘a’, (1, 3))]
[(‘a’, (2, 1)), (‘a’, (3, 1))]

>> For the given data: [0, 1, 2, 3]
Write a statement to get the output as:
[(0, 0), (1, 1), (4, 2), (9, 3)]

>> For the given data: [0, 1, 2, 3, 4] and [0, 1, 2, 3, 4]
Write a statement to get the output as:
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]

>> Given the data:
[(0, 0), (1, 1), (4, 2), (9, 3)]
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
Write a statement to get the output as:
[(0, [[0], [0]]), (1, [[1], [1]]), (2, [[], [2]]), (3, [[], [3]]), (4, [[2], [4]]), (9, [[3], []])]

>> Given the data: [(1, 2), (3, 4)]
Print only ‘1’ and ‘3’

>> Given the below statements:
x = sc.parallelize([(“a”, 1), (“b”, 4)])
y = sc.parallelize([(“a”, 2)])
write a snippet to get the following output:
output:
[(‘a’, (1, 2)), (‘b’, (4, None))]
[(‘a’, (1, 2))]

>> What is the output of the below statements:
rdd = sc.parallelize([“b”, “a”, “c”])
print(sorted(rdd.map(lambda x: (x, 1)).collect()))

>> What is the output of the below statements:
rdd = sc.parallelize([1, 2, 3, 4], 2)
def f(iterator): yield sum(iterator)
print(rdd.mapPartitions(f).collect())

>> Explain the output of the below code snippet:

rdd = sc.parallelize([1, 2, 3, 4], 4)
def f(splitIndex, iterator):
  yield splitIndex
print(rdd.mapPartitionsWithIndex(f).sum())
output: 6

>> Explain the output of the below code snippet:
x = sc.parallelize([(“a”, [“apple”, “banana”, “lemon”]), (“b”, [“grapes”])])
def f(x): return len(x)
print(x.mapValues(f).collect())
output: [(‘a’, 3), (‘b’, 1)]

>> What is the output of the below snippet:
import findspark
findspark.init(‘/opt/spark-2.2.1-bin-hadoop2.7’)
import pyspark
import os
from pyspark import SparkContext

sc=SparkContext()
sc.setLogLevel(“ERROR”)

print(sc.parallelize([1, 2, 3]).mean())

>> what is the output of the below snippet:
pairs = sc.parallelize([1, 2, 3]).map(lambda x: (x, x))
sets = pairs.partitionBy(2).glom().collect()
print(sets)

>> Given the rdd below:
sc.parallelize([1, 2, 3, 4, 5])
write a statement to get the below output:
output: 15

>> Given the statement below:
rdd = sc.parallelize([(“a”, 1), (“b”, 1), (“a”, 1)])
Write a statement to get the below output:
output:
[(‘a’, 2), (‘b’, 1)]

>> what is the difference between leftouterjoin and rightoutjoin

>> Given the below statement:
tmp = [(‘a’, 1), (‘b’, 2), (‘1’, 3), (‘d’, 4), (‘2’, 5)]
what is the output of the below statements:
print(sc.parallelize(tmp).sortBy(lambda x: x[0]).collect())
print(sc.parallelize(tmp).sortBy(lambda x: x[1]).collect())

>> Given the statement:
x = sc.parallelize([(“a”, 1), (“b”, 4), (“b”, 5), (“a”, 3)])
y = sc.parallelize([(“a”, 3), (“c”, None)])
do something to get the output:
[(‘a’, 1), (‘b’, 4), (‘b’, 5)]

>> Given the statement:
x = sc.parallelize([(“a”, 1), (“b”, 4), (“b”, 5), (“a”, 2)])
y = sc.parallelize([(“a”, 3), (“c”, None)])
do something to get the output:
[(‘b’, 4), (‘b’, 5)]

>> Given the statement:
sc.parallelize([“a”, “b”, “c”, “d”], 3)
do something to get the output:
[(‘a’, 0), (‘b’, 1), (‘c’, 2), (‘d’, 3)]

>> Given the statement:
sc.parallelize([“a”, “b”, “c”, “d”, “e”], 3)
do something to get the output:
[(‘a’, 0), (‘b’, 1), (‘c’, 4), (‘d’, 2), (‘e’, 5)]

>> Given the statements:
x = sc.parallelize(range(0,5))
y = sc.parallelize(range(1000, 1005))
do something to get the output:
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]

>> output of the given pgm:

sc=SparkContext()
sc.setLogLevel(“ERROR”)

data = [[“xyz1″,”a1”,1, 2],
  [“xyz1″,”a2”,3,4],
  [“xyz2″,”a1”,5,6],
  [“xyz2″,”a2”,7,8],
  [“xyz3″,”a1”,9,10]]

rdd = sc.parallelize(data,4)
output = rdd.map(lambda y : [y[0],y[1],(y[2]+y[3])/2])
output2 = output.filter(lambda y : “a2” in y)
output4 = output2.takeOrdered(num=3, key = lambda x :-x[2])
print(output4)
output5 = output2.takeOrdered(num=3, key = lambda x :x[2])
print(output5)

>> output the contents of a text file

>> output the contents of a csv file

>> write a pgm to save to a sequence file and read from a sequence file

>> write a pgm to save data in json format and display the contents of a json file

>> write a pgm to add the indices of data sets

>> write a pgm to differentiate across odd and even numbers using filter function

>> write a pgm to explain the concept of join function

>> write a pgm to explain the concept of map function

>> write a pgm to explain the concept of fold function

>> write a pgm to explain the concept of reducebykey function

>> write a pgm to explain the concept of combinebykey function

>> There are many pgms which are showcase to analyse big data

Introduction

1
Introduction
2
Question 1
3
Question 2
4
Question 3
5
Question 4
6
Question 5
7
Question 6
8
Question 7
9
Question 8
10
Question 9
11
Question 10
12
Question 11
13
Question 12
14
Question 13
15
Question 14
16
Question 15
17
Question 16
18
Question 17
19
Question 18
20
Question 19
21
Question 20
22
Question 21
23
Question 22
24
Question 23
25
Question 24
26
Question 25
27
Question 26
28
Question 27
29
Question 28
30
Question 29
31
Question 30
32
Question 31
33
Question 32
34
Question 33
35
Question 34
36
Question 35
37
Question 36
38
Question 37
39
Question 38
40
Question 39
41
Question 40
42
Question 41
43
Question 42
44
Question 43
45
Question 44
46
Question 45
47
Question 46
48
Question 47
49
Question 48
50
Question 49
51
Question 50
52
Question 51
53
Question 52
54
Question 53
55
Question 54
56
Question 55
57
Question 56
58
Question 57
59
Question 58
60
Question 59
61
Question 60
62
Question 61
63
Question 62
64
Question 63
65
Question 64
66
Question 65
67
Question 66
68
Question 67
69
Question 68
70
Question 69
71
Question 70
72
Question 71
73
Question 72
74
Question 73
75
Question 74
76
Question 75
77
Question 76
78
Question 77

Exercises

1
Exercise 1
2
Exercise 2
3
Exercise 3
4
Exercise 4
You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
3.8
3.8 out of 5
14 Ratings

Detailed Rating

Stars 5
6
Stars 4
3
Stars 3
2
Stars 2
1
Stars 1
2
b1e16c1052bb82cdf31073ef8d07d91d
30-Day Money-Back Guarantee

Includes

6 hours on-demand video
5 articles
Full lifetime access
Access on mobile and TV
Certificate of Completion