4.3 out of 5
63 reviews on Udemy

Managing Big Data on Google’s Cloud Platform

The Second Course in a Series for Attaining the Google Certified Data Engineer
Mike West
629 students enrolled
English [Auto-generated]
At the end of the course you'll understand Cloud Dataproc
You'll also know how to craft machine learning projects at scale on GCP.
You'll also know how to integrate dataproc with other core services like BigQuery
Additionally, you'll learn how to migrate on premise Hadoop and Spark jobs to Cloud Dataproc.

Welcome to Managing Big Data on Google’s Cloud Platform. This is the second course in a series of courses designed to help you attain the coveted Google Certified Data Engineer. 

Additionally, the series of courses is going to show you the role of the data engineer on the Google Cloud Platform

At this juncture the Google Certified Data Engineer is the only real world certification for data and machine learning engineers.

NOTE: This is NOT a course on Big Data. This is a course on a specific cloud service called Google Cloud Dataproc. The course was designed to be part of a series for those who want to become data engineers on Google’s Cloud Platform

This course is all about Google’s Cloud and migrating on-premise Hadoop jobs to GCP.  In reality, Big Data is simply about unstructured data.  There are two core types of data in the real world. The first is structured data, this is the kind of data found in a relational database. The second is unstructured, this is a file sitting on a file system. Approximately 90% of all data in the enterprise is unstructured and our job is to give it structure.

Why do we want to give it structure? We want to give is structure so we can analyze it. Recall that 99% of all applied machine learning is supervised learning. That simply means we have a data set and we point our machine learning models at that data set in order to gain insight into that data.

In the course we will spend much of the time working in Cloud Dataproc. This is Google’s managed Hadoop and Spark platform. 

Recall the end goal of big data is to get that data into a state where it can be analyzed and modeled. Therefore, we are also going to cover how to work on machine learning projects with big data at scale.

Please keep in mind this course alone will not give you the knowledge and skills to pass the exam. The course will provide you with the big data knowledge you need for working with Cloud Dataproc and for moving existing projects to the Google Cloud Platform. 

                                                             *Five Reasons to take this Course.*

1) The Top Job in the World

The data engineer role is the single most needed role in the world. Many believe that it’s the data scientist but several studies have broken down the job descriptions and the most needed position is that of the data engineer. 

2) Google’s the World Leader in Data

Amazon’s AWS is the most used cloud and Azure has the best UI but no cloud vendor in the world understands data like Google. They are the world leader in open sources artificial intelligence. You can’t be the leader in AI without being the leader in data. 

3) 90% of all Organizational Data is Unstructured

The study of big data is the study of unstructured data. As the data in companies grows most will need to scale to unprecedented level. Without a significant investment in infrastructure and talent this won’t be possible without the cloud. 

4) The Data Revolution is Now

We are in a data revolution. Data used to be viewed as a simple necessity and lower on the totem pole. Now it is more widely recognized as the source of truth. As we move into more complex systems of data management, the role of the data engineer becomes extremely important as a bridge between the DBA and the data consumer. Beyond the ubiquitous spreadsheet, graduating from RDBMS (which will always have a place in the data stack), we now work with NoSQL and Big Data technologies.

5) Data is Foundation 

Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers giving meaning to an otherwise static entity. Simply put, data engineers clean, prepare and optimize data for consumption. Once the data becomes useful, data scientists can perform a variety of analyses and visualization techniques to truly understand the data, and eventually, tell a story from the data. 

Thank you for your interest in Managing Big Data on Google’s Cloud Platform and we will see you in the course!!



In this lesson let's high level what this course is about. 

This is the second course in a series of courses on Google's Cloud Platform. 

This is not an entry level course on Hadoop or Spark. 

This course is about Cloud Dataproc and how to move existing projects to GCP. 

Is this Course for You?

In this lesson let's talk about the course's targeted audience. 

If you fit one of these roles are just want to learn more about the data engineer path on CGP then this course is for you. 

Instructor Course Q&A

This lesson is just a series of questions I've been asked about this course or other similar courses. 

I try to answer them before you take the course. 

Unstructured Data

There are only two kinds of data in the enterprise. 

One is structured and the other one is unstructured. 

Let's learn about them in this lesson. 

Data Sources

Every organization has 4 different kinds of data. 

Let's learn what they are in this lesson. 


Why Cloud Dataproc

Why Use GCP for Big Data?

Why use GCP for big data if you already have an existing Hadoop on premise set up? 

Lot's of reasons. 

Money and scale and two reasons we will discuss in the following videos. 

On-Premise Hadoop Build

In this lesson let's learn about all the sundry components of an on-premise build. 

On-premise builds are costly and don't scale very well. 

Scaling up or Scaling Out

Is it better to scale up or out? 

The answer is out but let's find out why in this lesson. 

Zones and Regions

In this lesson let's learn about regions and zones. 

Separating Compute and Storage

In GCP we decouple storage and compute. 

We do this so we can easily spin up and tear down our clusters. 

Let's learn about on-premise versus the GCP. 

Cloud Dataproc Architecture

In GCP our end goal is to use Google Cloud Storage to house our data. 

We can then use other services like BigQuery to analyze our data once that data is sitting on common storage. 


Cloud Dataproc in Action

Create Cluster Screen

The entire cluster creation process in one screen. 

Let's talk about the core parts in this lesson. 

Create Dataproc Cluster in GCP Console

In this lesson let's learn how to create a very simple cluster using the Google Cloud Console. 

Create a Cluster using the Shell

Creating a dataproc cluster using the shell is just as easy. 

In this lesson let's open a cloud shell session and spin up a cluster.

The Three Dataproc Configurations

We have three options when spin up our dataproc clusters. 

In this lesson let's learn what they are and why we should probably choose high availability for production loads. 

Using Preemption on Cloud Dataproc

Preemption is going to save your organizations or your clients money. 

Lets talk about what this is and how to leverage it on GCP.

How GCP Handles Preemption

In this brief lesson let's learn how preemption is handled on our preemptive clusters. 

Image Version Options

We've have several images to choose from. 

Let's learn why we might want to use the most stable ones instead of our other options. 

Scaling Clusters

We can easily scale our clusters even when jobs are running. 

Let's learn how to do that in this lesson. 

Creating a Custom Image

We can spin up custom boxes in GCP. 

Let's learn how to in this lesson.

Cluster Customization

In this lesson let's learn how to customize our clusters. .

3 Steps to Install Additional Software on Clusters

We can easily install additional software on our clusters. 

Let's learn how to use initialization scripts to do that. 

Initialization Actions

In this lesson let's demo how to implement an initialization script. 

High Availability

Cluster have a high availability option. 

Let's learn how to implement it in this lesson via the console. 


Submitting Jobs

The Submit Jobs Screen

Int his lesson let's learn how to submit jobs to our cluster once it's created. 

Submitting Spark Job - Console

In this lesson let's submit some jobs. 

In the lesson we will see that our cluster isn't large enough forcing us to kill the cluster and create a new one. 

Submitting Spark Job - Google Cloud Shell

In this lesson let's learn how to submit a spark job to our cluster and view the output. 

Submitting PySpark Job - SSH

In this lesson let's learn how to submit a PySpark job via the Google Cloud Shell. 

Moving from On-Premise to Google Cloud Dataproc

In this lesson let's learn how to move our on-premise Hadoop jobs to GCP. 

Python and Scala Code Reference Change

In this lesson let's look at the code behind a Python and Scala. 


You're the Data Engineer

White Boarding: Difference between On-prem and Cloud Dataproc

In this lesson let's learn about white boarding. 

It's used heavily at Google for interviewing. 

It's also used for architecting cloud solutions. 

White Boarding: Moving Jobs to GCP

Let's white board the approach we'd use to move on-premise Hadoop and other big data jobs to GCP. 

White Boarding: Data Near Clusters

When we are designing solutions for clients we want to make sure they understand that their data and clusters need to be in the same zones or regions so they don't incur excessive data movement charges. 

White Boarding: Defining Preemptibles

You'll get a lot of questions about preemptibles and how to use them. 

Let's high level some talking points and reinforce this idea of temporary clusters. 

White Boarding: On-Premise Architecture to GCP

Clients are going to want to know the exact steps to moving their jobs to GCP. 

In this lesson we will explain our phased approach to them. 

White Boarding: Add Software to Nodes

Clients always want something they can customize. 

In this lesson let's explain to them how easy it is to use initialization scripts.  

You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
4.3 out of 5
63 Ratings

Detailed Rating

Stars 5
Stars 4
Stars 3
Stars 2
Stars 1
30-Day Money-Back Guarantee


1 hours on-demand video
6 articles
Full lifetime access
Access on mobile and TV
Certificate of Completion