Welcome to Managing Big Data on Google’s Cloud Platform. This is the second course in a series of courses designed to help you attain the coveted Google Certified Data Engineer.
Additionally, the series of courses is going to show you the role of the data engineer on the Google Cloud Platform.
At this juncture the Google Certified Data Engineer is the only real world certification for data and machine learning engineers.
NOTE: This is NOT a course on Big Data. This is a course on a specific cloud service called Google Cloud Dataproc. The course was designed to be part of a series for those who want to become data engineers on Google’s Cloud Platform.
This course is all about Google’s Cloud and migrating on-premise Hadoop jobs to GCP. In reality, Big Data is simply about unstructured data. There are two core types of data in the real world. The first is structured data, this is the kind of data found in a relational database. The second is unstructured, this is a file sitting on a file system. Approximately 90% of all data in the enterprise is unstructured and our job is to give it structure.
Why do we want to give it structure? We want to give is structure so we can analyze it. Recall that 99% of all applied machine learning is supervised learning. That simply means we have a data set and we point our machine learning models at that data set in order to gain insight into that data.
In the course we will spend much of the time working in Cloud Dataproc. This is Google’s managed Hadoop and Spark platform.
Recall the end goal of big data is to get that data into a state where it can be analyzed and modeled. Therefore, we are also going to cover how to work on machine learning projects with big data at scale.
Please keep in mind this course alone will not give you the knowledge and skills to pass the exam. The course will provide you with the big data knowledge you need for working with Cloud Dataproc and for moving existing projects to the Google Cloud Platform.
*Five Reasons to take this Course.*
1) The Top Job in the World
The data engineer role is the single most needed role in the world. Many believe that it’s the data scientist but several studies have broken down the job descriptions and the most needed position is that of the data engineer.
2) Google’s the World Leader in Data
Amazon’s AWS is the most used cloud and Azure has the best UI but no cloud vendor in the world understands data like Google. They are the world leader in open sources artificial intelligence. You can’t be the leader in AI without being the leader in data.
3) 90% of all Organizational Data is Unstructured
The study of big data is the study of unstructured data. As the data in companies grows most will need to scale to unprecedented level. Without a significant investment in infrastructure and talent this won’t be possible without the cloud.
4) The Data Revolution is Now
We are in a data revolution. Data used to be viewed as a simple necessity and lower on the totem pole. Now it is more widely recognized as the source of truth. As we move into more complex systems of data management, the role of the data engineer becomes extremely important as a bridge between the DBA and the data consumer. Beyond the ubiquitous spreadsheet, graduating from RDBMS (which will always have a place in the data stack), we now work with NoSQL and Big Data technologies.
5) Data is Foundation
Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers giving meaning to an otherwise static entity. Simply put, data engineers clean, prepare and optimize data for consumption. Once the data becomes useful, data scientists can perform a variety of analyses and visualization techniques to truly understand the data, and eventually, tell a story from the data.
Thank you for your interest in Managing Big Data on Google’s Cloud Platform and we will see you in the course!!
Introduction
In this lesson let's high level what this course is about.
This is the second course in a series of courses on Google's Cloud Platform.
This is not an entry level course on Hadoop or Spark.
This course is about Cloud Dataproc and how to move existing projects to GCP.
In this lesson let's talk about the course's targeted audience.
If you fit one of these roles are just want to learn more about the data engineer path on CGP then this course is for you.
This lesson is just a series of questions I've been asked about this course or other similar courses.
I try to answer them before you take the course.
There are only two kinds of data in the enterprise.
One is structured and the other one is unstructured.
Let's learn about them in this lesson.
Every organization has 4 different kinds of data.
Let's learn what they are in this lesson.
Why Cloud Dataproc
Why use GCP for big data if you already have an existing Hadoop on premise set up?
Lot's of reasons.
Money and scale and two reasons we will discuss in the following videos.
In this lesson let's learn about all the sundry components of an on-premise build.
On-premise builds are costly and don't scale very well.
Is it better to scale up or out?
The answer is out but let's find out why in this lesson.
In this lesson let's learn about regions and zones.
In GCP we decouple storage and compute.
We do this so we can easily spin up and tear down our clusters.
Let's learn about on-premise versus the GCP.
In GCP our end goal is to use Google Cloud Storage to house our data.
We can then use other services like BigQuery to analyze our data once that data is sitting on common storage.
Cloud Dataproc in Action
The entire cluster creation process in one screen.
Let's talk about the core parts in this lesson.
In this lesson let's learn how to create a very simple cluster using the Google Cloud Console.
Creating a dataproc cluster using the shell is just as easy.
In this lesson let's open a cloud shell session and spin up a cluster.
We have three options when spin up our dataproc clusters.
In this lesson let's learn what they are and why we should probably choose high availability for production loads.
Preemption is going to save your organizations or your clients money.
Lets talk about what this is and how to leverage it on GCP.
In this brief lesson let's learn how preemption is handled on our preemptive clusters.
We've have several images to choose from.
Let's learn why we might want to use the most stable ones instead of our other options.
We can easily scale our clusters even when jobs are running.
Let's learn how to do that in this lesson.
We can spin up custom boxes in GCP.
Let's learn how to in this lesson.
In this lesson let's learn how to customize our clusters. .
We can easily install additional software on our clusters.
Let's learn how to use initialization scripts to do that.
In this lesson let's demo how to implement an initialization script.
Cluster have a high availability option.
Let's learn how to implement it in this lesson via the console.
Submitting Jobs
Int his lesson let's learn how to submit jobs to our cluster once it's created.
In this lesson let's submit some jobs.
In the lesson we will see that our cluster isn't large enough forcing us to kill the cluster and create a new one.
In this lesson let's learn how to submit a spark job to our cluster and view the output.
In this lesson let's learn how to submit a PySpark job via the Google Cloud Shell.
In this lesson let's learn how to move our on-premise Hadoop jobs to GCP.
In this lesson let's look at the code behind a Python and Scala.
You're the Data Engineer
In this lesson let's learn about white boarding.
It's used heavily at Google for interviewing.
It's also used for architecting cloud solutions.
Let's white board the approach we'd use to move on-premise Hadoop and other big data jobs to GCP.
When we are designing solutions for clients we want to make sure they understand that their data and clusters need to be in the same zones or regions so they don't incur excessive data movement charges.
You'll get a lot of questions about preemptibles and how to use them.
Let's high level some talking points and reinforce this idea of temporary clusters.
Clients are going to want to know the exact steps to moving their jobs to GCP.
In this lesson we will explain our phased approach to them.
Clients always want something they can customize.
In this lesson let's explain to them how easy it is to use initialization scripts.