Streaming Analytics on Google Cloud Platform
Review from course in this series:
“I like the detail, especially highlighting the specifics of the test. The detail makes this course worth the investment including the summary at the end and the quizzes that test my knowledge.” — Valentina Kibuyaga
Welcome to Streaming Analytics on Google Cloud Platform This is the Fifth and final course in a series of courses designed to help you attain the coveted Google Certified Data Engineer.
Additionally, the series of courses is going to show you the role of the data engineer on the Google Cloud Platform.
While this is a short course the topic matter is dense and while you won’t have to author is Java Pipelines for the exam you will need to know a lot about how they are created and executed.
At this juncture, the Google Certified Data Engineer is the only real world certification for data and machine learning engineers.
NOTE: This is NOT a course on programming Apache Beam Pipelines. This is a very targeted course on understanding how Apache Beam and Cloud Dataflow provide us with an infrastructure to build pipelines for streaming data. The course will provide the learner with the nomenclature and process understanding they’ll need to pass the Certified Data Engineering Exam.
Streaming data processing is a big deal in big data these days, and for good reasons. Businesses crave ever more timely data, and switching to streaming is a good way to achieve lower latency.
The massive, unbounded data sets that are increasingly common in modern business are more easily tamed using a system designed for such never-ending volumes of data.
Processing data as it arrives spreads workloads out more evenly over time, yielding more consistent and predictable consumption of resources.
In Google Cloud Platform the main tool we use for building these pipelines Cloud Dataflow. The product itself is a fusion of the code written by Google developers and that of the Apache foundation. The project that came out of that business cohabitation is Apache Beam.
Apache Beam (Batch + strEAM) is a model and set of APIs for doing both batch and streaming data processing. It was open-sourced by Google (with Cloudera and PayPal) in 2016 via an Apache incubator project.
In this course, we are going to learn about Apache Beam and Cloud Dataflow. While the course is an entry level course streaming will be new to many. Like most of my other courses in this series, I’ll attempt to break down more complicated topics pictorially.
*Five Reasons to take this Course.*
1) You Want to be a Data Engineer
It’s the number one job in the world. (not just within the computer space) The growth potential career wise is second to none. You want the freedom to move anywhere you’d like. You want to be compensated for your efforts. You want to be able to work remotely. The list of benefits goes on.
2) The Google Certified Data Engineer
Google is always ahead of the game. If you were to look back at a timeline of their accomplishments in the data space you might believe they have a crystal ball. They’ve been a decade ahead of everyone. Now, they are the first and the only cloud vendor to have a data engineering certification. With their track record I’ll go with Google.
3) The Growth of Data is Insane
Ninety percent of all the world’s data has been created in the last two years. Business around the world generate approximately 450 billions transactions a day. The amount of data collected by all organizations is approximately 2.5 Exabytes a day. That number doubles every month.
4) Apache Beam in Plain English
Apache Beam pipelines require basic programming skills. The Google Certified Data Engineering exam will require you are able to identify the parts of a Beam Pipeline in addition to understanding some of the vernacular and nuances behind streaming data.
5) You want to be ahead of the Curve
The data engineer role is new. While you’re learning, building your skills and becoming certified you are also the first to be part of this burgeoning field. You know that the first to be certified means the first to be hired and first to receive the top compensation package.
Thank you for your interest in Streaming Analytics on Google Cloud Platform and we will see you in the course!!
In this first lesson let's learn what this course is about.
It's our introduction to Apache Beam and Cloud Dataflow for this course.
I want you to take my course but I want the course to be right for you.
In this lesson let's learn if you are part of the course's target audience.
In this lesson let's high level what streaming is.
Let's define streaming and a few terms we will use throughout the course.
In this lesson let's learn about the big three when it comes to big data.
In this lesson let's learn what Apache Beam is.
Is an integral part of Cloud Dataflow so in this section we will learn all about it.
In this lesson we learn about the various objects that make up a Beam Pipeline.
Let's do a quick review of the critical objects in Beam.
The answer key is in the lecture below.
The answer key to the "Pipeline Object Review" lecture.
Let's learn the two core terms and concepts surround streaming data.
Understanding how these two times related is the cornerstone to understanding streaming data sets.
How do we slice up infinite out of order data sets?
You use time windows.
In this lesson let's learn how this happens.
Let's create a fictions use case so we can better understand streaming data sets.
There are issues that arise from dealing with infinite unordered data sets.
Let's learn what they are.
It's the paper that started it all.
Let's learn how MapReduce at a high level.
In this lesson let's learn what event skew is.
The Dataflow Model
Apache Beam is an SDK for developing Pipeline.
Cloud dataflow is a runner for executing those pipelines.
Let's learn more about them in this brief lesson.
We've seen this already once but let's review the questions once more.
Let's build our own pipeline and then execute it on Cloud Dataflow.
We need to be able to monitor our jobs.
We can easily monitor dataflow and most of our other services using Stackdriver.
In this lesson let's create a simple dashboard for monitoring dataflow
In this lesson let's learn how to monitor our dataflow jobs.