4.12 out of 5
1866 reviews on Udemy

Learn Big Data: The Hadoop Ecosystem Masterclass

Master the Hadoop ecosystem using HDFS, MapReduce, Yarn, Pig, Hive, Kafka, HBase, Spark, Knox, Ranger, Ambari, Zookeeper
Edward Viaene
10,793 students enrolled
English [Auto-generated] More
Process Big Data using batch
Process Big Data using realtime data
Be familiar with the technologies in the Hadoop Stack
Be able to install and configure the Hortonworks Data Platform (HDP)

In this course you will learn Big Data using the Hadoop Ecosystem. Why Hadoop? It is one of the most sought after skills in the IT industry. The average salary in the US is $112,000 per year, up to an average of $160,000 in San Fransisco (source: Indeed).

The course is aimed at Software Engineers, Database Administrators, and System Administrators that want to learn about Big Data. Other IT professionals can also take this course, but might have to do some extra research to understand some of the concepts.

You will learn how to use the most popular software in the Big Data industry at moment, using batch processing as well as realtime processing. This course will give you enough background to be able to talk about real problems and solutions with experts in the industry. Updating your LinkedIn profile with these technologies will make recruiters want you to get interviews at the most prestigious companies in the world.

The course is very practical, with more than 6 hours of lectures. You want to try out everything yourself, adding multiple hours of learning. If you get stuck with the technology while trying, there is support available. I will answer your messages on the message boards and we have a Facebook group where you can post questions.


Course Introduction

Course introduction, lecture overview, course objectives

Course Guide

This document provides a guide to do the demos in this course

What is Big Data and Hadoop

What is Big Data

The 3 (or 4) V's of Big Data explained

Examples of Big Data

What is Big Data? Some examples of companies using Big Data, like Spotify, Amazon, Google, and Tesla

What is Data Science

What can we do with Big Data? Data Science explained.

What is Hadoop

How to build a Big Data System? What is Hadoop?

Hadoop Distributions

Hadoop Distributions: a comparison between Apache Hadoop, Hortonworks Data Platform, Cloudera, and MapR

What is Big Data Quiz

Introduction to Hadoop

Hadoop Installation

How to install Hadoop? You can install Hadoop using vagrant with Virtualbox / VMWare, or on the Cloud using AWS. Hortonworks also provides a Sandbox.

Demo: Hortonworks Sandbox

This is a demo of how to install and use the Hortonworks Sandbox. An alternative to the full installation using Ambari if you have a machine that doesn't have a lot of memory available. You can also use both in conjunction. 

Demo: Hadoop Installation - Part 1

A walkthrough of how to install the Hortonworks Data Platform (HDP) on your Laptop or Desktop

Demo: Hadoop Installation - Part 2

A walkthrough of how to install the Hortonworks Data Platform (HDP) on your Laptop or Desktop (Part II)

Introduction to HDFS

An introduction to HDFS, The Hadoop Distributed Filesystem

DataNode Communications

Communications between the DataNode and the NameNode explained

Demo: HDFS - Part 1

An introduction to HDFS using hadoop fs put. I'm also showing how a files gets divided in blocks and where those blocks are stored.

Demo: HDFS - Part 2 - Using Ambari

An introduction to downloading, uploading and listing files. This time I'm using the Ambari HDFS Viewer and the NameNode UI. I also show what configuration changes are necessary to make this work.

MapReduce WordCount Example

MapReduce WordCount, step by step explained

Demo: MapReduce WordCount

A demo of MapReduce WordCount on our HDP cluster

Lines that span blocks

In HDFS, files are divided in blocks and stored on the DataNodes. In this lecture we're going to see what happens when we're reading lines from files that potentially span over multiple blocks.

Introduction to Yarn

Introducing Yarn, and concepts like the ResourceManager, the scheduler, the applicationsManager, the NodeManager, and the Application Master. I explain how an application is executed and the consequences when a node crashes.

Demo: Yarn and ResourceManager UI

A demo of an application executed using yarn jar. I provide an overview of Ambari Yarn metrics and the ResourceManager UI

Ambari API and Blueprints

Ambari also exposes a REST API. Commands can be executed directly to this API. Ambari also lets you do unattended install using Ambari Blueprints

Demo: Ambari API and Blueprints

A demo showing you the Ambari API and how to work with blueprints

ETL Processing in Hadoop

An introduction to ETL processing in Hadoop. MapReduce, Pig, and Spark are suitable to do batch processing. Hive is more suitable for data exploration.

Introduction Quiz


Introduction to Pig

An introduction to Pig and Pig Latin.

Demo: Part 1 - Pig Installation

This demo shows how to install pig and tez using Ambari on the Hortonworks Data Platform

Demo: Part 2 - Pig Commands

In this demo I will show you basic pig commands to load, dump and store data. I'll also show you an example how to filter data.

Demo: Part 3 - More Pig Commands

More Pig commands in this final part of the pig demo. I'll go over commands like GROUP BY, FOREACH ... GENERATE and COUNT()

Apache Spark

Introduction to Apache Spark

An introduction to Apache Spark. This lecture explains the differences between the spark-submit using local mode, yarn-cluster and yarn-client.

Spark WordCount

An introduction to WordCount in Spark using Python (pyspark)

Demo: Spark installation and WordCount

Spark installation using Ambari and a demo of the Spark Wordcount using the pyspark shell.


This lectures gives an introduction to Resilient Distributed Datasets (RDDs). This abstraction allows you to do transformations and actions in Spark. I give an example using filter RDDs, and explain how shuffle RDDs impact disk and network IO

Demo: RDD Transformations and Actions

A demo of RDD transformations and actions in Spark

Overview of RDD Transformations and Actions

An overview of the most common RDD actions and transformations

Spark MLLib

An overview of what Spark MLLib (Machine Learning Library) can do. I explain a Recommendation Engine example, and a Clustering Example (K-Means / DBScan)


Introduction to Hive

An introduction to SQL on Hadoop using Hive, enabling data warehouse capabilities. This lecture provides an architecture overview and an overview of the hive CLI and beeline using JDBC.

Hive Queries

An overview of Hive Queries: creating tables, creating databases, inserting data, and selecting data. This lecture also shows where the hive data is stored in HDFS.

Demo: Hive Installation and Hive Queries

A demo that shows the installation of Hiveserver2 and the clients. Afterwards I show you a few example queries using a JDBC beeline connection.

Hive Partitioning, Buckets, UDFs, and SerDes

Optimizing hive can't be done using indexes. This lecture explains how queries in hive should be optimized, using partitions and buckets. This lecture also handles User Defined Functions (UDFs) and Serialization / Deserialization

The Stinger Initiative

The Stinger initiative brings optimizations to Spark. Query time has lowered significantly over the years. This lecture explains you the details.

Hive in Spark

You can also use Hive in Spark using the Spark SQLContext.

Real Time Processing

Introduction to Realtime Processing

All the lectures up until now were batch oriented. From now on we're going to discuss Realtime processing technologies like Kafka, Storm, Spark Streaming, and HBase / Phoenix.


Introduction to Kafka

An introduction to Kafka and its terminology like Producers, Consumers, Topics and Partitions.

Kafka Topics

An explanation of Kafka Topics covering Leader partitions, Follower partitions, and how writes are sent to the partitions. Also covers the Consumer groups to show the difference between publish-subscribe (pubsub) mechanism and queuing

Kafka Messages and Log Compaction

Kafka guarantees at-least-once message delivery, but can also be configured for at-most-once. Log Compaction is a technique that Kafka provides to have a full dataset maintained in the commit log. This lecture shows an example of a customer dataset fully kept in Kafka and explains Log Tail, Cleaner Point and Log Head and how it impacts consumers.

Kafka Use Cases and Usage

A few example use cases of Kafka

Demo: Kafka Installation and Usage

The installation of Kafka on the Hortonworks Data Platform and a demo of a producer - consumer example.


Introduction to Storm

This lecture provides an introduction to Storm, a realtime computing system. The architecture overview explains components like Nimbus, Zookeeper, and the Supervisor

A Storm Topology

This lecture explains what Storm topologies are. I talk about streams, tuples, spouts, and bolts.

Demo: Storm installation and Example Topology

A demo of a Storm Topology ingesting data from Kafka and doing computation on the data.

Storm Message Processing and Reliability

Message Delivery explained:

  • At most once delivery
  • At least once delivery
  • Exactly once delivery

This lecture also explains the Storm's reliability API (Anchoring and Acking) and the performance impact of acking.


An introduction to the Trident API, an alternative interface for Storm that supports exactly-once processing of messages.

Spark Streaming

Introduction to Spark Streaming

Spark streaming is an alternative to Storm that gained a lot of popularity in the last few years. It allows you to reuse the code you wrote in batch and use it for stream processing.

Spark Streaming Architecture

Spark Streaming generates DStreams, micro-batches of RDDs. This lecture explains the Spark Streaming Architecture

Spark Receivers and WordCount Streaming Example

This lecture explains possible receivers, like Kafka. It also shows a WordCount streaming example, where data is ingested from Kafka and processed using WordCount in Spark Streaming

Demo: Spark Streaming with Kafka

This demo shows the Kafka-spark-streaming example.

Spark Streaming State and Checkpointing

In the previous lecture we did a WordCount using Spark Streaming, but our example was stateless. In this lecture I'm adding state, using UpdateStateByKey to keep state and checkpointing to save the data to HDFS.

Demo: Stateful Spark Streaming

A demo of a stateful spark streaming application. Performs a global WordCount from a topic from Kafka. Does checkpointing in HDFS.

More Spark Streaming Features

More Spark Streaming Features, like Windowing and streaming algorithms


Introduction to HBase

Introduction to HBase: a realtime, distributed, scalable, big data store on top of Hadoop. The lecture also briefly explains the CAP theorem.

HBase Tables

An HBase table is different than a table in a Relational Database. This lecture explains the differences and talks about the row key, Column Families, Column Qualifiers, versions, and regions.

The HBase Meta Table

A lecture that explains the hbase:meta table, which is retrieved using Zookeeper when a client connects. This way the clients knows what RegionServer to contact to read/write data.

HBase Writes

This lecture shows how a write (a PUT request) is handled by HBase. It shows how writes go to the WAL (Write-ahead-log), and the Memstore. I also show how flushes work to persist the data in HDFS.

HBase Reads

HBase reads go to the Memstore and the BlockCache first, then to HFiles on HDFS. The lecture shows how indexes and Bloomfilters are used to speed up reads from disk.


HBase does minor and major compactions to merge HFiles in HDFS.

Crash Recovery

This lecture explains how a crash recovery in HBase happens, how Zookeeper and the HMaster are involved, how recovery uses the WAL files and how data is persisted to disk after a crash.

Region Splits

When tables become bigger, they split. This lecture explains how Regions are split. balanced over the RegionServer and how pre-splitting can help with the performance.


HBase hotspotting is something to avoid. This lecture explains when hotspotting can happen and how to avoid it using salting.

Demo: HBase Install

This demo shows how to install HBase using Ambari.

Demo: HBase Shell

This demo gives you an introduction to the HBase Shell, where table can be created, data can be retrieved using get / scan, and data can be written using put

Demo: Spark HBase

An example of a stateful Spark Streaming application that ingests data from a Kafka topic, runs the wordcount on the data, and stores the data in an HBase table.


Introduction to Phoenix

An introduction to Phoenix, which brings SQL back into HBase.

Salting, Compression, and Indexes in Phoenix

An overview of Phoenix features like Salting, Compression, and Indexes. All implemented using standard SQL commands to make it easier for the database administrators and analysts to use HBase.

JOINs, VIEWs, and Phoenix in Spark

More Phoenix features like JOINs, VIEWs, and a Phoenix in Spark plugin.

Demo: Phoenix

A demo showing the Phoenix features

Hadoop Security

Introduction to Kerberos

An introduction to Kerberos, which we are going to use to secure our Hadoop cluster

Kerberos on Hadoop

An overview of different deployment strategies of Kerberos in Hadoop

Kerberos Terminology

Getting familiar with Kerberos Technologies like Principals, Realms, and keytabs

Demo: Enabling Kerberos

A demo showing you how to install MIT Kerberos, enabling Kerberos in Ambari, and showing how this impacts the users using HDFS

Introduction to SPNEGO

Introduction to SPNEGO, protecting the HTTP interfaces in Hadoop against unauthorized access


A demo showing how SPNEGO works

Introduction to Knox

The Knox gateway provides a single entry point to the Hadoop APIs and UIs. This lecture explains the Knox gateway architecture and how it can be used.


Introduction to Ranger

This lectures gives an introduction to Ranger, which can be used for access control on the Hadoop services (authorization)

Demo: Ranger Installation

Demo of installing ranger using Ambari

Demo: Ranger with Hive

A demo of Ranger with Hive. Ranger can be used to put granular access controls on hive databases, tables, and columns.

You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
4.1 out of 5
1866 Ratings

Detailed Rating

Stars 5
Stars 4
Stars 3
Stars 2
Stars 1
30-Day Money-Back Guarantee


6 hours on-demand video
1 article
Full lifetime access
Access on mobile and TV
Certificate of Completion