Flume and Sqoop for Ingesting Big Data
Taught by a team which includes 2 Stanford-educated, ex-Googlers. This team has decades of practical experience in working with Java and with billions of rows of data.
Use Flume and Sqoop to import data to HDFS, HBase and Hive from a variety of sources, including Twitter and MySQL
Let’s parse that.
Import data : Flume and Sqoop play a special role in the Hadoop ecosystem. They transport data from sources like local file systems, HTTP, MySQL and Twitter which hold/produce data to data stores like HDFS, HBase and Hive. Both tools come with built-in functionality and abstract away users from the complexity of transporting data between these systems.
Flume: Flume Agents can transport data produced by a streaming application to data stores like HDFS and HBase.
Sqoop: Use Sqoop to bulk import data from traditional RDBMS to Hadoop storage architectures like HDFS or Hive.
Practical implementations for a variety of sources and data stores ..
- Sources : Twitter, MySQL, Spooling Directory, HTTP
- Sinks : HDFS, HBase, Hive
.. Flume features :
Flume Agents, Flume Events, Event bucketing, Channel selectors, Interceptors
.. Sqoop features :
Sqoop import from MySQL, Incremental imports using Sqoop Jobs
You, This Course and Us
Let's start with an introduction about the course, and what we'll know at the end of the course.
Why do we need Flume and Sqoop?
Let's understand Flume and Sqoop and their role in the Hadoop Ecosystem
Installing Flume is pretty straightforward.
A Flume Agent is the most basic unit that can exist independently in Flume. An Agent is made up of Sources, Sinks and Channels.
Our first example of a Flume Agent using a Spooling Directory Source, a File Channel and a Logger Sink
A Flume event represents 1 record of data. Flume events consist of event headers and the event body.
Learn how to use HDFS as a sink with Flume
HTTP Sources can be pretty handy when you have an application capable of making POST requests.
Event Headers in Flume carry useful metadata. Use event headers to bucket events in HDFS.
Let's see how to use a HBase sink as the endpoint of the Flume Agent
HTTP to HDFS and Logger at the same time. See how to route events using channel selectors.
Connect with the Twitter API using Flume. Use an Interceptor to do Regex filtering within Flume itself!
If you are unfamiliar with softwares that require working with a shell/command line environment, this video will be helpful for you. It explains how to update the PATH environment variable, which is needed to set up most Linux/Mac shell based softwares.
Install Sqoop and the connector for Sqoop to MySQL