Learning Hive,Apache Zookeeper and SAS

Leveraging Apache Hive to process raw data and ETL operations in Hadoop various environments effectively
You will learn hive
You will learn how to use zookeeper
You will learn kafka architecture
You will learn how to integrate Hive with Hbase

ZooKeeper is a replicated synchronization service with eventual consistency. It is robust, since the persisted data is distributed between multiple nodes (this set of nodes is called an “ensemble”) and one client connects to any of them (i.e., a specific “server”), migrating if one node fails; as long as a strict majority of nodes are working, the ensemble of ZooKeeper nodes is alive. In particular, a master node is dynamically chosen by consensus within the ensemble; if the master node fails, the role of master migrates to another node.

The master is the authority for writes: in this way writes can be guaranteed to be persisted in-order, i.e., writes are linear. Each time a client writes to the ensemble, a majority of nodes persist the information: these nodes include the server for the client, and obviously the master. This means that each write makes the server up-to-date with the master. It also means, however, that you cannot have concurrent writes.

The guarantee of linear writes is the reason for the fact that ZooKeeper does not perform well for write-dominant workloads. In particular, it should not be used for interchange of large data, such as media. As long as your communication involves shared data, ZooKeeper helps you. When data could be written concurrently, ZooKeeper actually gets in the way, because it imposes a strict ordering of operations even if not strictly necessary from the perspective of the writers. Its ideal use is for coordination, where messages are exchanged between the clients.



Detailed information about Kafka and Spark Integration

Kafka Architecture
System Messages
Hive with SAS
Zookeeper model
Zookeeper installation
