This weeks topic is the ‘Hadoop Eco-System’. The core Hadoop environment consists of Hive, HBase, Sqoop, Pig, Mahoot and a few more products. We will cover this topics and a few more topics relating to data processing languages and how Hadoop is being integrated into other Database and Data Management enterprise architectures.

Discussion – Read the following and discuss in class.  Hadoop is Failing (or is it really)?
– Make sure to read the comments

Also have a read of  Hadump – meaning data dumped into Hadoop with no plan

FAQ : Check out the questions and suggestions from previous students.


Click here to download the notes.

Lab Exercises

Lab time this week can be used for the following.

– Complete all lab exercises from previous weeks.

– Work on the assignment

Assignment Hint: The TextPairs code will help with completing your assignment

Assignment Hint: Check the Q&A page regularly for updates.

VM Configuration : You may need to change the size of the disk for the VM, you can follow the instructions here. Make sure you follow them very carefully.

Additional Materials & Reading

Introduction to using Hive on Cloudera VM

The Next 50 Years of Databases

NoSQL keeps rising, but relational databases still dominate big data
The Hadoop Ecosystem – Table summarising products
A Plethora of Data Set Repositories
51 Database terms to know
The Secret Life of SQL and it’s Longevity
Relational Databases are far from dead — just ask Facebook

Some links on Spark (your next topic/component of the module)

Apache Spark
Installing Scala & Spark on a Mac
Scala Cheat Sheet

10-minute Spark Tutorials