This weeks topic is the ‘Hadoop Eco-System’. The core Hadoop environment consists of Hive, HBase, Sqoop, Pig, Mahoot and a few more products. We will cover this topics and a few more topics relating to data processing languages and how Hadoop is being integrated into other Database and Data Management enterprise architectures.
Discussion – Read the following and discuss in class. Hadoop is Failing (or is it really)?
– Make sure to read the comments
Also have a read of Hadump – meaning data dumped into Hadoop with no plan
FAQ : Check out the questions and suggestions from previous students.
Notes
Click here to download the notes.
Lab Exercises
Lab time this week can be used for the following.
– Complete all lab exercises from previous weeks.
– Work on the assignment
Assignment Hint: The TextPairs code will help with completing your assignment
Assignment Hint: Check the Q&A page regularly for updates.
VM Configuration : You may need to change the size of the disk for the VM, you can follow the instructions here. Make sure you follow them very carefully.
Additional Materials & Reading
Introduction to using Hive on Cloudera VM
The Next 50 Years of Databases
NoSQL keeps rising, but relational databases still dominate big data
The Hadoop Ecosystem – Table summarising products
A Plethora of Data Set Repositories
51 Database terms to know
The Secret Life of SQL and it’s Longevity
Relational Databases are far from dead — just ask Facebook
Some links on Spark (your next topic/component of the module)
Apache Spark
Installing Scala & Spark on a Mac
Scala Cheat Sheet
10-minute Spark Tutorials
- Practical Apache Spark in 10 minutes. Part 1 – Ubuntu installation
- Practical Apache Spark in 10 minutes. Part 2 – RDD
- Practical Apache Spark in 10 minutes. Part 3 – DataFrames and SQL
- Practical Apache Spark in 10 minutes. Part 4 – MLlib
- Practical Apache Spark in 10 minutes. Part 5 – Streaming
- Practical Apache Spark in 10 minutes. Part 6 – GraphX
- Practical Apache Spark in 10 minutes. Part 7 – GraphX and Neo4j