Welcome to the first week of the Programming for Big Data Module.
This week we will look at why we needed newer technologies, such as Hadoop, in the age of Big Data. Who uses it and why. An important consideration is the scale of the data. Hadoop is for massive levels of data. Much bigger than you could imagine. Anything smaller can be easily handled using traditional methods. The Hadoop eco-system will be reviewed before we move into the main components and start looking at HDFS, before moving onto the lab works.
FAQ : Check out the questions and suggestions from previous students.
Notes
Click here to download the notes.
Videos
Hadoop Environment – Very Important
You will need a Hadoop environment to work with. See the exercises below, which show how to setup a basic Hadoop Virtual Machine (VM). You can use this VM or you can use any other Hadoop environment you want to use. This could be using a different VM, use a Docker container, use a AWS/GCP/ etc Cloud Service, or you might want to create your own Hadoop Virtual Machine.
Some alternative environments:
IMPORTANT – You are responsible for managing whatever Hadoop environment you use. It would be impossible for me to be able to support all possibilities. If you encounter any problems/issues, you will need take ownership at resolving them.
Managing the supplied Hadoop VM
Make sure you regularly clear down the temp files and files in bin directory.
The VM has a small disk size and careful management of the available space is necessary.
If you need to change the size of the disk for the VM, you can follow the instructions here. Make sure you follow them very carefully and make a copy of the VM before making the changes.
IMPORTANT: Do not update any of the software on the VM. Do not update the OS, or version of Java, etc. No Updates.
Lab Exercises (complete all exercises before next class)
Exercise 0 – You should have already completed these Tasks
Approx 10-15 minutes to complete
Install VirtualBox software – onto your computer – select version for your Operating System
After installing VirtualBox software/application, install the Extensions Pack.
Make sure to follow the installation instructions for installing VirtualBox and the Extensions Pack.
After installing VirtualBox you can proceed to download the pre-built Virtual Machine (VM).
This is an 8Gb download. Additional Storage space will be required. You need a minimum of 4GB RAM on your laptop to run the VM. Ideally 8GB RAM is needed and you can allocate more memory to the VM.
Alternatively, you can create your own VM and install Hadoop yourself.
Docker: If you like working with Docker, try out the pre-built Docker images on the Docker Hub Store.
IMPORTANT: There are lots of options for having your own Hadoop environment. You can use one from AWS, GCP, DataBricks, etc. The VM provided here is just one option.
Exercise 1 – Setup the Hadoop VM
Exercise 1 – Notes – Setup the VM
Approx 15-20 minutes to complete
Exercise 2 – Explore the Hadoop environment
Approx 15-30 minutes to complete
Exercise 2 – Notes – Explore Hadoop
Exercise 3 – Java Environment Setup – needed for Map-Reduce next week
Approx 10-60 minutes to complete – It depends!
Exercise 3 – Notes – Setting up Java for Hadoop
Additional Reading
Google white paper : Google File System => HDFS
Google white paper : MapReduce
Video-What is Hadoop & Map-Reduce
Hadoop Website
Hadoop Documentation
Hadoop APIs
HDFS Commands Cheat Sheet
Hadoop in Action
Relational databases are far from dead — just ask Facebook
Microsoft CEO Satya Nadella reveals which product he wishes the company had developed first