Students are reminded that notes provided on this site are intended to form summary material only and are not intended to be a substitute for attending lectures or further reading on the subject.
My slides are not Lecture Notes
Students should download the notes to your own device. The notes are a living artifact and will evolve from semester to semester. It cannot be guaranteed that the notes will be available after the end of a semester.
This module is 100% continuous assessment. There is a large practical aspect to this module. It is expected that students can work Independently and have the necessary programming (Java, Python, scripting etc) and technical skills (working with Virtual Machines, Linux, etc) for this module.
Before you decide to take this module, have a read of the Module overview and watch the Module Overview and pre-requisites videos. — These are subject to change. This will allow you know what is involved with this module, what is expected of you, and how you can prepare for the module, especially the first class.
Important: Module Overview, Module Pre-requisites and What to do before first class. – This are subject to change.
Please be mindful of your location and any people nearby before playing any of the videos
Always use headphones.
FAQ : Check out the questions and suggestions from previous students.
Make sure to complete all materials and lab exercises each week and before the next class. No sample solutions are available for the lab exercises.
Commonly used Terms
Hadoop : Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Map-Reduce : MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in the Hadoop File System (HDFS). It is a core component, integral to the functioning of the Hadoop framework. MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.
Cluster : A computer cluster is a set of loosely or tightly connected computers that work together so that, in many aspects, they can be viewed as a single system. Clusters contains nodes to perform tasks, controlled and scheduled by software.
Node : A node is a computer/server running its own instance of the operating system and the distributed software. Each node will have storage device(s) attached to it.
Week of | Class Type | Lecture Topic |
---|---|---|
wk1 | online | Make sure you have read and understood the module details, admin and pre-requisites.
Introduction to Hadoop & Overview Download of pre-built Hadoop VM (64bit) Complete Download before first class, and follow the installation instructions in Lab Exercises. Make sure to complete all materials and lab exercises before next weeks class |
wk2 | online | Map-Reduce – Part 1
Make sure to complete all materials and lab exercises before next weeks class |
wk3 | online | Map-Reduce – Part 2
Assignment Handout Make sure to complete all materials and lab exercises before next weeks class |
wk4 | online | Hadoop Eco-System
No lab exercises this week. Use time to Work on Assignment. |
wk5 | Work on Assignment – Independent work week for you to work on your assignment. There will be no labs, class, etc |
|
wk6 | Introduction to Spark Programming
Hadoop Assignment Due this week. Due Date = Monday Week 6 at 11pm Must be submitted on BrightSpace. No Email submissions accepted. Feedback will be given on BrightSpace consisting of a short comment and your mark. |
|
wk7 |
More Spark Programming, SparkSQL Make sure to complete all materials and lab exercises before next weeks class |
|
wk8 | Spark for Analytics and Machine Learning
Assignment Handout Make sure to complete all materials and lab exercises before next weeks class |
|
wk9 |
Spark Streaming, RDDs and Scala Make sure to complete all materials and lab exercises before next weeks class |
|
wk10 | Work on Assignment – Independent work week for you to work on your assignment. There will be no labs, class, etc | |
wk11 | Introduction to Kafka | |
wk12 | More Kafka
Spark Assignment Due this week. Due Date = Monday Week 12 at 11pm Must be submitted on BrightSpace. No Email submissions accepted. Feedback will be given on BrightSpace consisting of a short comment and your mark. |
|
wk13 | Quiz – Online in Brightspace, covering all topics. |