Programming for Big Data

Students are reminded that notes provided on this site are intended to form summary material only and are not intended to be a substitute for attending lectures or further reading on the subject.

My slides are not Lecture Notes

Students should download the notes to your own device. The notes are a living artifact and will evolve from semester to semester. It cannot be guaranteed that the notes will be available after the end of a semester.

This module is 100% continuous assessment. There is a large practical aspect to this module. It is expected that students can work Independently and have the necessary programming (Java, Python, scripting etc) and technical skills (working with Virtual Machines, Linux, etc) for this module.

Before you decide to take this module, have a read of the Module overview and watch the Module Overview and pre-requisites videos. — These are subject to change. This will allow you know what is involved with this module, what is expected of you, and how you can prepare for the module, especially the first class.

Important: Module Overview, Module Pre-requisites and What to do before first class.  – This are subject to change.

Please be mindful of your location and any people nearby before playing any of the videos
Always use headphones.

FAQ : Check out the questions and suggestions from previous students.

Make sure to complete all materials and lab exercises each week and before the next class. No sample solutions are available for the lab exercises.

Commonly used Terms

Hadoop : Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Map-Reduce : MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in the Hadoop File System (HDFS). It is a core component, integral to the functioning of the Hadoop framework. MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.

Cluster : A computer cluster is a set of loosely or tightly connected computers that work together so that, in many aspects, they can be viewed as a single system. Clusters contains nodes to perform tasks, controlled and scheduled by software.

Node : A node is a computer/server running its own instance of the operating system and the distributed software. Each node will have storage device(s) attached to it.

Week of Class Type Lecture Topic
wk1 online Make sure you have read and understood the module details, admin and pre-requisites.

 

Introduction to Hadoop & Overview

Download of pre-built Hadoop VM (64bit) 

Complete Download before first class, and follow the installation instructions in Lab Exercises.

Make sure to complete all materials and lab exercises before next weeks class

wk2 online Map-Reduce – Part 1

 

Make sure to complete all materials and lab exercises before next weeks class

wk3 online Map-Reduce – Part 2

 

Assignment Handout

Make sure to complete all materials and lab exercises before next weeks class

wk4 online Hadoop Eco-System

 

No lab exercises this week. Use time to Work on Assignment.

wk5   Work on Assignment – Independent work week for you to work on your assignment. There will be no labs, class, etc
wk6   Introduction to Spark Programming

 

Hadoop Assignment Due this week.

Due Date = Monday Week 6 at 11pm

Must be submitted on BrightSpace. No Email submissions accepted.

Feedback will be given on BrightSpace consisting of a short comment and your mark.

wk7  

More Spark Programming, SparkSQL

Make sure to complete all materials and lab exercises before next weeks class

wk8   Spark for Analytics and Machine Learning

 

Assignment Handout

Make sure to complete all materials and lab exercises before next weeks class

wk9  

Spark Streaming, RDDs and Scala

Make sure to complete all materials and lab exercises before next weeks class

wk10   Work on Assignment – Independent work week for you to work on your assignment. There will be no labs, class, etc
wk11   Introduction to Kafka
wk12   More Kafka

 

Spark Assignment Due this week.

Due Date = Monday Week 12 at 11pm

Must be submitted on BrightSpace. No Email submissions accepted.

Feedback will be given on BrightSpace consisting of a short comment and your mark.

wk13   Quiz – Online in Brightspace, covering all topics.