This is a Frequently Asked Questions webpage for the Hadoop part of the Programming for Big Data Module.
This page contains questions asked by previous students. The answers given aim to answer these questions and to give some additional pointers.
The questions are broken into sections based on the topics covered during each week, and also includes any questions received about the Assignment.
If no Questions are listed for a week or on a topic, then no one asked any questions.
Submitting new questions: All new questions should be emailed to me. Try to make your question specific to a topic or a particular challenge. Give examples etc to help illustrate your questions. If you come across any materials or resources you would like to share with the class, I can post the details on this page.
Q: Do I need to install anything to run the VM?
You will need to have VirtualBox installed on your laptop/desktop to be able to run the VM. This is a simple install and only takes a few minutes. If this can be done in advance of the first class then it will save you some time.
Q: Do I have to use the supplied VM?
No, you can use any VM that has Hadoop installed on it. An alternative is to use one of the many cloud based Hadoop environments. There are many out there. For example Amazon has a few different versions. Several students have used these in previous years.
Q: What version of the VM should I use?
It is recommended that you use the 64bit version of the VM. Most laptops/desktops nowadays are 64bit, so use the 64bit VM.
You do need to insure that you have virtualization enabled on your laptop/desktop. This may require some changes to the bios settings of your machine.
Q: Do I have to setup shared folders for the VM? Is there an alternative way to share files between my laptop and VM?
No, you can do all the work on the VM. Then you don’t have to move any of the Java files from your laptop/desktop to the VM.
But if you prefer to work with Java on your laptop/desktop then an alternative is to use google drive, or email, etc to share the files between your working environments.
Whatever works best and easiest for you.
Q: The VM when into hibernate mode. When it wakes up it is asking for a username password. What should I enter?
Yes this can happen, and the username and password is in the lab notes for Week 1.
To save you the task of trying to find them, they are listed below. Make sure to use the correct username/password for the version of the VM you are using (32bit vs 64bit)
64bit VM = soctech / ubuntu
32bit VM = Hadoop / hadoopVM
Q: Can I use the latest version of Eclipse, instead of Luna?
Yes, you can use the latest version of Eclipse if you want, but it is configured to use the lastest version of Java. The version of Java on the VM is dependent on the version of Hadoop. This will not be the latest version of Java.
You will need to configure the latest version of Eclipse to work with the version of Java on the VM.
Q: Do I need to be good at programming to take this module?
Yes there is a lot of coding in all components of this module and not just in the Hadoop section.
For the Hadoop section it is ideal that you have some experience of working with Java, but is not mandatory. It comes down to how confident you are at coding in other languages and other IDEs. You may be able to pick up Java and the other languages quickly.
Q: I get the following error when I try to run my first MR jar file
When I enter the line below:
soc@soc-VirtualBox:~$ hadoop jar WordCount.jar WordCount shakespeare/poems myOutput
I just get this line;
Usage: WordCount <input path> <output path>
And nothing actually runs.
There are a couple of things to check. Check that the jar is generated as ‘Runnable Jar File’ then try to run it from the directory where it is located. If that doesn’t work then generate it as a ‘Jar File’ and again run it from the directory where it is located.
Alternatively, try changing the command to the following
hadoop jar WordCount.jar shakespeare/poems myOutput
Q: I get a syntax error after coping and pasting the sample code. Any ideas?
Look out for the weird double quotes in the code. They got in there somehow. Just change these to “normal” double quotes.
Q: I’m struggling to get the shared folders to work. Is there a different way of moving code from my computer to the VM?
Email the code to yourself, then open a web browser on the VM and log into your email.
Q: Can we have an Extension to the submission date for this assignment?
No, an extension to the submission date is not possible for this assignment.
Giving an extension would seriously impact on the other components and assignments of this module. Complete what you can of the assignment and submit it. Then more on to the next component of the module.
The assignment schedule is the same as previous deliveries of this module.
Late submissions will be accepted and will be subject to the late penalty.
Q: Can I use another tool or language to prepossess and clean the data files.
NO. Everything needs to be done using MapReduce
Q: How many languages do we need to have?
Select 3 different languages
Q: How many files/books should I download
2<=X<=6, where X is the number of files/books.
If you select 2 books for each language, you will have 6 (2 x 3) books to process. If you decide to use 3 books for each language, you will have 9 (3 x 3) books to process.
Q: Do I need to do any data cleaning of the files before they are process by map-reduce?
No. You just download the files, load them into HDFS and let map-reduce do all the work.
Q: Should the output for a language like French have a separate count for ‘ á ’ as opposed to ‘ a ’ or are we simply looking at the English alphabet?
Simplest is to consider them as separate letters.
Q: Can I use books written on languages that are not based on latin-based languages, eg. Japanese ?
The assignment is asking for an analysis of letter frequencies across a number of languages. Language such as Japanese (and others), the symbols can mean/represent words. If such a language is used, then the results will not be comparative to other languages.
Q: Can I build/run separate MapReduce processes to answer each part or language of the assignment.
No. You should create one overall chained MapReduce process for this assignment. This can consist of a number of map-reduce processes/jobs. Each component can create an output. These may contain the answer required, as as well as providing input to the next job of the chained process.
Do not design and run a map-reduce process on individual books.
Only the final output should be used for charting in R or Python. No additional data aggregation should be perform in R or Python.
Q: The assignment seems to indicate you want the code in the PDF and in a file. Is this correct?
Yes, include the code in the PDF document in the sections related to it. Also, include the code files in the ZIP file submission. This will make things easier for me to find and mark.
Q: I’m confused between what is a Project and what is a Site
Check out the webpage https://dumps.wikimedia.org/other/pagecounts-raw/ You can download the required files from here and process accordingly.