Need help with Big Data Assignment Help or Project Help? At Codersarts we offer 1:1 session with expert, Code mentorship, Course Training, and ongoing development projects. Get help from vetted Machine Learning engineers, mentors, experts, and tutors.
For this project, you will solve the given problems using the MapReduce computational model and Mahout on a Hadoop cluster.
QUESTION 1
Find the descriptive statistics for temperature of each day of a given month for the year 2007.
We will use weather data from NCDC. I can provide you with the hourly weather data, namely hourly records for April, May, June and July from year 2007. A month is represented per file. You may select any one of the four months (files) for analysis.
You will find the weather data from different weather stations (wban - first column). Using the hourly data across all weather stations, find
The daily maximum and minimum “Dry Bulb Temp” across all the weather stations
the daily mean and median of “Dry Bulb Temp” over all the weather stations
the daily variance of “Dry Bulb Temp” over all the weather stations
You must NOT to use any package that calculate these statistics. You MUST use MapReduce framework to calculate these figures. Write the pseudo code for mapper and reducer functions for the above three tasks and implement them in Python. Note that while using both the mapper and reducer it is helpful to consider the following formula for variance:
QUESTION 2
Cluster Analysis using Apache Mahout.
For this question, I can also provide you with the data (a set of text files that are placed in a folder) for the k-means algorithm. You are welcome to use your own dataset for this question. If you choose to do so, please provide a link to the data in your report
The terms of the documents are considered as features in text clustering. The vector space model is an algebraic model that maps the terms in a document into n-dimensional linear space. However, we need to represent the textual information (terms) as a numerical representation and create feature vectors using the numerical values to evaluate the similarity between data points.
Use Apache Mahout and perform the standard steps for the cluster analysis:
create sequence files from the raw text
create a sparse (efficient) representation of the vectors, initialising approximate centroids for K-Means
run the K-Means algorithm
get the final iteration’s clustering solution
evaluate the final solution
Further, you need to consider the following points in the analysis:
Implement the K-Means clustering algorithm with cosine distance to cluster the instances into K clusters
Vary the value of K and comment on the precision
Plot a graph that shows the relation between the average distance to the cluster centroid (or efficiency metric) and the K-value
Try to smooth the graph so that you can explain the value of K as the best value such that beyond this value there wouldn’t be a significant reduction in the average distance to the cluster centroids
Consider other distance measure of your choice and compare the different clusters you obtained in both the cases. Discuss which is the best setting for K-means clustering for this dataset.
You must include the following in your submission:
For Q1, submit the pseudo code and Python code for the mapper and reducer implementations for all the descriptive statistics along with some comments so that a layperson can follow. Anyone should be able to run your code and reproduce your results with the instructions that you have provided.
For Q2, write a brief summary on the impact of the parameter changes on the performance of the k-means algorithm. For example, you may:
compare different distance measures in K-Means algorithm, discuss the merits and demerits of these measures
present a table that shows the performance of K-means algorithm for different K values.
In Q2, if you automate the process of varying K then submit the code for the implementation along with some comments so that a layperson can follow
Submit a report on the experiments. This report will be a detailed explanation (Max 1,500 words, excluding code and references) of what you explored, the results you obtained, and some discussion points on the limitations of MapReduce methodology and Hadoop’s MapReduce computing engine.
Credit will be given for:
the technical skills you demonstrate in your write-up.
good use of the Hadoop cluster.
critical evaluation of your work.
Opmerkingen