Map Reduce I Sample assignment

Shikhar Sharma
Jul 23, 2021
2 min read

Introduction

Map Reduce is the programming model of Hadoop that is used for the analysis and processing of big data. The mapper and reducer phase was made and implemented. The input data goes through the following stages before writing the final output in the output directory.

MapReduce Algorithm

Input

The input directory that contains the datasets(input.txt) is given as input to the MapReduce task. In this stage, input data is split into many independent data blocks that are given as input to multiple mappers for parallel processing.

Mapper, mapper.py

Mapper takes a section of data as a text file . Thus, the input phase is also responsible for converting the input data in the pair form. Map functionality is then applied to each of the pairs. Map phase produces the intermediate result in the pair forms mapout.txt which is fed in as input to the next stage.

Once the map step is over, the outputs from the map step are sorted and it is taken as input in the reducer. The data transfer takes place from mapper to reducer. All the values belonging to a particular key from the output of the mapper function are aggregated at a single node where the reducer for that particular key has to be executed.

Reducer, reducer.py

The reducer function then takes this aggregated values for a particular key as an input and finally generates a key value pair as a desired output for all particular keys which is result.txt.

i.e. if we consider the intermediate key value pairs generated by the mapper function many key value pairs were generated with non unique keys but then, the similar keys were aggregated at a single node where reducer performed the necessary computations to generate a single output for a single key which demarcates our desired result.