First Hadoop Project: A Step-by-Step WordCount Example for Beginners

Pushkar Nandgaonkar
Dec 20, 2024
3 min read

Introduction

Apache Hadoop is an open-source framework designed to handle large-scale data storage and processing using distributed computing. For beginners, a WordCount project is one of the simplest yet most effective ways to understand how Hadoop works.

In this blog post, we will cover a complete WordCount example using Hadoop Streaming and Python scripts for the Mapper and Reducer. This step-by-step guide will walk you through setting up the environment, writing the scripts, running the job, and interpreting the results. By the end of this guide, you will have a solid understanding of how Hadoop's MapReduce paradigm works.

What is Hadoop?

Hadoop is an ecosystem of tools for storing and processing large amounts of data. It is designed to be distributed, fault-tolerant, and scalable. The key components of Hadoop include:

HDFS (Hadoop Distributed File System): The storage layer that allows you to store large files across multiple nodes.
MapReduce: The processing layer that breaks down tasks into smaller chunks and processes them in parallel.
YARN (Yet Another Resource Negotiator): Manages cluster resources and job scheduling.

Why Start with a WordCount Project?

The WordCount project is the 'Hello World' of the big data world. It is simple to implement yet demonstrates the core concepts of Hadoop and MapReduce. The goal of the WordCount project is to count the occurrences of each word in a given text file. This involves splitting the text into words (done by the Mapper) and aggregating the counts (done by the Reducer).

Creating a Hadoop Project Directory and Generating Sample Input Data

In this image, a new Hadoop project directory is created on the Desktop. The user navigates to the Desktop using cd Desktop, then creates a directory named My_first_hadoop_project with mkdir My_first_hadoop_project and enters it using cd My_first_hadoop_project/. Finally, a text file named data.txt is created with sample content using the echo command: echo -e "Hello world\nHadoop is fun\nHello Hadoop Hadoop is a powerful open-source framework that revolutionized the way we handle large datasets" > data.txt. This file serves as input data for Hadoop processing tasks.

Upload the File to HDFS

Upload this file to HDFS:

Create the first input directory on hdfs using the following command.

hdfs dfs -mkdir -p /user/hadoop/input

After creating input directory on hdfs, upload the text file on hdfs using put command.

hdfs dfs -put ~/Desktop/My_first_hadoop_project/data.txt /user/hadoop/input/

Verify the Upload

Check if the file was uploaded successfully:

hdfs dfs -ls /user/hadoop/input

Write the Mapper Script

The Mapper script processes each line of input, splits it into words, and outputs each word with a count of 1. Create a Python file named mapper.py with the following content:

nano mapper.py

mapper.py

Write the Reducer Script

The Reducer script aggregates the counts for each word. Create a Python file named reducer.py with the following content:

nano reducer.py

reducer.py

Make the Script Executable

chmod +x ~/Desktop/hadooproject/mapper.py
chmod +x ~/Desktop/hadooproject/reducer.py

Run the Hadoop Streaming Job

Now, run the Hadoop Streaming job with the following command:

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \
-input /user/hadoop/input/data.txt \
-output /user/hadoop/output \
-mapper ~/Desktop/hadooproject/mapper.py \
-reducer ~/Desktop/hadooproject/reducer.py

Important Notes

Ensure the output directory does not already exist in HDFS. If it does, delete it before running the job:

hdfs dfs -rm -r /user/hadoop/output

Explanation of the Command:
- -input: Specifies the input file in HDFS.
- -output: Specifies the output directory where results will be stored.
- -mapper: Path to the Mapper script.
- -reducer: Path to the Reducer script.

View the Results

After the job completes, view the results using the following command:

hdfs dfs -cat /user/hadoop/output/part*

Output :

This output shows the word count for each word in the input file.

we have successfully run your first Hadoop project using the WordCount example with Python Mapper and Reducer scripts. This basic project introduces you to the core concepts of Hadoop's distributed processing. Continue experimenting with different datasets and scripts to deepen your understanding of Hadoop.

Mastering these Hadoop commands is the first step to effectively managing big data projects. Hadoop's robust ecosystem empowers you to work with vast datasets seamlessly, and proficiency with these commands will make your journey smoother.

Take Your Big Data Projects to the Next Level with Hadoop

At Codersarts, we specialize in Hadoop Development Services, enabling you to process, store, and analyze massive datasets with ease. From setting up Hadoop clusters to developing MapReduce jobs and integrating with other tools, our skilled developers deliver tailored solutions for your big data challenges.

Contact us today to hire expert Hadoop developers and transform your data processing capabilities!

Keywords: Hadoop Development Services, Big Data Processing with Hadoop, Scalable Data Storage with Hadoop HDFS, Hadoop Cluster Setup and Management, MapReduce Development with Hadoop, Data Pipeline Development with Hadoop, Hadoop Integration Services, Real-Time Data Analysis with Hadoop, Data Engineering with Hadoop, Hire Hadoop Developer, Hadoop Project Help, Hadoop Freelance Developer