top of page

What is Hadoop?



Apache Hadoop is an open-source software framework that

  • Stores big data in a distributed manner

  • Processes big data parallelly

  • Builds on large clusters of commodity hardware

Based on Google’s papers on Google File System (2003) and MapReduce (2004)


Hadoop is:

  • Scalable to Petabytes or more easily(Volume)

  • Offering parallel data processing(Velocity)

  • Storing all kinds of data(Variety)


Hadoop offers:

  • Redundant, Fault-tolerant data storage (HDFS)

  • Parallel computation framework (MapReduce)

  • Job coordination/scheduling (YARN)


Programmers no longer need to worry about:

  • Where the file is located?

  • How to handle failures & data lost?

  • How to divide computation?

  • How to program for scaling?



Hadoop Ecosystem


Core of Hadoop:

  • Hadoop distributed file system (

  • MapReduce

  • YARN (Yet Another Resource Negotiator) (from Hadoop v2.0)

Additional software packages:

  • Pig

  • Hive

  • Spark

  • HBase

  • ........


The Master-Slave Architecture of Hadoop




Hadoop Distributed File Systems(HDFS)

HDFS is a file system that

  • follows master slave architecture

  • allows us to store data over multiple nodes(machines)

  • allows multiple users to access data.

  • just like file systems in your PC


HDFS supports

  • distributed storage

  • distributed computation

  • horizontal scalability


Vertical Scaling vs. Horizontal Scaling


HDFS Architecture




NameNode

NameNode maintains and manages the blocks in the DataNodes (slave nodes).

  • Master node


Functions:

  • records the metadata of all the files

    • FsImage: file system namespace since our name node is started

It records all changes

  • EditLogs: all the recent modifications e.g for the past 1 hour

  • records each change to the metadata

  • regularly checks the status of data nodes

  • keeps a record of all the blocks in HDFS

  • if the DataNode fails, handle data recovery


DataNode

A commodity hardware stores the data

  • Slave node

Functions

  • stores actual data

  • performs the read and write requests

  • reports the health to NameNode (heartbeat)


NameNode vs. DataNode


If NameNode failed…

All the files on HDFS will be lost

  • there’s no way to reconstruct the files from the blocks in DataNodes without the metadata in NameNode


In order to make NameNode resilient to failure

  • back up metadata in NameNode (with a remote NFS mount)

  • Secondary NameNode



Secondary NameNode

Take checkpoints of the file system metadata present on NameNode

  • It is not a backup NameNode!

Functions:

  • Stores a copy of FsImage file and Editlogs

  • Periodically applies Editlogs to FsImage and refreshes the Editlogs

  • If NameNode is failed, File System metadata can be recovered from the last saved FsImage on the Secondary NameNode


NameNode vs. Secondary NameNode


Blocks

Block is a sequence of bytes that stores data

  • Data stores as a set of blocks in HDFS

  • Default block size is 128M eta B ytes (Hadoop 2.x and 3.x) x), the default size in hadoob is 64 metab bytes

  • A file is spitted into multiple blocks


Why Large Block Size?

  • HDFS stores huge datasets

  • If block size is small (e.g., 4KB in Linux), then the number of blocks is large:

    • too much metadata for NameNode

    • too many seeks affect the read speed

    1. read speed=seek time+transfer time

    2. tranfer time=total size of file/transportation speed

  • harm the performance of MapReduce too

  • We don’t recommend using HDFS for small files due to similar reasons.


If DataNode Failed…

  • Commodity hardware fails

    • If NameNode hasn’t heard from a DataNode for 10mins, The DataNode is considered dead…

  • HDFS guarantees data reliability by generating multiple replications of data

    • each block has 3 replications by default

    • replications will be stored on different DataNodes

    • if blocks were lost due to the failure of a DataNode, they can be recovered from other replications

    • the total consumed space is 3 times the data size

  • It also helps to maintain data integrity (whether The data stored is correct or not


File, Block and Replica

  • A file contains one or more blocks

    • Blocks are different

    • Depends on the file size and block size

  • A block has multiple replicas

    • Replicas are the same

    • Depends on the preset replication factor


Replication Management

  • Each block is replicated 3 times and stored on different DataNodes



Why default replication factor 3?

  • If 1 replicate

    • DataNode fails, block lost

  • Assume

    • # of nodes N = 4000

    • # of blocks R = 1,000,000

    • Node failure rate FPD = 1 per day u expect to see 1 machine fail per day

  • If one node fails, then R/N = 250 blocks are lost

    • E(# of losing blocks in one day) 250

  • Let the number of losing blocks follows Poisson distribution, then

    • Pr[# of losing blocks in one day >= 250] = 0.508


If you are looking more about Hadoop or looking for project assignment help related to Hadoop, big data, spark, etc. then you can send the details of your requirement at below contact id:


contact@codersarts.com

Comments


bottom of page