Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views4 pages

Shawn

Hadoop is an Apache open-source framework that enables distributed processing of large datasets across clusters of computers using Java. It consists of two main components: Hadoop MapReduce for processing data and HDFS for storage, along with additional modules like Hadoop Common and YARN. Designed for scalability and fault tolerance, Hadoop efficiently handles big data analysis by distributing data across low-cost machines and replicating data for reliability.

Uploaded by

ianzeha21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views4 pages

Shawn

Hadoop is an Apache open-source framework that enables distributed processing of large datasets across clusters of computers using Java. It consists of two main components: Hadoop MapReduce for processing data and HDFS for storage, along with additional modules like Hadoop Common and YARN. Designed for scalability and fault tolerance, Hadoop efficiently handles big data analysis by distributing data across low-cost machines and replicating data for reliability.

Uploaded by

ianzeha21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Hadoop is an Apache open source framework written in java that allows distributed

processing of large datasets across clusters of computers using simple


programming models. The Hadoop framework application works in an environment
that provides distributed storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines, each
offering local computation and storage.

Hadoop Ecosystem and Components


Below diagram shows various components in the Hadoop ecosystem-

Apache Hadoop consists of two sub-projects –

1. Hadoop MapReduce: MapReduce is a computational model and software


framework for writing applications which are run on Hadoop. These
MapReduce programs are capable of processing enormous data in
parallel on large clusters of computation nodes.
2. HDFS (Hadoop Distributed File System): HDFS takes care of the storage
part of Hadoop applications. MapReduce applications consume data from
HDFS. HDFS creates multiple replicas of data blocks and distributes them
on compute nodes in a cluster. This distribution enables reliable and
extremely rapid computations.

The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on commodity
hardware. It has many similarities with existing distributed file systems. However,
the differences from other distributed file systems are significant. It is highly fault-
tolerant and is designed to be deployed on low-cost hardware. It provides high
throughput access to application data and is suitable for applications having large
datasets.
Apart from the above-mentioned two core components, Hadoop framework also
includes the following two modules −
 Hadoop Common − These are Java libraries and utilities required by other
Hadoop modules.
 Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
 Other Hadoop-related projects at Apache include
are Hive, HBase, Mahout, Sqoop, Flume, and ZooKeeper.

Hadoop Architecture

High Level Hadoop Architecture


Hadoop has a Master-Slave Architecture for data storage and distributed data
processing using MapReduce and HDFS methods.
NameNode:
NameNode represented every files and directory which is used in the
namespace
DataNode:
DataNode helps you to manage the state of an HDFS node and allows you to
interacts with the blocks
MasterNode:
The master node allows you to conduct parallel processing of data using
Hadoop MapReduce.
Slave node:
The slave nodes are the additional machines in the Hadoop cluster which allows
you to store data to conduct complex calculations. Moreover, all the slave node
comes with Task Tracker and a DataNode. This allows you to synchronize the
processes with the NameNode and Job Tracker respectively.
In Hadoop, master or slave system can be set up in the cloud or on-premise

How Does Hadoop Work?


It is quite expensive to build bigger servers with heavy configurations that handle
large scale processing, but as an alternative, you can tie together many commodity
computers with single-CPU, as a single functional distributed system and
practically, the clustered machines can read the dataset in parallel and provide a
much higher throughput. Moreover, it is cheaper than one high-end server. So this
is the first motivational factor behind using Hadoop that it runs across clustered and
low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the
following core tasks that Hadoop performs −
 Data is initially divided into directories and files. Files are divided into uniform
sized blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further
processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple
programming models. The Hadoop framework application works in an environment
that provides distributed storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines, each
offering local computation and storage.

Features Of 'Hadoop'
• Suitable for Big Data Analysis
As Big Data tends to be distributed and unstructured in nature, HADOOP
clusters are best suited for analysis of Big Data. Since it is processing logic (not
the actual data) that flows to the computing nodes, less network bandwidth is
consumed. This concept is called as data locality concept which helps increase
the efficiency of Hadoop based applications.

• Scalability
HADOOP clusters can easily be scaled to any extent by adding additional cluster
nodes and thus allows for the growth of Big Data. Also, scaling does not require
modifications to application logic.

• Fault Tolerance
HADOOP ecosystem has a provision to replicate the input data on to other
cluster nodes. That way, in the event of a cluster node failure, data processing
can still proceed by using data stored on another cluster node.

You might also like