Shawn

Hadoop is an Apache open-source framework that enables distributed processing of large datasets across clusters of computers using Java. It consists of two main components: Hadoop MapReduce for processing data and HDFS for storage, along with additional modules like Hadoop Common and YARN. Designed for scalability and fault tolerance, Hadoop efficiently handles big data analysis by distributing data across low-cost machines and replicating data for reliability.

Uploaded by

ianzeha21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views4 pages

Shawn

Uploaded by

ianzeha21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Hadoop is an Apache open source framework written in java that allows distributed

processing of large datasets across clusters of computers using simple

programming models. The Hadoop framework application works in an environment
that provides distributed storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines, each
offering local computation and storage.

Hadoop Ecosystem and Components

Below diagram shows various components in the Hadoop ecosystem-

Apache Hadoop consists of two sub-projects –

1. Hadoop MapReduce: MapReduce is a computational model and software

framework for writing applications which are run on Hadoop. These
MapReduce programs are capable of processing enormous data in
parallel on large clusters of computation nodes.
2. HDFS (Hadoop Distributed File System): HDFS takes care of the storage
part of Hadoop applications. MapReduce applications consume data from
HDFS. HDFS creates multiple replicas of data blocks and distributes them
on compute nodes in a cluster. This distribution enables reliable and
extremely rapid computations.

The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on commodity
hardware. It has many similarities with existing distributed file systems. However,
the differences from other distributed file systems are significant. It is highly fault-
tolerant and is designed to be deployed on low-cost hardware. It provides high
throughput access to application data and is suitable for applications having large
datasets.
Apart from the above-mentioned two core components, Hadoop framework also
includes the following two modules −
 Hadoop Common − These are Java libraries and utilities required by other
Hadoop modules.
 Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
 Other Hadoop-related projects at Apache include
are Hive, HBase, Mahout, Sqoop, Flume, and ZooKeeper.

Hadoop Architecture

High Level Hadoop Architecture

Hadoop has a Master-Slave Architecture for data storage and distributed data
processing using MapReduce and HDFS methods.
NameNode:
NameNode represented every files and directory which is used in the
namespace
DataNode:
DataNode helps you to manage the state of an HDFS node and allows you to
interacts with the blocks
MasterNode:
The master node allows you to conduct parallel processing of data using
Hadoop MapReduce.
Slave node:
The slave nodes are the additional machines in the Hadoop cluster which allows
you to store data to conduct complex calculations. Moreover, all the slave node
comes with Task Tracker and a DataNode. This allows you to synchronize the
processes with the NameNode and Job Tracker respectively.
In Hadoop, master or slave system can be set up in the cloud or on-premise

How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle
large scale processing, but as an alternative, you can tie together many commodity
computers with single-CPU, as a single functional distributed system and
practically, the clustered machines can read the dataset in parallel and provide a
much higher throughput. Moreover, it is cheaper than one high-end server. So this
is the first motivational factor behind using Hadoop that it runs across clustered and
low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the
following core tasks that Hadoop performs −
 Data is initially divided into directories and files. Files are divided into uniform
sized blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further
processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple
programming models. The Hadoop framework application works in an environment
that provides distributed storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines, each
offering local computation and storage.

Features Of 'Hadoop'
• Suitable for Big Data Analysis
As Big Data tends to be distributed and unstructured in nature, HADOOP
clusters are best suited for analysis of Big Data. Since it is processing logic (not
the actual data) that flows to the computing nodes, less network bandwidth is
consumed. This concept is called as data locality concept which helps increase
the efficiency of Hadoop based applications.

• Scalability
HADOOP clusters can easily be scaled to any extent by adding additional cluster
nodes and thus allows for the growth of Big Data. Also, scaling does not require
modifications to application logic.

• Fault Tolerance
HADOOP ecosystem has a provision to replicate the input data on to other
cluster nodes. That way, in the event of a cluster node failure, data processing
can still proceed by using data stored on another cluster node.

Fundamentals of Data Engineering
No ratings yet
Fundamentals of Data Engineering
16 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
Unit 2
No ratings yet
Unit 2
17 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Ds1604 - Data Analytics Department of Ads 2024-2025
No ratings yet
Ds1604 - Data Analytics Department of Ads 2024-2025
34 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Module - 2
No ratings yet
Module - 2
84 pages
Bda Unit-2
No ratings yet
Bda Unit-2
37 pages
Hadoop
No ratings yet
Hadoop
3 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Chap 2 Hadoop
No ratings yet
Chap 2 Hadoop
24 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
Srinivas SR - Informatica Developer Summary of Qualification
No ratings yet
Srinivas SR - Informatica Developer Summary of Qualification
5 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
152 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit-2 - Hadoop2
No ratings yet
Unit-2 - Hadoop2
30 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Hadoop Is An Open
No ratings yet
Hadoop Is An Open
4 pages
Module 2
No ratings yet
Module 2
23 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Module 2 CN
No ratings yet
Module 2 CN
23 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
Hadoop Introduction
No ratings yet
Hadoop Introduction
2 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Assignment 10
No ratings yet
Assignment 10
5 pages
Hadoop 10
No ratings yet
Hadoop 10
8 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
Chapter 3 Hadoop
No ratings yet
Chapter 3 Hadoop
10 pages
Hadoop
No ratings yet
Hadoop
11 pages
Chapter 04 Entity Relationship ER Modeling
No ratings yet
Chapter 04 Entity Relationship ER Modeling
21 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
BDA Module-02 Search Creators
No ratings yet
BDA Module-02 Search Creators
33 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
BDA Lab Assignment 1 PDF
No ratings yet
BDA Lab Assignment 1 PDF
20 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Hypertable An Open Source, High Performance, Scalable Database
100% (2)
Hypertable An Open Source, High Performance, Scalable Database
37 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Part 02 - Big Data Solutions
No ratings yet
Part 02 - Big Data Solutions
17 pages
Oracle SQL Exam Prep Guide
No ratings yet
Oracle SQL Exam Prep Guide
8 pages
Simple Settings To Help Your AX Solution To Run Faster
No ratings yet
Simple Settings To Help Your AX Solution To Run Faster
42 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Apache Hadoop
No ratings yet
Apache Hadoop
11 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Introduction To Analytics and Big Data - Hadoop: Thomas Rivera Hitachi Data Systems
No ratings yet
Introduction To Analytics and Big Data - Hadoop: Thomas Rivera Hitachi Data Systems
45 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
SQL Subqueries: Examples & Usage
No ratings yet
SQL Subqueries: Examples & Usage
4 pages
Database Systems for Businesses
No ratings yet
Database Systems for Businesses
58 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Hadoop
No ratings yet
Hadoop
14 pages
0427 SQL Part II
No ratings yet
0427 SQL Part II
31 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Unit 5
No ratings yet
Unit 5
40 pages
1 - SLT Errors
No ratings yet
1 - SLT Errors
11 pages
Inventory Management
No ratings yet
Inventory Management
11 pages
Let039s Chat Korean 1717707312
No ratings yet
Let039s Chat Korean 1717707312
306 pages
GCSE OCR 1.2 The Need For Secondary Storage
No ratings yet
GCSE OCR 1.2 The Need For Secondary Storage
9 pages
Lesson 7: Producing Readable Output With iSQL Plus: SQL Sample Questions
No ratings yet
Lesson 7: Producing Readable Output With iSQL Plus: SQL Sample Questions
15 pages
SQL Workshop: Table & Data Management
No ratings yet
SQL Workshop: Table & Data Management
18 pages
Self Service Analytics
No ratings yet
Self Service Analytics
20 pages
Performance Scenario Sudden Slowdown On Rac
No ratings yet
Performance Scenario Sudden Slowdown On Rac
45 pages
1NF, 2NF, 3NF and BCNF in Database Normalization
No ratings yet
1NF, 2NF, 3NF and BCNF in Database Normalization
3 pages
1Z0 083 Demo
No ratings yet
1Z0 083 Demo
4 pages
HFSQL US
No ratings yet
HFSQL US
24 pages
Cornea Madalina Tema 3
No ratings yet
Cornea Madalina Tema 3
13 pages
Multi-Meta-RAG: Enhanced RAG for Multi-Hop Queries
No ratings yet
Multi-Meta-RAG: Enhanced RAG for Multi-Hop Queries
10 pages
Finance Andrew Yi 2018
No ratings yet
Finance Andrew Yi 2018
7 pages
Single Row Functions-1st
No ratings yet
Single Row Functions-1st
5 pages
Trigger
No ratings yet
Trigger
5 pages
Oracle Asn
No ratings yet
Oracle Asn
2 pages
Assignments 2
No ratings yet
Assignments 2
2 pages
Deleting Diskgroup Via Asmcmd
No ratings yet
Deleting Diskgroup Via Asmcmd
7 pages
9713 m17 Ms 4
No ratings yet
9713 m17 Ms 4
5 pages
What Is Hadoop Distributed File System (HDFS) PDF
No ratings yet
What Is Hadoop Distributed File System (HDFS) PDF
3 pages
Extension Ledgers in S/4 HANA Guide
No ratings yet
Extension Ledgers in S/4 HANA Guide
3 pages
Carnival Sba 2020
No ratings yet
Carnival Sba 2020
2 pages
de Training by Venkat Reddy (Nreshit)
No ratings yet
de Training by Venkat Reddy (Nreshit)
42 pages

Shawn

Uploaded by

Shawn

Uploaded by

Hadoop is an Apache open source framework written in java that allows distributed

processing of large datasets across clusters of computers using simple

Hadoop Ecosystem and Components

Apache Hadoop consists of two sub-projects –

1. Hadoop MapReduce: MapReduce is a computational model and software

High Level Hadoop Architecture

How Does Hadoop Work?

You might also like