0% found this document useful (0 votes)

5 views22 pages

Big Data Lecture # 05

The document discusses Big Data Analytics, focusing on distributed systems and Apache Hadoop, which enables distributed processing of large datasets. It outlines Hadoop's characteristics such as fault tolerance, reliability, scalability, and its architecture involving NameNode and DataNode. Additionally, it covers common use cases for Hadoop in various sectors and explains data storage operations on HDFS, including block management and replication techniques.

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views22 pages

Big Data Lecture # 05

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

BIG DATA ANALYTICS

Lecture 5 --- Week 5

Content
 Distributed System

 Challenges of Distributed Systems

 Apache Hadoop

 Characteristics of Hadoop

 Four Distinct layers of Hadoop

 Common Use Cases for Big Data in Hadoop

 Data Storage Operations on HDFS

Distributed System

 A distributed system is a model in which components located on networked

computers communicate and coordinate their actions by passing messages.
How Does a Distributed System Work?
Challenges of Distributed Systems
 Since, multiple computers are used in a distributed system, there are high
chances of
Introduction to Hadoop

 Hadoop is a framework that allows distributed processing of large datasets

across clusters of commodity computers using simple programming models
 The original task that Hadoop was created for revolved around building search
indices
 It has now become a software ecosystem that forms the core of a data center
operating system built to do scalable data processing and analytical from the
ground-up
Characteristics of Hadoop
• Open Source
Apache Hadoop is an open source project. It means its code can be
modified according to business requirements.

• Distributed Processing
As data is stored in a distributed manner in HDFS across the cluster,
data is processed in parallel on a cluster of nodes.

• Fault Tolerance
This is one of the very important features of Hadoop. By default 3
replicas of each block is stored across the cluster in Hadoop and it can be
changed also as per the requirement. So if any node goes down, data on that
node can be recovered from other nodes easily with the help of this
characteristic. Failures of nodes or tasks are recovered automatically by the
framework. This is how Hadoop is fault tolerant.
Characteristics of Hadoop
• Reliability
Due to replication of data in the cluster, data is reliablystored on the
cluster of machine despite machine failures. If your machine goes down, then
also your data will be stored reliably due to this characteristic of Hadoop.

• High Availability
Learn Hadoop from Industry Experts Data is highly available and
accessible despite hardware failure due to multiple copies of data. If a machine
or few hardware crashes, then data will be accessed from another path.

• Scalability
Hadoop is highly scalable in the way new hardware can be easily added
to the nodes. This feature of Hadoop also provides horizontal scalability which
means new nodes can be added on the fly without any downtime.

• Economic
Apache Hadoop is not very expensive as it runs on a cluster of
commodity hardware.
Characteristics of Hadoop
• Easy to use
No need of client to deal with distributed computing, the framework
takes care of all the things. So this feature of Hadoop is easy to use

• Data Locality
This one is a unique features of Hadoop that made it easily handle the
Big Data. Hadoop works on data locality principle which states that move
computation to data instead of data to computation
Four distinctive layers of Hadoop
Common Use Cases for Big Data in Hadoop

 Financial Sector
 Healthcare Sector
 Telecom Industry
 Retail Sector
 Building Recommendation System
Data Storage Operations on HDFS

 The Hadoop Distributed File System (HDFS) is the primary data storage system
used by Hadoop applications. .
 HDFS employs a NameNode and DataNode architecture to implement a
distributed file system that provides high-performance access to data across
highly scalable Hadoop clusters.
How does HDFS work?

 HDFS stands for Hadoop Distributed File System.

 It provides for data storage of Hadoop.
 HDFS splits the data unit into smaller units called blocks and stores them in a
distributed manner.
 It has got two daemons running. One for master node – NameNode and other
for slave nodes – DataNode.
a. NameNode and DataNode
 HDFS has a Master-slave architecture. NameNode runs on the master server.
 It is responsible for Namespace management and regulates file access by the
client.
 Namenode manages modifications to file system namespace. These are
actions like the opening, closing and renaming files or directories.
 NameNode also keeps track of mapping of blocks to DataNodes.
 NameNode coordinates with hundreds or thousands of data nodes and serves
the requests coming from client applications.
 Two files ‘FSImage’ and the ‘EditLog’ are used to store metadata information.
 FsImage: It is the snapshot the file system when Name Node is started. It is an
“Image file”. FsImage contains the entire filesystem namespace and stored as
a file in the NameNode’s local file system. It also contains a serialized form of
all the directories and file inodes in the filesystem. Each inode is an internal
representation of file or directory’s metadata.
 EditLogs: It contains all the recent modifications made to the file system on
the most recent FsImage. NameNode receives a create/update/delete
request from the client. After that this request is first recorded to edits file.
a. NameNode and DataNode
 DataNode runs on slave nodes.
 It is responsible for storing actual business data. Internally, a file gets split
into a number of data blocks and stored on a group of slave machines.
 This DataNodes serves read/write request from the file system’s client.
DataNode also creates, deletes and replicates blocks on demand from
NameNode.
 Java is the native language of HDFS.
 Hence one can deploy DataNode and NameNode on machines having Java
installed.

 In a typical deployment, there is one dedicated machine running NameNode.

And all the other nodes in the cluster run DataNode.
b. Block in HDFS

 Block is nothing but the smallest unit of storage on a computer system. It is

the smallest contiguous storage allocated to a file. In Hadoop, we have a
default block size of 128MB or 256 MB.

 One should select the block size very carefully. To explain why so let us take
an example of a file which is 700MB in size. If our block size is 128MB then
HDFS divides the file into 6 blocks. Five blocks of 128MB and one block of
60MB. What will happen if the block is of size 4KB? But in HDFS we would be
having files of size in the order terabytes to petabytes. With 4KB of the block
size, we would be having numerous blocks. This, in turn, will create huge
metadata which will overload the NameNode. Hence we have to choose our
HDFS block size judiciously.
c. Replication Management

 To provide fault tolerance HDFS uses a replication technique. In that, it

makes copies of the blocks and stores in on different DataNodes. Replication
factor decides how many copies of the blocks get stored. It is 3 by default but
we can configure to any value.
The above figure shows how the replication technique works. Suppose we have a file
of 1GB then with a replication factor of 3 it will require 3GBs of total storage.
To maintain the replication factor NameNode collects block report from every
DataNode. Whenever a block is under-replicated or over-replicated the NameNode
adds or deletes the replicas accordingly.
d. What is Rack Awareness?

 What is a rack?
 The Rack is the collection of around 40-50 DataNodes connected using the
same network switch. If the network goes down, the whole rack will be
unavailable. A large Hadoop cluster is deployed in multiple racks.

 A rack contains many DataNode machines and there are several such racks in
the production. HDFS follows a rack awareness algorithm to place the
replicas of the blocks in a distributed fashion. This rack awareness algorithm
provides for low latency and fault tolerance. Suppose the replication factor
configured is 3. Now rack awareness algorithm will place the first block on a
local rack. It will keep the other two blocks on a different rack. It does not
store more than two blocks in the same rack if possible.

WSL Guide
No ratings yet
WSL Guide
153 pages
Linux Security Checklist
No ratings yet
Linux Security Checklist
5 pages
Elastix 4 Installation Guide
No ratings yet
Elastix 4 Installation Guide
16 pages
Big Data Analytics Syllabus
No ratings yet
Big Data Analytics Syllabus
169 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
HDFS
No ratings yet
HDFS
11 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Hadoop Ecosystem & HDFS Guide
No ratings yet
Hadoop Ecosystem & HDFS Guide
46 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Unit 2
No ratings yet
Unit 2
56 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
Hadoop Basics and HDFS Overview
No ratings yet
Hadoop Basics and HDFS Overview
126 pages
Bda 3
No ratings yet
Bda 3
70 pages
HDFS
No ratings yet
HDFS
8 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Unit - 2
No ratings yet
Unit - 2
27 pages
Embedded Linux System Development Training Lab Book
No ratings yet
Embedded Linux System Development Training Lab Book
4 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
Computer Abbreviations PDF Download For Competitive Exams by EntranceGeek PDF
100% (1)
Computer Abbreviations PDF Download For Competitive Exams by EntranceGeek PDF
16 pages
Topic Wise MCQ of Operating Systems
100% (1)
Topic Wise MCQ of Operating Systems
27 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Unit 3
No ratings yet
Unit 3
5 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
U Boot Abcdge
No ratings yet
U Boot Abcdge
232 pages
Unit 2
No ratings yet
Unit 2
14 pages
01 A4 RH Solaris Migration WP 6399067 0511 DM Web PDF
No ratings yet
01 A4 RH Solaris Migration WP 6399067 0511 DM Web PDF
16 pages
Module II
No ratings yet
Module II
46 pages
IMTC634 - Data Science - Chapter 14
No ratings yet
IMTC634 - Data Science - Chapter 14
22 pages
BDH Unit 3
No ratings yet
BDH Unit 3
25 pages
Windows 7 System Info Report
No ratings yet
Windows 7 System Info Report
10 pages
Plan and Prepare Your Environment For Filenet P8 For Installation On Microsoft Windows With Ibm Db2, Ibm Websphere Application Server, and Ibm Tivoli Directory Server
No ratings yet
Plan and Prepare Your Environment For Filenet P8 For Installation On Microsoft Windows With Ibm Db2, Ibm Websphere Application Server, and Ibm Tivoli Directory Server
104 pages
Lightroom SDK 3.2 Guide
No ratings yet
Lightroom SDK 3.2 Guide
204 pages
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
0% (1)
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
3 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
Buffer Cache PDF
No ratings yet
Buffer Cache PDF
33 pages
Blu-Ray Disc Format - File System Specification
No ratings yet
Blu-Ray Disc Format - File System Specification
6 pages
2 Zfs Internals
No ratings yet
2 Zfs Internals
29 pages
BD Unit II
No ratings yet
BD Unit II
57 pages
Unix LNX Cmds
No ratings yet
Unix LNX Cmds
10 pages
Expand Veritas File System From GZ On 3PAR
No ratings yet
Expand Veritas File System From GZ On 3PAR
12 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Data Domain Operating System 6.0 Administration Guide
No ratings yet
Data Domain Operating System 6.0 Administration Guide
490 pages
The Mcse Windows 2000 Professional Cram Sheet: Administering Resources
No ratings yet
The Mcse Windows 2000 Professional Cram Sheet: Administering Resources
2 pages
KBA00054787 - Oracle Reconfigure Backups For - DDBoost - P2
No ratings yet
KBA00054787 - Oracle Reconfigure Backups For - DDBoost - P2
10 pages
Notes - 3 Unit Neha
No ratings yet
Notes - 3 Unit Neha
25 pages
List of Linux Configuration File PDF
No ratings yet
List of Linux Configuration File PDF
9 pages
HDFS
No ratings yet
HDFS
16 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
Web Dev Tools for Beginners
No ratings yet
Web Dev Tools for Beginners
20 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Master Data Governance Mass Import Solution For Retail and Fashion Management - Article Substitution
No ratings yet
Master Data Governance Mass Import Solution For Retail and Fashion Management - Article Substitution
18 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Operating System Functions & Types
No ratings yet
Operating System Functions & Types
15 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
Linux Fundamental V1.0 20100429 B
No ratings yet
Linux Fundamental V1.0 20100429 B
118 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
BD U-3 Notes
No ratings yet
BD U-3 Notes
27 pages
Xbox360 File Reference
No ratings yet
Xbox360 File Reference
13 pages
IT and Computer 1st Edition by Andrew Tanenbaumpdf Download
100% (4)
IT and Computer 1st Edition by Andrew Tanenbaumpdf Download
68 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
Operatingsystem
No ratings yet
Operatingsystem
3 pages
SANS FOR518 APFS CheatSheet 033124
No ratings yet
SANS FOR518 APFS CheatSheet 033124
2 pages
BBVCX
No ratings yet
BBVCX
89 pages
Lecture 4 Introduction To Hadoop
No ratings yet
Lecture 4 Introduction To Hadoop
24 pages
Introduction To Hadoop and MapReduce Programming
No ratings yet
Introduction To Hadoop and MapReduce Programming
29 pages

Big Data Lecture # 05

Uploaded by

Big Data Lecture # 05

Uploaded by

BIG DATA ANALYTICS

Lecture 5 --- Week 5

 Challenges of Distributed Systems

 Four Distinct layers of Hadoop

 Common Use Cases for Big Data in Hadoop

 Data Storage Operations on HDFS

 A distributed system is a model in which components located on networked

 Hadoop is a framework that allows distributed processing of large datasets

 HDFS stands for Hadoop Distributed File System.

 In a typical deployment, there is one dedicated machine running NameNode.

 Block is nothing but the smallest unit of storage on a computer system. It is

 To provide fault tolerance HDFS uses a replication technique. In that, it

You might also like