BIG DATA ANALYTICS
Lecture 5 --- Week 5
Content
Distributed System
Challenges of Distributed Systems
Apache Hadoop
Characteristics of Hadoop
Four Distinct layers of Hadoop
Common Use Cases for Big Data in Hadoop
Data Storage Operations on HDFS
Distributed System
A distributed system is a model in which components located on networked
computers communicate and coordinate their actions by passing messages.
How Does a Distributed System Work?
Challenges of Distributed Systems
Since, multiple computers are used in a distributed system, there are high
chances of
Introduction to Hadoop
Hadoop is a framework that allows distributed processing of large datasets
across clusters of commodity computers using simple programming models
The original task that Hadoop was created for revolved around building search
indices
It has now become a software ecosystem that forms the core of a data center
operating system built to do scalable data processing and analytical from the
ground-up
Characteristics of Hadoop
• Open Source
Apache Hadoop is an open source project. It means its code can be
modified according to business requirements.
• Distributed Processing
As data is stored in a distributed manner in HDFS across the cluster,
data is processed in parallel on a cluster of nodes.
• Fault Tolerance
This is one of the very important features of Hadoop. By default 3
replicas of each block is stored across the cluster in Hadoop and it can be
changed also as per the requirement. So if any node goes down, data on that
node can be recovered from other nodes easily with the help of this
characteristic. Failures of nodes or tasks are recovered automatically by the
framework. This is how Hadoop is fault tolerant.
Characteristics of Hadoop
• Reliability
Due to replication of data in the cluster, data is reliablystored on the
cluster of machine despite machine failures. If your machine goes down, then
also your data will be stored reliably due to this characteristic of Hadoop.
• High Availability
Learn Hadoop from Industry Experts Data is highly available and
accessible despite hardware failure due to multiple copies of data. If a machine
or few hardware crashes, then data will be accessed from another path.
• Scalability
Hadoop is highly scalable in the way new hardware can be easily added
to the nodes. This feature of Hadoop also provides horizontal scalability which
means new nodes can be added on the fly without any downtime.
• Economic
Apache Hadoop is not very expensive as it runs on a cluster of
commodity hardware.
Characteristics of Hadoop
• Easy to use
No need of client to deal with distributed computing, the framework
takes care of all the things. So this feature of Hadoop is easy to use
• Data Locality
This one is a unique features of Hadoop that made it easily handle the
Big Data. Hadoop works on data locality principle which states that move
computation to data instead of data to computation
Four distinctive layers of Hadoop
Common Use Cases for Big Data in Hadoop
Financial Sector
Healthcare Sector
Telecom Industry
Retail Sector
Building Recommendation System
Data Storage Operations on HDFS
The Hadoop Distributed File System (HDFS) is the primary data storage system
used by Hadoop applications. .
HDFS employs a NameNode and DataNode architecture to implement a
distributed file system that provides high-performance access to data across
highly scalable Hadoop clusters.
How does HDFS work?
HDFS stands for Hadoop Distributed File System.
It provides for data storage of Hadoop.
HDFS splits the data unit into smaller units called blocks and stores them in a
distributed manner.
It has got two daemons running. One for master node – NameNode and other
for slave nodes – DataNode.
a. NameNode and DataNode
HDFS has a Master-slave architecture. NameNode runs on the master server.
It is responsible for Namespace management and regulates file access by the
client.
Namenode manages modifications to file system namespace. These are
actions like the opening, closing and renaming files or directories.
NameNode also keeps track of mapping of blocks to DataNodes.
NameNode coordinates with hundreds or thousands of data nodes and serves
the requests coming from client applications.
Two files ‘FSImage’ and the ‘EditLog’ are used to store metadata information.
FsImage: It is the snapshot the file system when Name Node is started. It is an
“Image file”. FsImage contains the entire filesystem namespace and stored as
a file in the NameNode’s local file system. It also contains a serialized form of
all the directories and file inodes in the filesystem. Each inode is an internal
representation of file or directory’s metadata.
EditLogs: It contains all the recent modifications made to the file system on
the most recent FsImage. NameNode receives a create/update/delete
request from the client. After that this request is first recorded to edits file.
a. NameNode and DataNode
DataNode runs on slave nodes.
It is responsible for storing actual business data. Internally, a file gets split
into a number of data blocks and stored on a group of slave machines.
This DataNodes serves read/write request from the file system’s client.
DataNode also creates, deletes and replicates blocks on demand from
NameNode.
Java is the native language of HDFS.
Hence one can deploy DataNode and NameNode on machines having Java
installed.
In a typical deployment, there is one dedicated machine running NameNode.
And all the other nodes in the cluster run DataNode.
b. Block in HDFS
Block is nothing but the smallest unit of storage on a computer system. It is
the smallest contiguous storage allocated to a file. In Hadoop, we have a
default block size of 128MB or 256 MB.
One should select the block size very carefully. To explain why so let us take
an example of a file which is 700MB in size. If our block size is 128MB then
HDFS divides the file into 6 blocks. Five blocks of 128MB and one block of
60MB. What will happen if the block is of size 4KB? But in HDFS we would be
having files of size in the order terabytes to petabytes. With 4KB of the block
size, we would be having numerous blocks. This, in turn, will create huge
metadata which will overload the NameNode. Hence we have to choose our
HDFS block size judiciously.
c. Replication Management
To provide fault tolerance HDFS uses a replication technique. In that, it
makes copies of the blocks and stores in on different DataNodes. Replication
factor decides how many copies of the blocks get stored. It is 3 by default but
we can configure to any value.
The above figure shows how the replication technique works. Suppose we have a file
of 1GB then with a replication factor of 3 it will require 3GBs of total storage.
To maintain the replication factor NameNode collects block report from every
DataNode. Whenever a block is under-replicated or over-replicated the NameNode
adds or deletes the replicas accordingly.
d. What is Rack Awareness?
What is a rack?
The Rack is the collection of around 40-50 DataNodes connected using the
same network switch. If the network goes down, the whole rack will be
unavailable. A large Hadoop cluster is deployed in multiple racks.
A rack contains many DataNode machines and there are several such racks in
the production. HDFS follows a rack awareness algorithm to place the
replicas of the blocks in a distributed fashion. This rack awareness algorithm
provides for low latency and fault tolerance. Suppose the replication factor
configured is 3. Now rack awareness algorithm will place the first block on a
local rack. It will keep the other two blocks on a different rack. It does not
store more than two blocks in the same rack if possible.