0% found this document useful (0 votes)

30 views71 pages

Unit 2 Da Material

Uploaded by

krishnaharish678

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views71 pages

Unit 2 Da Material

Uploaded by

krishnaharish678

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 71

CS19P16-DATA ANALYTICS

UNIT II- HDFS & Map

Reduce

1
Contents
•HDFS
•Processing data with Hadoop
•Map Reduce
•Managing Resources and Applications with Hadoop YARN
•Interacting with Hadoop Ecosystems

2
What is HDFS?

•HDFS is a distributed file system that provides access

to data across Hadoop clusters.
• A cluster is a group of computers that work together.
•HDFS is a key tool that manages and supports analysis
of very large volumes; petabytes and zettabytes of
data.

3
Why
HDFS?
•Before 2011, storing and retrieving
petabytes or zettabytes of data had the following
three major challenges: Cost, Speed, Reliability.
• Also, if search components were saved on different
servers, fetching data was difficult.
•Cost
•HDFS is open-source software so that it can be used with zero
licensing and support costs. It is designed to run on a
regular computer.
•Speed
•Large Hadoop clusters can read or write more than a
terabyte of data per second. A cluster comprises multiple
systems logically interconnected in the same network.
•HDFS can easily deliver more than two gigabytes of data
per second, per computer to MapReduce, which is a data
processing framework of Hadoop.
4
Why HDFS?
•Reliability
• HDFS copies the data multiple times and distributes the copies to
individual nodes. A node is a commodity server which is interconnected
through a network device.
• HDFS then places at least one copy of data on a different server. In case,
any of the data is deleted from any of the nodes; it can be found within
the cluster.
•A regular file system, like a Linux file system, is different from
HDFS with
respect to the size of the data. In a regular file system, each block
of data is
small, usually about 51 bytes. However,in HDFS, each
block is 128 Megabytes by default.
•A regular file system provides access to large data but may
suffer from disk input/output problems mainly due to multiple seek
operations.
•On the other hand,HDFS can read large quantities
5
of data sequentially
Characteristics of
HDFS
•HDFS has high fault-tolerance
•HDFS may consist of thousands of server machines. Each
machine stores a part of the file system data. HDFS
detects faults that can occur on any of the machines
and recovers it quickly and automatically.
•HDFS has high throughput
•HDFS is designed to store and scan millions of rows of data
and to count or add some subsets of the data. The
time required in this process is dependent on the
complexities involved.
•It has been designed to support large datasets in batch-
style jobs. However, the emphasis is on high throughput of
data access rather than low latency.
6
Characteristics of
HDFS
•HDFS is economical
•HDFS is designed in such a way that it can be built on
commodity hardware and heterogeneous platforms, which is
low-priced and easily available.
•HDFS stores files in a number of blocks. Each block is
replicated to a few separate computers. The count of
replication can be modified by the administrator. Data is
divided into 128 Megabytes per block and replicated
across local disks of cluster nodes. Metadata controls the
physical location of a block and its replication within the
cluster. It is stored in NameNode HDFS is the storage
system for both input/output of MapReduce jobs. Let’s
understand how HDFS stores files with an example. .

7
How Does HDFS Work?

•Example - A patron gifted a collection of popular

books to a college library. The librarian decided to
arrange the books on a small rack and then distribute
multiple copies of each book on other racks. This
way the students could easily pick up a book from
any of the racks.
•Similarly, HDFS creates multiple copies of a data
block and keeps them in separate systems for easy
access.

8
HDFS Architecture and Components

9
10
11
12
13
•There is a Secondary NameNode which performs tasks
for NameNode and is also considered as a master
node. Prior to Hadoop 2.0.0, the NameNode was a
Single Point of Failure, or SPOF, in an HDFS cluster.
•Each cluster had a single NameNode. In case of an
unplanned event, such as a system failure, the
cluster would be unavailable until an operator
restarted the NameNode.
•Also, planned maintenance events, such as software or
hardware upgrades on the NameNode system, would
result in cluster downtime.
•The HDFS High Availability, or HA, feature addresses
these problems by providing the option of running
two redundant NameNodes in the same cluster in
an Active/Passive configuration with a hot standby.
14
•This allows a fast failover to a new NameNode in case a
system crashes or an administrator initiates a
failover for the purpose of a planned
maintenance.
•In an HA cluster, two separate systems are
configured as NameNodes. At any instance, one of
the NameNodes is in an Active state, and the other is
in a Standby state.
•The Active NameNode is responsible for all client
operations in the cluster, while the Standby
simply acts as a slave, maintaining enough state
to provide a fast failover if necessary.

15
HDFS Components

•The main components of HDFS

are:
•Namenode
•Secondary Namenode
•File system
•Metadata
•Datanode

16
Namenod
e
•The NameNode server is the core component of
an HDFS cluster. There can be only one NameNode
server in an entire cluster. Namenode maintains and
executes the file system namespace operation
such as opening, closing, and renaming of files
and directories, which are present in HDFS.

17
•The namespace image and the edit log stores information of the
data and
the metadata. NameNode also determines the linking of blocks
to DataNodes. Furthermore, the NameNode is a single point of
failure.
•The DataNode is a multiple instance server. There can be several
numbers
of DataNode servers. The number depends on the type of
network and the storage system.
•The DataNode servers, stores, and maintains the data blocks.
The
NameNode Server provisions the data blocks on the basis of the
type of job submitted by the client.
•DataNode also stores and retrieves the blocks when asked by
clients or the
NameNode. Furthermore, it reads/writes requests
and performs block
creation, deletion, and replication of instruction from the
NameNode. There 18
can be only one Secondary NameNode server in a cluster. Note
Secondary Namenode

•The Secondary NameNode server maintains the edit

log and namespace image information in sync
with the NameNode server. At times, the
namespace images from the NameNode server are
not updated; therefore, you cannot totally rely on the
Secondary NameNode server for the recovery process.

19
20
File
System
•HDFS exposes a file system namespace and allows user
data to be stored in files. HDFS has a hierarchical file
system with directories and files. The NameNode
manages the file system namespace, allowing clients to
work with files and directories.

21
File
System

22
•A file system supports operations like create,
remove, move, and rename. The NameNode,
apart from maintaining the file system
namespace, records any change to metadata
information.
•Now that we have learned about HDFS components,
let us see how NameNode works along with other
components.

23
Namenode: Operation

•NameNode maintains two persistent files; one a

transaction log called an Edit Log and the other, a
namespace image called a FsImage. The Edit Log
records every change that occurs in the file system
metadata such as creating a new file.

24
Namenode: Operation

The NameNode is a local filesystem that stores the Edit Log. The entire
file system namespace including mapping of blocks, files, and file system
properties is stored in FsImage. This is also stored in the NameNode local
file system.
25
Metadat
•When new DataNodes join a cluster, metadata loads
athe blocks that reside on a specific DataNode into
its memory at startup. Metadata then periodically
loads the data at user-defined or default intervals.
•When the NameNode starts up, it retrieves the Edit
Log and FsImage from its local file system. It then
updates the FsImage with Edit Log information and
stores a copy of the FsImage on the file system as a
checkpoint.
•The metadata size is limited to the RAM available
on the NameNode. A large number of small files
would require more metadata than a small number
of large files. Hence, the in-memory metadata
management issue explains why HDFS favors a small
number of large files.
26
•If a NameNode runs out of RAM, it will crash,
and the applications will not be able to use HDFS until
the NameNode is operational again.
•Data block split is an important process of HDFS
architecture. As discussed earlier, each file is split
into one or more blocks stored and replicated in
DataNodes.

27
DataNod
e
•DataNodes manage names and locations of file
blocks. By default, each file block is 128
Megabytes. However, this potentially reduces the
amount of parallelism that can be achieved as
the number of blocks per file decreases.

28
DataNod
e

29
•The data block approach provides:
•Simplified replication
•Fault-tolerance
•Reliability.
•It also helps by shielding users from storage sub-
system details.

30
Block Replication Architecture

•Block replication refers to creating copies of a block in

multiple data nodes. Usually, the data is split into
the forms of parts such as part and part one.

31
Block Replication Architecture

32
Replication Method

•In the replication method, each file is split into a

sequence of blocks. All blocks except the last one in
the file are of the same size. Blocks are replicated for
fault tolerance.

33
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 34
Data Replication Topology

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 35

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 36
How Are Files Stored?

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 37

Example
•2 log files to save from a local file system to the HDFS
cluster.
•The cluster has 5 data nodes: node A, node B, node C,
node D, and node E.
•Now the first log is divided into three blocks: b1 b2 and
b3 and the other log is divided into two blocks: b4
and b5.
•Now the blocks b1 b2 b3 b4 and b5 are distributed to
the node A, node B, node C, and no D respectively
as shown in the diagram.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 38
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 39
READ OPERATION

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 40

READ OPERATION
1. A client initiates read request by calling 'open()' method of FileSystem object; it is
an object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as the
locations of the blocks of the file. Please note that these addresses are of first few blocks of
a file.
3. In response to this metadata request, addresses of the DataNodes having a copy of that
block is returned back.
4. Once addresses of DataNodes are received, an object of type
returned to the client. isFSDataInputStream contains DFSInputStream which takes
FSDataInputStream
of
careinteractions with DataNode and NameNode. In step 4 shown in the above diagram, a
client invokes 'read()' method which causes DFSInputStream to establish a
with the first DataNode with the first block of a file.
connection
5. Data is read in the form of streams wherein client invokes 'read()' method repeatedly.
process of read() operation continues till it reaches the end of block.
This
6. Once the end of a block is reached, DFSInputStream closes the connection and moves on to
locate the next DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 41

WRITE OPERATION

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 42

WRITE OPERATION
1. A client initiates write operation by calling 'create()' method of DistributedFileSystem
object which creates a new file - Step no. 1 in the above diagram.
2. DistributedFileSystem object connects to the NameNode using RPC call and initiates
new file creation. However, this file creates operation does not associate any blocks
with the file. It is the responsibility of NameNode to verify that the file (which
is being created) does not exist already and a client has correct permissions to
create a new file. If a file already exists or client does not have sufficient
permission to create a new file, then IOException is thrown to the client.
Otherwise, the operation succeeds and a new record for the file is created by the
NameNode.
3. Once a new record in NameNode is created, an object of type FSDataOutputStream is
returned to the client. A client uses it to write data into the HDFS. Data write method
is invoked (step 3 in the diagram).
4. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues
writing data, DFSOutputStream continues creating packets with this data. These
packets are enqueued into a queue which is called as DataQueue.
5. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby
picking desirable DataNodes to be used for replication.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 43
Replication
1. Now, the process of replication starts by creating a pipeline using DataNodes. In
our case, we have chosen a replication level of 3 and hence there are 3 DataNodes
in the pipeline.
2. The DataStreamer pours packets into the first DataNode in the pipeline.
3. Every DataNode in a pipeline stores packet received by it and forwards the same to
the second DataNode in a pipeline.
4. Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets
which are waiting for acknowledgment from DataNodes.
5. Once acknowledgment for a packet in the queue is received from all DataNodes in
the pipeline, it is removed from the 'Ack Queue'. In the event of any DataNode
failure, packets from this queue are used to reinitiate the operation.
6. After a client is done with the writing data, it calls a close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgment.
7. Once a final acknowledgment is received, NameNode is contacted to tell it that the
file write operation is complete.

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 44

MapReduce
• MAPREDUCE is a software framework and programming model used
for processing huge amounts of data.
• MapReduce program work in two phases, namely, Map and Reduce.
• Map tasks deal with splitting and mapping of data
• Reduce tasks shuffle and reduce the data.

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 45

MapReduce
•Hadoop is capable of running MapReduce programs written in various
languages: Java, Ruby, Python, and C++.
•MapReduce programs are parallel in nature, thus are very useful for
performing large-scale data analysis using multiple machines in
the cluster.
•The input to each phase is key-value pairs.
•In addition, every programmer needs to specify two functions:
map function and reduce function.

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 46

How MapReduce
•Four phases of
Works?
execution :
•Phase 1: Splitting
•Phase 2: Mapping
•Phase 3: Shuffling
•Phase 4: Reducing

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 47

Working of Map Reduce

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 48

Final output of the MapReduce
task

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 49

The data goes through the following phases:
• Input Splits:
•An input to a MapReduce job is divided into fixed-size pieces
called input splits .Input split is a chunk of the input that is
consumed by a single map
• Mapping:
•This is the very first phase in the execution of map-reduce program. In
this phase data in each split is passed to a mapping function to
produce output values.
• Example, a job of mapping phase is to count a number of
occurrences of each word from input splits and prepare a list in the
form of <word, frequency>

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 50

The data goes through the following phases:
• Shuffling
•This phase consumes the output of Mapping phase. Its task is to consolidate the
relevant records from Mapping phase output.
•In the example, the same words are clubbed together along with their respective
frequency.
• Reducing
•In this phase, output values from the Shuffling phase are aggregated. This phase
combines values from Shuffling phase and returns a single output value. In
short, this phase summarizes the complete dataset.
•In the example, this phase aggregates the values from Shuffling phase i.e.,
calculates total occurrences of each word.

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 51

How MapReduce Organizes
Work?
Hadoop divides the job into tasks. There are two types of tasks:
1.Map tasks (Splits & Mapping)
2.Reduce tasks (Shuffling, Reducing)
•The complete execution process (execution of Map and Reduce tasks, both)
is controlled by two types of entities called
3. Jobtracker : Acts like a master (responsible for complete
execution of submitted job)
4.Multiple Task Trackers: Acts like slaves, each of them performing
the job
•For every job submitted for execution in the system, there
is one Jobtracker that resides on Namenode and
there are multiple tasktrackers which reside on
Datanode.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 52
How MapReduce Organizes Work?

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 53

How MapReduce Organizes
Work?
•A job is divided into multiple tasks which are then run onto multiple data
nodes in a cluster.
•It is the responsibility of job tracker to coordinate the activity by scheduling
tasks to run on different data nodes.
•Execution of individual task is then to look after by task tracker,
which resides on every data node executing part of the job.
•Task tracker's responsibility is to send the progress report to the job tracker.
•In addition, task tracker periodically sends 'heartbeat' signal to
the Jobtracker so as to notify him of the current state of the system.
•Thus job tracker keeps track of the overall progress of each job. In the event
of task failure, the job tracker can reschedule it on a different task tracker.

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 54

Managing Resources And Applications With
Hadoop Yarn
•YARN stands for “Yet Another Resource Negotiator“.
•It was introduced in Hadoop 2.0 to remove the bottleneck on
Job Tracker which was present in Hadoop 1.0.
• YARN was described as a “Redesigned Resource Manager” at the
time of its launching, but it has now evolved to be known as
large-scale distributed operating system used for Big Data processing.

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 55

Hadoop 1.0

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 56

Hadoop 2.0

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 57

YARN
•YARN also allows different data processing engines like graph
processing, interactive processing, stream processing as well as
batch processing to run and process data stored in HDFS
(Hadoop Distributed File System) thus making the system much
more efficient.
• Through its various components, it can dynamically allocate
various resources and schedule the application processing. For large
volume data processing, it is quite necessary to manage the available
resources properly so that every application can leverage them.

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 58

YARN Features
• Scalability: The scheduler in Resource manager of YARN
architecture allows Hadoop to extend and manage thousands of
nodes and clusters.
• Compatibility: YARN supports the existing map-reduce applications
without disruptions thus making it compatible with Hadoop 1.0
as well.
• Cluster Utilization :Since YARN supports Dynamic utilization of
cluster in Hadoop, which enables optimized Cluster Utilization.
• Multi-tenancy: It allows multiple engine access thus giving
organizations a benefit of multi-tenancy.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 59
Hadoop YARN

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 60

Components of Yarn Architecture
• Client
• Resource Manager
• Scheduler
• Application manager
• Node Manager
• Application Master
• Container

• Client: It submits map-reduce jobs.

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 61

Components of Yarn Architecture
• Resource Manager: It is the master daemon of YARN and is responsible
for resource assignment and management among all the applications.
Whenever it receives a processing request, it forwards it to the
corresponding node manager and allocates resources for the completion of
the request accordingly. It has two major components:
• Scheduler: It performs scheduling based on the allocated application
and available resources. It is a pure scheduler, means it does not
perform other tasks such as monitoring or tracking and does not
guarantee a restart if a task fails. The YARN scheduler supports plugins
such as Capacity Scheduler and Fair Scheduler to partition the cluster
resources.
• Application manager: It is responsible for accepting the application
and negotiating the first container from the resource manager. It also
restarts the Application Manager container if a task fails.
20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 62
Components of Yarn Architecture
• Node Manager: It take care of individual node on Hadoop cluster and
manages application and workflow and that particular node.
Its primary job is to keep-up with the Node Manager. It monitors
resource usage, performs log management and also kills a
container based on directions from the resource manager. It
is also responsible for creating the container process and start it on
the request of Application master.

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 63

Components of Yarn Architecture
• Application Master: An application is a single job submitted to a
framework. The application manager is responsible for
negotiating resources with the resource manager, tracking the
status and monitoring progress of a single application. The
application master requests the container from the node
manager by sending a Container Launch Context(CLC) which
includes everything an application needs to run. Once the
application is started, it sends the health report to the resource
manager from time-to-time.

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 64

Components of Yarn Architecture
• Container: It is a collection of physical resources such as RAM, CPU
cores and disk on a single node. The containers are invoked
by Container Launch Context(CLC) which is a record that
contains information such as environment variables,
security tokens, dependencies etc.

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 65

Application workflow in Hadoop
YARN:

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 66

Application workflow
1. Client submits an application
2. The Resource Manager allocates a container to start the
Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to
monitor application’s status
8. Once the processing is complete, the Application Manager un-registers with
the Resource Manager

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 67

HDFS_READ FILE

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 68

HDFS_WRITE FILE

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 69

References
•https://www.simplilearn.com/tutorials/hadoop-tutorial/hdfs
•HDFS-READ
• https://www.youtube.com/watch?v=Ax7EhEsVVzE
•HDFS-WRITE
• https://www.youtube.com/watch?v=0QJKx4A4L7Y

•https://www.youtube.com/watch?v=nWqdePeOh9M
•https://techvidvan.com/tutorials/how-hadoop-works-internally/
•https://www.guru99.com/learn-hdfs-a-beginners-guide.html
•https://www.guru99.com/introduction-to-mapreduce.html
•https://data-flair.training/blogs/hdfs-data-write-operation/

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 70

THANK YOU

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 71

170 Cells Bms Wiring Color Code
100% (2)
170 Cells Bms Wiring Color Code
6 pages
Unit 3 HDFS Notes
No ratings yet
Unit 3 HDFS Notes
71 pages
Introduction To Hadoop and MapReduce Programming
No ratings yet
Introduction To Hadoop and MapReduce Programming
29 pages
Big Data Unit-3
No ratings yet
Big Data Unit-3
46 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Introduction to Linear Programming
No ratings yet
Introduction to Linear Programming
17 pages
BBVCX
No ratings yet
BBVCX
89 pages
Programming Fundamentals: Lecture # 1
No ratings yet
Programming Fundamentals: Lecture # 1
42 pages
3 HDFS
No ratings yet
3 HDFS
20 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Spectroil Q100
67% (3)
Spectroil Q100
100 pages
(D862.Ebook) PDF Download Principles of Textile Testing by Je Booth
50% (2)
(D862.Ebook) PDF Download Principles of Textile Testing by Je Booth
4 pages
HDFS Bda
No ratings yet
HDFS Bda
34 pages
Module 4 - Hadoop HDFS
No ratings yet
Module 4 - Hadoop HDFS
102 pages
HDFS Bda
No ratings yet
HDFS Bda
34 pages
Type DHM9B (Digital) Load Cell: Short Description
100% (1)
Type DHM9B (Digital) Load Cell: Short Description
2 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
HDFS (27 Jan 2025 Hadoop Distributed File System)
No ratings yet
HDFS (27 Jan 2025 Hadoop Distributed File System)
73 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
BCS061 Notes Unit3
No ratings yet
BCS061 Notes Unit3
23 pages
3a-105230 PBR 33 RH
No ratings yet
3a-105230 PBR 33 RH
1 page
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
No ratings yet
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
56 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
2 pages
Unit - 3 (HDFS)
No ratings yet
Unit - 3 (HDFS)
23 pages
Bda - M 2
No ratings yet
Bda - M 2
113 pages
HDFS
No ratings yet
HDFS
3 pages
G-LBDA1361NA 002 Tattle-Tape Workstation Datasheet LR2
No ratings yet
G-LBDA1361NA 002 Tattle-Tape Workstation Datasheet LR2
2 pages
Unit 3 1
No ratings yet
Unit 3 1
20 pages
Unit - 3 (HDFS) - 1
No ratings yet
Unit - 3 (HDFS) - 1
24 pages
HDFS
No ratings yet
HDFS
16 pages
Hackathon 2025
No ratings yet
Hackathon 2025
2 pages
BD Module 1 Final
No ratings yet
BD Module 1 Final
17 pages
GSM SIM800C Manual
No ratings yet
GSM SIM800C Manual
19 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
8 pages
HDFS Basics and Components Guide
No ratings yet
HDFS Basics and Components Guide
55 pages
HDFS Overview for Tech Professionals
No ratings yet
HDFS Overview for Tech Professionals
88 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
IMTC634 - Data Science - Chapter 14
No ratings yet
IMTC634 - Data Science - Chapter 14
22 pages
Unit - 2
No ratings yet
Unit - 2
27 pages
BDA - Unit-2
No ratings yet
BDA - Unit-2
24 pages
Huawei
No ratings yet
Huawei
32 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
HDFS
No ratings yet
HDFS
37 pages
Hdfs R20it III
No ratings yet
Hdfs R20it III
19 pages
HDFSnew
No ratings yet
HDFSnew
20 pages
Unit 2
No ratings yet
Unit 2
14 pages
Unique Student Identity and Profile System
No ratings yet
Unique Student Identity and Profile System
51 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Big Data Analytics Syllabus
No ratings yet
Big Data Analytics Syllabus
169 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
43 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
HDFS
No ratings yet
HDFS
11 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
Hadoop Architecture & HDFS Overview
No ratings yet
Hadoop Architecture & HDFS Overview
57 pages
Standard 1
No ratings yet
Standard 1
3 pages
How To Restore Deleted Files From The Recycle Bin
No ratings yet
How To Restore Deleted Files From The Recycle Bin
1 page
03U0095EN
No ratings yet
03U0095EN
20 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Grade 6 - Term 2 - Sample Paper - Answer Key
No ratings yet
Grade 6 - Term 2 - Sample Paper - Answer Key
9 pages
E-Wallet Adoption and Impact Study
No ratings yet
E-Wallet Adoption and Impact Study
30 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
05 - Introduction To HDFS
No ratings yet
05 - Introduction To HDFS
27 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
New Horizon College of Engineering,: (Autonomous Institute Affiliated To Vtu) Bangalore - 560 054
No ratings yet
New Horizon College of Engineering,: (Autonomous Institute Affiliated To Vtu) Bangalore - 560 054
1 page
HDFS Internals for Developers
No ratings yet
HDFS Internals for Developers
30 pages
Exponent Rules Simplified
No ratings yet
Exponent Rules Simplified
8 pages
Hadoop Ecosystem & HDFS Guide
No ratings yet
Hadoop Ecosystem & HDFS Guide
46 pages
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
No ratings yet
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
34 pages
HP Software and Driver Downloads For HP Printers, Laptops, Desktops and More - HP® Customer Support
No ratings yet
HP Software and Driver Downloads For HP Printers, Laptops, Desktops and More - HP® Customer Support
1 page
CSC2071 - Lecture 08 (Classes)
No ratings yet
CSC2071 - Lecture 08 (Classes)
29 pages
Ict 6
No ratings yet
Ict 6
31 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Thực Hành Điều Khiển Cơ Bản
No ratings yet
Thực Hành Điều Khiển Cơ Bản
39 pages
Hadoop File System
No ratings yet
Hadoop File System
36 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
CAN-Based Smart Home System
No ratings yet
CAN-Based Smart Home System
7 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
MCA 2 Year Syllabus
No ratings yet
MCA 2 Year Syllabus
99 pages
2017 Mit 070
No ratings yet
2017 Mit 070
71 pages
Inocontroller Control Module Instructions Manual Sames DRT7134 Uk
No ratings yet
Inocontroller Control Module Instructions Manual Sames DRT7134 Uk
44 pages
Accessibility Checklist
No ratings yet
Accessibility Checklist
25 pages
Fire & Gas System Module Guide
No ratings yet
Fire & Gas System Module Guide
10 pages
How To Create A Responsive Navigation Menu With Icons
No ratings yet
How To Create A Responsive Navigation Menu With Icons
4 pages

Unit 2 Da Material

Uploaded by

Unit 2 Da Material

Uploaded by

CS19P16-DATA ANALYTICS

UNIT II- HDFS & Map

•HDFS is a distributed file system that provides access

•Example - A patron gifted a collection of popular

•The main components of HDFS

•The Secondary NameNode server maintains the edit

•NameNode maintains two persistent files; one a

•Block replication refers to creating copies of a block in

•In the replication method, each file is split into a

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 35

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 37

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 40

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 41

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 42

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 44

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 45

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 46

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 47

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 48

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 49

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 50

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 51

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 53

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 54

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 55

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 56

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 57

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 58

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 60

• Client: It submits map-reduce jobs.

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 61

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 63

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 64

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 65

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 66

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 67

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 68

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 69

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 70

20-08-2020 REC\IT-IT17701_Data Analytics_UNIT II 71

You might also like