0% found this document useful (0 votes)

12 views109 pages

Hadoop 1

Hadoop in detail.

Uploaded by

Sana Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views109 pages

Hadoop 1

Hadoop in detail.

Uploaded by

Sana Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 109

INTRODUCTION TO

HADOOP

Giovanna Roda
PRACE Autumn School ’21, 27–28 September 2021
Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
What is Big Data?
What is Big Data? What is Big Data

What is Big Data?

"Big Data" is the catch-all term for massive amounts of data as well as
for frameworks and R&D initiatives aimed at working with them efficiently.

Image source: erpinnews.com

Introduction to Hadoop 4/101

What is Big Data? What is Big Data

A short definition of Big Data

A nice definition from this year’s PRACE Summer of HPC presentation

"Convergence of HPC and Big Data".

Introduction to Hadoop 5/101

What is Big Data? The three V’s of Big Data

The three V’s of Big Data

It is customary to define Big Data in terms of three V’s:

Volume (the sheer volume of data)

Velocity (rate of flow of the data and processing speed needs)
Variety (different sources and formats)

Introduction to Hadoop 6/101

What is Big Data? The three V’s of Big Data

The three V’s of Big Data

Data arise from disparate sources and come in many sizes and formats.
Velocity refers to the speed of data generation as well as to processing
speed requirements.

Volume Velocity Variety

MB batch table
GB periodic database
TB near-real time multimedia
PB real time unstructured
... ... ...

Introduction to Hadoop 7/101

What is Big Data? The three V’s of Big Data

Reference: metric prefixes

1000000000000000000000000 1024 yotta Y septillion

1000000000000000000000 1021 zetta Z sextillion
1000000000000000000 1018 exa E quintillion
1000000000000000 1015 peta P quadrillion
1000000000000 1012 tera T trillion
1000000000 109 giga G billion
1000000 106 mega M million
1000 103 kilo k thousand

Note: 1 Gigabyte (GB) is 109 bytes. Sometimes GB is also used to denote

10243 or 23 0 bytes, which is actually one gibibyte (GiB).

Introduction to Hadoop 8/101

What is Big Data? The three V’s of Big Data

Structured vs. unstructured data

By structured data one refers to highly organized data that are usually
stored in relational databases or data warehouses. Structured data are easy
to search but unflexible in terms of the three "V"s.

Unstructured data come in mixed formats, usually require pre-processing,

and are difficult to search. Structured data are usually stored in noSQL
databases or in data lakes (these are scalable storage spaces for raw data of
mixed formats).

Introduction to Hadoop 9/101

What is Big Data? The three V’s of Big Data

Examples of structured/unstructured data

Industry Structured data Unstructured data

products & prices reviews

e-commerce customer data phone transcripts
transactions social media mentions

customer communication
financial transactions regulations & compliance
banking
customer data financial news

patient data clinical reports

healthcare
medical billing data radiology imagery

Introduction to Hadoop 10/101

What is Big Data? The three V’s of Big Data

Big Data in 2025

This table1 shows the projected annual storage and computing needs in
four domains (astronomy, social media, genomics)

1
Stephens ZD et al. “Big Data: Astronomical or Genomical?” In: PLoS Biol (2015).
Introduction to Hadoop 11/101
What is Big Data? The three V’s of Big Data

The three V’s of Big Data: additional dimensions

Three more "V"s to be pondered:

Veracity (quality or trustworthiness of data)

Value (economic value of the data)
Variability (general variability in any of the aforementioned
characteristics)

Introduction to Hadoop 12/101

What is Big Data? Addressing the challenges of Big Data

The challenges of Big Data

Anyone working with large amounts of data will sooner or later be

confronted with one or more of these challenges:

disk and memory space

processing speed
hardware faults
network capacity and speed
need to optimize resources use

Introduction to Hadoop 13/101

What is Big Data? Addressing the challenges of Big Data

Distributed computing for Big Data

Traditional technologies are inadequate for processing large amounts of

data efficiently.
Distributed computation makes it possible to work with Big Data using
reasonable amounts of time and resources.
Image: VSC-4 ©Matthias Heisler
Introduction to Hadoop 14/101
What is Big Data? Distributed computing

What is distributed computing?

A distributed computer system

consists of several interconnected
nodes. Nodes can be physical as well
as virtual machines or containers.

When a group of nodes provides

services and applications to the client
as if it were a single machine, then it
is also called a cluster.

Introduction to Hadoop 15/101

What is Big Data? Distributed computing

Main benefits of distributed computing

I Performance: supports intensive workloads by spreading tasks across

nodes
I Scalability: new nodes can be added to increase capacity
I Fault tolerance: resilience in case of hardware failures

Introduction to Hadoop 16/101

Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
The Hadoop distributed computing
architecture
The Hadoop distributed computing architecture Hadoop

Hadoop for distributed data processing

Hadoop is a framework for running jobs on clusters of computers that

provides a good abstraction of the underlying hardware and software.
“Stripped to its core, the tools that Hadoop provides for building distributed
systems—for data storage, data analysis, and coordination—are simple. If
there’s a common theme, it is about raising the level of abstraction—to
create building blocks for programmers who just happen to have lots of
data to store, or lots of data to analyze, or lots of machines to coordinate,
and who don’t have the time, the skill, or the inclination to become
distributed systems experts to build the infrastructure to handle it.2 ”

2
White T. Hadoop: The Definitive Guide. O’Reilly, 1988.
Introduction to Hadoop 18/101
The Hadoop distributed computing architecture Hadoop

Hadoop: some facts

Hadoop3 is an open-source project of the Apache Software Foundation.

The project was created to facilitate computations involving massive
amounts of data.

I its core components are implemented in Java

I initially released in 2006. Last stable version is 3.3.1 from June 2021
I originally inspired by Google‘s MapReduce4 and the proprietary GFS
(Google File System)

3
Apache Software Foundation. Hadoop. url: https://hadoop.apache.org.
4
J. Dean and S. Ghemawat. “MapReduce: Simplified data processing on large
clusters.” In: Proceedings of Operating Systems Design and Implementation (OSDI).
2004. url: https://www.usenix.org/legacy/publications/library/proceedings/
osdi04/tech/full_papers/dean/dean.pdf.
Introduction to Hadoop 19/101
The Hadoop distributed computing architecture Hadoop

Hadoop’s features

Hadoop’s features addressing the challenges of Big Data:

I scalability
I fault tolerance
I high availability
I distributed cache/data locality
I cost-effectiveness as it does not need high-end hardware
I provides a good abstraction of the underlying hardware
I easy to learn
I data can be queried trough SQL-like endpoints (Hive, Cassandra)

Introduction to Hadoop 20/101

The Hadoop distributed computing architecture Hadoop

Mini-glossary of Hadoop’s distinguishing features

fault tolerance: the ability to withstand hardware or network failures

(also: resilience)
high availability : this refers to the system minimizing downtimes by
eliminating single points of failure
data locality : task are run on the node where data are located, in
order to reduce the cost of moving data around

Introduction to Hadoop 21/101

The Hadoop distributed computing architecture The Hadoop core

The Hadoop core

The core of Hadoop consists of:

Hadoop common, the core libraries

HDFS, the Hadoop Distributed File System
MapReduce
the YARN (Yet Another Resource Negotiator) resource manager

Introduction to Hadoop 22/101

The Hadoop distributed computing architecture The Hadoop core

The Hadoop ecosystem

There’s a whole constellation of open source components for collecting,

storing, and processing big data that integrate with Hadoop.
Image source: Cloudera
Introduction to Hadoop 23/101
The Hadoop distributed computing architecture The Hadoop core

The Hadoop Distributed File System (HDFS)

HDFS stands for Hadoop Distributed File System and it takes care of
partitioning data across a cluster.
In order to prevent data loss and/or task termination due to hardware
failures HDFS uses either

replication (creating multiple copies —usually 3— of the data)

erasure coding

Data redundancy (obtained through replication or erasure coding) is the

basis of Hadoop’s fault tolerance.

Erasure coding (EC) is a method of data protection in which

data is broken into fragments, expanded and encoded with
redundant data pieces and stored across a set of different
locations or storage media
Introduction to Hadoop 24/101
The Hadoop distributed computing architecture The Hadoop core

Replication vs. Erasure Coding

In order to provide protection against failures one introduces:

data redundancy
a method to recover the lost data using the redundant data

Replication is the simplest method for coding data by making n copies of

the data. n-fold replication guarantees the availability of data for at most
n − 1 failures and it has a storage overhead of 200% (this is equivalent to a
storage efficiency of 33%).
Erasure coding provides a better storage efficiency (up to to 71%) but it
can be more costly than replication in terms of performance.

Introduction to Hadoop 25/101

The Hadoop distributed computing architecture HDFS architecture

HDFS architecture

A typical Hadoop cluster installation

consists of:
a NameNode
a secondary NameNode
multiple DataNodes

Introduction to Hadoop 26/101

The Hadoop distributed computing architecture HDFS architecture

HDFS architecture: NameNode

NameNode
The NameNode is the main point of
access of a Hadoop cluster. It is
responsible for the bookkeeping of
the data partitioned across the
DataNodes, manages the whole
filesystem metadata, and performs
load balancing

Introduction to Hadoop 27/101

The Hadoop distributed computing architecture HDFS architecture

HDFS architecture: Secondary NameNode

Secondary NameNode
Keeps track of changes in the
NameNode performing regular
snapshots, thus allowing quick
startup.
An additional standby node is needed
to guarantee high availability (since
the NameNode is a single point of
failure).

Introduction to Hadoop 28/101

The Hadoop distributed computing architecture HDFS architecture

HDFS architecture: DataNode

DataNode
Here is where the data is saved and
the computations take place (data
nodes should actually be called "data
and worker nodes")

Introduction to Hadoop 29/101

The Hadoop distributed computing architecture HDFS architecture

HDFS architecture: internal data representation

HDFS supports working with very large files.

Internally, data are split into blocks. One of the reason for splitting data
into blocks is that in this way block objects all have the same size.
The block size in HDFS can be configured at installation time and it is by
default 128MiB (approximately 134MB).

Note: Hadoop sees data as a bunch of records and it processes multiple

files the same way it does with a single file. So, if the input is a directory
instead of a single file, it will process all files in that directory.

Introduction to Hadoop 30/101

The Hadoop distributed computing architecture HDFS architecture

HDFS architecture

Introduction to Hadoop 31/101

The Hadoop distributed computing architecture HDFS architecture

DataNode failures

Each DataNode sends a Heartbeat

message to the NameNode
periodically. Whenever a DataNode
becomes unavailable (due to network
or hardware failure), the NameNode
stops sending requests to that node
and creates new replicas of the blocks
stored on that node.

Introduction to Hadoop 32/101

The Hadoop distributed computing architecture WORM: Write Once Read Many

The WORM principle of HDFS

The Hadoop Distributed File System relies on a simple design principle for
data known as Write Once Read Many (WORM).
“A file once created, written, and closed need not be changed except for
appends and truncates. Appending the content to the end of the files is
supported but cannot be updated at arbitrary point. This assumption
simplifies data coherency issues and enables high throughput data access.5 ”
The data immutability paradigm is also discussed in Chapter 2 of "Big
Data".6

5
Apache Software Foundation. Hadoop. url:
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
hdfs/HdfsDesign.html.
6
Warren J. and Marz N. Big Data. Manning publications, 1988.
Introduction to Hadoop 33/101
Outline/next

The origins of MapReduce

The 2004 paper “MapReduce: Simplified Data Processing on Large

Clusters” by two members of Google’s R&D team, Jeffrey Dean and Sanjay
Ghemawat, is the seminal article on MapReduce.
The article describes the methods used to split, process, and aggregate the
large amount of data for the Google search engine.

The open-source version of MapReduce was later released within the

Apache Hadoop project.

Introduction to Hadoop 35/101

MapReduce MapReduce

MapReduce explained

Image source: Stack Overflow

Introduction to Hadoop 36/101
MapReduce MapReduce

MapReduce explained

The MapReduce paradigm is inspired by the computing model commonly

used in functional programming.

Applying the same function independently to items in a dataset either to

transform (map) or collate (reduce) them into new values, works well in a
distributed environment.

Introduction to Hadoop 37/101

MapReduce MapReduce

The phases of MapReduce

The phases of a MapReduce job:

split: data is partitioned across several computer nodes

map: apply a map function to each chunk of data
sort & shuffle: the output of the mappers is sorted and distributed to
the reducers
reduce: finally, a reduce function is applied to the data and an output
is produced

Introduction to Hadoop 38/101

MapReduce MapReduce

The phases of MapReduce

Introduction to Hadoop 39/101

MapReduce MapReduce

The phases of MapReduce

We have seen that a MapReduce job consists of four phases:

split
map
sort & shuffle
reduce

While splitting, sorting and shuffling are done by the framework, the map
and reduce functions are defined by the user.

It is also possible for the user to interact with the splitting, sorting and
shuffling phases and change their default behavior, for instance by
managing the amount of splitting or defining the sorting comparator. This
will be illustrated in the hands-on exercises.

Introduction to Hadoop 40/101

MapReduce MapReduce

MapReduce: some notes

Notes

the same map (and reduce) function is applied to all the chunks in the
data
the map and reduce computations can be carried out in parallel
because they’re completely independent from one another.
the split is not the same as the internal partitioning into blocks

Introduction to Hadoop 41/101

MapReduce MapReduce

MapReduce: shuffling and sorting

The shuffling and sorting phase is often the the most costly in a
MapReduce job.
The mapper takes as input unsorted data and emits key-value pairs. The
purpose of sorting is to provide data that is already grouped by key to the
reducer. This way reducers can start working as soon as a group (identified
by a key) is filled.

Introduction to Hadoop 42/101

MapReduce shuffling and sorting

MapReduce: shuffling and sorting

Introduction to Hadoop 43/101

Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
HDFS hands-on exercises
HDFS hands-on exercises HDFS basic commands

Where to find commands listing

For this part of the training you will need to activate the Hadoop module
using the command:

module load Hadoop/2.6.0-cdh5.8.0-native

All commands in this section can be found in the file:

HDFS_commands.txt

Introduction to Hadoop 45/101

HDFS hands-on exercises HDFS basic commands

Basic HDFS filesystem commands

One can regard HDFS as a regular file system, in fact many HDFS shell
commands are inherited from the corresponding bash commands.
To run a command on an Hadoop filesystem use the prefix hdfs dfs, for
instance use:
hdfs dfs - mkdir myDir

to create a new directory myDir on HDFS.

Note: One can use interchangeably hadoop or hdfs dfs when working on
a HDFS file system. The command hadoop is more generic because it can
be used not only on HDFS but also on other file systems that Hadoop
supports (such as Local FS, WebHDFS, S3 FS, and others).

Introduction to Hadoop 46/101

HDFS hands-on exercises HDFS basic commands

Basic HDFS filesystem commands

Basic HDFS filesystem commands that also exist in bash

hdfs dfs -mkdir create a directory

hdfs dfs -ls list files
hdfs dfs -cp copy files
hdfs dfs -cat print files
hdfs dfs -tail output last part of a file
hdfs dfs -rm remove files

Introduction to Hadoop 47/101

HDFS hands-on exercises HDFS basic commands

Basic HDFS filesystem commands

Here’s three basic commands that are specific to HDFS.

Copy single src, or multiple srcs

hdfs dfs –put from local file system to the
destination file system
Copy files to the local file sys-
hdfs dfs –get
tem
hdfs dfs -usage get help on hadoop fs

Introduction to Hadoop 48/101

HDFS hands-on exercises HDFS basic commands

Basic HDFS filesystem commands

To get more help on a specific hdfs command use: hdfs -help

<command>
$ hdfs dfs - help tail
# - tail [ - f ] < file > :
# Show the last 1 KB of the file .

# -f Shows appended data as the file grows .

Introduction to Hadoop 49/101

HDFS hands-on exercises HDFS basic commands

Some things to try

# create a new directory called " input " on HDFS

hdfs dfs - mkdir input
# copy local file wiki_1k_lines to input on HDFS
hdfs dfs - put wiki_1k_lines input /
# list contents of directory (" - h " = human )
hdfs dfs - ls -h input
# disk usage
hdfs dfs - du -h input
# get help on " du " command
hdfs dfs - help du
# remove directory
hdfs dfs - rm -r input

Introduction to Hadoop 50/101

HDFS hands-on exercises HDFS disk usage

Some things to try

What is the size of the file wiki_1k_lines? What is its disk usage?
# show the size of wiki_1k_lines on the regular filesystem
ls - lh wiki_1k_lines
# show the size of wiki_1k_lines on HDFS
hdfs dfs - put wiki_1k_lines
hdfs dfs - ls -h wiki_1k_lines

# disk usage of wiki_1k_lines on the regular filesystem

du -h wiki_1k_lines
# disk usage of wiki_1k_lines on HDFS
hdfs dfs - du -h wiki_1k_lines

Introduction to Hadoop 51/101

HDFS hands-on exercises HDFS disk usage

Disk usage on HDFS

The command hdfs dfs -help du will tell you that the output is of the
form:
size disk space consumed filename.

You’ll notice that the space on disk is larger than the file size (38.6MB
versus 19.3MB):
hdfs dfs - du -h wiki_1k_lines
# 19.3 M 38.6 M wiki_1k_lines

This is due to replication. You can check the replication factor using:
hdfs dfs - stat ’ Block size : % o Blocks : % b Replication : %r ’
input / wiki_1k_lines
# Block size : 134217728 Blocks : 20250760 Replication : 2

Introduction to Hadoop 52/101

HDFS hands-on exercises HDFS disk usage

Disk usage on HDFS

From the previous output:

Block size : 134217728 Blocks : 20250760 Replication : 2

we can see that the HDFS filesystem currently supports a replication factor
of 2.

Note that the Hadoop block size is defined in terms of mebibytes, in fact
134217728 bytes corresponds to 128MiB and 134MB. One MiB is larger
than a MB since one MiB is 10242 = 220 bytes, while one MB is 106 bytes.

Introduction to Hadoop 53/101

Outline/next

Where to find commands listing

For this part of the training you will need to activate the Hadoop module
using the command:

module load Hadoop/2.6.0-cdh5.8.0-native

All commands in this section can be found in the file:

MapReduce_commands.txt

Introduction to Hadoop 55/101

MapReduce hands-on The Mapreduce streaming library

MapReduce streaming

The mapreduce streaming library allows to use any executable as mappers

and reducers.

read the input from stdin (line by line)

emit the output to stdout

The documentation for streaming can be found in:

https://hadoop.apache.org/docs/stable/hadoop-
streaming/HadoopStreaming.html

Introduction to Hadoop 56/101

MapReduce hands-on The Mapreduce streaming library

Locate the streaming library

First of all, we need to locate the streaming library on the system.

# find out where Hadoop is installed ( variable $HADOOP_HOME )
echo $HADOOP_HOME
# / opt / apps / software / Hadoop /2.6.0 - cdh5 .8.0 - native / share /
hadoop / mapreduce

# find the streaming library

find / opt / apps / software / Hadoop /2.6.0 - cdh5 .8.0 - native - name "
hadoop - streaming * jar "
# . . .
# / opt / apps / software / Hadoop /2.6.0 - cdh5 .8.0 - native / share /
hadoop / tools / lib / hadoop - streaming -2.6.0 - cdh5 .8.0. jar

# save library in the variable $STREAMING

export STREAMING =/ opt / apps / software / Hadoop /2.6.0 - cdh5 .8.0 -
native / share / hadoop / tools / lib / hadoop - streaming -2.6.0 -
cdh5 .8.0. jar

Introduction to Hadoop 57/101

MapReduce hands-on How to run a Mapreduce job

Check input and output

We’re going to use the file wiki_1k_lines (later you can experiment with
a larger, for instance wiki_1k_lines.

# check that the output directory does not exist

hdfs dfs - rm -r output

# copy the file to HDFS

hdfs dfs - put wiki_1k_lines

Note: If you use a directory or file name that doesn’t start with a slash
(‘/‘) then the directory or file is meant to be in your home directory (both
in bash and on HDFS). A path that starts with a slash is called an absolute
path name.

Introduction to Hadoop 58/101

MapReduce hands-on How to run a Mapreduce job

Run a simple MapReduce job

Using the streaming library, we can run the simplest MapReduce job.
# launch MapReduce job
hadoop jar $STREAMING \
- input wiki_1k_lines \
- output output \
- mapper / bin / cat \
- reducer ’/ bin / wc -l ’

This job uses as a mapper the cat command, that does nothing else than
echoimg the input. The reducer wc -l counts the lines in the given input.
Note how we didn’t need to write any code for the mapper and reducer

because the executables (cat and wc) are already there as par of any
standard Linux distribution.

Introduction to Hadoop 59/101

MapReduce hands-on How to run a Mapreduce job

Run a simple MapReduce job

# launch MapReduce job

hadoop jar $STREAMING \
- input wiki_1k_lines \
- output output \
- mapper / bin / cat \
- reducer ’/ bin / wc -l ’

If the job was successful, the output directory on HDFS (we called it
output) should contain an empty file called _SUCCESS.
The file part-* contains the output of our job.
# check if job was successful ( output should contain a file
named _SUCCESS )
hdfs dfs - ls output
# check result
hdfs dfs - cat output / part -00000

Introduction to Hadoop 60/101

MapReduce hands-on How to run a Mapreduce job

Run a simple MapReduce job

Launch a MapReduce job with 4 mappers

hdfs dfs - rm -r output

# launch MapReduce job

hadoop jar $STREAMING \
-D mapreduce . job . maps =4 \
- input wiki_1k_lines \
- output output \
- mapper / bin / cat \
- reducer ’/ bin / wc -l ’

# check if job was successful ( output should contain a file

named _SUCCESS )
hdfs dfs - ls output
# check result
hdfs dfs - cat output / part -00000

Introduction to Hadoop 61/101

MapReduce hands-on How to run a Mapreduce job

Run a simple MapReduce job

Note how it is necessary to delete the output directory on HDFS (hdfs

dfs -rm -r output) because according to the WORM principle, Hadoop
will not delete or overwrite existing data!

The option -D mapreduce.job.maps=4 right after the jar directive (in

this example -D mapreduce.job.maps=4) allows to change MapReduce
properties at runtime.

The list of all MapReduce options can be found in: mapred-default.xml

Note: this is the link to the last stable version, there might be some slight
changes with respect to the version that is currently installed on the cluster.

Introduction to Hadoop 62/101

MapReduce hands-on Wordcount with Mapreduce

Wordcount

We are now going to run a wordcount job using Python executables as

mapper and reducer.
The mapper will be called mapper.py and the reducer reducer.py. Since
these executables are not known to Hadoop, it is necessary to add them
with the options

-files mapper.py -files reducer.py

Note: it is possible to have several mappers and reducers in one Mapreduce

job, the output of each function is sent as input to the next one.

Introduction to Hadoop 63/101

MapReduce hands-on Wordcount with Mapreduce

Define the mapper

# !/ bin / python3
import sys
for line in sys . stdin :
words = line . strip () . split ()
for word in words :
print ( " {}\ t {} " . format ( word ,1) )
Listing 1: mapper.py

Introduction to Hadoop 64/101

MapReduce hands-on Wordcount with Mapreduce

Define the reducer

# !/ bin / python3
import sys
current_word , current_count = None , 0
for line in sys . stdin :
word , count = line . strip () . split ( ’\t ’ , 1)
try :
count = int ( count )
except ValueError :
continue
if current_word == word :
current_count += count
else :
if current_word :
print ( " {}\ t {} " . format ( current_word ,
current_count ) )
current_count = count
current_word = word
if current_word == word :
print ( " {}\ t {} " . format ( current_word , current_count ) )
Listing 2: reducer.py
Introduction to Hadoop 65/101
MapReduce hands-on Wordcount with Mapreduce

Run the job

# upload file to HDFS
hdfs dfs - put data / wiki_1k_lines
# remove output directory
hdfs dfs - rm -r output

hadoop jar $STREAMING \

- files mapper . py \
- files reducer . py \
- mapper mapper . py \
- reducer reducer . py \
- input wiki_1k_lines \
- output output

Check results.
# check if job was successful ( output should contain a file
named _SUCCESS )
hdfs dfs - ls output
# check result
hdfs dfs - cat output / part -00000| head

Introduction to Hadoop 66/101

MapReduce hands-on Wordcount with Mapreduce

Sorting the output after the job

The reducer just writes the list of words and their frequency in the order
given by the mapper.

The output of the reducer is sorted by key (the word) because that’s the
ordering that the reducer becomes from the mapper. If we’re interested in
sorting the data by frequency, we can use the Unix sort command with the
options k2, n, r meaning respectively "by field 2", "numeric", "reverse".

hdfs dfs -cat output/part-00000|sort -k2nr|head

The output should be something like:

the 193778
of 117170
and 89966
in 69186
. . .

Introduction to Hadoop 67/101

MapReduce hands-on Wordcount with Mapreduce

Sorting with MapReduce

To sort by frequency using the mapreduce framework, we can employ a

simple trick: create a mapper that interchanges words with their frequency
values. Since by construction mappers sort their output by key, we get the
desired sorting as a side-effect.

Create a script swap_keyval.py

# !/ bin / python3
import sys
for line in sys . stdin :
word , count = line . strip () . split ( ’\t ’)
if int ( count ) >100:
print ( " {}\ t {} " . format ( count , word ) )
Listing 3: swap_keyval.py

Introduction to Hadoop 68/101

MapReduce hands-on Wordcount with Mapreduce

Sorting with MapReduce

Run the new MapReduce job using output as input and writing results to
a new directory output2.
# write the output to the directory output2
hdfs dfs - rm -r output2

hadoop jar $STREAMING \

- files swap_keyval . py \
- input output \
- output output2 \
- mapper swap_keyval . py

Looking at the output, one can see that it is sorted by frequency but
alphabetically.
hdfs dfs - cat output2 / part -00000| head
# 10021 his
# 1005 per
# 101 merely
# . . .
Introduction to Hadoop 69/101
MapReduce hands-on Wordcount with Mapreduce

Using comparator classes for sorting

In general, we can determine how mappers are going to sort their output by
configuring the comparator directive to use the special class
KeyFieldBasedComparator:
-D mapreduce . job . output . key . comparator . class =\
org . apache . hadoop . mapred . lib . K e yF ie l dB as ed C om pa ra t or

This class has some options similar to the Unix sort (-n to sort numerically,
-r for reverse sorting, -k pos1[,pos2] for specifying fields to sort by).
See documentation: KeyFieldBasedComparator.html

Introduction to Hadoop 70/101

MapReduce hands-on Wordcount with Mapreduce

Using comparator classes for sorting

hdfs dfs - rm -r output2

comparator_class = org . apache . hadoop . mapred . lib .

K ey Fi e l d B a s e d C o m p ar a to r

hadoop jar $STREAMING \

-D mapreduce . job . output . key . comparator . class =
$comparator_c lass \
-D mapreduce . partition . keycomparator . options = - nr \
- files swap_keyval . py \
- input output \
- output output2 \
- mapper swap_keyval . py

Introduction to Hadoop 71/101

MapReduce hands-on Wordcount with Mapreduce

Using comparator classes for sorting

Now MapReduce has performed the desired sorting on the data.

hdfs dfs - cat output2 / part -00000| head
193778 the
117170 of
89966 and
69186 in
. . .

Introduction to Hadoop 72/101

MapReduce hands-on Some things to try

Modify the Wordcount example

Try to modify the wordcount example:

using executables in other programming languages

adding a mapper that filters certain words
using larger files

Introduction to Hadoop 73/101

MapReduce hands-on Some things to try

Run the MapReduce examples

The MapReduce distribution comes with some standard examples including

source code.

To get a list of all available examples use:

hadoop jar \
$HADOOP_HOME / hadoop - mapreduce - examples -2.6.0 - cdh5 .8.0. jar

Run the Wordcount example:

hadoop jar \
$HADOOP_HOME / hadoop - mapreduce - examples -2.6.0 - cdh5 .8.0. jar
wordcount wiki_1k_lines output3

Introduction to Hadoop 74/101

Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
The YARN resource manager
The YARN resource manager YARN

YARN: Yet Another Resource Negotiator

Hadoop jobs are usually managed by YARN (acronym for Yet Another
Resource Negotiator), that is responsible for allocating resources and
managing job scheduling. Basic resource types are:

memory (memory-mb)
virtual cores (vcores)

YARN supports an extensible resource model that allows to define any

countable resource. A countable resource is a resource that is consumed
while a container is running, but is released afterwards. Such a resource
can be for instance:

GPU (gpu)

Introduction to Hadoop 76/101

The YARN resource manager YARN

YARN architecture

Image source: Apache Software Foundation

Introduction to Hadoop 77/101

The YARN resource manager YARN

YARN architecture

Each job submitted to the Yarn is assigned:

a container : this is an abstract entity which incorporates resources

such as memory, cpu, disk, network etc. Container resources are
allocated by YARN’s Scheduler.
an ApplicationMaster service assigned by the Application Manager
for monitoring the progress of the job, restarting tasks if needed

Introduction to Hadoop 78/101

The YARN resource manager YARN

YARN architecture

The main idea of Yarn is to have two distinct daemons for job monitoring
and scheduling, one global and one local for each application:

the Resource Manager is the global job manager, consisting of:

Scheduler: allocates resources across all applications
Applications Manager: accepts job submissions, restart
Application Masters on failure
an Application Master is the local application manager, responsible
for negotiating resources, monitoring status of the job, restarting failed
tasks

Introduction to Hadoop 79/101

The YARN resource manager YARN

Dynamic resource pools

Sharing computing resources fairly can be a big issue in multi-user

environments.

YARN supports dynamic resource pools for scheduling applications.

A resource pool is a given configuration of resources to which a group of
users is granted access. Whenever a group is not active, the resources are
preempted and granted to other groups.
Groups are assigned a priority and resources are shared among groups
according to these priority values.

Additionally, resource configurations can be scheduled for specific intervals

of time.

Introduction to Hadoop 80/101

The YARN resource manager YARN

YARN on SLURM

When running YARN on top of SLURM, it is not clear how to take

advantage of the flexibility of YARN’s dynamic resource pools to optimize
resource utilization.

How to leverage the bses characteristics of job schedulers from both Big
Data and HPC architectures in order to decrease latency is a subject of
active study.

Introduction to Hadoop 81/101

Outline/next

MRjob

What is MRjob? It’s a wrapper for MapReduce that allows to write

MapReduce jobs in pure Python.

The library can be used for testing MapReduce as well as Spark jobs
without the need of a Hadoop cluster.

Here’s a quick-start tutorial:

https://mrjob.readthedocs.io/en/latest/index.html

Introduction to Hadoop 83/101

MRjob The MRjob Python library

A MRjob wordcount
from mrjob . job import MRJob

class MRWo r d F r e q u e nc yCount ( MRJob ) :

"""
A class to represent a Word Frequency Count mapreduce
job
"""
def mapper ( self , _ , line ) :
yield " chars " , len ( line )
yield " words " , len ( line . split () )
yield " lines " , 1

def reducer ( self , key , values ) :

yield key , sum ( values )

if __name__ == ’ __main__ ’:
MRWord F r e q ue n c y C ount . run ()
Listing 4: word_count.py

Introduction to Hadoop 84/101

MRjob The MRjob Python library

A MRjob wordcount

Install mrjob in a virtual environment:

# install mrjob
python3 -m venv mypython
mypython / bin / pip install ipython mrjob

Run a basic mrjob wordcount

mypython / bin / python3 word_count . py data / wiki_1k_lines

Introduction to Hadoop 85/101

Outline/next

Schedule
What is Big Data?
The Hadoop distributed computing architecture
MapReduce
HDFS hands-on exercises
MapReduce hands-on
The YARN resource manager
MRjob
Benchmarking I/O with testDFSio
Concluding remarks
Benchmarking I/O with testDFSio
Benchmarking I/O with testDFSio Running TestDFSio

TestDFSio

TestDFSio is a tool for measuring the performance of read and write

operations on HDFS and can be used to measure performance, benchmark,
or stress-test a Hadoop cluster.

TestDFSio uses MapReduce to write files to the HDFS filesystem spanning

one mapper for file; the reducer is used to collect and summarize test data.

Introduction to Hadoop 87/101

Benchmarking I/O with testDFSio Running TestDFSio

Find library location

Find the location of the testDFSio library, save it in the variable

testDFSio:
find / opt / apps / software / Hadoop /2.6.0 - cdh5 .8.0 - native \
- name " hadoop - mapreduce - client - jobclient * tests . jar "
# / opt / apps / software / Hadoop /2.6.0 - cdh5 .8.0 - native / share /
hadoop / mapreduce / hadoop - mapreduce - client - jobclient
-2.6.0 - cdh5 .8.0 - tests . jar

export testDFSiojar =/ opt / apps / software / Hadoop /2.6.0 - cdh5

.8.0 - native / share / hadoop / mapreduce / hadoop - mapreduce -
client - jobclient -2.6.0 - cdh5 .8.0 - tests . jar

Introduction to Hadoop 88/101

Benchmarking I/O with testDFSio Running TestDFSio

Options

Main options:

-write to run write tests

-read to run read tests
-nrFiles the number of files (set to be equal to the number of
mappers)
-fileSize size of files (followed by B|KB|MB|GB|TB)

TestDFSIO generates exactly 1 map task per file, so it is a 1:1 mapping

from files to map tasks.

Introduction to Hadoop 89/101

Benchmarking I/O with testDFSio Running TestDFSio

Specify custom i/o directory

To avoid permission problems (you need to have read/write access to

/benchmarks/TestDFSIO on HDFS).

By default TestHDFSio uses the HDFS directory /benchmarks to read and

write, therefore it is recommended to run the tests as hdfs.

In case you want to run the tests as a user who has no write permissions on
HDFS root folder /, you can specify an alternative directory with the
option -D assigning a new value to test.build.data.

Introduction to Hadoop 90/101

Benchmarking I/O with testDFSio Running TestDFSio

Running a write test

We are going to run a test with nrFiles files, each of size fileSize,
using a custom output directory.

export myDir =/ user / $ { USER }/ benchmarks

export nrFiles =2
export fileSize =10 MB
cd ~
hadoop jar $testDFSiojar TestDFSIO -D test . build . data = $myDir
- write - nrFiles $nrFiles - fileSize $fileSize

Introduction to Hadoop 91/101

Benchmarking I/O with testDFSio Running TestDFSio

Running a read test

We are going to run a test with nrFiles files, each of size fileSize,
using a custom output directory.

export myDir =/ user / $ { USER }/ benchmarks

export nrFiles =2
export fileSize =10 MB
cd ~
hadoop jar $testDFSiojar TestDFSIO -D test . build . data = $myDir
- read - nrFiles $nrFiles - fileSize $fileSize

Introduction to Hadoop 92/101

Benchmarking I/O with testDFSio Running TestDFSio

A sample test output

19/07/25 15:44:26 INFO fs . TestDFSIO : ----- TestDFSIO ----- :

write
19/07/25 15:44:26 INFO fs . TestDFSIO : Date & time :
Thu Jul 25 15:44:26 CEST 2019
19/07/25 15:44:26 INFO fs . TestDFSIO : Number of files :
20
19/07/25 15:44:26 INFO fs . TestDFSIO : Total MBytes processed :
204800.0
19/07/25 15:44:26 INFO fs . TestDFSIO : Throughput mb / sec :
39.38 46532544 7428
19/07/25 15:44:26 INFO fs . TestDFSIO : Average IO rate mb / sec :
39.59 94606018 0664
19/07/25 15:44:26 INFO fs . TestDFSIO : IO rate std deviation :
3.0182 194 67 98 127 17
19/07/25 15:44:26 INFO fs . TestDFSIO : Test exec time sec :
292.66

Introduction to Hadoop 93/101

Benchmarking I/O with testDFSio Running TestDFSio

How to interpret the results

The main measurements produced by the HDFSio test are:

throughput in MB/sec
average IO rate in MB/sec
standard deviation of the IO rate
test execution time

All test results are logged by default to the file TestDFSIO_results.log.

The log file can be changed with the option -resFile resultFileName.

Introduction to Hadoop 94/101

Benchmarking I/O with testDFSio Running TestDFSio

Advanced test configuration

In addition to the default sequential file access, the mapper class for
reading data can be configured to perform various types of random reads:

random read always chooses a random position to read from

(skipSize = 0)
backward read reads file in reverse order (skipSize < 0)
skip-read skips skipSize bytes after every read (skipSize > 0)

The -compression option allows to specify a codec for the input and
output of data.

Introduction to Hadoop 95/101

Benchmarking I/O with testDFSio Running TestDFSio

What is a codec

Codec is a portmanteau of coder and decoder and it designates any

hardware or software device that is used to encode—most commonly also
reducing the original size—and decode information. Hadoop provides
classes that encapsulate compression and decompression algorithms.

These are all currently available Hadoop compression codecs:

Compression format Hadoop CompressionCodec
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO com.hadoop.compression.lzo.LzopCodec
LZ4 org.apache.hadoop.io.compress.Lz4Codec
Snappy org.apache.hadoop.io.compress.SnappyCodec

Introduction to Hadoop 96/101

Benchmarking I/O with testDFSio Running TestDFSio

Concurrent versus overall throughput

Throughput or data transfer rate measures the amount of data read or

written (expressed in Megabytes per second—MB/s) to the filesystem.

Throughput is one of the main performance measures used by disk

manufacturers as knowing how fast data can be moved around in a disk is
an important important factor to look at.

The listed throughput shows the average throughput among all the map
tasks. To get an approximate overall throughput on the cluster you can
divide the total MBytes by the test execution time in seconds.

Introduction to Hadoop 97/101

Benchmarking I/O with testDFSio Running TestDFSio

Clean up

When done with the tests, clean up the temporary files generated by
testDFSio.

export myDir =/ user / $ { USER }/ benchmarks

hadoop jar $testDFSiojar TestDFSIO \

-D test . build . data = $myDir - clean

Introduction to Hadoop 98/101

Outline/next

Big Data on VSC course

As part of the Vienna Scientific cluster training program, we offer a course

"Big Data on VSC".
The first two editions ran in January and March of this year and the next
edition will take place next spring.

Our Hadoop expertise comes from managing a Big Data cluster LBD
(Little Big Data*) at the Vienna University of technology. The cluster is
running since 2017 and is used for teaching and research.

(*) https://lbd.zserv.tuwien.ac.at/

Introduction to Hadoop 100/101

Concluding remarks

Thanks

Thanks to:

I Janez Povh and Leon Kos for inviting me once again to hold this
training on Hadoop.
I Dieter Kvasnicka my colleague and co-trainer for Big Data on VSC for
his constant support

Introduction to Hadoop 101/101

Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Notes Hadoop
No ratings yet
Notes Hadoop
19 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
58 pages
Introduction to Hadoop & Big Data
No ratings yet
Introduction to Hadoop & Big Data
111 pages
Social Media Analytics - Unit II
No ratings yet
Social Media Analytics - Unit II
9 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
61 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
Hadoop by Dr. Kamal Gulati
No ratings yet
Hadoop by Dr. Kamal Gulati
33 pages
Big Data
No ratings yet
Big Data
51 pages
SergeBazhievsky Introduction To Hadoop MapReduce v2
No ratings yet
SergeBazhievsky Introduction To Hadoop MapReduce v2
67 pages
Getting Started With Hadoop
No ratings yet
Getting Started With Hadoop
47 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
Big Data Solutions with Hadoop
No ratings yet
Big Data Solutions with Hadoop
27 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Hadoop
No ratings yet
Hadoop
7 pages
HADOOP
No ratings yet
HADOOP
18 pages
Big Data & Hadoop Architecture Guide
50% (2)
Big Data & Hadoop Architecture Guide
168 pages
HADOOP
No ratings yet
HADOOP
55 pages
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Module 02 - Learners Guide
No ratings yet
Module 02 - Learners Guide
82 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Hadoop Introduction & Use Cases
No ratings yet
Hadoop Introduction & Use Cases
22 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Big Data Insights with Hadoop
No ratings yet
Big Data Insights with Hadoop
34 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
ISMV4
No ratings yet
ISMV4
833 pages
4 Hadoop and HDFS
No ratings yet
4 Hadoop and HDFS
33 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Bda Mod 2
No ratings yet
Bda Mod 2
132 pages
Cse3002 Big Data m1
No ratings yet
Cse3002 Big Data m1
62 pages
BD Unit II
No ratings yet
BD Unit II
57 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
BDA Module2
No ratings yet
BDA Module2
83 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Module - 2
No ratings yet
Module - 2
84 pages
Elementary Concepts of Big Data and Hadoop
No ratings yet
Elementary Concepts of Big Data and Hadoop
4 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
(Health Informatics) Connie W. Delaney, Charlotte A. Weaver, Judith J. Warren, Thomas R. Clancy, Roy L. Simpson (Eds.) - Big Data-Enable PDF
100% (2)
(Health Informatics) Connie W. Delaney, Charlotte A. Weaver, Judith J. Warren, Thomas R. Clancy, Roy L. Simpson (Eds.) - Big Data-Enable PDF
504 pages
Hadoop Basics for Data Engineers
No ratings yet
Hadoop Basics for Data Engineers
44 pages
Big Data Map Reduce
No ratings yet
Big Data Map Reduce
34 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
BDA Exp 1
No ratings yet
BDA Exp 1
7 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
LIFE3 0-24oct20
100% (1)
LIFE3 0-24oct20
21 pages
Dongwon Lee - W14 - Final Review
No ratings yet
Dongwon Lee - W14 - Final Review
103 pages
Bda Unit-1
No ratings yet
Bda Unit-1
65 pages
Download (Ebook) Strategy is Digital: How Companies Can Use Big Data in the Value Chain by Carlos Cordon, Pau Garcia-Milà, Teresa Ferreiro Vilarino, Pablo Caballero ISBN 9783319311319, 331931131X ebook All Chapters PDF
100% (15)
Download (Ebook) Strategy is Digital: How Companies Can Use Big Data in the Value Chain by Carlos Cordon, Pau Garcia-Milà, Teresa Ferreiro Vilarino, Pablo Caballero ISBN 9783319311319, 331931131X ebook All Chapters PDF
65 pages
Big Data Analytics Syllabus
No ratings yet
Big Data Analytics Syllabus
2 pages
From Total Quality Management To Quality 4.0: A Systematic Literature Review and Future Research Agenda
No ratings yet
From Total Quality Management To Quality 4.0: A Systematic Literature Review and Future Research Agenda
15 pages
Introduction To
No ratings yet
Introduction To
7 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Big Data Techniques of 2025
No ratings yet
Big Data Techniques of 2025
31 pages
Unit 1
No ratings yet
Unit 1
30 pages
The History of Geographic Information Systems (Gis) .
100% (1)
The History of Geographic Information Systems (Gis) .
11 pages
Technological Innovations To Ensure Confidence in The Digital World
No ratings yet
Technological Innovations To Ensure Confidence in The Digital World
24 pages
Proc. of The 2021 Conference On Big Data From Space
No ratings yet
Proc. of The 2021 Conference On Big Data From Space
26 pages
Ready For Inspection - The Automotive Aftermarket in 2030: June 2018
No ratings yet
Ready For Inspection - The Automotive Aftermarket in 2030: June 2018
52 pages
DSBDA ORAL Question Bank
100% (1)
DSBDA ORAL Question Bank
6 pages
IoT Energy Optimization with BA
No ratings yet
IoT Energy Optimization with BA
13 pages
BA Questions - Answers
No ratings yet
BA Questions - Answers
12 pages
Ethical Issues in Big-Data Loan Evaluation
No ratings yet
Ethical Issues in Big-Data Loan Evaluation
12 pages
What Is Data Analytics
No ratings yet
What Is Data Analytics
16 pages
Analytics The Real-World Use of Big Data in Financial Services
No ratings yet
Analytics The Real-World Use of Big Data in Financial Services
16 pages
Big Data for Business Process Improvement
No ratings yet
Big Data for Business Process Improvement
7 pages
University of Liverpool Online Prospectus
No ratings yet
University of Liverpool Online Prospectus
15 pages
Euw Conference Brochure
No ratings yet
Euw Conference Brochure
12 pages
Makalah ASP Apbd Dan Keuangan Daerah
No ratings yet
Makalah ASP Apbd Dan Keuangan Daerah
10 pages
21cc l4 U4 Test
No ratings yet
21cc l4 U4 Test
6 pages
Assignment 1 Data Base
No ratings yet
Assignment 1 Data Base
4 pages
Assignment: Assignment Name: Course Name Course Code: CSE-4710
No ratings yet
Assignment: Assignment Name: Course Name Course Code: CSE-4710
4 pages
Commercial Usage Using Big Data
No ratings yet
Commercial Usage Using Big Data
4 pages