Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
45 views30 pages

Hadoop

Uploaded by

kshitijseven1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views30 pages

Hadoop

Uploaded by

kshitijseven1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Unit 3

Big Data Analytics


Faculty: Dr. Vandana Bhatia
CONTENTS
Introduction to Analysing data
Data format scaling out Hadoop streaming Hadoop pipes
Hadoop with Hadoop

Hadoop
distributed file Data flow Hadoop I/O Data integrity Compression Serialization
system (HDFS)

Test data and local


Avro file-based Map Reduce Unit tests with tests – anatomy of classic Map-
YARN
data structures workflows MRUnit Map Reduce job reduce
run

failures in classic job scheduling,


Map-reduce and shuffle and sort,
YARN task execution,
Data Format in hadoop
1. Text/CSV Files
2. JSON Records
3. Avro Files
4. Sequence Files
5. RC Files
6. ORC Files
7. Parquet Files
Analyzing Data with Hadoop
❑ While the MapReduce programming model is at the heart of
Hadoop, it is low-level and as such becomes a unproductive
way for developers to write complex analysis jobs.
❑ To increase developer productivity, several higher-level
languages and APIs have been created that abstract away the
low-level details of the MapReduce programming model.
❑ There are several choices available for writing data analysis
jobs.
❑ The Hive and Pig projects are popular choices that provide
SQL-like and procedural data flow-like languages, respectively.
❑ HBase is also a popular way to store and analyze data in HDFS.
It is a column-oriented database, and unlike MapReduce,
provides random read and write access to data with low latency.
Apache Spark

Apache Impala

Apache Mahout

Apache Storm
Other Analytical
Apache Sqoop
Tools
Apache Flume

Apache Kafka

Tableau
Hadoop Streaming
• Hadoop streaming is a utility that comes with the Hadoop
distribution.
• This utility allows you to create and run Map/Reduce jobs with
any executable or script as the mapper and/or the reducer.
Option Description
-input directory_name or filename Input location for the mapper.

-output directory_name Input location for the reducer.

-mapper executable or
The command to be run as the mapper
JavaClassName

-reducer executable or script or


The command to be run as the reducer
JavaClassName

Make the mapper, reducer, or combiner executable available


-file file-name
Hadoop locally on the compute nodes

By default, TextInputformat is used to return the key-value pair of


Streaming -inputformat JavaClassName Text class. We can specify our class but that should also return a
key-value pair.

Commands -outputformat JavaClassName


By default, TextOutputformat is used to take key-value pairs of
Text class. We can specify our class but that should also take a
key-value pair.

-partitioner JavaClassName The Class that determines which key to reduce.

-combiner streamingCommand or
The Combiner executable for map output
JavaClassName

-verbose The Verbose output.

-numReduceTasks It Specifies the number of reducers.

-mapdebug Script to call when map task fails

-reducedebug Script to call when reduce task fails


Hadoop Pipes
• Hadoop Pipes is the name of the C++ interface to
Hadoop MapReduce.
• Unlike Streaming, which uses standard input and
output to communicate with the map and reduce code,
Pipes uses sockets as the channel over which the
tasktracker communicates with the process running
the C++ map or reduce function.
• JNI is not used.
Hadoop I/O
❑Hadoop comes with a set of primitives for data I/O.
❑Deserve special consideration when dealing with multiterabyte
datasets.
➢ Data integrity
➢ Compression
❑Tools or APIs that form the building blocks for developing distributed
systems
➢ Serialization frameworks
➢ On-disk data structures.
Data Integrity
❑It is possible that a block of data fetched from a DataNode arrives corrupted.
❑This corruption can occur because of faults in a storage device, network faults, or
buggy software.
❑The HDFS client software implements checksum checking on the contents of HDFS
files.
❑When a client creates an HDFS file, it computes a checksum of each block of the file
and stores these checksums in a separate hidden file in the same HDFS namespace.
❑When a client retrieves file contents it verifies that the data it received from each
DataNode matches the checksum stored in the associated checksum file.
❑If not, then the client can opt to retrieve that block from another DataNode that has a
replica of that block.
Data Integrity

• Data Integrity in Hadoop is achieved by maintaining the checksum of the data written to
the block.
• Whenever data is written to HDFS blocks , HDFS calculate the checksum for all data
written and verify checksum when it will read that data. The separate checksum will create
for every dfs.bytes.per.checksum bytes of data. The default size for this property is 512
bytes. Checksum is 4 Byte long.
• All datanodes are responsible to check checksum of their data. When client read data from
checksum, they also check checksum. To check the data block datanodes runs a
DataBlockScanner periodically to verify Block. So if corrupt data found HDFS will take
replica of actual data and replace the corrupt one.
Compression
You can compress data in Hadoop MapReduce at various stages.
1.Compressing input files- You can compress the input file that will reduce storage
space in HDFS. If you compress the input files then the files will be decompressed
automatically when the file is processed by a MapReduce job. Determining the
appropriate coded will be done using the file name extension. As example if file
name extension is .snappy hadoop framework will automatically use SnappyCodec
to decompress the file.
2.Compressing the map output- You can compress the intermediate map output.
Since map output is written to disk and data from several map outputs is used by a
reducer so data from map outputs is transferred to the node where reduce task is
running. Thus by compressing intermediate map output you can reduce both
storage space and data transfer across network.
3.Compressing output files- You can also compress the output of a MapReduce
job.
Compression format in hadoop
Compression Format in Hadoop
1) GZIP
• Provides High compression ratio.
• Uses high CPU resources to compress and decompress data.
• Good choice for Cold data which is infrequently accessed.
• Compressed data is not splitable and hence not suitable for MapReduce jobs.
2) BZIP2
• Provides High compression ratio (even higher than GZIP).
• Takes long time to compress and decompress data.
• Good choice for Cold data which is infrequently accessed.
• Compressed data is splitable.
• Even though the compressed data is splitable, it is generally not suited for MR jobs because of high compression/decompression
time.
3) LZO
• Provides Low compression ratio.
• Very fast in compressing and decompressing data.
• Compressed data is splitable if an appropriate indexing algorithm is used.
• Best suited for MR jobs because of property (ii) and (iii).
4) SNAPPY
• Provides average compression ratio.
• Aimed at very fast compression and decompression time.
• Compressed data is not splitable if used with normal file like .txt
• Generally used to compress Container file formats like Avro and SequenceFile because the files inside a Compressed Container file
can be split.
Codecs in Hadoop
❑ Codec, short form of compressor-decompressor is the implementation of a compression-decompression algorithm.
❑ In Hadoop framework there are different codec classes for different compression formats, you will use the codec class
for the compression format you are using.
❑ The codec classes in Hadoop are as follows-
▪ Deflate – org.apache.hadoop.io.compress.DefaultCodec or org.apache.hadoop.io.compress.DeflateCodec
(DeflateCodec is an alias for DefaultCodec). This codec uses zlib compression.
▪ Gzip – org.apache.hadoop.io.compress.GzipCodec
▪ Bzip2 – org.apache.hadoop.io.compress.Bzip2Codec
▪ Snappy – org.apache.hadoop.io.compress.SnappyCodec
▪ LZO – com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec
LZO libraries are GPL licensed and doesn't come with Hadoop release. Hadoop codec for LZO has to be downloaded
separately.
▪ LZ4– org.apache.hadoop.io.compress.Lz4Codec
▪ Zstandard – org.apache.hadoop.io.compress.ZstandardCodec
Data serialization is a process that converts structure
data manually back to the original form.

Serialize to translate data structures into a stream of


data. Transmit this stream of data over the network or
Data store it in DB regardless of the system architecture.

serialization Isn't storing information in binary form or stream of


bytes is the right approach.

Serialization does the same but isn't dependent on


architecture.
Serialization
Data is serialized for two objectives −
➢For persistent storage
Persistent Storage is a digital storage facility that does not
lose its data with the loss of power supply. Files, folders,
databases are the examples of persistent storage.
➢To transport the data over network(Inter-process
Communication)
• To establish the inter-process communication between the
nodes connected in a network, RPC technique was used.
• RPC used internal serialization to convert the message into
binary format before sending it to the remote node via
network. At the other end the remote system deserializes
the binary stream into the original message.
Serialization
• The RPC serialization format is required to be as follows −
• Compact − To make the best use of network bandwidth, which is
the most scarce resource in a data center.
• Fast − Since the communication between the nodes is crucial in
distributed systems, the serialization and deserialization process
should be quick, producing less overhead.
• Extensible − Protocols change over time to meet new
requirements, so it should be straightforward to evolve the protocol
in a controlled manner for clients and servers.
• Interoperable − The message format should support the nodes that
are written in different languages.
Class hierarchy
of Hadoop
serialization
➢ Avro format is a row-based storage format for
Hadoop, which is widely used as a serialization
platform.
➢ Avro format stores the schema in JSON format,
making it easy to read and interpret by any program.
➢ The data itself is stored in a binary format making it
compact and efficient in Avro files.
➢ Avro format is a language-neutral data serialization
Avro file system. It can be processed by many languages
(currently C, C++, C#, Java, Python, and Ruby).
format ➢ A key feature of Avro format is the robust support for
data schemas that changes over time, i.e., schema
evolution. Avro handles schema changes like missing
fields, added fields, and changed fields.
➢ Avro format provides rich data structures. For
example, you can create a record that contains an
array, an enumerated type, and a sub-record.
• In AVRO, schema changes like missing fields,
changed/modified fields, and added/new areas are easy to
maintain What will happen to past data when there is a
change in schema.
• AVRO can read data with a new schema regardless of data
generation time. Is there any problem that occurs while
trying to convert AVRO data to other formats, if needed.
• There are some inbuilt specifications in the Hadoop system
Avro file while using a Hybrid system of data? In such a way, Apache
format spark can convert Avro files to parquet. Is it possible to
manage MapReduce jobs.
• MapReduce jobs can efficiently achieve by using Avro file
processing in MapReduce itself. What connection
Handshake do.
• Connection handshake helps to exchange schemas between
client and server during RPC.
Avro file structure
AVRO has Dynamic Schema stores for
serialized values in binary format in a
space-efficient way.

How is AVRO JSON schema and data in the file.


different
from other Parser No compilation directly by using
systems? the parser library.

AVRO is a language-neutral system.


• Have the schema of data format ready
• To read schema in program
• Compile using AVRO (by generating class)
How to • Direct read via Parser library
• Achieve serialization through serialization API (java)
implement • By creating class - A schema fed to AVRO utility and then gets
Avro? processes as a java file. Now write API methods to serialize
data.
• By parser library - Fed schema to AVRO utility and access
serialization parser library to serialize data. Achieve
Deserialization through deserialization API (java.
• By generating class - Deserialize the object and instantiate
DataFileReader class.
• By parser library - Instantiate parser class.
Creating schema in Avro

• There are three ways to define and create JSON schemas of AVRO
1. JSON String
2. The JSON object
3. JSON Array
• There are some data type that needs to mention in schema -Primitive (common data types)
• Complex (when collective data elements need to be process and store)
• Primitive data types contain null, boolean, int, long, bytes, float, string, and double. In
contrast, complex contains Records (attribute encapsulation), Enums, Arrays, Maps
(associativity), Unions (multiple types for one field), and Fixed (deals with the size of data).
• Data types help to maintain sort order.
• Logical types (a super form of complex data types) are used to store individual data
involved, i.e., time, date, timestamp, and decimal.
YARN

• YARN stands for “Yet Another Resource Negotiator“.


• It was introduced in Hadoop 2.0 to remove the bottleneck on Job
Tracker which was present in Hadoop 1.0. YARN was described as a
“Redesigned Resource Manager” at the time of its launching, but it has
now evolved to be known as large-scale distributed operating system
used for Big Data processing.
• YARN also allows different data processing engines like graph
processing, interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop Distributed
File System) thus making the system much more efficient.
Yarn Features
Application
Workflow of Yarn
1.Client submits an application
2.The Resource Manager allocates a container
to start the Application Manager
3.The Application Manager registers itself with
the Resource Manager
4.The Application Manager negotiates
containers from the Resource Manager
5.The Application Manager notifies the Node
Manager to launch containers
6.Application code is executed in the container
7.Client contacts Resource
Manager/Application Manager to monitor
application’s status
8.Once the processing is complete, the
Application Manager un-registers with the
Resource Manager

You might also like