0% found this document useful (0 votes)

45 views30 pages

Hadoop

Uploaded by

kshitijseven1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views30 pages

Hadoop

Uploaded by

kshitijseven1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Unit 3

Big Data Analytics

Faculty: Dr. Vandana Bhatia
CONTENTS
Introduction to Analysing data
Data format scaling out Hadoop streaming Hadoop pipes
Hadoop with Hadoop

Hadoop
distributed file Data flow Hadoop I/O Data integrity Compression Serialization
system (HDFS)

Test data and local

Avro file-based Map Reduce Unit tests with tests – anatomy of classic Map-
YARN
data structures workflows MRUnit Map Reduce job reduce
run

failures in classic job scheduling,

Map-reduce and shuffle and sort,
YARN task execution,
Data Format in hadoop
1. Text/CSV Files
2. JSON Records
3. Avro Files
4. Sequence Files
5. RC Files
6. ORC Files
7. Parquet Files
Analyzing Data with Hadoop
❑ While the MapReduce programming model is at the heart of
Hadoop, it is low-level and as such becomes a unproductive
way for developers to write complex analysis jobs.
❑ To increase developer productivity, several higher-level
languages and APIs have been created that abstract away the
low-level details of the MapReduce programming model.
❑ There are several choices available for writing data analysis
jobs.
❑ The Hive and Pig projects are popular choices that provide
SQL-like and procedural data flow-like languages, respectively.
❑ HBase is also a popular way to store and analyze data in HDFS.
It is a column-oriented database, and unlike MapReduce,
provides random read and write access to data with low latency.
Apache Spark

Apache Impala

Apache Mahout

Apache Storm
Other Analytical
Apache Sqoop
Tools
Apache Flume

Apache Kafka

Tableau
Hadoop Streaming
• Hadoop streaming is a utility that comes with the Hadoop
distribution.
• This utility allows you to create and run Map/Reduce jobs with
any executable or script as the mapper and/or the reducer.
Option Description
-input directory_name or filename Input location for the mapper.

-output directory_name Input location for the reducer.

-mapper executable or
The command to be run as the mapper
JavaClassName

-reducer executable or script or

The command to be run as the reducer
JavaClassName

Make the mapper, reducer, or combiner executable available

-file file-name
Hadoop locally on the compute nodes

By default, TextInputformat is used to return the key-value pair of

Streaming -inputformat JavaClassName Text class. We can specify our class but that should also return a
key-value pair.

Commands -outputformat JavaClassName

By default, TextOutputformat is used to take key-value pairs of
Text class. We can specify our class but that should also take a
key-value pair.

-partitioner JavaClassName The Class that determines which key to reduce.

-combiner streamingCommand or
The Combiner executable for map output
JavaClassName

-verbose The Verbose output.

-numReduceTasks It Specifies the number of reducers.

-mapdebug Script to call when map task fails

-reducedebug Script to call when reduce task fails

Hadoop Pipes
• Hadoop Pipes is the name of the C++ interface to
Hadoop MapReduce.
• Unlike Streaming, which uses standard input and
output to communicate with the map and reduce code,
Pipes uses sockets as the channel over which the
tasktracker communicates with the process running
the C++ map or reduce function.
• JNI is not used.
Hadoop I/O
❑Hadoop comes with a set of primitives for data I/O.
❑Deserve special consideration when dealing with multiterabyte
datasets.
➢ Data integrity
➢ Compression
❑Tools or APIs that form the building blocks for developing distributed
systems
➢ Serialization frameworks
➢ On-disk data structures.
Data Integrity
❑It is possible that a block of data fetched from a DataNode arrives corrupted.
❑This corruption can occur because of faults in a storage device, network faults, or
buggy software.
❑The HDFS client software implements checksum checking on the contents of HDFS
files.
❑When a client creates an HDFS file, it computes a checksum of each block of the file
and stores these checksums in a separate hidden file in the same HDFS namespace.
❑When a client retrieves file contents it verifies that the data it received from each
DataNode matches the checksum stored in the associated checksum file.
❑If not, then the client can opt to retrieve that block from another DataNode that has a
replica of that block.
Data Integrity

• Data Integrity in Hadoop is achieved by maintaining the checksum of the data written to
the block.
• Whenever data is written to HDFS blocks , HDFS calculate the checksum for all data
written and verify checksum when it will read that data. The separate checksum will create
for every dfs.bytes.per.checksum bytes of data. The default size for this property is 512
bytes. Checksum is 4 Byte long.
• All datanodes are responsible to check checksum of their data. When client read data from
checksum, they also check checksum. To check the data block datanodes runs a
DataBlockScanner periodically to verify Block. So if corrupt data found HDFS will take
replica of actual data and replace the corrupt one.
Compression
You can compress data in Hadoop MapReduce at various stages.
1.Compressing input files- You can compress the input file that will reduce storage
space in HDFS. If you compress the input files then the files will be decompressed
automatically when the file is processed by a MapReduce job. Determining the
appropriate coded will be done using the file name extension. As example if file
name extension is .snappy hadoop framework will automatically use SnappyCodec
to decompress the file.
2.Compressing the map output- You can compress the intermediate map output.
Since map output is written to disk and data from several map outputs is used by a
reducer so data from map outputs is transferred to the node where reduce task is
running. Thus by compressing intermediate map output you can reduce both
storage space and data transfer across network.
3.Compressing output files- You can also compress the output of a MapReduce
job.
Compression format in hadoop
Compression Format in Hadoop
1) GZIP
• Provides High compression ratio.
• Uses high CPU resources to compress and decompress data.
• Good choice for Cold data which is infrequently accessed.
• Compressed data is not splitable and hence not suitable for MapReduce jobs.
2) BZIP2
• Provides High compression ratio (even higher than GZIP).
• Takes long time to compress and decompress data.
• Good choice for Cold data which is infrequently accessed.
• Compressed data is splitable.
• Even though the compressed data is splitable, it is generally not suited for MR jobs because of high compression/decompression
time.
3) LZO
• Provides Low compression ratio.
• Very fast in compressing and decompressing data.
• Compressed data is splitable if an appropriate indexing algorithm is used.
• Best suited for MR jobs because of property (ii) and (iii).
4) SNAPPY
• Provides average compression ratio.
• Aimed at very fast compression and decompression time.
• Compressed data is not splitable if used with normal file like .txt
• Generally used to compress Container file formats like Avro and SequenceFile because the files inside a Compressed Container file
can be split.
Codecs in Hadoop
❑ Codec, short form of compressor-decompressor is the implementation of a compression-decompression algorithm.
❑ In Hadoop framework there are different codec classes for different compression formats, you will use the codec class
for the compression format you are using.
❑ The codec classes in Hadoop are as follows-
▪ Deflate – org.apache.hadoop.io.compress.DefaultCodec or org.apache.hadoop.io.compress.DeflateCodec
(DeflateCodec is an alias for DefaultCodec). This codec uses zlib compression.
▪ Gzip – org.apache.hadoop.io.compress.GzipCodec
▪ Bzip2 – org.apache.hadoop.io.compress.Bzip2Codec
▪ Snappy – org.apache.hadoop.io.compress.SnappyCodec
▪ LZO – com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec
LZO libraries are GPL licensed and doesn't come with Hadoop release. Hadoop codec for LZO has to be downloaded
separately.
▪ LZ4– org.apache.hadoop.io.compress.Lz4Codec
▪ Zstandard – org.apache.hadoop.io.compress.ZstandardCodec
Data serialization is a process that converts structure
data manually back to the original form.

Serialize to translate data structures into a stream of

data. Transmit this stream of data over the network or
Data store it in DB regardless of the system architecture.

serialization Isn't storing information in binary form or stream of

bytes is the right approach.

Serialization does the same but isn't dependent on

architecture.
Serialization
Data is serialized for two objectives −
➢For persistent storage
Persistent Storage is a digital storage facility that does not
lose its data with the loss of power supply. Files, folders,
databases are the examples of persistent storage.
➢To transport the data over network(Inter-process
Communication)
• To establish the inter-process communication between the
nodes connected in a network, RPC technique was used.
• RPC used internal serialization to convert the message into
binary format before sending it to the remote node via
network. At the other end the remote system deserializes
the binary stream into the original message.
Serialization
• The RPC serialization format is required to be as follows −
• Compact − To make the best use of network bandwidth, which is
the most scarce resource in a data center.
• Fast − Since the communication between the nodes is crucial in
distributed systems, the serialization and deserialization process
should be quick, producing less overhead.
• Extensible − Protocols change over time to meet new
requirements, so it should be straightforward to evolve the protocol
in a controlled manner for clients and servers.
• Interoperable − The message format should support the nodes that
are written in different languages.
Class hierarchy
of Hadoop
serialization
➢ Avro format is a row-based storage format for
Hadoop, which is widely used as a serialization
platform.
➢ Avro format stores the schema in JSON format,
making it easy to read and interpret by any program.
➢ The data itself is stored in a binary format making it
compact and efficient in Avro files.
➢ Avro format is a language-neutral data serialization
Avro file system. It can be processed by many languages
(currently C, C++, C#, Java, Python, and Ruby).
format ➢ A key feature of Avro format is the robust support for
data schemas that changes over time, i.e., schema
evolution. Avro handles schema changes like missing
fields, added fields, and changed fields.
➢ Avro format provides rich data structures. For
example, you can create a record that contains an
array, an enumerated type, and a sub-record.
• In AVRO, schema changes like missing fields,
changed/modified fields, and added/new areas are easy to
maintain What will happen to past data when there is a
change in schema.
• AVRO can read data with a new schema regardless of data
generation time. Is there any problem that occurs while
trying to convert AVRO data to other formats, if needed.
• There are some inbuilt specifications in the Hadoop system
Avro file while using a Hybrid system of data? In such a way, Apache
format spark can convert Avro files to parquet. Is it possible to
manage MapReduce jobs.
• MapReduce jobs can efficiently achieve by using Avro file
processing in MapReduce itself. What connection
Handshake do.
• Connection handshake helps to exchange schemas between
client and server during RPC.
Avro file structure
AVRO has Dynamic Schema stores for
serialized values in binary format in a
space-efficient way.

How is AVRO JSON schema and data in the file.

different
from other Parser No compilation directly by using
systems? the parser library.

AVRO is a language-neutral system.

• Have the schema of data format ready
• To read schema in program
• Compile using AVRO (by generating class)
How to • Direct read via Parser library
• Achieve serialization through serialization API (java)
implement • By creating class - A schema fed to AVRO utility and then gets
Avro? processes as a java file. Now write API methods to serialize
data.
• By parser library - Fed schema to AVRO utility and access
serialization parser library to serialize data. Achieve
Deserialization through deserialization API (java.
• By generating class - Deserialize the object and instantiate
DataFileReader class.
• By parser library - Instantiate parser class.
Creating schema in Avro

• There are three ways to define and create JSON schemas of AVRO
1. JSON String
2. The JSON object
3. JSON Array
• There are some data type that needs to mention in schema -Primitive (common data types)
• Complex (when collective data elements need to be process and store)
• Primitive data types contain null, boolean, int, long, bytes, float, string, and double. In
contrast, complex contains Records (attribute encapsulation), Enums, Arrays, Maps
(associativity), Unions (multiple types for one field), and Fixed (deals with the size of data).
• Data types help to maintain sort order.
• Logical types (a super form of complex data types) are used to store individual data
involved, i.e., time, date, timestamp, and decimal.
YARN

• YARN stands for “Yet Another Resource Negotiator“.

• It was introduced in Hadoop 2.0 to remove the bottleneck on Job
Tracker which was present in Hadoop 1.0. YARN was described as a
“Redesigned Resource Manager” at the time of its launching, but it has
now evolved to be known as large-scale distributed operating system
used for Big Data processing.
• YARN also allows different data processing engines like graph
processing, interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop Distributed
File System) thus making the system much more efficient.
Yarn Features
Application
Workflow of Yarn
1.Client submits an application
2.The Resource Manager allocates a container
to start the Application Manager
3.The Application Manager registers itself with
the Resource Manager
4.The Application Manager negotiates
containers from the Resource Manager
5.The Application Manager notifies the Node
Manager to launch containers
6.Application code is executed in the container
7.Client contacts Resource
Manager/Application Manager to monitor
application’s status
8.Once the processing is complete, the
Application Manager un-registers with the
Resource Manager

AWS Cloud Practitioner Full Course
86% (14)
AWS Cloud Practitioner Full Course
246 pages
AWS Solutions Architect Study Guide
70% (10)
AWS Solutions Architect Study Guide
288 pages
Academy, Skill Valley - PMI PMP PMBOK 7 Practice Exam Book_ Over 3 Full Practice Tests, Offering 540+ Realistic PMP Questions Aligned With PMBOK Guide, 7th Edition and 2021 ECO With Detailed Explanati
100% (20)
Academy, Skill Valley - PMI PMP PMBOK 7 Practice Exam Book_ Over 3 Full Practice Tests, Offering 540+ Realistic PMP Questions Aligned With PMBOK Guide, 7th Edition and 2021 ECO With Detailed Explanati
460 pages
AWS Certified Solutions Architect Professional Slides v6
100% (8)
AWS Certified Solutions Architect Professional Slides v6
823 pages
PMP Exam Prep Summary
100% (22)
PMP Exam Prep Summary
5 pages
2021 PMP Mock Practice Tests (Gumroad)
95% (21)
2021 PMP Mock Practice Tests (Gumroad)
284 pages
Tutorials Dojo Study Guide and Cheat Sheets AWS Certified Cloud Practitioner 2021 10 01 xrhf1w
100% (13)
Tutorials Dojo Study Guide and Cheat Sheets AWS Certified Cloud Practitioner 2021 10 01 xrhf1w
196 pages
Read and Pass Notes For PMP Exams (Based On PMBOK Guide 6th Edition) by Maneesh Vijaya (Marek) PDF
89% (27)
Read and Pass Notes For PMP Exams (Based On PMBOK Guide 6th Edition) by Maneesh Vijaya (Marek) PDF
774 pages
Essential PMP Preparation A Practical Exam Prep With Simplified Explanations Definitions and Examp 2022
92% (12)
Essential PMP Preparation A Practical Exam Prep With Simplified Explanations Definitions and Examp 2022
336 pages
PMP Formulas Cheat Sheet
100% (17)
PMP Formulas Cheat Sheet
2 pages
PMP Memory Sheets
98% (41)
PMP Memory Sheets
6 pages
AWS Certified Solutions Architect Associate (Jon Bonso and Adrian Formaran)
80% (5)
AWS Certified Solutions Architect Associate (Jon Bonso and Adrian Formaran)
236 pages
Aws Solution Architect Associate Guide
100% (2)
Aws Solution Architect Associate Guide
43 pages
Kubernetes Basic To Advance End To End
100% (6)
Kubernetes Basic To Advance End To End
295 pages
Free PMP Exam Questions Answers 2022
89% (9)
Free PMP Exam Questions Answers 2022
56 pages
AWS Interview 250 Questions
100% (1)
AWS Interview 250 Questions
34 pages
AWS Lab Workbook v1.0
93% (14)
AWS Lab Workbook v1.0
127 pages
AWS CSAA Practice-Questions DCT V08-Ambu0d
100% (11)
AWS CSAA Practice-Questions DCT V08-Ambu0d
411 pages
PMP Authorized Exam Prep
91% (11)
PMP Authorized Exam Prep
310 pages
Aws Lambda Tutorial
88% (8)
Aws Lambda Tutorial
393 pages
AWS Certified Solution Architect Associate Study Guide V1.0 Abdul Jaseem VP Release 30 Aug 2020
100% (6)
AWS Certified Solution Architect Associate Study Guide V1.0 Abdul Jaseem VP Release 30 Aug 2020
235 pages
AWS Course - All Slides
80% (10)
AWS Course - All Slides
879 pages
AWS Architect Exam Study Guide
100% (3)
AWS Architect Exam Study Guide
369 pages
Terraform & IaC Guide for Beginners
80% (5)
Terraform & IaC Guide for Beginners
34 pages
Kubernetes Tutorial
100% (11)
Kubernetes Tutorial
83 pages
Agile Project Management For Knowledge
90% (10)
Agile Project Management For Knowledge
131 pages
Plantpax Distributed Control System - Template User Manual
No ratings yet
Plantpax Distributed Control System - Template User Manual
144 pages
QGIS Map Layout Guide
No ratings yet
QGIS Map Layout Guide
18 pages
Big Data PPT Unit 2 1
No ratings yet
Big Data PPT Unit 2 1
25 pages
Hadoop Primitives
No ratings yet
Hadoop Primitives
6 pages
Unit3 Bda
No ratings yet
Unit3 Bda
71 pages
BDP 2023 04
No ratings yet
BDP 2023 04
25 pages
BDA - Unit - II Big Data
No ratings yet
BDA - Unit - II Big Data
43 pages
Unit 3 Topic 9 Hadoop Archives
No ratings yet
Unit 3 Topic 9 Hadoop Archives
32 pages
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
No ratings yet
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
26 pages
Unit II Hadoop IO
No ratings yet
Unit II Hadoop IO
27 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Data Analytics
No ratings yet
Data Analytics
26 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
Hadoop BigData Testing Overview
No ratings yet
Hadoop BigData Testing Overview
37 pages
CIA3 Answer
No ratings yet
CIA3 Answer
5 pages
BDAmod 3
No ratings yet
BDAmod 3
18 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
Hadoop I/O for Data Engineers
No ratings yet
Hadoop I/O for Data Engineers
36 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
IT JOB Tips
No ratings yet
IT JOB Tips
36 pages
Bda Mod 2
No ratings yet
Bda Mod 2
132 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Introduction to Hadoop Basics
No ratings yet
Introduction to Hadoop Basics
26 pages
Unit 3 (Big Data Analytics)
No ratings yet
Unit 3 (Big Data Analytics)
18 pages
IV-UNIT - BIG - DATA (2 Files Merged)
No ratings yet
IV-UNIT - BIG - DATA (2 Files Merged)
25 pages
BigData (126 150)
No ratings yet
BigData (126 150)
25 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Hadoop Basics and HDFS Overview
No ratings yet
Hadoop Basics and HDFS Overview
126 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
Unit 4 Bda
No ratings yet
Unit 4 Bda
33 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
Big Data Technologies
No ratings yet
Big Data Technologies
492 pages
BDA Unit-4
No ratings yet
BDA Unit-4
32 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
BDT Unit - Iii
No ratings yet
BDT Unit - Iii
12 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
27 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
Big Data Solutions with Hadoop
No ratings yet
Big Data Solutions with Hadoop
27 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
29 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
P.prabu (28x61c) CCS334 BDA - Unit 4
No ratings yet
P.prabu (28x61c) CCS334 BDA - Unit 4
28 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
File Formats in Big Data
No ratings yet
File Formats in Big Data
13 pages
Hadoop Training for Data Professionals
No ratings yet
Hadoop Training for Data Professionals
41 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
BDH Unit 3
No ratings yet
BDH Unit 3
25 pages
Unit 2
No ratings yet
Unit 2
56 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Bda 3
No ratings yet
Bda 3
70 pages
Kubernetes Practicals Ebook
75% (4)
Kubernetes Practicals Ebook
187 pages
Ansible For Kubernetes PDF
100% (6)
Ansible For Kubernetes PDF
172 pages
Aws Dumps and Study Material
100% (1)
Aws Dumps and Study Material
518 pages
Learn Kubernetes 5 Minutes at A Time
No ratings yet
Learn Kubernetes 5 Minutes at A Time
187 pages
Chapter 9 The Internet & World Wide Web
No ratings yet
Chapter 9 The Internet & World Wide Web
20 pages
OS-Record-21 Regulation
No ratings yet
OS-Record-21 Regulation
72 pages
Csma Ca
No ratings yet
Csma Ca
4 pages
Data Path Subsystem Design
No ratings yet
Data Path Subsystem Design
84 pages
Oracle Cloud Data Management 2022 Foundations Associate (1Z0-1105-22)
No ratings yet
Oracle Cloud Data Management 2022 Foundations Associate (1Z0-1105-22)
13 pages
Ghamal Greene (1st Period) Odd One Out
No ratings yet
Ghamal Greene (1st Period) Odd One Out
2 pages
Abhishek
No ratings yet
Abhishek
10 pages
Programming Foundations Version Control With Git
No ratings yet
Programming Foundations Version Control With Git
4 pages
Unified Comfort Panels - Improvements in Updates
No ratings yet
Unified Comfort Panels - Improvements in Updates
32 pages
11 Computer - Test Maker @
No ratings yet
11 Computer - Test Maker @
2 pages
Leica Geosystems' Leica Dispatch Engineer
No ratings yet
Leica Geosystems' Leica Dispatch Engineer
50 pages
Comptia Network n10 008 Exam Objectives (6 0)
No ratings yet
Comptia Network n10 008 Exam Objectives (6 0)
19 pages
ITCS 4141/5141 Computer Organization and Architecture Summer 2002 Barry Wilkinson
No ratings yet
ITCS 4141/5141 Computer Organization and Architecture Summer 2002 Barry Wilkinson
15 pages
Question Bank CAT-II
No ratings yet
Question Bank CAT-II
1 page
Voice Prescription Project Review
No ratings yet
Voice Prescription Project Review
21 pages
3EC82 AppliedElectronics ESE Section1 Winter202122
No ratings yet
3EC82 AppliedElectronics ESE Section1 Winter202122
2 pages
BACnet
No ratings yet
BACnet
6 pages
CSE130 Asgn2 Winter23 v7-1
No ratings yet
CSE130 Asgn2 Winter23 v7-1
10 pages
Computer Components Explained
No ratings yet
Computer Components Explained
30 pages
L2 Features Object (1.2)
No ratings yet
L2 Features Object (1.2)
28 pages
CS101 Jan 2022
No ratings yet
CS101 Jan 2022
5 pages
Pe NWC
No ratings yet
Pe NWC
17 pages
btm222 Datasheet
No ratings yet
btm222 Datasheet
10 pages
Divide and Conquer Algorithms
No ratings yet
Divide and Conquer Algorithms
2 pages
Arctiko UPUL580
No ratings yet
Arctiko UPUL580
1 page
Poster Specification1
No ratings yet
Poster Specification1
1 page
Data Types
No ratings yet
Data Types
29 pages
R9A02G021 32-Bit MCU Based On RISC-V
No ratings yet
R9A02G021 32-Bit MCU Based On RISC-V
72 pages

Hadoop

Uploaded by

Hadoop

Uploaded by

Unit 3

Big Data Analytics

Test data and local

failures in classic job scheduling,

-output directory_name Input location for the reducer.

-reducer executable or script or

Make the mapper, reducer, or combiner executable available

By default, TextInputformat is used to return the key-value pair of

Commands -outputformat JavaClassName

-partitioner JavaClassName The Class that determines which key to reduce.

-verbose The Verbose output.

-numReduceTasks It Specifies the number of reducers.

-mapdebug Script to call when map task fails

-reducedebug Script to call when reduce task fails

Serialize to translate data structures into a stream of

serialization Isn't storing information in binary form or stream of

Serialization does the same but isn't dependent on

How is AVRO JSON schema and data in the file.

AVRO is a language-neutral system.

• YARN stands for “Yet Another Resource Negotiator“.

You might also like