0% found this document useful (0 votes)

17 views50 pages

Unit 3 Hadoop

The document discusses the characteristics and features of Big Data Systems, particularly focusing on Hadoop and its ecosystem. It outlines the advantages of Hadoop, its versions (Hadoop 1.0 and 2.0), and various components for data ingestion, processing, and analysis. Additionally, it compares Hadoop with other technologies like HBase, Spark, Pig, and Hive, emphasizing their roles in handling large datasets and real-time analytics.

Uploaded by

tarunhs.cd22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views50 pages

Unit 3 Hadoop

Uploaded by

tarunhs.cd22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Agenda

Big Data Systems – Characteristics: Failures; Reliability and Availability; Consistency –

Notions of Consistency.

1
Big Data Systems
Contents
Chapter 4 The Big Data Technology Landscape
Acharya, Seema; Subhashini Chellappan. Big Data and Analytics, 2ed (p. 14).

4.2 Hadoop
4.2.1 Features of Hadoop
4.2.2 Key Advantages of Hadoop
4.2.3 Versions of Hadoop
4.2.4 Overview of Hadoop Ecosystems
4.2.5 Hadoop Distributions
4.2.6 Hadoop versus SQL
4.2.7 Integrated Hadoop Systems Offered by Leading Market
Vendors
4.2.8 Cloud-Based Hadoop Solutions

Big Data Systems

Big Data Systems
4.2 Hadoop

• Hadoop is an open-source project of the Apache foundation.

• It is a framework written in Java, originally developed by Doug Cutting in 2005 who
named it after his son’s toy elephant.
• He was working with Yahoo then.
• It was created to support distribution for “Nutch”, the text search engine.
• Hadoop uses Google’s MapReduce and Google File System technologies as its
foundation.
• Hadoop is now a core part of the computing infrastructure for companies such as
Yahoo, Facebook, LinkedIn, Twitter, etc.

Refer Figure 4.8.

Big Data Systems

Big Data Systems
4.2.1 Features of Hadoop Let us cite a few
features of Hadoop:

1. It is optimized to handle massive quantities of structured, semi-structured, and

unstructured data, using commodity hardware, that is, relatively inexpensive
computers.
2. Hadoop has a shared nothing architecture.
3. It replicates its data across multiple computers so that if one goes down, the data
can still be processed from another machine that stores its replica.
4. Hadoop is for high throughput rather than low latency. It is a batch operation
handling massive quantities of data; therefore the response time is not immediate.
5. It complements On-Line Transaction Processing (OLTP) and On-Line Analytical
Processing (OLAP). However, it is not a replacement for a relational database
management system.
6. It is NOT good when work cannot be parallelized or when there are dependencies
within the data.
7. It is NOT good for processing small files. It works best with huge data files and
datasets.

Big Data Systems

4.2.2 Key Advantages of Hadoop Refer Figure

Refer Figure 4.9 for a quick look at the key advantages of Hadoop. Some of them are as
follows:
1.Stores data in its native format:
•Hadoop’s data storage framework (HDFS – Hadoop Distributed File System) can store
data in its native format.
•There is no structure that is imposed while keying in data or storing data. HDFS is pretty
much schema-less.
•It is only later when the data needs to be processed that structure is imposed on the
raw data.

2. Scalable: Hadoop can store and distribute very large datasets (involving thousands of
terabytes of data) across hundreds of inexpensive servers that operate in parallel.

Big Data Systems

Big Data Systems
3. Cost-effective: Owing to its scale-out architecture, Hadoop has a much reduced
cost/terabyte of storage and processing.

4. Resilient to failure: Hadoop is fault-tolerant. It practices replication of data diligently

which means whenever data is sent to any node, the same data also gets replicated to
other nodes in the cluster, thereby ensuring that in theevent of a nodefailure, there
will always be anothercopy of data availablefor use.

5. Flexibility: One of the key advantages of Hadoop is its ability to work with all kinds
of data: structured, semi-structured, and unstructured data. It can help derive
meaningful business insights from email conversations, social media data,
click-stream data, etc. It can be put to several purposes such as log analysis, data
mining, recommendation systems, market campaign analysis, etc.

6. Fast: Processing is extremely fast in Hadoop as compared to other conventional

systems owing to the “move code to data” paradigm. Hadoop has a shared-nothing
architecture.
Big Data Systems
4.2.3 Versions of Hadoop There are two versions of Hadoop available:
1.Hadoop 1.0
2. 2. Hadoop 2.0

Let us take a look at the features of both. Refer Figure 4.10.

4.2.3.1 Hadoop 1.0 It has two main parts:

1. Data storage framework: It is a general-purpose file system called Hadoop

Distributed File System (HDFS). HDFS is schema-less. It simply stores data files.
These data files can be in just about any

Big Data Systems

Big Data Systems
format. The idea is to store files as close to their original form as possible. This is turn
provides the business units and the organization the much needed flexibility and
agility without being overly worried by what it can implement.

2. Data processing framework: This is a simple functional programming model initially

popularized by Google as MapReduce.

It essentially uses two functions: the MAP and the REDUCE functions to process data.
The “Mappers” take in a set of key–value pairs and generate intermediate data (which
is another list of key–value pairs).

The “Reducers” then act on this input to produce the output data.

The two functions seemingly work in isolation from one another, thus enabling the
processing to be highly distributed in a highly-parallel, fault-tolerant, and scalable way.

Big Data Systems

There were, however, a few limitations of Hadoop 1.0. They are as follows:
1.The first limitation was therequirement for MapReduce programming expertise along
with proficiency required in other programming languages, notably Java.
2.It supported only batch processing which although is suitable for tasks such as log
analysis, large-scale data mining projects but pretty much unsuitable for other kinds
of projects.
3.One major limitation was that Hadoop 1.0 was tightly computationally coupled with
MapReduce, which meant that the established data management vendors were left
with two options:
Either rewrite their functionality in MapReduce so that it could be executed in
Hadoop or extract the data from HDFS and process it outside of Hadoop.
None of the options were viable as it led to process inefficiencies caused by
the data being moved in and out of the Hadoop cluster.

Let us look at whether these limitations have been wholly or in parts resolved
by Hadoop 2.0.

Big Data Systems

4.2.3.2 Hadoop 2.0

• In Hadoop 2.0, HDFS continues to be the data storage framework.

• However, a new and separate resource management framework called Yet Another
Resource Negotiator (YARN) has been added.
• Any application capable of dividing itself into parallel tasks is supported by YARN.
• YARN coordinates the allocation of subtasks of the submitted application, thereby
further enhancing the flexibility, scalability, and efficiency of the applications.
• It works by having an ApplicationMaster in place of the erstwhile JobTracker, running
applications on resources governed by a new NodeManager (in place of the erstwhile
TaskTracker).
• ApplicationMaster is able to run any application and not just MapReduce.

Big Data Systems

• This, in other words, means that the MapReduce Programming expertise is no
longer required.
• Furthermore, it not only supports batch processing but also real-time processing.
• MapReduce is no longer the only data processing option;
• other alternative data processing functions such as data standardization, master
data management can now be performed natively in HDFS.

Big Data Systems

Big Data Systems
4.2.4 Overview of Hadoop Ecosystems

There are components available in the Hadoop ecosystem for data ingestion,
processing, and analysis.

Data Ingestion → Data Processing → Data Analysis

Components that help with Data Ingestion are:

1.Sqoop
2.Flume
Components that help with Data Processing are:
1.MapReduce
2.Spark

Components that help with Data Analysis are:

1.Pig
2.Hive
3.Impala

Big Data Systems

HDFS

• It is the distributed storage unit of Hadoop.

• It provides streaming access to file system data as well as file permissions and
authentication. It is based on GFS (Google File System).
• It is used to scale a single cluster node to hundreds and thousands of nodes.
• It handles large datasets running on commodity hardware. HDFS is highly
fault-tolerant. It stores files across multiple machines.
• These files are stored in redundant fashion to allow for data recovery in case of
failure.

Big Data Systems

• An e-commerce website stores millions of customers’ data in a distributed
manner. Data has been collected over 4–5 years.
• It then runs batch analytics on the archived data to analyze customer’s behavior,
buying patterns, their preferences, their requirements, etc.
• This helps to understand which products are purchased by customers in which
months, etc.

Big Data Systems

HBase

• It stores data in HDFS.

• It is the first non-batch component of the Hadoop Ecosystem.
• It is a database on top of HDFS. It provides a quick random access to the stored data.
• It has very low latency compared to HDFS.
• It is a NoSQL database, is non-relational and is a column-oriented database.
• A table can have thousands of columns.
• A table can have multiple rows.
• Each row can have several column families.
• Each column family can have several columns.
• Each column can have several key values.
• It is based on Google BigTable.
• This is widely used by Facebook, Twitter, Yahoo, etc.

Big Data Systems

Difference between HBase and Hadoop/HDFS

1. HDFS is the file system whereas HBase is a Hadoop database. It is like NTFS
and MySQL.
2. HDFS is WORM (Write once and read multiple times or many times). Latest
versions support appending of data but this feature is rarely used. However,
HBase supports real-time random read and write.
3. HDFS is based on Google File System (GFS) whereas HBase is based on Google
Big Table.
4. HDFS supports only full table scan or partition table scan. Hbase supports
random small range scan or table scan.
5. Performance of Hive on HDFS is relatively very good but for HBase it becomes
4–5 times slower.
6. The access to data is via MapReduce job only in HDFS whereas in HBase the
access is via Java APIs, Rest, Avro, Thrift APIs.
7. HDFS does not support dynamic storage owing to its rigid structure whereas
HBase supports dynamic storage.
8. HDFS has high latency operations whereas HBase has low latency operations.
9. HDFS is most suitable for batch analytics whereas HBase is for real-time
analytics.

Big Data Systems

Hadoop Ecosystem Components for Data Ingestion
1.Sqoop: Sqoop stands for SQL to Hadoop. Its main functions are

a) Importing data from RDBMS such as MySQL, Oracle, DB2, etc. to Hadoop file
system (HDFS, HBase, Hive).
b) Exporting data from Hadoop File system (HDFS, HBase, Hive) to RDBMS (MySQL,
Oracle, DB2).

Uses of Sqoop
a) It has a connector-based architecture to allow plug-ins to connect to external
systems such as MySQL, Oracle, DB2, etc.
b) It can provision the data from external system on to HDFS and populate tables in
Hive and HBase.
c) It integrates with Oozie allowing you to schedule and automate import and export
tasks. Big Data Systems
2. Flume: Flume is an important log aggregator (aggregates logs from different
machines and places them in HDFS)component in the Hadoop ecosystem.
Flume has been developed by Cloudera.
It is designed for high volume ingestion of event-based data into Hadoop.
The default destination in Flume (called as sink in flume parlance) is HDFS.
However it can also write to HBase or Solr.

Big Data Systems

• There is a bank of web servers.

• Flume moves log events from those files into new

aggregated files in HDFS for processing.

Big Data Systems

Hadoop Ecosystem Components for Data Processing

1. MapReduce: It is a programing paradigm that allows distributed and parallel

processing of huge datasets.
2. It is based on Google MapReduce.
3. Google released a paper on MapReduce programming paradigm in 2004 and that
became the genesis of Hadoop processing model.
4. The MapReduce framework gets the input data from HDFS.
5. There are two main phases: Map phase and the Reduce phase.
6. The map phase converts the input data into another set of data (key–value pairs).
7. This new intermediate dataset then serves as the input to the reduce phase.
8. The reduce phase acts on the datasets to combine (aggregate and consolidate)
and reduce them to a smaller set of tuples. T
9. he result is then stored back in HDFS.

Big Data Systems

2. Spark:

• It is both a programming model as well as a computing model.

• It is an open-source big data processing framework.
• It was originally developed in 2009 at UC Berkeley’s AmpLab and
became an open-source project in 2010.
• It is written in Scala.
• It provides in-memory computing for Hadoop.
• In Spark, workloads execute in memory rather than on disk owing to
which it is much faster (10 to 100 times) than when the workload is
executed on disk.

However, if the datasets are too large to fit into the available system
memory, it can perform conventional disk-based processing.
It serves as a potentially faster and more flexible alternative to
MapReduce.

It accesses data from HDFS (Spark does not have its own distributed
file system) but bypasses the MapReduce processing.

Big Data Systems

• Spark can be used with Hadoop coexisting smoothly with MapReduce (sitting on
top of Hadoop YARN) or used independently of Hadoop (standalone).
• As a programming model, it works well with Scala, Python (it has API
connectors for using it with Java or Python) or R programming language.

Big Data Systems

The following are the Spark libraries:
a)Spark SQL: Spark also has support for SQL. Spark SQL uses SQL to help query
data stored in disparate applications.
b)Spark streaming: It helps to analyze and present data in real time.
c)MLib: It supports machine learning such as applying advanced statistical
operations on data in Spark Cluster.
d)GraphX: It helps in graph parallel computation.

Spark and Hadoop are usually used together by several companies.

Hadoop was primarily designed to house unstructured data and run batch
processing operations on it.

Big Data Systems

• Spark is used extensively for its high speed in memory computing and ability to run
advanced real-time analytics.
• The two together have been giving very good results.

Big Data Systems

Hadoop Ecosystem Components for Data Analysis

• Pig: It is a high-level scripting language used with Hadoop.

• It serves as an alternative to MapReduce.
• It has two parts:
(a) Pig Latin: It is SQL-like scripting language. Pig Latin scripts are translated
into MapReduce jobs which can then run on YARN and process data in the HDFS
cluster.
It was initially developed by Yahoo. It is immensely popular with developers who
are not comfortable with MapReduce.
However, SQL developers may have a preference for Hive.

Big Data Systems

• How it works?
• There is a “Load” command available to load the data from “HDFS” into Pig.
• Then one can perform functions such as grouping, filtering, sorting, joining etc.
• The processed or computed data can then be either displayed on screen or
placed back into HDFS.
• It gives you a platform for building data flow for ETL (Extract, Transform and
Load), processing and analyzing huge data sets.
(b) Pig runtime: It is the runtime environment.

Big Data Systems

2. Hive:
•Hive is a data warehouse software project built on top of Hadoop.
•Three main tasks performed by Hive are summarization, querying and analysis.
•It supports queries written in a language called HQL or HiveQL which is a
declarative SQL-like language.
•It converts the SQL-style queries into MapReduce jobs which are then executed
on the Hadoop platform.

Big Data Systems

Difference between Hive and RDBMS

Both Hive and traditional databases such as MySQL, MS SQL Server, PostgreSQL
support SQL interface.
However, Hive is better known as a datawarehouse (D/W) rather than a database.

Big Data Systems

Let us look at the difference between Hive and traditional databases as regards the
schema.
1.Hive enforces schema on Read Time whereas RDBMS enforces schema on Write Time. In
RDBMS, at the time of loading/inserting data, the table’s schema is enforced.
•If the data being loaded does not conform to the schema then it is rejected.
•Thus, the schema is enforced on write (loading the data into the database).
•Schema on write takes longer to load the data into the database; however it makes up
for it during data retrieval with a good query time performance.

However, Hive does not enforce the schema when the data is being loaded into the
D/W. It is enforced only when the data is being read/retrieved.

This is called schema on read.

It definitely makes for fast initial load as the data load or insertion operation is just a file
copy or move.

Big Data Systems

2. Hive is based on the notion of write once and read many times whereas the RDBMS
is designed for read and write many times.
3.
•Hadoop is a batch-oriented system. Hive, therefore, is not suitable for OLTP (Online
Transaction Processing) but, although not ideal, seems closer to OLAP (Online
Analytical Processing).
•The reason being that there is quite a latency between issuing a query and receiving a
reply as the query written in HiveQL will be converted to MapReduce jobs which are
then executed on the Hadoop cluster.
•RDBMS is suitable for housing day-to-day transaction data and supports all OLTP
operations with frequent insertions, modifications (updates), deletions of the data.

Big Data Systems

4. Hive handles static data analysis which is non-real-time data. Hive is the data
warehouse of Hadoop. There are no frequent updates to the data and the query
response time is not fast. RDBMS is suited for handling dynamic data which is real
time.
5. Hive can be easily scaled at a very low cost when compared to RDMS. Hive uses HDFS
to store data, thus it cannot be considered as the owner of the data, while on the
other hand RDBMS is the owner of the data responsible for storing, managing and
manipulating it in the database.
6. Hive uses the concept of parallel computing, whereas RDBMS uses serial computing.

Big Data Systems

Big Data Systems
Big Data Systems
Big Data Systems
Difference between Hive and HBase
1.Hive is a MapReduce-based SQL engine that runs on top of Hadoop.
HBase is a key–value NoSQL database that runs on top of
HDFS.

2.Hive is for batch processing of big data. HBase is for real-time

data streaming.
Impala It is a high performance SQL engine that runs on Hadoop cluster.
It is ideal for interactive analysis.
It has very low latency measured in milliseconds.
It supports a dialect of SQL called Impala SQL.

Big Data Systems

• ZooKeeper It is a coordination service for distributed applications.
• Oozie It is a workflow scheduler system to manage Apache Hadoop jobs.
• Mahout It is a scalable machine learning and data mining library.
• Chukwa It is a data collection system for managing large distributed systems.
• Ambari It is a web-based tool for provisioning, managing, and monitoring
Apache Hadoop clusters.

Big Data Systems

4.2.5 Hadoop Distributions

• Hadoop is an open-source Apache project. Anyone can freely download the

core aspects of Hadoop. The core aspects of Hadoop include the following:
• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN (Yet Another Resource Negotiator)
• Hadoop MapReduce

Big Data Systems

• There are few companies such as IBM, Amazon Web Services, Microsoft,
Teradata, Hortonworks, Cloudera, etc. that have packaged Hadoop into a
more easily consumable distributions or services.

• Although each of these companies have a slightly different strategy, the key
essence remains its ability to distribute data and workloads across potentially
thousands of servers thus making big data manageable data.

• A few Hadoop distributions are given in Figure 4.12.

Big Data Systems

Big Data Systems
4.2.6 Hadoop versus SQL

Table 4.6 lists the differences between Hadoop and SQL.

Big Data Systems

4.2.7 Integrated Hadoop Systems Offered by Leading Market
Vendors

Refer Figure 4.13 to get a glimpse of the leading market vendors

offering integrated Hadoop systems.

Big Data Systems

4.2.8 Cloud-Based Hadoop Solutions

• Amazon Web Services holds out a comprehensive, end-to-end portfolio of

cloud computing services to help manage big data.
• The aim is to achieve this and more along with retaining the emphasis on
reducing costs, scaling to meet demand, and accelerating the speed of
innovation.

Big Data Systems

The Google Cloud Storage connector for Hadoop empowers one to perform
MapReduce jobs directly on data in Google Cloud Storage, without the need to copy
it to local disk and running it in the

Hadoop Distributed File System (HDFS).

The connector simplifies Hadoop deployment, and at the same time reduces cost
and provides performance comparable to HDFS, all this while increasing reliability by
eliminating the single point of failure of the name node.

Refer Figure 4.14.

Big Data Systems

Thank you
Q&A

49
Systems Programming - V 3.0
IMP Note to Self

50
Systems Programming - V 3.0

BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
58 pages
Unit 3
No ratings yet
Unit 3
90 pages
Big Data 2 - Part
No ratings yet
Big Data 2 - Part
40 pages
Best Forex EA - Free Expert Advisor For MT4 in 20 3
No ratings yet
Best Forex EA - Free Expert Advisor For MT4 in 20 3
3 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
61 pages
Bda M2
No ratings yet
Bda M2
60 pages
UNIT III
No ratings yet
UNIT III
22 pages
BDA Final
No ratings yet
BDA Final
23 pages
TLE-TE 9 - Q1 - W5 - Mod5 - ICT CSS
100% (4)
TLE-TE 9 - Q1 - W5 - Mod5 - ICT CSS
31 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
BDA praON Iat1
No ratings yet
BDA praON Iat1
12 pages
BDA Module2
No ratings yet
BDA Module2
83 pages
Chapter - 2 Hadoop
100% (1)
Chapter - 2 Hadoop
32 pages
? Class 12 Python Notes
No ratings yet
? Class 12 Python Notes
5 pages
Introduction To Hadoop: Module - II
No ratings yet
Introduction To Hadoop: Module - II
31 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
14 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Bda Unit-2
No ratings yet
Bda Unit-2
37 pages
Unit Ii BDT F
No ratings yet
Unit Ii BDT F
13 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Unit 2
No ratings yet
Unit 2
73 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
CC Unit 2
No ratings yet
CC Unit 2
29 pages
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
No ratings yet
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
260 pages
Data Visualization - Data Mining PRESENTATION
No ratings yet
Data Visualization - Data Mining PRESENTATION
9 pages
Bda Unit 2
No ratings yet
Bda Unit 2
44 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
1 (A) - Define IOT and M2M. Illustrate The Differences Between IOT and M2M. 1. Internet of Things
No ratings yet
1 (A) - Define IOT and M2M. Illustrate The Differences Between IOT and M2M. 1. Internet of Things
35 pages
Hadoop for Big Data Enthusiasts
No ratings yet
Hadoop for Big Data Enthusiasts
42 pages
Big Data and Mapreduce Challenges, Opportunities and Trends
No ratings yet
Big Data and Mapreduce Challenges, Opportunities and Trends
9 pages
Cloud Computing Unit-5
No ratings yet
Cloud Computing Unit-5
22 pages
Unit III
No ratings yet
Unit III
15 pages
Big Data Technology
No ratings yet
Big Data Technology
9 pages
Module 2 CN
No ratings yet
Module 2 CN
23 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
Unit Ii Hadoop With HDFS
No ratings yet
Unit Ii Hadoop With HDFS
22 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
BDAunit II
No ratings yet
BDAunit II
4 pages
HADOOP
No ratings yet
HADOOP
10 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
HADOOP
No ratings yet
HADOOP
18 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
HADOOP
No ratings yet
HADOOP
55 pages
DocScanner Jan 12, 2023 2-29 PM
No ratings yet
DocScanner Jan 12, 2023 2-29 PM
32 pages
Big Data Insights with Hadoop
No ratings yet
Big Data Insights with Hadoop
34 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
2nd Unit Bda
No ratings yet
2nd Unit Bda
30 pages
BDA Mod2@AzDOCUMENTS - in
No ratings yet
BDA Mod2@AzDOCUMENTS - in
64 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Hadoop
No ratings yet
Hadoop
11 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Bda Unit 4 Material
No ratings yet
Bda Unit 4 Material
37 pages
TCP Timers: Time Out Timer
No ratings yet
TCP Timers: Time Out Timer
2 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
SAP ABAP Technical Consultant Profile
No ratings yet
SAP ABAP Technical Consultant Profile
4 pages
RCA Application Stage 2 Guide
No ratings yet
RCA Application Stage 2 Guide
19 pages
Windows XP Command Line Guide
No ratings yet
Windows XP Command Line Guide
4 pages
Python Learning (Basics I & II)
No ratings yet
Python Learning (Basics I & II)
33 pages
AI Boost for YouTube Long-Form Videos
No ratings yet
AI Boost for YouTube Long-Form Videos
26 pages
Pro Jakarta EE 10: Open Source Enterprise Java-Based Cloud-Native Applications Development Peter Späth Instant Download
No ratings yet
Pro Jakarta EE 10: Open Source Enterprise Java-Based Cloud-Native Applications Development Peter Späth Instant Download
36 pages
Solution Manual For Managing Information Technology, 7/E 7th Edition - Read Online or Download Now
100% (23)
Solution Manual For Managing Information Technology, 7/E 7th Edition - Read Online or Download Now
29 pages
Flight Reservation System HLD
No ratings yet
Flight Reservation System HLD
9 pages
Human Pose Estimation
No ratings yet
Human Pose Estimation
4 pages
Security Consultant Guide With Code Examples
No ratings yet
Security Consultant Guide With Code Examples
12 pages
Mini Telephone Directory
No ratings yet
Mini Telephone Directory
23 pages
Module 1 Computer Packages Notes Bizziland
No ratings yet
Module 1 Computer Packages Notes Bizziland
36 pages
Questions
No ratings yet
Questions
10 pages
Autonomous Vehicle Detection Tech
No ratings yet
Autonomous Vehicle Detection Tech
22 pages
M03 Describing Cisco HX Software Components
No ratings yet
M03 Describing Cisco HX Software Components
34 pages
NSDL Response
No ratings yet
NSDL Response
7 pages
Excercise 4
No ratings yet
Excercise 4
7 pages
Asterisk vs. ShoreTel vs. Cisco PBX Comparison
No ratings yet
Asterisk vs. ShoreTel vs. Cisco PBX Comparison
4 pages
IOT Report Sai
No ratings yet
IOT Report Sai
22 pages
Multithreaded Programming Exercises
No ratings yet
Multithreaded Programming Exercises
18 pages
OVSDB Integration - Design - OpenDaylight Project
No ratings yet
OVSDB Integration - Design - OpenDaylight Project
6 pages
Commotion Construction Kit: Networking
No ratings yet
Commotion Construction Kit: Networking
5 pages
Dialogue
No ratings yet
Dialogue
2 pages
Effective Systems Analysis and Design Gu
No ratings yet
Effective Systems Analysis and Design Gu
56 pages

Unit 3 Hadoop

Uploaded by

Unit 3 Hadoop

Uploaded by

Agenda

Big Data Systems – Characteristics: Failures; Reliability and Availability; Consistency –

Big Data Systems

• Hadoop is an open-source project of the Apache foundation.

Refer Figure 4.8.

Big Data Systems

1. It is optimized to handle massive quantities of structured, semi-structured, and

Big Data Systems

Big Data Systems

4. Resilient to failure: Hadoop is fault-tolerant. It practices replication of data diligently

6. Fast: Processing is extremely fast in Hadoop as compared to other conventional

Let us take a look at the features of both. Refer Figure 4.10.

4.2.3.1 Hadoop 1.0 It has two main parts:

1. Data storage framework: It is a general-purpose file system called Hadoop

Big Data Systems

2. Data processing framework: This is a simple functional programming model initially

Big Data Systems

Big Data Systems

• In Hadoop 2.0, HDFS continues to be the data storage framework.

Big Data Systems

Big Data Systems

Data Ingestion → Data Processing → Data Analysis

Components that help with Data Ingestion are:

Components that help with Data Analysis are:

Big Data Systems

• It is the distributed storage unit of Hadoop.

Big Data Systems

Big Data Systems

• It stores data in HDFS.

Big Data Systems

Big Data Systems

Big Data Systems

• Flume moves log events from those files into new

aggregated files in HDFS for processing.

Big Data Systems

1. MapReduce: It is a programing paradigm that allows distributed and parallel

Big Data Systems

• It is both a programming model as well as a computing model.

Big Data Systems

Big Data Systems

Spark and Hadoop are usually used together by several companies.

Big Data Systems

Big Data Systems

• Pig: It is a high-level scripting language used with Hadoop.

Big Data Systems

Big Data Systems

Big Data Systems

Big Data Systems

This is called schema on read.

Big Data Systems

Big Data Systems

Big Data Systems

2.Hive is for batch processing of big data. HBase is for real-time

Big Data Systems

Big Data Systems

• Hadoop is an open-source Apache project. Anyone can freely download the

Big Data Systems

• A few Hadoop distributions are given in Figure 4.12.

Big Data Systems

Table 4.6 lists the differences between Hadoop and SQL.

Big Data Systems

Refer Figure 4.13 to get a glimpse of the leading market vendors

Big Data Systems

• Amazon Web Services holds out a comprehensive, end-to-end portfolio of

Big Data Systems

Hadoop Distributed File System (HDFS).

Refer Figure 4.14.

Big Data Systems

You might also like