0% found this document useful (0 votes)

72 views46 pages

Case Study: Hadoop

This document provides an overview of Hadoop, its core components HDFS and MapReduce, and related Apache projects such as Pig, HBase, and Hive. It describes the goals and architecture of HDFS, how MapReduce works, examples of using Pig and Hive to process data, and how HBase is built on top of HDFS to provide a distributed column-oriented database.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views46 pages

Case Study: Hadoop

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 46

CASE STUDY:

HADOOP
OUTLINE
 Hadoop - Basics
 HDFS
 Goals
 Architecture
 Other functions
 MapReduce
 Basics
 Word Count Example
 Handy tools
 Finding shortest path example
 Related Apache sub-projects (Pig, HBase,Hive)
HBASE: PART OF HADOOP’S
ECOSYSTEM

HBase is built on top of HDFS

HBase files are

internally
stored in HDFS

3
HADOOP - WHY ?
 Need to process huge datasets on large clusters
of computers
 Very expensive to build reliability into each
application
 Nodes fail every day
 Failure
is expected, rather than exceptional
 The number of nodes in a cluster is not constant
 Need a common infrastructure
 Efficient,
reliable, easy to use
 Open Source, Apache Licence
WHO USES HADOOP?
 Amazon/A9
 Facebook
 Google
 New York Times
 Veoh
 Yahoo!
 …. many more
COMMODITY HARDWARE
Aggregation switch

Rack switch

 Typically in 2 level architecture

 Nodes are commodity PCs
 30-40 nodes/rack
 Uplink from rack is 3-4 gigabit
 Rack-internal is 1 gigabit
HADOOP
DISTRIBUTED FILE
SYSTEM (HDFS)
GOALS OF HDFS
 Very Large Distributed File System
 10K nodes, 100 million files, 10PB
 Assumes Commodity Hardware
 Filesare replicated to handle hardware failure
 Detect failures and recover from them
 Optimized for Batch Processing
 Data locations exposed so that computations can
move to where data resides
 Provides very high aggregate bandwidth
DISTRIBUTED FILE SYSTEM
 Single Namespace for entire cluster
 Data Coherency
 Write-once-read-many access model
 Client can only append to existing files
 Files are broken up into blocks
 Typically64MB block size
 Each block replicated on multiple DataNodes
 Intelligent Client
 Client can find location of blocks
 Client accesses data directly from DataNode
HDFS ARCHITECTURE
FUNCTIONS OF A NAMENODE
 Manages File System Namespace
 Maps a file name to a set of blocks
 Maps a block to the DataNodes where it resides
 Cluster Configuration Management
 Replication Engine for Blocks
NAMENODE METADATA
 Metadata in Memory
 The entire metadata is in main memory
 No demand paging of metadata
 Types of metadata
 List of files
 List of Blocks for each file
 List of DataNodes for each block
 File attributes, e.g. creation time, replication factor
 A Transaction Log
 Records file creations, file deletions etc
DATANODE
 A Block Server
 Stores data in the local file system (e.g. ext3)
 Stores metadata of a block (e.g. CRC)
 Serves data and metadata to Clients
 Block Report
 Periodically
sends a report of all existing blocks to
the NameNode
 Facilitates Pipelining of Data
 Forwards data to other specified DataNodes
BLOCK PLACEMENT
 Current Strategy
 One replica on local node
 Second replica on a remote rack
 Third replica on same remote rack
 Additional replicas are randomly placed
 Clients read from nearest replicas
 Would like to make this policy pluggable
HEARTBEATS
 DataNodes send hearbeat to the NameNode
 Once every 3 seconds
 NameNode uses heartbeats to detect DataNode
failure
REPLICATION ENGINE
 NameNode detects DataNode failures
 Chooses new DataNodes for new replicas
 Balances disk usage
 Balances communication traffic to DataNodes
DATA CORRECTNESS
 Use Checksums to validate data
 Use CRC32
 File Creation
 Clientcomputes checksum per 512 bytes
 DataNode stores the checksum
 File access
 Client retrieves the data and checksum from
DataNode
 If Validation fails, Client tries other replicas
NAMENODE FAILURE
 A single point of failure
 Transaction Log stored in multiple directories
A directory on the local file system
 A directory on a remote file system (NFS/CIFS)
 Need to develop a real HA solution
SECONDARY NAMENODE
 Copies FsImage and Transaction Log from
Namenode to a temporary directory
 Merges FSImage and Transaction Log into a new
FSImage in temporary directory
 Uploads new FSImage to the NameNode
 Transaction Log on NameNode is purged
USER INTERFACE
 Commads for HDFS User:
 hadoop dfs -mkdir /foodir
 hadoop dfs -cat /foodir/myfile.txt
 hadoop dfs -rm /foodir/myfile.txt
 Commands for HDFS Administrator
 hadoop dfsadmin -report
 hadoop dfsadmin -decommision datanodename
 Web Interface
 http://host:port/dfshealth.jsp
PIG
PIG
 Started at Yahoo! Research
 Now runs about 30% of Yahoo!’s jobs
 Features
 Expresses sequences of MapReduce jobs
 Data model: nested “bags” of items
 Provides relational (SQL) operators

(JOIN, GROUP BY, etc.)

 Easy to plug in Java functions
AN EXAMPLE PROBLEM
Load Users Load Pages
 Suppose you have user
data in a file, website Filter by age

data in another, and

Join on name
you need to find the
top 5 most visited Group on url
pages by users aged
18-25 Count clicks

Order by clicks

Take top 5
IN PIG LATIN
Users = load ‘users’ as (name, age);
Filtered = filter Users by age >= 18 and age
<= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
EASE OF TRANSLATION
Load Users Load Pages
Users = load …
Filter by age Fltrd = filter …
Pages = load …
Joined = join …
Join on name Grouped = group …
Summed = … count()…
Group on url Sorted = order …
Top5 = limit …
Count clicks

Order by clicks

Take top 5
EASE OF TRANSLATION
Load Users Load Pages
Users = load …
Filter by age Fltrd = filter …
Pages = load …
Joined = join …
Join on name Grouped = group …
Job 1 Summed = … count()…
Group on url Sorted = order …
Job 2 Top5 = limit …
Count clicks

Order by clicks
Job 3
Take top 5
HBASE
HBASE - WHAT?
 Modeled on Google’s Bigtable
 Row/column store
 Billions of rows/millions on columns
 Column-oriented - nulls are free
 Untyped - stores byte[]
HBASE - DATA MODEL

Column
Column family:
Row Timestamp family
animal:
repairs:
animal:type animal:size repairs:cost
t2 zebra 1000 EUR
enclosure1
t1 lion big
enclosure2 … … … …
HBASE - DATA STORAGE
Column family animal:

(enclosure1, t2, animal:type) zebra

(enclosure1, t1, animal:size) big
(enclosure1, t1, animal:type) lion

Column family repairs:

(enclosure1, t1, repairs:cost) 1000 EUR

HBASE - CODE
HTable table = …
Text row = new Text(“enclosure1”);
Text col1 = new Text(“animal:type”);
Text col2 = new Text(“animal:size”);
BatchUpdate update = new BatchUpdate(row);
update.put(col1, “lion”.getBytes(“UTF-8”));
update.put(col2, “big”.getBytes(“UTF-8));
table.commit(update);

update = new BatchUpdate(row);

update.put(col1, “zebra”.getBytes(“UTF-8”));
table.commit(update);
HBASE - QUERYING
 Retrieve a cell
Cell =
table.getRow(“enclosure1”).getColumn(“animal:type”).getValue();
 Retrieve a row
RowResult = table.getRow( “enclosure1” );
 Scan through a range of rows
Scanner s = table.getScanner( new String[] { “animal:type” } );
HIVE
HIVE
 Developed at Facebook
 Used for majority of Facebook jobs
 “Relational database” built on Hadoop
 Maintains list of table schemas
 SQL-like query language (HiveQL)
 Can call Hadoop Streaming scripts from HiveQL
 Supports table partitioning, clustering, complex data
types, some optimizations
CREATING A HIVE TABLE
CREATE TABLE page_views(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'User IP address')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
 Partitioning breaks table into separate files for each
(dt, country) pair
Ex: /hive/page_view/dt=2008-06-08,country=USA
/hive/page_view/dt=2008-06-08,country=CA
A SIMPLE QUERY
• Find all page views coming from xyz.com
on March 31st:
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01'
AND page_views.date <= '2008-03-31'
AND page_views.referrer_url like '%xyz.com';

• Hive only reads partition 2008-03-01,*

instead of scanning entire table
AGGREGATION AND JOINS
• Count users who visited each page by gender:
SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id)
FROM page_views pv JOIN user u ON (pv.userid = u.id)
GROUP BY pv.page_url, u.gender
WHERE pv.date = '2008-03-03';

• Sample output:
USING A HADOOP STREAMING
MAPPER SCRIPT
SELECT TRANSFORM(page_views.userid,

page_views.date)
USING 'map_script.py'
AS dt, uid CLUSTER BY dt
FROM page_views;
STORM
STORM
 Developed by BackType which was acquired by
Twitter
 Lots of tools for data (i.e. batch) processing
 Hadoop, Pig, HBase, Hive, …
 None of them are realtime systems which is
becoming a real requirement for businesses
 Storm provides realtime computation
 Scalable
 Guarantees no data loss
 Extremely robust and fault-tolerant
 Programming language agnostic
BEFORE STORM
BEFORE STORM – ADDING A
WORKER Deploy

Reconfigure/Redeploy
PROBLEMS
 Scaling is painful
 Poor fault-tolerance
 Coding is tedious
WHAT WE WANT
 Guaranteed data processing
 Horizontal scalability
 Fault-tolerance
 No intermediate message brokers!
 Higher level abstraction than message passing
 “Just works” !!
STORM CLUSTER
Master node (similar to
Hadoop JobTracker)

Used for cluster coordination

Run worker processes

STREAMS

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples

Cpanel Fundamental Exam
70% (10)
Cpanel Fundamental Exam
2 pages
Ha Do Op World
No ratings yet
Ha Do Op World
24 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Apache Hadoop and Hive: Dhruba Borthakur
No ratings yet
Apache Hadoop and Hive: Dhruba Borthakur
32 pages
Unit 1: Database Management System (DBMS) Historical Perspective
100% (1)
Unit 1: Database Management System (DBMS) Historical Perspective
30 pages
Bda Unit 5
No ratings yet
Bda Unit 5
16 pages
Big-Data Computing: B. Ramamurthy
100% (1)
Big-Data Computing: B. Ramamurthy
55 pages
Big Data & Hadoop
100% (3)
Big Data & Hadoop
189 pages
Introduction to Hadoop and Cloudera
100% (1)
Introduction to Hadoop and Cloudera
91 pages
Multimedia Systems Overview
100% (1)
Multimedia Systems Overview
16 pages
Big Data Overview
No ratings yet
Big Data Overview
39 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
100% (1)
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
57 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
Management Information Systems - Introduction To Social Media
No ratings yet
Management Information Systems - Introduction To Social Media
26 pages
Big Data Challenges & Solutions
100% (1)
Big Data Challenges & Solutions
17 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
55 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Chapter - 2 Hadoop
100% (1)
Chapter - 2 Hadoop
32 pages
Configure Descriptive Flexfields and Generate OTBI Reports With DFFs
No ratings yet
Configure Descriptive Flexfields and Generate OTBI Reports With DFFs
4 pages
Unit 04 Modern Approach To Software Project and Economics
No ratings yet
Unit 04 Modern Approach To Software Project and Economics
35 pages
P6 File Corruption
No ratings yet
P6 File Corruption
20 pages
Hadoop Tutorials: Daniel Lanza Zbigniew Baranowski
No ratings yet
Hadoop Tutorials: Daniel Lanza Zbigniew Baranowski
49 pages
Chapter 2 Sound and Audio
No ratings yet
Chapter 2 Sound and Audio
20 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Apache Hive for Data Analysts
No ratings yet
Apache Hive for Data Analysts
51 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Hadoop Week 6
No ratings yet
Hadoop Week 6
38 pages
Extract Essbase Outline To SQL Database
No ratings yet
Extract Essbase Outline To SQL Database
21 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Big-Data Computing: B. Ramamurthy
No ratings yet
Big-Data Computing: B. Ramamurthy
61 pages
Unit 5 Da
No ratings yet
Unit 5 Da
41 pages
System Design
No ratings yet
System Design
150 pages
How To Create The Pivot Table
No ratings yet
How To Create The Pivot Table
8 pages
Advanced Data Structures: Binary Search Tree
No ratings yet
Advanced Data Structures: Binary Search Tree
14 pages
Introduction To DBMS
No ratings yet
Introduction To DBMS
18 pages
Oracle Business Intelligence Enterprise Edition: Overview
No ratings yet
Oracle Business Intelligence Enterprise Edition: Overview
5 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
System Design
No ratings yet
System Design
150 pages
Banking Management System Report
No ratings yet
Banking Management System Report
10 pages
Cloud Compute
No ratings yet
Cloud Compute
46 pages
Implementasi Kebijakan Dinas Pariwisata
No ratings yet
Implementasi Kebijakan Dinas Pariwisata
14 pages
Chapter 5 Data Compression
No ratings yet
Chapter 5 Data Compression
18 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
Binary Search Tree Guide
No ratings yet
Binary Search Tree Guide
13 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Amazon Web Services
No ratings yet
Amazon Web Services
6 pages
SAP Archiving Process-Simple Steps
No ratings yet
SAP Archiving Process-Simple Steps
20 pages
MySQL Exercise 10: Answer Key
No ratings yet
MySQL Exercise 10: Answer Key
2 pages
Multimedia Systems Overview
No ratings yet
Multimedia Systems Overview
15 pages
Lez.d-01-Hadoop (C)
No ratings yet
Lez.d-01-Hadoop (C)
29 pages
1 RG SQLNotes
No ratings yet
1 RG SQLNotes
216 pages
Chapter 3 Images and Graphics
No ratings yet
Chapter 3 Images and Graphics
11 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Hadoop Data Manipulation Guide
No ratings yet
Hadoop Data Manipulation Guide
3 pages
What Is Analytics Engineering
No ratings yet
What Is Analytics Engineering
7 pages
Linux File System Structure Guide
No ratings yet
Linux File System Structure Guide
3 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
Rules A DBA Should Follow
No ratings yet
Rules A DBA Should Follow
6 pages
The Dictionary ADT
No ratings yet
The Dictionary ADT
10 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
Disable Oracle Database Triggers
No ratings yet
Disable Oracle Database Triggers
4 pages
Final Answer Key
No ratings yet
Final Answer Key
13 pages
HBase & Hive Architecture Guide
No ratings yet
HBase & Hive Architecture Guide
10 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Unit I
No ratings yet
Unit I
38 pages
Big Data 4
No ratings yet
Big Data 4
14 pages
Hadoop
No ratings yet
Hadoop
154 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Big Data UNIT 5 Own
No ratings yet
Big Data UNIT 5 Own
18 pages
Mock Test - Data Backup and Restoration
No ratings yet
Mock Test - Data Backup and Restoration
5 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Chapter 14
No ratings yet
Chapter 14
35 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
S Pig Hive HBase Zookeeper
No ratings yet
S Pig Hive HBase Zookeeper
19 pages
Bda Mod 1
No ratings yet
Bda Mod 1
32 pages
S Pig Hive HBase
No ratings yet
S Pig Hive HBase
19 pages
Unit 5
No ratings yet
Unit 5
101 pages
Big Data: Week - 11
No ratings yet
Big Data: Week - 11
22 pages
DBMS Practical 7
No ratings yet
DBMS Practical 7
4 pages
Unit 5 Topic 13 IBM Big Data Strategy (12 Files Merged)
No ratings yet
Unit 5 Topic 13 IBM Big Data Strategy (12 Files Merged)
219 pages
Module 5 BDA
No ratings yet
Module 5 BDA
25 pages
2-Introduction To Hadoop Eco System
No ratings yet
2-Introduction To Hadoop Eco System
35 pages
Organiser Log
No ratings yet
Organiser Log
95 pages
2 Unit 5
No ratings yet
2 Unit 5
24 pages
BDA Module2
No ratings yet
BDA Module2
83 pages
Phạm Nguyễn Quỳnh Anh - ITDSIU22130 - Lab-06
No ratings yet
Phạm Nguyễn Quỳnh Anh - ITDSIU22130 - Lab-06
5 pages
Chapter - 1 - Introduction - To - Data Warehousing
No ratings yet
Chapter - 1 - Introduction - To - Data Warehousing
32 pages
Online Auto Booking Service Bcs402
No ratings yet
Online Auto Booking Service Bcs402
9 pages

Case Study: Hadoop

Uploaded by

Case Study: Hadoop

Uploaded by

CASE STUDY:

HBase is built on top of HDFS

HBase files are

 Typically in 2 level architecture

(JOIN, GROUP BY, etc.)

data in another, and

(enclosure1, t2, animal:type) zebra

Column family repairs:

(enclosure1, t1, repairs:cost) 1000 EUR

update = new BatchUpdate(row);

• Hive only reads partition 2008-03-01,*

Used for cluster coordination

Run worker processes

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples

You might also like