Big Data Notes Pdf3
Big Data Notes Pdf3
Introduction
WWW.AKTutor.in
Theme of this Course
WWW.AKTutor.in
Introduction to Big Data
WWW.AKTutor.in
Big Data Definition
WWW.AKTutor.in
Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
Exponential increase in
collected/generated data
WWW.AKTutor.in
Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and
structures
WWW.AKTutor.in
Characteristics of Big Data:
3-Speed (Velocity)
• Data is begin generated fast and need to be processed fast
• Examples
• E-Promotions: Based on your current location, your purchase history,
what you like send promotions right now for store next to you
WWW.AKTutor.in
Big Data: 3V’s
WWW.AKTutor.in
Some Make it 4V’s
WWW.AKTutor.in
Harnessing Big Data
10
WWW.AKTutor.in
Who’s Generating Big Data
Mobile devices
(tracking all objects all the time)
• The progress and innovation is no longer hindered by the ability to collect data
11
WWW.AKTutor.in
The Model Has Changed…
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
12
WWW.AKTutor.in
What’s driving Big Data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
13
WWW.AKTutor.in
Value of Big Data Analytics
• Big data is more real-time in nature
than traditional DW applications
14
WWW.AKTutor.in
Challenges in Handling Big Data
15
WWW.AKTutor.in
What Technology Do We Have
For Big Data ??
16
WWW.AKTutor.in
17
WWW.AKTutor.in
Big Data Technology
18
WWW.AKTutor.in
What You Will Learn…
• We focus on Hadoop/MapReduce technology
• Learn the platform (how it is designed and works)
• How big data are managed in a scalable, efficient way
19
WWW.AKTutor.in
Course Logistics
20
WWW.AKTutor.in
Course Logistics
• Lectures
• Tuesday, Thursday: (4:00pm - 5:20pm)
21
WWW.AKTutor.in
Textbook & Reading List
• No specific textbook
• Big Data is a relatively new topic (so no fixed syllabus)
• Reading List
• We will cover the state-of-art technology from research papers in big
conferences
• Many Hadoop-related papers are available on the course website
• Related books:
• Hadoop, The Definitive Guide [pdf]
22
WWW.AKTutor.in
Requirements & Grading
• Seminar-Type Course
• Students will read research papers and present them (Reading List)
Done in teams
• Hands-on Course
of two
• No written homework or exams
• Several coding projects covering the entire semester
23
WWW.AKTutor.in
Requirements & Grading (Cont’d)
• Reviews
• When a team is presenting (not the instructor), the other students should prepare a
review on the presented paper
• Course website gives guidelines on how to make good reviews
24
WWW.AKTutor.in
Late Submission Policy
• For Projects
• One-day late 10% off the max grade
• Two-day late 20% off the max grade
• Three-day late 30% off the max grade
• Beyond that, no late submission is accepted
• Submissions:
• Submitted via blackboard system by the due date
• Demonstrated to the instructor within the week after
• For Reviews
• No late submissions
• Student may skip at most 4 reviews
• Submissions:
• Given to the instructor at the beginning of class
25
WWW.AKTutor.in
More about Projects
• A virtual machine is created including the needed platform for the projects
• Ubuntu OS (Version 12.10)
• Hadoop platform (Version 1.1.0)
• Apache Pig (Version 0.10.0)
• Mahout library (Version 0.7)
• Rhadoop
• In addition to other software packages
26
WWW.AKTutor.in
Next Step from You…
1. Form teams of two
27
WWW.AKTutor.in
Course Output: What You
Will Learn…
• We focus on Hadoop/MapReduce technology
• Learn the platform (how it is designed and works)
• How big data are managed in a scalable, efficient way
28
WWW.AKTutor.in
Open Source World’s Solution
WWW.AKTutor.in
Simplified Search Engine
Architecture
Spider Runtime
Batch Processing System
on top of Hadoop
WWW.AKTutor.in
Simplified Data Warehouse
Architecture
Business
Intelligence Database
Batch Processing System
on top fo Hadoop
WWW.AKTutor.in
Hadoop History
Jan 2006 – Doug Cutting joins Yahoo
Feb 2006 – Hadoop splits out of Nutch and Yahoo starts
using it.
Dec 2006 – Yahoo creating 100-node Webmap with
Hadoop
Apr 2007 – Yahoo on 1000-node cluster
Jan 2008 – Hadoop made a top-level Apache project
Dec 2007 – Yahoo creating 1000-node Webmap with
Hadoop
Sep 2008 – Hive added to Hadoop as a contrib project
WWW.AKTutor.in
Hadoop Introduction
Open Source Apache Project
http://hadoop.apache.org/
Book: http://oreilly.com/catalog/9780596521998/index.html
Written in Java
Does work with other languages
Runs on
Linux, Windows and more
Commodity hardware with high failure rate
WWW.AKTutor.in
Current Status of Hadoop
Largest Cluster
2000 nodes (8 cores, 4TB disk)
Used by 40+ companies / universities over
the world
Yahoo, Facebook, etc
Cloud Computing Donation from Google and IBM
Startup focusing on providing services for
hadoop
Cloudera
WWW.AKTutor.in
Hadoop Components
Hadoop Distributed File System (HDFS)
Hadoop Map-Reduce
Contributes
Hadoop Streaming
Pig / JAQL / Hive
HBase
Hama / Mahout
WWW.AKTutor.in
Hadoop Distributed File System
WWW.AKTutor.in
Goals of HDFS
Very Large Distributed File System
10K nodes, 100 million files, 10 PB
Convenient Cluster Management
Load balancing
Node failures
Cluster expansion
Optimized for Batch Processing
Allow move computation to data
Maximize throughput
WWW.AKTutor.in
HDFS Architecture
WWW.AKTutor.in
HDFS Details
Data Coherency
Write-once-read-many access model
Client can only append to existing files
Files are broken up into blocks
Typically 128 MB block size
Each block replicated on multiple DataNodes
Intelligent Client
Client can find location of blocks
Client accesses data directly from DataNode
WWW.AKTutor.in
WWW.AKTutor.in
HDFS User Interface
Java API
Command Line
hadoop dfs -mkdir /foodir
hadoop dfs -cat /foodir/myfile.txt
Web Interface
http://host:port/dfshealth.jsp
WWW.AKTutor.in
More about HDFS
http://hadoop.apache.org/core/docs/current/hdfs_design.html
WWW.AKTutor.in
Hadoop Map-Reduce and
Hadoop Streaming
WWW.AKTutor.in
Hadoop Map-Reduce Introduction
Map/Reduce works like a parallel Unix pipeline:
cat input | grep | sort | uniq -c | cat > output
Input | Map | Shuffle & Sort | Reduce | Output
Framework does inter-node communication
Failure recovery, consistency etc
Load balancing, scalability etc
Fits a lot of batch processing applications
Log processing
Web index building
WWW.AKTutor.in
WWW.AKTutor.in
(Simplified) Map Reduce Review
Machine 1
WWW.AKTutor.in
Physical Flow
WWW.AKTutor.in
Example Code
WWW.AKTutor.in
Hadoop Streaming
Allow to write Map and Reduce functions in any
languages
Hadoop Map/Reduce only accepts Java
WWW.AKTutor.in
Example: Log Processing
Generate #pageview and #distinct users
for each page each day
Input: timestamp url userid
Generate the number of page views
Map: emit < <date(timestamp), url>, 1>
Reduce: add up the values for each row
Generate the number of distinct users
Map: emit < <date(timestamp), url, userid>, 1>
Reduce: For the set of rows with the same
<date(timestamp), url>, count the number of distinct users by
“uniq –c"
WWW.AKTutor.in
Example: Page Rank
In each Map/Reduce Job:
Map: emit <link, eigenvalue(url)/#links>
for each input: <url, <eigenvalue, vector<link>> >
Reduce: add all values up for each link, to generate the new
eigenvalue for that link.
WWW.AKTutor.in
TODO: Split Job Scheduler and Map-
Reduce
WWW.AKTutor.in
TODO: Faster Map-Reduce
Mapper Sender Receiver Reducer
R1
R2 sort
map R3 Merge
… sort Reduce
R1
map R2 sort
R3
…
Sender Receiver
Receiver
Mapper
Reducer
Sendermerge
doesNuser
calls
calls flows
user
flow into 1, call
functions:
functions:
control
user function Compare to sort, dump
Map and and
Compare Partition
Reduce
buffer to disk, and do checkpointing
WWW.AKTutor.in
MapReduce and Hadoop
Distributed File System
54
WWW.AKTutor.in
The Context: Big-data
55
Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
Google collects 270PB data in a month (2007), 20000PB a day (2008)
2010 census data is expected to be a huge gold mine of information
Data mining huge amounts of data collected in a wide range of domains
from astronomy to healthcare has become essential for planning and
performance.
We are in a knowledge economy.
Data is an important asset to any organization
WWW.AKTutor.in
Purpose of this talk
56
WWW.AKTutor.in
The Outline
57
Introduction to MapReduce
From CS Foundation to MapReduce
MapReduce programming model
Hadoop Distributed File System
Relevance to Undergraduate Curriculum
Demo (Internet access needed)
Our experience with the framework
Summary
References
WWW.AKTutor.in
MapReduce
58
WWW.AKTutor.in
What is MapReduce?
59
WWW.AKTutor.in
From CS Foundations to MapReduce
60
WWW.AKTutor.in
Word Counter and Result Table
61
{web, weed, green, sun, moon, land, part, web 2
web, green,…}
weed 1
green 2
Data Main
sun 1
collection
moon 1
land 1
WordCounter part 1
parse( )
count( )
DataCollection ResultTable
WWW.AKTutor.in
Multiple Instances of Word Counter
62
web 2
weed 1
green 2
Data
Main sun 1
collection
moon 1
Thread
land 1
1..*
WordCounter part 1
parse( )
count( )
WWW.AKTutor.in
Improve Word Counter for Performance
63 N No need for lock
Main oweb 2
weed 1
Data green 2
collection
sun 1
moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter
WordList
Separate counters
DataCollection ResultTable
KEY web weed green sun moon land part web green …….
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
WWW.AKTutor.in
Peta-scale Data
64
Main
web 2
weed 1
green 2
Data sun 1
collection moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter
KEY web weed green sun moon land part web green …….
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
WWW.AKTutor.in
Addressing the Scale Issue
65
Single machine cannot serve all the data: you need a distributed
special (file) system
Large number of commodity hardware disks: say, 1000 disks 1TB
each
Issue: With Mean time between failures (MTBF) or failure rate of
1/1000, then at least 1 of the above 1000 disks would be down at a
given time.
Thus failure is norm and not an exception.
File system has to be fault-tolerant: replication, checksum
Data transfer bandwidth is critical (location of data)
WWW.AKTutor.in
Peta-scale Data
66
Main
web 2
weed 1
green 2
Data sun 1
collection moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter
KEY web weed green sun moon land part web green …….
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
WWW.AKTutor.in
Data Peta Scale Data is Commonly Distributed
collection
67
Main
web 2
Data
collection weed 1
green 2
Data sun 1
collection
moon 1
Thread
land 1
Data 1..*
part 1
1..*
collection Parser Counter
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
WWW.AKTutor.in
Data Write Once Read Many (WORM) data
collection
68
Main
web 2
Data
collection weed 1
green 2
Data sun 1
collection
moon 1
Thread
land 1
Data 1..*
part 1
1..*
collection Parser Counter
collection
KEY web weed green sun moon land part web green …….
VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
WWW.AKTutor.in
Data WORM Data is Amenable to Parallelism
collection
69
Main
Data
collection
1. Data with WORM
characteristics : yields
Data to parallel processing;
collection 2. Data without
Thread dependencies: yields
to out of order
Data 1..*
processing
1..*
collection Parser Counter
collection
WWW.AKTutor.in
Divide and Conquer: Provision Computing at Data Location
70
Main For our example,
#1: Schedule parallel parse tasks
Data Thread
#2: Schedule parallel count tasks
collection
1..*
1..*
Parser Counter
Data Thread
Our parse is a mapping operation:
collection Parser
1..*
1..*
Counter
MAP: input <key, value> pairs
DataCollection WordList ResultTable
Main
Our count is a reduce operation:
REDUCE: <key, value> pairs reduced
Data Thread
collection
1..*
1..*
1..*
WWW.AKTutor.in
Mapper and Reducer
71
WWW.AKTutor.in
Map Operation
72
web 1
green
1
1 web 1
sun 1 weed 1
moon 1 green 1
land 1 sun1 1
web
part 1
Map web
weed
1
moon
1
land
1
1web 1 1
Data green
green
1 part
1weed 1 1
Collection: split1 Split the data to web … 1
sun
1 web
moon 1green 1 1
Supply multiple weedKEY 1 VALUEgreen
land 1sun 1 1
processors green 1
part … 1moon 1 1
sun 1 KEY
web 1land VALUE
1
moon 1
green 1part 1
… 1
…
green 1 KEY VALUE
… 1
KEY VALUE
Data
Collection: split n
WWW.AKTutor.in
Reduce Operation
73
Reduce
Map
Data
Collection: split1 Split the data to
Supply multiple
processors
Reduce
Data Map
Collection: split 2
……
Data
…
Map Reduce
Collection: split n
WWW.AKTutor.in
Large scale data splits
Map <key, 1> Reducers (say, Count)
Parse-hash
Count
P-0000
, count1
Parse-hash
Count
P-0001
, count2
Parse-hash
Count
P-0002
Parse-hash ,count3
WWW.AKTutor.in
MapReduce Example in my operating systems class
75
combine part0
map reduce
Cat split
reduce part1
split map combine
Bat
map part2
split combine reduce
Dog
split map
Other
Words
(size:
TByte)
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
WWW.AKTutor.in
MapReduce Programming
Model
76
WWW.AKTutor.in
MapReduce programming model
77
WWW.AKTutor.in
MapReduce Characteristics
78
WWW.AKTutor.in
Classes of problems “mapreducable”
79
WWW.AKTutor.in
Scope of MapReduce
80
Data size: small
Pipelined Instruction level
WWW.AKTutor.in
Hadoop
81
WWW.AKTutor.in
What is Hadoop?
82
WWW.AKTutor.in
Basic Features: HDFS
83
Highly fault-tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Can be built out of commodity hardware
WWW.AKTutor.in
Hadoop Distributed File System
84
HDFS Client
Application
Local file
system
Block size: 2K
Name Nodes
Block size: 128M
More details: We discuss this in great detail in my Operating Replicated
Systems course
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
WWW.AKTutor.in
Hadoop Distributed File System
85
blockmap
Application
Local file
system
Block size: 2K
Name Nodes
Block size: 128M
More details: We discuss this in great detail in my Operating Replicated
Systems course
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
WWW.AKTutor.in
Relevance and Impact on Undergraduate courses
86
WWW.AKTutor.in
Demo
87
WWW.AKTutor.in
Summary
88
WWW.AKTutor.in
References
89
WWW.AKTutor.in
WWW.AKTutor.in
Hive - SQL on top of Hadoop
WWW.AKTutor.in
Map-Reduce and SQL
• Map-Reduce is scalable
– SQL has a huge user base
– SQL is easy to code
• Solution: Combine SQL and Map-Reduce
– Hive on top of Hadoop (open source)
– Aster Data (proprietary)
– Green Plum (proprietary)
WWW.AKTutor.in
Hive
• A database/data warehouse on top of Hadoop
– Rich data types (structs, lists and maps)
– Efficient implementations of SQL filters, joins and group-
by’s on top of map reduce
• Allow users to access Hive data without using
Hive
• Link:
– http://svn.apache.org/repos/asf/hadoop/hive/
trunk/
WWW.AKTutor.in
Hive Architecture
Map Reduce HDFS
MetaStore Hive QL
SerDe
WWW.AKTutor.in
Hive QL – Join
page_view pv_users
user
pag use time pag age
eid rid use age gende eid
X rid r =
1 111 9:08: 1 25
01 111 25 femal
e 2 25
2 111 9:08: 1 32
222 32 male
• SQL: 13
1INSERT
222INTO9:08:
TABLE pv_users
SELECT pv.pageid,
14 u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
WWW.AKTutor.in
Hive QL – Join in Map Reduce
page_view
pv_users
pag use time key valu
eid rid key value e pag age
1 111 9:08: 111 <1,1 111 <1,1 eid
01 > > 1 25
2 111 9:08: Shuffle 111 <1,2
111 <1,2 Sort Reduce 2 25
user 13 Map
> >
use
1 age
222 gende
9:08: key valu key <2,2
111 valu
222 <1,1
rid r
14 e e
5> page age
>
111 25 femal 111 <2,2 222 <1,1 id
e 5> > 1 32
222 32 male 222 <2,3 222 <2,3
2> 2>
WWW.AKTutor.in
Hive QL – Group By
pv_users
pageid_age_sum
pag age
pag age Co
eid
eid unt
1 25
1 25 1
2 25
2 25 2
1 32
1 32 1
• SQL: 2 25
▪ INSERT INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
– GROUP BY pageid, age;
WWW.AKTutor.in
Hive QL – Group By in Map Reduce
pv_users pageid_age_sum
pag age key valu key valu pag age Co
eid e e eid unt
1 25 <1, 1 <1, 1 1 25 1
2 25 25> 25> 1 32 1
<2, 1 Shuffle <1, 1 Reduce
Map
Sort
pag age 25>
key valu 32>
key valu
eid e e pag age Cou
eid nt
1 32 <1, 1 <2, 1
32> 25> 2 25 2
2 25
<2, 1 <2, 1
25> 25>
WWW.AKTutor.in
Hive QL – Group By with Distinct
page_view
pag user time result
eid id page count_distinct
1 111 9:08: id _userid
01 1 2
2 111 9:08:
2 1
13
1 222 9:08:
• SQL
14
– SELECT pageid, COUNT(DISTINCT userid)
2– FROM
111page_view
9:08: GROUP BY pageid
20
WWW.AKTutor.in
Hive QL – Group By with Distinct in Map
Reduce
page_view
WWW.AKTutor.in
Hive QL: Order By
page_view
WWW.AKTutor.in
Hive Optimizations
Efficient Execution of SQL on top of Map-Reduce
WWW.AKTutor.in
(Simplified) Map Reduce Revisit
Machine 1
Machine 2
WWW.AKTutor.in
Merge Sequential Map Reduce Jobs
ke av AB
y
ke av bv
1 B 111 ABC
y
Map Reduce
ke bv 1 11 22 ke av bv cv
y 1C 2 Map Reduce
y
1 22 ke cv 1 11 22 33
2 y 1 2 3
• SQL: 1 33
– FROM (a join b on a.key
3 = b.key) join c on a.key =
c.key SELECT …
WWW.AKTutor.in
Share Common Read Operations
WWW.AKTutor.in
Load Balance Problem
pv_users
pag ag
eid e pageid_age_partial_sum
pageid_age_sum
1 25 pag ag
ag cou
cou
Map-Reduce
1 25 eid ee nt
nt
1 25
25 24
1 25
2 32
32 11
2 32
1 25 2
1 25
WWW.AKTutor.in
Map-side Aggregation / Combiner
Machine 1
<k1, v1>
<male, 343> <male, 343> <male, 343>
<k2, v2> <male, 466>
<female, 128> <male, 123> <male, 123>
<k3, v3>
Machine 2
<k4, v4>
<male, 123> <female, 128> <female, 128>
<k5, v5> <female, 372>
<female, 244> <female, 244> <female, 244>
<k6, v6>
WWW.AKTutor.in
Query Rewrite
• Predicate Push-down
– select * from (select * from t) where col1 = ‘2008’;
• Column Pruning
– select col1, col3 from (select * from t);
WWW.AKTutor.in
TODO: Column-based Storage and Map-side Join
WWW.AKTutor.in
MetaStore
• Stores Table/Partition properties:
– Table schema and SerDe library
– Table Location on HDFS
– Logical Partitioning keys and types
– Other information
• Thrift API
– Current clients in Php (Web Interface), Python (old CLI),
Java (Query Engine and CLI), Perl (Tests)
• Metadata can be stored as text files or even in a
SQL backend
WWW.AKTutor.in
Hive CLI
• DDL:
– create table/drop table/rename table
– alter table add column
• Browsing:
– show tables
– describe table
– cat table
• Loading Data
• Queries
WWW.AKTutor.in
Web UI for Hive
• MetaStore UI:
– Browse and navigate all tables in the system
– Comment on each table and each column
– Also captures data dependencies
• HiPal:
– Interactively construct SQL queries by mouse clicks
– Support projection, filtering, group by and joining
– Also support
WWW.AKTutor.in
Hive Query Language
• Philosophy
– SQL
– Map-Reduce with custom scripts (hadoop streaming)
• Query Operators
– Projections
– Equi-joins
– Group by
– Sampling
– Order By
WWW.AKTutor.in
Hive QL – Custom Map/Reduce Scripts
• Extended SQL:
• FROM (
• FROM pv_users
• MAP pv_users.userid, pv_users.date
• USING 'map_script' AS (dt, uid)
• CLUSTER BY dt) map
• INSERT INTO TABLE pv_users_reduced
• REDUCE map.dt, map.uid
• USING 'reduce_script' AS (date, count);
WWW.AKTutor.in