0% found this document useful (0 votes)

8 views114 pages

Big Data Notes Pdf3

The document provides an overview of Big Data Analytics, focusing on the definition, characteristics, and technologies involved in managing large-scale data. It highlights the exponential growth of data volume, the complexity of data types, and the speed at which data is generated and processed. The course aims to teach students about Hadoop and related technologies, including practical coding projects and the use of advanced analytics tools.

Uploaded by

renutrivedi7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views114 pages

Big Data Notes Pdf3

Uploaded by

renutrivedi7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 114

Big Data Analytics

Introduction

WWW.AKTutor.in
Theme of this Course

Large-Scale Data Management

Big Data Analytics
Data Science and Analytics
• How to manage very large amounts of data and extract value and
knowledge from them
2

WWW.AKTutor.in
Introduction to Big Data

What is Big Data?

What makes data, “Big” Data?

WWW.AKTutor.in
Big Data Definition

• No single standard definition…

“Big Data” is data whose scale, diversity, and

complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract
value and hidden knowledge from it…

WWW.AKTutor.in
Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb

• Data volume is increasing exponentially

Exponential increase in
collected/generated data

WWW.AKTutor.in
Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and
structures

• Text, numerical, images, audio,

video, sequences, time series, social
media data, multi-dim arrays, etc…

• Static data vs. streaming data

• A single application can be

generating/collecting many types
of data

WWW.AKTutor.in
Characteristics of Big Data:
3-Speed (Velocity)
• Data is begin generated fast and need to be processed fast

• Online Data Analytics

• Late decisions  missing opportunities

• Examples
• E-Promotions: Based on your current location, your purchase history,
what you like  send promotions right now for store next to you

• Healthcare monitoring: sensors monitoring your activities and body 

any abnormal measurements require immediate reaction

WWW.AKTutor.in
Big Data: 3V’s

WWW.AKTutor.in
Some Make it 4V’s

WWW.AKTutor.in
Harnessing Big Data

• OLTP: Online Transaction Processing (DBMSs)

• OLAP: Online Analytical Processing (Data Warehousing)

• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

WWW.AKTutor.in
Who’s Generating Big Data

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments

(all of us are generating data) (collecting all sorts of data)

Sensor technology and networks

(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data

• But, by the ability to manage, analyze, summarize, visualize, and discover

knowledge from the collected data in a timely manner and in a scalable fashion

WWW.AKTutor.in
The Model Has Changed…
• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

WWW.AKTutor.in
What’s driving Big Data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time

- Ad-hoc querying and reporting

- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets

WWW.AKTutor.in
Value of Big Data Analytics
• Big data is more real-time in nature
than traditional DW applications

• Traditional DW architectures (e.g.

Exadata, Teradata) are not well-
suited for big data apps

• Shared nothing, massively parallel

processing, scale out architectures
are well-suited for big data apps

WWW.AKTutor.in
Challenges in Handling Big Data

• The Bottleneck is in technology

• New architecture, algorithms, techniques are needed

• Also in technical skills

• Experts in using the new technology and dealing with big data

WWW.AKTutor.in
What Technology Do We Have
For Big Data ??

WWW.AKTutor.in
17

WWW.AKTutor.in
Big Data Technology

WWW.AKTutor.in
What You Will Learn…
• We focus on Hadoop/MapReduce technology
• Learn the platform (how it is designed and works)
• How big data are managed in a scalable, efficient way

• Learn writing Hadoop jobs in different languages

• Programming Languages: Java, C, Python
• High-Level Languages: Apache Pig, Hive

• Learn advanced analytics tools on top of Hadoop

• RHadoop: Statistical tools for managing big data
• Mahout: Data mining and machine learning tools over big data

• Learn state-of-art technology from recent research papers

• Optimizations, indexing techniques, and other extensions to Hadoop

WWW.AKTutor.in
Course Logistics

• Web Page: http://web.cs.wpi.edu/~cs525/s13-MYE/

• Electronic WPI system: blackboard.wpi.edu

• Lectures
• Tuesday, Thursday: (4:00pm - 5:20pm)

WWW.AKTutor.in
Textbook & Reading List
• No specific textbook
• Big Data is a relatively new topic (so no fixed syllabus)

• Reading List
• We will cover the state-of-art technology from research papers in big
conferences
• Many Hadoop-related papers are available on the course website

• Related books:
• Hadoop, The Definitive Guide [pdf]

WWW.AKTutor.in
Requirements & Grading
• Seminar-Type Course
• Students will read research papers and present them (Reading List)
Done in teams
• Hands-on Course
of two
• No written homework or exams
• Several coding projects covering the entire semester

WWW.AKTutor.in
Requirements & Grading (Cont’d)
• Reviews
• When a team is presenting (not the instructor), the other students should prepare a
review on the presented paper
• Course website gives guidelines on how to make good reviews

• Reviews are done individually

WWW.AKTutor.in
Late Submission Policy
• For Projects
• One-day late  10% off the max grade
• Two-day late  20% off the max grade
• Three-day late  30% off the max grade
• Beyond that, no late submission is accepted
• Submissions:
• Submitted via blackboard system by the due date
• Demonstrated to the instructor within the week after

• For Reviews
• No late submissions
• Student may skip at most 4 reviews
• Submissions:
• Given to the instructor at the beginning of class

WWW.AKTutor.in
More about Projects
• A virtual machine is created including the needed platform for the projects
• Ubuntu OS (Version 12.10)
• Hadoop platform (Version 1.1.0)
• Apache Pig (Version 0.10.0)
• Mahout library (Version 0.7)
• Rhadoop
• In addition to other software packages

• Download it from the course website (link)

• Username and password will be sent to you

• Need Virtual Box (Vbox) [free]

WWW.AKTutor.in
Next Step from You…
1. Form teams of two

2. Visit the course website (Reading List), each team selects

its first paper to present (1st come 1st served)
• Send me your choices top 2/3 choices

3. You have until Jan 20th

• Otherwise, I’ll randomly form teams and assign papers

4. Use Blackboard “Discussion” forum for posts or for

searching for teammates

WWW.AKTutor.in
Course Output: What You
Will Learn…
• We focus on Hadoop/MapReduce technology
• Learn the platform (how it is designed and works)
• How big data are managed in a scalable, efficient way

• Learn writing Hadoop jobs in different languages

• Programming Languages: Java, C, Python
• High-Level Languages: Apache Pig, Hive

• Learn advanced analytics tools on top of Hadoop

• RHadoop: Statistical tools for managing big data
• Mahout: Analytics and data mining tools over big data

• Learn state-of-art technology from recent research papers

• Optimizations, indexing techniques, and other extensions to Hadoop

WWW.AKTutor.in
Open Source World’s Solution

 Google File System – Hadoop Distributed FS

 Map-Reduce – Hadoop Map-Reduce
 Sawzall – Pig, Hive, JAQL
 Big Table – Hadoop HBase, Cassandra
 Chubby – Zookeeper

WWW.AKTutor.in
Simplified Search Engine
Architecture

Spider Runtime
Batch Processing System
on top of Hadoop

Internet Search Log Storage SE Web Server

WWW.AKTutor.in
Simplified Data Warehouse
Architecture

Business
Intelligence Database
Batch Processing System
on top fo Hadoop

Domain Knowledge View/Click/Events Log Storage Web Server

WWW.AKTutor.in
Hadoop History
 Jan 2006 – Doug Cutting joins Yahoo
 Feb 2006 – Hadoop splits out of Nutch and Yahoo starts
using it.
 Dec 2006 – Yahoo creating 100-node Webmap with
Hadoop
 Apr 2007 – Yahoo on 1000-node cluster
 Jan 2008 – Hadoop made a top-level Apache project
 Dec 2007 – Yahoo creating 1000-node Webmap with
Hadoop
 Sep 2008 – Hive added to Hadoop as a contrib project

WWW.AKTutor.in
Hadoop Introduction
 Open Source Apache Project
 http://hadoop.apache.org/
 Book: http://oreilly.com/catalog/9780596521998/index.html
 Written in Java
 Does work with other languages
 Runs on
 Linux, Windows and more
 Commodity hardware with high failure rate

WWW.AKTutor.in
Current Status of Hadoop
 Largest Cluster
 2000 nodes (8 cores, 4TB disk)
 Used by 40+ companies / universities over
the world
 Yahoo, Facebook, etc
 Cloud Computing Donation from Google and IBM
 Startup focusing on providing services for
hadoop
 Cloudera

WWW.AKTutor.in
Hadoop Components
 Hadoop Distributed File System (HDFS)
 Hadoop Map-Reduce
 Contributes
 Hadoop Streaming
 Pig / JAQL / Hive
 HBase
 Hama / Mahout

WWW.AKTutor.in
Hadoop Distributed File System

WWW.AKTutor.in
Goals of HDFS
 Very Large Distributed File System
 10K nodes, 100 million files, 10 PB
 Convenient Cluster Management
 Load balancing
 Node failures
 Cluster expansion
 Optimized for Batch Processing
 Allow move computation to data
 Maximize throughput

WWW.AKTutor.in
HDFS Architecture

WWW.AKTutor.in
HDFS Details
 Data Coherency
 Write-once-read-many access model
 Client can only append to existing files
 Files are broken up into blocks
 Typically 128 MB block size
 Each block replicated on multiple DataNodes
 Intelligent Client
 Client can find location of blocks
 Client accesses data directly from DataNode

WWW.AKTutor.in
WWW.AKTutor.in
HDFS User Interface
 Java API
 Command Line
 hadoop dfs -mkdir /foodir
 hadoop dfs -cat /foodir/myfile.txt

 hadoop dfs -rm /foodir myfile.txt

 hadoop dfsadmin -report

 hadoop dfsadmin -decommission datanodename

 Web Interface
 http://host:port/dfshealth.jsp

WWW.AKTutor.in
More about HDFS
 http://hadoop.apache.org/core/docs/current/hdfs_design.html

 Hadoop FileSystem API

 HDFS
 Local File System

 Kosmos File System (KFS)

 Amazon S3 File System

WWW.AKTutor.in
Hadoop Map-Reduce and
Hadoop Streaming

WWW.AKTutor.in
WWW.AKTutor.in
(Simplified) Map Reduce Review
Machine 1

<k1, v1> <nk1, nv1> <nk1, nv1> <nk1, nv1>

<nk1, 2>
<k2, v2> <nk2, nv2> <nk3, nv3> <nk1, nv6>
<nk3, 1>
<k3, v3> <nk3, nv3> <nk1, nv6> <nk3, nv3>

Local Global Local Local

Map Shuffle Sort Reduce
Machine 2

<k4, v4> <nk2, nv4> <nk2, nv4> <nk2, nv4>

<k5, v5> <nk2, nv5> <nk2, nv5> <nk2, nv5> <nk2, 3>
<k6, v6> <nk1, nv6> <nk2, nv2> <nk2, nv2>

WWW.AKTutor.in
Physical Flow

WWW.AKTutor.in
Example Code

WWW.AKTutor.in
Hadoop Streaming
 Allow to write Map and Reduce functions in any
languages
 Hadoop Map/Reduce only accepts Java

 Example: Word Count

 hadoop streaming
-input /user/zshao/articles
-mapper „tr “ ” “\n”‟
-reducer „uniq -c„
-output /user/zshao/
-numReduceTasks 32

WWW.AKTutor.in
Example: Log Processing
 Generate #pageview and #distinct users
for each page each day
 Input: timestamp url userid
 Generate the number of page views
 Map: emit < <date(timestamp), url>, 1>
 Reduce: add up the values for each row
 Generate the number of distinct users
 Map: emit < <date(timestamp), url, userid>, 1>
 Reduce: For the set of rows with the same
<date(timestamp), url>, count the number of distinct users by
“uniq –c"

WWW.AKTutor.in
Example: Page Rank
 In each Map/Reduce Job:
 Map: emit <link, eigenvalue(url)/#links>
for each input: <url, <eigenvalue, vector<link>> >
 Reduce: add all values up for each link, to generate the new
eigenvalue for that link.

 Run 50 map/reduce jobs till the eigenvalues are

stable.

WWW.AKTutor.in
TODO: Split Job Scheduler and Map-
Reduce

 Allow easy plug-in of different scheduling

algorithms
 Scheduling based on job priority, size, etc
 Scheduling for CPU, disk, memory, network bandwidth
 Preemptive scheduling
 Allow to run MPI or other jobs on the same
cluster
 PageRank is best done with MPI

WWW.AKTutor.in
TODO: Faster Map-Reduce
Mapper Sender Receiver Reducer
R1
R2 sort
map R3 Merge
… sort Reduce
R1
map R2 sort
R3
…

Sender Receiver
Receiver
Mapper
Reducer
Sendermerge
doesNuser
calls
calls flows
user
flow into 1, call
functions:
functions:
control
user function Compare to sort, dump
Map and and
Compare Partition
Reduce
buffer to disk, and do checkpointing

WWW.AKTutor.in
MapReduce and Hadoop
Distributed File System
54

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
The Context: Big-data
55

 Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
 Google collects 270PB data in a month (2007), 20000PB a day (2008)
 2010 census data is expected to be a huge gold mine of information
 Data mining huge amounts of data collected in a wide range of domains
from astronomy to healthcare has become essential for planning and
performance.
 We are in a knowledge economy.
 Data is an important asset to any organization

 Discovery of knowledge; Enabling discovery; annotation of data

 We are looking at newer

 programming models, and

 Supporting algorithms and data structures.

 NSF refers to it as “data-intensive computing” and industry calls it “big-

data” and “cloud computing”
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Purpose of this talk
56

 To provide a simple introduction to:

 “The big-data computing” : An important
advancement that has a potential to impact
significantly the CS and undergraduate curriculum.
 A programming model called MapReduce for
processing “big-data”
 A supporting file system called Hadoop Distributed
File System (HDFS)
 To encourage educators to explore ways to infuse
relevant concepts of this emerging area into their
curriculum.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
The Outline
57

 Introduction to MapReduce
 From CS Foundation to MapReduce
 MapReduce programming model
 Hadoop Distributed File System
 Relevance to Undergraduate Curriculum
 Demo (Internet access needed)
 Our experience with the framework
 Summary
 References

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
MapReduce
58

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
What is MapReduce?
59

 MapReduce is a programming model Google has used

successfully is processing its “big-data” sets (~ 20000 peta
bytes per day)
 Users specify the computation in terms of a map and a
reduce function,
 Underlying runtime system automatically parallelizes the
computation across large-scale clusters of machines, and
 Underlying system also handles machine failures,
efficient communications, and performance issues.
-- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters. Communication of
ACM 51, 1 (Jan. 2008), 107-113.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
From CS Foundations to MapReduce
60

Consider a large data collection:

{web, weed, green, sun, moon, land, part, web,
green,…}
Problem: Count the occurrences of the different words
in the collection.

Lets design a solution for this problem;

 We will start from scratch
 We will add and relax constraints
 We will do incremental design, improving the solution for
performance and scalability

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Word Counter and Result Table
61
{web, weed, green, sun, moon, land, part, web 2
web, green,…}
weed 1

green 2
Data Main
sun 1
collection
moon 1

land 1

WordCounter part 1

parse( )
count( )

DataCollection ResultTable

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Multiple Instances of Word Counter
62

web 2

weed 1

green 2
Data
Main sun 1
collection
moon 1
Thread
land 1
1..*
WordCounter part 1

parse( )
count( )

DataCollection ResultTable Observe:

Multi-thread
Lock on shared data

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Improve Word Counter for Performance
63 N No need for lock
Main oweb 2

weed 1

Data green 2
collection
sun 1

moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter

WordList
Separate counters
DataCollection ResultTable

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Peta-scale Data
64
Main
web 2

weed 1

green 2

Data sun 1

collection moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Addressing the Scale Issue
65

 Single machine cannot serve all the data: you need a distributed
special (file) system
 Large number of commodity hardware disks: say, 1000 disks 1TB
each
 Issue: With Mean time between failures (MTBF) or failure rate of
1/1000, then at least 1 of the above 1000 disks would be down at a
given time.
 Thus failure is norm and not an exception.
 File system has to be fault-tolerant: replication, checksum
 Data transfer bandwidth is critical (location of data)

 Critical aspects: fault tolerance + replication + load balancing,

monitoring
 Exploit parallelism afforded by splitting parsing and counting
 Provision and locate computing at data locations

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Peta-scale Data
66
Main
web 2

weed 1

green 2

Data sun 1

collection moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Data Peta Scale Data is Commonly Distributed
collection
67
Main
web 2
Data
collection weed 1

green 2

Data sun 1
collection
moon 1
Thread
land 1
Data 1..*
part 1
1..*
collection Parser Counter

Data DataCollection WordList ResultTable

collection Issue: managing the

large scale data
KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Data Write Once Read Many (WORM) data
collection
68
Main
web 2
Data
collection weed 1

green 2

Data sun 1
collection
moon 1
Thread
land 1
Data 1..*
part 1
1..*
collection Parser Counter

Data DataCollection WordList ResultTable

collection

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Data WORM Data is Amenable to Parallelism
collection
69
Main
Data
collection
1. Data with WORM
characteristics : yields
Data to parallel processing;
collection 2. Data without
Thread dependencies: yields
to out of order
Data 1..*
processing
1..*
collection Parser Counter

Data DataCollection WordList ResultTable

collection

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Divide and Conquer: Provision Computing at Data Location
70
Main For our example,
#1: Schedule parallel parse tasks
Data Thread
#2: Schedule parallel count tasks
collection
1..*
1..*
Parser Counter

One node DataCollection WordList ResultTable This is a particular solution;

Lets generalize it:
Main

Data Thread
Our parse is a mapping operation:
collection Parser
1..*
1..*

Counter
MAP: input  <key, value> pairs
DataCollection WordList ResultTable

Main
Our count is a reduce operation:
REDUCE: <key, value> pairs reduced
Data Thread

collection
1..*
1..*

Map/Reduce originated from Lisp

Parser Counter

DataCollection WordList ResultTable

But have different meaning here

Main

Runtime adds distribution + fault

Data tolerance + replication + monitoring +
collection
Thread

1..*

load balancing to your base application!

1..*
Parser Counter

DataCollection WordList ResultTable

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Mapper and Reducer
71

Remember: MapReduce is simplified processing for larger data sets:

MapReduce Version of WordCount Source code
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Map Operation
72
web 1

MAP: Input data  <key, value> pair weed

green
1

1 web 1
sun 1 weed 1
moon 1 green 1
land 1 sun1 1
web
part 1
Map web
weed
1
moon
1
land
1

1web 1 1
Data green
green
1 part
1weed 1 1
Collection: split1 Split the data to web … 1
sun
1 web
moon 1green 1 1
Supply multiple weedKEY 1 VALUEgreen
land 1sun 1 1

processors green 1
part … 1moon 1 1
sun 1 KEY
web 1land VALUE
1
moon 1
green 1part 1

Data Map land 1

… 1web 1
part 1
Collection: split 2 web 1
KEY green
VALUE 1
……

… 1

…
green 1 KEY VALUE
… 1

KEY VALUE

Data
Collection: split n

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Reduce Operation
73

MAP: Input data  <key, value> pair

REDUCE: <key, value> pair  <result>

Reduce
Map
Data
Collection: split1 Split the data to
Supply multiple
processors
Reduce
Data Map
Collection: split 2
……

Data
…
Map Reduce
Collection: split n

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Large scale data splits
Map <key, 1> Reducers (say, Count)

Parse-hash

Count
P-0000
, count1

Parse-hash

Count
P-0001
, count2
Parse-hash

Count
P-0002
Parse-hash ,count3

CCSCNE 2009 Palttsburg, April 24 2009 74 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
MapReduce Example in my operating systems class
75

combine part0
map reduce
Cat split

reduce part1
split map combine

Bat

map part2
split combine reduce
Dog

split map
Other
Words
(size:
TByte)
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
MapReduce Programming
Model
76

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
MapReduce programming model
77

 Determine if the problem is parallelizable and solvable using

MapReduce (ex: Is the data WORM?, large data set).
 Design and implement solution as Mapper classes and
Reducer class.
 Compile the source code with hadoop core.
 Package the code as jar executable.
 Configure the application (job) as to the number of mappers
and reducers (tasks), input and output streams
 Load the data (or use it on previously available data)
 Launch the job and monitor.
 Study the result.
 Detailed steps.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
MapReduce Characteristics
78

 Very large scale data: peta, exa bytes

 Write once and read many data: allows for parallelism without
mutexes
 Map and Reduce are the main operations: simple code
 There are other supporting operations such as combine and
partition (out of the scope of this talk).
 All the map should be completed before reduce operation starts.
 Map and reduce operations are typically performed by the same
physical processor.
 Number of map tasks and reduce tasks are configurable.
 Operations are provisioned near the data.
 Commodity hardware and storage.
 Runtime takes care of splitting and moving data for operations.
 Special distributed file system. Example: Hadoop Distributed File
System and Hadoop Runtime.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Classes of problems “mapreducable”
79

 Benchmark for comparing: Jim Gray’s challenge on data-

intensive computing. Ex: “Sort”
 Google uses it (we think) for wordcount, adwords, pagerank,
indexing data.
 Simple algorithms such as grep, text-indexing, reverse
indexing
 Bayesian classification: data mining domain
 Facebook uses it for various operations: demographics
 Financial services use it for analytics
 Astronomy: Gaussian analysis for locating extra-terrestrial
objects.
 Expected to play a critical role in semantic web and web3.0

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Scope of MapReduce
80
Data size: small
Pipelined Instruction level

Concurrent Thread level

Service Object level

Indexed File level

Mega Block level

Virtual System Level

Data size: large

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Hadoop
81

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
What is Hadoop?
82

 At Google MapReduce operation are run on a special

file system called Google File System (GFS) that is
highly optimized for this purpose.
 GFS is not open source.
 Doug Cutting and Yahoo! reverse engineered the
GFS and called it Hadoop Distributed File System
(HDFS).
 The software framework that supports HDFS,
MapReduce and other related entities is called the
project Hadoop or simply Hadoop.
 This is open source and distributed by Apache.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Basic Features: HDFS
83

 Highly fault-tolerant
 High throughput
 Suitable for applications with large data sets
 Streaming access to file system data
 Can be built out of commodity hardware

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Hadoop Distributed File System
84

HDFS Server Master node

HDFS Client
Application

Local file
system
Block size: 2K
Name Nodes
Block size: 128M
More details: We discuss this in great detail in my Operating Replicated
Systems course
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Hadoop Distributed File System
85

HDFS Server Master node

blockmap

HDFS Client heartbeat

Application

WWW.AKTutor.in
Relevance and Impact on Undergraduate courses
86

 Data structures and algorithms: a new look at traditional

algorithms such as sort: Quicksort may not be your
choice! It is not easily parallelizable. Merge sort is better.
 You can identify mappers and reducers among your
algorithms. Mappers and reducers are simply place
holders for algorithms relevant for your applications.
 Large scale data and analytics are indeed concepts to
reckon with similar to how we addressed “programming
in the large” by OO concepts.
 While a full course on MR/HDFS may not be warranted,
the concepts perhaps can be woven into most courses in
our CS curriculum.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Demo
87

 VMware simulated Hadoop and MapReduce demo

 Remote access to NEXOS system at my Buffalo office
 5-node HDFS running HDFS on Ubuntu 8.04
 1 –name node and 4 data-nodes
 Each is an old commodity PC with 512 MB RAM,
120GB – 160GB external memory
 Zeus (namenode), datanodes: hermes, dionysus,
aphrodite, athena

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
Summary
88

 We introduced MapReduce programming model for

processing large scale data
 We discussed the supporting Hadoop Distributed
File System
 The concepts were illustrated using a simple example
 We reviewed some important parts of the source
code for the example.
 Relationship to Cloud Computing

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
References
89

1. Apache Hadoop Tutorial: http://hadoop.apache.org

http://hadoop.apache.org/core/docs/current/mapred_tu
torial.html
2. Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters.
Communication of ACM 51, 1 (Jan. 2008), 107-113.
3. Cloudera Videos by Aaron Kimball:
http://www.cloudera.com/hadoop-training-basic
4. http://www.cse.buffalo.edu/faculty/bina/mapreduce.html

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

WWW.AKTutor.in
WWW.AKTutor.in
Hive - SQL on top of Hadoop

WWW.AKTutor.in
Map-Reduce and SQL
• Map-Reduce is scalable
– SQL has a huge user base
– SQL is easy to code
• Solution: Combine SQL and Map-Reduce
– Hive on top of Hadoop (open source)
– Aster Data (proprietary)
– Green Plum (proprietary)

WWW.AKTutor.in
Hive
• A database/data warehouse on top of Hadoop
– Rich data types (structs, lists and maps)
– Efficient implementations of SQL filters, joins and group-
by’s on top of map reduce
• Allow users to access Hive data without using
Hive
• Link:
– http://svn.apache.org/repos/asf/hadoop/hive/
trunk/

WWW.AKTutor.in
Hive Architecture
Map Reduce HDFS

Web UI Hive CLI

Mgmt, etc Browsing Queries DDL

MetaStore Hive QL

Parser Planner Execution

SerDe

Thrift Jute JSON

Thrift API

WWW.AKTutor.in
Hive QL – Join
page_view pv_users
user
pag use time pag age
eid rid use age gende eid
X rid r =
1 111 9:08: 1 25
01 111 25 femal
e 2 25
2 111 9:08: 1 32
222 32 male
• SQL: 13
1INSERT
222INTO9:08:
TABLE pv_users
SELECT pv.pageid,
14 u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);

WWW.AKTutor.in
Hive QL – Join in Map Reduce
page_view
pv_users
pag use time key valu
eid rid key value e pag age
1 111 9:08: 111 <1,1 111 <1,1 eid
01 > > 1 25
2 111 9:08: Shuffle 111 <1,2
111 <1,2 Sort Reduce 2 25
user 13 Map
> >
use
1 age
222 gende
9:08: key valu key <2,2
111 valu
222 <1,1
rid r
14 e e
5> page age
>
111 25 femal 111 <2,2 222 <1,1 id
e 5> > 1 32
222 32 male 222 <2,3 222 <2,3
2> 2>
WWW.AKTutor.in
Hive QL – Group By
pv_users
pageid_age_sum
pag age
pag age Co
eid
eid unt
1 25
1 25 1
2 25
2 25 2
1 32
1 32 1
• SQL: 2 25
▪ INSERT INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
– GROUP BY pageid, age;

WWW.AKTutor.in
Hive QL – Group By in Map Reduce
pv_users pageid_age_sum
pag age key valu key valu pag age Co
eid e e eid unt
1 25 <1, 1 <1, 1 1 25 1
2 25 25> 25> 1 32 1
<2, 1 Shuffle <1, 1 Reduce
Map
Sort
pag age 25>
key valu 32>
key valu
eid e e pag age Cou
eid nt
1 32 <1, 1 <2, 1
32> 25> 2 25 2
2 25
<2, 1 <2, 1
25> 25>

WWW.AKTutor.in
Hive QL – Group By with Distinct
page_view
pag user time result
eid id page count_distinct
1 111 9:08: id _userid
01 1 2
2 111 9:08:
2 1
13
1 222 9:08:
• SQL
14
– SELECT pageid, COUNT(DISTINCT userid)
2– FROM
111page_view
9:08: GROUP BY pageid
20

WWW.AKTutor.in
Hive QL – Group By with Distinct in Map
Reduce
page_view

page useri time key v page cou

id d <1,111 id nt
1 111 9:08: > 1 2
01 Shuffle <1,22
and Reduce
2 111 9:08: 2>
page useri time Sort
13 key v
id d
page cou
1 222 9:08: <2,111 id nt
14 >
2 1
2Shuffle
111key is9:08: <2,111
a prefix of the sort key.
20 >

WWW.AKTutor.in
Hive QL: Order By
page_view

page useri time key v page useri time

id d id d
<1,111 9:08:
2 111 9:08: > 01 1 111 9:08:
13 Shuffle <2,111 9:08: 01
and Reduce 2 111 9:08:
1 111 9:08: > 13
page useri time Sort 13
01 key v page useri time
id d
2 111 9:08: <1,22 9:08: id d
20 2> 14 1 222 9:08:
1Shuffle
222 9:08: <2,111 9:08: 14
randomly.
14 > 20 2 111 9:08:
20

WWW.AKTutor.in
Hive Optimizations
Efficient Execution of SQL on top of Map-Reduce

WWW.AKTutor.in
(Simplified) Map Reduce Revisit
Machine 1

<k1, v1> <nk1, nv1> <nk1, nv1> <nk1, nv1>

<nk1, 2>
<k2, v2> <nk2, nv2> <nk3, nv3> <nk1, nv6>
<nk3, 1>
<k3, v3> <nk3, nv3> <nk1, nv6> <nk3, nv3>

Local Global Local Local

Map Shuffle Sort Reduce

Machine 2

<k4, v4> <nk2, nv4> <nk2, nv4> <nk2, nv4>

<k5, v5> <nk2, nv5> <nk2, nv5> <nk2, nv5> <nk2, 3>
<k6, v6> <nk1, nv6> <nk2, nv2> <nk2, nv2>

WWW.AKTutor.in
Merge Sequential Map Reduce Jobs

ke av AB
y
ke av bv
1 B 111 ABC
y
Map Reduce
ke bv 1 11 22 ke av bv cv
y 1C 2 Map Reduce
y
1 22 ke cv 1 11 22 33
2 y 1 2 3
• SQL: 1 33
– FROM (a join b on a.key
3 = b.key) join c on a.key =
c.key SELECT …

WWW.AKTutor.in
Share Common Read Operations

pag ag pag cou

• Extended SQL
▪ FROM pv_users
eid e Map Reduce eid nt ▪ INSERT INTO TABLE
pv_pageid_sum
1 25 1 1 ▪ SELECT pageid, count(1)
2 32 2 1 ▪ GROUP BY pageid
▪ INSERT INTO TABLE pv_age_sum
▪ SELECT age, count(1)
pag ag age cou ▪ GROUP BY age;
eid e Map Reduce nt
1 25 25 1
2 32 32 1

WWW.AKTutor.in
Load Balance Problem

pv_users
pag ag
eid e pageid_age_partial_sum
pageid_age_sum

1 25 pag ag
ag cou
cou
Map-Reduce
1 25 eid ee nt
nt
1 25
25 24
1 25
2 32
32 11
2 32
1 25 2
1 25

WWW.AKTutor.in
Map-side Aggregation / Combiner
Machine 1

<k1, v1>
<male, 343> <male, 343> <male, 343>
<k2, v2> <male, 466>
<female, 128> <male, 123> <male, 123>
<k3, v3>

Local Global Local Local

Map Shuffle Sort Reduce

Machine 2

<k4, v4>
<male, 123> <female, 128> <female, 128>
<k5, v5> <female, 372>
<female, 244> <female, 244> <female, 244>
<k6, v6>

WWW.AKTutor.in
Query Rewrite

• Predicate Push-down
– select * from (select * from t) where col1 = ‘2008’;
• Column Pruning
– select col1, col3 from (select * from t);

WWW.AKTutor.in
TODO: Column-based Storage and Map-side Join

url page IP url clicked viewed

quality
http://a.com/ 12 145
http://a.co 90 65.1.2.3
m/ http://b.com/ 45 383
http://b.co 20 68.9.0.81
http://c.com/ 23 67
m/
http://c.co 68 11.3.85.1
m/

WWW.AKTutor.in
MetaStore
• Stores Table/Partition properties:
– Table schema and SerDe library
– Table Location on HDFS
– Logical Partitioning keys and types
– Other information
• Thrift API
– Current clients in Php (Web Interface), Python (old CLI),
Java (Query Engine and CLI), Perl (Tests)
• Metadata can be stored as text files or even in a
SQL backend

WWW.AKTutor.in
Hive CLI

• DDL:
– create table/drop table/rename table
– alter table add column
• Browsing:
– show tables
– describe table
– cat table
• Loading Data
• Queries

WWW.AKTutor.in
Web UI for Hive

• MetaStore UI:
– Browse and navigate all tables in the system
– Comment on each table and each column
– Also captures data dependencies
• HiPal:
– Interactively construct SQL queries by mouse clicks
– Support projection, filtering, group by and joining
– Also support

WWW.AKTutor.in
Hive Query Language
• Philosophy
– SQL
– Map-Reduce with custom scripts (hadoop streaming)

• Query Operators
– Projections
– Equi-joins
– Group by
– Sampling
– Order By

WWW.AKTutor.in
Hive QL – Custom Map/Reduce Scripts

• Extended SQL:
• FROM (
• FROM pv_users
• MAP pv_users.userid, pv_users.date
• USING 'map_script' AS (dt, uid)
• CLUSTER BY dt) map
• INSERT INTO TABLE pv_users_reduced
• REDUCE map.dt, map.uid
• USING 'reduce_script' AS (date, count);

• Map-Reduce: similar to hadoop streaming

WWW.AKTutor.in

Force FX-8CS Service Manual - en
83% (6)
Force FX-8CS Service Manual - en
282 pages
20IT503 - Big Data Analytics - Unit4
No ratings yet
20IT503 - Big Data Analytics - Unit4
73 pages
Big Data Course Overview
No ratings yet
Big Data Course Overview
97 pages
BDA - Unit-1
No ratings yet
BDA - Unit-1
24 pages
Bda U1
No ratings yet
Bda U1
80 pages
Big Data Analytics
No ratings yet
Big Data Analytics
20 pages
Hadoop PPT
100% (1)
Hadoop PPT
25 pages
MCAD2232 (PRESS) BIG DATA and Its Applications
No ratings yet
MCAD2232 (PRESS) BIG DATA and Its Applications
140 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
119 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
B.Tech. CS - CE and CSE Syllabus 3rd Year 2024-25
No ratings yet
B.Tech. CS - CE and CSE Syllabus 3rd Year 2024-25
2 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
134 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Big Data Analytics Syllabus - 22UAI603C - 204 - 2025
No ratings yet
Big Data Analytics Syllabus - 22UAI603C - 204 - 2025
2 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
RMK Group Data Analytics Guide
No ratings yet
RMK Group Data Analytics Guide
72 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Unit 1
No ratings yet
Unit 1
19 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Syllabus
No ratings yet
Syllabus
3 pages
BDA Syllabus - Sem VII - Mumbai University
No ratings yet
BDA Syllabus - Sem VII - Mumbai University
3 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
Module - 1
No ratings yet
Module - 1
84 pages
Bda U2
No ratings yet
Bda U2
68 pages
Big Data Processing: Jiaul Paik
No ratings yet
Big Data Processing: Jiaul Paik
47 pages
COMP9313: Big Data Management
No ratings yet
COMP9313: Big Data Management
79 pages
Bigdata Hadoop Spark - Python
No ratings yet
Bigdata Hadoop Spark - Python
8 pages
21cs71BDA Question Bank
No ratings yet
21cs71BDA Question Bank
4 pages
BIG Data - Unit - 1
No ratings yet
BIG Data - Unit - 1
24 pages
4.7.1 Bda-Mba
No ratings yet
4.7.1 Bda-Mba
2 pages
Big Data Analytics
No ratings yet
Big Data Analytics
61 pages
Topic 1 Big Data Technologies
No ratings yet
Topic 1 Big Data Technologies
5 pages
Big Data Analytics: by S. P. Sajjan
No ratings yet
Big Data Analytics: by S. P. Sajjan
21 pages
BDH Admin Ebook
No ratings yet
BDH Admin Ebook
807 pages
Bba13 Notes BDF Unit 1
No ratings yet
Bba13 Notes BDF Unit 1
3 pages
IIT Kharagpur Data Science PDF
No ratings yet
IIT Kharagpur Data Science PDF
22 pages
SEM VII BDA Syllabus Theory
No ratings yet
SEM VII BDA Syllabus Theory
4 pages
B2. Introduction To Big Data With Spark and Hadoop - Coursera
No ratings yet
B2. Introduction To Big Data With Spark and Hadoop - Coursera
12 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Big Data - Road Map
No ratings yet
Big Data - Road Map
22 pages
BIG DATA Class 1 1741496163
No ratings yet
BIG DATA Class 1 1741496163
108 pages
Big Data & Hadoop Mastery Guide
No ratings yet
Big Data & Hadoop Mastery Guide
2 pages
Big Data & Hadoop - Course Curriculum
No ratings yet
Big Data & Hadoop - Course Curriculum
6 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
Programme Elective Iv
No ratings yet
Programme Elective Iv
3 pages
Big Data Analytics
No ratings yet
Big Data Analytics
3 pages
BD Imp Ques 1
No ratings yet
BD Imp Ques 1
22 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data Syllabus
No ratings yet
Big Data Syllabus
6 pages
Big Data Analytics
No ratings yet
Big Data Analytics
131 pages
Big Data Analytics Course Syllabus
No ratings yet
Big Data Analytics Course Syllabus
9 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
31 pages
Business English Vocabulary Guide
No ratings yet
Business English Vocabulary Guide
27 pages
Arrays: Shristi Technology Labs
No ratings yet
Arrays: Shristi Technology Labs
9 pages
5G Wireless Technology: Millimeter Wave Health Effects
No ratings yet
5G Wireless Technology: Millimeter Wave Health Effects
5 pages
Manual7298631 Dell Color Management User S Guide For Macos
No ratings yet
Manual7298631 Dell Color Management User S Guide For Macos
13 pages
IJRPR15453
No ratings yet
IJRPR15453
7 pages
Phones 2017 PDF
No ratings yet
Phones 2017 PDF
161 pages
Automata State Elimination Method
No ratings yet
Automata State Elimination Method
3 pages
Unit 2 - Esp in Elt - Complete
No ratings yet
Unit 2 - Esp in Elt - Complete
35 pages
Modern Programming Tools and Techniques: DCAP505
No ratings yet
Modern Programming Tools and Techniques: DCAP505
28 pages
Machine Learning Assignment Questions
No ratings yet
Machine Learning Assignment Questions
2 pages
Multi2sim Quickstart
No ratings yet
Multi2sim Quickstart
10 pages
Biology of Stem Cells: An Overview: Pedro C. Chagastelles and Nance B. Nardi
No ratings yet
Biology of Stem Cells: An Overview: Pedro C. Chagastelles and Nance B. Nardi
5 pages
The Feasibility Study of Ballitaw
No ratings yet
The Feasibility Study of Ballitaw
2 pages
Nursing Theory Foundations
No ratings yet
Nursing Theory Foundations
60 pages
Aly 8520 To Aly 8526 12V PL
No ratings yet
Aly 8520 To Aly 8526 12V PL
4 pages
Post Test Questionnaire EOC EC
No ratings yet
Post Test Questionnaire EOC EC
4 pages
Hitch Climbers Guide
No ratings yet
Hitch Climbers Guide
28 pages
Learning Objectives: Introduction W
No ratings yet
Learning Objectives: Introduction W
238 pages
International Federation of Fruit Juice Froducers I.FJ.U.-Analyses No. 31 (Characterization by Thin-Layer Chromatography On Cellulose)
No ratings yet
International Federation of Fruit Juice Froducers I.FJ.U.-Analyses No. 31 (Characterization by Thin-Layer Chromatography On Cellulose)
7 pages
Definition of Tax MCQs
No ratings yet
Definition of Tax MCQs
2 pages
The Present Continuous
No ratings yet
The Present Continuous
4 pages
ICT Audit Tender for FSB
No ratings yet
ICT Audit Tender for FSB
3 pages
MCQ
67% (3)
MCQ
274 pages
Grinding Machines
100% (2)
Grinding Machines
140 pages
Tiếng Anh thầy Tiểu Đạt - chuyên luyện thi Đại học Mr. Tieu Dat's English Academy Thầy Lưu Tiến Đạt (thầy Tiểu Đạt) Chuyên gia luyện thi môn Tiếng Anh
No ratings yet
Tiếng Anh thầy Tiểu Đạt - chuyên luyện thi Đại học Mr. Tieu Dat's English Academy Thầy Lưu Tiến Đạt (thầy Tiểu Đạt) Chuyên gia luyện thi môn Tiếng Anh
5 pages
Katz-Moses Multi Sled FENCE Drawing v2
No ratings yet
Katz-Moses Multi Sled FENCE Drawing v2
1 page
Saudi Arabia Technician Jobs Listings
No ratings yet
Saudi Arabia Technician Jobs Listings
6 pages
Drilling Machine Mechanics
No ratings yet
Drilling Machine Mechanics
14 pages
4 VMXQ J9 R Qyj 7 Xo XUaj EB
No ratings yet
4 VMXQ J9 R Qyj 7 Xo XUaj EB
49 pages