An
Introduc+on
to
Hadoop
Mark
Fei
Cloudera
Strata
+
Hadoop
World
2012,
New
York
City,
October
23,
2012
Who
Am
I?
Mark
Fei
Cloudera!
Durango, Colorado!
Current:! Past:!
Senior Instructor at Cloudera! Professional Services Education, VMware! Senior Member Technical Staff, Hill Associates! Sales Engineer, Nortel Networks! Systems Programmer, large Bank! Banking Applications software developer!
Whats
Ahead?
Solid
introduc+on
to
Apache
Hadoop
What
it
is
Why
its
relevant
How
it
works
The
Ecosystem
No
prior
experience
needed
Feel
free
to
ask
ques+ons
What
is
Apache
Hadoop?
Scalable
data
storage
and
processing
Open
source
Apache
project
Harnesses
the
power
of
commodity
servers
Distributed
and
fault-tolerant
HDFS
(storage)
MapReduce
(processing)
Core
Hadoop
consists
of
two
main
parts
A large ecosystem
Who uses Hadoop?
Vendor integration
BI / Analytics
ETL
Database
OS / Cloud / System Mgmt.
Hardware
About Cloudera
Cloudera is The commercial Hadoop company Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo Provides consulting and training services for Hadoop users Staff includes several committers to Hadoop projects
Cloudera Software
Clouderas Distribution including Apache Hadoop (CDH)
A single, easy-to-install package from the Apache Hadoop core repository Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version 100% open source Apache Hadoop Apache Hive Apache Pig Apache HBase Apache Zookeeper Apache Flume, Apache Hue, Apache Oozie, Apache Sqoop, Apache Mahout
Components
A Coherent Platform
Components of the CDH Stack
File System Mount
FUSE-DFS
Storage
SDK
UI Framework
HUE
HUE SDK
Computation Integration Coordination Access
Workflow
APACHE OOZIE
Scheduling
APACHE OOZIE
Metadata
APACHE HIVE
Languages / Compilers Data Integration
APACHE FLUME, APACHE SQOOP HDFS, MAPREDUCE APACHE PIG, APACHE HIVE, APACHE MAHOUT
Fast Read/Write Access
APACHE HBASE
Coordination
APACHE ZOOKEEPER
Cloudera Manager, Free Edition
End-to-end Deployment and management of your CDH cluster
Zero to Hadoop in 15 minutes
Supports up to 50 nodes Free (but not open source)
Cloudera Enterprise
Cloudera Enterprise
Clouderas Distribution including Apache Hadoop (CDH)
Big data storage, processing and analytics platform based on CDH End-to-end deployment, management, and operation of CDH Provides sophisticated cluster monitoring tools not present in the free version A team of experts on call to help you meet your Service Level Agreements (SLAs)
Cloudera Manager (full version)
Production support
Cloudera University
Training for the entire Hadoop stack
Cloudera Developer Training for Apache Hadoop Cloudera Administrator Training for Apache Hadoop Cloudera Training for Apache HBase Cloudera Training for Apache Hive and Pig Cloudera Essentials for Apache Hadoop More courses coming Including customized on-site private classes Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera Certified Specialist in Apache HBase (CCSHB)
Public and private classes offered
Industry-recognized Certifications
Professional Services
Solutions Architects provide guidance and handson expertise
Use Case Discovery New Hadoop Deployment Proof of Concept Production Pilot Process and Team Development Hadoop Deployment Certification
How
Did
Apache
Hadoop
Originate?
Heavily
inuenced
by
Googles
architecture
Notably,
the
Google
Filesystem
and
MapReduce
papers
Early
adop+on
by
Yahoo,
Facebook
and
others
Nutch spun off from Lucene Google publishes GFS paper Google publishes MapReduce paper Nutch rewritten for MapReduce
Other
Web
companies
quickly
saw
the
benets
2002
2003
2004
2005
Why
Do
We
Have
So
Much
Data?
And
what
are
we
supposed
to
do
with
it?
Velocity
Why
were
genera+ng
data
faster
than
ever
Processes
are
increasingly
automated
Systems
are
increasingly
interconnected
People
are
increasingly
interac+ng
online
Variety
What
types
of
data
are
we
producing?
Applica+on
logs
Text
messages
Social
network
connec+ons
Tweets
Photos
Not
all
of
this
maps
cleanly
to
the
rela+onal
model
Volume
The
result
of
this
is
that
every
single
day
Twi]er
processes
340
million
messages
Facebook
stores
2.7
billion
comments
and
Likes
Google
processes
about
24
petabytes
of
data
More
than
200
million
e-mail
messages
are
sent
Foursquare
processes
more
than
2,000
check-ins
And
every
single
minute
Where
Does
Data
Come
From?
Science
Medical
imaging,
sensor
data,
genome
sequencing,
weather
data,
satellite
feeds,
etc.
Financial,
pharmaceu+cal,
manufacturing,
insurance,
online,
energy,
retail
data
Sales
data,
customer
behavior,
product
databases,
accoun+ng
data,
etc.
Log
les,
health
&
status
feeds,
ac+vity
streams,
network
messages,
Web
Analy+cs,
intrusion
detec+on,
spam
lters
Industry
Legacy
System
Data
Analyzing
Data:
The
Challenges
Huge
volumes
of
data
Mixed
sources
result
in
many
dierent
formats
XML
CSV
EDI
Log
les
Objects
SQL
Text
JSON
Binary
Etc.
What
is
Common
Across
Hadoop-able
Problems?
Nature
of
the
data
Complex
data
Mul+ple
data
sources
Lots
of
it
Nature
of
the
analysis
Batch
processing
Parallel
execu+on
Spread
data
over
a
cluster
of
servers
and
take
the
computa+on
to
the
data
Benets
of
Analyzing
With
Hadoop
Previously
impossible/imprac+cal
to
do
this
analysis
Analysis
conducted
at
lower
cost
Analysis
conducted
in
less
+me
Greater
exibility
Linear
scalability
What
Analysis
is
Possible
With
Hadoop?
Text
mining
Index
building
Graph
crea+on
and
analysis
Pa]ern
recogni+on
Collabora+ve
ltering
Predic+on
models
Sen+ment
analysis
Risk
assessment
Eight
Common
Hadoop-able
Problems
1. 2. 3. 4.
Modeling
true
risk
Customer
churn
analysis
Recommenda+on
engine
PoS
transac+on
analysis
5.
Analyzing
network
data
to
predict
failure
Threat
analysis
Search
quality
Data
sandbox
6. 7. 8.
1.
Modeling
True
Risk
Challenge:
How
much
risk
exposure
does
an
organiza+on
really
have
with
each
customer?
Mul+ple
sources
of
data
and
across
mul+ple
lines
of
business
Solu+on
with
Hadoop:
Source
and
aggregate
disparate
data
sources
to
build
data
picture
Structure
and
analyze
e.g.
credit
card
records,
call
recordings,
chat
sessions,
emails,
banking
ac+vity
Sen+ment
analysis,
graph
crea+on,
pa]ern
recogni+on
Financial
Services
(banks,
insurance
companies)
Typical
Industry:
2.
Customer
Churn
Analysis
Challenge:
Why
is
an
organiza+on
really
losing
customers?
Solu-on
with
Hadoop:
Rapidly
build
behavioral
model
from
disparate
data
sources
Structure
and
analyze
with
Hadoop
Data
on
these
factors
comes
from
dierent
sources
Typical
Industry:
Traversing
Graph
crea+on
Pa]ern
recogni+on
Telecommunica+ons,
Financial
Services
3.
Recommenda+on
Engine/Ad
Targe+ng
Challenge:
Using
user
data
to
predict
which
products
to
recommend
Solu+on
with
Hadoop:
Batch
processing
framework
Collabora+ve
ltering
Allow
execu+on
in
in
parallel
over
large
datasets
Typical
Industry
Collec+ng
taste
informa+on
from
many
users
U+lizing
informa+on
to
predict
what
similar
users
like
Ecommerce,
Manufacturing,
Retail
Adver+sing
4.
Point
of
Sale
Transac+on
Analysis
Challenge:
Analyzing
Point
of
Sale
(PoS)
data
to
target
promo+ons
and
manage
opera+ons
Solu+on
with
Hadoop:
Batch
processing
framework
Sources
are
complex
and
data
volumes
grow
across
chains
of
stores
and
other
sources
Allow
execu+on
in
in
parallel
over
large
datasets
Op+mizing
over
mul+ple
data
sources
U+lizing
informa+on
to
predict
demand
Retail
Pa]ern
recogni+on
Typical
Industry:
5.
Analyzing
Network
Data
to
Predict
Failure
Challenge:
Analyzing
real-+me
data
series
from
a
network
of
sensors
Solu+on
with
Hadoop:
Take
the
computa+on
to
the
data
Calcula+ng
average
frequency
over
+me
is
extremely
tedious
because
of
the
need
to
analyze
terabytes
Expand
from
simple
scans
to
more
complex
data
mining
Discrete
anomalies
may,
in
fact,
be
interconnected
Be]er
understand
how
the
network
reacts
to
uctua+ons
Iden+fy
leading
indicators
of
component
failure
Typical
Industry:
U+li+es,
Telecommunica+ons,
Data
Centers
6.
Threat
Analysis/Trade
Surveillance
Challenge:
Detec+ng
threats
in
the
form
of
fraudulent
ac+vity
or
a]acks
Solu+on
with
Hadoop:
Parallel
processing
over
huge
datasets
Pa]ern
recogni+on
to
iden+fy
anomalies,
Large
data
volumes
involved
Like
looking
for
a
needle
in
a
haystack
Typical
Industry:
i.e.,
threats
Security,
Financial
Services,
General:
spam
gh+ng,
click
fraud
7.
Search
Quality
Challenge:
Providing
real
+me
meaningful
search
results
Solu+on
with
Hadoop:
Analyzing
search
a]empts
in
conjunc+on
with
structured
data
Pa]ern
recogni+on
Typical
Industry:
Browsing
pa]ern
of
users
performing
searches
in
dierent
categories
Web,
Ecommerce
8.
Data
Sandbox
Challenge:
Data
Deluge
Dont
know
what
to
do
with
the
data
or
what
analysis
to
run
Solu+on
with
Hadoop:
Dump
all
this
data
into
an
HDFS
cluster
Use
Hadoop
to
start
trying
out
dierent
analysis
on
the
data
See
pa]erns
to
derive
value
from
data
Typical
Industry:
Common
across
all
industries
Hadoop:
How
does
it
work?
Moores
law
and
not
Disk
Capacity
and
Price
Were
genera+ng
more
data
than
ever
before
Fortunately,
the
size
and
cost
of
storage
has
kept
pace
Capacity
has
increased
while
price
has
decreased
Year
1997 2004 2012 2.1 200 3,000
Capacity (GB)
$157 $1.05 $0.05
Cost per GB (USD)
Disk
Capacity
and
Performance
Disk
performance
has
also
increased
in
the
last
15
years
Unfortunately,
transfer
rates
havent
kept
pace
with
capacity
Year
1997 2004 2012
Capacity (GB)
2.1 200 3,000
Transfer Rate (MB/s)
16.6 56.5 210
Disk Read Time
126 seconds 59 minutes 3 hours, 58 minutes
Architecture
of
a
Typical
HPC
System
Compute Nodes
Storage System
Fast Network
Architecture
of
a
Typical
HPC
System
Compute Nodes
Storage System
Step 1: Copy input data
Fast Network
Architecture
of
a
Typical
HPC
System
Compute Nodes
Storage System
Step 2: Process the data
Fast Network
Architecture
of
a
Typical
HPC
System
Compute Nodes
Storage System
Step 3: Copy output data
Fast Network
You
Dont
Just
Need
Speed
The
problem
is
that
we
have
way
more
data
than
code
$ du -ks code/ 1,083 $ du ks data/ 854,632,947,314
You
Need
Speed
At
Scale
Compute Nodes
Storage System
Bottleneck
HDFS:
HADOOP
DISTRIBUTED
FILESYSTEM
Because
10,000
hard
disks
are
be]er
than
one
Collocated
Storage
and
Processing
Solu+on:
store
and
process
data
on
the
same
nodes
Data
locality:
Bring
the
computa+on
to
the
data
Reduces
I/O
and
boosts
performance
"slave" nodes (storage and processing)
Hard
Disk
Latency
Disk
seeks
are
expensive
Solu+on:
Read
lots
of
data
at
once
to
amor+ze
the
cost
Current location of disk head
Where the data you need is stored
Introducing
HDFS
Hadoop
Distributed
File
System
Scalable
storage
inuenced
by
Googles
le
system
paper
HDFS
is
op+mized
for
Hadoop
Values
high
throughput
much
more
than
low
latency
Its
a
user-space
Java
process
Primarily
accessed
via
command-line
u+li+es
and
Java
API
Its
not
a
general-purpose
lesystem
HDFS
is
(Mostly)
UNIX-like
In
many
ways,
HDFS
is
similar
to
a
UNIX
lesystem
Hierarchical
UNIX-style
paths
(e.g.
/foo/bar/myfile.txt)
File
ownership
and
permissions
No
CWD
Cannot
modify
les
once
wri]en
There
are
also
some
major
devia+ons
from
UNIX
HDFS
High-Level
Architecture
HDFS
follows
a
master-slave
architecture
There
are
two
essen+al
daemons
in
HDFS
Master:
NameNode
Responsible
for
namespace
and
metadata
Namespace:
le
hierarchy
Metadata:
ownership,
permissions,
block
loca+ons,
etc.
Responsible
for
storing
actual
datablocks
Slave:
DataNode
Anatomy
of
a
Small
Hadoop
Cluster
The
diagram
shows
the
HDFS-related
daemons
on
a
small
cluster
Each "slave" node will run - DataNode daemon
The "master" node will run - NameNode daemon
HDFS
Blocks
When
a
le
is
added
to
HDFS,
its
split
into
blocks
This
is
a
similar
concept
to
na+ve
lesystems
HDFS
uses
a
much
larger
block
size
(64
MB),
for
performance
150 MB input le Block #1 (64 MB) Block #2 (64 MB)
Block #3 (remaining 22 MB)
HDFS
Replica+on
Those
blocks
are
then
replicated
across
machines
The
rst
block
might
be
replicated
to
A,
C
and
D
Block #1 A B Block #2 C D E
Block #3
HDFS
Replica+on
(contd)
The
next
block
might
be
replicated
to
B,
D
and
E
Block #1 A B Block #2 C D E
Block #3
HDFS
Replica+on
(contd)
The
last
block
might
be
replicated
to
A,
C
and
E
Block #1 A B Block #2 C D E
Block #3
HDFS
Reliability
Replica+on
helps
to
achieve
reliability
Even
when
a
node
fails,
two
copies
of
the
block
remain
These
will
be
re-replicated
to
other
nodes
automa+cally
This failed node held blocks #1 and #3
X
A B C D E
Blocks #1 and #3 are still available here Block #1 is still available here Block #3 is still available here
DATA
PROCESSING
WITH
MAPREDUCE
It
not
only
works,
its
func+onal
MapReduce
High-Level
Architecture
Like
HDFS,
MapReduce
has
a
master-slave
architecture
There
are
two
daemons
in
classical
MapReduce
Master:
JobTracker
Responsible
for
dividing,
scheduling
and
monitoring
work
Responsible
for
actual
processing
Slave:
TaskTracker
Anatomy
of
a
Small
Hadoop
Cluster
The
diagram
shows
both
MapReduce
and
HDFS
daemons
Each "slave" node will run - DataNode daemon - TaskTracker daemon
The "master" node will run - NameNode daemon - JobTracker daemon
Gentle
Introduc+on
to
MapReduce
MapReduce
is
conceptually
like
a
UNIX
pipeline
$ egrep 941 78264 4312
One
func+on
(Map)
processes
data
That
output
is
ul+mately
input
to
another
func+on
(Reduce)
Each
piece
is
simple,
but
can
be
powerful
when
combined
'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c ERROR INFO WARN
The
Map
Func+on
Operates
on
each
record
individually
Typical
uses
include
ltering,
parsing,
or
transforming
input
egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c
Map
Intermediate
Processing
The
Map
func+ons
output
is
grouped
and
sorted
This
is
the
automa+c
sort
and
shue
process
in
Hadoop
$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c
Map
Sort
and
Shue
The
Reduce
Func+on
Operates
on
all
records
in
a
group
Oren
used
for
sum,
average
or
other
aggregate
func+ons
$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c
Map
Sort
and
Reduce
shue
MapReduce
History
MapReduce
is
not
a
language,
its
a
programming
model
A
style
of
processing
data
you
could
implement
in
any
language
Many
languages
have
func+ons
named
map
and
reduce These
func+ons
have
largely
the
same
purpose
in
Hadoop
MapReduce
has
its
roots
in
func+onal
programming
Popularized
for
large-scale
data
processing
by
Google
MapReduce
Benets
Complex
details
are
abstracted
away
from
the
developer
No
le
I/O
No
networking
code
No
synchroniza+on
A
record
consists
of
a
key
and
corresponding
value
Its
scalable
because
you
process
one
record
at
a
+me
We
oren
care
about
only
one
of
these
MapReduce
Example
in
Python
MapReduce
code
for
Hadoop
is
typically
wri]en
in
Java
But
possible
to
use
nearly
any
language
with
Hadoop
Streaming
Ill
show
the
log
event
counter
using
MapReduce
in
Python
Its
very
helpful
to
see
the
data
as
well
as
the
code
Job
Input
Each
mapper
gets
a
chunk
of
jobs
input
data
to
process
This
chunk
is
called
an
InputSplit
2012-09-06 In
most
cases,
this
corresponds
to
a
b"This lock
in
H DFS
22:16:49.391 CDT INFO can wait"
2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 22:16:49.392 22:16:49.394 22:16:49.395 22:16:49.397 22:16:49.398 22:16:49.399 CDT CDT CDT CDT CDT CDT INFO "Blah blah" WARN "Hmmm..." INFO "More blather" WARN "Hey there" INFO "Spewing data" ERROR "Oh boy!"
Python
Code
for
Map
Func+on
1 2 3 4 5 6 7 8 9 10 11 12 13
Our
map
func+on
will
parse
the
event
type
And
then
output
that
event
(key)
and
a
literal
1
(value)
Boilerplate
Python
stu
#!/usr/bin/env python import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL'] for line in sys.stdin: fields = line.split() for field in fields: field = field.strip().upper() if field in levels: print "%s\t1" % field
Dene
list
of
JUnit
log
levels
Split
every
line
(record)
we
receive
on
standard
input
into
elds,
normalized
by
case
If
this
eld
matches
a
log
level,
print
it
(and
a
1)
Output
of
Map
Func+on
The
map
func+on
produces
key/value
pairs
as
output
INFO INFO WARN INFO WARN INFO ERROR 1 1 1 1 1 1 1
Input
to
Reduce
Func+on
The
Reducer
receives
a
key
and
all
values
for
that
key
Keys
are
always
passed
to
reducers
in
sorted
order
Although
its
not
obvious
here,
values
are
unordered
ERROR INFO INFO INFO INFO WARN WARN 1 1 1 1 1 1 1
Python
Code
for
Reduce
Func+on
1 2 3 4 5 6 7 8 9 10 11 12 13
The
Reducer
rst
extracts
the
key
and
value
it
was
passed
#!/usr/bin/env python import sys previous_key = '' sum = 0 for line in sys.stdin: fields = line.split() key, value = line.split() value = int(value) # continued on next slide
Boilerplate
Python
stu
Ini+alize
loop
variables
Extract
the
key
and
value
passed
via
standard
input
Python
Code
for
Reduce
Func+on
14 15 16 17 18 19 20 21 22 23
Then
simply
adds
up
the
value
for
each
key
# continued from previous slide if key == previous_key: sum = sum + value else: if previous_key != '': print '%s\t%i' % (previous_key, sum) previous_key = key sum = 1 print '%s\t%i' % (previous_key, sum)
If
key
unchanged,
increment
the
count
If
key
changed,
print
sum
for
previous
key
Re-init
loop
variables
Print
sum
for
nal
key
Output
of
Reduce
Func+on
The
output
of
this
Reduce
func+on
is
a
sum
for
each
level
ERROR INFO WARN 1 4 2
Recap
of
Data
Flow
Map
input
2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 22:16:49.391 22:16:49.392 22:16:49.394 22:16:49.395 22:16:49.397 22:16:49.398 22:16:49.399 CDT CDT CDT CDT CDT CDT CDT INFO "This can wait" INFO "Blah blah" WARN "Hmmm..." INFO "More blather" WARN "Hey there" INFO "Spewing data" ERROR "Oh boy!"
Map
output
INFO INFO WARN INFO WARN INFO ERROR 1 1 1 1 1 1 1
Reduce
input
ERROR INFO INFO INFO INFO WARN WARN 1 1 1 1 1 1 1
Reduce
output
ERROR INFO WARN 1 4 2
Input
Splits
Feed
the
Map
Tasks
Input
for
the
en+re
job
is
subdivided
into
InputSplits
An
InputSplit
usually
corresponds
to
a
single
HDFS
block
Each
of
these
serves
as
input
to
a
single
Map
task
Input for entire job (192 MB) 64 MB
Mapper #1
64 MB
Mapper #2
64 MB
Mapper #3
Mappers
Feed
the
Shue
and
Sort
Output
of
all
Mappers
is
par++oned,
merged,
and
sorted
(No
code
required
Hadoop
does
this
automa+cally)
Mapper #1
INFO WARN INFO INFO ERROR 1 1 1 1 1 ERROR ERROR ERROR 1 1 1
Mapper #2
WARN INFO INFO INFO ERROR
1 1 1 1 1
INFO INFO INFO INFO INFO INFO INFO INFO
1 1 1 1 1 1 1 1
Mapper #N
WARN INFO WARN INFO ERROR
1 1 1 1 1
WARN WARN WARN WARN
1 1 1 1
Shue
and
Sort
Feeds
the
Reducers
All
values
for
a
given
key
are
then
collapsed
into
a
list
The
key
and
all
its
values
are
fed
to
reducers
as
input
ERROR ERROR ERROR 1 1 1 ERROR 1 1 1
INFO INFO INFO INFO INFO INFO INFO INFO
1 1 1 1 1 1 1 1
Reducer #1
INFO 1 1 1 1 1 1 1 1
WARN WARN WARN WARN
1 1 1 1
Reducer #2
WARN 1 1 1 1
Each
Reducer
Has
an
Output
File
These
are
stored
in
HDFS
below
your
output
directory
Use
hadoop fs -getmerge
to
combine
them
into
a
local
copy
INFO 8
Reducer #1
ERROR WARN
3 4
Reducer #2
Apache
Hadoop
Ecosystem:
Overview
"Core
Hadoop"
consists
of
HDFS
and
MapReduce
These
are
the
kernel
of
a
much
broader
plavorm
Hadoop
has
many
related
projects
Most
are
open
source
Apache
projects
like
Hadoop
Some
help
you
integrate
Hadoop
with
other
systems
Others
help
you
analyze
your
data
S+ll
others,
like
Oozie,
help
you
use
Hadoop
more
eec+vely
Also
like
Hadoop,
they
have
funny
names
All
of
these
are
part
of
Clouderas
CDH
distribu+on
Ecosystem:
Apache
Flume
log
les
program
output
syslog
custom
source
and
many
more
Ecosystem:
Apache
Sqoop
Integrates
with
any
JDBC-compa+ble
database
Retrieve
all
tables,
a
single
table,
or
a
por+on
to
store
in
HDFS
Can
also
export
data
from
HDFS
back
to
the
database
Database Hadoop Cluster
Ecosystem:
Apache
Hive
Hive
allows
you
to
do
SQL-like
queries
on
data
in
HDFS
SELECT customer.id, customer.name, sum(orders.cost) FROM customers INNER JOIN ON (customer.id = orders.customer_id) WHERE customer.zipcode = '63105' GROUP BY customer.id;
It
turns
this
into
MapReduce
jobs
that
run
on
your
cluster
Reduces
development
+me
Makes
Hadoop
more
accessible
to
non-engineers
Ecosystem:
Apache
Pig
Apache
Pig
has
a
similar
purpose
to
Hive
It
has
a
high-level
language
(PigLa+n)
for
data
analysis
Scripts
yield
MapReduce
jobs
that
run
on
your
cluster
But
Pigs
approach
is
much
dierent
than
Hive
Ecosystem:
Apache
HBase
NoSQL
database
built
on
HDFS
Low-latency
and
high-performance
for
reads
and
writes
Extremely
scalable
Tables
can
have
billions
of
rows
And
poten+ally
millions
of
columns
You
Should
Be
Using
CDH
Clouderas
Distribu+on
including
Apache
Hadoop
(CDH)
Combines
Hadoop
with
many
important
ecosystem
tools
The
most
widely
used
distribu+on
of
Hadoop
A
stable,
proven
and
supported
environment
you
can
count
on
Such
as
Hive,
Pig,
Sqoop,
Flume
and
many
more
All
of
these
are
integrated
and
work
well
together
Its
completely
free
Apache
licensed
its
100%
open
source
too
How
much
does
it
cost?
When
is
Hadoop
(Not)
a
Good
Choice
Hadoop
may
be
a
great
choice
when
Hadoop
may
not
be
a
great
choice
when
You
need
to
process
non-rela+onal
(unstructured)
data
You
are
processing
large
amounts
of
data
You
can
run
your
jobs
in
batch
mode
Youre
processing
small
amounts
of
data
Your
algorithms
require
communica+on
among
nodes
You
need
low
latency
or
transac+ons
And
know
how
to
integrate
it
with
other
systems
As
always,
use
the
best
tool
for
the
job
Managing
The
Elephant
In
The
Room
-
Roles
System
Administrators
Developers
Analysts
Data
Stewards
System
Administrators
Required
skills:
Strong
Linux
administra+on
skills
Networking
knowledge
Understanding
of
hardware
Install,
congure
and
upgrade
Hadoop
sorware
Manage
hardware
components
Monitor
the
cluster
Integrate
with
other
systems
(e.g.,
Flume
and
Sqoop)
Job
responsibili+es
Developers
Required
Skills:
Strong
Java
or
scrip+ng
capabili+es
Understanding
of
MapReduce
and
algorithms
Write,
package
and
deploy
MapReduce
programs
Op+mize
MapReduce
jobs
and
Hive/Pig
programs
Job
responsibili+es:
Data
Analyst/Business
Analyst
Required
skills:
SQL
Understanding
data
analy+cs/data
mining
Extract
intelligence
from
the
data
Write
Hive
and/or
Pig
programs
Job
responsibili+es:
Data
Steward
Required
skills:
Data
modeling
and
ETL
Scrip+ng
skills
Cataloging
the
data
(analogous
to
a
librarian
for
books)
Manage
data
lifecycle,
reten+on
Data
quality
control
with
SLAs
Job
responsibili+es:
Combining
Roles
System
Administrator
+
Steward
analogous
to
DBA
Required
skills:
Job
responsibili+es:
Data
modeling
and
ETL
Scrip+ng
skills
Strong
Linux
administra+on
skills
Manage
data
lifecycle,
reten+on
Data
quality
control
with
SLAs
Install,
congure
and
upgrade
Hadoop
sorware
Manage
hardware
components
Monitor
the
cluster
Integrate
with
other
systems
(e.g.,
Flume
and
Sqoop)
Conclusion
Thanks
for
your
+me!
Ques+ons?