3252 - Big Data Management
Systems and Tools
Module 3: Hadoop and Related Tools
1
Course Plan
Module Titles
Module 1 – Reliable, Scalable Data-Intensive Applications
Module 2 – Storing Data
Current Focus: Module 3 – Hadoop and Related Tools
Module 4 – Just Enough Scala for Spark
Module 5 – Introduction to Spark
Module 6 – Big Data Analytics and Processing Streaming Data
Module 7 – Spark 2
Module 8 – Tools for Moderately-Sized Datasets
Module 9 – Data Pipelines
Module 10 – Working with Distributed Databases
Module 11 – Big Data Architecture
Module 12 – Term Project Presentations (no content)
2
Learning Outcomes for this Module
• Describe the use and features of Hadoop
• Explain the operation of MapReduce
• Describe the various tools in the Hadoop
Ecosystem
• Build hands-on skills with Hadoop
3
Topics for this Module
• 3.1 Introduction to Hadoop
• 3.2 Hadoop Distributed File System (HDFS)
• 3.3 Yet Another Resource Negotiator (YARN)
• 3.4 MapReduce
• 3.5 Hadoop cluster management tools
• 3.6 Hadoop storage
• 3.7 Getting data into and out of Hadoop
• 3.8 Hadoop-related databases and analytics
• 3.9 Resources
• 3.10 Homework
• 3.11 Next Class
4
Module 3 – Section 1
Introduction to Hadoop
5
Hadoop 3 Architecture
Source: Hortonworks
6
Module 3 – Section 2
Hadoop Distributed File System
(HDFS)
7
The Hadoop Distributed File System (HDFS)
• Stores very large files as blocks
• Designed for commodity servers
• Designed to scale easily and effectively
• Achieves reliability through replication
• But is a poor choice for:
− Low latency data access
− Storing lots of small files
− Multiple writers
− Modification within files
8
HDFS (cont’d)
• By default creates three replicas of data
− First on same node as client
− Second on different rack
− Third on same rack as second but different node
• Transparently checksums all data
• Corrupt replicas are automatically replaced from good
copies
• Variety of data compression algorithms
• Transparent end-to-end encryption
9
HDFS Concepts
• Blocks
− Data is stored in large blocks (default 128MB) to reduce seek
time
− Files consist of blocks and so can be larger than a single disk
and split across many servers
• Namenodes
− Master server that manages the filesystem namespace and
metadata
− Single point of failure
10
HDFS Concepts (cont’d)
• Datanodes
− Slave servers where data is stored
• HDFS Federation
− Can have multiple namenodes, each managing different
filesystem subtrees
• HDFS High Availability
− Active-passive configuration with automatic failover (takes
about a minute)
− Can alternatively configure a new namenode (takes half an
hour or more)
11
Isn’t This Just Grid Computing?
Grid Computing Hadoop
CPU-intensive Data-intensive (and possibly also
CPU-intensive)
Possibly highly distributed Distributed but data and CPU close
together to conserve bandwidth
Low-level coding High-level coding
Typically unaware of network Network topology aware and
topology and passes work to any attempts to give related work to
available processor physically close processors
Data is sent to the processors Program is sent to the processors to
work on data held locally
12
Module 3 – Section 3
Yet Another Resource Negotiator
(YARN)
13
YARN
• Yet Another Resource Negotiator
• Two parts:
− ResourceManager: for allocating resources and scheduling
applications
− ApplicationMaster: creates a container for each task
• Apache Myriad allows Hadoop and Mesos to share the
same physical infrastructure
14
Module 3 – Section 4
MapReduce
15
MapReduce
• A simple parallel programming paradigm
• 2-3 phases
− Map: Split the problem into chunks and tackle them
− (Shuffle): Organize the results into matching groups
− Reduce: Combine the results of each group
• Not all problems can be carved up this way
• Works with a variety of programming languages
• System “looks after” message passing, scheduling on
machines, managing assembly of partial results
16
MapReduce (cont’d)
sort
copy
split 0 map merge
reduce part 0
split 1 map
reduce part 1
split 2 map
17
MapReduce (cont’d)
• Mappers get (key: value) from files and emit (key, value)
where value has been mapped to something else
• Shuffle and Sort: groups output from above by key so we
now have (key: [value, value, …])
• Never need to deal with the shuffle/sort phase explicitly; it’s
handled by the framework
• Reducers take sorted (key: [values]) as input and emit (key:
aggregated_value)
18
Module 3 – Section 5
Hadoop Cluster Management Tools
19
Ambari
• Provision a Hadoop cluster
• Manage and monitor the cluster
20
Sentry
• A system for enforcing fine-grained role based authorization
to data and metadata stored on a Hadoop cluster
• Originally Cloudera Access
• Works out of the box with Apache Hive and Cloudera Impala
21
Ranger
• Vision is to provide comprehensive security across the
Hadoop ecosystem
• Today provides fine-grained authorization and auditing for:
− Hadoop
− Hive
− HBase
− Storm
− Knox
• Centralized web application
− Policy administration
− Audit
− Reporting
22
Falcon
• Data processing and management platform for Hadoop to:
− Simplify moving data
− Provide process orchestration (coordination), data discovery
and lifecycle management
23
ZooKeeper
• Centralized coordination service for distributed systems
• Provides centralized services such as maintaining
configuration information, naming, distributed
synchronization and group services
24
Azkaban
• Workflow scheduler
25
Module 3 – Section 6
Hadoop Storage
26
Avro
• Language-neutral (C, C++, Java, Python, Ruby) system for
storing objects (“data serialization”)
• Supports primitive (e.g. int, string) and complex (e.g. array,
enum) types
• Fast and interoperable
• Designed to work well with HDFS and MapReduce
27
Parquet RCFiles & ORCFiles
• Columnar data storage
28
Module 3 – Section 7
Getting Data Into and Out of
Hadoop
29
Flume
• Can handle arbitrary types of streaming data
• Can configure the reliability level of the data collection
based on how important the data is and how timely it needs
to be
• Can have it batch the data up for transfer or move it as soon
as it becomes available
• Can do some lightweight transformations that make it handy
for ETL
30
Sqoop
• NoSQL database
• Supports large (potentially sparse) tables
• Based on Google BigTable
• Stores data in columns but isn’t really column-oriented
• Excels at key-based access to a single cell or range of cells
31
Pig
• Dataflow/workflow language developed at Yahoo!
• Shell is called Grunt
• Steps get added to a logical plan built by the interpreter
which doesn’t get executed until a DUMP or STORE
command
• The scripts are converted to MapReduce jobs like Hive
• Good for ETL and dealing with unstructured data
32
Pig Latin Example
users = LOAD 'users’ AS (name, age);
filteredUsers = FILTER Users BY age >= 18 AND age <=
25;
pages = LOAD 'pages’ AS (user, url);
joinedUserPages = JOIN filteredUsers BY name, Pages BY
user;
groupedUserPages = GROUP joinedUserPages BY url;
numberOfClicks = FOREACH groupedUserPages GENERATE
GROUP, COUNT(joinedUserPages) AS clicks;
sortedNumberOfClicks = ORDER numberOfClicks BY clicks
DESC;
top5 = LIMIT sortedNumberOfClicks 5;
STORE top5 INTO 'top5sites';
33
Kafka
• Kafka is a distributed publish-subscribe messaging system
• Messages are published by producers, categorized into
topics and pulled for processing by consumers
• Consumers subscribe to topics they want to read messages
from
• Kafka maintains a partitioned log for each topic
• Each partition is an ordered subset of the messages
published to a topic
• Partitions of the same topic can be spread across many
nodes in a cluster and are usually replicated for reliability
34
Module 3 – Section 8
Hadoop-Related Databases and
Analytics
35
HCatalog
• Provides metadata services (and REST interface)
• Stored in a relational database outside Hadoop
• Makes Hive possible
36
Hive
• Relational-like table store in Hadoop
• Hive table consists of a schema stored in HCatalog plus
data stored in HDFS
• Converts HiveQL commands into batch jobs
• Can delete rows and add small batches of inserts but
changes are not in-place
• Limited index and subquery capability
• Supports both primitive (e.g. BOOLEAN, INT, FLOAT,
VARCHAR, BINARY, etc.) and complex (ARRAY, MAP,
STRUCT, UNION)
37
HBase
• NoSQL database
• Supports very large (potentially sparse) tables
• Latency in milliseconds
• Based on Google BigTable
• Stores data in column families
• Excels at key-based access to a single cell or range of cells
38
Drill
• A fast analytical engine for querying non-relational
datastores using SQL
• Based on Google BigQuery SQL (different from Hive’s)
• Columnar execution engine
• Data-driven compilation and recompilation at execution
time
• Specialized memory management that reduces memory
requirements and eliminates garbage collections
• Locality-aware execution that reduces network traffic
when Drill is co-located with the datastore
• Advanced cost-based optimizer that pushes processing
into the datastore when possible
39
Mahout
• Provides a variety of machine learning algorithms that run
on Hadoop (or other databases)
• Requires Java programming skills to use
• Primarily used for:
• Recommendations (aka Collaborative Filtering)
• Clustering e.g. documents by topic
• Classification
• Frequent itemset mining: find items that often appear
together, say in queries or shopping baskets
40
Module 3 – Section 9
Resources
41
Resources
• White, Tom. Hadoop: The Definitive Guide 4th Edition.
O’Reilly. 2015.
• Grover et. al. Hadoop Application Architectures. O’Reilly.
2015.
• Sammer, Eric. Hadoop Operations. O’Reilly. 2012.
• Kunigk & George. Hadoop in the Enterprise: Architecture.
O’Reilly. 2018.
• Holmes, Alex. Hadoop in Practice. Manning. 2014.
• Capriolo & Wampler. Programming Hive. O’Reilly. 2012.
42
Resources (cont’d)
• Free online PIG book:
• Google paper on MapReduce:
43
Module 3 – Section 10
Homework
44
Assignment #2: Setup
• Sign up for for Azure free trial: portal.azure.com
• Go to Marketplace (button at the bottom of the dashboard) .
• Enter “hdp” in Search and choose Hortonworks Sandbox
with HDP 2.5 that should show up.
• Click Create at the bottom of the page.
45
Assignment #2: Basics setup
46
Assignment #2: SSH Connect
47
Assignment #2: Ambari password change
48
Assignment #2: Open ports
49
Assignment #2: Full tutorial
• Installing Hortonworks Data Platform 2.5 on Microsoft
Azure: Tutorial
• It also has full list of ports that you need to open.
50
Assignment #2: Exercises
• Note: you do NOT need to hand in the below.
• Read tutorial:
– “Learning the Ropes of the HDP Sandbox”
• Complete the following tutorials:
– “How to Process data with Apache Hive”
– How to Process data with Apache Pig
51
Assignment #2: Questions
• Note: you DO need to hand in answers to below.
Instructions on Quercus.
1. Describe a business situation in which Hadoop would be
a better choice for storing the data than a relational
database and explain why (give at least three reasons)
2. Describe a business situation in which a relational
database would be the better option and explain why
(give at least three reasons)
3. Describe a business situation where MongoDB would a
better choice than either and again give at least three
reasons
4. What is the point in having and using Pig if we have Hive
so can use SQL (again, three advantages)?
52
Module 3 – Section 11
Next Class
53
Next Class
• Next class will be Just Enough Scala to get ready for
working with Spark
• In preparation for next class: Read Chapter 1 of
Programming in Scala:
54
Follow us on social
Join the conversation with us online:
facebook.com/uoftscs
@uoftscs
linkedin.com/company/university-of-toronto-school-of-continuing-studies
@uoftscs
55
Any questions?
56
Thank You
Thank you for choosing the University of Toronto
School of Continuing Studies
57