0% found this document useful (0 votes)

247 views57 pages

Hadoop and Related Tools

This document provides an overview of Module 3 of a course on big data management systems and tools. The module focuses on Hadoop and related tools. It describes the learning outcomes, topics to be covered, and provides an introduction to key Hadoop concepts like HDFS, YARN, MapReduce, and common tools used in the Hadoop ecosystem like Ambari, Sentry, Ranger, Falcon, ZooKeeper, and others.

Uploaded by

Prem Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

247 views57 pages

Hadoop and Related Tools

Uploaded by

Prem Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

3252 - Big Data Management

Systems and Tools

Module 3: Hadoop and Related Tools

1
Course Plan
Module Titles
Module 1 – Reliable, Scalable Data-Intensive Applications
Module 2 – Storing Data
Current Focus: Module 3 – Hadoop and Related Tools
Module 4 – Just Enough Scala for Spark
Module 5 – Introduction to Spark
Module 6 – Big Data Analytics and Processing Streaming Data
Module 7 – Spark 2
Module 8 – Tools for Moderately-Sized Datasets
Module 9 – Data Pipelines
Module 10 – Working with Distributed Databases
Module 11 – Big Data Architecture
Module 12 – Term Project Presentations (no content)

2
Learning Outcomes for this Module

• Describe the use and features of Hadoop

• Explain the operation of MapReduce
• Describe the various tools in the Hadoop
Ecosystem
• Build hands-on skills with Hadoop

3
Topics for this Module

• 3.1 Introduction to Hadoop

• 3.2 Hadoop Distributed File System (HDFS)
• 3.3 Yet Another Resource Negotiator (YARN)
• 3.4 MapReduce
• 3.5 Hadoop cluster management tools
• 3.6 Hadoop storage
• 3.7 Getting data into and out of Hadoop
• 3.8 Hadoop-related databases and analytics
• 3.9 Resources
• 3.10 Homework
• 3.11 Next Class

4
Module 3 – Section 1

Introduction to Hadoop

5
Hadoop 3 Architecture

Source: Hortonworks

6
Module 3 – Section 2

Hadoop Distributed File System

(HDFS)

7
The Hadoop Distributed File System (HDFS)
• Stores very large files as blocks
• Designed for commodity servers
• Designed to scale easily and effectively
• Achieves reliability through replication
• But is a poor choice for:
− Low latency data access
− Storing lots of small files
− Multiple writers
− Modification within files

8
HDFS (cont’d)

• By default creates three replicas of data

− First on same node as client
− Second on different rack
− Third on same rack as second but different node
• Transparently checksums all data
• Corrupt replicas are automatically replaced from good
copies
• Variety of data compression algorithms
• Transparent end-to-end encryption

9
HDFS Concepts

• Blocks
− Data is stored in large blocks (default 128MB) to reduce seek
time
− Files consist of blocks and so can be larger than a single disk
and split across many servers
• Namenodes
− Master server that manages the filesystem namespace and
metadata
− Single point of failure

10
HDFS Concepts (cont’d)

• Datanodes
− Slave servers where data is stored
• HDFS Federation
− Can have multiple namenodes, each managing different
filesystem subtrees
• HDFS High Availability
− Active-passive configuration with automatic failover (takes
about a minute)
− Can alternatively configure a new namenode (takes half an
hour or more)

11
Isn’t This Just Grid Computing?

Grid Computing Hadoop

CPU-intensive Data-intensive (and possibly also

CPU-intensive)

Possibly highly distributed Distributed but data and CPU close

together to conserve bandwidth

Low-level coding High-level coding

Typically unaware of network Network topology aware and

topology and passes work to any attempts to give related work to
available processor physically close processors

Data is sent to the processors Program is sent to the processors to

work on data held locally

12
Module 3 – Section 3

Yet Another Resource Negotiator

(YARN)

13
YARN

• Yet Another Resource Negotiator

• Two parts:
− ResourceManager: for allocating resources and scheduling
applications
− ApplicationMaster: creates a container for each task
• Apache Myriad allows Hadoop and Mesos to share the
same physical infrastructure

14
Module 3 – Section 4

MapReduce

15
MapReduce
• A simple parallel programming paradigm
• 2-3 phases
− Map: Split the problem into chunks and tackle them
− (Shuffle): Organize the results into matching groups
− Reduce: Combine the results of each group
• Not all problems can be carved up this way
• Works with a variety of programming languages
• System “looks after” message passing, scheduling on
machines, managing assembly of partial results

16
MapReduce (cont’d)

sort
copy
split 0 map merge

reduce part 0

split 1 map

reduce part 1

split 2 map

17
MapReduce (cont’d)
• Mappers get (key: value) from files and emit (key, value)
where value has been mapped to something else
• Shuffle and Sort: groups output from above by key so we
now have (key: [value, value, …])
• Never need to deal with the shuffle/sort phase explicitly; it’s
handled by the framework
• Reducers take sorted (key: [values]) as input and emit (key:
aggregated_value)

18
Module 3 – Section 5

Hadoop Cluster Management Tools

19
Ambari
• Provision a Hadoop cluster
• Manage and monitor the cluster

20
Sentry

• A system for enforcing fine-grained role based authorization

to data and metadata stored on a Hadoop cluster
• Originally Cloudera Access
• Works out of the box with Apache Hive and Cloudera Impala

21
Ranger
• Vision is to provide comprehensive security across the
Hadoop ecosystem
• Today provides fine-grained authorization and auditing for:
− Hadoop
− Hive
− HBase
− Storm
− Knox
• Centralized web application
− Policy administration
− Audit
− Reporting

22
Falcon

• Data processing and management platform for Hadoop to:

− Simplify moving data
− Provide process orchestration (coordination), data discovery
and lifecycle management

23
ZooKeeper
• Centralized coordination service for distributed systems
• Provides centralized services such as maintaining
configuration information, naming, distributed
synchronization and group services

24
Azkaban

• Workflow scheduler

25
Module 3 – Section 6

Hadoop Storage

26
Avro

• Language-neutral (C, C++, Java, Python, Ruby) system for

storing objects (“data serialization”)
• Supports primitive (e.g. int, string) and complex (e.g. array,
enum) types
• Fast and interoperable
• Designed to work well with HDFS and MapReduce

27
Parquet RCFiles & ORCFiles

• Columnar data storage

28
Module 3 – Section 7

Getting Data Into and Out of

Hadoop

29
Flume
• Can handle arbitrary types of streaming data
• Can configure the reliability level of the data collection
based on how important the data is and how timely it needs
to be
• Can have it batch the data up for transfer or move it as soon
as it becomes available
• Can do some lightweight transformations that make it handy
for ETL

30
Sqoop
• NoSQL database
• Supports large (potentially sparse) tables
• Based on Google BigTable
• Stores data in columns but isn’t really column-oriented
• Excels at key-based access to a single cell or range of cells

31
Pig

• Dataflow/workflow language developed at Yahoo!

• Shell is called Grunt
• Steps get added to a logical plan built by the interpreter
which doesn’t get executed until a DUMP or STORE
command
• The scripts are converted to MapReduce jobs like Hive
• Good for ETL and dealing with unstructured data

32
Pig Latin Example
users = LOAD 'users’ AS (name, age);
filteredUsers = FILTER Users BY age >= 18 AND age <=
25;
pages = LOAD 'pages’ AS (user, url);
joinedUserPages = JOIN filteredUsers BY name, Pages BY
user;
groupedUserPages = GROUP joinedUserPages BY url;
numberOfClicks = FOREACH groupedUserPages GENERATE
GROUP, COUNT(joinedUserPages) AS clicks;
sortedNumberOfClicks = ORDER numberOfClicks BY clicks
DESC;
top5 = LIMIT sortedNumberOfClicks 5;
STORE top5 INTO 'top5sites';

33
Kafka

• Kafka is a distributed publish-subscribe messaging system

• Messages are published by producers, categorized into
topics and pulled for processing by consumers
• Consumers subscribe to topics they want to read messages
from
• Kafka maintains a partitioned log for each topic
• Each partition is an ordered subset of the messages
published to a topic
• Partitions of the same topic can be spread across many
nodes in a cluster and are usually replicated for reliability

34
Module 3 – Section 8

Hadoop-Related Databases and

Analytics

35
HCatalog
• Provides metadata services (and REST interface)
• Stored in a relational database outside Hadoop
• Makes Hive possible

36
Hive
• Relational-like table store in Hadoop
• Hive table consists of a schema stored in HCatalog plus
data stored in HDFS
• Converts HiveQL commands into batch jobs
• Can delete rows and add small batches of inserts but
changes are not in-place
• Limited index and subquery capability
• Supports both primitive (e.g. BOOLEAN, INT, FLOAT,
VARCHAR, BINARY, etc.) and complex (ARRAY, MAP,
STRUCT, UNION)

37
HBase

• NoSQL database
• Supports very large (potentially sparse) tables
• Latency in milliseconds
• Based on Google BigTable
• Stores data in column families
• Excels at key-based access to a single cell or range of cells

38
Drill

• A fast analytical engine for querying non-relational

datastores using SQL
• Based on Google BigQuery SQL (different from Hive’s)
• Columnar execution engine
• Data-driven compilation and recompilation at execution
time
• Specialized memory management that reduces memory
requirements and eliminates garbage collections
• Locality-aware execution that reduces network traffic
when Drill is co-located with the datastore
• Advanced cost-based optimizer that pushes processing
into the datastore when possible
39
Mahout

• Provides a variety of machine learning algorithms that run

on Hadoop (or other databases)
• Requires Java programming skills to use
• Primarily used for:
• Recommendations (aka Collaborative Filtering)
• Clustering e.g. documents by topic
• Classification
• Frequent itemset mining: find items that often appear
together, say in queries or shopping baskets

40
Module 3 – Section 9

Resources

41
Resources
• White, Tom. Hadoop: The Definitive Guide 4th Edition.
O’Reilly. 2015.
• Grover et. al. Hadoop Application Architectures. O’Reilly.
2015.
• Sammer, Eric. Hadoop Operations. O’Reilly. 2012.
• Kunigk & George. Hadoop in the Enterprise: Architecture.
O’Reilly. 2018.
• Holmes, Alex. Hadoop in Practice. Manning. 2014.
• Capriolo & Wampler. Programming Hive. O’Reilly. 2012.

42
Resources (cont’d)
• Free online PIG book:
• Google paper on MapReduce:

43
Module 3 – Section 10

Homework

44
Assignment #2: Setup

• Sign up for for Azure free trial: portal.azure.com

• Go to Marketplace (button at the bottom of the dashboard) .
• Enter “hdp” in Search and choose Hortonworks Sandbox
with HDP 2.5 that should show up.
• Click Create at the bottom of the page.

45
Assignment #2: Basics setup

46
Assignment #2: SSH Connect

47
Assignment #2: Ambari password change

48
Assignment #2: Open ports

49
Assignment #2: Full tutorial

• Installing Hortonworks Data Platform 2.5 on Microsoft

Azure: Tutorial
• It also has full list of ports that you need to open.

50
Assignment #2: Exercises

• Note: you do NOT need to hand in the below.

• Read tutorial:
– “Learning the Ropes of the HDP Sandbox”
• Complete the following tutorials:
– “How to Process data with Apache Hive”
– How to Process data with Apache Pig

51
Assignment #2: Questions
• Note: you DO need to hand in answers to below.
Instructions on Quercus.
1. Describe a business situation in which Hadoop would be
a better choice for storing the data than a relational
database and explain why (give at least three reasons)
2. Describe a business situation in which a relational
database would be the better option and explain why
(give at least three reasons)
3. Describe a business situation where MongoDB would a
better choice than either and again give at least three
reasons
4. What is the point in having and using Pig if we have Hive
so can use SQL (again, three advantages)?
52
Module 3 – Section 11

Next Class

53
Next Class

• Next class will be Just Enough Scala to get ready for

working with Spark
• In preparation for next class: Read Chapter 1 of
Programming in Scala:

54
Follow us on social

Join the conversation with us online:

facebook.com/uoftscs

@uoftscs

linkedin.com/company/university-of-toronto-school-of-continuing-studies

@uoftscs

55
Any questions?

56
Thank You
Thank you for choosing the University of Toronto
School of Continuing Studies

ProMoTe A Data Product Model Template For Data Meshes
No ratings yet
ProMoTe A Data Product Model Template For Data Meshes
18 pages
(Ebook PDF) Internet of Things and Data Analytics Handbook PDF Download
100% (5)
(Ebook PDF) Internet of Things and Data Analytics Handbook PDF Download
57 pages
Rakesh Kumar - 21554244 - Big Data - Assessment 2
No ratings yet
Rakesh Kumar - 21554244 - Big Data - Assessment 2
23 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Informatica BDM Training Agenda
100% (2)
Informatica BDM Training Agenda
4 pages
Nursing Informatics
89% (9)
Nursing Informatics
43 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
Jamie Tsao's Resume
No ratings yet
Jamie Tsao's Resume
4 pages
Fractal Geometry and Superformula To Model Natural Shapes Over The World
No ratings yet
Fractal Geometry and Superformula To Model Natural Shapes Over The World
15 pages
Lecture 4 - Hadoop Ecosystem - 1691899782480
No ratings yet
Lecture 4 - Hadoop Ecosystem - 1691899782480
36 pages
An Investigation of NoSQL Database Performance From A MYSQL Perspective
No ratings yet
An Investigation of NoSQL Database Performance From A MYSQL Perspective
3 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
Big Data Landscape Overview 2017
No ratings yet
Big Data Landscape Overview 2017
1 page
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
No ratings yet
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
1 page
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Data Modeling Guidelines For Nosql Document-Store Databases
No ratings yet
Data Modeling Guidelines For Nosql Document-Store Databases
12 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
Apache Kudu: Fast Analytics Data Store
No ratings yet
Apache Kudu: Fast Analytics Data Store
9 pages
The Purpose of XML Schema
No ratings yet
The Purpose of XML Schema
12 pages
Rectangular Microstrip Antenna Design
No ratings yet
Rectangular Microstrip Antenna Design
3 pages
MSC IT Syllabus
93% (15)
MSC IT Syllabus
69 pages
Turunan Imidazoline Crodazoline o
No ratings yet
Turunan Imidazoline Crodazoline o
2 pages
Fan Kit Instruction
No ratings yet
Fan Kit Instruction
4 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
Cisco Asa Firepower
No ratings yet
Cisco Asa Firepower
11 pages
Prototype CNC Machine Design PDF
No ratings yet
Prototype CNC Machine Design PDF
6 pages
KHDA - Staff Approval Application - V1.22-June 2022
No ratings yet
KHDA - Staff Approval Application - V1.22-June 2022
4 pages
Rule of Thumb Calculator Instruction
100% (3)
Rule of Thumb Calculator Instruction
26 pages
Trivago Pipeline
No ratings yet
Trivago Pipeline
18 pages
Power Supply Unit Ps-203-60A: Unicont SPB LTD
No ratings yet
Power Supply Unit Ps-203-60A: Unicont SPB LTD
7 pages
Least Mastered Competency: Consolidated
No ratings yet
Least Mastered Competency: Consolidated
2 pages
Google Cloud Internship Review
No ratings yet
Google Cloud Internship Review
10 pages
99 Ta 516149
No ratings yet
99 Ta 516149
2 pages
Towards MVD Semantic Level
No ratings yet
Towards MVD Semantic Level
11 pages
The Chronicles of Riddick PC Game Download
No ratings yet
The Chronicles of Riddick PC Game Download
2 pages
Big Data and Hadoop: by - Ujjwal Kumar Gupta
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
57 pages
Dsei30 06a
No ratings yet
Dsei30 06a
3 pages
Deploying FortiMail Server Mode
No ratings yet
Deploying FortiMail Server Mode
5 pages
Here Is The Placeholder For Three Lines Title Create Social Media Accounts For Your Business
No ratings yet
Here Is The Placeholder For Three Lines Title Create Social Media Accounts For Your Business
21 pages
Ultimate Mongodb Cheatsheet
No ratings yet
Ultimate Mongodb Cheatsheet
5 pages
DLL Arts Q2 W2 D3 Nov 16
No ratings yet
DLL Arts Q2 W2 D3 Nov 16
6 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Aws
No ratings yet
Aws
10 pages
Big Data Management
No ratings yet
Big Data Management
38 pages
Cloud Computing: Types, Benefits, and Setup
No ratings yet
Cloud Computing: Types, Benefits, and Setup
7 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Persona Overview
No ratings yet
Persona Overview
8 pages
E-Commerce Project
No ratings yet
E-Commerce Project
26 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
CH 3-5 MRI Contrast Spatial Localization
No ratings yet
CH 3-5 MRI Contrast Spatial Localization
109 pages
Amor de Siempre (Audiotree Live Version) : Download Print
No ratings yet
Amor de Siempre (Audiotree Live Version) : Download Print
1 page
Form5 Accounting HHW December 2022
No ratings yet
Form5 Accounting HHW December 2022
15 pages
Free AI Tools To Boost Task Productivity and Work Efficiency
No ratings yet
Free AI Tools To Boost Task Productivity and Work Efficiency
3 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Hadoop Security Guide
No ratings yet
Hadoop Security Guide
400 pages
Mongodb Interview Questions (V4.4)
No ratings yet
Mongodb Interview Questions (V4.4)
25 pages
Unit-3 Part1
No ratings yet
Unit-3 Part1
57 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
4 pages
Hadoop Administration
No ratings yet
Hadoop Administration
97 pages
Real-Time Stock Market Analysis Using LSTM
No ratings yet
Real-Time Stock Market Analysis Using LSTM
5 pages
BDA GTU Study Material E-Notes All-Units 03122021014217PM
No ratings yet
BDA GTU Study Material E-Notes All-Units 03122021014217PM
42 pages
Da Notes (Big Data) PDF
No ratings yet
Da Notes (Big Data) PDF
32 pages
2022 Grade 10 3rd Tem Tamil
No ratings yet
2022 Grade 10 3rd Tem Tamil
8 pages
43 PPT On Apache Pig
No ratings yet
43 PPT On Apache Pig
16 pages
Hadoop For Windows Succinctly PDF
No ratings yet
Hadoop For Windows Succinctly PDF
148 pages
Introduction To Big Data Ecosystem V 2.0
No ratings yet
Introduction To Big Data Ecosystem V 2.0
76 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Tranning Project Report
No ratings yet
Tranning Project Report
25 pages
Storage in Cloud
No ratings yet
Storage in Cloud
51 pages
Cloud COMPUTING Module 5
No ratings yet
Cloud COMPUTING Module 5
63 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
Enhanced Petition Tracking System Final Report
No ratings yet
Enhanced Petition Tracking System Final Report
75 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
B.Tech R20 IQuestion Bank CC
No ratings yet
B.Tech R20 IQuestion Bank CC
3 pages
Cybersecurity Case Studies
No ratings yet
Cybersecurity Case Studies
7 pages
Full Stack UNIT 3
No ratings yet
Full Stack UNIT 3
36 pages
EAadhaar 0648019028606520240216115645 26022024194147
No ratings yet
EAadhaar 0648019028606520240216115645 26022024194147
1 page
Summer Internship Report On: Aws Data Engineering (Topic)
No ratings yet
Summer Internship Report On: Aws Data Engineering (Topic)
21 pages
Chapter 6 Security Reference Model
No ratings yet
Chapter 6 Security Reference Model
30 pages
Gamayas Portfolio
No ratings yet
Gamayas Portfolio
17 pages
01 RL Fundamentals - Complete Beginner's Guide
No ratings yet
01 RL Fundamentals - Complete Beginner's Guide
22 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
A Practical Guide To Testing in DevOps
No ratings yet
A Practical Guide To Testing in DevOps
166 pages
CC Module 5
No ratings yet
CC Module 5
26 pages
Databricks Certified Professional Data Engineer Aug 2025
No ratings yet
Databricks Certified Professional Data Engineer Aug 2025
18 pages

Hadoop and Related Tools

Uploaded by

Hadoop and Related Tools

Uploaded by

3252 - Big Data Management

Systems and Tools

• Describe the use and features of Hadoop

• 3.1 Introduction to Hadoop

Hadoop Distributed File System

• By default creates three replicas of data

Grid Computing Hadoop

CPU-intensive Data-intensive (and possibly also

Possibly highly distributed Distributed but data and CPU close

Low-level coding High-level coding

Typically unaware of network Network topology aware and

Data is sent to the processors Program is sent to the processors to

Yet Another Resource Negotiator

• Yet Another Resource Negotiator

Hadoop Cluster Management Tools

• A system for enforcing fine-grained role based authorization

• Data processing and management platform for Hadoop to:

• Language-neutral (C, C++, Java, Python, Ruby) system for

• Columnar data storage

Getting Data Into and Out of

• Dataflow/workflow language developed at Yahoo!

• Kafka is a distributed publish-subscribe messaging system

Hadoop-Related Databases and

• A fast analytical engine for querying non-relational

• Provides a variety of machine learning algorithms that run

• Sign up for for Azure free trial: portal.azure.com

• Installing Hortonworks Data Platform 2.5 on Microsoft

• Note: you do NOT need to hand in the below.

• Next class will be Just Enough Scala to get ready for

Join the conversation with us online:

You might also like