Lecture 06 - Data Analytics For IoT A Primer

Uploaded by

longpnbh01131

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views31 pages

Lecture 06 - Data Analytics For IoT A Primer

Uploaded by

longpnbh01131

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

INTERNET OF THINGS

Chapter 6
INTERNET OF THINGS A HANDS ON APPROACH
Data Analytics for IoT A Primer
Objectives
▪ Selecting the Right Tool for the Job
▪ Mastering Big Data Frameworks
▪ Unlocking Real-time and Advanced Analytics
Topics
▪ Overview of Hadoop ecosystem
▪ MapReduce architecture
▪ MapReduce job execution flow
▪ MapReduce schedulers
Hadoop Ecosystem (1/2)
▪ Apache Hadoop is an open • Pig
source framework for • Hive
distributed batch • Mahout
▪ Hadoop Ecosystem includes: • Chukwa
• Hadoop MapReduce • Cassandra
• HDFS • Avro
• YARN • Oozie
• Hbase • Flume
• Zookeeper • Sqoop
Hadoop Ecosystem (2/2)
Apache Hadoop (1/5)
▪ A Hadoop cluster comprises of a Master node, backup node and a
number of slave nodes.
▪ The master node runs the NameNode and JobTracker processes and
the slave nodes run the DataNode and TaskTracker components of
Hadoop.
▪ The backup node runs the Secondary NameNode process.
Apache Hadoop (2/5)
▪ NameNode:
• NameNode keeps the directory tree of all files in the file system, and tracks
where across the cluster the file data is kept. It does not store the data of these
files itself. Client applications talk to the NameNode whenever they wish to
locate a file, or when they want to add/copy/move/delete a file.
▪ Secondary NameNode
• NameNode is a Single Point of Failure for the HDFS Cluster. An optional
Secondary NameNode which is hosted on a separate machine creates
checkpoints of the namespace.
Apache Hadoop (3/5)
▪ JobTracker:
• The JobTracker is the service within Hadoop that distributes MapReduce tasks
to specific nodes in the cluster, ideally the nodes that have the data, or at least
are in the same rack.
▪ TaskTracker
• TaskTracker is a node in a Hadoop cluster that accepts Map, Reduce and
Shuffie tasks from the JobTracker.
• Each TaskTracker has a defined number of slots which indicate the number of
tasks that it can accept
Apache Hadoop (4/5)
▪ DataNode
▪ A DataNode stores data in an HDFS file system
▪ A functional HDFS filesystem has more than one DataNode, with data
replicated across them
▪ DataNodes respond to requests from the NameNode for filesystem
operations
▪ Client applications can talk directly to a DataNode, once the NameNode has
provided the location of the data
▪ Similarly, MapReduce operations assigned to TaskTracker instances near a
DataNode, talk directly to the DataNode to access the files
▪ TaskTracker instances can be deployed on the same servers that host
DataNode instances, so that MapReduce operations are performed close to
the data
Apache Hadoop (5/5)
MapReduce (1/2)
▪ MapReduce job consists of two phases:
• Map: In the Map phase, data is read from a distributed file system and
partitioned among a set of computing nodes in the cluster. The data is sent to
the nodes as a set of key-value pairs. The Map tasks process the input records
independently of each other and produce intermediate results as key-value
pairs. The intermediate results are stored on the local disk of the node
running the Map task
• Reduce: When all the Map tasks are completed, the Reduce phase begins in
which the intermediate data with the same key is aggregated
MapReduce (2/2)

▪ Optional Combine Task

• An optional Combine task can be
used to perform data aggregation
on the intermediate data of the
same key for the output of the
mapper before transferring the
output to the Reduce task.
MapReduce Job Execution Workflow (1/4)
▪ MapReduce job execution starts when the client applications submit jobs to the
Job tracker
▪ The JobTracker returns a JobID to the client application. The JobTracker talks to
the NameNodetodeterminethe location of the data
MapReduce Job Execution Workflow (2/4)
▪ The JobTracker locates TaskTracker nodes with available slots at/or near the data
▪ The TaskTrackers send out heartbeat messages to the JobTracker, usually every
few minutes, toreassuretheJobTracker that they are still alive. These messages
also inform the JobTracker of the number of availableslots, so the JobTracker can
stay up to date with where in the cluster, new work can be delegated
MapReduce Job Execution Workflow (3/4)
▪ The JobTracker submits the work to the TaskTracker nodes when they poll for
tasks. To choose a taskforaTaskTracker, the JobTracker uses various scheduling
algorithms (default is FIFO).
▪ The TaskTracker nodes are monitored using the heartbeat signals that are sent by
the TaskTrackers toJobTracker.
MapReduce Job Execution Workflow (4/4)
▪ The TaskTracker spawns a separate JVM process for each task so that any task
failure does not bringdownthe TaskTracker.
▪ The TaskTracker monitors these spawned processes while capturing the output
and exit codes. Whentheprocess finishes, successfully or not, the TaskTracker
notifies the JobTracker. When the job is completed, theJobTracker updates its
status.
MapReduce 2.0 – YARN (1/2)
▪ In Hadoop 2.0 the original processing engine of Hadoop (MapReduce)
has been separated from the resource management (which is now
part of YARN).
▪ This makes YARN effectively an operating system for Hadoop that
supports different processing engines on a Hadoop cluster such as
MapReduce for batch processing, Apache Tez for interactive queries,
Apache Storm for stream processing, etc.
MapReduce 2.0 – YARN (2/2)
▪ YARN architecture divides architecture
divides the two major functions of the
JobTracker - resource management and
job life-cycle management - into
separate components:
• ResourceManager
• ApplicationMaster
YARN Components (1/2)
▪ Resource Manager (RM): RM manages the
global assignment of compute resources to
applications. RM consists of two main services:
▪ Scheduler: Scheduler is a pluggable service
that manages and enforces the resource
scheduling policy in the cluster
▪ Applications Manager (AsM): AsM manages
the running Application Masters in the cluster.
AsM is responsible for starting application
masters and for monitoring and restarting
them on different nodes in case of failures
YARN Components (2/2)
▪ Application Master (AM): A per-application AM
manages the application’s life cycle. AM is
responsible for negotiating resources from the
RM and working with the NMs to execute and
monitor the tasks.
▪ Node Manager (NM): A per-machine NM
manages the user processes on that machine.
▪ Containers: Container is a bundle of resources
allocated by RM (memory, CPU, network, etc.). A
container is a conceptual entity that grants an
application the privilege to use a certain amount
of resources on a given machine to run a
component task.
Hadoop Schedulers (1/2)
▪ Hadoop scheduler is a pluggable component that makes it
opentosupportdifferent scheduling algorithms
▪ The default scheduler in Hadoop is FIFO
▪ Two advanced schedulers are also available - the Fair
Scheduler, developedat Facebook, and the Capacity
Scheduler, developed at Yahoo
Hadoop Schedulers (2/2)
▪ The pluggable scheduler framework provides the flexibility
tosupportavariety of workloads with varying priority and
performance constraints
▪ Efficient job scheduling makes Hadoop a multi-tasking
systemthat canprocess multiple data sets for multiple jobs
for multiple users simultaneously
FIFO Scheduler
▪ FIFO is the default scheduler in Hadoop that maintains
aworkqueuein which the jobs are queued
▪ The scheduler pulls jobs in first in first out manner (oldest
jobfirst)for scheduling
▪ There is no concept of priority or size of job in
FIFOscheduler
Fair Scheduler (1/3)
▪ The Fair Scheduler allocates resources evenly between
multiple jobs and also provides capacityguarantees
▪ Fair Scheduler assigns resources to jobs such that each job
gets an equal share of theavailableresources on average
over time
▪ Tasks slots that are free are assigned to the new jobs, so
that each job gets roughly thesameamount of CPU time
Fair Scheduler (2/3)
▪ Job Pools
• The Fair Scheduler maintains a set of pools into which jobs are placed. Each
pool has a guaranteedcapacity.
• When there is a single job running, all the resources are assigned to that
job. When there are multiplejobsinthe pools, each pool gets at least as
many task slots as guaranteed.
• Each pool receives at least the minimum share.
• When a pool does not require the guaranteed share the excess capacity is
split between other jobs.
Fair Scheduler (3/3)
▪ Fairness
• The scheduler computes periodically the difference between the
computing time received by eachjobandthe time it should have
received in ideal scheduling.
• The job which has the highest deficit of the compute time
received is scheduled next.
Capacity Scheduler (1/2)
▪ The Capacity Scheduler has similar functionally as the Fair
Schedulerbutadopts a different scheduling philosophy
▪ Queues
• In Capacity Scheduler, you define a number of named queues each
withaconfigurable number of map and reduce slots
• Each queue is also assigned a guaranteed capacity
• The Capacity Scheduler gives each queue its capacity when it contains jobs,
andshares any unused capacity between the queues. Within each queue
FIFOschedulingwith priority is used
Capacity Scheduler (2/2)
▪ Fairness
• For fairness, it is possible to place a limit on the percentage of running
tasksperuser,so that users share a cluster equally
• A wait time for each queue can be configured. When a queue is not
scheduledformore than the wait time, it can preempt tasks of other
queues to get its fair share
Further Reading
▪ Apache Hadoop, http://hadoop.apache.org
▪ Apache Hive, http://hive.apache.org
▪ Apache HBase, http://hbase.apache.org
▪ Apache Chukwa, http://chukwa.apache.org
▪ Apache Flume, http://flume.apache.org
▪ Apache Zookeeper, http://zookeeper.apache.org
▪ Apache Avro, http://avro.apache.org
▪ Apache Oozie, http://oozie.apache.org
▪ Apache Storm, http://storm-project.net
▪ Apache Tez, http://tez.incubator.apache.org
▪ Apache Cassandra, http://cassandra.apache.org
▪ Apache Mahout, http://mahout.apache.org
▪ Apache Pig, http://pig.apache.org
▪ Apache Sqoop, http://sqoop.apache.org
Summary
• Tackling Big Data Challenges
• Unlocking Real-time Insights
• Advanced Analytics

Week 11 - Wireshark Lab 10
No ratings yet
Week 11 - Wireshark Lab 10
10 pages
Aws API Gateway
No ratings yet
Aws API Gateway
18 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
CC Unit1 JJ
No ratings yet
CC Unit1 JJ
177 pages
Alcatel-Lucent Omnipcx Enterprise Communication Server: Abc-Ip Logical Link
50% (2)
Alcatel-Lucent Omnipcx Enterprise Communication Server: Abc-Ip Logical Link
46 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
38 pages
Fortinet NS4 V6.0 2019 Dumps1 PDF
No ratings yet
Fortinet NS4 V6.0 2019 Dumps1 PDF
24 pages
FALLSEM2025-26 VL ISWE309L 00100 TH 2025-08-08 Module-4
No ratings yet
FALLSEM2025-26 VL ISWE309L 00100 TH 2025-08-08 Module-4
40 pages
AWS Services Powering Twitch
No ratings yet
AWS Services Powering Twitch
12 pages
Cse 434 Name: Bing Hao Computer Networks (2014 Spring) Home Page
No ratings yet
Cse 434 Name: Bing Hao Computer Networks (2014 Spring) Home Page
11 pages
BGP (Border Gateway Protocol) - Port 179: Gateway Protocols (IGP's)
No ratings yet
BGP (Border Gateway Protocol) - Port 179: Gateway Protocols (IGP's)
32 pages
Contiki OS & Cooja Simulator Guide
No ratings yet
Contiki OS & Cooja Simulator Guide
5 pages
Network Design Proposal For Casino (Group 2) : Project Scope
33% (12)
Network Design Proposal For Casino (Group 2) : Project Scope
4 pages
EPON OLT WebGUI Guide
No ratings yet
EPON OLT WebGUI Guide
62 pages
Lab - Configure and Verify eBGP: Topology
No ratings yet
Lab - Configure and Verify eBGP: Topology
4 pages
SD Wan
No ratings yet
SD Wan
6 pages
C# Chat Server Setup Guide
No ratings yet
C# Chat Server Setup Guide
19 pages
Module 2 HDFS
No ratings yet
Module 2 HDFS
33 pages
AOS-SSwitch16 11 0018RN
No ratings yet
AOS-SSwitch16 11 0018RN
45 pages
Haoop Architecture
No ratings yet
Haoop Architecture
24 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
23 pages
Wa0005.
No ratings yet
Wa0005.
84 pages
ZXR10 2900E Series: Configuration Guide
No ratings yet
ZXR10 2900E Series: Configuration Guide
266 pages
StoneOS CLI User Guide Complete Book 5.5R7 Completo
No ratings yet
StoneOS CLI User Guide Complete Book 5.5R7 Completo
1,933 pages
Wa0002.
No ratings yet
Wa0002.
66 pages
Unit 5
No ratings yet
Unit 5
101 pages
Unit 4
No ratings yet
Unit 4
85 pages
Module 2
No ratings yet
Module 2
37 pages
ECS765P - W3 - Hadoop Principles and Components
No ratings yet
ECS765P - W3 - Hadoop Principles and Components
47 pages
RFC 5516
No ratings yet
RFC 5516
5 pages
UNIT-4 BIG DATA (NoSql)
No ratings yet
UNIT-4 BIG DATA (NoSql)
38 pages
csd4652 Ilias Kapsis cs335 hw3
No ratings yet
csd4652 Ilias Kapsis cs335 hw3
6 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Big Data Resume
No ratings yet
Big Data Resume
81 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
27 pages
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
No ratings yet
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
62 pages
Unit-2 Hadoop and Python
No ratings yet
Unit-2 Hadoop and Python
50 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
TCP IP Protocol Suite Chap-02 OSI Model
No ratings yet
TCP IP Protocol Suite Chap-02 OSI Model
39 pages
Big Data-Week 3 - 1
No ratings yet
Big Data-Week 3 - 1
22 pages
Nmap
No ratings yet
Nmap
55 pages
Big Data Unit 3 Own
No ratings yet
Big Data Unit 3 Own
20 pages
Analysis of Hadoop MapReduce Scheduling in Heterog 2021 Ain Shams Engineerin
No ratings yet
Analysis of Hadoop MapReduce Scheduling in Heterog 2021 Ain Shams Engineerin
10 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Exploring The Inner Workings of OSPF - BRKENT-1802
No ratings yet
Exploring The Inner Workings of OSPF - BRKENT-1802
58 pages
Hadoop 1
No ratings yet
Hadoop 1
26 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Generic PI Async-Sync Bridge Configuration For Any Adapters - SAP Blogs
No ratings yet
Generic PI Async-Sync Bridge Configuration For Any Adapters - SAP Blogs
29 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
2 - Yarn
No ratings yet
2 - Yarn
59 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Unit 3-1
No ratings yet
Unit 3-1
65 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Cloud Computing Infrastructure
No ratings yet
Cloud Computing Infrastructure
3 pages
Como Reservar IP No Servidor DHCP Do Firewall FortiGate - LinkedIn
No ratings yet
Como Reservar IP No Servidor DHCP Do Firewall FortiGate - LinkedIn
9 pages
Chapter 10
No ratings yet
Chapter 10
45 pages
Hadoop 2full Mod2
No ratings yet
Hadoop 2full Mod2
10 pages
IP Global Internet
No ratings yet
IP Global Internet
9 pages
HADOOP
No ratings yet
HADOOP
19 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
150M Wireless-N ADSL2+ Router: Model No.:iB-WRA150N
No ratings yet
150M Wireless-N ADSL2+ Router: Model No.:iB-WRA150N
76 pages
How Email Works
No ratings yet
How Email Works
6 pages
BDA Unit 1
No ratings yet
BDA Unit 1
35 pages
Module 2
No ratings yet
Module 2
23 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
Unit 3 Handouts
No ratings yet
Unit 3 Handouts
11 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Jenny Blog
No ratings yet
Jenny Blog
12 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Chapter 10 Part I - Ed
No ratings yet
Chapter 10 Part I - Ed
15 pages
BigdataUnit III-Part2
No ratings yet
BigdataUnit III-Part2
9 pages
bdcc-2 2
No ratings yet
bdcc-2 2
12 pages
Cisco Network Engineer Profile
No ratings yet
Cisco Network Engineer Profile
2 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Network Traffic Management
No ratings yet
Network Traffic Management
1 page
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
13 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Introduction to Hadoop & DFS
No ratings yet
Introduction to Hadoop & DFS
34 pages

Lecture 06 - Data Analytics For IoT A Primer

Uploaded by

Lecture 06 - Data Analytics For IoT A Primer

Uploaded by

INTERNET OF THINGS

▪ Optional Combine Task

You might also like