0% found this document useful (0 votes)

13 views21 pages

DATA228 Lecture Notes Week 3

The document provides an overview of Hadoop, its history, architecture, and ecosystem, highlighting its role as a scalable and resilient framework for big data storage and processing. It discusses the evolution of Hadoop from its inception based on Google's papers to its current status as a widely adopted open-source platform in the industry. Additionally, it outlines the various components of Hadoop, including HDFS, YARN, and MapReduce, as well as how to run Hadoop in different setups.

Uploaded by

sreenidhi.hayagreevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views21 pages

DATA228 Lecture Notes Week 3

Uploaded by

sreenidhi.hayagreevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

DATA 228

Big Data Technologies and Applications (Fall 2024)

Sangjin Lee
Hadoop: history & architecture

Ch pter 1 & p rts of 10, “H doop: the De initive Guide” 4th Edition, Tom White
a
a
a
f
Hadoop: Big Data refresher

• Store much l rger volumes of d t

• Compute/ n lyze much l rger volumes of d t

• H ndle diverse nd mostly unstructured d t

• … And do it che ply

• H doop is the irst complete open-source pl tform for Big D t

a
a
a
a
a
f
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop: history

• 2003 - 2004: two semin l p pers from Google

• “The Google File System”, S nj y Ghem w t, How rd Gobio , Shun-T k Leung, 2003

• “M pReduce: Simpli ied D t Processing on L rge Clusters”, Je rey De n, S nj y

Ghem w t, 2004

• These were b sed on l rge-sc le systems th t were in wide use t Google t the time
a
a
a
a
f
a
a
a
a
a
a
a
a
a
a
a
a
a
ff
ff
a
a
a
a
a
a
Hadoop: history

• 2005 - 2006: Doug Cutting t Y hoo cre tes M pReduce implement tion nd forms n
open-source project c lled H doop

• 2008: H doop becomes top-level Ap che project

• 2012: H doop 2 rele sed

• Introduced YARN: M pReduce becomes one YARN pplic tion type

• MR v.2 APIs
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop: history

• Since: H doop becomes ubiquitous in the industry

• Comp nies built on H doop: Clouder Hortonworks, M pR ( —> HPE)

• Almost ll comp nies in the industry tod y use nd oper te H doop

• All cloud providers o er irst-cl ss support for H doop

• H doop h s sp wned n ecosystem

a
a
a
a
a
a
a
ff
a
a
f
a
a
a
a
a
a
a
a
Hadoop in the cloud

AWS GCP

Compute Am zon EMR D t proc

El stic stor ge Am zon s3 GCS

Stre ming AWS Flink D t low

D t l ke AWS L ke Form tion BigL ke

Other AWS Redshift BigQuery, BigT ble

a
a
a
a
a
a
a
a
a
a
a
f
a
a
a
a
a
What is Hadoop?

• H doop is two distributed systems for stor ge nd compute

• Highly sc l ble: w. r. t. horizont l sc l bility

• Highly v il ble: w. r. t. resiliency nd f ult toler nce

• H doop is fr mework with which to inter ct with Big D t

• M pReduce APIs

• HDFS APIs

• YARN APIs

• H doop is n ecosystem
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop as a distributed system
Two distributed systems for storage and compute

M pReduce API

other YARN pps

M pReduce
Compute

YARN

Distributed ilesystem API

Stor ge
HDFS
a
a
a
f
a
Hadoop as a distributed system

• HDFS s distributed stor ge/ ilesystem

• YARN s distributed compute scheduler

• M pReduce s big d t processing fr mework

a
a
a
a
a
a
a
a
a
a
f
a
Hadoop as a distributed system

• H doop is compos ble: you c n use some (do not h ve to use ll)

• Ex mples

• Use only HDFS

• Use only YARN

• Use only YARN + M pReduce

• C ve t from the provider perspective: properly spec’ed h rdw re

a
a
a
a
a
a
a
a
a
a
a
Hadoop code organization

Client API

Tools
M pReduce

YARN

HDFS

H doop Common
a
a
Hadoop architecture

• M ster/centr l nodes vs. worker nodes

• HDFS: N menode nd D t nodes

• YARN: Resource M n ger nd Node M n gers

• High v il bility

• F il over to st ndby m ster nodes in c se of m ster f ilures

• Coordin ted using ZooKeeper

a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop architecture

• Self-he ling: resilient g inst individu l node f ilures

• D t gets re-replic ted if d t node is lost

• A t sk gets rest rted (on nother node) if node f ils

a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop as an ecosystem

• Higher-level fr meworks th t cre te complex M pReduce work lows: Pig, Oozie, C sc ding,
Sc lding, …

• SQL on H doop: Hive, Phoenix, Imp l , Presto, …

• Stor ge systems on H doop: HB se

• Seri liz tion/form t libr ries: P rquet, Avro, ORC, …

• Sp rk
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
f
a
a
Running Hadoop
Running Hadoop

• Single-node setup

• Single-node st nd lone (“loc l”)

• Single-node pseudo-distributed setup

• Cluster setup

• Cloud setup

• Roll your own: cluster setup using VMs

• More cloud-n tive setup: on-dem nd YARN/MR + cloud stor ge

a
a
a
a
a
a
Running Hadoop

# of processes # of m chines

loc l 1 1

pseudo-distributed sever l 1

cluster m ny m ny
a
a
a
a
a
Running Hadoop
Demo

Inst ll nd run H doop in

single-node setup
a
a
a
a
Running Hadoop
Demo

• Inst ll pre-requisites: JDK, ssh, etc.

• Inst ll H doop

• Explore the H doop inst ll tion

• Try st nd lone setup

• St rt nd stop pseudo-distributed setup

a
a
a
a
a
a
a
a
a
a
a
a
Running Hadoop
Pseudo-distributed cluster

• https://h doop. p che.org/docs/st ble/h doop-project-dist/h doop-common/

SingleCluster.html

• Set up ssh for loc lhost

• Inst ll ssh (server nd client): sshd nd ssh

• M ke sshd run in the b ckground

• Do key gener tion (keygen) to do p sswordless loc lhost ssh

• “Form t” the hdfs ilesystem

a
a
a
a
a
a
a
a
f
a
a
a
a
a
a
a
a

E-Fim OTNM2000 Element Management System Release Notes For Version V2.0R5 (Build04.20.05.50)
100% (1)
E-Fim OTNM2000 Element Management System Release Notes For Version V2.0R5 (Build04.20.05.50)
158 pages
How to Download Gofile Folders
No ratings yet
How to Download Gofile Folders
5 pages
ENEA OSE Epsilon ARM Kernel Reference Manual
No ratings yet
ENEA OSE Epsilon ARM Kernel Reference Manual
40 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
6 pages
Introduc) On To Bigdata
No ratings yet
Introduc) On To Bigdata
103 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
152 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
No ratings yet
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
5 pages
Media Player Report PDF
56% (9)
Media Player Report PDF
13 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Chapter 3 Hadoop
No ratings yet
Chapter 3 Hadoop
10 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
Hadoop Administration
No ratings yet
Hadoop Administration
97 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Hadoop
No ratings yet
Hadoop
7 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
PowerISO Portable
No ratings yet
PowerISO Portable
3 pages
Big Data Insights with Hadoop
No ratings yet
Big Data Insights with Hadoop
34 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Hadoop
No ratings yet
Hadoop
7 pages
Hadoop Basics for Data Engineers
No ratings yet
Hadoop Basics for Data Engineers
44 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit 3
No ratings yet
Unit 3
18 pages
Windows Phone 8 Development Internals
100% (1)
Windows Phone 8 Development Internals
1,046 pages
OS Lab Viva: Key Questions & Answers
No ratings yet
OS Lab Viva: Key Questions & Answers
25 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
38 pages
Lesson 1 - Introduction - Grade 8
100% (1)
Lesson 1 - Introduction - Grade 8
10 pages
Pcsx2 Directx 11 Plugin Download PDF
No ratings yet
Pcsx2 Directx 11 Plugin Download PDF
3 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Hadoop for Big Data Solutions
No ratings yet
Hadoop for Big Data Solutions
31 pages
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
Pgdump Pgrestore
No ratings yet
Pgdump Pgrestore
2 pages
Oracle BI Apps Installation Guide
No ratings yet
Oracle BI Apps Installation Guide
55 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
Nios Cliguide 7.3
No ratings yet
Nios Cliguide 7.3
198 pages
UNIT 2 Full
No ratings yet
UNIT 2 Full
121 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
OBIA 11.1.1.10.2 Installation-Oracle Linux 7 Part5
No ratings yet
OBIA 11.1.1.10.2 Installation-Oracle Linux 7 Part5
4 pages
Problem Based Learning (PBL) 3: Computer Network: Case Project 1-1
No ratings yet
Problem Based Learning (PBL) 3: Computer Network: Case Project 1-1
5 pages
Crystal Reports® 2008 With Service Pack 7 For Windows - Supported Platforms
No ratings yet
Crystal Reports® 2008 With Service Pack 7 For Windows - Supported Platforms
21 pages
Hadoop Intro
No ratings yet
Hadoop Intro
25 pages
Financial Secretary Membership Plus 11 Software Instructions
No ratings yet
Financial Secretary Membership Plus 11 Software Instructions
4 pages
Unit 5
No ratings yet
Unit 5
101 pages
Final Exam
No ratings yet
Final Exam
2 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
10 pages
EMC VNX Series: Release 7.1
No ratings yet
EMC VNX Series: Release 7.1
32 pages
Unit 2
No ratings yet
Unit 2
73 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Intro To Scratch
No ratings yet
Intro To Scratch
22 pages
Os Exp 1
No ratings yet
Os Exp 1
8 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Unit 2lecturenotes 240530095215 Bebaac62
No ratings yet
Unit 2lecturenotes 240530095215 Bebaac62
98 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
BDS Session 6
No ratings yet
BDS Session 6
78 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Tm-T88vi Firmware Update Instructions
No ratings yet
Tm-T88vi Firmware Update Instructions
9 pages
LibreOffice Writer Paragraph Styles
No ratings yet
LibreOffice Writer Paragraph Styles
27 pages
straton v6.51 Manual for zenon
No ratings yet
straton v6.51 Manual for zenon
84 pages
Hadoop - Presentation 101
No ratings yet
Hadoop - Presentation 101
10 pages
OSY Most IMP Q by Campusify
No ratings yet
OSY Most IMP Q by Campusify
2 pages
Installing A Digital Certificate On Iphone or Mac
No ratings yet
Installing A Digital Certificate On Iphone or Mac
5 pages
Abdumajit
No ratings yet
Abdumajit
9 pages
Module - 2
No ratings yet
Module - 2
84 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
Introduction To
No ratings yet
Introduction To
7 pages
SW1-ISCSI-CENTAURO-S4148T Rev1
No ratings yet
SW1-ISCSI-CENTAURO-S4148T Rev1
6 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
The Pizza Edition - Google Feud 2
No ratings yet
The Pizza Edition - Google Feud 2
1 page
Installing Nagios XI 2024 Offline
No ratings yet
Installing Nagios XI 2024 Offline
3 pages
Unit 3
No ratings yet
Unit 3
90 pages
BDA praON Iat1
No ratings yet
BDA praON Iat1
12 pages
Unit 2
No ratings yet
Unit 2
17 pages

DATA228 Lecture Notes Week 3

Uploaded by

DATA228 Lecture Notes Week 3

Uploaded by

DATA 228

Big Data Technologies and Applications (Fall 2024)

• Store much l rger volumes of d t

• Compute/ n lyze much l rger volumes of d t

• H ndle diverse nd mostly unstructured d t

• … And do it che ply

• H doop is the irst complete open-source pl tform for Big D t

• 2003 - 2004: two semin l p pers from Google

• “M pReduce: Simpli ied D t Processing on L rge Clusters”, Je rey De n, S nj y

• 2008: H doop becomes top-level Ap che project

• 2012: H doop 2 rele sed

• Introduced YARN: M pReduce becomes one YARN pplic tion type

• Since: H doop becomes ubiquitous in the industry

• Comp nies built on H doop: Clouder Hortonworks, M pR ( —> HPE)

• Almost ll comp nies in the industry tod y use nd oper te H doop

• All cloud providers o er irst-cl ss support for H doop

• H doop h s sp wned n ecosystem

Compute Am zon EMR D t proc

El stic stor ge Am zon s3 GCS

Stre ming AWS Flink D t low

D t l ke AWS L ke Form tion BigL ke

Other AWS Redshift BigQuery, BigT ble

• H doop is two distributed systems for stor ge nd compute

• Highly sc l ble: w. r. t. horizont l sc l bility

• Highly v il ble: w. r. t. resiliency nd f ult toler nce

• H doop is fr mework with which to inter ct with Big D t

other YARN pps

Distributed ilesystem API

• HDFS s distributed stor ge/ ilesystem

• YARN s distributed compute scheduler

• M pReduce s big d t processing fr mework

• Use only HDFS

• Use only YARN

• Use only YARN + M pReduce

• C ve t from the provider perspective: properly spec’ed h rdw re

• M ster/centr l nodes vs. worker nodes

• HDFS: N menode nd D t nodes

• YARN: Resource M n ger nd Node M n gers

• F il over to st ndby m ster nodes in c se of m ster f ilures

• Coordin ted using ZooKeeper

• Self-he ling: resilient g inst individu l node f ilures

• D t gets re-replic ted if d t node is lost

• A t sk gets rest rted (on nother node) if node f ils

• SQL on H doop: Hive, Phoenix, Imp l , Presto, …

• Stor ge systems on H doop: HB se

• Seri liz tion/form t libr ries: P rquet, Avro, ORC, …

• Single-node st nd lone (“loc l”)

• Single-node pseudo-distributed setup

• Roll your own: cluster setup using VMs

• More cloud-n tive setup: on-dem nd YARN/MR + cloud stor ge

Inst ll nd run H doop in

• Inst ll pre-requisites: JDK, ssh, etc.

• Explore the H doop inst ll tion

• Try st nd lone setup

• St rt nd stop pseudo-distributed setup

• https://h doop. p che.org/docs/st ble/h doop-project-dist/h doop-common/

• Set up ssh for loc lhost

• Inst ll ssh (server nd client): sshd nd ssh

• M ke sshd run in the b ckground

• Do key gener tion (keygen) to do p sswordless loc lhost ssh

• “Form t” the hdfs ilesystem

You might also like