DATA 228
Big Data Technologies and Applications (Fall 2024)
Sangjin Lee
Hadoop: history & architecture
Ch pter 1 & p rts of 10, “H doop: the De initive Guide” 4th Edition, Tom White
a
a
a
f
Hadoop: Big Data refresher
• Store much l rger volumes of d t
• Compute/ n lyze much l rger volumes of d t
• H ndle diverse nd mostly unstructured d t
• … And do it che ply
• H doop is the irst complete open-source pl tform for Big D t
a
a
a
a
a
f
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop: history
• 2003 - 2004: two semin l p pers from Google
• “The Google File System”, S nj y Ghem w t, How rd Gobio , Shun-T k Leung, 2003
• “M pReduce: Simpli ied D t Processing on L rge Clusters”, Je rey De n, S nj y
Ghem w t, 2004
• These were b sed on l rge-sc le systems th t were in wide use t Google t the time
a
a
a
a
f
a
a
a
a
a
a
a
a
a
a
a
a
a
ff
ff
a
a
a
a
a
a
Hadoop: history
• 2005 - 2006: Doug Cutting t Y hoo cre tes M pReduce implement tion nd forms n
open-source project c lled H doop
• 2008: H doop becomes top-level Ap che project
• 2012: H doop 2 rele sed
• Introduced YARN: M pReduce becomes one YARN pplic tion type
• MR v.2 APIs
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop: history
• Since: H doop becomes ubiquitous in the industry
• Comp nies built on H doop: Clouder Hortonworks, M pR ( —> HPE)
• Almost ll comp nies in the industry tod y use nd oper te H doop
• All cloud providers o er irst-cl ss support for H doop
• H doop h s sp wned n ecosystem
a
a
a
a
a
a
a
ff
a
a
f
a
a
a
a
a
a
a
a
Hadoop in the cloud
AWS GCP
Compute Am zon EMR D t proc
El stic stor ge Am zon s3 GCS
Stre ming AWS Flink D t low
D t l ke AWS L ke Form tion BigL ke
Other AWS Redshift BigQuery, BigT ble
a
a
a
a
a
a
a
a
a
a
a
f
a
a
a
a
a
What is Hadoop?
• H doop is two distributed systems for stor ge nd compute
• Highly sc l ble: w. r. t. horizont l sc l bility
• Highly v il ble: w. r. t. resiliency nd f ult toler nce
• H doop is fr mework with which to inter ct with Big D t
• M pReduce APIs
• HDFS APIs
• YARN APIs
• H doop is n ecosystem
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop as a distributed system
Two distributed systems for storage and compute
M pReduce API
other YARN pps
M pReduce
Compute
YARN
Distributed ilesystem API
Stor ge
HDFS
a
a
a
f
a
Hadoop as a distributed system
• HDFS s distributed stor ge/ ilesystem
• YARN s distributed compute scheduler
• M pReduce s big d t processing fr mework
a
a
a
a
a
a
a
a
a
a
f
a
Hadoop as a distributed system
• H doop is compos ble: you c n use some (do not h ve to use ll)
• Ex mples
• Use only HDFS
• Use only YARN
• Use only YARN + M pReduce
• C ve t from the provider perspective: properly spec’ed h rdw re
a
a
a
a
a
a
a
a
a
a
a
Hadoop code organization
Client API
Tools
M pReduce
YARN
HDFS
H doop Common
a
a
Hadoop architecture
• M ster/centr l nodes vs. worker nodes
• HDFS: N menode nd D t nodes
• YARN: Resource M n ger nd Node M n gers
• High v il bility
• F il over to st ndby m ster nodes in c se of m ster f ilures
• Coordin ted using ZooKeeper
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop architecture
• Self-he ling: resilient g inst individu l node f ilures
• D t gets re-replic ted if d t node is lost
• A t sk gets rest rted (on nother node) if node f ils
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop as an ecosystem
• Higher-level fr meworks th t cre te complex M pReduce work lows: Pig, Oozie, C sc ding,
Sc lding, …
• SQL on H doop: Hive, Phoenix, Imp l , Presto, …
• Stor ge systems on H doop: HB se
• Seri liz tion/form t libr ries: P rquet, Avro, ORC, …
• Sp rk
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
f
a
a
Running Hadoop
Running Hadoop
• Single-node setup
• Single-node st nd lone (“loc l”)
• Single-node pseudo-distributed setup
• Cluster setup
• Cloud setup
• Roll your own: cluster setup using VMs
• More cloud-n tive setup: on-dem nd YARN/MR + cloud stor ge
a
a
a
a
a
a
Running Hadoop
# of processes # of m chines
loc l 1 1
pseudo-distributed sever l 1
cluster m ny m ny
a
a
a
a
a
Running Hadoop
Demo
Inst ll nd run H doop in
single-node setup
a
a
a
a
Running Hadoop
Demo
• Inst ll pre-requisites: JDK, ssh, etc.
• Inst ll H doop
• Explore the H doop inst ll tion
• Try st nd lone setup
• St rt nd stop pseudo-distributed setup
a
a
a
a
a
a
a
a
a
a
a
a
Running Hadoop
Pseudo-distributed cluster
• https://h doop. p che.org/docs/st ble/h doop-project-dist/h doop-common/
SingleCluster.html
• Set up ssh for loc lhost
• Inst ll ssh (server nd client): sshd nd ssh
• M ke sshd run in the b ckground
• Do key gener tion (keygen) to do p sswordless loc lhost ssh
• “Form t” the hdfs ilesystem
a
a
a
a
a
a
a
a
f
a
a
a
a
a
a
a
a