Big Data
Ghislain Fourny
Big Data
1. Introduction
Ghislain Fourny
Fall 2021
123RF / generalfmv
1.85m
?
123RF / shtanzman
40,000 km
40 Mm
Source: Wikipedia
150 Gm
Source: Wikipedia
1 Tm
Source: Wikipedia
1 Pm
Source: Wikipedia
1000 Em
Source: Wikipedia
1 Zm
Source: Wikipedia
138 Ym
Source: Wikipedia
Exploring the infinitely big...
Source: Wikipedia
... means exploring the infinitely small
Source: Wikipedia
Data is like matter
123RF / shtanzman
Study of the real world Study of the data world
Physics Data Science
Poll
Go now to:
https://eduapp-app1.ethz.ch/
or install EduApp 3.x
My How-We-Do-Science Matrix
Mathematics Physics
My How-We-Do-Science Matrix
Ontological Epistemic
The world as it must be The world as it is
(necessary) (contingent)
With our brain Mathematics Physics
(natural)
theoretical
empirical
Thinking
My How-We-Do-Science Matrix
Ontological Epistemic
The world as it must be The world as it is
(necessary) (contingent)
With our brain Mathematics Physics
(natural)
theoretical
empirical
Thinking
With a machine
Computer Science
(artificial)
computational
Computing
My How-We-Do-Science Matrix
The four paradigms
Ontological Epistemic
The world as it must be The world as it is
(necessary) (contingent)
With our brain Mathematics Physics
(natural)
theoretical
empirical
Thinking
With a machine Data Science
Computer Science
(artificial)
computational
data driven
Computing
The Physics of CS
A good decision is based on
knowledge, not on numbers.
- Plato
Data Management Lectures at ETH
Computer Science
Data Science Big Data
CBB MSc with CS background (Fall)
This lecture
Other departments
Information Systems for Engineers
Fall 2021
+ Big Data for Engineers
Spring 2022
Poll
Go now to:
https://eduapp-app1.ethz.ch/
or install EduApp 3.x
A Short History of
Databases
123RF / Samantha Craddock
123RF / andreykuzmin
Speaking/Singing
Writing
Rosetta Stone © Hans Hillewaert
Accounting
Plimpton 322 (Public Domain)
Printing
Willi Heidelbach
Ben Franske - DM IBM S360.jpg on en.wiki
Computers
1960s: File Systems
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
123RF / andreykuzmin
1970s: The Relational Era
2000s: The NoSQL Era
foo
bar
Triple stores
foobar
Key-value stores
Column stores
Document stores
In short?
We threw data
at computers.
1970 123RF / blueringmedia
In short?
We threw computers
at computers.
1990 123RF / blueringmedia
In short?
We threw computers
at data.
2000
In short?
We are throwing
data at data.
now
Big Data
It's a buzzword!
Big Data goes across disciplines
Applications
Machine Learning
Statistics Algorithms
Data Management
Programming Languages
Distributed Systems High-Performance Computing
Big Data involves a lot of proprietary technology
The Big in Big Data
123RF / Patricia Hofmeester
The Three Vs
Volume TB ZB
Big Data
Variety
Velocity
MORE
MORE Data Volume
MORE
Data Volume
Content Location
Web Sensors IoT
Usage
Digital Traces
Experiments
Proprietary Scientific
Surveys
Data Volume
… because Technology
we Software
can! Hardware
Infrastructure
Data Volume
… because
data
carries
value
Data is worth more than the sum of its parts
Utility( + )
>
Utility( ) + Utility( )
Data totality: one must have complete data
All flights
All hotels
All shops
...
Prefixes (International System of Units)
kilo (k) 1,000 (3 zeros)
Mega (M) 1,000,000 (6 zeros)
Giga (G) 1,000,000,000 (9 zeros)
Tera (T) 1,000,000,000,000 (12 zeros)
Peta (P) 1,000,000,000,000,000 (15 zeros)
Exa (E) 1,000,000,000,000,000,000 (18 zeros)
Zetta (Z) 1,000,000,000,000,000,000,000 (21 zeros)
Yotta (Y) 1,000,000,000,000,000,000,000,000 (24 zeros)
You must know this by !
Clicker question
Go now to:
https://eduapp-app1.ethz.ch/
or install EduApp 3.x
Prefixes (International System of Units)
kilo (k)
Mega (M)
Giga (G)
Tera (T)
Peta (P)
Exa (E)
Zetta (Z)
Yotta (Y)
Prefixes (International System of Units)
kibi (ki) 1,024 (210)
Mebi (Mi) 1,048,576 (220)
Gibi (Gi) 1,073,741,824 (230)
Tebi (Ti) 1,099,511,627,776 (240)
Pebi (Pi) 1,125,899,906,842,624 (250)
Exbi (Ei) 1,152,921,504,606,846,976 (260)
Zebi (Zi) 1,180,591,620,717,411,303,424 (270)
Yobi (Yi) 1,208,925,819,614,629,174,706,176 (280)
You must NOT know this by !
Data Variety
Data Shapes: Tables
Data Shapes: Trees
Data Shapes: Graphs
Data Shapes: Cubes
Data Shapes: Text
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel erat nec dui
aliquet vulputate sed quis nulla. Donec eget ultricies magna, eu dignissim elit.
Nullam sed urna nec nisl rhoncus ullamcorper placerat et enim. Integer varius
ornare libero quis consequat. Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Aenean eu efficitur orci. Aenean ac posuere tellus. Ut id
commodo turpis.
Praesent nec libero metus. Praesent at turpis placerat, congue ipsum eget,
scelerisque justo. Ut volutpat, massa ac lacinia cursus, nisl dui volutpat arcu,
quis interdum sapien turpis in tellus. Suspendisse potenti. Vestibulum pharetra
justo massa, ac venenatis mi condimentum nec. Proin viverra tortor non orci
suscipit rutrum. Phasellus sit amet euismod diam. Nullam convallis nunc sit
amet diam suscipit dapibus. Integer porta hendrerit nunc. Quisque pharetra
congue porta. Suspendisse vestibulum sed mi in euismod. Etiam a purus
suscipit, accumsan nibh vel, posuere ipsum. Nulla nec tempor nibh, id
venenatis lectus. Duis lobortis id urna eget tincidunt.
Data Velocity
Data is generated automatically
Picture: Vladimir Voronin/123RF
Data is a realtime byproduct of human activity
Picture: Ash Waechter/123RF
Three paramount factors
Capacity
Throughput
Latency
Picture: Ash Waechter/123RF
Capacity
?
Picture: Ash Waechter/123RF
"How much data can we store?"
Throughput
?
Picture: Ash Waechter/123RF
"How fast can we transmit data?"
Latency
? !
Picture: Ash Waechter/123RF
"When do I start receiving data?"
1956: IBM RAMAC 350
Capacity: 5 MB
Throughput: 12.5 kB/s
Latency: 600 ms
Dimensions:
1.7x1.5x0.7 (m)
Computer history museum
computerhistory.org
1956: IBM RAMAC 350
Wikipedia
Deep silence
2021: Western Digital Ultrastar DC HC650
2021: Western Digital Ultrastar DC HC650
2021: Western Digital Ultrastar DC HC650
2021: Western Digital Ultrastar DC HC650
Capacity: 20 TB
Throughput: 250 MB/s
Latency: 4.16 ms
Dimensions:
101x147x26 (mm)
Clicker question
Go now to:
https://eduapp-app1.ethz.ch/
or install EduApp 3.x
The progress made (1956-2021)
Capacity Throughput Latency
(per unit of volume)
Picture: Ash Waechter/123RF
The progress made (1956-2021)
200,000,000,000x
20,000x 150x
Capacity Throughput Latency
(per unit of volume)
Picture: Ash Waechter/123RF
The progress made (1956-2021): Logarithmic
200,000,000,000x
20,000x
150x
Capacity Throughput Latency
(per unit of volume)
Picture: Ash Waechter/123RF
Capacity: Example
? ~ 600,000 words.
Picture: Ash Waechter/123RF
Throughput: Example
~1,000 word
per minute
Picture: Ash Waechter/123RF
Latency: Example
~ 1 minute to
stand up,
go to the shelf,
pick the book,
find the page.
123RF / Patricia Hofmeester
2021 – Analogy with a book
600,000 words
1,000 words per minute
10 hours
2221 – Analogy with a book
120,000,000,000,000,000 words
10,000,000 words per minute
22,800 years
The progress made (1956-2021): Logarithmic
200,000,000,000x
10,000x
150x
Capacity Throughput Latency
Picture: Ash Waechter/123RF
The progress made (1956-2021): Logarithmic
200,000,000,000x
10,000x
Parallelize!
150x
Capacity Throughput Latency
Picture: Ash Waechter/123RF
2221: 200,000,000 persons
could read it all in 10 hours.
Data centers: clusters of machines (10,000s)
123RF / blueringmedia
The progress made (1956-2021): Logarithmic
200,000,000,000x
10,000x
150x
Batch processing!
Capacity Throughput Latency
Picture: Ash Waechter/123RF
What is Big Data (my definition)?
Big Data is a portfolio of technologies
that were designed to
store, manage and analyze data that is
too large to fit on a single machine
while accommodating for the issue of
growing discrepancy between
capacity, throughput and latency.
Big Data in the Sciences
Picture: pcanzo/123RF
Physics: CERN pioneers, produces 30 PB/year
Picture: CERN
Physics: CERN pioneers, produces 30 PB/year
Wait!
Actually,
600,000,000 collisions/second
that was
11,000 computers
three years
ago!
100,000+ processors
Physics: CERN pioneers, produces 50 PB/year
1,000,000,000 collisions/second
10,400 servers
440,000+ cores
More on http://monit-grafana-open.cern.ch/d/000000884/it-overview?orgId=16
Physics: CERN pioneers, produces 30 PB/year
Astronomy: Sloan Digital Sky Survey
Picture: NASA / WMAP Science Team
Astronomy: Sloan Digital Sky Survey
Since 2000, phase IV ended in 2020
The most detailed
3D maps of the Universe
ever made
1G objects, 4 spectra
https://www.sdss.org/
Astronomy: Sloan Digital Sky Survey
200 GB/night
https://www.sdss.org/surveys/eboss/
Picture: Wikipedia/EdPost
Genomics: the complete human genome
3B base pairs
Picture: Wikipedia/Zephyris
Genomics: CRISPIR-Cas9
New (2018): DNA as a storage layer
Lecture Scope
Lecture scope
Lecture scope
AI
Machine Learning
Databases
Lecture scope: databases only
AI
Machine Learning
Databases
Lecture Team
Ingo Müller David Dao Dan Graur Alexandru Meterez
(co-head TA) (TA) (TA) (TA)
Ghislain Fourny
Monica Chiosa Amir Joudaki Pierre Motard Thomas Zhou
(co-head TA) (TA) (TA) (TA)
Lecture Overview
Concepts Technologies
Object storage S3, Azure Blob Storage
Storage Distributed file systems HDFS
Syntax JSON, XML, YAML
Wide column stores HBase
Data models and schemas JSound, JSON Schema, XML Schema
Models
Cubes OLAP
Graphs neo4j, Cypher, RDF, SPARQL
2-step distributed query processing Hadoop MapReduce
Processing Resource management YARN
DAG-based distributed query processing Spark
Document storage MongoDB
Management
Query languages JSONiq
What is expected
Attendance of the weekly lectures
(3 hours/w Tuesdays 14-16, Wednesdays 9-10)
What is expected
Attendance of the weekly lectures
(3 hours/w Tuesdays 14-16, Wednesdays 9-10)
Attendance of the exercise session
(2 hours/w Wednesdays/Fridays)
What is expected
Attendance of the weekly lectures
(3 hours/w Tuesdays 14-16, Wednesdays 9-10)
Attendance of the exercise session
(2 hours/w Wednesdays/Fridays)
Hands-on self-study, read the books,
play with technology (1-2 hours/w)
What is expected
Attendance of the weekly lectures
(3 hours/w Tuesdays 14-16, Wednesdays 9-10)
Attendance of the exercise session
(2 hours/w Wednesdays/Fridays)
Hands-on self-study, read the books,
play with technology (1-2 hours/w)
Passing the written exam
(180 minutes, Winter session)
What is expected
Attendance of the weekly lectures
(3 hours/w Tuesdays 14-16, Wednesdays 9-10)
10
Attendance of the exercise session
(2 hours/w Wednesdays/Fridays)
Hands-on self-study, read the books,
play with technology (1-2 hours/w)
Passing the written exam
KP
(180 minutes, Winter session)
Bonus points!
Bonus points!
0.25
Bonus points!
You SHOULD solve the weekly exercise sheets (theoretical, practical)
Bonus points!
You SHOULD solve the weekly exercise sheets (theoretical, practical)
We will grade the exercises marked as such (25 of them)
Bonus points!
You SHOULD solve the weekly exercise sheets (theoretical, practical)
We will grade the exercises marked as such (25 of them)
You get 0.01 extra point per passed assignment: at most 0.25
Bonus points!
You SHOULD solve the weekly exercise sheets (theoretical, practical)
We will grade the exercises marked as such (25 of them)
You get 0.01 extra point per passed assignment: at most 0.25
"0.n" will thus be added to your exam grade before rounding
Self-study: Docker for your laptop, Azure for large-scale clusters