Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
144 views120 pages

Big Data: Ghislain Fourny

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views120 pages

Big Data: Ghislain Fourny

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 120

Big Data

Ghislain Fourny

Big Data
1. Introduction

Ghislain Fourny
Fall 2021
123RF / generalfmv
1.85m
?
123RF / shtanzman
40,000 km
40 Mm
Source: Wikipedia
150 Gm

Source: Wikipedia
1 Tm

Source: Wikipedia
1 Pm

Source: Wikipedia
1000 Em

Source: Wikipedia
1 Zm

Source: Wikipedia
138 Ym

Source: Wikipedia
Exploring the infinitely big...

Source: Wikipedia
... means exploring the infinitely small

Source: Wikipedia
Data is like matter

123RF / shtanzman

Study of the real world Study of the data world

Physics Data Science


Poll

Go now to:

https://eduapp-app1.ethz.ch/

or install EduApp 3.x


My How-We-Do-Science Matrix

Mathematics Physics
My How-We-Do-Science Matrix

Ontological Epistemic

The world as it must be The world as it is


(necessary) (contingent)

With our brain Mathematics Physics


(natural)

theoretical

empirical
Thinking
My How-We-Do-Science Matrix

Ontological Epistemic

The world as it must be The world as it is


(necessary) (contingent)

With our brain Mathematics Physics


(natural)

theoretical

empirical
Thinking

With a machine
Computer Science
(artificial)
computational

Computing
My How-We-Do-Science Matrix
The four paradigms

Ontological Epistemic

The world as it must be The world as it is


(necessary) (contingent)

With our brain Mathematics Physics


(natural)

theoretical

empirical
Thinking

With a machine Data Science


Computer Science
(artificial)
computational

data driven
Computing

The Physics of CS
A good decision is based on
knowledge, not on numbers.

- Plato
Data Management Lectures at ETH

Computer Science
Data Science Big Data
CBB MSc with CS background (Fall)

This lecture

Other departments

Information Systems for Engineers

Fall 2021
+ Big Data for Engineers

Spring 2022
Poll

Go now to:

https://eduapp-app1.ethz.ch/

or install EduApp 3.x


A Short History of
Databases

123RF / Samantha Craddock


123RF / andreykuzmin
Speaking/Singing
Writing

Rosetta Stone © Hans Hillewaert


Accounting

Plimpton 322 (Public Domain)


Printing
Willi Heidelbach
Ben Franske - DM IBM S360.jpg on en.wiki
Computers
1960s: File Systems

Lorem Ipsum
Dolor sit amet

Consectetur

Adipiscing

Elit. In

Imperdiet

Ipsum ante
123RF / andreykuzmin
1970s: The Relational Era
2000s: The NoSQL Era

foo

bar
Triple stores
foobar

Key-value stores

Column stores
Document stores
In short?

We threw data
at computers.

1970 123RF / blueringmedia


In short?

We threw computers
at computers.

1990 123RF / blueringmedia


In short?

We threw computers
at data.

2000
In short?

We are throwing
data at data.

now
Big Data
It's a buzzword!
Big Data goes across disciplines

Applications

Machine Learning

Statistics Algorithms

Data Management

Programming Languages

Distributed Systems High-Performance Computing


Big Data involves a lot of proprietary technology
The Big in Big Data

123RF / Patricia Hofmeester


The Three Vs

Volume TB ZB
Big Data

Variety

Velocity
MORE
MORE Data Volume

MORE
Data Volume

Content Location

Web Sensors IoT


Usage
Digital Traces

Experiments
Proprietary Scientific
Surveys
Data Volume

… because Technology

we Software
can! Hardware

Infrastructure
Data Volume

… because
data
carries
value
Data is worth more than the sum of its parts

Utility( + )

>

Utility( ) + Utility( )
Data totality: one must have complete data

All flights
All hotels
All shops
...
Prefixes (International System of Units)

kilo (k) 1,000 (3 zeros)


Mega (M) 1,000,000 (6 zeros)
Giga (G) 1,000,000,000 (9 zeros)
Tera (T) 1,000,000,000,000 (12 zeros)
Peta (P) 1,000,000,000,000,000 (15 zeros)
Exa (E) 1,000,000,000,000,000,000 (18 zeros)
Zetta (Z) 1,000,000,000,000,000,000,000 (21 zeros)
Yotta (Y) 1,000,000,000,000,000,000,000,000 (24 zeros)

You must know this by !


Clicker question

Go now to:

https://eduapp-app1.ethz.ch/

or install EduApp 3.x


Prefixes (International System of Units)

kilo (k)
Mega (M)
Giga (G)
Tera (T)
Peta (P)
Exa (E)
Zetta (Z)
Yotta (Y)
Prefixes (International System of Units)

kibi (ki) 1,024 (210)


Mebi (Mi) 1,048,576 (220)
Gibi (Gi) 1,073,741,824 (230)
Tebi (Ti) 1,099,511,627,776 (240)
Pebi (Pi) 1,125,899,906,842,624 (250)
Exbi (Ei) 1,152,921,504,606,846,976 (260)
Zebi (Zi) 1,180,591,620,717,411,303,424 (270)
Yobi (Yi) 1,208,925,819,614,629,174,706,176 (280)

You must NOT know this by !


Data Variety
Data Shapes: Tables
Data Shapes: Trees
Data Shapes: Graphs
Data Shapes: Cubes
Data Shapes: Text

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel erat nec dui
aliquet vulputate sed quis nulla. Donec eget ultricies magna, eu dignissim elit.
Nullam sed urna nec nisl rhoncus ullamcorper placerat et enim. Integer varius
ornare libero quis consequat. Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Aenean eu efficitur orci. Aenean ac posuere tellus. Ut id
commodo turpis.

Praesent nec libero metus. Praesent at turpis placerat, congue ipsum eget,
scelerisque justo. Ut volutpat, massa ac lacinia cursus, nisl dui volutpat arcu,
quis interdum sapien turpis in tellus. Suspendisse potenti. Vestibulum pharetra
justo massa, ac venenatis mi condimentum nec. Proin viverra tortor non orci
suscipit rutrum. Phasellus sit amet euismod diam. Nullam convallis nunc sit
amet diam suscipit dapibus. Integer porta hendrerit nunc. Quisque pharetra
congue porta. Suspendisse vestibulum sed mi in euismod. Etiam a purus
suscipit, accumsan nibh vel, posuere ipsum. Nulla nec tempor nibh, id
venenatis lectus. Duis lobortis id urna eget tincidunt.
Data Velocity
Data is generated automatically

Picture: Vladimir Voronin/123RF


Data is a realtime byproduct of human activity

Picture: Ash Waechter/123RF


Three paramount factors

Capacity

Throughput

Latency
Picture: Ash Waechter/123RF
Capacity

?
Picture: Ash Waechter/123RF
"How much data can we store?"
Throughput

?
Picture: Ash Waechter/123RF
"How fast can we transmit data?"
Latency

? !
Picture: Ash Waechter/123RF
"When do I start receiving data?"
1956: IBM RAMAC 350

Capacity: 5 MB

Throughput: 12.5 kB/s

Latency: 600 ms

Dimensions:
1.7x1.5x0.7 (m)

Computer history museum


computerhistory.org
1956: IBM RAMAC 350

Wikipedia
Deep silence
2021: Western Digital Ultrastar DC HC650
2021: Western Digital Ultrastar DC HC650
2021: Western Digital Ultrastar DC HC650
2021: Western Digital Ultrastar DC HC650

Capacity: 20 TB

Throughput: 250 MB/s

Latency: 4.16 ms

Dimensions:
101x147x26 (mm)
Clicker question

Go now to:

https://eduapp-app1.ethz.ch/

or install EduApp 3.x


The progress made (1956-2021)

Capacity Throughput Latency


(per unit of volume)

Picture: Ash Waechter/123RF


The progress made (1956-2021)

200,000,000,000x

20,000x 150x
Capacity Throughput Latency
(per unit of volume)

Picture: Ash Waechter/123RF


The progress made (1956-2021): Logarithmic

200,000,000,000x

20,000x
150x

Capacity Throughput Latency


(per unit of volume)

Picture: Ash Waechter/123RF


Capacity: Example

? ~ 600,000 words.

Picture: Ash Waechter/123RF


Throughput: Example

~1,000 word
per minute

Picture: Ash Waechter/123RF


Latency: Example

~ 1 minute to
stand up,
go to the shelf,
pick the book,
find the page.
123RF / Patricia Hofmeester
2021 – Analogy with a book

600,000 words

1,000 words per minute

10 hours
2221 – Analogy with a book

120,000,000,000,000,000 words

10,000,000 words per minute

22,800 years
The progress made (1956-2021): Logarithmic

200,000,000,000x

10,000x

150x

Capacity Throughput Latency

Picture: Ash Waechter/123RF


The progress made (1956-2021): Logarithmic

200,000,000,000x

10,000x

Parallelize!
150x

Capacity Throughput Latency

Picture: Ash Waechter/123RF


2221: 200,000,000 persons
could read it all in 10 hours.
Data centers: clusters of machines (10,000s)

123RF / blueringmedia
The progress made (1956-2021): Logarithmic

200,000,000,000x

10,000x

150x
Batch processing!

Capacity Throughput Latency

Picture: Ash Waechter/123RF


What is Big Data (my definition)?

Big Data is a portfolio of technologies


that were designed to
store, manage and analyze data that is
too large to fit on a single machine
while accommodating for the issue of
growing discrepancy between
capacity, throughput and latency.
Big Data in the Sciences

Picture: pcanzo/123RF
Physics: CERN pioneers, produces 30 PB/year

Picture: CERN
Physics: CERN pioneers, produces 30 PB/year

Wait!
Actually,
600,000,000 collisions/second
that was
11,000 computers
three years
ago!
100,000+ processors
Physics: CERN pioneers, produces 50 PB/year

1,000,000,000 collisions/second

10,400 servers

440,000+ cores

More on http://monit-grafana-open.cern.ch/d/000000884/it-overview?orgId=16
Physics: CERN pioneers, produces 30 PB/year
Astronomy: Sloan Digital Sky Survey

Picture: NASA / WMAP Science Team


Astronomy: Sloan Digital Sky Survey

Since 2000, phase IV ended in 2020


The most detailed
3D maps of the Universe
ever made

1G objects, 4 spectra

https://www.sdss.org/
Astronomy: Sloan Digital Sky Survey

200 GB/night

https://www.sdss.org/surveys/eboss/
Picture: Wikipedia/EdPost
Genomics: the complete human genome

3B base pairs

Picture: Wikipedia/Zephyris
Genomics: CRISPIR-Cas9
New (2018): DNA as a storage layer
Lecture Scope
Lecture scope
Lecture scope

AI
Machine Learning

Databases
Lecture scope: databases only

AI
Machine Learning

Databases
Lecture Team

Ingo Müller David Dao Dan Graur Alexandru Meterez


(co-head TA) (TA) (TA) (TA)

Ghislain Fourny

Monica Chiosa Amir Joudaki Pierre Motard Thomas Zhou


(co-head TA) (TA) (TA) (TA)
Lecture Overview

Concepts Technologies
Object storage S3, Azure Blob Storage
Storage Distributed file systems HDFS
Syntax JSON, XML, YAML
Wide column stores HBase
Data models and schemas JSound, JSON Schema, XML Schema
Models
Cubes OLAP
Graphs neo4j, Cypher, RDF, SPARQL
2-step distributed query processing Hadoop MapReduce
Processing Resource management YARN
DAG-based distributed query processing Spark
Document storage MongoDB
Management
Query languages JSONiq
What is expected

Attendance of the weekly lectures


(3 hours/w Tuesdays 14-16, Wednesdays 9-10)
What is expected

Attendance of the weekly lectures


(3 hours/w Tuesdays 14-16, Wednesdays 9-10)

Attendance of the exercise session


(2 hours/w Wednesdays/Fridays)
What is expected

Attendance of the weekly lectures


(3 hours/w Tuesdays 14-16, Wednesdays 9-10)

Attendance of the exercise session


(2 hours/w Wednesdays/Fridays)

Hands-on self-study, read the books,


play with technology (1-2 hours/w)
What is expected

Attendance of the weekly lectures


(3 hours/w Tuesdays 14-16, Wednesdays 9-10)

Attendance of the exercise session


(2 hours/w Wednesdays/Fridays)

Hands-on self-study, read the books,


play with technology (1-2 hours/w)

Passing the written exam


(180 minutes, Winter session)
What is expected

Attendance of the weekly lectures


(3 hours/w Tuesdays 14-16, Wednesdays 9-10)

10
Attendance of the exercise session
(2 hours/w Wednesdays/Fridays)

Hands-on self-study, read the books,


play with technology (1-2 hours/w)

Passing the written exam


KP
(180 minutes, Winter session)
Bonus points!
Bonus points!

0.25
Bonus points!

You SHOULD solve the weekly exercise sheets (theoretical, practical)


Bonus points!

You SHOULD solve the weekly exercise sheets (theoretical, practical)

We will grade the exercises marked as such (25 of them)


Bonus points!

You SHOULD solve the weekly exercise sheets (theoretical, practical)

We will grade the exercises marked as such (25 of them)

You get 0.01 extra point per passed assignment: at most 0.25
Bonus points!

You SHOULD solve the weekly exercise sheets (theoretical, practical)

We will grade the exercises marked as such (25 of them)

You get 0.01 extra point per passed assignment: at most 0.25

"0.n" will thus be added to your exam grade before rounding


Self-study: Docker for your laptop, Azure for large-scale clusters

You might also like