0% found this document useful (0 votes)

40 views47 pages

01 - Intro

The document discusses big data and distributed computing. It introduces the instructor and their background in bioinformatics. It then defines big data, explaining how data volumes have grown exponentially due to decreased storage costs and more data sources like social media. The document outlines several examples of big data in science and business intelligence applications.

Uploaded by

Alireza Tehrani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views47 pages

01 - Intro

Uploaded by

Alireza Tehrani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Data-Intensive

Distributed
Computing
CS431/451/631/651

Fall 2022 – Dan Holtby

1
Today’s Agenda
Who am I?
What is “Big Data?”
Why is it different than regular Data?
How is the course structured?
(When and Where is on your schedule already…)

2
• PhD from UW (2013)
• Bioinformatics Research Group
• Bioinformatics involves lots of big data
• (A single human’s genome is about 3.5GB!)
• Humans aren’t even the most complicated species
Who am I? • Masters Thesis was on Distributed
Computing

3
Who are you?
CS451 / CS651 – CS Majors or Data Science Majors / MDSAI
Expectations: Comfortable in Java and Scala (you’ll be expected to pick it up
quickly if not)

CS431 / CS631 – Non-CS Majors, or Data Science majors / MDSAI

Expectations: Comfortable in Python (again, you’ll be expected to pick it up
quickly if not)

Everybody Should Be:

* Interested in the topic
* Comfortable with rapidly-evolving software

4
Big Data
• Question: Why are data so big
these days?
• Answer: It’s complicated

5
• The only reason to delete data is if the cost of
keeping it is too high

• (This is, of course, why Bilbo should not keep the One
Ring)

6
It’s one gigabyte, Michael!
How much could it cost, $10?

(Top) - IBM 350 Disk

($36,000 / month for 3.5 (Bottom) – WD Green
MB) (~$60 for 2TB)

1980

1958 Now

(Middle) – Shugart ST506

5.25” ($5000 for 5MB)
• 2.5” HDD for scale

7
Price per GB Over the Years
USD / GB
350000

300000

250000

200000

150000

100000

50000

0
1982 1987 1992 1997 2002 2007 2012 2017 2022

PRO-TIP: never make a graph that looks like this, use a log scale

8
Price per GB Over the Years (Log Scale)

USD / GB
1000000

100000

10000

1000

100

0.1

0.01
1982 1987 1992 1997 2002 2007 2012 2017 2022

9
• Facebook generates 4PB / day (that’s 4
million GB)
Where are all • There are 500 million new tweets per day
(~60 GB just for the text)
these data • 720,000 hours of new YouTube videos per
coming day. (It would take 90,000 full time
employees just to review uploads)
from? • Every “smart” device you own is sending
telemetry back to corporate to be packaged
and sold.

10
How
much????
• Right now* we
generate 2.5
exobytes (2,500,000
TB) per day
• That’s ~2MB /
person / second

* The number is from 2020, it's probably bigger now but I can't
find a good source

A lot of that is video so it’s all about averages

11
2.5 EXObytes???
• That might seem like a lot, but it’s
nothing compared to what it’s
going to be

• Will be up to 500 exobytes / day in

2025 (125 million 4TB HDDs filled
per day)

12
Businesses

But Why? Scientists

People

13
Business Data

DATA-DRIVEN DECISION- DATA-DRIVEN PRODUCT TARGETED ADVERTISING

MAKING DESIGN

14
Business Intelligence
• “What worked? What didn’t?”
• This isn’t a new concept.

15
Anecdote!
• In the 1990s, Walmart Discovered
people tend to buy beer and
diapers at the same time, so they
put them together.

• PS this isn’t true. Anecdotes

rarely are.

16
What Would
Walmart Do?
• Stores actually want items
that are bought together
to be FAR APART.
• So if Walmart did put beer
and diapers close, it’s
because they’re NOT
bought together.
• Costco puts the rotisserie
chicken at the back so you
have to walk past
everything else to grab
one

17
• A teenager’s parents learned she
was pregnant because Target
started sending coupons for
diapers.
Targetted
Adversiting • How did Target know? Data
Science

18
• “Customers like you
bought…”
Preferences • “People who liked X watch Y”
• Oddly specific Netflix
categories

19
Science!
• Data-Intensive eScience

• Modern Experiments generate

BIG DATA

20
Black Hole
• First Image of a Black
Hole (2019)
• 4.5PB of data from 8
telescopes

They flew and drove trucks full of HDD. Would have taken years to send over the internet.

21
Black Hole
• They shipped HDDs
• Never underestimate the
bandwidth of a station
wagon full of tapes
hurtling down the
highway. –Andrew
Tanenbaum

Because infereometry doesn’t parallelize well, not a good candidate for MapReduce (or
other data-intensive distributed techniques). Getting all the data in one place was
mandatory

22
JAMES WEBB • JWST sends 57GB / day back to Earth
• One pretty picture requires MANY images stitched
TELESCOPE together

23
Square Kilometer Array (SKA)

Estimated Completion Date – 2027

Will generate too much data to

handle today (5 Tb/sec)

They’re crossing that bridge when

they come to it.
By SPDO/TDP/DRAO/Swinburne Astronomy
Productions - SKA Project Development Office and
Swinburne Astronomy Productions

24
• Generates 1 PB / sec during an experiment
Large Hadron
Collider • That’s more than the SKA, but it’s not constantly running

25
Data Driven Policy

Computational
Social & Voter Preferences
Political
Science
Trending Hashtags

26
Humans as Sensors

Humans record their thoughts What can we do with all those

on social media. data?

27
Twitter

• Can Tweets tell us anything?

• Sentiment Analysis + Social Science

Sentiment Analysis – Figuring out the tone of a tweet. Harder than it sounds. People are
sarcastic, might use memes.
Fortunately people also tend to label their own sentences with emojis. Eyeroll emoji =
sarcastic. Angry face = mad post.

28
Predicting X
with Twitter
Fall 2020 Project : Predicting
COVID with Twitter

29
Big Data, Big
Computer?
• Vertical Scaling – More
RAM, Disk, CPU
• Return of the Mainframe?
• Expensive!
• Limited!

30
Big Data, Big • Horizontal Scaling

Network! • Cheap computers, just more of them

31
Distributed • Many inexpensive computers working together

Computing • Just like it says on the course

32
Parallelization
is hard
• Deadlocks, Livelocks, Race
Conditions, oh my!

• That’s just on one computer.

What if they’re remote?

If you haven’t taken OS, parallel programming, etc. you’ll just have to take my word for it

33
Scaling Out!
• A datacenter of many machines?
• Many datacenters???
• Fault tolerance

34
ALL
HARDWARE
FAILS

Over the years I’ve lost 2 video cards, 1 stick of RAM, several HDD, though none with data
loss…a mother board I think? 1 power supply, 2 if you count obnoxious coil whine as
failure. Oh, a monitor, if that counts. Went all yellow unless you beat it senseless (the IT
term for this is “percussive maintenance”)

35
DIFFICULT
We’re not going to design
a fault-tolerant
distributed computer
network

We’re going to use one

36
Abstraction to the rescue
You didn’t need to understand the hardware to use
assembly

You didn’t need to understand assembly to use C++

You didn’t need to understand a hash table to use

std::UnorderedMap

37
What’s the
Next Layer?
• How can we abstract a
distributed network?

• (That’s the topic of the next few

lectures)

38
What’s
CS431/CS451?
Data Science
Tools
A little helping of

This Course
Analytics
• Data Science Infrastructure

• Distributed Analytics
Execution
• Distributed Execution Infrastructure

39
More Buzzwords Please!

You got it!

• Analytics
• Business intelligence
• Data warehousing
• MapReduce, Hadoop, Spark, Pig, Hive, NoSQL, Pregel, Giraph, Storm/Heron
• Thinking at scale

40
HOW HARD IS THE COURSE?

• Based on course surveys –

• CS431 - ~8 hours a week
• CS451 - ~10 hours a week (That’s a
heavy course)

• UWFlow seems to think they’re both

relatively easy though

41
Grading

Undergrads Grad Students

Assignments – 70% Assignments – 60%
Final Exam – 30% Final Exam – 20%
Project – 20%

42
Course Info and Help
Course Website: https://www.student.cs.uwaterloo.ca/~cs451
(Yes, even if you’re in CS431)

Piazza (you should have been emailed an invite)

Online Office Hours: Microsoft Teams

In-Person Office Hours: See website.

43
All assignments will be checked for
plagiarism / unauthorized collaboration!
(See the course syllabus for more details)

One term, 23% of the class was under

Academic investigation for plagiarism.

Integrity
If caught: 0 on the assignment, -5% on
your course grade

44
Assignment Mechanics (CS451/651)

Java Scala

We’ll be using private Git repos for assignments

Complete your assignments, push to GitLab
We’ll pull your repos at the deadline and grade

Late assignments will get 0

45
Assignment Mechanics (CS431/631)

Assignments will use Python and Jupyter (Google Colab)

Everything you need to know is in the assignment itself

Assignments will be submitted using Git

Details are on the course website for the appropriate assignment

Python
Late assignments will get 0

46
Course Materials
One (required) textbook +
Three (optional but recommended) books +
Additional readings from other sources as appropriate

(optional but recommended)

Note: 4th Edition

No ratings yet
27 pages
Big Data Challenges in Bioinformatics
No ratings yet
Big Data Challenges in Bioinformatics
47 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
153 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Session 1
No ratings yet
Session 1
32 pages
Lecture 6 BigData
No ratings yet
Lecture 6 BigData
61 pages
Unit 1 - Class 5-Al-830
No ratings yet
Unit 1 - Class 5-Al-830
12 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
Analytics and Processing: Yuanyuan Zhu Email: Yyzhu@whu - Edu.cn
No ratings yet
Analytics and Processing: Yuanyuan Zhu Email: Yyzhu@whu - Edu.cn
47 pages
Unit 2 - Class 3-Al-830
No ratings yet
Unit 2 - Class 3-Al-830
22 pages
Syllabus E63 Spring2016-2
No ratings yet
Syllabus E63 Spring2016-2
3 pages
Data Mining for CS Students
No ratings yet
Data Mining for CS Students
118 pages
Unit 1 - Class 2 - 1130 (Riley)
No ratings yet
Unit 1 - Class 2 - 1130 (Riley)
39 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Unit 1 - Class 4-Al-830
No ratings yet
Unit 1 - Class 4-Al-830
25 pages
Unit 2 - Class 2-Al-830
No ratings yet
Unit 2 - Class 2-Al-830
27 pages
Week6 Iot Big Data
No ratings yet
Week6 Iot Big Data
21 pages
Road Signs and Traffic Signal
No ratings yet
Road Signs and Traffic Signal
300 pages
Unit 2 - Class 4
No ratings yet
Unit 2 - Class 4
36 pages
Big Data Analytics Handbook 2020
No ratings yet
Big Data Analytics Handbook 2020
103 pages
Data Science: Lecture #1
No ratings yet
Data Science: Lecture #1
22 pages
Prepared By: Asmita Deshmukh
No ratings yet
Prepared By: Asmita Deshmukh
51 pages
DA Full
No ratings yet
DA Full
738 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
Driving Book Farsi 2 2
No ratings yet
Driving Book Farsi 2 2
152 pages
17 2017 Lecture1-2 INT312
0% (2)
17 2017 Lecture1-2 INT312
21 pages
The Excitement of Data Science
No ratings yet
The Excitement of Data Science
137 pages
Big Data Overview
No ratings yet
Big Data Overview
75 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
100% (1)
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
75 pages
Lecture 1
No ratings yet
Lecture 1
22 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
30 pages
Facets of Data
0% (1)
Facets of Data
22 pages
Lecture 07
No ratings yet
Lecture 07
64 pages
Dsc652 - Chapter 1 Introduction To Big Data Systems
No ratings yet
Dsc652 - Chapter 1 Introduction To Big Data Systems
27 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
Big Data Course Overview
No ratings yet
Big Data Course Overview
97 pages
Module 1
No ratings yet
Module 1
54 pages
All in One
No ratings yet
All in One
362 pages
Module 1 - DS
No ratings yet
Module 1 - DS
40 pages
0 Principles of Big Data
No ratings yet
0 Principles of Big Data
70 pages
Unit - 1
No ratings yet
Unit - 1
46 pages
Bigdata Lecture Notes
No ratings yet
Bigdata Lecture Notes
166 pages
Data Science
No ratings yet
Data Science
23 pages
Introduction To Big Data Computing
No ratings yet
Introduction To Big Data Computing
25 pages
BigData AmberSahai1
No ratings yet
BigData AmberSahai1
32 pages
Big Data Course Student
No ratings yet
Big Data Course Student
37 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
17 pages
Data Science
No ratings yet
Data Science
54 pages
CS 441 Handouts
No ratings yet
CS 441 Handouts
300 pages
Dic PLB L1
No ratings yet
Dic PLB L1
64 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Data Collection & Analysis Educational Presentation in Pink and Blue Lined Style
No ratings yet
Data Collection & Analysis Educational Presentation in Pink and Blue Lined Style
51 pages
Hamid Seminar
No ratings yet
Hamid Seminar
57 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
21 pages
Unit I Introduction Data Science and Big Data
No ratings yet
Unit I Introduction Data Science and Big Data
42 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Unit 1 J2 Big Data
No ratings yet
Unit 1 J2 Big Data
6 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
BDA Module1
No ratings yet
BDA Module1
141 pages
Big Data Introduction
No ratings yet
Big Data Introduction
41 pages

01 - Intro

Uploaded by

01 - Intro

Uploaded by

Data-Intensive

Fall 2022 – Dan Holtby

CS431 / CS631 – Non-CS Majors, or Data Science majors / MDSAI

Everybody Should Be:

(Top) - IBM 350 Disk

(Middle) – Shugart ST506

A lot of that is video so it’s all about averages

• Will be up to 500 exobytes / day in

But Why? Scientists

DATA-DRIVEN DECISION- DATA-DRIVEN PRODUCT TARGETED ADVERTISING

• PS this isn’t true. Anecdotes

• Modern Experiments generate

Estimated Completion Date – 2027

Will generate too much data to

They’re crossing that bridge when

Humans record their thoughts What can we do with all those

• Can Tweets tell us anything?

Network! • Cheap computers, just more of them

Computing • Just like it says on the course

• That’s just on one computer.

We’re going to use one

You didn’t need to understand assembly to use C++

You didn’t need to understand a hash table to use

• (That’s the topic of the next few

You got it!

• Based on course surveys –

• UWFlow seems to think they’re both

Undergrads Grad Students

Piazza (you should have been emailed an invite)

Online Office Hours: Microsoft Teams

One term, 23% of the class was under

We’ll be using private Git repos for assignments

Late assignments will get 0

Assignments will use Python and Jupyter (Google Colab)

Assignments will be submitted using Git

(optional but recommended)

Note: 4th Edition

You might also like