Data-Intensive
Distributed
Computing
CS431/451/631/651
Fall 2022 – Dan Holtby
1
Today’s Agenda
Who am I?
What is “Big Data?”
Why is it different than regular Data?
How is the course structured?
(When and Where is on your schedule already…)
2
• PhD from UW (2013)
• Bioinformatics Research Group
• Bioinformatics involves lots of big data
• (A single human’s genome is about 3.5GB!)
• Humans aren’t even the most complicated species
Who am I? • Masters Thesis was on Distributed
Computing
3
Who are you?
CS451 / CS651 – CS Majors or Data Science Majors / MDSAI
Expectations: Comfortable in Java and Scala (you’ll be expected to pick it up
quickly if not)
CS431 / CS631 – Non-CS Majors, or Data Science majors / MDSAI
Expectations: Comfortable in Python (again, you’ll be expected to pick it up
quickly if not)
Everybody Should Be:
* Interested in the topic
* Comfortable with rapidly-evolving software
4
Big Data
• Question: Why are data so big
these days?
• Answer: It’s complicated
5
• The only reason to delete data is if the cost of
keeping it is too high
• (This is, of course, why Bilbo should not keep the One
Ring)
6
It’s one gigabyte, Michael!
How much could it cost, $10?
(Top) - IBM 350 Disk
($36,000 / month for 3.5 (Bottom) – WD Green
MB) (~$60 for 2TB)
1980
1958 Now
(Middle) – Shugart ST506
5.25” ($5000 for 5MB)
• 2.5” HDD for scale
7
Price per GB Over the Years
USD / GB
350000
300000
250000
200000
150000
100000
50000
0
1982 1987 1992 1997 2002 2007 2012 2017 2022
PRO-TIP: never make a graph that looks like this, use a log scale
8
Price per GB Over the Years (Log Scale)
USD / GB
1000000
100000
10000
1000
100
10
0.1
0.01
1982 1987 1992 1997 2002 2007 2012 2017 2022
9
• Facebook generates 4PB / day (that’s 4
million GB)
Where are all • There are 500 million new tweets per day
(~60 GB just for the text)
these data • 720,000 hours of new YouTube videos per
coming day. (It would take 90,000 full time
employees just to review uploads)
from? • Every “smart” device you own is sending
telemetry back to corporate to be packaged
and sold.
10
How
much????
• Right now* we
generate 2.5
exobytes (2,500,000
TB) per day
• That’s ~2MB /
person / second
* The number is from 2020, it's probably bigger now but I can't
find a good source
A lot of that is video so it’s all about averages
11
2.5 EXObytes???
• That might seem like a lot, but it’s
nothing compared to what it’s
going to be
• Will be up to 500 exobytes / day in
2025 (125 million 4TB HDDs filled
per day)
12
Businesses
But Why? Scientists
People
13
Business Data
DATA-DRIVEN DECISION- DATA-DRIVEN PRODUCT TARGETED ADVERTISING
MAKING DESIGN
14
Business Intelligence
• “What worked? What didn’t?”
• This isn’t a new concept.
15
Anecdote!
• In the 1990s, Walmart Discovered
people tend to buy beer and
diapers at the same time, so they
put them together.
• PS this isn’t true. Anecdotes
rarely are.
16
What Would
Walmart Do?
• Stores actually want items
that are bought together
to be FAR APART.
• So if Walmart did put beer
and diapers close, it’s
because they’re NOT
bought together.
• Costco puts the rotisserie
chicken at the back so you
have to walk past
everything else to grab
one
17
• A teenager’s parents learned she
was pregnant because Target
started sending coupons for
diapers.
Targetted
Adversiting • How did Target know? Data
Science
18
• “Customers like you
bought…”
Preferences • “People who liked X watch Y”
• Oddly specific Netflix
categories
19
Science!
• Data-Intensive eScience
• Modern Experiments generate
BIG DATA
20
Black Hole
• First Image of a Black
Hole (2019)
• 4.5PB of data from 8
telescopes
They flew and drove trucks full of HDD. Would have taken years to send over the internet.
21
Black Hole
• They shipped HDDs
• Never underestimate the
bandwidth of a station
wagon full of tapes
hurtling down the
highway. –Andrew
Tanenbaum
Because infereometry doesn’t parallelize well, not a good candidate for MapReduce (or
other data-intensive distributed techniques). Getting all the data in one place was
mandatory
22
JAMES WEBB • JWST sends 57GB / day back to Earth
• One pretty picture requires MANY images stitched
TELESCOPE together
23
Square Kilometer Array (SKA)
Estimated Completion Date – 2027
Will generate too much data to
handle today (5 Tb/sec)
They’re crossing that bridge when
they come to it.
By SPDO/TDP/DRAO/Swinburne Astronomy
Productions - SKA Project Development Office and
Swinburne Astronomy Productions
24
• Generates 1 PB / sec during an experiment
Large Hadron
Collider • That’s more than the SKA, but it’s not constantly running
25
Data Driven Policy
Computational
Social & Voter Preferences
Political
Science
Trending Hashtags
26
Humans as Sensors
Humans record their thoughts What can we do with all those
on social media. data?
27
Twitter
• Can Tweets tell us anything?
• Sentiment Analysis + Social Science
Sentiment Analysis – Figuring out the tone of a tweet. Harder than it sounds. People are
sarcastic, might use memes.
Fortunately people also tend to label their own sentences with emojis. Eyeroll emoji =
sarcastic. Angry face = mad post.
28
Predicting X
with Twitter
Fall 2020 Project : Predicting
COVID with Twitter
29
Big Data, Big
Computer?
• Vertical Scaling – More
RAM, Disk, CPU
• Return of the Mainframe?
• Expensive!
• Limited!
30
Big Data, Big • Horizontal Scaling
Network! • Cheap computers, just more of them
31
Distributed • Many inexpensive computers working together
Computing • Just like it says on the course
32
Parallelization
is hard
• Deadlocks, Livelocks, Race
Conditions, oh my!
• That’s just on one computer.
What if they’re remote?
If you haven’t taken OS, parallel programming, etc. you’ll just have to take my word for it
33
Scaling Out!
• A datacenter of many machines?
• Many datacenters???
• Fault tolerance
34
ALL
HARDWARE
FAILS
Over the years I’ve lost 2 video cards, 1 stick of RAM, several HDD, though none with data
loss…a mother board I think? 1 power supply, 2 if you count obnoxious coil whine as
failure. Oh, a monitor, if that counts. Went all yellow unless you beat it senseless (the IT
term for this is “percussive maintenance”)
35
DIFFICULT
We’re not going to design
a fault-tolerant
distributed computer
network
We’re going to use one
36
Abstraction to the rescue
You didn’t need to understand the hardware to use
assembly
You didn’t need to understand assembly to use C++
You didn’t need to understand a hash table to use
std::UnorderedMap
37
What’s the
Next Layer?
• How can we abstract a
distributed network?
• (That’s the topic of the next few
lectures)
38
What’s
CS431/CS451?
Data Science
Tools
A little helping of
This Course
Analytics
• Data Science Infrastructure
• Distributed Analytics
Execution
• Distributed Execution Infrastructure
39
More Buzzwords Please!
You got it!
• Analytics
• Business intelligence
• Data warehousing
• MapReduce, Hadoop, Spark, Pig, Hive, NoSQL, Pregel, Giraph, Storm/Heron
• Thinking at scale
40
HOW HARD IS THE COURSE?
• Based on course surveys –
• CS431 - ~8 hours a week
• CS451 - ~10 hours a week (That’s a
heavy course)
• UWFlow seems to think they’re both
relatively easy though
41
Grading
Undergrads Grad Students
Assignments – 70% Assignments – 60%
Final Exam – 30% Final Exam – 20%
Project – 20%
42
Course Info and Help
Course Website: https://www.student.cs.uwaterloo.ca/~cs451
(Yes, even if you’re in CS431)
Piazza (you should have been emailed an invite)
Online Office Hours: Microsoft Teams
In-Person Office Hours: See website.
43
All assignments will be checked for
plagiarism / unauthorized collaboration!
(See the course syllabus for more details)
One term, 23% of the class was under
Academic investigation for plagiarism.
Integrity
If caught: 0 on the assignment, -5% on
your course grade
44
Assignment Mechanics (CS451/651)
Java Scala
We’ll be using private Git repos for assignments
Complete your assignments, push to GitLab
We’ll pull your repos at the deadline and grade
Late assignments will get 0
45
45
Assignment Mechanics (CS431/631)
Assignments will use Python and Jupyter (Google Colab)
Everything you need to know is in the assignment itself
Assignments will be submitted using Git
Details are on the course website for the appropriate assignment
Python
Late assignments will get 0
46
Course Materials
One (required) textbook +
Three (optional but recommended) books +
Additional readings from other sources as appropriate
(optional but recommended)
Note: 4th Edition
47