Lecture 1 - Intro & Foundations
What should you be able to do after this week?
Describe the characteristics of the recent shift in data analytics
Explain different forms of parallelism and scalability
Distinguish between scalability and performance
What is this course about?
Storing data
Before very limited storage, huge size of device and expensive.
Now, huge storage capacity, small in size and affordable.
Storage cost over time decreased over the years and capacity increased.
Why was there a shift in Data & Analytics?
Previously:
Data traditionally used for measurement
Descriptive backwards facing view about what happened in the past
Nowadays:
Data leveraged for strategic analysis, centered on growth
Data is used in a predictive forward facing function.
What changed?
Data generation exploded!
Previously:
Businesses captured well understood well-defined transaction data (e.g., data about orders and
payments)
Nowadays:
Advent of the web and mobile phones produces unprecedented amount of much less structured and
less defined interaction data
How does Modern Data Analytics work?
Previously:
IT department had monopoly on access to data
End users had to go through IT (via ticketing systems) for data analysis
Slow and tedious
Nowadays:
Data centrally stored in the cloud, IT department manages cloud
End users can directly access and analyse data
Challenges in working with Big Data
Lecture 1 - Intro & Foundations 1
The four V’s of Big Data
Four V’s Description
Volume We have to process a lot of data
Velocity The data is arriving very fast
We have structured, semi-structured and
Variety
unstructured data from many different sources
We have data of highly varying quality and
Veracity
trustworthiness
Challenges with Volume & Velocity
Can’t we just use lots of machines to process lots of data really fast?
Unfortunately, programming distributed systems (=working with lots of computers) is really hard!
Coordination
Concurrency
Fault tolerance
We need ways to write simple but efficient programs which execute in parallel on large datasets.
Challenges with Variety & Veracity
Can’t we just feed all our data into machine learning modesl which magically find the right answers for us?
Unfortunately not, most data scientists spend the majority of their time with preparing, cleaning and
organizing data instead of analysing data and training models…
Many data-driven ML applications are found to reproduce and amplify existing bias and discrimination.
Parallelism & Scalability
Task Parallelism
Also known as “multi-tasking”
Execute many independent tasks at once
Example: Operating system executing different processes at once on a multi-core machine
Data Parallelism
Execute the same task in parallel on different slices of the data
Example: query processing in modern cloud databases which store partitions of the data on different
machines
Pipeline Parallelism
Break tasks into a sequence of processing stages
Each stage takes result from the previous stage as input, with results being passed downstream
immediately
Example: instruction pipelining in modern CPUs
Scalability
Ability of a system to handle a growing amount of work by adding resources to the system
Often distinguished how resources are added:
Lecture 1 - Intro & Foundations 2
Scale-up: replace machine with “beefier” machine (More RAM, more Cores)
Scale-out: add more machines of the same type
Desired goal in practice:
Linear scalability with number of machines/cores in scale-out settings
“Elastic” scaling in cloud environments
Scalability ≠ Performance
A common misconception is that scalable systems are
also automatically performant
Scalability often comes with increased overheads,
especially in distributed settings (e.g., network
communication, coordination overhead)
Lecture 1 - Intro & Foundations 3