Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views3 pages

Lecture 1 - Intro & Foundations

The lecture introduces the shift in data analytics from traditional measurement to strategic analysis, emphasizing the explosion of data generation and the transition to cloud storage for easier access. It discusses the challenges of Big Data characterized by the four V's: Volume, Velocity, Variety, and Veracity, and highlights the difficulties in programming distributed systems. Additionally, it covers different forms of parallelism and scalability, clarifying the distinction between scalability and performance.

Uploaded by

teun.bobbink
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views3 pages

Lecture 1 - Intro & Foundations

The lecture introduces the shift in data analytics from traditional measurement to strategic analysis, emphasizing the explosion of data generation and the transition to cloud storage for easier access. It discusses the challenges of Big Data characterized by the four V's: Volume, Velocity, Variety, and Veracity, and highlights the difficulties in programming distributed systems. Additionally, it covers different forms of parallelism and scalability, clarifying the distinction between scalability and performance.

Uploaded by

teun.bobbink
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Lecture 1 - Intro & Foundations

What should you be able to do after this week?


Describe the characteristics of the recent shift in data analytics

Explain different forms of parallelism and scalability

Distinguish between scalability and performance

What is this course about?


Storing data
Before very limited storage, huge size of device and expensive.

Now, huge storage capacity, small in size and affordable.


Storage cost over time decreased over the years and capacity increased.

Why was there a shift in Data & Analytics?


Previously:

Data traditionally used for measurement

Descriptive backwards facing view about what happened in the past

Nowadays:

Data leveraged for strategic analysis, centered on growth

Data is used in a predictive forward facing function.

What changed?
Data generation exploded!
Previously:

Businesses captured well understood well-defined transaction data (e.g., data about orders and
payments)

Nowadays:

Advent of the web and mobile phones produces unprecedented amount of much less structured and
less defined interaction data

How does Modern Data Analytics work?


Previously:

IT department had monopoly on access to data

End users had to go through IT (via ticketing systems) for data analysis

Slow and tedious

Nowadays:

Data centrally stored in the cloud, IT department manages cloud

End users can directly access and analyse data

Challenges in working with Big Data

Lecture 1 - Intro & Foundations 1


The four V’s of Big Data

Four V’s Description

Volume We have to process a lot of data

Velocity The data is arriving very fast

We have structured, semi-structured and


Variety
unstructured data from many different sources

We have data of highly varying quality and


Veracity
trustworthiness

Challenges with Volume & Velocity


Can’t we just use lots of machines to process lots of data really fast?

Unfortunately, programming distributed systems (=working with lots of computers) is really hard!

Coordination

Concurrency

Fault tolerance

We need ways to write simple but efficient programs which execute in parallel on large datasets.

Challenges with Variety & Veracity


Can’t we just feed all our data into machine learning modesl which magically find the right answers for us?

Unfortunately not, most data scientists spend the majority of their time with preparing, cleaning and
organizing data instead of analysing data and training models…

Many data-driven ML applications are found to reproduce and amplify existing bias and discrimination.

Parallelism & Scalability


Task Parallelism
Also known as “multi-tasking”
Execute many independent tasks at once
Example: Operating system executing different processes at once on a multi-core machine

Data Parallelism
Execute the same task in parallel on different slices of the data
Example: query processing in modern cloud databases which store partitions of the data on different
machines

Pipeline Parallelism
Break tasks into a sequence of processing stages
Each stage takes result from the previous stage as input, with results being passed downstream
immediately
Example: instruction pipelining in modern CPUs

Scalability
Ability of a system to handle a growing amount of work by adding resources to the system
Often distinguished how resources are added:

Lecture 1 - Intro & Foundations 2


Scale-up: replace machine with “beefier” machine (More RAM, more Cores)

Scale-out: add more machines of the same type

Desired goal in practice:

Linear scalability with number of machines/cores in scale-out settings

“Elastic” scaling in cloud environments

Scalability ≠ Performance
A common misconception is that scalable systems are
also automatically performant
Scalability often comes with increased overheads,
especially in distributed settings (e.g., network
communication, coordination overhead)

Lecture 1 - Intro & Foundations 3

You might also like