Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views83 pages

FSDL 2022 Lecture4 Data Management

The document provides an overview of data management strategies and tools for machine learning, emphasizing the importance of data exploration and augmentation to improve performance. It discusses various data storage options including filesystems, object storage, databases, and data lakes, while highlighting the significance of using SQL and DataFrames for data manipulation. Additionally, it touches on the use of feature stores and self-supervised learning techniques to enhance data processing and model training efficiency.

Uploaded by

ritika26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views83 pages

FSDL 2022 Lecture4 Data Management

The document provides an overview of data management strategies and tools for machine learning, emphasizing the importance of data exploration and augmentation to improve performance. It discusses various data storage options including filesystems, object storage, databases, and data lakes, while highlighting the significance of using SQL and DataFrames for data manipulation. Additionally, it touches on the use of feature stores and self-supervised learning techniques to enhance data processing and model training efficiency.

Uploaded by

ritika26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

FSDL 2022

Data Management
Sergey Karayev

AUGUST 29, 2022


FSDL 2022

https://veekaybee.github.io/2019/02/13/data-science-is-di erent/

Data Management - overview 2


ff
FSDL 2022

Key Points

• Spend 10x as much time exploring the data as you would like to

• Fixing/adding/augmenting data is usually the best way to improve


performance

• Keep it simple!

Let the data flow through you 3


“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model


Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Infrastructure & Tooling - Experiment Management FSDL 2022 4


“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model


Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Infrastructure & Tooling - Experiment Management FSDL 2022 5


FSDL 2022

Many possibilities

Data Sources Training

Images

Text Corpus Local Filesystem


Different for every project / company!
+
Logs
GPU
DB records

6
FSDL 2022

Many possibilities

Data Sources Training

Images Simply Dow


nload

7
FSDL 2022

Many possibilities

Data Sources Training

Process

Text Corpus
+
Analyze and select subset

8
FSDL 2022

Many possibilities

Data Sources Training

Aggregate and process

+
Logs

DB records

9
FSDL 2022

Many possibilities

Data Sources Training

Images

Text Corpus Local Filesystem


Different for every project / company!
+
Logs
GPU
DB records

10
FSDL 2022

The Basics

• Filesystem

• Object Storage

• Databases

11
FSDL 2022

The Basics

• Filesystem

• Object Storage

• Databases

12
FSDL 2022

Filesystem

• Fundamental unit is a " le", which can be text or binary, is not


versioned, and is easily overwritten.

• On a disk that's connected to your machine

• Physically connected on-prem

• "Attached" in the cloud

• Or even distributed (e.g. HDFS)

Data Management - storage 13


fi
FSDL 2022

Local Disk Speeds

Almost 2 orders of magnitude


difference!

HDD

SS D
SATA

e S S D
NVM

https://voltcave.com/ssd-vs-hdd/ 14
FSDL 2022
Latency numbers you should know
(with human-scale numbers in parens)

(Seconds) 1 ns (1s) 100ns (~1.5m)


L1/L2 Cache RAM
Access Access

(Days) 250 µs (~2.5 days)


Read 1MB from RAM

250 µs Please send GPU timing


(Weeks) 1 ms (~1.5 weeks) info!
Seek + Read 1MB from SATA SSD

1
(Months) 20 ms (~7 months)
Seek + Read 1MB from spinning disk

20 ms

(Years) 150 ms (~5 years)


Send packet California -> Netherlands -> California
FSDL 2022

Local Data Format

• Binary data (images, audio):

• Just use standard formats (e.g. JPEG)

• For metadata (labels) / tabular data / text data:

• Compressed json/txt le(s) are just ne

• Parquet is a table format that's fast, compact, and widely used

16
fi
fi
FSDL 2022

The Basics

• Filesystem

• Object Storage

• Databases

17
FSDL 2022

Object Storage

• An API over the lesystem.

• Fundamental unit is an "object". Usually binary: image, sound le,


etc.

• Versioning, redundancy can be built into the service.

• Not as fast as local, but fast enough within the cloud

e.g. s3://my-bucket-name/my- le-name.jpg

Data Management - storage 18


fi
fi
fi
FSDL 2022

The Basics

• Filesystem

• Object Storage

• Databases

19
FSDL 2022

Database

• Persistent, fast, scalable storage and retrieval of structured data

• Mental model: everything is actually in RAM, but software ensures


that everything is persisted to disk.

• Not for binary data! Store object-store URLs instead.

• Postgres is the right choice most of the time. Supports


unstructured JSON.

• SQLite is perfectly good for small projects.

Data Management - storage 20


FSDL 2022

You should probably be using a database

• Code that deals with collections of objects that reference each


other (e.g. a Text is from a Document, which has an Author) will
eventually implement a crappy database

• Using a database from the beginning will likely save time

• Many MLOps tools are databases at their core (e.g W&B is a DB of


experiments, HuggingFace Hub is a DB of models, Label Studio is a
DB of labels)

21
FSDL 2022

Data Warehouse
• Store for Online Analytical Processing (OLAP)

• vs Databases for Online Transaction Processing (OLTP)

• Extract-Transform-Load (ETL) data in

• OLAPs: usually column-


oriented, for queries like
mean length of comments.text
over last 30 days

• OLTPs: usually row-oriented,


for queries like
select comments where
user_id=123
Data Management - storage https://addepto.com/implement-data-warehouse-business-intelligence/ 22
FSDL 2022

Data Lake
• Unstructured aggregation of data from multiple sources, e.g.
databases, logs, expensive data transformations.

• ELT: dump everything in, then transform for speci c needs later.

Data Management - storage https://medium.com/data-ops/throw-your-data-in-a-lake-32cd21b6de02 23

fi
FSDL 2022

Trend: both Lake and House


• Both structured and unstructured data
together

24
If you're interested in this stu FSDL 2022

Data Management - storage https://dataintensive.net


f
25
“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model


Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Data Management - overview FSDL 2022 26


FSDL 2022

SQL and DataFrames


• Most data solutions use SQL.
Some, like Databricks, use
DataFrames.

• SQL is the standard interface


for structured data.

• Pandas is the main DataFrame


in the Python ecosystem.

• Our advice: become uent in


both

https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html
Data Management - storage 27
fl
FSDL 2022

Pandas

• The workhorse of Python data science

• + DASK DataFrames parallelize Pandas


operations over cores

• + RAPIDS to do Pandas operations on


GPUs

https://projectcodeed.blogspot.com/2019/08/setting-up-jupyter-notebooks-for-data.html
28
“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model


Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Data Management - overview FSDL 2022 29


FSDL 2022

Motivational Example
•We have to train a photo popularity predictor every
night.

• For each photo, training data must include: In database


Need to compute
• Metadata such as posting time, title, location
from logs
• Some features of the user, such as how many times
they logged in today.
Need to run classifiers
• Outputs of photo classi ers (content, style)

30
fi
FSDL 2022

Task Dependencies

• Some tasks can't start until others are


nished.

• Finishing a task should kick o its


dependencies.

31
fi
ff
FSDL 2022

Ideally

• Dependencies are not always les, but programs and databases

• Work needs to be spread over many machines

• Many dependency graphs are executing all at once

Data Management - processing 32


fi
FSDL 2022

Air ow

• Specify the DAG of tasks using Python

https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-air ow-67650418
Data Management - processing 33
fl
fl
FSDL 2022

Distributing work
• The work ow manager has a queue for the tasks, and manages
workers that pull from it, restarting jobs if they fail.

http://site.clairvoyantsoft.com/making-apache-air ow-highly-available/

Data Management - processing 34


fl
fl
FSDL 2022

Prefect
• Improvements over Air ow

35
fl
FSDL 2022

Dagster
• Another contender

36
FSDL 2022

Keep things simple whenever possible


• Don't overengineer
• We have many CPU cores and a
lot of RAM nowadays

• For example, UNIX has powerful 26 minutes


parallelism, streaming, highly
optimized tools in parallel
70 seconds
in parallel
18 seconds

https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
Data Management - processing 37
“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model


Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Data Management - overview FSDL 2022 38


FSDL 2022

Why feature stores?

• All the data processing generates artifacts for training

• How do we

- Make sure that in production, the same processing takes place?

- Avoid recomputation when we retrain?

• Feature stores are a solution (that you may not need!)

39
FSDL 2022

https://eng.uber.com/michelangelo-machine-learning-platform/ 40
FSDL 2022

https://www.tecton.ai 41
FSDL 2022

42
FSDL 2022

Featureform

43
FSDL 2022

In summary

• Binary data (images, sound les, compressed texts) is stored as objects.

• Metadata (labels, user activity) is stored in database.

• Don't be afraid of SQL, and know there are accelerated DataFrames

• If dealing with logs and other sources of data, set up data lake

• Set up a repeatable process to aggregate data needed for training.

• Depending on expense and complexity of processing, a feature store


could be useful

• At training time, copy the data that is needed to a lesystem on a fast


drive, and optimize GPU transfer.

Data Management - storage 44


fi
fi
“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model


Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Infrastructure & Tooling - Experiment Management FSDL 2022 45


FSDL 2022

Huggingface Datasets

• Over 8K
datasets
for vision,
NLP, etc

46
FSDL 2022

Example Dataset
• Github-Code: >1TB of text

• Library allows you to stream it

• Underlying format: Parquet

47
FSDL 2022

Example Dataset
• RedCaps: 12M image-text pairs

• Need to download images


yourself (multi-threaded!)

• Underlying format: images +


JSON les

48
fi
FSDL 2022

Example Dataset

• CommonVoice: 14K hours of


speech

• Underlying format: mp3 + text


le

49
fi
FSDL 2022

Activeloop

• Another interesting
dataset-focused
solution

• Explore, stream, and


transform data
without saving it all
locally

50
FSDL 2022

51
“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model


Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Data Management - overview FSDL 2022 52


FSDL 2022

May not have to label data!

53
FSDL 2022

Self-supervised learning
Very important idea: Use parts of data
to label other parts

Fig. 1. A great summary of how self-supervised learning tasks can be constructed (Image source: LeCun’s talk)

Fig. 4. Illustration of self-supervised learning by predicting the relative position of two random patches. (Image
source: Doersch et al., 2015)

https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence
https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html
Data Management - sources 54
FSDL 2022

Self-supervised learning

• Works across modalities, too

• Note the "contrastive" training:

- Minimize distance between


image and its text

- Maximize distance between


image and other texts

https://github.com/openai/CLIP

55
FSDL 2022

Image data augmentation

• Must do for training vision models

• Frameworks (e.g. torchvision) provide


functions that do this

• Done in parallel to GPU training on


the CPU

https://towardsdatascience.com/1000x-faster-data-augmentation-b91bafee896c

Data Management - sources 56


FSDL 2022

Augmentation can replace labels

• SimCLR: learning objective is to

- a) maximize agreement between


augmented views of the same image

- b) minimize agreement between


di erent images

https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html
57
ff
FSDL 2022

Other data augmentation


• Tabular
- Delete some cells to simulate
missing data

• Text
- No well established techniques, but
replace words with synonyms,
change order of things.

• Speech
- Change speed, insert pauses, add
audio e ects
https://github.com/makcedward/nlpaug
58
ff
FSDL 2022

Synthetic data
Underrated idea that is often
worth starting with

https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/ 59
FSDL 2022

This can get pretty deep!

Andrew Mo at - https://github.com/amo at/metabrite-receipt-tests


60
ff
ff
FSDL 2022

Ask your users to label data for you!


Enables rapid improvement with user labels

Data Management - sources 61


FSDL 2022

But usually: label data...

62
https://cdn-sv1.deepsense.ai/wp-content/uploads/2017/04/sample_image_from_the_training_set.jpg
FSDL 2022

Standard set of features:

- bounding boxes,
segmentations,
keypoints, cuboids

- set of applicable
classes

Data Management - labeling 63


FSDL 2022

Training the annotators is crucial

Quality assurance is key

Data Management - labeling 64


FSDL 2022

Sources of Labor

• Full-service data labeling

• Hire own annotators, promote best ones to quality control

• Crowdsource (Mechanical Turk)

Data Management - labeling 65


FSDL 2022

Full Service Companies

• Data labeling requires separate software stack, temporary labor, and


quality assurance. Makes sense to outsource.

• Dedicate several days to selecting the best one for you:

• Label gold standard data yourself

• Sales calls with several contenders, ask for work sample on same data

• Ensure agreement with your gold standard, and evaluate on value

Data Management - labeling 66


FSDL 2022

Scale.ai is a dominant data labeling solution

Data Management - labeling 67


And there are many others
FSDL 2022

Data Management - labeling 68


FSDL 2022

Label Studio
• Open-source edition to run yourself

• Enterprise edition for managed


hosting

• Using in lab!

69
FSDL 2022

Di gram

• Another open-source solution


may be even better

70
ff
FSDL 2022

Aquarium and Scale Nucleus


• Key feature: see where your current
model performs poorly, and label that
data

https://www.aquariumlearning.com

https://scale.com/nucleus 71
FSDL 2022

Weak supervision

• Snorkel

- Open-source
project snorkel.org

- Commercial
platform snorkel.ai

• Rubrix: open-source
solution

72
FSDL 2022

Conclusions

• Think of how you can do self-supervised learning

• Use labeling software and get to know your data by labeling it yourself for
a while

• Write out detailed rules and outsource to full-service company if you can
a ord it

• Else, hiring part-time makes more sense than trying to make


crowdsourcing work

Data Management - labeling 73


ff
“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model


Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Data Management - overview FSDL 2022 74


FSDL 2022

Data Versioning

Level 0: unversioned

Level 1: versioned via snapshot at training time

Level 2: versioned as a mix of assets and code

Level 3: specialized data versioning solution

Data Management - versioning 75


FSDL 2022

Level 0

• Data lives on lesystem/S3 and database

• Problem: Deployed machine learning models are part code, part


data. If data is not versioned, deployed models are not versioned.

• Problem you will face: inability to get back to a previous level of


performance

Data Management - versioning 76


fi
FSDL 2022

Level 1

• Data is versioned by storing a snapshot of everything at training


time

• This kind of works, but would be far better to be able to version


data just as easily as code.

Data Management - versioning 77


FSDL 2022

Level 2

• Data is versioned as a mix of assets and code.

• Heavy les stored in S3, with unique ids. Training data is stored
as JSON or Parquet, referring to these ids and include relevant
metadata (labels, user activity, etc).

• Data les can get big, but using git-lfs lets us store them just as
easily as code.

Data Management - versioning 78


fi
fi
FSDL 2022

Level 3

• Specialized solutions for versioning data, usually helping you


store large les.

• Could totally make sense, but don't assume you need it right
away!

• Leading solution is DVC.

Data Management - versioning 79


fi
FSDL 2022

Data Versioning Solutions

Warning: This is biased toward DVC


https://dagshub.com/blog/data-version-control-tools/ 80
FSDL 2022

DVC
1

4
2 3

Data Management - versioning 81


FSDL 2022

Research Area: Privacy


• Federated Learning: training a
global model from data on local
devices, without ever having
access to the data

• Di erential privacy: aggregating


data such that individual points
cannot be identi ed

• Another topic: Learning on


encrypted data

https://blog.ml.cmu.edu/2019/11/12/federated-learning-challenges-methods-and-future-directions/
https://blogs.nvidia.com/blog/2019/10/13/what-is-federated-learning/
82
ff
fi
FSDL 2022

Thank you!

83

You might also like