FSDL 2022
Data Management
Sergey Karayev
AUGUST 29, 2022
FSDL 2022
https://veekaybee.github.io/2019/02/13/data-science-is-di erent/
Data Management - overview 2
ff
FSDL 2022
Key Points
• Spend 10x as much time exploring the data as you would like to
• Fixing/adding/augmenting data is usually the best way to improve
performance
• Keep it simple!
Let the data flow through you 3
“All-in-one”
Feature
Store Monitoring
Versioning Labeling
Frameworks & Experiment and Model
Distributed Training Management
Edge Web
Processing Exploration
Datasets Resource Management Software Engineering CI / Testing
or or
Sources Compute
Data Development Deployment
Infrastructure & Tooling - Experiment Management FSDL 2022 4
“All-in-one”
Feature
Store Monitoring
Versioning Labeling
Frameworks & Experiment and Model
Distributed Training Management
Edge Web
Processing Exploration
Datasets Resource Management Software Engineering CI / Testing
or or
Sources Compute
Data Development Deployment
Infrastructure & Tooling - Experiment Management FSDL 2022 5
FSDL 2022
Many possibilities
Data Sources Training
Images
Text Corpus Local Filesystem
Different for every project / company!
+
Logs
GPU
DB records
6
FSDL 2022
Many possibilities
Data Sources Training
Images Simply Dow
nload
7
FSDL 2022
Many possibilities
Data Sources Training
Process
Text Corpus
+
Analyze and select subset
8
FSDL 2022
Many possibilities
Data Sources Training
Aggregate and process
+
Logs
DB records
9
FSDL 2022
Many possibilities
Data Sources Training
Images
Text Corpus Local Filesystem
Different for every project / company!
+
Logs
GPU
DB records
10
FSDL 2022
The Basics
• Filesystem
• Object Storage
• Databases
11
FSDL 2022
The Basics
• Filesystem
• Object Storage
• Databases
12
FSDL 2022
Filesystem
• Fundamental unit is a " le", which can be text or binary, is not
versioned, and is easily overwritten.
• On a disk that's connected to your machine
• Physically connected on-prem
• "Attached" in the cloud
• Or even distributed (e.g. HDFS)
Data Management - storage 13
fi
FSDL 2022
Local Disk Speeds
Almost 2 orders of magnitude
difference!
HDD
SS D
SATA
e S S D
NVM
https://voltcave.com/ssd-vs-hdd/ 14
FSDL 2022
Latency numbers you should know
(with human-scale numbers in parens)
(Seconds) 1 ns (1s) 100ns (~1.5m)
L1/L2 Cache RAM
Access Access
(Days) 250 µs (~2.5 days)
Read 1MB from RAM
250 µs Please send GPU timing
(Weeks) 1 ms (~1.5 weeks) info!
Seek + Read 1MB from SATA SSD
1
(Months) 20 ms (~7 months)
Seek + Read 1MB from spinning disk
20 ms
(Years) 150 ms (~5 years)
Send packet California -> Netherlands -> California
FSDL 2022
Local Data Format
• Binary data (images, audio):
• Just use standard formats (e.g. JPEG)
• For metadata (labels) / tabular data / text data:
• Compressed json/txt le(s) are just ne
• Parquet is a table format that's fast, compact, and widely used
16
fi
fi
FSDL 2022
The Basics
• Filesystem
• Object Storage
• Databases
17
FSDL 2022
Object Storage
• An API over the lesystem.
• Fundamental unit is an "object". Usually binary: image, sound le,
etc.
• Versioning, redundancy can be built into the service.
• Not as fast as local, but fast enough within the cloud
e.g. s3://my-bucket-name/my- le-name.jpg
Data Management - storage 18
fi
fi
fi
FSDL 2022
The Basics
• Filesystem
• Object Storage
• Databases
19
FSDL 2022
Database
• Persistent, fast, scalable storage and retrieval of structured data
• Mental model: everything is actually in RAM, but software ensures
that everything is persisted to disk.
• Not for binary data! Store object-store URLs instead.
• Postgres is the right choice most of the time. Supports
unstructured JSON.
• SQLite is perfectly good for small projects.
Data Management - storage 20
FSDL 2022
You should probably be using a database
• Code that deals with collections of objects that reference each
other (e.g. a Text is from a Document, which has an Author) will
eventually implement a crappy database
• Using a database from the beginning will likely save time
• Many MLOps tools are databases at their core (e.g W&B is a DB of
experiments, HuggingFace Hub is a DB of models, Label Studio is a
DB of labels)
21
FSDL 2022
Data Warehouse
• Store for Online Analytical Processing (OLAP)
• vs Databases for Online Transaction Processing (OLTP)
• Extract-Transform-Load (ETL) data in
• OLAPs: usually column-
oriented, for queries like
mean length of comments.text
over last 30 days
• OLTPs: usually row-oriented,
for queries like
select comments where
user_id=123
Data Management - storage https://addepto.com/implement-data-warehouse-business-intelligence/ 22
FSDL 2022
Data Lake
• Unstructured aggregation of data from multiple sources, e.g.
databases, logs, expensive data transformations.
• ELT: dump everything in, then transform for speci c needs later.
Data Management - storage https://medium.com/data-ops/throw-your-data-in-a-lake-32cd21b6de02 23
fi
FSDL 2022
Trend: both Lake and House
• Both structured and unstructured data
together
24
If you're interested in this stu FSDL 2022
Data Management - storage https://dataintensive.net
f
25
“All-in-one”
Feature
Store Monitoring
Versioning Labeling
Frameworks & Experiment and Model
Distributed Training Management
Edge Web
Processing Exploration
Datasets Resource Management Software Engineering CI / Testing
or or
Sources Compute
Data Development Deployment
Data Management - overview FSDL 2022 26
FSDL 2022
SQL and DataFrames
• Most data solutions use SQL.
Some, like Databricks, use
DataFrames.
• SQL is the standard interface
for structured data.
• Pandas is the main DataFrame
in the Python ecosystem.
• Our advice: become uent in
both
https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html
Data Management - storage 27
fl
FSDL 2022
Pandas
• The workhorse of Python data science
• + DASK DataFrames parallelize Pandas
operations over cores
• + RAPIDS to do Pandas operations on
GPUs
https://projectcodeed.blogspot.com/2019/08/setting-up-jupyter-notebooks-for-data.html
28
“All-in-one”
Feature
Store Monitoring
Versioning Labeling
Frameworks & Experiment and Model
Distributed Training Management
Edge Web
Processing Exploration
Datasets Resource Management Software Engineering CI / Testing
or or
Sources Compute
Data Development Deployment
Data Management - overview FSDL 2022 29
FSDL 2022
Motivational Example
•We have to train a photo popularity predictor every
night.
• For each photo, training data must include: In database
Need to compute
• Metadata such as posting time, title, location
from logs
• Some features of the user, such as how many times
they logged in today.
Need to run classifiers
• Outputs of photo classi ers (content, style)
30
fi
FSDL 2022
Task Dependencies
• Some tasks can't start until others are
nished.
• Finishing a task should kick o its
dependencies.
31
fi
ff
FSDL 2022
Ideally
• Dependencies are not always les, but programs and databases
• Work needs to be spread over many machines
• Many dependency graphs are executing all at once
Data Management - processing 32
fi
FSDL 2022
Air ow
• Specify the DAG of tasks using Python
https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-air ow-67650418
Data Management - processing 33
fl
fl
FSDL 2022
Distributing work
• The work ow manager has a queue for the tasks, and manages
workers that pull from it, restarting jobs if they fail.
http://site.clairvoyantsoft.com/making-apache-air ow-highly-available/
Data Management - processing 34
fl
fl
FSDL 2022
Prefect
• Improvements over Air ow
35
fl
FSDL 2022
Dagster
• Another contender
36
FSDL 2022
Keep things simple whenever possible
• Don't overengineer
• We have many CPU cores and a
lot of RAM nowadays
• For example, UNIX has powerful 26 minutes
parallelism, streaming, highly
optimized tools in parallel
70 seconds
in parallel
18 seconds
https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
Data Management - processing 37
“All-in-one”
Feature
Store Monitoring
Versioning Labeling
Frameworks & Experiment and Model
Distributed Training Management
Edge Web
Processing Exploration
Datasets Resource Management Software Engineering CI / Testing
or or
Sources Compute
Data Development Deployment
Data Management - overview FSDL 2022 38
FSDL 2022
Why feature stores?
• All the data processing generates artifacts for training
• How do we
- Make sure that in production, the same processing takes place?
- Avoid recomputation when we retrain?
• Feature stores are a solution (that you may not need!)
39
FSDL 2022
https://eng.uber.com/michelangelo-machine-learning-platform/ 40
FSDL 2022
https://www.tecton.ai 41
FSDL 2022
42
FSDL 2022
Featureform
43
FSDL 2022
In summary
• Binary data (images, sound les, compressed texts) is stored as objects.
• Metadata (labels, user activity) is stored in database.
• Don't be afraid of SQL, and know there are accelerated DataFrames
• If dealing with logs and other sources of data, set up data lake
• Set up a repeatable process to aggregate data needed for training.
• Depending on expense and complexity of processing, a feature store
could be useful
• At training time, copy the data that is needed to a lesystem on a fast
drive, and optimize GPU transfer.
Data Management - storage 44
fi
fi
“All-in-one”
Feature
Store Monitoring
Versioning Labeling
Frameworks & Experiment and Model
Distributed Training Management
Edge Web
Processing Exploration
Datasets Resource Management Software Engineering CI / Testing
or or
Sources Compute
Data Development Deployment
Infrastructure & Tooling - Experiment Management FSDL 2022 45
FSDL 2022
Huggingface Datasets
• Over 8K
datasets
for vision,
NLP, etc
46
FSDL 2022
Example Dataset
• Github-Code: >1TB of text
• Library allows you to stream it
• Underlying format: Parquet
47
FSDL 2022
Example Dataset
• RedCaps: 12M image-text pairs
• Need to download images
yourself (multi-threaded!)
• Underlying format: images +
JSON les
48
fi
FSDL 2022
Example Dataset
• CommonVoice: 14K hours of
speech
• Underlying format: mp3 + text
le
49
fi
FSDL 2022
Activeloop
• Another interesting
dataset-focused
solution
• Explore, stream, and
transform data
without saving it all
locally
50
FSDL 2022
51
“All-in-one”
Feature
Store Monitoring
Versioning Labeling
Frameworks & Experiment and Model
Distributed Training Management
Edge Web
Processing Exploration
Datasets Resource Management Software Engineering CI / Testing
or or
Sources Compute
Data Development Deployment
Data Management - overview FSDL 2022 52
FSDL 2022
May not have to label data!
53
FSDL 2022
Self-supervised learning
Very important idea: Use parts of data
to label other parts
Fig. 1. A great summary of how self-supervised learning tasks can be constructed (Image source: LeCun’s talk)
Fig. 4. Illustration of self-supervised learning by predicting the relative position of two random patches. (Image
source: Doersch et al., 2015)
https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence
https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html
Data Management - sources 54
FSDL 2022
Self-supervised learning
• Works across modalities, too
• Note the "contrastive" training:
- Minimize distance between
image and its text
- Maximize distance between
image and other texts
https://github.com/openai/CLIP
55
FSDL 2022
Image data augmentation
• Must do for training vision models
• Frameworks (e.g. torchvision) provide
functions that do this
• Done in parallel to GPU training on
the CPU
https://towardsdatascience.com/1000x-faster-data-augmentation-b91bafee896c
Data Management - sources 56
FSDL 2022
Augmentation can replace labels
• SimCLR: learning objective is to
- a) maximize agreement between
augmented views of the same image
- b) minimize agreement between
di erent images
https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html
57
ff
FSDL 2022
Other data augmentation
• Tabular
- Delete some cells to simulate
missing data
• Text
- No well established techniques, but
replace words with synonyms,
change order of things.
• Speech
- Change speed, insert pauses, add
audio e ects
https://github.com/makcedward/nlpaug
58
ff
FSDL 2022
Synthetic data
Underrated idea that is often
worth starting with
https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/ 59
FSDL 2022
This can get pretty deep!
Andrew Mo at - https://github.com/amo at/metabrite-receipt-tests
60
ff
ff
FSDL 2022
Ask your users to label data for you!
Enables rapid improvement with user labels
Data Management - sources 61
FSDL 2022
But usually: label data...
62
https://cdn-sv1.deepsense.ai/wp-content/uploads/2017/04/sample_image_from_the_training_set.jpg
FSDL 2022
Standard set of features:
- bounding boxes,
segmentations,
keypoints, cuboids
- set of applicable
classes
Data Management - labeling 63
FSDL 2022
Training the annotators is crucial
Quality assurance is key
Data Management - labeling 64
FSDL 2022
Sources of Labor
• Full-service data labeling
• Hire own annotators, promote best ones to quality control
• Crowdsource (Mechanical Turk)
Data Management - labeling 65
FSDL 2022
Full Service Companies
• Data labeling requires separate software stack, temporary labor, and
quality assurance. Makes sense to outsource.
• Dedicate several days to selecting the best one for you:
• Label gold standard data yourself
• Sales calls with several contenders, ask for work sample on same data
• Ensure agreement with your gold standard, and evaluate on value
Data Management - labeling 66
FSDL 2022
Scale.ai is a dominant data labeling solution
Data Management - labeling 67
And there are many others
FSDL 2022
Data Management - labeling 68
FSDL 2022
Label Studio
• Open-source edition to run yourself
• Enterprise edition for managed
hosting
• Using in lab!
69
FSDL 2022
Di gram
• Another open-source solution
may be even better
70
ff
FSDL 2022
Aquarium and Scale Nucleus
• Key feature: see where your current
model performs poorly, and label that
data
https://www.aquariumlearning.com
https://scale.com/nucleus 71
FSDL 2022
Weak supervision
• Snorkel
- Open-source
project snorkel.org
- Commercial
platform snorkel.ai
• Rubrix: open-source
solution
72
FSDL 2022
Conclusions
• Think of how you can do self-supervised learning
• Use labeling software and get to know your data by labeling it yourself for
a while
• Write out detailed rules and outsource to full-service company if you can
a ord it
• Else, hiring part-time makes more sense than trying to make
crowdsourcing work
Data Management - labeling 73
ff
“All-in-one”
Feature
Store Monitoring
Versioning Labeling
Frameworks & Experiment and Model
Distributed Training Management
Edge Web
Processing Exploration
Datasets Resource Management Software Engineering CI / Testing
or or
Sources Compute
Data Development Deployment
Data Management - overview FSDL 2022 74
FSDL 2022
Data Versioning
Level 0: unversioned
Level 1: versioned via snapshot at training time
Level 2: versioned as a mix of assets and code
Level 3: specialized data versioning solution
Data Management - versioning 75
FSDL 2022
Level 0
• Data lives on lesystem/S3 and database
• Problem: Deployed machine learning models are part code, part
data. If data is not versioned, deployed models are not versioned.
• Problem you will face: inability to get back to a previous level of
performance
Data Management - versioning 76
fi
FSDL 2022
Level 1
• Data is versioned by storing a snapshot of everything at training
time
• This kind of works, but would be far better to be able to version
data just as easily as code.
Data Management - versioning 77
FSDL 2022
Level 2
• Data is versioned as a mix of assets and code.
• Heavy les stored in S3, with unique ids. Training data is stored
as JSON or Parquet, referring to these ids and include relevant
metadata (labels, user activity, etc).
• Data les can get big, but using git-lfs lets us store them just as
easily as code.
Data Management - versioning 78
fi
fi
FSDL 2022
Level 3
• Specialized solutions for versioning data, usually helping you
store large les.
• Could totally make sense, but don't assume you need it right
away!
• Leading solution is DVC.
Data Management - versioning 79
fi
FSDL 2022
Data Versioning Solutions
Warning: This is biased toward DVC
https://dagshub.com/blog/data-version-control-tools/ 80
FSDL 2022
DVC
1
4
2 3
Data Management - versioning 81
FSDL 2022
Research Area: Privacy
• Federated Learning: training a
global model from data on local
devices, without ever having
access to the data
• Di erential privacy: aggregating
data such that individual points
cannot be identi ed
• Another topic: Learning on
encrypted data
https://blog.ml.cmu.edu/2019/11/12/federated-learning-challenges-methods-and-future-directions/
https://blogs.nvidia.com/blog/2019/10/13/what-is-federated-learning/
82
ff
fi
FSDL 2022
Thank you!
83