0% found this document useful (0 votes)

14 views83 pages

FSDL 2022 Lecture4 Data Management

The document provides an overview of data management strategies and tools for machine learning, emphasizing the importance of data exploration and augmentation to improve performance. It discusses various data storage options including filesystems, object storage, databases, and data lakes, while highlighting the significance of using SQL and DataFrames for data manipulation. Additionally, it touches on the use of feature stores and self-supervised learning techniques to enhance data processing and model training efficiency.

Uploaded by

ritika26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views83 pages

FSDL 2022 Lecture4 Data Management

Uploaded by

ritika26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

FSDL 2022

Data Management
Sergey Karayev

AUGUST 29, 2022

FSDL 2022

https://veekaybee.github.io/2019/02/13/data-science-is-di erent/

Data Management - overview 2

ff
FSDL 2022

Key Points

• Spend 10x as much time exploring the data as you would like to

• Fixing/adding/augmenting data is usually the best way to improve

performance

• Keep it simple!

Let the data flow through you 3

“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model

Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Infrastructure & Tooling - Experiment Management FSDL 2022 4

“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model

Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Infrastructure & Tooling - Experiment Management FSDL 2022 5

FSDL 2022

Many possibilities

Data Sources Training

Images

Text Corpus Local Filesystem

Different for every project / company!
+
Logs
GPU
DB records

6
FSDL 2022

Many possibilities

Data Sources Training

Images Simply Dow

nload

7
FSDL 2022

Many possibilities

Data Sources Training

Process

Text Corpus
+
Analyze and select subset

8
FSDL 2022

Many possibilities

Data Sources Training

Aggregate and process

+
Logs

DB records

9
FSDL 2022

Many possibilities

Data Sources Training

Images

Text Corpus Local Filesystem

Different for every project / company!
+
Logs
GPU
DB records

10
FSDL 2022

The Basics

• Filesystem

• Object Storage

• Databases

11
FSDL 2022

The Basics

• Filesystem

• Object Storage

• Databases

12
FSDL 2022

Filesystem

• Fundamental unit is a " le", which can be text or binary, is not

versioned, and is easily overwritten.

• On a disk that's connected to your machine

• Physically connected on-prem

• "Attached" in the cloud

• Or even distributed (e.g. HDFS)

Data Management - storage 13

fi
FSDL 2022

Local Disk Speeds

Almost 2 orders of magnitude

difference!

HDD

SS D
SATA

e S S D
NVM

https://voltcave.com/ssd-vs-hdd/ 14
FSDL 2022
Latency numbers you should know
(with human-scale numbers in parens)

(Seconds) 1 ns (1s) 100ns (~1.5m)

L1/L2 Cache RAM
Access Access

(Days) 250 µs (~2.5 days)

Read 1MB from RAM

250 µs Please send GPU timing

(Weeks) 1 ms (~1.5 weeks) info!
Seek + Read 1MB from SATA SSD

1
(Months) 20 ms (~7 months)
Seek + Read 1MB from spinning disk

20 ms

(Years) 150 ms (~5 years)

Send packet California -> Netherlands -> California
FSDL 2022

Local Data Format

• Binary data (images, audio):

• Just use standard formats (e.g. JPEG)

• For metadata (labels) / tabular data / text data:

• Compressed json/txt le(s) are just ne

• Parquet is a table format that's fast, compact, and widely used

16
fi
fi
FSDL 2022

The Basics

• Filesystem

• Object Storage

• Databases

17
FSDL 2022

Object Storage

• An API over the lesystem.

• Fundamental unit is an "object". Usually binary: image, sound le,

etc.

• Versioning, redundancy can be built into the service.

• Not as fast as local, but fast enough within the cloud

e.g. s3://my-bucket-name/my- le-name.jpg

Data Management - storage 18

fi
fi
fi
FSDL 2022

The Basics

• Filesystem

• Object Storage

• Databases

19
FSDL 2022

Database

• Persistent, fast, scalable storage and retrieval of structured data

• Mental model: everything is actually in RAM, but software ensures

that everything is persisted to disk.

• Not for binary data! Store object-store URLs instead.

• Postgres is the right choice most of the time. Supports

unstructured JSON.

• SQLite is perfectly good for small projects.

Data Management - storage 20

FSDL 2022

You should probably be using a database

• Code that deals with collections of objects that reference each

other (e.g. a Text is from a Document, which has an Author) will
eventually implement a crappy database

• Using a database from the beginning will likely save time

• Many MLOps tools are databases at their core (e.g W&B is a DB of

experiments, HuggingFace Hub is a DB of models, Label Studio is a
DB of labels)

21
FSDL 2022

Data Warehouse
• Store for Online Analytical Processing (OLAP)

• vs Databases for Online Transaction Processing (OLTP)

• Extract-Transform-Load (ETL) data in

• OLAPs: usually column-

oriented, for queries like
mean length of comments.text
over last 30 days

• OLTPs: usually row-oriented,

for queries like
select comments where
user_id=123
Data Management - storage https://addepto.com/implement-data-warehouse-business-intelligence/ 22
FSDL 2022

Data Lake
• Unstructured aggregation of data from multiple sources, e.g.
databases, logs, expensive data transformations.

• ELT: dump everything in, then transform for speci c needs later.

Data Management - storage https://medium.com/data-ops/throw-your-data-in-a-lake-32cd21b6de02 23

fi
FSDL 2022

Trend: both Lake and House

• Both structured and unstructured data
together

24
If you're interested in this stu FSDL 2022

Data Management - storage https://dataintensive.net

f
25
“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model

Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Data Management - overview FSDL 2022 26

FSDL 2022

SQL and DataFrames

• Most data solutions use SQL.
Some, like Databricks, use
DataFrames.

• SQL is the standard interface

for structured data.

• Pandas is the main DataFrame

in the Python ecosystem.

• Our advice: become uent in

both

https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html
Data Management - storage 27
fl
FSDL 2022

Pandas

• The workhorse of Python data science

• + DASK DataFrames parallelize Pandas

operations over cores

• + RAPIDS to do Pandas operations on

GPUs

https://projectcodeed.blogspot.com/2019/08/setting-up-jupyter-notebooks-for-data.html
28
“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model

Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Data Management - overview FSDL 2022 29

FSDL 2022

Motivational Example
•We have to train a photo popularity predictor every
night.

• For each photo, training data must include: In database

Need to compute
• Metadata such as posting time, title, location
from logs
• Some features of the user, such as how many times
they logged in today.
Need to run classifiers
• Outputs of photo classi ers (content, style)

30
fi
FSDL 2022

Task Dependencies

• Some tasks can't start until others are

nished.

• Finishing a task should kick o its

dependencies.

31
fi
ff
FSDL 2022

Ideally

• Dependencies are not always les, but programs and databases

• Work needs to be spread over many machines

• Many dependency graphs are executing all at once

Data Management - processing 32

fi
FSDL 2022

Air ow

• Specify the DAG of tasks using Python

https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-air ow-67650418
Data Management - processing 33
fl
fl
FSDL 2022

Distributing work
• The work ow manager has a queue for the tasks, and manages
workers that pull from it, restarting jobs if they fail.

http://site.clairvoyantsoft.com/making-apache-air ow-highly-available/

Data Management - processing 34

fl
fl
FSDL 2022

Prefect
• Improvements over Air ow

35
fl
FSDL 2022

Dagster
• Another contender

36
FSDL 2022

Keep things simple whenever possible

• Don't overengineer
• We have many CPU cores and a
lot of RAM nowadays

• For example, UNIX has powerful 26 minutes

parallelism, streaming, highly
optimized tools in parallel
70 seconds
in parallel
18 seconds

https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
Data Management - processing 37
“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model

Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Data Management - overview FSDL 2022 38

FSDL 2022

Why feature stores?

• All the data processing generates artifacts for training

• How do we

- Make sure that in production, the same processing takes place?

- Avoid recomputation when we retrain?

• Feature stores are a solution (that you may not need!)

39
FSDL 2022

https://eng.uber.com/michelangelo-machine-learning-platform/ 40
FSDL 2022

https://www.tecton.ai 41
FSDL 2022

42
FSDL 2022

Featureform

43
FSDL 2022

In summary

• Binary data (images, sound les, compressed texts) is stored as objects.

• Metadata (labels, user activity) is stored in database.

• Don't be afraid of SQL, and know there are accelerated DataFrames

• If dealing with logs and other sources of data, set up data lake

• Set up a repeatable process to aggregate data needed for training.

• Depending on expense and complexity of processing, a feature store

could be useful

• At training time, copy the data that is needed to a lesystem on a fast

drive, and optimize GPU transfer.

Data Management - storage 44

fi
fi
“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model

Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Infrastructure & Tooling - Experiment Management FSDL 2022 45

FSDL 2022

Huggingface Datasets

• Over 8K
datasets
for vision,
NLP, etc

46
FSDL 2022

Example Dataset
• Github-Code: >1TB of text

• Library allows you to stream it

• Underlying format: Parquet

47
FSDL 2022

Example Dataset
• RedCaps: 12M image-text pairs

• Need to download images

yourself (multi-threaded!)

• Underlying format: images +

JSON les

48
fi
FSDL 2022

Example Dataset

• CommonVoice: 14K hours of

speech

• Underlying format: mp3 + text

49
fi
FSDL 2022

Activeloop

• Another interesting
dataset-focused
solution

• Explore, stream, and

transform data
without saving it all
locally

50
FSDL 2022

51
“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model

Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Data Management - overview FSDL 2022 52

FSDL 2022

May not have to label data!

53
FSDL 2022

Self-supervised learning
Very important idea: Use parts of data
to label other parts

Fig. 1. A great summary of how self-supervised learning tasks can be constructed (Image source: LeCun’s talk)

Fig. 4. Illustration of self-supervised learning by predicting the relative position of two random patches. (Image
source: Doersch et al., 2015)

https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence
https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html
Data Management - sources 54
FSDL 2022

Self-supervised learning

• Works across modalities, too

• Note the "contrastive" training:

- Minimize distance between

image and its text

- Maximize distance between

image and other texts

https://github.com/openai/CLIP

55
FSDL 2022

Image data augmentation

• Must do for training vision models

• Frameworks (e.g. torchvision) provide

functions that do this

• Done in parallel to GPU training on

the CPU

https://towardsdatascience.com/1000x-faster-data-augmentation-b91bafee896c

Data Management - sources 56

FSDL 2022

Augmentation can replace labels

• SimCLR: learning objective is to

- a) maximize agreement between

augmented views of the same image

- b) minimize agreement between

di erent images

https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html
57
ff
FSDL 2022

Other data augmentation

• Tabular
- Delete some cells to simulate
missing data

• Text
- No well established techniques, but
replace words with synonyms,
change order of things.

• Speech
- Change speed, insert pauses, add
audio e ects
https://github.com/makcedward/nlpaug
58
ff
FSDL 2022

Synthetic data
Underrated idea that is often
worth starting with

https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/ 59
FSDL 2022

This can get pretty deep!

Andrew Mo at - https://github.com/amo at/metabrite-receipt-tests

60
ff
ff
FSDL 2022

Ask your users to label data for you!

Enables rapid improvement with user labels

Data Management - sources 61

FSDL 2022

But usually: label data...

62
https://cdn-sv1.deepsense.ai/wp-content/uploads/2017/04/sample_image_from_the_training_set.jpg
FSDL 2022

Standard set of features:

- bounding boxes,
segmentations,
keypoints, cuboids

- set of applicable
classes

Data Management - labeling 63

FSDL 2022

Training the annotators is crucial

Quality assurance is key

Data Management - labeling 64

FSDL 2022

Sources of Labor

• Full-service data labeling

• Hire own annotators, promote best ones to quality control

• Crowdsource (Mechanical Turk)

Data Management - labeling 65

FSDL 2022

Full Service Companies

• Data labeling requires separate software stack, temporary labor, and

quality assurance. Makes sense to outsource.

• Dedicate several days to selecting the best one for you:

• Label gold standard data yourself

• Sales calls with several contenders, ask for work sample on same data

• Ensure agreement with your gold standard, and evaluate on value

Data Management - labeling 66

FSDL 2022

Scale.ai is a dominant data labeling solution

Data Management - labeling 67

And there are many others
FSDL 2022

Data Management - labeling 68

FSDL 2022

Label Studio
• Open-source edition to run yourself

• Enterprise edition for managed

hosting

• Using in lab!

69
FSDL 2022

Di gram

• Another open-source solution

may be even better

70
ff
FSDL 2022

Aquarium and Scale Nucleus

• Key feature: see where your current
model performs poorly, and label that
data

https://www.aquariumlearning.com

https://scale.com/nucleus 71
FSDL 2022

Weak supervision

• Snorkel

- Open-source
project snorkel.org

- Commercial
platform snorkel.ai

• Rubrix: open-source
solution

72
FSDL 2022

Conclusions

• Think of how you can do self-supervised learning

• Use labeling software and get to know your data by labeling it yourself for
a while

• Write out detailed rules and outsource to full-service company if you can
a ord it

• Else, hiring part-time makes more sense than trying to make

crowdsourcing work

Data Management - labeling 73

ff
“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model

Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Data Management - overview FSDL 2022 74

FSDL 2022

Data Versioning

Level 0: unversioned

Level 1: versioned via snapshot at training time

Level 2: versioned as a mix of assets and code

Level 3: specialized data versioning solution

Data Management - versioning 75

FSDL 2022

Level 0

• Data lives on lesystem/S3 and database

• Problem: Deployed machine learning models are part code, part

data. If data is not versioned, deployed models are not versioned.

• Problem you will face: inability to get back to a previous level of

performance

Data Management - versioning 76

fi
FSDL 2022

Level 1

• Data is versioned by storing a snapshot of everything at training

time

• This kind of works, but would be far better to be able to version

data just as easily as code.

Data Management - versioning 77

FSDL 2022

Level 2

• Data is versioned as a mix of assets and code.

• Heavy les stored in S3, with unique ids. Training data is stored
as JSON or Parquet, referring to these ids and include relevant
metadata (labels, user activity, etc).

• Data les can get big, but using git-lfs lets us store them just as
easily as code.

Data Management - versioning 78

fi
fi
FSDL 2022

Level 3

• Specialized solutions for versioning data, usually helping you

store large les.

• Could totally make sense, but don't assume you need it right
away!

• Leading solution is DVC.

Data Management - versioning 79

fi
FSDL 2022

Data Versioning Solutions

Warning: This is biased toward DVC

https://dagshub.com/blog/data-version-control-tools/ 80
FSDL 2022

DVC
1

4
2 3

Data Management - versioning 81

FSDL 2022

Research Area: Privacy

• Federated Learning: training a
global model from data on local
devices, without ever having
access to the data

• Di erential privacy: aggregating

data such that individual points
cannot be identi ed

• Another topic: Learning on

encrypted data

https://blog.ml.cmu.edu/2019/11/12/federated-learning-challenges-methods-and-future-directions/
https://blogs.nvidia.com/blog/2019/10/13/what-is-federated-learning/
82
ff
fi
FSDL 2022

Thank you!

Online Ticket Reservation System
No ratings yet
Online Ticket Reservation System
43 pages
System Call Parameter Passing
100% (1)
System Call Parameter Passing
14 pages
Az 900
No ratings yet
Az 900
7 pages
BG Ch09 Summary
No ratings yet
BG Ch09 Summary
10 pages
VMS Programming Actions 5.0FinalVersion en
100% (1)
VMS Programming Actions 5.0FinalVersion en
78 pages
Data Engineering Roadmap For Freshers & Resources
No ratings yet
Data Engineering Roadmap For Freshers & Resources
6 pages
Bhagavata Pradipika#100 Oct2025 Kartik Special
No ratings yet
Bhagavata Pradipika#100 Oct2025 Kartik Special
21 pages
Become A Data Engineer
100% (2)
Become A Data Engineer
14 pages
Purushottam Adhik Maas
No ratings yet
Purushottam Adhik Maas
63 pages
Activity Sheet Maths Homeschool Worksheets Map 39 Lesson 195
No ratings yet
Activity Sheet Maths Homeschool Worksheets Map 39 Lesson 195
5 pages
Data Analysis
No ratings yet
Data Analysis
40 pages
Business Requirements Document (BRD) Project Name:-Easy Car Rental (ECR)
No ratings yet
Business Requirements Document (BRD) Project Name:-Easy Car Rental (ECR)
11 pages
Activity Sheet FP Fast Phonics Worksheets Peak 6 Peak 6 Phonics
No ratings yet
Activity Sheet FP Fast Phonics Worksheets Peak 6 Peak 6 Phonics
12 pages
Asap 8
100% (1)
Asap 8
47 pages
Tips Tricks Reports
No ratings yet
Tips Tricks Reports
14 pages
COMP6459 - Object Oriented Programming: Topic 9 - String Class
No ratings yet
COMP6459 - Object Oriented Programming: Topic 9 - String Class
27 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
FSDL 2022 Lecture5 Deployment
No ratings yet
FSDL 2022 Lecture5 Deployment
85 pages
A A Aaaaaaaaaaaa
No ratings yet
A A Aaaaaaaaaaaa
18 pages
Activity Sheet Maths Homeschool Worksheets Map 19 Lesson 92
No ratings yet
Activity Sheet Maths Homeschool Worksheets Map 19 Lesson 92
5 pages
Activity Sheet Maths Homeschool Worksheets Map 40 Lesson 198
No ratings yet
Activity Sheet Maths Homeschool Worksheets Map 40 Lesson 198
5 pages
DBT Cloud Advanced Architecture Guide
0% (1)
DBT Cloud Advanced Architecture Guide
4 pages
DS - Queue Best
No ratings yet
DS - Queue Best
7 pages
Start Programming Using HTML CSS and JavaScript 1st Edition Iztok Fajfar (Author) Instant Download
No ratings yet
Start Programming Using HTML CSS and JavaScript 1st Edition Iztok Fajfar (Author) Instant Download
71 pages
Structures Functions
No ratings yet
Structures Functions
13 pages
Python You Should Learn
No ratings yet
Python You Should Learn
12 pages
Data Munging
No ratings yet
Data Munging
61 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
Data Analytics - Intermediate
No ratings yet
Data Analytics - Intermediate
36 pages
Complete Java&J2 EE
No ratings yet
Complete Java&J2 EE
194 pages
Amazon Redshift Cost Optimization Guide
No ratings yet
Amazon Redshift Cost Optimization Guide
12 pages
Chapter10 Ethics DLS
100% (1)
Chapter10 Ethics DLS
28 pages
Activity Sheet Maths Homeschool Worksheets Map 40 Lesson 199
No ratings yet
Activity Sheet Maths Homeschool Worksheets Map 40 Lesson 199
5 pages
Industrial Training Master Readthedocs Io en Latest
No ratings yet
Industrial Training Master Readthedocs Io en Latest
186 pages
Pcs Should Allow To Remove A Dead Node From A Cluster
No ratings yet
Pcs Should Allow To Remove A Dead Node From A Cluster
5 pages
Excel ODBC Setup for Informatica Users
No ratings yet
Excel ODBC Setup for Informatica Users
22 pages
AnalyticsOlympiadPresentation RajatRanjan
No ratings yet
AnalyticsOlympiadPresentation RajatRanjan
16 pages
Secure Crypto-Biometric System For Cloud Computing
No ratings yet
Secure Crypto-Biometric System For Cloud Computing
54 pages
Azure de and Fabric de Full Edited
No ratings yet
Azure de and Fabric de Full Edited
7 pages
Activity Sheet Maths Homeschool Worksheets Map 23 Lesson 113
No ratings yet
Activity Sheet Maths Homeschool Worksheets Map 23 Lesson 113
5 pages
Vedic Creation Tree 51x31
No ratings yet
Vedic Creation Tree 51x31
1 page
BDA Module1
No ratings yet
BDA Module1
75 pages
19.1 - Data Pipelines
No ratings yet
19.1 - Data Pipelines
18 pages
Activity Sheet Rex English Skills - Spelling Year 3 Lesson 8
No ratings yet
Activity Sheet Rex English Skills - Spelling Year 3 Lesson 8
2 pages
Overall Report of Class 11 CS Demo
No ratings yet
Overall Report of Class 11 CS Demo
6 pages
SDL C For Game Development
No ratings yet
SDL C For Game Development
6 pages
Electronics 1
No ratings yet
Electronics 1
77 pages
FSDL Berkeley Lecture8 Data Management
No ratings yet
FSDL Berkeley Lecture8 Data Management
86 pages
F
No ratings yet
F
44 pages
Chapter 5. Template
No ratings yet
Chapter 5. Template
38 pages
IYSM. Thirty Years of IFPUG. Software Economics and Function Point Metrics Capers Jones
No ratings yet
IYSM. Thirty Years of IFPUG. Software Economics and Function Point Metrics Capers Jones
62 pages
Module 5
No ratings yet
Module 5
67 pages
Essentials of Data engineeringByMukeshSaini
No ratings yet
Essentials of Data engineeringByMukeshSaini
30 pages
DS Tools&Techniques
No ratings yet
DS Tools&Techniques
36 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Compiler Design 1
No ratings yet
Compiler Design 1
26 pages
Automated Archival For Amazon Redshift Ra
No ratings yet
Automated Archival For Amazon Redshift Ra
1 page
Me 101 Lab Reprot 2
No ratings yet
Me 101 Lab Reprot 2
6 pages
Lecture 2-3
No ratings yet
Lecture 2-3
65 pages
Gathering of Gray Presents: An Introduction To Programming For Hackers Part VI - Pointers, Data Structures and Dynamic Memory by Lovepump, 2004 Visit
No ratings yet
Gathering of Gray Presents: An Introduction To Programming For Hackers Part VI - Pointers, Data Structures and Dynamic Memory by Lovepump, 2004 Visit
13 pages
2 Emerging
No ratings yet
2 Emerging
10 pages
1 Introduction-To-Data-Science
No ratings yet
1 Introduction-To-Data-Science
43 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
3 pages
Shared Responsibility Model
No ratings yet
Shared Responsibility Model
2 pages
Full Stack Data Science Brochure 2024
No ratings yet
Full Stack Data Science Brochure 2024
62 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Data Engineering Skills Guide
100% (1)
Data Engineering Skills Guide
5 pages
Luis Fernando Trueba (CV)
No ratings yet
Luis Fernando Trueba (CV)
1 page
APJ Lakehouse Optimisation Webinar
No ratings yet
APJ Lakehouse Optimisation Webinar
53 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
C2 - W3 Mlopssasaddsad
No ratings yet
C2 - W3 Mlopssasaddsad
65 pages
FSDL 2022 Lecture3 Testing
No ratings yet
FSDL 2022 Lecture3 Testing
89 pages
Data and Analytics - TechM PDF
No ratings yet
Data and Analytics - TechM PDF
8 pages
Final Report
No ratings yet
Final Report
22 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Lecture 1
No ratings yet
Lecture 1
54 pages
Big Data & Model Management Guide
No ratings yet
Big Data & Model Management Guide
48 pages
Full Stack Data Science Guide 2023
No ratings yet
Full Stack Data Science Guide 2023
17 pages
Unit 1.1data Science Technology Stack
No ratings yet
Unit 1.1data Science Technology Stack
87 pages
1 Intro
No ratings yet
1 Intro
33 pages
Data Mining and BI - Student Notes 2
No ratings yet
Data Mining and BI - Student Notes 2
40 pages
Cs329s 03 Note Data Engineering
No ratings yet
Cs329s 03 Note Data Engineering
26 pages
Training Plan 2025
No ratings yet
Training Plan 2025
5 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
28 pages
Roadmap To Become Data Engineer in 2024
No ratings yet
Roadmap To Become Data Engineer in 2024
8 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
18 pages
Introduction-It Skills
No ratings yet
Introduction-It Skills
20 pages
Notes For DMML
No ratings yet
Notes For DMML
27 pages
IDS - Sem Ans Unit 1
No ratings yet
IDS - Sem Ans Unit 1
10 pages
Facets of Data:: Self-Describing Structure
No ratings yet
Facets of Data:: Self-Describing Structure
6 pages
Unit II
No ratings yet
Unit II
6 pages
CPP106-MODULE - 9 - 2ndSEM - Data - Modelling (1) (20230504171831)
No ratings yet
CPP106-MODULE - 9 - 2ndSEM - Data - Modelling (1) (20230504171831)
9 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Azure Data Solution Exam Prep
No ratings yet
Azure Data Solution Exam Prep
108 pages
Lecture 2.1 - Data Storage and Data Models
No ratings yet
Lecture 2.1 - Data Storage and Data Models
18 pages
Chapter 2 Preparing To Model
No ratings yet
Chapter 2 Preparing To Model
49 pages
Data Modelling
0% (2)
Data Modelling
8 pages
Data Modeling
No ratings yet
Data Modeling
12 pages
Selected Topic: Data Modeling and Management: What Are You Thinking of When We Talk About ?
No ratings yet
Selected Topic: Data Modeling and Management: What Are You Thinking of When We Talk About ?
28 pages
3rd Sem Syllabus
No ratings yet
3rd Sem Syllabus
13 pages
Data Science I: Lesson #01 - Outline Presentation
No ratings yet
Data Science I: Lesson #01 - Outline Presentation
20 pages