feat: refactor dataset state #87

egorchakov · 2025-09-26T15:58:07Z

update dataset state from a pair of pl.DataFrames to a combo of
- data: TensorDict for most sample keys (see blog)
- meta: pl.DataFrame for non tensor friendly dtypes
- streams: dict for storing metadata about training-time readers
add dataset saving/loading
add a torchdata node-based dataloader (https://meta-pytorch.org/data/beta/migrate_to_nodes_from_utils.html) for thread workers
minor config updates
tests refactoring

pyproject.toml

koritsky

I'm trying to understand what exactly resolved the issue. What i understood from the blog post:

we want to keep dataset metadata in shared memory to be accessed by all processes and not copy it
every time we access a piece of dataset data we increase it's refcount
increasing refcount turns objects into copy-to-read -> this piece of data is no longer shared but copied to each process -> this gradually increases unique memory for each process
the reason is python objects and to avoid this we use torch.Tensor whenever we can bc the way torch serializes objects for multiprocessing is saving it to a file with shared access by all child processes
to implement this we moved to TensorDicts all the tensor-like objects since polars doesnt do this serialization properly.

Is it so?

sandhawalia · 2025-10-08T09:26:19Z

src/rbyte/scripts/benchmark_dataloader.py

+from tqdm import tqdm
+
+
+@hydra.main(version_base=None)


egorchakov · 2025-10-08T10:30:06Z

@koritsky i tried to fix the copy-on-read issue ~2 yrs ago by switching from python containers to pl.DataFrames

iirc that at least fixed the macro issue of training runs getting OOMd

not sure if that attempt was flawed in some way or something has changed about polars/etc since

either way based on the blog post findings it seems the correct approach it to keep as much state as possible in torch.Tensors (esp for multi-GPU scenarios) -- one assumption here being that the CoR-susceptible parts of dataset state (TensorDict metadata, the now smaller pl.DataFrame and the streams dict) are small enough to tolerate their replication

egorchakov requested review from koritsky and sandhawalia September 26, 2025 16:01

koritsky reviewed Sep 29, 2025

View reviewed changes

pyproject.toml Show resolved Hide resolved

koritsky approved these changes Sep 29, 2025

View reviewed changes

egorchakov force-pushed the exp/memory branch from 6d2a3b8 to 498c1b2 Compare October 2, 2025 15:33

sandhawalia reviewed Oct 8, 2025

View reviewed changes

src/rbyte/scripts/benchmark_dataloader.py

from tqdm import tqdm

@hydra.main(version_base=None)

Copy link

Contributor

sandhawalia Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

sandhawalia approved these changes Oct 8, 2025

View reviewed changes

feat: refactor dataset state, add torchdata dataloader

2656e53

egorchakov force-pushed the exp/memory branch from 75855cf to 2656e53 Compare October 8, 2025 10:19

egorchakov merged commit 6a3bdbd into main Oct 8, 2025
1 check passed

egorchakov deleted the exp/memory branch October 8, 2025 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: refactor dataset state #87

feat: refactor dataset state #87

Uh oh!

egorchakov commented Sep 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

koritsky left a comment

Uh oh!

sandhawalia Oct 8, 2025

Uh oh!

egorchakov commented Oct 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: refactor dataset state #87

feat: refactor dataset state #87

Uh oh!

Conversation

egorchakov commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

koritsky left a comment

Choose a reason for hiding this comment

Uh oh!

sandhawalia Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

egorchakov commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

egorchakov commented Sep 26, 2025 •

edited

Loading

egorchakov commented Oct 8, 2025 •

edited

Loading