Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@egorchakov
Copy link
Contributor

@egorchakov egorchakov commented Sep 26, 2025

  • update dataset state from a pair of pl.DataFrames to a combo of

    • data: TensorDict for most sample keys (see blog)
    • meta: pl.DataFrame for non tensor friendly dtypes
    • streams: dict for storing metadata about training-time readers
  • add dataset saving/loading

  • add a torchdata node-based dataloader (https://meta-pytorch.org/data/beta/migrate_to_nodes_from_utils.html) for thread workers

  • minor config updates

  • tests refactoring

Copy link
Contributor

@koritsky koritsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand what exactly resolved the issue. What i understood from the blog post:

  • we want to keep dataset metadata in shared memory to be accessed by all processes and not copy it
  • every time we access a piece of dataset data we increase it's refcount
  • increasing refcount turns objects into copy-to-read -> this piece of data is no longer shared but copied to each process -> this gradually increases unique memory for each process
  • the reason is python objects and to avoid this we use torch.Tensor whenever we can bc the way torch serializes objects for multiprocessing is saving it to a file with shared access by all child processes
  • to implement this we moved to TensorDicts all the tensor-like objects since polars doesnt do this serialization properly.

Is it so?

from tqdm import tqdm


@hydra.main(version_base=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@egorchakov
Copy link
Contributor Author

egorchakov commented Oct 8, 2025

@koritsky i tried to fix the copy-on-read issue ~2 yrs ago by switching from python containers to pl.DataFrames

iirc that at least fixed the macro issue of training runs getting OOMd

not sure if that attempt was flawed in some way or something has changed about polars/etc since

either way based on the blog post findings it seems the correct approach it to keep as much state as possible in torch.Tensors (esp for multi-GPU scenarios) -- one assumption here being that the CoR-susceptible parts of dataset state (TensorDict metadata, the now smaller pl.DataFrame and the streams dict) are small enough to tolerate their replication

@egorchakov egorchakov merged commit 6a3bdbd into main Oct 8, 2025
1 check passed
@egorchakov egorchakov deleted the exp/memory branch October 8, 2025 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants