Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ThreadPrefetchIterDataset memory leaks / zombie threads that hang exiting? #1196

@am001122

Description

@am001122

(This seems related to #1021 , but I don't think it's the same issue since grain_nightly presumably has the fix to #1021 from #1051.)

This minimal reproducer:

import grain
import jax
import numpy as np
from grain.experimental import device_put as grain_device_put
from grain.sources import RandomAccessDataSource


class FakeDataSource(RandomAccessDataSource):
    def __len__(self):
        return 1000

    def __getitem__(self, idx):
        return {"x": np.ones((32,), dtype=np.float32) * idx}


# has to be in main() to hang
def main():
    source = FakeDataSource()
    dataset = grain.MapDataset.source(source)
    # has to have this to hang---something to do with nested threaded pre-fetching?
    dataset = grain_device_put(dataset, device=jax.devices()[0])

    # if this is for _ in iter(dataset) it doesn't hang
    data_iter = iter(dataset)
    for _ in data_iter:
        raise RuntimeError


if __name__ == "__main__":
    main()

will raise the RuntimeError, print the traceback, and then instead of exiting it will just hang indefinitely. I am running it with uv run --python 3.13 --with grain --with jax repro.py.

If everything is moved from main() to the top level of the script, then it fails with an additional exception and the process exits fully and returns to the shell:

Exception ignored in: <function DatasetIterator.__del__ at 0x7f1f6f717420>
Traceback (most recent call last):
  File ".../.venv/lib/python3.13/site-packages/grain/_src/python/dataset/dataset.py", line 1602, in __del__
  File ".../.venv/lib/python3.13/site-packages/grain/_src/python/dataset/transformations/prefetch.py", line 989, in close
  File ".../.venv/lib/python3.13/site-packages/grain/_src/python/dataset/transformations/prefetch.py", line 1006, in _stop_prefetch
  File ".../.venv/lib/python3.13/site-packages/grain/_src/python/dataset/transformations/prefetch.py", line 995, in _clear_buffer
AttributeError: 'NoneType' object has no attribute 'Empty'

Running with Python 3.14 instead (uv run --python 3.14 --with grain --with jax repro.py) prints a different extra traceback and exits back to the shell:

Exception ignored while calling deallocator <function DatasetIterator.__del__ at 0x775ab867e560>:
Traceback (most recent call last):
  File ".../.cache/uv/archive-v0/86c0NR2RDIQ-nixYPZcBP/lib/python3.14/site-packages/grain/_src/python/dataset/dataset.py", line 1551, in __del__
    self.close()
  File ".../.cache/uv/archive-v0/86c0NR2RDIQ-nixYPZcBP/lib/python3.14/site-packages/grain/_src/python/dataset/transformations/prefetch.py", line 959, in close
    self._stop_prefetch()
  File ".../.cache/uv/archive-v0/86c0NR2RDIQ-nixYPZcBP/lib/python3.14/site-packages/grain/_src/python/dataset/transformations/prefetch.py", line 977, in _stop_prefetch
    self._prefetch_thread.join()
  File ".../.local/share/uv/python/cpython-3.14.2-linux-x86_64-gnu/lib/python3.14/threading.py", line 1133, in join
    self._os_thread_handle.join(timeout)
PythonFinalizationError: cannot join thread at interpreter shutdown

Running the uv commands with --with grain_nightly instead gives same results as far as I can tell. I ran with grain_nightly==0.2.16.dev20260112 and current grain==0.2.15. CUDA-enabled JAX or CPU only also don't seem to matter, although I ran both tests on a Linux machine with a GPU.

While the example may seem contrived, I distilled it from a real-world failure in my training code where issues in my main training loop were causing jobs to hang indefinitely instead of exiting. I can't use for _ in iter(dataset) as a workaround because I need a reference to the data_iter object for checkpointing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions