Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Concurrent runs results in corrupt repositoriesΒ #2869

@JesseFarebro

Description

@JesseFarebro

πŸ› Bug

I'm launching ~10 jobs that will write to the same Aim repository. Consistently this will result in a multitude of issues that corrupts the Aim repository. We had ran a similar test on our cluster around December of 2022 and from my understanding there were no issues at the time.

To reproduce

I'm on a Slurm cluster and I've created a minimal reproduction that includes the following job script:

test_aim_concurrent_writes.sh
#!/bin/bash
#SBATCH --array=1-10
#SBATCH --cpus-per-task=2
#SBATCH --mem=4GB

# Execute the script for each job in the array
python test_aim_concurrent_writes.py

along with the following Python script:

test_aim_concurrent_writes.py
import time

import numpy
from aim import Run


class LearningCurve:
    def __init__(self, epochs, accuracy, seed: int | None = None):
        self.rng = numpy.random.RandomState(seed)
        self.asymptote = self.rng.uniform(accuracy * 0.95, 1)
        self.accuracy = accuracy
        self.epochs = epochs
        self.steepness = (-1 - (accuracy - self.asymptote)) / (epochs * (accuracy - self.asymptote))

    def __call__(self, epoch: float):
        accuracy = -numpy.abs(1 / (self.steepness * epoch + 1)) + self.asymptote
        return min(
            max(
                accuracy + self.rng.normal(0, max((self.asymptote - accuracy) / 4, 0.0001)),
                0,
            ),
            1,
        )


def main():
    epochs: int = 5_000
    steps: int = 1_000
    epoch_interval: float = 5.0
    step_interval: float = 2.0

    train_curve = LearningCurve(epochs, 1.0)
    valid_curve = LearningCurve(epochs, 0.95)
    test_curve = LearningCurve(epochs, 0.95)
    run = Run()

    for epoch in range(epochs):
        for step in range(steps):
            for name, foo in [
                ("train", train_curve),
                ("valid", valid_curve),
                ("test", test_curve),
            ]:
                run.track(
                    foo(epoch + step / steps),
                    name=name,
                    epoch=epoch,
                )

            run.track(
                step,
                name="step",
                epoch=epoch,
            )

            time.sleep(step_interval)

        time.sleep(epoch_interval)


if __name__ == "__main__":
    main()

To reproduce:

  1. Schedule the job (or maybe you could just run 10 processes without access to Slurm)
  2. Run aim up
  3. Try to navigate to any page and you'll see errors everywhere.

I've listed the most common stack traces at the bottom of this post but there's even more errors than this, e.g., there's sqlite errors not being able to aquire a lock, and some other errors about files not being found

Expected behavior

All jobs successfully write data to Aim without error.

Environment

  • Aim Version: 3.17.5
  • Python version: 3.10.11
  • OS: Ubuntu 18.04
  • Filesystem: BeeGFS

Additional context

Stack Traces

`aimrocks.errors.Corruption: b'Corruption: Corrupt or unsupported format_version: 2847736105'`
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File ".venv/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File ".venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
    await super().__call__(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 443, in handle
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
    await super().__call__(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File ".venv/lib/python3.10/site-packages/aim/web/api/utils.py", line 56, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 24, in __call__
    await responder(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 44, in __call__
    await self.app(scope, receive, self.send_with_gzip)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
    await response(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File ".env/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 273, in wrap
    await func()
  File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File ".venv/lib/python3.10/site-packages/aim/web/api/runs/utils.py", line 205, in metric_search_result_streamer
    for trace in run_trace_collection.iter():
  File ".venv/lib/python3.10/site-packages/aim/sdk/sequence_collection.py", line 119, in iter
    for seq_name, ctx, run in self.run.iter_sequence_info_by_type(allowed_dtypes):
  File ".venv/lib/python3.10/site-packages/aim/sdk/run.py", line 464, in iter_sequence_info_by_type
    for ctx_idx, run_ctx_dict in self.meta_run_tree.subtree('traces').items():
  File "aim/storage/containertreeview.py", line 152, in items
  File "aim/storage/prefixview.py", line 232, in aim.storage.prefixview.PrefixView.items
  File "aim/storage/prefixview.py", line 253, in aim.storage.prefixview.PrefixView.items
  File "aim/storage/prefixview.py", line 333, in aim.storage.prefixview.PrefixViewItemsIterator.__init__
  File "aim/storage/rockscontainer.pyx", line 413, in aim.storage.rockscontainer.RocksContainer.items
  File "aim/storage/rockscontainer.pyx", line 593, in aim.storage.rockscontainer.RocksContainerItemsIterator.__init__
  File "src/aimrocks/lib_rocksdb.pyx", line 2338, in aimrocks.lib_rocksdb.BaseIterator.seek
  File "src/aimrocks/lib_rocksdb.pyx", line 2342, in aimrocks.lib_rocksdb.BaseIterator.seek
  File "src/aimrocks/lib_rocksdb.pyx", line 80, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.Corruption: b'Corruption: Corrupt or unsupported format_version: 2847736105'
`aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While open a file for random read: .aim/meta/chunks/d7186fe3b1194e3da4db46db/000034.ldb: No such file or directory'`
Traceback (most recent call last):
  File ".venv/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File ".venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
    await super().__call__(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 443, in handle
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
    await super().__call__(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File ".venv/lib/python3.10/site-packages/aim/web/api/utils.py", line 56, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 24, in __call__
    await responder(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 44, in __call__
    await self.app(scope, receive, self.send_with_gzip)
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
    await response(scope, receive, send)
  File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File ".venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 273, in wrap
    await func()
  File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File ".venv/lib/python3.10/site-packages/aim/web/api/runs/utils.py", line 278, in run_search_result_streamer
    run_dict[run.hash]['traces'] = run.collect_sequence_info(sequence_types='metric')
  File ".venv/lib/python3.10/site-packages/aim/sdk/run.py", line 665, in collect_sequence_info
    ctx_dict = self.idx_to_ctx(idx).to_dict()
  File ".venv/lib/python3.10/site-packages/aim/sdk/run.py", line 336, in idx_to_ctx
    return self._tracker.idx_to_ctx(idx)
  File ".venv/lib/python3.10/site-packages/aim/sdk/tracker.py", line 80, in idx_to_ctx
    ctx = Context(self.meta_tree['contexts', idx])
  File "aim/storage/treeview.py", line 51, in aim.storage.treeview.TreeView.__getitem__
  File "aim/storage/containertreeview.py", line 69, in aim.storage.containertreeview.ContainerTreeView.collect
  File "aim/storage/prefixview.py", line 232, in aim.storage.prefixview.PrefixView.items
  File "aim/storage/prefixview.py", line 253, in aim.storage.prefixview.PrefixView.items
  File "aim/storage/prefixview.py", line 333, in aim.storage.prefixview.PrefixViewItemsIterator.__init__
  File "aim/storage/rockscontainer.pyx", line 413, in aim.storage.rockscontainer.RocksContainer.items
  File "aim/storage/rockscontainer.pyx", line 593, in aim.storage.rockscontainer.RocksContainerItemsIterator.__init__
  File "aim/storage/union.pyx", line 60, in aim.storage.union.ItemsIterator.seek
  File "src/aimrocks/lib_rocksdb.pyx", line 2338, in aimrocks.lib_rocksdb.BaseIterator.seek
  File "src/aimrocks/lib_rocksdb.pyx", line 2342, in aimrocks.lib_rocksdb.BaseIterator.seek
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While open a file for random read: .aim/meta/chunks/d7186fe3b1194e3da4db46db/000034.ldb: No such file or directory'

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is neededtype / bugIssue type: something isn't working

    Type

    No type

    Projects

    Status

    No status

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions