-
Notifications
You must be signed in to change notification settings - Fork 363
Open
Labels
help wantedExtra attention is neededExtra attention is neededtype / bugIssue type: something isn't workingIssue type: something isn't working
Milestone
Description
π Bug
I'm launching ~10 jobs that will write to the same Aim repository. Consistently this will result in a multitude of issues that corrupts the Aim repository. We had ran a similar test on our cluster around December of 2022 and from my understanding there were no issues at the time.
To reproduce
I'm on a Slurm cluster and I've created a minimal reproduction that includes the following job script:
test_aim_concurrent_writes.sh
#!/bin/bash
#SBATCH --array=1-10
#SBATCH --cpus-per-task=2
#SBATCH --mem=4GB
# Execute the script for each job in the array
python test_aim_concurrent_writes.pyalong with the following Python script:
test_aim_concurrent_writes.py
import time
import numpy
from aim import Run
class LearningCurve:
def __init__(self, epochs, accuracy, seed: int | None = None):
self.rng = numpy.random.RandomState(seed)
self.asymptote = self.rng.uniform(accuracy * 0.95, 1)
self.accuracy = accuracy
self.epochs = epochs
self.steepness = (-1 - (accuracy - self.asymptote)) / (epochs * (accuracy - self.asymptote))
def __call__(self, epoch: float):
accuracy = -numpy.abs(1 / (self.steepness * epoch + 1)) + self.asymptote
return min(
max(
accuracy + self.rng.normal(0, max((self.asymptote - accuracy) / 4, 0.0001)),
0,
),
1,
)
def main():
epochs: int = 5_000
steps: int = 1_000
epoch_interval: float = 5.0
step_interval: float = 2.0
train_curve = LearningCurve(epochs, 1.0)
valid_curve = LearningCurve(epochs, 0.95)
test_curve = LearningCurve(epochs, 0.95)
run = Run()
for epoch in range(epochs):
for step in range(steps):
for name, foo in [
("train", train_curve),
("valid", valid_curve),
("test", test_curve),
]:
run.track(
foo(epoch + step / steps),
name=name,
epoch=epoch,
)
run.track(
step,
name="step",
epoch=epoch,
)
time.sleep(step_interval)
time.sleep(epoch_interval)
if __name__ == "__main__":
main()To reproduce:
- Schedule the job (or maybe you could just run 10 processes without access to Slurm)
- Run
aim up - Try to navigate to any page and you'll see errors everywhere.
I've listed the most common stack traces at the bottom of this post but there's even more errors than this, e.g., there's sqlite errors not being able to aquire a lock, and some other errors about files not being found
Expected behavior
All jobs successfully write data to Aim without error.
Environment
- Aim Version: 3.17.5
- Python version: 3.10.11
- OS: Ubuntu 18.04
- Filesystem: BeeGFS
Additional context
Stack Traces
`aimrocks.errors.Corruption: b'Corruption: Corrupt or unsupported format_version: 2847736105'`
ERROR: Exception in ASGI application
Traceback (most recent call last):
File ".venv/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
result = await app( # type: ignore[func-returns-value]
File ".venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
await super().__call__(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 443, in handle
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
await super().__call__(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File ".venv/lib/python3.10/site-packages/aim/web/api/utils.py", line 56, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 24, in __call__
await responder(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 44, in __call__
await self.app(scope, receive, self.send_with_gzip)
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
await response(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File ".env/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 273, in wrap
await func()
File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File ".venv/lib/python3.10/site-packages/aim/web/api/runs/utils.py", line 205, in metric_search_result_streamer
for trace in run_trace_collection.iter():
File ".venv/lib/python3.10/site-packages/aim/sdk/sequence_collection.py", line 119, in iter
for seq_name, ctx, run in self.run.iter_sequence_info_by_type(allowed_dtypes):
File ".venv/lib/python3.10/site-packages/aim/sdk/run.py", line 464, in iter_sequence_info_by_type
for ctx_idx, run_ctx_dict in self.meta_run_tree.subtree('traces').items():
File "aim/storage/containertreeview.py", line 152, in items
File "aim/storage/prefixview.py", line 232, in aim.storage.prefixview.PrefixView.items
File "aim/storage/prefixview.py", line 253, in aim.storage.prefixview.PrefixView.items
File "aim/storage/prefixview.py", line 333, in aim.storage.prefixview.PrefixViewItemsIterator.__init__
File "aim/storage/rockscontainer.pyx", line 413, in aim.storage.rockscontainer.RocksContainer.items
File "aim/storage/rockscontainer.pyx", line 593, in aim.storage.rockscontainer.RocksContainerItemsIterator.__init__
File "src/aimrocks/lib_rocksdb.pyx", line 2338, in aimrocks.lib_rocksdb.BaseIterator.seek
File "src/aimrocks/lib_rocksdb.pyx", line 2342, in aimrocks.lib_rocksdb.BaseIterator.seek
File "src/aimrocks/lib_rocksdb.pyx", line 80, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.Corruption: b'Corruption: Corrupt or unsupported format_version: 2847736105'
`aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While open a file for random read: .aim/meta/chunks/d7186fe3b1194e3da4db46db/000034.ldb: No such file or directory'`
Traceback (most recent call last):
File ".venv/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
result = await app( # type: ignore[func-returns-value]
File ".venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
await super().__call__(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 443, in handle
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/fastapi/applications.py", line 282, in __call__
await super().__call__(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File ".venv/lib/python3.10/site-packages/aim/web/api/utils.py", line 56, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 24, in __call__
await responder(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 44, in __call__
await self.app(scope, receive, self.send_with_gzip)
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File ".venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
raise e
File ".venv/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
await response(scope, receive, send)
File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File ".venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 273, in wrap
await func()
File ".venv/lib/python3.10/site-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File ".venv/lib/python3.10/site-packages/aim/web/api/runs/utils.py", line 278, in run_search_result_streamer
run_dict[run.hash]['traces'] = run.collect_sequence_info(sequence_types='metric')
File ".venv/lib/python3.10/site-packages/aim/sdk/run.py", line 665, in collect_sequence_info
ctx_dict = self.idx_to_ctx(idx).to_dict()
File ".venv/lib/python3.10/site-packages/aim/sdk/run.py", line 336, in idx_to_ctx
return self._tracker.idx_to_ctx(idx)
File ".venv/lib/python3.10/site-packages/aim/sdk/tracker.py", line 80, in idx_to_ctx
ctx = Context(self.meta_tree['contexts', idx])
File "aim/storage/treeview.py", line 51, in aim.storage.treeview.TreeView.__getitem__
File "aim/storage/containertreeview.py", line 69, in aim.storage.containertreeview.ContainerTreeView.collect
File "aim/storage/prefixview.py", line 232, in aim.storage.prefixview.PrefixView.items
File "aim/storage/prefixview.py", line 253, in aim.storage.prefixview.PrefixView.items
File "aim/storage/prefixview.py", line 333, in aim.storage.prefixview.PrefixViewItemsIterator.__init__
File "aim/storage/rockscontainer.pyx", line 413, in aim.storage.rockscontainer.RocksContainer.items
File "aim/storage/rockscontainer.pyx", line 593, in aim.storage.rockscontainer.RocksContainerItemsIterator.__init__
File "aim/storage/union.pyx", line 60, in aim.storage.union.ItemsIterator.seek
File "src/aimrocks/lib_rocksdb.pyx", line 2338, in aimrocks.lib_rocksdb.BaseIterator.seek
File "src/aimrocks/lib_rocksdb.pyx", line 2342, in aimrocks.lib_rocksdb.BaseIterator.seek
File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While open a file for random read: .aim/meta/chunks/d7186fe3b1194e3da4db46db/000034.ldb: No such file or directory'
alberttorosyan, ArmandXiao, lkhphuc, a-shchupakov and RomDeffayet
Metadata
Metadata
Assignees
Labels
help wantedExtra attention is neededExtra attention is neededtype / bugIssue type: something isn't workingIssue type: something isn't working
Type
Projects
Status
No status