-
Notifications
You must be signed in to change notification settings - Fork 363
Description
π Bug
We're using PyTorch Lightning and Ray Tune, and the remote tracking server for Aim. We're on version 3.15.2. We started encountering an issue where the Metrics Explorer fails to load (hangs on "Searching over runs") some runs (ie if I query for run.hash == '<hash of broken run>', or if I run any query that includes a broken run). Looking at the stack trace in the Aim UI logs, it reports:
File ".../.venv/lib/python3.10/site-packages/aim/sdk/sequence.py", line 193, in numpy
last_step = self.meta_tree['last_step']
File "aim/storage/treeview.py", line 51, in aim.storage.treeview.TreeView.__getitem__
File "aim/storage/containertreeview.py", line 74, in aim.storage.containertreeview.ContainerTreeView.collect
KeyError: "No key ('last_step',) is present."
In the Run Details -> Metrics page, the run does show some of the metrics. One of the metrics fails to load its chart (spinning wheel). This is the metric whose sequence has the above issue.
This Run belonged to a Ray Tune Trial that was terminated after 1 validation run. The validation metric is the last metric that should have been reported to Aim. From Ray's side, the trial was terminated successfully, and Ray was reported the validation metric.
When I manually query for the Sequence object for this run's metric, via the SDK, I see
seq
---
<Sequence#938752977989852897 name=`loss` context=`<Context#5190394695475244853 {'subset': 'val'}>` run=`<Run#-2273509748222130010 name=b13c0b51dc9f43c3832c42b5 repo=<... read_only=None>>`>
dict(seq.data.meta_tree)
---
{'dtype': 'float',
'first_step': 1023,
'last': 0.13889867067337036,
'version': 2}
so Aim received the metric value, but for some reason did not mark the last step.
It appears the issue with loading the run stems from calling .sample(...) on this SequenceV2Data which internally has no steps:
running list(seq.data.steps.values()) yields []
To reproduce
Don't have a consistent repro yet.
Expected behavior
- The UI does not hang, and load as many of the valid metrics as it can
last_stepis correctly tracked on the Aim side in this setting
Environment
- Aim Version: 3.15.2
- Python version: 3.10
- pip version: 23
- OS (e.g., Linux): Ubuntu, running Docker container
- We are running Ray Tune experiments, using PyTorch Lightning trainers, logging to Aim via remote tracking.
Additional context
- Aim remote tracking server running in a Docker container behind network load balancer in AWS
- Ray version 2.3.0
Metadata
Metadata
Assignees
Labels
Type
Projects
Status