Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Metrics explorer fails to load due to some run metrics not having last_step in metadataΒ #2554

@schrobot

Description

@schrobot

πŸ› Bug

We're using PyTorch Lightning and Ray Tune, and the remote tracking server for Aim. We're on version 3.15.2. We started encountering an issue where the Metrics Explorer fails to load (hangs on "Searching over runs") some runs (ie if I query for run.hash == '<hash of broken run>', or if I run any query that includes a broken run). Looking at the stack trace in the Aim UI logs, it reports:

File ".../.venv/lib/python3.10/site-packages/aim/sdk/sequence.py", line 193, in numpy
    last_step = self.meta_tree['last_step']
  File "aim/storage/treeview.py", line 51, in aim.storage.treeview.TreeView.__getitem__
  File "aim/storage/containertreeview.py", line 74, in aim.storage.containertreeview.ContainerTreeView.collect
KeyError: "No key ('last_step',) is present."

In the Run Details -> Metrics page, the run does show some of the metrics. One of the metrics fails to load its chart (spinning wheel). This is the metric whose sequence has the above issue.

This Run belonged to a Ray Tune Trial that was terminated after 1 validation run. The validation metric is the last metric that should have been reported to Aim. From Ray's side, the trial was terminated successfully, and Ray was reported the validation metric.

When I manually query for the Sequence object for this run's metric, via the SDK, I see

seq
---
<Sequence#938752977989852897 name=`loss` context=`<Context#5190394695475244853 {'subset': 'val'}>` run=`<Run#-2273509748222130010 name=b13c0b51dc9f43c3832c42b5 repo=<... read_only=None>>`>

dict(seq.data.meta_tree)
---
{'dtype': 'float',
 'first_step': 1023,
 'last': 0.13889867067337036,
 'version': 2}

so Aim received the metric value, but for some reason did not mark the last step.

It appears the issue with loading the run stems from calling .sample(...) on this SequenceV2Data which internally has no steps:

running list(seq.data.steps.values()) yields []

To reproduce

Don't have a consistent repro yet.

Expected behavior

  • The UI does not hang, and load as many of the valid metrics as it can
  • last_step is correctly tracked on the Aim side in this setting

Environment

  • Aim Version: 3.15.2
  • Python version: 3.10
  • pip version: 23
  • OS (e.g., Linux): Ubuntu, running Docker container
  • We are running Ray Tune experiments, using PyTorch Lightning trainers, logging to Aim via remote tracking.

Additional context

  • Aim remote tracking server running in a Docker container behind network load balancer in AWS
  • Ray version 2.3.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is neededtype / bugIssue type: something isn't working

    Type

    No type

    Projects

    Status

    Patch-issues

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions