Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Multiple runs created for a single distributed training task with AIM #3148

@zhiyxu

Description

@zhiyxu

❓Question

When using AIM for a distributed training task with multiple GPUs (e.g., 8 GPUs), I noticed that each GPU generates a separate run with its own hyperparameters and metrics. As a result, for a single distributed training task with 8 GPUs, a total of 8 runs are created.

However, my expectation is to have only one run for the entire distributed training task, regardless of the number of GPUs used. Is this behavior expected, or is there a way to consolidate the runs into a single run for the entire task?

Having multiple runs for a single task makes it difficult to track and analyze the overall performance and metrics. It would be more convenient and intuitive to have a single run that aggregates the data from all GPUs involved in the distributed training process.

Please let me know if this behavior is intended or if there is a configuration option or workaround to achieve a single run for distributed training tasks with AIM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions