-
Notifications
You must be signed in to change notification settings - Fork 363
Description
❓Question
When using AIM for a distributed training task with multiple GPUs (e.g., 8 GPUs), I noticed that each GPU generates a separate run with its own hyperparameters and metrics. As a result, for a single distributed training task with 8 GPUs, a total of 8 runs are created.
However, my expectation is to have only one run for the entire distributed training task, regardless of the number of GPUs used. Is this behavior expected, or is there a way to consolidate the runs into a single run for the entire task?
Having multiple runs for a single task makes it difficult to track and analyze the overall performance and metrics. It would be more convenient and intuitive to have a single run that aggregates the data from all GPUs involved in the distributed training process.
Please let me know if this behavior is intended or if there is a configuration option or workaround to achieve a single run for distributed training tasks with AIM.