Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Metrics stopped to be recorded after a run starts for a while #2408

@anselmwang

Description

@anselmwang

🐛 Bug

I am using AIM in a relative special way, in this way, metrics stopped to be recorded after a run starts for a while.

Let me introduce my workload

  • Several continuous running pretraining experiments, these experiments generate 6 checkpoints every hour. Each of these experiments corresponds to an AIM run.
  • A monitor that launches a fine tune experiment for every checkpoint it detects. Each fine-tuning experiment launches a new AIM run. The monitor will ensure as most 6 fine-tuning experiments are running at the same time so it will not flood Aim. All the fine-tuning runs are closed correctly by my program.

After several days

  • the metric sequences of pretraining experiments start to fail to record. And once the situation happens, all future metrics are not recorded.
  • All the fine-tuning experiments are still able to create new AIM run and record all metrics.

All the fine-tuning runs are closed correctly by my program and are in finished status

Investigation and guess

I did a little investigation

  • aim server processes don't print any warnings or errors in terminal (maybe such messages once exist but I didn't write it to file)
  • aim up process complains about OSError: [Errno 24] Too many open files socket.accept() out of system resource
  • lsof | grep '\.aim' | wc -l reports 14761 opened files.

It is clear that too many files are opened. I guess the file handles corresponding to original pretraining runs are closed somehow due to the opened file limit.

#1786 says rockDB will keep file open for reading speed. But as all the fine-tuning runs have finished, in case of hitting opened file limit, those opened files belong to fine-tuning runs should be closed. We should not close files which corresponds to an "in progress" run.

BTW, lsof | grep '\.aim' | wc -l drops to 1400 after I stop all AIM clients including browsers for one day. so Aim does release opened files in when idle. seems that Aim need a smarter file handle recycling mechanism.

Environment

  • Aim Version (e.g., 3.0.1) Aim v3.14.4
  • Python version : 3.8.10
  • pip version: 21.1.1
  • OS (e.g., Linux) Ubuntu 18.04.6 LTS
  • Any other relevant information

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is neededpriority / critical-urgentHighest priority. Should be worked on by somebody right now.type / bugIssue type: something isn't working

    Type

    No type

    Projects

    Status

    Patch-issues

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions