-
Notifications
You must be signed in to change notification settings - Fork 363
Description
🐛 Bug
I am using AIM in a relative special way, in this way, metrics stopped to be recorded after a run starts for a while.
Let me introduce my workload
- Several continuous running pretraining experiments, these experiments generate 6 checkpoints every hour. Each of these experiments corresponds to an AIM run.
- A monitor that launches a fine tune experiment for every checkpoint it detects. Each fine-tuning experiment launches a new AIM run. The monitor will ensure as most 6 fine-tuning experiments are running at the same time so it will not flood Aim. All the fine-tuning runs are closed correctly by my program.
After several days
- the metric sequences of pretraining experiments start to fail to record. And once the situation happens, all future metrics are not recorded.
- All the fine-tuning experiments are still able to create new AIM run and record all metrics.
All the fine-tuning runs are closed correctly by my program and are in finished status
Investigation and guess
I did a little investigation
aim serverprocesses don't print any warnings or errors in terminal (maybe such messages once exist but I didn't write it to file)aim upprocess complains aboutOSError: [Errno 24] Too many open files socket.accept() out of system resourcelsof | grep '\.aim' | wc -lreports 14761 opened files.
It is clear that too many files are opened. I guess the file handles corresponding to original pretraining runs are closed somehow due to the opened file limit.
#1786 says rockDB will keep file open for reading speed. But as all the fine-tuning runs have finished, in case of hitting opened file limit, those opened files belong to fine-tuning runs should be closed. We should not close files which corresponds to an "in progress" run.
BTW, lsof | grep '\.aim' | wc -l drops to 1400 after I stop all AIM clients including browsers for one day. so Aim does release opened files in when idle. seems that Aim need a smarter file handle recycling mechanism.
Environment
- Aim Version (e.g., 3.0.1) Aim v3.14.4
- Python version : 3.8.10
- pip version: 21.1.1
- OS (e.g., Linux) Ubuntu 18.04.6 LTS
- Any other relevant information
Metadata
Metadata
Assignees
Labels
Type
Projects
Status