-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Hi, since there are no details in the README, can you provide some examples of how to run Step 4? I have it working, but I can't get any improvement from distributed training (takes the same amount of time, does not seem to be utilizing all GPUs). I'm using the train.py script from the DeRy repo, but I'm not even sure if I'm supposed to be using that or something else from MMCLassification.
I'm running on a Slurm cluster, but just sticking to one node. My node has 8 Tesla V100s and 32 cores.
I'm bringing up a bash terminal on the node, then running this way:
LOCAL_RANK=0 RANK=0 WORLD_SIZE=1 MASTER_ADDR='127.0.0.1' MASTER_PORT=29500 PYTHONPATH="$PWD" python tools/train.py configs/dery/imagenet/50m_imagenet_128x8_100e_dery_adamw_freeze.py --launcher pytorch --gpus 8
But I'm still only getting an ETA of 3+ days, the same thing I get when running on 1 GPU non-distributed.
2024-12-04 16:34:19,448 - mmcls - INFO - Epoch [1][7700/10010] lr: 9.999e-04, eta: 3 days, 6:47:18, time: 0.255, data_time: 0.052, memory: 7473, loss: 5.9310
When running this way, I notice that the config that gets printed to the console shows gpu_ids = range(0, 1).
I've also tried simply using the --gpus flag without the --launcher flag:
PYTHONPATH="$PWD" python tools/train.py configs/dery/imagenet/50m_imagenet_128x8_100e_dery_adamw_freeze.py --gpus 8
but then I get this error:
Traceback (most recent call last):
File "/gpfs2/scratch/ntraft/Development/DeRy/tools/train.py", line 181, in <module>
main()
File "/gpfs2/scratch/ntraft/Development/DeRy/tools/train.py", line 169, in main
train_model(
File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcls/apis/train.py", line 164, in train_model
runner.run(data_loaders, cfg.workflow)
File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/parallel/data_parallel.py", line 62, in train_step
assert len(self.device_ids) == 1, \
^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: MMDataParallel only supports single GPU training, if you need to train with multiple GPUs, please use MMDistributedDataParallel instead.