How to run Step 4, Fine-Tuning?

Hi, since there are no details in the README, can you provide some examples of how to run Step 4? I have it working, but I can't get any improvement from distributed training (takes the same amount of time, does not seem to be utilizing all GPUs). I'm using the `train.py` script from the DeRy repo, but I'm not even sure if I'm supposed to be using that or something else from MMCLassification.

I'm running on a Slurm cluster, but just sticking to one node. My node has 8 Tesla V100s and 32 cores.

I'm bringing up a bash terminal on the node, then running this way:
```
LOCAL_RANK=0 RANK=0 WORLD_SIZE=1 MASTER_ADDR='127.0.0.1' MASTER_PORT=29500 PYTHONPATH="$PWD" python tools/train.py configs/dery/imagenet/50m_imagenet_128x8_100e_dery_adamw_freeze.py --launcher pytorch --gpus 8
```
But I'm still only getting an ETA of 3+ days, the same thing I get when running on 1 GPU non-distributed.
```
2024-12-04 16:34:19,448 - mmcls - INFO - Epoch [1][7700/10010]  lr: 9.999e-04, eta: 3 days, 6:47:18, time: 0.255, data_time: 0.052, memory: 7473, loss: 5.9310
```
When running this way, I notice that the config that gets printed to the console shows `gpu_ids = range(0, 1)`.

I've also tried simply using the `--gpus` flag without the `--launcher` flag:
```
PYTHONPATH="$PWD" python tools/train.py configs/dery/imagenet/50m_imagenet_128x8_100e_dery_adamw_freeze.py --gpus 8
```
but then I get this error:
```
Traceback (most recent call last):
  File "/gpfs2/scratch/ntraft/Development/DeRy/tools/train.py", line 181, in <module>
    main()
  File "/gpfs2/scratch/ntraft/Development/DeRy/tools/train.py", line 169, in main
    train_model(
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcls/apis/train.py", line 164, in train_model
    runner.run(data_loaders, cfg.workflow)
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/parallel/data_parallel.py", line 62, in train_step
    assert len(self.device_ids) == 1, \
           ^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: MMDataParallel only supports single GPU training, if you need to train with multiple GPUs, please use MMDistributedDataParallel instead.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run Step 4, Fine-Tuning? #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to run Step 4, Fine-Tuning? #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions