Thanks to visit codestin.com
Credit goes to github.com

Skip to content

How to run Step 4, Fine-Tuning? #10

@ntraft

Description

@ntraft

Hi, since there are no details in the README, can you provide some examples of how to run Step 4? I have it working, but I can't get any improvement from distributed training (takes the same amount of time, does not seem to be utilizing all GPUs). I'm using the train.py script from the DeRy repo, but I'm not even sure if I'm supposed to be using that or something else from MMCLassification.

I'm running on a Slurm cluster, but just sticking to one node. My node has 8 Tesla V100s and 32 cores.

I'm bringing up a bash terminal on the node, then running this way:

LOCAL_RANK=0 RANK=0 WORLD_SIZE=1 MASTER_ADDR='127.0.0.1' MASTER_PORT=29500 PYTHONPATH="$PWD" python tools/train.py configs/dery/imagenet/50m_imagenet_128x8_100e_dery_adamw_freeze.py --launcher pytorch --gpus 8

But I'm still only getting an ETA of 3+ days, the same thing I get when running on 1 GPU non-distributed.

2024-12-04 16:34:19,448 - mmcls - INFO - Epoch [1][7700/10010]  lr: 9.999e-04, eta: 3 days, 6:47:18, time: 0.255, data_time: 0.052, memory: 7473, loss: 5.9310

When running this way, I notice that the config that gets printed to the console shows gpu_ids = range(0, 1).

I've also tried simply using the --gpus flag without the --launcher flag:

PYTHONPATH="$PWD" python tools/train.py configs/dery/imagenet/50m_imagenet_128x8_100e_dery_adamw_freeze.py --gpus 8

but then I get this error:

Traceback (most recent call last):
  File "/gpfs2/scratch/ntraft/Development/DeRy/tools/train.py", line 181, in <module>
    main()
  File "/gpfs2/scratch/ntraft/Development/DeRy/tools/train.py", line 169, in main
    train_model(
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcls/apis/train.py", line 164, in train_model
    runner.run(data_loaders, cfg.workflow)
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/n/t/ntraft/miniconda3/envs/dery/lib/python3.11/site-packages/mmcv/parallel/data_parallel.py", line 62, in train_step
    assert len(self.device_ids) == 1, \
           ^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: MMDataParallel only supports single GPU training, if you need to train with multiple GPUs, please use MMDistributedDataParallel instead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions