RuntimeError: Expected to mark a variable ready only once

I get the following error when running __ on two GPUs with Python 3.10.12 and torch 2.4.1:

```
$ python3 -u -m supar.cmds.const.aj --device $CUDA_VISIBLE_DEVICES train -b -c configs/config-mgpt.ini --dist ddp     -p .."$RESULTDIR"/parser.pt     --delay=0 --use_vq     --train ../treebanks/ptb-gold/train.trees     --dev ../treebanks/ptb-gold/dev.trees     --test ../treebanks/ptb-gold/test.trees

[...]

[32m[2024-09-20 13:41:15 INFO][0m 
amp: false
batch_size: 250
bert: sberbank-ai/mGPT
bert_pooling: mean
binarize: false
buckets: 32
build: true
cache: false
checkpoint: false
codebook_size: 512
commitment_weight: 0.4
delay: 0
dev: ../treebanks/ptb-gold/dev.trees
device: GPU-dd8053ee-a7ef-a463-3ff9-a0001be18fb2,GPU-aea88d64-d6b2-637e-4907-ac538548ac62
dist: ddp
embed: null
encoder: bert
encoder_dropout: 0.1
epochs: 30
eps: 1.0e-08
finetune: true
folder: ../results/models-con/ptb/abs-mgpt-lstm
lr: 5.0e-05
lr_rate: 10
max_len: null
mix_dropout: 0.0
mlp_dropout: 0.33
mode: train
mu: 0.9
n_bert_layers: 0
n_decoder_layers: 2
n_encoder_hidden: 100
n_plm_embed: 100
nu: 0.999
path: ../results/models-con/ptb/abs-mgpt-lstm/parser.pt
patience: 10
seed: 1
test: ../treebanks/ptb-gold/test.trees
threads: 16
train: ../treebanks/ptb-gold/train.trees
update_steps: 6
use_vq: true
vq_decay: 0.3
vq_passes: 600
wandb: false
warmup: 0.001
weight_decay: 0
workers: 0

[32m[2024-09-20 13:41:15 INFO][0m Building the fields
Tokenizer name: sberbank-ai/mGPT

  0%|                    | 0/39832 00:00<?, ?it/s
 46%|#########2          | 18409/39832 00:00<00:00, 184070.43it/s
 94%|##################7 | 37345/39832 00:00<00:00, 187175.87it/s
                                                                 

  0%|                    | 0/39832 00:00<?, ?it/s
 82%|################3   | 32471/39832 00:00<00:00, 324681.67it/s
                                                                 

  0%|                    | 0/39832 00:00<?, ?it/s
 98%|###################6| 39158/39832 00:00<00:00, 391547.70it/s
                                                                 
[32m[2024-09-20 13:41:43 INFO][0m AttachJuxtaposeTree(
 (words): SubwordField(vocab_size=100000, pad=<pad>, unk=<unk>, bos=<s>, eos=<|endoftext|>)
 (tags): Field(vocab_size=50, pad=<pad>, unk=<unk>, bos=<bos>, eos=<eos>, lower=True)
 (trees): RawField()
 (node): Field(use_vocab=False)
 (parent): Field(vocab_size=1305, unk=<unk>)
 (new): Field(vocab_size=1305, unk=<unk>)
)
[32m[2024-09-20 13:41:43 INFO][0m Building the model
Tokenizer name: sberbank-ai/mGPT
[32m[2024-09-20 13:42:22 INFO][0m AttachJuxtaposeConstituencyModel(
  (encoder): TransformerEmbedding(sberbank-ai/mGPT, n_layers=24, n_out=100, stride=256, pooling=mean, pad_index=1, finetune=True)
  (encoder_dropout): Dropout(p=0.1, inplace=False)
  (vq): VectorQuantize()
  (label_embed): Embedding(1306, 100)
  (gnn_layers): GraphConvolutionalNetwork(n_model=100, n_layers=3, selfloop=True, dropout=0.33, norm=True)
  (node_classifier): Sequential(
    (0): Linear(in_features=200, out_features=50, bias=True)
    (1): LayerNorm((50,), eps=1e-05, elementwise_affine=True)
    (2): ReLU()
    (3): Linear(in_features=50, out_features=1, bias=True)
  )
  (label_classifier): Sequential(
    (0): Linear(in_features=200, out_features=50, bias=True)
    (1): LayerNorm((50,), eps=1e-05, elementwise_affine=True)
    (2): ReLU()
    (3): Linear(in_features=50, out_features=2610, bias=True)
  )
  (criterion): CrossEntropyLoss()
)

[32m[2024-09-20 13:42:25 INFO][0m Loading the data

  0%|                    | 0/39832 00:00<?, ?it/s
 97%|###################4| 1654/1700 00:02<00:00, 730.03it/s
                                                            
[32m[2024-09-20 13:43:51 INFO][0m train: Dataset(n_sentences=39832, n_batches=18265, n_buckets=32)

 99%|###################7| 2386/2416 00:03<00:00, 756.57it/s
                                                            
/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/parser.py:168: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = GradScaler(enabled=args.amp)
/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/parser.py:168: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = GradScaler(enabled=args.amp)
[32m[2024-09-20 13:43:56 INFO][0m dev:   Dataset(n_sentences=1700, n_batches=778, n_buckets=32)
[32m[2024-09-20 13:43:56 INFO][0m test:  Dataset(n_sentences=2416, n_batches=1096, n_buckets=32)


  0%|                    | 0/18265 00:00<?, ?it/s
  0%|                    | 5/18265 00:02<1:41:23,  3.00it/s, lr: 5.4945e-07 - loss: 2.7114
                                                                                          
[32m[2024-09-20 13:44:00 INFO][0m Epoch 1 / 30:
Tokenizer name: sberbank-ai/mGPT
Tokenizer name: sberbank-ai/mGPT
W0920 13:44:10.017000 47482913914112 torch/multiprocessing/spawn.py:146] Terminating process 118322 via signal SIGTERM
Traceback (most recent call last):
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/cmds/const/aj.py", line 38, in <module>
    main()
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/cmds/const/aj.py", line 34, in main
    init(parser)
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/cmds/run.py", line 40, in init
    mp.spawn(parse, args=(args,), nprocs=get_device_count())
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 282, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 238, in start_processes
    while not context.join():
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 189, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
    fn(i, *args)
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/cmds/run.py", line 68, in parse
    parser.train(**args)
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/models/const/aj/parser.py", line 57, in train
    return super().train(**Config().update(locals()))
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/parser.py", line 214, in train
    self.backward(loss)
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/parser.py", line 569, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward
    torch.autograd.backward(
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/autograd/__init__.py", line 289, in backward
    _engine_run_backward(
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/autograd/graph.py", line 769, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 319 with name label_classifier.3.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

```

Since I don't know whether this is caused by original Supar code or additions you made I'm posting the error here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RuntimeError: Expected to mark a variable ready only once #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

RuntimeError: Expected to mark a variable ready only once #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions