- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1
Open
Description
I get the following error when running __ on two GPUs with Python 3.10.12 and torch 2.4.1:
$ python3 -u -m supar.cmds.const.aj --device $CUDA_VISIBLE_DEVICES train -b -c configs/config-mgpt.ini --dist ddp     -p .."$RESULTDIR"/parser.pt     --delay=0 --use_vq     --train ../treebanks/ptb-gold/train.trees     --dev ../treebanks/ptb-gold/dev.trees     --test ../treebanks/ptb-gold/test.trees
[...]
�[32m[2024-09-20 13:41:15 INFO]�[0m 
amp: false
batch_size: 250
bert: sberbank-ai/mGPT
bert_pooling: mean
binarize: false
buckets: 32
build: true
cache: false
checkpoint: false
codebook_size: 512
commitment_weight: 0.4
delay: 0
dev: ../treebanks/ptb-gold/dev.trees
device: GPU-dd8053ee-a7ef-a463-3ff9-a0001be18fb2,GPU-aea88d64-d6b2-637e-4907-ac538548ac62
dist: ddp
embed: null
encoder: bert
encoder_dropout: 0.1
epochs: 30
eps: 1.0e-08
finetune: true
folder: ../results/models-con/ptb/abs-mgpt-lstm
lr: 5.0e-05
lr_rate: 10
max_len: null
mix_dropout: 0.0
mlp_dropout: 0.33
mode: train
mu: 0.9
n_bert_layers: 0
n_decoder_layers: 2
n_encoder_hidden: 100
n_plm_embed: 100
nu: 0.999
path: ../results/models-con/ptb/abs-mgpt-lstm/parser.pt
patience: 10
seed: 1
test: ../treebanks/ptb-gold/test.trees
threads: 16
train: ../treebanks/ptb-gold/train.trees
update_steps: 6
use_vq: true
vq_decay: 0.3
vq_passes: 600
wandb: false
warmup: 0.001
weight_decay: 0
workers: 0
�[32m[2024-09-20 13:41:15 INFO]�[0m Building the fields
Tokenizer name: sberbank-ai/mGPT
  0%|                    | 0/39832 00:00<?, ?it/s
 46%|#########2          | 18409/39832 00:00<00:00, 184070.43it/s
 94%|##################7 | 37345/39832 00:00<00:00, 187175.87it/s
                                                                 
  0%|                    | 0/39832 00:00<?, ?it/s
 82%|################3   | 32471/39832 00:00<00:00, 324681.67it/s
                                                                 
  0%|                    | 0/39832 00:00<?, ?it/s
 98%|###################6| 39158/39832 00:00<00:00, 391547.70it/s
                                                                 
�[32m[2024-09-20 13:41:43 INFO]�[0m AttachJuxtaposeTree(
 (words): SubwordField(vocab_size=100000, pad=<pad>, unk=<unk>, bos=<s>, eos=<|endoftext|>)
 (tags): Field(vocab_size=50, pad=<pad>, unk=<unk>, bos=<bos>, eos=<eos>, lower=True)
 (trees): RawField()
 (node): Field(use_vocab=False)
 (parent): Field(vocab_size=1305, unk=<unk>)
 (new): Field(vocab_size=1305, unk=<unk>)
)
�[32m[2024-09-20 13:41:43 INFO]�[0m Building the model
Tokenizer name: sberbank-ai/mGPT
�[32m[2024-09-20 13:42:22 INFO]�[0m AttachJuxtaposeConstituencyModel(
  (encoder): TransformerEmbedding(sberbank-ai/mGPT, n_layers=24, n_out=100, stride=256, pooling=mean, pad_index=1, finetune=True)
  (encoder_dropout): Dropout(p=0.1, inplace=False)
  (vq): VectorQuantize()
  (label_embed): Embedding(1306, 100)
  (gnn_layers): GraphConvolutionalNetwork(n_model=100, n_layers=3, selfloop=True, dropout=0.33, norm=True)
  (node_classifier): Sequential(
    (0): Linear(in_features=200, out_features=50, bias=True)
    (1): LayerNorm((50,), eps=1e-05, elementwise_affine=True)
    (2): ReLU()
    (3): Linear(in_features=50, out_features=1, bias=True)
  )
  (label_classifier): Sequential(
    (0): Linear(in_features=200, out_features=50, bias=True)
    (1): LayerNorm((50,), eps=1e-05, elementwise_affine=True)
    (2): ReLU()
    (3): Linear(in_features=50, out_features=2610, bias=True)
  )
  (criterion): CrossEntropyLoss()
)
�[32m[2024-09-20 13:42:25 INFO]�[0m Loading the data
  0%|                    | 0/39832 00:00<?, ?it/s
 97%|###################4| 1654/1700 00:02<00:00, 730.03it/s
                                                            
�[32m[2024-09-20 13:43:51 INFO]�[0m train: Dataset(n_sentences=39832, n_batches=18265, n_buckets=32)
 99%|###################7| 2386/2416 00:03<00:00, 756.57it/s
                                                            
/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/parser.py:168: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = GradScaler(enabled=args.amp)
/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/parser.py:168: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = GradScaler(enabled=args.amp)
�[32m[2024-09-20 13:43:56 INFO]�[0m dev:   Dataset(n_sentences=1700, n_batches=778, n_buckets=32)
�[32m[2024-09-20 13:43:56 INFO]�[0m test:  Dataset(n_sentences=2416, n_batches=1096, n_buckets=32)
  0%|                    | 0/18265 00:00<?, ?it/s
  0%|                    | 5/18265 00:02<1:41:23,  3.00it/s, lr: 5.4945e-07 - loss: 2.7114
                                                                                          
�[32m[2024-09-20 13:44:00 INFO]�[0m Epoch 1 / 30:
Tokenizer name: sberbank-ai/mGPT
Tokenizer name: sberbank-ai/mGPT
W0920 13:44:10.017000 47482913914112 torch/multiprocessing/spawn.py:146] Terminating process 118322 via signal SIGTERM
Traceback (most recent call last):
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/cmds/const/aj.py", line 38, in <module>
    main()
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/cmds/const/aj.py", line 34, in main
    init(parser)
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/cmds/run.py", line 40, in init
    mp.spawn(parse, args=(args,), nprocs=get_device_count())
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 282, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 238, in start_processes
    while not context.join():
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 189, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 
-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
    fn(i, *args)
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/cmds/run.py", line 68, in parse
    parser.train(**args)
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/models/const/aj/parser.py", line 57, in train
    return super().train(**Config().update(locals()))
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/parser.py", line 214, in train
    self.backward(loss)
  File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/parser.py", line 569, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward
    torch.autograd.backward(
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/autograd/__init__.py", line 289, in backward
    _engine_run_backward(
  File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/autograd/graph.py", line 769, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 319 with name label_classifier.3.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
Since I don't know whether this is caused by original Supar code or additions you made I'm posting the error here.
Metadata
Metadata
Assignees
Labels
No labels