-
Couldn't load subscription status.
- Fork 1
Open
Description
I get the following error when running __ on two GPUs with Python 3.10.12 and torch 2.4.1:
$ python3 -u -m supar.cmds.const.aj --device $CUDA_VISIBLE_DEVICES train -b -c configs/config-mgpt.ini --dist ddp -p .."$RESULTDIR"/parser.pt --delay=0 --use_vq --train ../treebanks/ptb-gold/train.trees --dev ../treebanks/ptb-gold/dev.trees --test ../treebanks/ptb-gold/test.trees
[...]
�[32m[2024-09-20 13:41:15 INFO]�[0m
amp: false
batch_size: 250
bert: sberbank-ai/mGPT
bert_pooling: mean
binarize: false
buckets: 32
build: true
cache: false
checkpoint: false
codebook_size: 512
commitment_weight: 0.4
delay: 0
dev: ../treebanks/ptb-gold/dev.trees
device: GPU-dd8053ee-a7ef-a463-3ff9-a0001be18fb2,GPU-aea88d64-d6b2-637e-4907-ac538548ac62
dist: ddp
embed: null
encoder: bert
encoder_dropout: 0.1
epochs: 30
eps: 1.0e-08
finetune: true
folder: ../results/models-con/ptb/abs-mgpt-lstm
lr: 5.0e-05
lr_rate: 10
max_len: null
mix_dropout: 0.0
mlp_dropout: 0.33
mode: train
mu: 0.9
n_bert_layers: 0
n_decoder_layers: 2
n_encoder_hidden: 100
n_plm_embed: 100
nu: 0.999
path: ../results/models-con/ptb/abs-mgpt-lstm/parser.pt
patience: 10
seed: 1
test: ../treebanks/ptb-gold/test.trees
threads: 16
train: ../treebanks/ptb-gold/train.trees
update_steps: 6
use_vq: true
vq_decay: 0.3
vq_passes: 600
wandb: false
warmup: 0.001
weight_decay: 0
workers: 0
�[32m[2024-09-20 13:41:15 INFO]�[0m Building the fields
Tokenizer name: sberbank-ai/mGPT
0%| | 0/39832 00:00<?, ?it/s
46%|#########2 | 18409/39832 00:00<00:00, 184070.43it/s
94%|##################7 | 37345/39832 00:00<00:00, 187175.87it/s
0%| | 0/39832 00:00<?, ?it/s
82%|################3 | 32471/39832 00:00<00:00, 324681.67it/s
0%| | 0/39832 00:00<?, ?it/s
98%|###################6| 39158/39832 00:00<00:00, 391547.70it/s
�[32m[2024-09-20 13:41:43 INFO]�[0m AttachJuxtaposeTree(
(words): SubwordField(vocab_size=100000, pad=<pad>, unk=<unk>, bos=<s>, eos=<|endoftext|>)
(tags): Field(vocab_size=50, pad=<pad>, unk=<unk>, bos=<bos>, eos=<eos>, lower=True)
(trees): RawField()
(node): Field(use_vocab=False)
(parent): Field(vocab_size=1305, unk=<unk>)
(new): Field(vocab_size=1305, unk=<unk>)
)
�[32m[2024-09-20 13:41:43 INFO]�[0m Building the model
Tokenizer name: sberbank-ai/mGPT
�[32m[2024-09-20 13:42:22 INFO]�[0m AttachJuxtaposeConstituencyModel(
(encoder): TransformerEmbedding(sberbank-ai/mGPT, n_layers=24, n_out=100, stride=256, pooling=mean, pad_index=1, finetune=True)
(encoder_dropout): Dropout(p=0.1, inplace=False)
(vq): VectorQuantize()
(label_embed): Embedding(1306, 100)
(gnn_layers): GraphConvolutionalNetwork(n_model=100, n_layers=3, selfloop=True, dropout=0.33, norm=True)
(node_classifier): Sequential(
(0): Linear(in_features=200, out_features=50, bias=True)
(1): LayerNorm((50,), eps=1e-05, elementwise_affine=True)
(2): ReLU()
(3): Linear(in_features=50, out_features=1, bias=True)
)
(label_classifier): Sequential(
(0): Linear(in_features=200, out_features=50, bias=True)
(1): LayerNorm((50,), eps=1e-05, elementwise_affine=True)
(2): ReLU()
(3): Linear(in_features=50, out_features=2610, bias=True)
)
(criterion): CrossEntropyLoss()
)
�[32m[2024-09-20 13:42:25 INFO]�[0m Loading the data
0%| | 0/39832 00:00<?, ?it/s
97%|###################4| 1654/1700 00:02<00:00, 730.03it/s
�[32m[2024-09-20 13:43:51 INFO]�[0m train: Dataset(n_sentences=39832, n_batches=18265, n_buckets=32)
99%|###################7| 2386/2416 00:03<00:00, 756.57it/s
/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/parser.py:168: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
self.scaler = GradScaler(enabled=args.amp)
/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/transformers/optimization.py:591: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/parser.py:168: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
self.scaler = GradScaler(enabled=args.amp)
�[32m[2024-09-20 13:43:56 INFO]�[0m dev: Dataset(n_sentences=1700, n_batches=778, n_buckets=32)
�[32m[2024-09-20 13:43:56 INFO]�[0m test: Dataset(n_sentences=2416, n_batches=1096, n_buckets=32)
0%| | 0/18265 00:00<?, ?it/s
0%| | 5/18265 00:02<1:41:23, 3.00it/s, lr: 5.4945e-07 - loss: 2.7114
�[32m[2024-09-20 13:44:00 INFO]�[0m Epoch 1 / 30:
Tokenizer name: sberbank-ai/mGPT
Tokenizer name: sberbank-ai/mGPT
W0920 13:44:10.017000 47482913914112 torch/multiprocessing/spawn.py:146] Terminating process 118322 via signal SIGTERM
Traceback (most recent call last):
File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/cmds/const/aj.py", line 38, in <module>
main()
File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/cmds/const/aj.py", line 34, in main
init(parser)
File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/cmds/run.py", line 40, in init
mp.spawn(parse, args=(args,), nprocs=get_device_count())
File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 282, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 238, in start_processes
while not context.join():
File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 189, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
fn(i, *args)
File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/cmds/run.py", line 68, in parse
parser.train(**args)
File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/models/const/aj/parser.py", line 57, in train
return super().train(**Config().update(locals()))
File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/parser.py", line 214, in train
self.backward(loss)
File "/gpfs/scratch/lumie101/12053192.hpc-batch/incpar/supar/parser.py", line 569, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward
torch.autograd.backward(
File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/autograd/__init__.py", line 289, in backward
_engine_run_backward(
File "/home/lumie101/.conda/envs/clusopt/lib/python3.10/site-packages/torch/autograd/graph.py", line 769, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 319 with name label_classifier.3.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
Since I don't know whether this is caused by original Supar code or additions you made I'm posting the error here.
Metadata
Metadata
Assignees
Labels
No labels