-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
Hi, I’m trying to train stage 1 with the following command (without DeepSpeed, for debugging):
file="src/training/train.py"
CUDA_VISIBLE_DEVICES=0 \
PYTHONPATH="${PWD}:${PWD}/src" \
TORCH_HOME="/mnt/aopolin/.cache/torch" \
HF_HOME="/mnt/aopolin/.cache/huggingface" \
HF_HUB_CACHE="/mnt/aopolin/.cache/huggingface/hub" \
WANDB_PROJECT="CoVT" \
python ${file} \
--use_liger True \
--lora_enable True \
--vision_lora True \
--use_dora False \
--lora_namespan_exclude "['embed_tokens', 'lm_head', 'dino', 'sam', 'depth', 'SD', 'internvit', 'pidinet', 'siglip', 'metaclip']" \
--lora_rank 16 \
--lora_alpha 32 \
--lora_dropout 0.01 \
--num_lora_modules -1 \
--model_id Qwen2.5-VL-7B-Instruct \
--model_path "/mnt/aopolin/pretrained_models/huggingface/Qwen/Qwen2.5-VL-7B-Instruct/main" \
--data_path "/mnt/aopolin/projects/CoVT/train_ds.json" \
--image_folder "" \
--remove_unused_columns False \
--freeze_vision_tower True \
--freeze_llm True \
--tune_merger False \
--bf16 True \
--fp16 False \
--disable_flash_attn2 False \
--output_dir "/mnt/aopolin/projects/CoVT/outputs/stage0" \
--max_steps 6000 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--image_min_pixels 200704 \
--image_max_pixels 802816 \
--image_resized_width 448 \
--image_resized_height 448 \
--learning_rate 5e-5 \
--projection_layer_lr 1e-5 \
--weight_decay 0.01 \
--warmup_ratio 0.05 \
--lr_scheduler_type cosine \
--logging_steps 1 \
--tf32 True \
--gradient_checkpointing True \
--lazy_preprocess True \
--save_strategy steps \
--save_steps 1000 \
--save_total_limit 1 \
--dataloader_num_workers 0 \
--report_to wandb \
--run_name 1130_reim_finetune_stage0 \
--anchor_model_id "['sam', 'depth', 'dino']" \
--training_stage full \
--stage_0_step 6000 \
--stage_1_step 6000 \
--stage_2_step 6000 \
--vqa_only_stage 6000During training, I hit the following error:
Traceback (most recent call last):
File "/mnt/aopolin/projects/CoVT/train/src/training/train.py", line 390, in <module>
train()
File "/mnt/aopolin/projects/CoVT/train/src/training/train.py", line 365, in train
trainer.train()
File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/transformers/trainer.py", line 2245, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/transformers/trainer.py", line 2556, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/transformers/trainer.py", line 3718, in training_step
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/transformers/trainer.py", line 3783, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/accelerate/utils/operations.py", line 819, in forward
return model_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/accelerate/utils/operations.py", line 807, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/peft/peft_model.py", line 881, in forward
return self.get_base_model()(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/projects/CoVT/train/src/training/covt_qwen2_5_vl.py", line 1508, in forward
gt_masks, num_masks = self.anchor_models.get_sam_mask_improved(image_files[i][0])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/projects/CoVT/train/src/training/covt_qwen2_5_vl.py", line 662, in get_sam_mask_improved
masks = self.mask_generator.generate(image_np)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/projects/CoVT/train/src/anchors/segment_anything/automatic_mask_generator.py", line 153, in generate
mask_data = self._generate_masks(image)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/aopolin/projects/CoVT/train/src/anchors/segment_anything/automatic_mask_generator.py", line 215, in _generate_masks
data.to_numpy()
File "/mnt/aopolin/projects/CoVT/train/src/anchors/segment_anything/utils/amg.py", line 75, in to_numpy
v = v.detach().cpu()
^^^^
TypeError: Got unsupported ScalarType BFloat16
From the stack trace, it looks like MaskData.to_numpy is not robust to tensors with dtype=torch.bfloat16. When v is bfloat16, calling v.detach().cpu() leads to the TypeError: Got unsupported ScalarType BFloat16.
My questions are:
- Is it expected that
MaskData.to_numpymight receivebfloat16tensors in this pipeline? - Would it be acceptable to upcast
bfloat16tensors tofloat32insideto_numpy(e.g.,v = v.to(torch.float32) after moving to CPU)? - Would such a cast have any noticeable impact on the quality/precision of the generated masks in your intended usage?
Thanks in advance for any guidance!
Metadata
Metadata
Assignees
Labels
No labels