Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MaskData.to_numpy crashes on torch.bfloat16 tensors (TypeError: Got unsupported ScalarType BFloat16) #7

@aopolin-lv

Description

@aopolin-lv

Hi, I’m trying to train stage 1 with the following command (without DeepSpeed, for debugging):

file="src/training/train.py"

CUDA_VISIBLE_DEVICES=0 \
PYTHONPATH="${PWD}:${PWD}/src" \
TORCH_HOME="/mnt/aopolin/.cache/torch" \
HF_HOME="/mnt/aopolin/.cache/huggingface" \
HF_HUB_CACHE="/mnt/aopolin/.cache/huggingface/hub" \
WANDB_PROJECT="CoVT" \
python ${file} \
    --use_liger True \
    --lora_enable True \
    --vision_lora True \
    --use_dora False \
    --lora_namespan_exclude "['embed_tokens', 'lm_head', 'dino', 'sam', 'depth', 'SD', 'internvit', 'pidinet', 'siglip', 'metaclip']" \
    --lora_rank 16 \
    --lora_alpha 32 \
    --lora_dropout 0.01 \
    --num_lora_modules -1 \
    --model_id Qwen2.5-VL-7B-Instruct \
    --model_path "/mnt/aopolin/pretrained_models/huggingface/Qwen/Qwen2.5-VL-7B-Instruct/main" \
    --data_path "/mnt/aopolin/projects/CoVT/train_ds.json" \
    --image_folder "" \
    --remove_unused_columns False \
    --freeze_vision_tower True \
    --freeze_llm True \
    --tune_merger False \
    --bf16 True \
    --fp16 False \
    --disable_flash_attn2 False \
    --output_dir "/mnt/aopolin/projects/CoVT/outputs/stage0" \
    --max_steps 6000 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --image_min_pixels 200704 \
    --image_max_pixels 802816 \
    --image_resized_width 448 \
    --image_resized_height 448 \
    --learning_rate 5e-5 \
    --projection_layer_lr 1e-5 \
    --weight_decay 0.01 \
    --warmup_ratio 0.05 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --tf32 True \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --save_strategy steps \
    --save_steps 1000 \
    --save_total_limit 1 \
    --dataloader_num_workers 0 \
    --report_to wandb \
    --run_name 1130_reim_finetune_stage0 \
    --anchor_model_id "['sam', 'depth', 'dino']" \
    --training_stage full \
    --stage_0_step 6000 \
    --stage_1_step 6000 \
    --stage_2_step 6000 \
    --vqa_only_stage 6000

During training, I hit the following error:

Traceback (most recent call last):
  File "/mnt/aopolin/projects/CoVT/train/src/training/train.py", line 390, in <module>
    train()
  File "/mnt/aopolin/projects/CoVT/train/src/training/train.py", line 365, in train
    trainer.train()
  File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/transformers/trainer.py", line 2245, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/transformers/trainer.py", line 2556, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/transformers/trainer.py", line 3718, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/transformers/trainer.py", line 3783, in compute_loss
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/accelerate/utils/operations.py", line 819, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/accelerate/utils/operations.py", line 807, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/peft/peft_model.py", line 881, in forward
    return self.get_base_model()(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/projects/CoVT/train/src/training/covt_qwen2_5_vl.py", line 1508, in forward
    gt_masks, num_masks = self.anchor_models.get_sam_mask_improved(image_files[i][0])
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/projects/CoVT/train/src/training/covt_qwen2_5_vl.py", line 662, in get_sam_mask_improved
    masks = self.mask_generator.generate(image_np)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/miniconda3/envs/covt_train/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/projects/CoVT/train/src/anchors/segment_anything/automatic_mask_generator.py", line 153, in generate
    mask_data = self._generate_masks(image)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/aopolin/projects/CoVT/train/src/anchors/segment_anything/automatic_mask_generator.py", line 215, in _generate_masks
    data.to_numpy()
  File "/mnt/aopolin/projects/CoVT/train/src/anchors/segment_anything/utils/amg.py", line 75, in to_numpy
    v = v.detach().cpu()
                     ^^^^
TypeError: Got unsupported ScalarType BFloat16

From the stack trace, it looks like MaskData.to_numpy is not robust to tensors with dtype=torch.bfloat16. When v is bfloat16, calling v.detach().cpu() leads to the TypeError: Got unsupported ScalarType BFloat16.

My questions are:

  1. Is it expected that MaskData.to_numpy might receive bfloat16 tensors in this pipeline?
  2. Would it be acceptable to upcast bfloat16 tensors to float32 inside to_numpy (e.g., v = v.to(torch.float32) after moving to CPU)?
  3. Would such a cast have any noticeable impact on the quality/precision of the generated masks in your intended usage?

Thanks in advance for any guidance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions