-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Environment
- Python: 3.11.14
- PyTorch: 2.5.1
- DeepSpeed: 0.16.4
- Transformers: 4.50.1
- Flash Attention: 2.8.3
- Training Setup: 2 RTX a6000 with DeepSpeed ZeRO Stage 3
Problem Description
I am fine-tuning Qwen2.5-VL with CoVT. Due to network restrictions on my server(443 network error), I have to use local DINOv2 weights instead of torch.hub.load(). However, I'm encountering a RuntimeError: weight should have at least three dimensions error when the DINOv2 model is called during training with DeepSpeed ZeRO-3.
Code Modifications
I modified the anchor model initialization code to load DINOv2 from a local path:
Original code:
self.dinovit = torch.hub.load(DINO_MODEL_PATH, DINO_MODEL_TYPE)
self.dinovit = self.dinovit.eval()
self.hook_handle = self.dinovit.norm.register_forward_hook(norm_hook)Modified code:
DINO_MODEL_PATH = "/data20t/guanxiaofei/VLM/dinov2-large"
DINO_PROCESSOR_CONFIG = {
"pretrained_model_name_or_path": "/data20t/guanxiaofei/VLM/dinov2-large",
"crop_size": {"height": 448, "width": 448}
}
self.dinovit = AutoModel.from_pretrained(DINO_MODEL_PATH, trust_remote_code=True)
self.dinovit.eval()
self.extracted_outputs = {}
def norm_hook(module, module_input, module_output):
self.extracted_outputs["norm_output"] = module_output
# Changed from self.dinovit.norm to self.dinovit.layernorm
self.hook_handle = self.dinovit.layernorm.register_forward_hook(norm_hook)
self.dino_processor = AutoImageProcessor.from_pretrained(**DINO_PROCESSOR_CONFIG)Note: I changed self.dinovit.norm to self.dinovit.layernorm because the model structure is different when loaded via AutoModel:
torch.hub.load()creates a model with.normattributeAutoModel.from_pretrained()creates aDinov2Modelwith.layernormattribute
I don't think it's the root of the problem.
Error Traceback
File "/home/guanxiaofei/VLM/CoVT/train/src/training/covt_qwen2_5_vl.py", line 1453, in forward
dino_embed = self.anchor_models.get_dino_embed(image_file[0], self.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/guanxiaofei/VLM/CoVT/train/src/training/covt_qwen2_5_vl.py", line 697, in get_dino_embed
dino_val = self.dinovit(dino_pixel_value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
File "/home/guanxiaofei/miniconda3/envs/VLM/lib/python3.11/site-packages/transformers/models/dinov2/modeling_dinov2.py", line 170, in forward
embeddings = self.projection(pixel_values).flatten(2).transpose(1, 2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/guanxiaofei/miniconda3/envs/VLM/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward
return F.conv2d(...)
^^^^^^^^^
RuntimeError: weight should have at least three dimensions
The error occurs at the Conv2d layer in Dinov2PatchEmbeddings.projection.
What I've Tried
1. Attempting to Bypass DeepSpeed Sharding (By Initializing DINOv2 as Non-Trainable)
I explicitly configured the DINOv2 model to be non-trainable and attempted to use a common workaround to prevent DeepSpeed from partitioning its parameters.
# In the initialization section:
if "dino" in self.anchor_model_id:
# 1. Load the model
self.dinovit = AutoModel.from_pretrained(DINO_MODEL_PATH, trust_remote_code=True)
self.dinovit.eval()
# 2. Ensure no gradients are required
self.dinovit = self.dinovit.to(device)
for param in self.dinovit.parameters():
param.requires_grad = False
# 3. Attempt to mark the model as excluded from ZeRO-3 by removing its Hugging Face hook
# (This is a common DeepSpeed workaround for non-trainable teacher models)
if hasattr(self.dinovit, '_hf_hook'):
delattr(self.dinovit, '_hf_hook')
# ... rest of the initialization code ...This method produces the same error.
Questions
-
Is there an official way to load DINOv2 models locally in CoVT? The original code uses
torch.hub.load()which requires internet access. -
How should anchor models be handled with DeepSpeed ZeRO-3? Since these models are frozen and only used for feature extraction, should they be excluded from DeepSpeed's parameter sharding?
-
Is ZeRO Stage 3 necessary for the entire training process and if ZeRO Stage 2 is used, how much will it impact the training time?
Searched for similar issues: I have searched extensively online for this specific error involving DINOv2 and DeepSpeed ZeRO-3. I found reports of the exact same problem (Hugging Face Discussion), but no definitive working solution has been provided. Any guidance or suggestions would be greatly appreciated! Thank you for maintaining this excellent project.