RuntimeError: "Weight should have at least three dimensions" when loading DINOv2 locally with DeepSpeed ZeRO-3

## Environment

- **Python**: 3.11.14
- **PyTorch**: 2.5.1
- **DeepSpeed**: 0.16.4
- **Transformers**: 4.50.1
- **Flash Attention**: 2.8.3
- **Training Setup**: 2 RTX a6000 with DeepSpeed ZeRO Stage 3

## Problem Description

I am fine-tuning Qwen2.5-VL with CoVT. Due to network restrictions on my server(**443 network error**), I have to use **local DINOv2 weights** instead of `torch.hub.load()`. However, I'm encountering a `RuntimeError: weight should have at least three dimensions` error when the DINOv2 model is called during training with DeepSpeed ZeRO-3.

## Code Modifications

I modified the anchor model initialization code to load DINOv2 from a local path:

**Original code:**
```python
self.dinovit = torch.hub.load(DINO_MODEL_PATH, DINO_MODEL_TYPE)
self.dinovit = self.dinovit.eval()
self.hook_handle = self.dinovit.norm.register_forward_hook(norm_hook)
```

**Modified code:**
```python

DINO_MODEL_PATH = "/data20t/guanxiaofei/VLM/dinov2-large"
DINO_PROCESSOR_CONFIG = {
    "pretrained_model_name_or_path": "/data20t/guanxiaofei/VLM/dinov2-large", 
    "crop_size": {"height": 448, "width": 448}
}

self.dinovit = AutoModel.from_pretrained(DINO_MODEL_PATH, trust_remote_code=True)
self.dinovit.eval()

self.extracted_outputs = {}
def norm_hook(module, module_input, module_output):
    self.extracted_outputs["norm_output"] = module_output

# Changed from self.dinovit.norm to self.dinovit.layernorm
self.hook_handle = self.dinovit.layernorm.register_forward_hook(norm_hook)
self.dino_processor = AutoImageProcessor.from_pretrained(**DINO_PROCESSOR_CONFIG)
```

**Note:** I changed `self.dinovit.norm` to `self.dinovit.layernorm` because the model structure is different when loaded via `AutoModel`:
- `torch.hub.load()` creates a model with `.norm` attribute
- `AutoModel.from_pretrained()` creates a `Dinov2Model` with `.layernorm` attribute

I don't think it's the root of the problem.

## Error Traceback
```

File "/home/guanxiaofei/VLM/CoVT/train/src/training/covt_qwen2_5_vl.py", line 1453, in forward
    dino_embed = self.anchor_models.get_dino_embed(image_file[0], self.device)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/guanxiaofei/VLM/CoVT/train/src/training/covt_qwen2_5_vl.py", line 697, in get_dino_embed
    dino_val = self.dinovit(dino_pixel_value)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...

File "/home/guanxiaofei/miniconda3/envs/VLM/lib/python3.11/site-packages/transformers/models/dinov2/modeling_dinov2.py", line 170, in forward
    embeddings = self.projection(pixel_values).flatten(2).transpose(1, 2)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/guanxiaofei/miniconda3/envs/VLM/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward
    return F.conv2d(...)
           ^^^^^^^^^
RuntimeError: weight should have at least three dimensions
```

The error occurs at the `Conv2d` layer in `Dinov2PatchEmbeddings.projection`.

## What I've Tried

### 1\. **Attempting to Bypass DeepSpeed Sharding (By Initializing DINOv2 as Non-Trainable)**

I explicitly configured the DINOv2 model to be non-trainable and attempted to use a common workaround to prevent DeepSpeed from partitioning its parameters.

```python
# In the initialization section:
if "dino" in self.anchor_model_id:
    # 1. Load the model
    self.dinovit = AutoModel.from_pretrained(DINO_MODEL_PATH, trust_remote_code=True)
    self.dinovit.eval()
    
    # 2. Ensure no gradients are required
    self.dinovit = self.dinovit.to(device)
    for param in self.dinovit.parameters():
        param.requires_grad = False
    
    # 3. Attempt to mark the model as excluded from ZeRO-3 by removing its Hugging Face hook
    # (This is a common DeepSpeed workaround for non-trainable teacher models)
    if hasattr(self.dinovit, '_hf_hook'):
        delattr(self.dinovit, '_hf_hook')
    # ... rest of the initialization code ...
```

This method produces the same error.


## Questions

1. **Is there an official way to load DINOv2 models locally in CoVT?** The original code uses `torch.hub.load()` which requires internet access.

2. **How should anchor models be handled with DeepSpeed ZeRO-3?** Since these models are frozen and only used for feature extraction, should they be excluded from DeepSpeed's parameter sharding?

3. **Is ZeRO Stage 3 necessary for the entire training process and if ZeRO Stage 2 is used, how much will it impact the training time?** 

Searched for similar issues: I have searched extensively online for this specific error involving DINOv2 and DeepSpeed ZeRO-3. I found reports of the exact same problem ([Hugging Face Discussion](https://discuss.huggingface.co/t/deepspeed-zero-3-flattens-convolution-that-causes-runtime-error/141391)), but no definitive working solution has been provided. Any guidance or suggestions would be greatly appreciated! Thank you for maintaining this excellent project.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RuntimeError: "Weight should have at least three dimensions" when loading DINOv2 locally with DeepSpeed ZeRO-3 #9

Environment

Problem Description

Code Modifications

Error Traceback

What I've Tried

1. Attempting to Bypass DeepSpeed Sharding (By Initializing DINOv2 as Non-Trainable)

Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RuntimeError: "Weight should have at least three dimensions" when loading DINOv2 locally with DeepSpeed ZeRO-3 #9

Description

Environment

Problem Description

Code Modifications

Error Traceback

What I've Tried

1. Attempting to Bypass DeepSpeed Sharding (By Initializing DINOv2 as Non-Trainable)

Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions