Lingxiao Yang, Ru-Yuan Zhang, Qi Chen, Xiaohua Xie
Sun Yat-sen University, Shanghai Jiao Tong University
IJCV2025
Abstract: Vision-Language Models, pre-trained on large-scale image-text pairs, serve as strong foundation models for transfer learning across a variety of downstream tasks. For few-shot generalization tasks, ie., when the model is trained on few-shot samples and then tested on unseen categories or datasets, there is a balance to be struck between generalization and discrimination when tweaking these models. Existing approaches typically rely on one or two strategies during training to learn task-specific knowledge, while preserving as much task-agnostic representation as possible. However, these methods overlook the importance of other useful inductive biases, thereby limiting their generalization capabilities. In this work, we propose a method–Learning with Enriched Inductive Biases (LwEIB)–to explore multiple inductive biases at the text, model, and optimization levels. Specifically, we first propose to enrich the handcrafted text prompt with Large Language Model generated descriptions for each category. To better capture structural cues in both linguistics and vision, we design two new adapters for text and image encoders, respectively. Additionally, we propose a slow-fast optimization method to explore different degrees of adaptation more efficiently, learning task-specific representations while maintaining task-agnostic ones. We empirically validate the effectiveness of LwEIB on three widely used benchmarks. Remarkably, our LwEIB outperforms numerous state-of-the-art methods across all evaluation metrics, demonstrating its efficacy and versatility.
- We propose a novel parameter-efficient fine-tuning framework – Learning with Enriched Inductive Biases (LwEIB) that can be trained end-to-end to leverage multiple inductive biases.
- We propose three levels of inductive biases, i.e., textlevel, model-level and optimization-level, inductive biases, to increase the generalizability of VLMs in fewshot settings.
- We evaluate LwEIB on three widely used and challenging few-shot generalization tasks. Experimental results show that LwEIB achieves leading performance among all compared methods in all evaluated benchmarks.
Results reported below are average accuracy across 3 evaluated test settings. Please refer to our paper for more details.
| Method | Base2New (HM) | Cross-Datasets | Domain Generalization | Avg |
|---|---|---|---|---|
| CLIP | 71.70 | 65.15 | 57.18 | 64.67 |
| CoOp | 71.66 | 63.88 | 59.28 | 64.94 |
| CoCoOp | 75.83 | 65.74 | 59.91 | 67.16 |
| MaPLe | 78.55 | 66.30 | 60.27 | 68.37 |
| PromptSRC | 79.97 | 65.81 | 60.65 | 68.81 |
| HPT | 80.23 | 67.74 | 60.71 | 69.56 |
| LwEIB (Paper) | 81.21 | 68.61 | 60.84 | 70.22 |
| LwEIB (This repo) | 81.18 | 68.79 | 60.83 | 70.27 |
Some hyper-parameters in configs are slightly differeces to our paper, which provides a better average performance over three benchmarks (see above).
We provide all trained models and logs, based on this repo, to reproduce the results (70.27) on BaiduYunPan (passcode: 6hge) and Google Drive.
This code is built on top of the awesome project - CoOp, so you need to follow its setup steps:
First, you need to install the dassl environment - Dassl.pytorch. Simply follow the instructions described here to install dassl as well as PyTorch. After that, run pip install -r requirements.txt under VLM-LwEIB/ to install a few more packages required by CLIP (this should be done when dassl is activated).
Second, you need to follow DATASETS.md to install the datasets.
# arg1 = used gpu_id
# arg2 = seed number
# using the following command for the base2new experiment
bash run_base2new.sh 0 1
# using the following command for the cross-datasets and domain-generalization experimetns
bash run_xd.sh 0 1If you find our work or this repo helpful for your research, please kindly cite the following paper:
@article{LwEIB-Yang2025,
title={Learning with Enriched Inductive Biases for Vision-Language Models},
author={Yang, Lingxiao and Zhang, Ru-Yuan and Chen, Qi and Xie, Xiaohua},
journal={International Journal of Computer Vision},
year={2025},
publisher={Springer}
}Our code is based on Co-CoOp, CoOp and MMA repositories. We thank the authors for releasing their codes.