Currently, the most dominant approach to establishing language-image alignment is to pre-train (always from scratch) text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. We investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning.
The pipeline of LIFT, which adopts a dual-tower architecture similar to CLIP. LIFT uses an LLM-based text encoder
Language-Image Alignment with Fixed Text Encoders
Jingfeng Yang*, Ziyang Wu*, Yue Zhao, Yi Ma,
UC Berkeley, The University of Hong Kong
[project page] [arxiv] [pdf] [bibtex]
- ๐ 06/03/2025: Initial commit. We release the code and the model checkpoints!
- LIFT Encodes Compositional Information Much Better! LIFT outperforms CLIP by an average accuracy gain of 7.4% across seven compositional understanding tasks and also leads CLIP on five out of six LLaVA downstream tasks, all driven by its superior ability to encode compositional information.
- LIFT Learns from Long Captions Much Better! When trained on short captions, CLIP has a slight edge over LIFT on three zero-shot retrieval tasks and one LLaVA downstream task. However, all of these advantages transfer to LIFT when both are trained on long (usually synthesized) captions. We attribute LIFT's better performance on long captions to its robustness against the inverse effect induced by syntactically homogeneous captions.
-
LIFT Is Much More Efficient in Terms of FLOPs and Memory Footprint! Given average per-batch max caption token length
$n$ , the FLOPs and memory footprint of CLIP scale with$\mathcal{O}(n^2)$ complexity, whereas LIFT achieves$\mathcal{O}(1)$ amortized complexity. On average, LIFT reduces FLOPs by 25.5% for short captions and 35.7% for long ones, while lowering memory usage by 6.8% and 12.6%.
Please install the conda environment and the required packages using the following commands:
conda create --name lift python=3.10.16 -y
conda activate lift
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
python -m pip install -r requirements.txtSee Preparing Datasets for LIFT.
We support the evaluation of four vision-language model tasks:
- SugarCrepe, which specifically tests models' compositional understanding
- Zero-shot classification on ImageNet-1K validation set
- Image-to-text (I2T) and text-to-image (T2I) retrieval on COCO Val2017 split
- Image-to-text (I2T) and text-to-image (T2I) retrieval on Flickr30K test set
Please first read the instructions to organize the evaluation dataset and specify CONFIG and your evaluation task in evaluate.sh. Then, in the evaluation config, please fill in
- The path to the evaluation dataset
data_path. - The path to the checkpoint
pretrained. - If you are evaluating LIFT, the path to the LLM-based text encoder
llm_model. Our checkpoints use NV-Embed-V2 as theirllm_model.
Then run the following command:
bash scripts/evaluate.shBelow we highlight some results here. We also provide the model checkpoints.
| Models | Backbone | Dataset | Sample Seen | Add | Replace | Swap | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Obj | Att | Obj | Att | Rel | Obj | Att | |||||
| CLIP | ViT-B/16 | DataComp | 1.28B | 82.3 | 73.7 | 91.7 | 79.4 | 61.2 | 59.6 | 56.9 | |
| LIFT | ViT-B/16 | DataComp | 1.28B | 89.0 | 86.1 | 93.2 | 86.0 | 70.6 | 64.1 | 63.4 | |
| CLIP | ViT-B/16 | Recap | 512M | 77.0 | 73.7 | 88.9 | 80.8 | 63.4 | 62.0 | 76.3 | |
| LIFT | ViT-B/16 | Recap | 512M | 88.8 | 92.2 | 92.3 | 88.2 | 76.8 | 66.5 | 72.8 | |
| Models | Backbone | Dataset | Sample Seen | ImageNet | COCO | Flickr | ||
|---|---|---|---|---|---|---|---|---|
| I2T | T2I | Obj | Att | |||||
| CLIP | ViT-B/16 | DataComp | 1.28B | 58.4 | 31.0 | 27.2 | 62.9 | 59.6 |
| LIFT | ViT-B/16 | DataComp | 1.28B | 58.3 | 29.1 | 28.1 | 58.8 | 63.7 |
| CLIP | ViT-B/16 | Recap | 512M | 34.6 | 25.7 | 26.7 | 56.4 | 57.9 |
| LIFT | ViT-B/16 | Recap | 512M | 43.6 | 34.6 | 36.0 | 69.1 | 72.9 |
| Models | Checkpoints | Backbone | Dataset | Sample Seen |
|---|---|---|---|---|
| CLIP | Download | ViT-B/16 | DataComp | 1.28B |
| LIFT | Download | ViT-B/16 | DataComp | 1.28B |
| CLIP | Download | ViT-B/16 | Recap | 512M |
| LIFT | Download | ViT-B/16 | Recap | 512M |
We provide the script to generate caption embeddings offline using an LLM-based text encoder. First, please read the instructions to organize the raw caption folder and specify CONFIG in embed.sh. Then, in the embeddings generation config, please fill in
- The path to the raw caption folder
raw_text_data. - The path to the destination folder of caption embeddings
embedded_labels. - The path to the LLM-based text encoder
llm_model. - The total number of raw captions to be embedded
raw_text_num_samples.
Then run the following command:
bash scripts/embed.shWe provide the training script to replicate LIFT and CLIP reported in the paper. First, please read the instructions to organize the image, raw caption, and caption embedding folders and specify CONFIG in train.sh. Then, in the training config, please fill in
- The path to the image folder
train_data - If you are training CLIP, the path to the raw caption folder
raw_text_data. - If you are training LIFT, the path to the caption embedding folder
embedded_text_data.
Then run the following command:
bash scripts/train.shIf you have any general questions, feel free to email us at Jingfeng Yang. If you have code or implementation-related questions, please feel free to send emails to us or open an issue in this codebase (We recommend that you open an issue in this codebase, because your questions may help others).
If you find our work inspiring, please consider giving a star โญ and a citation!
@misc{yang2025languageimagealignmentfixedtext,
title={Language-Image Alignment with Fixed Text Encoders},
author={Jingfeng Yang and Ziyang Wu and Yue Zhao and Yi Ma},
year={2025},
eprint={2506.04209},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.04209},
}
This codebase is built on top of OpenCLIP, Webdataset, SugarCrepe, and SuperClass. The source code of the README page is borrowed from UnSAM. We appreciate the authors for open-sourcing their codes.