🚀 LIFT: Language-Image Alignment with Fixed Text Encoders

Currently, the most dominant approach to establishing language-image alignment is to pre-train (always from scratch) text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. We investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning.

The pipeline of LIFT, which adopts a dual-tower architecture similar to CLIP. LIFT uses an LLM-based text encoder $f^{\text{text}}$ to pre-compute the embedding $z^T$ for each text sample $T$ offline. During training, we solely update the image encoder $f_{\theta}^{\text{img}}$ and the projection head $f_{\phi}^{\text{head}}$ to align image embeddings with the pre-computed text embeddings by optimizing an alignment objective.

Language-Image Alignment with Fixed Text Encoders
Jingfeng Yang*, Ziyang Wu*, Yue Zhao, Yi Ma,
UC Berkeley, The University of Hong Kong

[project page] [arxiv] [pdf] [bibtex]

Updates

🚀 06/03/2025: Initial commit. We release the code and the model checkpoints!

Features

LIFT Encodes Compositional Information Much Better! LIFT outperforms CLIP by an average accuracy gain of 7.4% across seven compositional understanding tasks and also leads CLIP on five out of six LLaVA downstream tasks, all driven by its superior ability to encode compositional information.
LIFT Learns from Long Captions Much Better! When trained on short captions, CLIP has a slight edge over LIFT on three zero-shot retrieval tasks and one LLaVA downstream task. However, all of these advantages transfer to LIFT when both are trained on long (usually synthesized) captions. We attribute LIFT's better performance on long captions to its robustness against the inverse effect induced by syntactically homogeneous captions.
LIFT Is Much More Efficient in Terms of FLOPs and Memory Footprint! Given average per-batch max caption token length $n$, the FLOPs and memory footprint of CLIP scale with $\mathcal{O}(n^2)$ complexity, whereas LIFT achieves $\mathcal{O}(1)$ amortized complexity. On average, LIFT reduces FLOPs by 25.5% for short captions and 35.7% for long ones, while lowering memory usage by 6.8% and 12.6%.

Installation

Please install the conda environment and the required packages using the following commands:

conda create --name lift python=3.10.16 -y
conda activate lift
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
python -m pip install -r requirements.txt

Dataset Preparation

See Preparing Datasets for LIFT.

Evaluation

We support the evaluation of four vision-language model tasks:

SugarCrepe, which specifically tests models' compositional understanding
Zero-shot classification on ImageNet-1K validation set
Image-to-text (I2T) and text-to-image (T2I) retrieval on COCO Val2017 split
Image-to-text (I2T) and text-to-image (T2I) retrieval on Flickr30K test set

Please first read the instructions to organize the evaluation dataset and specify CONFIG and your evaluation task in evaluate.sh. Then, in the evaluation config, please fill in

The path to the evaluation dataset data_path.
The path to the checkpoint pretrained.
If you are evaluating LIFT, the path to the LLM-based text encoder llm_model. Our checkpoints use NV-Embed-V2 as their llm_model.

Then run the following command:

bash scripts/evaluate.sh

Below we highlight some results here. We also provide the model checkpoints.

SugarCrepe

Models	Backbone	Dataset	Sample Seen	Add		Replace			Swap
Models	Backbone	Dataset	Sample Seen	Obj	Att	Obj	Att	Rel	Obj	Att
CLIP	ViT-B/16	DataComp	1.28B	82.3	73.7	91.7	79.4	61.2	59.6	56.9
LIFT	ViT-B/16	DataComp	1.28B	89.0	86.1	93.2	86.0	70.6	64.1	63.4
CLIP	ViT-B/16	Recap	512M	77.0	73.7	88.9	80.8	63.4	62.0	76.3
LIFT	ViT-B/16	Recap	512M	88.8	92.2	92.3	88.2	76.8	66.5	72.8

Zero-shot Retrieval

Models	Backbone	Dataset	Sample Seen	ImageNet	COCO		Flickr
Models	Backbone	Dataset	Sample Seen	ImageNet	I2T	T2I	Obj	Att
CLIP	ViT-B/16	DataComp	1.28B	58.4	31.0	27.2	62.9	59.6
LIFT	ViT-B/16	DataComp	1.28B	58.3	29.1	28.1	58.8	63.7
CLIP	ViT-B/16	Recap	512M	34.6	25.7	26.7	56.4	57.9
LIFT	ViT-B/16	Recap	512M	43.6	34.6	36.0	69.1	72.9

Model Zoo

Models	Checkpoints	Backbone	Dataset	Sample Seen
CLIP	Download	ViT-B/16	DataComp	1.28B
LIFT	Download	ViT-B/16	DataComp	1.28B
CLIP	Download	ViT-B/16	Recap	512M
LIFT	Download	ViT-B/16	Recap	512M

Offline Embeddings Generation

We provide the script to generate caption embeddings offline using an LLM-based text encoder. First, please read the instructions to organize the raw caption folder and specify CONFIG in embed.sh. Then, in the embeddings generation config, please fill in

The path to the raw caption folder raw_text_data.
The path to the destination folder of caption embeddings embedded_labels.
The path to the LLM-based text encoder llm_model.
The total number of raw captions to be embedded raw_text_num_samples.

Then run the following command:

bash scripts/embed.sh

Training

We provide the training script to replicate LIFT and CLIP reported in the paper. First, please read the instructions to organize the image, raw caption, and caption embedding folders and specify CONFIG in train.sh. Then, in the training config, please fill in

The path to the image folder train_data
If you are training CLIP, the path to the raw caption folder raw_text_data.
If you are training LIFT, the path to the caption embedding folder embedded_text_data.

Then run the following command:

bash scripts/train.sh

How to get support from us?

If you have any general questions, feel free to email us at Jingfeng Yang. If you have code or implementation-related questions, please feel free to send emails to us or open an issue in this codebase (We recommend that you open an issue in this codebase, because your questions may help others).

Citation

If you find our work inspiring, please consider giving a star ⭐ and a citation!

@misc{yang2025languageimagealignmentfixedtext,
      title={Language-Image Alignment with Fixed Text Encoders}, 
      author={Jingfeng Yang and Ziyang Wu and Yue Zhao and Yi Ma},
      year={2025},
      eprint={2506.04209},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.04209}, 
}

Acknowledgement

This codebase is built on top of OpenCLIP, Webdataset, SugarCrepe, and SuperClass. The source code of the README page is borrowed from UnSAM. We appreciate the authors for open-sourcing their codes.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LIFT		LIFT
assets		assets
embed_label		embed_label
evaluation		evaluation
extra_zeroshot_datasets		extra_zeroshot_datasets
scripts		scripts
utils		utils
.gitignore		.gitignore
DATASET.md		DATASET.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 LIFT: Language-Image Alignment with Fixed Text Encoders

Updates

Features

Installation

Dataset Preparation

Evaluation

SugarCrepe

Zero-shot Retrieval

Model Zoo

Offline Embeddings Generation

Training

How to get support from us?

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

Jingfeng0705/LIFT

Folders and files

Latest commit

History

Repository files navigation

🚀 LIFT: Language-Image Alignment with Fixed Text Encoders

Updates

Features

Installation

Dataset Preparation

Evaluation

SugarCrepe

Zero-shot Retrieval

Model Zoo

Offline Embeddings Generation

Training

How to get support from us?

Citation

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages