Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Jingfeng0705/LIFT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš€ LIFT: Language-Image Alignment with Fixed Text Encoders

Currently, the most dominant approach to establishing language-image alignment is to pre-train (always from scratch) text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. We investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning.

pipeline

The pipeline of LIFT, which adopts a dual-tower architecture similar to CLIP. LIFT uses an LLM-based text encoder $f^{\text{text}}$ to pre-compute the embedding $z^T$ for each text sample $T$ offline. During training, we solely update the image encoder $f_{\theta}^{\text{img}}$ and the projection head $f_{\phi}^{\text{head}}$ to align image embeddings with the pre-computed text embeddings by optimizing an alignment objective.

Language-Image Alignment with Fixed Text Encoders
Jingfeng Yang*, Ziyang Wu*, Yue Zhao, Yi Ma,
UC Berkeley, The University of Hong Kong

[project page] [arxiv] [pdf] [bibtex]

Updates

  • ๐Ÿš€ 06/03/2025: Initial commit. We release the code and the model checkpoints!

Features

teaser

  • LIFT Encodes Compositional Information Much Better! LIFT outperforms CLIP by an average accuracy gain of 7.4% across seven compositional understanding tasks and also leads CLIP on five out of six LLaVA downstream tasks, all driven by its superior ability to encode compositional information.
  • LIFT Learns from Long Captions Much Better! When trained on short captions, CLIP has a slight edge over LIFT on three zero-shot retrieval tasks and one LLaVA downstream task. However, all of these advantages transfer to LIFT when both are trained on long (usually synthesized) captions. We attribute LIFT's better performance on long captions to its robustness against the inverse effect induced by syntactically homogeneous captions.
  • LIFT Is Much More Efficient in Terms of FLOPs and Memory Footprint! Given average per-batch max caption token length $n$, the FLOPs and memory footprint of CLIP scale with $\mathcal{O}(n^2)$ complexity, whereas LIFT achieves $\mathcal{O}(1)$ amortized complexity. On average, LIFT reduces FLOPs by 25.5% for short captions and 35.7% for long ones, while lowering memory usage by 6.8% and 12.6%.

Installation

Please install the conda environment and the required packages using the following commands:

conda create --name lift python=3.10.16 -y
conda activate lift
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
python -m pip install -r requirements.txt

Dataset Preparation

See Preparing Datasets for LIFT.

Evaluation

We support the evaluation of four vision-language model tasks:

  • SugarCrepe, which specifically tests models' compositional understanding
  • Zero-shot classification on ImageNet-1K validation set
  • Image-to-text (I2T) and text-to-image (T2I) retrieval on COCO Val2017 split
  • Image-to-text (I2T) and text-to-image (T2I) retrieval on Flickr30K test set

Please first read the instructions to organize the evaluation dataset and specify CONFIG and your evaluation task in evaluate.sh. Then, in the evaluation config, please fill in

  1. The path to the evaluation dataset data_path.
  2. The path to the checkpoint pretrained.
  3. If you are evaluating LIFT, the path to the LLM-based text encoder llm_model. Our checkpoints use NV-Embed-V2 as their llm_model.

Then run the following command:

bash scripts/evaluate.sh

Below we highlight some results here. We also provide the model checkpoints.

SugarCrepe

Models Backbone Dataset Sample Seen Add Replace Swap
Obj Att Obj Att Rel Obj Att
CLIP ViT-B/16 DataComp 1.28B 82.3 73.7 91.7 79.4 61.2 59.6 56.9
LIFT ViT-B/16 DataComp 1.28B 89.0 86.1 93.2 86.0 70.6 64.1 63.4
CLIP ViT-B/16 Recap 512M 77.0 73.7 88.9 80.8 63.4 62.0 76.3
LIFT ViT-B/16 Recap 512M 88.8 92.2 92.3 88.2 76.8 66.5 72.8

Zero-shot Retrieval

Models Backbone Dataset Sample Seen ImageNet COCO Flickr
I2T T2I Obj Att
CLIP ViT-B/16 DataComp 1.28B 58.4 31.0 27.2 62.9 59.6
LIFT ViT-B/16 DataComp 1.28B 58.3 29.1 28.1 58.8 63.7
CLIP ViT-B/16 Recap 512M 34.6 25.7 26.7 56.4 57.9
LIFT ViT-B/16 Recap 512M 43.6 34.6 36.0 69.1 72.9

Model Zoo

Models Checkpoints Backbone Dataset Sample Seen
CLIP Download ViT-B/16 DataComp 1.28B
LIFT Download ViT-B/16 DataComp 1.28B
CLIP Download ViT-B/16 Recap 512M
LIFT Download ViT-B/16 Recap 512M

Offline Embeddings Generation

We provide the script to generate caption embeddings offline using an LLM-based text encoder. First, please read the instructions to organize the raw caption folder and specify CONFIG in embed.sh. Then, in the embeddings generation config, please fill in

  1. The path to the raw caption folder raw_text_data.
  2. The path to the destination folder of caption embeddings embedded_labels.
  3. The path to the LLM-based text encoder llm_model.
  4. The total number of raw captions to be embedded raw_text_num_samples.

Then run the following command:

bash scripts/embed.sh

Training

We provide the training script to replicate LIFT and CLIP reported in the paper. First, please read the instructions to organize the image, raw caption, and caption embedding folders and specify CONFIG in train.sh. Then, in the training config, please fill in

  1. The path to the image folder train_data
  2. If you are training CLIP, the path to the raw caption folder raw_text_data.
  3. If you are training LIFT, the path to the caption embedding folder embedded_text_data.

Then run the following command:

bash scripts/train.sh

How to get support from us?

If you have any general questions, feel free to email us at Jingfeng Yang. If you have code or implementation-related questions, please feel free to send emails to us or open an issue in this codebase (We recommend that you open an issue in this codebase, because your questions may help others).

Citation

If you find our work inspiring, please consider giving a star โญ and a citation!

@misc{yang2025languageimagealignmentfixedtext,
      title={Language-Image Alignment with Fixed Text Encoders}, 
      author={Jingfeng Yang and Ziyang Wu and Yue Zhao and Yi Ma},
      year={2025},
      eprint={2506.04209},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.04209}, 
}

Acknowledgement

This codebase is built on top of OpenCLIP, Webdataset, SugarCrepe, and SuperClass. The source code of the README page is borrowed from UnSAM. We appreciate the authors for open-sourcing their codes.

About

The official repo for LIFT: Language-Image Alignment with Fixed Text Encoders

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published