Revela: Dense Retriever Learning via Language Modeling

📑 Paper | 🔧 Installation | 📚 Resources | 🚀 Training | 📊 Evaluation | 📄 Citing

TL;DR: Self-supervised retriever learning is framed as next-token prediction with retrieval-weighted in-batch attention. Trained only on raw text, the method delivers strong performance on code, reasoning-intensive, and general-domain retrieval.

Abstract:

Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers?

To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR’s unsupervised SoTA with ~1000× less training data and 10× less compute. Performance increases with batch size and model size, highlighting Revela’s scalability and its promise for self-supervised retriever learning.

Installation

To begin, set up the conda environment using the following command:

conda env create -f environment.yml

In Revela, we modify the transformers architecture to incorporate in-batch attention. To enable this, install a customized version of the transformers library:

pip uninstall transformers
pip install git+https://github.com/TRUMANCFY/transformers.git@adapt

Finally, we train the model in a modular setup. To install the local package in editable mode, run:

cd src/tevatron
pip install -e .

Resources

Data

Dataset	Source	Number of Batches	Batch Size
Revela Training Corpus	Wikipedia	320,000	16
Revela Code Training Corpus	Stackoverflow Posts, Online Tutorials, Library Documentation	358,763	16

Models

Model Name	Base Model	Training Source
Revela-3b	meta-llama/Llama-3.2-3B	Wikipedia
Revela-1b	meta-llama/Llama-3.2-1B	Wikipedia
Revela-500m	Qwen/Qwen2.5-0.5B	Wikipedia
Revela-code-3b	meta-llama/Llama-3.2-1B	Stackoverflow Posts + Online Tutorials + Library Documentation
Revela-code-1b	meta-llama/Llama-3.2-1B	Stackoverflow Posts + Online Tutorials + Library Documentation
Revela-code-500m	Qwen/Qwen2.5-0.5B	Stackoverflow Posts + Online Tutorials + Library Documentation

Training

The training script can be found at `train.sh` under DeepSpeed training framework.

export CUDA_VISIBLE_DEVICES=0,1,2,3
export TRITON_PRINT_AUTOTUNING=1

export ROOT_DIR=./
export OUTPUT_DIR=...
export RUN_NAME=...

deepspeed --include localhost:0,1,2,3 --master_port 6022 --module tevatron.llm_retriever.driver.train \
  --deepspeed $ROOT_DIR/deepspeed/ds_zero3_config.json \
  --output_dir $OUTPUT_DIR \
  --model_name_or_path meta-llama/Llama-3.2-1B \
  --lora \
  --lora_r 256 \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 500 \
  --bm25_retrieval_file $DATA_PATH \
  --add_passage_prefix True \
  --add_query_prefix True \
  --first_half True \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  --attn_temperature 0.0001 \
  --per_device_train_batch_size 1 \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --passage_max_len 157 \
  --num_train_epochs 1 \
  --gradient_accumulation_steps 8 \
  --logging_steps 1 \
  --overwrite_output_dir \
  --warmup_steps 100 \
  --resume latest \
  --top_k 16 \
  --run_name $RUN_NAME

Evaluation

We can evaluate the trained models with customized mteb.

from mteb.model_meta import ModelMeta
from mteb.models.repllama_models import RepLLaMAWrapper, _loader
import mteb, torch

revela_llama_code_3b = ModelMeta(
    loader=_loader(
        RepLLaMAWrapper,
        base_model_name_or_path="meta-llama/Llama-3.2-3B",
        peft_model_name_or_path="trumancai/Revela-code-3b",
        device_map="auto",
        torch_dtype=torch.bfloat16,
    ),
    name="trumancai/Revela-code-3b",
    languages=["eng_Latn"],
    open_source=True,
    revision="974f4d8e7ff5d5439cc1863088948249f612c284",
    release_date="2025-10-07",
)

model = revela_llama_code_3b.loader()

mteb.MTEB(tasks=["AppsRetrieval"])
    .run(model=model, output_folder="results/Revela-code-3b")

Results

Revela achieves robust and impressive results on code retrieval (CoIR), reasoning-intensive retrieval (BRIGHT), and general retrieval (BEIR). Additional results are provided in the paper.

Citing

@inproceedings{
cai2026revela,
title={Revela: Dense Retriever Learning via Language Modeling},
author={Fengyu Cai and Tong Chen and Xinran Zhao and Sihao Chen and Hongming Zhang and Tongshuang Wu and Iryna Gurevych and Heinz Koeppl},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=e7pAjJZJWb}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
deepspeed		deepspeed
src/tevatron		src/tevatron
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Revela: Dense Retriever Learning via Language Modeling

📑 Paper | 🔧 Installation | 📚 Resources | 🚀 Training | 📊 Evaluation | 📄 Citing

Installation

Resources

Data

Models

Training

Evaluation

Results

Citing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Revela: Dense Retriever Learning via Language Modeling

📑 Paper | 🔧 Installation | 📚 Resources | 🚀 Training | 📊 Evaluation | 📄 Citing

Installation

Resources

Data

Models

Training

Evaluation

Results

Citing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages