TL;DR: Self-supervised retriever learning is framed as next-token prediction with retrieval-weighted in-batch attention. Trained only on raw text, the method delivers strong performance on code, reasoning-intensive, and general-domain retrieval.
Abstract:
Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers?
To answer this question, we introduce
Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling.Revelamodels semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluateRevelaon domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs,Revelasurpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIRβs unsupervised SoTA with ~1000Γ less training data and 10Γ less compute. Performance increases with batch size and model size, highlightingRevelaβs scalability and its promise for self-supervised retriever learning.
To begin, set up the conda environment using the following command:
conda env create -f environment.yml
In Revela, we modify the transformers architecture to incorporate in-batch attention. To enable this, install a customized version of the transformers library:
pip uninstall transformers
pip install git+https://github.com/TRUMANCFY/transformers.git@adapt
Finally, we train the model in a modular setup. To install the local package in editable mode, run:
cd src/tevatron
pip install -e .
| Dataset | Source | Number of Batches | Batch Size |
|---|---|---|---|
| Revela Training Corpus | Wikipedia | 320,000 | 16 |
| Revela Code Training Corpus | Stackoverflow Posts, Online Tutorials, Library Documentation | 358,763 | 16 |
export CUDA_VISIBLE_DEVICES=0,1,2,3
export TRITON_PRINT_AUTOTUNING=1
export ROOT_DIR=./
export OUTPUT_DIR=...
export RUN_NAME=...
deepspeed --include localhost:0,1,2,3 --master_port 6022 --module tevatron.llm_retriever.driver.train \
--deepspeed $ROOT_DIR/deepspeed/ds_zero3_config.json \
--output_dir $OUTPUT_DIR \
--model_name_or_path meta-llama/Llama-3.2-1B \
--lora \
--lora_r 256 \
--lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
--save_steps 500 \
--bm25_retrieval_file $DATA_PATH \
--add_passage_prefix True \
--add_query_prefix True \
--first_half True \
--bf16 \
--pooling eos \
--append_eos_token \
--normalize \
--temperature 0.01 \
--attn_temperature 0.0001 \
--per_device_train_batch_size 1 \
--train_group_size 16 \
--learning_rate 1e-4 \
--passage_max_len 157 \
--num_train_epochs 1 \
--gradient_accumulation_steps 8 \
--logging_steps 1 \
--overwrite_output_dir \
--warmup_steps 100 \
--resume latest \
--top_k 16 \
--run_name $RUN_NAME
We can evaluate the trained models with customized mteb.
from mteb.model_meta import ModelMeta
from mteb.models.repllama_models import RepLLaMAWrapper, _loader
import mteb, torch
revela_llama_code_3b = ModelMeta(
loader=_loader(
RepLLaMAWrapper,
base_model_name_or_path="meta-llama/Llama-3.2-3B",
peft_model_name_or_path="trumancai/Revela-code-3b",
device_map="auto",
torch_dtype=torch.bfloat16,
),
name="trumancai/Revela-code-3b",
languages=["eng_Latn"],
open_source=True,
revision="974f4d8e7ff5d5439cc1863088948249f612c284",
release_date="2025-10-07",
)
model = revela_llama_code_3b.loader()
mteb.MTEB(tasks=["AppsRetrieval"])
.run(model=model, output_folder="results/Revela-code-3b")
Revela achieves robust and impressive results on code retrieval (CoIR), reasoning-intensive retrieval (BRIGHT), and general retrieval (BEIR). Additional results are provided in the paper.
@inproceedings{
cai2026revela,
title={Revela: Dense Retriever Learning via Language Modeling},
author={Fengyu Cai and Tong Chen and Xinran Zhao and Sihao Chen and Hongming Zhang and Tongshuang Wu and Iryna Gurevych and Heinz Koeppl},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=e7pAjJZJWb}
}
