We study how model size, training data, and inference-time compute affect the performance of generative retrieval, a paradigm where LLMs generate document identifiers. To enable robust comparison, we introduce a new evaluation metric based on contrastive entropy and generation loss. Our results show that larger LLMs, especially decoder-only models like LLaMA, benefit more from increased inference compute. N-gram-based decoding aligns well with scaling trends, highlighting key design choices for future generative retrieval systems.
For more details, refer to our paper accepted to SIGIR 2025: Exploring Training and Inference Scaling Laws in Generative Retrieval.
To run the experiments, two different environments are required: one for MINDER_LLaMA and RIPOR, and another for MINDER_T5.
For MINDER_Llama and RIPOR:
cd MINDER_LLaMA
conda env create -f environment.yaml
conda activate mllamaFor MINDER_T5:
cd MINDER_T5
conda env create -f environment.yaml
conda activate mt5We use the following datasets:
- MINDER experiments: NQ (Natural Questions) dataset.
- RIPOR experiments: MSMARCO dataset.
The preprocessed data and FMIndex are available for download on Google Drive. Place the data in the
datafolder.
Although the FMIndex should work if the environment is set up correctly, we recommend rebuilding the FMIndex in your environment for best results.
MINDER is a generative retrieval method that uses text spans (e.g., body text, title, and pseudo-query) as document identifiers. For simplicity, we use only the body text as the document identifier.
- Install FMIndex:
Follow the instructions in the SEAL repository to install the necessary dependencies (you may need to clone the SEAL repo to install sdsl-lite).
cd MINDER_LLaMA
conda activate mllama
# install FMIndex- Data preparation:
We use the Natural Questions dataset. You can use scripts/llama_index.sh to build the FMIndex.
- Run the experiments
# train
bash scripts/finetune_llama.sh
# test if you need
bash scripts/test_llama.sh
# eval loss
bash scripts/eval_loss.sh- Install FMIndex:
The steps are similar to MINDER_LLaMA, but you will use a different environment.
cd MINDER_T5
conda activate mt5
# install FMIndex- Data preparation:
We use the Natural Questions dataset. You can use scripts/t5_index.sh to build the FMIndex.
- Run the experiments
# train
bash scripts/train.sh
# test if you need
bash scripts/test_t5.sh
# eval loss
bash scripts/eval_loss.shRIPOR is a generative retrieval method that leverages codebooks to learn discrete representations of documents. We directly use the data provided by the authors.
- Environment
cd RIPOR
conda activate mllama- Data preparation:
We use the MSMARCO dataset provided by RIPOR repository.
- Run the experiments
For LLaMA:
# train
bash scripts/finetune_llama.sh
# eval loss
bash scripts/eval_loss_llama.shFor T5:
# train
bash scripts/train_t5.sh
# eval loss
bash scripts/eval_loss_t5.sh- Model Sizes: For both methods, you can test different model sizes by changing the model name.
- CGL Calculation: After evaluating the loss, you can calculate the contrastive generation loss as described in the paper.
- Inference Scaling: For inference scaling, you can adjust the beam size in the MINDER test scripts to observe performance changes.
If you use source code or dataset in your research, please cite our paper:
@inproceedings{cai2025exploringtraininginferencescaling,
title={Large Language Models Empowered Personalized Web Agents},
author={Hongru Cai and Yongqi Li and Ruifeng Yuan and Wenjie Wang and Zhen Zhang and Wenjie Li and Tat-Seng Chua},
booktitle={Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},
series={SIGIR'25},
year={2025}
}
This project is licensed under the CC BY-NC 4.0 License.
For inquiries, feel free to reach out to Hongru Cai at [email protected].