FOrming semantic identifieRs for Generative retriEval in Industrial Datasets
Explore full dataset in Huggingface »
View Demo
·
Report Bug
·
Request Feature
Semantic identifiers (SIDs) have gained increasing interest in generative retrieval (GR) due to their meaningful semantic discriminability. Existing studies typically rely on arbitrarily defined SIDs while neglecting the influence of SID configurations on GR. Besides, evaluations conducted on datasets with limited multimodal features and behaviors also hinder their reliability in industrial traffic. To address these limitations, we propose FORGE, a comprehensive benchmark for FOrming semantic identifieR in Generative Etrieval with industrial datasets. Specifically, FORGE is equipped with a dataset comprising 14 billion user interactions and multi-modal features of 250 million items collected from an e-commerce platform, enabling researchers to construct and evaluate their own SIDs. Leveraging this dataset, we systematically explore various strategies for SID generation and validate their effectiveness across different settings and tasks. Extensive online experiments show 8.93% and 0.35% improvements in PVR and transaction count, highlighting the practical value of our approach. Notably, we propose two novel metrics of SID that correlate well with GR performance, providing insights into a convenient measurement of SID quality without training GR. Subsequent offline pretraining also offers support for online convergence in industrial applications. The code and data are available at code repo.
This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.
This is an example of how to list things you need to use the software and how to install them.
- Clone the repository:
git clone repo_name cd repo - Install dependencies:
- Python 3.8+
- PyTorch 1.10+
- requirements.txt
pip install -r requirements.txt
- For SID Generation Task:
wget -P datas/ https://mvap-public-data.oss-cn-zhangjiakou.aliyuncs.com/ICLR_2026_data/reconstruct_data_mask.npz
wget -P datas/ https://mvap-public-data.oss-cn-zhangjiakou.aliyuncs.com/ICLR_2026_data/contrastive_data_mask.npzDataset Preview:
- 10m_80msideinfo_feat.npz for contrastive task in Eq.2
The file contains three components:
(1) a deduplicated mapping table between item IDs and their corresponding indices ("itemEncId") with shape (6,844,930, 2), where each row is an [item_id, index] pair (e.g., [855036080309, 0]);
(2) a list of item pairs ("pairs") with shape (9,509,084, 2), representing co-occurrence or association relationships between items (e.g., [855036080309, 545092516562]); and
(3) a deduplicated embedding matrix with shape (6,844,930, 512), where each row is a 512-dimensional vector representation of an item (e.g., [xx, xx,..., xx]).
import numpy as np
import os
# 1. load .npz file
file_path = os.path.join(dirpath, filename2) # replace your filename
data = np.load(file_path, allow_pickle=True)
itemEncID, pairs, embeds = data['itemEncID'].item(), data['pairs'], data['embeds'].astype(np.float32)
for key, item in itemEncID.items():
print(key, item) #[855036080309, 0], [545092516562, 1]
if item > 100:
break
print("pairs:", pairs.shape, pairs[:1]) #(9509084, 2) [[855036080309 545092516562]]
print("embeds:", embeds.shape, embeds[:1]) #(6844930, 512) [[xx,xx,...,xx]]- 5mold_80msideinfo_feat.npz for reconstruction task in Eq.3
import numpy as np
import os
# 1. load .npz file
dirpath = '~/git/al_sid/SID_generation/datas'
filename1 = '5mold_80msideinfo_feat.npz'
file_path = os.path.join(dirpath, filename1) # replace your filename1
data = np.load(file_path)
print("Available arrays:", data.files) ##Available arrays: ['ids', 'embeds']
for key in data:
print(f"{key}: {data[key].shape}")
print(data[key][:1])
data.close()
#ids: (4148316,) [813799260043]
#embeds: (4148316, 512) [xx,xx,...,xx]- Seq Data for Generative Task:
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
dataset = load_dataset("AL-GR/AL-GR-Tiny", data_files="train_data/s1_tiny.csv", split="train")Data Preview: https://huggingface.co/datasets/AL-GR/AL-GR-Tiny
- Training the Model
To start distributed training, use the following command:
python -m torch.distributed.launch --nnodes=2 --nproc_per_node=1 --master_port=27646 train.py --output_dir=/path/to/output --save_prefix=MODEL_NAME --cfg=configs/rqvae_i2v.yml- Parameters
--cfg: Path to the configuration file.--output_dir: Directory for model outputs.--save_prefix: Prefix for saving the model.
- Testing the Model
Use the following command to start testing:
python infer_SID.py- Clone the repo
git clone repo_name cd repo_name/algr - training scripts:
python -m torch.distributed.launch --nnodes=1 --nproc_per_node=4 runner.py --config=config/t5base_3layer_tiny.json - predict scripts:
python -m torch.distributed.launch --nnodes=1 --nproc_per_node=4 runner.py --config=config/generate_t5base_3layer_tiny.json - calculate Hitrate:
# nebula test: python calc_hr.py --dataset_name=/home/admin/.cache/huggingface/modules/datasets_modules/datasets/AL-GR--AL-GR-Tiny/25dea07242891a2d --nebula 1. python calc_hr.py --item_sid_file=item_info/tiny_item_sid_final.csv --generate_file=logs/generate_t5base_3layer_tiny/output.jsonl 2. python calc_hr.py --item_sid_file=item_info/tiny_item_sid_final.csv --generate_file=logs/generate_qwen2.5_05b_3layer_tiny/output.jsonl --decoder_only
- Generative Retrieval Training
- SID Generation
- Data Processing
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the project_license. See LICENSE.txt for more information.
If you have any questions or encounter difficulties, we welcome you to contact ours via GitHub Issues. We aim to respond promptly and support you in quickly getting up and running with generative recommendation.
Please cite the following paper if you find our code helpful.
@misc{fu2025forge,
title={FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets},
author={Kairui Fu and Tao Zhang and Shuwen Xiao and Ziyang Wang and Xinming Zhang and Chenchi Zhang and Yuliang Yan and Junjun Zheng and Yu Li and Zhihong Chen and Jian Wu and Xiangheng Kong and Shengyu Zhang and Kun Kuang and Yuning Jiang and Bo Zheng},
year={2025},
eprint={2509.20904},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2509.20904},
}