Codestin Search App

FORGE

FOrming semantic identifieRs for Generative retriEval in Industrial Datasets
Explore full dataset in Huggingface »

View Demo · Report Bug · Request Feature

About The Project

Semantic identifiers (SIDs) have gained increasing interest in generative retrieval (GR) due to their meaningful semantic discriminability. Existing studies typically rely on arbitrarily defined SIDs while neglecting the influence of SID configurations on GR. Besides, evaluations conducted on datasets with limited multimodal features and behaviors also hinder their reliability in industrial traffic. To address these limitations, we propose FORGE, a comprehensive benchmark for FOrming semantic identifieR in Generative Etrieval with industrial datasets. Specifically, FORGE is equipped with a dataset comprising 14 billion user interactions and multi-modal features of 250 million items collected from an e-commerce platform, enabling researchers to construct and evaluate their own SIDs. Leveraging this dataset, we systematically explore various strategies for SID generation and validate their effectiveness across different settings and tasks. Extensive online experiments show 8.93% and 0.35% improvements in PVR and transaction count, highlighting the practical value of our approach. Notably, we propose two novel metrics of SID that correlate well with GR performance, providing insights into a convenient measurement of SID quality without training GR. Subsequent offline pretraining also offers support for online convergence in industrial applications. The code and data are available at code repo.

Getting Started

This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.

Prerequisites

This is an example of how to list things you need to use the software and how to install them.

Clone the repository:
```
git clone repo_name
cd repo
```
Install dependencies:

Python 3.8+
PyTorch 1.10+
requirements.txt
```
pip install -r requirements.txt
```

Dataset Decription(Demo)

For SID Generation Task:

wget -P datas/ https://mvap-public-data.oss-cn-zhangjiakou.aliyuncs.com/ICLR_2026_data/reconstruct_data_mask.npz
wget -P datas/ https://mvap-public-data.oss-cn-zhangjiakou.aliyuncs.com/ICLR_2026_data/contrastive_data_mask.npz

Dataset Preview:

10m_80msideinfo_feat.npz for contrastive task in Eq.2

The file contains three components:

(1) a deduplicated mapping table between item IDs and their corresponding indices ("itemEncId") with shape (6,844,930, 2), where each row is an [item_id, index] pair (e.g., [855036080309, 0]);

(2) a list of item pairs ("pairs") with shape (9,509,084, 2), representing co-occurrence or association relationships between items (e.g., [855036080309, 545092516562]); and

(3) a deduplicated embedding matrix with shape (6,844,930, 512), where each row is a 512-dimensional vector representation of an item (e.g., [xx, xx,..., xx]).

import numpy as np
import os

# 1. load .npz file
file_path = os.path.join(dirpath, filename2)  # replace your filename
data = np.load(file_path, allow_pickle=True)
itemEncID, pairs, embeds = data['itemEncID'].item(), data['pairs'], data['embeds'].astype(np.float32)
for key, item in itemEncID.items():
    print(key, item) #[855036080309, 0], [545092516562, 1]
    if item > 100:
        break
print("pairs:", pairs.shape, pairs[:1]) #(9509084, 2) [[855036080309 545092516562]]
print("embeds:", embeds.shape, embeds[:1]) #(6844930, 512) [[xx,xx,...,xx]]

5mold_80msideinfo_feat.npz for reconstruction task in Eq.3

import numpy as np
import os
# 1. load .npz file
dirpath = '~/git/al_sid/SID_generation/datas'
filename1 = '5mold_80msideinfo_feat.npz'
file_path = os.path.join(dirpath, filename1)  # replace your filename1
data = np.load(file_path)
print("Available arrays:", data.files) ##Available arrays: ['ids', 'embeds']
for key in data:
    print(f"{key}: {data[key].shape}")
    print(data[key][:1])
data.close()
#ids: (4148316,) [813799260043]
#embeds: (4148316, 512) [xx,xx,...,xx]

Seq Data for Generative Task:

from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
dataset = load_dataset("AL-GR/AL-GR-Tiny", data_files="train_data/s1_tiny.csv", split="train")

Data Preview: https://huggingface.co/datasets/AL-GR/AL-GR-Tiny

SID Generation

Training the Model

To start distributed training, use the following command:

python -m torch.distributed.launch --nnodes=2 --nproc_per_node=1 --master_port=27646 train.py --output_dir=/path/to/output --save_prefix=MODEL_NAME --cfg=configs/rqvae_i2v.yml

Parameters

--cfg: Path to the configuration file.
--output_dir: Directory for model outputs.
--save_prefix: Prefix for saving the model.

Testing the Model

Use the following command to start testing:

python infer_SID.py

Generative Retrival

Clone the repo
```
git clone repo_name
cd repo_name/algr
```

training scripts:

python -m torch.distributed.launch --nnodes=1 --nproc_per_node=4 runner.py --config=config/t5base_3layer_tiny.json

predict scripts:

python -m torch.distributed.launch --nnodes=1 --nproc_per_node=4 runner.py --config=config/generate_t5base_3layer_tiny.json

calculate Hitrate:

# nebula test: python calc_hr.py --dataset_name=/home/admin/.cache/huggingface/modules/datasets_modules/datasets/AL-GR--AL-GR-Tiny/25dea07242891a2d --nebula
1. python calc_hr.py --item_sid_file=item_info/tiny_item_sid_final.csv --generate_file=logs/generate_t5base_3layer_tiny/output.jsonl
2. python calc_hr.py --item_sid_file=item_info/tiny_item_sid_final.csv --generate_file=logs/generate_qwen2.5_05b_3layer_tiny/output.jsonl --decoder_only

Roadmap

Generative Retrieval Training
SID Generation
Data Processing

See the open issues for a full list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Top contributors:

License

Distributed under the project_license. See LICENSE.txt for more information.

Contact

If you have any questions or encounter difficulties, we welcome you to contact ours via GitHub Issues. We aim to respond promptly and support you in quickly getting up and running with generative recommendation.

Citing this work

Please cite the following paper if you find our code helpful.

@misc{fu2025forge,
      title={FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets}, 
      author={Kairui Fu and Tao Zhang and Shuwen Xiao and Ziyang Wang and Xinming Zhang and Chenchi Zhang and Yuliang Yan and Junjun Zheng and Yu Li and Zhihong Chen and Jian Wu and Xiangheng Kong and Shengyu Zhang and Kun Kuang and Yuning Jiang and Bo Zheng},
      year={2025},
      eprint={2509.20904},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2509.20904}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
SID_generation		SID_generation
algr		algr
asset		asset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FORGE

About The Project

Getting Started

Prerequisites

Dataset Decription(Demo)

SID Generation

Generative Retrival

Roadmap

Contributing

Top contributors:

License

Contact

Citing this work

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

selous123/al_sid

Folders and files

Latest commit

History

Repository files navigation

FORGE

About The Project

Getting Started

Prerequisites

Dataset Decription(Demo)

SID Generation

Generative Retrival

Roadmap

Contributing

Top contributors:

License

Contact

Citing this work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages