Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Pytorch] The repo contains the code for "FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets"

License

Notifications You must be signed in to change notification settings

selous123/al_sid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues project_license


Logo

FORGE

FOrming semantic identifieRs for Generative retriEval in Industrial Datasets
Explore full dataset in Huggingface »

View Demo · Report Bug · Request Feature

About The Project

Product Name Screen Shot

Semantic identifiers (SIDs) have gained increasing interest in generative retrieval (GR) due to their meaningful semantic discriminability. Existing studies typically rely on arbitrarily defined SIDs while neglecting the influence of SID configurations on GR. Besides, evaluations conducted on datasets with limited multimodal features and behaviors also hinder their reliability in industrial traffic. To address these limitations, we propose FORGE, a comprehensive benchmark for FOrming semantic identifieR in Generative Etrieval with industrial datasets. Specifically, FORGE is equipped with a dataset comprising 14 billion user interactions and multi-modal features of 250 million items collected from an e-commerce platform, enabling researchers to construct and evaluate their own SIDs. Leveraging this dataset, we systematically explore various strategies for SID generation and validate their effectiveness across different settings and tasks. Extensive online experiments show 8.93% and 0.35% improvements in PVR and transaction count, highlighting the practical value of our approach. Notably, we propose two novel metrics of SID that correlate well with GR performance, providing insights into a convenient measurement of SID quality without training GR. Subsequent offline pretraining also offers support for online convergence in industrial applications. The code and data are available at code repo.

Getting Started

This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.

Prerequisites

This is an example of how to list things you need to use the software and how to install them.

  1. Clone the repository:
    git clone repo_name
    cd repo
  2. Install dependencies:
  • Python 3.8+
  • PyTorch 1.10+
  • requirements.txt
    pip install -r requirements.txt

Dataset Decription(Demo)

  1. For SID Generation Task:
wget -P datas/ https://mvap-public-data.oss-cn-zhangjiakou.aliyuncs.com/ICLR_2026_data/reconstruct_data_mask.npz
wget -P datas/ https://mvap-public-data.oss-cn-zhangjiakou.aliyuncs.com/ICLR_2026_data/contrastive_data_mask.npz

Dataset Preview:

  • 10m_80msideinfo_feat.npz for contrastive task in Eq.2

The file contains three components:

(1) a deduplicated mapping table between item IDs and their corresponding indices ("itemEncId") with shape (6,844,930, 2), where each row is an [item_id, index] pair (e.g., [855036080309, 0]);

(2) a list of item pairs ("pairs") with shape (9,509,084, 2), representing co-occurrence or association relationships between items (e.g., [855036080309, 545092516562]); and

(3) a deduplicated embedding matrix with shape (6,844,930, 512), where each row is a 512-dimensional vector representation of an item (e.g., [xx, xx,..., xx]).

import numpy as np
import os

# 1. load .npz file
file_path = os.path.join(dirpath, filename2)  # replace your filename
data = np.load(file_path, allow_pickle=True)
itemEncID, pairs, embeds = data['itemEncID'].item(), data['pairs'], data['embeds'].astype(np.float32)
for key, item in itemEncID.items():
    print(key, item) #[855036080309, 0], [545092516562, 1]
    if item > 100:
        break
print("pairs:", pairs.shape, pairs[:1]) #(9509084, 2) [[855036080309 545092516562]]
print("embeds:", embeds.shape, embeds[:1]) #(6844930, 512) [[xx,xx,...,xx]]
  • 5mold_80msideinfo_feat.npz for reconstruction task in Eq.3
import numpy as np
import os
# 1. load .npz file
dirpath = '~/git/al_sid/SID_generation/datas'
filename1 = '5mold_80msideinfo_feat.npz'
file_path = os.path.join(dirpath, filename1)  # replace your filename1
data = np.load(file_path)
print("Available arrays:", data.files) ##Available arrays: ['ids', 'embeds']
for key in data:
    print(f"{key}: {data[key].shape}")
    print(data[key][:1])
data.close()
#ids: (4148316,) [813799260043]
#embeds: (4148316, 512) [xx,xx,...,xx]
  1. Seq Data for Generative Task:
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
dataset = load_dataset("AL-GR/AL-GR-Tiny", data_files="train_data/s1_tiny.csv", split="train")

Data Preview: https://huggingface.co/datasets/AL-GR/AL-GR-Tiny

SID Generation

  1. Training the Model

To start distributed training, use the following command:

python -m torch.distributed.launch --nnodes=2 --nproc_per_node=1 --master_port=27646 train.py --output_dir=/path/to/output --save_prefix=MODEL_NAME --cfg=configs/rqvae_i2v.yml
  1. Parameters
  • --cfg: Path to the configuration file.
  • --output_dir: Directory for model outputs.
  • --save_prefix: Prefix for saving the model.
  1. Testing the Model

Use the following command to start testing:

python infer_SID.py

Generative Retrival

  1. Clone the repo
    git clone repo_name
    cd repo_name/algr
  2. training scripts:
    python -m torch.distributed.launch --nnodes=1 --nproc_per_node=4 runner.py --config=config/t5base_3layer_tiny.json
    
  3. predict scripts:
    python -m torch.distributed.launch --nnodes=1 --nproc_per_node=4 runner.py --config=config/generate_t5base_3layer_tiny.json
    
  4. calculate Hitrate:
    # nebula test: python calc_hr.py --dataset_name=/home/admin/.cache/huggingface/modules/datasets_modules/datasets/AL-GR--AL-GR-Tiny/25dea07242891a2d --nebula
    1. python calc_hr.py --item_sid_file=item_info/tiny_item_sid_final.csv --generate_file=logs/generate_t5base_3layer_tiny/output.jsonl
    2. python calc_hr.py --item_sid_file=item_info/tiny_item_sid_final.csv --generate_file=logs/generate_qwen2.5_05b_3layer_tiny/output.jsonl --decoder_only
    

Roadmap

  • Generative Retrieval Training
  • SID Generation
  • Data Processing

See the open issues for a full list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Top contributors:

contrib.rocks image

License

Distributed under the project_license. See LICENSE.txt for more information.

Contact

If you have any questions or encounter difficulties, we welcome you to contact ours via GitHub Issues. We aim to respond promptly and support you in quickly getting up and running with generative recommendation.

Citing this work

Please cite the following paper if you find our code helpful.

@misc{fu2025forge,
      title={FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets}, 
      author={Kairui Fu and Tao Zhang and Shuwen Xiao and Ziyang Wang and Xinming Zhang and Chenchi Zhang and Yuliang Yan and Junjun Zheng and Yu Li and Zhihong Chen and Jian Wu and Xiangheng Kong and Shengyu Zhang and Kun Kuang and Yuning Jiang and Bo Zheng},
      year={2025},
      eprint={2509.20904},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2509.20904}, 
}

About

[Pytorch] The repo contains the code for "FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages