TokenAR

TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement

Haiyue Sun, Qingdong He, Jinlong Peng, Peng Tang, Jiangning Zhang,
Junwei Zhu, Xiaobin Hu, Shuicheng YAN

🌠 Key Features

Fantastic results of our proposed TokenAR on multi-subject image generation consists of three core components:

(a) Token Index Embedding: Clusters token indices to better represent images of the same reference identity.
(b) Instruct Token Injection: Acts as an additional container for visual features, injecting detailed and complementary priors into the reference tokens.
(c) Identity-Token Disentanglement (ITD): Explicitly guides the model to learn separate token representations for the features of each unique identity.

This approach enables strong identity consistency in the generated images while preserving high-quality background reconstruction.

🚩 Updates

✅ October 5, 2025. We release TokenAR framework.

🔧 Preparation

Environment Setup. Please follow install.sh to install the packages as shown in requirements.txt. Then you may download all pre-trained checkpoints as instructed below.
Download text encoder model flan-t5-xl and put it as ./pretrained_models/t5-ckpt/flan-t5-xl. Download vqvae model vq_ds16_t2i.pt from LlamaGen and put it as ./pretrained_models/vq_ds16_t2i.pt.
(Required for training) Download pre-trained text-to-image model t2i_XL_stage2_512.pt from LlamaGen and put it as ./pretrained_models/t2i_XL_stage2_512.pt.

🚀 Demo

Please run the following script to generate single image. Put the source image and instruction text in the ./examples as demonstrated, then run,

python3 autoregressive/sample/sample_edit_example_plus.py \
        --gpt-ckpt "${ckpt}" \
        --add_ref_embed \
        --multi-cond \
        --cfg-scale 3 \
        --seed 83 \
        --max_ref_num 4 \
        --additional-info "${CKPT_NAME}" \
        --device "${device}" \
        --concat-target \
        --dataset "./examples" \
        --instruct-token-mode casual \
        --instruct-token-num "${TOKENNUM}" \

🚀 Training

Data Preparation. For image editing, download SpatialSubject200K and InstructAR Dataset(Coming Soom).

The folder should be preprocessed ending up looking like,

./data/
    example1/
        ref/
            ref1.png
            ...
        background.png
        real.png
        description.txt
    example2/
    ...

We provide an example as shown in train_concat_instruct.sh. Please modify train_concat_instruct.sh accordingly to run on your system.

🚀 Dataset Constructing

To support robust training and evaluation for multi-reference image generation, we created the InstructAR Dataset. It is the first large-scale, open-source dataset specifically designed for this task. Its construction pipeline was built to address four key problems in existing datasets:

Limited scale and relational diversity
Insufficient pose variation
Inaccurate segmentation masks
Lack of quality control

Our dataset is built using a multi-stage pipeline to ensure scale, quality, and precision.

Image Synthesis: We use a relation-guided generative model (DreamRelation) to synthesize a large and diverse corpus of images. Each image is generated from two reference subjects and a text prompt describing their interaction. Subject Dataset can be found in RelationBench and human_parsing_dataset
Foreground & Background Extraction: We use an automated tool (BNE2) to segment the foreground subjects from the background in the generated images. These masks are then manually refined to ensure high accuracy. Code are provided in ./scripts/background_segment.py
Rigorous Two-Stage Filtering: To guarantee the quality and fidelity of our dataset, every sample undergoes a strict filtering process: Semantic Filtering: A Vision-Language Model (gemma-3n-E4B-it) validates that the generated image content accurately matches the text prompt. Samples that fail this check are discarded. Identity Filtering: We use DINOv2 feature similarity to confirm that the subjects in the generated image are visually consistent with the original reference images. This ensures high identity preservation. Codes are provided in ./scripts/gemma_filter.py and ./scrips/dinov2_filter

The folder ends up looking like,

./data/
    example1/
        ref/
            ref1.png
            ...
        background.png
        real.png
        description.txt
    example2/

Average Key Feature Analysis

To optimize the efficacy of the Instruct Token Injection mechanism, we performed an extensive ablation study on the number of instruct tokens used during training and inference.

This study confirms that a concise sequence length of 120 Instruct Tokens strikes the best balance, delivering precise, targeted visual priors without introducing redundancy or instability. This ensures the instructional guidance remains potent and effective throughout the transformer blocks.

Acknowledgement

The implementation is mainly built on top of EditAR. We also want to thank the authors from DreamRelation, UniCombine, Dino-v2 for the code release.

License

The majority of this project is licensed under MIT License. Portions of the project are under separate license of referred projects.

BibTeX

@article{mu2025editAR,
  title={TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement},
  author={Haiyue Sun, Qingdong He, Jinlong Peng, Peng Tang, Jiangning Zhang, Junwei Zhu, Xiaobin Hu, Shuicheng Yan},
  journal={arXiv preprint arXiv:2510.16332},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
autoregressive		autoregressive
dataset		dataset
feature_encoders		feature_encoders
language		language
scripts		scripts
tokenizer		tokenizer
utils		utils
.gitignore		.gitignore
README.md		README.md
install.sh		install.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TokenAR

TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement

🌠 Key Features

🚩 Updates

🔧 Preparation

🚀 Demo

🚀 Training

🚀 Dataset Constructing

Average Key Feature Analysis

Acknowledgement

License

BibTeX

About

Uh oh!

Releases

Packages

Languages

lyrig/TokenAR

Folders and files

Latest commit

History

Repository files navigation

TokenAR

TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement

🌠 Key Features

🚩 Updates

🔧 Preparation

🚀 Demo

🚀 Training

🚀 Dataset Constructing

Average Key Feature Analysis

Acknowledgement

License

BibTeX

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages