[NeurIPS 2025 Spotlight] Tensor ProducT ATTenTion (TPA) Transformer (T6) is a state-of-the-art transformer model that leverages Tensor Product Attention (TPA) mechanisms to enhance performance and reduce KV cache size. This repository provides tools for data preparation, model pretraining, and evaluation to facilitate research and development using the T6 architecture.
This repository contains the official code for the paper "Tensor Product Attention Is All You Need".
Authors: Yifan Zhang*, Yifeng Liu*, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao
[Webpage] [Huggingface]
- [09/18/2025] Our paper is accepted as NeurIPS 2025 spotlight!
- [07/04/2025] The prefilling code for FlashTPA is available.
- [06/13/2025] The decoding code for FlashTPA is available.
- [05/29/2025] Our paper is updated on ArXiv for FlashTPA Decoding: https://arxiv.org/abs/2501.06425.
- [01/11/2025] Our code is open-sourced!
- [01/11/2025] Our paper is released on arXiv: https://arxiv.org/abs/2501.06425.
- Tensor Product Attention: Implements advanced attention mechanisms for improved model performance.
- Scalability: Efficient training procedures optimized for large-scale datasets and multi-GPU setups.
- Flexible Data Support: Compatible with popular datasets like Fineweb-Edu-100B and OpenWebText.
- Comprehensive Evaluation: Integrated with lm-evaluation-harness for standardized benchmarking.
- FlashTPA Decoding: see Algorithm 2 and 3 in Paper for FlashTPA Decoding algorithms and ./decode for Python and Triton implementations.
A100 and H100 are recommended. At least 8*80G VRAM is needed.
Ensure you have Python 3.10 or higher installed. It's recommended to use a virtual environment to manage dependencies.
- 
Clone the Repository git clone https://github.com/tensorgi/TPA.git cd TPA
- 
Create and Activate a Virtual Environment python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate 
- 
Install Required Packages pip install torch==2.4.0 numpy transformers datasets tiktoken wandb tqdm 
Prepare the necessary datasets before pretraining the model. T6 supports both Fineweb-Edu-100B and OpenWebText.
Fineweb-Edu-100B is a large-scale educational dataset hosted on Hugging Face.
- 
Navigate to the Data Directory cd data/fineweb-edu
- 
Run the Data Preparation Script python fineweb-edu.py 
- 
Move the Prepared Data mv fineweb-edu100B .. cd ../..
OpenWebText is an open reproduction of OpenAI's WebText dataset.
- 
Run the Data Preparation Script python data/openwebtext/prepare.py Ensure you have sufficient storage and computational resources, as OpenWebText is sizable. 
Pretrain the T6 model using the prepared datasets. The provided scripts support distributed training across multiple GPUs.
- 
Using the Provided Bash Script Execute the pretraining script, which handles the training process. bash pretrain.sh 
- 
Manual Execution with torchrunFor more control or customization, use torchrunto initiate training. Replaceconfig/train_T6_medium_adam_80g8.pywith your desired configuration file.torchrun --standalone --nproc_per_node=8 \ train_adam_fw.py \ config/train_T6_medium_adam_80g8.py- --nproc_per_node=8specifies the number of processes (typically matching the number of GPUs).
 
Evaluate the performance of the pretrained T6 model using standardized benchmarks.
- 
Navigate to the Evaluation Harness Directory cd lm-evaluation-harness
- 
Follow the Instructions Within This Directory Ensure your model is compatible with the evaluation harness requirements. 
- Karpathy’s nanoGPT provides the foundational codebase upon which this repo is built.
- Hugging Face for providing the Fineweb-Edu-100B dataset.
- EleutherAI for the lm-evaluation-harness.
- OpenWebText team for replicating the WebText dataset.
If you use Tensor Product Attention (TPA) or the Tensor ProducT ATTenTion Transformer (T6) in your research or application, please consider citing it!
@article{zhang2025tensor,
    title={Tensor Product Attention Is All You Need},
    author={Zhang, Yifan and Liu, Yifeng and Yuan, Huizhuo and Qin, Zhen and Yuan, Yang and Gu, Quanquan and Yao, Andrew Chi-Chih},
    journal={arXiv preprint arXiv:2501.06425},
    year={2025},
}