Official implementation of the paper "Group Representational Position Encoding (GRAPE)".
GRAPE is a unified group-theoretic framework for positional encoding that subsumes multiplicative mechanisms (like RoPE) and additive mechanisms (like ALiBi and FoX) under a single mathematical formalism. By leveraging group actions, specifically rotations in
Authors: Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao
[Webpage] [Huggingface]
Positional information is essential for sequence modeling with Transformers. We present GRAPE, a framework that unifies two families of mechanisms:
- Multiplicative GRAPE ($SO(d)$): Models positions as norm-preserving rotations generated by rank-2 skew-symmetric matrices. It recovers RoPE exactly when the planes are canonical coordinate pairs with a log-uniform spectrum. It further extends to learned commuting subspaces and compact non-commuting mixtures.
- Additive GRAPE ($GL(d+k)$): Models positions as unipotent actions in a lifted homogeneous space. It recovers ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability.
Positions
We derive a closed-form Rodrigues-type formula for fast
Additive biases are realized via a homogeneous lift to
- Scalability: Efficient training procedures optimized for large-scale datasets and multi-GPU setups.
- Flexible Data Support: Compatible with popular datasets like Fineweb-Edu-100B and OpenWebText.
- Comprehensive Evaluation: Integrated with lm-evaluation-harness for standardized benchmarking.
A100 and H100 are recommended. At least 8*80G VRAM is needed.
Ensure you have Python 3.10 or higher installed. It's recommended to use a virtual environment to manage dependencies.
-
Clone the Repository
git clone https://github.com/model-architectures/GRAPE.git cd GRAPE -
Create and Activate a Virtual Environment
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Required Packages
pip install torch==2.4.0 numpy transformers datasets tiktoken wandb tqdm
Prepare the necessary datasets before pretraining the model. This codebase supports both Fineweb-Edu-100B and OpenWebText.
Fineweb-Edu-100B is a large-scale educational dataset hosted on Hugging Face.
-
Navigate to the Data Directory
cd data/fineweb-edu -
Run the Data Preparation Script
python fineweb-edu.py
-
Move the Prepared Data
mv fineweb-edu100B .. cd ../..
OpenWebText is an open reproduction of OpenAI's WebText dataset.
-
Run the Data Preparation Script
python data/openwebtext/prepare.py
Ensure you have sufficient storage and computational resources, as OpenWebText is sizable.
Pretrain the GRAPE and other models using the prepared datasets. The provided scripts support distributed training across multiple GPUs.
-
Using the Provided Bash Script
Execute the pretraining script, which handles the training process.
bash pretrain.sh
-
Manual Execution with
torchrunFor more control or customization, use
torchrunto initiate training. Replacetrain_llama_mha_rope_medium_adam_80g8.pywith your desired configuration file.torchrun --standalone --nproc_per_node=8 \ train_adam_finewebedu.py \ config/train_llama_mha_rope_medium_adam_80g8.py--nproc_per_node=8specifies the number of processes (typically matching the number of GPUs).
By default the training scripts use data_rng_mode=stateless, which makes the exact per-rank training batches repeatable across runs under DDP (and when resuming from a checkpoint). For best results, fix seed/data_seed/eval_seed:
torchrun --standalone --nproc_per_node=8 \
train_adam_finewebedu.py \
config/train_llama_mha_rope_medium_adam_80g8.py \
--seed=42 --data_seed=42 --eval_seed=42data_rng_mode=stateful is also deterministic, and checkpoints the per-rank data RNG stream for exact resume via data_rng_state.pt (single-process) or data_rng_state_rank{RANK}.pt (DDP) saved alongside optimizer.pt. When resuming with data_rng_mode=stateful, keep these files and use the same WORLD_SIZE/RANK mapping.
If you need stronger end-to-end determinism beyond data order (at some performance cost), pass --deterministic=True to enable PyTorch deterministic kernels.
For best-effort bitwise-identical training (and exact resume), pass --bitwise_deterministic=True. This forces deterministic kernels, disables torch.compile, disables TF32, forces SDPA to use the math kernel, and checkpoints per-rank RNG + GradScaler state alongside optimizer.pt (requires the same PyTorch/CUDA/hardware + WORLD_SIZE/RANK mapping).
Evaluate the performance of the pretrained model using standardized benchmarks.
-
Navigate to the Evaluation Harness Directory
cd lm-evaluation-harness -
Follow the Instructions Within This Directory
Ensure your model is compatible with the evaluation harness requirements.
- Karpathyβs nanoGPT provides the foundational codebase upon which this repo is built.
- Hugging Face for providing the Fineweb-Edu-100B dataset.
- EleutherAI for the lm-evaluation-harness.
- OpenWebText team for replicating the WebText dataset.
We evaluate GRAPE on language modeling tasks using the FineWeb-Edu 100B dataset (0-shot with lm-evaluation-harness; Avg. over 7 tasks: ARC-E, ARC-C, HellaSwag, OBQA, PIQA, WinoGrande, SciQ).
| Method | Avg. Score (Medium, 355M) | Avg. Score (Large, 770M) |
|---|---|---|
| RoPE | 51.73 | 55.76 |
| ALiBi | 52.87 | 56.44 |
| FoX | 52.96 | 56.30 |
| GRAPE-AP | 53.25 | 56.91 |
Numbers shown are w/o KV-shift; see Tables 1 and 2 in the paper for full breakdown (including w/ KV-shift).
If you find this work useful, please cite our paper:
@article{zhang2025grape,
title={Group Representational Position Encoding},
author={Zhang, Yifan and Chen, Zixiang and Liu, Yifeng and Qin, Zhen and Yuan, Huizhuo and Xu, Kangping and Yuan, Yang and Gu, Quanquan and Yao, Andrew Chi-Chih},
journal={arXiv preprint arXiv:2512.07805},
year={2025}
}This project is licensed under the Apache License 2.0.