GRAPE: Group Representational Position Encoding

Official implementation of the paper "Group Representational Position Encoding (GRAPE)".

GRAPE is a unified group-theoretic framework for positional encoding that subsumes multiplicative mechanisms (like RoPE) and additive mechanisms (like ALiBi and FoX) under a single mathematical formalism. By leveraging group actions, specifically rotations in $SO(d)$ and unipotent lifts in $GL(d+k)$, GRAPE guarantees exact relative position laws and efficient streaming cacheability.

Authors: Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

[Webpage] [Huggingface]

📖 Abstract

Positional information is essential for sequence modeling with Transformers. We present GRAPE, a framework that unifies two families of mechanisms:

Multiplicative GRAPE ($SO(d)$): Models positions as norm-preserving rotations generated by rank-2 skew-symmetric matrices. It recovers RoPE exactly when the planes are canonical coordinate pairs with a log-uniform spectrum. It further extends to learned commuting subspaces and compact non-commuting mixtures.
Additive GRAPE ($GL(d+k)$): Models positions as unipotent actions in a lifted homogeneous space. It recovers ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability.

🚀 Key Methodologies

1. Multiplicative GRAPE (The $SO(d)$ Action)

Positions $n \in \mathbb{Z}$ act on the query/key space via a matrix exponential of a generator $L$: $$G(n) = \exp(n \omega L) \in SO(d)$$ where $L = ab^\top - ba^\top$ is a rank-2 skew-symmetric generator.

We derive a closed-form Rodrigues-type formula for fast $\mathcal{O}(d)$ application without materializing the matrix exponential: $$\exp(L) = I + \frac{\sin s}{s}L + \frac{1 - \cos s}{s^2}L^2$$ where $s$ is the scaling factor determined by the generator vectors $a$ and $b$.

2. Additive GRAPE (The Unipotent Lift)

Additive biases are realized via a homogeneous lift to $GL(d+k)$ using a nilpotent generator $A$ (where $A^2=0$). The group action is defined as: $$G_{add}(n) = \exp(n \omega A) = I + n \omega A$$ This yields an exact relative law where attention scores depend strictly on the offset $j-i$: $$\tilde{q}_i^\top \tilde{k}_j = q_i^\top k_j + \text{bias}(j-i)$$

Features

Scalability: Efficient training procedures optimized for large-scale datasets and multi-GPU setups.
Flexible Data Support: Compatible with popular datasets like Fineweb-Edu-100B and OpenWebText.
Comprehensive Evaluation: Integrated with lm-evaluation-harness for standardized benchmarking.

Hardware Requirements

A100 and H100 are recommended. At least 8*80G VRAM is needed.

Installation

Ensure you have Python 3.10 or higher installed. It's recommended to use a virtual environment to manage dependencies.

Clone the Repository

git clone https://github.com/model-architectures/GRAPE.git
cd GRAPE

Create and Activate a Virtual Environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Required Packages

pip install torch==2.4.0 numpy transformers datasets tiktoken wandb tqdm

Data Preparation

Prepare the necessary datasets before pretraining the model. This codebase supports both Fineweb-Edu-100B and OpenWebText.

Fineweb-Edu-100B

Fineweb-Edu-100B is a large-scale educational dataset hosted on Hugging Face.

Navigate to the Data Directory
```
cd data/fineweb-edu
```
Run the Data Preparation Script
```
python fineweb-edu.py
```
Move the Prepared Data
```
mv fineweb-edu100B ..
cd ../..
```

OpenWebText

OpenWebText is an open reproduction of OpenAI's WebText dataset.

Run the Data Preparation Script
```
python data/openwebtext/prepare.py
```
Ensure you have sufficient storage and computational resources, as OpenWebText is sizable.

Pretraining

Pretrain the GRAPE and other models using the prepared datasets. The provided scripts support distributed training across multiple GPUs.

Using the Provided Bash Script

Execute the pretraining script, which handles the training process.
```
bash pretrain.sh
```
Manual Execution with torchrun

For more control or customization, use torchrun to initiate training. Replace train_llama_mha_rope_medium_adam_80g8.py with your desired configuration file.
```
torchrun --standalone --nproc_per_node=8 \
    train_adam_finewebedu.py \
    config/train_llama_mha_rope_medium_adam_80g8.py 
```
- --nproc_per_node=8 specifies the number of processes (typically matching the number of GPUs).

Reproducibility (Data Order)

By default the training scripts use data_rng_mode=stateless, which makes the exact per-rank training batches repeatable across runs under DDP (and when resuming from a checkpoint). For best results, fix seed/data_seed/eval_seed:

torchrun --standalone --nproc_per_node=8 \
    train_adam_finewebedu.py \
    config/train_llama_mha_rope_medium_adam_80g8.py \
    --seed=42 --data_seed=42 --eval_seed=42

data_rng_mode=stateful is also deterministic, and checkpoints the per-rank data RNG stream for exact resume via data_rng_state.pt (single-process) or data_rng_state_rank{RANK}.pt (DDP) saved alongside optimizer.pt. When resuming with data_rng_mode=stateful, keep these files and use the same WORLD_SIZE/RANK mapping.

If you need stronger end-to-end determinism beyond data order (at some performance cost), pass --deterministic=True to enable PyTorch deterministic kernels.

For best-effort bitwise-identical training (and exact resume), pass --bitwise_deterministic=True. This forces deterministic kernels, disables torch.compile, disables TF32, forces SDPA to use the math kernel, and checkpoints per-rank RNG + GradScaler state alongside optimizer.pt (requires the same PyTorch/CUDA/hardware + WORLD_SIZE/RANK mapping).

Evaluation

Evaluate the performance of the pretrained model using standardized benchmarks.

Navigate to the Evaluation Harness Directory
```
cd lm-evaluation-harness
```
Follow the Instructions Within This Directory

Ensure your model is compatible with the evaluation harness requirements.

Acknowledgements

Karpathy’s nanoGPT provides the foundational codebase upon which this repo is built.
Hugging Face for providing the Fineweb-Edu-100B dataset.
EleutherAI for the lm-evaluation-harness.
OpenWebText team for replicating the WebText dataset.

📊 Experiments

We evaluate GRAPE on language modeling tasks using the FineWeb-Edu 100B dataset (0-shot with lm-evaluation-harness; Avg. over 7 tasks: ARC-E, ARC-C, HellaSwag, OBQA, PIQA, WinoGrande, SciQ).

Method	Avg. Score (Medium, 355M)	Avg. Score (Large, 770M)
RoPE	51.73	55.76
ALiBi	52.87	56.44
FoX	52.96	56.30
GRAPE-AP	53.25	56.91

Numbers shown are w/o KV-shift; see Tables 1 and 2 in the paper for full breakdown (including w/ KV-shift).

📝 Citation

If you find this work useful, please cite our paper:

@article{zhang2025grape,
  title={Group Representational Position Encoding},
  author={Zhang, Yifan and Chen, Zixiang and Liu, Yifeng and Qin, Zhen and Yuan, Huizhuo and Xu, Kangping and Yuan, Yang and Gu, Quanquan and Yao, Andrew Chi-Chih},
  journal={arXiv preprint arXiv:2512.07805},
  year={2025}
}

📜 License

This project is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
config		config
data		data
docs		docs
lm-evaluation-harness		lm-evaluation-harness
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
configurator.py		configurator.py
optim_utils.py		optim_utils.py
reproducibility.py		reproducibility.py
requirements.txt		requirements.txt
seed_utils.py		seed_utils.py
train_adam_finewebedu.py		train_adam_finewebedu.py
train_adam_openwebtext.py		train_adam_openwebtext.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GRAPE: Group Representational Position Encoding

📖 Abstract

🚀 Key Methodologies

1. Multiplicative GRAPE (The $SO(d)$ Action)

2. Additive GRAPE (The Unipotent Lift)

Features

Hardware Requirements

Installation

Data Preparation

Fineweb-Edu-100B

OpenWebText

Pretraining

Reproducibility (Data Order)

Evaluation

Acknowledgements

📊 Experiments

📝 Citation

📜 License

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

model-architectures/GRAPE

Folders and files

Latest commit

History

Repository files navigation

GRAPE: Group Representational Position Encoding

📖 Abstract

🚀 Key Methodologies

1. Multiplicative GRAPE (The $SO(d)$ Action)

2. Additive GRAPE (The Unipotent Lift)

Features

Hardware Requirements

Installation

Data Preparation

Fineweb-Edu-100B

OpenWebText

Pretraining

Reproducibility (Data Order)

Evaluation

Acknowledgements

📊 Experiments

📝 Citation

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages