Thanks to visit codestin.com
Credit goes to github.com

Skip to content

model-architectures/GRAPE

Repository files navigation

GRAPE: Group Representational Position Encoding

arXiv Website License Python 3.12+

Official implementation of the paper "Group Representational Position Encoding (GRAPE)".

GRAPE is a unified group-theoretic framework for positional encoding that subsumes multiplicative mechanisms (like RoPE) and additive mechanisms (like ALiBi and FoX) under a single mathematical formalism. By leveraging group actions, specifically rotations in $SO(d)$ and unipotent lifts in $GL(d+k)$, GRAPE guarantees exact relative position laws and efficient streaming cacheability.

Authors: Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

[Webpage] [Huggingface]

πŸ“– Abstract

Positional information is essential for sequence modeling with Transformers. We present GRAPE, a framework that unifies two families of mechanisms:

  1. Multiplicative GRAPE ($SO(d)$): Models positions as norm-preserving rotations generated by rank-2 skew-symmetric matrices. It recovers RoPE exactly when the planes are canonical coordinate pairs with a log-uniform spectrum. It further extends to learned commuting subspaces and compact non-commuting mixtures.
  2. Additive GRAPE ($GL(d+k)$): Models positions as unipotent actions in a lifted homogeneous space. It recovers ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability.

πŸš€ Key Methodologies

1. Multiplicative GRAPE (The $SO(d)$ Action)

Positions $n \in \mathbb{Z}$ act on the query/key space via a matrix exponential of a generator $L$: $$G(n) = \exp(n \omega L) \in SO(d)$$ where $L = ab^\top - ba^\top$ is a rank-2 skew-symmetric generator.

We derive a closed-form Rodrigues-type formula for fast $\mathcal{O}(d)$ application without materializing the matrix exponential: $$\exp(L) = I + \frac{\sin s}{s}L + \frac{1 - \cos s}{s^2}L^2$$ where $s$ is the scaling factor determined by the generator vectors $a$ and $b$.

2. Additive GRAPE (The Unipotent Lift)

Additive biases are realized via a homogeneous lift to $GL(d+k)$ using a nilpotent generator $A$ (where $A^2=0$). The group action is defined as: $$G_{add}(n) = \exp(n \omega A) = I + n \omega A$$ This yields an exact relative law where attention scores depend strictly on the offset $j-i$: $$\tilde{q}_i^\top \tilde{k}_j = q_i^\top k_j + \text{bias}(j-i)$$

Features

  • Scalability: Efficient training procedures optimized for large-scale datasets and multi-GPU setups.
  • Flexible Data Support: Compatible with popular datasets like Fineweb-Edu-100B and OpenWebText.
  • Comprehensive Evaluation: Integrated with lm-evaluation-harness for standardized benchmarking.

Hardware Requirements

A100 and H100 are recommended. At least 8*80G VRAM is needed.

Installation

Ensure you have Python 3.10 or higher installed. It's recommended to use a virtual environment to manage dependencies.

  1. Clone the Repository

    git clone https://github.com/model-architectures/GRAPE.git
    cd GRAPE
  2. Create and Activate a Virtual Environment

    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install Required Packages

    pip install torch==2.4.0 numpy transformers datasets tiktoken wandb tqdm

Data Preparation

Prepare the necessary datasets before pretraining the model. This codebase supports both Fineweb-Edu-100B and OpenWebText.

Fineweb-Edu-100B

Fineweb-Edu-100B is a large-scale educational dataset hosted on Hugging Face.

  1. Navigate to the Data Directory

    cd data/fineweb-edu
  2. Run the Data Preparation Script

    python fineweb-edu.py
  3. Move the Prepared Data

    mv fineweb-edu100B ..
    cd ../..

OpenWebText

OpenWebText is an open reproduction of OpenAI's WebText dataset.

  1. Run the Data Preparation Script

    python data/openwebtext/prepare.py

    Ensure you have sufficient storage and computational resources, as OpenWebText is sizable.

Pretraining

Pretrain the GRAPE and other models using the prepared datasets. The provided scripts support distributed training across multiple GPUs.

  1. Using the Provided Bash Script

    Execute the pretraining script, which handles the training process.

    bash pretrain.sh
  2. Manual Execution with torchrun

    For more control or customization, use torchrun to initiate training. Replace train_llama_mha_rope_medium_adam_80g8.py with your desired configuration file.

    torchrun --standalone --nproc_per_node=8 \
        train_adam_finewebedu.py \
        config/train_llama_mha_rope_medium_adam_80g8.py 
    • --nproc_per_node=8 specifies the number of processes (typically matching the number of GPUs).

Reproducibility (Data Order)

By default the training scripts use data_rng_mode=stateless, which makes the exact per-rank training batches repeatable across runs under DDP (and when resuming from a checkpoint). For best results, fix seed/data_seed/eval_seed:

torchrun --standalone --nproc_per_node=8 \
    train_adam_finewebedu.py \
    config/train_llama_mha_rope_medium_adam_80g8.py \
    --seed=42 --data_seed=42 --eval_seed=42

data_rng_mode=stateful is also deterministic, and checkpoints the per-rank data RNG stream for exact resume via data_rng_state.pt (single-process) or data_rng_state_rank{RANK}.pt (DDP) saved alongside optimizer.pt. When resuming with data_rng_mode=stateful, keep these files and use the same WORLD_SIZE/RANK mapping.

If you need stronger end-to-end determinism beyond data order (at some performance cost), pass --deterministic=True to enable PyTorch deterministic kernels.

For best-effort bitwise-identical training (and exact resume), pass --bitwise_deterministic=True. This forces deterministic kernels, disables torch.compile, disables TF32, forces SDPA to use the math kernel, and checkpoints per-rank RNG + GradScaler state alongside optimizer.pt (requires the same PyTorch/CUDA/hardware + WORLD_SIZE/RANK mapping).

Evaluation

Evaluate the performance of the pretrained model using standardized benchmarks.

  1. Navigate to the Evaluation Harness Directory

    cd lm-evaluation-harness
  2. Follow the Instructions Within This Directory

    Ensure your model is compatible with the evaluation harness requirements.

Acknowledgements

πŸ“Š Experiments

We evaluate GRAPE on language modeling tasks using the FineWeb-Edu 100B dataset (0-shot with lm-evaluation-harness; Avg. over 7 tasks: ARC-E, ARC-C, HellaSwag, OBQA, PIQA, WinoGrande, SciQ).

Method Avg. Score (Medium, 355M) Avg. Score (Large, 770M)
RoPE 51.73 55.76
ALiBi 52.87 56.44
FoX 52.96 56.30
GRAPE-AP 53.25 56.91

Numbers shown are w/o KV-shift; see Tables 1 and 2 in the paper for full breakdown (including w/ KV-shift).

πŸ“ Citation

If you find this work useful, please cite our paper:

@article{zhang2025grape,
  title={Group Representational Position Encoding},
  author={Zhang, Yifan and Chen, Zixiang and Liu, Yifeng and Qin, Zhen and Yuan, Huizhuo and Xu, Kangping and Yuan, Yang and Gu, Quanquan and Yao, Andrew Chi-Chih},
  journal={arXiv preprint arXiv:2512.07805},
  year={2025}
}

πŸ“œ License

This project is licensed under the Apache License 2.0.