Thanks to visit codestin.com
Credit goes to github.com

Skip to content

InfiXAI/InfiR2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

82 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

InfiR2

δΈ­ζ–‡η‰ˆ

InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

Wenjun Wang*, Shuo Cai*, Congkai Xie, Mingfa Feng, Yiming Zhang, Zhen Li, Kejing Yang, Ming Li, Jiannong Cao, Hongxia Yang

Β  πŸ“„ Paper Β  | Β  Β  πŸ€— Huggingface Β  | Β  Β  🌐 Project Website Β 

πŸ”₯ Update


Table of Contents


🌟 Overview

We introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we release the accompanying code to further democratize large-scale model training.

Our approach

  • Memory optimization and Speed up. Memory Optimization & Computation Acceleration: Compared to the widely used BF16, FP8 delivers:

    • Up to 22% increase in end-to-end training speed.
    • Up to 14% savings in peak memory usage.
    • Up to 19% increase in end-to-end throughput.

    Model Size = 1.5B

    Context Length = 32k, TP = 2, CP = 1, MBS = 1

    Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio
    BF16 841 ms 2329 ms 3170 ms - 57.8 GB - 345 TFlops -
    FP8 875 ms 2075 ms 2950 ms 0.93Γ— 51.7 GB 0.89Γ— 360 TFlops 1.04Γ—

    Context Length = 8k, TP = 1, CP = 1, MBS = 2

    Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio
    BF16 463 ms 1567 ms 2030 ms - 68.1 GB - 340 TFlops -
    FP8 529 ms 1061 ms 1590 ms 0.78Γ— 58.3 GB 0.86Γ— 376 TFlops 1.10Γ—

    Model Size = 7B

    Context Length = 32k, TP = 4, CP = 1, MBS = 1

    Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio
    BF16 2790 ms 6800 ms 9590 ms - 78.1 GB - 409 TFlops -
    FP8 2660 ms 5700 ms 8360 ms 0.87Γ— 67.4 GB 0.86Γ— 461 TFlops 1.14Γ—

    Context Length = 8k, TP = 2, CP = 1, MBS = 1

    Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio
    BF16 1760 ms 5320 ms 7080 ms - 53.2 GB - 453 TFlops -
    FP8 2300 ms 3230 ms 5530 ms 0.78Γ— 50.8 GB 0.95Γ— 537 TFlops 1.19Γ—

πŸš€ Preparation

To clone this repository, please use:

git clone --recursive https://github.com/InfiXAI/InfiR2

Environment Setup

We support the environment setup Docker and provide the custom docker file. Please follow the instructions below.


Docker Setup

The custom-configured Docker image is stored at Dockerfile. Build the Docker image using:

docker build --no-cache \
    --file docker/Dockerfile \
    --build-arg HTTP_PROXY="$http_proxy" \
    --build-arg HTTPS_PROXY="$https_proxy" \
    --build-arg NO_PROXY="localhost,127.0.0.1" \
    --build-arg SGLANG_VERSION=${SGLANG_VERSION:-latest} \
    --build-arg MEGATRON_COMMIT=${MEGATRON_COMMIT:-main} \
    -t infir2-training:latest .

Key Components:

  • Base: lmsysorg/sglang:${SGLANG_VERSION}
  • Megatron-LM: core_v0.14.0 branch (NVIDIA official)
  • TransformerEngine: v2.4.0 (commit 3cd6870) - ⚠️ Must use this version to avoid precision/gpu memory issues
  • FlashAttention: v2.7.4.post1 + Hopper build
  • Additional: slime, mbridge, torch_memory_saver, ray, sglang-router, and more

For more details, please refer to docker/README.md.

πŸ€– Continual Pre-training with FP8

We provide continual pre-training (CPT) scripts with FP8 quantization. Our FP8 training recipe achieves up to 22% reduction in training time, 14% decrease in peak memory usage, and 19% increase in throughput compared to BF16 baseline, while maintaining performance parity on reasoning benchmarks. For more details, please refer to docs/CPT.md

Available Scripts

We support both 7B and 1.5B models with flexible training configurations:

Running

Option 1: Complete Training Pipeline (Recommended)

Run the full warmup+stable+decay training in one go:

bash scripts/CPT/InfiR2_CPT_FP8_7B.sh

This single script will complete all three training phases automatically.

Option 2: Using Standalone Decay Script (Advanced)

If you want to enter the decay phase from a specific checkpoint in the stable phase:

# First, identify your stable-phase checkpoint
# Then run the decay script with the checkpoint
bash scripts/CPT/InfiR2_CPT_FP8_7B_decay.sh \
    --load exp/InfiR2_CPT_FP8_7B/checkpoints/iter_0035000

🌈 Supervised Fine-tuning with FP8

We provide two-stage SFT training scripts with FP8 quantization following InfiAlign. The training process uses Ray for distributed execution and supports multi-node training configurations. For more details, refer to docs/SFT.md.

Available Scripts

We support both 7B and 1.5B models with flexible training configurations:

Configuration

Dataset: Modify the DATA_DIR variable to point to your training data:

DATA_DIR=/path/to/stage1_data

Model Configuration:

  • HF_CHECKPOINT: Path to the model in HuggingFace format (e.g., Qwen2.5-7B-Instruct)
  • REF_LOAD: Path to the base model weights in PyTorch distributed format
HF_CHECKPOINT=/path/to/base_models_hf/qwen2.5-7B-Instruct/
REF_LOAD=/path/to/base_models_/qwen2.5-7B_torch_dist/

Running

First, start Ray cluster:

export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265

Then launch the training:

bash scripts/SFT/InfiR2_SFT_FP8_7B_stage1.sh

🎯 Reinforcement Learning with FP8

Our RL training pipeline consists of two stages: first compressing the response length, then expanding it. Before RL training, you need to convert the SFT checkpoint to FP8 format for efficient FP8 inference during rollout generation. For more details, refer to docs/RL.md.

Model Conversion for RL

After completing SFT Stage 2, convert the model to HuggingFace format, then to FP8 format:

# Step 1: Convert PyTorch distributed checkpoint to HuggingFace format
PYTHONPATH=training/Megatron-LM:training/slime python tools/convert_torch_dist_to_hf.py \
    --input-dir /path/to/InfiR2_SFT_FP8_stg2 \
    --output-dir /path/to/InfiR2_SFT_FP8_stg2_hf \
    --origin-hf-dir /path/to/models/Qwen2.5-7B-Instruct

# Step 2: Convert BF16 HuggingFace model to FP8 format
python tools/bf16_cast_fp8.py \
    --input-bf16-hf-path /path/to/InfiR2_SFT_FP8_stg2_hf \
    --output-fp8-hf-path /path/to/InfiR2_SFT_FP8_stg2_hf_fp8 \
    --force-pow-2-scale False

The FP8 model will be used for inference during the RL rollout phase, significantly improving generation efficiency.

Configuration

Dataset: Set the DATA_DIR to your RL training data:

DATA_DIR=/path/to/data/dapo-math-17k.jsonl

Model Configuration:

  • HF_CHECKPOINT: Path to the FP8 converted model (for inference)
  • REF_LOAD: Path to the SFT Stage 2 checkpoint in PyTorch distributed format
HF_CHECKPOINT=/path/to/your_model/

REF_LOAD=/path/to/your_model/

FP8 Training Configuration:

GRPO_ARGS=(
   ...
   --use-tis
   ...
)

PRECISE_ARGS=(
   ...
   # for fp8 training
   --fp8-format e4m3
   --fp8-recipe blockwise
   --fp8-param-gather
   ...
)

Running

The way to launch RL training is the same as SFT. First start ray and then run the script.

This curriculum-based strategy ensures stable training and optimal performance across different response length requirements.

πŸ“Š Evaluation

We use the open-source evalscope framework for all model evaluations to ensure reproducibility. Our evaluation suite includes four reasoning benchmarks with provided evaluation scripts.

Environment Setup

We have verified that our models work correctly with the latest version of evalscope, achieving consistent performance results. However, to strictly reproduce the exact evaluation results reported in our paper, please use the following specific version of evalscope:

Recommended Version for Reproduction:

Installation:

Follow the official documentation at https://evalscope.readthedocs.io/zh-cn/latest/get_started/installation.html

git clone https://github.com/modelscope/evalscope.git
cd evalscope/
pip install -e .

Evaluation Benchmarks

We provide evaluation scripts for four key reasoning benchmarks:

Benchmark Script Max Tokens Samples Temperature
AIME 2024 aime24_eval.sh 31,000 32 0.65
AIME 2025 aime25_eval.sh 31,000 32 0.65
GPQA gpqa_eval.sh 26,000 8 0.65
LiveCodeBench livecodebenchv5_eval.sh 27,000 8 0.65

Each script uses slurm for job scheduling and SGLang for efficient inference serving. The evaluation pipeline consists of:

  1. Starting an SGLang server with the model
  2. Running evalscope with the specified benchmark

Model Performance

  • 7B Model
Model AIME 25 AIME 24 GPQA LiveCodeBench v5
Deepseek-Distill-Qwen-7B 43.00 49.00 48.20 37.60
Qwen2.5-7B-base (w. InfiAlign) 33.75 43.02 48.11 39.48
InfiR2-7B-Instruct-FP8 40.62 55.73 45.33 40.31
  • 1.5B Model
Model AIME 25 AIME 24 GPQA LiveCodeBench v5
Deepseek-Distill-Qwen-1.5B 21.35 26.87 32.26 18.50
Qwen2.5-1.5B-base (w. InfiAlign) 14.58 10.52 28.98 12.99
InfiR2-1.5B-Instruct-FP8 18.45 17.39 29.48 17.10

πŸ™ Acknowledgements

We would like to express our gratitude for the following open-source projects:

  • slime - An LLM post-training framework for RL scaling that powers GLM-4.5 and GLM-4.6. slime supports training for nearly all models compatible with Megatron-LM. We are actively collaborating with the slime community to achieve fully training-inference consistent FP8 RL training.
  • Megatron-LM - Large-scale transformer model training framework by NVIDIA.
  • TransformerEngine - Library for accelerating transformer models on NVIDIA GPUs with FP8 precision.
  • Qwen2.5 - Foundation models that inspired our work.

πŸ“Œ Citation

If you find our work useful, please cite:

@misc{wang2025infir2comprehensivefp8training,
      title={InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models}, 
      author={Wenjun Wang and Shuo Cai and Congkai Xie and Mingfa Feng and Yiming Zhang and Zhen Li and Kejing Yang and Ming Li and Jiannong Cao and Hongxia Yang},
      year={2025},
      eprint={2509.22536},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.22536}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •