InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models
Wenjun Wang*,
Shuo Cai*,
Congkai Xie, Mingfa Feng, Yiming Zhang, Zhen Li, Kejing Yang, Ming Li, Jiannong Cao, Hongxia Yang
Β π Paper Β | Β Β π€ Huggingface Β | Β Β π Project Website Β
- [2025.10.8] We release the code and model.
- [2025.9.26] We release the arxiv paper.
- π Overview
- π Preparation
- π€ Continual Pre-training with FP8
- π Supervised Fine-tuning with FP8
- π Evaluation
- π Acknowledgements
- π Citation
We introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we release the accompanying code to further democratize large-scale model training.
-
Memory optimization and Speed up. Memory Optimization & Computation Acceleration: Compared to the widely used BF16, FP8 delivers:
- Up to 22% increase in end-to-end training speed.
- Up to 14% savings in peak memory usage.
- Up to 19% increase in end-to-end throughput.
Model Size = 1.5B
Context Length = 32k, TP = 2, CP = 1, MBS = 1
Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio BF16 841 ms 2329 ms 3170 ms - 57.8 GB - 345 TFlops - FP8 875 ms 2075 ms 2950 ms 0.93Γ 51.7 GB 0.89Γ 360 TFlops 1.04Γ Context Length = 8k, TP = 1, CP = 1, MBS = 2
Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio BF16 463 ms 1567 ms 2030 ms - 68.1 GB - 340 TFlops - FP8 529 ms 1061 ms 1590 ms 0.78Γ 58.3 GB 0.86Γ 376 TFlops 1.10Γ Model Size = 7B
Context Length = 32k, TP = 4, CP = 1, MBS = 1
Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio BF16 2790 ms 6800 ms 9590 ms - 78.1 GB - 409 TFlops - FP8 2660 ms 5700 ms 8360 ms 0.87Γ 67.4 GB 0.86Γ 461 TFlops 1.14Γ Context Length = 8k, TP = 2, CP = 1, MBS = 1
Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio BF16 1760 ms 5320 ms 7080 ms - 53.2 GB - 453 TFlops - FP8 2300 ms 3230 ms 5530 ms 0.78Γ 50.8 GB 0.95Γ 537 TFlops 1.19Γ
To clone this repository, please use:
git clone --recursive https://github.com/InfiXAI/InfiR2We support the environment setup Docker and provide the custom docker file. Please follow the instructions below.
The custom-configured Docker image is stored at Dockerfile. Build the Docker image using:
docker build --no-cache \
--file docker/Dockerfile \
--build-arg HTTP_PROXY="$http_proxy" \
--build-arg HTTPS_PROXY="$https_proxy" \
--build-arg NO_PROXY="localhost,127.0.0.1" \
--build-arg SGLANG_VERSION=${SGLANG_VERSION:-latest} \
--build-arg MEGATRON_COMMIT=${MEGATRON_COMMIT:-main} \
-t infir2-training:latest .Key Components:
- Base:
lmsysorg/sglang:${SGLANG_VERSION} - Megatron-LM: core_v0.14.0 branch (NVIDIA official)
- TransformerEngine: v2.4.0 (commit 3cd6870) -
β οΈ Must use this version to avoid precision/gpu memory issues - FlashAttention: v2.7.4.post1 + Hopper build
- Additional: slime, mbridge, torch_memory_saver, ray, sglang-router, and more
For more details, please refer to docker/README.md.
We provide continual pre-training (CPT) scripts with FP8 quantization. Our FP8 training recipe achieves up to 22% reduction in training time, 14% decrease in peak memory usage, and 19% increase in throughput compared to BF16 baseline, while maintaining performance parity on reasoning benchmarks. For more details, please refer to docs/CPT.md
We support both 7B and 1.5B models with flexible training configurations:
- 7B Model
- Complete Training: InfiR2_CPT_FP8_7B.sh - Full warmup+stable+decay pipeline
- Decay Only: InfiR2_CPT_FP8_7B_decay.sh - Optional standalone decay phase
- 1.5B Model
- Complete Training: InfiR2_CPT_FP8_1.5B.sh - Full warmup+stable+decay pipeline
- Decay Only: InfiR2_CPT_FP8_1.5B_decay.sh - Optional standalone decay phase
Option 1: Complete Training Pipeline (Recommended)
Run the full warmup+stable+decay training in one go:
bash scripts/CPT/InfiR2_CPT_FP8_7B.shThis single script will complete all three training phases automatically.
Option 2: Using Standalone Decay Script (Advanced)
If you want to enter the decay phase from a specific checkpoint in the stable phase:
# First, identify your stable-phase checkpoint
# Then run the decay script with the checkpoint
bash scripts/CPT/InfiR2_CPT_FP8_7B_decay.sh \
--load exp/InfiR2_CPT_FP8_7B/checkpoints/iter_0035000We provide two-stage SFT training scripts with FP8 quantization following InfiAlign. The training process uses Ray for distributed execution and supports multi-node training configurations. For more details, refer to docs/SFT.md.
We support both 7B and 1.5B models with flexible training configurations:
- 7B SFT
- Stage1: InfiR2_SFT_FP8_7B_stage1.sh.
- Stage2: InfiR2_SFT_FP8_7B_stage2.sh.
- 1.5B SFT
- Stage1: InfiR2_SFT_FP8_1.5B_stage1.sh.
- Stage2: InfiR2_SFT_FP8_1.5B_stage2.sh.
Dataset: Modify the DATA_DIR variable to point to your training data:
DATA_DIR=/path/to/stage1_dataModel Configuration:
HF_CHECKPOINT: Path to the model in HuggingFace format (e.g., Qwen2.5-7B-Instruct)REF_LOAD: Path to the base model weights in PyTorch distributed format
HF_CHECKPOINT=/path/to/base_models_hf/qwen2.5-7B-Instruct/
REF_LOAD=/path/to/base_models_/qwen2.5-7B_torch_dist/First, start Ray cluster:
export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265Then launch the training:
bash scripts/SFT/InfiR2_SFT_FP8_7B_stage1.shOur RL training pipeline consists of two stages: first compressing the response length, then expanding it. Before RL training, you need to convert the SFT checkpoint to FP8 format for efficient FP8 inference during rollout generation. For more details, refer to docs/RL.md.
After completing SFT Stage 2, convert the model to HuggingFace format, then to FP8 format:
# Step 1: Convert PyTorch distributed checkpoint to HuggingFace format
PYTHONPATH=training/Megatron-LM:training/slime python tools/convert_torch_dist_to_hf.py \
--input-dir /path/to/InfiR2_SFT_FP8_stg2 \
--output-dir /path/to/InfiR2_SFT_FP8_stg2_hf \
--origin-hf-dir /path/to/models/Qwen2.5-7B-Instruct
# Step 2: Convert BF16 HuggingFace model to FP8 format
python tools/bf16_cast_fp8.py \
--input-bf16-hf-path /path/to/InfiR2_SFT_FP8_stg2_hf \
--output-fp8-hf-path /path/to/InfiR2_SFT_FP8_stg2_hf_fp8 \
--force-pow-2-scale FalseThe FP8 model will be used for inference during the RL rollout phase, significantly improving generation efficiency.
- Stage 1: InfiR2_RL_FP8_7B_stage1_4node.sh with 8K response lengths.
- Stage 2: InfiR2_RL_FP8_7B_stage2_4node.sh with 16K response lengths and higher temperature.
Dataset: Set the DATA_DIR to your RL training data:
DATA_DIR=/path/to/data/dapo-math-17k.jsonlModel Configuration:
HF_CHECKPOINT: Path to the FP8 converted model (for inference)REF_LOAD: Path to the SFT Stage 2 checkpoint in PyTorch distributed format
HF_CHECKPOINT=/path/to/your_model/
REF_LOAD=/path/to/your_model/FP8 Training Configuration:
GRPO_ARGS=(
...
--use-tis
...
)
PRECISE_ARGS=(
...
# for fp8 training
--fp8-format e4m3
--fp8-recipe blockwise
--fp8-param-gather
...
)The way to launch RL training is the same as SFT. First start ray and then run the script.
This curriculum-based strategy ensures stable training and optimal performance across different response length requirements.
We use the open-source evalscope framework for all model evaluations to ensure reproducibility. Our evaluation suite includes four reasoning benchmarks with provided evaluation scripts.
We have verified that our models work correctly with the latest version of evalscope, achieving consistent performance results. However, to strictly reproduce the exact evaluation results reported in our paper, please use the following specific version of evalscope:
Recommended Version for Reproduction:
- Repository: evalscope
- Branch:
main - Pull Request: Add qwen-code best practice doc #734
Installation:
Follow the official documentation at https://evalscope.readthedocs.io/zh-cn/latest/get_started/installation.html
git clone https://github.com/modelscope/evalscope.git
cd evalscope/
pip install -e .We provide evaluation scripts for four key reasoning benchmarks:
| Benchmark | Script | Max Tokens | Samples | Temperature |
|---|---|---|---|---|
| AIME 2024 | aime24_eval.sh | 31,000 | 32 | 0.65 |
| AIME 2025 | aime25_eval.sh | 31,000 | 32 | 0.65 |
| GPQA | gpqa_eval.sh | 26,000 | 8 | 0.65 |
| LiveCodeBench | livecodebenchv5_eval.sh | 27,000 | 8 | 0.65 |
Each script uses slurm for job scheduling and SGLang for efficient inference serving. The evaluation pipeline consists of:
- Starting an SGLang server with the model
- Running evalscope with the specified benchmark
- 7B Model
| Model | AIME 25 | AIME 24 | GPQA | LiveCodeBench v5 |
|---|---|---|---|---|
| Deepseek-Distill-Qwen-7B | 43.00 | 49.00 | 48.20 | 37.60 |
| Qwen2.5-7B-base (w. InfiAlign) | 33.75 | 43.02 | 48.11 | 39.48 |
| InfiR2-7B-Instruct-FP8 | 40.62 | 55.73 | 45.33 | 40.31 |
- 1.5B Model
| Model | AIME 25 | AIME 24 | GPQA | LiveCodeBench v5 |
|---|---|---|---|---|
| Deepseek-Distill-Qwen-1.5B | 21.35 | 26.87 | 32.26 | 18.50 |
| Qwen2.5-1.5B-base (w. InfiAlign) | 14.58 | 10.52 | 28.98 | 12.99 |
| InfiR2-1.5B-Instruct-FP8 | 18.45 | 17.39 | 29.48 | 17.10 |
We would like to express our gratitude for the following open-source projects:
- slime - An LLM post-training framework for RL scaling that powers GLM-4.5 and GLM-4.6. slime supports training for nearly all models compatible with Megatron-LM. We are actively collaborating with the slime community to achieve fully training-inference consistent FP8 RL training.
- Megatron-LM - Large-scale transformer model training framework by NVIDIA.
- TransformerEngine - Library for accelerating transformer models on NVIDIA GPUs with FP8 precision.
- Qwen2.5 - Foundation models that inspired our work.
If you find our work useful, please cite:
@misc{wang2025infir2comprehensivefp8training,
title={InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models},
author={Wenjun Wang and Shuo Cai and Congkai Xie and Mingfa Feng and Yiming Zhang and Zhen Li and Kejing Yang and Ming Li and Jiannong Cao and Hongxia Yang},
year={2025},
eprint={2509.22536},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.22536},
}