InfiR2

InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

Wenjun Wang*, Shuo Cai*, Congkai Xie, Mingfa Feng, Yiming Zhang, Zhen Li, Kejing Yang, Ming Li, Jiannong Cao, Hongxia Yang

📄 Paper | 🤗 Huggingface | 🌐 Project Website

🔥 Update

[2025.10.8] We release the code and model.
[2025.9.26] We release the arxiv paper.

🌟 Overview

We introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we release the accompanying code to further democratize large-scale model training.

Memory optimization and Speed up. Memory Optimization & Computation Acceleration: Compared to the widely used BF16, FP8 delivers:
- Up to 22% increase in end-to-end training speed.
- Up to 14% savings in peak memory usage.
- Up to 19% increase in end-to-end throughput.
Model Size = 1.5B

Context Length = 32k, TP = 2, CP = 1, MBS = 1

Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio

BF16 841 ms 2329 ms 3170 ms - 57.8 GB - 345 TFlops -

FP8 875 ms 2075 ms 2950 ms 0.93× 51.7 GB 0.89× 360 TFlops 1.04×

Context Length = 8k, TP = 1, CP = 1, MBS = 2

Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio

BF16 463 ms 1567 ms 2030 ms - 68.1 GB - 340 TFlops -

FP8 529 ms 1061 ms 1590 ms 0.78× 58.3 GB 0.86× 376 TFlops 1.10×

Model Size = 7B

Context Length = 32k, TP = 4, CP = 1, MBS = 1

Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio

BF16 2790 ms 6800 ms 9590 ms - 78.1 GB - 409 TFlops -

FP8 2660 ms 5700 ms 8360 ms 0.87× 67.4 GB 0.86× 461 TFlops 1.14×

Context Length = 8k, TP = 2, CP = 1, MBS = 1

Forward Backward Total Ratio Peak Memory Ratio Throughput Ratio

BF16 1760 ms 5320 ms 7080 ms - 53.2 GB - 453 TFlops -

FP8 2300 ms 3230 ms 5530 ms 0.78× 50.8 GB 0.95× 537 TFlops 1.19×

🚀 Preparation

To clone this repository, please use:

git clone --recursive https://github.com/InfiXAI/InfiR2

Environment Setup

We support the environment setup Docker and provide the custom docker file. Please follow the instructions below.

Docker Setup

The custom-configured Docker image is stored at Dockerfile. Build the Docker image using:

docker build --no-cache \
    --file docker/Dockerfile \
    --build-arg HTTP_PROXY="$http_proxy" \
    --build-arg HTTPS_PROXY="$https_proxy" \
    --build-arg NO_PROXY="localhost,127.0.0.1" \
    --build-arg SGLANG_VERSION=${SGLANG_VERSION:-latest} \
    --build-arg MEGATRON_COMMIT=${MEGATRON_COMMIT:-main} \
    -t infir2-training:latest .

Key Components:

Base: lmsysorg/sglang:${SGLANG_VERSION}
Megatron-LM: core_v0.14.0 branch (NVIDIA official)
TransformerEngine: v2.4.0 (commit 3cd6870) - ⚠️ Must use this version to avoid precision/gpu memory issues
FlashAttention: v2.7.4.post1 + Hopper build
Additional: slime, mbridge, torch_memory_saver, ray, sglang-router, and more

For more details, please refer to docker/README.md.

🤖 Continual Pre-training with FP8

We provide continual pre-training (CPT) scripts with FP8 quantization. Our FP8 training recipe achieves up to 22% reduction in training time, 14% decrease in peak memory usage, and 19% increase in throughput compared to BF16 baseline, while maintaining performance parity on reasoning benchmarks. For more details, please refer to docs/CPT.md

Available Scripts

We support both 7B and 1.5B models with flexible training configurations:

7B Model
- Complete Training: InfiR2_CPT_FP8_7B.sh - Full warmup+stable+decay pipeline
- Decay Only: InfiR2_CPT_FP8_7B_decay.sh - Optional standalone decay phase
1.5B Model
- Complete Training: InfiR2_CPT_FP8_1.5B.sh - Full warmup+stable+decay pipeline
- Decay Only: InfiR2_CPT_FP8_1.5B_decay.sh - Optional standalone decay phase

Running

Option 1: Complete Training Pipeline (Recommended)

Run the full warmup+stable+decay training in one go:

bash scripts/CPT/InfiR2_CPT_FP8_7B.sh

This single script will complete all three training phases automatically.

Option 2: Using Standalone Decay Script (Advanced)

If you want to enter the decay phase from a specific checkpoint in the stable phase:

# First, identify your stable-phase checkpoint
# Then run the decay script with the checkpoint
bash scripts/CPT/InfiR2_CPT_FP8_7B_decay.sh \
    --load exp/InfiR2_CPT_FP8_7B/checkpoints/iter_0035000

🌈 Supervised Fine-tuning with FP8

We provide two-stage SFT training scripts with FP8 quantization following InfiAlign. The training process uses Ray for distributed execution and supports multi-node training configurations. For more details, refer to docs/SFT.md.

Available Scripts

We support both 7B and 1.5B models with flexible training configurations:

7B SFT
- Stage1: InfiR2_SFT_FP8_7B_stage1.sh.
- Stage2: InfiR2_SFT_FP8_7B_stage2.sh.
1.5B SFT
- Stage1: InfiR2_SFT_FP8_1.5B_stage1.sh.
- Stage2: InfiR2_SFT_FP8_1.5B_stage2.sh.

Configuration

Dataset: Modify the DATA_DIR variable to point to your training data:

DATA_DIR=/path/to/stage1_data

Model Configuration:

HF_CHECKPOINT: Path to the model in HuggingFace format (e.g., Qwen2.5-7B-Instruct)
REF_LOAD: Path to the base model weights in PyTorch distributed format

HF_CHECKPOINT=/path/to/base_models_hf/qwen2.5-7B-Instruct/
REF_LOAD=/path/to/base_models_/qwen2.5-7B_torch_dist/

Running

First, start Ray cluster:

export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265

Then launch the training:

bash scripts/SFT/InfiR2_SFT_FP8_7B_stage1.sh

🎯 Reinforcement Learning with FP8

Our RL training pipeline consists of two stages: first compressing the response length, then expanding it. Before RL training, you need to convert the SFT checkpoint to FP8 format for efficient FP8 inference during rollout generation. For more details, refer to docs/RL.md.

Model Conversion for RL

After completing SFT Stage 2, convert the model to HuggingFace format, then to FP8 format:

# Step 1: Convert PyTorch distributed checkpoint to HuggingFace format
PYTHONPATH=training/Megatron-LM:training/slime python tools/convert_torch_dist_to_hf.py \
    --input-dir /path/to/InfiR2_SFT_FP8_stg2 \
    --output-dir /path/to/InfiR2_SFT_FP8_stg2_hf \
    --origin-hf-dir /path/to/models/Qwen2.5-7B-Instruct

# Step 2: Convert BF16 HuggingFace model to FP8 format
python tools/bf16_cast_fp8.py \
    --input-bf16-hf-path /path/to/InfiR2_SFT_FP8_stg2_hf \
    --output-fp8-hf-path /path/to/InfiR2_SFT_FP8_stg2_hf_fp8 \
    --force-pow-2-scale False

The FP8 model will be used for inference during the RL rollout phase, significantly improving generation efficiency.

Stage 1: InfiR2_RL_FP8_7B_stage1_4node.sh with 8K response lengths.
Stage 2: InfiR2_RL_FP8_7B_stage2_4node.sh with 16K response lengths and higher temperature.

Configuration

Dataset: Set the DATA_DIR to your RL training data:

DATA_DIR=/path/to/data/dapo-math-17k.jsonl

Model Configuration:

HF_CHECKPOINT: Path to the FP8 converted model (for inference)
REF_LOAD: Path to the SFT Stage 2 checkpoint in PyTorch distributed format

HF_CHECKPOINT=/path/to/your_model/

REF_LOAD=/path/to/your_model/

FP8 Training Configuration:

GRPO_ARGS=(
   ...
   --use-tis
   ...
)

PRECISE_ARGS=(
   ...
   # for fp8 training
   --fp8-format e4m3
   --fp8-recipe blockwise
   --fp8-param-gather
   ...
)

Running

The way to launch RL training is the same as SFT. First start ray and then run the script.

This curriculum-based strategy ensures stable training and optimal performance across different response length requirements.

📊 Evaluation

We use the open-source evalscope framework for all model evaluations to ensure reproducibility. Our evaluation suite includes four reasoning benchmarks with provided evaluation scripts.

Environment Setup

We have verified that our models work correctly with the latest version of evalscope, achieving consistent performance results. However, to strictly reproduce the exact evaluation results reported in our paper, please use the following specific version of evalscope:

Recommended Version for Reproduction:

Repository: evalscope
Branch: main
Pull Request: Add qwen-code best practice doc #734

Installation:

Follow the official documentation at https://evalscope.readthedocs.io/zh-cn/latest/get_started/installation.html

git clone https://github.com/modelscope/evalscope.git
cd evalscope/
pip install -e .

Evaluation Benchmarks

We provide evaluation scripts for four key reasoning benchmarks:

Benchmark	Script	Max Tokens	Samples	Temperature
AIME 2024	aime24_eval.sh	31,000	32	0.65
AIME 2025	aime25_eval.sh	31,000	32	0.65
GPQA	gpqa_eval.sh	26,000	8	0.65
LiveCodeBench	livecodebenchv5_eval.sh	27,000	8	0.65

Each script uses slurm for job scheduling and SGLang for efficient inference serving. The evaluation pipeline consists of:

Starting an SGLang server with the model
Running evalscope with the specified benchmark

Model Performance

7B Model

Model	AIME 25	AIME 24	GPQA	LiveCodeBench v5
Deepseek-Distill-Qwen-7B	43.00	49.00	48.20	37.60
Qwen2.5-7B-base (w. InfiAlign)	33.75	43.02	48.11	39.48
InfiR2-7B-Instruct-FP8	40.62	55.73	45.33	40.31

1.5B Model

Model	AIME 25	AIME 24	GPQA	LiveCodeBench v5
Deepseek-Distill-Qwen-1.5B	21.35	26.87	32.26	18.50
Qwen2.5-1.5B-base (w. InfiAlign)	14.58	10.52	28.98	12.99
InfiR2-1.5B-Instruct-FP8	18.45	17.39	29.48	17.10

🙏 Acknowledgements

We would like to express our gratitude for the following open-source projects:

slime - An LLM post-training framework for RL scaling that powers GLM-4.5 and GLM-4.6. slime supports training for nearly all models compatible with Megatron-LM. We are actively collaborating with the slime community to achieve fully training-inference consistent FP8 RL training.
Megatron-LM - Large-scale transformer model training framework by NVIDIA.
TransformerEngine - Library for accelerating transformer models on NVIDIA GPUs with FP8 precision.
Qwen2.5 - Foundation models that inspired our work.

📌 Citation

If you find our work useful, please cite:

@misc{wang2025infir2comprehensivefp8training,
      title={InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models}, 
      author={Wenjun Wang and Shuo Cai and Congkai Xie and Mingfa Feng and Yiming Zhang and Zhen Li and Kejing Yang and Ming Li and Jiannong Cao and Hongxia Yang},
      year={2025},
      eprint={2509.22536},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.22536}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
assets		assets
docker		docker
docs		docs
scripts		scripts
third_party		third_party
tools		tools
training		training
.gitmodules		.gitmodules
README.md		README.md
README_CN.md		README_CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

InfiR2

🔥 Update

Table of Contents

🌟 Overview

🚀 Preparation

Environment Setup

Docker Setup

🤖 Continual Pre-training with FP8

Available Scripts

Running

🌈 Supervised Fine-tuning with FP8

Available Scripts

Configuration

Running

🎯 Reinforcement Learning with FP8

Model Conversion for RL

Configuration

Running

📊 Evaluation

Environment Setup

Evaluation Benchmarks

Model Performance

🙏 Acknowledgements

📌 Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

	Forward	Backward	Total	Ratio	Peak Memory	Ratio	Throughput	Ratio
BF16	841 ms	2329 ms	3170 ms	-	57.8 GB	-	345 TFlops	-
FP8	875 ms	2075 ms	2950 ms	0.93×	51.7 GB	0.89×	360 TFlops	1.04×

	Forward	Backward	Total	Ratio	Peak Memory	Ratio	Throughput	Ratio
BF16	463 ms	1567 ms	2030 ms	-	68.1 GB	-	340 TFlops	-
FP8	529 ms	1061 ms	1590 ms	0.78×	58.3 GB	0.86×	376 TFlops	1.10×

	Forward	Backward	Total	Ratio	Peak Memory	Ratio	Throughput	Ratio
BF16	2790 ms	6800 ms	9590 ms	-	78.1 GB	-	409 TFlops	-
FP8	2660 ms	5700 ms	8360 ms	0.87×	67.4 GB	0.86×	461 TFlops	1.14×

	Forward	Backward	Total	Ratio	Peak Memory	Ratio	Throughput	Ratio
BF16	1760 ms	5320 ms	7080 ms	-	53.2 GB	-	453 TFlops	-
FP8	2300 ms	3230 ms	5530 ms	0.78×	50.8 GB	0.95×	537 TFlops	1.19×

InfiXAI/InfiR2

Folders and files

Latest commit

History

Repository files navigation

InfiR2

🔥 Update

Table of Contents

🌟 Overview

🚀 Preparation

Environment Setup

Docker Setup

🤖 Continual Pre-training with FP8

Available Scripts

Running

🌈 Supervised Fine-tuning with FP8

Available Scripts

Configuration

Running

🎯 Reinforcement Learning with FP8

Model Conversion for RL

Configuration

Running

📊 Evaluation

Environment Setup

Evaluation Benchmarks

Model Performance

🙏 Acknowledgements

📌 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages