This repository contains the demo implementation of the FlowMoE paper accepted by NeurIPS 2025. FlowMoE is a scalable, generic and user-friendly pipeline scheduling framework for accelerating the training of MoE models. FlowMoE outperforms the state-of-the-art scheduling frameworks, including ScheMoE, Tutel and FasterMoE. The development of this code refers to ScheMoE and Tutel.
The following prerequisites shoud be installed for this repository:
- CUDA >= 11.3
- PyTorch >= 1.12.1
You can run the following scripts:
# Install zfp
git clone https://github.com/Fragile-azalea/zfp.git
cd zfp
mkdir build
cd build
cmake ..
cmake --build . --config Release
ctest
cd ../..
git clone https://github.com/Fragile-azalea/ScheMoE.git
cd ScheMoE
python setup.py install
You can download this code to /root/code folder and run the following scripts:
# Single Machine:
cd /root/code/flowmoe/dist_train
python3 -m torch.distributed.run --nproc_per_node=4 -m train_w_FlowMoE_BO --a2a_ffn_overlap_degree=2 --num_steps=10
Assume that you have 4 GPUs on a single node and everything works well, you will see that there are 4 workers running at a single node training the customized MoE layers using the FlowMoE framework.
# Distribute:
# pls refers to flowmoe/dist_train/run_mpi.sh
- g++ == 7.5.0
- cuda == 11.3
- gpu == 3090