FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Expert Training

Introduction

This repository contains the demo implementation of the FlowMoE paper accepted by NeurIPS 2025. FlowMoE is a scalable, generic and user-friendly pipeline scheduling framework for accelerating the training of MoE models. FlowMoE outperforms the state-of-the-art scheduling frameworks, including ScheMoE, Tutel and FasterMoE. The development of this code refers to ScheMoE and Tutel.

Installation

Prerequisites

The following prerequisites shoud be installed for this repository:

CUDA >= 11.3
PyTorch >= 1.12.1

How to insatll

You can run the following scripts:

# Install zfp
git clone https://github.com/Fragile-azalea/zfp.git
cd zfp
mkdir build
cd build
cmake ..
cmake --build . --config Release
ctest
cd ../..

git clone https://github.com/Fragile-azalea/ScheMoE.git
cd ScheMoE
python setup.py install

Quick start

You can download this code to /root/code folder and run the following scripts:

# Single Machine:
cd /root/code/flowmoe/dist_train  
python3 -m torch.distributed.run --nproc_per_node=4 -m train_w_FlowMoE_BO --a2a_ffn_overlap_degree=2 --num_steps=10

Assume that you have 4 GPUs on a single node and everything works well, you will see that there are 4 workers running at a single node training the customized MoE layers using the FlowMoE framework.

# Distribute:
# pls refers to flowmoe/dist_train/run_mpi.sh

Test Environment

g++ == 7.5.0
cuda == 11.3
gpu == 3090

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
dist_train		dist_train
experts		experts
gates		gates
impls		impls
jit_kernels		jit_kernels
launcher		launcher
parted		parted
README.md		README.md
__init__.py		__init__.py
jit.py		jit.py
moe.py		moe.py
net.py		net.py
system.py		system.py
workflow_nips.png		workflow_nips.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Expert Training

Introduction

Installation

Prerequisites

How to insatll

Quick start

Test Environment

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ZJU-CNLAB/FlowMoE

Folders and files

Latest commit

History

Repository files navigation

FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Expert Training

Introduction

Installation

Prerequisites

How to insatll

Quick start

Test Environment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages