SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks

This repository contains the code and instructions to replicate experiments of the paper titled "SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks".

SkipPipe introduces a fault tolerant, pipeline parallel method that skips and reorders stages dynamically to optimize training within decentralized environments. SkipPipe shows a 55% reduction in training time compared to standard pipeline methods within these environments, with no degradation in convergence.

It is also highly fault tolerant - demonstrating robustness up to 50% node failure rate with only 7% perplexity loss at inference time (i.e. when half of the pipeline nodes for a single model are unavailable we only lose 7% perplexity running inference through the - now sparse - model).

Unlike existing data parallel methods, SkipPipe can accommodate large model training. Since it shards the model itself across nodes, rather than simply sharding the dataset, SkipPipe reduces the memory footprint on each individual node and removes the cap on model size - allowing models of theoretically infinite size to be built across distributed, and decentralised, infrastructure.

An example of partial pipeline parallelism scheduling where each colored (solid or dashed) path represents a different microbatch. Each node in stage 0 sends out 2 microbatches, the first in solid, the second in dashed. Green backgrounds show the forward pass, while light orange - the backwards pass. For better visualisation, the loss and deembedding computations are omitted. Arrows show the prioritisation of the microbatches from forward to backward pass within the same node.")

Requirements

This code uses the following two repositories:

simplellm - for construction of the models, loading datasets, tokenizers, etc.
DecCom - for communication between devices

You can install both by cloning the repo and doing pip install . or by running the setup.sh provided here.

Additionally, you need to install the requirements in requirements.txt with pip install -r requirements.txt

Making a scheduler

Schedulers are made with create_schedule.py. Modify the respective hyper parameters of the algorithm (lines 16 to 42). Depending on your CPU and setting this may take a bit...

Training

Start training with

./run.sh [FIRST DEVICE] [LAST DEVICE] [SETTING] [SAMPLES PER MICROBATCH]

Which will start all nodes from FIRST DEVICE to LAST DEVICE on this machine with a given SETTING (random for DT-FM Skip, ca-partial for SkipPipe with TC2, non-ca-partial for SkipPipe without TC2, or baseline for DT-FM).

Publication

@article{blagoev2025skippipe,
  title={SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks}, 
  author={Blagoev, Nikolay and Chen, Lydia Y and Ersoy, O\u{g}uzhan},
  year={2025},
  eprint={2502.19913},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2502.19913},
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assests		assests
communications		communications
schedulers		schedulers
simulate-training		simulate-training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_schedule.py		create_schedule.py
requirements.txt		requirements.txt
run.sh		run.sh
schedule.json		schedule.json
setup.sh		setup.sh
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks

Requirements

Making a scheduler

Training

Publication

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

gensyn-ai/skippipe

Folders and files

Latest commit

History

Repository files navigation

SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks

Requirements

Making a scheduler

Training

Publication

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages