Thanks to visit codestin.com
Credit goes to github.com

Skip to content

mfkiwl/gym

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EXO Gym

EXO Gym
EXO Gym: Simulate distributed training on any hardware configuration, at any scale.

Simulate a GPU cluster with just your laptop! For example:

  • Simulate training with SPARTA with a cluster of 4 Mac Studios connected over Ethernet
  • Simulate training with DiLoCo on a cluster of 16 H100's connected over the internet

Why EXO Gym?

  • Simulate distributed training without setting up distributed clusters; avoid Kubernetes, Docker, and GPU hosting.
  • Fast iteration: implementing a new distributed training algo from scratch takes as little as 5 lines
  • Scale up number of nodes by changing a single parameter

Supported Algorithms

... and anything else you can imagine! Implementing new algorithms with EXO Gym is very simple - see Custom Algorithms.

Installation

Dependencies

  • python>=3.10

Installation

To install:

git clone https://github.com/exo-explore/gym.git exogym
cd exogym
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

Usage

Example Scripts

Example Result
MNIST Comparison

Compare DDP, DiLoCo, SPARTA on MNIST dataset. Runs in <2 mins on a M4 Mac Mini.

python example/mnist.py
MNIST Training Comparison
NanoGPT OpenWebText

Train a NanoGPT-style transformer on the OpenWebText dataset.

python example/nanogpt_train.py --dataset owt --strategy diloco
OWT DiLoCo N=4
Shakespeare DiLoCo Scaling K

How does DiLoCo compare for different device count (K)? This script compares DiLoCo for different device counts, normalized by FLOPs.

python example/diloco_scaling.py --dataset shakespeare

We can generate text with the model that has just been trained as so:

python example/nanogpt/shakespeare_inference.py
Shakespeare Training Results

Custom Training

Strategies (eg. DiLoCo, SPARTA) are portable across domains. A custom dataset and model can be trained with a distributed algorithm like so:

from exogym import Trainer
from exogym.strategy.diloco import DiLoCoStrategy

train_dataset, val_dataset = ...
model = ... # model.forward() expects a batch, and returns a scalar loss

trainer = Trainer(model, train_dataset, val_dataset)

# Strategy for optimization & communication
strategy = DiLoCoStrategy(
  inner_optim='adam',
  H=100
)

trainer.fit(
  strategy=strategy,
  num_nodes=4,
  device='mps'
)

Custom Algorithms

example/playground.py is a minimal starting-point for writing new algorithms. For example, to implement gradient quantization from scratch:

class QuantizationStrategy(Strategy):
    def __init__(self, optim_spec, quantization_level: Literal['int8']):
        super().__init__()
        self.optim_spec = optim_spec
        self.scale = 0.024
        self.zero_point = 0
        self.qdtype = torch.uint8

    def step(self):
        for param in self.model.parameters():
            if param.grad is not None:
                quantized = torch.round(param.grad / self.scale + self.zero_point).clamp(0, 255).to(self.qdtype)
                
                q_wide = quantized.to(torch.int32)
                all_reduce(q_wide)
                
                param.grad = (q_wide.to(torch.float32) * self.scale) / self.num_nodes

        self.optim.step()
        super().step()

Supported Devices

  • CPU
  • CUDA
  • MPS (CPU-bound for copy operations, see here)

Technical Details

For further details on how EXO Gym works under-the-hood, please see docs/.

Citation

If you use EXO Gym in your research, please cite:

@software{exogym2025,
  title={EXO Gym},
  author={Matt Beton, Mohamed Baioumy, Matt Reed, Seth Howes, Alex Cheema},
  year={2025},
  url={https://github.com/exo-explore/gym}
}

About

EXO Gym is an open-source Python toolkit that facilitates distributed AI research.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.1%
  • Dockerfile 0.9%