This repository contains a PyTorch implementation of the paper "One-step Latent-free Image Generation with Pixel Mean Flows" (Lu et al., 2026).
Pixel Mean Flow (pMF) is a one-step, latent-free generative model that trains a network to directly predict clean images from noisy inputs. It formulates the training objective using Mean Matching in the velocity space while parameterizing the network output in the pixel space (x-prediction).
model.py: DiT-based architecture adapted for pMF.pmf.py: Core pMF logic, including Algorithm 1 (Training) and One-step Sampling.optimizer.py: Implementation of the Muon optimizer.train.py: Training script with Accelerator support.eval.py: Evaluation script for generating samples and FID preparation.config.yaml: Configuration file (YAML).config.py: Configuration loading logic.dataset.py: Data loading (Dummy, ImageFolder, or Hugging Face Datasets).auto_batch.py: Automatic batch size estimation utility.
This implementation supports loading the ImageNet-1K dataset via the Hugging Face datasets library (Apache Parquet format).
Ensure your data directory (e.g., /data2/private/huangcheng/data/imagenet-1k-256x256-modelscope) contains the following structure:
/path/to/dataset/
├── data/
│ ├── train-00000-of-00040.parquet
│ ├── ...
│ └── validation-00000-of-00002.parquet
└── ...
The dataset.py script automatically detects Parquet files and loads them using datasets:
from datasets import load_dataset
# Automatically handled in dataset.py
dataset = load_dataset(config.data_path, split='train')Set the data_path in config.yaml to your dataset directory.
- Minimum Configuration: 8x NVIDIA A100 (40GB or 80GB).
- Recommended Batch Size:
- For A100 40GB (FP16): Micro-batch size per GPU ≈ 32-64.
- Total Batch Size = (Micro-batch size) × 8 GPUs.
pip install -r requirements.txt
# Or using uv
uv syncEdit config.yaml to match your environment, specifically the data_path.
Run the estimation tool to automatically determine the optimal batch size for your hardware:
python auto_batch.pyThis will update config.yaml with the recommended micro_batch_size and global_batch_size.
Use accelerate to launch distributed training on 8 GPUs:
accelerate launch --multi_gpu --num_processes 8 train.pyOr using torchrun:
torchrun --nproc_per_node=8 train.pyYou can manually override the batch size in config.yaml:
training:
global_batch_size: 512 # Total across all GPUs
micro_batch_size: 64 # Per GPUTested on 8x NVIDIA A100 40GB (ImageNet 256x256, FP16):
| Metric | Value |
|---|---|
| Throughput | ~1200 images/sec |
| Memory per GPU | ~32 GB (Batch Size 64) |
| Training Time | ~160 Epochs (approx. 2-3 days) |
Note: Actual performance may vary based on CPU data loading speed and disk I/O.
- Lu et al., "One-step Latent-free Image Generation with Pixel Mean Flows", arXiv:2601.22158, 2026.
- Geng et al., "Improved Mean Flows", 2025.