ported from pytorch-examples
- torchvision:
pip install torchvision - tqdm:
pip install tqdm
Run the example:
python mnist.pySame example with logging using TQDM progress bar
python mnist_with_tqdm_logger.pyMNIST example with training and validation monitoring using Tensorboard
- Tensorboard:
pip install tensorboard
Run the example:
python mnist_with_tensorboard.py --log_dir=/tmp/tensorboard_logsStart tensorboard:
tensorboard --logdir=/tmp/tensorboard_logs/MNIST example with training and validation monitoring using Visdom
- Visdom:
pip install visdom
Start visdom:
python -m visdom.serverRun the example:
python mnist_with_visdom.py- ClearML python client:
pip install clearml
python mnist_with_clearml_logger.pyExample shows how to save a checkpoint of the trainer, model, optimizer, lr scheduler. User can resume the training from stored latest checkpoint. In addition, training crash can be emulated.
We provided an option --deterministic which setups a deterministic trainer as
DeterministicEngine.
Trainer performs dataflow synchronization on epoch in order to ensure the same dataflow when training is resumed.
Please, see the documentation for more details.
- torchvision:
pip install torchvision - tqdm:
pip install tqdm - TensorboardX:
pip install tensorboardX - Tensorboard:
pip install tensorboard
Training
python mnist_save_resume_engine.py --log_dir=logs/run_1 --epochs=10
# or same in deterministic mode
python mnist_save_resume_engine.py --log_dir=logs-det/run_1 --deterministic --epochs=10Resume the training
python mnist_save_resume_engine.py --log_dir=logs/run_2 --resume_from=logs/run_1/checkpoint_5628.pt --epochs=10
# or same in deterministic mode
python mnist_save_resume_engine.py --log_dir=logs-det/run_2 --resume_from=logs-det/run_1/checkpoint_5628.pt --deterministic --epochs=10Start tensorboard:
tensorboard --logdir=.The script logs batch stats (mean/std of images, median of targets), model weights' norms and computed gradients norms in
run.log and resume_run.log to compare training behaviour in both cases.
If set --deterministic option, we can observe the same values after resuming the training.
| Non-deterministic | Deterministic |
|---|---|
![]() |
![]() |
Deterministic run.log vs resume_run.log

Initial training with a crash
python mnist_save_resume_engine.py --crash_iteration 5700 --log_dir=logs/run_3_crash --epochs 10
# or same in deterministic mode
python mnist_save_resume_engine.py --crash_iteration 5700 --log_dir=logs-det/run_3_crash --epochs 10 --deterministicResume from the latest checkpoint
python mnist_save_resume_engine.py --resume_from logs/run_3_crash/checkpoint_6.pt --log_dir=logs/run_4 --epochs 10
# or same in deterministic mode
python mnist_save_resume_engine.py --resume_from logs-det/run_3_crash/checkpoint_6.pt --log_dir=logs-det/run_4 --epochs 10 --deterministic| Non-deterministic | Deterministic |
|---|---|
![]() |
![]() |




