This repository was archived by the owner on Dec 9, 2024. It is now read-only.

Description
Hi @tfboyd,
Problem: tf_cnn_benchmarks.py training speed using TFrecord in SSD drive is only half of the one using synthetic data
Question: how to identify and reduce the software bottleneck for training with TFrecord in SSD drive?
Attached image show training speed and gpu/cpu utilization ratio for imagenet training using TFrecord in SSD drive (upper picture) and synthetic data (lower picture), based on tf_cnn_benchmarks.py cmd in https://github.com/tensorflow/benchmarks
Not hardware bottleneck because the same PC with Pytorch ver 1.0.1.post2 achieves 320 img/sec (100% GPU util) for resnet50 training that uses jpeg in SSD. Pytorch training code taken from https://github.com/pytorch/examples/tree/master/imagenet
python main.py -a resnet50 /N/data/ILSVRC2012/partition/imagenet-data/imagenet_data
system infor
ubuntu18.04
Samsung SSD 970 EVO
TensorFlow: 1.14
Model: mobilenet
Dataset: imagenet
Mode: training
SingleSess: False
Batch size: 192 global
192 per device
Num batches: 600548
Num epochs: 90.00
Devices: ['/gpu:0']
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
