quantization

Model Quantization with Calibration Examples

This folder contains examples of quantizing a FP32 model with Intel® MKL-DNN or CUDNN.

Model Quantization with Intel® MKL-DNN

Intel® MKL-DNN supports quantization with subgraph features on Intel® CPU Platform and can bring performance improvements on the Intel® Xeon® Scalable Platform. A new quantization script imagenet_gen_qsym_mkldnn.py has been designed to launch quantization for image-classification models with Intel® MKL-DNN. This script integrates with Gluon-CV modelzoo, so that more pre-trained models can be downloaded from Gluon-CV and then converted for quantization. To apply quantization flow to your project directly, please refer Quantize custom models with MKL-DNN backend.

usage: imagenet_gen_qsym_mkldnn.py [-h] [--model MODEL] [--epoch EPOCH]
                                   [--no-pretrained] [--batch-size BATCH_SIZE]
                                   [--label-name LABEL_NAME]
                                   [--calib-dataset CALIB_DATASET]
                                   [--image-shape IMAGE_SHAPE]
                                   [--data-nthreads DATA_NTHREADS]
                                   [--num-calib-batches NUM_CALIB_BATCHES]
                                   [--exclude-first-conv] [--shuffle-dataset]
                                   [--shuffle-chunk-seed SHUFFLE_CHUNK_SEED]
                                   [--shuffle-seed SHUFFLE_SEED]
                                   [--calib-mode CALIB_MODE]
                                   [--quantized-dtype {auto,int8,uint8}]
                                   [--enable-calib-quantize ENABLE_CALIB_QUANTIZE]

Generate a calibrated quantized model from a FP32 model with Intel MKL-DNN
support

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         model to be quantized.
  --epoch EPOCH         number of epochs, default is 0
  --no-pretrained       If enabled, will not download pretrained model from
                        MXNet or Gluon-CV modelzoo.
  --batch-size BATCH_SIZE
  --label-name LABEL_NAME
  --calib-dataset CALIB_DATASET
                        path of the calibration dataset
  --image-shape IMAGE_SHAPE
  --data-nthreads DATA_NTHREADS
                        number of threads for data decoding
  --num-calib-batches NUM_CALIB_BATCHES
                        number of batches for calibration
  --exclude-first-conv  excluding quantizing the first conv layer since the
                        input data may have negative value which doesn't
                        support at moment
  --shuffle-dataset     shuffle the calibration dataset
  --shuffle-chunk-seed SHUFFLE_CHUNK_SEED
                        shuffling chunk seed, see https://mxnet.incubator.apac
                        he.org/api/python/io/io.html?highlight=imager#mxnet.io
                        .ImageRecordIter for more details
  --shuffle-seed SHUFFLE_SEED
                        shuffling seed, see https://mxnet.incubator.apache.org
                        /api/python/io/io.html?highlight=imager#mxnet.io.Image
                        RecordIter for more details
  --calib-mode CALIB_MODE
                        calibration mode used for generating calibration table
                        for the quantized symbol; supports 1. none: no
                        calibration will be used. The thresholds for
                        quantization will be calculated on the fly. This will
                        result in inference speed slowdown and loss of
                        accuracy in general. 2. naive: simply take min and max
                        values of layer outputs as thresholds for
                        quantization. In general, the inference accuracy
                        worsens with more examples used in calibration. It is
                        recommended to use `entropy` mode as it produces more
                        accurate inference results. 3. entropy: calculate KL
                        divergence of the fp32 output and quantized output for
                        optimal thresholds. This mode is expected to produce
                        the best inference accuracy of all three kinds of
                        quantized models if the calibration dataset is
                        representative enough of the inference dataset.
  --quantized-dtype {auto,int8,uint8}
                        quantization destination data type for input data
  --enable-calib-quantize ENABLE_CALIB_QUANTIZE
                        If enabled, the quantize op will be calibrated offline
                        if calibration mode is enabled

A new benchmark script launch_inference_mkldnn.sh has been designed to launch performance benchmark for float32 or int8 image-classification models with Intel® MKL-DNN.

usage: bash ./launch_inference_mkldnn.sh [[[-s symbol_file ] [-b batch_size] [-iter iteraton] [-ins instance] [-c cores/instance]] | [-h]]

optional arguments:
  -h, --help                show this help message and exit
  -s, --symbol_file         symbol file for benchmark
  -b, --batch_size          inference batch size
                            default: 64
  -iter, --iteration        inference iteration
                            default: 500
  -ins, --instance          launch multi-instance inference
                            default: one instance per socket
  -c, --core                number of cores per instance
                            default: divide full physical cores

example: resnet int8 performance benchmark on c5.24xlarge(duo sockets, 24 physical cores per socket).

    bash ./launch_inference_mkldnn.sh -s ./model/resnet50_v1-quantized-5batches-naive-symbol.json

will launch two instances for throughput benchmark and each instance will use 24 physical cores.

Use the following command to install Gluon-CV:

pip install gluoncv

The following models have been tested on Linux systems. Accuracy is collected on Intel XEON Cascade Lake CPU. For CPU with Skylake Lake or eariler architecture, the accuracy may not be the same.

Model	Source	Dataset	FP32 Accuracy (top-1/top-5)	INT8 Accuracy (top-1/top-5)
ResNet18-V1	Gluon-CV	Validation Dataset	70.15%/89.38%	69.92%/89.30%
ResNet50-V1	Gluon-CV	Validation Dataset	76.34%/93.13%	76.06%/92.99%
ResNet101-V1	Gluon-CV	Validation Dataset	77.33%/93.59%	77.07%/93.47%
Squeezenet 1.0	Gluon-CV	Validation Dataset	56.98%/79.20%	56.79%/79.47%
MobileNet 1.0	Gluon-CV	Validation Dataset	72.23%/90.64%	72.06%/90.53%
MobileNetV2 1.0	Gluon-CV	Validation Dataset	70.27%/89.62%	69.82%/89.35%
Inception V3	Gluon-CV	Validation Dataset	77.76%/93.83%	78.05%/93.91%
ResNet152-V2	MXNet ModelZoo	Validation Dataset	76.65%/93.07%	76.25%/92.89%
Inception-BN	MXNet ModelZoo	Validation Dataset	72.28%/90.63%	72.02%/90.53%
SSD-VGG16	example/ssd	VOC2007/2012	0.8366 mAP	0.8357 mAP
SSD-VGG16	example/ssd	COCO2014	0.2552 mAP	0.253 mAP

ResNetV1

The following command is to download the pre-trained model from Gluon-CV and transfer it into the symbolic model which would be finally quantized. The validation dataset is available for testing the pre-trained models:

python imagenet_gen_qsym_mkldnn.py --model=resnet50_v1 --num-calib-batches=5 --calib-mode=naive

The model would be automatically replaced in fusion and quantization format. It is then saved as the quantized symbol and parameter files in the ./model directory. Set --model to resnet18_v1/resnet50_v1b/resnet101_v1 to quantize other models. The following command is to launch inference.

# Launch FP32 Inference
python imagenet_inference.py --symbol-file=./model/resnet50_v1-symbol.json --param-file=./model/resnet50_v1-0000.params --rgb-mean=123.68,116.779,103.939 --rgb-std=58.393,57.12,57.375 --num-skipped-batches=50 --batch-size=64 --num-inference-batches=500 --dataset=./data/val_256_q90.rec --ctx=cpu

# Launch INT8 Inference
python imagenet_inference.py --symbol-file=./model/resnet50_v1-quantized-5batches-naive-symbol.json --param-file=./model/resnet50_v1-quantized-0000.params --rgb-mean=123.68,116.779,103.939 --rgb-std=58.393,57.12,57.375 --num-skipped-batches=50 --batch-size=64 --num-inference-batches=500 --dataset=./data/val_256_q90.rec --ctx=cpu

# Launch dummy data Inference
bash ./launch_inference_mkldnn.sh -s ./model/resnet50_v1-symbol.json
bash ./launch_inference_mkldnn.sh -s ./model/resnet50_v1-quantized-5batches-naive-symbol.json

SqueezeNet 1.0

The following command is to download the pre-trained model from Gluon-CV and transfer it into the symbolic model which would be finally quantized. The validation dataset is available for testing the pre-trained models:

python imagenet_gen_qsym_mkldnn.py --model=squeezenet1.0 --num-calib-batches=5 --calib-mode=naive