[MXNET-688] Fix quantization divide by zero errors#11833
Conversation
…ithout 1 off errors
| start = j * num_merged_bins | ||
| if j == num_quantized_bins - 1: | ||
| stop = -1 | ||
| stop = len(is_nonzeros) |
There was a problem hiding this comment.
This is an off-by-1 error that can be caught via the quantization tests that I added. Indexing a numpy array with x[a:-1] excludes the last element.
| norm = is_nonzeros[start:stop].sum() | ||
| if norm != 0: | ||
| q[start:stop] = float(quantized_bins[j]) / float(norm) | ||
| q[sliced_nd_hist == 0] = 0 |
There was a problem hiding this comment.
This is not representative of the quantized distribution, as setting values to 0 artificially will not correctly represent the quantized activation output.
| if norm != 0: | ||
| q[start:stop] = float(quantized_bins[j]) / float(norm) | ||
| q[sliced_nd_hist == 0] = 0 | ||
| q[start:stop] = float(quantized_bins[j]) / float(num_quantized_bins) |
There was a problem hiding this comment.
Originally this was float(norm), and that is not appropriate. Suppose you have the distribution:
[0, 0, ... , 0, 1]
If the num_quantized_bins is 3, then you theoretically should get:
[0, 0, ... , 1/3, 1/3, 1/3]
instead of:
[0, 0, ..., 1, 1, 1]
To make this more clear, suppose your original dist is:
[0, 0, 0, ... , 1/3, 1/3, 1/3]
This should be equivalent after quantization as the first distribution, but it isn't. Under the rules, you would get the same array back, where you have [..., 1/3, 1/3 ,1/3], and the first distribution would give you [..., 1, 1, 1], off by the multiplier.
There was a problem hiding this comment.
I'm not sure I understand your change here. The original implementation is following the explanation here (see page 38):
http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf
| th = max(abs(min_val), abs(max_val)) | ||
|
|
||
| hist, hist_edeges = np.histogram(arr, bins=num_bins, range=(-th, th)) | ||
| hist, hist_edges = np.histogram(arr, bins=num_bins, range=(-th, th)) |
There was a problem hiding this comment.
edges is mispelled as edeges throughout the code
| # at one edge: [0, 0, ..., 1000]. (histogram) | ||
| # We want to make sure that the optimal threshold in this case is the max. | ||
| arr = np.array([2]*1000) | ||
| res = mx.contrib.quant._get_optimal_threshold(arr, num_quantized_bins=5) |
There was a problem hiding this comment.
Using the incorrectly implemented code for _get_optimal_threshold, we would result in a divide by 0 error here.
| try: | ||
| q = _smooth_distribution(q) | ||
| except ValueError: | ||
| divergence[i - num_half_quantized_bins] = float("inf") |
There was a problem hiding this comment.
If the distribution is improper, we set the KL divergence to infinity, as it could theoretically model a uniform distribution of parameters [a,b] with either variables unbounded, which means KL divergence is infinity.
|
For reference: With new PR, on imagenet, resnet 152, 5 batch, entropy method: Previously: This is a proof that there was no degradation in performance. @reminisce |
| if norm != 0: | ||
| q[start:stop] = float(quantized_bins[j]) / float(norm) | ||
| q[sliced_nd_hist == 0] = 0 | ||
| q[p == 0] = 0 |
There was a problem hiding this comment.
According to the slides on page 38: http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf, the zero'd out bins are meant to be w.r.t reference distribution p, rather than the sliced_nd_hist.
* Fix quantization bug * Added tests and made sure the edge case is now considered correctly without 1 off errors * Changed back to original truncated distribution but with different kl divergence calc * Reorder back to original format * Reorder back to original format (again) * Change comments * Clarified comments * Changed norm division
Description
The current quantization strategy for
calib_mode='entropy'is to calculate the KL divergence for different thresholds and choose the best threshold. This assumes that the random variable is nonzero for all reals and is a continuous random variable. Because we are discretizing the distribution, we smooth the distribution over the range[-threshold, threshold]. What we are not considering is that the entire sampled distribution may be not in the range[-threshold, threshold]and thus we end up with all zeros in the sampled candidatepdistribution inside of_get_optimal_threshold.I have added a check that the distribution(possibly unnormalized) is proper before attempting to smooth or else we'll run into a divide by 0 error.
In most cases, activation functions and layers for classification type problems output numbers symmetric around 0. This is not the case for a regressor's last layer, and there are various other examples where the activation distribution is not around 0, and this was a major blockage for airbnb's adoption into mxnet's quantization capabilities.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes