From 4b94d035b21dd9707fb0d81eae54f8764ef9fbf3 Mon Sep 17 00:00:00 2001
From: Jiaxiang Wu <jiaxiang.wu.90@gmail.com>
Date: Mon, 4 Mar 2019 10:33:49 +0800
Subject: [PATCH 1/3] minor updates to fix typos

---
 docs/docs/cp_learner.md             |  8 ++---
 docs/docs/dcp_learner.md            |  2 +-
 docs/docs/installation.md           |  1 +
 docs/docs/multi_gpu_training.md     |  2 +-
 docs/docs/nuq_learner.md            | 10 +++---
 docs/docs/reinforcement_learning.md |  2 +-
 docs/docs/self_defined_models.md    |  6 ++--
 docs/docs/uq_learner.md             | 54 ++++++++++++++---------------
 docs/docs/ws_learner.md             |  2 +-
 9 files changed, 45 insertions(+), 42 deletions(-)

diff --git a/docs/docs/cp_learner.md b/docs/docs/cp_learner.md
index 2719fc7..3231b04 100644
--- a/docs/docs/cp_learner.md
+++ b/docs/docs/cp_learner.md
@@ -4,7 +4,7 @@
 
 Channel pruning is a kind of structural model compression approach which can not only compress the model size, but accelerate the inference speed directly. PocketFlow uses the channel pruning algorithm proposed in (He et al., 2017) to pruning each channel of convolution layers with a certain ratio, and for details please refer to the [channel pruning paper](https://arxiv.org/abs/1707.06168). For better performance and more robust, we modify some parts of the algorithm to achieve better result.
 
-In order to achieve a better performance, PocketFlow can take advantages of reinforcement learning to search a better compression ratio (He et al., 2018). User can also use the distilling (Hinton et al., 2015) and group tuning function to improve the accuracy after compression. Group tuning means setting a certain number of layers as group and then pruning and finetuning/retraining each group sequentially. For example we can set each 3 layers as a group and then pruning the first 3 layers. After that finetune/retraine the whole model and prune the next 3 layers and so on. Distilling and group tuning are experimentally proved as effective approaches to achieve higher accuracy at a certain compression ratio in most situations.
+In order to achieve a better performance, PocketFlow can take advantages of reinforcement learning to search a better compression ratio (He et al., 2018). User can also use the distilling (Hinton et al., 2015) and group tuning function to improve the accuracy after compression. Group tuning means setting a certain number of layers as group and then pruning and fine-tuning (or re-training) each group sequentially. For example, we can set each 3 layers as a group and then prune the first 3 layers. After that, we fine-tune (or re-train) the whole model and prune the next 3 layers and so on. Distilling and group tuning are experimentally proved as effective approaches to achieve higher accuracy at a certain compression ratio in most situations.
 
 ## Pruning Option
 
@@ -12,7 +12,7 @@ The code of channel pruning are located at directory `./learners/channel_pruning
 
 ### Uniform Channel Pruning
 
-One is the uniform layer pruning, which means the user can set each convolution layer pruned with an uniform pruning ratio by  `--cp_prune_option=uniform` and set the ratio (eg. making the ratio 0.5) by `--cp_uniform_preserve_ratio=0.5`. Note that for a layer, if both of pruning ratio of the layer and its previous layer are 0.5, the real preserved FLOPs are 1/4 of original FLOPs. Because channel pruning only prune the c_out channels of the convolution and c_in channels of the next convolution, if both c_in and c_out channels are pruned by 0.5, it will preserve only 1/4 of original computation cost. For a layer by layer convolution networks without residual blocks, if the user set `cp_uniform_preserve_ratio` to `0.5`, the whole model will be the 0.25 computation of the original model. However for the residual networks, some convolutions can only prune their c_in or c_out channels, which means the total preseved computation ratio may be much greater than 0.25.
+One is the uniform layer pruning, which means the user can set each convolution layer pruned with an uniform pruning ratio by  `--cp_prune_option=uniform` and set the ratio (eg. making the ratio 0.5) by `--cp_uniform_preserve_ratio=0.5`. Note that for a layer, if both of pruning ratio of the layer and its previous layer are 0.5, the real preserved FLOPs are 1/4 of original FLOPs. Because channel pruning only prune the c_out channels of the convolution and c_in channels of the next convolution, if both c_in and c_out channels are pruned by 0.5, it will preserve only 1/4 of original computation cost. For a layer by layer convolution networks without residual blocks, if the user set `cp_uniform_preserve_ratio` to `0.5`, the whole model will be the 0.25 computation of the original model. However for the residual networks, some convolutions can only prune their c_in or c_out channels, which means the total preserved computation ratio may be much greater than 0.25.
 
 **Example:**
 
@@ -62,8 +62,8 @@ The implementation of the channel pruning use Lasso algorithm to do channel sele
 
 ## Distilling
 
-Distilling is an effective approach to improve the final accuracy of compressed model with PocketFlow in most situations of classification. User can set `--enbl_dst=True` to enable distillling.
+Distilling is an effective approach to improve the final accuracy of compressed model with PocketFlow in most situations of classification. User can set `--enbl_dst=True` to enable distilling.
 
 ## Group Tuning
 
-As introduced above, group tuning was proposed by the PocketFlow team and finding it is very useful to improve the performance of model compression. In PocketFlow, users can set `--cp_finetune=True` to enable group finetuning and set the group number by `--cp_list_group`, the default value is `1000`. There is a trade-off between the small value and large value, because if the value is `1`, Pocketflow will prune convolution and finetune/retrain by each layer, which may have better effect but be more time-consuming. If we set the value large, the function will be less effective. User can also set the number of iterations to finetune by setting `cp_nb_iters_ft_ratio` which mean the ratio the total iterations to be used in finetuning. The learning rate of finetuning can be set by `cp_lrn_rate_ft`.
+As introduced above, group tuning was proposed by the PocketFlow team and finding it is very useful to improve the performance of model compression. In PocketFlow, users can set `--cp_finetune=True` to enable group fine-tuning and set the group number by `--cp_list_group`, the default value is `1000`. There is a trade-off between the small value and large value, because if the value is `1`, PocketFlow will prune convolution and fine-tune/re-train by each layer, which may have better effect but be more time-consuming. If we set the value large, the function will be less effective. User can also set the number of iterations to fine-tune by setting `cp_nb_iters_ft_ratio` which mean the ratio the total iterations to be used in fine-tuning. The learning rate of fine-tuning can be set by `cp_lrn_rate_ft`.
diff --git a/docs/docs/dcp_learner.md b/docs/docs/dcp_learner.md
index 895bc52..85becd5 100644
--- a/docs/docs/dcp_learner.md
+++ b/docs/docs/dcp_learner.md
@@ -8,7 +8,7 @@ Discrimination-aware channel pruning (DCP, Zhuang et al., 2018) introduces a gro
 
 For a convolutional layer, we denote its input feature map as $\mathbf{X} \in \mathbb{R}^{N \times c_{i} \times h_{i} \times w_{i}}$, where $N$ is the batch size, $c_{i}$ is the number of inputs channels, and $h_{i}$ and $w_{i}$ are the spatial height and width. The convolutional kernel is denoted as $\mathbf{W} \in \mathbb{R}^{c_{o} \times c_{i} \times k \times k}$, where $c_{o}$ is the number of output channels and $k$ is the kernel size. The resulting output feature map is given by $\mathbf{Y} = f \left( \mathbf{X}; \mathbf{W} \right)$, where $f \left( \cdot \right)$ represents the convolutional operation.
 
-The idea of channel pruning is to impose the sparsity constraint on the convolutional kernel, so that some of its input channels only contains all-zero weights and can be safely removed. For instance, if the convolutional kernel satisifies:
+The idea of channel pruning is to impose the sparsity constraint on the convolutional kernel, so that some of its input channels only contains all-zero weights and can be safely removed. For instance, if the convolutional kernel satisfies:
 
 $$
 \left\| \left\| \mathbf{W}_{:, j, :, :} \right\|_{F}^{2} \right\|_{0} = c'_{i},
diff --git a/docs/docs/installation.md b/docs/docs/installation.md
index 0799266..e2a5af3 100644
--- a/docs/docs/installation.md
+++ b/docs/docs/installation.md
@@ -9,6 +9,7 @@ PocketFlow is developed and tested on Linux, using Python 3.6 and TensorFlow 1.1
 ## Clone PocketFlow
 
 To make a local copy of the PocketFlow repository, use:
+
 ``` bash
 $ git clone https://github.com/Tencent/PocketFlow.git
 ```
diff --git a/docs/docs/multi_gpu_training.md b/docs/docs/multi_gpu_training.md
index 1efc11c..6d9b88e 100644
--- a/docs/docs/multi_gpu_training.md
+++ b/docs/docs/multi_gpu_training.md
@@ -10,7 +10,7 @@ Our implementation is compatible with:
 We have provide a wrapper class, `MultiGpuWrapper`, to seamlessly switch between the above two frameworks.
 It will sequentially check whether Horovod and TF-Plus can be used, and use the first available one as the underlying framework for multi-GPU training.
 
-The main reason that using Horovod or TF-Plus instead TensorFlow's original distributed training routine is that these framewors provide many easy-to-use APIs and require far less code changes to change from single-GPU to multi-GPU training, as we shall see later.
+The main reason that using Horovod or TF-Plus instead TensorFlow's original distributed training routine is that these frameworks provide many easy-to-use APIs and require far less code changes to change from single-GPU to multi-GPU training, as we shall see later.
 
 ## From Single-GPU to Multi-GPU
 
diff --git a/docs/docs/nuq_learner.md b/docs/docs/nuq_learner.md
index 98da143..37d2c59 100644
--- a/docs/docs/nuq_learner.md
+++ b/docs/docs/nuq_learner.md
@@ -1,4 +1,5 @@
 # Non-Uniform Quantization Learner
+
 Non-uniform quantization is a generalization to uniform quantization. In non-uniform quantization, the quantization points are not distributed evenly, and can be optimized via the back-propagation of the network gradients. Consequently, with the same number of bits, non-uniform quantization is more expressive to approximate the original full-precision network comparing to uniform quantization. Nevertheless, the non-uniform quantized model cannot be accelerated directly based on current deep learning frameworks, since the low-precision multiplication requires the intervals among quantization points to be equal. Therefore, the `NonUniformQuantLearner` can only help better compress the model.
 
 ## Algorithm
@@ -35,8 +36,8 @@ To configure `NonUniformQuantLearner`, users can pass the options via the Tensor
 | `nuql_save_quant_mode_path` | the save path for quantized models. Default: `./nuql_quant_models/model.ckpt` |
 | `nuql_use_buckets`          | the switch to quantize first and last layers of network. Default: `False`. |
 | `nuql_bucket_type`          | two bucket type available: ['split', 'channel']. Default: `channel`. |
-| `nuql_bucket_size`          | the number of bucket size for bucket type 'split'. Default: `256. |
-| `nuql_enbl_rl_agent`        | the switch to enable RL to learn optimal bit strategy. Default:`False`. |
+| `nuql_bucket_size`          | the number of bucket size for bucket type 'split'. Default: `256`. |
+| `nuql_enbl_rl_agent`        | the switch to enable RL to learn optimal bit strategy. Default: `False`. |
 | `nuql_quantize_all_layers`  | the switch to quantize first and last layers of network. Default: `False`. |
 | `nuql_quant_epoch`          | the number of epochs for fine-tuning. Default: `60`.         |
 
@@ -60,7 +61,7 @@ Similar to uniform quantization, once `nuql_enbl_rl_agent==True` , the RL agent
 
 | Options                       | Description                                                  |
 | :---------------------------- | :----------------------------------------------------------- |
-| `nuql_evquivalent_bits`       | the number of re-allocated bits that is equivalent to non-uniform quantization without RL agent. Default: `4`. |
+| `nuql_equivalent_bits`       | the number of re-allocated bits that is equivalent to non-uniform quantization without RL agent. Default: `4`. |
 | `nuql_nb_rlouts`              | the number of roll outs for training the RL agent. Default: `200`. |
 | `nuql_w_bit_min`              | the minimal number of bits for each layer. Default: `2`.     |
 | `nuql_w_bit_max`              | the maximal number of bits for each layer. Default: `8`.     |
@@ -104,7 +105,7 @@ To quantize a MobileNet-v1 model for ILSVRC_12 classification task with 4 bits i
 
 ```bash
 # quantize mobilenet-v1 on ILSVRC-12
-sh ./scripts/run_seven.sh nets/mobilnet_at_ilsvrc12_run.py \
+sh ./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py \
 -n=8 \
 --learner=non-uniform \
 --nuql_enbl_rl_agent=True \
@@ -112,4 +113,5 @@ sh ./scripts/run_seven.sh nets/mobilnet_at_ilsvrc12_run.py \
 ```
 
 ## References
+
 Han S, Mao H, and Dally W J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. [arXiv:1510.00149, 2015](https://arxiv.org/abs/1510.00149)
diff --git a/docs/docs/reinforcement_learning.md b/docs/docs/reinforcement_learning.md
index 2b9a0c1..9c511a0 100644
--- a/docs/docs/reinforcement_learning.md
+++ b/docs/docs/reinforcement_learning.md
@@ -32,7 +32,7 @@ In each roll-out, we sequentially traverse each layer in the network to determin
 For the $t$-th layer, we construct its state vector with following information:
 
 * one-hot embedding of layer index
-* weight tensor's shape
+* shape of weight tensor
 * number of parameters in the weight tensor
 * number of quantization bits used by previous layers
 * budget of quantization bits for remaining layers
diff --git a/docs/docs/self_defined_models.md b/docs/docs/self_defined_models.md
index 2e5300c..fd030bd 100644
--- a/docs/docs/self_defined_models.md
+++ b/docs/docs/self_defined_models.md
@@ -192,7 +192,7 @@ from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
 
 FLAGS = tf.app.flags.FLAGS
 
-tf.app.flags.DEFINE_float('nb_epochs_rat', 1.0, '# of training epochs\'s ratio')
+tf.app.flags.DEFINE_float('nb_epochs_rat', 1.0, '# of training epochs\' ratio')
 tf.app.flags.DEFINE_float('lrn_rate_init', 1e-1, 'initial learning rate')
 tf.app.flags.DEFINE_float('batch_size_norm', 128, 'normalization factor of batch size')
 tf.app.flags.DEFINE_float('momentum', 0.9, 'momentum coefficient')
@@ -209,7 +209,7 @@ def forward_fn(inputs, data_format):
   * inputs: outputs from the network's forward pass
   """
 
-  # tranpose the image tensor if needed
+  # transpose the image tensor if needed
   if data_format == 'channel_first':
     inputs = tf.transpose(inputs, [0, 3, 1, 2])
 
@@ -302,7 +302,7 @@ class ModelHelper(AbstractModelHelper):
     return 'fmnist'
 ```
 
-In the `build_dataset_train` and `build_dataset_eval` functions, we adopt the previously introduced `FMnistDataset` class to define the data input pipeline. The network forward-pass computation is defined in the `forward_train` and `forward_eval` functions, which corresponds to the training and evaluation graph, respectivley. The training graph is slightly different from evaluation graph, such as operations related to the batch normalization layers. The `calc_loss` function calculates the loss function's value and extra evaluation metrics, *e.g.* classification accuracy. Finally, the `setup_lrn_rate` function defines the learning rate schedule, as well as how many training iterations are need.
+In the `build_dataset_train` and `build_dataset_eval` functions, we adopt the previously introduced `FMnistDataset` class to define the data input pipeline. The network forward-pass computation is defined in the `forward_train` and `forward_eval` functions, which corresponds to the training and evaluation graph, respectively. The training graph is slightly different from evaluation graph, such as operations related to the batch normalization layers. The `calc_loss` function calculates the loss function's value and extra evaluation metrics, *e.g.* classification accuracy. Finally, the `setup_lrn_rate` function defines the learning rate schedule, as well as how many training iterations are need.
 
 ### Execution Script
 
diff --git a/docs/docs/uq_learner.md b/docs/docs/uq_learner.md
index 10c093e..271a286 100644
--- a/docs/docs/uq_learner.md
+++ b/docs/docs/uq_learner.md
@@ -75,15 +75,15 @@ To configure `UniformQuantLearner`, users can pass options via the TensorFlow fl
 
 Here, we provide detailed description (and some analysis) for above hyper-parameters:
 
-- `uql_weight_bits`: The number of bits for weight quantization. Generally, 8 bit does not hurt the model performance while it can compress the model size by 4 folds. While 2 bit and 4 bit could lead to drop of performance on large datasets such as Imagenet.
-- `uql_activation_bits`: The number of bits for activation quantization. When both weights and activations are quantized, 8 bit does not lead to apparent drop of performance, and sometimes can even increase the classification accuracy, which is probably due to better generalization ability. Nevertheless, the performance will be more challenged when both weights and activations are quantized to lower bits, comparing to weight-only quantization.
-- `uql_save_quant_mode_path`: the path to save the quantized model. Quantization nodes  have already been inserted into the graph.
-- `uql_use_buckets`: the switch to turn on the bucket. With bucketing, weights are split into multiple pieces, while the $\alpha$ and $\beta$ are calculated individually for each piece. Therefore, turning on the bucketing can lead to more fine-grained quantization.
-- `uql_bucket_type`: the type of bucketing. Currently two types are supported: [`split`, `channel`]. `split` refers to that the weights of a layer are first concatenated into a long vector, and then cut it into pieces according to `uql_bucket_size`. The remaining last piece will be padded and taken as a new piece. After quantization of each piece, the vectors are then folded back to the original shape as the quantized weights. `channel` refers to that weights with shape `[k, k, cin, cout]` in a convolutional layer are cut into `cout` buckets, where each bucket has the size of `k * k * cin`. For weights with shape `[m, n]` in fully connected layers, they are cut into `n` buckets, each of size `m`. In practice, bucketing with type  `channel` can be calculated more efficiently comparing to type `split` since there are less buckets and less computation to iterate through all of them.
-- `uql_bucket_size`: the size of buckets when using bucket type `split`. Generally, smaller bucket size can lead to more fine grained quantization, while more storage are required since full precision statistics ($\alpha$ and $\beta$) of each bucket need to be kept.
-- `uql_quantize_all_layers`: the switch to quantize the first and last layers. The first and last layers of the network are connected directly with the input and output, and are arguably more sensitive to quantization. Keeping them un-quantized can slightly increase the performance, nevertheless, if you want to accelerate the inference speed, all layers are supposed to be quantized.
-- `uql_quant_epoch`: the epochs for fine-tuning a quantized network.
-- `uql_enbl_rl_agent`: the switch to turn on the RL agent as hyper parameter optimizer. Details about the RL agent and its configurations are described below.
+* `uql_weight_bits`: The number of bits for weight quantization. Generally, 8 bit does not hurt the model performance while it can compress the model size by 4 folds. While 2 bit and 4 bit could lead to drop of performance on large datasets such as Imagenet.
+* `uql_activation_bits`: The number of bits for activation quantization. When both weights and activations are quantized, 8 bit does not lead to apparent drop of performance, and sometimes can even increase the classification accuracy, which is probably due to better generalization ability. Nevertheless, the performance will be more challenged when both weights and activations are quantized to lower bits, comparing to weight-only quantization.
+* `uql_save_quant_mode_path`: the path to save the quantized model. Quantization nodes  have already been inserted into the graph.
+* `uql_use_buckets`: the switch to turn on the bucket. With bucketing, weights are split into multiple pieces, while the $\alpha$ and $\beta$ are calculated individually for each piece. Therefore, turning on the bucketing can lead to more fine-grained quantization.
+* `uql_bucket_type`: the type of bucketing. Currently two types are supported: [`split`, `channel`]. `split` refers to that the weights of a layer are first concatenated into a long vector, and then cut it into pieces according to `uql_bucket_size`. The remaining last piece will be padded and taken as a new piece. After quantization of each piece, the vectors are then folded back to the original shape as the quantized weights. `channel` refers to that weights with shape `[k, k, cin, cout]` in a convolutional layer are cut into `cout` buckets, where each bucket has the size of `k * k * cin`. For weights with shape `[m, n]` in fully connected layers, they are cut into `n` buckets, each of size `m`. In practice, bucketing with type  `channel` can be calculated more efficiently comparing to type `split` since there are less buckets and less computation to iterate through all of them.
+* `uql_bucket_size`: the size of buckets when using bucket type `split`. Generally, smaller bucket size can lead to more fine grained quantization, while more storage are required since full precision statistics ($\alpha$ and $\beta$) of each bucket need to be kept.
+* `uql_quantize_all_layers`: the switch to quantize the first and last layers. The first and last layers of the network are connected directly with the input and output, and are arguably more sensitive to quantization. Keeping them un-quantized can slightly increase the performance, nevertheless, if you want to accelerate the inference speed, all layers are supposed to be quantized.
+* `uql_quant_epoch`: the epochs for fine-tuning a quantized network.
+* `uql_enbl_rl_agent`: the switch to turn on the RL agent as hyper parameter optimizer. Details about the RL agent and its configurations are described below.
 
 ### Configure the RL Agent
 
@@ -91,7 +91,7 @@ Once the hyper parameter optimizer is turned on, i.e., `uql_enbl_rl_agent==True`
 
 | Option | Description |
 |:-------|:------------|
-| `uql_evquivalent_bits`       | the number of re-allocated bits that is equivalent to uniform allocation of bits. Default: `4`. |
+| `uql_equivalent_bits`       | the number of re-allocated bits that is equivalent to uniform allocation of bits. Default: `4`. |
 | `uql_nb_rlouts`              | the number of roll outs for training the RL agent. Default: `200`. |
 | `uql_w_bit_min`              | the minimal number of bits for each layer. Default: `2`.     |
 | `uql_w_bit_max`              | the maximal number of bits for each layer. Default: `8`.     |
@@ -104,19 +104,19 @@ Once the hyper parameter optimizer is turned on, i.e., `uql_enbl_rl_agent==True`
 
 Detailed description and usages for above hyper-parameters are listed below:
 
-- `uql_equivalent_bits`:  the total number of bits used in the optimal strategy will not exceed $n_{param}*$`uql_equivalent_bits` . For example, by setting `uql_equivalent_bits`=4, the RL agent will try to find the best quantization strategy with the same compression ratio to that each layer is quantized by 4 bits.
+* `uql_equivalent_bits`:  the total number of bits used in the optimal strategy will not exceed $n_{param}*$`uql_equivalent_bits` . For example, by setting `uql_equivalent_bits`=4, the RL agent will try to find the best quantization strategy with the same compression ratio to that each layer is quantized by 4 bits.
 
 The following parameters can be kept in default value in most cases. Users can also modify them when using their customized models if necessary.
 
-- `uql_nb_rlouts`: the number of roll-out for training the RL agent.  Generally we will use the first quarter of `uql_nb_rlouts` for collection of  the training buffer, and last three quarters for the training of the agent. The larger the `uql_nb_rlouts`, the slower the search for the hyper-parameter optimizer.
-- `uql_w_bit_min`: the minimum number of quantization bit for a layer. This is used to constrain the searching space and avoid extreme strategies that crash the entire performance of the compressed model.
-- `uql_w_bit_max`: the maximum number of quantization bit for a layer. This is used to constrain the searching space and avoid that one layer may use too much unnecessary bits.
-- `uql_enbl_rl_global_tune`: the switch to globally fine-tune the network in each roll-out, which is done by updating the full-precision weights for all layers via the STE estimator. The aim of the fine-tune is to obtain effective reward from the current strategy.
-- `uql_enbl_rl_layerwise_tune`: the switch to layer-wise fine-tune the network in each roll-out, which is done by minimizing the l2-norm between the quantized layer and full-precision layer.
-- `uql_tune_layerwise_steps`: the number of steps for layer-wise fine-tuning. Generally, the larger the value, the more precise the reward and thereon the better the strategy.
-- `uql_tune_global_steps`: the number of steps for global fine-tuning. Generally, the larger the value, the more precise the reward and thereon the better the strategy.
-- `uql_tune_disp_steps`: the intervals to display the global training process in each roll-out.
-- `uql_enbl_random_layers` : the switch to randomly permute layers of the network when searching the optimal strategy. This could be helpful since the bit budget used in previous layers may affect the searching space for following layers, while randomly shuffling all layers makes sure that all layers have equal probability of all strategies.
+* `uql_nb_rlouts`: the number of roll-out for training the RL agent.  Generally we will use the first quarter of `uql_nb_rlouts` for collection of  the training buffer, and last three quarters for the training of the agent. The larger the `uql_nb_rlouts`, the slower the search for the hyper-parameter optimizer.
+* `uql_w_bit_min`: the minimum number of quantization bit for a layer. This is used to constrain the searching space and avoid extreme strategies that crash the entire performance of the compressed model.
+* `uql_w_bit_max`: the maximum number of quantization bit for a layer. This is used to constrain the searching space and avoid that one layer may use too much unnecessary bits.
+* `uql_enbl_rl_global_tune`: the switch to globally fine-tune the network in each roll-out, which is done by updating the full-precision weights for all layers via the STE estimator. The aim of the fine-tune is to obtain effective reward from the current strategy.
+* `uql_enbl_rl_layerwise_tune`: the switch to layer-wise fine-tune the network in each roll-out, which is done by minimizing the l2-norm between the quantized layer and full-precision layer.
+* `uql_tune_layerwise_steps`: the number of steps for layer-wise fine-tuning. Generally, the larger the value, the more precise the reward and thereon the better the strategy.
+* `uql_tune_global_steps`: the number of steps for global fine-tuning. Generally, the larger the value, the more precise the reward and thereon the better the strategy.
+* `uql_tune_disp_steps`: the intervals to display the global training process in each roll-out.
+* `uql_enbl_random_layers` : the switch to randomly permute layers of the network when searching the optimal strategy. This could be helpful since the bit budget used in previous layers may affect the searching space for following layers, while randomly shuffling all layers makes sure that all layers have equal probability of all strategies.
 
 ### Usage Examples
 
@@ -151,7 +151,7 @@ To quantize a MobileNet-v1 model for ILSVRC_12 classification task with 4 bits i
 
 ```bash
 # quantize mobilenet-v1 on ILSVRC-12
-sh ./scripts/run_seven.sh nets/mobilnet_at_ilsvrc12_run.py \
+sh ./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py \
 -n=8 \
 --learner=uniform \
 --uql_enbl_rl_agent=True \
@@ -160,7 +160,7 @@ sh ./scripts/run_seven.sh nets/mobilnet_at_ilsvrc12_run.py \
 
 ## UniformQuantTFLearner
 
-PocketFlow also wraps the quantization aware training in TensorFlow. The quantized model can be directly exported to `.tflite` format via [export_quant_tflite_model.py](https://github.com/haolibai/PocketFlow/blob/master/tools/conversion/export_quant_tflite_model.py) in PocketFlow, and then be easily deployed on Andriod devices.
+PocketFlow also wraps the quantization aware training in TensorFlow. The quantized model can be directly exported to `.tflite` format via [export_quant_tflite_model.py](https://github.com/haolibai/PocketFlow/blob/master/tools/conversion/export_quant_tflite_model.py) in PocketFlow, and then be easily deployed on Android devices.
 
 To configure `UniformQuantTFLearner`, the hyper-parameters are as follows:
 
@@ -176,13 +176,13 @@ To configure `UniformQuantTFLearner`, the hyper-parameters are as follows:
 
 Here, the detailed description (and some analysis) for some above hyper-parameters are listed as follows:
 
-- `uqtf_quant_delay`: The number of steps to start fine-tuning on the quantized network. Before the training step reaches `uqtf_quant_delay`, only full precision weights of the model are updated.
-- `uqtf_freeze_bn_delay`: The number of steps after which the moving mean and variance of batch normalization layers are frozen and used, instead of the batch statistics during training.
-- `uqtf_lrn_rate_dcy` : The decay of learning rate for the quantized model. Generally the quantized network needs smaller learning rate comparing to that for the full-precision model.
+* `uqtf_quant_delay`: The number of steps to start fine-tuning on the quantized network. Before the training step reaches `uqtf_quant_delay`, only full precision weights of the model are updated.
+* `uqtf_freeze_bn_delay`: The number of steps after which the moving mean and variance of batch normalization layers are frozen and used, instead of the batch statistics during training.
+* `uqtf_lrn_rate_dcy` : The decay of learning rate for the quantized model. Generally the quantized network needs smaller learning rate comparing to that for the full-precision model.
 
 ### Usage Examples
 
-To deploy a quantized network on Andriod devices, there are generally 3 steps:
+To deploy a quantized network on Android devices, there are generally 3 steps:
 
 ### Quantize the pre-trained network
 
@@ -190,7 +190,7 @@ To quantize a MobileNet-v1 model for ILSVRC-12 classification task with 8 bits i
 
 ``` bash
 # quantize MobileNet-v1 on ILSVRC-12
-$ ./scripts/run_seven.sh nets/mobilnet_at_ilsvrc12_run.py -n=8 \
+$ ./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py -n=8 \
     --learner uniform-tf \
     --nb_epochs_rat 0.2
 ```
diff --git a/docs/docs/ws_learner.md b/docs/docs/ws_learner.md
index 34f6e89..7b88066 100644
--- a/docs/docs/ws_learner.md
+++ b/docs/docs/ws_learner.md
@@ -10,7 +10,7 @@ Note: in this documentation, we will use both "sparsity" and "pruning ratio" to
 
 For each convolutional kernel (for convolutional layer) or weighting matrix (for fully-connected layer), we create a binary mask of the same size to impose the sparsity constraint. During the forward pass, the convolutional kernel (or weighting matrix) is multiplied with the binary mask, so that some weights will not participate in the computation and also will not be updated via gradients. The binary mask is computed based on absolute values of weights: weight with the smallest absolute value will be masked-out until the desired sparsity is reached.
 
-During the training process, the sparsity is gradually increased to improve the overall optimization behaviour. The dynamic pruning schedule is defined as:
+During the training process, the sparsity is gradually increased to improve the overall optimization behavior. The dynamic pruning schedule is defined as:
 
 $$
 s_{t} = s_{f} - s_{f} \cdot \left( 1 - \frac{t - t_{b}}{t_{e} - t_{b}} \right)^{\alpha}, t \in \left[ t_{b}, t_{e} \right]

From 85c7f822ba36b802f43c0b7bb116d97686b84bdf Mon Sep 17 00:00:00 2001
From: Jiaxiang Wu <jiaxiang.wu.90@gmail.com>
Date: Mon, 4 Mar 2019 10:45:41 +0800
Subject: [PATCH 2/3] create .markdownlint.json to customize MD-lint

---
 docs/docs/.markdownlint.json | 5 +++++
 1 file changed, 5 insertions(+)
 create mode 100644 docs/docs/.markdownlint.json

diff --git a/docs/docs/.markdownlint.json b/docs/docs/.markdownlint.json
new file mode 100644
index 0000000..435c42e
--- /dev/null
+++ b/docs/docs/.markdownlint.json
@@ -0,0 +1,5 @@
+{
+    "MD013": false,
+    "MD014": false,
+    "MD024": {"allow_different_nesting": true}
+}

From c00fd7c4b2863c42da1165fa438f0098d7309fdb Mon Sep 17 00:00:00 2001
From: Jiaxiang Wu <jiaxiang.wu.90@gmail.com>
Date: Mon, 4 Mar 2019 20:25:31 +0800
Subject: [PATCH 3/3] add documentation for ChannelPrunedRmtLearner

---
 docs/docs/cpr_learner.md | 137 +++++++++++++++++++++++++++++++++++++++
 docs/mkdocs.yml          |   1 +
 2 files changed, 138 insertions(+)
 create mode 100644 docs/docs/cpr_learner.md

diff --git a/docs/docs/cpr_learner.md b/docs/docs/cpr_learner.md
new file mode 100644
index 0000000..133bd76
--- /dev/null
+++ b/docs/docs/cpr_learner.md
@@ -0,0 +1,137 @@
+# Channel Pruning - Remastered
+
+## Introduction
+
+Channel pruning (He et al., 2017) aims at reducing the number of input channels of each convolutional layer while minimizing the reconstruction loss of its output feature maps, using preserved input channels only. Similar to other model compression components based on channel pruning, this can lead to direct reduction in both model size and computational complexity (in terms of FLOPs).
+
+In PocketFlow, we provide `ChannelPrunedRmtLearner` as the remastered version of the previous `ChannelPrunedLearner`, with simplified and easier-to-understand implementation. The underlying algorithm is based on (He et al., 2017), with a few modifications. However, the support for RL-based hyper-parameter optimization is not yet ready and will be provided in the near future.
+
+## Algorithm Description
+
+For a convolutional layer, we denote its input feature map as $\mathcal{X} \in \mathbb{R}^{N \times h_{i} \times w_{i} \times c_{i}}$, where $N$ is the batch size, $h_{i}$ and $w_{i}$ are the spatial height and width, and $c_{i}$ is the number of inputs channels. The convolutional kernel is denoted as $\mathcal{W} \in \mathbb{R}^{k_{h} \times k_{w} \times c_{i} \times c_{o}}$, where $\left( k_{h}, k_{w} \right)$ is the kernel's spatial size and $c_{o}$ is the number of output channels. The resulting output feature map is given by $\mathcal{Y} = f \left( \mathcal{X}; \mathcal{W} \right) \in \mathbb{R}^{N \times h_{o} \times w_{o} \times c_{o}}$, where $h_{o}$ and $w_{o}$ are the spatial height and width, and $f \left( \cdot \right)$ denotes the convolutional operation.
+
+The convolutional operation can be understood as standard matrix multiplication between two matrices, one from $\mathcal{X}$ and the other from $\mathcal{W}$. The input feature map $\mathcal{X}$ is re-arranged via the `im2col` operator to produce a matrix $\mathbf{X}$ of size $N h_{o} w_{o} \times h_{k} w_{k} c_{i}$. The convolutional kernel $\mathcal{W}$ is correspondingly reshaped into $\mathbf{W}$ of size $h_{k} w_{k} c_{i} \times c_{o}$. The multiplication of these two matrices produces the output feature map in the matrix form, given by $\mathbf{Y} = \mathbf{X} \mathbf{W}$, which can be further reshaped back to the 4-D tensor $\mathcal{Y}$.
+
+The matrix multiplication can be decomposed along the dimension of input channels. We divide $\mathbf{X}$ into $c_{i}$ sub-matrices $\left\{ \mathbf{X}_{i} \right\}$, each of size $N h_{o} w_{o} \times h_{k} w_{k}$, and similarly divide $\mathbf{W}$ into $c_{i}$ sub-matrices $\left\{ \mathbf{W}_{i} \right\}$, each of size $h_{k} w_{k} c_{i} \times c_{o}$. The computation of output feature map $\mathbf{Y}$ can be rewritten as:
+
+$$
+\mathbf{Y} = \sum\nolimits_{i = 1}^{c_{i}} \mathbf{X}_{i} \mathbf{W}_{i}
+$$
+
+In (He et al., 2017), a $c_{i}$-dimensional binary-valued mask vector $\boldsymbol{\beta}$ is introduced to indicate whether an input channel is pruned ($\beta_{i} = 0$) or not ($\beta_{i} = 1$). More formally, we consider the minimization of output feature map's reconstruction loss under sparsity constraint:
+
+$$
+\min_{\mathbf{W}, \boldsymbol{\beta}} \left\| \mathbf{Y} - \sum\nolimits_{i = 1}^{c_{i}} \beta_{i} \mathbf{X}_{i} \mathbf{W}_{i} \right\|_{F}^{2}, ~ \text{s.t.} ~ \left\| \boldsymbol{\beta} \right\|_{0} \le c'_{i}
+$$
+
+The above problem can be tackled by firstly solving $\boldsymbol{\beta}$ via a LASSO regression problem, and then updating $\mathbf{W}$ with the closed-form solution (or iterative solution) to least-square regression. Particularly, in the first step, we rewrite the sparsity constraint as a $l_{1}$-regularization term, so the optimization over $\boldsymbol{\beta}$ is now given by:
+
+$$
+\min_{\boldsymbol{\beta}} \left\| \mathbf{Y} - \sum\nolimits_{i = 1}^{c_{i}} \beta_{i} \mathbf{X}_{i} \mathbf{W}_{i} \right\|_{F}^{2} + \lambda \left\| \boldsymbol{\beta} \right\|_{1}
+$$
+
+The coefficient of $l_{1}$-regularization, $\lambda$, is determined via binary search so that the resulting solution $\boldsymbol{\beta}^{*}$ has exactly $c_{i}$ non-zero entries. We solve the above unconstrained problem with the Iterative Shrinkage Thresholding Algorithm (ISTA).
+
+## Hyper-parameters
+
+Below is the full list of hyper-parameters used in `ChannelPrunedRmtLearner`:
+
+| Name | Description |
+|:-----|:------------|
+| `cpr_save_path` | model's save path |
+| `cpr_save_path_eval` | model's save path for evaluation |
+| `cpr_save_path_ws` | model's save path for warm-start |
+| `cpr_prune_ratio` | target pruning ratio |
+| `cpr_skip_frst_layer` | skip the first convolutional layer for channel pruning |
+| `cpr_skip_last_layer` | skip the last convolutional layer for channel pruning |
+| `cpr_skip_op_names` | comma-separated Conv2D operations names to be skipped |
+| `cpr_nb_smpls` | number of cached training samples for channel pruning |
+| `cpr_nb_crops_per_smpl` | number of random crops per sample |
+| `cpr_ista_lrn_rate` | ISTA's learning rate |
+| `cpr_ista_nb_iters` | number of iterations in ISTA |
+| `cpr_lstsq_lrn_rate` | least-square regression's learning rate |
+| `cpr_lstsq_nb_iters` | number of iterations in least-square regression |
+| `cpr_warm_start` | use a channel-pruned model for warm start |
+
+Here, we provide detailed description (and some analysis) for above hyper-parameters:
+
+* `cpr_save_path`: save path for model created in the training graph. The resulting checkpoint files can be used to resume training from a previous run and compute model's loss function's value and some other evaluation metrics.
+* `cpr_save_path_eval`: save path for model created in the evaluation graph. The resulting checkpoint files can be used to export GraphDef & TensorFlow Lite model files.
+* `cpr_save_path_ws`: save path for model used for warm-start. This learner supports loading a previously-saved channel-pruned model, so that no need to perform channel selection again. This is only used when `cpr_warm_start` is `True`.
+* `cpr_prune_ratio`: target pruning ratio for input channels of each convolutional layer. The larger `cpr_prune_ratio` is, the more input channels will be pruned. If `cpr_prune_ratio` equals 0, then no input channels will be pruned and model remains the same; if `cpr_prune_ratio` equals 1, then all input channels will be pruned.
+* `cpr_skip_frst_layer`: whether to skip the first convolutional layer for channel pruning. The first convolutional layer may be directly related to input images and pruning its input channel may harm the performance significantly.
+* `cpr_skip_last_layer`: whether to skip the last convolutional layer for channel pruning. The first convolutional layer may be directly related to final outputs and pruning its input channel may harm the performance significantly.
+* `cpr_skip_op_names`: comma-separated Conv2D operations names to be skipped. For instance, if `cpr_skip_op_names` is set to "aaa,bbb", then any Conv2D operation whose name contains either "aaa" or "bbb" will be skipped and no channel pruning will be applied on it.
+* `cpr_nb_smpls`: number of cached training samples for channel pruning. Increasing this may lead to smaller performance degradation after channel pruning but also require more training time.
+* `cpr_nb_crops_per_smpl`: number of random crops per sample. Increasing this may lead to smaller performance degradation after channel pruning but also require more training time.
+* `cpr_ista_lrn_rate`: ISTA's learning rate for LASSO regression. If `cpr_ista_lrn_rate` is too large, then the optimization process may become unstable; if `cpr_ista_lrn_rate` is too small, then the optimization process may require lots of iterations until convergence.
+* `cpr_ista_nb_iters`: number of iterations for LASSO regression.
+* `cpr_lstsq_lrn_rate`: Adam's learning rate for least-square regression. If `cpr_lstsq_lrn_rate` is too large, then the optimization process may become unstable; if `cpr_lstsq_lrn_rate` is too small, then the optimization process may require lots of iterations until convergence.
+* `cpr_lstsq_nb_iters`: number of iterations for least-square regression.
+* `cpr_warm_start`: whether to use a previously-saved channel-pruned model for warm-start.
+
+## Empirical Evaluation
+
+In this section, we present some of our results for applying `ChannelPrunedRmtLearner` for compression image classification and object detection models.
+
+For image classification, we use `ChannelPrunedRmtLearner` to compress the ResNet-18 model on the ILSVRC-12 dataset:
+
+| Model | Prune Ratio | FLOPs | Distillation? | Top-1 Acc. | Top-5 Acc. |
+|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
+| ResNet-18 | 0.2 | 73.32% | No | 69.43% | 88.97% |
+| ResNet-18 | 0.2 | 73.32% | Yes | 68.78% | 88.71% |
+| ResNet-18 | 0.3 | 61.31% | No | 68.44% | 88.30% |
+| ResNet-18 | 0.3 | 61.31% | Yes | 68.85% | 88.53% |
+| ResNet-18 | 0.4 | 50.70% | No | 67.17% | 87.48% |
+| ResNet-18 | 0.4 | 50.70% | Yes | 67.35% | 87.83% |
+| ResNet-18 | 0.5 | 41.27% | No | 65.73% | 86.38% |
+| ResNet-18 | 0.5 | 41.27% | Yes | 65.98% | 86.98% |
+| ResNet-18 | 0.6 | 32.07% | No | 63.38% | 84.62% |
+| ResNet-18 | 0.6 | 32.07% | Yes | 63.65% | 85.47% |
+| ResNet-18 | 0.7 | 24.28% | No | 60.26% | 82.70% |
+| ResNet-18 | 0.7 | 24.28% | Yes | 60.43% | 82.96% |
+
+For object detection, we use `ChannelPrunedRmtLearner` to compress the SSD-VGG16 model on the Pascal VOC 07-12 dataset:
+
+| Model | Prune Ratio | FLOPs | Pruned Layers | mAP |
+|:-----:|:-----:|:-----:|:-----:|:-----:|
+| ResNet-18 | 0.2 | 67.34% | Backbone | 77.53% |
+| ResNet-18 | 0.2 | 66.50% | All | 77.22% |
+| ResNet-18 | 0.3 | 53.58% | Backbone | 76.94% |
+| ResNet-18 | 0.3 | 52.32% | All | 76.90% |
+| ResNet-18 | 0.4 | 41.63% | Backbone | 75.81% |
+| ResNet-18 | 0.4 | 39.96% | All | 75.80% |
+| ResNet-18 | 0.5 | 31.56% | Backbone | 74.42% |
+| ResNet-18 | 0.5 | 29.47% | All | 73.76% |
+
+## Usage Examples
+
+In this section, we provide some usage examples to demonstrate how to use `ChannelPrunedRmtLearner` under different execution modes and hyper-parameter combinations:
+
+To compress a ResNet-20 model for CIFAR-10 classification task in the local mode, use:
+
+``` bash
+# set the target pruning ratio to 0.50
+./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --learner=chn-pruned-rmt \
+    --cpr_prune_ratio=0.50
+```
+
+To compress a ResNet-18 model for ILSVRC-12 classification task in the docker mode with 4 GPUs, use:
+
+``` bash
+# do no apply channel pruning to the last convolutional layer
+./scripts/run_docker.sh nets/resnet_at_ilsvrc12_run.py -n=4 \
+    --learner=chn-pruned-rmt \
+    --cpr_skip_last_layer=True
+```
+
+To compress a MobileNet-v1 model for ILSVRC-12 classification task in the seven mode with 8 GPUs, use:
+
+``` bash
+# use a channel-pruned model for warm-start, so no channel selection is needed
+./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py -n=8 \
+    --learner=chn-pruned-rmt \
+    --cpr_warm_start=True \
+    --cpr_save_path_ws=./models_cpr_ws/model.ckpt
+```
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
index 27352d7..46f64a1 100644
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -5,6 +5,7 @@ nav:
 - Tutorial: tutorial.md
 - Learners - Algorithms:
   - Channel Pruning: cp_learner.md
+  - Channel Pruning - Remastered: cpr_learner.md
   - Discrimination-aware Channel Pruning: dcp_learner.md
   - Weight Sparsification: ws_learner.md
   - Uniform Quantization: uq_learner.md