Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 05edef2

Browse files
authored
Merge pull request Tencent#245 from jiaxiang-wu/fix-doc
Add Documentation for ChannelPrunedRmtLearner
2 parents be15305 + c00fd7c commit 05edef2

12 files changed

Lines changed: 188 additions & 42 deletions

docs/docs/.markdownlint.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"MD013": false,
3+
"MD014": false,
4+
"MD024": {"allow_different_nesting": true}
5+
}

docs/docs/cp_learner.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,15 @@
44

55
Channel pruning is a kind of structural model compression approach which can not only compress the model size, but accelerate the inference speed directly. PocketFlow uses the channel pruning algorithm proposed in (He et al., 2017) to pruning each channel of convolution layers with a certain ratio, and for details please refer to the [channel pruning paper](https://arxiv.org/abs/1707.06168). For better performance and more robust, we modify some parts of the algorithm to achieve better result.
66

7-
In order to achieve a better performance, PocketFlow can take advantages of reinforcement learning to search a better compression ratio (He et al., 2018). User can also use the distilling (Hinton et al., 2015) and group tuning function to improve the accuracy after compression. Group tuning means setting a certain number of layers as group and then pruning and finetuning/retraining each group sequentially. For example we can set each 3 layers as a group and then pruning the first 3 layers. After that finetune/retraine the whole model and prune the next 3 layers and so on. Distilling and group tuning are experimentally proved as effective approaches to achieve higher accuracy at a certain compression ratio in most situations.
7+
In order to achieve a better performance, PocketFlow can take advantages of reinforcement learning to search a better compression ratio (He et al., 2018). User can also use the distilling (Hinton et al., 2015) and group tuning function to improve the accuracy after compression. Group tuning means setting a certain number of layers as group and then pruning and fine-tuning (or re-training) each group sequentially. For example, we can set each 3 layers as a group and then prune the first 3 layers. After that, we fine-tune (or re-train) the whole model and prune the next 3 layers and so on. Distilling and group tuning are experimentally proved as effective approaches to achieve higher accuracy at a certain compression ratio in most situations.
88

99
## Pruning Option
1010

1111
The code of channel pruning are located at directory `./learners/channel_pruning`. To use channel pruning. users can set `--learners` to `channel`. The Channel pruning supports 3 kinds of pruning setup by `cp_prune_option` option.
1212

1313
### Uniform Channel Pruning
1414

15-
One is the uniform layer pruning, which means the user can set each convolution layer pruned with an uniform pruning ratio by `--cp_prune_option=uniform` and set the ratio (eg. making the ratio 0.5) by `--cp_uniform_preserve_ratio=0.5`. Note that for a layer, if both of pruning ratio of the layer and its previous layer are 0.5, the real preserved FLOPs are 1/4 of original FLOPs. Because channel pruning only prune the c_out channels of the convolution and c_in channels of the next convolution, if both c_in and c_out channels are pruned by 0.5, it will preserve only 1/4 of original computation cost. For a layer by layer convolution networks without residual blocks, if the user set `cp_uniform_preserve_ratio` to `0.5`, the whole model will be the 0.25 computation of the original model. However for the residual networks, some convolutions can only prune their c_in or c_out channels, which means the total preseved computation ratio may be much greater than 0.25.
15+
One is the uniform layer pruning, which means the user can set each convolution layer pruned with an uniform pruning ratio by `--cp_prune_option=uniform` and set the ratio (eg. making the ratio 0.5) by `--cp_uniform_preserve_ratio=0.5`. Note that for a layer, if both of pruning ratio of the layer and its previous layer are 0.5, the real preserved FLOPs are 1/4 of original FLOPs. Because channel pruning only prune the c_out channels of the convolution and c_in channels of the next convolution, if both c_in and c_out channels are pruned by 0.5, it will preserve only 1/4 of original computation cost. For a layer by layer convolution networks without residual blocks, if the user set `cp_uniform_preserve_ratio` to `0.5`, the whole model will be the 0.25 computation of the original model. However for the residual networks, some convolutions can only prune their c_in or c_out channels, which means the total preserved computation ratio may be much greater than 0.25.
1616

1717
**Example:**
1818

@@ -62,8 +62,8 @@ The implementation of the channel pruning use Lasso algorithm to do channel sele
6262

6363
## Distilling
6464

65-
Distilling is an effective approach to improve the final accuracy of compressed model with PocketFlow in most situations of classification. User can set `--enbl_dst=True` to enable distillling.
65+
Distilling is an effective approach to improve the final accuracy of compressed model with PocketFlow in most situations of classification. User can set `--enbl_dst=True` to enable distilling.
6666

6767
## Group Tuning
6868

69-
As introduced above, group tuning was proposed by the PocketFlow team and finding it is very useful to improve the performance of model compression. In PocketFlow, users can set `--cp_finetune=True` to enable group finetuning and set the group number by `--cp_list_group`, the default value is `1000`. There is a trade-off between the small value and large value, because if the value is `1`, Pocketflow will prune convolution and finetune/retrain by each layer, which may have better effect but be more time-consuming. If we set the value large, the function will be less effective. User can also set the number of iterations to finetune by setting `cp_nb_iters_ft_ratio` which mean the ratio the total iterations to be used in finetuning. The learning rate of finetuning can be set by `cp_lrn_rate_ft`.
69+
As introduced above, group tuning was proposed by the PocketFlow team and finding it is very useful to improve the performance of model compression. In PocketFlow, users can set `--cp_finetune=True` to enable group fine-tuning and set the group number by `--cp_list_group`, the default value is `1000`. There is a trade-off between the small value and large value, because if the value is `1`, PocketFlow will prune convolution and fine-tune/re-train by each layer, which may have better effect but be more time-consuming. If we set the value large, the function will be less effective. User can also set the number of iterations to fine-tune by setting `cp_nb_iters_ft_ratio` which mean the ratio the total iterations to be used in fine-tuning. The learning rate of fine-tuning can be set by `cp_lrn_rate_ft`.

docs/docs/cpr_learner.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Channel Pruning - Remastered
2+
3+
## Introduction
4+
5+
Channel pruning (He et al., 2017) aims at reducing the number of input channels of each convolutional layer while minimizing the reconstruction loss of its output feature maps, using preserved input channels only. Similar to other model compression components based on channel pruning, this can lead to direct reduction in both model size and computational complexity (in terms of FLOPs).
6+
7+
In PocketFlow, we provide `ChannelPrunedRmtLearner` as the remastered version of the previous `ChannelPrunedLearner`, with simplified and easier-to-understand implementation. The underlying algorithm is based on (He et al., 2017), with a few modifications. However, the support for RL-based hyper-parameter optimization is not yet ready and will be provided in the near future.
8+
9+
## Algorithm Description
10+
11+
For a convolutional layer, we denote its input feature map as $\mathcal{X} \in \mathbb{R}^{N \times h_{i} \times w_{i} \times c_{i}}$, where $N$ is the batch size, $h_{i}$ and $w_{i}$ are the spatial height and width, and $c_{i}$ is the number of inputs channels. The convolutional kernel is denoted as $\mathcal{W} \in \mathbb{R}^{k_{h} \times k_{w} \times c_{i} \times c_{o}}$, where $\left( k_{h}, k_{w} \right)$ is the kernel's spatial size and $c_{o}$ is the number of output channels. The resulting output feature map is given by $\mathcal{Y} = f \left( \mathcal{X}; \mathcal{W} \right) \in \mathbb{R}^{N \times h_{o} \times w_{o} \times c_{o}}$, where $h_{o}$ and $w_{o}$ are the spatial height and width, and $f \left( \cdot \right)$ denotes the convolutional operation.
12+
13+
The convolutional operation can be understood as standard matrix multiplication between two matrices, one from $\mathcal{X}$ and the other from $\mathcal{W}$. The input feature map $\mathcal{X}$ is re-arranged via the `im2col` operator to produce a matrix $\mathbf{X}$ of size $N h_{o} w_{o} \times h_{k} w_{k} c_{i}$. The convolutional kernel $\mathcal{W}$ is correspondingly reshaped into $\mathbf{W}$ of size $h_{k} w_{k} c_{i} \times c_{o}$. The multiplication of these two matrices produces the output feature map in the matrix form, given by $\mathbf{Y} = \mathbf{X} \mathbf{W}$, which can be further reshaped back to the 4-D tensor $\mathcal{Y}$.
14+
15+
The matrix multiplication can be decomposed along the dimension of input channels. We divide $\mathbf{X}$ into $c_{i}$ sub-matrices $\left\{ \mathbf{X}_{i} \right\}$, each of size $N h_{o} w_{o} \times h_{k} w_{k}$, and similarly divide $\mathbf{W}$ into $c_{i}$ sub-matrices $\left\{ \mathbf{W}_{i} \right\}$, each of size $h_{k} w_{k} c_{i} \times c_{o}$. The computation of output feature map $\mathbf{Y}$ can be rewritten as:
16+
17+
$$
18+
\mathbf{Y} = \sum\nolimits_{i = 1}^{c_{i}} \mathbf{X}_{i} \mathbf{W}_{i}
19+
$$
20+
21+
In (He et al., 2017), a $c_{i}$-dimensional binary-valued mask vector $\boldsymbol{\beta}$ is introduced to indicate whether an input channel is pruned ($\beta_{i} = 0$) or not ($\beta_{i} = 1$). More formally, we consider the minimization of output feature map's reconstruction loss under sparsity constraint:
22+
23+
$$
24+
\min_{\mathbf{W}, \boldsymbol{\beta}} \left\| \mathbf{Y} - \sum\nolimits_{i = 1}^{c_{i}} \beta_{i} \mathbf{X}_{i} \mathbf{W}_{i} \right\|_{F}^{2}, ~ \text{s.t.} ~ \left\| \boldsymbol{\beta} \right\|_{0} \le c'_{i}
25+
$$
26+
27+
The above problem can be tackled by firstly solving $\boldsymbol{\beta}$ via a LASSO regression problem, and then updating $\mathbf{W}$ with the closed-form solution (or iterative solution) to least-square regression. Particularly, in the first step, we rewrite the sparsity constraint as a $l_{1}$-regularization term, so the optimization over $\boldsymbol{\beta}$ is now given by:
28+
29+
$$
30+
\min_{\boldsymbol{\beta}} \left\| \mathbf{Y} - \sum\nolimits_{i = 1}^{c_{i}} \beta_{i} \mathbf{X}_{i} \mathbf{W}_{i} \right\|_{F}^{2} + \lambda \left\| \boldsymbol{\beta} \right\|_{1}
31+
$$
32+
33+
The coefficient of $l_{1}$-regularization, $\lambda$, is determined via binary search so that the resulting solution $\boldsymbol{\beta}^{*}$ has exactly $c_{i}$ non-zero entries. We solve the above unconstrained problem with the Iterative Shrinkage Thresholding Algorithm (ISTA).
34+
35+
## Hyper-parameters
36+
37+
Below is the full list of hyper-parameters used in `ChannelPrunedRmtLearner`:
38+
39+
| Name | Description |
40+
|:-----|:------------|
41+
| `cpr_save_path` | model's save path |
42+
| `cpr_save_path_eval` | model's save path for evaluation |
43+
| `cpr_save_path_ws` | model's save path for warm-start |
44+
| `cpr_prune_ratio` | target pruning ratio |
45+
| `cpr_skip_frst_layer` | skip the first convolutional layer for channel pruning |
46+
| `cpr_skip_last_layer` | skip the last convolutional layer for channel pruning |
47+
| `cpr_skip_op_names` | comma-separated Conv2D operations names to be skipped |
48+
| `cpr_nb_smpls` | number of cached training samples for channel pruning |
49+
| `cpr_nb_crops_per_smpl` | number of random crops per sample |
50+
| `cpr_ista_lrn_rate` | ISTA's learning rate |
51+
| `cpr_ista_nb_iters` | number of iterations in ISTA |
52+
| `cpr_lstsq_lrn_rate` | least-square regression's learning rate |
53+
| `cpr_lstsq_nb_iters` | number of iterations in least-square regression |
54+
| `cpr_warm_start` | use a channel-pruned model for warm start |
55+
56+
Here, we provide detailed description (and some analysis) for above hyper-parameters:
57+
58+
* `cpr_save_path`: save path for model created in the training graph. The resulting checkpoint files can be used to resume training from a previous run and compute model's loss function's value and some other evaluation metrics.
59+
* `cpr_save_path_eval`: save path for model created in the evaluation graph. The resulting checkpoint files can be used to export GraphDef & TensorFlow Lite model files.
60+
* `cpr_save_path_ws`: save path for model used for warm-start. This learner supports loading a previously-saved channel-pruned model, so that no need to perform channel selection again. This is only used when `cpr_warm_start` is `True`.
61+
* `cpr_prune_ratio`: target pruning ratio for input channels of each convolutional layer. The larger `cpr_prune_ratio` is, the more input channels will be pruned. If `cpr_prune_ratio` equals 0, then no input channels will be pruned and model remains the same; if `cpr_prune_ratio` equals 1, then all input channels will be pruned.
62+
* `cpr_skip_frst_layer`: whether to skip the first convolutional layer for channel pruning. The first convolutional layer may be directly related to input images and pruning its input channel may harm the performance significantly.
63+
* `cpr_skip_last_layer`: whether to skip the last convolutional layer for channel pruning. The first convolutional layer may be directly related to final outputs and pruning its input channel may harm the performance significantly.
64+
* `cpr_skip_op_names`: comma-separated Conv2D operations names to be skipped. For instance, if `cpr_skip_op_names` is set to "aaa,bbb", then any Conv2D operation whose name contains either "aaa" or "bbb" will be skipped and no channel pruning will be applied on it.
65+
* `cpr_nb_smpls`: number of cached training samples for channel pruning. Increasing this may lead to smaller performance degradation after channel pruning but also require more training time.
66+
* `cpr_nb_crops_per_smpl`: number of random crops per sample. Increasing this may lead to smaller performance degradation after channel pruning but also require more training time.
67+
* `cpr_ista_lrn_rate`: ISTA's learning rate for LASSO regression. If `cpr_ista_lrn_rate` is too large, then the optimization process may become unstable; if `cpr_ista_lrn_rate` is too small, then the optimization process may require lots of iterations until convergence.
68+
* `cpr_ista_nb_iters`: number of iterations for LASSO regression.
69+
* `cpr_lstsq_lrn_rate`: Adam's learning rate for least-square regression. If `cpr_lstsq_lrn_rate` is too large, then the optimization process may become unstable; if `cpr_lstsq_lrn_rate` is too small, then the optimization process may require lots of iterations until convergence.
70+
* `cpr_lstsq_nb_iters`: number of iterations for least-square regression.
71+
* `cpr_warm_start`: whether to use a previously-saved channel-pruned model for warm-start.
72+
73+
## Empirical Evaluation
74+
75+
In this section, we present some of our results for applying `ChannelPrunedRmtLearner` for compression image classification and object detection models.
76+
77+
For image classification, we use `ChannelPrunedRmtLearner` to compress the ResNet-18 model on the ILSVRC-12 dataset:
78+
79+
| Model | Prune Ratio | FLOPs | Distillation? | Top-1 Acc. | Top-5 Acc. |
80+
|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
81+
| ResNet-18 | 0.2 | 73.32% | No | 69.43% | 88.97% |
82+
| ResNet-18 | 0.2 | 73.32% | Yes | 68.78% | 88.71% |
83+
| ResNet-18 | 0.3 | 61.31% | No | 68.44% | 88.30% |
84+
| ResNet-18 | 0.3 | 61.31% | Yes | 68.85% | 88.53% |
85+
| ResNet-18 | 0.4 | 50.70% | No | 67.17% | 87.48% |
86+
| ResNet-18 | 0.4 | 50.70% | Yes | 67.35% | 87.83% |
87+
| ResNet-18 | 0.5 | 41.27% | No | 65.73% | 86.38% |
88+
| ResNet-18 | 0.5 | 41.27% | Yes | 65.98% | 86.98% |
89+
| ResNet-18 | 0.6 | 32.07% | No | 63.38% | 84.62% |
90+
| ResNet-18 | 0.6 | 32.07% | Yes | 63.65% | 85.47% |
91+
| ResNet-18 | 0.7 | 24.28% | No | 60.26% | 82.70% |
92+
| ResNet-18 | 0.7 | 24.28% | Yes | 60.43% | 82.96% |
93+
94+
For object detection, we use `ChannelPrunedRmtLearner` to compress the SSD-VGG16 model on the Pascal VOC 07-12 dataset:
95+
96+
| Model | Prune Ratio | FLOPs | Pruned Layers | mAP |
97+
|:-----:|:-----:|:-----:|:-----:|:-----:|
98+
| ResNet-18 | 0.2 | 67.34% | Backbone | 77.53% |
99+
| ResNet-18 | 0.2 | 66.50% | All | 77.22% |
100+
| ResNet-18 | 0.3 | 53.58% | Backbone | 76.94% |
101+
| ResNet-18 | 0.3 | 52.32% | All | 76.90% |
102+
| ResNet-18 | 0.4 | 41.63% | Backbone | 75.81% |
103+
| ResNet-18 | 0.4 | 39.96% | All | 75.80% |
104+
| ResNet-18 | 0.5 | 31.56% | Backbone | 74.42% |
105+
| ResNet-18 | 0.5 | 29.47% | All | 73.76% |
106+
107+
## Usage Examples
108+
109+
In this section, we provide some usage examples to demonstrate how to use `ChannelPrunedRmtLearner` under different execution modes and hyper-parameter combinations:
110+
111+
To compress a ResNet-20 model for CIFAR-10 classification task in the local mode, use:
112+
113+
``` bash
114+
# set the target pruning ratio to 0.50
115+
./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
116+
--learner=chn-pruned-rmt \
117+
--cpr_prune_ratio=0.50
118+
```
119+
120+
To compress a ResNet-18 model for ILSVRC-12 classification task in the docker mode with 4 GPUs, use:
121+
122+
``` bash
123+
# do no apply channel pruning to the last convolutional layer
124+
./scripts/run_docker.sh nets/resnet_at_ilsvrc12_run.py -n=4 \
125+
--learner=chn-pruned-rmt \
126+
--cpr_skip_last_layer=True
127+
```
128+
129+
To compress a MobileNet-v1 model for ILSVRC-12 classification task in the seven mode with 8 GPUs, use:
130+
131+
``` bash
132+
# use a channel-pruned model for warm-start, so no channel selection is needed
133+
./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py -n=8 \
134+
--learner=chn-pruned-rmt \
135+
--cpr_warm_start=True \
136+
--cpr_save_path_ws=./models_cpr_ws/model.ckpt
137+
```

docs/docs/dcp_learner.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Discrimination-aware channel pruning (DCP, Zhuang et al., 2018) introduces a gro
88

99
For a convolutional layer, we denote its input feature map as $\mathbf{X} \in \mathbb{R}^{N \times c_{i} \times h_{i} \times w_{i}}$, where $N$ is the batch size, $c_{i}$ is the number of inputs channels, and $h_{i}$ and $w_{i}$ are the spatial height and width. The convolutional kernel is denoted as $\mathbf{W} \in \mathbb{R}^{c_{o} \times c_{i} \times k \times k}$, where $c_{o}$ is the number of output channels and $k$ is the kernel size. The resulting output feature map is given by $\mathbf{Y} = f \left( \mathbf{X}; \mathbf{W} \right)$, where $f \left( \cdot \right)$ represents the convolutional operation.
1010

11-
The idea of channel pruning is to impose the sparsity constraint on the convolutional kernel, so that some of its input channels only contains all-zero weights and can be safely removed. For instance, if the convolutional kernel satisifies:
11+
The idea of channel pruning is to impose the sparsity constraint on the convolutional kernel, so that some of its input channels only contains all-zero weights and can be safely removed. For instance, if the convolutional kernel satisfies:
1212

1313
$$
1414
\left\| \left\| \mathbf{W}_{:, j, :, :} \right\|_{F}^{2} \right\|_{0} = c'_{i},

docs/docs/installation.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ PocketFlow is developed and tested on Linux, using Python 3.6 and TensorFlow 1.1
99
## Clone PocketFlow
1010

1111
To make a local copy of the PocketFlow repository, use:
12+
1213
``` bash
1314
$ git clone https://github.com/Tencent/PocketFlow.git
1415
```

docs/docs/multi_gpu_training.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Our implementation is compatible with:
1010
We have provide a wrapper class, `MultiGpuWrapper`, to seamlessly switch between the above two frameworks.
1111
It will sequentially check whether Horovod and TF-Plus can be used, and use the first available one as the underlying framework for multi-GPU training.
1212

13-
The main reason that using Horovod or TF-Plus instead TensorFlow's original distributed training routine is that these framewors provide many easy-to-use APIs and require far less code changes to change from single-GPU to multi-GPU training, as we shall see later.
13+
The main reason that using Horovod or TF-Plus instead TensorFlow's original distributed training routine is that these frameworks provide many easy-to-use APIs and require far less code changes to change from single-GPU to multi-GPU training, as we shall see later.
1414

1515
## From Single-GPU to Multi-GPU
1616

0 commit comments

Comments
 (0)