You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Motivation: Magnitude 0-order method work on trained model, but for pretrained model which need fine-tune, it's not suitable.
The output is $\mathbf{a}=(\mathbf{W} \odot \mathbf{M}) \mathbf{x}$
$Top_v(S) \in {0, 1}$, use straight-through method to gradient.
The loss gradients should be $\frac{\partial \mathcal{L}}{\partial S_{i, j}}=\frac{\partial \mathcal{L}}{\partial a_{i}} \frac{\partial a_{i}}{\partial S_{i, j}}=\frac{\partial \mathcal{L}}{\partial a_{i}} W_{i, j} x_{j}$
Compressing Large-Scale Transformer-Based Models: A Case Study on BERT. [TACL21]. Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Hassan Sajjad, Preslav Nakov, Deming Chen, Marianne Winslett. [Survey]
a. Unstructure Pruning: Magnitude, Movement, Rewrited Proximal PruningRPP()
b. Structure: Head, Layer, Embedding Layer.
c. Matrix decomposition.
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning.[ACL20 Workshop Rep4NLP] Mitchell A. Gordon, Kevin Duh, Nicholas Andrews. [Magnitude-base]
a. Does compressing BERT impede it’s ability to transfer to new tasks?Does fine-tuning make BERT more or less compressible?
b. Low levels of pruning (30-40%) are ok. Medium levels/High levels of pruning weak performance downstream tasks.
c. Finally, we observe that fine-tuning BERT on a specific task does not improve its prunability or change the order of pruning by a meaningful amount.
Block Pruning For Faster Transformers. [EMNLP21]. François Lagunas, Ella Charlaix, Victor Sanh, Alexander M. Rush. [Gradient-base]
a. Hybird-filled Movement Pruning.
TPrune: Efficient Transformer Pruning for Mobile Devices. [TCPS21]. Jiachen Mao, Huanrui Yang, Ang Li, Hai Li, Yiran Chen. [Regulation-base]
a. Using block-wise structure sprity pruning(BSSL) only train with regulation, find that WQ, WK, WV, WFFN1 are colum-wise, WO, WFFN2 are row-wise. WQ, WK, WV are pruned at some extent. Wo, WFFN1, WFFN2 hardly get sparity. Different layer in encode-decode should set different sparity ratio.
b. propose one method base on Structured Hoyer Square(also in regulation-base).
Blockwise Self-Attention for Long Document Understanding. [EMNLP20 Finding]. Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, Jie Tang.[hand-craft-base]
a. Block Attntion Pruning, but only have N=2/3 two pattern.
Learned Token Pruning for Transformers. Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer. [Token Pruning, gradient-base]
a. Input sequence lengths can vary greatly within tasks and between training and validation sets, and thus a single pruning configuration can potentially under- prune shorter sequences or over-prune longer sequences.
b. Straight-Through Estimator binarized mask.