Compression

Model Compression

Transformer Compression

An Efficient Transformer Decoder with Compressed Sub-layers [AAAI 20] Yanyang Li, Ye Lin, Tong Xiao, Jingbo Zhu.
- In Decoder layer, parallel two attentions.
- And remove FFN.

Pruning

Workshop

NeurIPS 2021 ENSLP Workshop Efficient Natural Language and Speech Processing(Models, Training and Inference)

Normal Pruning

Movement Pruning: Adaptive Sparsity by Fine-Tuning [NeurIPS 20] Victor Sanh, Thomas Wolf, Alexander M. Rush.
- Motivation: Magnitude 0-order method work on trained model, but for pretrained model which need fine-tune, it's not suitable.
- The output is $\mathbf{a}=(\mathbf{W} \odot \mathbf{M}) \mathbf{x}$
- $Top_v(S) \in {0, 1}$, use straight-through method to gradient.
- The loss gradients should be $\frac{\partial \mathcal{L}}{\partial S_{i, j}}=\frac{\partial \mathcal{L}}{\partial a_{i}} \frac{\partial a_{i}}{\partial S_{i, j}}=\frac{\partial \mathcal{L}}{\partial a_{i}} W_{i, j} x_{j}$

Head Pruning

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning.[HPCA21] Hanrui Wang, Zhekai Zhang, Song Han. [Hardware-Software] a. Cascade(Iterative, deeper, more sparse) Token and Head pruning. b. Magnitude-base/Important-base. c. Top-K engine
Pruning Attention Heads of Transformer Models Using A* Search: A Novel Approach to Compress Big NLP Architectures. Archit Parnami, Rahul Singh, Tarun Joshi. [Sensitivity-base] a. A*, loss-based
Differentiable Subset Pruning of Transformer Heads. [TACL21]. Jiaoda Li, Ryan Cotterell, Mrinmaya Sachan.[Gradient-base] a. Gumble-Tok-K, gradient-based, both in pipeline mode & joint pruning mode.
Are Sixteen Heads Really Better than One?[NeurIPS19]. Paul Michel, Omer Levy, Graham Neubig.[Sensitivity-base] a. Sensitivity-base, Iterative,
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. [ACL19]. Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, Ivan Titov.[Gradient-base] a. Stochastic gates with Hard Gumbel-Softmax distrubtion.
Scheduled DropHead: A Regularization Method for Transformer Models. [EMNLP20 Finding]. Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, Ming Zhou. a. Dropout Head to prevent the multi-head attention model from being dominated by a small portion of attention heads.

Layer Pruning

Reducing Transformer Depth on Demand with Structured Dropout. [ICLR20]. Angela Fan, Edouard Grave, Armand Joulin. [DropConnect] a. Add Layer-level DropConnect in training processing.

Transfomer Pruning

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT. [TACL21]. Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Hassan Sajjad, Preslav Nakov, Deming Chen, Marianne Winslett. [Survey] a. Unstructure Pruning: Magnitude, Movement， Rewrited Proximal PruningRPP() b. Structure: Head, Layer, Embedding Layer. c. Matrix decomposition.
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning.[ACL20 Workshop Rep4NLP] Mitchell A. Gordon, Kevin Duh, Nicholas Andrews. [Magnitude-base] a. Does compressing BERT impede it’s ability to transfer to new tasks?Does fine-tuning make BERT more or less compressible? b. Low levels of pruning (30-40%) are ok. Medium levels/High levels of pruning weak performance downstream tasks. c. Finally, we observe that fine-tuning BERT on a specific task does not improve its prunability or change the order of pruning by a meaningful amount.
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. [ICML20]. Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, Joseph E. Gonzalez. [Iterative Magnitude-base] a. Deeper, more wide model with pruning better than small model.
Structured Pruning of Large Language Models. [EMNLP20]. Ziheng Wang, Jeremy Wohlwend, Tao Lei. [Factorized Structure Pruning] a. Factorized low-rank pruning.
Structured Pruning of a BERT-based Question Answering Model. J.S. McCarley, Rishav Chakravarti, Avirup Sil.[weight-base] a. using L0 to structure pruning.
Block Pruning For Faster Transformers. [EMNLP21]. François Lagunas, Ella Charlaix, Victor Sanh, Alexander M. Rush. [Gradient-base] a. Hybird-filled Movement Pruning.
MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models. Zhewei Yao, Linjian Ma, Sheng Shen, Kurt Keutzer, Michael W. Mahoney. [Gradient-base] a. Multi-stage, different regulation to contrul ununiform.
NViT: Vision Transformer Compression and Parameter Redistribution. Huanrui Yang, Hongxu Yin, Pavlo Molchanov, Hai Li, Jan Kautz. [Vision, Gradient-base] a. Structure first-order Talyor pruning.
Layer-wise Model Pruning based on Mutual Information. [EMNLP21]. Chun Fan, Jiwei Li, Xiang Ao, Fei Wu, Yuxian Meng, Xiaofei Sun. [MI] a. Top-down Iterative pruning base on mutual information.
Rethinking Network Pruning-under the Pre-train and Fine-tune Paradigm. Dongkuan Xu, Ian E.H. Yen, Jinxi Zhao, Zhibin Xiao. [Magnitude-base]
TPrune: Efficient Transformer Pruning for Mobile Devices. [TCPS21]. Jiachen Mao, Huanrui Yang, Ang Li, Hai Li, Yiran Chen. [Regulation-base] a. Using block-wise structure sprity pruning(BSSL) only train with regulation, find that WQ, WK, WV, WFFN1 are colum-wise, WO, WFFN2 are row-wise. WQ, WK, WV are pruned at some extent. Wo, WFFN1, WFFN2 hardly get sparity. Different layer in encode-decode should set different sparity ratio. b. propose one method base on Structured Hoyer Square(also in regulation-base).
LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression. [Coling20]. Yihuan Mao, Yujing Wang, Chufan Wu, Chen Zhang, Yang Wang, Yaming Yang, Quanlu Zhang, Yunhai Tong, Jing Bai. a. Weight pruning + SVD + KD
Reweighted Proximal Pruning for Large-Scale Language Representation. Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lin, Yanzhi Wang. [Prominal-Pruning] a. Reweighted L1, to avoid bigger |wi| get much more gradient than small wj. b. Using prominal to learn the L1.
EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets. [ACL21]. Xiaohan Chen, Yu Cheng, Shuohang Wang, Zhe Gan, Zhangyang Wang, Jingjing Liu. [Magnitude-base structured] a. RPP + Magnitude-base structured + lottery tickets.
Chasing Sparsity in Vision Transformers: An End-to-End Exploration. [NeurIPS21]. Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang. [ViT + Iterative Taylor] a. Token Pruning + Iterative structured Taylor Attention head pruning + L1 FFN Pruning.
Aligned Weight Regularizers for Pruning Pretrained Neural Networks. [ARR21Nov]. [Magnitude-base] a. Using a regulator loss to align pruned weight and origin weight, like cosine-base and frobenius-base.

Token Pruning or Sparse Attention

Adaptively Sparse Transformers. [EMNLP 2019]. Gonçalo M. Correia, Vlad Niculae, André F.T. Martins. a. α-entmax, which replace softmax in attention.
Blockwise Self-Attention for Long Document Understanding. [EMNLP20 Finding]. Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, Jie Tang.[hand-craft-base] a. Block Attntion Pruning, but only have N=2/3 two pattern.
Learned Token Pruning for Transformers. Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer. [Token Pruning, gradient-base] a. Input sequence lengths can vary greatly within tasks and between training and validation sets, and thus a single pruning configuration can potentially under- prune shorter sequences or over-prune longer sequences. b. Straight-Through Estimator binarized mask.
PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination. [ICML20]. Saurabh Goyal, Anamitra R. Choudhury, Saurabh M. Raje, Venkatesan T. Chakaravarthy, Yogish Sabharwal, Ashish Verma. [Cascade Token Pruning]
Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search. [ACL21]. Gyuwan Kim, Kyunghyun Cho. [Token Pruning] a. LengthDrop with trade-off search to find a model suit performance & efferient requirments. b. Drop-and-Restore make method can use in generator/MRC tasks.
TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference. [NAACL21]. Deming Ye, Yankai Lin, Yufei Huang, Maosong Sun. [RL + Token Pruning] a. RL to decide layer by layer.

MoE

A Mixture of h−1 Heads is Better than h Heads. [ACL20]. Hao Peng, Roy Schwartz, Dianqi Li, Noah A. Smith. a. MoE for Attention.
EBERT: Efficient BERT Inference with Dynamic Structured Pruning. [ACL21 Finding]. Zejian Liu, Fanrong Li, Gang Li, Jian Cheng. a. Using a model(FFN + BN) to router structured weight(head-level in MHA, channel-level in FFN)

Embedding Compression

Compressing Word Embeddings via Deep Compositional Code Learning [ICLR 18] Raphael Shu, Hideki Nakayama.
- The first one propose Code-based Methods to slove embedding compression problem.
- To find the basic vector from word embedding space, and use it(the number << the vocabular size) to represent other embeddings.
- Use Gumbel Softmax to reparameter.
Near-lossless Binarization of Word Embeddings [AAAI 2019] Julien Tissier, Christophe Gravier, Amaury Habrard.
- AutoEncoder to Binarization embedding. somehow oneway of code-based methods.
Improving Word Embedding Factorization for Compression Using Distilled Nonlinear Neural Decomposition [EMNLP 20 Finding] Vasileios Lioutas, Ahmad Rashid, Krtin Kumar, Md Akmal Haidar, Mehdi Rezagholizadeh.
- KD + Matrix Decompose
Adaptive Compression of Word Embeddings [ACL 20] Yeachan Kim, Kang-Min Kim, SangKeun Lee.
- Adpative Code-Based Model.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Model Compression

Transformer Compression

Pruning

Workshop

Normal Pruning

Head Pruning

Layer Pruning

Transfomer Pruning

Token Pruning or Sparse Attention

MoE

Embedding Compression

FilesExpand file tree

Compression

Directory actions

More options

Directory actions

More options

Latest commit

History

Compression

Folders and files

parent directory

README.md

Model Compression

Transformer Compression

Pruning

Workshop

Normal Pruning

Head Pruning

Layer Pruning

Transfomer Pruning

Token Pruning or Sparse Attention

MoE

Embedding Compression