- A Neural Knowledge Language Model [-] Sungjin Ahn, Heeyoul Choi, Tanel PƤrnamaa, Yoshua Bengio.
- add knowledge to Language Models.
- Tuple(head entity, relation, tail entity).
- token-level topic knowledge.
- aligned knowledge.
- Latent Relation Language Models [AAAI 2020] Hiroaki Hayashi*, Zecong Hu*, Chenyan Xiong, Graham Neubig.
- Span-level.
- Aligned entity to raw text by latent parameters.
- span variable
- source variable
- relation variable
- Contextual Word Representations: A Contextual Introduction [-] Noah A. Smith.
- Survey of Contextual Word Representations
- Megatron-LM: Training Multi-Billion Parameter Language Models Using GPU Model Parallelism [-] Mohammad Shoeybi, Mostofa Patwary Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro.
- parallelism bert.
- using Pre-LN Transformer instead of Post-LN Transformer in Origin Bert
- using the GELU instead of RELU
- little of code change in parallelism architecture.
- Neural Machine Translation with Byte-Level Subwords [-] Changhan Wang, Kyunghyun Cho, Jiatao Gu.
- UTF-8 -> BPE
- Semi-Supervised Sequence Modeling with Cross-View Training [EMNLP 2018] Kevin Clark, Minh-Thang Luong, Christopher D. Manning, Quoc V. Le.
- Using windows + NER + suoervised model to train the unlabel data.
- Multi-task.
-
Parameter-Efļ¬cient Transfer Learning for NLP [ICML 2019] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly.
- Motivation:
- Cloud service (Pass)
- Without forgetting previous knowledge (compare with continual learning)
- For multi-task, injection new knowledge need previous data and recurrent train all previous task.
- Compare with fine-tuning, want to reduce the parameters and also have great performance.
- Every Transformer layer has two Adapter modules. 12 Ć 2
- The pre-trained BERT parameters are frozen. (Attention & FFN except LN)
- The Adapter contains two FFN, one non-linear and one skip-connect.
- The skip-connect is to ensure that the initial state is consistent with pre-trained.
- The adding parameters of one Adapter are 2dm + d + m.
- Layer Norm also need to update due to the$\gamma,\beta\ in\ y=\frac{x-\mathrm{E}\left[x\right]}{\sqrt{\mathrm{Var}\left[x\right]+\epsilon}}\ast\gamma+\beta$
- Total need 2dm + d + m + 2d parameters.
-
- Linear Layer
- In GLUE and in additional Classification Tasks
- Difference parameters V.S. Acc
- Baseline: 1. Fine-turn Top N transformer layer. 2. Only fine-tune LN parameters.
- reducing the fine-tune layer makes the accuracy dramatically decrease.
- The generalization of the adapter for the dim is great.
- Fine-tune LN isnāt useful.
- Does every Adapter layers are significant?
- The single adapter isnāt useful.
- 0-4 barely affect performance.
- Lower layers extract lower-level features that are shared among tasks, while the higher layers build features that are unique to different tasks.
- The Var of initialization parameters cannot too big.
- Also, test for 1. Add LN/BN 2. Increase num of layers per adapter. 3. Difference activation func. ⦠But the result is similar.
- Motivation:
-
BERT and PALs: Projected Attention Layers for Efļ¬cient Adaptation in Multi-Task Learning [ICML 2019] Asa Cooper Stickland, Iain Murray.
- Projected Attention Layers to take task-specific layer to model.
-
Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation [-] Yige Xu, Xipeng Qiu, Ligao Zhou, Xuanjing Huang.
- want to use others method to replace fine-tune.
- self-ensemble: average the ensemble models.
- self-distillation: study gold + self-ensemble.