Folders and files Name Name Last commit message
Last commit date
parent directory Ā
View all files
BERT Rediscovers the Classical NLP Pipeline [ACL 2019] Ian Tenney, Dipanjan Das, Ellie Pavlick .
Scalar Mixing Weights, which layers more important?
Cumulative Scoring, how many layer need in that task?
Language Models as Knowledge Bases? [EMNLP 2019] Fabio Petroni, Tim RocktƤschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel .
Bert contain relational knowledge, even if without fine-tune.
But the experimental can not verify this. Because of the Google-RE and T-REx are both part of Wikipedia which is the train set of BERT.
maybe is co-occurrence patterns.
the output of BERT is bigger, the more likely to be correct.
by using pearson correlation coefficient, to explain the co-occurrence.
ELMO is more like to BERT, even if the train set have no wikipedia.
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference [ACL 2019] R. Thomas McCoy, Ellie Pavlick, Tal Linzen .
BERT not good at some anti-heuristics samples, like:
lexical overlap
subsequence
constituent
proposal an data set which have many anti-heuristics samples, which called as HANS.
Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data [ACL 2020] Emily M. Bender, Alexander Koller
They say Pre-trained don't have the ability to understand the meaning of language.
The ability of understanding language two part: meaning + linguistic form.
In my opinion, memory is one part of (need big size, like our daily experiments), and co-occurrence is only to ensure the grammar.
For domain-specific, the co-occurrence is very important, especially for the entity phrases.
For other hand, memory with the specific-topic maybe more important.
A Primer in BERTology: What we know about how BERT works [-] Anna Rogers, Olga Kovaleva, Anna Rumshisky .
What knowledge does BERT have?
BERT representations are hierarchical rather than linear.
BERT embeddings encode information about parts of speech, syntactic chunks and roles.
syntactic structure is not directly encoded in self-attention weights, but they can be transformed to reļ¬ect it.
BERT takes subject-predicate agreement into account when performing the cloze task.
BERT is better able to detect the presence of NPIs (e.g. āeverā) and the words that allow their use (e.g. āwhetherā) than scope violations.
BERT does not āunderstandā negation and is insensitive to malformed input.
BERTās encoding of syntactic structure does not indicate that it actually relies on that knowledge.
Semantic knowledge
BERT has some knowledge for semantic roles
BERT encodes information about entity types, relations, semantic roles, and proto-roles,
BERT struggles with representations of numbers
for some relation types, vanilla BERT is competitive with methods relying on knowledge bases
BERT cannot reason based on its world knowledge.
Localizing linguistic knowledge
most selfattention heads do not directly encode any nontrivial linguistic information,
Some BERT heads seem to specialize in certain types of syntactic relations.
no single head has the complete syntactic tree information.
attention weights are weak indicators of subjectverb agreement and reļ¬exive anafora.
even when attention heads specialize in tracking semantic relations, they do not necessarily contribute to BERTās performance on relevant tasks.
lower layers have the most linear word order information.
syntactic information is the most prominent in the middle BERT 3 layers.
conļ¬icting evidence about syntactic chunks.
The ļ¬nal layers of BERT are the most taskspeciļ¬c.
semantics is spread across the entire model
Training BERT
alternative training objectives
Future
Benchmarks that require verbal reasoning.
Developing methods to āteachā reasoning.
Learning what happens at inference time.
Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work? [ACL 2020] Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, Samuel R. Bowman .
Test which intermediate task good for downstream task.
Do 10 intermediate * 11 downstream task which contains finetune and probing.
And calculate the Correlations matrix.
You canāt perform that action at this time.