🔎 Data |
🔨 Code |
🤗 Huggingface Leaderboard |
📑 Paper |
🤖ConvRe🤯 is the benchmark proposed in our EMNLP 2023 main conference paper: An Investigation of LLMs’ Inefficacy in Understanding Converse Relations.
It aims to evaluate LLMs' ability on understanding converse relations.
Converse relation is defined as the opposite of semantic relation while keeping the surface form of the triple unchanged.
For example, the triple (x, has part, y) is interpreted as "x has a part called y" in normal relation, while "y has a part called x" in converse relation 🔁.
The experiments in our paper suggested that LLMs often resort to shortcut learning (or superficial correlations) and still face challenges on our 🤖ConvRe🤯 benchmark even for powerful models like GPT-4. The following picture shows the performances of GPT models under zero-shot easy/hard settings on our benchmark. It can be observed that in both Re2Text and Text2Re tasks, GPT models exhibit a positive scaling trend under easy-setting, and inverse scaling trend under hard-setting. Please check our paper 📑 or huggingface leaderboard 🤗 for more detailed and comprehensive results.
Read this in 中文.
- [2023/10/09] ConvRe benchmark has released🌟.
- [2023/10/08] ConvRe has been accepted by EMNLP 2023.
ConvRe benchmark is composed of 17 relations and 1240 triples from five widely used knowledge graph datasets: WN18RR, FB15K-237, NELL-ONE, Wikidata5M, ICEWS14, ConceptNet5. The detailed number of triples for each relation in the benchmark is listed below.
| Relation | # Triples | Source |
|---|---|---|
| hypernym | 80 | WN18RR |
| has part | 78 | WN18RR |
| organization, organization relationship, child | 75 | FB15K-237 |
| location, location, partially contains | 77 | FB15K-237 |
| athlete beat athlete | 80 | NELL-ONE |
| parent of | 145 | NELL-ONE & Wikidata5M |
| represented by | 79 | Wikidata5M |
| side effect | 8 | Wikidata5M |
| has facility | 62 | Wikidata5M |
| influenced by | 65 | Wikidata5M |
| owned by | 51 | Wikidata5M |
| consult | 73 | ICEWS14 |
| praise or endorse | 78 | ICEWS14 |
| made of | 80 | ConceptNet5 |
| used of | 79 | ConceptNet5 |
| has property | 55 | ConceptNet5 |
| has subevent | 75 | ConceptNet5 |
| Total | 1240 |
The dataset files can be found in data directory. Here is the description of each file.
re2text_relations.json: The normal and converse relation definition and corresponding choices of each relation forre2texttask.re2text_examples.json: The few shot examples ofre2texttask, includingnormalprompt setting andhint+cotsetting.text2re_relations: The normal and converse relation definition and corresponding choices of each relation fortext2retask.text2re_examples.json: The few shot examples ofre2texttask, includingnormalprompt setting andhint+cotsetting.triple_dataset: Full dataset of the benchmark, including triples and correct answers.triple_subset: The subset we used in our paper, it contains 328 triples and their corresponding correct answers.
The models listed below are tested and can be run directly using the script in Inference.
GPT TEXT MODELS
- text-ada-001
- text-babbage-001
- text-curie-001
- text-davinci-003
- gpt-3.5-turbo
- gpt-3.5-turbo-0301
- gpt-4
- gpt-4-0314
Claude MODELS
- claude-1.3
- claude-instant-1.1
FLAN-T5 MODELS
- flan-t5-small
- flan-t5-base
- flan-t5-large
- flan-t5-xl
- flan-t5-xxl
LLAMA2 CHAT MODELS
- llama-2-7b-chat-hf
- llama-2-13b-chat-hf
- llama-2-70b-chat-hf
QWEN CHAT MODELS
- qwen-7b-chat
- qwen-14b-chat
INTERNLM MODELS
- internlm-chat-7b
- internlm-chat-20b
Our benchmark is available on Huggingface 🤗 (link). You can easily run the inference by using main_hf.py and specifying the following three arguments.
model_name: the name of the large language model, see our supported model list.task: the subtask of ConvRe benchmark:text2reorre2text.setting: prompt setting for current run (prompt1-prompt 12), please refer to our paper(LINK) for more details of each setting.
Example
Here is the script to run prompt4 of re2text task on text-davinci-003 👇
python3 main_hf.py --model_name text-davinci-003 --task re2text --setting prompt4We also provide a more flexible way to run the experiments. There are ️eight arguments you need to specify.
model_name: the name of the large language model you want to use, see our supported model list.task: the subtask of ConvRe benchmark:text2reorre2text.data_dir: The directory where the dataset stored.prompt: The type of prompt to use in the experiment:normal,hintorhint+cot.relation: The relation type to use in the experiment:normalfor normal relation andconversefor converse relation.n_shot: Few-shot numbers, choose a number in [0, 1, 2, 3, 4, 5, 6].example_type: The type of few-shot examples,hardorregular.text_type: The type of text to use in the experiment,regularorhard.
The argument settings for each of the 12 prompt used in our paper is listed below.
| Prompt ID | prompt | relation | n_shot | example_type | text_type |
|---|---|---|---|---|---|
| re2text 1# | normal | normal | 0 | regular | regular |
| text2re 1# | normal | normal | 0 | regular | hard |
| re2text 2# | normal | normal | 0 | regular | hard |
| text2re 2# | normal | normal | 0 | regular | regular |
| re2text 3# | normal | converse | 0 | regular | regular |
| text2re 3# | normal | converse | 0 | regular | hard |
| re2text 4# | normal | converse | 0 | regular | hard |
| text2re 4# | normal | converse | 0 | regular | regular |
| re2text 5# | hint | converse | 0 | regular | regular |
| text2re 5# | hint | converse | 0 | regular | hard |
| re2text 6# | hint | converse | 0 | regular | hard |
| text2re 6# | hint | converse | 0 | regular | regular |
| 7# | normal | converse | 3 | hard | hard |
| 8# | hint+cot | converse | 3 | hard | hard |
| 9# | normal | converse | 6 | hard | hard |
| 10# | normal | converse | 3 | regular | hard |
| 11# | hint+cot | converse | 3 | regular | hard |
| 12# | normal | converse | 6 | regular | hard |
Example
Here is the script to run prompt3 of text2re task on gpt-3.5-turbo-0301 👇
python3 main.py --model_name gpt-3.5-turbo-0301 --task text2re --data_dir data --prompt normal --relation converse --n_shot 0 --example_type regular --text_type hardThere are three arguments need to be specified when running the evaluation script.
file_path: Thepathof the result file 📁.model_family: The model family of the result file, used to choose the corresponding evaluator. You should choose fromflan-t5,claude,gpt-text,gpt-chat,llama2,qwen,internlm.mode: We provide two evaluation mode:strictandauto.strictmode will raise errors if the answer of the model isn't consistent with what we want. In this case, you should check the model's answer manually.automode will just ignore the inconsistent answers. The performance calculated underautomode may be lower thanstrictmode, but it's very convenient and doesn't need any human support. 💡The ability to align with user's request is also a very important indicator of LLMs' capability.
Firstly, you should create a new class that inherit LanguageModels in llms_interface.py, and then implement the completion method according to the characteristics (such as the structure of the new model's API) of your model.
After obtaining the result, you should create a new class that inherit BaseEvaluator in llms_evaluator.py, and then implement the evaluate method according to the pattern of your model's answer.
To add a new relation in the benchmark, you should firstly check whether the relation meets the requirements in Section 2.5 of our paper. Then you should write the corresponding prompts for both Re2Text and Text2Re tasks.
Re2Text
Note: in this task, all the question is asking for head entity.
normal: thenormalinstruction of the relation.converse: theconverseinstruction of the relaiton.normal-regular: theregulardescription for the question undernormalrelation.normal-hard: theharddescription for the question undernormalrelation.converse-regular: theregulardescription for the question underconverserelation.converse-hard: theharddescription for the question underconverserelation.
Text2Re
normal: thenormalinstruction of the relation.converse: theconverseinstruction of the relaton.hard: theharddescription of the question.regular: theregulardescription of the question.normal-correct: thecorrectchoice undernormalrelation.normal-wrong: thewrongchoice undernormalrelation.converse-correct: thecorrectchoice underconverserelation.converse-wrong: thewrongchoice underconverserelation.
Feel free to add new models and relations to our benchmark🥰
@misc{qi2023investigation,
title={An Investigation of LLMs' Inefficacy in Understanding Converse Relations},
author={Chengwen Qi and Bowen Li and Binyuan Hui and Bailin Wang and Jinyang Li and Jinwang Wu and Yuanjun Laili},
year={2023},
eprint={2310.05163},
archivePrefix={arXiv},
primaryClass={cs.CL}
}