feat: added the precision and recall metrics for QA accuracy #157

bilalaws · 2023-12-12T19:44:08Z

Description of changes: This pull request adds the Precision and Recall metrics for the Question Answering task. Previous metrics do not capture the cases where one of the target output or model output is short and the other one is long.

For instance, consider the question Did RMS Titanic sink in 1912? If the target output is Yes and the model output is Yes. The ship indeed sank in 1912. It was the largest ship at the time <some long text> then the existing metrics will give a low score even though the answer is correct. The recall metric added in this PR will be 1.0 indicating that all of the target output words are contained within the model output. The precision metric operates in the opposite direction and measures what fraction of words in the model output are found in the target output.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

franluca · 2023-12-14T12:43:28Z

src/fmeval/eval_algorithms/qa_accuracy.py

+    """
+    if normalize_text:  # pragma: no branch
+        model_output, target_output = (_normalize_text_quac_protocol(text) for text in (model_output, target_output))
+    ret = precision(reference=set(target_output.split(" ")), test=set(model_output.split(" ")))


why set and not list? we want to discard repetitions of words?

Valid question, but set seems standard. At least that's what the NLTK metric assumes that we use here.

franluca · 2023-12-14T12:43:50Z

src/fmeval/eval_algorithms/qa_accuracy.py

+    """
+    if normalize_text:  # pragma: no branch
+        model_output, target_output = (_normalize_text_quac_protocol(text) for text in (model_output, target_output))
+    ret = recall(reference=set(target_output.split(" ")), test=set(model_output.split(" ")))


same comment

src/fmeval/eval_algorithms/qa_accuracy.py

polaschwoebel · 2023-12-14T17:37:10Z

src/fmeval/eval_algorithms/qa_accuracy.py

+    """
+    if normalize_text:  # pragma: no branch
+        model_output, target_output = (_normalize_text_quac_protocol(text) for text in (model_output, target_output))
+    ret = precision(reference=set(target_output.split(" ")), test=set(model_output.split(" ")))


Valid question, but set seems standard. At least that's what the NLTK metric assumes that we use here.

bilalaws requested review from franluca and keerthanvasist December 12, 2023 19:44

franluca reviewed Dec 14, 2023

View reviewed changes

polaschwoebel previously approved these changes Dec 14, 2023

View reviewed changes

bilalaws dismissed polaschwoebel’s stale review via 920c4c7 December 15, 2023 11:02

bilalaws requested review from franluca and polaschwoebel December 15, 2023 11:05

polaschwoebel previously approved these changes Dec 15, 2023

View reviewed changes

franluca previously approved these changes Dec 15, 2023

View reviewed changes

malhotra18 previously approved these changes Dec 15, 2023

View reviewed changes

xiaoyi-cheng changed the title ~~Added the precision and recall metrics for QA accuracy~~ feat: added the precision and recall metrics for QA accuracy Dec 21, 2023

bilalaws added 2 commits January 11, 2024 17:51

Added the precision and recall metrics for QA accuracy

56d2d81

Add integration tests for precision and recall

d36ef85

bilalaws dismissed stale reviews from malhotra18, franluca, and polaschwoebel via d36ef85 January 11, 2024 18:23

bilalaws force-pushed the qa_metrics branch from 920c4c7 to d36ef85 Compare January 11, 2024 18:23

malhotra18 approved these changes Jan 11, 2024

View reviewed changes

polaschwoebel approved these changes Jan 12, 2024

View reviewed changes

bilalaws merged commit 32c089a into main Jan 12, 2024

bilalaws deleted the qa_metrics branch January 17, 2024 12:05

bilalaws mentioned this pull request Jan 22, 2024

[Feature] LLM-based (QA Accuracy) eval algorithm #163

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: added the precision and recall metrics for QA accuracy #157

feat: added the precision and recall metrics for QA accuracy #157

Uh oh!

bilalaws commented Dec 12, 2023

Uh oh!

franluca Dec 14, 2023

Uh oh!

polaschwoebel Dec 14, 2023

Uh oh!

franluca Dec 14, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

polaschwoebel Dec 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

feat: added the precision and recall metrics for QA accuracy #157

feat: added the precision and recall metrics for QA accuracy #157

Uh oh!

Conversation

bilalaws commented Dec 12, 2023

Uh oh!

franluca Dec 14, 2023

Choose a reason for hiding this comment

Uh oh!

polaschwoebel Dec 14, 2023

Choose a reason for hiding this comment

Uh oh!

franluca Dec 14, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

polaschwoebel Dec 14, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants