-
Couldn't load subscription status.
- Fork 57
feat: added the precision and recall metrics for QA accuracy #157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| """ | ||
| if normalize_text: # pragma: no branch | ||
| model_output, target_output = (_normalize_text_quac_protocol(text) for text in (model_output, target_output)) | ||
| ret = precision(reference=set(target_output.split(" ")), test=set(model_output.split(" "))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why set and not list? we want to discard repetitions of words?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid question, but set seems standard. At least that's what the NLTK metric assumes that we use here.
| """ | ||
| if normalize_text: # pragma: no branch | ||
| model_output, target_output = (_normalize_text_quac_protocol(text) for text in (model_output, target_output)) | ||
| ret = recall(reference=set(target_output.split(" ")), test=set(model_output.split(" "))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment
| """ | ||
| if normalize_text: # pragma: no branch | ||
| model_output, target_output = (_normalize_text_quac_protocol(text) for text in (model_output, target_output)) | ||
| ret = precision(reference=set(target_output.split(" ")), test=set(model_output.split(" "))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid question, but set seems standard. At least that's what the NLTK metric assumes that we use here.
d36ef85
Description of changes: This pull request adds the Precision and Recall metrics for the Question Answering task. Previous metrics do not capture the cases where one of the target output or model output is short and the other one is long.
For instance, consider the question
Did RMS Titanic sink in 1912?If the target output isYesand the model output isYes. The ship indeed sank in 1912. It was the largest ship at the time <some long text>then the existing metrics will give a low score even though the answer is correct. The recall metric added in this PR will be1.0indicating that all of the target output words are contained within the model output. The precision metric operates in the opposite direction and measures what fraction of words in the model output are found in the target output.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.