Some fixes for the Explaining a Question Answering Transformers Model Example Notebook #3685

LetiP · 2024-06-07T12:13:42Z

This pull request concerns the example "Explaining a Question Answering Transformers Model"

Problems

A: Re-Tokenization Issue

The analysis does not work if a word is divided into several tokens and one of the subwords is also an independent word in the tokenizer's vocabulary.

In the analysis that this example code puts forward, the input is tokenized and masked. But then, the input is detokenized, and the part that is not masked, is retokenized. This can result in a different retokenization than the original one. This, then results in a different sequence length, causing shape mismatch errors.

For example, consider the tokenization: ['what', 'did', 'i', 'eat', 'for', 'now', '?', '[SEP]', 'i', 'picked', 'up', 'a', 'bag', 'of', 'peanuts', 'and', 'rai', '##sin', '##s', 'for', 'a', 'snack', '.', 'i', 'wanted', 'a', 'sweet', '##er', 'snack', 'out', 'so', 'i', 'ate', 'them', 'for', 'now', '.']

The problem arises when a part of the text including "rai" gets masked. In this case, "sins" will not be retokenized as "##sin" and "##s", but as "sins", which is recognized as a full word in the tokenizer's vocabulary. Thanks to Yuka Wolter for pointing out this bug to me.

B: Inflexible Input Formatting

The current example only works for input formatted like this:

data = [
    "what did i eat for now? [SEP] i picked up a bag of peanuts and raisins for a snack. i wanted a sweeter snack out so i ate them for now."
]

However, it fails if the input is formatted without whitespaces before and after [SEP]:

data = [
    "what did i eat for now?[SEP]i picked up a bag of peanuts and raisins for a snack. i wanted a sweeter snack out so i ate them for now."
]

Proposed changes

To address these issues, I propose rewriting f(questions, start) to avoid retokenizing after masking. The modified code below delivers the same results as the original but should be more robust because it avoids retokenization.

The idea is to use the same text masker as the original code, but force it to output ids instead of strings:
explainer_start = shap.Explainer(f_start, shap.maskers.Text(tokenizer=pmodel.tokenizer, output_type='ids'))
This way, f(questions, start) works directly with ids and does not need to retokenize anything.

Remaining problems

This example still has some remaining problems:

The fix does not address an existing issue in the original example where the output of the interpretation includes both the question and the context, even though the model is supposed to identify the token in the context that contains the answer. Including the questions in the interpretation outputs does not make sense.
The proposed fix only works if we provide one sentence at a time in data for analysis. I have not yet figured out how to modify the solution to accept multiple input sentences for interpretation.

Checklist

All pre-commit checks pass.
Unit tests added (if fixing a bug or adding a new feature)

for more information, see https://pre-commit.ci

CloseChoice

Thanks for the great PR and the in-depth analysis.

I have three suggestions:

Could we somehow get rid of the "TF-TRT Warning: Could not find TensorRT" warning? Either just delete the warning in the .ipynb file or install the according stuff. This is optional, if you struggle too much with it, please just go ping me here.
I would suggest to swap the order of things so that

def f_start(questions):
   return f(questions, tokenized_qs, True)


def f_end(questions):
   return f(questions, tokenized_qs, False)

tokenized_qs is defined before it is first used in the functions. My IDE gives me some warnings about this, so please just define it first.
3. Could you please run the cells in order, so that the appropriate numbers are displayed in the notebook?

…lls in order.

for more information, see https://pre-commit.ci

LetiP · 2024-06-07T15:24:01Z

@CloseChoice , thanks for the tips. I have updated the notebook with the following:

Regarding the "TF-TRT Warning: Could not find TensorRT" warning: I am sorry, but I do not see the warning you are talking about. I opened the ipynb as a text file, searched for any warning, did not find anything resembling it. I removed the following warning, though:

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.

Now tokenized_qs is defined at the beginning of the script, such that IDEs do not flag it anymore.
Ran the cells in order, such that the cell numbers are steadily increasing.

CloseChoice · 2024-06-10T22:22:13Z

Thanks for your PR. I looked deeper into this and I have to say, our current implementation (not your notebook) does seem incompatible with this notebook, so your solution is a clever workaround. Thanks.

LetiP and others added 2 commits June 6, 2024 11:29

Fix the QA example to work for any sentence

2b6fb04

[pre-commit.ci] auto fixes from pre-commit.com hooks

5f325df

for more information, see https://pre-commit.ci

CloseChoice requested changes Jun 7, 2024

View reviewed changes

LetiP and others added 3 commits June 7, 2024 17:18

Define tokenized_qs variable at the beginning of the script. Rerun ce…

cd6e258

…lls in order.

Merge branch 'master' of https://github.com/LetiP/shap

4bffd8b

[pre-commit.ci] auto fixes from pre-commit.com hooks

7691cb7

for more information, see https://pre-commit.ci

CloseChoice merged commit a3ddf5c into shap:master Jun 10, 2024

LetiP changed the title ~~Some fixed for the Explaining a Question Answering Transformers Model Example Notebook~~ Some fixes for the Explaining a Question Answering Transformers Model Example Notebook Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some fixes for the Explaining a Question Answering Transformers Model Example Notebook #3685

Some fixes for the Explaining a Question Answering Transformers Model Example Notebook #3685

Uh oh!

LetiP commented Jun 7, 2024 •

edited

Loading

Uh oh!

CloseChoice left a comment •

edited

Loading

Uh oh!

LetiP commented Jun 7, 2024

Uh oh!

CloseChoice commented Jun 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Some fixes for the Explaining a Question Answering Transformers Model Example Notebook #3685

Some fixes for the Explaining a Question Answering Transformers Model Example Notebook #3685

Uh oh!

Conversation

LetiP commented Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problems

A: Re-Tokenization Issue

B: Inflexible Input Formatting

Proposed changes

Remaining problems

Checklist

Uh oh!

CloseChoice left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LetiP commented Jun 7, 2024

Uh oh!

CloseChoice commented Jun 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LetiP commented Jun 7, 2024 •

edited

Loading

CloseChoice left a comment •

edited

Loading