Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@LetiP
Copy link
Contributor

@LetiP LetiP commented Jun 7, 2024

This pull request concerns the example "Explaining a Question Answering Transformers Model"

Problems

A: Re-Tokenization Issue

The analysis does not work if a word is divided into several tokens and one of the subwords is also an independent word in the tokenizer's vocabulary.

In the analysis that this example code puts forward, the input is tokenized and masked. But then, the input is detokenized, and the part that is not masked, is retokenized. This can result in a different retokenization than the original one. This, then results in a different sequence length, causing shape mismatch errors.

For example, consider the tokenization: ['what', 'did', 'i', 'eat', 'for', 'now', '?', '[SEP]', 'i', 'picked', 'up', 'a', 'bag', 'of', 'peanuts', 'and', 'rai', '##sin', '##s', 'for', 'a', 'snack', '.', 'i', 'wanted', 'a', 'sweet', '##er', 'snack', 'out', 'so', 'i', 'ate', 'them', 'for', 'now', '.']

The problem arises when a part of the text including "rai" gets masked. In this case, "sins" will not be retokenized as "##sin" and "##s", but as "sins", which is recognized as a full word in the tokenizer's vocabulary. Thanks to Yuka Wolter for pointing out this bug to me.

B: Inflexible Input Formatting

The current example only works for input formatted like this:

data = [
    "what did i eat for now? [SEP] i picked up a bag of peanuts and raisins for a snack. i wanted a sweeter snack out so i ate them for now."
]

However, it fails if the input is formatted without whitespaces before and after [SEP]:

data = [
    "what did i eat for now?[SEP]i picked up a bag of peanuts and raisins for a snack. i wanted a sweeter snack out so i ate them for now."
]

Proposed changes

To address these issues, I propose rewriting f(questions, start) to avoid retokenizing after masking. The modified code below delivers the same results as the original but should be more robust because it avoids retokenization.

The idea is to use the same text masker as the original code, but force it to output ids instead of strings:
explainer_start = shap.Explainer(f_start, shap.maskers.Text(tokenizer=pmodel.tokenizer, output_type='ids'))
This way, f(questions, start) works directly with ids and does not need to retokenize anything.

Remaining problems

This example still has some remaining problems:

  1. The fix does not address an existing issue in the original example where the output of the interpretation includes both the question and the context, even though the model is supposed to identify the token in the context that contains the answer. Including the questions in the interpretation outputs does not make sense.
  2. The proposed fix only works if we provide one sentence at a time in data for analysis. I have not yet figured out how to modify the solution to accept multiple input sentences for interpretation.

Checklist

  • All pre-commit checks pass.
  • Unit tests added (if fixing a bug or adding a new feature)

Copy link
Collaborator

@CloseChoice CloseChoice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great PR and the in-depth analysis.

I have three suggestions:

  1. Could we somehow get rid of the "TF-TRT Warning: Could not find TensorRT" warning? Either just delete the warning in the .ipynb file or install the according stuff. This is optional, if you struggle too much with it, please just go ping me here.
  2. I would suggest to swap the order of things so that
def f_start(questions):
   return f(questions, tokenized_qs, True)


def f_end(questions):
   return f(questions, tokenized_qs, False)

tokenized_qs is defined before it is first used in the functions. My IDE gives me some warnings about this, so please just define it first.
3. Could you please run the cells in order, so that the appropriate numbers are displayed in the notebook?

@LetiP
Copy link
Contributor Author

LetiP commented Jun 7, 2024

@CloseChoice , thanks for the tips. I have updated the notebook with the following:

  1. Regarding the "TF-TRT Warning: Could not find TensorRT" warning: I am sorry, but I do not see the warning you are talking about. I opened the ipynb as a text file, searched for any warning, did not find anything resembling it. I removed the following warning, though:

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.

  1. Now tokenized_qs is defined at the beginning of the script, such that IDEs do not flag it anymore.
  2. Ran the cells in order, such that the cell numbers are steadily increasing.

@CloseChoice
Copy link
Collaborator

Thanks for your PR. I looked deeper into this and I have to say, our current implementation (not your notebook) does seem incompatible with this notebook, so your solution is a clever workaround. Thanks.

@CloseChoice CloseChoice merged commit a3ddf5c into shap:master Jun 10, 2024
@LetiP LetiP changed the title Some fixed for the Explaining a Question Answering Transformers Model Example Notebook Some fixes for the Explaining a Question Answering Transformers Model Example Notebook Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants