Official implementation of the multimodal input ablation method introduced in the paper: "What Vision-Language Models 'See' when they See Scenes".
A tool to perform targeted semantic multimodal input ablation. It allows to perform textual ablation based on noun-phrases instead of tokens, and visual ablation based on the content of a text.
- 🗃️ Repository: github.com/michelecafagna26/vl-ablation
- 📜 Paper: What Vision-Language Models 'See' when they See Scenes
- 🖊️ Contact: [email protected]
python=>3.8
pytorch
torchvisionInstall dependecies
pip install git+https://github.com/michelecafagna26/compress-fasttextinstall the vl-ablation
pip install git+https://github.com/michelecafagna26/vl-ablation.git#egg=ablationDownload the Spacy model
python3 -m spacy download en_core_web_mdIf you want to use the full model, download the original not-distilled Fasttext model
wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
gzip -d cc.en.300.bin.gzfrom ablation.textual import TextualAblator
t_ablator = TextualAblator()
caption = "A table with pies being made and a person standing near a wall with pots and pans hanging on the wall"
ablations = t_ablator(caption)ablations is a list of ablations looking like this:
[{'nps': (A table,),
'nps_index': [0],
'ablated_caption': 'pies being made and a person standing near a wall with pots and pans hanging on the wall'},
{'nps': (pies,),
'nps_index': [1],
'ablated_caption': 'A table and a person standing near a wall with pots and pans hanging on the wall'},
...]
where nps is the noun phrase ablated, nps_index is the noun phrase index and ablated_caption is the caption without the ablated noun phrases.
The list contains all the possible combinations of noun phrases in the text.
from ablation.visual import VisualAblator
from PIL import Image
from io import BytesIO
import requests
img_url = "http://farm6.staticflickr.com/5003/5318500980_18b4dcf1fd_z.jpg"
# load the image
response = requests.get(img_url)
img = Image.open(BytesIO(response.content))
# perform visual ablation based on the text content
v_ablator = VisualAblator()
ablated_img, boxes = v_ablator(img, "a man in front of a stop sign")The ablator identifies objects mentioned in the caption that are also present in the image. The match is performed semantically, thus no exact match between the object label and the text is required.
ablated_img is the result of the ablation, namely the image with grey patches applied in correspondence of the objects identified as bounding boxes
boxes looks like this:
[{'token': 'man',
'confidence': 0.7822560667991638,
'coco_class': 'person',
'coco_idx': 1}]
Note that the ablator can identify only the set of objects present in the COCO annotations. Check the notebook demo to run this code.
If you want to use full model initialize the ablator as follows:
fasttext_model = "path/to/the/model"
v_ablator = VisualAblator(fasttext_model, distilled=False)If you use the distilled model (enabled by default) the fasttext model will take less then 5 GB.
Be aware that the original not-distilled fasttext embeddings takes around 13-14 GB in memory.
@article{cafagna2021vision,
title={What Vision-Language ModelsSee'when they See Scenes},
author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert},
journal={arXiv preprint arXiv:2109.07301},
year={2021}
}