-
Notifications
You must be signed in to change notification settings - Fork 603
Add ONNX exporter support for ColPali model #2251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| "vision": "Embedding extraction for image.", | ||
| "text": "Embedding extraction for text.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it make more sense to call them vision-language and laguage (or vision+text and text-only), etc
because we do pass text as well in the first no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I understand your point. My thoughts on the naming was that it is more intuitive for end user to call it as the embedding of each modality regardless of the internal logic, considering the usage scenario of real users (the implementation of ColPaliProcessor also takes either image + prompt w/o custom text, or just a text). But in terms of internal inputs it is also true so if you still prefer it I will follow your opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's go with vision-language and language-only, as the vision language variant can work for both text+pixels and text alone (with empty pixels). cc @kstavro if you have any input on this !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I think that makes sense, looks good to me!
|
Confirmed it works with optimum-cli after the fix. Can now export it like below: optimum-cli export onnx --model vidore/colpali-v1.3-hf ./onnx_output --task feature-extraction --variant visionThe dtype of fp16 is not working in optimum-cli for now but it seems to be the problem with transformers, but works with fp32 so I just keep it as it is. Will probably fix in transformers in the future. |
|
Okay I took a deeper look and it seems to me that doing it this way with variants will require users to create two onnx models to solve the retrieval task (which is what colpali is for), the problem is that those two models share most of the same weights. @echarlaix what do you think ? I think we should have at least one clean VLM implementation (currently we have none) so that contributors can use it as a base recipe. |
You've got the point and that's what I exactly thought and looking for (but I did not have much time to work on it atm), so I greatly appreciate that you pointed out the necessity for baseline VLM configs. I did not know there are some vlms already in optimum-intel, and I think it looks fine for the base for now. I think your team would have better understanding of how it should be implemented but if it's helpful to make a simple starting point at least in this model I can definitely work on it. Kindly request your suggestion on how we can move forward for that! |
|
Yeah you're right, Also I just noticed that in the notebook created by @kstavro, simply passing zeros as pixel values is enough to getting text embedding from the same "vision" variant. I think having two variants is okay ! Maybe we can even create an inference class ORTModelForRetrieval that enables a user to load a "vision" model and use it the same way as in https://huggingface.co/docs/transformers/en/model_doc/colpali |
IlyasMoutawwakil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM will merge tomorrow !
The name of variants is a nit, let me know what you think.
|
Hey @Balladie @IlyasMoutawwakil, I just saw the tag from the other issue, thanks for taking care of this. The main thing to take from my notebook when exporting the model was that in order to export a single model instead of two, the ‘ColPaliProcessor’ needs to be changed to also return dummy pixel_values for text as well (maybe with an argument like return_dummy_pixel_values or so). And that is actually enough for everything to work, as this makes the expected input for the ONNX model uniform. I think the dummy values don’t actually play any role when processing text, as there are no image tokens as input for the model (image tokens are only concatenatied for image and images are embedded only when model image tokens exist). I first tried to manually concatenate the dummy pixel values to the text tokens for the text cases (while properly setting the attention mask to 0 for the image tokens) and that had the exact same results. But with the additional computational cost of 1024 unnecessary image tokens, so not worth it. Then I realized that what I was doing was a modified ColPaliProcessor in its essence. The trade off to such an approach is that the ColPaliProcessor is part of the transformers library, not optimum. Where as with the export of two models everything remains more self contained, but with double the memory consumption. That is why I asked in the other issue about any design preferences. I hope this helps somehow, thank you for integrating this model to the library :) |
|
@IlyasMoutawwakil have manually tested that separating into three parts works (codes and onnx models here) |
|
@kstavro no need to modify the processor logic, dummy pixel values could be created rather in the inference class
|
|
Hi @Balladie @IlyasMoutawwakil, I am just back from my vacation and I noticed that the PR is still open. Do you require any assistance from my side for the implementation? If desired and/or preferred, I could find some time on the weekend to assist with the PR. Thanks again for taking care of this so far :) |
What does this PR do?
Adds ONNX exporter support for
ColPaliForRetrievalmodel. The resulting ONNX has two variant options: one for image embedding extraction, and the other one for text embedding. Although previous PR exists (#2074), that work became stale and I just wanted to follow up and create a new one with some modifications.The exported models, conversion script, and usages are all uploaded to the in my collection, so please refer to it.
Some notes fyi below:
input_embedsas input. But I sticked to current implementation for consistency of input levels with other configs in optimum.optimum-cliis currently not supported. To support it,ColPaliForRetrievalshould be mapped inAutoModelin transformers, not inAutoModelForPreTrainingas of now.Thanks for the review, and please let me know if there's any suggestion! Any better idea is much appreciated.
Who can review?
@fxmarty, @echarlaix, @JingyaHuang, @michaelbenayoun