Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Balladie
Copy link
Contributor

@Balladie Balladie commented May 6, 2025

What does this PR do?

Adds ONNX exporter support for ColPaliForRetrieval model. The resulting ONNX has two variant options: one for image embedding extraction, and the other one for text embedding. Although previous PR exists (#2074), that work became stale and I just wanted to follow up and create a new one with some modifications.

The exported models, conversion script, and usages are all uploaded to the in my collection, so please refer to it.

Some notes fyi below:

  • The reason for two variants (resulting in two separate ONNX models) stems from the difference between the forward of image and text. The image takes additional forward of SigLIP vision tower, and I can rather make separate ONNX for vision tower and take input_embeds as input. But I sticked to current implementation for consistency of input levels with other configs in optimum.
  • In Add support to export ColPali Model to ONNX #2074 the variants for image and text are split by task, but I thought it would be better to use other separation rather than task and decided to use variant.
  • It can be either implemented by first implementing an onnx config for PaliGemma and then inherting it, but this model is in purpose of embedding extraction (without need for past KV, etc.), so I just skipped it and left it as a class inheriting Gemma config directly. Can be implemented as well in the future.
  • optimum-cli is currently not supported. To support it, ColPaliForRetrieval should be mapped in AutoModel in transformers, not in AutoModelForPreTraining as of now.

Thanks for the review, and please let me know if there's any suggestion! Any better idea is much appreciated.

Who can review?

@fxmarty, @echarlaix, @JingyaHuang, @michaelbenayoun

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines +2709 to +2710
"vision": "Embedding extraction for image.",
"text": "Embedding extraction for text.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it make more sense to call them vision-language and laguage (or vision+text and text-only), etc
because we do pass text as well in the first no ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I understand your point. My thoughts on the naming was that it is more intuitive for end user to call it as the embedding of each modality regardless of the internal logic, considering the usage scenario of real users (the implementation of ColPaliProcessor also takes either image + prompt w/o custom text, or just a text). But in terms of internal inputs it is also true so if you still prefer it I will follow your opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go with vision-language and language-only, as the vision language variant can work for both text+pixels and text alone (with empty pixels). cc @kstavro if you have any input on this !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I think that makes sense, looks good to me!

@Balladie
Copy link
Contributor Author

Balladie commented May 7, 2025

Confirmed it works with optimum-cli after the fix. Can now export it like below:

optimum-cli export onnx --model vidore/colpali-v1.3-hf ./onnx_output --task feature-extraction --variant vision

The dtype of fp16 is not working in optimum-cli for now but it seems to be the problem with transformers, but works with fp32 so I just keep it as it is. Will probably fix in transformers in the future.

@IlyasMoutawwakil

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented May 12, 2025

Okay I took a deeper look and it seems to me that doing it this way with variants will require users to create two onnx models to solve the retrieval task (which is what colpali is for), the problem is that those two models share most of the same weights.
I think a better way to do it is to follow the example in optimum-intel where for vlms we define a couple "behaviors" that are exported separately and used together during inference.
So instead of having to load two VLMs, one that takes pixel values and another that doesn't, we can divide the forward call defined here into "components"/"behaviors" that run depending on the input.
@Balladie tell me if what I said makes sense 😄

@echarlaix what do you think ? I think we should have at least one clean VLM implementation (currently we have none) so that contributors can use it as a base recipe.

@Balladie
Copy link
Contributor Author

Okay I took a deeper look and it seems to me that doing it this way with variants will require users to create two onnx models to solve the retrieval task (which is what colpali is for), the problem is that those two models share most of the same weights. I think a better way to do it is to follow the example in optimum-intel where for vlms we define a couple "behaviors" that are exported separately and used together during inference. So instead of having to load two VLMs, one that takes pixel values and another that doesn't, we can divide the forward call defined here into "components"/"behaviors" that run depending on the input. @Balladie tell me if what I said makes sense 😄

@echarlaix what do you think ? I think we should have at least one clean VLM implementation (currently we have none) so that contributors can use it as a base recipe.

You've got the point and that's what I exactly thought and looking for (but I did not have much time to work on it atm), so I greatly appreciate that you pointed out the necessity for baseline VLM configs. I did not know there are some vlms already in optimum-intel, and I think it looks fine for the base for now.
So basically what I understand is that we can divide into parts (ex. vision tower, text embeds, lm decoder) for each behavior and export them into three onnx models like this which has compatible formats with transformers.js (that's also what I'd like to have). Would love to see optimum naturally supports that (which has been done quite manually as far as I know), and also it would also be great if optimum-cli supports that as an option for the users (ex. --behavior vision).

I think your team would have better understanding of how it should be implemented but if it's helpful to make a simple starting point at least in this model I can definitely work on it. Kindly request your suggestion on how we can move forward for that!

@IlyasMoutawwakil

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented May 25, 2025

Yeah you're right, Also I just noticed that in the notebook created by @kstavro, simply passing zeros as pixel values is enough to getting text embedding from the same "vision" variant. I think having two variants is okay ! Maybe we can even create an inference class ORTModelForRetrieval that enables a user to load a "vision" model and use it the same way as in https://huggingface.co/docs/transformers/en/model_doc/colpali

Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM will merge tomorrow !
The name of variants is a nit, let me know what you think.

@kstavro
Copy link

kstavro commented May 26, 2025

Hey @Balladie @IlyasMoutawwakil, I just saw the tag from the other issue, thanks for taking care of this.

The main thing to take from my notebook when exporting the model was that in order to export a single model instead of two, the ‘ColPaliProcessor’ needs to be changed to also return dummy pixel_values for text as well (maybe with an argument like return_dummy_pixel_values or so). And that is actually enough for everything to work, as this makes the expected input for the ONNX model uniform.

I think the dummy values don’t actually play any role when processing text, as there are no image tokens as input for the model (image tokens are only concatenatied for image and images are embedded only when model image tokens exist). I first tried to manually concatenate the dummy pixel values to the text tokens for the text cases (while properly setting the attention mask to 0 for the image tokens) and that had the exact same results. But with the additional computational cost of 1024 unnecessary image tokens, so not worth it. Then I realized that what I was doing was a modified ColPaliProcessor in its essence.

The trade off to such an approach is that the ColPaliProcessor is part of the transformers library, not optimum. Where as with the export of two models everything remains more self contained, but with double the memory consumption. That is why I asked in the other issue about any design preferences.

I hope this helps somehow, thank you for integrating this model to the library :)

@Balladie
Copy link
Contributor Author

@IlyasMoutawwakil have manually tested that separating into three parts works (codes and onnx models here)
it's not a full reflection of the concept of "behavior" but the flow would be aligned so I just share it with current form, hope it helps

@IlyasMoutawwakil
Copy link
Member

@kstavro no need to modify the processor logic, dummy pixel values could be created rather in the inference class ORTModelForRetrieval for example if not passed.
We do something similar for LLMs where we export them with past key values and create dummy ones in the first forward pass.

use_cache_branch, past_key_values, known_output_shapes = self.prepare_past_key_values(

@echarlaix echarlaix added the onnx Related to the ONNX export label Jun 4, 2025
@kstavro
Copy link

kstavro commented Jun 5, 2025

Hi @Balladie @IlyasMoutawwakil, I am just back from my vacation and I noticed that the PR is still open. Do you require any assistance from my side for the implementation? If desired and/or preferred, I could find some time on the weekend to assist with the PR.

Thanks again for taking care of this so far :)

@echarlaix echarlaix merged commit 1c7012f into huggingface:main Jun 6, 2025
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

onnx Related to the ONNX export

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants