Add ONNX exporter support for ColPali model #2251

Balladie · 2025-05-06T08:52:17Z

What does this PR do?

Adds ONNX exporter support for ColPaliForRetrieval model. The resulting ONNX has two variant options: one for image embedding extraction, and the other one for text embedding. Although previous PR exists (#2074), that work became stale and I just wanted to follow up and create a new one with some modifications.

The exported models, conversion script, and usages are all uploaded to the in my collection, so please refer to it.

Some notes fyi below:

The reason for two variants (resulting in two separate ONNX models) stems from the difference between the forward of image and text. The image takes additional forward of SigLIP vision tower, and I can rather make separate ONNX for vision tower and take input_embeds as input. But I sticked to current implementation for consistency of input levels with other configs in optimum.
In Add support to export ColPali Model to ONNX #2074 the variants for image and text are split by task, but I thought it would be better to use other separation rather than task and decided to use variant.
It can be either implemented by first implementing an onnx config for PaliGemma and then inherting it, but this model is in purpose of embedding extraction (without need for past KV, etc.), so I just skipped it and left it as a class inheriting Gemma config directly. Can be implemented as well in the future.
~~optimum-cli is currently not supported. To support it, ColPaliForRetrieval should be mapped in AutoModel in transformers, not in AutoModelForPreTraining as of now.~~

Thanks for the review, and please let me know if there's any suggestion! Any better idea is much appreciated.

Who can review?

@fxmarty, @echarlaix, @JingyaHuang, @michaelbenayoun

… in kwargs

HuggingFaceDocBuilderDev · 2025-05-06T15:18:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

IlyasMoutawwakil · 2025-05-07T12:00:44Z

optimum/exporters/onnx/model_configs.py

+        "vision": "Embedding extraction for image.",
+        "text": "Embedding extraction for text.",


wouldn't it make more sense to call them vision-language and laguage (or vision+text and text-only), etc
because we do pass text as well in the first no ?

Yeah I understand your point. My thoughts on the naming was that it is more intuitive for end user to call it as the embedding of each modality regardless of the internal logic, considering the usage scenario of real users (the implementation of ColPaliProcessor also takes either image + prompt w/o custom text, or just a text). But in terms of internal inputs it is also true so if you still prefer it I will follow your opinion.

let's go with vision-language and language-only, as the vision language variant can work for both text+pixels and text alone (with empty pixels). cc @kstavro if you have any input on this !

yea I think that makes sense, looks good to me!

optimum/exporters/tasks.py

Balladie · 2025-05-07T18:24:30Z

Confirmed it works with optimum-cli after the fix. Can now export it like below:

optimum-cli export onnx --model vidore/colpali-v1.3-hf ./onnx_output --task feature-extraction --variant vision

The dtype of fp16 is not working in optimum-cli for now but it seems to be the problem with transformers, but works with fp32 so I just keep it as it is. Will probably fix in transformers in the future.

@IlyasMoutawwakil

optimum/exporters/onnx/model_configs.py

IlyasMoutawwakil · 2025-05-12T12:53:47Z

Okay I took a deeper look and it seems to me that doing it this way with variants will require users to create two onnx models to solve the retrieval task (which is what colpali is for), the problem is that those two models share most of the same weights.
I think a better way to do it is to follow the example in optimum-intel where for vlms we define a couple "behaviors" that are exported separately and used together during inference.
So instead of having to load two VLMs, one that takes pixel values and another that doesn't, we can divide the forward call defined here into "components"/"behaviors" that run depending on the input.
@Balladie tell me if what I said makes sense 😄

@echarlaix what do you think ? I think we should have at least one clean VLM implementation (currently we have none) so that contributors can use it as a base recipe.

Balladie · 2025-05-14T04:16:31Z

Okay I took a deeper look and it seems to me that doing it this way with variants will require users to create two onnx models to solve the retrieval task (which is what colpali is for), the problem is that those two models share most of the same weights. I think a better way to do it is to follow the example in optimum-intel where for vlms we define a couple "behaviors" that are exported separately and used together during inference. So instead of having to load two VLMs, one that takes pixel values and another that doesn't, we can divide the forward call defined here into "components"/"behaviors" that run depending on the input. @Balladie tell me if what I said makes sense 😄

@echarlaix what do you think ? I think we should have at least one clean VLM implementation (currently we have none) so that contributors can use it as a base recipe.

You've got the point and that's what I exactly thought and looking for (but I did not have much time to work on it atm), so I greatly appreciate that you pointed out the necessity for baseline VLM configs. I did not know there are some vlms already in optimum-intel, and I think it looks fine for the base for now.
So basically what I understand is that we can divide into parts (ex. vision tower, text embeds, lm decoder) for each behavior and export them into three onnx models like this which has compatible formats with transformers.js (that's also what I'd like to have). Would love to see optimum naturally supports that (which has been done quite manually as far as I know), and also it would also be great if optimum-cli supports that as an option for the users (ex. --behavior vision).

I think your team would have better understanding of how it should be implemented but if it's helpful to make a simple starting point at least in this model I can definitely work on it. Kindly request your suggestion on how we can move forward for that!

@IlyasMoutawwakil

IlyasMoutawwakil · 2025-05-25T20:06:03Z

Yeah you're right, Also I just noticed that in the notebook created by @kstavro, simply passing zeros as pixel values is enough to getting text embedding from the same "vision" variant. I think having two variants is okay ! Maybe we can even create an inference class ORTModelForRetrieval that enables a user to load a "vision" model and use it the same way as in https://huggingface.co/docs/transformers/en/model_doc/colpali

IlyasMoutawwakil

LGTM will merge tomorrow !
The name of variants is a nit, let me know what you think.

kstavro · 2025-05-26T05:39:14Z

Hey @Balladie @IlyasMoutawwakil, I just saw the tag from the other issue, thanks for taking care of this.

The main thing to take from my notebook when exporting the model was that in order to export a single model instead of two, the ‘ColPaliProcessor’ needs to be changed to also return dummy pixel_values for text as well (maybe with an argument like return_dummy_pixel_values or so). And that is actually enough for everything to work, as this makes the expected input for the ONNX model uniform.

I think the dummy values don’t actually play any role when processing text, as there are no image tokens as input for the model (image tokens are only concatenatied for image and images are embedded only when model image tokens exist). I first tried to manually concatenate the dummy pixel values to the text tokens for the text cases (while properly setting the attention mask to 0 for the image tokens) and that had the exact same results. But with the additional computational cost of 1024 unnecessary image tokens, so not worth it. Then I realized that what I was doing was a modified ColPaliProcessor in its essence.

The trade off to such an approach is that the ColPaliProcessor is part of the transformers library, not optimum. Where as with the export of two models everything remains more self contained, but with double the memory consumption. That is why I asked in the other issue about any design preferences.

I hope this helps somehow, thank you for integrating this model to the library :)

Balladie · 2025-05-26T06:03:44Z

@IlyasMoutawwakil have manually tested that separating into three parts works (codes and onnx models here)
it's not a full reflection of the concept of "behavior" but the flow would be aligned so I just share it with current form, hope it helps

IlyasMoutawwakil · 2025-05-26T06:08:48Z

@kstavro no need to modify the processor logic, dummy pixel values could be created rather in the inference class ORTModelForRetrieval for example if not passed.
We do something similar for LLMs where we export them with past key values and create dummy ones in the first forward pass.

optimum/optimum/onnxruntime/modeling_decoder.py

Line 255 in e15053d

    
           use_cache_branch, past_key_values, known_output_shapes = self.prepare_past_key_values(

kstavro · 2025-06-05T11:22:53Z

Hi @Balladie @IlyasMoutawwakil, I am just back from my vacation and I noticed that the PR is still open. Do you require any assistance from my side for the implementation? If desired and/or preferred, I could find some time on the weekend to assist with the PR.

Thanks again for taking care of this so far :)

Balladie added 2 commits May 6, 2025 03:24

add onnx exporter for colpali

3dc3825

move role of input_mode to variant, fix cases without sequence_length…

d3dd929

… in kwargs

IlyasMoutawwakil reviewed May 7, 2025

View reviewed changes

optimum/exporters/tasks.py Show resolved Hide resolved

add custom model cls for colpali

91c9de6

IlyasMoutawwakil reviewed May 12, 2025

View reviewed changes

optimum/exporters/onnx/model_configs.py Show resolved Hide resolved

clean up redundant codes

c77dc44

IlyasMoutawwakil mentioned this pull request May 25, 2025

ONNX export for ColPali #2275

Closed

IlyasMoutawwakil approved these changes May 25, 2025

View reviewed changes

IlyasMoutawwakil mentioned this pull request May 26, 2025

Add: colpali ONNX conversion #2253

Closed

3 tasks

echarlaix added the onnx Related to the ONNX export label Jun 4, 2025

merge main in branch

e15c99a

echarlaix merged commit 1c7012f into huggingface:main Jun 6, 2025
30 checks passed

		"vision": "Embedding extraction for image.",
		"text": "Embedding extraction for text.",

Add ONNX exporter support for ColPali model #2251

Add ONNX exporter support for ColPali model #2251

Uh oh!

Conversation

Balladie commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented May 6, 2025

Uh oh!

IlyasMoutawwakil May 7, 2025

Choose a reason for hiding this comment

Uh oh!

Balladie May 7, 2025

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil May 25, 2025

Choose a reason for hiding this comment

Uh oh!

Balladie May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Balladie commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

IlyasMoutawwakil commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Balladie commented May 14, 2025

Uh oh!

IlyasMoutawwakil commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IlyasMoutawwakil left a comment

Choose a reason for hiding this comment

Uh oh!

kstavro commented May 26, 2025

Uh oh!

Balladie commented May 26, 2025

Uh oh!

IlyasMoutawwakil commented May 26, 2025

Uh oh!

kstavro commented Jun 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Balladie commented May 6, 2025 •

edited

Loading

Balladie commented May 7, 2025 •

edited

Loading

IlyasMoutawwakil commented May 12, 2025 •

edited

Loading

IlyasMoutawwakil commented May 25, 2025 •

edited

Loading