Fix Pan and Scan on batched images Gemma3 #36864

yonigozlan · 2025-03-20T17:24:34Z

What does this PR do?

Currently, inputs such as this one:

from transformers import (
    AutoModelForImageTextToText,
    Gemma3Processor,
    Gemma3ForConditionalGeneration,
)
import torch

model_id = "gg-hf-g/gemma-3-4b-it"
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
processor = Gemma3Processor.from_pretrained(
    model_id, padding_side="left", use_fast=False
)

model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16
).to(torch_device)
crop_config = {
    "images_kwargs": {
        "do_pan_and_scan": True,
        "pan_and_scan_max_num_crops": 448,
        "pan_and_scan_min_crop_size": 32,
        "pan_and_scan_min_ratio_to_activate": 0.3,
    }
}


messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/cow_beach_1.png",
            },
            {"type": "text", "text": "What is shown in this image?"},
        ],
    },
]
messages_2 = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/cow_beach_1.png",
            },
            {
                "type": "image",
                "url": "https://www.ilankelman.org/stopsigns/australia.jpg",
            },
            {"type": "text", "text": "Are these images identical?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    [messages, messages_2],
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    padding=True,
    add_generation_prompt=True,
    **crop_config,
).to(torch_device)

output = model.generate(**inputs, max_new_tokens=30, do_sample=False)
output_text = processor.batch_decode(output, skip_special_tokens=True)
print(output_text)

will crash with both slow and fast image processor.

Non-batched inputs with pan and scan will also fail with fast image processors.

This PR fixes both issue, and simplify the image processing by processing flattened images instead of nested ones.
This PR also introduces some changes to take better advantages of batch processing in the fast image processor, with a batched pan_and_scan method.

Also add some tests.

Cc @zucchini-nlp @RyanMullins

github-actions · 2025-03-20T17:24:47Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

zucchini-nlp

Hm, interesting, since we had a test for PaS in image-processors which was green. Or was the error in processing code?

zucchini-nlp · 2025-03-20T17:38:54Z

src/transformers/models/gemma3/image_processing_gemma3_fast.py

+                # Add the thumbnails to the image patches
+                stacked_images = [stacked_images] + pas_images
+                # Group images by size for batched resizing (this will typically group thumbnails together and cropped patches together)
+                processed_image_patches_grouped = {}
+                grouped_image_patches, grouped_image_patches_index = group_images_by_shape(stacked_images)
+                for shape, stacked_image_patches in grouped_image_patches.items():
+                    stacked_image_patches = self.resize(
+                        image=stacked_image_patches,
+                        size=size,
+                        interpolation=interpolation,
+                    )
+                    processed_image_patches_grouped[shape] = stacked_image_patches
+                processed_image_patches = reorder_images(processed_image_patches_grouped, grouped_image_patches_index)
+                # Transpose to have the thumbnails with their corresponding patches
+                stacked_images = torch.stack(processed_image_patches, dim=0).transpose(0, 1).contiguous()


not related to this PR particularly. Seeing a second grouping to sizes while we are using a grouped image batch, leads me to believe the batching logic in fast processors are over-complicated. Would be nice if we can simplify stuff, especially for community contributors when they add a new model with a new processing like PaS

Yes this doesn't look great agreed. This is quite a specific case where we start with images of potentially different sizes, then we split them in patches, and concatenate the patches with the original image (which is bigger than the patches), before resizing all to the same size 😅.
This new code is a bit overkill, where we group the patches and images by size at every step, but it's not really necessary to have at least in the first implementation by external contributors, so hopefully they won't ever have to do that to get a working fast image processor.
Not sure if there's a simpler way to fully use batch processing, I guess it's case by case

yeah, no rush to work on that. I don't think we have been forcing users to add only fast processors for now. We can come back to this question later. Maybe we'll find a better way to batch or we add some guides about how to add special processing in fast image processors

➕ on spending time to write simpler code! We can find some magic here and there.
In this specific case, I don't necessarily know. But, sometimes padding then unpadding can lead to better perfs.

Ordering can be costly

Depending on the size of padding, we are not really adding too much compute

distribution of image size is important to take into account!

yonigozlan · 2025-03-20T18:44:44Z

Hm, interesting, since we had a test for PaS in image-processors which was green. Or was the error in processing code?

I don't think PaS was tested with batched inputs was it?

Edit: I see it was, but the problem is really with num_crops, where for example fi you have image inputs like [[image1, image2], [image3]], the num_crops returned will be [[2, 2], [2]] which will crash when trying to return pt tensors.
Modified the test to account for this case!

zucchini-nlp

Ah right, we didn't test different number of images per batch in PaS. As long as the gemma3 tests are green, lgtm. Thanks for fixing and the new test!

zucchini-nlp · 2025-03-20T19:35:20Z

tests/models/gemma3/test_image_processing_gemma3.py

+        encoding_fast = image_processor_fast(dummy_images, return_tensors="pt")
+
+        torch.testing.assert_close(encoding_slow.num_crops, encoding_fast.num_crops)
+        self.assertTrue(torch.allclose(encoding_slow.pixel_values, encoding_fast.pixel_values, atol=1e-1))


we can use torch.testing.assert_close here also, afair that works better for tensor match tests

I'll try to see if i can make it work but I've had some issues with torch.testing.assert_close when comparing pixel values, because some rtol can be very high and it's difficult to choose a rtol value that will work for all processors (the base tests comparing slow and fast also don't use torch.testing.assert_close to compare pixel values)

ArthurZucker

Thanks as half of the changes make code simpler! (falt list ofimage vs nested!)

ArthurZucker · 2025-03-21T09:19:24Z

src/transformers/models/gemma3/image_processing_gemma3_fast.py

+                # Add the thumbnails to the image patches
+                stacked_images = [stacked_images] + pas_images
+                # Group images by size for batched resizing (this will typically group thumbnails together and cropped patches together)
+                processed_image_patches_grouped = {}
+                grouped_image_patches, grouped_image_patches_index = group_images_by_shape(stacked_images)
+                for shape, stacked_image_patches in grouped_image_patches.items():
+                    stacked_image_patches = self.resize(
+                        image=stacked_image_patches,
+                        size=size,
+                        interpolation=interpolation,
+                    )
+                    processed_image_patches_grouped[shape] = stacked_image_patches
+                processed_image_patches = reorder_images(processed_image_patches_grouped, grouped_image_patches_index)
+                # Transpose to have the thumbnails with their corresponding patches
+                stacked_images = torch.stack(processed_image_patches, dim=0).transpose(0, 1).contiguous()


➕ on spending time to write simpler code! We can find some magic here and there.
In this specific case, I don't necessarily know. But, sometimes padding then unpadding can lead to better perfs.

Ordering can be costly

Depending on the size of padding, we are not really adding too much compute

distribution of image size is important to take into account!

…gemma3

yonigozlan · 2025-03-21T15:20:22Z

➕ on spending time to write simpler code! We can find some magic here and there.
In this specific case, I don't necessarily know. But, sometimes padding then unpadding can lead to better perfs.

True! In general I think now that we have several techniques for fast processing, I need to make a new benchmark to compare those for each models (batched vs unbatched, padded vs unpadded, different techniques for splitting images into patches etc.).
@ArthurZucker Do you thinks it would be interesting to add these benchmarks to Transformers as a util or something? Or better to keep them separate?

…zlan/transformers into fix-pas-batch-proc-gemma3

HuggingFaceDocBuilderDev · 2025-03-21T16:59:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-03-21T17:19:32Z

run-slow: gemma3

github-actions · 2025-03-21T17:24:37Z

This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs:

models: ['models/gemma3']
quantizations: [] ...

yonigozlan · 2025-03-21T17:42:28Z

@zucchini-nlp CI won't run the integration tests because of require_read_token

zucchini-nlp · 2025-03-21T17:50:09Z

It worked, we have asked access to gemma-3 already. All green, cool

yonigozlan · 2025-03-21T17:54:17Z

Ah great, merging then

* process flattened images in fast image proc * process flattened images in low proc and add tests * remove print * add unbalanced batch test pas image proc * fix integration tests

yonigozlan added 3 commits March 20, 2025 16:24

process flattened images in fast image proc

b780c4e

process flattened images in low proc and add tests

e309045

remove print

4a6c28d

github-actions bot marked this pull request as draft March 20, 2025 17:24

yonigozlan marked this pull request as ready for review March 20, 2025 17:24

Merge branch 'main' into fix-pas-batch-proc-gemma3

0aba6c8

github-actions bot requested review from qubvel and ydshieh March 20, 2025 17:25

yonigozlan requested review from zucchini-nlp and removed request for ydshieh March 20, 2025 17:25

zucchini-nlp reviewed Mar 20, 2025

View reviewed changes

yonigozlan and others added 2 commits March 20, 2025 18:57

add unbalanced batch test pas image proc

4550cb3

Merge branch 'main' into fix-pas-batch-proc-gemma3

9338947

zucchini-nlp approved these changes Mar 20, 2025

View reviewed changes

yonigozlan requested a review from ArthurZucker March 20, 2025 21:33

ArthurZucker approved these changes Mar 21, 2025

View reviewed changes

molbap mentioned this pull request Mar 21, 2025

🔴 🔴 🔴 supersede paligemma forward to shift pos id indexing #36859

Merged

Merge remote-tracking branch 'upstream/main' into fix-pas-batch-proc-…

2d79716

…gemma3

Merge branch 'fix-pas-batch-proc-gemma3' of https://github.com/yonigo…

c2152cc

…zlan/transformers into fix-pas-batch-proc-gemma3

fix integration tests

e3ce1d9

yonigozlan merged commit beb9b5b into huggingface:main Mar 21, 2025
15 of 16 checks passed

yonigozlan added the for patch Tag issues / labels that should be included in the next patch label Mar 25, 2025

Fix Pan and Scan on batched images Gemma3 #36864

Fix Pan and Scan on batched images Gemma3 #36864

Uh oh!

Conversation

yonigozlan commented Mar 20, 2025

What does this PR do?

Uh oh!

github-actions bot commented Mar 20, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

yonigozlan commented Mar 21, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 21, 2025

Uh oh!

zucchini-nlp commented Mar 21, 2025

Uh oh!

github-actions bot commented Mar 21, 2025

Uh oh!

yonigozlan commented Mar 21, 2025

Uh oh!

zucchini-nlp commented Mar 21, 2025

Uh oh!

yonigozlan commented Mar 21, 2025

Uh oh!

Uh oh!

Uh oh!

yonigozlan commented Mar 20, 2025 •

edited

Loading