Add pybindings for multimodal LLM runner #14285

larryliu0820 · 2025-09-12T21:24:55Z

This pull request introduces Python bindings for the ExecuTorch MultimodalRunner, enabling Python users to run multimodal LLM inference (supporting text, image, and audio inputs) and generate text outputs. The changes include new build system integration, a detailed implementation plan and documentation, and a high-level Python API with robust input handling and error management.

Python Bindings Implementation:

Added a new high-level Python API in __init__.py for the MultimodalRunner, providing user-friendly methods for text and image input creation, text generation (with or without streaming callbacks), and resource management. The API includes comprehensive input validation, support for multiple image formats (file path, NumPy array, PIL), and fallback mechanisms if dependencies are missing.
Implemented robust error handling: if the C++ extension is not built, placeholder classes and functions raise informative exceptions, guiding users to rebuild with Python bindings enabled.

Build System Integration:

Updated CMakeLists.txt to add a pybind11-based Python extension module (_llm_runner) when EXECUTORCH_BUILD_PYBIND is set, linking all necessary dependencies and setting up include paths.

Documentation and Planning:

Added python API section to README.md.

Utility and Extensibility:

Exposed utility functions (load_image_from_file, preprocess_image, create_generation_config) for easier input preprocessing and configuration from Python.

Testing and Examples (Planned):

Added test_runner_pybindings.py.

Code Snippet of How to Use:

from executorch.extension.llm.runner import MultimodalRunner, GenerationConfig, make_image_input, make_text_input
from transformers import AutoProcessor
model_id = "google/gemma-3-4b-it"
processor = AutoProcessor.from_pretrained(model_id)
image_url = "https://llava-vl.github.io/static/images/view.jpg"
conversation = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            {"type": "image", "url": image_url},
            {
                "type": "text",
                "text": "What are the things I should be cautious about when I visit here?",
            },
        ],
    },
]
inputs = processor.apply_chat_template(conversation, add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt")
inputs_combined = [
    make_text_input("<bos><start_of_turn>user\nYou are a helpful assistant.\n\n"), 
    make_image_input(inputs["pixel_values"]), 
    make_text_input("What are the things I should be cautious about when I visit here?<end_of_turn>\n"),
]
runner = MultimodalRunner("/Volumes/larryliu/work/optimum-executorch/model/model.pte", "/Volumes/larryliu/work/optimum-executorch/model/tokenizer.model", None)
config = GenerationConfig()
config.max_new_tokens = 100
runner.generate(inputs_combined, config)

Output from console:

[multimodal_runner.cpp:88] RSS after loading model: 0.000000 MiB (0 if unsupported)
[multimodal_runner.cpp:109] Prefilling input 0/3, type: text
[util.h:125] second_input_sizes[0] = 1023
[multimodal_runner.cpp:109] Prefilling input 1/3, type: image
[multimodal_prefiller.cpp:87] Image tensor dim: 4, dtype: Float
[util.h:125] second_input_sizes[0] = 1023
[multimodal_runner.cpp:109] Prefilling input 2/3, type: text
[util.h:125] second_input_sizes[0] = 1023
What are the things I should be cautious about when I visit here?<end_of_turn>


You'
[multimodal_runner.cpp:127] RSS after multimodal input processing: 0.000000 MiB (0 if unsupported)
[multimodal_runner.cpp:139] Max new tokens resolved: 100, pos_ 669, max_context_len 2048
re absolutely right to focus on the weather – it's the key factor here! Let’s delve deeper into what you should be cautious about when visiting this location, and how to prepare.

**1. Weather & Terrain – Expanded:**

*   **Snow & Ice:** As we discussed, there’s a significant risk of heavy snowfall and ice formation. This can make trails treacherous, and create hazardous conditions on the pier itself.
*   **Terrain Stability:** The
PyTorchObserver {"prompt_tokens":669,"generated_tokens":99,"model_load_start_ms":1758178599491,"model_load_end_ms":1758178601788,"inference_start_ms":1758178629348,"inference_end_ms":1758178649749,"prompt_eval_end_ms":1758178642009,"first_token_ms":1758178642009,"aggregate_sampling_time_ms":117,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
[stats.h:108] 	Prompt Tokens: 669    Generated Tokens: 99
[stats.h:114] 	Model Load Time:		2.297000 (seconds)
[stats.h:124] 	Total inference time:		20.401000 (seconds)		 Rate: 	4.852703 (tokens/second)
[stats.h:132] 		Prompt evaluation:	12.661000 (seconds)		 Rate: 	52.839428 (tokens/second)
[stats.h:143] 		Generated 99 tokens:	7.740000 (seconds)		 Rate: 	12.790698 (tokens/second)
[stats.h:151] 	Time to first generated token:	12.661000 (seconds)
[stats.h:158] 	Sampling time over 768 tokens:	0.117000 (seconds)

cc @mergennachin @cccclai @helunwencser @jackzhxng

pytorch-bot · 2025-09-12T21:24:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14285

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 14 New Failures, 1 Unrelated Failure

As of commit 7f111bc with merge base d43cde5 ():

NEW FAILURES - The following jobs have failed:

Build Presets / apple (ios-simulator) / build (gh)
Build Presets / apple (ios) / build (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
Build Presets / apple (macos) / build (gh)
Build Presets / windows (pybind) / build (gh)
Process completed with exit code 1.
Build Windows Wheels / pytorch/executorch / build-wheel-py3_10-cpu (gh)
Process completed with exit code 1.
Build Windows Wheels / pytorch/executorch / upload / upload-wheel-py3_10-cpu (gh)
Unable to download artifact(s): Artifact not found for name: pytorch_executorch__3.10_cpu_x64
Lint / link-check / lint-xrefs / linux-job (gh)
RuntimeError: Command docker exec -t e55bf59ff70b4fe4bfe3fa8ad16b79a77b3b9d32800ed48f5f3e9fc65b3871eb /exec failed with exit code 1
pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 47ee5bfdc67366774c6dd25c26d101c12f2b9569b862a1fabf1bd17427413b4d /exec failed with exit code 1
pull / unittest / linux / linux-job (gh)
extension/llm/runner/test/test_runner_pybindings.py::TestHelperFunctions::test_make_image_input
pull / unittest / macos / macos-job (gh)
extension/llm/runner/test/test_runner_pybindings.py::TestMultimodalInput::test_repr
pull / unittest / windows / windows-job (gh)
Process completed with exit code 1.
pull / unittest-editable / linux / linux-job (gh)
extension/llm/runner/test/test_runner_pybindings.py::TestHelperFunctions::test_make_image_input
pull / unittest-editable / macos / macos-job (gh)
extension/llm/runner/test/test_runner_pybindings.py::TestHelperFunctions::test_make_image_input
pull / unittest-editable / windows / windows-job (gh)
Process completed with exit code 1.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-models-linux (resnet18, portable, linux.2xlarge) / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mergennachin · 2025-09-16T00:57:25Z

extension/llm/runner/utils.py

+from executorch.extension.llm.runner._llm_runner import GenerationConfig  # noqa: F401
+
+
+def load_image_from_file(


Should these methods in utils.py be prefixed with _? Otherwise, it looks like an API we would support in the long term

If you want to keep it, I'd suggest this location:

extension/vision/preprocessing.py that can be used in general CV tasks. We already have extension/audio for audio preprocessing.

Will update

mergennachin · 2025-09-16T01:03:34Z

extension/llm/runner/utils.py

+    return image
+
+
+def create_generation_config(


For this method (as well as estimate_tokens, format_stats) Looks like the current extension/llm/runner/utils.py location is a good choice.

extension/llm/runner/llm_runner_helper.h

mergennachin · 2025-09-16T01:08:15Z

extension/llm/runner/__init__.py

+            ValueError: If the image format is not supported
+            FileNotFoundError: If the image file doesn't exist
+        """
+        if isinstance(image, (str, Path)):


shouldn't you use the CV preprocessing utils function?

Yeah let me fix. Recent updates made sure it works with Gemma3, exported using optimum-et.

larryliu0820 requested review from kirklandsign, jackzhxng, swolchok and mergennachin as code owners September 12, 2025 21:24

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2025

larryliu0820 added module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code release notes: llm Changes to llm utilities labels Sep 13, 2025

larryliu0820 changed the title ~~Add pybindings for LLM runners~~ Add pybindings for multimodal LLM runners Sep 15, 2025

larryliu0820 changed the title ~~Add pybindings for multimodal LLM runners~~ Add pybindings for multimodal LLM runner Sep 15, 2025

mergennachin reviewed Sep 16, 2025

View reviewed changes

larryliu0820 force-pushed the mm_runner_ext branch from d352449 to 0ad3c71 Compare September 18, 2025 07:20

larryliu0820 requested a review from lucylq as a code owner September 18, 2025 07:20

larryliu0820 added 10 commits September 18, 2025 00:29

Update

43ee06e

Make it work

88c5cd9

Add readme

d226e18

move test to test/

b80a1b1

Fix tests

bc401e8

Fix

8e7f2b1

Rename test

aa0f265

make_image_input take tensor

c18142e

More changes

433d7a3

More changes

02ac3bf

larryliu0820 force-pushed the mm_runner_ext branch from 0ad3c71 to 02ac3bf Compare September 18, 2025 07:30

Address comments

7f111bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pybindings for multimodal LLM runner #14285

Add pybindings for multimodal LLM runner #14285

Uh oh!

larryliu0820 commented Sep 12, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 12, 2025 •

edited

Loading

Uh oh!

mergennachin Sep 16, 2025

Uh oh!

mergennachin Sep 16, 2025

Uh oh!

larryliu0820 Sep 18, 2025

Uh oh!

mergennachin Sep 16, 2025

Uh oh!

Uh oh!

mergennachin Sep 16, 2025

Uh oh!

larryliu0820 Sep 18, 2025

Uh oh!

Uh oh!

		from executorch.extension.llm.runner._llm_runner import GenerationConfig # noqa: F401


		def load_image_from_file(

Add pybindings for multimodal LLM runner #14285

Are you sure you want to change the base?

Add pybindings for multimodal LLM runner #14285

Uh oh!

Conversation

larryliu0820 commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14285

❌ 14 New Failures, 1 Unrelated Failure

Uh oh!

mergennachin Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

larryliu0820 Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergennachin Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

larryliu0820 Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

larryliu0820 commented Sep 12, 2025 •

edited

Loading

pytorch-bot bot commented Sep 12, 2025 •

edited

Loading