Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

larryliu0820
Copy link
Contributor

@larryliu0820 larryliu0820 commented Sep 12, 2025

This pull request introduces Python bindings for the ExecuTorch MultimodalRunner, enabling Python users to run multimodal LLM inference (supporting text, image, and audio inputs) and generate text outputs. The changes include new build system integration, a detailed implementation plan and documentation, and a high-level Python API with robust input handling and error management.

Python Bindings Implementation:

  • Added a new high-level Python API in __init__.py for the MultimodalRunner, providing user-friendly methods for text and image input creation, text generation (with or without streaming callbacks), and resource management. The API includes comprehensive input validation, support for multiple image formats (file path, NumPy array, PIL), and fallback mechanisms if dependencies are missing.
  • Implemented robust error handling: if the C++ extension is not built, placeholder classes and functions raise informative exceptions, guiding users to rebuild with Python bindings enabled.

Build System Integration:

  • Updated CMakeLists.txt to add a pybind11-based Python extension module (_llm_runner) when EXECUTORCH_BUILD_PYBIND is set, linking all necessary dependencies and setting up include paths.

Documentation and Planning:

  • Added python API section to README.md.

Utility and Extensibility:

  • Exposed utility functions (load_image_from_file, preprocess_image, create_generation_config) for easier input preprocessing and configuration from Python.

Testing and Examples (Planned):

  • Added test_runner_pybindings.py.

Code Snippet of How to Use:

from executorch.extension.llm.runner import MultimodalRunner, GenerationConfig, make_image_input, make_text_input
from transformers import AutoProcessor
model_id = "google/gemma-3-4b-it"
processor = AutoProcessor.from_pretrained(model_id)
image_url = "https://llava-vl.github.io/static/images/view.jpg"
conversation = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            {"type": "image", "url": image_url},
            {
                "type": "text",
                "text": "What are the things I should be cautious about when I visit here?",
            },
        ],
    },
]
inputs = processor.apply_chat_template(conversation, add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt")
inputs_combined = [
    make_text_input("<bos><start_of_turn>user\nYou are a helpful assistant.\n\n"), 
    make_image_input(inputs["pixel_values"]), 
    make_text_input("What are the things I should be cautious about when I visit here?<end_of_turn>\n"),
]
runner = MultimodalRunner("/Volumes/larryliu/work/optimum-executorch/model/model.pte", "/Volumes/larryliu/work/optimum-executorch/model/tokenizer.model", None)
config = GenerationConfig()
config.max_new_tokens = 100
runner.generate(inputs_combined, config)

Output from console:

[multimodal_runner.cpp:88] RSS after loading model: 0.000000 MiB (0 if unsupported)
[multimodal_runner.cpp:109] Prefilling input 0/3, type: text
[util.h:125] second_input_sizes[0] = 1023
[multimodal_runner.cpp:109] Prefilling input 1/3, type: image
[multimodal_prefiller.cpp:87] Image tensor dim: 4, dtype: Float
[util.h:125] second_input_sizes[0] = 1023
[multimodal_runner.cpp:109] Prefilling input 2/3, type: text
[util.h:125] second_input_sizes[0] = 1023
What are the things I should be cautious about when I visit here?<end_of_turn>


You'
[multimodal_runner.cpp:127] RSS after multimodal input processing: 0.000000 MiB (0 if unsupported)
[multimodal_runner.cpp:139] Max new tokens resolved: 100, pos_ 669, max_context_len 2048
re absolutely right to focus on the weather – it's the key factor here! Let’s delve deeper into what you should be cautious about when visiting this location, and how to prepare.

**1. Weather & Terrain – Expanded:**

*   **Snow & Ice:** As we discussed, there’s a significant risk of heavy snowfall and ice formation. This can make trails treacherous, and create hazardous conditions on the pier itself.
*   **Terrain Stability:** The
PyTorchObserver {"prompt_tokens":669,"generated_tokens":99,"model_load_start_ms":1758178599491,"model_load_end_ms":1758178601788,"inference_start_ms":1758178629348,"inference_end_ms":1758178649749,"prompt_eval_end_ms":1758178642009,"first_token_ms":1758178642009,"aggregate_sampling_time_ms":117,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
[stats.h:108] 	Prompt Tokens: 669    Generated Tokens: 99
[stats.h:114] 	Model Load Time:		2.297000 (seconds)
[stats.h:124] 	Total inference time:		20.401000 (seconds)		 Rate: 	4.852703 (tokens/second)
[stats.h:132] 		Prompt evaluation:	12.661000 (seconds)		 Rate: 	52.839428 (tokens/second)
[stats.h:143] 		Generated 99 tokens:	7.740000 (seconds)		 Rate: 	12.790698 (tokens/second)
[stats.h:151] 	Time to first generated token:	12.661000 (seconds)
[stats.h:158] 	Sampling time over 768 tokens:	0.117000 (seconds)

cc @mergennachin @cccclai @helunwencser @jackzhxng

Copy link

pytorch-bot bot commented Sep 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14285

Note: Links to docs will display an error until the docs builds have been completed.

❌ 14 New Failures, 1 Unrelated Failure

As of commit 7f111bc with merge base d43cde5 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2025
@larryliu0820 larryliu0820 added module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code release notes: llm Changes to llm utilities labels Sep 13, 2025
@larryliu0820 larryliu0820 changed the title Add pybindings for LLM runners Add pybindings for multimodal LLM runners Sep 15, 2025
@larryliu0820 larryliu0820 changed the title Add pybindings for multimodal LLM runners Add pybindings for multimodal LLM runner Sep 15, 2025
from executorch.extension.llm.runner._llm_runner import GenerationConfig # noqa: F401


def load_image_from_file(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these methods in utils.py be prefixed with _? Otherwise, it looks like an API we would support in the long term

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to keep it, I'd suggest this location:

extension/vision/preprocessing.py that can be used in general CV tasks. We already have extension/audio for audio preprocessing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update

return image


def create_generation_config(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this method (as well as estimate_tokens, format_stats) Looks like the current extension/llm/runner/utils.py location is a good choice.

ValueError: If the image format is not supported
FileNotFoundError: If the image file doesn't exist
"""
if isinstance(image, (str, Path)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you use the CV preprocessing utils function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let me fix. Recent updates made sure it works with Gemma3, exported using optimum-et.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: llm Issues related to LLM examples and apps, and to the extensions/llm/ code release notes: llm Changes to llm utilities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants