Thanks to visit codestin.com
Credit goes to github.com

Skip to content

mtmd: Expose helper_decode_image_chunk #13366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 8, 2025

Conversation

mattjcly
Copy link
Contributor

@mattjcly mattjcly commented May 7, 2025

New API

Decoding-only helper

  • mtmd_helper_decode_image_chunk: Split out from mtmd_helper_eval_chunk_single. Same logic as before, but use as a standalone function enables clients to use mtmd_encode at some prior time, cache these embeddings, and then send them in later to mtmd_helper_decode_image_chunk to decode the embeddings without having to re-encode the image (expensive)

Edit: removed below APIs that were in original PR

Output embedding copy

  • mtmd_get_output_embd_copy: Allows client to embed with mtmd_encode, then get a copy of the embd to hold onto past the lifetime of the embeddings within the mtmd_context. Useful for caching these embedings, and sending into mtmd_helper_decode_image later

mtmd_image_tokens management functions

  • mtmd_image_tokens_copy: Allows clients to get a copy of mtmd_image_tokens from mtmd_input_chunk, for later use to send alongside pre-computed embeddings to mtmd_helper_decode_image.
  • mtmd_image_tokens_free: For use to free an mtmd_image_tokens *, as can be recieved from mtmd_image_tokens_copy
  • image_tokens_ptr(made public, existed privately in mtmd.cpp before): Enables auto memory management of mtmd_image_tokens *

@ngxson I'm thinking that maybe there's a way to avoid the need to expose new API for mtmd_image_tokens, since I feel like the statement "for later use to send alongside pre-computed embeddings" about mtmd_image_tokens_copy could potential be weak and the API of mtmd_helper_decode_image could be reworked not need this object in full? But it also seemed like the simplest conversion to enable decoupled embedding + decoding

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be make more simple: In the application code, you can handle the embedding copy as I said. This way, you can even have a CPP struct with std::vector<float> which makes memory management much easier. The mtmd API already provided enough function allowing you to do that, so I think we should not extend it more.

A struct in your app could look like this:

struct my_image {
  std::vector<float> embeddings; // the encoded embeddings
  mtmd_input_chunk * chunk; // the chunk containing mtmd_image_tokens
}

@mattjcly mattjcly requested a review from ngxson May 8, 2025 16:47
@mattjcly mattjcly changed the title mtmd: Expose helper_decode_image, output_embd_copy, image_tokens_copy/free mtmd: Expose helper_decode_image_chunk May 8, 2025
@mattjcly mattjcly requested a review from ngxson May 8, 2025 17:17
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks! 💯 💯

Btw @mattjcly one nice-to-have thing that I'm thinking about, currently mtmd_helper_decode_image_chunk run non-stop while it actually support smaller batch under the hood.

This can lead to a poor UX where user hits "stop" button on the UI, but mtmd_helper_decode_image_chunk still tries to decode the whole image which may takes some extra seconds to finish.

I'm thinking about another version of mtmd_helper_decode_image_chunk (ofc will add it in another PR) which support interrupt-ability. I'm thinking about maybe exposing the i_batch and n_batch to the public API. Do you have any other ideas?


Edit: another idea could be to add a helper that does pre/post batch preparation, then you can llama_decode(prepared_image_batch) in the user code ; but still this may look quite cumbersome 😞

@ngxson ngxson merged commit f05a6d7 into ggml-org:master May 8, 2025
44 checks passed
@mattjcly mattjcly deleted the mtmd-api-extension branch May 8, 2025 18:37
@mattjcly
Copy link
Contributor Author

mattjcly commented May 8, 2025

I'm thinking about another version of mtmd_helper_decode_image_chunk (ofc will add it in another PR) which support interrupt-ability. I'm thinking about maybe exposing the i_batch and n_batch to the public API. Do you have any other ideas?

I like this - I think that 1) having a point where the decoding can be stopped in between batches would be great 2) having a way to, as a user, get progress information during image decoding in the mutli-batch case (other than just the current log) would be great.

maybe exposing the i_batch and n_batch to the public API

Interesting. How would you envision this as the method of supporting interrupt-ability from the client-side? Just trying to understand more

@ngxson
Copy link
Collaborator

ngxson commented May 9, 2025

Interesting. How would you envision this as the method of supporting interrupt-ability from the client-side? Just trying to understand more

The most intuitive way to to provide to application code the notion of "a list of batches" instead of a one-do-all API call. A pseudo code looks like this:

list_batches = mtmd_generate_decode_batches()
for batch in list_batches:
  llama_decode(batch)

Then if you want the interrupt-ability:

list_batches = mtmd_generate_decode_batches()
for batch in list_batches:
  if check_user_interrupt():
    break  # stop the decode
  llama_decode(batch)

I'm thinking about this line, maybe this will be implemented as a cpp-only API to make it easier to manage batch allocation

@ngxson ngxson mentioned this pull request May 9, 2025
8 tasks
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 9, 2025
* origin/master: (39 commits)
server : vision support via libmtmd (ggml-org#12898)
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858)
metal : optimize MoE for large batches (ggml-org#13388)
CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306)
llama : do not crash if there is no CPU backend (ggml-org#13395)
CUDA: fix crash on large batch size for MoE models (ggml-org#13384)
imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389)
llama-run: add support for downloading models from ModelScope (ggml-org#13370)
mtmd : fix batch_view for m-rope (ggml-org#13397)
llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398)
rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353)
vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326)
server : (webui) rename has_multimodal --> modalities (ggml-org#13393)
ci : limit write permission to only the release step + fixes (ggml-org#13392)
mtmd : Expose helper_decode_image_chunk (ggml-org#13366)
server : (webui) fix a very small misalignment (ggml-org#13387)
server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365)
convert : support rope_scaling type and rope_type (ggml-org#13349)
mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381)
context : allow cache-less context for embeddings (ggml-org#13108)
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants