-
Notifications
You must be signed in to change notification settings - Fork 11.8k
mtmd: Expose helper_decode_image_chunk #13366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be make more simple: In the application code, you can handle the embedding copy as I said. This way, you can even have a CPP struct with std::vector<float>
which makes memory management much easier. The mtmd API already provided enough function allowing you to do that, so I think we should not extend it more.
A struct in your app could look like this:
struct my_image {
std::vector<float> embeddings; // the encoded embeddings
mtmd_input_chunk * chunk; // the chunk containing mtmd_image_tokens
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks! 💯 💯
Btw @mattjcly one nice-to-have thing that I'm thinking about, currently mtmd_helper_decode_image_chunk
run non-stop while it actually support smaller batch under the hood.
This can lead to a poor UX where user hits "stop" button on the UI, but mtmd_helper_decode_image_chunk
still tries to decode the whole image which may takes some extra seconds to finish.
I'm thinking about another version of mtmd_helper_decode_image_chunk
(ofc will add it in another PR) which support interrupt-ability. I'm thinking about maybe exposing the i_batch
and n_batch
to the public API. Do you have any other ideas?
Edit: another idea could be to add a helper that does pre/post batch preparation, then you can llama_decode(prepared_image_batch)
in the user code ; but still this may look quite cumbersome 😞
I like this - I think that 1) having a point where the decoding can be stopped in between batches would be great 2) having a way to, as a user, get progress information during image decoding in the mutli-batch case (other than just the current log) would be great.
Interesting. How would you envision this as the method of supporting interrupt-ability from the client-side? Just trying to understand more |
The most intuitive way to to provide to application code the notion of "a list of batches" instead of a one-do-all API call. A pseudo code looks like this: list_batches = mtmd_generate_decode_batches()
for batch in list_batches:
llama_decode(batch) Then if you want the interrupt-ability: list_batches = mtmd_generate_decode_batches()
for batch in list_batches:
if check_user_interrupt():
break # stop the decode
llama_decode(batch) I'm thinking about this line, maybe this will be implemented as a cpp-only API to make it easier to manage batch allocation |
* origin/master: (39 commits) server : vision support via libmtmd (ggml-org#12898) sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858) metal : optimize MoE for large batches (ggml-org#13388) CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306) llama : do not crash if there is no CPU backend (ggml-org#13395) CUDA: fix crash on large batch size for MoE models (ggml-org#13384) imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389) llama-run: add support for downloading models from ModelScope (ggml-org#13370) mtmd : fix batch_view for m-rope (ggml-org#13397) llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398) rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353) vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326) server : (webui) rename has_multimodal --> modalities (ggml-org#13393) ci : limit write permission to only the release step + fixes (ggml-org#13392) mtmd : Expose helper_decode_image_chunk (ggml-org#13366) server : (webui) fix a very small misalignment (ggml-org#13387) server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365) convert : support rope_scaling type and rope_type (ggml-org#13349) mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381) context : allow cache-less context for embeddings (ggml-org#13108) ...
New API
Decoding-only helper
mtmd_helper_decode_image_chunk
: Split out frommtmd_helper_eval_chunk_single
. Same logic as before, but use as a standalone function enables clients to usemtmd_encode
at some prior time, cache these embeddings, and then send them in later tomtmd_helper_decode_image_chunk
to decode the embeddings without having to re-encode the image (expensive)Edit: removed below APIs that were in original PR
Output embedding copy
mtmd_get_output_embd_copy
: Allows client to embed withmtmd_encode
, then get a copy of the embd to hold onto past the lifetime of the embeddings within themtmd_context
. Useful for caching these embedings, and sending intomtmd_helper_decode_image
latermtmd_image_tokens
management functionsmtmd_image_tokens_copy
: Allows clients to get a copy ofmtmd_image_tokens
frommtmd_input_chunk
, for later use to send alongside pre-computed embeddings tomtmd_helper_decode_image
.mtmd_image_tokens_free
: For use to free anmtmd_image_tokens *
, as can be recieved frommtmd_image_tokens_copy
image_tokens_ptr
(made public, existed privately inmtmd.cpp
before): Enables auto memory management ofmtmd_image_tokens *
@ngxson I'm thinking that maybe there's a way to avoid the need to expose new API for
mtmd_image_tokens
, since I feel like the statement "for later use to send alongside pre-computed embeddings" aboutmtmd_image_tokens_copy
could potential be weak and the API ofmtmd_helper_decode_image
could be reworked not need this object in full? But it also seemed like the simplest conversion to enable decoupled embedding + decoding