-
Notifications
You must be signed in to change notification settings - Fork 11.7k
server : vision support via libmtmd #12898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
@qnixsynapse I had a problem with my logic, which make it discard the text batch comes before the image batch. It should be fixed now, could you give it a try? |
Btw @ggerganov I'm noting here for visibility: while working on this PR, I realize that I can have 2 refactoring which can be done in their dedicated PR:
And I also have a question regarding the logic around Edit: optionally one more refactoring, we should split |
@ngxson Nvm. Ended up using your fork .. On further testing, it seems that llama_batch_size exceeds sometimes in successive requests.
|
This was useful mainly before the defragmentation support was added. The reason is that with time the KV cache can become highly fragmented and even if it has capacity for I'll think about the input chunk question today and let you know if I have any thoughts. |
This comment was marked as resolved.
This comment was marked as resolved.
maybe stupid question (sorry!) edit: ok i see the condition of mmproj emtpy. so i guess its ok. |
For mistral small, it does not come with a default chat template, make sure to specify one, otherwise it will use default chatml which is incorrect Edit: default template will be added after #13398 |
I experimented with this PR a bit, mostly using gemma3 12b ( Excited to see this land. Thanks :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested mostly the existing text-only chat and infill functionality.
Co-authored-by: Georgi Gerganov <[email protected]>
One more thing before finishing this: I need to update the docs |
I updated mtmd docs to state that llama-server is supported, while also mentioning in llama-server docs that Also added to hot topic. Merging this once the CI passes 🔥 Here is the list of follow-up PR that I'll work on:
|
Unless there already is such feature (didn't notice it), probably it also needs a way to limit the maximum image resolution (i.e. resizing input images if necessary) and thus vision model resources accordingly (I remember from a different PR that it's currently requiring as much memory as the worst-case scenario). I'm currently unable to process any image with Mistral Small 2503 without Additionally: https://docs.mistral.ai/capabilities/vision/
Worth noting that the maximum token consumption of a single image with Pixtral-12B is 1024×1024/256=4096 tokens, while for Mistral Small it's 1540×1540/784=3025 tokens. |
* origin/master: (39 commits) server : vision support via libmtmd (ggml-org#12898) sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858) metal : optimize MoE for large batches (ggml-org#13388) CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306) llama : do not crash if there is no CPU backend (ggml-org#13395) CUDA: fix crash on large batch size for MoE models (ggml-org#13384) imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389) llama-run: add support for downloading models from ModelScope (ggml-org#13370) mtmd : fix batch_view for m-rope (ggml-org#13397) llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398) rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353) vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326) server : (webui) rename has_multimodal --> modalities (ggml-org#13393) ci : limit write permission to only the release step + fixes (ggml-org#13392) mtmd : Expose helper_decode_image_chunk (ggml-org#13366) server : (webui) fix a very small misalignment (ggml-org#13387) server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365) convert : support rope_scaling type and rope_type (ggml-org#13349) mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381) context : allow cache-less context for embeddings (ggml-org#13108) ...
Cont #12849
This is my first trial to bring
libmtmd
toserver.cpp
.How to use this?
See this documentation
Web UI now also support uploading image (see demo below)
Implementation
The main implementation is built around
mtmd.h
andstruct server_tokens
inutils.hpp
:The idea is that:
struct server_tokens
will make most of the existing text tokens logic to work without too many changesstruct server_tokens
can optionally contains image tokens (managed bylibmtmd
)update_slots()
callslibmtmd
as soon as it sees an image tokenTODOs
image_url
in addition ofbase64
Demo (webui)
The server can be run with this command:
Then access the web UI (default at
localhost:8080
):Demo (API)
Click to see