server : vision support via libmtmd #12898

ngxson · 2025-04-11T16:08:00Z

Cont #12849

This is my first trial to bring libmtmd to server.cpp.

How to use this?

See this documentation

Web UI now also support uploading image (see demo below)

Implementation

The main implementation is built around mtmd.h and struct server_tokens in utils.hpp:

The idea is that:

struct server_tokens will make most of the existing text tokens logic to work without too many changes
struct server_tokens can optionally contains image tokens (managed by libmtmd)
update_slots() calls libmtmd as soon as it sees an image token

    // map a **start** position in tokens to the image chunk
    std::unordered_map<llama_pos, mtmd::input_chunk_ptr> map_pos_to_image;

    // list of tokens
    // it can include LLAMA_TOKEN_NULL, which is used to indicate a token that is not a text token
    // a mtmd_input_chunk can occupy multiple tokens, one llama_token per **position**
    // important: for models using mrope, an image can contain multiple tokens but will use only one **position**
    llama_tokens tokens;

    // for ex. with input of 5 text tokens and 2 images:
    //      [0] [1] [2] [3] [4] [img0] [img0] [img0] [img1] [img1]
    // pos  0   1   2   3   4   5      6      7      8      9
    // map_pos_to_image will contain: {5, img0}, {8, img1}

TODOs

automatically deactivate certain features if vision is enabled, we will work on these features later
implement hash function for image (to keep track of the cache)
fix detokenize(server_inp_chunk)
add more error handlings
support remote image_url in addition of base64
add tinygemma3 model for CI test
add image upload to web UI
update docs

Demo (webui)

The server can be run with this command:

llama-server -hf ggml-org/gemma-3-4b-it-GGUF

Then access the web UI (default at localhost:8080):

Demo (API)

Click to see

import json
import base64
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)

# we support both remove image_url and base64 ; example below is for base64 image (read from disk)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "../models/bliss.png"

# Getting the Base64 string
base64_image = encode_image(image_path)

response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.1,
    stream=True,
    messages=[
        {
            "role": "user",
            "content": [
                { "type": "text", "text": "describe what you see in details" },
                {
                    "type": "image_url",
                    "image_url": {
                        # alternatively, you can put the remote image url here, example: "http(s)://....."
                        "url": f"data:image/png;base64,{base64_image}",
                    },
                },
            ],
        }
    ],
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

print("\n\n")

ngxson · 2025-04-13T21:39:53Z

@qnixsynapse I had a problem with my logic, which make it discard the text batch comes before the image batch.

It should be fixed now, could you give it a try?

ngxson · 2025-04-13T22:02:44Z

Btw @ggerganov I'm noting here for visibility: while working on this PR, I realize that I can have 2 refactoring which can be done in their dedicated PR:

The first one is quite simple, currently server_task is passed-by-copy in some places, we need to add some std::move
The second one is a bit more tricky. Currently, we track everything using a std::vector<llama_token>. However, for multimodal, I introduced the notion of "input chunks" along with libmtmd. Server need to be adapted to work with chunks of tokens / embeddings instead of a simple list of tokens.
In the current PR, I'm kinda hacking this by having server_inp_chunk to wrap around one single text token (so most of the text-related logic are unchanged). But obviously this brings some complication when dealing with both text + image chunks. Do you have any better ideas to handle this?

And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.

Edit: optionally one more refactoring, we should split llama-server into different compilation units, currently it may takes up to 20s to compile

qnixsynapse · 2025-04-14T05:23:10Z

@ngxson ~~Can you please refresh this branch with master?~~

Nvm. Ended up using your fork .. ~~working great!!!~~ 👍

On further testing, it seems that llama_batch_size exceeds sometimes in successive requests.

common/common.cpp:1161: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

ggerganov · 2025-04-14T06:20:52Z

And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.

This was useful mainly before the defragmentation support was added. The reason is that with time the KV cache can become highly fragmented and even if it has capacity for n_tokens it won't be able to find a contiguous slot, so attempting to split the batch into smaller chunks was a way to workaround this. With defragmentation enabled by default this is now rarely necessary. So yes, this should be simplified in a separate PR.

I'll think about the input chunk question today and let you know if I have any thoughts.

common/arg.cpp

examples/server/utils.hpp

tools/server/server.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

ngxson · 2025-05-09T12:44:38Z

One more thing before finishing this: I need to update the docs

ngxson · 2025-05-09T13:10:21Z

I updated mtmd docs to state that llama-server is supported, while also mentioning in llama-server docs that /chat/completions now support this.

Also added to hot topic. Merging this once the CI passes 🔥

Here is the list of follow-up PR that I'll work on:

Allow interrupt mid-processing the image - useful in case user wants to cancel the request. Discussion here: mtmd: Expose helper_decode_image_chunk #13366
Support raw /completions API
Allow caching image's encoded embeddings, so same image won't need to be process again if it appears in 2 different chat history (not same prefix)
Maybe support embedding model, discussion here: Feature Request: Support multimodal LLMs such as Qwen2.5-VL as embedding models #13247

tools/server/server.cpp

BugReporterZ · 2025-05-09T20:01:06Z

Unless there already is such feature (didn't notice it), probably it also needs a way to limit the maximum image resolution (i.e. resizing input images if necessary) and thus vision model resources accordingly (I remember from a different PR that it's currently requiring as much memory as the worst-case scenario). I'm currently unable to process any image with Mistral Small 2503 without --no-mmproj-offload, no matter the size.

Additionally: https://docs.mistral.ai/capabilities/vision/

Pixtral:

For both Pixtral models, each image will be divided into batches of 16x16 pixels, with each batch converted to a token. As a rule of thumb, an image with a resolution of "ResolutionX"x"ResolutionY" will consume approximately (ResolutionX/16) * (ResolutionY/16) tokens. For example, a 720x512 image will consume approximately (720/16) * (512/16) ≈ 1440 tokens. Note that all images with a resolution higher than 1024x1024 will be downscaled while maintaining the same aspect ratio. For instance, a 1436x962 image will be downscaled to approximately 1024x686, consuming around (1024/16) * (686/16) ≈ 2600 tokens.

Final Formula: N of tokens ≈ (ResolutionX * ResolutionY) / 256

Small / Medium:

Small is similar; however, instead of batches of 16, it will be batched in 14 pixels. Instead of a maximum resolution of 1024x1024, it has a maximum resolution of 1540x1540. Due to its slightly different architecture, it also only uses 1/4 of that number of tokens as input to the text decoder. This means that in total, you can summarize the consumption approximately as (ResolutionX/14) * (ResolutionY/14) * 1/4, which is approximately 3x less than Pixtral models, making it use fewer tokens and be more efficient.

Final Formula: N of tokens ≈ (ResolutionX * ResolutionY) / 784

Worth noting that the maximum token consumption of a single image with Pixtral-12B is 1024×1024/256=4096 tokens, while for Mistral Small it's 1540×1540/784=3025 tokens.

* origin/master: (39 commits) server : vision support via libmtmd (ggml-org#12898) sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858) metal : optimize MoE for large batches (ggml-org#13388) CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306) llama : do not crash if there is no CPU backend (ggml-org#13395) CUDA: fix crash on large batch size for MoE models (ggml-org#13384) imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389) llama-run: add support for downloading models from ModelScope (ggml-org#13370) mtmd : fix batch_view for m-rope (ggml-org#13397) llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398) rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353) vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326) server : (webui) rename has_multimodal --> modalities (ggml-org#13393) ci : limit write permission to only the release step + fixes (ggml-org#13392) mtmd : Expose helper_decode_image_chunk (ggml-org#13366) server : (webui) fix a very small misalignment (ggml-org#13387) server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365) convert : support rope_scaling type and rope_type (ggml-org#13349) mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381) context : allow cache-less context for embeddings (ggml-org#13108) ...

broadbit-hu · 2025-05-12T18:49:39Z

Something went wrong since this commit with long input token counts (possibly greater than 2048):

main: server is listening on http://0.0.0.0:18080 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /props 192.168.253.130 200
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 22016, n_keep = 0, n_prompt_tokens = 8241
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.248514
/opt/text/llama.cpp/tools/server/utils.hpp:1157: GGML_ASSERT(n <= tokens.size()) failedslot update_slots: id  0 | task 0 | kv cache rm [2048, end)

Memory critical error by agent node-0 (Agent handle: 0x59fc5fabc930) on address 0x7cbd6cc00000. Reason: Memory in use. 
Aborted (core dumped)

Same results on ROCm and NVidia CUDA too...

Bug ticket: #13484

LordLokator · 2025-05-13T12:09:57Z

@ngxson Very well done, thank you for this!

aviallon · 2025-05-17T10:40:09Z

@ngxson Very well done, thank you for this!

@ngxson tested in production. Speed, quality and ease-of-use is impressive. I simply had to update my llama.cpp build, and it just works.
Major accomplishment. Well done.

ngxson added 2 commits April 11, 2025 17:46

server : (experimental) vision support via libmtmd

466c6cd

mtmd : add more api around mtmd_image_tokens

2317e61

github-actions bot added examples server labels Apr 11, 2025

ngxson mentioned this pull request Apr 11, 2025

server: Bring back multimodal support #8010

Closed

18 tasks

ngxson added 2 commits April 11, 2025 21:49

mtmd : add more api around mtmd_image_tokens

a46b6db

mtmd : ability to calc image hash

7ac0b7b

ngxson mentioned this pull request Apr 11, 2025

mtmd : add methods to access mtmd_image_tokens #12906

Merged

ngxson added 2 commits April 12, 2025 10:34

shared_ptr for mtmd_image_tokens

58c4767

move hash to user-define ID (fixed)

d3c3e20

This comment was marked as resolved.

Sign in to view

ngxson added 2 commits April 13, 2025 17:40

Merge branch 'xsn/mtmd_image_api' into xsn/server_mtmd

a44029a

abstract out the batch management

5e6c7ba

Merge branch 'master' into xsn/server_mtmd

78a76de

ngxson mentioned this pull request Apr 14, 2025

server : use std::move whenever possible #12936

Merged

qnixsynapse reviewed Apr 15, 2025

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

ngxson added 2 commits April 21, 2025 22:39

Merge branch 'master' into xsn/server_mtmd

c734b53

small fix

a6a3653

ngxson commented Apr 21, 2025

View reviewed changes

examples/server/utils.hpp Outdated Show resolved Hide resolved

ngxson added 2 commits April 21, 2025 23:18

refactor logic adding tokens to batch

f8bc466

implement hashing image

f5420e1

ggerganov reviewed May 9, 2025

View reviewed changes

tools/server/server.cpp Show resolved Hide resolved

ngxson changed the title ~~server : vision support via libmtmd (need testing!)~~ server : vision support via libmtmd May 9, 2025

ngxson and others added 6 commits May 9, 2025 14:26

Apply suggestions from code review

51afc0a

Co-authored-by: Georgi Gerganov <[email protected]>

rm can_be_detokenized

f10fc56

on prmpt processing done, assert cache_tokens.size

689035c

handle_completions_impl returns void

b2906a9

Merge branch 'master' into xsn/server_mtmd

abfd821

adapt the new web ui

f5fbc03

update docs and hot topics

5fe8d72

ngxson commented May 9, 2025

View reviewed changes

tools/server/server.cpp Outdated Show resolved Hide resolved

ngxson added 2 commits May 9, 2025 15:25

rm assert

b8000fd

small fix (2)

9ed430c

github-actions bot added the documentation Improvements or additions to documentation label May 9, 2025

ngxson merged commit 33eff40 into ggml-org:master May 9, 2025
46 checks passed

Cohee1207 mentioned this pull request May 9, 2025

Fix llama.cpp captioning SillyTavern/SillyTavern#3980

Merged

1 task

xunjieliu mentioned this pull request May 10, 2025

Reddit News Daily 2025-05-10 xunjieliu/reddit-daily-news#70

Open

This was referenced May 15, 2025

[Feat]: llama.cpp adds vision support a-ghorbani/pocketpal-ai#309

Closed

[FEATURE] llama.cpp adds vision support shubham0204/SmolChat-Android#79

Open

[FEATURE] llama.cpp adds vision support Vali-98/ChatterUI#339

Closed

Mihailoff mentioned this pull request May 15, 2025

feat: pass an image as part of the evaluation withcatai/node-llama-cpp#88

Open

getnamo mentioned this pull request May 16, 2025

Vision support (e.g. gemma3/llava) getnamo/Llama-Unreal#36

Open

7 tasks

oytech mentioned this pull request May 19, 2025

llama.cpp: update to b5415 macports/macports-ports#28474

Closed

12 tasks

aropb mentioned this pull request Jun 20, 2025

[BUG]: Error loading the LLava model SciSharp/LLamaSharp#1136

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : vision support via libmtmd #12898

server : vision support via libmtmd #12898

ngxson commented Apr 11, 2025 •

edited

Loading

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

ngxson commented Apr 13, 2025

Uh oh!

ngxson commented Apr 13, 2025 •

edited

Loading

Uh oh!

qnixsynapse commented Apr 14, 2025 •

edited

Loading

Uh oh!

ggerganov commented Apr 14, 2025

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

ngxson commented May 9, 2025 •

edited

Loading

Uh oh!

ngxson commented May 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

BugReporterZ commented May 9, 2025 •

edited

Loading

Uh oh!

broadbit-hu commented May 12, 2025

Uh oh!

LordLokator commented May 13, 2025

Uh oh!

aviallon commented May 17, 2025

Uh oh!

Uh oh!

server : vision support via libmtmd #12898

server : vision support via libmtmd #12898

Conversation

ngxson commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use this?

Implementation

TODOs

Demo (webui)

Demo (API)

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

ngxson commented Apr 13, 2025

Uh oh!

ngxson commented Apr 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qnixsynapse commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Apr 14, 2025

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

ngxson commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BugReporterZ commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

broadbit-hu commented May 12, 2025

Uh oh!

LordLokator commented May 13, 2025

Uh oh!

aviallon commented May 17, 2025

Uh oh!

Uh oh!

ngxson commented Apr 11, 2025 •

edited

Loading

ngxson commented Apr 13, 2025 •

edited

Loading

qnixsynapse commented Apr 14, 2025 •

edited

Loading

ngxson commented May 9, 2025 •

edited

Loading

ngxson commented May 9, 2025 •

edited

Loading

BugReporterZ commented May 9, 2025 •

edited

Loading