Thanks to visit codestin.com
Credit goes to github.com

Skip to content

server : vision support via libmtmd #12898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 79 commits into from
May 9, 2025
Merged

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 11, 2025

Cont #12849

This is my first trial to bring libmtmd to server.cpp.

How to use this?

See this documentation

Web UI now also support uploading image (see demo below)

Implementation

The main implementation is built around mtmd.h and struct server_tokens in utils.hpp:

The idea is that:

  • struct server_tokens will make most of the existing text tokens logic to work without too many changes
  • struct server_tokens can optionally contains image tokens (managed by libmtmd)
  • update_slots() calls libmtmd as soon as it sees an image token
    // map a **start** position in tokens to the image chunk
    std::unordered_map<llama_pos, mtmd::input_chunk_ptr> map_pos_to_image;

    // list of tokens
    // it can include LLAMA_TOKEN_NULL, which is used to indicate a token that is not a text token
    // a mtmd_input_chunk can occupy multiple tokens, one llama_token per **position**
    // important: for models using mrope, an image can contain multiple tokens but will use only one **position**
    llama_tokens tokens;

    // for ex. with input of 5 text tokens and 2 images:
    //      [0] [1] [2] [3] [4] [img0] [img0] [img0] [img1] [img1]
    // pos  0   1   2   3   4   5      6      7      8      9
    // map_pos_to_image will contain: {5, img0}, {8, img1}

TODOs

  • automatically deactivate certain features if vision is enabled, we will work on these features later
  • implement hash function for image (to keep track of the cache)
  • fix detokenize(server_inp_chunk)
  • add more error handlings
  • support remote image_url in addition of base64
  • add tinygemma3 model for CI test
  • add image upload to web UI
  • update docs

Demo (webui)

The server can be run with this command:

llama-server -hf ggml-org/gemma-3-4b-it-GGUF

Then access the web UI (default at localhost:8080):

image

Demo (API)

Click to see
import json
import base64
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-test", timeout=9999)

# we support both remove image_url and base64 ; example below is for base64 image (read from disk)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "../models/bliss.png"

# Getting the Base64 string
base64_image = encode_image(image_path)

response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.1,
    stream=True,
    messages=[
        {
            "role": "user",
            "content": [
                { "type": "text", "text": "describe what you see in details" },
                {
                    "type": "image_url",
                    "image_url": {
                        # alternatively, you can put the remote image url here, example: "http(s)://....."
                        "url": f"data:image/png;base64,{base64_image}",
                    },
                },
            ],
        }
    ],
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

print("\n\n")

@qnixsynapse

This comment was marked as resolved.

@ngxson

This comment was marked as resolved.

@qnixsynapse

This comment was marked as resolved.

@ngxson

This comment was marked as resolved.

@qnixsynapse

This comment was marked as resolved.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 13, 2025

@qnixsynapse I had a problem with my logic, which make it discard the text batch comes before the image batch.

It should be fixed now, could you give it a try?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 13, 2025

Btw @ggerganov I'm noting here for visibility: while working on this PR, I realize that I can have 2 refactoring which can be done in their dedicated PR:

  • The first one is quite simple, currently server_task is passed-by-copy in some places, we need to add some std::move
  • The second one is a bit more tricky. Currently, we track everything using a std::vector<llama_token>. However, for multimodal, I introduced the notion of "input chunks" along with libmtmd. Server need to be adapted to work with chunks of tokens / embeddings instead of a simple list of tokens.
    In the current PR, I'm kinda hacking this by having server_inp_chunk to wrap around one single text token (so most of the text-related logic are unchanged). But obviously this brings some complication when dealing with both text + image chunks. Do you have any better ideas to handle this?

And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.


Edit: optionally one more refactoring, we should split llama-server into different compilation units, currently it may takes up to 20s to compile

@qnixsynapse
Copy link
Collaborator

qnixsynapse commented Apr 14, 2025

@ngxson Can you please refresh this branch with master?

Nvm. Ended up using your fork .. working great!!! 👍

On further testing, it seems that llama_batch_size exceeds sometimes in successive requests.

common/common.cpp:1161: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") failed

@ggerganov
Copy link
Member

And I also have a question regarding the logic around batch_view. IIRC, this is because sometimes the batch is too large for llama_decode to process, so we may want to reduce the input batch size (dynamically). However, we also internally split the batch into ubatch, so I'm wondering if this logic is now obsolete.

This was useful mainly before the defragmentation support was added. The reason is that with time the KV cache can become highly fragmented and even if it has capacity for n_tokens it won't be able to find a contiguous slot, so attempting to split the batch into smaller chunks was a way to workaround this. With defragmentation enabled by default this is now rarely necessary. So yes, this should be simplified in a separate PR.

I'll think about the input chunk question today and let you know if I have any thoughts.

@Beinsezii

This comment was marked as resolved.

@ngxson
Copy link
Collaborator Author

ngxson commented May 8, 2025

This should now work out of the box with the default Web UI

Screenshot 2025-05-08 at 15 43 20

@vgrunner4v
Copy link

vgrunner4v commented May 8, 2025

maybe stupid question (sorry!)
so in order to activate it, we should provide --mmproj (in loading local gguf scenario) and if we not include it, then it will consider it as 'multimodel disabled'? asking because of some features that disabled once multimodal is on, like context shift.

edit: ok i see the condition of mmproj emtpy. so i guess its ok.
one request tho, can you please add small info about it in the server/README.md? and also the additional flags like --mmproj, --mmproj-url, etc.. that would really help.

@pnb
Copy link
Contributor

pnb commented May 9, 2025

I tried Mistral Small 2504 (Q4_K_M) from bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF and it works terrifically with --no-mmproj-offload or a Q8_0 quant of the mmproj and only 2K context (on a 7900XTX).

It is even able to pass the strawberry test (occasionally):
Strawberries and blueberries on a plate, with some of the strawberries arranged in the shape of the capital letter R

Twice I got Jinja parsing errors with --jinja, although I suspect this is unrelated to the image part. I couldn't consistently replicate it but will keep trying in case it is somehow related to libmtmd.

@ngxson
Copy link
Collaborator Author

ngxson commented May 9, 2025

For mistral small, it does not come with a default chat template, make sure to specify one, otherwise it will use default chatml which is incorrect

Edit: default template will be added after #13398

@p1-0tr
Copy link
Contributor

p1-0tr commented May 9, 2025

I experimented with this PR a bit, mostly using gemma3 12b (bartowski/google_gemma-3-12b-it-GGUF). HW: mac m4 max. Everything worked great :)

Excited to see this land. Thanks :)

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested mostly the existing text-only chat and infill functionality.

@ngxson ngxson changed the title server : vision support via libmtmd (need testing!) server : vision support via libmtmd May 9, 2025
@ngxson
Copy link
Collaborator Author

ngxson commented May 9, 2025

One more thing before finishing this: I need to update the docs

@ngxson
Copy link
Collaborator Author

ngxson commented May 9, 2025

I updated mtmd docs to state that llama-server is supported, while also mentioning in llama-server docs that /chat/completions now support this.

Also added to hot topic. Merging this once the CI passes 🔥


Here is the list of follow-up PR that I'll work on:

@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 9, 2025
@ngxson ngxson merged commit 33eff40 into ggml-org:master May 9, 2025
46 checks passed
@BugReporterZ
Copy link

BugReporterZ commented May 9, 2025

Unless there already is such feature (didn't notice it), probably it also needs a way to limit the maximum image resolution (i.e. resizing input images if necessary) and thus vision model resources accordingly (I remember from a different PR that it's currently requiring as much memory as the worst-case scenario). I'm currently unable to process any image with Mistral Small 2503 without --no-mmproj-offload, no matter the size.

Additionally: https://docs.mistral.ai/capabilities/vision/

Pixtral:

For both Pixtral models, each image will be divided into batches of 16x16 pixels, with each batch converted to a token. As a rule of thumb, an image with a resolution of "ResolutionX"x"ResolutionY" will consume approximately (ResolutionX/16) * (ResolutionY/16) tokens. For example, a 720x512 image will consume approximately (720/16) * (512/16) ≈ 1440 tokens. Note that all images with a resolution higher than 1024x1024 will be downscaled while maintaining the same aspect ratio. For instance, a 1436x962 image will be downscaled to approximately 1024x686, consuming around (1024/16) * (686/16) ≈ 2600 tokens.

Final Formula: N of tokens ≈ (ResolutionX * ResolutionY) / 256

Small / Medium:

Small is similar; however, instead of batches of 16, it will be batched in 14 pixels. Instead of a maximum resolution of 1024x1024, it has a maximum resolution of 1540x1540. Due to its slightly different architecture, it also only uses 1/4 of that number of tokens as input to the text decoder. This means that in total, you can summarize the consumption approximately as (ResolutionX/14) * (ResolutionY/14) * 1/4, which is approximately 3x less than Pixtral models, making it use fewer tokens and be more efficient.

Final Formula: N of tokens ≈ (ResolutionX * ResolutionY) / 784

Worth noting that the maximum token consumption of a single image with Pixtral-12B is 1024×1024/256=4096 tokens, while for Mistral Small it's 1540×1540/784=3025 tokens.

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 9, 2025
* origin/master: (39 commits)
server : vision support via libmtmd (ggml-org#12898)
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858)
metal : optimize MoE for large batches (ggml-org#13388)
CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306)
llama : do not crash if there is no CPU backend (ggml-org#13395)
CUDA: fix crash on large batch size for MoE models (ggml-org#13384)
imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389)
llama-run: add support for downloading models from ModelScope (ggml-org#13370)
mtmd : fix batch_view for m-rope (ggml-org#13397)
llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398)
rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353)
vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326)
server : (webui) rename has_multimodal --> modalities (ggml-org#13393)
ci : limit write permission to only the release step + fixes (ggml-org#13392)
mtmd : Expose helper_decode_image_chunk (ggml-org#13366)
server : (webui) fix a very small misalignment (ggml-org#13387)
server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365)
convert : support rope_scaling type and rope_type (ggml-org#13349)
mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381)
context : allow cache-less context for embeddings (ggml-org#13108)
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples python python script changes server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.