Adding logprobs to /v1/completions #11344

jpodivin · 2025-01-22T08:22:29Z

The /v1/completions endpoint of the server doesn't respect the logprobs argument when called.
The original API from OpenAI is deprecated, but the endpoint is still used in lot of examples, and I would assume actual projects as well.

This change will allow the logprobs to be treated the same as n_probs is.

In principle, this change would allow for /completion endpoint to be called with both n_probs and logprobs.
Potentially, this could cause some confusion in case the user would supply both n_probs and logprobs in the same call to the API.

It would be possible to safeguard against that eventuality, but considering the minimal impact of this behavior and that it is only possible in cases when user deliberately calls API with different parameters, I have decided against it.
This way the PR can stay as a one liner.

Documentation needs no adjustment, because it already links to OpenAI docs, implying that it behaves in essentially the same way.

Old output example:

jpodivin@fedora:~/repos/llama.cpp$ curl -H "Content-Type: application/json"  -d '{"model":"gpt-3.5-turbo", "prompt": "something", "logprobs": 5, "max_tokens": 1}' http://127.0.0.1:8080/v1/completions | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   598  100   518  100    80    639     98 --:--:-- --:--:-- --:--:--   739
{
  "choices": [
    {
      "text": " Ge",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "created": 1737533686,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b4501-667d7284",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 1,
    "prompt_tokens": 2,
    "total_tokens": 3
  },
  "id": "chatcmpl-Xg4bq1JlcEyXkybZDrdyEyl48zv2EeyG",
  "timings": {
    "prompt_n": 2,
    "prompt_ms": 807.999,
    "prompt_per_token_ms": 403.9995,
    "prompt_per_second": 2.475250588181421,
    "predicted_n": 1,
    "predicted_ms": 0.032,
    "predicted_per_token_ms": 0.032,
    "predicted_per_second": 31250.0
  }
}

New output example:

jpodivin@fedora:~/repos/llama.cpp$ curl -H "Content-Type: application/json"  -d '{"model":"gpt-3.5-turbo", "prompt": "something", "logprobs": 5, "max_tokens": 1}' http://127.0.0.1:8080/v1/completions | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1073  100   993  100    80   2287    184 --:--:-- --:--:-- --:--:--  2466
{
  "choices": [
    {
      "text": " Ge",
      "index": 0,
      "logprobs": {
        "content": [
          {
            "id": 2404,
            "token": " Ge",
            "bytes": [
              32,
              71,
              101
            ],
            "logprob": -0.44570669531822205,
            "top_logprobs": [
              {
                "id": 2404,
                "token": "Ge",
                "bytes": [
                  71,
                  101
                ],
                "logprob": -0.44570669531822205
              },
              {
                "id": 2970,
                "token": "ge",
                "bytes": [
                  103,
                  101
                ],
                "logprob": -2.617774248123169
              },
              {
                "id": 349,
                "token": "is",
                "bytes": [
                  105,
                  115
                ],
                "logprob": -3.1366515159606934
              },
              {
                "id": 315,
                "token": "I",
                "bytes": [
                  73
                ],
                "logprob": -3.662461519241333
              },
              {
                "id": 369,
                "token": "that",
                "bytes": [
                  116,
                  104,
                  97,
                  116
                ],
                "logprob": -4.037624835968018
              }
            ]
          }
        ]
      },
      "finish_reason": "length"
    }
  ],
  "created": 1737533297,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b4501-667d7284",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 1,
    "prompt_tokens": 2,
    "total_tokens": 3
  },
  "id": "chatcmpl-LwqGCxZCuQHAfAbfmfuU8w4axsYGaRrH",
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 424.499,
    "prompt_per_token_ms": 424.499,
    "prompt_per_second": 2.355718152457367,
    "predicted_n": 1,
    "predicted_ms": 7.645,
    "predicted_per_token_ms": 7.645,
    "predicted_per_second": 130.80444735120994
  }
}

ngxson

Hmm sorry I revoke the approval because CI didn't pass, I will have a look

jpodivin · 2025-01-22T08:38:43Z

I may need that logic after all.

Edit: yep.

jpodivin · 2025-01-22T08:58:30Z

@ngxson Basically it comes down to defaults. I could insert a conditional that would use the value of logprobs only if there is one and if the probs are set to 0.

 if (data.contains("logprobs") && params.sampling.n_probs == defaults.sampling.n_probs){
    params.sampling.n_probs = json_value(data, "logprobs", defaults.sampling.n_probs);
}

That would restore the expected behavior for /completion. Alternative would be bigger modification of params_from_json_cmpl that would prevent usage of any args which are not allowed by the endpoint in question.

Example output:

jpodivin@fedora:~/repos/llama.cpp$ curl -H "Content-Type: application/json"  -d '{"model":"gpt-3.5-turbo", "prompt": "something", "n_probs": 1, "max_tokens": 1}' http://127.0.0.1:8080/completion | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1671  100  1592  100    79   1835     91 --:--:-- --:--:-- --:--:--  1925
{
  "index": 0,
  "content": " Ge",
  "tokens": [],
  "id_slot": 0,
  "stop": true,
  "model": "gpt-3.5-turbo",
  "tokens_predicted": 1,
  "tokens_evaluated": 2,
  "generation_settings": {
    "n_predict": 1,
    "seed": 4294967295,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0.0,
    "dynatemp_exponent": 1.0,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "xtc_probability": 0.0,
    "xtc_threshold": 0.10000000149011612,
    "typical_p": 1.0,
    "repeat_last_n": 64,
    "repeat_penalty": 1.0,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0,
    "dry_multiplier": 0.0,
    "dry_base": 1.75,
    "dry_allowed_length": 2,
    "dry_penalty_last_n": 4096,
    "dry_sequence_breakers": [
      "\n",
      ":",
      "\"",
      "*"
    ],
    "mirostat": 0,
    "mirostat_tau": 5.0,
    "mirostat_eta": 0.10000000149011612,
    "stop": [],
    "max_tokens": 1,
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": false,
    "logit_bias": [],
    "n_probs": 1,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "penalties",
      "dry",
      "top_k",
      "typ_p",
      "top_p",
      "min_p",
      "xtc",
      "temperature"
    ],
    "speculative.n_max": 16,
    "speculative.n_min": 5,
    "speculative.p_min": 0.8999999761581421,
    "timings_per_token": false,
    "post_sampling_probs": false,
    "lora": []
  },
  "prompt": "<s> something",
  "has_new_line": false,
  "truncated": false,
  "stop_type": "limit",
  "stopping_word": "",
  "tokens_cached": 2,
  "timings": {
    "prompt_n": 2,
    "prompt_ms": 856.85,
    "prompt_per_token_ms": 428.425,
    "prompt_per_second": 2.334130828032911,
    "predicted_n": 1,
    "predicted_ms": 8.324,
    "predicted_per_token_ms": 8.324,
    "predicted_per_second": 120.1345506967804
  },
  "completion_probabilities": [
    {
      "id": 2404,
      "token": " Ge",
      "bytes": [
        32,
        71,
        101
      ],
      "logprob": -0.44570669531822205,
      "top_logprobs": [
        {
          "id": 2404,
          "token": "Ge",
          "bytes": [
            71,
            101
          ],
          "logprob": -0.44570669531822205
        }
      ]
    }
  ]
}

ngxson · 2025-01-22T09:04:47Z

Basically it comes down to defaults. I could insert a conditional that would use the value of logprobs only if there is one and if the probs are set to 0.

Can you try push that so we can see if CI pass?

Signed-off-by: Jiri Podivin <[email protected]>

jpodivin · 2025-01-22T09:30:55Z

@ngxson Sure, there it is.

Signed-off-by: Jiri Podivin <[email protected]>

jpodivin requested a review from ngxson as a code owner January 22, 2025 08:22

github-actions bot added examples server labels Jan 22, 2025

ngxson approved these changes Jan 22, 2025

View reviewed changes

ngxson requested changes Jan 22, 2025

View reviewed changes

jpodivin mentioned this pull request Jan 22, 2025

Misc. bug: Log probabilities of tokens are not produced for the /v1/completions endpoint #11346

Closed

Adding logprobs to /v1/completions

aa91158

Signed-off-by: Jiri Podivin <[email protected]>

jpodivin force-pushed the logprobs branch from 06fd058 to aa91158 Compare January 22, 2025 09:05

ngxson approved these changes Jan 22, 2025

View reviewed changes

ngxson merged commit 96f4053 into ggml-org:master Jan 22, 2025
45 checks passed

anagri pushed a commit to BodhiSearch/llama.cpp that referenced this pull request Jan 26, 2025

Adding logprobs to /v1/completions (ggml-org#11344)

eb33393

Signed-off-by: Jiri Podivin <[email protected]>

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025

Adding logprobs to /v1/completions (ggml-org#11344)

d094c35

Signed-off-by: Jiri Podivin <[email protected]>

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025

Adding logprobs to /v1/completions (ggml-org#11344)

8d944ed

Signed-off-by: Jiri Podivin <[email protected]>

mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025

Adding logprobs to /v1/completions (ggml-org#11344)

38348c6

Signed-off-by: Jiri Podivin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding logprobs to /v1/completions #11344

Adding logprobs to /v1/completions #11344

Uh oh!

jpodivin commented Jan 22, 2025

Uh oh!

ngxson left a comment

Uh oh!

jpodivin commented Jan 22, 2025 •

edited

Loading

Uh oh!

jpodivin commented Jan 22, 2025 •

edited

Loading

Uh oh!

ngxson commented Jan 22, 2025

Uh oh!

jpodivin commented Jan 22, 2025

Uh oh!

Uh oh!

Uh oh!

Adding logprobs to /v1/completions #11344

Adding logprobs to /v1/completions #11344

Uh oh!

Conversation

jpodivin commented Jan 22, 2025

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

jpodivin commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpodivin commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jan 22, 2025

Uh oh!

jpodivin commented Jan 22, 2025

Uh oh!

Uh oh!

Uh oh!

jpodivin commented Jan 22, 2025 •

edited

Loading

jpodivin commented Jan 22, 2025 •

edited

Loading