Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Adding logprobs to /v1/completions #11344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 22, 2025
Merged

Conversation

jpodivin
Copy link
Contributor

The /v1/completions endpoint of the server doesn't respect the logprobs argument when called.
The original API from OpenAI is deprecated, but the endpoint is still used in lot of examples, and I would assume actual projects as well.

This change will allow the logprobs to be treated the same as n_probs is.

In principle, this change would allow for /completion endpoint to be called with both n_probs and logprobs.
Potentially, this could cause some confusion in case the user would supply both n_probs and logprobs in the same call to the API.

It would be possible to safeguard against that eventuality, but considering the minimal impact of this behavior and that it is only possible in cases when user deliberately calls API with different parameters, I have decided against it.
This way the PR can stay as a one liner.

Documentation needs no adjustment, because it already links to OpenAI docs, implying that it behaves in essentially the same way.

Old output example:

jpodivin@fedora:~/repos/llama.cpp$ curl -H "Content-Type: application/json"  -d '{"model":"gpt-3.5-turbo", "prompt": "something", "logprobs": 5, "max_tokens": 1}' http://127.0.0.1:8080/v1/completions | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   598  100   518  100    80    639     98 --:--:-- --:--:-- --:--:--   739
{
  "choices": [
    {
      "text": " Ge",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "created": 1737533686,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b4501-667d7284",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 1,
    "prompt_tokens": 2,
    "total_tokens": 3
  },
  "id": "chatcmpl-Xg4bq1JlcEyXkybZDrdyEyl48zv2EeyG",
  "timings": {
    "prompt_n": 2,
    "prompt_ms": 807.999,
    "prompt_per_token_ms": 403.9995,
    "prompt_per_second": 2.475250588181421,
    "predicted_n": 1,
    "predicted_ms": 0.032,
    "predicted_per_token_ms": 0.032,
    "predicted_per_second": 31250.0
  }
}

New output example:

jpodivin@fedora:~/repos/llama.cpp$ curl -H "Content-Type: application/json"  -d '{"model":"gpt-3.5-turbo", "prompt": "something", "logprobs": 5, "max_tokens": 1}' http://127.0.0.1:8080/v1/completions | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1073  100   993  100    80   2287    184 --:--:-- --:--:-- --:--:--  2466
{
  "choices": [
    {
      "text": " Ge",
      "index": 0,
      "logprobs": {
        "content": [
          {
            "id": 2404,
            "token": " Ge",
            "bytes": [
              32,
              71,
              101
            ],
            "logprob": -0.44570669531822205,
            "top_logprobs": [
              {
                "id": 2404,
                "token": "Ge",
                "bytes": [
                  71,
                  101
                ],
                "logprob": -0.44570669531822205
              },
              {
                "id": 2970,
                "token": "ge",
                "bytes": [
                  103,
                  101
                ],
                "logprob": -2.617774248123169
              },
              {
                "id": 349,
                "token": "is",
                "bytes": [
                  105,
                  115
                ],
                "logprob": -3.1366515159606934
              },
              {
                "id": 315,
                "token": "I",
                "bytes": [
                  73
                ],
                "logprob": -3.662461519241333
              },
              {
                "id": 369,
                "token": "that",
                "bytes": [
                  116,
                  104,
                  97,
                  116
                ],
                "logprob": -4.037624835968018
              }
            ]
          }
        ]
      },
      "finish_reason": "length"
    }
  ],
  "created": 1737533297,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b4501-667d7284",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 1,
    "prompt_tokens": 2,
    "total_tokens": 3
  },
  "id": "chatcmpl-LwqGCxZCuQHAfAbfmfuU8w4axsYGaRrH",
  "timings": {
    "prompt_n": 1,
    "prompt_ms": 424.499,
    "prompt_per_token_ms": 424.499,
    "prompt_per_second": 2.355718152457367,
    "predicted_n": 1,
    "predicted_ms": 7.645,
    "predicted_per_token_ms": 7.645,
    "predicted_per_second": 130.80444735120994
  }
}

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm sorry I revoke the approval because CI didn't pass, I will have a look

@jpodivin
Copy link
Contributor Author

jpodivin commented Jan 22, 2025

I may need that logic after all.

Edit: yep.

@jpodivin
Copy link
Contributor Author

jpodivin commented Jan 22, 2025

@ngxson Basically it comes down to defaults. I could insert a conditional that would use the value of logprobs only if there is one and if the probs are set to 0.

 if (data.contains("logprobs") && params.sampling.n_probs == defaults.sampling.n_probs){
    params.sampling.n_probs = json_value(data, "logprobs", defaults.sampling.n_probs);
}

That would restore the expected behavior for /completion. Alternative would be bigger modification of params_from_json_cmpl that would prevent usage of any args which are not allowed by the endpoint in question.

Example output:

jpodivin@fedora:~/repos/llama.cpp$ curl -H "Content-Type: application/json"  -d '{"model":"gpt-3.5-turbo", "prompt": "something", "n_probs": 1, "max_tokens": 1}' http://127.0.0.1:8080/completion | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1671  100  1592  100    79   1835     91 --:--:-- --:--:-- --:--:--  1925
{
  "index": 0,
  "content": " Ge",
  "tokens": [],
  "id_slot": 0,
  "stop": true,
  "model": "gpt-3.5-turbo",
  "tokens_predicted": 1,
  "tokens_evaluated": 2,
  "generation_settings": {
    "n_predict": 1,
    "seed": 4294967295,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0.0,
    "dynatemp_exponent": 1.0,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "xtc_probability": 0.0,
    "xtc_threshold": 0.10000000149011612,
    "typical_p": 1.0,
    "repeat_last_n": 64,
    "repeat_penalty": 1.0,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0,
    "dry_multiplier": 0.0,
    "dry_base": 1.75,
    "dry_allowed_length": 2,
    "dry_penalty_last_n": 4096,
    "dry_sequence_breakers": [
      "\n",
      ":",
      "\"",
      "*"
    ],
    "mirostat": 0,
    "mirostat_tau": 5.0,
    "mirostat_eta": 0.10000000149011612,
    "stop": [],
    "max_tokens": 1,
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": false,
    "logit_bias": [],
    "n_probs": 1,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "penalties",
      "dry",
      "top_k",
      "typ_p",
      "top_p",
      "min_p",
      "xtc",
      "temperature"
    ],
    "speculative.n_max": 16,
    "speculative.n_min": 5,
    "speculative.p_min": 0.8999999761581421,
    "timings_per_token": false,
    "post_sampling_probs": false,
    "lora": []
  },
  "prompt": "<s> something",
  "has_new_line": false,
  "truncated": false,
  "stop_type": "limit",
  "stopping_word": "",
  "tokens_cached": 2,
  "timings": {
    "prompt_n": 2,
    "prompt_ms": 856.85,
    "prompt_per_token_ms": 428.425,
    "prompt_per_second": 2.334130828032911,
    "predicted_n": 1,
    "predicted_ms": 8.324,
    "predicted_per_token_ms": 8.324,
    "predicted_per_second": 120.1345506967804
  },
  "completion_probabilities": [
    {
      "id": 2404,
      "token": " Ge",
      "bytes": [
        32,
        71,
        101
      ],
      "logprob": -0.44570669531822205,
      "top_logprobs": [
        {
          "id": 2404,
          "token": "Ge",
          "bytes": [
            71,
            101
          ],
          "logprob": -0.44570669531822205
        }
      ]
    }
  ]
}

@ngxson
Copy link
Collaborator

ngxson commented Jan 22, 2025

Basically it comes down to defaults. I could insert a conditional that would use the value of logprobs only if there is one and if the probs are set to 0.

Can you try push that so we can see if CI pass?

@jpodivin
Copy link
Contributor Author

@ngxson Sure, there it is.

@ngxson ngxson merged commit 96f4053 into ggml-org:master Jan 22, 2025
45 checks passed
anagri pushed a commit to BodhiSearch/llama.cpp that referenced this pull request Jan 26, 2025
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants