Slow when logits_all=True, inconsistent logprobs and solutions

# Environment
Lastest version by 2025.03.26
# Issue
I tried to use this library to record logits of my gguf model. But it seemed to be some errors on the return value of Llama.\_\_call\_\_(). And the process of recording logits is too slow.

Then I dived into this problem. I found that the "token_logprobs" in the return value of Llama.\_\_call\_\_() are not the log probabilities of the current token generated on the current position. The returned "token_logprobs" are actually the log probabilities of the current token generated by the model after receiving the current token.

I wrote two python program to verify it.
``` python
from llama_cpp import Llama

model_path = "" #gguf models

prompt = "Once upon a time,"

model = Llama(
    model_path= model_path,
    n_gpu_layers=-1, 
    n_ctx=2048,
    logits_all=True,
    main_gpu=0,
    verbose=False
)

response = model(
    prompt= prompt, 
    max_tokens = 1, 
    echo=True,
    logprobs = 1,
    temperature = 0,
)

logprobs = response['choices'][0]['logprobs']
tokens = logprobs['tokens']
token_logprobs = logprobs['token_logprobs']
top_logprobs = logprobs['top_logprobs']

print(response)
for i in range(len(tokens)):
    print("prob_ln = ", token_logprobs[i], "Token = ", tokens[i])


print(len(response['choices'][0]['logprobs']['top_logprobs']))
```
``` python
import torch
import torch.nn.functional as F
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "" # hf models.
prompt = "Once upon a time,"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype='auto', revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_path, torch_dtype='auto')

inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"]

print(input_ids)
with torch.no_grad():
    outputs = model(input_ids=input_ids)

logits = F.log_softmax(outputs.logits, dim=-1)
print(logits)
logits = torch.squeeze(logits)

for i in range(logits.shape[0]): 
    prob_ln = logits[i-1, input_ids[0, i]].item() #replace "i-1" with "i", you will get results same as the Llama version
    token = tokenizer.decode(input_ids[0, i]) 
    print(f"Position {i}: prob_ln = {prob_ln}, Token = '{token}', TokenID = {input_ids[0, i]}")
```
If you use the two models only different on format, you will found totaly different outputs.  And you can modify the code based on my comments to verify my conclusion.
# Solutions
I got a rough solution for my need. But I am busy recently, unable to write thorough codes. Hope you guys can complement it to the project.
For the efficiency, I found a sort for a vocab size tensor in Llama.\_create_completion( See the picture below ). But for my need, I only need one top_token_logprobs. This sort costs most time in my program. I use torch.max to fix it. For thorough need, you may add some specific optimization when the parameter "logprobs" is small.
For the inconsistent logprobs, you just need to change "int(token)"( See the picture below ) to the "pre_token".
![Image](https://github.com/user-attachments/assets/e3a28710-398f-4173-a654-3a970934a2f5)
And here is my solution. But it only works for recording one top_tokens. You can replace them to the correspond space in llama.py if you have the same need.
``` python
    logprobs_or_none: Optional[CompletionLogprobs] = None
        if logprobs is not None:
            text_offset = 0 if echo else len(prompt)
            token_offset = 0 if echo else len(prompt_tokens[1:])
            text_offsets: List[int] = []
            token_logprobs: List[Optional[float]] = []
            tokens: List[str] = []
            top_logprobs: List[Optional[Dict[str, float]]] = []

            if echo:
                # Remove leading BOS token if exists
                all_tokens = (
                    prompt_tokens[1 if prompt_tokens[0] == self.token_bos() else 0 :]
                    + completion_tokens
                )
            else:
                all_tokens = completion_tokens

            all_token_strs = [
                self.detokenize([token], prev_tokens=all_tokens[:i]).decode(
                    "utf-8", errors="ignore"
                )
                for i, token in enumerate(all_tokens)
            ]
            all_logprobs = Llama.logits_to_logprobs(self._scores)[token_offset:]
            # TODO: may be able to change this loop to use np.take_along_dim
            token_logprobs = [0.0]
            top_logprobs = [{}]
            for idx, (token, token_str, logprobs_token) in enumerate(
                zip(all_tokens, all_token_strs, all_logprobs)
            ):
                if token == bos_token_id:
                    continue
                text_offsets.append(
                    text_offset
                    + len(
                        self.detokenize(all_tokens[:idx]).decode(
                            "utf-8", errors="ignore"
                        )
                    )
                )
                tokens.append(token_str)
                # sorted_logprobs = list(
                #     sorted(
                #         zip(logprobs_token, range(len(logprobs_token))), reverse=True
                #     )
                # )
                logprobs_token_tmp = torch.tensor(logprobs_token)
                max_logprob, max_indix = torch.max(logprobs_token_tmp, dim=0)
                if(idx+1 != len(all_tokens)):
                    token_logprobs.append(logprobs_token[int(all_tokens[idx+1])])
                # print(token, logprobs_token[int(token)])
                top_logprob: Optional[Dict[str, float]] = {
                    # self.detokenize([i], prev_tokens=all_tokens[:idx]).decode(
                        # "utf-8", errors="ignore"
                    # ): logprob
                    self.detokenize([max_indix], prev_tokens=all_tokens[:idx]).decode(
                        "utf-8", errors="ignore"
                    ): max_logprob
                }
                # top_logprob.update({token_str: logprobs_token[int(token)]})
                top_logprobs.append(top_logprob)
            # Weird idosincracy of the OpenAI API where
            # token_logprobs and top_logprobs are null for
            # the first token.
            # if echo and len(all_tokens) > 0:
            #     token_logprobs[0] = None
            #     top_logprobs[0] = None
            logprobs_or_none = {
                "tokens": tokens,
                "text_offset": text_offsets,
                "token_logprobs": token_logprobs,
                "top_logprobs": top_logprobs,
            }
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow when logits_all=True, inconsistent logprobs and solutions #1983

Environment

Issue

Solutions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Slow when logits_all=True, inconsistent logprobs and solutions #1983

Description

Environment

Issue

Solutions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions