Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Slow when logits_all=True, inconsistent logprobs and solutions #1983

Open
@For-rest2005

Description

@For-rest2005

Environment

Lastest version by 2025.03.26

Issue

I tried to use this library to record logits of my gguf model. But it seemed to be some errors on the return value of Llama.__call__(). And the process of recording logits is too slow.

Then I dived into this problem. I found that the "token_logprobs" in the return value of Llama.__call__() are not the log probabilities of the current token generated on the current position. The returned "token_logprobs" are actually the log probabilities of the current token generated by the model after receiving the current token.

I wrote two python program to verify it.

from llama_cpp import Llama

model_path = "" #gguf models

prompt = "Once upon a time,"

model = Llama(
    model_path= model_path,
    n_gpu_layers=-1, 
    n_ctx=2048,
    logits_all=True,
    main_gpu=0,
    verbose=False
)

response = model(
    prompt= prompt, 
    max_tokens = 1, 
    echo=True,
    logprobs = 1,
    temperature = 0,
)

logprobs = response['choices'][0]['logprobs']
tokens = logprobs['tokens']
token_logprobs = logprobs['token_logprobs']
top_logprobs = logprobs['top_logprobs']

print(response)
for i in range(len(tokens)):
    print("prob_ln = ", token_logprobs[i], "Token = ", tokens[i])


print(len(response['choices'][0]['logprobs']['top_logprobs']))
import torch
import torch.nn.functional as F
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "" # hf models.
prompt = "Once upon a time,"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype='auto', revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_path, torch_dtype='auto')

inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"]

print(input_ids)
with torch.no_grad():
    outputs = model(input_ids=input_ids)

logits = F.log_softmax(outputs.logits, dim=-1)
print(logits)
logits = torch.squeeze(logits)

for i in range(logits.shape[0]): 
    prob_ln = logits[i-1, input_ids[0, i]].item() #replace "i-1" with "i", you will get results same as the Llama version
    token = tokenizer.decode(input_ids[0, i]) 
    print(f"Position {i}: prob_ln = {prob_ln}, Token = '{token}', TokenID = {input_ids[0, i]}")

If you use the two models only different on format, you will found totaly different outputs. And you can modify the code based on my comments to verify my conclusion.

Solutions

I got a rough solution for my need. But I am busy recently, unable to write thorough codes. Hope you guys can complement it to the project.
For the efficiency, I found a sort for a vocab size tensor in Llama._create_completion( See the picture below ). But for my need, I only need one top_token_logprobs. This sort costs most time in my program. I use torch.max to fix it. For thorough need, you may add some specific optimization when the parameter "logprobs" is small.
For the inconsistent logprobs, you just need to change "int(token)"( See the picture below ) to the "pre_token".
Image
And here is my solution. But it only works for recording one top_tokens. You can replace them to the correspond space in llama.py if you have the same need.

    logprobs_or_none: Optional[CompletionLogprobs] = None
        if logprobs is not None:
            text_offset = 0 if echo else len(prompt)
            token_offset = 0 if echo else len(prompt_tokens[1:])
            text_offsets: List[int] = []
            token_logprobs: List[Optional[float]] = []
            tokens: List[str] = []
            top_logprobs: List[Optional[Dict[str, float]]] = []

            if echo:
                # Remove leading BOS token if exists
                all_tokens = (
                    prompt_tokens[1 if prompt_tokens[0] == self.token_bos() else 0 :]
                    + completion_tokens
                )
            else:
                all_tokens = completion_tokens

            all_token_strs = [
                self.detokenize([token], prev_tokens=all_tokens[:i]).decode(
                    "utf-8", errors="ignore"
                )
                for i, token in enumerate(all_tokens)
            ]
            all_logprobs = Llama.logits_to_logprobs(self._scores)[token_offset:]
            # TODO: may be able to change this loop to use np.take_along_dim
            token_logprobs = [0.0]
            top_logprobs = [{}]
            for idx, (token, token_str, logprobs_token) in enumerate(
                zip(all_tokens, all_token_strs, all_logprobs)
            ):
                if token == bos_token_id:
                    continue
                text_offsets.append(
                    text_offset
                    + len(
                        self.detokenize(all_tokens[:idx]).decode(
                            "utf-8", errors="ignore"
                        )
                    )
                )
                tokens.append(token_str)
                # sorted_logprobs = list(
                #     sorted(
                #         zip(logprobs_token, range(len(logprobs_token))), reverse=True
                #     )
                # )
                logprobs_token_tmp = torch.tensor(logprobs_token)
                max_logprob, max_indix = torch.max(logprobs_token_tmp, dim=0)
                if(idx+1 != len(all_tokens)):
                    token_logprobs.append(logprobs_token[int(all_tokens[idx+1])])
                # print(token, logprobs_token[int(token)])
                top_logprob: Optional[Dict[str, float]] = {
                    # self.detokenize([i], prev_tokens=all_tokens[:idx]).decode(
                        # "utf-8", errors="ignore"
                    # ): logprob
                    self.detokenize([max_indix], prev_tokens=all_tokens[:idx]).decode(
                        "utf-8", errors="ignore"
                    ): max_logprob
                }
                # top_logprob.update({token_str: logprobs_token[int(token)]})
                top_logprobs.append(top_logprob)
            # Weird idosincracy of the OpenAI API where
            # token_logprobs and top_logprobs are null for
            # the first token.
            # if echo and len(all_tokens) > 0:
            #     token_logprobs[0] = None
            #     top_logprobs[0] = None
            logprobs_or_none = {
                "tokens": tokens,
                "text_offset": text_offsets,
                "token_logprobs": token_logprobs,
                "top_logprobs": top_logprobs,
            }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions