Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Slow when logits_all=True, inconsistent logprobs and solutions #1983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
For-rest2005 opened this issue Mar 26, 2025 · 6 comments
Open

Slow when logits_all=True, inconsistent logprobs and solutions #1983

For-rest2005 opened this issue Mar 26, 2025 · 6 comments

Comments

@For-rest2005
Copy link

Environment

Lastest version by 2025.03.26

Issue

I tried to use this library to record logits of my gguf model. But it seemed to be some errors on the return value of Llama.__call__(). And the process of recording logits is too slow.

Then I dived into this problem. I found that the "token_logprobs" in the return value of Llama.__call__() are not the log probabilities of the current token generated on the current position. The returned "token_logprobs" are actually the log probabilities of the current token generated by the model after receiving the current token.

I wrote two python program to verify it.

from llama_cpp import Llama

model_path = "" #gguf models

prompt = "Once upon a time,"

model = Llama(
    model_path= model_path,
    n_gpu_layers=-1, 
    n_ctx=2048,
    logits_all=True,
    main_gpu=0,
    verbose=False
)

response = model(
    prompt= prompt, 
    max_tokens = 1, 
    echo=True,
    logprobs = 1,
    temperature = 0,
)

logprobs = response['choices'][0]['logprobs']
tokens = logprobs['tokens']
token_logprobs = logprobs['token_logprobs']
top_logprobs = logprobs['top_logprobs']

print(response)
for i in range(len(tokens)):
    print("prob_ln = ", token_logprobs[i], "Token = ", tokens[i])


print(len(response['choices'][0]['logprobs']['top_logprobs']))
import torch
import torch.nn.functional as F
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "" # hf models.
prompt = "Once upon a time,"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype='auto', revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_path, torch_dtype='auto')

inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"]

print(input_ids)
with torch.no_grad():
    outputs = model(input_ids=input_ids)

logits = F.log_softmax(outputs.logits, dim=-1)
print(logits)
logits = torch.squeeze(logits)

for i in range(logits.shape[0]): 
    prob_ln = logits[i-1, input_ids[0, i]].item() #replace "i-1" with "i", you will get results same as the Llama version
    token = tokenizer.decode(input_ids[0, i]) 
    print(f"Position {i}: prob_ln = {prob_ln}, Token = '{token}', TokenID = {input_ids[0, i]}")

If you use the two models only different on format, you will found totaly different outputs. And you can modify the code based on my comments to verify my conclusion.

Solutions

I got a rough solution for my need. But I am busy recently, unable to write thorough codes. Hope you guys can complement it to the project.
For the efficiency, I found a sort for a vocab size tensor in Llama._create_completion( See the picture below ). But for my need, I only need one top_token_logprobs. This sort costs most time in my program. I use torch.max to fix it. For thorough need, you may add some specific optimization when the parameter "logprobs" is small.
For the inconsistent logprobs, you just need to change "int(token)"( See the picture below ) to the "pre_token".
Image
And here is my solution. But it only works for recording one top_tokens. You can replace them to the correspond space in llama.py if you have the same need.

    logprobs_or_none: Optional[CompletionLogprobs] = None
        if logprobs is not None:
            text_offset = 0 if echo else len(prompt)
            token_offset = 0 if echo else len(prompt_tokens[1:])
            text_offsets: List[int] = []
            token_logprobs: List[Optional[float]] = []
            tokens: List[str] = []
            top_logprobs: List[Optional[Dict[str, float]]] = []

            if echo:
                # Remove leading BOS token if exists
                all_tokens = (
                    prompt_tokens[1 if prompt_tokens[0] == self.token_bos() else 0 :]
                    + completion_tokens
                )
            else:
                all_tokens = completion_tokens

            all_token_strs = [
                self.detokenize([token], prev_tokens=all_tokens[:i]).decode(
                    "utf-8", errors="ignore"
                )
                for i, token in enumerate(all_tokens)
            ]
            all_logprobs = Llama.logits_to_logprobs(self._scores)[token_offset:]
            # TODO: may be able to change this loop to use np.take_along_dim
            token_logprobs = [0.0]
            top_logprobs = [{}]
            for idx, (token, token_str, logprobs_token) in enumerate(
                zip(all_tokens, all_token_strs, all_logprobs)
            ):
                if token == bos_token_id:
                    continue
                text_offsets.append(
                    text_offset
                    + len(
                        self.detokenize(all_tokens[:idx]).decode(
                            "utf-8", errors="ignore"
                        )
                    )
                )
                tokens.append(token_str)
                # sorted_logprobs = list(
                #     sorted(
                #         zip(logprobs_token, range(len(logprobs_token))), reverse=True
                #     )
                # )
                logprobs_token_tmp = torch.tensor(logprobs_token)
                max_logprob, max_indix = torch.max(logprobs_token_tmp, dim=0)
                if(idx+1 != len(all_tokens)):
                    token_logprobs.append(logprobs_token[int(all_tokens[idx+1])])
                # print(token, logprobs_token[int(token)])
                top_logprob: Optional[Dict[str, float]] = {
                    # self.detokenize([i], prev_tokens=all_tokens[:idx]).decode(
                        # "utf-8", errors="ignore"
                    # ): logprob
                    self.detokenize([max_indix], prev_tokens=all_tokens[:idx]).decode(
                        "utf-8", errors="ignore"
                    ): max_logprob
                }
                # top_logprob.update({token_str: logprobs_token[int(token)]})
                top_logprobs.append(top_logprob)
            # Weird idosincracy of the OpenAI API where
            # token_logprobs and top_logprobs are null for
            # the first token.
            # if echo and len(all_tokens) > 0:
            #     token_logprobs[0] = None
            #     top_logprobs[0] = None
            logprobs_or_none = {
                "tokens": tokens,
                "text_offset": text_offsets,
                "token_logprobs": token_logprobs,
                "top_logprobs": top_logprobs,
            }
@For-rest2005 For-rest2005 changed the title Slow when logits _all=True, inconsistent logprobs and solutions Slow when logits_all=True, inconsistent logprobs and solutions Mar 26, 2025
@kunxiongzhu
Copy link

@For-rest2005 Unfortunately, I modified the code and rebuilt it, the lm-eval still return the wrong accuracy.

Even when I ran the two sample code you provided, they still return different values.

@For-rest2005
Copy link
Author

@kunxiongzhu Sorry, I forgot the details now. But I can tell you my results. The results that my example code returns is slightly different(0.2%-5%) in my environment. And "acc_norm" results for gguf and hf in lm-eval is almost the same( < 0.1% ). But the "acc" results differ by 2%-3%. This difference may come from the different activation value type.

@kunxiongzhu
Copy link

@For-rest2005 Thank you for your response. After modifying "token_logprobs.append(logprobs_token[int(all_tokens[idx])])" instead of "token_logprobs.append(logprobs_token[int(all_tokens[idx+1])])", the returned values are similar, as shown below. However, when I run lm-eval, there still appears to be a significant difference in accuracy between Transformers and llama-cpp. Could you please help me identify any mistakes I might have made?

transformers demo code result:
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:41<00:00, 20.71s/it]
tensor([[ 1, 9038, 2501, 263, 931, 29892]])
tensor([[[-10.9297, -11.2578, -10.7188, ..., -9.7031, -9.1562, -10.3906],
[-22.6094, -15.9297, -15.0391, ..., -18.9844, -20.4531, -20.3750],
[-25.8750, -22.0000, -18.7500, ..., -25.0156, -27.8438, -24.7344],
[-27.7188, -24.5469, -18.6875, ..., -24.9219, -28.4062, -25.7500],
[-20.6562, -19.1875, -14.9062, ..., -20.5312, -20.0000, -23.3281],
[-21.3125, -17.2344, -13.0312, ..., -16.0156, -17.8750, -18.4531]]],
dtype=torch.float16)
Position 0: prob_ln = -17.234375, Token = '', TokenID = 1
Position 1: prob_ln = -11.1171875, Token = 'Once', TokenID = 9038
Position 2: prob_ln = -0.85009765625, Token = 'upon', TokenID = 2501
Position 3: prob_ln = 0.0, Token = 'a', TokenID = 263
Position 4: prob_ln = 0.0, Token = 'time', TokenID = 931
Position 5: prob_ln = -0.08154296875, Token = ',', TokenID = 29892

llama-cpp demo code result:
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
{'id': 'cmpl-d3d0bdb7-f785-4f36-9559-2e3a50c9f912', 'object': 'text_completion', 'created': 1744041809, 'model': '/mnt/data/ehdd1/home/kz96891/models/gguf/Llama-2-7b-chat-hf/Llama-2-7b-chat-hf-fp16.gguf', 'choices': [{'text': 'Once upon a time, in', 'index': 0, 'logprobs': {'tokens': [' Once', ' upon', ' a', ' time', ',', ' in'], 'text_offset': [0, 5, 10, 12, 17, 18], 'token_logprobs': [0.0, -11.185795, -0.85255027, -0.00034016545, -0.00026854247, -0.080405265], 'top_logprobs': [{}, {' Unterscheidung': tensor(-4.0696)}, {' upon': tensor(-0.8526)}, {' a': tensor(-0.0003)}, {' time': tensor(-0.0003)}, {',': tensor(-0.0804)}, {' in': tensor(-0.8544)}]}, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 6, 'completion_tokens': 1, 'total_tokens': 7}}
prob_ln = 0.0 Token = Once
prob_ln = -11.185795 Token = upon
prob_ln = -0.85255027 Token = a
prob_ln = -0.00034016545 Token = time
prob_ln = -0.00026854247 Token = ,
prob_ln = -0.080405265 Token = in
7
Exception ignored in: <function Llama.del at 0x71bc48b70040>
Traceback (most recent call last):
File "/home/myid/kz96891/Code/non-determinism/framework/llama-cpp-python/llama_cpp/llama.py", line 4651, in del
File "/home/myid/kz96891/Code/non-determinism/framework/llama-cpp-python/llama_cpp/llama.py", line 4648, in close
File "/home/myid/kz96891/anaconda3/envs/python3.10/lib/python3.12/contextlib.py", line 618, in close
File "/home/myid/kz96891/anaconda3/envs/python3.10/lib/python3.12/contextlib.py", line 610, in exit
File "/home/myid/kz96891/anaconda3/envs/python3.10/lib/python3.12/contextlib.py", line 595, in exit
File "/home/myid/kz96891/anaconda3/envs/python3.10/lib/python3.12/contextlib.py", line 360, in exit
File "/home/myid/kz96891/Code/non-determinism/framework/llama-cpp-python/llama_cpp/_internals.py", line 75, in close
File "/home/myid/kz96891/anaconda3/envs/python3.10/lib/python3.12/contextlib.py", line 618, in close
File "/home/myid/kz96891/anaconda3/envs/python3.10/lib/python3.12/contextlib.py", line 610, in exit
File "/home/myid/kz96891/anaconda3/envs/python3.10/lib/python3.12/contextlib.py", line 595, in exit
File "/home/myid/kz96891/anaconda3/envs/python3.10/lib/python3.12/contextlib.py", line 478, in _exit_wrapper
File "/home/myid/kz96891/Code/non-determinism/framework/llama-cpp-python/llama_cpp/_internals.py", line 69, in free_model
TypeError: 'NoneType' object is not callable

transformers accuracy:
"results": {
"commonsense_qa": {
"alias": "commonsense_qa",
"acc,none": 0.5823095823095823,
"acc_stderr,none": 0.014119662277692187
}
}

llama-cpp accuracy:
"results": {
"commonsense_qa": {
"alias": "commonsense_qa",
"acc,none": 0.29238329238329236,
"acc_stderr,none": 0.013022531002213258
}
}

@For-rest2005
Copy link
Author

@kunxiongzhu Do you use the API model in this PR.

@kunxiongzhu
Copy link

@For-rest2005 Yes, I did, but it still returns the wrong accuracy

@For-rest2005
Copy link
Author

For-rest2005 commented Apr 12, 2025

Image
@kunxiongzhu You may print the intermediate result "logprobs" to check whether you get things done correctly.
The code I upload is slightly different from the code in my machine. But I am sure that they can work correctly in my machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants