-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Slow when logits_all=True, inconsistent logprobs and solutions #1983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@For-rest2005 Unfortunately, I modified the code and rebuilt it, the lm-eval still return the wrong accuracy. Even when I ran the two sample code you provided, they still return different values. |
@kunxiongzhu Sorry, I forgot the details now. But I can tell you my results. The results that my example code returns is slightly different(0.2%-5%) in my environment. And "acc_norm" results for gguf and hf in lm-eval is almost the same( < 0.1% ). But the "acc" results differ by 2%-3%. This difference may come from the different activation value type. |
@For-rest2005 Thank you for your response. After modifying "token_logprobs.append(logprobs_token[int(all_tokens[idx])])" instead of "token_logprobs.append(logprobs_token[int(all_tokens[idx+1])])", the returned values are similar, as shown below. However, when I run lm-eval, there still appears to be a significant difference in accuracy between Transformers and llama-cpp. Could you please help me identify any mistakes I might have made? transformers demo code result: llama-cpp demo code result: transformers accuracy: llama-cpp accuracy: |
@kunxiongzhu Do you use the API model in this PR. |
@For-rest2005 Yes, I did, but it still returns the wrong accuracy |
|
Environment
Lastest version by 2025.03.26
Issue
I tried to use this library to record logits of my gguf model. But it seemed to be some errors on the return value of Llama.__call__(). And the process of recording logits is too slow.
Then I dived into this problem. I found that the "token_logprobs" in the return value of Llama.__call__() are not the log probabilities of the current token generated on the current position. The returned "token_logprobs" are actually the log probabilities of the current token generated by the model after receiving the current token.
I wrote two python program to verify it.
If you use the two models only different on format, you will found totaly different outputs. And you can modify the code based on my comments to verify my conclusion.
Solutions
I got a rough solution for my need. But I am busy recently, unable to write thorough codes. Hope you guys can complement it to the project.

For the efficiency, I found a sort for a vocab size tensor in Llama._create_completion( See the picture below ). But for my need, I only need one top_token_logprobs. This sort costs most time in my program. I use torch.max to fix it. For thorough need, you may add some specific optimization when the parameter "logprobs" is small.
For the inconsistent logprobs, you just need to change "int(token)"( See the picture below ) to the "pre_token".
And here is my solution. But it only works for recording one top_tokens. You can replace them to the correspond space in llama.py if you have the same need.
The text was updated successfully, but these errors were encountered: