Description
Environment
Lastest version by 2025.03.26
Issue
I tried to use this library to record logits of my gguf model. But it seemed to be some errors on the return value of Llama.__call__(). And the process of recording logits is too slow.
Then I dived into this problem. I found that the "token_logprobs" in the return value of Llama.__call__() are not the log probabilities of the current token generated on the current position. The returned "token_logprobs" are actually the log probabilities of the current token generated by the model after receiving the current token.
I wrote two python program to verify it.
from llama_cpp import Llama
model_path = "" #gguf models
prompt = "Once upon a time,"
model = Llama(
model_path= model_path,
n_gpu_layers=-1,
n_ctx=2048,
logits_all=True,
main_gpu=0,
verbose=False
)
response = model(
prompt= prompt,
max_tokens = 1,
echo=True,
logprobs = 1,
temperature = 0,
)
logprobs = response['choices'][0]['logprobs']
tokens = logprobs['tokens']
token_logprobs = logprobs['token_logprobs']
top_logprobs = logprobs['top_logprobs']
print(response)
for i in range(len(tokens)):
print("prob_ln = ", token_logprobs[i], "Token = ", tokens[i])
print(len(response['choices'][0]['logprobs']['top_logprobs']))
import torch
import torch.nn.functional as F
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "" # hf models.
prompt = "Once upon a time,"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype='auto', revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_path, torch_dtype='auto')
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"]
print(input_ids)
with torch.no_grad():
outputs = model(input_ids=input_ids)
logits = F.log_softmax(outputs.logits, dim=-1)
print(logits)
logits = torch.squeeze(logits)
for i in range(logits.shape[0]):
prob_ln = logits[i-1, input_ids[0, i]].item() #replace "i-1" with "i", you will get results same as the Llama version
token = tokenizer.decode(input_ids[0, i])
print(f"Position {i}: prob_ln = {prob_ln}, Token = '{token}', TokenID = {input_ids[0, i]}")
If you use the two models only different on format, you will found totaly different outputs. And you can modify the code based on my comments to verify my conclusion.
Solutions
I got a rough solution for my need. But I am busy recently, unable to write thorough codes. Hope you guys can complement it to the project.
For the efficiency, I found a sort for a vocab size tensor in Llama._create_completion( See the picture below ). But for my need, I only need one top_token_logprobs. This sort costs most time in my program. I use torch.max to fix it. For thorough need, you may add some specific optimization when the parameter "logprobs" is small.
For the inconsistent logprobs, you just need to change "int(token)"( See the picture below ) to the "pre_token".
And here is my solution. But it only works for recording one top_tokens. You can replace them to the correspond space in llama.py if you have the same need.
logprobs_or_none: Optional[CompletionLogprobs] = None
if logprobs is not None:
text_offset = 0 if echo else len(prompt)
token_offset = 0 if echo else len(prompt_tokens[1:])
text_offsets: List[int] = []
token_logprobs: List[Optional[float]] = []
tokens: List[str] = []
top_logprobs: List[Optional[Dict[str, float]]] = []
if echo:
# Remove leading BOS token if exists
all_tokens = (
prompt_tokens[1 if prompt_tokens[0] == self.token_bos() else 0 :]
+ completion_tokens
)
else:
all_tokens = completion_tokens
all_token_strs = [
self.detokenize([token], prev_tokens=all_tokens[:i]).decode(
"utf-8", errors="ignore"
)
for i, token in enumerate(all_tokens)
]
all_logprobs = Llama.logits_to_logprobs(self._scores)[token_offset:]
# TODO: may be able to change this loop to use np.take_along_dim
token_logprobs = [0.0]
top_logprobs = [{}]
for idx, (token, token_str, logprobs_token) in enumerate(
zip(all_tokens, all_token_strs, all_logprobs)
):
if token == bos_token_id:
continue
text_offsets.append(
text_offset
+ len(
self.detokenize(all_tokens[:idx]).decode(
"utf-8", errors="ignore"
)
)
)
tokens.append(token_str)
# sorted_logprobs = list(
# sorted(
# zip(logprobs_token, range(len(logprobs_token))), reverse=True
# )
# )
logprobs_token_tmp = torch.tensor(logprobs_token)
max_logprob, max_indix = torch.max(logprobs_token_tmp, dim=0)
if(idx+1 != len(all_tokens)):
token_logprobs.append(logprobs_token[int(all_tokens[idx+1])])
# print(token, logprobs_token[int(token)])
top_logprob: Optional[Dict[str, float]] = {
# self.detokenize([i], prev_tokens=all_tokens[:idx]).decode(
# "utf-8", errors="ignore"
# ): logprob
self.detokenize([max_indix], prev_tokens=all_tokens[:idx]).decode(
"utf-8", errors="ignore"
): max_logprob
}
# top_logprob.update({token_str: logprobs_token[int(token)]})
top_logprobs.append(top_logprob)
# Weird idosincracy of the OpenAI API where
# token_logprobs and top_logprobs are null for
# the first token.
# if echo and len(all_tokens) > 0:
# token_logprobs[0] = None
# top_logprobs[0] = None
logprobs_or_none = {
"tokens": tokens,
"text_offset": text_offsets,
"token_logprobs": token_logprobs,
"top_logprobs": top_logprobs,
}