Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
Expected llama-cpp-python to correctly override the model parameters when passing {"tokenizer.ggml.pre": "llama3"} to kv_override.
Current Behavior
The string value to override always appears to be empty upon running the model as validate_override: Using metadata override ( str) 'tokenizer.ggml.pre' =
indicates, and thus the model ends up using the default pre-tokenizer instead of the llama3 one.
Example output:
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from ./Meta-Llama-3-8B-Instruct.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = .
llama_model_loader: - kv 2: llama.vocab_size u32 = 128256
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.block_count u32 = 32
llama_model_loader: - kv 6: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: llama.attention.head_count u32 = 32
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
validate_override: Using metadata override ( str) 'tokenizer.ggml.pre' =
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
...
Environment and Context
- WSL with Ubuntu 20.04
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 80
Model name: AMD Ryzen 9 5900HX with Radeon Graphics
Stepping: 0
CPU MHz: 3293.809
BogoMIPS: 6587.61
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 256 KiB
L1i cache: 256 KiB
L2 cache: 4 MiB
L3 cache: 16 MiB
$ uname -a
Linux LAPTOP 5.15.146.1-microsoft-standard-WSL2 #1 SMP Thu Jan 11 04:09:03 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- SDK version, e.g. for Linux:
$ python3 --version
Python 3.8.10
$ make --version
GNU Make 4.2.1
Built for x86_64-pc-linux-gnu
$ g++ --version
g++ (Ubuntu 13.1.0-8ubuntu1~20.04.2) 13.1.0
Failure Information (for bugs)
Steps to Reproduce
I'm running the following code:
my_model_path = "./Meta-Llama-3-8B-Instruct.Q4_K_M.gguf"
CONTEXT_SIZE = 8000
model = Llama(model_path=my_model_path, kv_overrides={"tokenizer.ggml.pre":"llama3"}, n_ctx=CONTEXT_SIZE)
Findings
I checked out the code using a debugger and the problem seems to be on the following line:
ctypes.memmove(
self._kv_overrides_array[i].value.str_value,
v_bytes,
min(len(v_bytes), 128),
)
For some reason memmove is not properly copying the string.