Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Eval bug: Nondeterministic output with ROCm backend despite zero temperature #14727

@Googulator

Description

@Googulator

Name and Version

Custom build of b5849 for ROCm 6.4.1 (RDNA3). For some reason, the built-in version reports 0.

./build/bin/llama-cli --version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7800 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 0 (unknown)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

HIP

Hardware

AMD Ryzen 5 4500 + AMD Radeon RX 7800 XT (also reproducible using AMD Radeon RX 7900 XT)

Models

https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf

Problem description & steps to reproduce

llamapost.txt

This llama-server request uses deterministic (greedy) sampling, i.e. -t 0, but it generates nondeterministic output on ROCm with an RDNA3 GPU (tested with an RX 7800 XT and an RX 7900 XT). On CPU or an NVIDIA GPU, it behaves deterministically, as expected.

Run the server with this command:

build/bin/llama-server -m Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -c 60000 -ngl 50 -np 1 --flash-attn --chat-template llama3 --host 0.0.0.0 --port 8012 --mlock --no-warmup -t 4

and then call it 10 times:

for a in $(seq 10); do curl 'http://localhost:8080/completion' \
   -X POST \
   -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:140.0) Gecko/20100101 Firefox/140.0' \
   -H 'Accept-Language: en-US,en;q=0.5' \
   -H 'Accept-Encoding: gzip, deflate' \
   -H 'Cache-Control: no-cache' \
   -H 'Authorization: ' \
   -H 'Content-Type: application/json' \
   -H 'Origin: http://localhost:8080' \
   -H 'Connection: keep-alive' \
   -H 'Referer: http://localhost:8080/' \
   -H 'Priority: u=0' \
   -H 'Pragma: no-cache' \
   -T llamapost.txt 2>/dev/null | jq -r '.content' | sha256sum; done

All resulting sha256sums should be the same, but they seem to be wildly varied instead. (If the 10 calls are made immediately after the server is started, the same list of 10 hashes tends to appear for each set of 10 calls.)

The issue is not specific to this one GGUF, many other models (e.g. Qwen2.5-Coder, Mistral-NeMo) behave the same. Some of these show a pattern in their responses (e.g. Qwen-2.5-Coder alternates between 2 responses in an A-B-A-B pattern - the first call after starting llama-server consistently responds the same), others (e.g. Llama 3.1) seem completely random. Disabling Flash-Attention changes the pattern, but doesn't fix the issue.

First Bad Commit

No response

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7800 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32
build: 0 (unknown) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
system info: n_threads = 4, n_threads_batch = 4, total_threads = 12

system_info: n_threads = 4 (n_threads_batch = 4) / 12 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 18080, http threads: 11
main: loading model
srv    load_model: loading model '../models/BroadBit/llama-3.1-8b-instruct-q8_0.gguf'
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon RX 7800 XT) - 16176 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 292 tensors from ../models/BroadBit/llama-3.1-8b-instruct-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Meta Llama 3.1 8B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Meta Llama
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Met...
llama_model_loader: - kv  11:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv  12:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv  13:                          llama.block_count u32              = 32
llama_model_loader: - kv  14:                       llama.context_length u32              = 131072
llama_model_loader: - kv  15:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  16:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  17:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  18:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  19:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  20:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  22:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 7
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q8_0:  226 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 7.95 GiB (8.50 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Llama 3.1 8B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        ROCm0 model buffer size =  7605.33 MiB
load_tensors:   CPU_Mapped model buffer size =   532.31 MiB
.........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 60000
llama_context: n_ctx_per_seq = 60000
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (60000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.49 MiB
llama_kv_cache_unified:      ROCm0 KV buffer size =  7520.00 MiB
llama_kv_cache_unified: size = 7520.00 MiB ( 60160 cells,  32 layers,  1 seqs), K (f16): 3760.00 MiB, V (f16): 3760.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:      ROCm0 compute buffer size =   350.50 MiB
llama_context:  ROCm_Host compute buffer size =   125.51 MiB
llama_context: graph nodes  = 1031
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 60160
Failed to infer a tool call example (possible template bug)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 60160
main: model loaded
main: chat template, chat_template: llama3, example_format: '<|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

'
main: server is listening on http://0.0.0.0:18080 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 60160, n_keep = 0, n_prompt_tokens = 1013
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 1013, n_tokens = 1013, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 1013, n_tokens = 1013
slot      release: id  0 | task 0 | stop processing: n_past = 1524, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =     854.48 ms /  1013 tokens (    0.84 ms per token,  1185.52 tokens per second)
       eval time =   11457.35 ms /   512 tokens (   22.38 ms per token,    44.69 tokens per second)
      total time =   12311.83 ms /  1525 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id  0 | task 513 | processing task
slot update_slots: id  0 | task 513 | new prompt, n_ctx_slot = 60160, n_keep = 0, n_prompt_tokens = 1013
slot update_slots: id  0 | task 513 | kv cache rm [0, end)
slot update_slots: id  0 | task 513 | prompt processing progress, n_past = 1013, n_tokens = 1013, progress = 1.000000
slot update_slots: id  0 | task 513 | prompt done, n_past = 1013, n_tokens = 1013
slot      release: id  0 | task 513 | stop processing: n_past = 1524, truncated = 0
slot print_timing: id  0 | task 513 |
prompt eval time =     747.38 ms /  1013 tokens (    0.74 ms per token,  1355.39 tokens per second)
       eval time =   11481.99 ms /   512 tokens (   22.43 ms per token,    44.59 tokens per second)
      total time =   12229.37 ms /  1525 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id  0 | task 1026 | processing task
slot update_slots: id  0 | task 1026 | new prompt, n_ctx_slot = 60160, n_keep = 0, n_prompt_tokens = 1013
slot update_slots: id  0 | task 1026 | kv cache rm [0, end)
slot update_slots: id  0 | task 1026 | prompt processing progress, n_past = 1013, n_tokens = 1013, progress = 1.000000
slot update_slots: id  0 | task 1026 | prompt done, n_past = 1013, n_tokens = 1013
slot      release: id  0 | task 1026 | stop processing: n_past = 1524, truncated = 0
slot print_timing: id  0 | task 1026 |
prompt eval time =     744.58 ms /  1013 tokens (    0.74 ms per token,  1360.50 tokens per second)
       eval time =   11473.80 ms /   512 tokens (   22.41 ms per token,    44.62 tokens per second)
      total time =   12218.38 ms /  1525 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id  0 | task 1539 | processing task
slot update_slots: id  0 | task 1539 | new prompt, n_ctx_slot = 60160, n_keep = 0, n_prompt_tokens = 1013
slot update_slots: id  0 | task 1539 | kv cache rm [0, end)
slot update_slots: id  0 | task 1539 | prompt processing progress, n_past = 1013, n_tokens = 1013, progress = 1.000000
slot update_slots: id  0 | task 1539 | prompt done, n_past = 1013, n_tokens = 1013
slot      release: id  0 | task 1539 | stop processing: n_past = 1524, truncated = 0
slot print_timing: id  0 | task 1539 |
prompt eval time =     745.49 ms /  1013 tokens (    0.74 ms per token,  1358.85 tokens per second)
       eval time =   11465.44 ms /   512 tokens (   22.39 ms per token,    44.66 tokens per second)
      total time =   12210.93 ms /  1525 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id  0 | task 2052 | processing task
slot update_slots: id  0 | task 2052 | new prompt, n_ctx_slot = 60160, n_keep = 0, n_prompt_tokens = 1013
slot update_slots: id  0 | task 2052 | kv cache rm [0, end)
slot update_slots: id  0 | task 2052 | prompt processing progress, n_past = 1013, n_tokens = 1013, progress = 1.000000
slot update_slots: id  0 | task 2052 | prompt done, n_past = 1013, n_tokens = 1013
slot      release: id  0 | task 2052 | stop processing: n_past = 1524, truncated = 0
slot print_timing: id  0 | task 2052 |
prompt eval time =     744.08 ms /  1013 tokens (    0.73 ms per token,  1361.41 tokens per second)
       eval time =   11473.04 ms /   512 tokens (   22.41 ms per token,    44.63 tokens per second)
      total time =   12217.12 ms /  1525 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id  0 | task 2565 | processing task
slot update_slots: id  0 | task 2565 | new prompt, n_ctx_slot = 60160, n_keep = 0, n_prompt_tokens = 1013
slot update_slots: id  0 | task 2565 | kv cache rm [0, end)
slot update_slots: id  0 | task 2565 | prompt processing progress, n_past = 1013, n_tokens = 1013, progress = 1.000000
slot update_slots: id  0 | task 2565 | prompt done, n_past = 1013, n_tokens = 1013
slot      release: id  0 | task 2565 | stop processing: n_past = 1524, truncated = 0
slot print_timing: id  0 | task 2565 |
prompt eval time =     745.22 ms /  1013 tokens (    0.74 ms per token,  1359.32 tokens per second)
       eval time =   11456.97 ms /   512 tokens (   22.38 ms per token,    44.69 tokens per second)
      total time =   12202.19 ms /  1525 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id  0 | task 3078 | processing task
slot update_slots: id  0 | task 3078 | new prompt, n_ctx_slot = 60160, n_keep = 0, n_prompt_tokens = 1013
slot update_slots: id  0 | task 3078 | kv cache rm [0, end)
slot update_slots: id  0 | task 3078 | prompt processing progress, n_past = 1013, n_tokens = 1013, progress = 1.000000
slot update_slots: id  0 | task 3078 | prompt done, n_past = 1013, n_tokens = 1013
slot      release: id  0 | task 3078 | stop processing: n_past = 1524, truncated = 0
slot print_timing: id  0 | task 3078 |
prompt eval time =     742.43 ms /  1013 tokens (    0.73 ms per token,  1364.44 tokens per second)
       eval time =   11453.73 ms /   512 tokens (   22.37 ms per token,    44.70 tokens per second)
      total time =   12196.16 ms /  1525 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id  0 | task 3591 | processing task
slot update_slots: id  0 | task 3591 | new prompt, n_ctx_slot = 60160, n_keep = 0, n_prompt_tokens = 1013
slot update_slots: id  0 | task 3591 | kv cache rm [0, end)
slot update_slots: id  0 | task 3591 | prompt processing progress, n_past = 1013, n_tokens = 1013, progress = 1.000000
slot update_slots: id  0 | task 3591 | prompt done, n_past = 1013, n_tokens = 1013
slot      release: id  0 | task 3591 | stop processing: n_past = 1524, truncated = 0
slot print_timing: id  0 | task 3591 |
prompt eval time =     742.84 ms /  1013 tokens (    0.73 ms per token,  1363.69 tokens per second)
       eval time =   11440.67 ms /   512 tokens (   22.35 ms per token,    44.75 tokens per second)
      total time =   12183.50 ms /  1525 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 127.0.0.1 200
slot launch_slot_: id  0 | task 4104 | processing task
slot update_slots: id  0 | task 4104 | new prompt, n_ctx_slot = 60160, n_keep = 0, n_prompt_tokens = 1013
slot update_slots: id  0 | task 4104 | kv cache rm [0, end)
slot update_slots: id  0 | task 4104 | prompt processing progress, n_past = 1013, n_tokens = 1013, progress = 1.000000
slot update_slots: id  0 | task 4104 | prompt done, n_past = 1013, n_tokens = 1013
slot      release: id  0 | task 4104 | stop processing: n_past = 1524, truncated = 0
slot print_timing: id  0 | task 4104 |
prompt eval time =     744.12 ms /  1013 tokens (    0.73 ms per token,  1361.34 tokens per second)
       eval time =   11442.04 ms /   512 tokens (   22.35 ms per token,    44.75 tokens per second)
      total time =   12186.16 ms /  1525 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /completion 127.0.0.1 200

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions