Thank you for the work here. I really am kinda desperate to finally make some use of my Snapdragon NPU. After failing to use Microsoft Olive for model conversion, I revisited your repo.
hf download Qwen/Qwen2.5-Coder-7B-Instruct-GGUF --include "Qwen2.5-Coder-7B-Instruct-Q6_K.gguf" --local-dir C:\Users\me\models
.\llama-server.exe -m C:\Users\me\models\qwen2.5-coder-7b-instruct-q6_k.gguf --host 0.0.0.0 --port 8080 -ngl 99
PS C:\Users\me\workspace\llama-cpp-qnn-builder\llama.cpp\build-arm64-windows-llvm-debug\bin> .\llama-server.exe -m C:\Users\me\models\qwen2.5-coder-7b-instruct-q6_k.gguf --host 0.0.0.0 --port 8080 -ngl 99
backend registry init
skip hexagon device 3
qnn backend registry skip CPU device
skip device 0
register_backend: registered backend qualcomm (2 devices)
register_device: registered device qnn-npu (Hexagon NPU)
register_device: registered device qnn-gpu (Adreno GPU)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Snapdragon(R) X Elite - X1E78100 - Qualcomm(R) Oryon(TM) CPU)
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7771 (897501a78) with Clang 20.1.8 for Windows arm64 (debug)
system info: n_threads = 12, n_threads_batch = 12, total_threads = 12
system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | OPENMP = 1 | REPACK = 1 |
init: using 11 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model 'C:\Users\me\models\qwen2.5-coder-7b-instruct-q6_k.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
failed to load libcdsprpc.dll, error: (null)
failed to load rpcmem lib
llama_params_fit_impl: projected to use 0 MiB of device memory vs. 32326 MiB of free device memory
llama_params_fit_impl: will leave 23928 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 2.82 seconds
llama_model_load_from_file_impl: using device qnn-gpu (unknown(Adreno GPU)) (unknown id) - 24021 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from C:\Users\me\models\qwen2.5-coder-7b-instruct-q6_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 Coder 7B Instruct GGUF
llama_model_loader: - kv 3: general.finetune str = Instruct-GGUF
llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder
llama_model_loader: - kv 5: general.size_label str = 7B
llama_model_loader: - kv 6: qwen2.block_count u32 = 28
llama_model_loader: - kv 7: qwen2.context_length u32 = 131072
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: general.file_type u32 = 18
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - kv 26: split.no u16 = 0
llama_model_loader: - kv 27: split.count u16 = 0
llama_model_loader: - kv 28: split.tensors.count i32 = 339
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q6_K: 198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q6_K
print_info: file size = 5.82 GiB (6.56 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3584
print_info: n_embd_inp = 3584
print_info: n_layer = 28
print_info: n_head = 28
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18944
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 7B
print_info: model params = 7.62 B
print_info: general.name = Qwen2.5 Coder 7B Instruct GGUF
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 '─è'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading output layer to GPU
load_tensors: offloading 27 repeating layers to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 5958.78 MiB
load_tensors: qnn-gpu model buffer size = 1.27 MiB
........................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 131072
llama_context: n_ctx_seq = 131072
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
extend_lib_search_path is nullptr, will use as default
initialize qnn system successfully
[I]QNN API Version: 2.33.0
[I]QNN GPU API Version: 3.12.0
[I]Found C:\Windows\system32.\OpenCL.dll
[I]Successfully resolved extension function clGetDeviceImageInfoQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clSetPerfHintQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clNewRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clEndRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clReleaseRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clRetainRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clEnqueueRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Device version: 3.0 Device tier: 740
[I]OpenCL Driver version: OpenCL 3.0 QUALCOMM build: 827.0 Compiler DX.18.05.00
[I]QnnOpPackage: v2.0.0
[I]Creating operation package: qti.aisw
[I]Found C:\Windows\system32.\OpenCL.dll
[I]QnnOpPackage: qti.aisw
device counts 1
deviceID:0, deviceType:0, numCores 1
htp_type:0(ON_CHIP)
soc_model:unknown(unknown), htp_arch:unknown(884792475), vtcm_size:1323847928 MB
[I]Found C:\Windows\system32.\OpenCL.dll
[I]Successfully resolved extension function clGetDeviceImageInfoQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clSetPerfHintQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clNewRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clEndRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clReleaseRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clRetainRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clEnqueueRecordingQCOM using clGetExtensionFunctionAddressForPlatform
create QNN device successfully
failed to load libcdsprpc.dll, error: (null)
failed to load rpcmem lib
[V]Creating GPU context
qnn device name qnn-gpu
extend_lib_search_path is nullptr, will use as default
initialize qnn system successfully
[I] <I> QnnLog_create started.
[V] <V> Registered a new graph environment 0 with priority: 100, num hvx threads: 1001, num hmx threads: 1001
[W] <W> Initializing HtpProvider
[V] <V> Creating default router
[V] <V> RouterWindows creater
[V] <V> HTP: Initializing the router
[V] <V> Detected Snapdragon SOC Dynamic SDM with 4 SOCs
[V] <V> Allocating PlatformInfo struct size 120
[V] <V> Multicore support is unavailable
[V] <V> Force to use single core in default platformInfo when MultiCore is not supported, numHwDevices= 1
[V] <V> HTP: Initializing the graph registry
[V] <V> HTP: Initializing the context registry
[V] <V> HTP: Initializing the device registry
[V] <V> HTP: Initializing the tensor counter
[V] <V> HTP: setting isExitCalled to false
[V] <V> HTP: setting ssrInProgress to false
[V] <V> HTP: FinalCleanupFn fnPtr is nullptr
[V] <V> HTP: initializing mem registry
[V] <V> HTP: initializing mmap registry
[V] <V> HTP: Initializing the logger lifecycle manager
[V] <V> HTP: constructing bundle
[V] <V> Graph environment handle not opened as preparelib or driverlib is not yet loaded
[V] <V> Set default graph environment 0 remoteHandle 0
[V] <V> Opened default graph env, envRemoteHandle 0
[I] <I> exit with 0
[I] <I> exit with 0
[V] <V> HTP: initialization completed successfully
[I] <I> QnnLog_create exit.
[I] <I> QnnBackend_create started. backend = 0x4eff8290
[V] <V> Oem key validation infra not found, limiting oemMaxPriority to HIGHEST
[V] <V> Backend handle created: 1
[V] <V> Graph environment handle not opened as preparelib or driverlib is not yet loaded
[V] <V> Set default graph environment 0 remoteHandle 0
[V] <V> Opened default graph env, envRemoteHandle 0
[I] <I> QnnBackend_create done successfully. backend = 0x4eff8290
[V] <V> Deactivated logger with handle 0000000000000001
device counts 1
deviceID:0, deviceType:0, numCores 1
htp_type:0(ON_CHIP)
soc_model:unknown(unknown), htp_arch:HTP_V73(73), vtcm_size:8 MB
[I] <I> QnnDevice_create started
[V] <V> Create device with id 0x1
[V] <V> Config not passed. Loading default platform info!
[V] <V> Setting default value for unsigned PD usage
[V] <V> DSP Driver Path: C:\Windows\System32\DriverStore\FileRepository\qcnspmcdm8380.inf_arm64_b31b1d855e0f5f79
[I] <I> First connection to QNN stub established!
[V] <V> Loading remote funcs
[V] <V> Getting effective domain ID of domain name cdsp
[V] <V> Effective cdsp_id is: 3, Session_id is: 0 for original Device Id: 0, DeviceId: 0, CoreId: 0, pdId: 0
[E] <E> DspTransport.openSession qnn_open failed, 0x80000406, prio 100
[E] <E> IDspTransport: Unable to load lib 0x80000406
[E] <E> DspTransport.getHandle failed, error 0x00000008
[E] <E> createDspTransportInstance failed to config transport object
[E] <E> error in creation of transport instance
[W] <W> Failed to create transport instance: 1002
[W] <W> Failed to load skel, error: 1002
[W] <W> Traditional path not available. Switching to user driver path
[V] <V> DriverLibLoader Loading HtpUsrDrv.dll
[V] <V> HTP User Driver Path: C:\Windows\System32\DriverStore\FileRepository\qcnspmcdm8380.inf_arm64_b31b1d855e0f5f79/HTP
[V] <V> Max API version supported by the driver = 1.4.2
[V] <V> Min API version supported by the driver = 1.0.0
[V] <V> QNN side interface version = 1.5.21
[V] <V> Driver interface requested size 576, filled 352
[V] <V> Driver capabilities size requested 256 size filled 116
[V] <V> Initializeing OpPackageManager log callback in HtpUsrDrv_setLogCallback
[V] <V> HtpUsrDrv_setLogLevel is called
[V] <V> Driver log level is set as: 5
[V] <V> HtpUsrDrv_setProfileCallback is called
[V] <V> Setting profile extended callback
[V] <V> HtpUsrDrv_getConfig is called
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2015
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2004
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2005
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2006
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2007
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2008
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2009
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2010
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2011
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2012
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2013
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2014
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2016
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2017
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2018
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10003
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10006
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10001
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10002
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10004
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10005
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10007
[V] <V> HtpUsrDrv_getBuildId is called
[V] <V> Driver build id: v2.30.2.250124135729_113467
[W] <W> HTP user driver is loaded. Switched to user driver path
[V] <V> Calling driver's API - deviceCreate
[V] <V> Calling transport createDeviceTransportInstance from driver
[V] <V> skel file path file:///C:\Windows\System32\DriverStore\FileRepository\qcnspmcdm8380.inf_arm64_b31b1d855e0f5f79\HTP\libQnnHtpV73SkelDrv.so?qnn_skel_handle_invoke&_modver=1.0&_dom=cdspDrv.so?qnn_skel_handle_invoke&_modver=1.0&_dom=cdsp
[V] <V> DSP Driver Path: C:\Windows\System32\DriverStore\FileRepository\qcnspmcdm8380.inf_arm64_b31b1d855e0f5f79
[I] <I> First connection to QNN stub established!
[V] <V> Loading remote funcs
[V] <V> Getting effective domain ID of domain name cdsp
[V] <V> Effective cdsp_id is: 3, Session_id is: 0 for DeviceId: 0, CoreId: 0, pdId: 0
[V] <V> Transport session for deviceId 268435456 coreId 0 pdId 0 not found!
[V] <V> DeviceId 268435456 coreId 0 pdId 0 not present, insert a new entry 0000021D4F6FED70
[V] <V> rpcMemoryInit exits with 2, successfully initialized rpc memory
[V] <V> Successful rpcMemInit
[V] <V> rpcMemoryAlloc: 8 isInit 1
[V] <V> rpcMemoryAlloc: 136 isInit 1
[D] <D> Calling RPC transport with params 0000021D4BFC0000 [8 B], 0000000000000000 [0 B], 0000021D4BFD0000 [88 B]
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[D] <D> qnn_transport_run time: 6 (ms)
[V] <V> rpcMemoryAlloc: 8 isInit 1
[V] <V> rpcMemoryAlloc: 8 isInit 1
[D] <D> Calling RPC transport with params 0000021D4BFC0000 [8 B], 0000000000000000 [0 B], 0000021D4BFD0000 [8 B]
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[D] <D> qnn_transport_run time: 1 (ms)
[V] <V> New session config entry is found, value = 1
[V] <V> New session config value = 1
[V] <V> exits device initialization with 0
[V] <V> Calling driver's API - createGraphEnvHandle
[V] <V> Graph environments is not supported by current User Driver. Default environment will be used.
[V] <V> Set default graph environment 0 remoteHandle 0
[V] <V> Opened default graph env, envRemoteHandle 0
[V] <V> Calling driver's API - createGraphEnvHandle
[V] <V> Graph environments is not supported by current User Driver. Default environment will be used.
[V] <V> Successfully opened graph env handle, envId 0
[V] <V> Successfully opened graph environment, envId 0
[V] <V> Calling driver's API - setSkelLogLevel
[V] <V> HtpUsrDrv_setLogLevel is called
[V] <V> Setting skel log level from driver
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[D] <D> qnn_transport_run time: 0 (ms)
[V] <V> setSkelLogLevel return 0
[V] <V> Setting OpPackageManager log level from driver
[I] <I> QnnDevice_create done. device = 0x1. status 0x0
[V] <V> Deactivated logger with handle 0000000000000001
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling terminate OpPackage log API
create QNN device successfully
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling initialize OpPackage log API
[I] <I> QnnContext_create started. backend = 0x1, device = 0x1
[V] <V> Create context 0x1
[V] <V> Multicore support is unavailable
[V] <V> Wake up free backend (id: 1)'s thread(s)
[I] <I> Number of existing contexts: 1, graphs: 0
[I] <I> QnnContext_create done successfully. context = 0x1
[V] <V> Deactivated logger with handle 0000000000000001
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling terminate OpPackage log API
HTP backend perf_infrastructure creation ok
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling initialize OpPackage log API
[I] <I> htpPerfInfrastructureCreatePowerConfigId started for deviceId: 0, coreId: 0
[V] <V> Device with devID[0] coreID[0], pdId[0] found with CoreType:0
[V] <V> Created power config id 1534446784 for device id 0 core id 0 processDomain id 0
[I] <I> htpPerfInfrastructureCreatePowerConfigId done. status 0x0
[V] <V> Deactivated logger with handle 0000000000000001
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling terminate OpPackage log API
HTP infra type = 0, which is perf infra type
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling initialize OpPackage log API
[I] <I> htpPerfInfrastructureSetPowerConfig started for powerConfigId: 1534446784
[V] <V> Get HTP power device instance for device id 0 core id 0 processDomain id 0 coreType 0
[V] <V> Get HTP power device instance for device id 0 core id 0 processDomain id 0 coreType 0
[V] <V> set power settings rpc polling time 9999
[V] <V> Setting poll QoS to 9999
[V] <V> Polling not supported in setPollQos
[V] <V> HtpUsrDrv_perfSetConfig is called
[V] <V> perfSetRpcPollingTime is called
[V] <V> set power settings rpc polling time 9999
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[V] <V> set remote rpc control return 0
[V] <V> set power settings rpc control latency 100
[V] <V> HtpUsrDrv_perfSetConfig is called
[V] <V> perfSetRpcControlLatency is called
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[V] <V> set remote rpc control return 0
[I] <I> htpPerfInfrastructureSetPowerConfig done. status 0x0
[V] <V> Deactivated logger with handle 0000000000000001
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling terminate OpPackage log API
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling initialize OpPackage log API
[I] <I> htpPerfInfrastructureSetPowerConfig started for powerConfigId: 1534446784
[V] <V> Get HTP power device instance for device id 0 core id 0 processDomain id 0 coreType 0
[V] <V> Get HTP power device instance for device id 0 core id 0 processDomain id 0 coreType 0
[V] <V> set power settings DCVS V3 for context id 1534446784:
[V] <V> setDcvsEnable 1
[V] <V> dcvsEnable 0
[V] <V> powerMode 16
[V] <V> setSleepLatency 1
[V] <V> sleepLatency 40
[V] <V> setSleepDisable 1
[V] <V> sleepDisable 1
[V] <V> setBusParams 1
[V] <V> busVoltageCornerMin 160
[V] <V> busVoltageCornerTarget 160
[V] <V> busVoltageCornerMax 160
[V] <V> setCoreParams 1
[V] <V> coreVoltageCornerMin 160
[V] <V> coreVoltageCornerTarget 160
[V] <V> coreVoltageCornerMax 160
[V] <V> Resetting polling rpc graph stats
[V] <V> Polling not supported in resetDeviceGraphStats
[V] <V> Memory allocated - size 72 addr 0000021D57D125D0
[V] <V> HtpUsrDrv_perfSetConfig is called
[V] <V> perfSetPowerConfig is called
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[D] <D> qnn_transport_run time: 1 (ms)
[V] <V> Set perf settings success
[V] <V> Freeing memory - addr 0000021D57D125D0
[I] <I> htpPerfInfrastructureSetPowerConfig done. status 0x0
[V] <V> Deactivated logger with handle 0000000000000001
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling terminate OpPackage log API
qnn device name qnn-npu
llama_context: CPU output buffer size = 2.32 MiB
llama_kv_cache: qnn-gpu KV buffer size = 7168.00 MiB
llama_kv_cache: size = 7168.00 MiB (131072 cells, 28 layers, 4/1 seqs), K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_context: layer 0 is assigned to device qnn-gpu but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
llama_context: Flash Attention was auto, set to disabled
llama_context: CPU compute buffer size = 7580.01 MiB
llama_context: graph nodes = 1098
llama_context: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[I]Graph precision mode is user provided
[I]Memory Optimizations enabled
[I]Node Optimizations enabled
[I]Queue Recording enabled
[V]Constructed: ElementWiseAdd ffn_inp-27
[I]QnnGraph_finalize: start
[I]Create operation: ElementWiseAdd
[I]qnn::gpu::backend::CompositionalGraph::finalize: total host time: 50.9 [ms]
[I]QnnGraph_finalize: finish
[I]QnnGraph_execute: start
[I]qnn::gpu::backend::Graph::execute: total host time: 0.7 [ms]
[I]QnnGraph_execute: finish
[I]Graph precision mode is user provided
[I]Memory Optimizations enabled
[I]Node Optimizations enabled
[I]Queue Recording enabled
[V]Constructed: ElementWiseMultiply ffn_norm-27
[I]QnnGraph_finalize: start
[I]Create operation: ElementWiseMultiply
[I]qnn::gpu::backend::CompositionalGraph::finalize: total host time: 12.3 [ms]
[I]QnnGraph_finalize: finish
[I]QnnGraph_execute: start
[I]qnn::gpu::backend::Graph::execute: total host time: 0.6 [ms]
[I]QnnGraph_execute: finish
[I]QnnGraph_execute: start
[I]qnn::gpu::backend::Graph::execute: total host time: 0.6 [ms]
[I]QnnGraph_execute: finish
[I]QnnGraph_execute: start
[I]qnn::gpu::backend::Graph::execute: total host time: 0.6 [ms]
[I]QnnGraph_execute: finish
srv load_model: initializing slots, n_slots = 4
slot load_model: id 0 | task -1 | new slot, n_ctx = 131072
slot load_model: id 1 | task -1 | new slot, n_ctx = 131072
slot load_model: id 2 | task -1 | new slot, n_ctx = 131072
slot load_model: id 3 | task -1 | new slot, n_ctx = 131072
srv load_model: prompt cache is enabled, size limit: 8192 MiB
srv load_model: use `--cache-ram 0` to disable the prompt cache
srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv load_model: thinking = 0
So what prerequisites need to be fulfilled to have a model run on the NPU? What would be needed for this to become part of llama.cpp mainline?
Hi,
Thank you for the work here. I really am kinda desperate to finally make some use of my Snapdragon NPU. After failing to use Microsoft Olive for model conversion, I revisited your repo.
As mentioned in #14 I was able to build the project with cURL disabled and
qnn_sdk 2.44.0.260225. I was very excited to test the new Qwen-3.5-series which failed because the architecture isn't supported by your llama.cpp version. I was able to runbut most of the work was done by the CPU and the GPU, nothing on the NPU:
So what prerequisites need to be fulfilled to have a model run on the NPU? What would be needed for this to become part of llama.cpp mainline?
THANK YOU FOR YOUR WORK!