Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Integration into newer llama.cpp plus NPU support #26

Description

@stiller-leser

Hi,

Thank you for the work here. I really am kinda desperate to finally make some use of my Snapdragon NPU. After failing to use Microsoft Olive for model conversion, I revisited your repo.

As mentioned in #14 I was able to build the project with cURL disabled and qnn_sdk 2.44.0.260225. I was very excited to test the new Qwen-3.5-series which failed because the architecture isn't supported by your llama.cpp version. I was able to run

hf download Qwen/Qwen2.5-Coder-7B-Instruct-GGUF --include "Qwen2.5-Coder-7B-Instruct-Q6_K.gguf" --local-dir C:\Users\me\models
.\llama-server.exe -m C:\Users\me\models\qwen2.5-coder-7b-instruct-q6_k.gguf --host 0.0.0.0 --port 8080 -ngl 99

but most of the work was done by the CPU and the GPU, nothing on the NPU:

PS C:\Users\me\workspace\llama-cpp-qnn-builder\llama.cpp\build-arm64-windows-llvm-debug\bin> .\llama-server.exe -m C:\Users\me\models\qwen2.5-coder-7b-instruct-q6_k.gguf --host 0.0.0.0 --port 8080 -ngl 99
backend registry init
skip hexagon device 3
qnn backend registry skip CPU device
skip device 0
register_backend: registered backend qualcomm (2 devices)
register_device: registered device qnn-npu (Hexagon NPU)
register_device: registered device qnn-gpu (Adreno GPU)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Snapdragon(R) X Elite - X1E78100 - Qualcomm(R) Oryon(TM) CPU)
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7771 (897501a78) with Clang 20.1.8 for Windows arm64 (debug)
system info: n_threads = 12, n_threads_batch = 12, total_threads = 12

system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | OPENMP = 1 | REPACK = 1 |

init: using 11 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'C:\Users\me\models\qwen2.5-coder-7b-instruct-q6_k.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
failed to load libcdsprpc.dll, error: (null)
failed to load rpcmem lib
llama_params_fit_impl: projected to use 0 MiB of device memory vs. 32326 MiB of free device memory
llama_params_fit_impl: will leave 23928 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 2.82 seconds
llama_model_load_from_file_impl: using device qnn-gpu (unknown(Adreno GPU)) (unknown id) - 24021 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from C:\Users\me\models\qwen2.5-coder-7b-instruct-q6_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 Coder 7B Instruct GGUF
llama_model_loader: - kv   3:                           general.finetune str              = Instruct-GGUF
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 18
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                                   split.no u16              = 0
llama_model_loader: - kv  27:                                split.count u16              = 0
llama_model_loader: - kv  28:                        split.tensors.count i32              = 339
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q6_K:  198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 5.82 GiB (6.56 BPW)
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3584
print_info: n_embd_inp       = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 7B
print_info: model params     = 7.62 B
print_info: general.name     = Qwen2.5 Coder 7B Instruct GGUF
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 '─è'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading output layer to GPU
load_tensors: offloading 27 repeating layers to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  5958.78 MiB
load_tensors:      qnn-gpu model buffer size =     1.27 MiB
........................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 131072
llama_context: n_ctx_seq     = 131072
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
extend_lib_search_path is nullptr, will use  as default
initialize qnn system successfully
[I]QNN API Version: 2.33.0
[I]QNN GPU API Version: 3.12.0
[I]Found C:\Windows\system32.\OpenCL.dll
[I]Successfully resolved extension function clGetDeviceImageInfoQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clSetPerfHintQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clNewRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clEndRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clReleaseRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clRetainRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clEnqueueRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Device version: 3.0    Device tier: 740
[I]OpenCL Driver version: OpenCL 3.0 QUALCOMM build: 827.0 Compiler DX.18.05.00
[I]QnnOpPackage: v2.0.0
[I]Creating operation package: qti.aisw
[I]Found C:\Windows\system32.\OpenCL.dll
[I]QnnOpPackage: qti.aisw
device counts 1
deviceID:0, deviceType:0, numCores 1
htp_type:0(ON_CHIP)
soc_model:unknown(unknown), htp_arch:unknown(884792475), vtcm_size:1323847928 MB
[I]Found C:\Windows\system32.\OpenCL.dll
[I]Successfully resolved extension function clGetDeviceImageInfoQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clSetPerfHintQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clNewRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clEndRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clReleaseRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clRetainRecordingQCOM using clGetExtensionFunctionAddressForPlatform
[I]Successfully resolved extension function clEnqueueRecordingQCOM using clGetExtensionFunctionAddressForPlatform
create QNN device successfully
failed to load libcdsprpc.dll, error: (null)
failed to load rpcmem lib
[V]Creating GPU context
qnn device name qnn-gpu
extend_lib_search_path is nullptr, will use  as default
initialize qnn system successfully
[I] <I> QnnLog_create started.
[V] <V> Registered a new graph environment 0 with priority: 100, num hvx threads: 1001, num hmx threads: 1001
[W] <W> Initializing HtpProvider
[V] <V> Creating default router
[V] <V> RouterWindows creater
[V] <V> HTP: Initializing the router
[V] <V> Detected Snapdragon SOC Dynamic SDM with 4 SOCs
[V] <V> Allocating PlatformInfo struct size 120
[V] <V> Multicore support is unavailable
[V] <V> Force to use single core in default platformInfo when MultiCore is not supported, numHwDevices= 1
[V] <V> HTP: Initializing the graph registry
[V] <V> HTP: Initializing the context registry
[V] <V> HTP: Initializing the device registry
[V] <V> HTP: Initializing the tensor counter
[V] <V> HTP: setting isExitCalled to false
[V] <V> HTP: setting ssrInProgress to false
[V] <V> HTP: FinalCleanupFn fnPtr is nullptr
[V] <V> HTP: initializing mem registry
[V] <V> HTP: initializing mmap registry
[V] <V> HTP: Initializing the logger lifecycle manager
[V] <V> HTP: constructing bundle
[V] <V> Graph environment handle not opened as preparelib or driverlib is not yet loaded
[V] <V> Set default graph environment 0 remoteHandle 0
[V] <V> Opened default graph env, envRemoteHandle 0
[I] <I> exit with 0
[I] <I> exit with 0
[V] <V> HTP: initialization completed successfully
[I] <I> QnnLog_create exit.
[I] <I> QnnBackend_create started. backend = 0x4eff8290
[V] <V> Oem key validation infra not found, limiting oemMaxPriority to HIGHEST
[V] <V> Backend handle created: 1
[V] <V> Graph environment handle not opened as preparelib or driverlib is not yet loaded
[V] <V> Set default graph environment 0 remoteHandle 0
[V] <V> Opened default graph env, envRemoteHandle 0
[I] <I> QnnBackend_create done successfully. backend = 0x4eff8290
[V] <V> Deactivated logger with handle 0000000000000001
device counts 1
deviceID:0, deviceType:0, numCores 1
htp_type:0(ON_CHIP)
soc_model:unknown(unknown), htp_arch:HTP_V73(73), vtcm_size:8 MB
[I] <I> QnnDevice_create started
[V] <V> Create device with id 0x1
[V] <V> Config not passed. Loading default platform info!
[V] <V> Setting default value for unsigned PD usage
[V] <V> DSP Driver Path: C:\Windows\System32\DriverStore\FileRepository\qcnspmcdm8380.inf_arm64_b31b1d855e0f5f79
[I] <I> First connection to QNN stub established!
[V] <V> Loading remote funcs
[V] <V> Getting effective domain ID of domain name cdsp
[V] <V> Effective cdsp_id is: 3, Session_id is: 0 for original Device Id: 0, DeviceId: 0, CoreId: 0, pdId: 0
[E] <E> DspTransport.openSession qnn_open failed, 0x80000406, prio 100
[E] <E> IDspTransport: Unable to load lib 0x80000406
[E] <E> DspTransport.getHandle failed, error 0x00000008
[E] <E> createDspTransportInstance failed to config transport object
[E] <E> error in creation of transport instance
[W] <W> Failed to create transport instance: 1002
[W] <W> Failed to load skel, error: 1002
[W] <W> Traditional path not available. Switching to user driver path
[V] <V> DriverLibLoader Loading HtpUsrDrv.dll
[V] <V> HTP User Driver Path: C:\Windows\System32\DriverStore\FileRepository\qcnspmcdm8380.inf_arm64_b31b1d855e0f5f79/HTP
[V] <V> Max API version supported by the driver = 1.4.2
[V] <V> Min API version supported by the driver = 1.0.0
[V] <V> QNN side interface version = 1.5.21
[V] <V> Driver interface requested size 576, filled 352
[V] <V> Driver capabilities size requested 256 size filled 116
[V] <V> Initializeing OpPackageManager log callback in HtpUsrDrv_setLogCallback
[V] <V> HtpUsrDrv_setLogLevel is called
[V] <V> Driver log level is set as: 5
[V] <V> HtpUsrDrv_setProfileCallback is called
[V] <V> Setting profile extended callback
[V] <V> HtpUsrDrv_getConfig is called
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2015
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2004
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2005
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2006
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2007
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2008
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2009
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2010
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2011
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2012
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2013
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2014
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2016
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2017
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2018
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10003
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10006
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10001
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10002
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10004
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10005
[W] <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 10007
[V] <V> HtpUsrDrv_getBuildId is called
[V] <V> Driver build id: v2.30.2.250124135729_113467
[W] <W> HTP user driver is loaded. Switched to user driver path
[V] <V> Calling driver's API - deviceCreate
[V] <V> Calling transport createDeviceTransportInstance from driver
[V] <V> skel file path file:///C:\Windows\System32\DriverStore\FileRepository\qcnspmcdm8380.inf_arm64_b31b1d855e0f5f79\HTP\libQnnHtpV73SkelDrv.so?qnn_skel_handle_invoke&_modver=1.0&_dom=cdspDrv.so?qnn_skel_handle_invoke&_modver=1.0&_dom=cdsp
[V] <V> DSP Driver Path: C:\Windows\System32\DriverStore\FileRepository\qcnspmcdm8380.inf_arm64_b31b1d855e0f5f79
[I] <I> First connection to QNN stub established!
[V] <V> Loading remote funcs
[V] <V> Getting effective domain ID of domain name cdsp
[V] <V> Effective cdsp_id is: 3, Session_id is: 0 for DeviceId: 0, CoreId: 0, pdId: 0
[V] <V> Transport session for deviceId 268435456 coreId 0 pdId 0 not found!
[V] <V> DeviceId 268435456 coreId 0 pdId 0 not present, insert a new entry 0000021D4F6FED70
[V] <V> rpcMemoryInit exits with 2, successfully initialized rpc memory
[V] <V> Successful rpcMemInit
[V] <V> rpcMemoryAlloc: 8 isInit 1
[V] <V> rpcMemoryAlloc: 136 isInit 1
[D] <D> Calling RPC transport with params 0000021D4BFC0000 [8 B], 0000000000000000 [0 B], 0000021D4BFD0000 [88 B]
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[D] <D> qnn_transport_run time: 6 (ms)

[V] <V> rpcMemoryAlloc: 8 isInit 1
[V] <V> rpcMemoryAlloc: 8 isInit 1
[D] <D> Calling RPC transport with params 0000021D4BFC0000 [8 B], 0000000000000000 [0 B], 0000021D4BFD0000 [8 B]
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[D] <D> qnn_transport_run time: 1 (ms)

[V] <V> New session config entry is found, value = 1
[V] <V> New session config value = 1
[V] <V> exits device initialization with  0
[V] <V> Calling driver's API - createGraphEnvHandle
[V] <V> Graph environments is not supported by current User Driver. Default environment will be used.
[V] <V> Set default graph environment 0 remoteHandle 0
[V] <V> Opened default graph env, envRemoteHandle 0
[V] <V> Calling driver's API - createGraphEnvHandle
[V] <V> Graph environments is not supported by current User Driver. Default environment will be used.
[V] <V> Successfully opened graph env handle, envId 0
[V] <V> Successfully opened graph environment, envId 0
[V] <V> Calling driver's API - setSkelLogLevel
[V] <V> HtpUsrDrv_setLogLevel is called
[V] <V> Setting skel log level from driver
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[D] <D> qnn_transport_run time: 0 (ms)

[V] <V> setSkelLogLevel return 0
[V] <V> Setting OpPackageManager log level from driver
[I] <I> QnnDevice_create done. device = 0x1. status 0x0
[V] <V> Deactivated logger with handle 0000000000000001
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling terminate OpPackage log API
create QNN device successfully
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling initialize OpPackage log API
[I] <I> QnnContext_create started. backend = 0x1, device = 0x1
[V] <V> Create context 0x1
[V] <V> Multicore support is unavailable
[V] <V> Wake up free backend (id: 1)'s thread(s)
[I] <I> Number of existing contexts: 1, graphs: 0
[I] <I> QnnContext_create done successfully. context = 0x1
[V] <V> Deactivated logger with handle 0000000000000001
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling terminate OpPackage log API
HTP backend perf_infrastructure creation ok
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling initialize OpPackage log API
[I] <I> htpPerfInfrastructureCreatePowerConfigId started for deviceId: 0, coreId: 0

[V] <V> Device with devID[0] coreID[0], pdId[0] found with CoreType:0
[V] <V> Created power config id 1534446784 for device id 0 core id 0 processDomain id 0
[I] <I> htpPerfInfrastructureCreatePowerConfigId done. status 0x0
[V] <V> Deactivated logger with handle 0000000000000001
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling terminate OpPackage log API
HTP infra type = 0, which is perf infra type
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling initialize OpPackage log API
[I] <I> htpPerfInfrastructureSetPowerConfig started for powerConfigId: 1534446784

[V] <V> Get HTP power device instance for device id 0 core id 0 processDomain id 0 coreType 0
[V] <V> Get HTP power device instance for device id 0 core id 0 processDomain id 0 coreType 0
[V] <V> set power settings rpc polling time 9999
[V] <V> Setting poll QoS to 9999
[V] <V> Polling not supported in setPollQos
[V] <V> HtpUsrDrv_perfSetConfig is called
[V] <V> perfSetRpcPollingTime is called
[V] <V> set power settings rpc polling time 9999
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[V] <V> set remote rpc control return 0
[V] <V> set power settings rpc control latency 100
[V] <V> HtpUsrDrv_perfSetConfig is called
[V] <V> perfSetRpcControlLatency is called
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[V] <V> set remote rpc control return 0
[I] <I> htpPerfInfrastructureSetPowerConfig done. status 0x0
[V] <V> Deactivated logger with handle 0000000000000001
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling terminate OpPackage log API
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling initialize OpPackage log API
[I] <I> htpPerfInfrastructureSetPowerConfig started for powerConfigId: 1534446784

[V] <V> Get HTP power device instance for device id 0 core id 0 processDomain id 0 coreType 0
[V] <V> Get HTP power device instance for device id 0 core id 0 processDomain id 0 coreType 0
[V] <V> set power settings DCVS V3 for context id 1534446784:
[V] <V> setDcvsEnable 1
[V] <V> dcvsEnable 0
[V] <V> powerMode 16
[V] <V> setSleepLatency 1
[V] <V> sleepLatency 40
[V] <V> setSleepDisable 1
[V] <V> sleepDisable 1
[V] <V> setBusParams 1
[V] <V> busVoltageCornerMin 160
[V] <V> busVoltageCornerTarget 160
[V] <V> busVoltageCornerMax 160
[V] <V> setCoreParams 1
[V] <V> coreVoltageCornerMin 160
[V] <V> coreVoltageCornerTarget 160
[V] <V> coreVoltageCornerMax 160
[V] <V> Resetting polling rpc graph stats
[V] <V> Polling not supported in resetDeviceGraphStats
[V] <V> Memory allocated - size 72 addr 0000021D57D125D0
[V] <V> HtpUsrDrv_perfSetConfig is called
[V] <V> perfSetPowerConfig is called
[V] <V> Found transport session 0000021D4F6FED70 for deviceId 268435456 coreId 0 pdId 0!
[D] <D> qnn_transport_run time: 1 (ms)

[V] <V> Set perf settings success
[V] <V> Freeing memory - addr 0000021D57D125D0
[I] <I> htpPerfInfrastructureSetPowerConfig done. status 0x0
[V] <V> Deactivated logger with handle 0000000000000001
[V] <V> OpPackage log is handled by User Driver now, nothing happens when calling terminate OpPackage log API
qnn device name qnn-npu
llama_context:        CPU  output buffer size =     2.32 MiB
llama_kv_cache:    qnn-gpu KV buffer size =  7168.00 MiB
llama_kv_cache: size = 7168.00 MiB (131072 cells,  28 layers,  4/1 seqs), K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_context: layer 0 is assigned to device qnn-gpu but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
llama_context: Flash Attention was auto, set to disabled
llama_context:        CPU compute buffer size =  7580.01 MiB
llama_context: graph nodes  = 1098
llama_context: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[I]Graph precision mode is user provided
[I]Memory Optimizations enabled
[I]Node Optimizations enabled
[I]Queue Recording enabled
[V]Constructed: ElementWiseAdd            ffn_inp-27
[I]QnnGraph_finalize: start
[I]Create operation: ElementWiseAdd
[I]qnn::gpu::backend::CompositionalGraph::finalize: total host time: 50.9 [ms]
[I]QnnGraph_finalize: finish
[I]QnnGraph_execute: start
[I]qnn::gpu::backend::Graph::execute: total host time: 0.7 [ms]
[I]QnnGraph_execute: finish
[I]Graph precision mode is user provided
[I]Memory Optimizations enabled
[I]Node Optimizations enabled
[I]Queue Recording enabled
[V]Constructed: ElementWiseMultiply       ffn_norm-27
[I]QnnGraph_finalize: start
[I]Create operation: ElementWiseMultiply
[I]qnn::gpu::backend::CompositionalGraph::finalize: total host time: 12.3 [ms]
[I]QnnGraph_finalize: finish
[I]QnnGraph_execute: start
[I]qnn::gpu::backend::Graph::execute: total host time: 0.6 [ms]
[I]QnnGraph_execute: finish
[I]QnnGraph_execute: start
[I]qnn::gpu::backend::Graph::execute: total host time: 0.6 [ms]
[I]QnnGraph_execute: finish
[I]QnnGraph_execute: start
[I]qnn::gpu::backend::Graph::execute: total host time: 0.6 [ms]
[I]QnnGraph_execute: finish
srv    load_model: initializing slots, n_slots = 4
slot   load_model: id  0 | task -1 | new slot, n_ctx = 131072
slot   load_model: id  1 | task -1 | new slot, n_ctx = 131072
slot   load_model: id  2 | task -1 | new slot, n_ctx = 131072
slot   load_model: id  3 | task -1 | new slot, n_ctx = 131072
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv    load_model: thinking = 0

So what prerequisites need to be fulfilled to have a model run on the NPU? What would be needed for this to become part of llama.cpp mainline?

THANK YOU FOR YOUR WORK!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions