Tags: JackDanger/llama.cpp
Tags
server : fix incoming tasks not process in order (ggml-org#15395)
vulkan: disable spirv-opt for bfloat16 shaders (ggml-org#15352)
server : export max observed n_past value (ggml-org#15361) Add tracking for high watermark cache usage and make it available in /metrics endpoint. Use-case: Tracking largest needed cache usage under realistic workload to better understand memory requirements and be able to adjust cache size/quantization for model/cache accordingly.
vulkan: Use larger workgroups for mul_mat_vec when M is small (ggml-o… …rg#15355) * vulkan: Use larger workgroups for mul_mat_vec when M is small Also use subgroup instructions for (part of) the reduction when supported. Without this, the more expensive reductions would eat into the benefits of the larger workgroups. * update heuristic for amd/intel Co-authored-by: 0cc4m <[email protected]> --------- Co-authored-by: 0cc4m <[email protected]>
ci : fix hang in windows-hip build/release (ggml-org#15365) * fix hang in windows-latest-cmake-hip * apply fix to release as well
vulkan: Optimize argsort (ggml-org#15354) - Launch an appropriate number of invocations (next larger power of two). 32 invocations is common and the barrier is much cheaper there. - Specialize for "needs bounds checking" vs not. - Make the code less branchy and [[unroll]] the loops. In the final code, I see no branches inside the main loop (only predicated stores) when needs_bounds_check is false. - Always sort ascending, then apply the ascending vs descending option when doing the final stores to memory. - Copy the values into shared memory, makes them slightly cheaper to access.
PreviousNext