-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Block interleaving support for Q4_K quantization for x86 AVX2 architecture #12332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block interleaving support for Q4_K quantization for x86 AVX2 architecture #12332
Conversation
What're your thoughts on the TG speed decrease? Obviously it's pretty negligible overall, but it's curious to see.. Will these efforts also help with ARM repacking? |
Hi @bartowski1182 , we were trying to check on the text generation if the recent introduction of restrict keyword in vec_dot_q4_K_q8_K is causing this difference and tried to replicate the restrict keyword usage in the repacking implementation, but currently we do see this difference. Regarding ARM Repacking, the repacking of Q4_K has been designed similar to Q4_0_8_8. So there's a possibility of this being extended to ARM, although not sure on finer details. Thanks |
Nice work! I don't have an easy access to a AVX2 machine, but will try to do some testing soon. I am curious how the performance looks like for different threads and model sizes (e.g. 1B, 3B). |
Update : A couple of updates were made in the PR
GCC Linux : Q4_K_M Model :
Q4_K_S Model :
GCC Version = 12.3 The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b The PR was tested in AMD Raphael 7600X which supports the following flags by default : CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | The perplexity results are tabulated as follows post changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor formatting comments.
I did some tests on Ryzen 9 5950X and observe similar results as reported.
@@ -45,6 +45,24 @@ using block_q4_0x8 = block<4, 8>; | |||
using block_q8_0x4 = block<8, 4>; | |||
using block_q8_0x8 = block<8, 8>; | |||
|
|||
|
|||
struct block_q4_Kx8{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct block_q4_Kx8{ | |
struct block_q4_Kx8 { |
|
||
static_assert(sizeof(block_q4_Kx8) == sizeof(ggml_half) * 16 + K_SCALE_SIZE * 8 + QK_K * 4, "wrong q4_K block size/padding"); | ||
|
||
struct block_q8_Kx4{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct block_q8_Kx4{ | |
struct block_q8_Kx4 { |
|
||
__m256i bsums_r2 = _mm256_maddubs_epi16(one, sb_h2_interleaved); | ||
|
||
for(int l = 0; l < 3; l++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few places where the opening bracket should be separated like this:
for(int l = 0; l < 3; l++) { | |
for (int l = 0; l < 3; l++) { |
…architecture (ggml-org#12332) * Add block interleaving support for Q4_K quantization * Remove whitespaces and fix CI/CD issues * Update pointer of bsums from int16_t to const int16_t * Add vector version of quantize_q8_K_4x8 function * Update code formatting based on review comments
@Srihari-mcw Hi, could you please help me check this error? : #12528 |
Do I understand correctly that this PR repackages Q4_K tensors to some other memory layout on the fly during model loading from existing GGUF files? I noticed that after updating llama.cpp to the current master DeepSeek R1 Q4_K_S model loading times (with the model file already cached) increased from:
to:
and only a single core is active when loading the model. It this how it should work? |
Yes, loading time is increased due to the runtime repacking of the weights. |
Hi, I have one question, why doesn't this patch modify |
Hi @Nor7th , I guess this is addressed in #12544 . The models we tested while developing this particular PR did not go through the particular code path that you pointed out and I guess hence this was missed (Guess MoE models go through the code path you had pointed). Sorry, @Yangxiaoz missed your message and glad the fix is already there Thanks @ggerganov |
Thanks! Good to know it's already solved. |
Yes #12544 should have fixed that. |
Aside from the extreme slowdown in model loading times due to repacking, this has also slowed down token generation for me, at least on my Dual Xeon 4216 system. Before this change, I could load the Cohere Command A model in Q4_K_M almost instantly (excluding the dry run), and token generation was at approximately 1.92 tokens per second. After this change, loading the model takes ages (20+ minutes), and token generation is around 1.63 tokens per second (15% slower).
|
Maybe we need a command line option to disable repacking for those that have the compile flag but doing like it? |
I just tried out Phi 4 Q4_K_M with the same command line arguments I used for Command A, and had a similar slowdown. Model loading time with repacking on my machine for Phi 4 is at least reasonable-ish (around a minute), but that's still much worse than near-instant loading without the repacking. This commit slowed down token generation from 9.53 T/s to 8.81 T/s, a 7.5% slowdown.
|
I did another experiment, only using one socket of my NUMA system. In single socket (non-NUMA) usage, this patch speeds up both token generation and prompt processing significantly, at the expense of slow initial load times due to repacking.
|
I think an option (or maybe a default) to disable block interleaving and the associated repacking would be good to have to speed up initial loading times. This patch does speed up both prompt processing and token generation for me in single socket non-NUMA usage. A separate issue is that this patch doesn't seem to properly handle NUMA systems, with minimal to no improvement in prompt processing, and substantially slower token generation. For Command A, both prompt processing and token generation slowed down with this patch when using NUMA. |
The repacking process takes ages for large models, and only seems to use around 16% of a single CPU core according to |
You should be able to disable repacking with |
Thanks, this brings it back to the performance before this commit or at least close to it. Nonetheless, making the repacking faster and NUMA friendly would be the ideal solution. |
Hi, is there any evaluation data for pp and tg in other sizes? Is there any acceleration effect of the method when the pp is small, like 16, 32? |
Hi @longaaalong , though we have llama-bench data with the default configuration (pp512) currently with llama-cli we have earlier observed good performance gains with relatively smaller prompt sizes For eg, We observed a gain from ~46 t/s to ~71 t/s for prompt processing Thanks |
Block Interleaving Formats
Block_Q4_Kx8 :
Block_Q8_Kx4:
GCC Linux :
Q4_K_M Model :
Q4_K_S Model :
GCC Version = 12.3
The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b
The PR was tested in AMD Raphael 7600X which supports the following flags by default :
CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
Additionally the PR was tested for execution with clang linux also
Further the perplexity was tested and found to be similar with the Q4_K_S model :
The perplexity results are tabulated as follows :