Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors

The current `Q4_0` uses a single F32 floating-point scaling factor.

An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: https://github.com/ggerganov/llama.cpp/commit/679e1cb6c01b16abe4f3ee3c849813b98970df93
Initial results indicate that this might be as accurate as `Q4_1` and hopefully as fast as current `Q4_0`.

The goal of this task is to try to implement efficiently this data format (quantization, dequantization and dot product), measure the speed and perplexity and decide if this is viable. Depending on the results, we can think about updating the current `Q4_0` data format and potentially dropping support for `Q4_1`.

### SIMD implementation progress

- [x] ARM NEON
- [x] AVX
- [ ] WASM

I plan to work on the ARM NEON implementation.
If you want to help with any of the implementations, propose an implementation + results in a PR, summarizing the inference speed and the obtained perplexity of your implementation.

### Related

- #397 
- #896 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors #995

SIMD implementation progress

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors #995

Description

SIMD implementation progress

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions