Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors #995

Closed
@ggerganov

Description

@ggerganov

The current Q4_0 uses a single F32 floating-point scaling factor.

An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: 679e1cb
Initial results indicate that this might be as accurate as Q4_1 and hopefully as fast as current Q4_0.

The goal of this task is to try to implement efficiently this data format (quantization, dequantization and dot product), measure the speed and perplexity and decide if this is viable. Depending on the results, we can think about updating the current Q4_0 data format and potentially dropping support for Q4_1.

SIMD implementation progress

  • ARM NEON
  • AVX
  • WASM

I plan to work on the ARM NEON implementation.
If you want to help with any of the implementations, propose an implementation + results in a PR, summarizing the inference speed and the obtained perplexity of your implementation.

Related

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions