GPTQ / ExLlamaV2 (EXL2) quantisation

# Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected `llama.cpp` to do as an enhancement.

# Motivation

It sounds like it's a fast/useful quantisation method:

- https://towardsdatascience.com/exllamav2-the-fastest-library-to-run-llms-32aeda294d26
  - https://github.com/mlabonne/llm-course/blob/main/Quantize_models_with_ExLlamaV2.ipynb
- https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34
- https://huggingface.co/blog/gptq-integration
- https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/
  - > A detailed comparison between GPTQ, AWQ, EXL2, q4_K_M, q4_K_S, and load_in_4bit: perplexity, VRAM, speed, model size, and loading time.

# Possible Implementation

N/A

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPTQ / ExLlamaV2 (EXL2) quantisation #4165

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPTQ / ExLlamaV2 (EXL2) quantisation #4165

Description

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions