Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Comments

Add GPTQ to prototype#3517

Merged
jcaip merged 3 commits intomainfrom
jcaip/gptq
Jan 23, 2026
Merged

Add GPTQ to prototype#3517
jcaip merged 3 commits intomainfrom
jcaip/gptq

Conversation

@jcaip
Copy link
Contributor

@jcaip jcaip commented Dec 19, 2025

Summary:

This PR adds GPTQ support to torchao.prototype.gptq.

It is exposed via a new config, GPTQConfig, which can have two steps, "observe" and "convert".

When quantize_(model, GPTQConfig(step="observe")) is run, observer tensors are attached to the weight tensors, which keep track of linear / torch.bmm ops and updates the hessian matrix based on the observed inputs.

When quantize_(model, GPTQConfig(step="convert")) is run, we will find any observer tensors, take the Hessian and do int4 GPTQ quantization to find the weights. The core of this function is in gptq_quantize.

Currently Int4WeightOnlyConfig and Int8WeightOnlyConfig are supported.

Also included is an example script, gptq_example.py that does sequential / nonsequential quantization on helllaswag for a simple example.

I am comparing against this survey paper: https://arxiv.org/pdf/2409.11055v1 We look to be on par accuraccy wise for int8 / int4 GPTQ. (GPTQ** is llm-compressor, GPTQ* is autogptq)

To reproduces these results please run:

python torchao/prototype/gptq/gptq_example.py --quantization baseline
python torchao/prototype/gptq/gptq_example.py --quantization int4-rtn
python torchao/prototype/gptq/gptq_example.py --quantization int4-gptq-sequential
python torchao/prototype/gptq/gptq_example.py --quantization int4-gptq-nonsequential
python torchao/prototype/gptq/gptq_example.py --quantization int8-rtn
python torchao/prototype/gptq/gptq_example.py --quantization int8-gptq-sequential
python torchao/prototype/gptq/gptq_example.py --quantization int8-gptq-nonsequential

Benchmark Results:

Model Method W/A Storage (GB) BBH (3-shot) TorchAO Results
Llama-3.1-8B-it FP16 16 / 16 16 51.35 51.08
  FP8 8 / 8 8 50.78 (↓0.57)  
  INT8 8 / 16 8 N/A 41.82
  INT4 4 / 16 4 N/A 50.91
  GPTQ* 4 / 16 4 48.27 (↓3.08)  
  GPTQ** 4 / 16 4 48.48 (↓2.87) 48.32
  GPTQ** 8 / 16 8 51.29 (↑0.06) 51.10
  SmoothQuant 8 / 8 8 51.32 (↓0.03)  
  AWQ 4 / 16 4 48.25 (↓3.10)  

Timings:

running python torchao/prototype/gptq/gptq_example.py --quantization int4-gptq-sequential yields a quantized model in 6.6 minutes, while running the llmcompressor example here takes 8 minutes, so we're approximately ~13% faster.

Need to profile the runs to get a sense of where the speedup is happening

Test Plan:

pytest test/prototype/gptq/test_gptq.py 

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3517

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4ff83b2 with merge base 789f07a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 19, 2025
@jcaip jcaip added topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) accuracy Accuracy related labels Dec 19, 2025
@jcaip jcaip marked this pull request as ready for review December 19, 2025 19:29
@jcaip jcaip requested review from jerryzh168 and vkuzo December 19, 2025 19:29
from .observer import ObserverTensor


@dataclass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put logic in some_file.py instead of __init__.py

Comment on lines 160 to 170
dead = torch.diag(H) == 0
H[dead, dead] = 1
W[:, dead] = 0

damp = percdamp * torch.mean(torch.diag(H))
diag = torch.arange(columns, device=device)
H[diag, diag] += damp
H = torch.linalg.cholesky(H)
H = torch.cholesky_inverse(H)
H = torch.linalg.cholesky(H, upper=True)
Hinv = H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add some comments linking to the right page of the paper


all_qparams = []

for W_quantize_block, block_start in zip(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add comments as needed to explain how this code is following the algorithm in the paper

def _calculate_hessian(inputs, device=None):
"""Calculate Hessian matrix from input activations for GPTQ.

DEPRECATED: This function is kept for backward compatibility in tests only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backward compatibility with what?

from torchao.utils import TorchAOBaseTensor


class ObserverTensor(TorchAOBaseTensor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic here is specific to GPTQ, can we ensure the name specifies that

@jerryzh168
Copy link
Contributor

can you show e2e accuracy results as well

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess with this it's good that we don't need MultiTensor stuff anymore. I do remember there were some inplace op considerations before:

if is_in_place:

how is it addressed here?

@jcaip jcaip force-pushed the jcaip/gptq branch 2 times, most recently from 3907a7d to 2e8478f Compare January 14, 2026 11:06
@jcaip jcaip requested a review from vkuzo January 14, 2026 11:22
@jcaip
Copy link
Contributor Author

jcaip commented Jan 14, 2026

I guess with this it's good that we don't need MultiTensor stuff anymore. I do remember there were some inplace op considerations before:

if is_in_place:

how is it addressed here?

The in-place op considerations are specific to MultiTensor, so if we avoid multi tensor we don't have to worry about them.

We needed Multi-tensor before to support propagating quantized outputs to the subsequent observer modules. Right now we instead quantize one transformer layer at a time via some custom code in gptq_example.py. This mirrors the approach that https://github.com/IST-DASLab/MoE-Quant takes.

llm comperssor does this by tracing the fx graph, it is possible to do something similar if we want.


@dataclass
class GPTQConfig(AOBaseConfig):
"""Config for GPTQ quantization
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add explanation that this is with unquantized activations out of the box

torch.testing.assert_close(dynamic_out_eager, static_out_eager)

@common_utils.parametrize("dtype", [torch.bfloat16])
def test_int8_weight_only_v1_vs_v2_comparison(self, dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just test v2 and remove v1?

model,
calibration_data: List[torch.Tensor],
config: Any,
) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docblock explaining that user is expected to implement this for their model

config: Any,
) -> None:
# run with no grad otherwise keeping all the tensors around for the backwards will cause oom
with torch.no_grad():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be moved to a decorator on sequential_quantize to save a level of indent


import torch
import torch.nn as nn
from fbgemm_gpu.experimental.gen_ai.quantize import int4_row_quantize_zp, pack_int4
Copy link
Contributor

@namgyu-youn namgyu-youn Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we focus on W8A8-INT only, instead of W4A16-INT for early land? fbgemm dependency issue has been reported many times I remember.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this just needs to be gated properly like the other fbgemm imports


from .observer import GPTQObserverTensor

CONFIG_TO_TORCHAO_BASE_TENSOR = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Can we drop this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need something to map from the configs to the tensor, which we don't have currently afaik.

def _calculate_hessian(inputs, device=None):
"""Calculate Hessian matrix from input activations for GPTQ.

DEPRECATED: This function is kept for backward compatibility in tests only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update docblock

For example we can seee how GPTQ improves accuracy on a toy example:(2x2 matrix, 2-bit symmetric quantization):

Example:
Given: W = [[1.2, 0.8], X = [[1.0], Original output: W @ X = [[2.0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we switch this to do X @ W.T to match the common case? X @ W is also fine, it's confusing to have W first here.

for i in range(group_start - block_start, group_end - block_start):
w = W_quantize_block[:, i].unsqueeze(1)
if isinstance(base_config, Int4WeightOnlyConfig):
q = _int4_row_quantize_zp_precomputed_qparams(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can probably improve how int4/int8/etc logic works in these two for loops, but better in a future PR

# Once a block is fully processed, perform global updates to H^-1 and W using batched versions of the error propagation equations.
W[:, block_end:] -= Err1.matmul(Hinv[block_start:block_end, block_end:])

if "cuda" in device.type:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this support cpu? maybe just drop cpu for now? my guess is things will be too slow to be useful on cpu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work but i agree this will be quite slow on CPU.

Copy link
Contributor

@vkuzo vkuzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! I think we should land this once CI is green and keep improving in future PRs.

@jcaip jcaip force-pushed the jcaip/gptq branch 3 times, most recently from 6eb3cb0 to d5dcb04 Compare January 22, 2026 16:03
@jcaip jcaip force-pushed the jcaip/gptq branch 4 times, most recently from ec82b27 to 69cc3e0 Compare January 22, 2026 18:36
This PR adds GPTQ (Generalized Post-Training Quantization) support to torchao:

- Add GPTQ quantization implementation (api.py)
- Add GPTQ observer for Hessian collection (observer.py)
- Add sequential quantization example for HuggingFace models (gptq_example.py)
- Add comprehensive GPTQ tests (test_gptqv2.py)
- Clean up documentation and examples with improved clarity

GPTQ quantizes weights column-by-column while propagating quantization errors
to subsequent columns, minimizing reconstruction error.
@jcaip jcaip merged commit 81d7755 into main Jan 23, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

accuracy Accuracy related CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants