Codestin Search App

jcaip · 2025-12-19T06:20:21Z

Summary:

This PR adds GPTQ support to torchao.prototype.gptq.

It is exposed via a new config, GPTQConfig, which can have two steps, "observe" and "convert".

When quantize_(model, GPTQConfig(step="observe")) is run, observer tensors are attached to the weight tensors, which keep track of linear / torch.bmm ops and updates the hessian matrix based on the observed inputs.

When quantize_(model, GPTQConfig(step="convert")) is run, we will find any observer tensors, take the Hessian and do int4 GPTQ quantization to find the weights. The core of this function is in gptq_quantize.

Currently Int4WeightOnlyConfig and Int8WeightOnlyConfig are supported.

Also included is an example script, gptq_example.py that does sequential / nonsequential quantization on helllaswag for a simple example.

I am comparing against this survey paper: https://arxiv.org/pdf/2409.11055v1 We look to be on par accuraccy wise for int8 / int4 GPTQ. (GPTQ** is llm-compressor, GPTQ* is autogptq)

To reproduces these results please run:

python torchao/prototype/gptq/gptq_example.py --quantization baseline
python torchao/prototype/gptq/gptq_example.py --quantization int4-rtn
python torchao/prototype/gptq/gptq_example.py --quantization int4-gptq-sequential
python torchao/prototype/gptq/gptq_example.py --quantization int4-gptq-nonsequential
python torchao/prototype/gptq/gptq_example.py --quantization int8-rtn
python torchao/prototype/gptq/gptq_example.py --quantization int8-gptq-sequential
python torchao/prototype/gptq/gptq_example.py --quantization int8-gptq-nonsequential

Benchmark Results:

Model	Method	W/A	Storage (GB)	BBH (3-shot)	TorchAO Results
Llama-3.1-8B-it	FP16	16 / 16	16	51.35	51.08
	FP8	8 / 8	8	50.78 (↓0.57)
	INT8	8 / 16	8	N/A	41.82
	INT4	4 / 16	4	N/A	50.91
	GPTQ*	4 / 16	4	48.27 (↓3.08)
	GPTQ**	4 / 16	4	48.48 (↓2.87)	48.32
	GPTQ**	8 / 16	8	51.29 (↑0.06)	51.10
	SmoothQuant	8 / 8	8	51.32 (↓0.03)
	AWQ	4 / 16	4	48.25 (↓3.10)

Timings:

running python torchao/prototype/gptq/gptq_example.py --quantization int4-gptq-sequential yields a quantized model in 6.6 minutes, while running the llmcompressor example here takes 8 minutes, so we're approximately ~13% faster.

Need to profile the runs to get a sense of where the speedup is happening

Test Plan:

pytest test/prototype/gptq/test_gptq.py

pytorch-bot · 2025-12-19T06:20:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3517

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4ff83b2 with merge base 789f07a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

test/prototype/gptq/test_gptqv2.py

vkuzo · 2025-12-19T20:37:27Z

torchao/prototype/gptq/__init__.py

+from .observer import ObserverTensor
+
+
+@dataclass


nit: put logic in some_file.py instead of __init__.py

vkuzo · 2025-12-19T20:38:37Z

torchao/prototype/gptq/__init__.py

+    dead = torch.diag(H) == 0
+    H[dead, dead] = 1
+    W[:, dead] = 0
+
+    damp = percdamp * torch.mean(torch.diag(H))
+    diag = torch.arange(columns, device=device)
+    H[diag, diag] += damp
+    H = torch.linalg.cholesky(H)
+    H = torch.cholesky_inverse(H)
+    H = torch.linalg.cholesky(H, upper=True)
+    Hinv = H


nit: add some comments linking to the right page of the paper

vkuzo · 2025-12-19T20:39:07Z

torchao/prototype/gptq/__init__.py

+
+    all_qparams = []
+
+    for W_quantize_block, block_start in zip(


nit: add comments as needed to explain how this code is following the algorithm in the paper

vkuzo · 2025-12-19T20:39:50Z

torchao/prototype/gptq/__init__.py

+def _calculate_hessian(inputs, device=None):
+    """Calculate Hessian matrix from input activations for GPTQ.
+
+    DEPRECATED: This function is kept for backward compatibility in tests only.


backward compatibility with what?

vkuzo · 2025-12-19T20:40:46Z

torchao/prototype/gptq/observer.py

+from torchao.utils import TorchAOBaseTensor
+
+
+class ObserverTensor(TorchAOBaseTensor):


the logic here is specific to GPTQ, can we ensure the name specifies that

jerryzh168 · 2026-01-13T21:39:26Z

can you show e2e accuracy results as well

jerryzh168

I guess with this it's good that we don't need MultiTensor stuff anymore. I do remember there were some inplace op considerations before:

ao/torchao/quantization/GPTQ/GPTQ.py

Line 349 in f59a19b

if is_in_place:

how is it addressed here?

jcaip · 2026-01-14T11:27:53Z

I guess with this it's good that we don't need MultiTensor stuff anymore. I do remember there were some inplace op considerations before:

ao/torchao/quantization/GPTQ/GPTQ.py

Line 349 in f59a19b

if is_in_place:

how is it addressed here?

The in-place op considerations are specific to MultiTensor, so if we avoid multi tensor we don't have to worry about them.

We needed Multi-tensor before to support propagating quantized outputs to the subsequent observer modules. Right now we instead quantize one transformer layer at a time via some custom code in gptq_example.py. This mirrors the approach that https://github.com/IST-DASLab/MoE-Quant takes.

llm comperssor does this by tracing the fx graph, it is possible to do something similar if we want.

jcaip · 2026-01-20T16:24:19Z

torchao/prototype/gptq/api.py

+
+@dataclass
+class GPTQConfig(AOBaseConfig):
+    """Config for GPTQ quantization


add explanation that this is with unquantized activations out of the box

vkuzo · 2026-01-20T17:48:39Z

test/quantization/quantize_/workflows/int8/test_int8_tensor.py

        torch.testing.assert_close(dynamic_out_eager, static_out_eager)

+    @common_utils.parametrize("dtype", [torch.bfloat16])
+    def test_int8_weight_only_v1_vs_v2_comparison(self, dtype):


just test v2 and remove v1?

vkuzo · 2026-01-20T17:49:45Z

torchao/prototype/gptq/gptq_example.py

+    model,
+    calibration_data: List[torch.Tensor],
+    config: Any,
+) -> None:


add docblock explaining that user is expected to implement this for their model

vkuzo · 2026-01-20T17:50:03Z

torchao/prototype/gptq/gptq_example.py

+    config: Any,
+) -> None:
+    # run with no grad otherwise keeping all the tensors around for the backwards will cause oom
+    with torch.no_grad():


can be moved to a decorator on sequential_quantize to save a level of indent

torchao/prototype/gptq/api.py

namgyu-youn · 2026-01-22T10:12:07Z

torchao/prototype/gptq/gptq.py

+
+import torch
+import torch.nn as nn
+from fbgemm_gpu.experimental.gen_ai.quantize import int4_row_quantize_zp, pack_int4


Can we focus on W8A8-INT only, instead of W4A16-INT for early land? fbgemm dependency issue has been reported many times I remember.

I think this just needs to be gated properly like the other fbgemm imports

namgyu-youn · 2026-01-22T10:13:18Z

torchao/prototype/gptq/api.py

+
+from .observer import GPTQObserverTensor
+
+CONFIG_TO_TORCHAO_BASE_TENSOR = {


Is this needed? Can we drop this?

We need something to map from the configs to the tensor, which we don't have currently afaik.

torchao/prototype/gptq/api.py

vkuzo · 2026-01-22T12:06:37Z

test/prototype/gptq/test_gptq.py

+def _calculate_hessian(inputs, device=None):
+    """Calculate Hessian matrix from input activations for GPTQ.
+
+    DEPRECATED: This function is kept for backward compatibility in tests only.


nit: update docblock

vkuzo · 2026-01-22T12:11:52Z

torchao/prototype/gptq/gptq.py

+    For example we can seee how GPTQ improves accuracy on a toy example:(2x2 matrix, 2-bit symmetric quantization):
+
+    Example:
+        Given: W = [[1.2, 0.8],    X = [[1.0],      Original output: W @ X = [[2.0],


nit: can we switch this to do X @ W.T to match the common case? X @ W is also fine, it's confusing to have W first here.

vkuzo · 2026-01-22T12:18:05Z

torchao/prototype/gptq/api.py

+            for i in range(group_start - block_start, group_end - block_start):
+                w = W_quantize_block[:, i].unsqueeze(1)
+                if isinstance(base_config, Int4WeightOnlyConfig):
+                    q = _int4_row_quantize_zp_precomputed_qparams(


can probably improve how int4/int8/etc logic works in these two for loops, but better in a future PR

vkuzo · 2026-01-22T12:18:56Z

torchao/prototype/gptq/gptq.py

+        # Once a block is fully processed, perform global updates to H^-1 and W using batched versions of the error propagation equations.
+        W[:, block_end:] -= Err1.matmul(Hinv[block_start:block_end, block_end:])
+
+    if "cuda" in device.type:


does this support cpu? maybe just drop cpu for now? my guess is things will be too slow to be useful on cpu

It should work but i agree this will be quite slow on CPU.

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

vkuzo

looks good! I think we should land this once CI is green and keep improving in future PRs.

This PR adds GPTQ (Generalized Post-Training Quantization) support to torchao: - Add GPTQ quantization implementation (api.py) - Add GPTQ observer for Hessian collection (observer.py) - Add sequential quantization example for HuggingFace models (gptq_example.py) - Add comprehensive GPTQ tests (test_gptqv2.py) - Clean up documentation and examples with improved clarity GPTQ quantizes weights column-by-column while propagating quantization errors to subsequent columns, minimizing reconstruction error.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 19, 2025

jcaip added topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) accuracy Accuracy related labels Dec 19, 2025

jcaip marked this pull request as ready for review December 19, 2025 19:29

jcaip requested review from jerryzh168 and vkuzo December 19, 2025 19:29

vkuzo reviewed Dec 19, 2025

View reviewed changes

test/prototype/gptq/test_gptqv2.py Show resolved Hide resolved

vkuzo reviewed Dec 19, 2025

View reviewed changes

namgyu-youn mentioned this pull request Jan 10, 2026

Deprecate TransformerEvalWrapper, LMEvalInputRecorder #3617

Open

jerryzh168 reviewed Jan 13, 2026

View reviewed changes

jcaip force-pushed the jcaip/gptq branch 2 times, most recently from 3907a7d to 2e8478f Compare January 14, 2026 11:06

jcaip requested a review from vkuzo January 14, 2026 11:22

jcaip mentioned this pull request Jan 20, 2026

Add static quant flow support for Float8StaticActivationFloat8WeightConfig #3655

Merged

jcaip commented Jan 20, 2026

View reviewed changes

vkuzo reviewed Jan 20, 2026

View reviewed changes

jcaip force-pushed the jcaip/gptq branch 2 times, most recently from b765914 to b4a8bd5 Compare January 20, 2026 19:37

namgyu-youn mentioned this pull request Jan 22, 2026

How to test quantization model quantized by GPTQ #3684

Open

namgyu-youn reviewed Jan 22, 2026

View reviewed changes

vkuzo reviewed Jan 22, 2026

View reviewed changes

torchao/quantization/quantize_/workflows/int8/int8_tensor.py Show resolved Hide resolved

vkuzo approved these changes Jan 22, 2026

View reviewed changes

jcaip force-pushed the jcaip/gptq branch 3 times, most recently from 6eb3cb0 to d5dcb04 Compare January 22, 2026 16:03

jcaip mentioned this pull request Jan 22, 2026

Consolidate enum for observer step #3678

Merged

jcaip force-pushed the jcaip/gptq branch 4 times, most recently from ec82b27 to 69cc3e0 Compare January 22, 2026 18:36

jcaip force-pushed the jcaip/gptq branch from 69cc3e0 to 2fb1c41 Compare January 22, 2026 19:41

jcaip closed this Jan 22, 2026

jcaip reopened this Jan 22, 2026

jcaip mentioned this pull request Jan 22, 2026

Jcaip/gptq new branch #3705

Closed

jcaip force-pushed the jcaip/gptq branch from 4b90ebe to 2876504 Compare January 22, 2026 23:36

Merge branch 'main' into jcaip/gptq

58c1d42

jcaip force-pushed the jcaip/gptq branch from 2876504 to 58c1d42 Compare January 23, 2026 01:45

Merge branch 'main' into jcaip/gptq

4ff83b2

jcaip force-pushed the jcaip/gptq branch from 62fc2a7 to 4ff83b2 Compare January 23, 2026 15:53

jcaip merged commit 81d7755 into main Jan 23, 2026
19 checks passed

namgyu-youn mentioned this pull request Jan 29, 2026

Update GPTQ to use common enum step #3757

Open

		from torchao.utils import TorchAOBaseTensor


		class ObserverTensor(TorchAOBaseTensor):


		from .observer import GPTQObserverTensor

		CONFIG_TO_TORCHAO_BASE_TENSOR = {

Comments

Conversation

jcaip commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3517

✅ No Failures

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 commented Jan 13, 2026

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

jcaip commented Jan 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

namgyu-youn Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vkuzo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jcaip commented Dec 19, 2025 •

edited

Loading

pytorch-bot bot commented Dec 19, 2025 •

edited

Loading

namgyu-youn Jan 22, 2026 •

edited

Loading