Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3517
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 4ff83b2 with merge base 789f07a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
torchao/prototype/gptq/__init__.py
Outdated
| from .observer import ObserverTensor | ||
|
|
||
|
|
||
| @dataclass |
There was a problem hiding this comment.
nit: put logic in some_file.py instead of __init__.py
torchao/prototype/gptq/__init__.py
Outdated
| dead = torch.diag(H) == 0 | ||
| H[dead, dead] = 1 | ||
| W[:, dead] = 0 | ||
|
|
||
| damp = percdamp * torch.mean(torch.diag(H)) | ||
| diag = torch.arange(columns, device=device) | ||
| H[diag, diag] += damp | ||
| H = torch.linalg.cholesky(H) | ||
| H = torch.cholesky_inverse(H) | ||
| H = torch.linalg.cholesky(H, upper=True) | ||
| Hinv = H |
There was a problem hiding this comment.
nit: add some comments linking to the right page of the paper
torchao/prototype/gptq/__init__.py
Outdated
|
|
||
| all_qparams = [] | ||
|
|
||
| for W_quantize_block, block_start in zip( |
There was a problem hiding this comment.
nit: add comments as needed to explain how this code is following the algorithm in the paper
torchao/prototype/gptq/__init__.py
Outdated
| def _calculate_hessian(inputs, device=None): | ||
| """Calculate Hessian matrix from input activations for GPTQ. | ||
|
|
||
| DEPRECATED: This function is kept for backward compatibility in tests only. |
There was a problem hiding this comment.
backward compatibility with what?
torchao/prototype/gptq/observer.py
Outdated
| from torchao.utils import TorchAOBaseTensor | ||
|
|
||
|
|
||
| class ObserverTensor(TorchAOBaseTensor): |
There was a problem hiding this comment.
the logic here is specific to GPTQ, can we ensure the name specifies that
|
can you show e2e accuracy results as well |
jerryzh168
left a comment
There was a problem hiding this comment.
I guess with this it's good that we don't need MultiTensor stuff anymore. I do remember there were some inplace op considerations before:
ao/torchao/quantization/GPTQ/GPTQ.py
Line 349 in f59a19b
how is it addressed here?
3907a7d to
2e8478f
Compare
The in-place op considerations are specific to MultiTensor, so if we avoid multi tensor we don't have to worry about them. We needed Multi-tensor before to support propagating quantized outputs to the subsequent observer modules. Right now we instead quantize one transformer layer at a time via some custom code in gptq_example.py. This mirrors the approach that https://github.com/IST-DASLab/MoE-Quant takes. llm comperssor does this by tracing the fx graph, it is possible to do something similar if we want. |
|
|
||
| @dataclass | ||
| class GPTQConfig(AOBaseConfig): | ||
| """Config for GPTQ quantization |
There was a problem hiding this comment.
add explanation that this is with unquantized activations out of the box
| torch.testing.assert_close(dynamic_out_eager, static_out_eager) | ||
|
|
||
| @common_utils.parametrize("dtype", [torch.bfloat16]) | ||
| def test_int8_weight_only_v1_vs_v2_comparison(self, dtype): |
| model, | ||
| calibration_data: List[torch.Tensor], | ||
| config: Any, | ||
| ) -> None: |
There was a problem hiding this comment.
add docblock explaining that user is expected to implement this for their model
| config: Any, | ||
| ) -> None: | ||
| # run with no grad otherwise keeping all the tensors around for the backwards will cause oom | ||
| with torch.no_grad(): |
There was a problem hiding this comment.
can be moved to a decorator on sequential_quantize to save a level of indent
b765914 to
b4a8bd5
Compare
torchao/prototype/gptq/gptq.py
Outdated
|
|
||
| import torch | ||
| import torch.nn as nn | ||
| from fbgemm_gpu.experimental.gen_ai.quantize import int4_row_quantize_zp, pack_int4 |
There was a problem hiding this comment.
Can we focus on W8A8-INT only, instead of W4A16-INT for early land? fbgemm dependency issue has been reported many times I remember.
There was a problem hiding this comment.
I think this just needs to be gated properly like the other fbgemm imports
|
|
||
| from .observer import GPTQObserverTensor | ||
|
|
||
| CONFIG_TO_TORCHAO_BASE_TENSOR = { |
There was a problem hiding this comment.
Is this needed? Can we drop this?
There was a problem hiding this comment.
We need something to map from the configs to the tensor, which we don't have currently afaik.
test/prototype/gptq/test_gptq.py
Outdated
| def _calculate_hessian(inputs, device=None): | ||
| """Calculate Hessian matrix from input activations for GPTQ. | ||
|
|
||
| DEPRECATED: This function is kept for backward compatibility in tests only. |
torchao/prototype/gptq/gptq.py
Outdated
| For example we can seee how GPTQ improves accuracy on a toy example:(2x2 matrix, 2-bit symmetric quantization): | ||
|
|
||
| Example: | ||
| Given: W = [[1.2, 0.8], X = [[1.0], Original output: W @ X = [[2.0], |
There was a problem hiding this comment.
nit: can we switch this to do X @ W.T to match the common case? X @ W is also fine, it's confusing to have W first here.
| for i in range(group_start - block_start, group_end - block_start): | ||
| w = W_quantize_block[:, i].unsqueeze(1) | ||
| if isinstance(base_config, Int4WeightOnlyConfig): | ||
| q = _int4_row_quantize_zp_precomputed_qparams( |
There was a problem hiding this comment.
can probably improve how int4/int8/etc logic works in these two for loops, but better in a future PR
torchao/prototype/gptq/gptq.py
Outdated
| # Once a block is fully processed, perform global updates to H^-1 and W using batched versions of the error propagation equations. | ||
| W[:, block_end:] -= Err1.matmul(Hinv[block_start:block_end, block_end:]) | ||
|
|
||
| if "cuda" in device.type: |
There was a problem hiding this comment.
does this support cpu? maybe just drop cpu for now? my guess is things will be too slow to be useful on cpu
There was a problem hiding this comment.
It should work but i agree this will be quite slow on CPU.
vkuzo
left a comment
There was a problem hiding this comment.
looks good! I think we should land this once CI is green and keep improving in future PRs.
6eb3cb0 to
d5dcb04
Compare
ec82b27 to
69cc3e0
Compare
This PR adds GPTQ (Generalized Post-Training Quantization) support to torchao: - Add GPTQ quantization implementation (api.py) - Add GPTQ observer for Hessian collection (observer.py) - Add sequential quantization example for HuggingFace models (gptq_example.py) - Add comprehensive GPTQ tests (test_gptqv2.py) - Clean up documentation and examples with improved clarity GPTQ quantizes weights column-by-column while propagating quantization errors to subsequent columns, minimizing reconstruction error.
Summary:
This PR adds GPTQ support to
torchao.prototype.gptq.It is exposed via a new config, GPTQConfig, which can have two steps, "observe" and "convert".
When
quantize_(model, GPTQConfig(step="observe"))is run, observer tensors are attached to the weight tensors, which keep track of linear / torch.bmm ops and updates the hessian matrix based on the observed inputs.When
quantize_(model, GPTQConfig(step="convert"))is run, we will find any observer tensors, take the Hessian and do int4 GPTQ quantization to find the weights. The core of this function is ingptq_quantize.Currently Int4WeightOnlyConfig and Int8WeightOnlyConfig are supported.
Also included is an example script,
gptq_example.pythat does sequential / nonsequential quantization on helllaswag for a simple example.I am comparing against this survey paper: https://arxiv.org/pdf/2409.11055v1 We look to be on par accuraccy wise for int8 / int4 GPTQ. (GPTQ** is llm-compressor, GPTQ* is autogptq)
To reproduces these results please run:
Benchmark Results:
Timings:
running
python torchao/prototype/gptq/gptq_example.py --quantization int4-gptq-sequentialyields a quantized model in 6.6 minutes, while running the llmcompressor example here takes 8 minutes, so we're approximately ~13% faster.Need to profile the runs to get a sense of where the speedup is happening
Test Plan: