During the loading of a TorchAO-quantized model using the Int4Opaque packing format into vLLM, I encountered an error:
(EngineCore_DP0 pid=2159165) raise NotImplementedError( (EngineCore_DP0 pid=2159165) NotImplementedError: Int4OpaqueTensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.slice', overload='Tensor')>, types=(<class 'torchao.quantization.Int4OpaqueTensor'>,), arg_types=(<class 'torchao.quantization.Int4OpaqueTensor'>, <class 'int'>, <class 'int'>, <class 'int'>), kwarg_types={}
Versions:
torch: 2.8.0+cpu
torchao: 0.14.1
vllm: 0.11.0+cpu
Diagonsis:
- vLLM's weight loading code (in
vllm/model_executor/layers/linear.py) calls param_data.narrow() for tensor parallelism sharding
narrow() internally uses aten.slice.Tensor operation
Int4OpaqueTensor doesn't support slicing operations - torchao's dispatch system intercepts aten.slice and raises NotImplementedError because it's not registered