Releases: amd/Quark
AMD Quark Release 0.11.1
AMD Quark for PyTorch
Model Support
Supported out-of-box model architectures:
- Kimi-K2-Thinking, Kimi-K2-Instruct, Kimi-K2.5
- Qwen3 MoE, Qwen3 Coder, Qwen3 Coder-Next
- DeepSeek-V3.2, DeepSeek-OCR
- GLM-4.7
- Minimax-M2.1
New Features
-
Added File-to-File quantization for ultra-large models. This mode supports weight-only quantization and dynamic activation quantization + weight quantization, exports hf_format only, and can also accept pre-quantized inputs (deepseek-style FP8, compressed-tensors) and re-quantize them to a different format.
For example, the command below runs file-to-file quantization to MXFP4:
python3 quantize_quark.py --model_dir [model checkpoint folder] \ --output_dir [output folder] \ --quant_scheme mxfp4 \ --file2file_quantization \ --skip_evaluation -
Added a pre-quantization compatibility check for
transformersin LLM PTQ workflows, and enabled dry-run compatibility checking by default with clearer error messages when model loading fails.
Bug fixes and minor improvements
- Fixed weight calibration coverage to ensure complete calibration even for weights outside the forward path, and added token distribution coverage warnings during calibration.
AMD Quark for ONNX
New Features
- Support using a YAML file as input to perform custom preprocessing for float models before quantization.
Enhancements
- Memory optimization has been extended to all calibration methods, particularly further reducing memory usage during activation data collection.
Bug fixes and minor improvements
- Infer kernel size from weights if the attribute
kernel_shapeof Conv nodes are not presented explicitly during fast finetuning. - Fix scale extraction to handle int32_data in ONNX initializers.
- Add optional config parameter to onnxslim optimization.
- Update requirements.txt for onnxslim from 0.1.77 to 0.1.84.
- Handle NaN and Inf values in model inference output.
Release 0.11
AMD Quark for PyTorch
AMD Quark 0.11 is tested against PyTorch 2.9, and compatible with upstream transformers==4.57.
Fused "rotation" and "quarot" algorithms in a single interface
The pre-quantization algorithms "rotation" and "quarot" are fused together into a single rotation algorithm. It can be configured using RotationConfig. By default, only R1 rotation is applied, corresponding to the previous quant_algo="rotation" behavior.
Quark Torch Quantization Config Refactor
-
The quantization configuration classes have been renamed for better clarity and consistency:
QuantizationSpecis deprecated in favor ofQTensorConfig.QuantizationConfigis deprecated in favor ofQLayerConfig.Configis deprecated in favor ofQConfig.
-
The deprecated class names (
QuantizationSpec,QuantizationConfig,Config) are still available as aliases for backward compatibility, but will be removed in a future release. -
Before Refactor:
from quark.torch.quantization.config.config import Config, QuantizationConfig, QuantizationSpec quant_spec = QuantizationSpec(dtype=Dtype.int8, ...) quant_config = QuantizationConfig(weight=quant_spec, ...) config = Config(global_quant_config=quant_config, ...)
-
After Refactor:
from quark.torch.quantization.config.config import QConfig, QLayerConfig, QTensorConfig quant_spec = QTensorConfig(dtype=Dtype.int8, ...) quant_config = QLayerConfig(weight=quant_spec, ...) config = QConfig(global_quant_config=quant_config, ...)
quark torch-llm-ptq CLI Refactor and Simplification
The CLI has been significantly refactored to use the new LLMTemplate interface and remove redundant features:
- Removed model-specific algorithm configuration files (e.g.,
awq_config.json,gptq_config.json,smooth_config.json). Algorithm configurations are now automatically handled byLLMTemplate. - Removed unnecessary CLI arguments, retaining only a dozen or so essential arguments.
- Simplified export: The CLI now only exports to Hugging Face safetensors format.
- Simplified evaluation: Evaluation now uses perplexity (PPL) on wikitext-2 dataset instead of the previous multi-task evaluation framework.
Code Organization and Examples Refactor
Moved common utilities to quark.torch.utils:
model_preparation.pyanddata_preparation.pyare now available inquark.torch.utilsfor easier reuse across examples and applications.module_replacementutilities are now located inquark.torch.utils.module_replacement.
Moved LLM evaluation code to quark.contrib:
- The
llm_evalmodule has been moved toquark.contrib.llm_evalandexamples/contrib/llm_eval. - Perplexity evaluation (
ppl_eval) is now shared between CLI and examples viaquark.contrib.llm_eval.
Reorganized example scripts:
- Removed model-specific algorithm configuration files (e.g.,
awq_config.json,gptq_config.json,smooth_config.json). Algorithm configurations are now automatically handled byLLMTemplate.
Extended quantize_quark.py example script and quark torch-llm-ptq CLI with new features:
- Support for custom model templates and quantization schemes registration (example script only).
- Support for per-layer quantization scheme configuration via
--layer_quant_schemeargument. - Support for custom algorithm configurations via
--quant_algo_config_fileargument (example script only). - Simplified quantization scheme naming, directly use the built-in scheme names (see breaking changes below).
Setting log level with QUARK_LOG_LEVEL
Logging level can now be set with the environment variable QUARK_LOG_LEVEL, e.g. QUARK_LOG_LEVEL=debug or QUARK_LOG_LEVEL=warning or QUARK_LOG_LEVEL=error or QUARK_LOG_LEVEL=critical.
Support for online rotations (online hadamard transform)
The rotation algorithm supports online rotations, such that:
where
Online rotations can be enabled using online_r1_rotation=True in RotationConfig. Please refer to its documentation and to the user guide for more details.
Support for rotation / SmoothQuant scales fine-tuning (SpinQuant/OSTQuant)
We support fine-tuning joint rotations and smoothing scales as a non-destructive transformation
The support is well tested for llama, qwen3, qwen3_moe and gpt_oss architectures.
Rotation fine-tuning and online rotations are compatible with other algorithms as GPTQ or Qronos.
Please refer to the documentation of RotationConfig, the example and the user guide for more details.
Minor changes and bug fixes
- Fix memory duplication and OOM issues when loading
gpt_ossmodels for quantization. ModelQuantizer.freezebehavior is changed to permanently quantize weights. Weights are still in high precision, but QDQ (quantize + dequantize) is run on them. This allows to avoid to rerun QDQ on static weights at each subsequent call.scaled_fake_quantizeoperator, which is used for QDQ, is now by default compiled withtorch.compile, allowing significant speedups depending on the quantization scheme (1x - 8x).- An efficient MXFP4 dynamic quantization kernel is used for activations when quantizing models, fusing scale computation and QDQ operations.
- Batching support is fixed in
lm-evaluation-harnessintegration in the examples, correctly passing the user-provided--eval_batch_size. - CPU/GPU communication is removed in quantization observers, allowing for faster quantization and runtime during e.g. the evaluation of models.
Deprecations and breaking changes
-
Quantization scheme names in
examples/torch/language_modeling/llm_ptq/quantize_quark.pyandquark torch-llm-ptqCLI have been simplified and renamed:w_int4_per_group_symis deprecated in favor ofint4_wo_32,int4_wo_64,int4_wo_128(depending on group size).w_uint4_per_group_asymis deprecated in favor ofuint4_wo_32,uint4_wo_64,uint4_wo_128(depending on group size).w_int8_a_int8_per_tensor_symis deprecated in favor ofint8.w_fp8_a_fp8is deprecated in favor offp8.w_mxfp4_a_mxfp4is deprecated in favor ofmxfp4.w_mxfp4_a_fp8is deprecated in favor ofmxfp4_fp8.w_mxfp6_e3m2_a_mxfp6_e3m2is deprecated in favor ofmxfp6_e3m2.w_mxfp6_e2m3_a_mxfp6_e2m3is deprecated in favor ofmxfp6_e2m3.w_bfp16_a_bfp16is deprecated in favor ofbfp16.w_mx6_a_mx6is deprecated in favor ofmx6.
-
The
--group_sizeand--group_size_per_layerarguments inexamples/torch/language_modeling/llm_ptq/quantize_quark.pyandquark torch-llm-ptqCLI have been removed. Group size is now embedded in the scheme name (e.g.,int4_wo_32,int4_wo_64,int4_wo_128). -
The
--layer_quant_schemeargument format inexamples/torch/language_modeling/llm_ptq/quantize_quark.pyandquark torch-llm-ptqCLI has changed to repeated arguments with pattern and scheme pairs (e.g.,--layer_quant_scheme lm_head int8 --layer_quant_scheme '*down_proj' fp8). -
The token counter used count the number of tokens seen by each expert during calibration is now disabled by default, and requires the environment variable
QUARK_COUNT_OBSERVED_SAMPLES=1. -
The export format
"quark_format"is removed, following deprecation in AMD Quark 0.10. Additionally,quark.torch.export.api.ModelExporterandquark.torch.export.api.ModelImporterare removed, please refer to the 0.10 release notes and to the documentation for the current API.
AMD Quark for ONNX
New Features
-
Auto Search Pro
- Hierarchical Search: Support for conditional and nested hyperparameter trees for advanced search strategies.
- Custom Objectives: Support custom evaluation logic that perfectly aligns with specific needs.
- Sampler Flexibility: Various samplers ('TPE', 'Grid Search', etc) are available .
- Parallel search: Take advantage of parallelization to run multiple searches simultaneously, reducing time to solution.
- Checkpoint: Resume interrupted hyperparameter optimization from the last checkpoint.
- Visualization: View real-time visualizations that show your optimization performance and feature importance, making it easier to interpret results.
- Output Saving: Automatically save the best configuration, study database, and generated plots for your analysis.
-
Latency and memory usage profiling
-
Latency Profiling: Each quantization stage performs specific operations that contribute to the overall quantization pipeline, and their individual latency are reported in the profiling results.
-
Memory profiling
- CPU Memory Profiling: By wrapping the Python script with mprof, we can record detailed memory traces during execution.
- ROCM GPU Memory Profiling: For workflows involving ROCMExecutionProvider or any GPU-based quantization step, Quark ONNX offers a lightwe...
-
Release 0.10
Release Notes
Release 0.10
-
AMD Quark for PyTorch
-
New Features
- Support PyTorch 2.7.1 and 2.8.0.
- Support for int3 quantization and exporting of models.
- Support the AWQ algorithm with Gemma3 and Phi4.
- Support Qronos advanced quantization algorithm.
- Applying the GPTQ algorithm runs x3-x4 faster compared to AMD Quark 0.9, using CUDA/HIP Graph by default. If requirement, CUDA Graph for GPTQ can be disabled using the environment variable
QUARK_GRAPH_DEBUG=0. - Quarot algorithm supports a new configuration parameter
rotation_sizeto define custom hadamard rotation sizes. Please refer to QuaRotConfig documentation. - Support the Qronos post-training quantization algorithm. Please refer to the arXiv paper and Quark documentation.
-
QuantizationSpec check:
- Every time user finishes init
QuantizationSpecwill automatically perform config check. If any invalid config is supplied, a warning or error message will be given to user for better correction. In this way, find potential error as early as possible rather than cause a runtime error during quantization process.
- Every time user finishes init
-
LLM Depth-Wise Pruning tool:
- Depth-wise pruning tool that can decrease the LLM model size. This tool deletes the consecutive decode layers in LLM under a certain supplied pruning ratio.
- Based on PPL influence, the consecutive layers that have less influence on PPL will be regarded as having less influence on LLM and can be deleted.
-
Model Support:
- Support OCP MXFP4, MXFP6, MXFP8 quantization of new models: DeepSeek-R1, Llama4-Scout, Llama4-Maverick, gpt-oss-20b, gpt-oss-120b.
-
Deprecations and breaking changes
-
OCP MXFP6 weight packing layout is modified to fit the expected layout by CDNA4
mfma_scaleinstruction. -
In the
examples/language_modeling/llm_ptq/quantize_quark.pyexample, the quantization schemew_mxfp4_a_mxfp6is removed and replaced byw_mxfp4_a_mxfp6_e2m3andw_mxfp4_a_mxfp6_e3m2.
-
-
Important bug fixes
-
-
AMD Quark for ONNX
-
New Features:
-
API Refactor (Introduced the new API design with improved consistency and usability)
- Supported class-based algorithm usage.
- Aligned data type both for Quark Torch and Quark ONNX.
- Refactored quantization configs.
-
Auto Search Enhancements
- Two-Stage Search: First identifies the best calibration config, then searches for the optimal FastFinetune config based on it. Expands the search space for higher efficiency.
- Advanced-Fastft Search: Supports continuous search spaces, advanced algorithms (e.g., TPE), and parallel execution for faster, smarter searching.
- Joint-Parameter Search: Combines coupled parameters into a unified space to avoid ineffective configurations and improve search quality.
-
Added support for ONNX 1.19 and ONNXRuntime 1.22.1
-
Added optimized weight-scale calculation with the MinMSE method to improve quantization accuracy.
-
Accelerated calibration with multi-process support, covering algorithms such as MinMSE, Percentile, Entropy, Distribution, and LayerwisePercentile.
-
Added progress bars for Percentile, Entropy, Distribution, and LayerwisePercentile algorithms.
-
Supported users to specify a directory for saving cache files.
-
-
Enhancements:
- Significantly reduced memory usage across various configurations, including calibration and FastFinetune stages, with optimizations for both CPU and GPU memory.
- Improved clarity of error and warning outputs, helping users select better parameters based on memory and disk conditions.
-
Bug fixes and minor improvements:
- Provided actionable hints when OOM or insufficient disk space issues occur in calibration and fast fine-tuning.
- Fixed multi-GPU issues during FastFinetune.
- Fixed a bug related to converting BatchNorm to Conv.
- Fixed a bug in BF16 conversion on models larger than 2GB.
-
-
Quark Torch API Refactor
-
LLMTemplate for simplified quantization configuration:
- Introduced :py:class:
.LLMTemplateclass for convenient LLM quantization configuration - Built-in templates for popular LLM architectures (Llama4, Qwen, Mistral, Phi, DeepSeek, GPT-OSS, etc.)
- Support for multiple quantization schemes: int4/uint4 (group sizes 32, 64, 128), int8, fp8, mxfp4, mxfp6e2m3, mxfp6e3m2, bfp16, mx6
- Advanced features: layer-wise quantization, KV cache quantization, attention quantization
- Algorithm support: AWQ, GPTQ, SmoothQuant, AutoSmoothQuant, Rotation
- Custom template and scheme registration capabilities for users to define their own template and quantization schemes
- Introduced :py:class:
-
from quark.torch import LLMTemplate
# List available templates
templates = LLMTemplate.list_available()
print(templates) # ['llama', 'opt', 'qwen', 'mistral', ...]
# Get a specific template
llama_template = LLMTemplate.get("llama")
# Create a basic configuration
config = llama_template.get_config(scheme="fp8", kv_cache_scheme="fp8")-
Export and import APIs are deprecated in favor of new ones:
-
ModelExporter.export_safetensors_modelis deprecated in favor ofexport_safetensors:Before:
-
from quark.torch import ModelExporter
from quark.torch.export.config.config import ExporterConfig, JsonExporterConfig
export_config = ExporterConfig(json_export_config=JsonExporterConfig())
exporter = ModelExporter(config=export_config, export_dir=export_dir)
exporter.export_safetensors_model(model, quant_config) After:
from quark.torch import export_safetensors
export_safetensors(model, output_dir=export_dir) - `ModelImporter.import_model_info` is deprecated in favor of `import_model_from_safetensors`:
Before:
from quark.torch.export.api import ModelImporter
model_importer = ModelImporter(
model_info_dir=export_dir,
saved_format="safetensors"
)
quantized_model = model_importer.import_model_info(original_model) After:
from quark.torch import import_model_from_safetensors
quantized_model = import_model_from_safetensors(
original_model,
model_dir=export_dir
)-
Quark ONNX API Refactor
-
Before:
- Basic Usage:
-
from quark.onnx import ModelQuantizer
from quark.onnx.quantization.config.config import Config
from quark.onnx.quantization.config.custom_config import get_default_config
input_model_path = "demo.onnx"
quantized_model_path = "demo_quantized.onnx"
calib_data_path = "calib_data"
calib_data_reader = ImageDataReader(calib_data_path)
a8w8_config = get_default_config("A8W8")
quantization_config = Config(global_quant_config=a8w8_config )
quantizer = ModelQuantizer(quantization_config)
quantizer.quantize_model(input_model_path, quantized_model_path, calib_data_reader) - Advanced Usage:
from quark.onnx import ModelQuantizer
from quark.onnx.quantization.config.config import Config, QuantizationConfig
from onnxruntime.quantization.calibrate import CalibrationMethod
from onnxruntime.quantization.quant_utils import QuantFormat, QuantType, ExtendedQuantType
input_model_path = "demo.onnx"
quantized_model_path = "demo_quantized.onnx"
calib_data_path = "calib_data"
calib_data_reader = ImageDataReader(calib_data_path)
DEFAULT_ADAROUND_PARAMS = {
"DataSize": 1000,
"FixedSeed": 1705472343,
"BatchSize": 2,
"NumIterations": 1000,
"LearningRate": 0.1,
"OptimAlgorithm": "adaround",
"OptimDevice": "cpu",
"InferDevice": "cpu",
"EarlyStop": True,
}
quant_config = QuantizationConfig(
calibrate_method=CalibrationMethod.Percentile,
quant_format=QuantFormat.QDQ,
activation_type=QuantType.QInt8,
weight_type=QuantType.QInt8,
nodes_to_exclude=["/layer.2/Conv_1", "^/Conv/.*"],
subgraphs_to_exclude=[(["start_node_1", "start_node_2"], ["end_node_1", "end_node_2"])],
include_cle=True,
include_fast...