Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Releases: amd/Quark

AMD Quark Release 0.11.1

19 Feb 16:55

Choose a tag to compare

AMD Quark for PyTorch

Model Support

Supported out-of-box model architectures:

  • Kimi-K2-Thinking, Kimi-K2-Instruct, Kimi-K2.5
  • Qwen3 MoE, Qwen3 Coder, Qwen3 Coder-Next
  • DeepSeek-V3.2, DeepSeek-OCR
  • GLM-4.7
  • Minimax-M2.1

New Features

  • Added File-to-File quantization for ultra-large models. This mode supports weight-only quantization and dynamic activation quantization + weight quantization, exports hf_format only, and can also accept pre-quantized inputs (deepseek-style FP8, compressed-tensors) and re-quantize them to a different format.

    For example, the command below runs file-to-file quantization to MXFP4:

    python3 quantize_quark.py --model_dir [model checkpoint folder] \
                              --output_dir [output folder] \
                              --quant_scheme mxfp4 \
                              --file2file_quantization \
                              --skip_evaluation
  • Added a pre-quantization compatibility check for transformers in LLM PTQ workflows, and enabled dry-run compatibility checking by default with clearer error messages when model loading fails.

Bug fixes and minor improvements

  • Fixed weight calibration coverage to ensure complete calibration even for weights outside the forward path, and added token distribution coverage warnings during calibration.

AMD Quark for ONNX

New Features

  • Support using a YAML file as input to perform custom preprocessing for float models before quantization.

Enhancements

  • Memory optimization has been extended to all calibration methods, particularly further reducing memory usage during activation data collection.

Bug fixes and minor improvements

  • Infer kernel size from weights if the attribute kernel_shape of Conv nodes are not presented explicitly during fast finetuning.
  • Fix scale extraction to handle int32_data in ONNX initializers.
  • Add optional config parameter to onnxslim optimization.
  • Update requirements.txt for onnxslim from 0.1.77 to 0.1.84.
  • Handle NaN and Inf values in model inference output.

Release 0.11

19 Feb 16:42

Choose a tag to compare

AMD Quark for PyTorch

AMD Quark 0.11 is tested against PyTorch 2.9, and compatible with upstream transformers==4.57.

Fused "rotation" and "quarot" algorithms in a single interface

The pre-quantization algorithms "rotation" and "quarot" are fused together into a single rotation algorithm. It can be configured using RotationConfig. By default, only R1 rotation is applied, corresponding to the previous quant_algo="rotation" behavior.

Quark Torch Quantization Config Refactor

  • The quantization configuration classes have been renamed for better clarity and consistency:

    • QuantizationSpec is deprecated in favor of QTensorConfig.
    • QuantizationConfig is deprecated in favor of QLayerConfig.
    • Config is deprecated in favor of QConfig.
  • The deprecated class names (QuantizationSpec, QuantizationConfig, Config) are still available as aliases for backward compatibility, but will be removed in a future release.

  • Before Refactor:

    from quark.torch.quantization.config.config import Config, QuantizationConfig, QuantizationSpec
    
    quant_spec = QuantizationSpec(dtype=Dtype.int8, ...)
    quant_config = QuantizationConfig(weight=quant_spec, ...)
    config = Config(global_quant_config=quant_config, ...)
  • After Refactor:

    from quark.torch.quantization.config.config import QConfig, QLayerConfig, QTensorConfig
    
    quant_spec = QTensorConfig(dtype=Dtype.int8, ...)
    quant_config = QLayerConfig(weight=quant_spec, ...)
    config = QConfig(global_quant_config=quant_config, ...)

quark torch-llm-ptq CLI Refactor and Simplification

The CLI has been significantly refactored to use the new LLMTemplate interface and remove redundant features:

  • Removed model-specific algorithm configuration files (e.g., awq_config.json, gptq_config.json, smooth_config.json). Algorithm configurations are now automatically handled by LLMTemplate.
  • Removed unnecessary CLI arguments, retaining only a dozen or so essential arguments.
  • Simplified export: The CLI now only exports to Hugging Face safetensors format.
  • Simplified evaluation: Evaluation now uses perplexity (PPL) on wikitext-2 dataset instead of the previous multi-task evaluation framework.

Code Organization and Examples Refactor

Moved common utilities to quark.torch.utils:

  • model_preparation.py and data_preparation.py are now available in quark.torch.utils for easier reuse across examples and applications.
  • module_replacement utilities are now located in quark.torch.utils.module_replacement.

Moved LLM evaluation code to quark.contrib:

  • The llm_eval module has been moved to quark.contrib.llm_eval and examples/contrib/llm_eval.
  • Perplexity evaluation (ppl_eval) is now shared between CLI and examples via quark.contrib.llm_eval.

Reorganized example scripts:

  • Removed model-specific algorithm configuration files (e.g., awq_config.json, gptq_config.json, smooth_config.json). Algorithm configurations are now automatically handled by LLMTemplate.

Extended quantize_quark.py example script and quark torch-llm-ptq CLI with new features:

  • Support for custom model templates and quantization schemes registration (example script only).
  • Support for per-layer quantization scheme configuration via --layer_quant_scheme argument.
  • Support for custom algorithm configurations via --quant_algo_config_file argument (example script only).
  • Simplified quantization scheme naming, directly use the built-in scheme names (see breaking changes below).

Setting log level with QUARK_LOG_LEVEL

Logging level can now be set with the environment variable QUARK_LOG_LEVEL, e.g. QUARK_LOG_LEVEL=debug or QUARK_LOG_LEVEL=warning or QUARK_LOG_LEVEL=error or QUARK_LOG_LEVEL=critical.

Support for online rotations (online hadamard transform)

The rotation algorithm supports online rotations, such that:

$$y = xRR^TW$$

where $x$ is the input activation, $W$ the weight, and $R$ an orthogonal matrix (e.g. hadamard transform). With the quantization operator $\mathcal{Q}$ added, this becomes $\mathcal{Q}(xR) \times \mathcal{Q}(WR)^T$. The activation quantization $\mathcal{Q}(xR)$ is done online, that is the rotation is applied during inference and is not fused in a preceding layer.

Online rotations can be enabled using online_r1_rotation=True in RotationConfig. Please refer to its documentation and to the user guide for more details.

Support for rotation / SmoothQuant scales fine-tuning (SpinQuant/OSTQuant)

We support fine-tuning joint rotations and smoothing scales as a non-destructive transformation $O = DR$, where $R$ is an orthogonal matrix and $D$ is a diagonal matrix (SmoothQuant scales), such that:

$$y = xOO^{-1}W$$ $$= xDRR^TD^{-1}W^T$$ $$= xDR \times (WD^{-1}R)^T$$ $$= ... x'R \times (WD^{-1}R)^T$$

The support is well tested for llama, qwen3, qwen3_moe and gpt_oss architectures.

Rotation fine-tuning and online rotations are compatible with other algorithms as GPTQ or Qronos.

Please refer to the documentation of RotationConfig, the example and the user guide for more details.

Minor changes and bug fixes

  • Fix memory duplication and OOM issues when loading gpt_oss models for quantization.
  • ModelQuantizer.freeze behavior is changed to permanently quantize weights. Weights are still in high precision, but QDQ (quantize + dequantize) is run on them. This allows to avoid to rerun QDQ on static weights at each subsequent call.
  • scaled_fake_quantize operator, which is used for QDQ, is now by default compiled with torch.compile, allowing significant speedups depending on the quantization scheme (1x - 8x).
  • An efficient MXFP4 dynamic quantization kernel is used for activations when quantizing models, fusing scale computation and QDQ operations.
  • Batching support is fixed in lm-evaluation-harness integration in the examples, correctly passing the user-provided --eval_batch_size.
  • CPU/GPU communication is removed in quantization observers, allowing for faster quantization and runtime during e.g. the evaluation of models.

Deprecations and breaking changes

  • Quantization scheme names in examples/torch/language_modeling/llm_ptq/quantize_quark.py and quark torch-llm-ptq CLI have been simplified and renamed:

    • w_int4_per_group_sym is deprecated in favor of int4_wo_32, int4_wo_64, int4_wo_128 (depending on group size).
    • w_uint4_per_group_asym is deprecated in favor of uint4_wo_32, uint4_wo_64, uint4_wo_128 (depending on group size).
    • w_int8_a_int8_per_tensor_sym is deprecated in favor of int8.
    • w_fp8_a_fp8 is deprecated in favor of fp8.
    • w_mxfp4_a_mxfp4 is deprecated in favor of mxfp4.
    • w_mxfp4_a_fp8 is deprecated in favor of mxfp4_fp8.
    • w_mxfp6_e3m2_a_mxfp6_e3m2 is deprecated in favor of mxfp6_e3m2.
    • w_mxfp6_e2m3_a_mxfp6_e2m3 is deprecated in favor of mxfp6_e2m3.
    • w_bfp16_a_bfp16 is deprecated in favor of bfp16.
    • w_mx6_a_mx6 is deprecated in favor of mx6.
  • The --group_size and --group_size_per_layer arguments in examples/torch/language_modeling/llm_ptq/quantize_quark.py and quark torch-llm-ptq CLI have been removed. Group size is now embedded in the scheme name (e.g., int4_wo_32, int4_wo_64, int4_wo_128).

  • The --layer_quant_scheme argument format in examples/torch/language_modeling/llm_ptq/quantize_quark.py and quark torch-llm-ptq CLI has changed to repeated arguments with pattern and scheme pairs (e.g., --layer_quant_scheme lm_head int8 --layer_quant_scheme '*down_proj' fp8).

  • The token counter used count the number of tokens seen by each expert during calibration is now disabled by default, and requires the environment variable QUARK_COUNT_OBSERVED_SAMPLES=1.

  • The export format "quark_format" is removed, following deprecation in AMD Quark 0.10. Additionally, quark.torch.export.api.ModelExporter and quark.torch.export.api.ModelImporter are removed, please refer to the 0.10 release notes and to the documentation for the current API.

AMD Quark for ONNX

New Features

  • Auto Search Pro

    • Hierarchical Search: Support for conditional and nested hyperparameter trees for advanced search strategies.
    • Custom Objectives: Support custom evaluation logic that perfectly aligns with specific needs.
    • Sampler Flexibility: Various samplers ('TPE', 'Grid Search', etc) are available .
    • Parallel search: Take advantage of parallelization to run multiple searches simultaneously, reducing time to solution.
    • Checkpoint: Resume interrupted hyperparameter optimization from the last checkpoint.
    • Visualization: View real-time visualizations that show your optimization performance and feature importance, making it easier to interpret results.
    • Output Saving: Automatically save the best configuration, study database, and generated plots for your analysis.
  • Latency and memory usage profiling

    • Latency Profiling: Each quantization stage performs specific operations that contribute to the overall quantization pipeline, and their individual latency are reported in the profiling results.

    • Memory profiling

      • CPU Memory Profiling: By wrapping the Python script with mprof, we can record detailed memory traces during execution.
      • ROCM GPU Memory Profiling: For workflows involving ROCMExecutionProvider or any GPU-based quantization step, Quark ONNX offers a lightwe...
Read more

Release 0.10

26 Sep 22:24

Choose a tag to compare

Release Notes

Release 0.10

  • AMD Quark for PyTorch

    • New Features

      • Support PyTorch 2.7.1 and 2.8.0.
      • Support for int3 quantization and exporting of models.
      • Support the AWQ algorithm with Gemma3 and Phi4.
      • Support Qronos advanced quantization algorithm.
      • Applying the GPTQ algorithm runs x3-x4 faster compared to AMD Quark 0.9, using CUDA/HIP Graph by default. If requirement, CUDA Graph for GPTQ can be disabled using the environment variable QUARK_GRAPH_DEBUG=0.
      • Quarot algorithm supports a new configuration parameter rotation_size to define custom hadamard rotation sizes. Please refer to QuaRotConfig documentation.
      • Support the Qronos post-training quantization algorithm. Please refer to the arXiv paper and Quark documentation.
    • QuantizationSpec check:

      • Every time user finishes init QuantizationSpec will automatically perform config check. If any invalid config is supplied, a warning or error message will be given to user for better correction. In this way, find potential error as early as possible rather than cause a runtime error during quantization process.
    • LLM Depth-Wise Pruning tool:

      • Depth-wise pruning tool that can decrease the LLM model size. This tool deletes the consecutive decode layers in LLM under a certain supplied pruning ratio.
      • Based on PPL influence, the consecutive layers that have less influence on PPL will be regarded as having less influence on LLM and can be deleted.
    • Model Support:

      • Support OCP MXFP4, MXFP6, MXFP8 quantization of new models: DeepSeek-R1, Llama4-Scout, Llama4-Maverick, gpt-oss-20b, gpt-oss-120b.
    • Deprecations and breaking changes

      • OCP MXFP6 weight packing layout is modified to fit the expected layout by CDNA4 mfma_scale instruction.

      • In the examples/language_modeling/llm_ptq/quantize_quark.py example, the quantization scheme w_mxfp4_a_mxfp6 is removed and replaced by w_mxfp4_a_mxfp6_e2m3 and w_mxfp4_a_mxfp6_e3m2.

    • Important bug fixes

      • A bug in Quarot and Rotation algorithms where fused rotations were wrongly applied twice on input embeddings / LM head weights is fixed.

      • Reduce the slowness of the reloading of large quantized models as DeepSeek-R1 using Transformers + Quark.

  • AMD Quark for ONNX

    • New Features:

      • API Refactor (Introduced the new API design with improved consistency and usability)

        • Supported class-based algorithm usage.
        • Aligned data type both for Quark Torch and Quark ONNX.
        • Refactored quantization configs.
      • Auto Search Enhancements

        • Two-Stage Search: First identifies the best calibration config, then searches for the optimal FastFinetune config based on it. Expands the search space for higher efficiency.
        • Advanced-Fastft Search: Supports continuous search spaces, advanced algorithms (e.g., TPE), and parallel execution for faster, smarter searching.
        • Joint-Parameter Search: Combines coupled parameters into a unified space to avoid ineffective configurations and improve search quality.
      • Added support for ONNX 1.19 and ONNXRuntime 1.22.1

      • Added optimized weight-scale calculation with the MinMSE method to improve quantization accuracy.

      • Accelerated calibration with multi-process support, covering algorithms such as MinMSE, Percentile, Entropy, Distribution, and LayerwisePercentile.

      • Added progress bars for Percentile, Entropy, Distribution, and LayerwisePercentile algorithms.

      • Supported users to specify a directory for saving cache files.

    • Enhancements:

      • Significantly reduced memory usage across various configurations, including calibration and FastFinetune stages, with optimizations for both CPU and GPU memory.
      • Improved clarity of error and warning outputs, helping users select better parameters based on memory and disk conditions.
    • Bug fixes and minor improvements:

      • Provided actionable hints when OOM or insufficient disk space issues occur in calibration and fast fine-tuning.
      • Fixed multi-GPU issues during FastFinetune.
      • Fixed a bug related to converting BatchNorm to Conv.
      • Fixed a bug in BF16 conversion on models larger than 2GB.
  • Quark Torch API Refactor

    • LLMTemplate for simplified quantization configuration:

      • Introduced :py:class:.LLMTemplate class for convenient LLM quantization configuration
      • Built-in templates for popular LLM architectures (Llama4, Qwen, Mistral, Phi, DeepSeek, GPT-OSS, etc.)
      • Support for multiple quantization schemes: int4/uint4 (group sizes 32, 64, 128), int8, fp8, mxfp4, mxfp6e2m3, mxfp6e3m2, bfp16, mx6
      • Advanced features: layer-wise quantization, KV cache quantization, attention quantization
      • Algorithm support: AWQ, GPTQ, SmoothQuant, AutoSmoothQuant, Rotation
      • Custom template and scheme registration capabilities for users to define their own template and quantization schemes
            from quark.torch import LLMTemplate

            # List available templates
            templates = LLMTemplate.list_available()
            print(templates)  # ['llama', 'opt', 'qwen', 'mistral', ...]

            # Get a specific template
            llama_template = LLMTemplate.get("llama")

            # Create a basic configuration
            config = llama_template.get_config(scheme="fp8", kv_cache_scheme="fp8")
  • Export and import APIs are deprecated in favor of new ones:

    • ModelExporter.export_safetensors_model is deprecated in favor of export_safetensors:

      Before:

            from quark.torch import ModelExporter
            from quark.torch.export.config.config import ExporterConfig, JsonExporterConfig

            export_config = ExporterConfig(json_export_config=JsonExporterConfig())
            exporter = ModelExporter(config=export_config, export_dir=export_dir)
            exporter.export_safetensors_model(model, quant_config)
     After:
            from quark.torch import export_safetensors
            export_safetensors(model, output_dir=export_dir)
  -  `ModelImporter.import_model_info` is deprecated in favor of `import_model_from_safetensors`:

     Before:
            from quark.torch.export.api import ModelImporter

            model_importer = ModelImporter(
               model_info_dir=export_dir,
               saved_format="safetensors"
            )
            quantized_model = model_importer.import_model_info(original_model)
     After:
            from quark.torch import import_model_from_safetensors
            quantized_model = import_model_from_safetensors(
               original_model,
               model_dir=export_dir
            )
  • Quark ONNX API Refactor

    • Before:

      • Basic Usage:
           from quark.onnx import ModelQuantizer
           from quark.onnx.quantization.config.config import Config
           from quark.onnx.quantization.config.custom_config import get_default_config

           input_model_path = "demo.onnx"
           quantized_model_path = "demo_quantized.onnx"
           calib_data_path = "calib_data"
           calib_data_reader = ImageDataReader(calib_data_path)

           a8w8_config = get_default_config("A8W8")
           quantization_config = Config(global_quant_config=a8w8_config )
           quantizer = ModelQuantizer(quantization_config)
           quantizer.quantize_model(input_model_path, quantized_model_path, calib_data_reader)
  -  Advanced Usage:
	   from quark.onnx import ModelQuantizer
	   from quark.onnx.quantization.config.config import Config, QuantizationConfig
	   from onnxruntime.quantization.calibrate import CalibrationMethod
	   from onnxruntime.quantization.quant_utils import QuantFormat, QuantType, ExtendedQuantType

	   input_model_path = "demo.onnx"
	   quantized_model_path = "demo_quantized.onnx"
	   calib_data_path = "calib_data"
	   calib_data_reader = ImageDataReader(calib_data_path)

	   DEFAULT_ADAROUND_PARAMS = {
	       "DataSize": 1000,
	       "FixedSeed": 1705472343,
	       "BatchSize": 2,
	       "NumIterations": 1000,
	       "LearningRate": 0.1,
	       "OptimAlgorithm": "adaround",
	       "OptimDevice": "cpu",
	       "InferDevice": "cpu",
	       "EarlyStop": True,
	   }

	   quant_config = QuantizationConfig(
	       calibrate_method=CalibrationMethod.Percentile,
	       quant_format=QuantFormat.QDQ,
	       activation_type=QuantType.QInt8,
	       weight_type=QuantType.QInt8,
	       nodes_to_exclude=["/layer.2/Conv_1", "^/Conv/.*"],
	       subgraphs_to_exclude=[(["start_node_1", "start_node_2"], ["end_node_1", "end_node_2"])],
	       include_cle=True,
	       include_fast...
Read more