-
Notifications
You must be signed in to change notification settings - Fork 748
Description
Summary
Add a new ai-artifact cataloger in Syft that detects and parses GGUF files (.gguf). We want to do header-only metadata extraction (fast, no full weights or full downloads). Emit results in:
- Syft JSON (native) using a new metadata type for GGUF
- CycloneDX 1.6 (ML-BOM) as machine-learning-model components with basic properties.
Goals / Scope
Detect .gguf files from supported sources. This issue starts with Local FS & container filesystem. A second issue will focus on OCI media types and adding a new syft source to parse the docker layer API for efficient cataloging.
Notes
- Parse only the GGUF header (magic, version, KV count, KV table) to capture identity & key facts.
- Create a new package type
modeland a new metadata type gguf-file-metadata.
Emit Syft JSON package(s) with:
- type: "model"
- metadataType: "gguf-file-metadata"
- metadata: minimal but stable fields (see below).
Emit CycloneDX 1.6 with:
- type: "machine-learning-model"
- minimal modelCard.modelParameters and properties mapping (see below).
- Zero network calls for local/container sources.
We're also looking for a stable global identifiers across remotes. This will be obtained by taking a hash of the metadata extracted from the model.
Examples
Syft JSON example (native):
{
"name": "Qwen3-Coder-30B-A3B-Instruct-IQ4_NL.gguf",
"type": "ai-artifact",
"foundBy": "ai-artifact-cataloger",
"locations": [{"path": "/models/Qwen3-Coder-30B-A3B-Instruct-IQ4_NL.gguf"}],
"licenses": [],
"purl": "",
"metadataType": "gguf-file-metadata",
"metadata": {
"ModelFormat": "gguf",
"ModelName": "Qwen3-Coder-30B-A3B-Instruct",
"ModelVersion": "unknown",
"FileSize": 0, // best-effort if available from resolver
"Hash": "", // leave blank unless already computed upstream
"License": "apache-2.0",
"GGUFVersion": 3,
"Architecture": "qwen3moe",
"Quantization": "IQ4_NL",
"Parameters": 0, // if present in header
"TensorCount": 579, // derived from header tensor entries
"Header": { // raw KVs (namespaced)
"general.architecture": "qwen3moe",
"general.name": "Qwen3-Coder-30B-A3B-Instruct",
"general.license": "apache-2.0",
"general.quantized_by": "Unsloth"
},
"TruncatedHeader": false
}
}CycloneDX 1.6 (ML-BOM) mapping:
component:
- type = "machine-learning-model"
- name = general.name || filename
- version = header field if available (else "unknown")
- modelCard.modelParameters (best-effort):
- architectureFamily from general.architecture (map common values: llama/qwen/gemma → "transformer" family)
- modelArchitecture freeform (e.g., "decoder-only", if inferable; else omit)
Note: Keep CycloneDX output minimal & typed; avoid dumping the entire KV bag to properties.
CLI UX
Works out of the box for local files or hugging face URL:
syft dir:./path/to/models -o jsongo run cmd/syft/main.go -o json https://huggingface.co/janhq/Jan-v1-4B-GGUF/blob/main/Jan-v1-4B-Q4_K_M.gguf
Add --select-catalogers=ai-artifact to limit runs if needed (optional).
Follow-ups
- OCI Artifact (local | remote)
- PURL strategy (e.g., pkg:huggingface/...) once we add remote/registry context.
- Safetensors & ONNX parsers.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status