Phase-based Research & Interpretability Spectral Microscope
PRISM is a model-agnostic mechanistic interpretability toolkit for transformer-family language models. It measures activation geometry — the structural properties of hidden-state tensors that predict quantisation error, representation collapse, and alignment shifts — and exposes a 14-module suite of causal, spectral, and evaluation tools for deeper analysis.
When a language model is fine-tuned, its weights change. What changes in the activations is harder to see — and more revealing. PRISM's geometry metrics, inspired by TurboQuant (Google, ICLR 2026), measure four per-layer statistics that together capture how "hostile" a layer's activations are to low-bit quantisation:
| Metric | What it measures | Danger sign |
|---|---|---|
outlier_ratio |
max dim magnitude / mean dim magnitude | > 10 |
activation_kurtosis |
heavy-tail-ness of per-dim magnitudes | large positive |
cardinal_proximity |
how axis-aligned unit vectors are | near 1.0 |
quantization_hostility |
composite score in [0, 1] | > 0.70 |
A high quantization_hostility score means the layer will lose significant information when quantised to 4- or 8-bit precision. Tracking this score across training runs reveals whether fine-tuning is moving a model toward or away from a quantisation-friendly (and typically more robust) geometry.
pip install humanaiconvention-prismMeasure the quantisation hostility of any model in 3 lines:
from prism.geometry import scan_model_geometry
results = scan_model_geometry("google/gemma-4-e2b-it")
print(results["mean_quantization_hostility"]) # e.g. 0.914That is the complete API for the primary use case. scan_model_geometry loads the model from the Hugging Face Hub, runs a single forward pass, and returns per-layer geometry metrics — no hook management, no manual tokenisation, no dtype juggling.
The result that motivated PRISM's public release came from the Gemma4Good Hackathon (April 2026), where the HumanAI Convention team fine-tuned Gemma 4 E2B using QLoRA on curated semantic-grounding interview data.
PRISM tracked quantization_hostility at three checkpoints:
| Checkpoint | Mean hostility | SGT score | Security failures |
|---|---|---|---|
| Gemma 4 E2B baseline | 0.9146 | — | — |
| HAIC v1 adapter (BEAST, untagged data) | 0.9144 | 6.20 / 10 | 0 |
| HAIC v2 adapter (A100, SGT-formatted data) | 0.7398 | 8.56 / 10 | 0 |
The v2 adapter achieved a −0.175 reduction in mean quantisation hostility — a 19% shift — while simultaneously producing the strongest SGT behavioural score in the HAIC model family. This is not coincidence.
The geometry shift tracks the training-data quality change precisely:
- v1 was trained on
grounding_mix_v3i.jsonl— 975 examples with no[PIVOT:]format tags. The LoRA adapted the model's style but never saw the target output format. Hostility: unchanged (0.9144, Δ = −0.0002 — noise level). - v2 was trained on
grounding_gemma4_v2.jsonl— 500 SGT-formatted examples with 100%[PIVOT:]coverage, three training windows per conversation (T2 + T4 + T6 loss), on an A100. The geometry shifted.
Interpretation under C(t): LoRA adapters generally navigate the activation manifold rather than reshape it — which is why v1 showed geometry-stable Δ ≈ 0. The v2 geometry shift from 0.9146 to 0.7398 indicates that the higher-quality, format-consistent training data moved the adapter outside the geometry-neutral zone. The model is not just stylistically different; its residual-stream structure has measurably changed in a direction that reduces quantisation sensitivity and — empirically — correlates with improved task performance and maintained security.
This is the kind of result that PRISM was built to surface. Without geometry tracking, the v1 and v2 adapters look similar on loss curves. With it, the bifurcation point is obvious.
# Standard (CPU / CUDA)
pip install humanaiconvention-prism
# With BitsAndBytes for 4-bit / 8-bit quantised models
pip install 'humanaiconvention-prism[quantized]'
# With notebook / visualisation tools
pip install 'humanaiconvention-prism[notebook]'
# Full install
pip install 'humanaiconvention-prism[all]'Or from source:
git clone https://github.com/humanaiconvention/prism.git
cd prism && pip install -e .from prism.geometry import scan_model_geometry
# From a HF Hub model id (auto-loads model + tokenizer)
results = scan_model_geometry("google/gemma-4-e2b-it")
# With a pre-loaded model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
results = scan_model_geometry(model, tokenizer=tokenizer)
# 4-bit quantised (requires bitsandbytes + accelerate)
results = scan_model_geometry("google/gemma-4-e2b-it", load_in_4bit=True)Return value — a flat dict:
{
"model_name": str, # identifier
"prompt": str, # probe text used
"n_layers": int, # transformer block count
"layers": list, # per-layer dicts (see below)
"mean_quantization_hostility": float, # mean across all layers
"worst_layer_idx": int,
"best_layer_idx": int,
"worst_layer_hostility": float,
"best_layer_hostility": float,
"n_hostile_layers": int, # layers with hostility > 0.70
}
# Each element of "layers":
{
"layer_idx": int,
"outlier_ratio": float,
"activation_kurtosis": float,
"cardinal_proximity": float,
"quantization_hostility": float,
}from prism.geometry import outlier_geometry
import torch
# hidden states from a single layer: (seq_len, hidden_dim)
H = torch.randn(64, 2048)
metrics = outlier_geometry(H)
# {'outlier_ratio': ..., 'activation_kurtosis': ...,
# 'cardinal_proximity': ..., 'quantization_hostility': ...}Accepts both torch.Tensor and numpy.ndarray.
For environments where PyTorch is not available (lightweight CI, pre-computed activation files, etc.):
from prism.geometry import outlier_geometry_numpy
import numpy as np
H = np.load("layer_12_hidden_states.npy") # (seq_len, hidden_dim)
metrics = outlier_geometry_numpy(H)from prism.geometry import scan_model_geometry
checkpoints = [
"path/to/checkpoint-500",
"path/to/checkpoint-1000",
"path/to/checkpoint-2000",
]
for ckpt in checkpoints:
r = scan_model_geometry(ckpt)
print(f"{ckpt} hostility={r['mean_quantization_hostility']:.4f} "
f"hostile_layers={r['n_hostile_layers']}/{r['n_layers']}")PRISM 1.2.0 adds a prism.provenance submodule that wraps Cisco's Model Provenance Kit (Apache-2.0, released 2026-05-04). Given a model, compare_models(candidate, claimed_parent) and scan_model_provenance(model) answer "does this artifact actually descend from what its producer says?" using five weight-level statistical signals (EAS, END, NLF, LEP, WVC). See docs/PROVENANCE.md for the technique, threshold calibration, and the limitations Cisco discloses (the output is statistical evidence, not a cryptographic signature, and cannot distinguish "copied weights" from "trained from the same template" when architectures are identical).
The submodule keeps MPK as an optional dependency — install provenancekit only when you want the real backend; the deterministic mock in prism.provenance.mock_compare / mock_scan is the default for tests and offline demos. Every result carries a not_cryptographic=True flag that propagates into the audit-dict serialisation so downstream consumers can't silently drop the caveat.
PRISM's geometry metrics describe the structure of hidden states. In May 2026 Anthropic published Natural Language Autoencoders (NLAs), a complementary technique that describes the semantics of those same hidden states — a learned decoder converts an activation token into a natural-language explanation of what the layer appears to be representing. PRISM 1.1.0 adds a prism.nla submodule that bundles the public Anthropic-style checkpoints (released by kitft under Apache-2.0 for Qwen2.5-7B, Gemma-3-12B/27B, and Llama-3.3-70B) into a registry, and adds an optional nla_explainer= parameter to scan_model_geometry so a geometry profile can carry an NLA explanation alongside each measured layer.
NLAs are powerful but not mechanistic — Anthropic flags four honest limitations (confabulation, blackbox-by-construction, excessive expressivity, training cost) which PRISM repeats verbatim in docs/NLA.md. There is no released NLA for Gemma-4-E2B-it, and PRISM will not silently run a foreign-architecture NLA against Gemma-4 activations — scan_model_geometry raises on d_model mismatch. Treat NLA text as a hypothesis generator paired with a geometric anomaly, not as ground truth.
PRISM is a 14-module library. The geometry scanner is its primary entry point, but the full suite is available for deeper mechanistic analysis.
from prism import SpectralMicroscope
microscope = SpectralMicroscope()
report = microscope.full_scan(model, tokenizer, prompt="The capital of France is")
# Returns: logit_lens, rank_profile, static_circuits,
# attention_heatmap, positional_sensitivity, provenance| Module | Import | Capability |
|---|---|---|
| Geometry scanner | prism.geometry |
Quantisation hostility profiling |
| Natural-language autoencoders | prism.nla |
Per-layer activation verbalisation (Anthropic, May 2026) |
| Model provenance | prism.provenance |
Weight-fingerprint lineage detection (Cisco MPK, May 2026) |
| Causal patching | prism.causal |
Activation swap & attribution patching |
| Logit / Tuned Lens | prism.lens |
Vocabulary projection at every layer |
| Attention circuits | prism.attention |
Induction head detection, OV/QK SVD |
| Linear probing | prism.probing |
Concept directions, CKA drift |
| Sparse features | prism.sae |
TopK SAE training & feature attribution |
| MLP decomposition | prism.mlp |
Rank restoration, neuron mapping |
| Hybrid diagnostics | prism.arch |
Recurrent / linear-attention architectures |
| Phase coherence | prism.phase |
Hilbert phase, PLV, FFT spectral |
| Entropy dynamics | prism.entropy |
Shannon / Rényi entropy profiles |
| Geometric viability | prism.geometry |
Intrinsic dimensionality, Fisher curvature |
| Circuit discovery | prism.discovery |
Automated circuit extraction |
| Evaluation | prism.eval |
Calibration, drift, temporal collapse |
| Telemetry schemas | prism.telemetry |
Verifiable snapshot & delta proofs |
PRISM's architecture adapter resolves transformer components without hard-coding any single layout. Tested families include:
LLaMA · Gemma (2, 3, 4, E2B, E4B) · Qwen2 / Qwen3 · Mistral · Phi · GPT-2 / GPT-NeoX · Pythia · SmolLM2 · OLMo2 · Mamba / GLA / FoX hybrids · T5 · mT5
Any model that supports output_hidden_states=True works with scan_model_geometry.
- Precision —
float32avoids spectral-decomposition instabilities on CUDA. BitsAndBytes quantised activations are automatically cast tofloat32for metric computation. - Memory — a single forward pass stores all layer hidden states simultaneously. For very large models, reduce sequence length via the
promptparameter. - Attention hooks — use
attn_implementation="eager"if flash-attention drops hidden-state hooks on your architecture.
experiments/genesis/ is a complete 12-phase mechanistic interpretability research run on guiferrarib/genesis-152m-instruct, a 152M-parameter hybrid GLA/FoX model. It serves as a worked example of PRISM applied end-to-end.
python experiments/genesis/go.py --list # list all phases
python experiments/genesis/go.py 10A # run a phasepytest # full suite
pytest tests/test_geometry* # geometry module only
pytest --cov=prism # with coverageContributions expanding architecture coverage, adding new geometry metrics, or
improving the test suite are welcome. All new modules must include tests under
tests/. See CONTRIBUTING.md for guidelines.
@software{prism_2026,
title = {PRISM: Phase-based Research \& Interpretability Spectral Microscope},
author = {HumanAI Convention Contributors},
year = {2026},
url = {https://github.com/humanaiconvention/prism},
}