Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

feat: NemotronH architecture support#166

Open
cpagac wants to merge 1 commit intop-e-w:masterfrom
venomx-pentester:nemotron-support
Open

feat: NemotronH architecture support#166
cpagac wants to merge 1 commit intop-e-w:masterfrom
venomx-pentester:nemotron-support

Conversation

@cpagac
Copy link
Contributor

@cpagac cpagac commented Feb 16, 2026

Summary

Adds support for hybrid Mamba/SSM architectures (e.g. nvidia/NVIDIA-Nemotron-Nano-9B-v2), with several correctness fixes along the way.

Hybrid model support (model.py)

  • get_layers(): adds model.backbone.layers path for NemotronH and other backbone-based architectures
  • get_layer_modules(): adds mixer-based module discovery for NemotronH's unified mixer attribute (attention o_proj, MLP down_proj, MoE experts, Mamba2 SSM out_proj). Also softens the attn.o_proj
    lookup from a hard assert to a suppressed exception, so layers without standard self-attention don't abort
  • get_abliterable_components(): scans all layers instead of only layer 0, collecting the union of component types. Necessary for hybrid architectures where different layers have different
    structures. Logs a warning for any layers with no recognized modules
  • get_residuals(): detects when generate() returns None hidden states (NemotronH behavior) and falls back to forward hooks to capture per-layer outputs directly. Also moves tensors to a common
    device before stacking, required on multi-GPU setups
  • _get_hidden_states_via_hooks(): new method — registers forward pre/post hooks on all layers to capture hidden states, matching the standard [embedding, layer_0, layer_1, ...] format
  • _has_mamba_layers(): new method — detects hybrid SSM models by checking for layer.mixer.out_proj
  • abliterate(): skips modules whose base weight is on meta device or contains NaN values
  • NaN warnings: added to get_residuals() and get_logprobs_batched() to surface post-abliteration model corruption early

Multi-GPU VRAM calibration (model.py)

NemotronH's Mamba2 SSM layers allocate a persistent ~4 GiB workspace during the first forward pass — after model loading — which Accelerate cannot account for when computing the device map. On
multi-GPU systems this causes OOM mid-inference even when initial placement looked balanced.

After the warmup generate() call, if any GPU has less than 6 GiB free and the model has Mamba layers, the model is released, actual post-warmup free VRAM is measured per device, and the model
reloads once with corrected max_memory caps. A 0.7 correction factor is applied only to overloaded GPUs to prevent Accelerate from repeating the same placement error. Regular models are unaffected.

trust_remote_code interactive prompt (model.py)

Previously, models requiring custom code would fail silently or trigger HuggingFace's own prompt with no context. Now heretic passes trust_remote_code=False on first attempt, catches the resulting
error, explains to the user what custom code execution means (with a link to the HF repo), and retries with trust_remote_code=True if the user confirms. Also handles mamba-ssm import errors with an
auto-install prompt.

Fast kernel suggestion (model.py)

After listing abliterable components, if Mamba layers are present and causal-conv1d/mamba-ssm are not installed, prints a one-time suggestion with the pip command, CUDA toolkit version requirement
(≥ 11.6), and expected build time (~10 min).

Merge path fix for models with built-in quantization (model.py, main.py)
get_merged_model() and obtain_merge_strategy() now detect models that have a quantization_config baked into their HuggingFace config (e.g. models already quantized at publish time) and route them through the same CPU-reload merge path as BNB 4-bit, preventing a silent merge failure on export.

Test plan

  • - nvidia/NVIDIA-Nemotron-Nano-9B-v2 loads and abliterates on a dual-GPU system without OOM
  • - VRAM calibration triggers on NemotronH, not on standard transformers (e.g. Qwen/Qwen3-14B)
  • - trust_remote_code prompt appears with context when loading a model that requires it
  • - Fast kernel suggestion prints when causal-conv1d/mamba-ssm are absent
  • - Models with built-in quantization config reach the CPU-reload merge path correctly
  • - All CI checks pass (ruff format, ruff lint, ty check, uv build)

@gemini-code-assist
Copy link

Summary of Changes

Hello @cpagac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the model compatibility of Heretic by introducing robust support for FP8/NVFP4 pre-quantized models and the complex NemotronH hybrid architecture. It refines the model loading and processing pipeline to intelligently handle diverse quantization schemes and novel layer structures, ensuring stability and accurate abliteration across a broader range of large language models.

Highlights

  • FP8/NVFP4 Model Support: Enabled loading of pre-quantized FP8/NVFP4 models by treating "fp8" as a special dtype token, allowing HuggingFace to auto-detect quantization configurations.
  • NemotronH Architecture Integration: Added comprehensive support for the NemotronH hybrid architecture, which combines Mamba2 SSM, MoE, and Attention layers, including specific layer discovery and module extraction patterns.
  • Dynamic Layer Discovery: Implemented a fallback for backbone.layers and enhanced component scanning to inspect all layers, accommodating diverse and hybrid model structures.
  • Robust Hidden State Capture: Introduced a hook-based mechanism to capture per-layer hidden states for models that do not return them via standard generate() or forward() calls.
  • Quantization Method Expansion: Added FP8 to the QuantizationMethod enum and integrated its handling into model loading, merging, and reset processes for consistency.
  • Evaluator Stability: Fixed a division-by-zero error in the evaluator's refusal score calculation when base_refusals is zero.
  • Abliteration Safety: Incorporated NaN and meta-device safety checks during the abliteration process to prevent model corruption and ensure stability with CPU-offloaded modules.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • FP8_NVFP4_CHANGES.md
    • Added a new documentation file detailing the problem, solution, usage, and implementation specifics for FP8/NVFP4 quantization and NemotronH architecture support.
  • pyproject.toml
    • Added an optional fp8 dependency group for future on-the-fly FP8 quantization.
  • src/heretic/config.py
    • Added FP8 variant to the QuantizationMethod enum.
    • Updated the quantization field description to include fp8 as an option.
  • src/heretic/evaluator.py
    • Fixed a division-by-zero error in get_score() when base_refusals is zero.
  • src/heretic/main.py
    • Updated obtain_merge_strategy() signature to accept a Model instance.
    • Modified obtain_merge_strategy() to detect FP8 quantized models for merge path considerations.
  • src/heretic/model.py
    • Introduced _FP8_DTYPE_TOKEN constant for FP8 dtype detection.
    • Modified __init__ to handle _FP8_DTYPE_TOKEN for from_pretrained() calls, using torch_dtype=torch.bfloat16 and storing the _loaded_dtype.
    • Expanded get_merged_model() logic to include FP8 models in the CPU-reload merge path.
    • Updated reset_model() to correctly reload FP8 models using torch_dtype=torch.bfloat16.
    • Fixed _get_quantization_config() to treat fp8 dtype similarly to auto for compute dtype determination.
    • Added a model.backbone.layers fallback in get_layers() for NemotronH-like architectures.
    • Enhanced get_layer_modules() to recognize NemotronH hybrid layer patterns (Mamba2 SSM, MoE, Attention) and removed a restrictive assertion.
    • Revised get_abliterable_components() to scan all layers for a comprehensive list of abliterable components.
    • Implemented _get_hidden_states_via_hooks() as a fallback for capturing hidden states in models that don't expose them directly.
    • Improved get_residuals() to utilize the hook-based hidden state capture, ensure multi-GPU device compatibility, and include NaN detection.
    • Added a guard in abliterate() to skip meta-device and NaN-weight modules.
    • Included a NaN diagnostic warning in get_logprobs().
    • Updated the __init__ printout to display per-component layer counts.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements to support the NemotronH architecture and FP8/NVFP4 pre-quantized models. The changes are extensive and well-implemented, including robust fallbacks for layer discovery and hidden state extraction, as well as safety checks for multi-GPU and quantized model handling. The addition of detailed documentation in FP8_NVFP4_CHANGES.md is particularly helpful.

My review focuses on a couple of areas for improvement:

  • A performance optimization opportunity in the model initialization logic.
  • A minor documentation inconsistency.

Overall, this is a high-quality contribution that greatly expands the capabilities of the project.

@cpagac cpagac force-pushed the nemotron-support branch 5 times, most recently from c1ec534 to af4384f Compare February 18, 2026 07:18
@cpagac
Copy link
Contributor Author

cpagac commented Feb 18, 2026

PR body and description have been overhauled to accurately reflect the full diff.

@cpagac
Copy link
Contributor Author

cpagac commented Feb 18, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality to support hybrid Mamba/SSM architectures like NemotronH, along with several correctness fixes and user experience improvements. The changes are comprehensive and well-structured, particularly the VRAM calibration for multi-GPU setups and the interactive prompts for trust_remote_code and missing dependencies.

My review focuses on improving code maintainability and ensuring adherence to the repository's style guide. I've pointed out a few areas where the code could be refactored for clarity, a magic number that should be a constant, and several comments that need to be updated to match the project's coding conventions. Overall, this is a great contribution that significantly expands the tool's capabilities.

Comment on lines 99 to 292
for dtype in settings.dtypes:
if abort:
break
print(f"* Trying dtype [bold]{dtype}[/]... ", end="")

try:
quantization_config = self._get_quantization_config(dtype)

extra_kwargs = {}
# Only include quantization_config if it's not None
# (some models like gpt-oss have issues with explicit None).
if quantization_config is not None:
extra_kwargs["quantization_config"] = quantization_config

self.model = get_model_class(settings.model).from_pretrained(
settings.model,
dtype=dtype,
device_map=settings.device_map,
max_memory=self.max_memory,
trust_remote_code=self.trusted_models.get(settings.model),
**extra_kwargs,
)

# If we reach this point and the model requires trust_remote_code,
# either the user accepted, or settings.trust_remote_code is True.
if self.trusted_models.get(settings.model) is None:
self.trusted_models[settings.model] = True

# A test run can reveal dtype-related problems such as the infamous
# "RuntimeError: probability tensor contains either `inf`, `nan` or element < 0"
# (https://github.com/meta-llama/llama/issues/380).
self.generate(
[
Prompt(
system=settings.system_prompt,
user="What is 1+1?",
while True:
try:
quantization_config = self._get_quantization_config(dtype)

extra_kwargs = {}
# Only include quantization_config if it's not None
# (some models like gpt-oss have issues with explicit None).
if quantization_config is not None:
extra_kwargs["quantization_config"] = quantization_config

# Pass trust_remote_code=False (not None) when trust hasn't been
# established yet. This prevents HF from showing its own interactive
# prompt; we handle that ourselves below with clearer context.
self.model = get_model_class(settings.model).from_pretrained(
settings.model,
dtype=dtype,
device_map=self.device_map,
max_memory=self.max_memory,
trust_remote_code=self.trusted_models.get(settings.model)
or False,
**extra_kwargs,
)

# If we reach this point and the model requires trust_remote_code,
# either the user accepted, or settings.trust_remote_code is True.
if self.trusted_models.get(settings.model) is None:
self.trusted_models[settings.model] = True

# A test run can reveal dtype-related problems such as the infamous
# "RuntimeError: probability tensor contains either `inf`, `nan` or element < 0"
# (https://github.com/meta-llama/llama/issues/380).
self.generate(
[
Prompt(
system=settings.system_prompt,
user="What is 1+1?",
)
],
max_new_tokens=1,
)

# After a successful load and warmup on multi-GPU systems, check
# whether each GPU has enough free VRAM for batch inference. If not,
# compute corrected per-GPU caps from the actual measured allocations
# and reload once. This handles architectures (e.g. NemotronH) where
# SSM workspace and other one-time allocations during the first
# forward pass leave insufficient headroom for batched inference.
# Only applies to hybrid SSM models — regular transformers don't
# allocate persistent inference workspace on top of model weights.
if (
not _vram_calibrated
and torch.cuda.is_available()
and torch.cuda.device_count() > 1
and self._has_mamba_layers()
):
_HEADROOM = 6 * 1024**3 # 6 GiB minimum free per GPU
gpu_count = torch.cuda.device_count()
min_free = min(
torch.cuda.mem_get_info(i)[0] for i in range(gpu_count)
)
],
max_new_tokens=1,
)
except Exception as error:
self.model = None # ty:ignore[invalid-assignment]
empty_cache()
print(f"[red]Failed[/] ({error})")
if min_free < _HEADROOM:
print()
print(
f"[yellow]Only {min_free / (1024**3):.1f} GiB free on "
"most-loaded GPU — recalibrating layout for batch inference...[/]"
)
# Identify overloaded GPUs before releasing the model.
overloaded = {
i
for i in range(gpu_count)
if torch.cuda.mem_get_info(i)[0] < _HEADROOM
}

# Release model so we can measure true available VRAM.
self.model = None # ty:ignore[invalid-assignment]
empty_cache()

max_mem: dict[int | str, str] = {}
for i in range(gpu_count):
free_i, _ = torch.cuda.mem_get_info(i)
# Reserve headroom for inference working memory
# (SSM workspace, KV cache, activations, etc.).
usable = max(free_i - _HEADROOM, 0)
if i in overloaded:
# Apply correction to prevent Accelerate from
# overloading this GPU again due to layer-size
# underestimation (~30% on hybrid architectures).
stated_gib = max(int(usable / (1024**3) * 0.7), 1)
else:
# Full budget — this GPU absorbs displaced layers.
stated_gib = max(int(usable / (1024**3)), 1)
max_mem[i] = f"{stated_gib}GiB"
caps = ", ".join(
f"GPU {k}: {v}" for k, v in max_mem.items()
)
print(f" [dim]Corrected caps: {caps}[/]")
self.max_memory = max_mem
_vram_calibrated = True
print(
f"* Retrying dtype [bold]{dtype}[/] with corrected caps... ",
end="",
)
continue # reload this dtype with corrected max_memory
except Exception as error:
self.model = None # ty:ignore[invalid-assignment]
empty_cache()
print(f"[red]Failed[/] ({error})")

error_str = str(error).lower()

if "trust_remote_code" in error_str:
if self.trusted_models.get(settings.model) is None:
# Model requires custom code — explain and ask once.
print()
print(
"[yellow](This is expected — the model requires permission to run custom code.)[/]"
)
print(
f"[yellow][bold]{settings.model}[/bold] ships custom architecture "
"code that must be executed to load this model. "
f"You can inspect the repository at "
f"https://huggingface.co/{settings.model}[/]"
)
print()
if questionary.confirm(
"Trust and run this model's custom code?",
default=True,
).ask():
self.trusted_models[settings.model] = True
print(f"* Retrying dtype [bold]{dtype}[/]... ", end="")
continue # retry this dtype with trust granted
else:
self.trusted_models[settings.model] = False
abort = True
break # trust already decided; move to next dtype or abort

if "mamba-ssm" in error_str:
# Missing dependency — retrying other dtypes won't help.
print()
print(
f"[bold red]mamba-ssm is required to load [cyan]{settings.model}[/cyan].[/]"
)
print()
if questionary.confirm(
"Install mamba-ssm now? (this may take several minutes)",
default=True,
).ask():
try:
subprocess.check_call(
[
sys.executable,
"-m",
"pip",
"install",
"mamba-ssm",
]
)
except subprocess.CalledProcessError:
print()
print("[bold red]Auto-install failed.[/]")
print(
"[yellow]mamba-ssm requires the CUDA toolkit (nvcc) to build. "
"Install nvcc, then run:[/] pip install mamba-ssm"
)
print("[yellow]To install nvcc:[/]")
print(
" sudo apt install nvidia-cuda-toolkit [dim]# Ubuntu/Debian[/]"
)
print(
" conda install -c nvidia cuda-nvcc [dim]# Conda[/]"
)
raise SystemExit(1)
print()
print(
"[green]Installation complete. Retrying model load...[/]"
)
print()
continue # retry this dtype after install
abort = True
break

# For all other errors, update trust cache if needed and try next dtype.
if self.trusted_models.get(settings.model) is None:
self.trusted_models[settings.model] = True
break
else:
# Load and test generate succeeded — exit the retry loop.
break

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The __init__ method has grown quite large and complex. To improve maintainability and readability, consider refactoring parts of this logic into smaller, dedicated helper methods. For example, the VRAM calibration logic (lines 145-206), the trust_remote_code prompt handling, and the mamba-ssm installation could each be extracted into their own methods.

Additionally, there are a few other areas for improvement:

  • Magic Number: The correction factor 0.7 on line 191 is a magic number. It should be defined as a named constant with a descriptive name (e.g., _VRAM_CORRECTION_FACTOR).

  • Style Guide: Several inline comments do not adhere to the repository's style guide (Rule 4), which requires comments to start with a capital letter and end with a period. Please update the following comments to be compliant:

    • Line 159: # 6 GiB minimum free per GPU
    • Line 206: # reload this dtype with corrected max_memory
    • Line 234: # retry this dtype with trust granted
    • Line 238: # trust already decided; move to next dtype or abort
    • Line 281: # retry this dtype after install
References
  1. Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the style violations, I'll fix those.
On the 0.7 correction factor: this was tuned empirically against observed NemotronH Mamba2 SSM workspace allocation. Accelerate tends to underestimate layer sizes by around 30% on hybrid architectures, so the factor compensates for this. Determining how to distribute layers across GPUs without overcommitting VRAM is genuinely one of the harder problems here, since the only reliable signal is a live OOM.

Just to preface this question and to add a bit more information on why there is 6 GiB of headroom:
This was similarly derived from observed post-warmup free VRAM on the test system — enough to cover the SSM workspace plus KV cache and activations for batch inference.

Both could be worth promoting to named constants with a comment explaining the rationale, which I can do if that would be helpful.

The broader refactoring into helper methods is tied to the hybrid.py question below.

Comment on lines 416 to 417
# Quantized models need special handling - we must reload the base model
# in full precision to merge the LoRA adapters

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment does not end with a period, which violates the repository's style guide (Rule 4).

Suggested change
# Quantized models need special handling - we must reload the base model
# in full precision to merge the LoRA adapters
# Quantized models need special handling - we must reload the base model
# in full precision to merge the LoRA adapters.
References
  1. Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)

Comment on lines +572 to +586
# NemotronH hybrid layers - all use a unified `mixer` attribute.
# Attention layers have mixer.o_proj.
with suppress(Exception):
try_add("attn.o_proj", layer.mixer.o_proj) # ty:ignore[possibly-missing-attribute]

# NemotronH simple MLP layers have mixer.down_proj.
with suppress(Exception):
try_add("mlp.down_proj", layer.mixer.down_proj) # ty:ignore[possibly-missing-attribute]

# NemotronH MoE layers have mixer.experts (per-expert) and mixer.shared_experts.
# Following heretic's standard pattern for MoE models (Qwen3, Phi-3.5-MoE, Granite):
# include all expert down_proj modules. Optuna will optimize the weight.
with suppress(Exception):
for expert in layer.mixer.experts: # ty:ignore[possibly-missing-attribute, not-iterable]
try_add("mlp.down_proj", expert.down_proj) # ty:ignore[possibly-missing-attribute]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Some comments in this section violate the repository's style guide (Rule 4):

  • Line 572: Does not end with a period.
  • Line 582: Ends with a colon instead of a period.
  • Line 583: Starts with a lowercase letter.

Please correct these comments to follow the style guide.

Suggested change
# NemotronH hybrid layers - all use a unified `mixer` attribute.
# Attention layers have mixer.o_proj.
with suppress(Exception):
try_add("attn.o_proj", layer.mixer.o_proj) # ty:ignore[possibly-missing-attribute]
# NemotronH simple MLP layers have mixer.down_proj.
with suppress(Exception):
try_add("mlp.down_proj", layer.mixer.down_proj) # ty:ignore[possibly-missing-attribute]
# NemotronH MoE layers have mixer.experts (per-expert) and mixer.shared_experts.
# Following heretic's standard pattern for MoE models (Qwen3, Phi-3.5-MoE, Granite):
# include all expert down_proj modules. Optuna will optimize the weight.
with suppress(Exception):
for expert in layer.mixer.experts: # ty:ignore[possibly-missing-attribute, not-iterable]
try_add("mlp.down_proj", expert.down_proj) # ty:ignore[possibly-missing-attribute]
# NemotronH hybrid layers - all use a unified `mixer` attribute.
# Attention layers have mixer.o_proj.
with suppress(Exception):
try_add("attn.o_proj", layer.mixer.o_proj) # ty:ignore[possibly-missing-attribute]
# NemotronH simple MLP layers have mixer.down_proj.
with suppress(Exception):
try_add("mlp.down_proj", layer.mixer.down_proj) # ty:ignore[possibly-missing-attribute]
# NemotronH MoE layers have mixer.experts (per-expert) and mixer.shared_experts.
# Following heretic's standard pattern for MoE models (Qwen3, Phi-3.5-MoE, Granite).
# Include all expert down_proj modules. Optuna will optimize the weight.
with suppress(Exception):
for expert in layer.mixer.experts: # ty:ignore[possibly-missing-attribute, not-iterable]
try_add("mlp.down_proj", expert.down_proj) # ty:ignore[possibly-missing-attribute]
References
  1. Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)

@cpagac
Copy link
Contributor Author

cpagac commented Feb 18, 2026

@p-e-w
A thought that came up when I was initially working, but Gemini's review sparked it again. Would it be worth moving the hybrid/SSM-specific logic (VRAM calibration, hook-based hidden-state capture, mixer-layer detection) into a dedicated module, such as hybrid.py? I feel like most users won't run a NemotronH-style model, or really any hybrid model, so it could live outside core model.py. Curious what your take is on that approach before I consider a full restructure.

@cpagac
Copy link
Contributor Author

cpagac commented Feb 19, 2026

When testing loading a model that would not fit within VRAM constraints, the calibration would still trigger, release the model, compute tiny caps from whatever free VRAM remained, and reload with those caps, thus making disk offloading worse rather than better. The fix checks whether any model parameters are on the meta device before deciding to recalibrate. If disk offloading is detected, calibration is skipped entirely, since rebalancing GPU distribution cannot address a capacity problem.

This was fixed in the most recent commit

NemotronH (Mamba2 SSM + MoE + Attention) requires several changes to
load and abliterate correctly on multi-GPU systems.

Architecture support (model.py):
- Add backbone.layers fallback in get_layers() for NemotronH's
  model.backbone.layers structure
- Add get_layer_modules() patterns for NemotronH's unified mixer
  attribute: mixer.out_proj (Mamba2), mixer.o_proj (attention),
  mixer.down_proj / mixer.experts[*].down_proj /
  mixer.shared_experts.down_proj (MoE)
- Scan all layers in get_abliterable_components() instead of only
  layer 0, to discover the full union of component types in hybrid
  architectures
- Add _get_hidden_states_via_hooks() fallback for models that don't
  return hidden_states through generate() (NemotronH returns tuple
  of Nones); use forward hooks on each layer with device-aware
  stacking for multi-GPU compatibility
- Skip meta-device and NaN-weight modules in abliterate() to prevent
  NaN corruption when layers are CPU-offloaded by Accelerate
- Add _has_mamba_layers() to detect hybrid SSM architectures

Multi-GPU VRAM calibration (model.py):
- After inference warmup on multi-GPU systems, check if any GPU has
  less than 6 GiB free; if so, release the model, measure actual free
  VRAM per GPU, and reload once with corrected per-GPU caps
- Overloaded GPUs get a 0.7 correction factor for Accelerate's
  layer-size underestimation; other GPUs get full budget to absorb
  displaced layers; gated to hybrid SSM models via _has_mamba_layers()
  so regular transformers are unaffected

User experience:
- Show trust_remote_code explanation with model repo link before
  prompting, replacing the bare HuggingFace error message
- Auto-install mamba-ssm when required, with clear nvcc/CUDA toolkit
  guidance on build failure
- Suggest installing causal-conv1d and mamba-ssm after loading any
  model with Mamba layers when fast kernels are missing

Other fixes:
- Sum VRAM across all GPUs in print_memory_usage() (utils.py)
- Show total and per-GPU VRAM in startup output (main.py)
- Fix division by zero in evaluator when base_refusals is 0
- Add mamba optional dependency group to pyproject.toml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments