Conversation
Summary of ChangesHello @cpagac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the model compatibility of Heretic by introducing robust support for FP8/NVFP4 pre-quantized models and the complex NemotronH hybrid architecture. It refines the model loading and processing pipeline to intelligently handle diverse quantization schemes and novel layer structures, ensuring stability and accurate abliteration across a broader range of large language models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant enhancements to support the NemotronH architecture and FP8/NVFP4 pre-quantized models. The changes are extensive and well-implemented, including robust fallbacks for layer discovery and hidden state extraction, as well as safety checks for multi-GPU and quantized model handling. The addition of detailed documentation in FP8_NVFP4_CHANGES.md is particularly helpful.
My review focuses on a couple of areas for improvement:
- A performance optimization opportunity in the model initialization logic.
- A minor documentation inconsistency.
Overall, this is a high-quality contribution that greatly expands the capabilities of the project.
c1ec534 to
af4384f
Compare
|
PR body and description have been overhauled to accurately reflect the full diff. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality to support hybrid Mamba/SSM architectures like NemotronH, along with several correctness fixes and user experience improvements. The changes are comprehensive and well-structured, particularly the VRAM calibration for multi-GPU setups and the interactive prompts for trust_remote_code and missing dependencies.
My review focuses on improving code maintainability and ensuring adherence to the repository's style guide. I've pointed out a few areas where the code could be refactored for clarity, a magic number that should be a constant, and several comments that need to be updated to match the project's coding conventions. Overall, this is a great contribution that significantly expands the tool's capabilities.
| for dtype in settings.dtypes: | ||
| if abort: | ||
| break | ||
| print(f"* Trying dtype [bold]{dtype}[/]... ", end="") | ||
|
|
||
| try: | ||
| quantization_config = self._get_quantization_config(dtype) | ||
|
|
||
| extra_kwargs = {} | ||
| # Only include quantization_config if it's not None | ||
| # (some models like gpt-oss have issues with explicit None). | ||
| if quantization_config is not None: | ||
| extra_kwargs["quantization_config"] = quantization_config | ||
|
|
||
| self.model = get_model_class(settings.model).from_pretrained( | ||
| settings.model, | ||
| dtype=dtype, | ||
| device_map=settings.device_map, | ||
| max_memory=self.max_memory, | ||
| trust_remote_code=self.trusted_models.get(settings.model), | ||
| **extra_kwargs, | ||
| ) | ||
|
|
||
| # If we reach this point and the model requires trust_remote_code, | ||
| # either the user accepted, or settings.trust_remote_code is True. | ||
| if self.trusted_models.get(settings.model) is None: | ||
| self.trusted_models[settings.model] = True | ||
|
|
||
| # A test run can reveal dtype-related problems such as the infamous | ||
| # "RuntimeError: probability tensor contains either `inf`, `nan` or element < 0" | ||
| # (https://github.com/meta-llama/llama/issues/380). | ||
| self.generate( | ||
| [ | ||
| Prompt( | ||
| system=settings.system_prompt, | ||
| user="What is 1+1?", | ||
| while True: | ||
| try: | ||
| quantization_config = self._get_quantization_config(dtype) | ||
|
|
||
| extra_kwargs = {} | ||
| # Only include quantization_config if it's not None | ||
| # (some models like gpt-oss have issues with explicit None). | ||
| if quantization_config is not None: | ||
| extra_kwargs["quantization_config"] = quantization_config | ||
|
|
||
| # Pass trust_remote_code=False (not None) when trust hasn't been | ||
| # established yet. This prevents HF from showing its own interactive | ||
| # prompt; we handle that ourselves below with clearer context. | ||
| self.model = get_model_class(settings.model).from_pretrained( | ||
| settings.model, | ||
| dtype=dtype, | ||
| device_map=self.device_map, | ||
| max_memory=self.max_memory, | ||
| trust_remote_code=self.trusted_models.get(settings.model) | ||
| or False, | ||
| **extra_kwargs, | ||
| ) | ||
|
|
||
| # If we reach this point and the model requires trust_remote_code, | ||
| # either the user accepted, or settings.trust_remote_code is True. | ||
| if self.trusted_models.get(settings.model) is None: | ||
| self.trusted_models[settings.model] = True | ||
|
|
||
| # A test run can reveal dtype-related problems such as the infamous | ||
| # "RuntimeError: probability tensor contains either `inf`, `nan` or element < 0" | ||
| # (https://github.com/meta-llama/llama/issues/380). | ||
| self.generate( | ||
| [ | ||
| Prompt( | ||
| system=settings.system_prompt, | ||
| user="What is 1+1?", | ||
| ) | ||
| ], | ||
| max_new_tokens=1, | ||
| ) | ||
|
|
||
| # After a successful load and warmup on multi-GPU systems, check | ||
| # whether each GPU has enough free VRAM for batch inference. If not, | ||
| # compute corrected per-GPU caps from the actual measured allocations | ||
| # and reload once. This handles architectures (e.g. NemotronH) where | ||
| # SSM workspace and other one-time allocations during the first | ||
| # forward pass leave insufficient headroom for batched inference. | ||
| # Only applies to hybrid SSM models — regular transformers don't | ||
| # allocate persistent inference workspace on top of model weights. | ||
| if ( | ||
| not _vram_calibrated | ||
| and torch.cuda.is_available() | ||
| and torch.cuda.device_count() > 1 | ||
| and self._has_mamba_layers() | ||
| ): | ||
| _HEADROOM = 6 * 1024**3 # 6 GiB minimum free per GPU | ||
| gpu_count = torch.cuda.device_count() | ||
| min_free = min( | ||
| torch.cuda.mem_get_info(i)[0] for i in range(gpu_count) | ||
| ) | ||
| ], | ||
| max_new_tokens=1, | ||
| ) | ||
| except Exception as error: | ||
| self.model = None # ty:ignore[invalid-assignment] | ||
| empty_cache() | ||
| print(f"[red]Failed[/] ({error})") | ||
| if min_free < _HEADROOM: | ||
| print() | ||
| print( | ||
| f"[yellow]Only {min_free / (1024**3):.1f} GiB free on " | ||
| "most-loaded GPU — recalibrating layout for batch inference...[/]" | ||
| ) | ||
| # Identify overloaded GPUs before releasing the model. | ||
| overloaded = { | ||
| i | ||
| for i in range(gpu_count) | ||
| if torch.cuda.mem_get_info(i)[0] < _HEADROOM | ||
| } | ||
|
|
||
| # Release model so we can measure true available VRAM. | ||
| self.model = None # ty:ignore[invalid-assignment] | ||
| empty_cache() | ||
|
|
||
| max_mem: dict[int | str, str] = {} | ||
| for i in range(gpu_count): | ||
| free_i, _ = torch.cuda.mem_get_info(i) | ||
| # Reserve headroom for inference working memory | ||
| # (SSM workspace, KV cache, activations, etc.). | ||
| usable = max(free_i - _HEADROOM, 0) | ||
| if i in overloaded: | ||
| # Apply correction to prevent Accelerate from | ||
| # overloading this GPU again due to layer-size | ||
| # underestimation (~30% on hybrid architectures). | ||
| stated_gib = max(int(usable / (1024**3) * 0.7), 1) | ||
| else: | ||
| # Full budget — this GPU absorbs displaced layers. | ||
| stated_gib = max(int(usable / (1024**3)), 1) | ||
| max_mem[i] = f"{stated_gib}GiB" | ||
| caps = ", ".join( | ||
| f"GPU {k}: {v}" for k, v in max_mem.items() | ||
| ) | ||
| print(f" [dim]Corrected caps: {caps}[/]") | ||
| self.max_memory = max_mem | ||
| _vram_calibrated = True | ||
| print( | ||
| f"* Retrying dtype [bold]{dtype}[/] with corrected caps... ", | ||
| end="", | ||
| ) | ||
| continue # reload this dtype with corrected max_memory | ||
| except Exception as error: | ||
| self.model = None # ty:ignore[invalid-assignment] | ||
| empty_cache() | ||
| print(f"[red]Failed[/] ({error})") | ||
|
|
||
| error_str = str(error).lower() | ||
|
|
||
| if "trust_remote_code" in error_str: | ||
| if self.trusted_models.get(settings.model) is None: | ||
| # Model requires custom code — explain and ask once. | ||
| print() | ||
| print( | ||
| "[yellow](This is expected — the model requires permission to run custom code.)[/]" | ||
| ) | ||
| print( | ||
| f"[yellow][bold]{settings.model}[/bold] ships custom architecture " | ||
| "code that must be executed to load this model. " | ||
| f"You can inspect the repository at " | ||
| f"https://huggingface.co/{settings.model}[/]" | ||
| ) | ||
| print() | ||
| if questionary.confirm( | ||
| "Trust and run this model's custom code?", | ||
| default=True, | ||
| ).ask(): | ||
| self.trusted_models[settings.model] = True | ||
| print(f"* Retrying dtype [bold]{dtype}[/]... ", end="") | ||
| continue # retry this dtype with trust granted | ||
| else: | ||
| self.trusted_models[settings.model] = False | ||
| abort = True | ||
| break # trust already decided; move to next dtype or abort | ||
|
|
||
| if "mamba-ssm" in error_str: | ||
| # Missing dependency — retrying other dtypes won't help. | ||
| print() | ||
| print( | ||
| f"[bold red]mamba-ssm is required to load [cyan]{settings.model}[/cyan].[/]" | ||
| ) | ||
| print() | ||
| if questionary.confirm( | ||
| "Install mamba-ssm now? (this may take several minutes)", | ||
| default=True, | ||
| ).ask(): | ||
| try: | ||
| subprocess.check_call( | ||
| [ | ||
| sys.executable, | ||
| "-m", | ||
| "pip", | ||
| "install", | ||
| "mamba-ssm", | ||
| ] | ||
| ) | ||
| except subprocess.CalledProcessError: | ||
| print() | ||
| print("[bold red]Auto-install failed.[/]") | ||
| print( | ||
| "[yellow]mamba-ssm requires the CUDA toolkit (nvcc) to build. " | ||
| "Install nvcc, then run:[/] pip install mamba-ssm" | ||
| ) | ||
| print("[yellow]To install nvcc:[/]") | ||
| print( | ||
| " sudo apt install nvidia-cuda-toolkit [dim]# Ubuntu/Debian[/]" | ||
| ) | ||
| print( | ||
| " conda install -c nvidia cuda-nvcc [dim]# Conda[/]" | ||
| ) | ||
| raise SystemExit(1) | ||
| print() | ||
| print( | ||
| "[green]Installation complete. Retrying model load...[/]" | ||
| ) | ||
| print() | ||
| continue # retry this dtype after install | ||
| abort = True | ||
| break | ||
|
|
||
| # For all other errors, update trust cache if needed and try next dtype. | ||
| if self.trusted_models.get(settings.model) is None: | ||
| self.trusted_models[settings.model] = True | ||
| break | ||
| else: | ||
| # Load and test generate succeeded — exit the retry loop. | ||
| break | ||
|
|
There was a problem hiding this comment.
The __init__ method has grown quite large and complex. To improve maintainability and readability, consider refactoring parts of this logic into smaller, dedicated helper methods. For example, the VRAM calibration logic (lines 145-206), the trust_remote_code prompt handling, and the mamba-ssm installation could each be extracted into their own methods.
Additionally, there are a few other areas for improvement:
-
Magic Number: The correction factor
0.7on line 191 is a magic number. It should be defined as a named constant with a descriptive name (e.g.,_VRAM_CORRECTION_FACTOR). -
Style Guide: Several inline comments do not adhere to the repository's style guide (Rule 4), which requires comments to start with a capital letter and end with a period. Please update the following comments to be compliant:
- Line 159:
# 6 GiB minimum free per GPU - Line 206:
# reload this dtype with corrected max_memory - Line 234:
# retry this dtype with trust granted - Line 238:
# trust already decided; move to next dtype or abort - Line 281:
# retry this dtype after install
- Line 159:
References
- Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)
There was a problem hiding this comment.
Agreed on the style violations, I'll fix those.
On the 0.7 correction factor: this was tuned empirically against observed NemotronH Mamba2 SSM workspace allocation. Accelerate tends to underestimate layer sizes by around 30% on hybrid architectures, so the factor compensates for this. Determining how to distribute layers across GPUs without overcommitting VRAM is genuinely one of the harder problems here, since the only reliable signal is a live OOM.
Just to preface this question and to add a bit more information on why there is 6 GiB of headroom:
This was similarly derived from observed post-warmup free VRAM on the test system — enough to cover the SSM workspace plus KV cache and activations for batch inference.
Both could be worth promoting to named constants with a comment explaining the rationale, which I can do if that would be helpful.
The broader refactoring into helper methods is tied to the hybrid.py question below.
| # Quantized models need special handling - we must reload the base model | ||
| # in full precision to merge the LoRA adapters |
There was a problem hiding this comment.
This comment does not end with a period, which violates the repository's style guide (Rule 4).
| # Quantized models need special handling - we must reload the base model | |
| # in full precision to merge the LoRA adapters | |
| # Quantized models need special handling - we must reload the base model | |
| # in full precision to merge the LoRA adapters. |
References
- Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)
| # NemotronH hybrid layers - all use a unified `mixer` attribute. | ||
| # Attention layers have mixer.o_proj. | ||
| with suppress(Exception): | ||
| try_add("attn.o_proj", layer.mixer.o_proj) # ty:ignore[possibly-missing-attribute] | ||
|
|
||
| # NemotronH simple MLP layers have mixer.down_proj. | ||
| with suppress(Exception): | ||
| try_add("mlp.down_proj", layer.mixer.down_proj) # ty:ignore[possibly-missing-attribute] | ||
|
|
||
| # NemotronH MoE layers have mixer.experts (per-expert) and mixer.shared_experts. | ||
| # Following heretic's standard pattern for MoE models (Qwen3, Phi-3.5-MoE, Granite): | ||
| # include all expert down_proj modules. Optuna will optimize the weight. | ||
| with suppress(Exception): | ||
| for expert in layer.mixer.experts: # ty:ignore[possibly-missing-attribute, not-iterable] | ||
| try_add("mlp.down_proj", expert.down_proj) # ty:ignore[possibly-missing-attribute] |
There was a problem hiding this comment.
Some comments in this section violate the repository's style guide (Rule 4):
- Line 572: Does not end with a period.
- Line 582: Ends with a colon instead of a period.
- Line 583: Starts with a lowercase letter.
Please correct these comments to follow the style guide.
| # NemotronH hybrid layers - all use a unified `mixer` attribute. | |
| # Attention layers have mixer.o_proj. | |
| with suppress(Exception): | |
| try_add("attn.o_proj", layer.mixer.o_proj) # ty:ignore[possibly-missing-attribute] | |
| # NemotronH simple MLP layers have mixer.down_proj. | |
| with suppress(Exception): | |
| try_add("mlp.down_proj", layer.mixer.down_proj) # ty:ignore[possibly-missing-attribute] | |
| # NemotronH MoE layers have mixer.experts (per-expert) and mixer.shared_experts. | |
| # Following heretic's standard pattern for MoE models (Qwen3, Phi-3.5-MoE, Granite): | |
| # include all expert down_proj modules. Optuna will optimize the weight. | |
| with suppress(Exception): | |
| for expert in layer.mixer.experts: # ty:ignore[possibly-missing-attribute, not-iterable] | |
| try_add("mlp.down_proj", expert.down_proj) # ty:ignore[possibly-missing-attribute] | |
| # NemotronH hybrid layers - all use a unified `mixer` attribute. | |
| # Attention layers have mixer.o_proj. | |
| with suppress(Exception): | |
| try_add("attn.o_proj", layer.mixer.o_proj) # ty:ignore[possibly-missing-attribute] | |
| # NemotronH simple MLP layers have mixer.down_proj. | |
| with suppress(Exception): | |
| try_add("mlp.down_proj", layer.mixer.down_proj) # ty:ignore[possibly-missing-attribute] | |
| # NemotronH MoE layers have mixer.experts (per-expert) and mixer.shared_experts. | |
| # Following heretic's standard pattern for MoE models (Qwen3, Phi-3.5-MoE, Granite). | |
| # Include all expert down_proj modules. Optuna will optimize the weight. | |
| with suppress(Exception): | |
| for expert in layer.mixer.experts: # ty:ignore[possibly-missing-attribute, not-iterable] | |
| try_add("mlp.down_proj", expert.down_proj) # ty:ignore[possibly-missing-attribute] |
References
- Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)
|
@p-e-w |
af4384f to
f2d66c8
Compare
|
When testing loading a model that would not fit within VRAM constraints, the calibration would still trigger, release the model, compute tiny caps from whatever free VRAM remained, and reload with those caps, thus making disk offloading worse rather than better. The fix checks whether any model parameters are on the meta device before deciding to recalibrate. If disk offloading is detected, calibration is skipped entirely, since rebalancing GPU distribution cannot address a capacity problem. This was fixed in the most recent commit |
NemotronH (Mamba2 SSM + MoE + Attention) requires several changes to load and abliterate correctly on multi-GPU systems. Architecture support (model.py): - Add backbone.layers fallback in get_layers() for NemotronH's model.backbone.layers structure - Add get_layer_modules() patterns for NemotronH's unified mixer attribute: mixer.out_proj (Mamba2), mixer.o_proj (attention), mixer.down_proj / mixer.experts[*].down_proj / mixer.shared_experts.down_proj (MoE) - Scan all layers in get_abliterable_components() instead of only layer 0, to discover the full union of component types in hybrid architectures - Add _get_hidden_states_via_hooks() fallback for models that don't return hidden_states through generate() (NemotronH returns tuple of Nones); use forward hooks on each layer with device-aware stacking for multi-GPU compatibility - Skip meta-device and NaN-weight modules in abliterate() to prevent NaN corruption when layers are CPU-offloaded by Accelerate - Add _has_mamba_layers() to detect hybrid SSM architectures Multi-GPU VRAM calibration (model.py): - After inference warmup on multi-GPU systems, check if any GPU has less than 6 GiB free; if so, release the model, measure actual free VRAM per GPU, and reload once with corrected per-GPU caps - Overloaded GPUs get a 0.7 correction factor for Accelerate's layer-size underestimation; other GPUs get full budget to absorb displaced layers; gated to hybrid SSM models via _has_mamba_layers() so regular transformers are unaffected User experience: - Show trust_remote_code explanation with model repo link before prompting, replacing the bare HuggingFace error message - Auto-install mamba-ssm when required, with clear nvcc/CUDA toolkit guidance on build failure - Suggest installing causal-conv1d and mamba-ssm after loading any model with Mamba layers when fast kernels are missing Other fixes: - Sum VRAM across all GPUs in print_memory_usage() (utils.py) - Show total and per-GPU VRAM in startup output (main.py) - Fix division by zero in evaluator when base_refusals is 0 - Add mamba optional dependency group to pyproject.toml
f2d66c8 to
ecaf645
Compare
Summary
Adds support for hybrid Mamba/SSM architectures (e.g. nvidia/NVIDIA-Nemotron-Nano-9B-v2), with several correctness fixes along the way.
Hybrid model support (model.py)
lookup from a hard assert to a suppressed exception, so layers without standard self-attention don't abort
structures. Logs a warning for any layers with no recognized modules
device before stacking, required on multi-GPU setups
Multi-GPU VRAM calibration (model.py)
NemotronH's Mamba2 SSM layers allocate a persistent ~4 GiB workspace during the first forward pass — after model loading — which Accelerate cannot account for when computing the device map. On
multi-GPU systems this causes OOM mid-inference even when initial placement looked balanced.
After the warmup generate() call, if any GPU has less than 6 GiB free and the model has Mamba layers, the model is released, actual post-warmup free VRAM is measured per device, and the model
reloads once with corrected max_memory caps. A 0.7 correction factor is applied only to overloaded GPUs to prevent Accelerate from repeating the same placement error. Regular models are unaffected.
trust_remote_code interactive prompt (model.py)
Previously, models requiring custom code would fail silently or trigger HuggingFace's own prompt with no context. Now heretic passes trust_remote_code=False on first attempt, catches the resulting
error, explains to the user what custom code execution means (with a link to the HF repo), and retries with trust_remote_code=True if the user confirms. Also handles mamba-ssm import errors with an
auto-install prompt.
Fast kernel suggestion (model.py)
After listing abliterable components, if Mamba layers are present and causal-conv1d/mamba-ssm are not installed, prints a one-time suggestion with the pip command, CUDA toolkit version requirement
(≥ 11.6), and expected build time (~10 min).
Merge path fix for models with built-in quantization (model.py, main.py)
get_merged_model() and obtain_merge_strategy() now detect models that have a quantization_config baked into their HuggingFace config (e.g. models already quantized at publish time) and route them through the same CPU-reload merge path as BNB 4-bit, preventing a silent merge failure on export.
Test plan