GPU Node Resilience System for Kubernetes
NVSentinel is a comprehensive collection of Kubernetes services that automatically detect, classify, and remediate hardware and software faults in GPU nodes. Designed for GPU clusters, it ensures maximum uptime and seamless fault recovery in high-performance computing environments.
Warning
Experimental Preview Release This is an experimental/preview release of NVSentinel. Use at your own risk in production environments. The software is provided "as is" without warranties of any kind. Features, APIs, and configurations may change without notice in future releases. For production deployments, thoroughly test in non-critical environments first.
- Kubernetes 1.25+
- Helm 3.0+
- NVIDIA GPU Operator (includes DCGM for GPU monitoring)
# Install from GitHub Container Registry
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.2.0 \
--namespace nvsentinel \
--create-namespace
# View chart information
helm show chart oci://ghcr.io/nvidia/nvsentinel --version v0.2.0- π Comprehensive Monitoring: Real-time detection of GPU, NVSwitch, and system-level failures
- π§ Automated Remediation: Intelligent fault handling with cordon, drain, and break-fix workflows
- π¦ Modular Architecture: Pluggable health monitors with standardized gRPC interfaces
- π High Availability: Kubernetes-native design with replica support and leader election
- β‘ Real-time Processing: Event-driven architecture with immediate fault response
- π Persistent Storage: MongoDB-based event store with change streams for real-time updates
- π‘οΈ Graceful Handling: Coordinated workload eviction with configurable timeouts
For a full installation with all dependencies, follow these steps:
helm repo add jetstack https://charts.jetstack.io --force-update
helm upgrade --install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace \
--version v1.19.1 --set installCRDs=true \
--waithelm repo add prometheus-community https://prometheus-community.github.io/helm-charts --force-update
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.enabled=true \
--set alertmanager.enabled=false \
--set grafana.enabled=false \
--set kubeStateMetrics.enabled=false \
--set nodeExporter.enabled=false \
--waitNVSENTINEL_VERSION=v0.2.0
helm upgrade --install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--namespace nvsentinel --create-namespace \
--version "$NVSENTINEL_VERSION" \
--timeout 15m \
--waitkubectl get pods -n nvsentinel
kubectl get nodes # Verify GPU nodes are visible
# Run comprehensive validation
./scripts/validate-nvsentinel.sh --version v0.2.0 --verboseTesting: The example above uses default settings. For testing with simulated GPU nodes, use
tilt/release/values-release.yaml. For production, customize values for your environment.
Production: By default, only health monitoring is enabled. Enable fault quarantine and remediation modules via Helm values. See Configuration below.
NVSentinel follows a microservices architecture with modular health monitors and core processing modules:
graph TB
subgraph "Health Monitors"
GPU["GPU Health Monitor<br/>(DCGM Integration)"]
SYS["Syslog Health Monitor<br/>(Journalctl)"]
CSP["CSP Health Monitor<br/>(CSP APIs)"]
end
subgraph "Core Processing"
PC["Platform Connectors<br/>(gRPC Server)"]
STORE[("MongoDB Store<br/>(Event Database)")]
FQ["Fault Quarantine<br/>(Node Cordon)"]
ND["Node Drainer<br/>(Workload Eviction)"]
FR["Fault Remediation<br/>(Break-Fix Integration)"]
HEA["Health Events Analyzer<br/>(Pattern Analysis)"]
LBL["Labeler<br/>(Node Labels)"]
end
subgraph "Kubernetes Cluster"
K8S["Kubernetes API<br/>(Nodes, Pods, Events)"]
end
GPU -->|gRPC| PC
SYS -->|gRPC| PC
CSP -->|gRPC| PC
PC -->|persist| STORE
PC <-->|update status| K8S
FQ -.->|watch changes| STORE
FQ -->|cordon| K8S
ND -.->|watch changes| STORE
ND -->|drain| K8S
FR -.->|watch changes| STORE
FR -->|create CRDs| K8S
HEA -.->|watch changes| STORE
LBL -->|update labels| K8S
Data Flow:
- Health Monitors detect hardware/software faults and send events via gRPC to Platform Connectors
- Platform Connectors validate, persist events to MongoDB, and update Kubernetes node conditions
- Core Modules independently watch MongoDB change streams for relevant events
- Modules interact with Kubernetes API to cordon, drain, label nodes, and create remediation CRDs
- Labeler monitors pods to automatically label nodes with DCGM and driver versions
Note: All modules operate independently without direct communication. Coordination happens through MongoDB change streams and Kubernetes API.
Control module enablement and behavior:
global:
dryRun: false # Test mode - no actual actions
# Health Monitors (enabled by default)
gpuHealthMonitor:
enabled: true
syslogHealthMonitor:
enabled: true
cspHealthMonitor:
enabled: false # Cloud provider integration
# Core Modules (disabled by default - enable for production)
faultQuarantineModule:
enabled: false
nodeDrainerModule:
enabled: false
faultRemediationModule:
enabled: false
healthEventsAnalyzer:
enabled: falseFor detailed per-module configuration, see Module Details.
Monitors GPU hardware health via DCGM - detects thermal issues, ECC errors, and XID events.
Key Configuration:
global:
gpuHealthMonitor:
enabled: true
useHostNetworking: false # Enable for direct DCGM access
dcgm:
service:
endpoint: "nvidia-dcgm.gpu-operator.svc"
port: 5555Analyzes system logs for hardware and software fault patterns via journalctl.
Key Configuration:
global:
syslogHealthMonitor:
enabled: true
pollingInterval: "30m"
stateFile: "/var/run/syslog_health_monitor/state.json"Integrates with cloud provider APIs (GCP/AWS) for maintenance events.
Key Configuration:
global:
cspHealthMonitor:
enabled: false
cspName: "gcp" # or "aws"
configToml:
maintenanceEventPollIntervalSeconds: 60Receives health events from monitors via gRPC, persists to MongoDB, and updates Kubernetes node status.
Key Configuration:
platformConnector:
mongodbStore:
enabled: true
connectionString: "mongodb://nvsentinel-mongodb:27017"Watches MongoDB for health events and cordons nodes based on configurable rules.
Key Configuration:
global:
faultQuarantineModule:
enabled: false
config: |
[[rule-sets]]
name = "GPU fatal error ruleset"
[[rule-sets.match.all]]
expression = "event.isFatal == true"
[rule-sets.cordon]
shouldCordon = trueGracefully evicts workloads from cordoned nodes with configurable policies.
Key Configuration:
global:
nodeDrainerModule:
enabled: false
config: |
evictionTimeoutInSeconds = "60"
[[userNamespaces]]
name = "runai-*"
mode = "AllowCompletion"Triggers external break-fix systems after drain completion.
Key Configuration:
global:
faultRemediationModule:
enabled: false
maintenanceResource:
apiGroup: "janitor.dgxc.nvidia.com"
namespace: "dgxc-janitor"Analyzes event patterns and generates recommended actions.
Key Configuration:
global:
healthEventsAnalyzer:
enabled: false
config: |
[[rules]]
name = "XID Pattern Detection"
time_window = "30m"
recommended_action = "COMPONENT_RESET"Persistent storage for health events with real-time change streams.
Key Configuration:
mongodb:
architecture: replicaset
replicaCount: 3
auth:
enabled: true
tls:
enabled: true
mTLS:
enabled: true- Kubernetes: 1.25 or later
- Helm: 3.0 or later
- NVIDIA GPU Operator: For GPU monitoring capabilities (includes DCGM)
- Storage: Persistent storage for MongoDB (recommended 10GB+)
- Network: Cluster networking for inter-service communication
We welcome contributions! Here's how to get started:
Ways to Contribute:
- π Report bugs and request features via issues
- π Improve documentation
- π§ͺ Add tests and increase coverage
- π§ Submit pull requests to fix issues
- π¬ Help others in discussions
Getting Started:
- Read the Contributing Guide for guidelines
- Check the Development Guide for setup instructions
- Browse open issues for opportunities
All contributors must sign their commits (DCO). See the contributing guide for details.
- π Bug Reports: Create an issue
- β Questions: Start a discussion
- π Security: See Security Policy
- β Star this repository to show your support
- π Watch for updates on releases and announcements
- π Share NVSentinel with others who might benefit
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Built with β€οΈ by NVIDIA for GPU infrastructure reliability