dgxh100 User Guide
dgxh100 User Guide
NVIDIA Corporation
i
4.4.1 Startup Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.2 Shutdown Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Verifying Functionality - Quick Health Check . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6 Running the Pre-flight Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.7 Running NGC Containers with GPU Support . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7.1 Using Native GPU Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7.2 Using the NVIDIA Container Runtime for Docker . . . . . . . . . . . . . . . . . . . . . . . 33
4.8 Managing CPU Mitigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.8.1 Determining the CPU Mitigation State of the DGX System . . . . . . . . . . . . . . . . 34
4.8.2 Disabling CPU Mitigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.8.3 Re-enabling CPU Mitigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 SBIOS Settings 37
5.1 Accessing the SBIOS Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Configuring the Boot Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Configuring the Local Terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.1 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.2 Windows and MacOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Power on or Reboot the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8 Security 61
8.1 User Security Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.1.1 Securing the BMC Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2 System Security Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2.1 Secure Flash of DGX H100/H200 Firmware . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.2.2 Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.2.3 NVIDIA System Manager Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.3 Secure Data Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.3.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.3.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
ii
9.1 Supported Redfish Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.2 Connectivity Between the Host and BMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
9.3 Redfish Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
9.3.1 BMC Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
9.3.2 Firmware Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9.3.3 BIOS Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.3.4 Modifying the Boot Order on DGX H100/H200 Using Redfish . . . . . . . . . . . . . . 70
9.3.5 Changing the UEFI Secure Boot Platform Key . . . . . . . . . . . . . . . . . . . . . . . . 73
9.3.6 Telemetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9.3.7 Chassis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.3.8 SEL Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.3.9 Virtual Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.3.10 Backing Up and Restoring BMC Configurations . . . . . . . . . . . . . . . . . . . . . . . 76
9.3.10.1 Backing Up the BMC Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.3.10.2 Restoring the BMC configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.3.11 Collecting BMC Debug Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.3.12 Clear BIOS and Reset to Factory Defaults . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.3.13 Querying GPU Power Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.3.14 Power Capping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.3.14.1 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.3.14.2 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.3.14.3 Custom Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.3.14.4 PSU Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
10 Safety 91
10.1 Safety Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
10.2 Safety Warnings and Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
10.3 Intended Application Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.4 Site Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.5 Equipment Handling Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.6 Electrical Precautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.6.1 Power and Electrical Warnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.6.2 Power Cord Warnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.7 System Access Warnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.8 Rack Mount Warnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.9 Electrostatic Discharge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.10 Other Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.10.1 CALIFORNIA DEPARTMENT OF TOXIC SUBSTANCES CONTROL . . . . . . . . . . . . . 96
10.10.2 NICKEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.10.3 Battery Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.10.4 Cooling and Airflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
11 Compliance 99
11.1 United States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
11.2 United States/Canada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
11.3 Canada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
11.4 CE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
11.5 Australia and New Zealand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.6 Brazil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.7 Japan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.8 South Korea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
11.9 China . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.10 Taiwan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11.11 Russia/Kazakhstan/Belarus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
iii
11.12 Israel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
11.13 India . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
11.14 South Africa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
11.15 Great Britain (England, Wales, and Scotland) . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
13 Notices 113
13.1 Notice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
13.2 Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
iv
NVIDIA DGX H100/H200 User Guide
The NVIDIA DGX H100/H200 System User Guide is also available as a PDF.
Contents 1
NVIDIA DGX H100/H200 User Guide
2 Contents
Chapter 1. Introduction to NVIDIA DGX
H100/H200 Systems
The NVIDIA DGX™ H100/H200 Systems are the universal systems purpose-built for all AI infrastructure
and workloads from analytics to training to inference. The DGX H100/H200 systems are built on eight
NVIDIA H100 Tensor Core GPUs or eight NVIDIA H200 Tensor Core GPUs.
3
NVIDIA DGX H100/H200 User Guide
Component Description
GPU
For H100: 8 x NVIDIA H100 GPUs that provide 640 GB total
GPU memory
For H200: 8 x NVIDIA H200 GPUs that provide 1,128 GB total
GPU memory
CPU 2 x Intel Xeon 8480C PCIe Gen5 CPUs with 56 cores each
2.0/2.9/3.8 GHz (base/all core turbo/Max turbo)
NVSwitch 4 x 4th generation NVLinks that provide 900 GB/s GPU-to-GPU
bandwidth
Storage (OS) 2 x 1.92 TB NVMe M.2 SSD (ea) in RAID 1 array
Storage (Data Cache) 8 x 3.84 TB NVMe U.2 SED (ea) in RAID 0 array
Network (Cluster) card 4 x OSFP ports for 8 x NVIDIA® ConnectX®-7 Single Port Infini-
Band Cards
Each card provides the following speeds:
▶ InfiniBand (default): Up to 400Gbps
▶ Ethernet: 400GbE, 200GbE, 100GbE, 50GbE, 40GbE,
25GbE, and 10GbE
Network (storage and in-band 2 x NVIDIA® ConnectX®-7 Dual Port Ethernet Cards
management) card Each card provides the following speeds:
▶ Ethernet (default): 400GbE, 200GbE, 100GbE, 50GbE,
40GbE, 25GbE, and 10GbE
▶ InfiniBand: Up to 400Gbps
Feature Description
The system includes six power supply units (PSU) configured for 4+2 redundancy.
Refer to the following additional considerations:
▶ If a PSU fails, troubleshoot the cause and replace the failed PSU immediately.
▶ If three PSUs lose power as a result of a data center issue or power distribution unit failure, the
system continues to function, but at a reduced performance level.
▶ If only three PSUs have power, shut down the system before replacing an operational PSU.
▶ The system only boots if at least three PSUs are operational. If fewer than three PSUs are oper-
ational, only the BMC is available.
▶ Do not operate the system with PSUs depopulated.
Á Warning
To avoid electric shock or fire, only use the NVIDIA-provided power cords to connect power to the
DGX H100/H200. For more details, refer to Electrical Precautions.
ǩ Important
Do not use the provided cables with any other product or for any other purpose.
▶ To INSERT or REMOVE make sure the cable is UNLOCKED and push/ pull into/out of the socket.
Feature Specification
Control Description
ǩ Important
Refer to the section First Boot Setup for instructions on how to properly turn the system on or off.
Control Description
See Network Connections, Cables, and Adaptors for details on the network connections.
Port Designation
ǩ Important
Connect directly to the DGX H100/H200 console if the NVIDIA DGX™ H100/H200 system is con-
nected to a 172.17.xx.xx subnet.
DGX OS Server software installs Docker Engine which uses the 172.17.xx.xx subnet by default for
Docker containers. If the DGX H100/H200 system is on the same subnet, you will not be able to
establish a network connection to the DGX H100/H200 system.
Refer to Configuring Docker IP Addresses in the NVIDIA DGX OS 6 User Guide for instructions on how
to change the default Docker network settings.
® Note
19
NVIDIA DGX H100/H200 User Guide
Ϫ Caution
You perform the First Boot Setup to change the default credentials before connecting the BMC to
an unsecured network.
Ϫ Caution
When you create a BMC admin user, we strongly recommend that you change the default password
for this user - DO NOT use the default password.
During the first-boot procedure, you were prompted to configure an administrator username and pass-
word and a password for the BMC. The BMC username is the same as the administrator username:
▶ Username: <administrator-username>
▶ Password: <bmc-password>
1. Make sure you have connected the BMC port on the DGX H100/H200 system to your LAN.
2. Open a browser within your LAN and go to https:∕∕<bmc-ip-address>∕
Make sure popups are allowed for the BMC address.
3. Log in.
4. From the navigation menu, click Remote Control.
The Remote Control page enables you to open a virtual Keyboard/Video/Mouse (KVM) on the
DGX H100/H200 system, as if you were using a physical monitor and keyboard connected to the
front of the system.
5. Click Launch KVM.
The DGX H100/H200 console appears in your browser.
This section provides information about the set up process after you first boot the NVIDIA DGX™
H100/H200 Systems.
While NVIDIA partner network personnel or NVIDIA field service engineers will install the DGX
H100/H200 system at the site and perform the first boot setup, the first boot setup instructions
are provided here for reference and to support any reimaging of the server.
25
NVIDIA DGX H100/H200 User Guide
3. Refer to First Boot Process for DGX Servers in the NVIDIA DGX OS 6 User Guide for information
about the following topics:
▶ Optionally encrypt the root file system.
▶ Use the first boot wizard to set the language, locale, country, and so on.
▶ Create an administrative user account for the system, BMC, and Grub boot loader.
▶ Configure the primary network interface.
® Note
During this time, running the nvsm show health command reports a warning that the RAID volume
is re-syncing.
You can monitor status of the RAID 1 rebuild process by running the sudo nvsm show volumes com-
mand, and then view the output under ∕systems∕localhost∕storage∕volumes∕md0∕rebuild.
This topic provides basic requirements and instructions for using the NVIDIA DGX™ H100/H200 Sys-
tems, including how to perform a preliminary health check and how to prepare for running containers.
Refer to the DGX documentation for additional product documentation.
ǩ Important
Your DGX H100/H200 system must be installed by NVIDIA partner network personnel or NVIDIA
field service engineers. If not performed accordingly, your hardware warranty will be voided.
4.2. Registration
To obtain support for your DGX H100/H200, follow the instructions for registration in the Entitlement
Certification email that was sent as part of the purchase.
Registration allows you to access the NVIDIA Enterprise Support Portal, obtain technical support, get
software updates, and set up an NGC for DGX systems account. If you did not receive the informa-
tion, open a case with the NVIDIA Enterprise Support Team at https://www.nvidia.com/en-us/support/
enterprise/.
To obtain support for your DGX H100/H200 system, follow the instructions for registration in the
Entitlement Certification email that was sent as part of the purchase.
Registration allows you to access the NVIDIA Enterprise Support Portal, obtain technical support, get
software updates, and set up an NGC for DGX systems account. If you did not receive the informa-
tion, open a case with the NVIDIA Enterprise Support Team at https://www.nvidia.com/en-us/support/
enterprise/.
Refer to Customer Support for contact information.
29
NVIDIA DGX H100/H200 User Guide
Á Warning
Risk of Danger - Removing power cables or using Power Distribution Units (PDUs) to shut off the
system while the Operating System is running may cause damage to sensitive components in the
DGX H100/H200 server.
3. Verify that the output summary shows that all checks are Healthy and that the overall system
status is Healthy.
4. Verify that Docker is installed by viewing the installed Docker version.
sudo docker --version
On success, the command returns the version as Docker version xx.yy.zz, where the actual
version may differ depending on the specific release of the DGX OS Server software.
5. Verify connection to the NVIDIA repository and that the NVIDIA Driver is installed.
sudo docker run --gpus all --rm nvcr.io∕nvidia∕cuda:12.1.1-base-ubuntu22.04�
,→nvidia-smi
The preceding command pulls the nvidia∕cuda container image layer by layer, then runs the
nvidia-smi command.
When complete, the output shows the NVIDIA Driver version and a description of each installed
GPU.
For more information, refer to Containers For Deep Learning Frameworks User Guide.
Syntax
Recommended Command
The following command runs the test on all supported components (GPU, CPU, memory, and storage),
and takes approximately 20 minutes.
sudo nvsm stress-test --force
6.0
▶ Native GPU support
▶ NVIDIA Container Runtime for Docker (dep-
recated - availability to be removed in a fu-
ture DGX OS release)
The DGX OS also includes the NVIDIA Container Runtime for Docker (nvidia- docker2) which lets you
run GPU-accelerated containers in one of the following ways:
▶ Use docker run and specify runtime=nvidia.
docker run --runtime=nvidia ...
The nvidia-docker2 package provides backward compatibility with the previous nvidia-docker package,
so you can run GPU-accelerated containers using this command and the new runtime will be used.
▶ Use docker run with nvidia as the default runtime.
You can set nvidia as the default runtime, for example, by adding the following line to the ∕
etc∕docker∕daemon.json configuration file as the first entry.
"default-runtime": "nvidia",
Here is an example of how the added line appears in the JSON file. Do not remove any pre-existing
content when making this change.
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "∕usr∕bin∕nvidia-container-runtime",
"args": []
}
}
}
Ϫ Caution
If you build Docker images while nvidia is set as the default runtime, make sure the build scripts
executed by the Dockerfile specify the GPU architectures that the container will need. Failure to
do so might result in the container being optimized only for the GPU architecture on which it was
built. Instructions for specifying the GPU architecture depend on the application and are beyond
the scope of this document. Consult the specific application build process.
▶ CPU mitigations are enabled if the output consists of multiple lines prefixed with Mitigation:.
Example
KVM: Mitigation: Split huge pages
Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
Mitigation: Clear CPU buffers; SMT vulnerable
Mitigation: PTI
Mitigation: Speculative Store Bypass disabled via prctl and seccomp
Mitigation: usercopy∕swapgs barriers and __user pointer sanitization
Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional,�
,→RSB filling
▶ CPU mitigations are disabled if the output consists of multiple lines prefixed with Vulnerable.
Example
KVM: Vulnerable
Mitigation: PTE Inversion; VMX: vulnerable
Vulnerable; SMT vulnerable
Vulnerable
Vulnerable
Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerable
Ϫ Caution
Performing the following instructions will disable the CPU mitigations provided by the DGX OS
Server software.
The output should include several Vulnerable lines. See Determining the CPU Mitigation State
of the DGX System for example output.
The output should include several Mitigations lines. See Determining the CPU Mitigation State
of the DGX System for example output.
The NVIDIA DGX™ H100/H200 system comes with a system BIOS with optimized settings for the DGX
system. There might be situations where the settings need to be changed, such as changes in the
boot order, changes to enable PXE booting, or changes in the BMC network settings.
Instructions for these use cases are provided in this section.
ǩ Important
Do not change settings in the SBIOS other than those described in this or other DGX H100/H200
user documents. Contact NVIDIA Enterprise Services before making other changes.
37
NVIDIA DGX H100/H200 User Guide
Here are some occasions where it might be necessary to reconfigure settings in the SBIOS:
▶ Configuring a BMC Static IP Address Using the System BIOS
▶ Enabling the TPM and Preventing the BIOS from Sending Block SID Requests
▶ Clearing the TPM
5.3.1. Linux
1. Set the locale and language for your terminal:
sudo localectl set-locale LANG=en_US.UTF-8
▶ Using IPMItool
ipmitool -I lanplus -H <ip-address> -U admin -P dgxluna.admin sol activate
The NVIDIA DGX™ H100/H200 system comes with a baseboard management controller (BMC) for
monitoring and controlling various hardware devices on the system. It monitors system sensors and
other parameters.
43
NVIDIA DGX H100/H200 User Guide
Control Description
® Note
If you cannot access the DGX H100/H200 system remotely, connect a display (1440x900 or lower
resolution) and keyboard directly to the DGX H100/H200 system.
▶ To set the subnet mask, enter the following and replace the italicized text with your infor-
mation.
$ sudo ipmitool lan set 1 netmask <my-netmask-address>
▶ To set the default gateway IP (Router IP address in the BIOS settings), enter the following
and replace the italicized text with your information.
$ sudo ipmitool lan set 1 defgw ipaddr <my-default-gateway-ip-address>
6.5.2. Procedure
To change your credentials or add or remove users, perform the following steps:
1. Select Settings from the left-side navigation menu.
2. Select the User Management card.
3. Click the help icon (?) for information about configuring users and creating a password.
4. Log out and then log in with the new credentials.
2. Click Active Directory Settings or LDAP/E-Directory Settings and follow the instructions.
The Event Filters page shows all configured event filters and available slots. You can modify or add
new event filter entry on this page.
▶ To view available configured and unconfigured slots, click All in the upper-left corner of the page.
▶ To view available configured slots, click Configured in the upper-left corner of the page.
▶ To view available unconfigured slots, click UnConfigured in the upper-left corner of the page.
▶ To delete an event filter from the list, click the x icon.
The View SSL Certificate page displays the following basic information about the uploaded SSL cer-
tificate:
▶ Certificate Version, Serial Number, Algorithm, and Public Key
▶ Issuer information
▶ Valid Date range
▶ Issued to information
Common Name (CN) The common name for which the certificate is to be generated.
▶ Maximum length of 64 alphanumeric characters.
▶ Special characters ‘#’ and ‘$’ are not allowed.
Organization (O) The name of the organization for which the certificate is gener-
ated.
▶ Maximum length of 64 alphanumeric characters.
▶ Special characters ‘#’ and ‘$’ are not allowed.
Organization Unit (OU) Overall organization section unit name for which the certificate is
generated.
▶ Maximum length of 64 alphanumeric characters.
▶ Special characters ‘#’ and ‘$’ are not allowed.
2. Click the New Certificate folder icon, browse to locate the appropriate file, and select it.
3. Click the New Private Key folder icon, browse and locate the appropriate file, and select it.
4. Click Save.
® Note
6. In the BIOS setup menu on the Advanced tab, select Tls Auth Config.
® Note
59
NVIDIA DGX H100/H200 User Guide
This section provides information about security measures in the NVIDIA DGX™ H100/H200 system.
61
NVIDIA DGX H100/H200 User Guide
8.2.2. Encryption
Here is some information about encrypting the DGX H100/H200 firmware.
The firmware encryption algorithm is AES-CBC.
▶ The firmware encryption key strength is 128 bits or higher.
▶ Each firmware class uses a unique encryption key.
▶ Firmware decryption is performed either by the same agent that performs signature check or a
more trusted agent in the same COT.
8.3.1. Prerequisites
You need to prepare a bootable installation medium that contains the current DGX OS Server ISO
image.
Refer to Reimaging in the NVIDIA DGX OS 6 User Guide for information on the following topics:
▶ Obtaining the DGX OS ISO Image
▶ Booting the DGX OS ISO Image
62 Chapter 8. Security
NVIDIA DGX H100/H200 User Guide
8.3.2. Procedure
Here are the instructions to securely delete data from the DGX H100/H200 system SSDs.
1. Boot the system from the ISO image, either remotely or from a bootable USB key.
2. At the GRUB menu, select:
▶ (For DGX OS 6): Rescue a broken system and configure the locale and network information.
3. When prompted to select a root file system, select Do not use a root file system and then select
Execute a shell in the installer environment.
4. Log in.
5. Run the following command to identify the devices available in the system:
nvme list
If the nvme-cli package is not installed, then install the CLI as follows and then run nvme list.
dpkg -i ∕usr∕lib∕live∕mount∕rootfs∕filesystem.squashfs∕curtin∕repo∕<nvme-cli-
,→package.deb>
where <device-path> is the specific storage node as listed in the previous step. For example,
∕dev∕nvme0n1.
64 Chapter 8. Security
Chapter 9. Redfish APIs Support
The DGX System firmware supports Redfish APIs. Redfish is DMTF’s standard set of APIs for managing
and monitoring a platform. By default, Redfish support is enabled in the DGX H100/H200 BMC and the
SBIOS. By using the Redfish interface, administrator-privileged users can browse physical resources
at the chassis and system level through the REST API interface. Redfish provides information that is
categorized under a specific resource endpoint and Redfish clients can use the end points by using
following HTTP methods:
▶ GET
▶ POST
▶ PATCH
▶ PUT
▶ DELETE
Not all endpoints support all these operations. Refer to the Redfish JSON Schema for more informa-
tion about the operations. The Redfish server follows the DSP0266 1.7.0 Specification and Redfish
Schema 2019.1 documentation. Redfish URIs are accessed by using basic authentication and imple-
mentation, so that IPMI users with required privilege can access the Redfish URIs.
65
NVIDIA DGX H100/H200 User Guide
Replace the network interface name and IP address in the preceding example according to your needs.
After you configure the network interface, you can use commands such as curl and nvfwupd with
the 169.254.0.17 IP address to connect to the BMC and use the Redfish API.
The following example command shows the firmware versions:
nvfwupd -t ip=169.254.0.17 username=<bmc-user> password=<password> show_version
The password field is mandatory and must meet the following requirements:
▶ At least 13 characters long but no more than 20 characters.
▶ At least 1 lowercase letter (a-z).
▶ At least 1 uppercase letter (A-Z).
▶ At least 1 digit (0-9).
▶ At least 1 special character (!"#$%&'()*+,-.∕:;<=>?@[\]^_`{|}~).
▶ White space is not allowed.
▶ Reset BMC
The following curl command forces a reset of the DGX H100/H200 BMC.
curl -k -u <bmc-user>:<password> --request POST --location 'https:∕∕<bmc-ip-
,→address>∕redfish∕v1∕Managers∕BMC∕Actions∕Manager.Reset' --header 'Content-
,→Type: application∕json' --data '{"ResetType": "ForceRestart"}'
Example Output
{
"@odata.context": "∕redfish∕v1∕$metadata#SoftwareInventoryCollection.
,→SoftwareInventoryCollection",
"@odata.etag": "\"1683226281\"",
"@odata.id": "∕redfish∕v1∕UpdateService∕FirmwareInventory",
"@odata.type": "#SoftwareInventoryCollection.SoftwareInventoryCollection",
"Description": "Collection of Firmware Inventory resources available to the�
,→UpdateService",
"Members": [
{
"@odata.id": "∕redfish∕v1∕UpdateService∕FirmwareInventory∕CPLDMB_0"
},
{
"@odata.id": "∕redfish∕v1∕UpdateService∕FirmwareInventory∕CPLDMID_0"
},
(continues on next page)
"Name": "CPLDMB_0",
"Version": "0.2.1.6"
},
{
"DataSourceUri": "∕redfish∕v1∕UpdateService∕FirmwareInventory∕
,→CPLDMID_0",
"Name": "CPLDMID_0",
"Version": "0.2.0.7"
},
∕∕ ...
]
}
}
}
,→'[email protected];type=application∕json' -F UpdateFile=@<fw_
,→bundle>
,→'[email protected];type=application∕json' -F UpdateFile=@<fw_
,→bundle>
On success, the command returns a 204 HTTP status code. If you attempt to set the flag to the
currently set value, the command returns a 400 HTTP status code.
To get the value of the ForceUpdate parameter:
curl -k -u <bmc-user>:<password> --request GET 'https:∕∕<bmc-ip-address>∕redfish∕
,→v1∕UpdateService'
One of the Registries in the list is your BIOS attribute registry. The format is BiosAt-
tributeRegistry<version><version>. For example, for BIOS 0.1.6, the registry is
BiosAttributeRegistry106.1.0.6.
2. Get the URI of the BIOS registry:
curl -k -u <bmc-user>:<password> --location --request GET 'https:∕∕<bmc-ip-
,→address>∕redfish∕v1∕Registries∕BiosAttributeRegistry016.0.1.6∕'
The response includes the location of the JSON file that describes all the BIOS attributes.
Under Location, the Uri is specified. For example, Uri":"∕redfish∕v1∕Registries∕
BiosAttributeRegistry106.1.0.6.
3. Get the JSON file with the registry of all your BIOS attributes:
curl -k -u <bmc-user>:<password> --location --request GET 'https:∕∕<bmc-ip-
,→address>∕redfish∕v1∕Registries∕BiosAttributeRegistry106.en-US.1.0.6.json' --
,→output BiosAttributeRegistry106.en-US.1.0.6.json
Each attribute name has a default value, display name, help text, a read-only indicator, and
an indicator of whether a reset is required to take effect.
▶ To get the current BIOS settings:
curl -k -u <bmc-user>:<password> --location --request GET 'https:∕∕<bmc-ip-
,→address>∕redfish∕v1∕Systems∕DGX∕Bios'
Match the attribute name with the value in the registry for a description.
Example response:
"Description": "Current BIOS Settings",
"Id": "Bios",
"Name": "Current BIOS Settings"
...
▶ To change an attribute in the future BIOS settings, PATCH the SD URI and specify the attribute
name with the new value. You can change more than one attribute at a time.
For example, the following PATCH request specifies how the system responds when the SEL log
is full:
curl -k -u <bmc-user>:<password> --location --request PATCH 'https:∕∕<bmc-ip-
,→address>∕redfish∕v1∕Systems∕DGX∕Bios∕SD' -H 'Content-Type: application∕json' -H
,→":"IPMI201Donotloganymore"}}'
Example response:
"Description": "Future BIOS Settings",
"Id": "SD",
"Name": "Future BIOS Settings"
...
® Note
All attribute changes to the BIOS require a power cycle to take effect. When changing the attributes
is followed by a BIOS update, an additional power cycle is needed to apply the changes.
From any system in the same network as the BMC, run the following curl command to get the
current boot order:
$ curl -k -u <BMC username>:<BMC password> https:∕∕<BMC_IP_address>∕redfish∕v1∕
,→Systems∕DGX∕SD -H "content-type:application∕json" -X GET -s | jq .Boot.BootOrder
[
"Boot0000",
"Boot000F",
"Boot0004",
"Boot0005",
"Boot0006",
"Boot0007",
"Boot0008",
"Boot0009",
"Boot000A",
"Boot0010"
]
"@odata.etag": "\"1696896625\"",
"DisplayName": "DGX OS",
"Name": "Boot0000",
"UefiDevicePath": "HD(1,GPT,159C2E52-2329-40AC-9103-6C28DC1528B8,0x800,0x100000)∕\
,→\EFI\\UBUNTU\\SHIMX64.EFI"
"@odata.etag": "\"1696896625\"",
"DisplayName": "UEFI: PXE IPv4 Intel(R) Ethernet Controller X550",
"Name": "Boot0004",
"UefiDevicePath": "PciRoot(0x0)∕Pci(0x10,0x0)∕Pci(0x0,0x0)∕MAC(5CFF35FBDA09,0x1)∕
,→IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)"
"@odata.etag": "\"1696896625\"",
"DisplayName": "UEFI: PXE IPv4 Nvidia Network Adapter - B8:3F:D2:E7:B1:6C",
"Name": "Boot0005",
"UefiDevicePath": "PciRoot(0x20)∕Pci(0x1,0x0)∕Pci(0x0,0x0)∕Pci(0x0,0x0)∕Pci(0x0,
,→0x0)∕Pci(0x0,0x0)∕Pci(0x0,0x0)∕MAC(B83FD2E7B16C,0x1)∕IPv4(0.0.0.0,0x0,DHCP,0.0.
,→0.0,0.0.0.0,0.0.0.0)"
"@odata.etag": "\"1696896625\"",
"DisplayName": "UEFI: PXE IPv4 Nvidia Network Adapter - B8:3F:D2:E7:B1:6D",
"Name": "Boot0006",
"UefiDevicePath": "PciRoot(0x20)∕Pci(0x1,0x0)∕Pci(0x0,0x0)∕Pci(0x0,0x0)∕Pci(0x0,
,→0x0)∕Pci(0x0,0x0)∕Pci(0x0,0x1)∕MAC(B83FD2E7B16D,0x1)∕IPv4(0.0.0.0,0x0,DHCP,0.0.
,→0.0,0.0.0.0,0.0.0.0)"
"@odata.etag": "\"1696896625\"",
"DisplayName": "UEFI: PXE IPv4 Nvidia Network Adapter - B8:3F:D2:E7:B0:9C",
"Name": "Boot0007",
"UefiDevicePath": "PciRoot(0x120)∕Pci(0x1,0x0)∕Pci(0x0,0x0)∕Pci(0x0,0x0)∕Pci(0x0,
,→0x0)∕Pci(0x0,0x0)∕Pci(0x0,0x0)∕MAC(B83FD2E7B09C,0x1)∕IPv4(0.0.0.0,0x0,DHCP,0.0.
,→0.0,0.0.0.0,0.0.0.0)"
"@odata.etag": "\"1696896625\"",
"DisplayName": "UEFI: PXE IPv4 Nvidia Network Adapter - B8:3F:D2:E7:B0:9D",
(continues on next page)
,→0.0,0.0.0.0,0.0.0.0)"
"@odata.etag": "\"1696896625\"",
"DisplayName": "UEFI: PXE IPv4 Intel(R) Ethernet Network Adapter E810-C-Q2",
"Name": "Boot0009",
"UefiDevicePath": "PciRoot(0x160)∕Pci(0x5,0x0)∕Pci(0x0,0x0)∕MAC(6CFE543D8F48,0x1)∕
,→IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)"
"@odata.etag": "\"1696896625\"",
"DisplayName": "UEFI: PXE IPv4 Intel(R) Ethernet Network Adapter E810-C-Q2",
"Name": "Boot000A",
"UefiDevicePath": "PciRoot(0x160)∕Pci(0x5,0x0)∕Pci(0x0,0x1)∕MAC(6CFE543D8F49,0x1)∕
,→IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)"
"@odata.etag": "\"1696896625\"",
"DisplayName": "ubuntu",
"Name": "Boot000F",
"UefiDevicePath": "HD(1,GPT,1E0EFF2A-2BF3-4DC6-8757-4075B1E5343D,0x800,0x100000)∕\
,→\EFI\\UBUNTU\\SHIMX64.EFI"
"@odata.etag": "\"1696896625\"",
"DisplayName": "UEFI: PXE IPv4 American Megatrends Inc.",
"Name": "Boot0010",
"UefiDevicePath": "PciRoot(0x0)∕Pci(0x14,0x0)∕USB(0xA,0x0)∕USB(0x2,0x1)∕
,→MAC(4E2A712C2451,0x0)∕IPv4(0.0.0.0,0x0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)"
Where
▶ The DisplayName string is the name of the drive or network adapter.
▶ The Name string is the boot device name.
▶ The MAC(<address>,0x1) value for the UefiDevicePath string is the corresponding MAC
address.
▶ The @odata.etag string is the etag number.
Identify the following information from the JSON output for the next step:
▶ The name of the device to be the boot device.
▶ The etag number to compose the header.
3. Update the boot order.
The following command uses the PATCH method to modify the BootOrder settings, specifying
the etag number and boot device names from step 2. The command generates a new order list
for BootOrder, which affects the next boot of the system.
$ curl -k -u <BMC username>:<BMC password> https:∕∕<BMC_IP_address>∕redfish∕v1∕
,→Systems∕DGX∕SD -H "content-type:application∕json" -H 'if-None-Match: "@odata.
[
"Boot0004",
"Boot0000",
"Boot0005",
"Boot0006",
"Boot0007",
"Boot0008",
"Boot0009",
"Boot000A",
"Boot000F",
"Boot0010"
]
Upon reboot, the system should attempt to boot from the network using the correct network
interface:
This boot order change will remain until the next boot order update, which can be done by resetting
the SBIOS or running this procedure again.
,→"1721382290"' -d '{"SecureBootEnable":false}' | jq
,→PK∕Certificates∕1 | jq
,→PK∕Certificates -d
'{
"CertificateString": "-----BEGIN CERTIFICATE-----\n ... \n-----END CERTIFICATE---
,→--",
"CertificateType": "PEM",
"UefiSignatureOwner": "<GUID-of-the-UEFI-signature-owner>"
}'
Where
▶ The CertificateString string is the certificate starting with -----BEGIN CERTIFICATE.
▶ The CertificateType string is the format of the certificate, a Privacy Enhanced Mail
(PEM)-encoded single certificate.
▶ The UefiSignatureOwner string (UUID) is the UEFI signature owner for this signature.
4. Reboot the system for the change to take effect.
curl -ks -u <bmc-user>:<password> -H "Content-Type: application∕json" -X POST�
,→https:∕∕<bmc-ip-address>∕redfish∕v1∕Systems∕DGX∕Actions∕ComputerSystem.Reset -d
9.3.6. Telemetry
▶ GPU tray sensors
curl -k -u <bmc-user>:<password> --location --request GET 'https:∕∕<bmc-ip-
,→address>∕redfish∕v1∕TelemetryService∕MetricReportDefinitions∕HGX_
,→PlatformEnvironmentMetrics_0'
The endpoint returns 75 members at a time. To page through the results, use the URI in
the [email protected] field. For example, ∕redfish∕v1∕Chassis∕DGX∕Sensors?
$skip=75.
9.3.7. Chassis
▶ Chassis Restart (IPMI chassis power cycle)
curl -k -u <bmc-user>:<password> --request POST --location 'https:∕∕<bmc-ip-
,→address>∕redfish∕v1∕Systems∕DGX∕Actions∕ComputerSystem.Reset' --header 'Content-
▶ Chassis Graceful Restart (IPMI chassis soft off, IPMI chassis power on)
curl -k -u <bmc-user>:<password> --request POST --location 'https:∕∕<bmc-ip-
,→address>∕redfish∕v1∕Systems∕DGX∕Actions∕ComputerSystem.Reset' --header 'Content-
▶ Chassis Power Cycle (IPMI chassis power off, IPMI chassis power on)
curl -k -u <bmc-user>:<password> --request POST --location 'https:∕∕<bmc-ip-
,→address>∕redfish∕v1∕Systems∕DGX∕Actions∕ComputerSystem.Reset' --header 'Content-
® Note
The ForceRestart, GracefulRestart, and GracefulShutdown reset actions on HMC are not
supported for security reasons.
The endpoint returns 75 members at a time. To page through the results, use the URI in the
[email protected] field. For example, ∕redfish∕v1∕Managers∕BMC∕LogServices∕SEL∕
Entries?$skip=75.
,→04.2-live-server-amd64.iso","TransferProtocolType" : "NFS"}'
,→'AESKey=@aes_key.bin' | jq
® Note
You must perform a factory reset to restore the default settings before restoring the BMC config-
uration.
,→'AESKey=@aes_key.bin' | jq
,→file=@"bmc-config.bak"' | jq
® Note
Example response:
{
"@odata.context": "∕redfish∕v1∕$metadata#Task.Task",
"@odata.id": "∕redfish∕v1∕TaskService∕Tasks∕2",
"@odata.type": "#Task.v1_4_2.Task",
"Description": "Task for Manager CollectDiagnosticData",
"Id": "2",
"Name": "Manager CollectDiagnosticData",
"TaskState": "New"
}
2. Change the task number to the appropriate task Id returned from step 1, and monitor the task
for completion until PercentComplete reaches 100.
curl -k -u <bmc-user>:<password> --request GET 'https:∕∕<bmc-ip-address>∕redfish∕
,→v1∕TaskService∕Tasks∕2' | jq
Example response:
{
"@odata.context": "∕redfish∕v1∕$metadata#Task.Task",
"@odata.etag": "\"1723565599\"",
"@odata.id": "∕redfish∕v1∕TaskService∕Tasks∕2",
"@odata.type": "#Task.v1_4_2.Task",
"Description": "Task for Manager CollectDiagnosticData",
"EndTime": "2024-08-13T16:28:15+00:00",
"Id": "2",
"Messages": [
{
"@odata.type": "#Message.v1_0_8.Message",
"Message": "Indicates that a DiagnosticDump of was created at ∕redfish∕
,→v1∕Managers∕BMC∕LogServices∕DiagnosticLog∕Attachment∕nvidiadiag-HT9buy.tar.gz",
"MessageArgs": [
"∕redfish∕v1∕Managers∕BMC∕LogServices∕DiagnosticLog∕Attachment∕
,→nvidiadiag-HT9buy.tar.gz"
],
"MessageId": "Ami.1.0.0.DiagnosticDumpCreated",
"Resolution": "None",
"Severity": "Warning"
},
{
"@odata.type": "#Message.v1_0_8.Message",
"Message": "Task ∕redfish∕v1∕Managers∕BMC∕LogServices∕DiagnosticLog∕
,→Actions∕LogService.CollectDiagnosticData has completed.",
"MessageArgs": [
"∕redfish∕v1∕Managers∕BMC∕LogServices∕DiagnosticLog∕Actions∕
,→LogService.CollectDiagnosticData"
],
"MessageId": "Task.1.0.Completed",
"Resolution": "None",
"Severity": "OK"
}
],
"Name": "Manager CollectDiagnosticData",
"PercentComplete": 100,
"StartTime": "2024-08-13T16:13:20+00:00",
"TaskState": "Completed",
"TaskStatus": "OK"
}
3. After the TaskState field reports Completed, use the path provided by MessageArgs to down-
load the attachment:
curl -k -u <bmc-user>:<password> --request GET 'https:∕∕<bmc-ip-address>∕redfish∕
,→v1∕Managers∕BMC∕LogServices∕DiagnosticLog∕Attachment∕nvidiadiag-HT9buy.tar.gz' -
,→-output nvidiadiag-HT9buy.tar.gz
® Note
For BMC versions earlier than 24.09.17, use the following command:
curl -k -u <bmc-user>:<password> --request GET 'https:∕∕<bmc-ip-address>∕
,→redfish∕v1∕Managers∕BMC∕LogServices∕DiagnosticLog∕Entries∕All∕Attachment' --
,→output debugBMC.tgz
,→Type: application∕json' \
Where
▶ <bmc> is the BMC IP address.
▶ <id> is the GPU instance number of 1 to 8.
As shown in the following example output, the Reading field indicates the current power usage,
and the SetPoint field indicates the current GPU power limit.
...
"PowerLimitWatts": {
"AllowableMax": 700,
"AllowableMin": 200,
"ControlMode": "Automatic",
"DefaultSetPoint": 700,
"Reading": 64.388,
"SetPoint": 700
}
...
Example response:
{
"@odata.context": "∕redfish∕v1∕$metadata#NodeManager.NodeManager",
"@odata.etag": "\"1709588153\"",
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager",
"@odata.type": "#NodeManager.v1_0_0.NodeManager",
(continues on next page)
"target": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Actions∕NodeManager.
,→ChangeState"
}
},
"Description": "Node Manager for BMC",
"Domains": {
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains"
},
"Id": "NodeManager",
"Name": "Node Manager",
"Policies": {
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Policies"
},
"Status": {
"Health": "OK",
"State": "Disabled"
},
"ThrottlingStatus": {
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕ThrottlingStatus"
},
"Triggers": {
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Triggers"
}
}
9.3.14.2 Domains
There are several predefined domains. If no domains are set, the default domains are shown.
▶ To get a list of domains:
curl -k -u <bmc-user>:<password> https:∕∕<bmcip>∕redfish∕v1∕Managers∕BMC∕
,→NodeManager∕Domains
Example response:
{
"@odata.context": "∕redfish∕v1∕$Metadata#NvidiaNmDomainCollection.
,→ NvidiaNmDomainCollection",
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NvidiaNmDomainCollection",
"@odata.type": "#NvidiaNmDomainCollection.NvidiaNmDomainCollection",
"Members": [
{
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕0"
},
{
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕1"
},
{
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕4"
},
(continues on next page)
Example response:
{
"@odata.context": "∕redfish∕v1∕$Metadata#NvidiaNmDomain.NvidiaNmDomain",
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕0",
"@odata.type": "#NvidiaNmDomain.v1_4_0.NvidiaNmDomain",
"Capabilities": {
"MaxCorrectionTimeInMs": 2000,
"MaxStatisticsReportingPeriod": "2000",
"Min": 5000,
"MinCorrectionTimeInMs": 1000,
"MinStatisticsReportingPeriod": "1000"
},
"Id": "0",
"Name": "protection",
"Policies": {
"@odata.context": "∕redfish∕v1∕$Metadata#NvidiaNmPolicyCollection.
,→NvidiaNmPolicyCollection",
"@odata.type": "#NvidiaNmPolicyCollection.NvidiaNmPolicyCollection",
"Members": [
{
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕0∕
,→Policies∕0"
},
{
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕0∕
,→Policies∕1"
},
{
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕0∕
,→Policies∕2"
}
],
(continues on next page)
Example response:
{
"@odata.context": "∕redfish∕v1∕$Metadata#NvidiaNmPolicy.NvidiaNmPolicy",
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕0∕Policies∕0",
"@odata.type": "#NvidiaNmPolicy.v1_2_0.NvidiaNmPolicy",
"AssociatedDomainID": {
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕0"
},
"ComponentId": "COMP_CPU",
"Id": "0",
"Limit": 800,
"Name": "0",
"PercentageOfDomainBudget": 15,
"Status": {
"State": "Disabled"
}
}
In this example, policy 0 defines the percentage of budget for domain 0. The CPU budget for
both sockets is 800 W, which is equally divided. The PercentageOfDomainBudget field, which
indicates how much of the overall budget will be allocated to the CPUs, shows 15 percent for this
example.
To add a custom policy, use the following template and specify values for the highlighted fields. Cus-
tom domain ID starts from 10.
The engine will add the percentage values and the power values in the provided configuration fields.
Error messages are issued for the following conditions:
▶ Power exceeds the Max value or falls below the Min value of the domain power.
▶ The PercentageOfDomainBudget values add up to over 100 percent.
Template:
{
"@odata.context": "∕redfish∕v1∕$Metadata#NvidiaNmDomain.NvidiaNmDomain",
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕0",
"@odata.type": "#NvidiaNmDomain.v1_4_0.NvidiaNmDomain",
"Capabilities": {
"Max": 6000.0000,
"Min": 4000.0000
},
"Id": "0",
"Name": "custom4",
"Status": {
"State": "Enabled"
},
"Policies": {
"@odata.context": "∕redfish∕v1∕$Metadata#NvidiaNmPolicyCollection.
,→NvidiaNmPolicyCollection",
"@odata.type": "#NvidiaNmPolicyCollection.NvidiaNmPolicyCollection",
"Members": [
{
"@odata.context": "∕redfish∕v1∕$Metadata#NvidiaNmPolicy.NvidiaNmPolicy",
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕0∕Policies∕0",
"@odata.type": "#NvidiaNmPolicy.v1_2_0.NvidiaNmPolicy",
"AssociatedDomainID": {
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕0"
},
"ComponentId": "COMP_CPU",
"Id": "0",
"Limit": 500.0000,
"PercentageOfDomainBudget": 15.0000,
"Name": "0"
},
{
"@odata.context": "∕redfish∕v1∕$Metadata#NvidiaNmPolicy.NvidiaNmPolicy",
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕0∕Policies∕1",
"@odata.type": "#NvidiaNmPolicy.v1_2_0.NvidiaNmPolicy",
"ComponentId": "COMP_MEMORY",
"Id": "0",
"Limit": 500.0000,
"PercentageOfDomainBudget": 15.0000,
"Name": "0"
},
{
"@odata.context": "∕redfish∕v1∕$Metadata#NvidiaNmPolicy.NvidiaNmPolicy",
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕0∕Policies∕2",
"@odata.type": "#NvidiaNmPolicy.v1_2_0.NvidiaNmPolicy",
"AssociatedDomainID": {
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕0"
},
"ComponentId": "COMP_GPU",
"Id": "0",
"Limit": 5000.0000,
"PercentageOfDomainBudget": 70.0000,
"Name": "0"
}
],
"[email protected]": 3,
"Name": "NvidiaNmPolicyCollection"
Example response:
{
"@odata.context": "∕redfish∕v1∕$Metadata#NvidiaNmDomain.NvidiaNmDomain",
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕21",
"@odata.type": "#NvidiaNmDomain.v1_4_0.NvidiaNmDomain",
"Capabilities": {
"Max": 6000,
"MaxCorrectionTimeInMs": 0,
"MaxStatisticsReportingPeriod": "0",
"Min": 4000,
"MinCorrectionTimeInMs": 0,
"MinStatisticsReportingPeriod": "0"
},
"Id": "21",
"Name": "custom4",
"Policies": {
"@odata.context": "∕redfish∕v1∕$Metadata#NvidiaNmPolicyCollection.
,→NvidiaNmPolicyCollection",
"@odata.type": "#NvidiaNmPolicyCollection.NvidiaNmPolicyCollection",
"Members": [
{
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕21∕
,→Policies∕0"
},
{
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕21∕
,→Policies∕1"
},
{
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕Domains∕21∕
,→Policies∕2"
}
],
"Name": "NvidiaNmPolicyCollection"
},
"Status": {
"State": "Enabled"
}
}
▶ To patch custom domain policies, provide only the configuration changes you want to make.
▶ To delete custom domain policies:
curl -k -u <bmc-user>:<password> -X DELETE ∕redfish∕v1∕Managers∕BMC∕NodeManager∕
,→Domains∕<DomainID>
Example response:
{
"@odata.context": "∕redfish∕v1∕$Metadata#NvidiaNmPSUPolicyCollection.
,→ NvidiaNmPSUPolicyCollection",
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NvidiaNmPSUPolicyCollection",
"@odata.type": "#NvidiaNmPSUPolicyCollection.NvidiaNmPSUPolicyCollection",
"Members": [
{
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕PSUPolicies∕0"
},
{
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕PSUPolicies∕1"
},
{
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕PSUPolicies∕2"
}
],
"[email protected]": 3,
"Name": "NvidiaNmPSUPolicyCollection"
}
Example response:
{
"@odata.context": "∕redfish∕v1∕$Metadata#NvidiaNmPSUPolicy.NvidiaNmPSUPolicy",
"@odata.id": "∕redfish∕v1∕Managers∕BMC∕NodeManager∕PSUPolicies∕0",
"@odata.type": "#NvidiaNmPSUPolicy.v1_2_0.NvidiaNmPSUPolicy",
"Id": "0",
"LimitMax": 6000,
"MaxPSU": 2,
"MinPSU": 2,
"Name": "Limp",
"Status": {
"State": "Disabled"
}
}
PSU policy 0 defines the number of PSUs and the power that will be allocated to the system with
a maximum of two PSUs.
Example output:
{
"@odata.id": "∕redfish∕v1∕TelemetryService∕MetricReports∕NvidiaNMMetrics_0",
"@odata.type": "#MetricReport.v1_4_2.MetricReport",
"Id": "NvidiaNMMetrics_0",
"MetricReportDefinition": {
"@odata.id": "∕redfish∕v1∕TelemetryService∕MetricReportDefinitions∕
,→NvidiaNMMetrics_0",
"MetricProperties": []
},
"MetricValues": [
{
"MetricId": "dcPlatformPower_avg",
"MetricValue": "2181.00",
"Timestamp": "2024-07-15T18:49:43+00:00"
},
{
"MetricId": "dcPlatformPowerDGX_avg",
"MetricValue": "1444.00",
"Timestamp": "2024-07-15T18:49:43+00:00"
},
{
"MetricId": "dcPlatformPowerHGX_avg",
"MetricValue": "736.00",
"Timestamp": "2024-07-15T18:49:43+00:00"
},
{
"MetricId": "dcPlatformEnergy",
"MetricValue": "2181.00",
"Timestamp": "2024-07-15T18:49:43+00:00"
},
...
{
"MetricId": "gpuPowerCapabilitiesMax_7",
"MetricValue": "700.00",
"Timestamp": "2024-07-15T18:49:43+00:00"
}
],
"Name": "NvidiaNMMetrics_0"
}
This section provides information about how to safely use the NVIDIA DGX™ H100/H200 system.
Indicates shock hazards that result in serious injury or death if safety instructions are not followed.
91
NVIDIA DGX H100/H200 User Guide
Shock hazard: The product might be equipped with multiple power cords. - To remove all hazardous
voltages, disconnect all power cords. - High leakage current ground (earth) connection to the Power
Supply is essential before connecting the supply.
The rail racks are designed to carry only the weight of the server system. Do not use rail-mounted
equipment as a workspace. Do not place additional load onto any rail-mounted equipment.
▶ In regions that are susceptible to electrical storms, we recommend you plug your system into a
surge suppressor and disconnect telecommunication lines to your modem during an electrical
storm.
▶ Provided with a properly grounded wall outlet.
▶ Provided with sufficient space to access the power supply cord(s), because they serve as the
product’s main power disconnect.
Ϫ Caution
The power button, indicated by the stand-by power marking, DOES NOT completely turn off the
system AC power; standby power is active whenever the system is plugged in. To remove power
from system, you must unplug the AC power cord from the wall outlet. Make sure all AC power
cords are unplugged before you open the chassis, or add or remove any non hot-plug components.
Do not attempt to modify or use an AC power cord if it is not the exact type required. A separate AC
cord is required for each system power supply.
Some power supplies in servers use Neutral Pole Fusing. To avoid risk of shock use caution when
working with power supplies that use Neutral Pole Fusing.
The power supply in this product contains no user-serviceable parts. Do not open the power supply.
Hazardous voltage, current and energy levels are present inside the power supply. Return to manufac-
turer for servicing.
When replacing a hot-plug power supply, unplug the power cord to the power supply being replaced
before removing it from the server.
To avoid risk of electric shock, tum off the server and disconnect the power cords, telecommunications
systems, networks, and modems attached to the server before opening it.
Ϫ Caution
To avoid electrical shock or fire, check the power cord(s) that will be used with the product as
follows:
▶ Do not attempt to modify or use the AC power cord(s) if they are not the exact type required
to fit into the grounded electrical outlets.
▶ The power cord(s) must meet the following criteria:
▶ The power cord must have an electrical rating that is greater than that of the electrical
current rating marked on the product.
▶ The power cord must have safety ground pin or contact that is suitable for the electrical
outlet.
▶ The power supply cord(s) is/ are the main disconnect device to AC power. The socket
outlet(s) must be near the equipment and readily accessible for disconnection.
▶ The power supply cord(s) must be plugged into socket-outlet(s) that is /are provided with
a suitable earth ground.
Ϫ Caution
If the server has been running, any installed processor(s) and heat sink(s) may be hot. Unless you
are adding or removing a hot-plug component, allow the system to cool before opening the covers.
To avoid the possibility of coming into contact with hot component(s) during a hot-plug installation,
be careful when removing or installing the hot-plug component(s).
Ϫ Caution
To avoid injury do not contact moving fan blades. Your system is supplied with a guard over the fan,
do not operate the system without the fan guard in place.
Ϫ Caution
ESD can damage drives, boards, and other parts. We recommend that you perform all procedures
at an ESD workstation. If one is not available, provide some ESD protection by wearing an antistatic
wrist strap attached to chassis ground (any unpainted metal surface) on your server when handling
parts.
Always handle boards carefully. They can be extremely sensitive to ESD. Hold boards only by their
edges. After removing a board from its protective wrapper or from the server, place the board com-
ponent side up on a grounded, static free surface. Use a conductive foam pad if available but not the
board wrapper. Do not slide board over any surface.
10.10.2. NICKEL
NVIDIA Bezel. The bezel’s decorative metal foam contains some nickel. The metal foam is not intended
for direct and prolonged skin contact. Please use the handles to remove, attach or carry the bezel.
While nickel exposure is unlikely to be a problem, you should be aware of the possibility in case you are
susceptible to nickel-related reactions.
Ϫ Caution
There is the danger of explosion if the battery is incorrectly replaced. When replacing the battery,
use only the battery recommended by the equipment manufacturer.
Dispose of batteries according to local ordinances and regulations. Do not attempt to recharge a
battery.
Do not attempt to disassemble, puncture, or otherwise damage a battery.
�������
�������������������� ���������������������
�������������� ����������
���������������������
Ϫ Caution
Carefully route cables as directed to minimize airflow blockage and cooling problems. For proper
cooling and airflow, operate the system only with the chassis covers installed.
Operating the system without the covers in place can damage system parts. To install the covers:
▶ Check first to make sure you have not left loose tools or parts inside the system.
▶ Check that cables, add-in cards, and other components are properly installed.
▶ Attach the covers to the chassis according to the product instructions.
The equipment is intended for installation only in a Server Room/ Computer Room where both these
conditions apply:
▶ Access can only be gained by SERVICE PERSONS or by USERS who have been instructed about
the reasons for the restrictions applied to the location and about any precautions that shall be
taken.
▶ Access is through the use of a TOOL or lock and key, or other means of security, and is controlled
by the authority responsible for the location.
The NVIDIA DGX™ H100/H200 Server is compliant with the regulations listed in this section.
99
NVIDIA DGX H100/H200 User Guide
11.3. Canada
Innovation, Science and Economic Development Canada (ISED) CAN ICES-3(A)/NMB-3(A)
The Class A digital apparatus meets all requirements of the Canadian Interference-Causing Equipment
Regulation.
Cet appareil numerique de la class A respecte toutes les exigences du Reglement sur le materiel
brouilleur du Canada.
11.4. CE
European Conformity; Conformité Européenne (CE)
This is a Class A product. In a domestic environment this product may cause radio frequency interfer-
ence in which case the user may be required to take adequate measures.
This device bears the CE mark in accordance with Directive 2014/53/EU. This device complies with the
following Directives:
▶ EMC Directive A, I.T.E Equipment.
▶ Low Voltage Directive for electrical safety.
▶ RoHS Directive for hazardous substances.
▶ Energy-related Products Directive (ErP).
The full text of EU declaration of conformity is available at the following URL: http://www.nvidia.com/
support
A copy of the Declaration of Conformity to the essential requirements may be obtained directly from
NVIDIA GmbH (Bavaria Towers – Blue Tower, Einsteinstrasse 172, D-81677 Munich, Germany).
This product meets the applicable EMC requirements for Class A, I.T.E equipment.
11.6. Brazil
INMETRO
11.7. Japan
Voluntary Control Council for Interference (VCCI)
Class A Equipment (Industrial Broadcasting & Communication Equipment). This equipment Industrial
(Class A) electromagnetic wave suitability equipment and seller or user should take notice of it, and
this equipment is to be used in the places except for home.
11.9. China
China Compulsory Certificate
No certification is needed for China. The NVIDIA DGX H100/H200 system is a server with power con-
sumption greater than 1.3 kW.
11.10. Taiwan
Bureau of Standards, Metrology & Inspection (BSMI)
11.11. Russia/Kazakhstan/Belarus
Customs Union Technical Regulations (CU TR)
This device complies with the technical regulations of the Customs Union (CU TR)
ТЕХНИЧЕСКИЙ РЕГЛАМЕНТ ТАМОЖЕННОГО СОЮЗА О безопасности низковольтного
оборудования (ТР ТС 004/2011)
ТЕХНИЧЕСКИЙ РЕГЛАМЕНТ ТАМОЖЕННОГО СОЮЗА Электромагнитная совместимость
технических средств (ТР ТС 020/2011)
Технический регламент Евразийского экономического союза “Об ограничении применения
опасных веществ в изделиях электротехники и радиоэлектроники” (ТР ЕАЭС 037/2016)
This device complies with the rules set forth by Federal Agency of Communications and the Ministry
of Communications and Mass Media.
Federal Security Service notification has been filed.
11.12. Israel
SII
11.13. India
Bureau of India Standards (BIS)
Authenticity may be verified by visiting the Bureau of Indian Standards website at http://www.bis.gov.
in.
This product, as well as its related consumables and spares, complies with the reduction in hazardous
substances provisions of the “India E-waste (Management and Handling) Rule 2016”. It does not con-
tain lead, mercury, hexavalent chromium, polybrominated biphenyls or polybrominated diphenyl ethers
in concentrations exceeding 0.1 weight % and 0.01 weight % for cadmium, except for where allowed
pursuant to the exemptions set in Schedule 2 of the Rule.
This NVIDIA product contains third party software that is being made available to you under their re-
spective open source software licenses. Some of those licenses also require specific legal information
to be included in the product. This section provides such information.
111
NVIDIA DGX H100/H200 User Guide
INFORMATION) ARISING OUT OF YOUR USE OF OR INABILITY TO USE THE SOFTWARE, EVEN IF MTI
HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Because some jurisdictions prohibit
the exclusion or limitation of liability for consequential or incidental damages, the above limitation
may not apply to you.
TERMINATION OF THIS LICENSE: MTI may terminate this license at any time if you are in breach of
any of the terms of this Agreement. Upon termination, you will immediately destroy all copies the
Software.
GENERAL: This Agreement constitutes the entire agreement between MTI and you regarding the sub-
ject matter hereof and supersedes all previous oral or written communications between the parties.
This Agreement shall be governed by the laws of the State of Idaho without regard to its conflict of
laws rules.
CONTACT: If you have any questions about the terms of this Agreement, please contact MTI’s legal
department at (208) 368-4500. By proceeding with the installation of the Software, you agree to the
terms of this Agreement. You must agree to the terms in order to install and use the Software.
13.1. Notice
This document is provided for information purposes only and shall not be regarded as a warranty of a
certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no repre-
sentations or warranties, expressed or implied, as to the accuracy or completeness of the information
contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall
have no liability for the consequences or use of such information or for any infringement of patents
or other rights of third parties that may result from its use. This document is not a commitment to
develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any
other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that
such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the
time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by
authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects
to applying any customer general terms and conditions with regards to the purchase of the NVIDIA
product referenced in this document. No contractual obligations are formed either directly or indirectly
by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military,
aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA
product can reasonably be expected to result in personal injury, death, or property or environmental
damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or
applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for
any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA.
It is customer’s sole responsibility to evaluate and determine the applicability of any information con-
tained in this document, ensure the product is suitable and fit for the application planned by customer,
and perform the necessary testing for the application in order to avoid a default of the application or
the product. Weaknesses in customer’s product designs may affect the quality and reliability of the
NVIDIA product and may result in additional or different conditions and/or requirements beyond those
contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or prob-
lem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is
contrary to this document or (ii) customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other
NVIDIA intellectual property right under this document. Information published by NVIDIA regarding
third-party products or services does not constitute a license from NVIDIA to use such products or
113
NVIDIA DGX H100/H200 User Guide
services or a warranty or endorsement thereof. Use of such information may require a license from a
third party under the patents or other intellectual property rights of the third party, or a license from
NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA
in writing, reproduced without alteration and in full compliance with all applicable export laws and
regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS,
DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE
BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR
OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WAR-
RANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CON-
SEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARIS-
ING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY
OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatso-
ever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein
shall be limited in accordance with the Terms of Sale for the product.
13.2. Trademarks
NVIDIA, the NVIDIA logo, DGX, DGX-1, DGX-2, DGX A100, DGX H100, DGX H200, DGX Station, and
DGX Station A100 are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites
States and other countries. Other company and product names may be trademarks of the respective
companies with which they are associated.
Copyright
©2022-2024, NVIDIA Corporation