To create the infrastructure for tightly-coupled applications that scale across multiple nodes, you can create a cluster of virtual machine (VM) instances. This guide provides a high-level overview of the key considerations and steps to configure a cluster of virtual machine (VM) instances for high performance computing (HPC) workloads using dense resource allocation.
With H4D, Compute Engine adds support for running massive HPC workloads by treating an entire cluster of VM instances as a single computer. Using topology-aware placement of VMs lets you access many instances within a single networking superblock and minimizes network latency. You can also configure Cloud RDMA on these instances to maximize inter-node communication performance, which is crucial for tightly-coupled HPC workloads.
You create these HPC VM clusters with H4D by reserving blocks of capacity instead of individual resources. Using blocks of capacity for your cluster enables enhanced cluster management capabilities.
HPC clusters with H4D instances can be created either with or without enhanced cluster management capabilities. If you don't require enhanced cluster management capabilities features with your H4D HPC cluster, or if you want to create HPC clusters using a machine series other than H4D, then use the following instructions for creating HPC instances or clusters:
Cluster terminology
When working with blocks of capacity, the following terms are used:
Overview of cluster creation process with H4D VMs
To create HPC clusters on reserved blocks of capacity, you must complete the following steps:
- Review available provisioning models
- Choose a consumption option and obtain capacity
- Choose a deployment option and orchestrator
- Choose the operating system or cluster image
- Create your cluster
Provisioning models for VM and cluster creation
When creating VM instances, you can use the provisioning models described in Compute Engine instances provisioning models.
To create a tightly-coupled H4D instances, you must use one of the following provisioning models to obtain the necessary resources for creating compute instances:
Reservation-bound: you can reserve resources at a discounted price for a future date and duration. At the start of your reservation period, you can use the reserved resources to create VMs or clusters. You have exclusive access to your reserved resources for the reservation period.
Flex-start: you can request discounted resources for up to seven days. Compute Engine makes best-effort attempts to schedule the provisioning of your requested resources as soon as they're available. You have exclusive access to your obtained resources for your requested period.
Spot: based on availability, you can immediately obtain deeply discounted resources. However, Compute Engine might stop or delete the VM instances at any time to reclaim capacity.
Reservation-bound provisioning model
The reservation-bound provisioning model links your created VM instances to the capacity that you previously reserved. When you reserve capacity, Compute Engine creates an empty reservation. Then, at the reservation start time, the following occurs:
Compute Engine adds your reserved resources to the reservation. You have exclusive access to the reserved capacity until the reservation end time.
Google Cloud charges you for the reserved capacity until the end of your reservation period, whether you use the capacity or not.
You can then use the reserved resources to create VMs without additional charges. You only pay for resources that aren't included in the reservation, such as disks or IP addresses.
You can reserve resources for as many VMs as you like for as long as you like for a future date. Then, you can use the reserved resources to create and run VMs until the end of the reservation period. If you reserve resources for one year or longer, then you must purchase and attach a resource-based commitment.
To provision resources using the reservation-bound provisioning model, see:
For long-running, large-scale distributed workloads with densely allocated resources: Reserve capacity through your account team
For short-running (up to 90 days) distributed workloads with densely allocated resources: Future reservation requests in calendar mode
You can use reservation-bound provisioning with H4D instances by specifying the reservation-bound provisioning model when creating individual VMs, a HPC cluster, or a group of VMs.
Flex-start provisioning model
To run short-duration workloads that require densely allocated resources, you can request compute resources for up to seven days by using Flex-start. Whenever resources are available, Compute Engine creates your requested number of VMs. You can stop standalone Flex-start VMs, but you can't stop Flex-start VMs that a managed instance group (MIG) creates through resize requests. The Flex-start VMs exist until you delete them, or until Compute Engine deletes the VMs at the end of their run duration.
Flex-start is ideal for workloads that can start at any time. The flex-start provisioning model provisions resources from a secure capacity pool, so the allocated resources are densely allocated to minimize network latency.
When you add Flex-start VMs to a managed instance group (MIG) by using resize requests, the MIG creates the VMs all at once. This approach helps you avoid unnecessary charges for partial capacity that Compute Engine might deliver while you wait for the full capacity needed to start your workload.
You can use Flex-start provisioning with H4D instances, using any available deployment model.
Spot provisioning model
To run fault-tolerant workloads, you can obtain compute resources immediately based on availability. You get resources at the lowest price possible. However, Compute Engine might stop or delete the created Spot VMs at any time to reclaim capacity. This process is called preemption.
Spot VMs are ideal for workloads where interruptions are acceptable, such as:
- Batch processing
- High performance computing (HPC)
- Data analytics
- Continuous integration and continuous deployment (CI/CD)
- Media encoding
You can use Spot VMs with any machine type, except A4X, X4, and bare metal machine types. Dense allocation depends on resource availability. To help ensure a closer allocation, you can apply a compact placement policy to the Spot VMs.
You can use Spot VMs with the following dense deployment options:
- Create a HPC Slurm cluster with H4D
- Bulk create HPC-optimized instances with H4D
- Create a HPC MIG with H4D machine series
Choose a consumption option and obtain capacity
Consumption options determine how resources are obtained for your cluster. To create a cluster that uses enhanced cluster management capabilities, you must request blocks of capacity for a dense deployment.
The following table summarizes the key differences between the consumption options for blocks of capacity:
| Consumption option | Future reservations for capacity blocks | Future reservations for up to 90 days (in calendar mode) | Flex-start | Spot |
|---|---|---|---|---|
| Workload characteristics | Long-running, large-scale distributed workloads that require densely allocated resources | Short-duration workloads that require densely allocated resources | Short-duration workloads that require densely allocated resources | Fault-tolerant workloads |
| Lifespan | Any time | Up to 90 days | Up to 7 days | Any time, but subject to preemption |
| Preemptible | No | No | No | Yes |
| Capacity assurance | Very high | Very high | Best effort | Best effort |
| Quota | Check that you have enough quota before creating instances. | No quota is charged | Preemptible quota is charged. | Preemptible quota is charged. |
| Pricing |
|
|
|
|
| Resource allocation | Dense | Dense | Dense | Standard (Compact placement policy optional) |
| Provisioning model | Reservation-bound | Reservation-bound | Flex-start | Spot |
| Creation method | To create HPC clusters and VMs, you must do the following:
|
To create HPC clusters and VMs, you must do the following:
|
To create VMs, select one of the following options:
When your requested capacity becomes available, Compute Engine provisions it. |
You can immediately create VMs. See Choose a deployment option. |
Choose a deployment option
High performance computing (HPC) workloads aggregate computing resources to gain performance greater than that of a single workstation, server, or computer. HPC is used to solve problems in academic research, science, design, simulation, and business intelligence.
For HPC clusters with enhanced cluster management capabilities, choose the H4D machine series. If you plan to use a different machine series, follow the documentation at Create an HPC-ready VM instance instead of using the deployment methods listed on this page.
Some of the available deployment options include the installation and configuration of an orchestrator for enhanced management of the HPC cluster.
For the most appropriate option to create your VMs or clusters for your use case, choose one of the following:
| Option | Use case |
|---|---|
| Cluster Toolkit | You want to use open-source software that simplifies the process for you to deploy both Slurm and Google Kubernetes Engine (GKE) clusters. Cluster Toolkit is designed to be highly customizable and extensible. To learn more, see the following: |
| GKE | You want maximum flexibility in configuring your Google Kubernetes Engine cluster based on the needs of your workload. To learn more, see Run HPC workloads with H4D. |
| Use Compute Engine | You want full control of the infrastructure layer so that you can set up your own orchestrator. To learn more, see the following:
|
Choose the operating system image
The operating system (OS) image you choose depends on the service you use to deploy your cluster.
For clusters on GKE: Use a GKE node image, such as Container-Optimized OS. If you use Cluster Toolkit to deploy your GKE cluster, a Container-Optimized OS image is used by default. For more information about node images, see Node images in the GKE documentation.
For clusters on Compute Engine: You can use one of the following images:
- HPC VM image: A Rocky Linux 8 image that is optimized for tightly-coupled HPC workloads.
- OS image provided by Google Cloud: OS images that support H4D. You will need to configure these for your HPC workloads.
- Custom images: You can create and use your own custom images. To include HPC-specific optimizations, we recommend that you create a custom image using the HPC VM image.
For Slurm Clusters: Cluster Toolkit deploys the Slurm Cluster with a HPC VM image based on Rocky Linux 8 that is optimized for tightly-coupled HPC workloads.
Create your HPC cluster
After you review the cluster creation process and make preliminary decisions for your workload, create your cluster by using any of the deployment options.
Enhanced cluster management capabilities for your HPC cluster
When you create H4D instances with densely allocated resources using the deployment methods mentioned in Choose a deployment option, you can use enhanced HPC cluster management capabilities with your instances.
For more information about these capabilities, see Enhanced HPC cluster management with H4D instances.
What's next
- Learn more about Cluster Toolkit.
- Try the Quickstart tutorial Deploy an HPC cluster with Slurm.
- Review best practices for running HPC workloads