0.19.31
Kubernetes
The kubernetes backend introduces many significant improvements and has now graduated from alpha to beta. It is much more stable and can be reliably used on GPU clusters for all kinds of workloads, including distributed tasks.
Here's what changed:
- Resource allocation now fully respects the user’s
resourcesspecification. Previously, it ignored certain aspects, especially the proper selection of GPU labels according to the specifiedgpuspec. - Distributed tasks now fully work on Kubernetes clusters with fast interconnect enabled. Previously, this caused many issues.
- Added support
privileged.
We’ve also published a dedicated guide on how to get started with dstack on Kubernetes, highlighting important nuances.
Warning
Be aware of breaking changes if you used the kubernetes backend before. The following properties in the Kubernetes backend configuration have been renamed:
networking→proxy_jumpssh_host→hostnamessh_port→port
Additionally, the "proxy jump" pod and service names now include a dstack- prefix.
GCP
A4 spot instances with B200 GPUs
The gcp backend now supports A4 spot instances equipped with B200 GPUs. This includes provisioning both standalone A4 instances and A4 clusters with high-performance RoCE networking.
To use A4 clusters with high-performance networking, you must configure multiple VPCs in your backend settings (~/.dstack/server/config.yml):
projects:
- name: main
backends:
- type: gcp
project_id: my-project
creds:
type: default
vpc_name: my-vpc-0 # regular, 1 subnet
extra_vpcs:
- my-vpc-1 # regular, 1 subnet
roce_vpcs:
- my-vpc-mrdma # RoCE profile, 8 subnetsThen, provision a cluster using a fleet configuration:
type: fleet
nodes: 2
placement: cluster
availability_zones: [us-west2-c]
backends: [gcp]
spot_policy: spot
resources:
gpu: B200:8Each instance in the cluster will have 10 network interfaces: 1 regular interface in the main VPC, 1 regular interface in the extra VPC, and 8 RDMA interfaces in the RoCE VPC.
Note
Currently, the gcp backend only supports A4 spot instances. Support for other options, such as flex and calendar scheduling via Dynamic Workload Scheduler, is coming soon.
CLI
dstack project is now faster
The USER column in dstack project list is now shown only when the --verbose flag is used.
This significantly improves performance for users with many configured projects, reducing execution time from ~20 seconds to as little as 2 seconds in some cases.
What's changed
- [Kubernetes] Request resources according to
RequirementsSpecby @un-def in #3127 - [GCP] Support A4 spot instances with the B200 GPU by @jvstme in #3100
- [CLI] Move
USERtodstack project list --verboseby @jvstme in #3134 - [Kubernetes] Configure
/dev/shmif requested by @un-def in #3135 - [Backward incompatible] Rename properties in Kubernetes backend config by @un-def in #3137
- Support GCP A4 clusters by @jvstme in #3142
- Kubernetes: add multi-node support by @un-def in #3141
- Fix duplicate server log messages by @jvstme in #3143
- [Docs] Improve Kubernetes documentation by @peterschmidt85 in #3138
Full changelog: 0.19.30...0.19.31