0.19.31

@un-def

Kubernetes

The kubernetes backend introduces many significant improvements and has now graduated from alpha to beta. It is much more stable and can be reliably used on GPU clusters for all kinds of workloads, including distributed tasks.

Here's what changed:

Resource allocation now fully respects the user’s resources specification. Previously, it ignored certain aspects, especially the proper selection of GPU labels according to the specified gpu spec.
Distributed tasks now fully work on Kubernetes clusters with fast interconnect enabled. Previously, this caused many issues.
Added support privileged.

We’ve also published a dedicated guide on how to get started with dstack on Kubernetes, highlighting important nuances.

Warning

Be aware of breaking changes if you used the kubernetes backend before. The following properties in the Kubernetes backend configuration have been renamed:

networking → proxy_jump
ssh_host → hostname
ssh_port → port

Additionally, the "proxy jump" pod and service names now include a dstack- prefix.

GCP

A4 spot instances with B200 GPUs

The gcp backend now supports A4 spot instances equipped with B200 GPUs. This includes provisioning both standalone A4 instances and A4 clusters with high-performance RoCE networking.

To use A4 clusters with high-performance networking, you must configure multiple VPCs in your backend settings (~/.dstack/server/config.yml):

projects:
- name: main
  backends:
  - type: gcp
    project_id: my-project
    creds:
      type: default
    vpc_name: my-vpc-0   # regular, 1 subnet
    extra_vpcs:
    - my-vpc-1   # regular, 1 subnet
    roce_vpcs:
    - my-vpc-mrdma   # RoCE profile, 8 subnets

Then, provision a cluster using a fleet configuration:

type: fleet

nodes: 2
placement: cluster

availability_zones: [us-west2-c]
backends: [gcp]

spot_policy: spot

resources:
  gpu: B200:8

Each instance in the cluster will have 10 network interfaces: 1 regular interface in the main VPC, 1 regular interface in the extra VPC, and 8 RDMA interfaces in the RoCE VPC.

Note

Currently, the gcp backend only supports A4 spot instances. Support for other options, such as flex and calendar scheduling via Dynamic Workload Scheduler, is coming soon.

CLI

`dstack project` is now faster

The USER column in dstack project list is now shown only when the --verbose flag is used.
This significantly improves performance for users with many configured projects, reducing execution time from ~20 seconds to as little as 2 seconds in some cases.

What's changed

[Kubernetes] Request resources according to RequirementsSpec by @un-def in #3127
[GCP] Support A4 spot instances with the B200 GPU by @jvstme in #3100
[CLI] Move USER to dstack project list --verbose by @jvstme in #3134
[Kubernetes] Configure /dev/shm if requested by @un-def in #3135
[Backward incompatible] Rename properties in Kubernetes backend config by @un-def in #3137
Support GCP A4 clusters by @jvstme in #3142
Kubernetes: add multi-node support by @un-def in #3141
Fix duplicate server log messages by @jvstme in #3143
[Docs] Improve Kubernetes documentation by @peterschmidt85 in #3138

Full changelog: 0.19.30...0.19.31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

0.19.31

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Kubernetes

GCP

A4 spot instances with B200 GPUs

CLI

`dstack project` is now faster

What's changed

Contributors

Uh oh!

0.19.31

Kubernetes

GCP

A4 spot instances with B200 GPUs

CLI

dstack project is now faster

What's changed

Contributors

Uh oh!

`dstack project` is now faster