Releases: AI-Hypercomputer/xpk
Releases · AI-Hypercomputer/xpk
v0.14.0
What's Changed
New Features
- A4X support by @scaliby in #643
- Merge main to develop by @FIoannides in #657
- Feat: Add --skip-validation flag to bypass system dependency checks by @RexBearIU in #665
- Allow use of zonal clusters by resolving actual cluster region by @scaliby in #682
- feat: Add --sub-slicing flag to cluster create by @jamOne- in #689
- feat: Add sub_slicing_support to SystemCharacteristics by @jamOne- in #695
- feat: Bump up the default Kueue version to v0.14.1 by @jamOne- in #691
- Subslicing workload annotations by @scaliby in #712
- Release 0.14 by @FIoannides in #721
Bug fixes
- Update supported platforms by @scaliby in #663
- Pass autoprovisioning_args to Pathways workload yaml by @wstcliyu in #664
- Use map NodeSelector instead of string for Pathways workload by @wstcliyu in #669
- fix: Don't check flex in is_TAS_possible by @jamOne- in #675
- Update PathwaysJob version to v0.1.3 by @wstcliyu in #683
- Fix plain XPK command execution by @scaliby in #684
- No kueue admission checks for FSNQ by @scaliby in #686
New Contributors
- @wstcliyu made their first contribution in #664
- @RexBearIU made their first contribution in #665
Full Changelog: v0.13.0...v0.14.0
v0.13.0
Full Changelog: v0.12.0...v0.13.0
v0.12.0
What's Changed
New Features
- Add TAS support for workloads on DWS clusters by @FIoannides in #615
- feat: Add link to GCP Workloads list under xpk workload list by @jamOne- in #621
- feat: Make workloads link navigate to aiml jobsets list by @jamOne- in #624
- Increase kubectl wait times by @scaliby in #625
- Add pathways + nap support by @scaliby in #623
Bug fixes
- Fix error when updating a cluster that already has JobSet by @lukebaumann in #622
- Don't set --enable-queued-provisioning for flex multislice by @SikaGrr in #631
New Contributors
- @SikaGrr made their first contribution in #607
- @FIoannides made their first contribution in #615
- @jamOne- made their first contribution in #621
Full Changelog: v0.11.0...v0.12.0
v0.11.0
What's Changed
New Features
- "Select TPU by topology (#525)" + Fix errors by @sharabiani in #563
- feat: Added an update to CoreDNS by @DannyLiCom in #530
- feat: Added an update to CoreDNS by @DannyLiCom in #501
- Add tpu7x support by @scaliby in #586
- Release v0.11.0 by @scaliby in #613
Bug fixes
- Fix max-nodes when creating flex queued nodepool of tpus by @pawloch00 in #541
- Fix kueue version in yaml string and loosen dependecy on cloud-storage by @pawloch00 in #546
- Remove RBAC container by @pawloch00 in #547
- Fix kjob.py pyink by @pawloch00 in #552
- Update Kueue to create Visibility folder by @SujeethJinesh in #556
- Update CPU limits to 750m by @SujeethJinesh in #558
- Update CPU limit for medium to large scale clusters by @SujeethJinesh in #571
- Fix cluster create when
--enable-autoprovisioning
is supplied by @scaliby in #589 - Fix provisioning 1x1 and 1x1x1 topologies by @scaliby in #595
- Reorder custom_nodepool_arguments for node-pool create command by @scaliby in #596
- NAP memory and cpu limit increased by @sharabiani in #597
- fix: only install JQ when not installed by @samos123 in #601
- Fix custom nodepool arguments append by @scaliby in #602
- Fix nodepool creation by @scaliby in #606
New Contributors
- @DannyLiCom made their first contribution in #530
- @scaliby made their first contribution in #581
- @samos123 made their first contribution in #601
Full Changelog: v0.10.1...v0.11.0
Release v0.10.1
v0.10.0
Highlights
DWS Flex support for GPUs and TPUs
Managed Lustre storage attach support
What's Changed
New Features
- Update PathwaysJob version to v0.1.2 by @RoshaniN in #507
- Update Cluster Toolkit Version by @pawloch00 in #503
- Managed Lustre storage attach support implemented by @sharabiani in #534
- Implement DWS for GPUs and TPUs by @pawloch00 in #467
Bug fixes
- Fix issue in control_plane_endpoints_config.dns_endpoint_config.allow… by @SujeethJinesh in #499
- Fix broken A3 High workloads by @gcie in #494
- Bring back shared_memory volume for A3 Mega and A3 High by @gcie in #512
- Provided the required permissions for JAX to list the pods by @sharabiani in #509
- fix the incorrect number of chips per VM for v5litepod-8 by @gcie in #513
- Update Kueue and Jobset controller default limit value by @ycchenzheng in #502
- Fix cluster creation from reservation by @pawloch00 in #522
New Contributors
- @ycchenzheng made their first contribution in #502
Full Changelog: v0.9.0...v0.10.0
v0.9.0
Highlights
GPUDirect-TCPX support for H100 accelerator (A3-High VMs)
A command to adapt a cluster to XPK expected config (xpk cluster adapt
)
DWS Calendar Mode Reservations
What's Changed
New Features
- GPUDirect-TCPX support for H100 by @gcie in #459
- Add Multi-tier checkpointing support in XPK Cluster Creation by @abhinavclemson in #465
- Add Jobset controller patching for MTC cluster by @abhinavclemson in #475
xpk cluster adapt
by @gcie in #466
Bug fixes
- Merge main to develop by @gcie in #458
- Update pathways.py with worker component type. by @RoshaniN in #456
- Fix error when
xpk storage attach --type=gcpfilestore
without--mount-options
by @gcie in #463 - Update README.md - text edit in Advanced usage section by @kzmyslona in #473
- Update PathwaysJob Version to v0.1.1 To Fix RM OOM by @SujeethJinesh in #477
- Placement Policy removed from A3-Mega blueprints with --spot by @sharabiani in #478
- Enable DNS Access to Prevent Connection Timeout Errors by @SujeethJinesh in #483
- Fix DWS Calendar Mode Reservations for A3 Mega by @gcie in #484
New Contributors
- @kzmyslona made their first contribution in #468
Full Changelog: v0.8.0...v0.9.0
v0.8.0
Highlights
- Support for provisioning A4 GKE clusters
- PathwaysJob integration
- Storage support (Parallelstore and Hyperdisk)
What's Changed
New Features
- Add the option to use Multi-tier checkpointing in workloads by @abhinavclemson in #447
- Integrate PathwaysJob into XPK. by @RoshaniN in #448
- Implement Parallelstore and Hyperdisk storages attach by @sharabiani in #436
- A4 support for prod by @gcie in #412
- Add
--mount-options
parameter toxpk storage attach/create
by @gcie in #450
Bug fixes
- Update JOBSET_VERSION from 0.7.2 to 0.8.0 by @SujeethJinesh in #425
- fix yaml alignment when
remote-python-sidecar-image
is passed by @sadikneipp in #426 - Bring back manual manifest specification for attaching storage by @gcie in #427
- Fix XPK version in Pypi release by @sharabiani in #428
- Remove sudo requirement from make by @sharabiani in #435
- fix: workloads not scheduling on A3 Ultra clusters by @gcie in #441
- Disable creating additional networks for L4 and A2 clusters by @gcie in #444
- Fix
xpk workload create
for L4 and A100 by @gcie in #452
Full Changelog: v0.7.2...v0.8.0
v0.7.2
What's Changed
Bug fixes
- Remove sudo requirement from make by @sharabiani in #435
Full Changelog: v0.7.1...v0.7.2
v0.7.1
What's Changed
Bug fixes
- fix yaml alignment when remote-python-sidecar-image is passed by @sadikneipp in #426
- Bring back manual manifest specification for attaching storage by @gcie in #427
- Fix XPK version in Pypi release by @sharabiani in #428