Slurm in the Clouds
Nick Ihli
SchedMD
Copyright 2021 SchedMD LLC
https://schedmd.com
Slurm User Group Meeting 2021
Copyright 2021 SchedMD LLC
https://schedmd.com
Agenda
All times are US Mountain Daylight (UTC-6)
Time Speaker Title
9:00 - 9:50 Jason Booth Field Notes 5: From The Frontlines of Slurm Support
10:00 - 10:25 Nate Rini REST API and also Containers
10:30 - 10:50 Marshall Garey burst_buffer/lua and slurmscriptd
11:00 - 11:25 Nick Ihli Slurm in the Clouds
11:30 - 11:50 Tim Wickberg Slurm 21.08 and Beyond
Copyright 2021 SchedMD LLC
https://schedmd.com
Welcome
● Five separate presentations, five separate streams
● Presentations will remain available for at least two weeks after SLUG'21 concludes
● Presentations are available through the SchedMD Slurm YouTube channel
○ https://youtube.com/c/schedmdslurm
● Or through direct links from the agenda
○ https://slurm.schedmd.com/slurm_ug_agenda.html
Copyright 2021 SchedMD LLC
https://schedmd.com
Asking questions
● Feel free to ask questions throughout through YouTube's chat
● Chat is moderated by SchedMD staff
○ Tim McMullan, Ben Roberts, and Tim Wickberg
○ Also identified by the little wrench symbol next to their name
● Questions will be relayed to the presenter by the moderators
○ Some may be deferred to the end if they cannot be relayed in a timely fashion
○ Or some may be answered by the moderators in chat directly
Copyright 2021 SchedMD LLC
https://schedmd.com
Slurm in the Clouds
Nick Ihli
SchedMD
Copyright 2021 SchedMD LLC
https://schedmd.com
● New Power Save/Cloud-related features in 21.08
● Update on Slurm in public clouds
Copyright 2021 SchedMD LLC
https://schedmd.com
PowerSave/Cloud changes
● SlurmctldParameters=node_reg_mem_percent
○ Allows node to register with a percentage of configured memory.
○ Defaults:
■ 90% for cloud nodes
■ 100% for everything else
Copyright 2021 SchedMD LLC
https://schedmd.com
Power State Transition - Resume
Job State Configuring Running Completing
IDLE ALLOCATED / MIXED
Node State
ALLOCATED / MIXED
POWERED_OFF ~ POWERING_ON #
Copyright 2021 SchedMD LLC
https://schedmd.com
Power State Transition - Resume Failure
Job State Configuring Requeue
IDLE ALLOCATED / MIXED
Node State
POWERED_OFF ~ POWERING_ON #
DOWN
ResumeTimeOut
POWERED_OFF ~
Copyright 2021 SchedMD LLC
https://schedmd.com
Power State Transition - Suspend
IDLE IDLE
Node State IDLE
POWERING_DOWN % POWERED_OFF ~
SuspendTime SuspendTimeOut
Copyright 2021 SchedMD LLC
https://schedmd.com
PowerSave/Cloud changes
● sinfo/sview/scontrol state == base_state + flags
○ IDLE~ vs IDLE+CLOUD+POWERED_DOWN
○ sinfo -O statecomplete
○ sview "StateComplete"
○ scontrol
■ State == base_state + flags
■ Used to be shortened state + some flags
● Old: ALLOCATED#+CLOUD
● Now: ALLOCATED+POWERING_UP+CLOUD
○ States can be viewed through REST API
Copyright 2021 SchedMD LLC
https://schedmd.com
PowerSave/Cloud changes
● scontrol update nodename=<> state=power_down[_asap|_force]
○ Like scontrol reboot states.
○ power_down (!) - power_down after node becomes idle
○ power_down_asap - put node in drain state and power down after currently running jobs
○ power_down_force - kill running jobs power down immediately (depends on
power_save_interval)
● Able to suspend nodes that are part of SuspendExc<Parts|Nodes>
Copyright 2021 SchedMD LLC
https://schedmd.com
PowerSave/Cloud changes
● LastBusy time in scontrol
○ last_busy + suspend_time < now == Suspend
Copyright 2021 SchedMD LLC
https://schedmd.com
PowerSave/Cloud changes
● SuspendTime, SuspendTimeout, ResumeTimeout on partitions
○ SuspendTime can enable PowerSave if disabled at global level
○ This will be helpful for "hybrid" (i.e. bursting from on-premise) scenarios
■ e.g. by default PowerSave wants to suspend everything that isn't in
SuspendExc<Nodes|Parts>
● You have to remember to update these
■ Now with SuspendTime on the partition, you can disable PowerSave at the global level
and enable on specific cloud partitions
■ e.g.
● SuspendTime=INFINITE
● PartitionName=cloud … SuspendTime=300
Copyright 2021 SchedMD LLC
https://schedmd.com
PowerSave/Cloud changes
● JSON mapping of jobs to nodes available in ResumeProgram
Copyright 2021 SchedMD LLC
https://schedmd.com
Burst Buffer
● Review Marshall’s presentation from earlier on Lua Burst Buffer
● Cloud potential - hybrid and all-in the cloud
○ Stage data before and after jobs are completed without wasting $ on compute idle time
Copyright 2021 SchedMD LLC
https://schedmd.com
Cloud Partners
We have strong relationships with our public cloud partners and are working
with them closely on development and consultative engagements to continue
to enhance the experience of using Slurm on their clouds.
● AWS
● Microsoft Azure
● Google Cloud
Copyright 2021 SchedMD LLC
https://schedmd.com
Slurm on Google Cloud
Latest Updates and Features In Development
Open Source on SchedMD’s Github:
https://github.com/schedmd/slurm-gcp
“Version 4” Features Available Today:
● Terraform support generally available
● Google Cloud HPC VM Image-based deployment reduces
deployment time to just a few minutes
● Placement policy support for low latency networking
● Bulk API reduces deployment time by reducing API calls and
performing “regional capacity finding” for large deployments, up
to 1,000 instances
● Instance templates simplify configuration and reusability
● Cloud Marketplace listing simplifies small Slurm cluster
deployment
Slurm on Google Cloud
Latest Updates and Features In Development
21.08 release soon:
● Intel Select Solution support built-in to improve compatibility and
performance
“Version 5” Features In Development:
● Billing insights provide visibility into Slurm and GCP billing
● Partition flexibility allows more flexible (re)configuration of
partitions
● Data migration integrates Slurm data migration abilities with GCS
● SMT configurability integrates GCP SMT controls with Slurm-GCP
● Cloud Foundations Toolkit simplifies and standardizes the
Slurm-GCP scripts
Azure CycleCloud + Slurm
• Fastest way to get started with Slurm on Azure
• Deploy a complete HPC cluster in just minutes
• Azure-enable workflows with no changes
• Cluster autoscaling
• Automatic or manual resource control
• Cost reporting and controls
• Near real-time cost reporting
• Link usage to spend
• Tools to manage and control costs
• Hybrid workflows
• Burst on-premises Slurm clusters to the cloud
• Authorization and governance:
• AD and Azure AD integration
• Audit and event logging
• RBAC authorization control
Visit https://azure.microsoft.com/en-us/features/azure-cyclecloud to learn more
Microsoft Azure Slurm Integration
Full support for Slurm 20.11+ with CycleCloud 8.2
Support for job topology of MPI jobs requiring InfiniBand
VMs with GPUs automatically configured with GRES settings
Easy to use job accounting configuration
Burst execute nodes from on-premises to Azure
COMING SOON: Custom Slurm.conf settings set via the CycleCloud UI
Official project repository: https://github.com/Azure/cyclecloud-slurm
AWS and Slurm Updates
• Slurm 21.08 – Key new features
• Improved “all or nothing” allocation and scaling
• Support for Slurm REST API with Amazon Cognito Introducing ParallelCluster 3
Blog Post
• ParallelCluster 3.0 – Launched Sep 10
• AWS-supported, open source cluster management tool that is the
simplest way to get started with Slurm and HPC on AWS
• Enables:
• Easy Cluster Management
• Automatic Resource Scaling
• Seamless Migration to the Cloud
• Integration with Slurm 21.08 coming soon
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
AWS ParallelCluster and Slurm
ALINUX CENTOS 6/7 UBUNTU DCV EFA
OPENMPI INTELMPI NCCL
16/18
SLURM AWS
BATCH
FSX EFS S3 EBS RAID
ON-DEMAND SPOT VPC & SUBNETS
Enable On-premises environmental parity | Facilitate “lift and shift”
and applications migration over time.
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Questions?
Copyright 2021 SchedMD LLC
https://schedmd.com
Next Session
● The next presentation is by Tim Wickberg: "Slurm 21.08 and Beyond"
● Starts at 11:30am Mountain Daylight Time (UTC-6)
● And is on a separate YouTube Live stream
● Please see the SchedMD Slurm YouTube channel for links
Copyright 2021 SchedMD LLC
https://schedmd.com
End Of Stream
● Thanks for watching!
Copyright 2021 SchedMD LLC
https://schedmd.com