Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
30 views28 pages

Slurm in The Clouds

Uploaded by

DTSX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views28 pages

Slurm in The Clouds

Uploaded by

DTSX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Slurm in the Clouds

Nick Ihli
SchedMD

Copyright 2021 SchedMD LLC


https://schedmd.com
Slurm User Group Meeting 2021

Copyright 2021 SchedMD LLC


https://schedmd.com
Agenda

All times are US Mountain Daylight (UTC-6)

Time Speaker Title

9:00 - 9:50 Jason Booth Field Notes 5: From The Frontlines of Slurm Support

10:00 - 10:25 Nate Rini REST API and also Containers

10:30 - 10:50 Marshall Garey burst_buffer/lua and slurmscriptd

11:00 - 11:25 Nick Ihli Slurm in the Clouds

11:30 - 11:50 Tim Wickberg Slurm 21.08 and Beyond

Copyright 2021 SchedMD LLC


https://schedmd.com
Welcome

● Five separate presentations, five separate streams


● Presentations will remain available for at least two weeks after SLUG'21 concludes
● Presentations are available through the SchedMD Slurm YouTube channel
○ https://youtube.com/c/schedmdslurm
● Or through direct links from the agenda
○ https://slurm.schedmd.com/slurm_ug_agenda.html

Copyright 2021 SchedMD LLC


https://schedmd.com
Asking questions

● Feel free to ask questions throughout through YouTube's chat


● Chat is moderated by SchedMD staff
○ Tim McMullan, Ben Roberts, and Tim Wickberg
○ Also identified by the little wrench symbol next to their name
● Questions will be relayed to the presenter by the moderators
○ Some may be deferred to the end if they cannot be relayed in a timely fashion
○ Or some may be answered by the moderators in chat directly

Copyright 2021 SchedMD LLC


https://schedmd.com
Slurm in the Clouds

Nick Ihli
SchedMD

Copyright 2021 SchedMD LLC


https://schedmd.com
● New Power Save/Cloud-related features in 21.08
● Update on Slurm in public clouds

Copyright 2021 SchedMD LLC


https://schedmd.com
PowerSave/Cloud changes

● SlurmctldParameters=node_reg_mem_percent
○ Allows node to register with a percentage of configured memory.
○ Defaults:
■ 90% for cloud nodes
■ 100% for everything else

Copyright 2021 SchedMD LLC


https://schedmd.com
Power State Transition - Resume

Job State Configuring Running Completing

IDLE ALLOCATED / MIXED


Node State
ALLOCATED / MIXED
POWERED_OFF ~ POWERING_ON #

Copyright 2021 SchedMD LLC


https://schedmd.com
Power State Transition - Resume Failure

Job State Configuring Requeue

IDLE ALLOCATED / MIXED


Node State
POWERED_OFF ~ POWERING_ON #

DOWN
ResumeTimeOut
POWERED_OFF ~

Copyright 2021 SchedMD LLC


https://schedmd.com
Power State Transition - Suspend

IDLE IDLE
Node State IDLE
POWERING_DOWN % POWERED_OFF ~

SuspendTime SuspendTimeOut

Copyright 2021 SchedMD LLC


https://schedmd.com
PowerSave/Cloud changes

● sinfo/sview/scontrol state == base_state + flags


○ IDLE~ vs IDLE+CLOUD+POWERED_DOWN
○ sinfo -O statecomplete
○ sview "StateComplete"
○ scontrol
■ State == base_state + flags
■ Used to be shortened state + some flags
● Old: ALLOCATED#+CLOUD
● Now: ALLOCATED+POWERING_UP+CLOUD
○ States can be viewed through REST API

Copyright 2021 SchedMD LLC


https://schedmd.com
PowerSave/Cloud changes

● scontrol update nodename=<> state=power_down[_asap|_force]


○ Like scontrol reboot states.
○ power_down (!) - power_down after node becomes idle
○ power_down_asap - put node in drain state and power down after currently running jobs
○ power_down_force - kill running jobs power down immediately (depends on
power_save_interval)
● Able to suspend nodes that are part of SuspendExc<Parts|Nodes>

Copyright 2021 SchedMD LLC


https://schedmd.com
PowerSave/Cloud changes

● LastBusy time in scontrol


○ last_busy + suspend_time < now == Suspend

Copyright 2021 SchedMD LLC


https://schedmd.com
PowerSave/Cloud changes

● SuspendTime, SuspendTimeout, ResumeTimeout on partitions


○ SuspendTime can enable PowerSave if disabled at global level
○ This will be helpful for "hybrid" (i.e. bursting from on-premise) scenarios
■ e.g. by default PowerSave wants to suspend everything that isn't in
SuspendExc<Nodes|Parts>
● You have to remember to update these
■ Now with SuspendTime on the partition, you can disable PowerSave at the global level
and enable on specific cloud partitions
■ e.g.
● SuspendTime=INFINITE
● PartitionName=cloud … SuspendTime=300

Copyright 2021 SchedMD LLC


https://schedmd.com
PowerSave/Cloud changes

● JSON mapping of jobs to nodes available in ResumeProgram

Copyright 2021 SchedMD LLC


https://schedmd.com
Burst Buffer

● Review Marshall’s presentation from earlier on Lua Burst Buffer


● Cloud potential - hybrid and all-in the cloud
○ Stage data before and after jobs are completed without wasting $ on compute idle time

Copyright 2021 SchedMD LLC


https://schedmd.com
Cloud Partners

We have strong relationships with our public cloud partners and are working
with them closely on development and consultative engagements to continue
to enhance the experience of using Slurm on their clouds.

● AWS
● Microsoft Azure
● Google Cloud

Copyright 2021 SchedMD LLC


https://schedmd.com
Slurm on Google Cloud
Latest Updates and Features In Development

Open Source on SchedMD’s Github:


https://github.com/schedmd/slurm-gcp

“Version 4” Features Available Today:


● Terraform support generally available
● Google Cloud HPC VM Image-based deployment reduces
deployment time to just a few minutes
● Placement policy support for low latency networking
● Bulk API reduces deployment time by reducing API calls and
performing “regional capacity finding” for large deployments, up
to 1,000 instances
● Instance templates simplify configuration and reusability
● Cloud Marketplace listing simplifies small Slurm cluster
deployment
Slurm on Google Cloud
Latest Updates and Features In Development

21.08 release soon:


● Intel Select Solution support built-in to improve compatibility and
performance

“Version 5” Features In Development:


● Billing insights provide visibility into Slurm and GCP billing
● Partition flexibility allows more flexible (re)configuration of
partitions
● Data migration integrates Slurm data migration abilities with GCS
● SMT configurability integrates GCP SMT controls with Slurm-GCP
● Cloud Foundations Toolkit simplifies and standardizes the
Slurm-GCP scripts
Azure CycleCloud + Slurm
• Fastest way to get started with Slurm on Azure
• Deploy a complete HPC cluster in just minutes
• Azure-enable workflows with no changes
• Cluster autoscaling
• Automatic or manual resource control
• Cost reporting and controls
• Near real-time cost reporting
• Link usage to spend
• Tools to manage and control costs
• Hybrid workflows
• Burst on-premises Slurm clusters to the cloud
• Authorization and governance:
• AD and Azure AD integration
• Audit and event logging
• RBAC authorization control

Visit https://azure.microsoft.com/en-us/features/azure-cyclecloud to learn more


Microsoft Azure Slurm Integration

Full support for Slurm 20.11+ with CycleCloud 8.2

Support for job topology of MPI jobs requiring InfiniBand

VMs with GPUs automatically configured with GRES settings

Easy to use job accounting configuration

Burst execute nodes from on-premises to Azure

COMING SOON: Custom Slurm.conf settings set via the CycleCloud UI

Official project repository: https://github.com/Azure/cyclecloud-slurm


AWS and Slurm Updates
• Slurm 21.08 – Key new features
• Improved “all or nothing” allocation and scaling
• Support for Slurm REST API with Amazon Cognito Introducing ParallelCluster 3
Blog Post
• ParallelCluster 3.0 – Launched Sep 10
• AWS-supported, open source cluster management tool that is the
simplest way to get started with Slurm and HPC on AWS
• Enables:
• Easy Cluster Management
• Automatic Resource Scaling
• Seamless Migration to the Cloud
• Integration with Slurm 21.08 coming soon

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
AWS ParallelCluster and Slurm

ALINUX CENTOS 6/7 UBUNTU DCV EFA


OPENMPI INTELMPI NCCL
16/18

SLURM AWS
BATCH

FSX EFS S3 EBS RAID

ON-DEMAND SPOT VPC & SUBNETS

Enable On-premises environmental parity | Facilitate “lift and shift”


and applications migration over time.
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Questions?

Copyright 2021 SchedMD LLC


https://schedmd.com
Next Session

● The next presentation is by Tim Wickberg: "Slurm 21.08 and Beyond"


● Starts at 11:30am Mountain Daylight Time (UTC-6)
● And is on a separate YouTube Live stream
● Please see the SchedMD Slurm YouTube channel for links

Copyright 2021 SchedMD LLC


https://schedmd.com
End Of Stream

● Thanks for watching!

Copyright 2021 SchedMD LLC


https://schedmd.com

You might also like