Slurm

This project sets up an auto-scaling Slurm cluster Slurm is a highly configurable open source workload manager. See the Slurm project site for an overview.

Slurm Clusters in CycleCloud versions < 8.4.0

See Transitioning from 2.7 to 3.0 for more information.

Making Cluster Changes

The Slurm cluster deployed in CycleCloud contains a cli called azslurm which facilitates this. After making any changes to the cluster, run the following command as root on the Slurm scheduler node to rebuild the azure.conf and update the nodes in the cluster:

      $ sudo -i
      # azslurm scale

This should create the partitions with the correct number of nodes, the proper gres.conf and restart the slurmctld.

No longer pre-creating execute nodes

As of 3.0.0, we are no longer pre-creating the nodes in CycleCloud. Nodes are created when azslurm resume is invoked, or by manually creating them in CycleCloud via CLI etc.

Creating additional partitions

The default template that ships with Azure CycleCloud has three partitions (hpc, htc and dynamic), and you can define custom nodearrays that map directly to Slurm partitions. For example, to create a GPU partition, add the following section to your cluster template:

   [[nodearray gpu]]
   MachineType = $GPUMachineType
   ImageName = $GPUImageName
   MaxCoreCount = $MaxGPUExecuteCoreCount
   Interruptible = $GPUUseLowPrio
   AdditionalClusterInitSpecs = $ExecuteClusterInitSpecs

      [[[configuration]]]
      slurm.autoscale = true
      # Set to true if nodes are used for tightly-coupled multi-node jobs
      slurm.hpc = false

      [[[cluster-init cyclecloud/slurm:execute:3.0.4]]]
      [[[network-interface eth0]]]
      AssociatePublicIpAddress = $ExecuteNodesPublic

Dynamic Partitions

As of 3.0.1, we support dynamic partitions. You can make a nodearray map to a dynamic partition by adding the following. Note that mydyn could be any valid Feature. It could also be more than one, separated by a comma.

      [[[configuration]]]
      slurm.autoscale = true
      # Set to true if nodes are used for tightly-coupled multi-node jobs
      slurm.hpc = false
      # This is the minimum, but see slurmd --help and [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) for more information.
      slurm.dynamic_config := "-Z --conf \"Feature=mydyn\""

This will generate a dynamic partition like the following

# Creating dynamic nodeset and partition using slurm.dynamic_config=-Z --conf "Feature=mydyn"
Nodeset=mydynamicns Feature=mydyn
PartitionName=mydynamic Nodes=mydynamicns

Using Dynamic Partitions to Autoscale

By default, we define no nodes in the dynamic partition. Instead, you can start nodes via CycleCloud or by manually invoking azslurm resume and they will join the cluster with whatever name you picked. However, Slurm does not know about these nodes so it can not autoscale them up.

Instead, you can also pre-create node records like so, which allows Slurm to autoscale them up.

scontrol create nodename=f4-[1-10] Feature=mydyn State=CLOUD

One other advantage of dynamic partitions is that you can support multiple VM sizes in the same partition. Simply add the VM Size name as a feature, and then azslurm can distinguish which VM size you want to use.

Note The VM Size is added implicitly. You do not need to add it to slurm.dynamic_config

scontrol create nodename=f4-[1-10] Feature=mydyn,Standard_F4 State=CLOUD
scontrol create nodename=f8-[1-10] Feature=mydyn,Standard_F8 State=CLOUD

Either way, once you have created these nodes in a State=Cloud they are now available to autoscale like other nodes.

To support multiple VM sizes in a CycleCloud nodearray, you can alter the template to allow multiple VM sizes by adding Config.Mutiselect = true.

        [[[parameter DynamicMachineType]]]
        Label = Dyn VM Type
        Description = The VM type for Dynamic nodes
        ParameterType = Cloud.MachineType
        DefaultValue = Standard_F2s_v2
        Config.Multiselect = true

Dynamic Scaledown

By default, all nodes in the dynamic partition will scale down just like the other partitions. To disable this, see SuspendExcParts.

Manual scaling

If cyclecloud_slurm detects that autoscale is disabled (SuspendTime=-1), it will use the FUTURE state to denote nodes that are powered down instead of relying on the power state in Slurm. i.e. When autoscale is enabled, off nodes are denoted as idle~ in sinfo. When autoscale is disabled, the off nodes will not appear in sinfo at all. You can still see their definition with scontrol show nodes --future.

To start new nodes, run /opt/azurehpc/slurm/resume_program.sh node_list (e.g. htc-[1-10]).

To shutdown nodes, run /opt/azurehpc/slurm/suspend_program.sh node_list (e.g. htc-[1-10]).

To start a cluster in this mode, simply add SuspendTime=-1 to the additional slurm config in the template.

To switch a cluster to this mode, add SuspendTime=-1 to the slurm.conf and run scontrol reconfigure. Then run azslurm remove_nodes && azslurm scale.

Accounting

To enable accounting in slurm, maria-db can now be started via cloud-init on the scheduler node and slurmdbd configured to enable db connection without a password string. In the absense of database URL and password, slurmdbd configuration defaults to localhost. One way of doing this is to add following lines in cluster-init:

#!/bin/bash
yum install -y mariadb-server
systemctl enable mariadb.service
systemctl start mariadb.service

AzureCA.pem and existing MariaDB/MySQL instances

In previous versions, we shipped with an embedded certificate to connect to Azure MariaDB and Azure MySQL instances. This is no longer required. However, if you wish to restore this behavior, select the 'AzureCA.pem' option from the dropdown for the 'Accounting Certificate URL' parameter in your your cluster settings.

Cost Reporting

azslurm in slurm 3.0 project now comes with a new experimental feature azslurm cost to display costs of slurm jobs. This requires Cyclecloud 8.4 or newer, as well as slurm accounting enabled.

usage: azslurm cost [-h] [--config CONFIG] [-s START] [-e END] -o OUT [-f FMT]

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG, -c CONFIG
  -s START, --start START
                        Start time period (yyyy-mm-dd), defaults to current
                        day.
  -e END, --end END     End time period (yyyy-mm-dd), defaults to current day.
  -o OUT, --out OUT     Directory name for output CSV
  -f FMT, --fmt FMT     Comma separated list of SLURM formatting options.
                        Otherwise defaults are applied

Cost reporting at the moment only works with retail azure pricing, and hence may not reflect actual customer invoices.

To generate cost reports for a given time period:

 azslurm cost -s 2023-03-01 -e 2023-03-31 -o march-2023

This will create a directory march-2023 and generate csv files containing costs for jobs and partitions.

[root@slurm301-2-scheduler ~]# ls march-2023/
jobs.csv  partition.csv  partition_hourly.csv

jobs.csv : contains costs per job based on jobs runtime. Currently running jobs are included.
partition.csv: contains costs per partition, based total usage in each partition. For partitions, such as dynamic partitions where multiple VM sizes can be included, it includes a row for each VM size.
partition_hourly.csv: contains csv report for each partition on an hourly basis.

Some basic formatting support includes customizing fields in the jobs report that are appended from sacct data. Cost reporting fields such as sku_name,region,spot,meter,meterid,metercat,rate,currency,cost are always appended but slurm fields from sacct can be customizable. Any field available in sacct -e is valid. To customize formatting:

azslurm cost -s 2023-03-01 -e 2023-03-31 -o march-2023 -f account,cluster,jobid,jobname,reqtres,start,end,state,qos,priority,container,constraints,user

This will append the supplied formatting options to cost reporting fields, and produce the jobs csv file with following columns:

account,cluster,jobid,jobname,reqtres,start,end,state,qos,priority,container,constraints,user,sku_name,region,spot,meter,meterid,metercat,rate,currency,cost

Formatting is only available for jobs and not for partition and partition_hourly data.

Do note: azslurm cost relies on slurm's admincomment feature to associate specific vm_size and meter info for jobs.

Troubleshooting

UID conflicts for Slurm and Munge users

By default, this project uses a UID and GID of 11100 for the Slurm user and 11101 for the Munge user. If this causes a conflict with another user or group, these defaults may be overridden.

To override the UID and GID, click the edit button for both the scheduler node:

And for each nodearray, for example the htc array:

and add the following attributes at the end of the Configuration section:

Transitioning from 2.7 to 3.0

The installation folder changed /opt/cycle/slurm -> /opt/azurehpc/slurm
Logs are now in /opt/azurehpc/slurm/logs instead of /var/log/slurmctld. Note, slurmctld.log will still be in this folder.

cyclecloud_slurm.sh no longer exists. Instead there is the azslurm cli, which can be run as root. azslurm uses autocomplete.

[root@scheduler ~]# azslurm
usage: 
accounting_info      - 
buckets              - Prints out autoscale bucket information, like limits etc
config               - Writes the effective autoscale config, after any preprocessing, to stdout
connect              - Tests connection to CycleCloud
cost                 - Cost analysis and reporting tool that maps Azure costs to SLURM Job Accounting data. This is an experimental feature.
default_output_columns - Output what are the default output columns for an optional command.
generate_topology    - Generates topology plugin configuration
initconfig           - Creates an initial autoscale config. Writes to stdout
keep_alive           - Add, remeove or set which nodes should be prevented from being shutdown.
limits               - 
nodes                - Query nodes
partitions           - Generates partition configuration
refresh_autocomplete - Refreshes local autocomplete information for cluster specific resources and nodes.
remove_nodes         - Removes the node from the scheduler without terminating the actual instance.
resume               - Equivalent to ResumeProgram, starts and waits for a set of nodes.
resume_fail          - Equivalent to SuspendFailProgram, shutsdown nodes
retry_failed_nodes   - Retries all nodes in a failed state.
scale                - 
shell                - Interactive python shell with relevant objects in local scope. Use --script to run python scripts
suspend              - Equivalent to SuspendProgram, shutsdown nodes
wait_for_resume      - Wait for a set of nodes to converge.

Nodes are no longer pre-populated in CycleCloud. They are only created when needed.
All slurm binaries are inside the azure-slurm-install-pkg*.tar.gz file, under slurm-pkgs. They are pulled from a specific binary release. The current binary releases is 2023-08-07
For MPI jobs, the only network boundary that exists by default is the partition. There are not multiple "placement groups" per partition like 2.x. So you only have one colocated VMSS per partition. There is also no use of the topology plugin, which necessitated the use of a job submission plugin that is also no longer needed. Instead, submitting to multiple partitions is now the recommended option for use cases that require submitting jobs to multiple placement groups.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Name		Name	Last commit message	Last commit date
Latest commit History 331 Commits
images		images
integration		integration
sbin		sbin
slurm		slurm
specs		specs
templates		templates
util		util
.gitignore		.gitignore
DEVEL.md		DEVEL.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
ThirdPartyNotice		ThirdPartyNotice
description.html		description.html
dev-requirements.txt		dev-requirements.txt
docker-package.sh		docker-package.sh
icon.png		icon.png
install.sh		install.sh
package.py		package.py
package.sh		package.sh
project.ini		project.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Slurm

Slurm Clusters in CycleCloud versions < 8.4.0

Making Cluster Changes

No longer pre-creating execute nodes

Creating additional partitions

Dynamic Partitions

Using Dynamic Partitions to Autoscale

Dynamic Scaledown

Manual scaling

Accounting

AzureCA.pem and existing MariaDB/MySQL instances

Cost Reporting

Troubleshooting

UID conflicts for Slurm and Munge users

Transitioning from 2.7 to 3.0

Contributing

About

Uh oh!

Releases

Packages

Languages

License

edwardsp/cyclecloud-slurm

Folders and files

Latest commit

History

Repository files navigation

Slurm

Slurm Clusters in CycleCloud versions < 8.4.0

Making Cluster Changes

No longer pre-creating execute nodes

Creating additional partitions

Dynamic Partitions

Using Dynamic Partitions to Autoscale

Dynamic Scaledown

Manual scaling

Accounting

AzureCA.pem and existing MariaDB/MySQL instances

Cost Reporting

Troubleshooting

UID conflicts for Slurm and Munge users

Transitioning from 2.7 to 3.0

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages