0% found this document useful (0 votes)

30 views6 pages

Slurm Usage Guide

Uploaded by

Le Truc Quynh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views6 pages

Slurm Usage Guide

Uploaded by

Le Truc Quynh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Slurm Usage Guide

Concept

SSH flow: Get into hanoi -> then go login-sp.vinai-systems.com

Login with account AD.

ssh hanoi
ssh <username>@login-sp.vinai-systems.com
Ex: ssh [email protected]

HOME_FOLDER_ISILON <=> /home/your_username (on loginNode) <=>

/vinai/your_username

SUPERPOD_STORAGE_DDN_FOLDER <=> /lustre/scratch/client (on all node)

PERSONAL_STORAGE_DDN_FOLDER <=>
/lustre/scratch/client/vinai/user/your_username

You have to put your training data in DDN Storage, HOME ISILON will be used for data
archive longterm.

Introduction
Slurm is an open-source job scheduling system for Linux clusters, most frequently used for
high-performance computing (HPC) applications. This guide will cover some of the basics to
get started using slurm as a user. For more information, the Slurm Docs are a good place to
start.

After slurm is deployed on a cluster, a slurmd daemon should be running on each compute
system. Users do not log directly into each compute system to do their work. Instead, they
execute slurm commands (ex: srun, sinfo, scancel, scontrol, etc) from a slurm login node.
These commands communicate with the slurmd daemons on each host to perform work.
Simple Commands
Cluster state with sinfo
To "see" the cluster, ssh to the slurm login node for your cluster and run the `sinfo`
command:
dgxuser@sdc2-hpc-login-mgmt001:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up 1-00:00:00 8 idle sdc2-hpc-dgx-a100-[001-008]
batch* up 1-00:00:00 2 down sdc2-hpc-dgx-a100-[013,015]
There are 8 nodes available on this system, all in an idle state. If a node is busy, its state will
change from idle to alloc. If a node is down, its state will change from idle to down.
dgxuser@sdc2-hpc-login-mgmt001:~$ sinfo -lN
Fri Jul 16 10:47:52 2021
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT
AVAIL_FE REASON
sdc2-hpc-dgx-a100-001 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-002 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-003 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-004 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-005 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-006 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-007 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-008 1 batch* idle 256 2:64:2 103100 0 1 (null) none
sdc2-hpc-dgx-a100-013 1 batch* down 256 2:64:2 103100 0 1 (null) VinAI use
sdc2-hpc-dgx-a100-015 1 batch* down 256 2:64:2 103100 0 1 (null) VinAI use

The `sinfo` command can be used to output a lot more information about the cluster. Check out
the sinfo doc for more information.

Running a job with srun

To run a job, use the srun command:
dgxuser@sdc2-hpc-login-mgmt001:~$ srun --partition=batch --gres=gpu:8 env | grep CUDA
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
dgxuser@sdc2-hpc-login-mgmt001:~$ srun --partition=batch --ntasks 8 -l hostname
5: sdc2-hpc-dgx-a100-001
2: sdc2-hpc-dgx-a100-001
7: sdc2-hpc-dgx-a100-001
6: sdc2-hpc-dgx-a100-001
0: sdc2-hpc-dgx-a100-001
3: sdc2-hpc-dgx-a100-001
1: sdc2-hpc-dgx-a100-001
4: sdc2-hpc-dgx-a100-001

Running an interactive job

Especially when developing and experimenting, it's helpful to run an interactive job, which
requests a resource and provides a command prompt as an interface to it (maxtime=2h):

dgxuser@sdc2-hpc-login-mgmt001:~$ srun --partition=batch --pty /bin/bash --time=02:00:00

dgxuser@sdc2-hpc-dgx-a100-001:~$ hostname
sdc2-hpc-dgx-a100-001
dgxuser@sdc2-hpc-dgx-a100-001:~$ exit

During interactive mode, the resource is being reserved for use until the prompt is exited (as
shown above). Commands can be run in succession.
Note: before starting an interactive session with srun it may be helpful to create a session
on the login node with a tool like tmux or `screen`. This will prevent a user from losing
interactive jobs if there is a network outage or the terminal is closed.
More Advanced Use

Run a batch job

While the srun command blocks any other execution in the terminal, sbatch can be run to queue
a job for execution once resources are available in the cluster. Also, a batch job will let you
queue up several jobs that run as nodes become available. It's therefore good practice to
encapsulate everything that needs to be run into a script and then execute with sbatch vs with
srun:
Example: running job python

dgxuser@sdc2-hpc-login-mgmt001:~$ cat script.sh

#!/bin/bash
set -e
#SBATCH --job-name=demo # create a short name for your job
#SBATCH --output=/lustre/scratch/client/vinai/users/youruser/yourfolder/slurm_%A.out #
create a output file
#SBATCH --error=/lustre/scratch/client/vinai/users/youruser/yourfolder/slurm_%A.err #
create a error file
#SBATCH --partition=batch or phase2 # choose partition
#SBATCH --gpus=1 # gpu count
#SBATCH --nodes=1 # node count
#SBATCH --mem-per-cpu=2G # memory per cpu-core (4G is default)
#SBATCH --cpus-per-gpu=8 # cpu-cores per gpu
#SBATCH --mail-type=all # option sendmail: begin,fail.end,requeue,all
#SBATCH [email protected] //your email
python3 demo.py
dgxuser@sdc2-hpc-login-mgmt001:~$ sbatch script.sh

Resources can be requested in several different ways:

sbatch/srun Option Description

-N, --nodes= Specify the total number of nodes to request
-n, --ntasks= Specify the total number of tasks to request
--ntasks-per-node= Specify the number of tasks per node
--gpus-per-node= Specify the number of GPUs to use Per node
-G, --gpus= Total number of GPUs to allocate for the job
--gpus-per-task= Number of gpus per task
--cpus-per-task= Number of cpus per task
--exclusive Guarantee that nodes are not shared amongst jobs

Observing running jobs with squeue

To see which jobs are running in the cluster, use the `squeue` command:
dgxuser@sdc2-hpc-login-mgmt001:~$ squeue -a -l
Fri Jul 16 11:01:38 2021
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES
NODELIST(REASON)
125 batch demo dgxuser COMPLETI 0:09 1-00:00:00 1 sdc2-hpc-dgx-a100-001

Cancel a job with scancel

dgxuser@sdc2-hpc-login-mgmt001:~$ squeue
dgxuser@sdc2-hpc-login-mgmt001:~$ scancel JOBID

Running job with module

List of available modules

dgxuser@sdc2-hpc-login-mgmt001:~$ module avail

-----------------------------------------------------------------------------------------------------------------
/sw/modules/all -------------------------------------------------------------------------------------------------
-----------------
mpi/3.0.6 python/2.7.18 python/3.6.10 python/3.8.10 python/miniconda3/miniconda3
python/pytorch/1.9.0+cu111 python/tensorflow/2.3.0

Use "module spider" to find all possible modules.

Use "module keyword key1 key2 ..." to search for all possible modules matching any of the
"keys".

Create your environment

dgxuser@sdc2-hpc-login-mgmt001:~$ module load python/miniconda3/miniconda3
dgxuser@sdc2-hpc-login-mgmt001:~$ conda create -p
/lustre/scratch/client/vinai/users/youruser/yourfolder python=yourversion
dgxuser@sdc2-hpc-login-mgmt001:~$ conda activate yourenv

Installation of your lib and packages you want (prefer using pip). Export proxy if you have
a problem with internet connection.
export HTTP_PROXY=http://proxytc.vingroup.net:9090/
export HTTPS_PROXY=http://proxytc.vingroup.net:9090/
export http_proxy=http://proxytc.vingroup.net:9090/
export https_proxy=http://proxytc.vingroup.net:9090/
Example run job with 1 node A100, 4 Gpus:

dgxuser@sdc2-hpc-login-mgmt001:~$ cat conda.sh

#!/bin/bash -e
#SBATCH --job-name=py-job
#SBATCH --output=/lustre/scratch/client/vinai/users/youruser/yourfolder/slurm_%A.out
#SBATCH --error=/lustre/scratch/client/vinai/users/youruser/yourfolder/slurm_%A.err
#SBATCH --gpus=4
#SBATCH --nodes=1
#SBATCH --mem-per-gpu=36G
#SBATCH --cpus-per-gpu=8
#SBATCH --partition=batch or phase2
#SBATCH --mail-type=all
#SBATCH [email protected] //your email

module purge
module load python/miniconda3/miniconda3
eval "$(conda shell.bash hook)"
conda activate /lustre/scratch/client/vinai/users/youruser/yourfolder

command ...
dgxuser@sdc2-hpc-login-mgmt001:~$ sbatch conda.sh

Running job with docker container

List of available containers on harbor.vinai-systems.com

harbor.vinai-systems.com/library/dc-miniconda:3-cuda10.0-cudnn7-ubuntu18.04
harbor.vinai-systems.com/library/cuda:10.0-cudnn7-ubuntu18.04
harbor.vinai-systems.com/library/pytorch:1.4.0-python3.7-cuda10.1-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-tensorflow:1.14.0-python3.7-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-python:3.6-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-tf-torch:1.15.0-1.4.0-python2.7-cuda10.0-cudnn7-
ubuntu16.04
harbor.vinai-systems.com/library/dc-miniconda:3-cuda10.1-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/miniconda:3-cuda10.1-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-pytorch:1.4.0-python3.7-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/dc-miniconda:3-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/miniconda:3-cuda10.0-cudnn7-ubuntu16.04
harbor.vinai-systems.com/library/pytorch:1.4.0-python3.7-cuda10.0-cudnn7-ubuntu16.04

You can build one of your own from nvcr.io. Dockerfile example in the ZipFile attached.
On login node:
docker login harbor.vinai-systems.com (login node account)
docker tag your_image harbor.vinai-systems.com/library/your_image:your_tag
docker push harbor.vinai-systems.com/library/your_image:your_tag
Contact Admin if you want to create account login harbor.vinai-system.com

You can run docker by example:

dgxuser@sdc2-hpc-login-mgmt001:~$ cat container.sh

#!/bin/bash -e
#SBATCH --job-name=container-job
#SBATCH --output=/lustre/scratch/client/vinai/users/youruser/yourfolder/slurm_%A.out
#SBATCH --error=/lustre/scratch/client/vinai/users/youruser/yourfolder/slurm_%A.err
#SBATCH --gpus=2
#SBATCH --nodes=1
#SBATCH --mem-per-gpu=36G
#SBATCH --cpus-per-gpu=8
#SBATCH --partition=batch
#SBATCH --mail-type=all
#SBATCH [email protected] //your email
srun --container-image="harbor.vinai-systems.com#library/cuda:10.0-cudnn7-ubuntu18.04" \
--container-mounts=lustre_folder:container_folder \
python …
dgxuser@sdc2-hpc-login-mgmt001:~$ sbatch container.sh

Note: Save your checkpoint to lustre folder

Forensic Toxicology
90% (10)
Forensic Toxicology
165 pages
HPC Intro Genentech
No ratings yet
HPC Intro Genentech
42 pages
TorqueAdminGuide 4.2.6
No ratings yet
TorqueAdminGuide 4.2.6
322 pages
Users Guide en 20240524
No ratings yet
Users Guide en 20240524
207 pages
Polaris New
No ratings yet
Polaris New
40 pages
How To Make Computers Work For You When You Are Enjoying Life
No ratings yet
How To Make Computers Work For You When You Are Enjoying Life
29 pages
Ccdsgpu Tc2 Msai Msds Userguide
No ratings yet
Ccdsgpu Tc2 Msai Msds Userguide
28 pages
High Performance Computing Tools and Resources For Engineers and Administrators
No ratings yet
High Performance Computing Tools and Resources For Engineers and Administrators
10 pages
HPC User Manual-Updated
No ratings yet
HPC User Manual-Updated
4 pages
Thiell
No ratings yet
Thiell
29 pages
Final Guideline IDC Jupyter
No ratings yet
Final Guideline IDC Jupyter
2 pages
LSF For Users: Mike Page SCD Consulting Services Group
No ratings yet
LSF For Users: Mike Page SCD Consulting Services Group
26 pages
FAMILY CODE - Ateneo Reviewer
100% (1)
FAMILY CODE - Ateneo Reviewer
26 pages
HPC Basics for New Users
No ratings yet
HPC Basics for New Users
77 pages
2021-07-20 VMUG DACH UserCon Going-Beyond-100pct Vbondzio
No ratings yet
2021-07-20 VMUG DACH UserCon Going-Beyond-100pct Vbondzio
37 pages
Breaking Boundaries Tacc As An Unified Cloud Native Infra For Ai HPC Wu Dui Zha Daeptaccai Hpcni Chang 27dya Shi Peter Pan Daocloud Kaiqiang Xu Hong Kong University of Science and Technology
No ratings yet
Breaking Boundaries Tacc As An Unified Cloud Native Infra For Ai HPC Wu Dui Zha Daeptaccai Hpcni Chang 27dya Shi Peter Pan Daocloud Kaiqiang Xu Hong Kong University of Science and Technology
33 pages
Idlcv Exercise 2 1 HPC
No ratings yet
Idlcv Exercise 2 1 HPC
4 pages
Flux-Core User Guide
No ratings yet
Flux-Core User Guide
342 pages
Legion Quick Reference Sheet: Access. Job Script Options. Resource Limits
No ratings yet
Legion Quick Reference Sheet: Access. Job Script Options. Resource Limits
1 page
Bank Strategic Planning and Budgeting Process
No ratings yet
Bank Strategic Planning and Budgeting Process
18 pages
Summary
No ratings yet
Summary
2 pages
AMD Accelerator Cloud Guide
No ratings yet
AMD Accelerator Cloud Guide
39 pages
Incredible English. Unit 8
No ratings yet
Incredible English. Unit 8
4 pages
16-Limpan Investment Corp. v. CIR G.R. No. L-21570 July 26, 1966
No ratings yet
16-Limpan Investment Corp. v. CIR G.R. No. L-21570 July 26, 1966
4 pages
User Manual
No ratings yet
User Manual
17 pages
Intro To Linux and HPC
No ratings yet
Intro To Linux and HPC
67 pages
Linux Performance Tools: Brendan Gregg
No ratings yet
Linux Performance Tools: Brendan Gregg
90 pages
Turbo Modeuwu
No ratings yet
Turbo Modeuwu
4 pages
Serverservices - Gpu-Cluster (LME - WIKI)
No ratings yet
Serverservices - Gpu-Cluster (LME - WIKI)
4 pages
Mscluster 08 02 2024
No ratings yet
Mscluster 08 02 2024
14 pages
Solaris LDOM Cluster Setup Guide
No ratings yet
Solaris LDOM Cluster Setup Guide
2 pages
HPC Introduction Lecture 2
No ratings yet
HPC Introduction Lecture 2
55 pages
HPC Clusters Best Practices Performance Study
No ratings yet
HPC Clusters Best Practices Performance Study
38 pages
Android System Debug Settings
No ratings yet
Android System Debug Settings
2 pages
The Ultimate 5-Ingredient Cookbook - Fast and Flavorful 5 Ingredients or Less Recipes For Any Skill Leve
100% (1)
The Ultimate 5-Ingredient Cookbook - Fast and Flavorful 5 Ingredients or Less Recipes For Any Skill Leve
105 pages
Scheduler Commands Cheatsheet-2020-Ally
No ratings yet
Scheduler Commands Cheatsheet-2020-Ally
1 page
HPC Intro Ad OS
No ratings yet
HPC Intro Ad OS
44 pages
KDAY
No ratings yet
KDAY
23 pages
Slurm Talk
No ratings yet
Slurm Talk
40 pages
Top Ten Ways of Handling Guest Complaints
No ratings yet
Top Ten Ways of Handling Guest Complaints
6 pages
Advanced Supercomputing Guide
No ratings yet
Advanced Supercomputing Guide
30 pages
ULHPC Beginner's Guide: Cluster Use
No ratings yet
ULHPC Beginner's Guide: Cluster Use
54 pages
HPC Rosalind Gettingstarted
No ratings yet
HPC Rosalind Gettingstarted
6 pages
Submitting Your MATLAB Jobs Using Slurm To High-Performance Clusters - by Rahul Bhadani - Towards Da
No ratings yet
Submitting Your MATLAB Jobs Using Slurm To High-Performance Clusters - by Rahul Bhadani - Towards Da
1 page
Lesson 2 - Stem Cells PPT Notes
No ratings yet
Lesson 2 - Stem Cells PPT Notes
8 pages
The Autoimmune Epidemic by Human Garage
No ratings yet
The Autoimmune Epidemic by Human Garage
12 pages
4.3.4 Lab - Linux Servers - ILM
No ratings yet
4.3.4 Lab - Linux Servers - ILM
7 pages
RWS Vol. 29 No. 2 113 146 Presto 2020 Revisiting Intersectional Identities - Voices of Poor Bakla Youth in Rural Philippines
No ratings yet
RWS Vol. 29 No. 2 113 146 Presto 2020 Revisiting Intersectional Identities - Voices of Poor Bakla Youth in Rural Philippines
34 pages
HPC UserGuide 20250527
No ratings yet
HPC UserGuide 20250527
7 pages
The Neptune, Pool Layout Programme: Proo F
No ratings yet
The Neptune, Pool Layout Programme: Proo F
1 page
Final Exam Denis Bonilla
100% (1)
Final Exam Denis Bonilla
7 pages
Annex D
No ratings yet
Annex D
1 page
19 Ies LM 83 12
No ratings yet
19 Ies LM 83 12
20 pages
Boot1 - Debian Imx6
No ratings yet
Boot1 - Debian Imx6
11 pages
CAPE Chemistry Data Booklet New
No ratings yet
CAPE Chemistry Data Booklet New
10 pages
EmeraldTrainingJul15-Vdbench Plus Script-A
No ratings yet
EmeraldTrainingJul15-Vdbench Plus Script-A
24 pages
HPC 2013 Cluster User Guide
No ratings yet
HPC 2013 Cluster User Guide
4 pages
PHP Developer Training Course
No ratings yet
PHP Developer Training Course
10 pages
Optimize ACC Cement Distribution
100% (3)
Optimize ACC Cement Distribution
4 pages
NERSC SLURM System Insights
No ratings yet
NERSC SLURM System Insights
13 pages
TorqueAdminGuide 2.5.12
No ratings yet
TorqueAdminGuide 2.5.12
282 pages
Judge Spinner Cases 01 01 2010 To 06 16 2012
No ratings yet
Judge Spinner Cases 01 01 2010 To 06 16 2012
75 pages
Student Reading Comprehension Test
No ratings yet
Student Reading Comprehension Test
2 pages
Should I Do My Homework Now or Wake Up Early
No ratings yet
Should I Do My Homework Now or Wake Up Early
4 pages
4.3.4 Lab - Linux Servers - ILM
No ratings yet
4.3.4 Lab - Linux Servers - ILM
7 pages
A Practical Guide To Building High-Performance Computing Clusters
No ratings yet
A Practical Guide To Building High-Performance Computing Clusters
69 pages
Linux Server Identification Lab
No ratings yet
Linux Server Identification Lab
7 pages
Doing More With Slurm Advanced Capabilities
No ratings yet
Doing More With Slurm Advanced Capabilities
31 pages
Methods For Testing Tar and Bituminous Materials - Determination of Specific Gravity
100% (1)
Methods For Testing Tar and Bituminous Materials - Determination of Specific Gravity
10 pages
Dmesg
No ratings yet
Dmesg
9 pages
Friends - The One With Russ
No ratings yet
Friends - The One With Russ
15 pages
Introductory Supercomputing PDF
No ratings yet
Introductory Supercomputing PDF
94 pages
Introduction To High Performance Computing: Shaohao Chen Research Computing Services (RCS) Boston University
No ratings yet
Introduction To High Performance Computing: Shaohao Chen Research Computing Services (RCS) Boston University
29 pages
Linux Performance Tools (LinuxCon NA) - Brendan Gregg
No ratings yet
Linux Performance Tools (LinuxCon NA) - Brendan Gregg
90 pages
AppNote Work Load Management in RedHawk-SC
No ratings yet
AppNote Work Load Management in RedHawk-SC
7 pages
P.D. No. 223
No ratings yet
P.D. No. 223
1 page
The Terror: About This Text
0% (2)
The Terror: About This Text
6 pages
Using The Batch Farm: Technische Universität München
No ratings yet
Using The Batch Farm: Technische Universität München
28 pages
Connect Representations of Functions
No ratings yet
Connect Representations of Functions
2 pages
SystemData 20240926 162625
No ratings yet
SystemData 20240926 162625
34 pages
The Cause-Effect Essay
No ratings yet
The Cause-Effect Essay
12 pages
Percona2016linuxsystemsperf 160421182216
No ratings yet
Percona2016linuxsystemsperf 160421182216
72 pages
RPG and Story-Based Game in Game Development
No ratings yet
RPG and Story-Based Game in Game Development
9 pages
HPC Job Submission Guide
No ratings yet
HPC Job Submission Guide
8 pages
Zeeshan CV
No ratings yet
Zeeshan CV
1 page
NB4-06 PT I Using CNN
No ratings yet
NB4-06 PT I Using CNN
21 pages
Intro To Slurm
No ratings yet
Intro To Slurm
27 pages
Parallel Programming Using MPI
No ratings yet
Parallel Programming Using MPI
69 pages
Linuxperftools 140820091946 Phpapp01
No ratings yet
Linuxperftools 140820091946 Phpapp01
85 pages

Slurm Usage Guide

Uploaded by

Slurm Usage Guide

Uploaded by

Slurm Usage Guide

SSH flow: Get into hanoi -> then go login-sp.vinai-systems.com

Login with account AD.

HOME_FOLDER_ISILON <=> /home/your_username (on loginNode) <=>

SUPERPOD_STORAGE_DDN_FOLDER <=> /lustre/scratch/client (on all node)

Running a job with srun

Running an interactive job

dgxuser@sdc2-hpc-login-mgmt001:~$ srun --partition=batch --pty /bin/bash --time=02:00:00

Run a batch job

dgxuser@sdc2-hpc-login-mgmt001:~$ cat script.sh

Resources can be requested in several different ways:

sbatch/srun Option Description

Observing running jobs with squeue

Cancel a job with scancel

Running job with module

dgxuser@sdc2-hpc-login-mgmt001:~$ module avail

Use "module spider" to find all possible modules.

Create your environment

dgxuser@sdc2-hpc-login-mgmt001:~$ cat conda.sh

Running job with docker container

You can run docker by example:

dgxuser@sdc2-hpc-login-mgmt001:~$ cat container.sh

Note: Save your checkpoint to lustre folder

You might also like