|
| 1 | +# Using GPUs for NLP in Informatics: Quick Start |
| 2 | +## 11th October 2023 |
| 3 | +### Tom Sherborne |
| 4 | + |
| 5 | +This guide explains how to: |
| 6 | + - Log on to the cluster |
| 7 | + - Set up your bash environment |
| 8 | + - Set up a conda environment manager |
| 9 | + - Get some useful student written scripts |
| 10 | + - Provides examples of |
| 11 | + - Interactive sessions with and without a GPU (srun) |
| 12 | + - Running scripts on the cluster non-interactively (sbatch) |
| 13 | + - Set up jupyter notebooks to run on the cluster, but be easily accessible from anywhere |
| 14 | + |
| 15 | +**New**: The lecture and demo itself now have accompanying written notes available [here](./ilcc_cluster_talk_11_10_23.md) |
| 16 | + |
| 17 | +## First Steps |
| 18 | + |
| 19 | +The rest of this guide is designed to help you understand how you can use the cluster. If you got lost during the talk, or you need to recap then you are in the right place. Consider this document a written version of most of the talk (but not everything). |
| 20 | + |
| 21 | +__At a minimum, it would be very helpful if you can do these THREE tasks:__ |
| 22 | + |
| 23 | + 1. Check you can SSH into `ilcc-cluster.inf.ed.ac.uk`. This will probably need to be from a DICE machine or a machine within the Informatics VPN (setup help for this is below). |
| 24 | + - If you cannot access this machine, please file a [Support ticket](https://www.inf.ed.ac.uk/systems/support/form/). You’ll be limited during the talk/demo if you cannot experience the cluster for yourself. |
| 25 | + - If you are not comfortable with shell access then read this [MIT guide](https://missing.csail.mit.edu/) for shells and [here](https://www.digitalocean.com/community/tutorials/how-to-use-ssh-to-connect-to-a-remote-server) for SSH. |
| 26 | + |
| 27 | + 2. Clone the [cluster-scripts](https://github.com/cdt-data-science/cluster-scripts) repository into your own workspace using Git. |
| 28 | + - We will use resources within this repository during the session so make sure you have it now. |
| 29 | + - Run the following code line in the directory where you want these scripts to live (usually the top-level directory) |
| 30 | + ``` |
| 31 | + git clone https://github.com/cdt-data-science/cluster-scripts |
| 32 | + ``` |
| 33 | +
|
| 34 | + 3. Install Anaconda within the cluster: |
| 35 | + - The clusters can’t use the same environment as DICE, so you need to reinstall this for `ilcc-cluster.inf.ed.ac.uk`. |
| 36 | + - More details are below but the gist is these two lines: |
| 37 | + ``` |
| 38 | + wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh |
| 39 | + bash Miniconda3-latest-Linux-x86_64.sh |
| 40 | + ``` |
| 41 | + - Installing Conda can be _very_ slow, so please get this done before we start. |
| 42 | + - Note that copying `${HOME}/miniconda3` between clusters typically breaks links and pathways. You should install `miniconda3` from scratch each time and then export and recreate each environment across clusters (guidance [here](ilcc_cluster_quick_start_13_9_22.md)). |
| 43 | + - You can use `virtualenv` or `poetry` if desired but I will assume that you know what you are doing / don't need help. |
| 44 | +
|
| 45 | +
|
| 46 | +## Important things to note first |
| 47 | +
|
| 48 | + |
| 49 | +This is an approximate setup of how the cluster is arranged for a different cluster (`albert`). The ILCC Cluster we are using today is set up similarly. |
| 50 | +
|
| 51 | +- The initial node you log into is called the __head node__ (named `escience6` at the time of writing) - __do not__ run heavy processes on here. This node is only used for sending jobs to other nodes in the cluster. |
| 52 | +- The filesystem you have access to when you log in is identical on all the nodes you can access - it is a __distributed__ filesystem. As such, it is very slow (because it must appear the same everywhere)! |
| 53 | + - Avoid reading and writing files frequently on this filesystem. |
| 54 | + - For the ILCC cluster, the file-system is not load-balanced. If you break it, it will go down quickly so be careful! |
| 55 | + - Instead, when running a job on a node, use the scratch disk and only move files to the shared filesystem infrequently. The scratch disk is located at `/disk/scratch` normally. |
| 56 | +- Please skim read this for best practice: http://computing.help.inf.ed.ac.uk/cluster-tips |
| 57 | +
|
| 58 | +
|
| 59 | +## Quick Bash Environment Setup |
| 60 | +
|
| 61 | +1. Find the name of your cluster. For example, it may be `cdtcluster`, `mlp`, `ilcc-cluster`, see http://computing.help.inf.ed.ac.uk/cluster-computing for more. Throughout this guide I will assume you have either set a variable called `CLUSTER_NAME` (or you'll just replace that in the instructions) e.g `export CLUSTER_NAME=ilcc-cluster`. |
| 62 | +
|
| 63 | +2. Run this line to SSH into the cluster: `ssh ${USER}@${CLUSTER_NAME}.inf.ed.ac.uk` |
| 64 | +
|
| 65 | +3. Create a Bash profile file so that the correct Bash setup runs when you login: |
| 66 | + - `touch .bash_profile` |
| 67 | + - Open the `.bash_profile` file and paste the following code into it. If we are using Vim: |
| 68 | + - `vim .bash_profile` |
| 69 | + - Press the `i` button so that you can insert text into the file (you'll see the word INSERT at the bottom of the screen) |
| 70 | + - Paste: |
| 71 | + ``` |
| 72 | + if [ -f ~/.bashrc ]; then |
| 73 | + source ~/.bashrc |
| 74 | + fi |
| 75 | + ``` |
| 76 | + - Press ESC then `!wq` to exit Vim. |
| 77 | + - If you don't know how to use any command line editor: try these steps to getting started with Vim [here](https://vim-adventures.com/). |
| 78 | +
|
| 79 | +4. Install `miniconda3`: |
| 80 | + - Download and run Miniconda installer |
| 81 | + ``` |
| 82 | + wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh |
| 83 | + bash Miniconda3-latest-Linux-x86_64.sh |
| 84 | + ``` |
| 85 | + - Accept the licensing terms during installation. |
| 86 | +
|
| 87 | +5. Install the `cluster-scripts` repository containing useful scripts and the demos. |
| 88 | + - Git clone to your cluster user space |
| 89 | + ``` |
| 90 | + git clone https://github.com/cdt-data-science/cluster-scripts |
| 91 | + cd ./cluster-scripts |
| 92 | + ``` |
| 93 | + - __Follow the instructions in `README.md`__ |
| 94 | +
|
| 95 | +6. Re-source your Bash profile |
| 96 | + ``` |
| 97 | + source ~/.bashrc |
| 98 | + ``` |
| 99 | +
|
| 100 | +7. You can now play around with commands on the cluster (try running `free-gpus`, `cluster-status`) |
| 101 | +
|
| 102 | +8. You are ready to go! |
| 103 | +
|
| 104 | +
|
| 105 | +## What's Next? Practical examples! |
| 106 | +
|
| 107 | +You can head straight to the cluster-scripts experiments repository [here](https://github.com/cdt-data-science/cluster-scripts/tree/master/experiments/examples/nlp) where you will find templates and practical examples to get you going. |
| 108 | +
|
| 109 | +There are further examples below to try afterwards. |
| 110 | +
|
| 111 | +## Further examples |
| 112 | +All the examples below expect you have performed the prior setup. |
| 113 | +
|
| 114 | +### Setup |
| 115 | +#### Create a conda environment with PyTorch |
| 116 | +
|
| 117 | +Make the conda environment. This can take a bit of time (it’s harder for the distributed filesystem to deal with lots of small files than for your local machine’s hard drive) - go get a cup of tea. |
| 118 | +
|
| 119 | +1. Check local versions of cuda available - at time of writing cuda 10.1 is available: `ls -d /opt/cu*`. You should use this version for the `cudatoolkit=??.?` argument below. |
| 120 | +
|
| 121 | +2. Run the command to create a conda environment called `pt`: |
| 122 | + `conda create -y -n pt python=3 pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch` (more info about PyTorch installation here if this goes wrong: https://pytorch.org/get-started/locally/) |
| 123 | +3. Activate the environment `conda activate pt` |
| 124 | +
|
| 125 | +#### Get some code to run MNIST experiments. |
| 126 | +
|
| 127 | +Get some code to run MNIST in PyTorch and run it: |
| 128 | + - `mkdir ~/projects` |
| 129 | + - `cd ~/projects` |
| 130 | + - `git clone https://github.com/pytorch/examples.git` |
| 131 | +
|
| 132 | +##### Interactive jobs (without a GPU) |
| 133 | +
|
| 134 | +1. Get an interactive session (you shouldn’t do processing on the head node) |
| 135 | + - Find partitions which are used for Interactive sessions (they'll have interactive in the name). For example: |
| 136 | + ``` |
| 137 | + $ sinfo -o '%R;%N;%l' | column -s';' -t |
| 138 | + PARTITION NODELIST TIMELIMIT |
| 139 | + ILCC_GPU barre,duflo,greider,levi,mcclintock,moser,nuesslein,ostrom 10-00:00:00 |
| 140 | + CDT_GPU arnold,strickland 10-00:00:00 |
| 141 | + ILCC_CPU bravas,kinloch,rockall,stkilda 10-00:00:00 |
| 142 | + M_AND_I_GPU bonsall,buccleuch,chatelet,davie,elion,gibbs,livy,nicolson,quarry,snippy,tangmere,tomorden,yonath 10-00:00:00 |
| 143 | + ``` |
| 144 | +
|
| 145 | + - Use srun to get an interactive session on that partition. For example: |
| 146 | + ``` |
| 147 | + srun --partition=ILCC_GPU --time=08:00:00 --mem=8000 --cpus-per-task=4 --pty bash |
| 148 | + ``` |
| 149 | +
|
| 150 | +2. Run example MNIST code (you will find this very slow): |
| 151 | + - `cd ~/cluster-scripts/examples/mnist` |
| 152 | + - `conda activate pt` |
| 153 | + - `python main.py # Check this runs, but you can cancel it anytime with CTRL+C` |
| 154 | +
|
| 155 | +Please note: this is going to download data to the Distributed Filesystem (i.e. in your current working directory) and the code will access the data from there: this is not good practice on this cluster (because it will be very slow) - best practice says to store and access data in the scratch space of the node you’re running on. We will cover this in the lecture. |
| 156 | +
|
| 157 | +3. Exit your interactive session by running `exit` |
| 158 | +
|
| 159 | +##### Interactive jobs (with a GPU) |
| 160 | +
|
| 161 | +1. Launch a similar `srun` command using the `gres` argument to request a GPU in your job: |
| 162 | + ``` |
| 163 | + srun --partition=ILCC_GPU --time=08:00:00 --mem=14000 --cpus-per-task=4 --gres=gpu:1 --pty bash |
| 164 | + ``` |
| 165 | +2. Run example MNIST code (should be much faster): |
| 166 | +
|
| 167 | + - `cd ~/cluster-scripts/examples/mnist` |
| 168 | + - `conda activate pt` |
| 169 | + - `python main.py` |
| 170 | +
|
| 171 | +3. Exit your interactive session by running `exit` |
| 172 | +
|
| 173 | +##### Batch Jobs (non-interactive) |
| 174 | +
|
| 175 | +Repeat the above but this time using an `sbatch` script (non-interactive session). The command `sbatch` has many of the same arguments as `srun`, for example, add `--gres=gpu:1` if you would like to use one gpu |
| 176 | +
|
| 177 | +- `cd ~/cluster-scripts/examples/mnist` |
| 178 | +- create a bash script, `mnist_expt.sh`, for slurm to run: |
| 179 | + ``` |
| 180 | + #!/usr/bin/env bash |
| 181 | + conda activate pt #Alternatively "source /home/${USER}/miniconda3/bin/activate pt" |
| 182 | + python main.py |
| 183 | + ``` |
| 184 | + - Run this script by running: `sbatch --time=08:00:00 --mem=14000 --cpus-per-task=4 --gres=gpu:1 mnist_expt.sh` |
| 185 | + - Observe your job running with: `squeue` |
| 186 | + - You can get information about your jobs with `jobinfo -u ${USER}` |
| 187 | + - Check out the log file with `cat slurm-*.out`. This will be in the working directory where you ran the `sbatch` command from. |
| 188 | +
|
| 189 | +## Useful Documentation and Links |
| 190 | +
|
| 191 | +### Computing support |
| 192 | + - Main page: http://computing.help.inf.ed.ac.uk/cluster-computing |
| 193 | + - Explanation of filesystem, and best practice: http://computing.help.inf.ed.ac.uk/cluster-tips |
| 194 | +
|
| 195 | +### Slurm docs |
| 196 | + - Quick start: https://slurm.schedmd.com/quickstart.html |
| 197 | + - sbatch: https://slurm.schedmd.com/sbatch.html |
| 198 | + - srun: https://slurm.schedmd.com/srun.html |
| 199 | + - array jobs: https://slurm.schedmd.com/job_array.html |
| 200 | +
|
| 201 | +### Other: |
| 202 | + - Setting up VPN for self managed machines/laptops: http://computing.help.inf.ed.ac.uk/openvpn |
| 203 | + - Logging in to the informatics machines remotely: http://computing.help.inf.ed.ac.uk/external-login |
| 204 | +
|
| 205 | +
|
| 206 | +## Example `.bashrc` file |
| 207 | +
|
| 208 | +You can keep one, `~/.bashrc`, and make an additional` ~/.bash_profile` that just runs the `~/.bashrc` (as we did earlier). This file should get run every time you log in. Different files get run depending on whether you’re logging in interactively, or non interactively to a login or non-login shell. For more information, see: https://www.gnu.org/software/bash/manual/html_node/Bash-Startup-Files.html. |
| 209 | +
|
| 210 | +For more information about bash start-up files for DICE machines, see http://computing.help.inf.ed.ac.uk/dice-bash. |
| 211 | +
|
| 212 | +This is an example `~/.bashrc` you can use as guidance. |
| 213 | +
|
| 214 | +``` |
| 215 | +# Allow resourcing of this file without continually lengthening path |
| 216 | +# i.e. this resets path to original on every source of this file |
| 217 | +if [ -z $orig_path ]; then |
| 218 | + export orig_path=$PATH |
| 219 | +else |
| 220 | + export PATH=$orig_path |
| 221 | +fi |
| 222 | + |
| 223 | +# This is so jupyter lab doesn't give the permission denied error |
| 224 | +export XDG_RUNTIME_DIR="" |
| 225 | + |
| 226 | +# This part is added automatically by the conda installation ================ |
| 227 | +# >>> conda initialize >>> |
| 228 | +# !! Contents within this block are managed by 'conda init' !! |
| 229 | +__conda_setup="$('/home/${USER}/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" |
| 230 | +if [ $? -eq 0 ]; then |
| 231 | + eval "$__conda_setup" |
| 232 | +else |
| 233 | + if [ -f "/home/${USER}/miniconda3/etc/profile.d/conda.sh" ]; then |
| 234 | + . "/home/${USER}/miniconda3/etc/profile.d/conda.sh" |
| 235 | + else |
| 236 | + export PATH="/home/${USER}/miniconda3/bin:$PATH" |
| 237 | + fi |
| 238 | +fi |
| 239 | +unset __conda_setup |
| 240 | +# <<< conda initialize <<< |
| 241 | + |
| 242 | +conda activate |
| 243 | +# =========================================================================== |
| 244 | + |
| 245 | +# environment variable for your AFS home space |
| 246 | +export AFS_HOME=/afs/inf.ed.ac.uk/user/${USER:0:3}/$USER |
| 247 | + |
| 248 | +# Add cluster-scripts to path for easy use (explained in README) |
| 249 | +export PATH=/home/${USER}/cluster-scripts:$PATH |
| 250 | +source /home/${USER}/cluster-scripts/job-id-completion.sh |
| 251 | + |
| 252 | +# Useful auto args for ls |
| 253 | +alias ls='ls --color=auto' |
| 254 | +alias ll='ls -alhF' |
| 255 | +``` |
0 commit comments