Scripts for launching hyperparameter sweeps on a SLURM cluster.
Python scripts are configured to use the system installation of Python3 (#!/usr/bin/python3). Therefore, the scripts only use standard libraries, and are compatible with python>=3.6. Other scripts use bash.
To set up scripts in a new repository new_repo, run
./setup.sh new_repo
This will symlink scripts for running jobs and copy over example sweep configurations.
From the new repository, sweeps can be configured by creating a json file.
For an example, see example.json.
Each key in the json file corresponds to a separate command line argument. The key's value can be a list of values to be swept, a single value to be set, or a dictionary.
If the key points to a dictionary, that dictionary can have the following key-value pairs:
keymust be a string. This option can be used to sweep multiple hyperparameters together under onekey. For example, we may want to set--dropout 0if--batchnorm, and--dropout .5if there is no batchnorm. See the entries with keyno_dropout_with_bninexample.jsonfor an example. Note that the value ofkeyCANNOT conflict with the names of any other arguments set in the json file.valuesmust be a list of values to be swept over. It should be the same length for all arguments with the same sweepkey.dist,start,stop,numcan be specified instead ofvalues.distgives the distribution of values to be swept over (lin,ln(basee), orlog(base 10)). A custom base of log can be specified by appending a number afterlog, e.g.log2,log3,log1.5.startandstopgive the left and right endpoints for the values (inclusive).numgives the number of values to use.dtypecan be specified tofloat(default) orint.
one_hot_sweepcan be specified instead ofvalues. This argument can be used to sweep over akeyconsisting of all boolean values by turning on exactly one of them at a time. For example, we may want to try--batch_norm,--group_norm, and--layer_normindividually.
If the key points to a bool (true or false), then the arg will be set as --arg instead of --arg [value].
A hyperparameter sweep can be launched as follows:
./batch.py PARTITION JOB_NAME FILE_TO_RUN SLURM_QOS CONFIG
Run ./batch.py -h to see more options. The output of the job will be saved by default in experiments/YYYY-MM-DD-HH-MM-SS,
with a directory for each configuration of hyperparameters in the sweep.
Add the bash aliases defined in .bash_profile, then run q to see the SLURM queue for your jobs, and sq to see the slurm queue for all jobs.
Run ./check.py from within experiments/YYYY-MM-DD-HH-MM-SS to see the final line of output for each job in the sweep.
The file check.py is automatically copied into experiments/YYYY-MM-DD-HH-MM-SS for each sweep.
Run ./check.py -h to see the full list of options.
Run ./get.py YYYY-MM-DD-HH-MM-SS to locally download the experiment experiments/YYYY-MM-DD-HH-MM-SS from the cluster via rsync.
Run ./get.py -h to see the full list of options.
Run scancel -u $USER to cancel all jobs. Run ./cancel.sh jobid_start jobid_end to scancel jobs with ids jobid_start through jobid_end (inclusive).
If a job preempts after a new update to the repo has been pulled in, when the job relaunches it will run the newly updated code. Thus, to prevent any potential problems, it is best practice to make pulled changes backwards compatible while jobs are in progress.
Originally inspired by nng555/cluster_examples.