python-batchtools is a CLI for students and researchers to
submit GPU batch jobs through Kueue-managed GPU queues on an
OpenShift cluster. It provides an inexpensive and accessible way to use
GPU hardware without reserving dedicated GPU nodes.
Users submit GPU jobs with a single command:
batchtools br "./cuda_program"The CLI will automatically:
- Creates the batch job
- Submits it to the appropriate Kueue-managed LocalQueue
- Tracks job status
- Streams logs on completion
git clone https://github.com/memalhot/python-batchtools.git
cd python-batchtools
pip install -e .- A Kueue-enabled OpenShift cluster, with localqueues named: v100-localqueue, a100-localqueue, h100-localqueue, dummy-localqueue
- An OpenShift account
- The Python OpenShift client:
pip install openshift-client- RBAC permissions for the user to have access to jobs, kueue resources like localqueues and clusterqueues. See rbac folder for setup.
For any command you can run:
batchtools <command> -h or batchtools <command> --help
The br command is how to submit batchjobs. It submits code intended to run on GPUs to the Kueue, where it is queued, then run, produces logs stored in the RUNDIR, and then deletes the job for resource conservation.
Here's how to use thed br command:
First write a CUDA program and compile it :D
Then to submit your CUDA program to the GPU node:
batchtools br "./cuda-code"Submit a program with arguments:
batchtools br './simulate --steps 1000'Specify GPU type:
batchtools br --gpu v100 "./train_model"Run without waiting for logs (for longer runs, similar to a more traditional batch system):
batchtools br --no-wait "./cuda_program"WARNING
If you run br with the --no-wait flag, it will not be cleaned up for you. You must delete it on your own by running batchtools bd <job-name> or oc delete job <job-name>
But don't worry, running with --no-wait will give you a reminder to delete your jobs!
And if you need help or want to see more flas:
batchtools br --hList all jobs:
batchtools bjDelete all jobs:
batchtools bdTo delete specific jobs:
batchtools bd job-a job-bbatchtools bpsOutput will be empty if all nodes are free.
If some nodes are busy:
wrk-4: BUSY 3 project-1/project-stuff testing/other-stuff test/fraud-detectiob
To always ensure output, you can run:
batchtools --verbose bpsTo get output like:
ctl-0: FREE
ctl-1: FREE
ctl-2: FREE
wrk-0: FREE
wrk-1: FREE
wrk-3: FREE
wrk-4: BUSY 3 project-1/project-stuff testing/other-stuff test/fraud-detection
wrk-5: FREE
wrk-6: FREE
wrk-7: FREE
batchtools blFor a specific pod:
batchtools bl pod-namebatchtools bqOutput will look like:
a100-clusterqueue admitted: 0 pending: 0 reserved: 0 GPUs: 0 BestEffortFIFO
dummy-clusterqueue admitted: 0 pending: 0 reserved: 0 GPUs: 0 BestEffortFIFO
h100-clusterqueue admitted: 0 pending: 0 reserved: 0 GPUs: 0 BestEffortFIFO
v100-clusterqueue admitted: 0 pending: 0 reserved: 0 GPUs: 3 BestEffortFIFOInstall uv:
pipx install uvInstall pre-commit:
pipx install pre-commitActivate hooks:
pre-commit installuv run pytestCoverage report is generated at:
htmlcov/index.html