Thanks to visit codestin.com
Credit goes to github.com

Skip to content

OCP-on-NERC/python-batchtools

Repository files navigation

python-batchtools

Overview

python-batchtools is a CLI for students and researchers to submit GPU batch jobs through Kueue-managed GPU queues on an OpenShift cluster. It provides an inexpensive and accessible way to use GPU hardware without reserving dedicated GPU nodes.

Users submit GPU jobs with a single command:

batchtools br "./cuda_program"

The CLI will automatically:

  • Creates the batch job
  • Submits it to the appropriate Kueue-managed LocalQueue
  • Tracks job status
  • Streams logs on completion

For Users

Installation

Option 1: Use the provided container image (recommended)

Option 2: Install from source

git clone https://github.com/memalhot/python-batchtools.git
cd python-batchtools
pip install -e .

Prerequisites

  1. A Kueue-enabled OpenShift cluster, with localqueues named: v100-localqueue, a100-localqueue, h100-localqueue, dummy-localqueue
  2. An OpenShift account
  3. The Python OpenShift client:
pip install openshift-client
  1. RBAC permissions for the user to have access to jobs, kueue resources like localqueues and clusterqueues. See rbac folder for setup.

Usage Examples

For any command you can run: batchtools <command> -h or batchtools <command> --help

1. Submit a Batch Job --- br

The br command is how to submit batchjobs. It submits code intended to run on GPUs to the Kueue, where it is queued, then run, produces logs stored in the RUNDIR, and then deletes the job for resource conservation.

Here's how to use thed br command:

First write a CUDA program and compile it :D
Then to submit your CUDA program to the GPU node:

batchtools br "./cuda-code"

Submit a program with arguments:

batchtools br './simulate --steps 1000'

Specify GPU type:

batchtools br --gpu v100 "./train_model"

Run without waiting for logs (for longer runs, similar to a more traditional batch system):

batchtools br --no-wait "./cuda_program"

WARNING
If you run br with the --no-wait flag, it will not be cleaned up for you. You must delete it on your own by running batchtools bd <job-name> or oc delete job <job-name> But don't worry, running with --no-wait will give you a reminder to delete your jobs!

And if you need help or want to see more flas:

batchtools br --h

2. List Jobs --- bj

List all jobs:

batchtools bj

3. Delete Jobs --- bd

Delete all jobs:

batchtools bd

To delete specific jobs:

batchtools bd job-a job-b

4. List active GPU pods per node --- bps

batchtools bps

Output will be empty if all nodes are free.

If some nodes are busy:

wrk-4: BUSY 3 project-1/project-stuff testing/other-stuff test/fraud-detectiob

To always ensure output, you can run:

batchtools --verbose bps

To get output like:

ctl-0: FREE
ctl-1: FREE
ctl-2: FREE
wrk-0: FREE
wrk-1: FREE
wrk-3: FREE
wrk-4: BUSY 3 project-1/project-stuff testing/other-stuff test/fraud-detection
wrk-5: FREE
wrk-6: FREE
wrk-7: FREE

5. Show pod logs --- bl

batchtools bl

For a specific pod:

batchtools bl pod-name

6. Show pod logs --- bq

batchtools bq

Output will look like:

a100-clusterqueue       admitted: 0     pending: 0      reserved: 0     GPUs: 0 BestEffortFIFO
dummy-clusterqueue      admitted: 0     pending: 0      reserved: 0     GPUs: 0 BestEffortFIFO
h100-clusterqueue       admitted: 0     pending: 0      reserved: 0     GPUs: 0 BestEffortFIFO
v100-clusterqueue       admitted: 0     pending: 0      reserved: 0     GPUs: 3 BestEffortFIFO

For Contributors

Tools

Install uv:

pipx install uv

Install pre-commit:

pipx install pre-commit

Activate hooks:

pre-commit install

Running Tests

uv run pytest

Coverage report is generated at:

htmlcov/index.html

About

cli to create gpu jobs on openshift cluster with kueue

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •