Jobman-v2 is a modular and extensible job management system for TPU VMs.
- 2025-09-21: Now you can configure to make jobman send you emails when tpu resources are allocated. See here and here for details.
- 2025-09-08: I discovered tpu-pod-commander by Young Geng at Deepmind, which has very similar functions as Jobman but is much simpler in design. Jobman aims to settle down everything for the user but can be hard to understand or debug, whereas tpu-pod-commander is much more native to the generic TPU interfaces. Feel free to switch to tpu-pod-commander if you prefer its design.
- 2025-09-06: Added an automatic checker to verify the bucket region and the TPU VM zone match, so that you can ensure your storage costs are minimized.
- 2025-08-28: Added support for conda and venv as well as unit tests.
- 2025-08-22: Added quota and storage viewer.
In order to use Jobman, you need to make sure gcloud is available on your machine in the first place. You may refer to the official doc to do so.
Afterwards, also install alpha and beta:
gcloud components install alpha betaLogin with your gcloud account:
gcloud auth login
gcloud auth application-default loginAlso make sure tmux has been installed:
tmux -VIf not, follow tmux wiki to install tmux.
Lastly, build the jobman package from source:
python -m pip install --upgrade pip
pip install -e .Try the following command to submit a minimal job:
jobman create configs/quick_start.yaml
Then check its status
jobman list
Before you start using Jobman (properly), be sure to go through GET_STARTED.md. This is vital for you to proceed to run your own jobs.
This section differs from the Get Started section as it explains briefly how Jobman works. Basically, each job is viewed as a data structure or a class by Jobman, with
- life cycle, including queueing, running, idle, and dead managed by a centralized data structure
jobman. Specifically,jobmancreates and kills tmux sessions to manage the jobs in the backend. - corresponding tpus, ssh, gcsfuse, and environment config as attributes.
- all logs saved to
jobs/<user_id>/<job_id>/logs.
- since jobs live as tmux sessions, it's suggested that you run this tool on some remote host instead of some local machine, since tmux sessions may die after you shut down your machine.
- on the other hand,
jobmanlives as several local data files inside ofjobs/.jobmanand uses a lock to maintain the consistency. Therefore, please do not mess up with the files injobs/.jobmanunless you know what you're doing (if you cannot findjobs/.jobman, it's normal since it'll be created after you run your first job).
Boya Zeng has created a comprehensive guide covering various problems and tips when using tpus. You can find the answers to most of the problems you may have regarding TPUs. This project also provides a simple job management script.
The design concept of Jobman is somewhat complex, but it aims to provide the easiest user interface s.t. users unfamiliar with TPUs can quickly get started.
For a simpler setup tool, you may refer to other_resources/ultra_create_tpu.sh by Peter Tong.
Boyang Zheng has also developed a brilliant Slack Chatbot that 1) automatically deletes dead tpu vms 2) profiles daily usage and sends to their Slack Channel. You may refer to it at other_resources/slack_chatbot.
Coming soon
- Q: I ran
jobman create <config_path>but nothing happens. What should I do?
A: Under the hood,jobman createcreates the job directory and starts the job process with tmux in the backend. If the job process fails, it fails silently since it's in tmux.
The first debugging step is to runjobman run <job_id>where<job_id>is the id of the job you just created. This will run the job in the front end. If this stucks as well, please kindly check ifgcloudcommand works on your machine. - Q: How can I validate the job status displayed in
jobman list?
A: Although in my use cases, the 4 jobs states (QUEUEING, RUNNING, IDLE, DEAD) are mostly accurate, it'a always a good idea to verify the TPU state on Google Cloud Console. If you observe inconsistencies betweenjobman listand Google Cloud Console, kindly open an issue and report the bug.
- If you have any issues with this project or want to contribute to it, please first open an issue in the
Issuessection. This will be of great help to the maintenance of this project! - You may also contact Yufeng Xu [email protected] for further communication.
- Also, if you would like to contribute to this project, please refer to CONTRIBUTING.md.