-
Notifications
You must be signed in to change notification settings - Fork 650
Add multi-node verl SLURM job #1798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @@ -0,0 +1,5 @@ | |||
| working_dir: ./ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file needed? How would a user change the conda environment etc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was recommended in the verl guide. They could update the value in this file, but given the pattern we use in our job configs (copying our working directory to the slurm cluster and installing that in the oumi conda env), hardcoding it makes sense IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- How about the environment "oumi" ?
- I think this would not work with pip installed oumi right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, how would this work for users who are pip installing oumi from pypi?
| # - Run step 1 of verl quickstart: https://verl.readthedocs.io/en/latest/start/quickstart.html | ||
| # | ||
| # Usage: | ||
| # oumi launch up -c configs/examples/misc/grpo_verl_gsm8k_slurm_job.yaml --cluster $OUMI_SLURM_CONNECTIONS --user wizeng |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # oumi launch up -c configs/examples/misc/grpo_verl_gsm8k_slurm_job.yaml --cluster $OUMI_SLURM_CONNECTIONS --user wizeng | |
| # oumi launch up -c configs/examples/misc/grpo_verl_gsm8k_slurm_job.yaml --cluster $OUMI_SLURM_CONNECTIONS --user $USER |
| @@ -0,0 +1,5 @@ | |||
| working_dir: ./ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, how would this work for users who are pip installing oumi from pypi?
Description
configs/examples/misc/slurm_ray_init.sh, which sets up a Ray cluster on Slurm nodes. Reference: https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html.configs/examples/misc/grpo_verl_gsm8k_slurm_job.yamlfor running verl on Slurm, using an Oumi job configRelated issues
Towards OPE-1338
Before submitting