Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support GRPO #3146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

tastelikefeet
Copy link

@tastelikefeet tastelikefeet commented Feb 16, 2025

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Compitible with GRPO

  1. Able to select turbomind GPU device
  2. Able to reload weights

Modification

  1. turbomind.py, reload weights and DISABLE the row-major optimization
  2. loader.py, support load state_dict
  3. messages.py, add devices argument

Please check this PR for test results: modelscope/ms-swift#3126

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@lvhan028
Copy link
Collaborator

Hi, @tastelikefeet Thanks for your contribution
May resolve the conflicts in lmdeploy/turbomind/turbomind.py.

* main: (90 commits)
  Fix cogvlm and phi3vision (InternLM#3137)
  support release pipeline (InternLM#3069)
  [ci] fix some fail in daily testcase (InternLM#3134)
  Fix internvl2.5 error after eviction (InternLM#3122)
  fix UT of deepseek chat template (InternLM#3125)
  Update benchmark script and user guide (InternLM#3110)
  bump version to v0.7.0.post3 (InternLM#3115)
  fix postional argument (InternLM#3086)
  remove logitswarper (InternLM#3109)
  [Fix] fix the URL judgment problem in Windows (InternLM#3103)
  fix user guide about cogvlm deployment (InternLM#3088)
  add option max-concurrent-requests for api_server(InternLM#2961)
  bump version to v0.7.0.post2 (InternLM#3094)
  Fix xcomposer2d5 (InternLM#3087)
  Add system role to deepseek chat template (InternLM#3031)
  Update tokenizer (InternLM#3061)
  Add deepseek-r1 chat template (InternLM#3072)
  bump version to v0.7.0.post1 (InternLM#3076)
  More arguments in api_client, update docstrings (InternLM#3077)
  fix sliding window mgr (InternLM#3068)
  ...

# Conflicts:
#	lmdeploy/turbomind/turbomind.py
@tastelikefeet
Copy link
Author

Hi, @tastelikefeet Thanks for your contribution May resolve the conflicts in lmdeploy/turbomind/turbomind.py.

Solved

@lvhan028
Copy link
Collaborator

THere are linting errors which can be solved by

pip install pre-commit
cd lmdeploy
pre-commit install
pre-commit run --all-files

@@ -211,6 +211,7 @@ class TurbomindEngineConfig:
max_prefill_token_num: int = 8192
num_tokens_per_iter: int = 0
max_prefill_iters: int = 1
devices: List[int] = field(default_factory=lambda: [0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we specify the cuda devices by the env var CUDA_VISIBLE_DEVICES?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, trl will specify which GPU to load the model:
https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py#L416

Copy link
Collaborator

@lvhan028 lvhan028 Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/vllm-project/vllm/blob/d0a7a2769d92619afdcdc3b91c78098eaa9e38c0/vllm/engine/arg_utils.py#L718
According to vllm's EngineArgs definition, the value of device can be one of the following:

DEVICE_OPTIONS = [
    "auto",
    "cuda",
    "neuron",
    "cpu",
    "openvino",
    "tpu",
    "xpu",
    "hpu",
]

I haven't found a case to build the vllm engine with specifying device ids
Could you please provide an example?

@lvhan028
Copy link
Collaborator

Tensor parallelism cases got hang, for instance,

lmdeploy serve api_server meta-llama/Meta-Llama-8-8B-Instruct --tp 2

model_dict[key] = value
else:
model_dict = model_path
for key, value in model_dict.items():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is model_dict stored? Is it in CPU memory or GPU memory?

Copy link
Author

@tastelikefeet tastelikefeet Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tastelikefeet
Copy link
Author

pre-commit install

Done

@lvhan028
Copy link
Collaborator

I've noticed modelscope/ms-swift#3126. We'll learn this feature and process this PR ASAP

…oad_state_dict

* commit 'f6f7a5d707e3ccbc69af10babf1c9afcaf72a402':
  fix deepseekv2 has no attribute use_mla error (InternLM#3188)
  fix blocked fp8 moe (InternLM#3181)
  [Feature] support deepseek-vl2 for pytorch engine (InternLM#3149)
  make turbomind support gpu embedding inputs (InternLM#3177)
  fix temperature=0 (InternLM#3176)
  Update qwen2.py (InternLM#3174)
  Fix tool call prompt for InternLM and Qwen (InternLM#3156)
  Use pad_token_id as image_token_id for vl models (InternLM#3158)
  fix default temperature value (InternLM#3166)
  fix min length penalty (InternLM#3150)
  update cuda runtime package dependencies (InternLM#3142)
  fix typing (InternLM#3153)
  support deepseekv2 for maca backend. (InternLM#2918)
  fix the issue that stop_token may be less than defined in model.py (InternLM#3148)
  [fix] fix vl gradio, use pipeline api and remove interactive chat (InternLM#3136)
  [feature] add dlinfer w8a8 support. (InternLM#2988)
  Use aiohttp inside proxy server && add --disable-cache-status argument (InternLM#3020)
  support eos_token list in turbomind (InternLM#3044)
@lvhan028
Copy link
Collaborator

@irexyc will work on this feature.

@irexyc
Copy link
Collaborator

irexyc commented Feb 28, 2025

@tastelikefeet
请问一下,在你们的使用场景中,支持 device_ids 是不是一个刚需呢?通过设置 CUDA_VISIBLE_DEVICES 能否解决呢?

@tastelikefeet
Copy link
Author

tastelikefeet commented Feb 28, 2025

@tastelikefeet 请问一下,在你们的使用场景中,支持 device_ids 是不是一个刚需呢?通过设置 CUDA_VISIBLE_DEVICES 能否解决呢?

目前的GRPO方案主要有两个方向:

  1. trl的方案支持集群规模较小,但调试使用更加简单,流量更大,但需要infer engine支持device设置
  2. veRL的方案支持集群规模较大,调试相对困难,不需要infer engine支持device
    我们目前follow的trl的方案,但目前正在refactor,以在保留易用性和速度的情况下支持更强的扩展性
    在实际的使用过程中,vLLM在GRPO等训练中应用较为广泛主要原因有:
  • 支持的模型很多
  • 支持load_weights,或者SPMD模式
  • 支持简单设置devices
    在我们已有的方案中,为了快速支持GRPO的速度提升,我们手动hack了一定的代码,跑通了vLLM和LMDeploy的单node多实例,未来我们的方向是:
  • 更大的模型支持(MP/PP)
  • 扩展性更强的架构
    所以,我个人觉得更好的方向是整体通盘考虑对GRPO等RL方法的支持,而非我们单独框架的需求
    供参考~

@tastelikefeet
Copy link
Author

我提交的这个PR其实是为了给出一个我看到的需求,然而毕竟是hack的代码,整体问题会多一些,LMDeploy是一个优秀的框架,我们也希望能一起做出一些优秀的产品给开发者

@irexyc
Copy link
Collaborator

irexyc commented Feb 28, 2025

@tastelikefeet

我们内部目前也有一些参数更新以及推理引擎 offload 需求,因为发现构建空模型耗时占比较小,针对单机多实例的情况,初步计划是先通过销毁重建 pipeline / server 的方式来实现,针对 pipeline / server 写了俩 demo。如果是使用 pipeline 的方式来使用的话,因为要设定 CUDA_VISIBLE_DEVICES,对训练确实可能会有影响。

https://aicarrier.feishu.cn/wiki/VmDlwlqB9iGGOAkSOoucxEVwnPb

@tastelikefeet
Copy link
Author

@tastelikefeet

我们内部目前也有一些参数更新以及推理引擎 offload 需求,因为发现构建空模型耗时占比较小,针对单机多实例的情况,初步计划是先通过销毁重建 pipeline / server 的方式来实现,针对 pipeline / server 写了俩 demo。如果是使用 pipeline 的方式来使用的话,因为要设定 CUDA_VISIBLE_DEVICES,对训练确实可能会有影响。

https://aicarrier.feishu.cn/wiki/VmDlwlqB9iGGOAkSOoucxEVwnPb

CUDA_VISIBLE_DEVICES的问题是同进程只能设置一次
reload weights这个我之前如PR所示,加了一个hack版本,感觉速度还挺快的,但是唯一的问题就是行转列的优化需要注释掉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants