Implement auxiliary-loss-free load balancing #11031
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
New features
PR changes
Models
Description
实现了 DeepseekV3 论文所述的 Auxiliary-Loss-Free Load Balancing 机制
该机制在 MoE gating score 上增加了一个统计性的 expert-wise bias,对于一个 expert,若它在上个 batch 中访问量高于平均,则减少它的 bias,反之增加它的 bias,以此实现负载均衡
参数 alpha 和 gamma 已根据论文设置如下:
实验效果
通过 log 确认,bias 确实按预期在更新,高于平均的 expert 会减小 bias,低于平均的 expert 会增加 bias
使用【PP4EP8 29层32专家】配置进行冷启的访问量热力图:

可以看到,Aux-Free-Loss 在 300 step 开始就进入了一种非常均衡的状态,而默认 loss 一直有比较严重的均衡问题
使用【PP4EP8 29层64专家】配置进行热启的访问量热力图:
由于热启的权重已经经过训练,因此不均衡阶段应当比较短,Aux-Free-Loss 基本 100 step 就进入了均衡状态,而默认 loss 的均衡性则越来越差,然后很快因为 OOM 而终止
代码规范
由于 bias 不是通过梯度更新的,不能走 optimizer,因此我使用了 Callback 的方式来实现更新
我新增的 MoECorrectionBiasAdjustCallback 考虑了通用性,并不局限于 DSV3 模型,任何 MoE 模型只要用了 topk_method == "noaux_tc" 的方式都可以用
根据 DSV3 论文,bias 的 lr 在前 14.3T 个 token 为 0.001,之后则为 0.0,这里我不那么严谨地将其固定为 0.001,如果用户真的需要训练超过 14.3T 个 token,再由用户自己来调整