Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Comments

🪙 [Experimental] Support GSPO-token#3820

Merged
qgallouedec merged 14 commits intohuggingface:mainfrom
hjh0119:gspo-token
Sep 24, 2025
Merged

🪙 [Experimental] Support GSPO-token#3820
qgallouedec merged 14 commits intohuggingface:mainfrom
hjh0119:gspo-token

Conversation

@hjh0119
Copy link
Contributor

@hjh0119 hjh0119 commented Jul 31, 2025

Support for GSPO-token as described in GSPO paper, Section 4.3.

related issue: #3811

GSPO
$w_{i}^{\mathrm{GSPO}} = \left[ \frac{\pi_{\theta}(y_i \mid x)}{\pi_{\theta_{\mathrm{old}}}(y_i \mid x)} \right]^{\frac{1}{|y_i|}} = \exp(\frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \log \frac{\pi_{\theta}(y_{i, t} \mid x, y_{i, <t})}{\pi_{\theta_{\mathrm{old}}}(y_{i, t} \mid x, y_{i, <t})})$

GSPO-token
$w_{i, t}^{\mathrm{GSPO_token}} = \mathrm{sg}\left[w_i^{\mathrm{GSPO}}\right] \cdot \frac{\pi_{\theta}(y_{i, t} \mid x, y_{i, < t})}{\mathrm{sg}\left[\pi_{\theta}(y_{i, t} \mid x, y_{i, < t})\right]}$

where $\mathrm{sg}[\cdot]$ denotes the stop-gradient (detach) operation.

💡 NOTE: GSPO-token enables support for fine-grained (token-level) advantages.
However, given the current formulation for advantage computation, all tokens within a sentence share the same value. In this case, GSPO and GSPO-token are theoretically equivalent, as shown in equations (11) and (18) of the paper.

@LeonEricsson
Copy link
Collaborator

Thanks for this. Since GSPO-token is a generalized version of vanilla GSPO, I suggest we fully transition to GSPO-token instead of supporting both versions. Consequently, we would rename/remove importance_sampling_level, as both methods operate at the token level.

elif self.importance_sampling_level == 'sequence_token':
# GSPO-token: sg[si(θ)] * πθ(yi,t)/sg[πθ(yi,t)]
seq_level_log_weight = (log_ratio * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)
seq_level_log_weight = seq_level_log_weight.unsqueeze(-1).detach() # Stop gradient
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
seq_level_log_weight = seq_level_log_weight.unsqueeze(-1).detach() # Stop gradient
seq_level_log_weight = seq_level_log_weight.detach().unsqueeze(-1) # Stop gradient

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(log_ratio * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)

This op is common across GSPO and GSPO-token, would be good to have a single variable pointing to this value under an if condition like

if self.importance_sampling_level != 'token'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, so shall we move the invalid value check for importance_sampling_level into the model parameter initialization?

elif self.importance_sampling_level == 'sequence_token':
# GSPO-token: sg[si(θ)] * πθ(yi,t)/sg[πθ(yi,t)]
seq_level_log_weight = (log_ratio * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)
seq_level_log_weight = seq_level_log_weight.unsqueeze(-1).detach() # Stop gradient
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(log_ratio * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)

This op is common across GSPO and GSPO-token, would be good to have a single variable pointing to this value under an if condition like

if self.importance_sampling_level != 'token'

@hjh0119
Copy link
Contributor Author

hjh0119 commented Aug 2, 2025

Thanks for this. Since GSPO-token is a generalized version of vanilla GSPO, I suggest we fully transition to GSPO-token instead of supporting both versions. Consequently, we would rename/remove importance_sampling_level, as both methods operate at the token level.

Agreed to keep GSPO-token. Should we retain this parameter for compatibility with previous usage, or introduce an additional parameter instead? Which is better?

@hjh0119
Copy link
Contributor Author

hjh0119 commented Aug 5, 2025

@qgallouedec @lewtun @edbeeching @kashif If there are any concerns or suggestions, please feel free to let me know. Thank you very much in advance

@LeonEricsson
Copy link
Collaborator

LeonEricsson commented Aug 5, 2025

Agreed to keep GSPO-token. Should we retain this parameter for compatibility with previous usage, or introduce an additional parameter instead? Which is better?

imo it should be removed, however, since it's already been published as part of TRL v0.20, we may need to keep it for backward comp. I can't speak to it myself, so I'll leave it to someone else to decide.

@qgallouedec
Copy link
Member

qgallouedec commented Aug 10, 2025

Thanks for the contribution, and apologies for the delay in reviewing it.

After reading the paper, I don’t think this PR fully achieves GSPO-token. This variation is most relevant when you have a fine-grained advantage— i.e. when $\hat{A_{i,t}}$ varies with $t$—which isn't the case here, since $\hat{A_{i,t}}=\hat{A_{i}}$.

@hjh0119
Copy link
Contributor Author

hjh0119 commented Aug 11, 2025

After reading the paper, I don’t think this PR fully achieves GSPO-token. This variation is most relevant when you have a fine-grained advantage— i.e. when A i , t ^ varies with t —which isn't the case here, since A i , t ^ = A i ^ .

Sure. As I mentioned, when there is no fine-grained advantage, the gspo-token gradient is equivalent to the original implementation.

However, do we need to implement this algorithm in advance to accommodate possible future fine-grained advantage, or to make it easier for downstream users to implement their own customized fine-grained advantage? Anyway, thank you for your review.

@hjh0119
Copy link
Contributor Author

hjh0119 commented Aug 14, 2025

Any feedback before I close this PR, or should we go ahead and merge it?

@qgallouedec
Copy link
Member

Please leave it open, we are working hard to provide fast review for all the PRs 🙏

@qgallouedec qgallouedec changed the title support GSPO-token 🪙 [Experimental] Support GSPO-token Sep 24, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec qgallouedec merged commit d144e73 into huggingface:main Sep 24, 2025
kashif pushed a commit that referenced this pull request Sep 30, 2025
Co-authored-by: LeonEricsson <[email protected]>
Co-authored-by: Quentin Gallouédec <[email protected]>
Co-authored-by: Quentin Gallouédec <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants