Codestin Search App

hjh0119 · 2025-07-31T06:47:06Z

Support for GSPO-token as described in GSPO paper, Section 4.3.

related issue: #3811

GSPO
$w_{i}^{\mathrm{GSPO}} = \left[ \frac{\pi_{\theta}(y_i \mid x)}{\pi_{\theta_{\mathrm{old}}}(y_i \mid x)} \right]^{\frac{1}{|y_i|}} = \exp(\frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \log \frac{\pi_{\theta}(y_{i, t} \mid x, y_{i, <t})}{\pi_{\theta_{\mathrm{old}}}(y_{i, t} \mid x, y_{i, <t})})$

GSPO-token
$w_{i, t}^{\mathrm{GSPO_token}} = \mathrm{sg}\left[w_i^{\mathrm{GSPO}}\right] \cdot \frac{\pi_{\theta}(y_{i, t} \mid x, y_{i, < t})}{\mathrm{sg}\left[\pi_{\theta}(y_{i, t} \mid x, y_{i, < t})\right]}$

where $\mathrm{sg}[\cdot]$ denotes the stop-gradient (detach) operation.

💡 NOTE: GSPO-token enables support for fine-grained (token-level) advantages.
However, given the current formulation for advantage computation, all tokens within a sentence share the same value. In this case, GSPO and GSPO-token are theoretically equivalent, as shown in equations (11) and (18) of the paper.

LeonEricsson · 2025-07-31T14:07:52Z

Thanks for this. Since GSPO-token is a generalized version of vanilla GSPO, I suggest we fully transition to GSPO-token instead of supporting both versions. Consequently, we would rename/remove importance_sampling_level, as both methods operate at the token level.

LeonEricsson · 2025-07-31T14:08:30Z

trl/trainer/grpo_trainer.py

+        elif self.importance_sampling_level == 'sequence_token':
+            # GSPO-token: sg[si(θ)] * πθ(yi,t)/sg[πθ(yi,t)]
+            seq_level_log_weight = (log_ratio * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)
+            seq_level_log_weight = seq_level_log_weight.unsqueeze(-1).detach()  # Stop gradient


Suggested change

seq_level_log_weight = seq_level_log_weight.unsqueeze(-1).detach() # Stop gradient

seq_level_log_weight = seq_level_log_weight.detach().unsqueeze(-1) # Stop gradient

(log_ratio * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)

This op is common across GSPO and GSPO-token, would be good to have a single variable pointing to this value under an if condition like

if self.importance_sampling_level != 'token'

make sense, so shall we move the invalid value check for importance_sampling_level into the model parameter initialization?

trl/trainer/grpo_trainer.py

pramodith · 2025-08-01T15:18:13Z

trl/trainer/grpo_trainer.py

+        elif self.importance_sampling_level == 'sequence_token':
+            # GSPO-token: sg[si(θ)] * πθ(yi,t)/sg[πθ(yi,t)]
+            seq_level_log_weight = (log_ratio * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)
+            seq_level_log_weight = seq_level_log_weight.unsqueeze(-1).detach()  # Stop gradient


(log_ratio * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)

This op is common across GSPO and GSPO-token, would be good to have a single variable pointing to this value under an if condition like

if self.importance_sampling_level != 'token'

hjh0119 · 2025-08-02T16:10:48Z

Thanks for this. Since GSPO-token is a generalized version of vanilla GSPO, I suggest we fully transition to GSPO-token instead of supporting both versions. Consequently, we would rename/remove importance_sampling_level, as both methods operate at the token level.

Agreed to keep GSPO-token. Should we retain this parameter for compatibility with previous usage, or introduce an additional parameter instead? Which is better?

hjh0119 · 2025-08-05T02:01:19Z

@qgallouedec @lewtun @edbeeching @kashif If there are any concerns or suggestions, please feel free to let me know. Thank you very much in advance

LeonEricsson · 2025-08-05T11:34:20Z

Agreed to keep GSPO-token. Should we retain this parameter for compatibility with previous usage, or introduce an additional parameter instead? Which is better?

imo it should be removed, however, since it's already been published as part of TRL v0.20, we may need to keep it for backward comp. I can't speak to it myself, so I'll leave it to someone else to decide.

qgallouedec · 2025-08-10T18:34:47Z

Thanks for the contribution, and apologies for the delay in reviewing it.

After reading the paper, I don’t think this PR fully achieves GSPO-token. This variation is most relevant when you have a fine-grained advantage— i.e. when $\hat{A_{i,t}}$ varies with $t$—which isn't the case here, since $\hat{A_{i,t}}=\hat{A_{i}}$.

hjh0119 · 2025-08-11T02:14:26Z

After reading the paper, I don’t think this PR fully achieves GSPO-token. This variation is most relevant when you have a fine-grained advantage— i.e. when A i , t ^ varies with t —which isn't the case here, since A i , t ^ = A i ^ .

Sure. As I mentioned, when there is no fine-grained advantage, the gspo-token gradient is equivalent to the original implementation.

However, do we need to implement this algorithm in advance to accommodate possible future fine-grained advantage, or to make it easier for downstream users to implement their own customized fine-grained advantage? Anyway, thank you for your review.

hjh0119 · 2025-08-14T02:12:32Z

Any feedback before I close this PR, or should we go ahead and merge it?

qgallouedec · 2025-08-14T06:45:25Z

Please leave it open, we are working hard to provide fast review for all the PRs 🙏

HuggingFaceDocBuilderDev · 2025-09-24T15:45:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-authored-by: LeonEricsson <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]>

support gspo-token

d48f18f

LeonEricsson reviewed Jul 31, 2025

View reviewed changes

LeonEricsson requested review from edbeeching, lewtun and qgallouedec July 31, 2025 14:11

pramodith requested changes Aug 1, 2025

View reviewed changes

hjh0119 added 2 commits August 3, 2025 19:38

reorder compute log weight

e29321f

lint

32ab1e3

exchange detach and unsqueeze

6301d5c

LeonEricsson and others added 2 commits August 8, 2025 08:38

Merge branch 'main' into gspo-token

db68524

Merge branch 'main' into gspo-token

883eb34

Merge branch 'main' into gspo-token

e3ede14

qgallouedec and others added 3 commits September 24, 2025 09:11

Merge branch 'main' into gspo-token

860bed1

experimental

37723f2

paper index

67d7a5d

qgallouedec changed the title ~~support GSPO-token~~ 🪙 [Experimental] Support GSPO-token Sep 24, 2025

qgallouedec approved these changes Sep 24, 2025

View reviewed changes

low prio

7187f0e

qgallouedec added 3 commits September 24, 2025 15:47

warning

9f4087f

style

11792dd

fix equation

0ab4fa2

qgallouedec merged commit d144e73 into huggingface:main Sep 24, 2025

kashif pushed a commit that referenced this pull request Sep 30, 2025

🪙 [Experimental] Support GSPO-token (#3820)

773c22f

Co-authored-by: LeonEricsson <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]>

	seq_level_log_weight = seq_level_log_weight.unsqueeze(-1).detach() # Stop gradient
	seq_level_log_weight = seq_level_log_weight.detach().unsqueeze(-1) # Stop gradient

Comments

Conversation

hjh0119 commented Jul 31, 2025

Uh oh!

LeonEricsson commented Jul 31, 2025

Uh oh!

LeonEricsson Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

pramodith Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

hjh0119 Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pramodith Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

hjh0119 commented Aug 2, 2025

Uh oh!

hjh0119 commented Aug 5, 2025

Uh oh!

LeonEricsson commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hjh0119 commented Aug 11, 2025

Uh oh!

hjh0119 commented Aug 14, 2025

Uh oh!

qgallouedec commented Aug 14, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LeonEricsson commented Aug 5, 2025 •

edited

Loading

qgallouedec commented Aug 10, 2025 •

edited

Loading