-
Notifications
You must be signed in to change notification settings - Fork 14.3k
implement adaptive-p sampler #17927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
implement adaptive-p sampler #17927
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
Nevermind, sorry, I think we want to do a little more testing. I'm going to mark this as draft again temporarily. |
pnb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very interesting! I wish the original compared to XTC, since the goals seem highly similar.
As an aside, I am curious if there is some way to make it work without selecting a token (i.e., only steps 1-3). I see why token selection is necessary, given the need to save the original probability to the history for the adaptive adjustment part. But, for example, maybe it would suffice instead to save the original probability of the highest-probability token after transforming, regardless of which one is eventually selected by a downstream sampler.
src/llama-sampling.cpp
Outdated
|
|
||
| // fixed power law transform parameters (from original implementation) | ||
| const float distribution_width = 0.2f; | ||
| const float peak_logit_value = 3.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these parameters be configurable like in the original implementation? There is probably a tradeoff with feature creep, having too many options for users to control, but some of these seem potentially important (especially distribution_width). Also, I noticed peak_logit_value is outside the range suggested in the original implementation; is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Myself and the original author are discussing the parameters over the next few days, I agree that the current implementation is probably not ideal, which is why I marked it back as draft.
I will post a comment in the main thread with an update once we've got it more figured out. Thank you!
|
Very interesting sampler, thank you for the implementation! I like the effect so far, it stays on topic even on long results. One question: if this sample must be the last in the chain, why include it alongside other samplers? For now it looks like a user can make a mistake by putting it elsewhere, which is probably not what we want. Maybe it's worth adding it into the chain at the end, where the |
The idea is that you're supposed to configure your truncation samplers (like min-p) in such a way that removes garbage tokens from the candidates pool, before it even hits Power Law. It's the same for temperature - if you're using a high temperature you should cut out the nonsense before you apply it. (@z80maniac)
This is good feedback, thank you. I will consider how to change it so that the power law sampler is guaranteed to always be at the end of the chain, if it's active. (@MaggotHATE) |
|
I took another look through the code and I think the choice of what is a tunable parameter vs. what is a fixed default is great. The knobs to tune make sense, and I tried playing with the other parameters (that are now constants) without seeing much obvious effect in the text. Overall I would say the effect of this sampler is a little subtle compared to XTC, but it is noticeable with a low target like .05, where lots of excessively popular adverbs disappear from the results. |
This is addressed now in Gentle poke to @ggerganov - are there any more changes needed here? What are your thoughts? |
|
Let's come back to this after we merge #17004 (ETA: end of the year) as it will reduce the amount of work for me on this part of the code. |
|
Really excited about this sampler getting merged. Looks very promising for creative writing. |
- `ctx->weighted_sum` is now initialized and reset to `target / (1.0f - clamped_decay)` - `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f - clamped_decay)` this fixes a "cold start" problem with the moving average
|
Barring any changes requested by reviewers/maintainers, I believe this implementation to be correct and finalized at this point. Just waiting on #17004. |
no functional changes
This PR implements a new sampler called adaptive-p that selects tokens near a configurable target probability over time.
How it works
The adaptive-p sampler transforms the token probability distribution to favor tokens that fall near a user-configurable probability target. Internally, the sampler maintains an exponential moving average of the original probabilities of selected tokens. It uses this, along with the user's set target, to compute an adapted target at each sampling step, steering the running average toward the configured target over time. If recent selections have been higher-probability than target, the sampler compensates by temporarily favoring lower-probability tokens, and vice versa.
Parameters
This sampler exposes two parameters:
target--adaptive-target Ndecay--adaptive-decay NIn most cases, you can just play with
--adaptive-target. The default decay of 0.9 (for a ~10 token history) works well.Usage notes
adaptive-p selects a token ID rather than just mutating candidates, so it must be last in the sampler chain. It shares this behaviour with some existing samplers like mirostat, dist, and greedy (mirostat being the closest relative).
Only mild truncation before this sampler is recommended. We suggest applying min-p before adaptive-p as the only other active sampler in the chain (optionally with top-k as well).
Example usage:
Other notes
This sampler was previously called "power law" in earlier versions of this PR, named for the power law transform we were applying to logits. We are no longer applying the power law transform. We also experimented with gaussian, but ultimately settled on the current formula.
Acknowledgements