Releases: huggingface/trl
v0.28.0
Features
- [GRPOTrainer]: Agent Training Supports Async Tool Calls by @pramodith in #4742
- Add retry strategy to vLLM Client for increased robustness by @apalmas-saifh in #4845
- Enable vLLM sleep mode for generation in Online DPO by @winglian in #4882
- Support tool call data in
is_conversationalby @qgallouedec in #4923 - [GRPO] Add parquet logging for completions with individual rewards by @qgallouedec in #4818
- Update wordle.py example with masking of env tokens by @sergiopaniego in #4895
- NeMo-Gym Integration by @cmunley1 in #4848
Experimental
- Refactor KTO coordinated with DPO [c/N]: Remove ref_model_init_kwargs by @albertvillanova in #4837
- Refactor KTO coordinated with DPO [e/N]: Remove label_pad_token_id by @albertvillanova in #4875
- Refactor KTO coordinated with DPO [d/N]: Remove base_model_attribute_name by @albertvillanova in #4862
- Fix type hint in
openenv/utils.py: fallback for no vLLM installed case by @Datta0 in #4868 - Remove label_pad_token_id from experimental trainers by @albertvillanova in #4878
- GOLD training speed up by @141forever in #4888
- Remove ref_model_init_kwargs from experimental BCO by @albertvillanova in #4946
- Remove max_prompt_length from experimental PRM by @albertvillanova in #4963
- Remove max_prompt_length from experimental BCO by @albertvillanova in #4964
- Remove max_prompt_length from experimental CPO by @albertvillanova in #4965
- Remove max_prompt_length from experimental ORPO by @albertvillanova in #4966
- Remove padding_value from experimental CPO and use pad_token_id by @albertvillanova in #4962
Fixes
- Fix _patch_transformers_hybrid_cache for peft by @albertvillanova in #4844
- Refactor KTO [4/N]: Remove unused padding_value by @albertvillanova in #4839
- Fix: undefined
current_gradient_accumulation_stepsby @qgallouedec in #4852 - fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in #4857
- Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in #4880
- Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in #4873
- Fix import path for
get_open_portbased on vLLM version by @qgallouedec in #4883 - Fix RewardTrainer's results not reproducible by @liyc-ai in #4887
device_mapinit consistency in GRPO/RLOO/KTO by @qgallouedec in #4909- Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in #4908
- Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in #4942
- Fix PPO run_name parameter not taking effect by @mel3c in #4945
- Remove access to
warnings_issuedby @qgallouedec in #4960 - Revert change in GRPO from NeMo-Gym Integration by @qgallouedec in #4970
Documentation and Examples
- Add Nash Learning from Human Feedback paper to paper index by @kansalaman in #4860
- Update OpenEnv dependency to new version for hf jobs scripts by @sergiopaniego in #4843
- Enhance GRPO documentation with scaling notes by @javadtaghia in #4849
- Created new PTT integration docs as requested by @adityachallapally in #4907
- docs: add DoRA (2402.09353) to Paper Index by @billycrapediem in #4892
Deprecations
- Remove unused padding_value from BCO by @albertvillanova in #4846
- Remove deprecated parameters by @qgallouedec in #4847
- Deprecate parameters in
DPOConfigby @qgallouedec in #4969 - Replace
warmup_ratiowithwarmup_stepsby @qgallouedec in #4983
CI Improvements
- Support triggering CI via push to ci-* branches by @albertvillanova in #4840
- Revert CI hotfix pinning transformers 4.57.4 after tiny model regeneration by @albertvillanova in #4833
- Use pytest-datadir in CI tests by @albertvillanova in #4836
- Fix CI with dev dependencies: Mark Qwen3-VL tests as xfail by @albertvillanova in #4851
- Use pytest-datadir for accelerate config files by @albertvillanova in #4861
- Update transformer version checks and documentation for lr_scheduler_kwargs workaround by @qgallouedec in #4876
- Test distributed training for
RewardTrainer,RLOOTrainerandGRPOTrainerby @qgallouedec in #4823 - Mark ZeRO 2 as xfail in distributed tests due to current failure by @qgallouedec in #4885
- Transformers v5 release: extend xfail condition for
TestGRPOTrainer.test_training_vlm_and_ligerand update version checks by @qgallouedec in #4898 - Fix CI NotImplementedError for bfloat16 by @albertvillanova in #4902
- Fix CI AssertionError: Parameter has not changed by @albertvillanova in #4904
- Fix CI TypeError in llm-blender tests by @albertvillanova in #4919
- Fix CI AssertionError: assert not True by @albertvillanova in #4921
- Fix CI ValueError for 0 temperature by @albertvillanova in #4916
- Set model dtype to float32 in tests of trainers by @albertvillanova in #4924
- Set model dtype to float32 in experimental tests of trainers by @albertvillanova in #4925
- Add test for training with
compute_metricsinSFTTrainerby @qgallouedec in #4950 - Add test for tool call data in
RewardTrainerby @qgallouedec in #4959 - Add test for training with
compute_metricsinRewardTrainerby @qgallouedec in #4958 - Fix test_train_with_chat_template_kwargs by @qgallouedec in #4971
Miscellaneous
- Update
CITATION.cffby @qgallouedec in #4856 - Update generate_tiny_models.py: CohereForAI -> CohereLabs by @Michellehbn in #4877
- Refactor vLLM generation [1/N]: Extract vLLM generation by @albertvillanova in #4700
- Rearrange variable assignments in
DataCollatorForVisionLanguageModelingby @qgallouedec in #4911 - Fix help text formatting for
max_lengthinRewardConfigandSFTConfigby @qgallouedec in #4910 - Comment about overriding prediction_step in GRPOTrainer and RLOOTrainer by @qgallouedec in #4913
- Remove gradient checkpointing option from various training scripts by @qgallouedec in #4905
- Remove chat template setup in dpo_vlm.py by @qgallouedec in #4906
- Update learning rate comments and add assertions for reference model parameters in GRPO and RLOO tests by @qgallouedec in #4914
- Add validation for
sync_ref_modelinGRPOTrainerandRLOOTrainerwhen using PEFT models by @qgallouedec in #4912 - Require transformers<5 with PairRMJudge by @albertvillanova in #4926
- Move VLLMClient to generation module by @albertvillanova in #4928
- Fix profiling of VLLMGeneration.sync_weights by @albertvillanova in #4931
- Fix import statement for import_utils in vllm_client.py by @qgallouedec in #4932
- Set default top_k to 0 in VLLMClient by @albertvillanova in #4927
- Minor fix docs style by @albertvillanova in #4953
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #4835
- Support triggering CI via push to ci-* branches by @albertvillanova in #4840
- Revert CI hotfix pinning transformers 4.57.4 after tiny mo...
v0.27.2
What's Changed
- Remove access to
warnings_issuedby @qgallouedec in #4960 - Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in #4942
- Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in #4908
Full Changelog: v0.27.1...v0.27.2
v0.27.1
What's Changed
- Fix: undefined
current_gradient_accumulation_stepsby @qgallouedec in #4852 - fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in #4857
- Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in #4880
- Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in #4873
- Fix RewardTrainer's results not reproducible by @liyc-ai in #4887
New Contributors
- @kdubovikov made their first contribution in #4873
- @liyc-ai made their first contribution in #4887
Full Changelog: v0.27.0...v0.27.1
v0.27.0
Features
- Add
vllm_group_portargument to GRPO, RLOO and OnlineDPO configuration by @pointerhacker in #4545 - Preserve truncated tokens in BFD packing by @qgallouedec in #4632
- Support async reward functions and parallelize call to reward functions. by @pramodith in #4567
- RLOO supports async rewards. by @pramodith in #4718
- Support vLLM 0.12.0 by @jiqing-feng in #4117
- feat: DeepSeek V3.2 Off-policy sequence masking by @casinca in #4689
- 🎭 Up to 50% less VRAM during forward with
forward_masked_logitsfunction by @qgallouedec in #4729 - [GRPO] Add a config to limit the number of tool calling iterations by @pramodith in #4761
- Switch gradient checkpointing default to use_reentrant=False (PyTorch recommended) by @qgallouedec in #4811
- Add support for GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization by @nbasyl in #4785
Experimental
- Move
AutoModelForCausalLMWithValueHeadandAutoModelForSeq2SeqLMWithValueHeadto experimental by @qgallouedec in #4654 - Move DPODataCollatorWithPadding to
experimental.utilsby @qgallouedec in #4667 - Move
DataCollatorForChatMLtoexperimental.utilsby @qgallouedec in #4668 - Move
add_bos_token_if_neededandadd_eos_token_if_neededtoexperimental.utilsby @qgallouedec in #4674 - Move
truncate_rightandSIMPLE_CHAT_TEMPLATEtoexperimental.utilsby @qgallouedec in #4677 - Move
prepare_model_for_kbit_training,enable_gradient_checkpointing,prepare_peft_modeltoexperimental.utilsby @qgallouedec in #4704 - Move
get_rewardfunction toexperimental.utilsby @qgallouedec in #4683 - Remove experimental imports from testing_utils by @albertvillanova in #4727
- ORPO: Avoid catastrophic cancellation in loss function by @hartmans in #4763
- Refactor KTO [1/N]: Modernize model initialization by @albertvillanova in #4783
- [GOLD] add probability merging fix to implement chain rule by @kashif in #4765
- Refactor KTO coordinated with DPO [a/N]: Remove encoder-decoder support by @albertvillanova in #4792
- Refactor KTO coordinated with DPO [b/N]: Simplify truncation logic by @albertvillanova in #4808
Fixes
- Accounting for case
num_generations_eval=1in the calculation of the advantage by @qgallouedec in #4662 - Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in #4663
- Fix GRPO config validation in case
num_generations_evalis specified and different thannum_generationsby @apalmas-saifh in #4682 - Fix top_k default value to 0 for disabling top-k filtering by @albertvillanova in #4695
- Include
generation_configfor tiny model uploads by @qgallouedec in #4643 - Fix KeyError with transformers 5.0.0+ where push_to_hub_token is removed by @Manodeepray in #4691
- Overwrite model default generation config used by model.generate by @albertvillanova in #4647
- Fix: handle multiple tool calls in
qwen3_schemaby @mattbui in #4709 - Fix bugs when using multi-gpu: dataset streaming for offline trainers + dtype initialization by @kaixuanliu in #3950
- Ensure llm-blender is importable with transformers >= v5 by @albertvillanova in #4781
- Monkey patch for
HybridCachein Liger-Kernel with transformers v5 by @qgallouedec in #4798 - [fix] GRPOTrainer: proper access
argsby @carlyou in #4801 - Fix vllm compat patches to be applied only to affected versions by @albertvillanova in #4815
- fix bug when sft calc outputs.token_accuracy by @kaixuanliu in #4814
- fix xpu vllm client server by @jiqing-feng in #4780
Documentation and Examples
- docs: add RapidFire AI integration section to SFT Trainer by @kamran-rapidfireAI in #4661
- Fix environment image name for BrowserGym example script by @sergiopaniego in #4680
- Docs(
grpo_trainer.md): Added Qwen SAPO details underLoss Typesby @casinca in #4681 - [docs] Adds GRPO, RSO and LoRA to Paper Index by @SSusantAchary in #4441
- Enable zero3 init and 16-bit model saving for ds ulysses config by @edbeeching in #4701
- Set version to packaged one in notebooks by @sergiopaniego in #4648
- BrowserGym example for LLMs (no vision) by @sergiopaniego in #4696
- docs: Add RapidFire AI cross-references to DPO and GRPO trainer docs by @kamran-rapidfireAI in #4705
- [docs] Fix RapidFire AI position in documentation by @qgallouedec in #4715
- Add inference example to GRPO agent training notebook by @sergiopaniego in #4710
- Upload FunctionGemma notebook by @sergiopaniego in #4721
- Update agents notebook dependencies by @sergiopaniego in #4724
- Add uv/hf jobs support to OpenEnv scripts by @sergiopaniego in #4720
- Add GRPO QLoRA free notebook by @sergiopaniego in #4660
- Hotfix for browsergym openenv notebook by @sergiopaniego in #4740
- docs: fix "Good Second Issue" redirection link by @casinca in #4749
- [Docs] Add SRL (Supervised Reinforcement Learning) to Community Tutorials by @s23deepak in #4758
- Add LFM2.5 to GRPO notebook by @sergiopaniego in #4793
- Sudoku GRPO example script using TextArena by @sergiopaniego in #4762
- [EXAMPLES] Update wordle to new openenv release by @burtenshaw in #4791
- Update the typos in docs/source/grpo_trainer.md by @Tianyi-Billy-Ma in #4804
- Updat examples to new OpenEnv version by @sergiopaniego in #4796
- Update GRPO example to use Qwen2.5 instead of Qwen2 by @BurnyCoder in #4803
Deprecations
- Remove deprecated functions and parameters by @qgallouedec in #4651
- Remove
MergeModelCallbackfrom import structure by @qgallouedec in #4664 - Remove
ChatMlSpecialTokensby @qgallouedec in #4666 - Remove unused
_win_rate_completions_dffunction from callbacks by @qgallouedec in #4672 - Deprecate max_prompt_length in RLOOTrainer by @albertvillanova in #4703
- Small fix on contributing docs by @murilo-cunha in #4753
- Remove
DbrxForCausalLMsupport by @qgallouedec in #4799
CI Improvements
- Hotfix CI due to generation config by setting tests as xfail by @albertvillanova in #4657
- Upgrade GitHub Actions to latest versions by @salmanmkc in #4734
- Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #4733
- Include data type for tiny models and update tests by @qgallouedec in #4728
- Change tiny model dtype from float16 to bfloat16 to fix CUDA error by @albertvillanova in #4745
- Add revision override mechanism for testing tiny models by @albertvillanova in #4769
- Hotfix: Set float32 as default dtype for testing tiny models by @albertvillanova in #4770
- Hotfix CI with dev dependencies: xfail test_training_vlm_and_liger by @albertvillanova in #4777
- Add initial multi-GPU CI tests for distributed training by @qgallouedec in #4784
- Set dtype default to float32 by @albertvillanova in #4778
- Test FSDP2 by @qgallouedec in #4813
- Test ZeRO Stage 3 by @qgallouedec in #4821
- Hotfix CI main test...
v0.26.2
What's Changed
- Overwrite model default generation config used by model.generate by @albertvillanova in #4647
Full Changelog: v0.26.1...v0.26.2
v0.26.1
What's Changed
- Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in #4663
- Fix GRPO config validation in case
num_generations_evalis specified and different thannum_generationsby @apalmas-saifh in #4682
New Contributors
- @apalmas-saifh made their first contribution in #4663
Full Changelog: v0.26.0...v0.26.1
v0.26.0
Features
🕵️♂️ GRPO: Agent training
GRPOTrainer now supports training agents using tools. This allows language models to interact with external functions or APIs during training.
from datasets import Dataset
from trl import GRPOTrainer
def multiply(a: int, b: int) -> int:
"""
Multiplies two integers.
Args:
a: The first integer.
b: The second integer.
Returns:
The product of the two integers.
"""
return a * b
dataset = Dataset.from_list(
[
{"prompt": [{"role": "user", "content": "What is 3 multiplied by 4?"}], "answer": 12},
{"prompt": [{"role": "user", "content": "Calculate 7 times 8."}], "answer": 56},
{"prompt": [{"role": "user", "content": "Find the product of 5 and 6."}], "answer": 30},
{"prompt": [{"role": "user", "content": "What do you get when you multiply 9 by 9?"}], "answer": 81},
{"prompt": [{"role": "user", "content": "Compute 12 multiplied by 11."}], "answer": 132},
{"prompt": [{"role": "user", "content": "What is 15 times 14?"}], "answer": 210},
]
)
def accuracy(completions, answer, **kwargs):
predictions = [completion[-1]["content"] for completion in completions]
rewards = [float(str(ans) in pred) for pred, ans in zip(predictions, answer)]
return rewards
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
train_dataset=dataset,
tools=[multiply],
reward_funcs=accuracy,
)
trainer.train()by @qgallouedec in #4300
ScaleRL: Add CISPO Loss
CISPO Loss was first introduced in the Minimax-M1 paper, the ScaleRL paper subsequently showed that CISPO loss scales the best in terms of performance and efficiency as models are trained for longer.
GRPOTrainer now supports the CISPO loss using loss_type="cispo" in the GRPOConfig.
by @pramodith in #4495
Add vLLM quantization option for colocate
When the input model is quantized using bitsandbytes, vLLM will now also use quantization when in colocate mode.
by @sergiopaniego in #4496
Reasoning reward
TRL nows includes a reasoning reward function
from trl.rewards import reasoning_accuracy_reward
solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
completions = [
[
{
"role": "assistant",
"content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{3}}",
}
],
[
{
"role": "assistant",
"content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{2}}",
}
],
[
{
"role": "assistant",
"content": r"<think> Reasoning content with partial answers \boxed{\frac{1}{3}} but no final answer",
}
],
]
reasoning_accuracy_reward(completions, solutions) # [1.0, 0.0, 0.0] As any other reward function, it can be used in GRPOTrainer or RLOOTrainer.
from trl import GRPOTrainer
from trl.rewards import reasoning_accuracy_reward
trainer = GRPOTrainer(
...,
reward_funcs=reasoning_accuracy_reward,
)Add shuffle_dataset option to SFTTrainer
You can now shuffle the dataset in SFTTrainer by setting the shuffle_dataset argument to True in SFTConfig. This is useful when the dataset features high similarity between consecutive samples.
from trl import SFTTrainer, SFTConfig
SFTConfig(shuffle_dataset=True)by @qgallouedec in #4564
Add SAPO Loss in GRPO
Soft Adaptive Policy Optimization (SAPO), replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO.
You can now use SAPO loss in GRPOTrainer by setting loss_type="sapo" in the GRPOConfig.
by @pramodith in #4600
Other Features
- Support completion bootstrap for VLM in GRPO/RLOO by @SolarWindRider in #4452
- Add support for images inside tables with Trackio completions logging by @taha-yassine in #4505
- Add step time metric to GRPO Trainer for performance tracking by @qgallouedec in #4516
- Add target_parameters to LoraConfig by @jonnyli1125 in #4536
- [SFT] Log mean token accuracy from Liger kernel by @kashif in #4302
- Add
num_generations_evalparameter for efficient evaluation by @mingxuetian in #4458 - [GRPO] Sequence-level TIS & MIS by @LeonEricsson in #4530
- TRL supports vLLM 0.11 by @qgallouedec in #4633
- feat: implement DeepSeek unbiased KL estimator for GRPO by @jlcanta in #4638
Experimental
- Move XPOTrainer to trl.experimental.xpo by @behroozazarkhalili in #4485
- Move judges to experimental submodule by @behroozazarkhalili in #4439
- Add MiniLLM Trainer by @t1101675 in #4504
- refactor: Move CPOTrainer to experimental module by @behroozazarkhalili in #4470
- Move GKDTrainer to experimental module by @behroozazarkhalili in #4474
- Move NashMDTrainer to experimental module by @behroozazarkhalili in #4477
- Move PPOTrainer to trl.experimental.ppo by @behroozazarkhalili in #4482
- [ORPO] Move ORPOTrainer to experimental by @behroozazarkhalili in #4480
- Move PRMTrainer to trl.experimental.prm by @behroozazarkhalili in #4483
- Move OnlineDPOTrainer to experimental module by @behroozazarkhalili in #4473
- Move
WinRateCallbackto experimental by @qgallouedec in #4558 - Move tests for GSPOTokenTrainer to experimental by @qgallouedec in #4572
- Raise FutureWarning for classes moved to experimental by @albertvillanova in #4605
- Move MergeModelCallback to experimental by @qgallouedec in #4608
- Raise FutureWarning for trainer moved to experimental by @albertvillanova in #4620
- Remove no longer applicable warning once BCO was moved to experimental by @albertvillanova in #4628
- Refactor suppression of warning at experimental import by @albertvillanova in #4629
- 🚚 Move KTO to trl.experimental by @neha222222 in #4575
Fixes
- Buffer samples based on group level stds. by @pramodith in #4492
- Fix bugs in CISPO conditions by @pramodith in #4499
device_mapanddtypeto"auto"by default by @qgallouedec in #4509- MiniLLM: Fix arguments in config & add to documentation index by @t1101675 in #4518
- [Bug Fix] OnlineDPOTrainer with vLLM Server Mode by @YangKai0616 in #4500
- Rename
flash-attntoflash-attn2by @qgallouedec in #4514 - fix(GOLDTrainer): Resolve incorrect attribute access and VLLMClient.generate() output type by @fabio-sim in #4526
- Fix bug with VLM processors in prompt-completion completion text-only training by @kschwethelm in #4553
- fix+docs:
device_map=Nonefor DeepSpeed and add ZeRO paper (1910.02054) to Paper Index by @JenWei0312 in #4551 - Fix vLLM sleep mode: add collective RPC call to reload weights in vLLM wake-up process by @qgallouedec in #4571
- fix: use shift_labels for metrics when using CP or SP by @jue-jue-zi in #4579
- Fix 'generation_config' AttributeError by @albertvillanova in #4596
- Fix FSDP2 model key miss match when sync LoRA model to vLLM server by @Xiao-Chenguang in #4603
- Fix KTOTrainer CUDA error for large-vocab models via tensor indexing by @bhuvanprakash in #4635
Documentation and Examples
- docs: Add PEFT subsection to reducing memory usage guide by @behroozazarkhalili in #4430
- [DOCS] update and fix openenv by @burtenshaw in #4490
- Fix link to OpenEnv docs by @lukehinds in #4502
- Tweak description for vLLM sleep mode by @lewtun in #4506
- Paper Index: Change
num_completionstonum_generationsby @pramodith in https://gi...
v0.25.1
What's Changed
- Replace accelerate logging with stdlib in CLI by @lewtun in #4512
- Add temporary workaround for
lr_scheduler_kwargsdtype issue in Transformers 4.57.0 by @qgallouedec in #4513
Full Changelog: v0.25.0...0.25.1
v0.25.0
Features
- 💤 Switch to sleep level=2 and split wake-ups in GRPO and RLOO trainers by @xxrjun in #4296
- Added custom
prepare_model_for_kbit_trainingto save VRAM by @sergiopaniego in #4335 - Add
add_generation_promptto processor_kwargs in GRPO and RLOO trainer by @qgallouedec in #4361 - Add support for Trackio completions logging in GRPOTrainer by @taha-yassine in #4359
- Support chat_template_kwargs by @pramodith in #4350
- GRPO: ScaleRL -> Support casting LM Head to FP32 by @pramodith in #4303
- Support casting to fp32 when word embeddings are tied to lm_head by @pramodith in #4446
- 💬 Add chat to vLLM client and server, update trainer calls by @qgallouedec in #4450
Experimental
- 🚚 Move BCO to
trl.experimentalby @qgallouedec in #4312 - 👑 [experimental] GOLD Trainer by @kashif in #4349
- Add PAPOTrainer for preference-based optimization by @SolarWindRider in #4334
- [GFPO] fix the GFPO loss calculation error caused by unmodified old_per_token_logps by @Peter-Chou in #4454
- 🕹️ Add rollout function for OpenEnv integration by @lewtun in #4310
Fixes
- [Activation-checkpointing] add tensor dedup and param offloading by @kashif in #4247
- Fix attn_implementation name in OnlineDPO for transformers v5 by @albertvillanova in #4322
- Hotfix: Fall back to config.text_config._name_or_path if missing config._name_or_path by @albertvillanova in #4324
- Fix GRPO and RLOO trainers for continuous batching by @albertvillanova in #4348
- Fix:
add_generation_prompt=Truefor conversational only by @qgallouedec in #4362 - Remove ignored max_length parameter from PRMTrainer data collator by @albertvillanova in #4355
- Fix add_generation_prompt arg for paged transformers in GRPO and RLOO trainers by @albertvillanova in #4370
- Fix GKD Liger memory spike by @qgallouedec in #4140
- Fix GRPO with replay buffer by inserting images in the prompt by @albertvillanova in #4391
- fix: Remove chat template setting from non-SFT trainer scripts by @behroozazarkhalili in #4437
- 🖼️ Fix reporting images with vLLM by @qgallouedec in #4476
Documentation and Examples
- Added SFT LoRA notebook by @sergiopaniego in #4244
- Update notebooks README with latest additions by @sergiopaniego in #4316
- Add notebooks to Examples docs and restructure by @sergiopaniego in #4317
- Highlight OpenEnv in landing docs by @sergiopaniego in #4327
- Update OpenEnv docs by @sergiopaniego in #4328
- Add OpenEnv blog to landing by @sergiopaniego in #4333
- 🗞️ Update "What's New" by @qgallouedec in #4338
- Update Reducing Memory Consumption guide with more details by @sergiopaniego in #4332
- Fixed links inside Tips in docs by @sergiopaniego in #4360
- 🔥 docs: Add RapidFire AI integration guide by @kamran-rapidfireAI in #4340
- Fix paper link for "Towards Efficient and Exact Optimization of Language Model Alignment" by @qgallouedec in #4409
- Migrate experimental trl feature docs by @ethanknights in #4411
- Update SFT QLoRA notebook with 14B model on free Colab by @sergiopaniego in #4336
- Create "Talks" subsection by @sergiopaniego in #4414
- Openenv wordle example by @burtenshaw in #4357
- docs: Remove outdated conversational dataset conversion guidance by @behroozazarkhalili in #4422
- docs: List all trainers that support Liger Kernel by @behroozazarkhalili in #4432
- Add On-Policy Distillation from thinking labs to paper index. by @pramodith in #4410
- Upload notebook with T4 selected by @sergiopaniego in #4449
- Removed outdated warning about batch contamination by @Harras3 in #4423
- Removed Sentiment Tuning Examples by @Harras3 in #4424
- docs: Remove outdated notebooks by @behroozazarkhalili in #4435
- docs: Move Multi-Adapter RL section to PEFT integration by @behroozazarkhalili in #4436
- Update
max_lengthexplanation for VLM in online trainers by @sergiopaniego in #4220 - Updated OpenEnv docs by @sergiopaniego in #4418
- add llasa-tutorial by @Deep-unlearning in #4456
Deprecations
- Replace deprecated AutoModelForVision2Seq with AutoModelForImageTextToText by @albertvillanova in #4353
- Replace deprecated list with tuple indexing in PPOTrainer by @albertvillanova in #4356
- Remove liger loss in favor of liger kernel by @sergiopaniego in #4364
- 🐍 Drop Python 3.9 by @qgallouedec in #4183
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #4293
- Update links to docs in README to latest packaged version by @sergiopaniego in #4084
- 🧺 [4/N] Refactor
_generatein GRPO/RLOO: Moveforward_kwargsoutside generation method by @qgallouedec in #4154 - Fix missing CI slow tests: ImportError: vLLM is not installed by @albertvillanova in #4304
- Added SFT LoRA notebook by @sergiopaniego in #4244
- ⚰️ Remove deprecated by @qgallouedec in #4301
- Silence TRL experimental warnings in CI by @albertvillanova in #4307
- Filter expected setup_chat_format deprecation warning in CI by @albertvillanova in #4306
- [Activation-checkpointing] add tensor dedup and param offloading by @kashif in #4247
- Remove parameterized as test extra dependency by @albertvillanova in #4315
- Update notebooks README with latest additions by @sergiopaniego in #4316
- 🚚 Move BCO to
trl.experimentalby @qgallouedec in #4312 - 🧺 [5/N] Refactor
_generatein GRPO/RLOO: Insert images in the prompt by @qgallouedec in #4155 - 💤 Switch to sleep level=2 and split wake-ups in GRPO and RLOO trainers by @xxrjun in #4296
- Replace unittest skipTest from transformers with pytest.skip by @albertvillanova in #4297
- Add notebooks to Examples docs and restructure by @sergiopaniego in #4317
- Fix attn_implementation name in OnlineDPO for transformers v5 by @albertvillanova in #4322
- 🕹️ Add rollout function for OpenEnv integration by @lewtun in #4310
- Highlight OpenEnv in landing docs by @sergiopaniego in #4327
- Update OpenEnv docs by @sergiopaniego in #4328
- Move BCO tests to tests/experimental by @albertvillanova in #4326
- Hotfix: Fall back to config.text_config._name_or_path if missing config._name_or_path by @albertvillanova in #4324
- Add OpenEnv blog to landing by @sergiopaniego in #4333
- 🗞️ Update "What's New" by @qgallouedec in #4338
- Update Reducing Memory Consumption guide with more details by @sergiopaniego in #4332
- Added custom
prepare_model_for_kbit_trainingto save VRAM by @sergiopaniego in #4335 - [vllm] update comment about communication group host ip by @kashif in #4337
- Fix GRPO and RLOO trainers for continuous batching by @albertvillanova in #4348
- Fixed links inside Tips in docs by @sergiopaniego in #4360
- Fix CI issue for vlm_gemma_3n model by @kaixuanliu in #4278
- Add
add_generation_promptto processor_kwargs ...
v0.24.0
Features
- Add accuracy reward by @pramodith in #4270
- Add support for
token_type_idsinDPOTrainerby @aweers in #4285 - 💰
RichProgressCallbackenhancement by @qgallouedec in #4245 - Include
chat_template_kwargsinapply_chat_templateby @cmpatino in #4233 - 🏷️ Account for
token_type_idsinDataCollatorForVisionLanguageModelingby @qgallouedec in #4190 - 🎨 Support mixing image+text and text-only examples by @qgallouedec in #4203
- 🎁
RewardTrainerrefactor by @qgallouedec in #4093 - 🎞️ Support sequence classification models in
clone_chat_templateby @qgallouedec in #4097 - ✨ Add logging for training completion and model saving in training scripts by @qgallouedec in #4048
- 🖨️ Print rich table for messages by @qgallouedec in #4160
- 😴 Add
vllm_enable_sleep_modeto RLOO Trainer by @sergiopaniego in #4107 - 📽 Multi image support for GRPO/RLOO by @qgallouedec in #4113
- 👁️ Add VLM support to RLOO trainer by @behroozazarkhalili in #4067
- ℹ️ Enable XPU for vLLM client by @jiqing-feng in #4031
- 🧶 feat: Add WeaveCallback for W&B Weave integration by @parambharat in #4089
Fixes
- [Online-DPO] fix the completion_len == max_new_tokens crash by @kashif in #4193
- Fix entropy and accuracy calculation for prompt_tuning techniques. by @pramodith in #4196
- Fix prompt-completion labeling with add_generation_prompt and warning by @behroozazarkhalili in #4201
- 🌡️ Have vLLM return processed (temperature scaled) log probs by @YonatanGideoni in #4163
- Fix handling of f_divergence_type in DPO by @albertvillanova in #4171
- ⚡ Fix Flash Attention x Padding-Free loss by @qgallouedec in #4170
- Pass required token_type_ids by @albertvillanova in #4148
- 👩🦯 Fix usage of VLM using text only by @SamuelBarryCS in #4080
- ⚓ [vllm] ensure MASTER_ADDR/MASTER_PORT are set safely by @kashif in #4057
- 📤 Fix a dataset loading bug in scripts by @singing-cat in #4124
- 🐯 fix: use_liger_kernel with IterableDataset by @jue-jue-zi in #4087
- [GKD] Fix
batchmeanreduce op in GKDTrainer's loss by @cmpatino in #4105 - Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers by @Hoesu in #4081
- Aux loss is already included in the loss returned by Transformers by @pramodith in #4078
- ♨️ [GRPO] Fix potential hang in
get_high_entropy_maskby @akakakakakaa in #4041
Documentation
- Remove logging.md: trainer-specific metrics documentation by @behroozazarkhalili in #4269
- Remove using_llama_models.md: outdated Llama2-specific documentation by @behroozazarkhalili in #4268
- Remove how_to_train.md: outdated training FAQ by @behroozazarkhalili in #4267
- Add Qwen3-VL notebooks (SFT, GRPO) by @sergiopaniego in #4275
- Remove obsolete research_projects directory by @behroozazarkhalili in #4243
- Add Efficient Online Training with GRPO and vLLM in TRL to community tutorials by @sergiopaniego in #4219
- Add trainers taxonomy to docs by @sergiopaniego in #4195
- Updated vLLM integration guide by @sergiopaniego in #4162
- [DOCS] Lora without regret by @burtenshaw in #4181
- Add docstring for OnlineTrainerState by @albertvillanova in #4166
- ⚖️ Align SFT and DPO for model creation and deprecate
DPOConfig.padding_valuein favour orpad_token_idby @qgallouedec in #4006 - 🏞️ Context Parallelism benchmark guide by @sergiopaniego in #4075
▶️ Add video to community tutorials by @qgallouedec in #4090- Reviewed HF jobs updated docs by @sergiopaniego in #4088
Deprecations
- Deprecate
BestOfNSamplerby @qgallouedec in #4291 - Raise deprecation warning for Python 3.9 by @albertvillanova in #4226
- Deprecate unused dataset_formatting module by @behroozazarkhalili in #4242
- Warnings pointing to RFC by @qgallouedec in #4224
🅰️ Remove apex by @qgallouedec in #4139- 🗑️ Remove deprecated
AlignPropTrainer,DDPOTrainerandIterativeSFTTrainerby @qgallouedec in #4068
Experimental
- 🧪 Add
trl.experimentalSubmodule by @August-murr in #4073 - [GRPO]: Sample from a Replay Buffer To Substitute Groups with 0 std. by @pramodith in #4060
- 🪙 [Experimental] Support GSPO-token by @hjh0119 in #3820
- 🌪️ [GFPO]: implement GFPO in GRPOTrainer by @Peter-Chou in #3989
- 🌾 [Experimental] BEMA for ref model by @qgallouedec in #3898
What's Changed
- ⬆️ Bump dev version by @qgallouedec in #4054
- Remove redundant 'None' from docstrings by @albertvillanova in #4058
- Hotfix: Add ParallelismConfig fallback for transformers with old accelerate by @albertvillanova in #4063
- Fix CI failure in slow GRPO test due to missing pillow dependency by @albertvillanova in #4064
- 💡 Fix type hint to
make_parserfunction in multiple scripts by @qgallouedec in #4050 - Improve docstring of AlignPropTrainer by @albertvillanova in #4059
- ♨️ [GRPO] Fix potential hang in
get_high_entropy_maskby @akakakakakaa in #4041 - Set Ruff src for first-party imports by @albertvillanova in #4074
- 🧪 Add
trl.experimentalSubmodule by @August-murr in #4073 - 🌾 [Experimental] BEMA for ref model by @qgallouedec in #3898
- ✂️ [GRPO VLM] Update split sizes to generalize by @zucchini-nlp in #4032
- 🛠️ Fix CI by @qgallouedec in #4076
- 🐳 Docker update + Simplify Jobs doc by @qgallouedec in #3931
- Aux loss is already included in the loss returned by Transformers by @pramodith in #4078
- Reviewed HF jobs updated docs by @sergiopaniego in #4088
- 🗑️ Remove deprecated
AlignPropTrainer,DDPOTrainerandIterativeSFTTrainerby @qgallouedec in #4068 ▶️ Add video to community tutorials by @qgallouedec in #4090- Align slow tests with regular tests by @albertvillanova in #4085
- Add support for testing experimental features by @albertvillanova in #4082
- Community Tutorials design adaptation for videos by @sergiopaniego in #4095
- 🏞️ Context Parallelism benchmark guide by @sergiopaniego in #4075
- ⌨️ Pin num2words by @lewtun in #4094
- Add deprecation warnings to docstrings by @albertvillanova in #4083
- 📜 Convert
settolistof tags by @qgallouedec in #4092 - 🧶 feat: Add WeaveCallback for W&B Weave integration by @parambharat in #4089
- ⚖️ Align SFT and DPO for model creation and deprecate
DPOConfig.padding_valuein favour orpad_token_idby @qgallouedec in #4006 - 🌪️ [GFPO]: implement GFPO in GRPOTrainer by @Peter-Chou in #3989
- ℹ️ feat: Add NPU and XPU support for activation offloading by @zilongzheng in #4056
- ℹ️ Enable XPU for vLLM client by @jiqing-feng in #4031
- Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers by @Hoesu in https://github.com/huggingface/trl/pull/...