Codestin Search App

@pramodith

Features

[GRPOTrainer]: Agent Training Supports Async Tool Calls by @pramodith in #4742
Add retry strategy to vLLM Client for increased robustness by @apalmas-saifh in #4845
Enable vLLM sleep mode for generation in Online DPO by @winglian in #4882
Support tool call data in is_conversational by @qgallouedec in #4923
[GRPO] Add parquet logging for completions with individual rewards by @qgallouedec in #4818
Update wordle.py example with masking of env tokens by @sergiopaniego in #4895
NeMo-Gym Integration by @cmunley1 in #4848

Experimental

Refactor KTO coordinated with DPO [c/N]: Remove ref_model_init_kwargs by @albertvillanova in #4837
Refactor KTO coordinated with DPO [e/N]: Remove label_pad_token_id by @albertvillanova in #4875
Refactor KTO coordinated with DPO [d/N]: Remove base_model_attribute_name by @albertvillanova in #4862
Fix type hint in openenv/utils.py: fallback for no vLLM installed case by @Datta0 in #4868
Remove label_pad_token_id from experimental trainers by @albertvillanova in #4878
GOLD training speed up by @141forever in #4888
Remove ref_model_init_kwargs from experimental BCO by @albertvillanova in #4946
Remove max_prompt_length from experimental PRM by @albertvillanova in #4963
Remove max_prompt_length from experimental BCO by @albertvillanova in #4964
Remove max_prompt_length from experimental CPO by @albertvillanova in #4965
Remove max_prompt_length from experimental ORPO by @albertvillanova in #4966
Remove padding_value from experimental CPO and use pad_token_id by @albertvillanova in #4962

Fixes

Fix _patch_transformers_hybrid_cache for peft by @albertvillanova in #4844
Refactor KTO [4/N]: Remove unused padding_value by @albertvillanova in #4839
Fix: undefined current_gradient_accumulation_steps by @qgallouedec in #4852
fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in #4857
Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in #4880
Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in #4873
Fix import path for get_open_port based on vLLM version by @qgallouedec in #4883
Fix RewardTrainer's results not reproducible by @liyc-ai in #4887
device_map init consistency in GRPO/RLOO/KTO by @qgallouedec in #4909
Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in #4908
Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in #4942
Fix PPO run_name parameter not taking effect by @mel3c in #4945
Remove access to warnings_issued by @qgallouedec in #4960
Revert change in GRPO from NeMo-Gym Integration by @qgallouedec in #4970

Documentation and Examples

Add Nash Learning from Human Feedback paper to paper index by @kansalaman in #4860
Update OpenEnv dependency to new version for hf jobs scripts by @sergiopaniego in #4843
Enhance GRPO documentation with scaling notes by @javadtaghia in #4849
Created new PTT integration docs as requested by @adityachallapally in #4907
docs: add DoRA (2402.09353) to Paper Index by @billycrapediem in #4892

Deprecations

Remove unused padding_value from BCO by @albertvillanova in #4846
Remove deprecated parameters by @qgallouedec in #4847
Deprecate parameters in DPOConfig by @qgallouedec in #4969
Replace warmup_ratio with warmup_steps by @qgallouedec in #4983

CI Improvements

Support triggering CI via push to ci-* branches by @albertvillanova in #4840
Revert CI hotfix pinning transformers 4.57.4 after tiny model regeneration by @albertvillanova in #4833
Use pytest-datadir in CI tests by @albertvillanova in #4836
Fix CI with dev dependencies: Mark Qwen3-VL tests as xfail by @albertvillanova in #4851
Use pytest-datadir for accelerate config files by @albertvillanova in #4861
Update transformer version checks and documentation for lr_scheduler_kwargs workaround by @qgallouedec in #4876
Test distributed training for RewardTrainer, RLOOTrainer and GRPOTrainer by @qgallouedec in #4823
Mark ZeRO 2 as xfail in distributed tests due to current failure by @qgallouedec in #4885
Transformers v5 release: extend xfail condition for TestGRPOTrainer.test_training_vlm_and_liger and update version checks by @qgallouedec in #4898
Fix CI NotImplementedError for bfloat16 by @albertvillanova in #4902
Fix CI AssertionError: Parameter has not changed by @albertvillanova in #4904
Fix CI TypeError in llm-blender tests by @albertvillanova in #4919
Fix CI AssertionError: assert not True by @albertvillanova in #4921
Fix CI ValueError for 0 temperature by @albertvillanova in #4916
Set model dtype to float32 in tests of trainers by @albertvillanova in #4924
Set model dtype to float32 in experimental tests of trainers by @albertvillanova in #4925
Add test for training with compute_metrics in SFTTrainer by @qgallouedec in #4950
Add test for tool call data in RewardTrainer by @qgallouedec in #4959
Add test for training with compute_metrics in RewardTrainer by @qgallouedec in #4958
Fix test_train_with_chat_template_kwargs by @qgallouedec in #4971

Miscellaneous

Update CITATION.cff by @qgallouedec in #4856
Update generate_tiny_models.py: CohereForAI -> CohereLabs by @Michellehbn in #4877
Refactor vLLM generation [1/N]: Extract vLLM generation by @albertvillanova in #4700
Rearrange variable assignments in DataCollatorForVisionLanguageModeling by @qgallouedec in #4911
Fix help text formatting for max_length in RewardConfig and SFTConfig by @qgallouedec in #4910
Comment about overriding prediction_step in GRPOTrainer and RLOOTrainer by @qgallouedec in #4913
Remove gradient checkpointing option from various training scripts by @qgallouedec in #4905
Remove chat template setup in dpo_vlm.py by @qgallouedec in #4906
Update learning rate comments and add assertions for reference model parameters in GRPO and RLOO tests by @qgallouedec in #4914
Add validation for sync_ref_model in GRPOTrainer and RLOOTrainer when using PEFT models by @qgallouedec in #4912
Require transformers<5 with PairRMJudge by @albertvillanova in #4926
Move VLLMClient to generation module by @albertvillanova in #4928
Fix profiling of VLLMGeneration.sync_weights by @albertvillanova in #4931
Fix import statement for import_utils in vllm_client.py by @qgallouedec in #4932
Set default top_k to 0 in VLLMClient by @albertvillanova in #4927
Minor fix docs style by @albertvillanova in #4953

What's Changed

⬆️ Bump dev version by @qgallouedec in #4835
Support triggering CI via push to ci-* branches by @albertvillanova in #4840
Revert CI hotfix pinning transformers 4.57.4 after tiny mo...

@qgallouedec

What's Changed

Remove access to warnings_issued by @qgallouedec in #4960
Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in #4942
Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in #4908

Full Changelog: v0.27.1...v0.27.2

@qgallouedec

What's Changed

Fix: undefined current_gradient_accumulation_steps by @qgallouedec in #4852
fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in #4857
Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in #4880
Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in #4873
Fix RewardTrainer's results not reproducible by @liyc-ai in #4887

New Contributors

@kdubovikov made their first contribution in #4873
@liyc-ai made their first contribution in #4887

Full Changelog: v0.27.0...v0.27.1

@pointerhacker

Features

Add vllm_group_port argument to GRPO, RLOO and OnlineDPO configuration by @pointerhacker in #4545
Preserve truncated tokens in BFD packing by @qgallouedec in #4632
Support async reward functions and parallelize call to reward functions. by @pramodith in #4567
RLOO supports async rewards. by @pramodith in #4718
Support vLLM 0.12.0 by @jiqing-feng in #4117
feat: DeepSeek V3.2 Off-policy sequence masking by @casinca in #4689
🎭 Up to 50% less VRAM during forward with forward_masked_logits function by @qgallouedec in #4729
[GRPO] Add a config to limit the number of tool calling iterations by @pramodith in #4761
Switch gradient checkpointing default to use_reentrant=False (PyTorch recommended) by @qgallouedec in #4811
Add support for GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization by @nbasyl in #4785

Experimental

Move AutoModelForCausalLMWithValueHead and AutoModelForSeq2SeqLMWithValueHead to experimental by @qgallouedec in #4654
Move DPODataCollatorWithPadding to experimental.utils by @qgallouedec in #4667
Move DataCollatorForChatML to experimental.utils by @qgallouedec in #4668
Move add_bos_token_if_needed and add_eos_token_if_needed to experimental.utils by @qgallouedec in #4674
Move truncate_right and SIMPLE_CHAT_TEMPLATE to experimental.utils by @qgallouedec in #4677
Move prepare_model_for_kbit_training, enable_gradient_checkpointing, prepare_peft_model to experimental.utils by @qgallouedec in #4704
Move get_reward function to experimental.utils by @qgallouedec in #4683
Remove experimental imports from testing_utils by @albertvillanova in #4727
ORPO: Avoid catastrophic cancellation in loss function by @hartmans in #4763
Refactor KTO [1/N]: Modernize model initialization by @albertvillanova in #4783
[GOLD] add probability merging fix to implement chain rule by @kashif in #4765
Refactor KTO coordinated with DPO [a/N]: Remove encoder-decoder support by @albertvillanova in #4792
Refactor KTO coordinated with DPO [b/N]: Simplify truncation logic by @albertvillanova in #4808

Fixes

Accounting for case num_generations_eval=1 in the calculation of the advantage by @qgallouedec in #4662
Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in #4663
Fix GRPO config validation in case num_generations_eval is specified and different than num_generations by @apalmas-saifh in #4682
Fix top_k default value to 0 for disabling top-k filtering by @albertvillanova in #4695
Include generation_config for tiny model uploads by @qgallouedec in #4643
Fix KeyError with transformers 5.0.0+ where push_to_hub_token is removed by @Manodeepray in #4691
Overwrite model default generation config used by model.generate by @albertvillanova in #4647
Fix: handle multiple tool calls in qwen3_schema by @mattbui in #4709
Fix bugs when using multi-gpu: dataset streaming for offline trainers + dtype initialization by @kaixuanliu in #3950
Ensure llm-blender is importable with transformers >= v5 by @albertvillanova in #4781
Monkey patch for HybridCache in Liger-Kernel with transformers v5 by @qgallouedec in #4798
[fix] GRPOTrainer: proper access args by @carlyou in #4801
Fix vllm compat patches to be applied only to affected versions by @albertvillanova in #4815
fix bug when sft calc outputs.token_accuracy by @kaixuanliu in #4814
fix xpu vllm client server by @jiqing-feng in #4780

Documentation and Examples

docs: add RapidFire AI integration section to SFT Trainer by @kamran-rapidfireAI in #4661
Fix environment image name for BrowserGym example script by @sergiopaniego in #4680
Docs(grpo_trainer.md): Added Qwen SAPO details under Loss Types by @casinca in #4681
[docs] Adds GRPO, RSO and LoRA to Paper Index by @SSusantAchary in #4441
Enable zero3 init and 16-bit model saving for ds ulysses config by @edbeeching in #4701
Set version to packaged one in notebooks by @sergiopaniego in #4648
BrowserGym example for LLMs (no vision) by @sergiopaniego in #4696
docs: Add RapidFire AI cross-references to DPO and GRPO trainer docs by @kamran-rapidfireAI in #4705
[docs] Fix RapidFire AI position in documentation by @qgallouedec in #4715
Add inference example to GRPO agent training notebook by @sergiopaniego in #4710
Upload FunctionGemma notebook by @sergiopaniego in #4721
Update agents notebook dependencies by @sergiopaniego in #4724
Add uv/hf jobs support to OpenEnv scripts by @sergiopaniego in #4720
Add GRPO QLoRA free notebook by @sergiopaniego in #4660
Hotfix for browsergym openenv notebook by @sergiopaniego in #4740
docs: fix "Good Second Issue" redirection link by @casinca in #4749
[Docs] Add SRL (Supervised Reinforcement Learning) to Community Tutorials by @s23deepak in #4758
Add LFM2.5 to GRPO notebook by @sergiopaniego in #4793
Sudoku GRPO example script using TextArena by @sergiopaniego in #4762
[EXAMPLES] Update wordle to new openenv release by @burtenshaw in #4791
Update the typos in docs/source/grpo_trainer.md by @Tianyi-Billy-Ma in #4804
Updat examples to new OpenEnv version by @sergiopaniego in #4796
Update GRPO example to use Qwen2.5 instead of Qwen2 by @BurnyCoder in #4803

Deprecations

Remove deprecated functions and parameters by @qgallouedec in #4651
Remove MergeModelCallback from import structure by @qgallouedec in #4664
Remove ChatMlSpecialTokens by @qgallouedec in #4666
Remove unused _win_rate_completions_df function from callbacks by @qgallouedec in #4672
Deprecate max_prompt_length in RLOOTrainer by @albertvillanova in #4703
Small fix on contributing docs by @murilo-cunha in #4753
Remove DbrxForCausalLM support by @qgallouedec in #4799

CI Improvements

Hotfix CI due to generation config by setting tests as xfail by @albertvillanova in #4657
Upgrade GitHub Actions to latest versions by @salmanmkc in #4734
Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #4733
Include data type for tiny models and update tests by @qgallouedec in #4728
Change tiny model dtype from float16 to bfloat16 to fix CUDA error by @albertvillanova in #4745
Add revision override mechanism for testing tiny models by @albertvillanova in #4769
Hotfix: Set float32 as default dtype for testing tiny models by @albertvillanova in #4770
Hotfix CI with dev dependencies: xfail test_training_vlm_and_liger by @albertvillanova in #4777
Add initial multi-GPU CI tests for distributed training by @qgallouedec in #4784
Set dtype default to float32 by @albertvillanova in #4778
Test FSDP2 by @qgallouedec in #4813
Test ZeRO Stage 3 by @qgallouedec in #4821
Hotfix CI main test...

@albertvillanova

What's Changed

Overwrite model default generation config used by model.generate by @albertvillanova in #4647

Full Changelog: v0.26.1...v0.26.2

@apalmas-saifh

What's Changed

Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in #4663
Fix GRPO config validation in case num_generations_eval is specified and different than num_generations by @apalmas-saifh in #4682

New Contributors

@apalmas-saifh made their first contribution in #4663

Full Changelog: v0.26.0...v0.26.1

@qgallouedec

Features

🕵️‍♂️ GRPO: Agent training

GRPOTrainer now supports training agents using tools. This allows language models to interact with external functions or APIs during training.

from datasets import Dataset
from trl import GRPOTrainer

def multiply(a: int, b: int) -> int:
    """
    Multiplies two integers.

    Args:
        a: The first integer.
        b: The second integer.

    Returns:
        The product of the two integers.
    """
    return a * b


dataset = Dataset.from_list(
    [
        {"prompt": [{"role": "user", "content": "What is 3 multiplied by 4?"}], "answer": 12},
        {"prompt": [{"role": "user", "content": "Calculate 7 times 8."}], "answer": 56},
        {"prompt": [{"role": "user", "content": "Find the product of 5 and 6."}], "answer": 30},
        {"prompt": [{"role": "user", "content": "What do you get when you multiply 9 by 9?"}], "answer": 81},
        {"prompt": [{"role": "user", "content": "Compute 12 multiplied by 11."}], "answer": 132},
        {"prompt": [{"role": "user", "content": "What is 15 times 14?"}], "answer": 210},
    ]
)

def accuracy(completions, answer, **kwargs):
    predictions = [completion[-1]["content"] for completion in completions]
    rewards = [float(str(ans) in pred) for pred, ans in zip(predictions, answer)]
    return rewards

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=dataset,
    tools=[multiply],
    reward_funcs=accuracy,
)
trainer.train()

by @qgallouedec in #4300

ScaleRL: Add CISPO Loss

CISPO Loss was first introduced in the Minimax-M1 paper, the ScaleRL paper subsequently showed that CISPO loss scales the best in terms of performance and efficiency as models are trained for longer.

GRPOTrainer now supports the CISPO loss using loss_type="cispo" in the GRPOConfig.

by @pramodith in #4495

Add vLLM quantization option for colocate

When the input model is quantized using bitsandbytes, vLLM will now also use quantization when in colocate mode.

by @sergiopaniego in #4496

Reasoning reward

TRL nows includes a reasoning reward function

from trl.rewards import reasoning_accuracy_reward

solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
completions = [
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{3}}",
        }
    ],
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{2}}",
        }
    ],
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content with partial answers \boxed{\frac{1}{3}} but no final answer",
        }
    ],
]
reasoning_accuracy_reward(completions, solutions)  # [1.0, 0.0, 0.0]

As any other reward function, it can be used in GRPOTrainer or RLOOTrainer.

from trl import GRPOTrainer
from trl.rewards import reasoning_accuracy_reward

trainer = GRPOTrainer(
    ...,
    reward_funcs=reasoning_accuracy_reward,
)

by @lewtun in #4563

Add `shuffle_dataset` option to `SFTTrainer`

You can now shuffle the dataset in SFTTrainer by setting the shuffle_dataset argument to True in SFTConfig. This is useful when the dataset features high similarity between consecutive samples.

from trl import SFTTrainer, SFTConfig

SFTConfig(shuffle_dataset=True)

by @qgallouedec in #4564

Add SAPO Loss in GRPO

Soft Adaptive Policy Optimization (SAPO), replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO.

You can now use SAPO loss in GRPOTrainer by setting loss_type="sapo" in the GRPOConfig.

by @pramodith in #4600

Other Features

Support completion bootstrap for VLM in GRPO/RLOO by @SolarWindRider in #4452
Add support for images inside tables with Trackio completions logging by @taha-yassine in #4505
Add step time metric to GRPO Trainer for performance tracking by @qgallouedec in #4516
Add target_parameters to LoraConfig by @jonnyli1125 in #4536
[SFT] Log mean token accuracy from Liger kernel by @kashif in #4302
Add num_generations_eval parameter for efficient evaluation by @mingxuetian in #4458
[GRPO] Sequence-level TIS & MIS by @LeonEricsson in #4530
TRL supports vLLM 0.11 by @qgallouedec in #4633
feat: implement DeepSeek unbiased KL estimator for GRPO by @jlcanta in #4638

Experimental

Move XPOTrainer to trl.experimental.xpo by @behroozazarkhalili in #4485
Move judges to experimental submodule by @behroozazarkhalili in #4439
Add MiniLLM Trainer by @t1101675 in #4504
refactor: Move CPOTrainer to experimental module by @behroozazarkhalili in #4470
Move GKDTrainer to experimental module by @behroozazarkhalili in #4474
Move NashMDTrainer to experimental module by @behroozazarkhalili in #4477
Move PPOTrainer to trl.experimental.ppo by @behroozazarkhalili in #4482
[ORPO] Move ORPOTrainer to experimental by @behroozazarkhalili in #4480
Move PRMTrainer to trl.experimental.prm by @behroozazarkhalili in #4483
Move OnlineDPOTrainer to experimental module by @behroozazarkhalili in #4473
Move WinRateCallback to experimental by @qgallouedec in #4558
Move tests for GSPOTokenTrainer to experimental by @qgallouedec in #4572
Raise FutureWarning for classes moved to experimental by @albertvillanova in #4605
Move MergeModelCallback to experimental by @qgallouedec in #4608
Raise FutureWarning for trainer moved to experimental by @albertvillanova in #4620
Remove no longer applicable warning once BCO was moved to experimental by @albertvillanova in #4628
Refactor suppression of warning at experimental import by @albertvillanova in #4629
🚚 Move KTO to trl.experimental by @neha222222 in #4575

Fixes

Buffer samples based on group level stds. by @pramodith in #4492
Fix bugs in CISPO conditions by @pramodith in #4499
device_map and dtype to "auto" by default by @qgallouedec in #4509
MiniLLM: Fix arguments in config & add to documentation index by @t1101675 in #4518
[Bug Fix] OnlineDPOTrainer with vLLM Server Mode by @YangKai0616 in #4500
Rename flash-attn to flash-attn2 by @qgallouedec in #4514
fix(GOLDTrainer): Resolve incorrect attribute access and VLLMClient.generate() output type by @fabio-sim in #4526
Fix bug with VLM processors in prompt-completion completion text-only training by @kschwethelm in #4553
fix+docs: device_map=None for DeepSpeed and add ZeRO paper (1910.02054) to Paper Index by @JenWei0312 in #4551
Fix vLLM sleep mode: add collective RPC call to reload weights in vLLM wake-up process by @qgallouedec in #4571
fix: use shift_labels for metrics when using CP or SP by @jue-jue-zi in #4579
Fix 'generation_config' AttributeError by @albertvillanova in #4596
Fix FSDP2 model key miss match when sync LoRA model to vLLM server by @Xiao-Chenguang in #4603
Fix KTOTrainer CUDA error for large-vocab models via tensor indexing by @bhuvanprakash in #4635

Documentation and Examples

docs: Add PEFT subsection to reducing memory usage guide by @behroozazarkhalili in #4430
[DOCS] update and fix openenv by @burtenshaw in #4490
Fix link to OpenEnv docs by @lukehinds in #4502
Tweak description for vLLM sleep mode by @lewtun in #4506
Paper Index: Change num_completions to num_generations by @pramodith in https://gi...

@lewtun

What's Changed

Replace accelerate logging with stdlib in CLI by @lewtun in #4512
Add temporary workaround for lr_scheduler_kwargs dtype issue in Transformers 4.57.0 by @qgallouedec in #4513

Full Changelog: v0.25.0...0.25.1

@xxrjun

Features

💤 Switch to sleep level=2 and split wake-ups in GRPO and RLOO trainers by @xxrjun in #4296
Added custom prepare_model_for_kbit_training to save VRAM by @sergiopaniego in #4335
Add add_generation_prompt to processor_kwargs in GRPO and RLOO trainer by @qgallouedec in #4361
Add support for Trackio completions logging in GRPOTrainer by @taha-yassine in #4359
Support chat_template_kwargs by @pramodith in #4350
GRPO: ScaleRL -> Support casting LM Head to FP32 by @pramodith in #4303
Support casting to fp32 when word embeddings are tied to lm_head by @pramodith in #4446
💬 Add chat to vLLM client and server, update trainer calls by @qgallouedec in #4450

Experimental

🚚 Move BCO to trl.experimental by @qgallouedec in #4312
👑 [experimental] GOLD Trainer by @kashif in #4349
Add PAPOTrainer for preference-based optimization by @SolarWindRider in #4334
[GFPO] fix the GFPO loss calculation error caused by unmodified old_per_token_logps by @Peter-Chou in #4454
🕹️ Add rollout function for OpenEnv integration by @lewtun in #4310

Fixes

[Activation-checkpointing] add tensor dedup and param offloading by @kashif in #4247
Fix attn_implementation name in OnlineDPO for transformers v5 by @albertvillanova in #4322
Hotfix: Fall back to config.text_config._name_or_path if missing config._name_or_path by @albertvillanova in #4324
Fix GRPO and RLOO trainers for continuous batching by @albertvillanova in #4348
Fix: add_generation_prompt=True for conversational only by @qgallouedec in #4362
Remove ignored max_length parameter from PRMTrainer data collator by @albertvillanova in #4355
Fix add_generation_prompt arg for paged transformers in GRPO and RLOO trainers by @albertvillanova in #4370
Fix GKD Liger memory spike by @qgallouedec in #4140
Fix GRPO with replay buffer by inserting images in the prompt by @albertvillanova in #4391
fix: Remove chat template setting from non-SFT trainer scripts by @behroozazarkhalili in #4437
🖼️ Fix reporting images with vLLM by @qgallouedec in #4476

Documentation and Examples

Added SFT LoRA notebook by @sergiopaniego in #4244
Update notebooks README with latest additions by @sergiopaniego in #4316
Add notebooks to Examples docs and restructure by @sergiopaniego in #4317
Highlight OpenEnv in landing docs by @sergiopaniego in #4327
Update OpenEnv docs by @sergiopaniego in #4328
Add OpenEnv blog to landing by @sergiopaniego in #4333
🗞️ Update "What's New" by @qgallouedec in #4338
Update Reducing Memory Consumption guide with more details by @sergiopaniego in #4332
Fixed links inside Tips in docs by @sergiopaniego in #4360
🔥 docs: Add RapidFire AI integration guide by @kamran-rapidfireAI in #4340
Fix paper link for "Towards Efficient and Exact Optimization of Language Model Alignment" by @qgallouedec in #4409
Migrate experimental trl feature docs by @ethanknights in #4411
Update SFT QLoRA notebook with 14B model on free Colab by @sergiopaniego in #4336
Create "Talks" subsection by @sergiopaniego in #4414
Openenv wordle example by @burtenshaw in #4357
docs: Remove outdated conversational dataset conversion guidance by @behroozazarkhalili in #4422
docs: List all trainers that support Liger Kernel by @behroozazarkhalili in #4432
Add On-Policy Distillation from thinking labs to paper index. by @pramodith in #4410
Upload notebook with T4 selected by @sergiopaniego in #4449
Removed outdated warning about batch contamination by @Harras3 in #4423
Removed Sentiment Tuning Examples by @Harras3 in #4424
docs: Remove outdated notebooks by @behroozazarkhalili in #4435
docs: Move Multi-Adapter RL section to PEFT integration by @behroozazarkhalili in #4436
Update max_length explanation for VLM in online trainers by @sergiopaniego in #4220
Updated OpenEnv docs by @sergiopaniego in #4418
add llasa-tutorial by @Deep-unlearning in #4456

Deprecations

Replace deprecated AutoModelForVision2Seq with AutoModelForImageTextToText by @albertvillanova in #4353
Replace deprecated list with tuple indexing in PPOTrainer by @albertvillanova in #4356
Remove liger loss in favor of liger kernel by @sergiopaniego in #4364
🐍 Drop Python 3.9 by @qgallouedec in #4183

What's Changed

⬆️ Bump dev version by @qgallouedec in #4293
Update links to docs in README to latest packaged version by @sergiopaniego in #4084
🧺 [4/N] Refactor _generate in GRPO/RLOO: Move forward_kwargs outside generation method by @qgallouedec in #4154
Fix missing CI slow tests: ImportError: vLLM is not installed by @albertvillanova in #4304
Added SFT LoRA notebook by @sergiopaniego in #4244
⚰️ Remove deprecated by @qgallouedec in #4301
Silence TRL experimental warnings in CI by @albertvillanova in #4307
Filter expected setup_chat_format deprecation warning in CI by @albertvillanova in #4306
[Activation-checkpointing] add tensor dedup and param offloading by @kashif in #4247
Remove parameterized as test extra dependency by @albertvillanova in #4315
Update notebooks README with latest additions by @sergiopaniego in #4316
🚚 Move BCO to trl.experimental by @qgallouedec in #4312
🧺 [5/N] Refactor _generate in GRPO/RLOO: Insert images in the prompt by @qgallouedec in #4155
💤 Switch to sleep level=2 and split wake-ups in GRPO and RLOO trainers by @xxrjun in #4296
Replace unittest skipTest from transformers with pytest.skip by @albertvillanova in #4297
Add notebooks to Examples docs and restructure by @sergiopaniego in #4317
Fix attn_implementation name in OnlineDPO for transformers v5 by @albertvillanova in #4322
🕹️ Add rollout function for OpenEnv integration by @lewtun in #4310
Highlight OpenEnv in landing docs by @sergiopaniego in #4327
Update OpenEnv docs by @sergiopaniego in #4328
Move BCO tests to tests/experimental by @albertvillanova in #4326
Hotfix: Fall back to config.text_config._name_or_path if missing config._name_or_path by @albertvillanova in #4324
Add OpenEnv blog to landing by @sergiopaniego in #4333
🗞️ Update "What's New" by @qgallouedec in #4338
Update Reducing Memory Consumption guide with more details by @sergiopaniego in #4332
Added custom prepare_model_for_kbit_training to save VRAM by @sergiopaniego in #4335
[vllm] update comment about communication group host ip by @kashif in #4337
Fix GRPO and RLOO trainers for continuous batching by @albertvillanova in #4348
Fixed links inside Tips in docs by @sergiopaniego in #4360
Fix CI issue for vlm_gemma_3n model by @kaixuanliu in #4278
Add add_generation_prompt to processor_kwargs ...

@pramodith

Features

Add accuracy reward by @pramodith in #4270
Add support for token_type_ids in DPOTrainer by @aweers in #4285
💰 RichProgressCallback enhancement by @qgallouedec in #4245
Include chat_template_kwargs in apply_chat_template by @cmpatino in #4233
🏷️ Account for token_type_ids in DataCollatorForVisionLanguageModeling by @qgallouedec in #4190
🎨 Support mixing image+text and text-only examples by @qgallouedec in #4203
🎁 RewardTrainer refactor by @qgallouedec in #4093
🎞️ Support sequence classification models in clone_chat_template by @qgallouedec in #4097
✨ Add logging for training completion and model saving in training scripts by @qgallouedec in #4048
🖨️ Print rich table for messages by @qgallouedec in #4160
😴 Add vllm_enable_sleep_mode to RLOO Trainer by @sergiopaniego in #4107
📽 Multi image support for GRPO/RLOO by @qgallouedec in #4113
👁️ Add VLM support to RLOO trainer by @behroozazarkhalili in #4067
ℹ️ Enable XPU for vLLM client by @jiqing-feng in #4031
🧶 feat: Add WeaveCallback for W&B Weave integration by @parambharat in #4089

Fixes

[Online-DPO] fix the completion_len == max_new_tokens crash by @kashif in #4193
Fix entropy and accuracy calculation for prompt_tuning techniques. by @pramodith in #4196
Fix prompt-completion labeling with add_generation_prompt and warning by @behroozazarkhalili in #4201
🌡️ Have vLLM return processed (temperature scaled) log probs by @YonatanGideoni in #4163
Fix handling of f_divergence_type in DPO by @albertvillanova in #4171
⚡ Fix Flash Attention x Padding-Free loss by @qgallouedec in #4170
Pass required token_type_ids by @albertvillanova in #4148
👩‍🦯 Fix usage of VLM using text only by @SamuelBarryCS in #4080
⚓ [vllm] ensure MASTER_ADDR/MASTER_PORT are set safely by @kashif in #4057
📤 Fix a dataset loading bug in scripts by @singing-cat in #4124
🐯 fix: use_liger_kernel with IterableDataset by @jue-jue-zi in #4087
[GKD] Fix batchmean reduce op in GKDTrainer's loss by @cmpatino in #4105
Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers by @Hoesu in #4081
Aux loss is already included in the loss returned by Transformers by @pramodith in #4078
♨️ [GRPO] Fix potential hang in get_high_entropy_mask by @akakakakakaa in #4041

Documentation

Remove logging.md: trainer-specific metrics documentation by @behroozazarkhalili in #4269
Remove using_llama_models.md: outdated Llama2-specific documentation by @behroozazarkhalili in #4268
Remove how_to_train.md: outdated training FAQ by @behroozazarkhalili in #4267
Add Qwen3-VL notebooks (SFT, GRPO) by @sergiopaniego in #4275
Remove obsolete research_projects directory by @behroozazarkhalili in #4243
Add Efficient Online Training with GRPO and vLLM in TRL to community tutorials by @sergiopaniego in #4219
Add trainers taxonomy to docs by @sergiopaniego in #4195
Updated vLLM integration guide by @sergiopaniego in #4162
[DOCS] Lora without regret by @burtenshaw in #4181
Add docstring for OnlineTrainerState by @albertvillanova in #4166
⚖️ Align SFT and DPO for model creation and deprecate DPOConfig.padding_value in favour or pad_token_id by @qgallouedec in #4006
🏞️ Context Parallelism benchmark guide by @sergiopaniego in #4075
▶️ Add video to community tutorials by @qgallouedec in #4090
Reviewed HF jobs updated docs by @sergiopaniego in #4088

Deprecations

Deprecate BestOfNSampler by @qgallouedec in #4291
Raise deprecation warning for Python 3.9 by @albertvillanova in #4226
Deprecate unused dataset_formatting module by @behroozazarkhalili in #4242
Warnings pointing to RFC by @qgallouedec in #4224
🅰️ Remove apex by @qgallouedec in #4139
🗑️ Remove deprecated AlignPropTrainer, DDPOTrainer and IterativeSFTTrainer by @qgallouedec in #4068

Experimental

🧪 Add trl.experimental Submodule by @August-murr in #4073
[GRPO]: Sample from a Replay Buffer To Substitute Groups with 0 std. by @pramodith in #4060
🪙 [Experimental] Support GSPO-token by @hjh0119 in #3820
🌪️ [GFPO]: implement GFPO in GRPOTrainer by @Peter-Chou in #3989
🌾 [Experimental] BEMA for ref model by @qgallouedec in #3898

What's Changed

⬆️ Bump dev version by @qgallouedec in #4054
Remove redundant 'None' from docstrings by @albertvillanova in #4058
Hotfix: Add ParallelismConfig fallback for transformers with old accelerate by @albertvillanova in #4063
Fix CI failure in slow GRPO test due to missing pillow dependency by @albertvillanova in #4064
💡 Fix type hint to make_parser function in multiple scripts by @qgallouedec in #4050
Improve docstring of AlignPropTrainer by @albertvillanova in #4059
♨️ [GRPO] Fix potential hang in get_high_entropy_mask by @akakakakakaa in #4041
Set Ruff src for first-party imports by @albertvillanova in #4074
🧪 Add trl.experimental Submodule by @August-murr in #4073
🌾 [Experimental] BEMA for ref model by @qgallouedec in #3898
✂️ [GRPO VLM] Update split sizes to generalize by @zucchini-nlp in #4032
🛠️ Fix CI by @qgallouedec in #4076
🐳 Docker update + Simplify Jobs doc by @qgallouedec in #3931
Aux loss is already included in the loss returned by Transformers by @pramodith in #4078
Reviewed HF jobs updated docs by @sergiopaniego in #4088
🗑️ Remove deprecated AlignPropTrainer, DDPOTrainer and IterativeSFTTrainer by @qgallouedec in #4068
▶️ Add video to community tutorials by @qgallouedec in #4090
Align slow tests with regular tests by @albertvillanova in #4085
Add support for testing experimental features by @albertvillanova in #4082
Community Tutorials design adaptation for videos by @sergiopaniego in #4095
🏞️ Context Parallelism benchmark guide by @sergiopaniego in #4075
⌨️ Pin num2words by @lewtun in #4094
Add deprecation warnings to docstrings by @albertvillanova in #4083
📜 Convert set to list of tags by @qgallouedec in #4092
🧶 feat: Add WeaveCallback for W&B Weave integration by @parambharat in #4089
⚖️ Align SFT and DPO for model creation and deprecate DPOConfig.padding_value in favour or pad_token_id by @qgallouedec in #4006
🌪️ [GFPO]: implement GFPO in GRPOTrainer by @Peter-Chou in #3989
ℹ️ feat: Add NPU and XPU support for activation offloading by @zilongzheng in #4056
ℹ️ Enable XPU for vLLM client by @jiqing-feng in #4031
Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers by @Hoesu in https://github.com/huggingface/trl/pull/...

Releases: huggingface/trl

v0.28.0

Features

Experimental

Fixes

Documentation and Examples

Deprecations

CI Improvements

Miscellaneous

What's Changed

Contributors

Uh oh!

v0.27.2

What's Changed

Contributors

Uh oh!

v0.27.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.27.0

Features

Experimental

Fixes

Documentation and Examples

Deprecations

CI Improvements

Contributors

Uh oh!

v0.26.2

What's Changed

Contributors

Uh oh!

v0.26.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.26.0

Features

🕵️‍♂️ GRPO: Agent training

ScaleRL: Add CISPO Loss

Add vLLM quantization option for colocate

Reasoning reward

Add shuffle_dataset option to SFTTrainer

Add SAPO Loss in GRPO

Other Features

Experimental

Fixes

Documentation and Examples

Contributors

Uh oh!

v0.25.1

What's Changed

Contributors

Uh oh!

v0.25.0

Features

Experimental

Fixes

Documentation and Examples

Deprecations

What's Changed

Contributors

Uh oh!

v0.24.0

Features

Fixes

Documentation

Deprecations

Experimental

What's Changed

Contributors

Uh oh!

Add `shuffle_dataset` option to `SFTTrainer`