Trl on AI Knowledge Base

Asynchronous GRPO

Mon, 01 Jan 0001 00:00:00 +0000

BCO Trainer

Mon, 01 Jan 0001 00:00:00 +0000

BEMA for Reference Model

Mon, 01 Jan 0001 00:00:00 +0000

Callbacks

Mon, 01 Jan 0001 00:00:00 +0000

Chat template utilities

Mon, 01 Jan 0001 00:00:00 +0000

Chat Templates

Mon, 01 Jan 0001 00:00:00 +0000

Command Line Interfaces (CLIs)

Mon, 01 Jan 0001 00:00:00 +0000

Community Tutorials

Mon, 01 Jan 0001 00:00:00 +0000

CPO Trainer

Mon, 01 Jan 0001 00:00:00 +0000

Data Utilities

Mon, 01 Jan 0001 00:00:00 +0000

Dataset formats and types

Mon, 01 Jan 0001 00:00:00 +0000

DeepSpeed Integration

Mon, 01 Jan 0001 00:00:00 +0000

Distillation Trainer

Mon, 01 Jan 0001 00:00:00 +0000

Distributing Training

Mon, 01 Jan 0001 00:00:00 +0000

DPO Trainer

Mon, 01 Jan 0001 00:00:00 +0000

Examples

Mon, 01 Jan 0001 00:00:00 +0000

Experimental

Mon, 01 Jan 0001 00:00:00 +0000

General Online Logit Distillation (GOLD) Trainer

Mon, 01 Jan 0001 00:00:00 +0000

Generalized Knowledge Distillation Trainer

Mon, 01 Jan 0001 00:00:00 +0000

GFPO

Mon, 01 Jan 0001 00:00:00 +0000

GRPO Trainer

Mon, 01 Jan 0001 00:00:00 +0000

GRPO With Replay Buffer

Mon, 01 Jan 0001 00:00:00 +0000

GSPO-token

Mon, 01 Jan 0001 00:00:00 +0000

Installation

Mon, 01 Jan 0001 00:00:00 +0000

Kernels Hub Integration and Usage

Mon, 01 Jan 0001 00:00:00 +0000

KTO Trainer

Mon, 01 Jan 0001 00:00:00 +0000

Liger Kernel Integration

Mon, 01 Jan 0001 00:00:00 +0000

LoRA Without Regret

Mon, 01 Jan 0001 00:00:00 +0000

MiniLLM Trainer

Mon, 01 Jan 0001 00:00:00 +0000

Nash-MD Trainer

Mon, 01 Jan 0001 00:00:00 +0000

NeMo Gym Integration

Mon, 01 Jan 0001 00:00:00 +0000

Online DPO Trainer

Mon, 01 Jan 0001 00:00:00 +0000

OpenEnv Integration for Training LLMs with Environments

Mon, 01 Jan 0001 00:00:00 +0000

OpenReward Integration for Training LLMs with Environments

Mon, 01 Jan 0001 00:00:00 +0000

ORPO Trainer

Mon, 01 Jan 0001 00:00:00 +0000

Paper Index

Mon, 01 Jan 0001 00:00:00 +0000

PAPO Trainer

Mon, 01 Jan 0001 00:00:00 +0000

PEFT Integration

Mon, 01 Jan 0001 00:00:00 +0000

Post-Training Toolkit Integration

Mon, 01 Jan 0001 00:00:00 +0000

PPO Trainer

Mon, 01 Jan 0001 00:00:00 +0000

PRM Trainer

Mon, 01 Jan 0001 00:00:00 +0000

Quickstart

Mon, 01 Jan 0001 00:00:00 +0000

RapidFire AI Integration

Mon, 01 Jan 0001 00:00:00 +0000

Reducing Memory Usage

Mon, 01 Jan 0001 00:00:00 +0000

Reward Functions

Mon, 01 Jan 0001 00:00:00 +0000

Reward Modeling

Mon, 01 Jan 0001 00:00:00 +0000

RLOO Trainer

Mon, 01 Jan 0001 00:00:00 +0000

Scripts Utilities

Mon, 01 Jan 0001 00:00:00 +0000

SDFT

Mon, 01 Jan 0001 00:00:00 +0000

SDPO

Mon, 01 Jan 0001 00:00:00 +0000

SFT Trainer

Mon, 01 Jan 0001 00:00:00 +0000

Supervised fine-tuning is the simplest and most common way to adapt a model to your data, and the SFTTrainer is where most TRL users begin. Pay close attention to dataset format: it accepts both language-modeling and prompt-completion shapes and auto-applies the chat template for conversational data, so mismatched formats are the most common source of silent quality loss. Two gotchas worth remembering are that completion-only loss is on by default for prompt-completion datasets, and that training adapters via PEFT usually wants a higher learning rate near 1e-4. Read the TRL overview first, and pair this with the PEFT LoRA guide when you train adapters.

Speeding Up Training

Mon, 01 Jan 0001 00:00:00 +0000

SSD

Mon, 01 Jan 0001 00:00:00 +0000

TPO Trainer

Mon, 01 Jan 0001 00:00:00 +0000

Trackio Integration

Mon, 01 Jan 0001 00:00:00 +0000

Training customization

Mon, 01 Jan 0001 00:00:00 +0000

Training with Jobs

Mon, 01 Jan 0001 00:00:00 +0000

TRL - Transformers Reinforcement Learning

Mon, 01 Jan 0001 00:00:00 +0000

This overview maps the whole TRL post-training stack — SFT, reward modeling, DPO, PPO, GRPO, and more — so it matters as the decision page for which trainer fits your alignment goal. Focus on the taxonomy of online versus offline methods, since that split drives compute cost and data requirements more than any single hyperparameter. TRL integrates tightly with Transformers and PEFT, so you can train adapters rather than full models. Start here, then go to the SFT trainer, the most common starting point for instruction tuning.

Unsloth Integration

Mon, 01 Jan 0001 00:00:00 +0000

Usage Stats Collection

Mon, 01 Jan 0001 00:00:00 +0000

Use model after training

Mon, 01 Jan 0001 00:00:00 +0000

vLLM Integration

Mon, 01 Jan 0001 00:00:00 +0000

XPO Trainer

Mon, 01 Jan 0001 00:00:00 +0000