Quickstart ↗
noOriginal Documentation
TRL is a comprehensive library for post-training foundation models using techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO).
Quick Examples#
Get started instantly with TRL’s most popular trainers. Each example uses compact models for quick experimentation.
Supervised Fine-Tuning#
from trl import SFTTrainer
from datasets import load_dataset
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=load_dataset("trl-lib/Capybara", split="train"),
)
trainer.train()Group Relative Policy Optimization#
from trl import GRPOTrainer
from datasets import load_dataset
from trl.rewards import accuracy_reward
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=load_dataset("trl-lib/DeepMath-103K", split="train"),
reward_funcs=accuracy_reward,
)
trainer.train()Direct Preference Optimization#
from trl import DPOTrainer
from datasets import load_dataset
trainer = DPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=load_dataset("trl-lib/ultrafeedback_binarized", split="train"),
)
trainer.train()Reward Modeling#
from trl import RewardTrainer
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
trainer = RewardTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=dataset,
)
trainer.train()Command Line Interface#
Skip the code entirely - train directly from your terminal:
# SFT: Fine-tune on instructions
trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/Capybara
# DPO: Align with preferences
trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized
# Reward: Train a reward model
trl reward --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarizedWhat’s Next?#
📚 Learn More#
- SFT Trainer - Complete SFT guide
- DPO Trainer - Preference alignment
- GRPO Trainer - Group relative policy optimization
🚀 Scale Up#
- Distributed Training - Multi-GPU setups
- Memory Optimization - Efficient training
- PEFT Integration - LoRA and QLoRA
💡 Examples#
- Example Scripts - Production-ready code
- Community Tutorials - External guides
Troubleshooting#
Out of Memory?#
Reduce batch size and enable optimizations:
training_args = SFTConfig(
per_device_train_batch_size=1, # Start small
gradient_accumulation_steps=8, # Maintain effective batch size
)training_args = DPOConfig(
per_device_train_batch_size=1, # Start small
gradient_accumulation_steps=8, # Maintain effective batch size
)training_args = GRPOConfig(
per_device_train_batch_size=1, # Start small
gradient_accumulation_steps=8, # Maintain effective batch size
num_generations=4, # Reduce from default 8 (GRPO generates num_generations completions per prompt)
max_completion_length=256, # Tune based on task; longer sequences cost more memory
)Loss not decreasing?#
Try adjusting the learning rate:
training_args = SFTConfig(learning_rate=2e-5) # Good starting pointFor more help, open an issue on GitHub.
Link last verified
June 7, 2026.
View original ↗
Source: TRL Docs
Link last verified: 2026-06-07