RLOO Trainer ↗

Original Documentation

Overview#

TRL supports the RLOO Trainer for training language models, as described in the paper Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs by Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Ahmet Üstün and Sara Hooker.

The abstract from the paper is the following:

AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the formulation of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed “RL-free” methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.

This post-training method was contributed by Costa Huang and later refactored by Shirin Yamani.

Quick start#

This example demonstrates how to train a model using the RLOO method. We train a Qwen 0.5B Instruct model with the prompts from the DeepMath-103K dataset. You can view the data in the dataset here: