Reward Functions ↗
noOriginal Documentation
This module contains some useful reward functions, primarily intended for use with the GRPOTrainer and RLOOTrainer.
accuracy_reward[[trl.rewards.accuracy_reward]]#
trl.rewards.accuracy_reward[[trl.rewards.accuracy_reward]]#
Reward function that checks if the completion matches the ground truth.
- If both gold and prediction are parseable → use math verification.
- If gold is not parseable → return
Noneto skip the example.
Example:
>>> from trl.rewards import accuracy_reward
>>> solutions = [r"\frac{1}{3}", r"\frac{1}{3}"]
>>> completions = [
... [{"role": "assistant", "content": r"My answer is \boxed{\frac{1}{3}}"}],
... [{"role": "assistant", "content": r"My answer is \boxed{\frac{1}{2}}"}],
... ]
>>> accuracy_reward(completions, solutions)
[1.0, 0.0]Parameters:
completions (list[list[dict[str, str]]]) : List of completions to be evaluated. Each completion must be a list of one message, i.e. a dictionary containing the key "content" with the value being the text of the completion.
solution : (list[str]): List of the raw-text solutions to the questions/problems/prompts.
log_extra (callable, optional) : Callable to log extra columns to the completions table, provided automatically by the trainer. Defaults to None to allow calling the function directly outside of a trainer (e.g., for testing).
- **kwargs : Additional keyword arguments. This function does not use them, but they are required in the function signature to ensure compatibility with trainers like GRPOTrainer.
reasoning_accuracy_reward[[trl.rewards.reasoning_accuracy_reward]]#
trl.rewards.reasoning_accuracy_reward[[trl.rewards.reasoning_accuracy_reward]]#
Reward function that removes the reasoning content and checks if the final answer matches the ground truth.
- If both gold and prediction are parseable → use math verification.
- If gold is not parseable → return
Noneto skip the example.
Example:
>>> from trl.rewards import reasoning_accuracy_reward
>>> reasoning_delimiters = [""]
>>> solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
>>> completions = [
... [
... {
... "role": "assistant",
... "content": r" Reasoning content The final answer is \boxed{\frac{1}{3}}",
... }
... ],
... [
... {
... "role": "assistant",
... "content": r" Reasoning content The final answer is \boxed{\frac{1}{2}}",
... }
... ],
... [
... {
... "role": "assistant",
... "content": r" Reasoning content with partial answers \boxed{\frac{1}{3}} but no final answer",
... }
... ],
... ]
>>> reasoning_accuracy_reward(completions, solutions, reasoning_delimiters=reasoning_delimiters)
[1.0, 0.0, 0.0]Parameters:
completions (list[list[dict[str, str]]]) : List of completions to be evaluated. Each completion must be a list of one message, i.e. a dictionary containing the key "content" with the value being the text of the completion.
solution : (list[str]): List of the raw-text solutions to the questions/problems/prompts.
reasoning_delimiters (list[str]], optional) : List of strings indicating where the reasoning content ends. The final answer is assumed to be after the last occurrence of any of these delimiters. If None, defaults to ["</think>"].
log_extra (callable, optional) : Callable to log extra columns to the completions table, provided automatically by the trainer. Defaults to None to allow calling the function directly outside of a trainer (e.g., for testing).
- **kwargs : Additional keyword arguments. This function does not use them, but they are required in the function signature to ensure compatibility with trainers like GRPOTrainer.
think_format_reward[[trl.rewards.think_format_reward]]#
trl.rewards.think_format_reward[[trl.rewards.think_format_reward]]#
Reward function that checks if the reasoning process is enclosed within "" and "" tags. The
function returns a reward of 1.0 if the format is correct, otherwise 0.0.
Example:
>>> from trl.rewards import think_format_reward
>>> completions = [
... [{"content": "\nThis is my reasoning.\n\nThis is my answer."}],
... [{"content": "\nThis is my reasoning.\nThis is my answer."}],
... ]
>>> think_format_reward(completions)
[1.0, 0.0]Parameters:
completions (list[list[dict[str, str]]]) : List of completions to be evaluated. Each completion must be a list of one message, i.e. a dictionary containing the key "content" with the value being the text of the completion.
- **kwargs : Additional keyword arguments. This function does not use them, but they are required in the function signature to ensure compatibility with trainers like GRPOTrainer.
Returns:
list[float]
A list of rewards, where each reward is 1.0 if the completion matches the expected format, otherwise 0.0.
get_soft_overlong_punishment[[trl.rewards.get_soft_overlong_punishment]]#
trl.rewards.get_soft_overlong_punishment[[trl.rewards.get_soft_overlong_punishment]]#
Reward function that penalizes overlong completions. It is used to penalize overlong completions, but not to reward shorter completions. Reference: Eq. (13) from the DAPO paper (https://huggingface.co/papers/2503.14476)
$$ R_{\text{length}}(y) = \begin{cases} 0, & |y| \le L_{\max} - L_{\text{cache}} \ \dfrac{(L_{\max} - L_{\text{cache}}) - |y|}{L_{\text{cache}}}, & L_{\max} - L_{\text{cache}}
Example:
from trl.rewards import get_soft_overlong_punishment
soft_overlong_punishment = get_soft_overlong_punishment(max_completion_len=100, soft_punish_cache=20)
completion_ids = [[1] * 90] # simulating a completion with 90 tokens. 90 is between 80 and 100.
rewards = soft_overlong_punishment(completion_ids)
print(rewards) # [-0.5]Parameters:
max_completion_len (int) : Maximum length of the completion, ( L_{\max} ).
soft_punish_cache (int) : Minimum length of the completion, ( L_{\text{cache}} ). If set to 0, no minimum length is applied.