Reward Functions

no

Original Documentation

This module contains some useful reward functions, primarily intended for use with the GRPOTrainer and RLOOTrainer.

accuracy_reward[[trl.rewards.accuracy_reward]]#

trl.rewards.accuracy_reward[[trl.rewards.accuracy_reward]]#

Source

Reward function that checks if the completion matches the ground truth.

  • If both gold and prediction are parseable → use math verification.
  • If gold is not parseable → return None to skip the example.

Example:

>>> from trl.rewards import accuracy_reward

>>> solutions = [r"\frac{1}{3}", r"\frac{1}{3}"]
>>> completions = [
...     [{"role": "assistant", "content": r"My answer is \boxed{\frac{1}{3}}"}],
...     [{"role": "assistant", "content": r"My answer is \boxed{\frac{1}{2}}"}],
... ]
>>> accuracy_reward(completions, solutions)
[1.0, 0.0]

Parameters:

completions (list[list[dict[str, str]]]) : List of completions to be evaluated. Each completion must be a list of one message, i.e. a dictionary containing the key "content" with the value being the text of the completion.

solution : (list[str]): List of the raw-text solutions to the questions/problems/prompts.

log_extra (callable, optional) : Callable to log extra columns to the completions table, provided automatically by the trainer. Defaults to None to allow calling the function directly outside of a trainer (e.g., for testing).

  • **kwargs : Additional keyword arguments. This function does not use them, but they are required in the function signature to ensure compatibility with trainers like GRPOTrainer.

reasoning_accuracy_reward[[trl.rewards.reasoning_accuracy_reward]]#

trl.rewards.reasoning_accuracy_reward[[trl.rewards.reasoning_accuracy_reward]]#

Source

Reward function that removes the reasoning content and checks if the final answer matches the ground truth.

  • If both gold and prediction are parseable → use math verification.
  • If gold is not parseable → return None to skip the example.

Example:

>>> from trl.rewards import reasoning_accuracy_reward

>>> reasoning_delimiters = [""]
>>> solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
>>> completions = [
...     [
...         {
...             "role": "assistant",
...             "content": r" Reasoning content  The final answer is \boxed{\frac{1}{3}}",
...         }
...     ],
...     [
...         {
...             "role": "assistant",
...             "content": r" Reasoning content  The final answer is \boxed{\frac{1}{2}}",
...         }
...     ],
...     [
...         {
...             "role": "assistant",
...             "content": r" Reasoning content with partial answers \boxed{\frac{1}{3}} but no final answer",
...         }
...     ],
... ]
>>> reasoning_accuracy_reward(completions, solutions, reasoning_delimiters=reasoning_delimiters)
[1.0, 0.0, 0.0]

Parameters:

completions (list[list[dict[str, str]]]) : List of completions to be evaluated. Each completion must be a list of one message, i.e. a dictionary containing the key "content" with the value being the text of the completion.

solution : (list[str]): List of the raw-text solutions to the questions/problems/prompts.

reasoning_delimiters (list[str]], optional) : List of strings indicating where the reasoning content ends. The final answer is assumed to be after the last occurrence of any of these delimiters. If None, defaults to ["</think>"].

log_extra (callable, optional) : Callable to log extra columns to the completions table, provided automatically by the trainer. Defaults to None to allow calling the function directly outside of a trainer (e.g., for testing).

  • **kwargs : Additional keyword arguments. This function does not use them, but they are required in the function signature to ensure compatibility with trainers like GRPOTrainer.

think_format_reward[[trl.rewards.think_format_reward]]#

trl.rewards.think_format_reward[[trl.rewards.think_format_reward]]#

Source

Reward function that checks if the reasoning process is enclosed within "" and "" tags. The function returns a reward of 1.0 if the format is correct, otherwise 0.0.

Example:

>>> from trl.rewards import think_format_reward

>>> completions = [
...     [{"content": "\nThis is my reasoning.\n\nThis is my answer."}],
...     [{"content": "\nThis is my reasoning.\nThis is my answer."}],
... ]
>>> think_format_reward(completions)
[1.0, 0.0]

Parameters:

completions (list[list[dict[str, str]]]) : List of completions to be evaluated. Each completion must be a list of one message, i.e. a dictionary containing the key "content" with the value being the text of the completion.

  • **kwargs : Additional keyword arguments. This function does not use them, but they are required in the function signature to ensure compatibility with trainers like GRPOTrainer.

Returns:

list[float]

A list of rewards, where each reward is 1.0 if the completion matches the expected format, otherwise 0.0.

get_soft_overlong_punishment[[trl.rewards.get_soft_overlong_punishment]]#

trl.rewards.get_soft_overlong_punishment[[trl.rewards.get_soft_overlong_punishment]]#

Source

Reward function that penalizes overlong completions. It is used to penalize overlong completions, but not to reward shorter completions. Reference: Eq. (13) from the DAPO paper (https://huggingface.co/papers/2503.14476)

$$ R_{\text{length}}(y) = \begin{cases} 0, & |y| \le L_{\max} - L_{\text{cache}} \ \dfrac{(L_{\max} - L_{\text{cache}}) - |y|}{L_{\text{cache}}}, & L_{\max} - L_{\text{cache}}

Example:

from trl.rewards import get_soft_overlong_punishment

soft_overlong_punishment = get_soft_overlong_punishment(max_completion_len=100, soft_punish_cache=20)
completion_ids = [[1] * 90]  # simulating a completion with 90 tokens. 90 is between 80 and 100.
rewards = soft_overlong_punishment(completion_ids)
print(rewards)  # [-0.5]

Parameters:

max_completion_len (int) : Maximum length of the completion, ( L_{\max} ).

soft_punish_cache (int) : Minimum length of the completion, ( L_{\text{cache}} ). If set to 0, no minimum length is applied.

Link last verified June 7, 2026. View original ↗
Source: TRL Docs
Link last verified: 2026-06-07