OpenReward Integration for Training LLMs with Environments ↗
noOriginal Documentation
OpenReward is an open ecosystem for RL environments built on the Open Reward Standard (ORS) — a public, language-agnostic HTTP/SSE protocol for how an environment exposes its tasks, tools, sessions, and rewards. Because ORS is just a protocol, the same environment can run on the OpenReward platform, self-hosted on any container service, or locally on localhost for development. A catalog of ready-to-use environments is available at openreward.ai.
This guide covers how to integrate OpenReward with TRL. For more on the standard itself, see the ORS docs.
The integration lives at
trl.experimental.openrewardand is gated behind thetrl[openreward]extra (lazy-imported — non-users pay nothing).
When to use OpenReward environments#
GRPOTrainer supports environment-based training via the environment_factory slot — see OpenEnv for the general contract. Use OpenReward when you want to train against an ORS-speaking environment: the OpenReward catalog (e.g. Eigent/SETA, kanishk/EndlessTerminals, nebius/SWE-rebench-V2), an env you self-host on your own infra, or a local server you’re developing.
Installation#
pip install trl[openreward]This installs the openreward Python SDK. The integration itself imports openreward lazily, so users who don’t touch trl.experimental.openreward aren’t affected.
Quick start#
The OpenRewardSpec class wires a single ORS environment into the three TRL trainer slots — train_dataset, environment_factory, reward_funcs — by exposing properties that map 1:1 to those kwarg names:
from trl import GRPOConfig, GRPOTrainer
from trl.experimental.openreward import OpenRewardSpec
spec = OpenRewardSpec("Eigent/SETA", num_tasks=64)
trainer = GRPOTrainer(
model="Qwen/Qwen3-4B",
args=GRPOConfig(
num_generations=2,
max_steps=5,
max_tool_calling_iterations=20,
log_completions=True,
),
train_dataset=spec.train_dataset,
environment_factory=spec.environment_factory,
reward_funcs=spec.reward_funcs,
)
trainer.train()Under the hood OpenRewardSpec does three things, lazily on first access:
spec.train_dataset: derives adatasets.Datasetfrom the env’s task list (one HTTP roundtrip via the SDK). Has at minimumprompt,task_index, plus per-task metadata columns folded in.spec.environment_factory: returns a zero-arg callable that produces a fresh per-rollout adapter on each call. The adapter exposes one Python method per ORS tool, with a typed signature and docstring auto-generated from the env’s JSON Schema. TRL’s tool collector picks them up viainspect.getmembers.spec.reward_funcs: an outcome-only reward function (last non-null reward in the trajectory) suitable for sparse-reward envs like SETA.
Using a hub environment#
Pass an openreward.ai catalog name as the target. The SDK reads OPENREWARD_API_KEY from the environment for authentication.
spec = OpenRewardSpec("Eigent/SETA", num_tasks=64)Using a self-hosted environment#
Pass the URL directly. No API key is needed if your server doesn’t enforce one.
spec = OpenRewardSpec("https://my-org-my-env.hf.space", env_name="my_env")The
openrewardSDK by default expects a two-subdomain platform layout (api.<host>for stateless calls andsessions.<host>for SSE-based session calls). For single-host self-hosted servers (one URL serving everything), set the override env vars below before constructingOpenRewardSpec:import os URL = "https://my-org-my-env.hf.space" os.environ["OPENREWARD_API_URL"] = URL os.environ["OPENREWARD_SESSION_URL"] = URL spec = OpenRewardSpec(URL, env_name="my_env")
Running a minimal environment locally#
The fastest way to try the integration end-to-end without external dependencies is a tiny ORS server defined with the openreward SDK’s Environment + Server scaffolding. The example below is a complete echo environment — the model wins by calling echo(text=...) with the task’s target string.
# server.py
from pydantic import BaseModel
from openreward.environments import Environment, JSONObject, Server, TextBlock, ToolOutput, tool
class EchoTaskSpec(BaseModel):
target: str
class EchoParams(BaseModel):
text: str
class EchoEnvironment(Environment):
def __init__(self, task_spec: JSONObject = {}, secrets: dict[str, str] = {}):
super().__init__(task_spec)
self.config = EchoTaskSpec.model_validate(task_spec)
@classmethod
def list_splits(cls) -> list[str]:
return ["train"]
@classmethod
def list_tasks(cls, split: str) -> list[JSONObject]:
return [{"target": "hello"}, {"target": "world"}]
def get_prompt(self) -> list[TextBlock]:
return [TextBlock(type="text", text=f"Echo '{self.config.target}' to win.")]
@tool
async def echo(self, params: EchoParams) -> ToolOutput:
"""Submit a string. Reward 1.0 + finished if it matches the target.
Args:
text: The string to echo back.
"""
correct = params.text == self.config.target
return ToolOutput(
blocks=[TextBlock(type="text", text="match" if correct else "no match")],
reward=1.0 if correct else 0.0,
finished=correct,
)
if __name__ == "__main__":
Server([EchoEnvironment]).run(host="0.0.0.0", port=8000)Run it:
pip install openreward fastapi uvicorn pydantic
python server.py # listens on :8000Then point OpenRewardSpec at it (with the URL overrides described above):
import os
URL = "http://127.0.0.1:8000"
os.environ["OPENREWARD_API_URL"] = URL
os.environ["OPENREWARD_SESSION_URL"] = URL
from trl.experimental.openreward import OpenRewardSpec
spec = OpenRewardSpec(URL, env_name="echoenvironment")
print(spec.train_dataset) # 2 rows, task_index + target columnsThis is also the fixture pattern used by TRL’s own tests — see trl-internal-testing/openreward-echo-env for the deployed Space.
Selecting tasks#
OpenRewardSpec accepts either a count or an explicit index list:
spec = OpenRewardSpec("Eigent/SETA", num_tasks=10) # first 10 tasks
spec = OpenRewardSpec("Eigent/SETA", indices=[0, 5, 13, 27]) # specific indices
spec = OpenRewardSpec("Eigent/SETA", indices=list(range(50, 100))) # rangenum_tasks and indices are mutually exclusive and both fetch only the tasks they need (no full task list scan).
How tool binding works#
At construction the spec calls the env’s /tools endpoint to fetch a list of tool specs (each with a name, description, and JSON Schema for arguments). For each tool it generates a Python method on the per-rollout adapter with a typed signature and a docstring derived from the schema. So transformers.utils.get_json_schema and TRL’s inspect.getmembers(env, ismethod) both produce the right tool schema for the model with no per-env wrapper code.
If a tool description contains characters that aren’t safe to splice into Python source, the binder falls back to a sanitized form so binding never fails on real envs.
Reward functions#
spec.reward_funcs defaults to an outcome-only reward — for each rollout it returns the last non-null reward observed during the trajectory. This is the right default for sparse-reward envs (e.g. SETA, where only submit_solution returns a non-null reward).
If you want a custom reward, write a regular TRL reward function and pass it directly:
def my_reward(environments, **kwargs) -> list[float]:
return [env.reward * 2.0 for env in environments] # double the env reward, etc.
trainer = GRPOTrainer(
...,
reward_funcs=my_reward,
)The per-rollout adapter exposes the running state TRL needs — env.reward, env.rewards, env.metadata, env.finished, env.last_output — for arbitrary post-hoc reward shaping.
OpenRewardSpec[[trl.experimental.openreward.OpenRewardSpec]]#
trl.experimental.openreward.OpenRewardSpec[[trl.experimental.openreward.OpenRewardSpec]]#
Single spec object that wires an ORS environment into a TRL trainer.
Parameters:
target (str) : Either an openreward.ai catalog name (“Eigent/SETA”) or a URL pointing at any ORS server (“https://you-seta.hf.space”, “http://localhost:8080”). Auto-detected by the presence of :// in the string.
num_tasks (int, optional) : Cap on the number of tasks pulled into the dataset. None uses every task the env exposes.
split (str, optional, defaults to “train”) : Which split’s task list to draw from.
indices (list[int], optional) : Specific task indices to train on. Mutually exclusive with num_tasks. Useful for debugging or curriculum subsets.
api_key (str, optional) : OPENREWARD_API_KEY override. Only used when target is a catalog name.
secrets (dict[str, str], optional) : Per-session secrets forwarded to env.session(secrets=).
env_name (str, optional) : Override for the env name to look up on the server. Rarely needed.
include_metadata (bool, optional, defaults to True) : Fold per-task metadata (difficulty, category, tags, …) into the dataset rows so reward funcs can read them via TRL’s inputs argument.
discover_task_tools (bool, optional, defaults to True) : If True, opens a short-lived ORS session and uses session.list_tools() so task-specific tools (GET …/task_tools per ORS — e.g. @tool(shared=False) and list_task_tools()) are bound for GRPO. If probe fails, falls back to environment.list_tools() only. Set False to skip the extra session (shared tools only / offline quirks).
task_tools_discovery_index (int, optional) : Task index used only for the discovery session when set; overrides multi-index probing below. When omitted and indices= is set, discovery opens one probe session per distinct entry in indices (sorted) and merges tool specs by name so task-specific tools from every listed task are bound. When omitted and num_tasks / full-list mode is used, probes task 0 only. Ignored when discover_task_tools=False.
Limitations#
- The integration is in
trl.experimental— APIs may change. SetTRL_EXPERIMENTAL_SILENCE=1to silence the warning in CI logs. - Currently exposes a single
OpenRewardSpeccovering one environment; multi-environment training (à la the OpenEnv “meta-environment” pattern) is not supported yet. - Long-running rollouts (>15 min per episode) need a keepalive ping — not yet wired.