Evaluators ↗
noOriginal Documentation
Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.
Understand the fundamentals of evaluators and reward functions in reinforcement fine-tuning
An evaluator (also called a reward function) is code that scores model outputs from 0.0 (worst) to 1.0 (best). During reinforcement fine-tuning, your evaluator guides the model toward better responses by providing feedback on its generated outputs.
Why evaluators matter#
Unlike supervised fine-tuning where you provide perfect examples, RFT uses evaluators to define what “good” means. This is powerful because:
- No perfect data required - Just prompts and a way to score outputs
- Encourages exploration - Models learn strategies, not just patterns
- Noise tolerant - Even noisy signals can improve model performance
- Encodes domain expertise - Complex rules and logic that are hard to demonstrate with examples
Anatomy of an evaluator#
Every evaluator has three core components:
1. Input data#
The prompt and any ground truth data needed for evaluation:
{
"messages": [
{"role": "system", "content": "You are a math tutor."},
{"role": "user", "content": "What is 15 * 23?"}
],
"ground_truth": "345" # Optional additional data
}2. Model output#
The assistant’s response to evaluate:
{
"role": "assistant",
"content": "Let me calculate that step by step:\n15 * 23 = 345"
}3. Scoring logic#
Code that compares the output to your criteria:
def evaluate(model_output: str, ground_truth: str) -> float:
# Extract answer from model's response
predicted = extract_number(model_output)
# Score it
if predicted == int(ground_truth):
return 1.0 # Perfect
else:
return 0.0 # WrongTypes of evaluators#
Rule-based evaluators#
Check if outputs match specific patterns or rules:
- Exact match - Output exactly equals expected value
- Contains - Output includes required text
- Regex - Output matches a pattern
- Format validation - Output follows required structure (e.g., valid JSON)
Start with rule-based evaluators. They’re simple, fast, and surprisingly effective.
Execution-based evaluators#
Run code or commands to verify correctness:
- Code execution - Run generated code and check results
- Test suites - Pass generated code through unit tests
- API calls - Execute commands and verify outcomes
- Simulations - Run agents in environments and measure success
LLM-as-judge evaluators#
Use another model to evaluate quality:
- Rubric scoring - Judge outputs against criteria
- Comparative ranking - Compare multiple outputs
- Natural language assessment - Evaluate subjective qualities like helpfulness
Scoring guidelines#
Your evaluator should return a score between 0.0 and 1.0:
| Score range | Meaning | Example |
|---|---|---|
| 1.0 | Perfect | Exact correct answer |
| 0.7-0.9 | Good | Right approach, minor error |
| 0.4-0.6 | Partial | Some correct elements |
| 0.1-0.3 | Poor | Wrong but attempted |
| 0.0 | Failure | Completely wrong |
Binary scoring (0.0 or 1.0) works well for many tasks. Use gradual scoring when you can meaningfully distinguish between partial successes.
Best practices#
# Start here
score = 1.0 if predicted == expected else 0.0
# Then refine if needed
score = calculate_similarity(predicted, expected)
```
Start with the simplest scoring approach that captures your core requirements. You can always add sophistication later based on training results.
</Accordion>
<Accordion title="Make evaluators fast">
Training generates many outputs to evaluate, so performance matters:
* **Cache expensive computations**: Store results of repeated calculations
* **Use timeouts for code execution**: Prevent hanging on infinite loops
* **Batch API calls when possible**: Reduce network overhead
* **Profile slow evaluators and optimize**: Identify and fix bottlenecks
Aim for evaluations that complete in seconds, not minutes. Slow evaluators directly increase training time and cost.
</Accordion>
<Accordion title="Handle edge cases">
Models will generate unexpected outputs, so build robust error handling:
```python
try:
result = execute_code(model_output)
score = check_result(result)
except TimeoutError:
score = 0.0 # Code ran too long
except SyntaxError:
score = 0.0 # Invalid code
except Exception as e:
score = 0.0 # Any other error
```
Anticipate and gracefully handle malformed outputs, syntax errors, timeouts, and edge cases specific to your domain.
</Accordion>
<Accordion title="Avoid reward hacking">
Models will exploit evaluation weaknesses, so design defensively:
**Example: Length exploitation**
If you score outputs by length, the model might generate verbose nonsense. Add constraints:
```python
# Bad: Model learns to write long outputs
score = min(len(output) / 1000, 1.0)
# Better: Require correctness AND reasonable length
if is_correct(output):
score = 1.0 if len(output) < 500 else 0.8
else:
score = 0.0
```
**Example: Format over substance**
If you only check JSON validity, the model might return valid but wrong JSON. Check content too:
```python
# Bad: Only checks format
score = 1.0 if is_valid_json(output) else 0.0
# Better: Check format AND content
if is_valid_json(output):
data = json.loads(output)
score = evaluate_content(data)
else:
score = 0.0
```
Always combine format checks with content validation to prevent models from gaming the system.
</Accordion>
</AccordionGroup>
## Debugging evaluators
Test your evaluator before training. Look for:
* **Correct scoring** - Good outputs score high, bad outputs score low
* **Reasonable runtime** - Each evaluation completes in reasonable time
* **Clear feedback** - Evaluation reasons explain scores
<span class="callout-start" data-callout-type="tip"></span>
Run your evaluator on manually created good and bad examples first. If it doesn't score them correctly, fix the evaluator before training.
<span class="callout-end"></span>
## Next steps
<span class="card-group-start" data-cols="2"></span>
<span class="card-start" data-card-title="Connect environments" data-card-icon="code" data-card-href="/fine-tuning/connect-environments"></span>
Connect to your environment for single and multi-turn agents
<span class="card-end"></span>
<span class="card-start" data-card-title="Quickstart: Math solver" data-card-icon="calculator" data-card-href="/fine-tuning/quickstart-math"></span>
Follow a complete example building and using an evaluator
<span class="card-end"></span>
<span class="card-group-end"></span>