Overview ↗

Original Documentation

Evaluators are the core of Pydantic Evals. They analyze task outputs and provide scores, labels, or pass/fail assertions.

When to Use Different Evaluators#

Deterministic Checks (Fast & Reliable)#

Use deterministic evaluators when you can define exact rules:

Evaluator	Use Case	Example
EqualsExpected	Exact output match	Structured data, classification
Equals	Equals specific value	Checking for sentinel values
Contains	Substring/element check	Required keywords, PII detection
IsInstance	Type validation	Format validation
MaxDuration	Performance threshold	SLA compliance
HasMatchingSpan	Behavior verification	Tool calls, code paths

Advantages:

Fast execution (microseconds to milliseconds)
Deterministic results
No cost
Easy to debug

When to use:

Format validation (JSON structure, type checking)
Required content checks (must contain X, must not contain Y)
Performance requirements (latency, token counts)
Behavioral checks (which tools were called, which code paths executed)

LLM-as-a-Judge (Flexible & Nuanced)#

Use LLMJudge when evaluation requires understanding or judgment:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

dataset = Dataset(
    cases=[Case(inputs='What is 2+2?', expected_output='4')],
    evaluators=[
        LLMJudge(
            rubric='Response is factually accurate based on the input',
            include_input=True,
        )
    ],
)

Advantages:

Can evaluate subjective qualities (helpfulness, tone, creativity)
Understands natural language
Can follow complex rubrics
Flexible across domains

Disadvantages:

Slower (seconds per evaluation)
Costs money
Non-deterministic
Can have biases

When to use:

Factual accuracy
Relevance and helpfulness
Tone and style
Completeness
Following instructions
RAG quality (groundedness, citation accuracy)

Custom Evaluators#

Custom evaluators can be useful if you want to make use of any evaluation logic we don’t provide with the framework. They are frequently useful for domain-specific logic:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ValidSQL(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        try:
            import sqlparse
            sqlparse.parse(ctx.output)
            return True
        except Exception:
            return False

When to use:

Domain-specific validation (SQL syntax, regex patterns, business rules)
External API calls (running generated code, checking databases)
Complex calculations (precision/recall, BLEU scores)
Integration checks (does API call succeed?)

Evaluation Types#

Detailed Return Types Guide

For full detail about precisely what custom Evaluators may return, see Custom Evaluator Return Types.

Evaluators essentially return three types of results:

1. Assertions (bool)#

Pass/fail checks that appear as ✔ or ✗ in reports:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class HasKeyword(Evaluator):
    keyword: str

    def evaluate(self, ctx: EvaluatorContext) -> bool:
        return self.keyword in ctx.output

Use for: Binary checks, quality gates, compliance requirements

2. Scores (int or float)#

Numeric metrics:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ConfidenceScore(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> float:
        # Analyze and return score
        return 0.87  # 87% confidence

Use for: Quality metrics, ranking, A/B testing, regression tracking

3. Labels (str)#

Categorical classifications:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class SentimentClassifier(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> str:
        if 'error' in ctx.output.lower():
            return 'error'
        elif 'success' in ctx.output.lower():
            return 'success'
        return 'neutral'

Use for: Classification, error categorization, quality buckets

Multiple Results#

You can return multiple evaluations from a single evaluator:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ComprehensiveCheck(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> dict[str, bool | float | str]:
        return {
            'valid_format': self._check_format(ctx.output),  # bool
            'quality_score': self._score_quality(ctx.output),  # float
            'category': self._classify(ctx.output),  # str
        }

    def _check_format(self, output: str) -> bool:
        return True

    def _score_quality(self, output: str) -> float:
        return 0.85

    def _classify(self, output: str) -> str:
        return 'good'

Combining Evaluators#

Mix and match evaluators to create comprehensive evaluation suites:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import (
    Contains,
    IsInstance,
    LLMJudge,
    MaxDuration,
)

dataset = Dataset(
    cases=[Case(inputs='test', expected_output='result')],
    evaluators=[
        # Fast deterministic checks first
        IsInstance(type_name='str'),
        Contains(value='required_field'),
        MaxDuration(seconds=2.0),
        # Slower LLM checks after
        LLMJudge(
            rubric='Response is accurate and helpful',
            include_input=True,
        ),
    ],
)

Case-specific evaluators#

Case-specific evaluators are one of the most powerful features for building comprehensive evaluation suites. You can attach evaluators to individual Case objects that only run for those specific cases:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import IsInstance, LLMJudge

dataset = Dataset(
    cases=[
        Case(
            name='greeting_response',
            inputs='Say hello',
            evaluators=[
                # This evaluator only runs for this case
                LLMJudge(
                    rubric='Response is warm and friendly, uses casual tone',
                    include_input=True,
                ),
            ],
        ),
        Case(
            name='formal_response',
            inputs='Write a business email',
            evaluators=[
                # Different requirements for this case
                LLMJudge(
                    rubric='Response is professional and formal, uses business language',
                    include_input=True,
                ),
            ],
        ),
    ],
    evaluators=[
        # This runs for ALL cases
        IsInstance(type_name='str'),
    ],
)

Why Case-Specific Evaluators Matter#

Case-specific evaluators solve a fundamental problem with one-size-fits-all evaluation: if you could write a single evaluator rubric that perfectly captured your requirements across all cases, you’d just incorporate that rubric into your agent’s instructions. (Note: this is less relevant in cases where you want to use a cheaper model in production and assess it using a more expensive model, but in many cases it makes sense to use the best model you can in production.)

The power of case-specific evaluation comes from the nuance:

Different cases have different requirements: A customer support response needs empathy; a technical API response needs precision
Avoid “inmates running the asylum”: If your LLMJudge rubric is generic enough to work everywhere, your agent should already be following it
Capture nuanced golden behavior: Each case can specify exactly what “good” looks like for that scenario

Building Golden Datasets with Case-Specific LLMJudge#

A particularly powerful pattern is using case-specific LLMJudge evaluators to quickly build comprehensive, maintainable evaluation suites. Instead of needing exact expected_output values, you can describe what you care about:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

dataset = Dataset(
    cases=[
        Case(
            name='handle_refund_request',
            inputs={'query': 'I want my money back', 'order_id': '12345'},
            evaluators=[
                LLMJudge(
                    rubric="""
                    Response should:
                    1. Acknowledge the refund request empathetically
                    2. Ask for the reason for the refund
                    3. Mention our 30-day refund policy
                    4. NOT process the refund immediately (needs manager approval)
                    """,
                    include_input=True,
                ),
            ],
        ),
        Case(
            name='handle_shipping_question',
            inputs={'query': 'Where is my order?', 'order_id': '12345'},
            evaluators=[
                LLMJudge(
                    rubric="""
                    Response should:
                    1. Confirm the order number
                    2. Provide tracking information
                    3. Give estimated delivery date
                    4. Be brief and factual (not overly apologetic)
                    """,
                    include_input=True,
                ),
            ],
        ),
        Case(
            name='handle_angry_customer',
            inputs={'query': 'This is completely unacceptable!', 'order_id': '12345'},
            evaluators=[
                LLMJudge(
                    rubric="""
                    Response should:
                    1. Prioritize de-escalation with empathy
                    2. Avoid being defensive
                    3. Offer concrete next steps
                    4. Use phrases like "I understand" and "Let me help"
                    """,
                    include_input=True,
                ),
            ],
        ),
    ],
)

This approach lets you:

Build comprehensive test suites quickly: Just describe what you want per case
Maintain easily: Update rubrics as requirements change, without regenerating outputs
Cover edge cases naturally: Add new cases with specific requirements as you discover them
Capture domain knowledge: Each rubric documents what “good” means for that scenario

The LLM evaluator excels at understanding nuanced requirements and assessing compliance, making this a practical way to create thorough evaluation coverage without brittleness.

Async vs Sync#

Evaluators can be sync or async:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class SyncEvaluator(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        return True


async def some_async_operation() -> bool:
    return True


@dataclass
class AsyncEvaluator(Evaluator):
    async def evaluate(self, ctx: EvaluatorContext) -> bool:
        result = await some_async_operation()
        return result

Pydantic Evals handles both automatically. Use async when:

Making API calls
Running database queries
Performing I/O operations
Calling LLMs (like LLMJudge)

Evaluation Context#

All evaluators receive an EvaluatorContext:

ctx.inputs - Task inputs
ctx.output - Task output (to evaluate)
ctx.expected_output - Expected output (if provided)
ctx.metadata - Case metadata (if provided)
ctx.duration - Task execution time (seconds)
ctx.span_tree - OpenTelemetry spans (if logfire configured)
ctx.metrics - Custom metrics dict
ctx.attributes - Custom attributes dict

This gives evaluators full context to make informed assessments.

Error Handling#

If an evaluator raises an exception, it’s captured as an EvaluatorFailure:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


def risky_operation(output: str) -> bool:
    # This might raise an exception
    if 'error' in output:
        raise ValueError('Found error in output')
    return True


@dataclass
class RiskyEvaluator(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        # If this raises an exception, it will be captured
        result = risky_operation(ctx.output)
        return result

Failures appear in report.cases[i].evaluator_failures with:

Evaluator name
Error message
Full stacktrace

Use retry configuration to handle transient failures (see Retry Strategies).

Report Evaluators (Experiment-Wide)#

All the evaluators above run once per case. Report evaluators are different: they run once per experiment after all cases have been evaluated, and analyze the full set of results together.

Use report evaluators for experiment-wide statistics like:

Confusion matrices — visualize classification accuracy across classes
Precision-recall curves — assess ranking quality with AUC scores
Scalar metrics — overall accuracy, F1, BLEU, or any single number
Summary tables — per-class breakdowns, error category summaries

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import ConfusionMatrixEvaluator

dataset = Dataset(
    cases=[
        Case(inputs='meow', expected_output='cat'),
        Case(inputs='woof', expected_output='dog'),
    ],
    report_evaluators=[
        ConfusionMatrixEvaluator(
            predicted_from='output',
            expected_from='expected_output',
        ),
    ],
)

See: Report Evaluators for the full guide, including built-in report evaluators and how to write custom ones.

Next Steps#

Built-in Evaluators - Complete reference of all provided evaluators
LLM Judge - Deep dive on LLM-as-a-Judge evaluation
Custom Evaluators - Write your own evaluation logic
Report Evaluators - Experiment-wide analyses
Span-Based Evaluation - Evaluate using OpenTelemetry spans

Link last verified June 7, 2026. View original ↗

Source: Pydantic AI Docs

Link last verified: 2026-03-04