Overview

no

Original Documentation

Evaluators are the core of Pydantic Evals. They analyze task outputs and provide scores, labels, or pass/fail assertions.

When to Use Different Evaluators#

Deterministic Checks (Fast & Reliable)#

Use deterministic evaluators when you can define exact rules:

EvaluatorUse CaseExample
EqualsExpectedExact output matchStructured data, classification
EqualsEquals specific valueChecking for sentinel values
ContainsSubstring/element checkRequired keywords, PII detection
IsInstanceType validationFormat validation
MaxDurationPerformance thresholdSLA compliance
HasMatchingSpanBehavior verificationTool calls, code paths

Advantages:

  • Fast execution (microseconds to milliseconds)
  • Deterministic results
  • No cost
  • Easy to debug

When to use:

  • Format validation (JSON structure, type checking)
  • Required content checks (must contain X, must not contain Y)
  • Performance requirements (latency, token counts)
  • Behavioral checks (which tools were called, which code paths executed)

LLM-as-a-Judge (Flexible & Nuanced)#

Use LLMJudge when evaluation requires understanding or judgment:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

dataset = Dataset(
    cases=[Case(inputs='What is 2+2?', expected_output='4')],
    evaluators=[
        LLMJudge(
            rubric='Response is factually accurate based on the input',
            include_input=True,
        )
    ],
)

Advantages:

  • Can evaluate subjective qualities (helpfulness, tone, creativity)
  • Understands natural language
  • Can follow complex rubrics
  • Flexible across domains

Disadvantages:

  • Slower (seconds per evaluation)
  • Costs money
  • Non-deterministic
  • Can have biases

When to use:

  • Factual accuracy
  • Relevance and helpfulness
  • Tone and style
  • Completeness
  • Following instructions
  • RAG quality (groundedness, citation accuracy)

Custom Evaluators#

Custom evaluators can be useful if you want to make use of any evaluation logic we don’t provide with the framework. They are frequently useful for domain-specific logic:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ValidSQL(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        try:
            import sqlparse
            sqlparse.parse(ctx.output)
            return True
        except Exception:
            return False

When to use:

  • Domain-specific validation (SQL syntax, regex patterns, business rules)
  • External API calls (running generated code, checking databases)
  • Complex calculations (precision/recall, BLEU scores)
  • Integration checks (does API call succeed?)

Evaluation Types#

Detailed Return Types Guide

For full detail about precisely what custom Evaluators may return, see Custom Evaluator Return Types.

Evaluators essentially return three types of results:

1. Assertions (bool)#

Pass/fail checks that appear as ✔ or ✗ in reports:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class HasKeyword(Evaluator):
    keyword: str

    def evaluate(self, ctx: EvaluatorContext) -> bool:
        return self.keyword in ctx.output

Use for: Binary checks, quality gates, compliance requirements

2. Scores (int or float)#

Numeric metrics:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ConfidenceScore(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> float:
        # Analyze and return score
        return 0.87  # 87% confidence

Use for: Quality metrics, ranking, A/B testing, regression tracking

3. Labels (str)#

Categorical classifications:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class SentimentClassifier(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> str:
        if 'error' in ctx.output.lower():
            return 'error'
        elif 'success' in ctx.output.lower():
            return 'success'
        return 'neutral'

Use for: Classification, error categorization, quality buckets

Multiple Results#

You can return multiple evaluations from a single evaluator:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ComprehensiveCheck(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> dict[str, bool | float | str]:
        return {
            'valid_format': self._check_format(ctx.output),  # bool
            'quality_score': self._score_quality(ctx.output),  # float
            'category': self._classify(ctx.output),  # str
        }

    def _check_format(self, output: str) -> bool:
        return True

    def _score_quality(self, output: str) -> float:
        return 0.85

    def _classify(self, output: str) -> str:
        return 'good'

Combining Evaluators#

Mix and match evaluators to create comprehensive evaluation suites:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import (
    Contains,
    IsInstance,
    LLMJudge,
    MaxDuration,
)

dataset = Dataset(
    cases=[Case(inputs='test', expected_output='result')],
    evaluators=[
        # Fast deterministic checks first
        IsInstance(type_name='str'),
        Contains(value='required_field'),
        MaxDuration(seconds=2.0),
        # Slower LLM checks after
        LLMJudge(
            rubric='Response is accurate and helpful',
            include_input=True,
        ),
    ],
)

Case-specific evaluators#

Case-specific evaluators are one of the most powerful features for building comprehensive evaluation suites. You can attach evaluators to individual Case objects that only run for those specific cases:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import IsInstance, LLMJudge

dataset = Dataset(
    cases=[
        Case(
            name='greeting_response',
            inputs='Say hello',
            evaluators=[
                # This evaluator only runs for this case
                LLMJudge(
                    rubric='Response is warm and friendly, uses casual tone',
                    include_input=True,
                ),
            ],
        ),
        Case(
            name='formal_response',
            inputs='Write a business email',
            evaluators=[
                # Different requirements for this case
                LLMJudge(
                    rubric='Response is professional and formal, uses business language',
                    include_input=True,
                ),
            ],
        ),
    ],
    evaluators=[
        # This runs for ALL cases
        IsInstance(type_name='str'),
    ],
)

Why Case-Specific Evaluators Matter#

Case-specific evaluators solve a fundamental problem with one-size-fits-all evaluation: if you could write a single evaluator rubric that perfectly captured your requirements across all cases, you’d just incorporate that rubric into your agent’s instructions. (Note: this is less relevant in cases where you want to use a cheaper model in production and assess it using a more expensive model, but in many cases it makes sense to use the best model you can in production.)

The power of case-specific evaluation comes from the nuance:

  • Different cases have different requirements: A customer support response needs empathy; a technical API response needs precision
  • Avoid “inmates running the asylum”: If your LLMJudge rubric is generic enough to work everywhere, your agent should already be following it
  • Capture nuanced golden behavior: Each case can specify exactly what “good” looks like for that scenario

Building Golden Datasets with Case-Specific LLMJudge#

A particularly powerful pattern is using case-specific LLMJudge evaluators to quickly build comprehensive, maintainable evaluation suites. Instead of needing exact expected_output values, you can describe what you care about:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

dataset = Dataset(
    cases=[
        Case(
            name='handle_refund_request',
            inputs={'query': 'I want my money back', 'order_id': '12345'},
            evaluators=[
                LLMJudge(
                    rubric="""
                    Response should:
                    1. Acknowledge the refund request empathetically
                    2. Ask for the reason for the refund
                    3. Mention our 30-day refund policy
                    4. NOT process the refund immediately (needs manager approval)
                    """,
                    include_input=True,
                ),
            ],
        ),
        Case(
            name='handle_shipping_question',
            inputs={'query': 'Where is my order?', 'order_id': '12345'},
            evaluators=[
                LLMJudge(
                    rubric="""
                    Response should:
                    1. Confirm the order number
                    2. Provide tracking information
                    3. Give estimated delivery date
                    4. Be brief and factual (not overly apologetic)
                    """,
                    include_input=True,
                ),
            ],
        ),
        Case(
            name='handle_angry_customer',
            inputs={'query': 'This is completely unacceptable!', 'order_id': '12345'},
            evaluators=[
                LLMJudge(
                    rubric="""
                    Response should:
                    1. Prioritize de-escalation with empathy
                    2. Avoid being defensive
                    3. Offer concrete next steps
                    4. Use phrases like "I understand" and "Let me help"
                    """,
                    include_input=True,
                ),
            ],
        ),
    ],
)

This approach lets you:

  • Build comprehensive test suites quickly: Just describe what you want per case
  • Maintain easily: Update rubrics as requirements change, without regenerating outputs
  • Cover edge cases naturally: Add new cases with specific requirements as you discover them
  • Capture domain knowledge: Each rubric documents what “good” means for that scenario

The LLM evaluator excels at understanding nuanced requirements and assessing compliance, making this a practical way to create thorough evaluation coverage without brittleness.

Async vs Sync#

Evaluators can be sync or async:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class SyncEvaluator(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        return True


async def some_async_operation() -> bool:
    return True


@dataclass
class AsyncEvaluator(Evaluator):
    async def evaluate(self, ctx: EvaluatorContext) -> bool:
        result = await some_async_operation()
        return result

Pydantic Evals handles both automatically. Use async when:

  • Making API calls
  • Running database queries
  • Performing I/O operations
  • Calling LLMs (like LLMJudge)

Evaluation Context#

All evaluators receive an EvaluatorContext:

  • ctx.inputs - Task inputs
  • ctx.output - Task output (to evaluate)
  • ctx.expected_output - Expected output (if provided)
  • ctx.metadata - Case metadata (if provided)
  • ctx.duration - Task execution time (seconds)
  • ctx.span_tree - OpenTelemetry spans (if logfire configured)
  • ctx.metrics - Custom metrics dict
  • ctx.attributes - Custom attributes dict

This gives evaluators full context to make informed assessments.

Error Handling#

If an evaluator raises an exception, it’s captured as an EvaluatorFailure:

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


def risky_operation(output: str) -> bool:
    # This might raise an exception
    if 'error' in output:
        raise ValueError('Found error in output')
    return True


@dataclass
class RiskyEvaluator(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        # If this raises an exception, it will be captured
        result = risky_operation(ctx.output)
        return result

Failures appear in report.cases[i].evaluator_failures with:

  • Evaluator name
  • Error message
  • Full stacktrace

Use retry configuration to handle transient failures (see Retry Strategies).

Report Evaluators (Experiment-Wide)#

All the evaluators above run once per case. Report evaluators are different: they run once per experiment after all cases have been evaluated, and analyze the full set of results together.

Use report evaluators for experiment-wide statistics like:

  • Confusion matrices — visualize classification accuracy across classes
  • Precision-recall curves — assess ranking quality with AUC scores
  • Scalar metrics — overall accuracy, F1, BLEU, or any single number
  • Summary tables — per-class breakdowns, error category summaries
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import ConfusionMatrixEvaluator

dataset = Dataset(
    cases=[
        Case(inputs='meow', expected_output='cat'),
        Case(inputs='woof', expected_output='dog'),
    ],
    report_evaluators=[
        ConfusionMatrixEvaluator(
            predicted_from='output',
            expected_from='expected_output',
        ),
    ],
)

See: Report Evaluators for the full guide, including built-in report evaluators and how to write custom ones.

Next Steps#

Link last verified June 7, 2026. View original ↗
Source: Pydantic AI Docs
Link last verified: 2026-03-04