How to define a code evaluator ↗

langchain guide intermediate testing workflows

Original Documentation

Documentation Index#
Fetch the complete documentation index at: https://docs.langchain.com/llms.txt Use this file to discover all available pages before exploring further.

Code evaluators in the LangSmith UI allow you to write custom evaluation logic using Python or TypeScript code directly in the interface. Unlike LLM-as-a-judge evaluators that use a model to evaluate outputs, code evaluators use deterministic logic you define.

To define code evaluators programmatically using the SDK, refer to How to define a code evaluator (SDK).

Step 1. Create the evaluator#

Create an evaluator from one of the following pages in the LangSmith UI:
- In the playground or from a dataset: Select the + Evaluator button.
- Select Add rules, configure your rule and select Apply evaluator.
Give your evaluator a clear name that describes what it measures (e.g., “Exact Match”).
Select Create code evaluator from the evaluator type options.

Step 2. Write your evaluator code#

Custom code evaluators restrictions.

Allowed Libraries: You can import all standard library functions, as well as the following public packages:

numpy (v2.2.2): "numpy"
pandas (v1.5.2): "pandas"
jsonschema (v4.21.1): "jsonschema"
scipy (v1.14.1): "scipy"
sklearn (v1.26.4): "scikit-learn"

Network Access: You cannot access the internet from a custom code evaluator.

In the Add Custom Code Evaluator page, define your evaluation logic using Python or TypeScript.

Your evaluator function must be named perform_eval and should:

Accept run and example parameters.
Access data via run['inputs'], run['outputs'], and example['outputs'].
Return a dictionary where each key is a metric name and each value is the score for that metric. Each key represents a piece of feedback you want to return. For example, {"correctness": 1, "silliness": 0} would create two pieces of feedback on the run.

Function signature#

def perform_eval(run, example):
    # Access the data
    inputs = run['inputs']
    outputs = run['outputs']
    reference_outputs = example['outputs']  # Optional: reference/expected outputs

    # Your evaluation logic here
    score = ...

    # Return a dict with your metric name
    return {"metric_name": score}

Example: Exact match evaluator#

def perform_eval(run, example):
    """Check if the answer exactly matches the expected answer."""
    actual = run['outputs']['answer']
    expected = example['outputs']['answer']

    is_correct = actual == expected
    return {"exact_match": is_correct}

Example: Input-based evaluator#

def perform_eval(run, example):
    """Check if the input text contains toxic language."""
    text = run['inputs'].get('text', '').lower()
    toxic_words = ["idiot", "stupid", "hate", "awful"]

    is_toxic = any(word in text for word in toxic_words)
    return {"is_toxic": is_toxic}

Step 3. Test and save#

Test your evaluator on example data to ensure it works as expected
Click Save to make the evaluator available for use

Use your code evaluator#

Once created, you can use your code evaluator:

When running evaluations from the playground
As part of a dataset to automatically run evaluations on experiments

LLM-as-a-judge evaluator (UI): Use an LLM to evaluate outputs
Composite evaluators: Combine multiple evaluator scores

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Link last verified June 7, 2026. View original ↗

Source: LangChain Docs

Link last verified: 2026-03-04