LangSmith Evaluation ↗

langchain guide intermediate testing workflows

Original Documentation

Documentation Index#
Fetch the complete documentation index at: https://docs.langchain.com/llms.txt Use this file to discover all available pages before exploring further.

LangSmith supports two types of evaluations based on when and where they run:

<span class=“card-start” data-card-raw=“title=“Offline Evaluation” icon=“flask”"> Test before you ship

Run evaluations on curated datasets during development to compare versions, benchmark performance, and catch regressions.

<span class=“card-start” data-card-raw=“title=“Online Evaluation” icon=“radar”"> Monitor in production

Evaluate real user interactions in real-time to detect issues and measure quality on live traffic.

Evaluation workflow#

Create a dataset with examples from manually curated test cases, historical production traces, or synthetic data generation.

Create evaluators to score performance:

* [Human](/langsmith/evaluation-concepts#human) review
* [Code](/langsmith/evaluation-concepts#code) rules
* [LLM-as-judge](/langsmith/llm-as-judge)
* [Pairwise](/langsmith/evaluate-pairwise) comparison

Execute your application on the dataset to create an experiment. Configure repetitions, concurrency, and caching to optimize runs.

Compare experiments for benchmarking, unit tests, regression tests, or backtesting.

Each interaction creates a run without reference outputs.

Set up evaluators to run automatically on production traces: safety checks, format validation, quality heuristics, and reference-free LLM-as-judge. Apply filters and sampling rates to control costs.

Evaluators run automatically on runs or threads, providing real-time monitoring, anomaly detection, and alerting.

Add failing production traces to your dataset, create targeted evaluators, validate fixes with offline experiments, and redeploy.

For more on the differences between offline and online evaluation, refer to the Evaluation concepts page.

Get started#

Get started with offline evaluation.

Create and manage datasets for evaluation through the UI or SDK.

Explore evaluation types, techniques, and frameworks for comprehensive testing.

View and analyze evaluation results, compare experiments, filter data, and export findings.

Monitor production quality in real-time from the Observability tab.

Learn by following step-by-step tutorials, from simple chatbots to complex agent evaluations.

To set up a LangSmith instance, visit the Platform setup section to choose between cloud, hybrid, or self-hosted. All options include observability, evaluation, prompt engineering, and deployment.

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Link last verified June 7, 2026. View original ↗

Source: LangChain Docs

Link last verified: 2026-03-04