Evaluation & Testing
Build a comprehensive evaluation practice for AI applications. This path spans 7 sources to cover the full evaluation landscape: foundational concepts, practical implementation, RAG-specific metrics, LLM-as-judge patterns, and agent evaluation challenges.
Evaluation is the most cross-cutting concern in AI development — every provider and framework has a different take. OpenAI provides hosted evals, RAGAS specializes in RAG metrics, DSPy uses metrics for optimization, LangSmith offers traceability, and W&B Weave treats evaluation as a core development primitive. This path helps you pick the right tools and combine them.
Steps
- Getting started with datasets
openai
beginner
Introduction to evaluation datasets — the foundation for systematic AI testing and the first step in eval-driven development.
Start with OpenAI's primer on evaluation concepts — datasets, metrics, and the eval lifecycle. This establishes the vocabulary and mental model you'll use across all providers: define what good looks like, measure it, iterate.
- Evaluations overview
wandb
beginner
Evaluation-driven LLM application development to systematically improve applications
Weave approaches evaluation as a first-class development primitive, not an afterthought. Understanding evaluations as core types alongside models and datasets is a mindset shift — your eval suite should evolve with your application.
- Evaluation Overview
dspy
beginner
DSPy's evaluation system is deeply integrated with its optimization loop — metrics drive automatic prompt tuning. This is fundamentally different from the 'test after build' approach of other frameworks. Compare with how OpenAI and W&B treat evals.
- Working with evals
openai
intermediate
Build, run, and iterate on evaluations to systematically test and improve AI model outputs — OpenAI's practical guide to eval-driven development.
The practical implementation guide for building and running evaluations with OpenAI. Focus on how to structure evaluation datasets, choose grading criteria, and interpret results. This is where concepts from step 1 become code.
- Evaluation quickstart
langchain
beginner
LangSmith provides evaluation infrastructure that works across LLM providers. The key value is traceability — you can see exactly which retrieval step or prompt caused a failure, not just that the final answer was wrong.
- Evaluate a simple LLM application
ragas
intermediate
RAGAS is purpose-built for evaluating RAG pipelines with metrics like faithfulness, answer relevancy, and context precision. If you're building RAG applications, these domain-specific metrics catch failures that generic evals miss.
- How to define an LLM-as-a-judge evaluator
langchain
intermediate
Using an LLM to evaluate LLM outputs is powerful but has subtle pitfalls — bias, inconsistency, and circular reasoning. LangSmith's guide covers how to design reliable LLM judges with rubrics and few-shot examples.
- Build an evaluation
wandb
beginner
Learn how to build an evaluation pipeline with Weave Models and Evaluations
Hands-on tutorial for building a complete evaluation pipeline with Weave. The practical focus on scorers, datasets, and result visualization shows the full workflow from writing eval code to analyzing results.
- Metrics
dspy
intermediate
DSPy's metric system goes beyond pass/fail — metrics can be continuous, multi-dimensional, and used as optimization objectives. Understanding metric design here informs better eval practices across any framework.
- Evaluation best practices
openai
advanced
Advanced evaluation patterns for production AI systems — handling ambiguous cases, scaling eval suites, avoiding eval gaming, and integrating evals into CI/CD pipelines.
Advanced evaluation patterns: handling ambiguous cases, scaling eval suites, avoiding eval gaming, and using evals in CI/CD. These production concerns separate toy evaluations from reliable quality gates.
- Evaluate a simple RAG system
ragas
beginner
Apply RAGAS metrics to evaluate a real RAG system end-to-end. This practical tutorial connects the metrics concepts from step 6 to actual pipeline evaluation — a template you can adapt for your own RAG applications.
- Agent evals
openai
intermediate
Use agent evals to create datasets, configure graders, and track evaluation runs for your agents.
Evaluating agents is harder than evaluating single-turn LLM calls — agents have multi-step trajectories, tool use sequences, and state management. This guide covers the unique challenges and patterns for agentic evaluation.