Testing on AI Knowledge Base

Evaluating and Debugging Generative AI Models

Mon, 01 Jan 0001 00:00:00 +0000

Covers evaluation metrics, debugging techniques, and systematic testing for generative AI applications using Weights & Biases. The practical companion to the Evaluation & Testing learning path — the course provides hands-on practice with evaluation tools, while the path covers the full evaluation landscape across providers.

TDD vs BDD vs SDD

Mon, 01 Jan 0001 00:00:00 +0000

Comparison of test-driven, behavior-driven, and specification-driven development, highlighting when each approach is most appropriate and how they complement each other.

Evaluation & Testing

Mon, 01 Jan 0001 00:00:00 +0000

Build a comprehensive evaluation practice for AI applications. This path spans 7 sources to cover the full evaluation landscape: foundational concepts, practical implementation, RAG-specific metrics, LLM-as-judge patterns, and agent evaluation challenges.

Evaluation is the most cross-cutting concern in AI development — every provider and framework has a different take. OpenAI provides hosted evals, RAGAS specializes in RAG metrics, DSPy uses metrics for optimization, LangSmith offers traceability, and W&B Weave treats evaluation as a core development primitive. This path helps you pick the right tools and combine them.

Adding to your CI pipeline with Pytest

Mon, 01 Jan 0001 00:00:00 +0000

Agent evals

Mon, 01 Jan 0001 00:00:00 +0000

Use agent evals to create datasets, configure graders, and track evaluation runs for your agents.

Agent Evaluation Quickstart

Mon, 01 Jan 0001 00:00:00 +0000

AI Evaluations UI

Mon, 01 Jan 0001 00:00:00 +0000

Guide to using the AI Evaluations UI for model assessment

Aligning LLM Evaluators with Human Judgment

Mon, 01 Jan 0001 00:00:00 +0000

An Overview of the Developer Playground

Mon, 01 Jan 0001 00:00:00 +0000

The Cohere Playground is a powerful visual interface for testing Cohere’s generation and embedding language models without coding.

Application-specific evaluation approaches

Mon, 01 Jan 0001 00:00:00 +0000

Automatically run evaluators on experiments

Mon, 01 Jan 0001 00:00:00 +0000

Basic RAG

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to build a basic RAG system by combining retrieval and generation for AI-powered knowledge-based responses

Basic RAG: Retrieval-Augmented Generation with Cohere

Mon, 01 Jan 0001 00:00:00 +0000

This page describes how to work with Cohere’s basic retrieval-augmented generation functionality.

Braintrust

Mon, 01 Jan 0001 00:00:00 +0000

Braintrust integration for CrewAI with OpenTelemetry tracing and evaluation

Build an evaluation

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to build an evaluation pipeline with Weave Models and Evaluations

Building RAG models with Cohere

Mon, 01 Jan 0001 00:00:00 +0000

This page walks through building a retrieval-augmented generation model with Cohere.

Built-in Evaluators

Mon, 01 Jan 0001 00:00:00 +0000

Case Lifecycle Hooks

Mon, 01 Jan 0001 00:00:00 +0000

CI/CD with Pinecone Local and GitHub Actions

Mon, 01 Jan 0001 00:00:00 +0000

Test Pinecone integration with CI/CD workflows.

Code Embeddings

Mon, 01 Jan 0001 00:00:00 +0000

Code embeddings enable retrieval, clustering, and analytics for code databases and coding assistants using Mistral AI’s API

Cohere's Command R7B Model

Mon, 01 Jan 0001 00:00:00 +0000

Command R7B is the smallest, fastest, and final model in our R family of enterprise-focused large language models. It excels at RAG, tool use, and agents.

Collect and track datasets

Mon, 01 Jan 0001 00:00:00 +0000

Organize, collect, track, and version examples for LLM application evaluation

Command Line Interface

Mon, 01 Jan 0001 00:00:00 +0000

Common workflows

Mon, 01 Jan 0001 00:00:00 +0000

Step-by-step guides for exploring codebases, fixing bugs, refactoring, testing, and other everyday tasks with Claude Code.

Compare and rank models

Mon, 01 Jan 0001 00:00:00 +0000

Compare and rank different model versions based on evaluation metrics

Compare LLMs using Ragas Evaluations

Mon, 01 Jan 0001 00:00:00 +0000

Compare model performance using the Evaluation Playground

Mon, 01 Jan 0001 00:00:00 +0000

Compare and evaluate model performance without code using Weave’s interactive playground, running evaluations with custom datasets and LLM judges to test system prompts, models, and scoring criteria in a visual interface.

Comparison Testing

Mon, 01 Jan 0001 00:00:00 +0000

Concurrency & Performance

Mon, 01 Jan 0001 00:00:00 +0000

Conversation Simulator

Mon, 01 Jan 0001 00:00:00 +0000

Conversation Simulator Custom Templates

Mon, 01 Jan 0001 00:00:00 +0000

Conversation Simulator Lifecycle Hooks

Mon, 01 Jan 0001 00:00:00 +0000

Conversation Simulator Model Callback

Mon, 01 Jan 0001 00:00:00 +0000

Conversation Simulator Simulation Graph

Mon, 01 Jan 0001 00:00:00 +0000

Conversation Simulator Stopping Logic

Mon, 01 Jan 0001 00:00:00 +0000

Core Concepts

Mon, 01 Jan 0001 00:00:00 +0000

Create and manage saved views

Mon, 01 Jan 0001 00:00:00 +0000

Customize how you interact with traced function calls and evaluations

Create dynamic Leaderboards in Evaluations

Mon, 01 Jan 0001 00:00:00 +0000

Dynamic Leaderboards let you configure, customize, save, and update Leaderboard views directly from an evaluation.

Crew Studio

Mon, 01 Jan 0001 00:00:00 +0000

Build new automations with AI assistance, a visual editor, and integrated testing.

Criteria

Mon, 01 Jan 0001 00:00:00 +0000

CSV RAG Search

Mon, 01 Jan 0001 00:00:00 +0000

The ‘CSVSearchTool’ is a powerful RAG (Retrieval-Augmented Generation) tool designed for semantic searches within a CSV file’s content.

Custom Evaluators

Mon, 01 Jan 0001 00:00:00 +0000

Custom Metrics

Mon, 01 Jan 0001 00:00:00 +0000

Custom Multi-hop Query

Mon, 01 Jan 0001 00:00:00 +0000

Custom Single-hop Query

Mon, 01 Jan 0001 00:00:00 +0000

Customizing Test Data Generation

Mon, 01 Jan 0001 00:00:00 +0000

Data Handling

Mon, 01 Jan 0001 00:00:00 +0000

Data modeling

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to structure records for efficient data retrieval and management in Pinecone.

Data Privacy

Mon, 01 Jan 0001 00:00:00 +0000

Data retrieval with GPT Actions

Mon, 01 Jan 0001 00:00:00 +0000

Learn about performing data retrieval using APIs, relational databases, and vector databases with GPT Actions.

Dataset Management

Mon, 01 Jan 0001 00:00:00 +0000

Dataset Serialization

Mon, 01 Jan 0001 00:00:00 +0000

Deep Dive Into Evaluating RAG Outputs

Mon, 01 Jan 0001 00:00:00 +0000

This page contains information on evaluating the output of RAG systems.

DeepEval

Mon, 01 Jan 0001 00:00:00 +0000

Define and log attributes

Mon, 01 Jan 0001 00:00:00 +0000

Use attributes to add meta data to your traces and evaluations.

Deploying Models in Private Environments

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to pull and test Cohere’s container images using a license with Docker and Kubernetes.

DeploymentSampler

Mon, 01 Jan 0001 00:00:00 +0000

Client-side tokenized sampling from inference deployments for training and evaluation.

Develop Tests

Mon, 01 Jan 0001 00:00:00 +0000

Development

Mon, 01 Jan 0001 00:00:00 +0000

Development

Mon, 01 Jan 0001 00:00:00 +0000

Development

Mon, 01 Jan 0001 00:00:00 +0000

Different Types of API Keys and Rate Limits

Mon, 01 Jan 0001 00:00:00 +0000

This page describes Cohere API rate limits for production and evaluation keys.

Directory RAG Search

Mon, 01 Jan 0001 00:00:00 +0000

The ‘DirectorySearchTool’ is a powerful RAG (Retrieval-Augmented Generation) tool designed for semantic searches within a directory’s content.

End-to-end example of RAG with Chat, Embed, and Rerank

Mon, 01 Jan 0001 00:00:00 +0000

Guide on using Cohere’s Retrieval Augmented Generation (RAG) capabilities covering the Chat, Embed, and Rerank endpoints (API v2).

Environment Variables

Mon, 01 Jan 0001 00:00:00 +0000

Eval Tool

Mon, 01 Jan 0001 00:00:00 +0000

Evals

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate agent trajectories using deterministic matching or LLM-as-judge evaluators with AgentEvals and LangSmith.

Evals

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate agent trajectories using deterministic matching or LLM-as-judge evaluators with AgentEvals and LangSmith.

Evals In Prod

Mon, 01 Jan 0001 00:00:00 +0000

Evals In Prod

Mon, 01 Jan 0001 00:00:00 +0000

Evals In Prod

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate a chatbot

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate a complex agent

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate a hosted API model

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate a hosted API model using infrastructure managed by CoreWeave

Evaluate a model checkpoint

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate a VLLM-compatible model checkpoint using infrastructure managed by CoreWeave

Evaluate a New LLM

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate a prompt

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate a RAG application

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate a simple LLM application

Mon, 01 Jan 0001 00:00:00 +0000

This is the canonical first hands-on with RAGAS, and it matters because it shows the metric-driven evaluation loop on a simple LLM app before you add retrieval complexity. Focus on how RAGAS frames a sample, a metric, and a score — the same abstractions scale up to full RAG evaluation. A subtle gotcha is that many RAGAS metrics call an LLM under the hood, so scores carry cost and run-to-run variance you must account for. Read this before the RAG tutorial, which layers retrieval metrics on top.

Evaluate a simple RAG system

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate a simple RAG system

Mon, 01 Jan 0001 00:00:00 +0000

This is the practical RAG evaluation walkthrough and the page most teams should run first when they need to measure a retrieval pipeline rather than guess at it. Pay attention to the distinction between retrieval metrics like context precision and recall and generation metrics like faithfulness and answer relevancy, because a RAG system can fail at either stage and the fix differs entirely. A common mistake is optimizing answer quality while ignoring context recall, leaving the model fluent but ungrounded. Start with the simple-evals page first if you are new to RAGAS.

Evaluate a Text-to-SQL Agent

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate an AI Agent

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate an AI Workflow

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate and Improve a RAG App

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate answers

Mon, 01 Jan 0001 00:00:00 +0000

Measure assistant response quality with LLM-based evaluation.

Evaluate external models

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to run evals on non-OpenAI models, using the OpenAI platform.

Evaluate RAG applications

Mon, 01 Jan 0001 00:00:00 +0000

Build and evaluate RAG applications using Weave with LLM judges

Evaluate using local scorers

Mon, 01 Jan 0001 00:00:00 +0000

Small language models that run locally to evaluate AI system safety and quality

Evaluating Multi-turn Conversations

Mon, 01 Jan 0001 00:00:00 +0000

Evaluating Text Summarization Models

Mon, 01 Jan 0001 00:00:00 +0000

This page discusses how to evaluate a model’s text summarization.

Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

Guide to evaluating LLMs for specific tasks with metrics, human, and LLM-based methods

Evaluation Arena Test Cases

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation benchmark catalog

Mon, 01 Jan 0001 00:00:00 +0000

Browse the evaluation benchmarks available through LLM Evaluation Jobs

Evaluation best practices

Mon, 01 Jan 0001 00:00:00 +0000

Advanced evaluation patterns for production AI systems — handling ambiguous cases, scaling eval suites, avoiding eval gaming, and integrating evals into CI/CD pipelines.

Evaluation Component Level Llm Evals

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation concepts

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation Dataset

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation Datasets

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation End To End Llm Evals

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation End To End Multi Turn

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation End To End Single Turn

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation Flags And Configs

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation Introduction

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation Llm Tracing

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation Mcp

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation Multiturn Test Cases

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation overview

Mon, 01 Jan 0001 00:00:00 +0000

Learn about evaluating the correctness and completeness of assistant responses.

Evaluation Overview

Mon, 01 Jan 0001 00:00:00 +0000

This is the foundational concept page for DSPy’s evaluate-then-optimize workflow, and it is essential reading before you touch any teleprompter. The key insight is that DSPy treats evaluation as a first-class input to compilation rather than an afterthought — your dev set and metric become the signal the optimizer uses to rewrite prompts. Start here, then move to the metrics page to define what good actually means for your task. Watch out for evaluating on the same examples you optimize against, which inflates scores and hides overfitting.

Evaluation Prompts

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation quickstart

Mon, 01 Jan 0001 00:00:00 +0000

This is the fastest path into LangSmith evaluation and the right starting point before the deeper evaluator guides. The key takeaway is the dataset to target-function to evaluator to run loop, which is the mental model every other LangSmith eval feature builds on. Pay attention to how examples and the evaluation client are wired up, since that boilerplate carries over to LLM-as-judge work. A common beginner mistake is evaluating against a dataset that does not represent production traffic, which produces reassuring but meaningless scores.

Evaluation Sample

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation Test Cases

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation types

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation Unit Testing In Ci Cd

Mon, 01 Jan 0001 00:00:00 +0000

Evaluations overview

Mon, 01 Jan 0001 00:00:00 +0000

Evaluation-driven LLM application development to systematically improve applications

Evaluations with Vertex AI models

Mon, 01 Jan 0001 00:00:00 +0000

Evaluators

Mon, 01 Jan 0001 00:00:00 +0000

Understand the fundamentals of evaluators and reward functions in reinforcement fine-tuning

Export evaluation data

Mon, 01 Jan 0001 00:00:00 +0000

Programmatically export evaluation results using the Evaluation REST API.

Faq

Mon, 01 Jan 0001 00:00:00 +0000

Fireworks Agent: Classification

Mon, 01 Jan 0001 00:00:00 +0000

Benchmark base models, fine-tune on labeled data, and pick the best classifier — automatically.

Fireworks Agent: Evaluator Authoring

Mon, 01 Jan 0001 00:00:00 +0000

Have Fireworks Agent generate a reusable evaluator from your dataset — for scoring candidates in an SFT sweep, or for use with Managed RFT.

Fireworks Agent: Preference Learning (DPO/ORPO)

Mon, 01 Jan 0001 00:00:00 +0000

Run preference fine-tuning end-to-end with optional base-model sweep, automatic pair generation, and pairwise evaluation.

Fireworks Agent: Supervised Fine-Tuning

Mon, 01 Jan 0001 00:00:00 +0000

Run end-to-end SFT with Fireworks Agent — dataset inspection, hyperparameter sweep, evaluator-guided selection, and a deployed winner.

Galileo

Mon, 01 Jan 0001 00:00:00 +0000

Galileo integration for CrewAI tracing and evaluation

Generate Parallel Queries for Better RAG Retrieval

Mon, 01 Jan 0001 00:00:00 +0000

Build an agentic RAG system that can expand a user query into a more optimized set of queries for retrieval.

Get latest invocations by keys

Mon, 01 Jan 0001 00:00:00 +0000

Returns the latest invocations for the given keys on a source.

Getting Started

Mon, 01 Jan 0001 00:00:00 +0000

This five-minute quickstart is the fastest way into DeepEval: install it, write a test case, pick a metric, and run deepeval test run, which feels like pytest for LLM outputs. The critical thing to set up first is an OPENAI_API_KEY, because nearly all DeepEval metrics are LLM-as-a-judge evaluators that call a model under the hood. If a run appears stuck, suspect rate limits or quota rather than a framework bug, the most common early gotcha. DeepEval covers similar ground to RAGAS but with a pytest-style assertion workflow; read the metrics introduction next.

Getting Started Agents

Mon, 01 Jan 0001 00:00:00 +0000

Getting Started Chatbots

Mon, 01 Jan 0001 00:00:00 +0000

Getting Started Llm Arena

Mon, 01 Jan 0001 00:00:00 +0000

Getting Started Mcp

Mon, 01 Jan 0001 00:00:00 +0000

Getting Started Rag

Mon, 01 Jan 0001 00:00:00 +0000

Getting started with datasets

Mon, 01 Jan 0001 00:00:00 +0000

Introduction to evaluation datasets — the foundation for systematic AI testing and the first step in eval-driven development.

Getting started with GPT Actions

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to set up and test GPT actions from scratch with the OpenAI API.

Golden Synthesizer

Mon, 01 Jan 0001 00:00:00 +0000

Graders

Mon, 01 Jan 0001 00:00:00 +0000

Learn about graders used for evals and fine-tuning.

Guides Ai Agent Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

Guides Ai Agent Evaluation Metrics

Mon, 01 Jan 0001 00:00:00 +0000

Guides Answer Correctness Metric

Mon, 01 Jan 0001 00:00:00 +0000

Guides Building Custom Metrics

Mon, 01 Jan 0001 00:00:00 +0000

Guides Llm As A Judge

Mon, 01 Jan 0001 00:00:00 +0000

Guides Llm Observability

Mon, 01 Jan 0001 00:00:00 +0000

Guides Multi Turn Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

Guides Multi Turn Evaluation Metrics

Mon, 01 Jan 0001 00:00:00 +0000

Guides Multi Turn Simulation

Mon, 01 Jan 0001 00:00:00 +0000

Guides Optimizing Hyperparameters

Mon, 01 Jan 0001 00:00:00 +0000

Guides Rag Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

Guides Rag Triad

Mon, 01 Jan 0001 00:00:00 +0000

Guides Red Teaming

Mon, 01 Jan 0001 00:00:00 +0000

Guides Regression Testing In Cicd

Mon, 01 Jan 0001 00:00:00 +0000

Guides Tracing Ai Agents

Mon, 01 Jan 0001 00:00:00 +0000

Guides Tracing Multi Turn

Mon, 01 Jan 0001 00:00:00 +0000

Guides Tracing Rag

Mon, 01 Jan 0001 00:00:00 +0000

Guides Using Custom Embedding Models

Mon, 01 Jan 0001 00:00:00 +0000

Guides Using Custom Llms

Mon, 01 Jan 0001 00:00:00 +0000

Guides Using Synthesizer

Mon, 01 Jan 0001 00:00:00 +0000

Handle Streaming Refusals

Mon, 01 Jan 0001 00:00:00 +0000

Streaming refusals present a unique UX challenge: tokens have already been sent to the client before the model decides to refuse, so you cannot simply suppress the response. This guide covers detection strategies and graceful recovery patterns for when Claude mid-stream determines a request violates safety guidelines. Pay close attention to the stop reason codes and how they differ from normal completion events — your streaming parser needs to handle refusal signals without crashing or displaying partial unsafe content. Implement these patterns early in development rather than retrofitting them after users encounter jarring truncated responses in production.

Haystack and Cohere (Integration Guide)

Mon, 01 Jan 0001 00:00:00 +0000

Build custom LLM applications with Haystack, now integrated with Cohere for embedding, generation, chat, and retrieval.

Hierarchical Process

Mon, 01 Jan 0001 00:00:00 +0000

A comprehensive guide to understanding and applying the hierarchical process within your CrewAI projects, updated to reflect the latest coding practices and functionalities.

How to add evaluators to an existing experiment (Python only)

Mon, 01 Jan 0001 00:00:00 +0000

How to audit evaluator scores

Mon, 01 Jan 0001 00:00:00 +0000

How to create a composite evaluator

Mon, 01 Jan 0001 00:00:00 +0000

How to create a composite evaluator

Mon, 01 Jan 0001 00:00:00 +0000

How to define a code evaluator

Mon, 01 Jan 0001 00:00:00 +0000

How to define a code evaluator

Mon, 01 Jan 0001 00:00:00 +0000

How to define a summary evaluator

Mon, 01 Jan 0001 00:00:00 +0000

How to define a target function to evaluate

Mon, 01 Jan 0001 00:00:00 +0000

How to define an LLM-as-a-judge evaluator

Mon, 01 Jan 0001 00:00:00 +0000

How to define an LLM-as-a-judge evaluator

Mon, 01 Jan 0001 00:00:00 +0000

LLM-as-judge is the workhorse evaluator for open-ended outputs where exact-match scoring is impossible, so this page becomes essential the moment you move past trivial test cases. Pay close attention to how you define the judge prompt and scoring schema — vague rubrics produce noisy, irreproducible scores, the most common pitfall here. This is conceptually the same technique RAGAS and OpenAI’s agent evals implement, but LangSmith binds the judge directly to traced runs. Read the evaluation quickstart first to understand datasets and runs.

How to evaluate a graph

Mon, 01 Jan 0001 00:00:00 +0000

How to evaluate a runnable

Mon, 01 Jan 0001 00:00:00 +0000

How to evaluate an application's intermediate steps

Mon, 01 Jan 0001 00:00:00 +0000

How to evaluate an LLM application

Mon, 01 Jan 0001 00:00:00 +0000

How to evaluate with OpenTelemetry

Mon, 01 Jan 0001 00:00:00 +0000

How to evaluate with repetitions

Mon, 01 Jan 0001 00:00:00 +0000

How to evaluate your agent with trajectory evaluations

Mon, 01 Jan 0001 00:00:00 +0000

How to improve your evaluator with few-shot examples

Mon, 01 Jan 0001 00:00:00 +0000

How to retry failed runs in experiments (Python only)

Mon, 01 Jan 0001 00:00:00 +0000

How to return multiple scores in one evaluator

Mon, 01 Jan 0001 00:00:00 +0000

How to run a pairwise evaluation

Mon, 01 Jan 0001 00:00:00 +0000

How to run an evaluation asynchronously

Mon, 01 Jan 0001 00:00:00 +0000

How to run an evaluation locally (Python only)

Mon, 01 Jan 0001 00:00:00 +0000

How to run evaluations with pytest

Mon, 01 Jan 0001 00:00:00 +0000

How to run evaluations with Vitest/Jest

Mon, 01 Jan 0001 00:00:00 +0000

How to use prebuilt evaluators

Mon, 01 Jan 0001 00:00:00 +0000

How to use the REST API

Mon, 01 Jan 0001 00:00:00 +0000

HuggingFace Dataset Evaluations

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use huggingface dataset evaluations with W&B Weave

Implement a CI/CD pipeline using LangSmith Deployment and Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

Improve LLM-as-judge evaluators using human feedback

Mon, 01 Jan 0001 00:00:00 +0000

Improvement

Mon, 01 Jan 0001 00:00:00 +0000

Improvement

Mon, 01 Jan 0001 00:00:00 +0000

Improvement

Mon, 01 Jan 0001 00:00:00 +0000

Increase Consistency

Mon, 01 Jan 0001 00:00:00 +0000

Output consistency matters most when Claude powers automated pipelines where downstream code parses its responses. This guide covers techniques like temperature reduction, few-shot examples, structured output formats, and explicit schemas that make Claude’s responses more deterministic. The single biggest lever is providing concrete output examples in your prompt – this anchors the model’s formatting far more reliably than verbal instructions alone. Read this before building any system that pipes Claude output into JSON parsers, database inserts, or multi-step agent workflows.

Instructor

Mon, 01 Jan 0001 00:00:00 +0000

Trace and evaluate structured data extraction from LLMs with Weave’s Instructor integration, capturing Pydantic model validation, retry logic, and JSON schema enforcement for reliable structured output workflows.

Integration testing

Mon, 01 Jan 0001 00:00:00 +0000

Test agents with real LLM APIs by organizing tests, managing keys, handling flakiness, and controlling costs.

Integration testing

Mon, 01 Jan 0001 00:00:00 +0000

Test agents with real LLM APIs by organizing tests, managing keys, handling flakiness, and controlling costs.

Intro to Retrieval

Mon, 01 Jan 0001 00:00:00 +0000

Ground LLMs in your own data using retrieval-augmented generation.

Introduction

Mon, 01 Jan 0001 00:00:00 +0000

Introduction

Mon, 01 Jan 0001 00:00:00 +0000

Introduction

Mon, 01 Jan 0001 00:00:00 +0000

Introduction

Mon, 01 Jan 0001 00:00:00 +0000

Introduction Comparisons

Mon, 01 Jan 0001 00:00:00 +0000

Introduction Design Philosophy

Mon, 01 Jan 0001 00:00:00 +0000

Introduction to Evaluations

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use introduction to evaluations with W&B Weave

La Plateforme

Mon, 01 Jan 0001 00:00:00 +0000

Mistral AI’s La Plateforme offers pay-as-you-go API access to its latest models with flexible deployment options

LangGraph

Mon, 01 Jan 0001 00:00:00 +0000

LangSmith CLI

Mon, 01 Jan 0001 00:00:00 +0000

Query and manage LangSmith projects, traces, runs, datasets, evaluators, experiments, and threads from the terminal

LangSmith Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

LangSmith Polly

Mon, 01 Jan 0001 00:00:00 +0000

LangSmith skills

Mon, 01 Jan 0001 00:00:00 +0000

Use Agent Skills to work with LangSmith traces, datasets, and evaluators from your coding agent.

Learning DSPy

Mon, 01 Jan 0001 00:00:00 +0000

Three stages of building AI systems - programming, evaluation, and optimization

Let Claude use your computer from the CLI

Mon, 01 Jan 0001 00:00:00 +0000

Enable computer use in the Claude Code CLI so Claude can open apps, click, type, and see your screen on macOS. Test native apps, debug visual issues, and automate GUI-only tools without leaving your terminal.

LlamaIndex

Mon, 01 Jan 0001 00:00:00 +0000

Automatically trace and debug LlamaIndex applications with Weave, capturing all LLM calls, RAG pipelines, agent steps, and evaluations for comprehensive observability of your data-connected AI workflows.

LlamaIndex Agent Evaluation Quickstart

Mon, 01 Jan 0001 00:00:00 +0000

LLM Benchmarking Quickstart

Mon, 01 Jan 0001 00:00:00 +0000

LLM Evaluations

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to run LLM-as-a-Judge evaluations

LLM Judge

Mon, 01 Jan 0001 00:00:00 +0000

Local development & testing

Mon, 01 Jan 0001 00:00:00 +0000

Log evaluation data from your code

Mon, 01 Jan 0001 00:00:00 +0000

Flexible, incremental way to log evaluation data from Python and TypeScript code

Logfire Integration

Mon, 01 Jan 0001 00:00:00 +0000

Manage Weave Projects

Mon, 01 Jan 0001 00:00:00 +0000

Use Weave projects to organize related assets like traces, prompts, evaluations, models, and dashboards.

Maxim Integration

Mon, 01 Jan 0001 00:00:00 +0000

Start Agent monitoring, evaluation, and observability

Metrics

Mon, 01 Jan 0001 00:00:00 +0000

In DSPy a metric is the objective function that drives both evaluation and optimization, so this page matters more than a typical reference — your metric definition directly shapes how teleprompters compile and improve a program. Pay close attention to the difference between simple answer-matching metrics and metrics that themselves call an LM to judge quality, since the latter adds cost and variance you have to control. A common pitfall is returning a bare boolean where an optimizer expects a float score. Read the evaluation overview first, then pair this with the optimizers documentation.

Metrics & Attributes

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Answer Relevancy

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Arena G Eval

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Argument Correctness

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Bias

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Contextual Precision

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Contextual Recall

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Contextual Relevancy

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Conversation Completeness

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Conversational Dag

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Conversational G Eval

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Custom

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Dag

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Exact Match

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Faithfulness

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Goal Accuracy

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Hallucination

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Introduction

Mon, 01 Jan 0001 00:00:00 +0000

This page introduces DeepEval’s fifty-plus metrics, each scored from 0 to 1 with reasoning, and it matters because choosing the right metrics is the whole game in LLM evaluation. The key discipline the docs push is restraint: use no more than about five metrics, roughly two or three generic plus one or two custom to your use case, so you prioritize what truly matters instead of drowning in numbers. Because the metrics are LLM-as-a-judge, expect real cost and some run-to-run variance. This parallels RAGAS’s metric suite; read getting-started first if you have not run an evaluation yet.

Metrics Json Correctness

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Knowledge Retention

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Llm Evals

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Mcp Task Completion

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Mcp Use

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Misuse

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Multi Turn Mcp Use

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Non Advice

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Pattern Match

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Pii Leakage

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Plan Adherence

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Plan Quality

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Prompt Alignment

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Ragas

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Role Adherence

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Role Violation

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Step Efficiency

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Summarization

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Task Completion

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Tool Correctness

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Tool Use

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Topic Adherence

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Toxicity

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Turn Contextual Precision

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Turn Contextual Recall

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Turn Contextual Relevancy

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Turn Faithfulness

Mon, 01 Jan 0001 00:00:00 +0000

Metrics Turn Relevancy

Mon, 01 Jan 0001 00:00:00 +0000

Miscellaneous

Mon, 01 Jan 0001 00:00:00 +0000

Mitigate Jailbreaks

Mon, 01 Jan 0001 00:00:00 +0000

Jailbreak mitigation is essential for any production deployment where Claude interacts with untrusted user input. This guide covers defense-in-depth strategies including system prompt hardening, input validation, and output filtering. A common pitfall is relying solely on system prompt instructions for safety – attackers routinely bypass single-layer defenses, so layering multiple techniques is critical. Read this alongside the harmlessness screens documentation to understand how Anthropic’s built-in protections complement your application-level guardrails.

Model optimization

Mon, 01 Jan 0001 00:00:00 +0000

Ensure quality model outputs with evals and fine-tuning in the OpenAI platform.

Models Benchmarks

Mon, 01 Jan 0001 00:00:00 +0000

Mistral’s benchmarked models excel in reasoning, multilingual tasks, coding, and multimodal capabilities, outperforming competitors in key benchmarks

Multi-Run Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

Multimodal Metrics Image Coherence

Mon, 01 Jan 0001 00:00:00 +0000

Multimodal Metrics Image Editing

Mon, 01 Jan 0001 00:00:00 +0000

Multimodal Metrics Image Helpfulness

Mon, 01 Jan 0001 00:00:00 +0000

Multimodal Metrics Image Reference

Mon, 01 Jan 0001 00:00:00 +0000

Multimodal Metrics Text To Image

Mon, 01 Jan 0001 00:00:00 +0000

Non-English Testset Generation

Mon, 01 Jan 0001 00:00:00 +0000

Observability

Mon, 01 Jan 0001 00:00:00 +0000

Observability for LLMs ensures visibility, debugging, and performance optimization across prototyping, testing, and production

Online Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

OpenAI

Mon, 01 Jan 0001 00:00:00 +0000

Integrate OpenAI with Weave for tracing, evaluation, and monitoring

Opik Integration

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use Comet Opik to debug, evaluate, and monitor your CrewAI applications with comprehensive tracing, automated evaluations, and production-ready dashboards.

Optimizing LLM Accuracy

Mon, 01 Jan 0001 00:00:00 +0000

Learn strategies to enhance the accuracy of large language models using techniques like prompt engineering, retrieval-augmented generation, and fine-tuning.

Overview

Mon, 01 Jan 0001 00:00:00 +0000

Monitor, evaluate, and optimize your CrewAI agents with comprehensive observability tools

Overview

Mon, 01 Jan 0001 00:00:00 +0000

Overview

Mon, 01 Jan 0001 00:00:00 +0000

Patronus AI Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

Monitor and evaluate CrewAI agent performance using Patronus AI’s comprehensive evaluation platform for LLM outputs and agent behaviors.

Performance

Mon, 01 Jan 0001 00:00:00 +0000

Single-node Chroma performance benchmarks and limitations.

Performance benchmarking

Mon, 01 Jan 0001 00:00:00 +0000

Measure and optimize your deployment’s performance with load testing

Persona Generation

Mon, 01 Jan 0001 00:00:00 +0000

Pin and compare runs

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use pinned and baseline runs to keep track of important runs and efficiently evaluate model experiments.

Prompt Evaluation Quickstart

Mon, 01 Jan 0001 00:00:00 +0000

Prompt Optimization Copro

Mon, 01 Jan 0001 00:00:00 +0000

Prompt Optimization Gepa

Mon, 01 Jan 0001 00:00:00 +0000

Prompt Optimization Introduction

Mon, 01 Jan 0001 00:00:00 +0000

Prompt Optimization Miprov2

Mon, 01 Jan 0001 00:00:00 +0000

Prompt Optimization Simba

Mon, 01 Jan 0001 00:00:00 +0000

Prompting capabilities

Mon, 01 Jan 0001 00:00:00 +0000

Learn effective prompting techniques for classification, summarization, personalization, and evaluation with Mistral models

Prune Threads

Mon, 01 Jan 0001 00:00:00 +0000

Prune threads by ID. The ‘delete’ strategy removes threads entirely. The ‘keep_latest’ strategy prunes old checkpoints but keeps threads and their latest state.

Quick Start

Mon, 01 Jan 0001 00:00:00 +0000

Quickstart: Retrieval Augmented Generation (RAG)

Mon, 01 Jan 0001 00:00:00 +0000

How to build a RAG workflow in under 5 mins!

RAG Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

RAG Tool

Mon, 01 Jan 0001 00:00:00 +0000

The ‘RagTool’ is a dynamic knowledge base tool for answering questions using Retrieval-Augmented Generation.

Red teaming

Mon, 01 Jan 0001 00:00:00 +0000

Learn how red teaming fits into AI evaluation, including Promptfoo open source and OpenAI Red Teaming for enterprise teams.

Reduce Hallucinations

Mon, 01 Jan 0001 00:00:00 +0000

Hallucination reduction is arguably the most impactful guardrail topic for practitioners building retrieval-augmented or factual applications with Claude. The guide covers grounding techniques such as providing source documents, instructing the model to quote directly, and asking it to flag uncertainty. A key gotcha is that simply telling Claude “don’t hallucinate” is far less effective than structuring prompts so the model can cite or decline – give it an explicit escape hatch like “say I don’t know if the answer isn’t in the provided context.” Pair this with the evaluation techniques in the testing docs to measure hallucination rates systematically.

Reduce Latency

Mon, 01 Jan 0001 00:00:00 +0000

Latency optimization directly impacts user experience and cost in production Claude deployments. This guide walks through techniques like prompt length reduction, streaming, model selection trade-offs, and caching strategies that can cut response times significantly. Start with the quick wins – enabling streaming and trimming unnecessary context from prompts – before moving to architectural changes like prompt caching. Be aware that some latency reduction techniques (such as using smaller models or shorter prompts) trade off against output quality, so always measure both metrics together.

Reduce Prompt Leak

Mon, 01 Jan 0001 00:00:00 +0000

Prompt leakage is one of the most common security concerns in production LLM applications, and this guide provides concrete techniques for preventing Claude from revealing system prompts to end users. Focus on the layered defense approach — no single technique is sufficient, so you need to combine prompt structure, output filtering, and behavioral instructions. A frequent mistake is relying solely on “do not reveal your instructions” directives, which are trivially bypassed by indirect extraction attacks. Read this alongside the general guardrails documentation to build a comprehensive safety posture before shipping user-facing agents.

Remote Environment Setup

Mon, 01 Jan 0001 00:00:00 +0000

Implement the /init endpoint to run evaluations in your infrastructure

Replay Tasks from Latest Crew Kickoff

Mon, 01 Jan 0001 00:00:00 +0000

Replay tasks from the latest crew.kickoff(…)

Report Evaluators

Mon, 01 Jan 0001 00:00:00 +0000

Retrieval

Mon, 01 Jan 0001 00:00:00 +0000

Retrieval

Mon, 01 Jan 0001 00:00:00 +0000

LangChain’s retrieval guide covers the foundational abstractions for document loading, splitting, embedding, and querying that underpin every RAG application built on the framework. Understanding the Retriever interface is critical because it is the common contract that vector stores, BM25 indexes, and custom retrieval strategies all implement. Focus on how retrievers compose with chains and agents, since the retrieval step is often the performance bottleneck in production RAG pipelines. Read this before the RAG-specific guide to ensure you understand the building blocks before seeing them assembled into a full application.

Retrieval

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to search your data using semantic similarity with the OpenAI API.

Retrieval Augmented Generation (RAG)

Mon, 01 Jan 0001 00:00:00 +0000

Guide on using Cohere’s Retrieval Augmented Generation (RAG) capabilities such as document grounding and citations.

Retrieval augmented generation (RAG) - Cohere on Azure AI Foundry

Mon, 01 Jan 0001 00:00:00 +0000

A guide for performing retrieval augmented generation (RAG) with Cohere’s Command models on Azure AI Foundry (API v2).

Retrieval augmented generation (RAG) - quickstart

Mon, 01 Jan 0001 00:00:00 +0000

A quickstart guide for performing retrieval augmented generation (RAG) with Cohere’s Command models (v2 API).

Retrieval evaluation using LLM-as-a-judge via Pydantic AI

Mon, 01 Jan 0001 00:00:00 +0000

This page contains a tutorial on how to evaluate retrieval systems using LLMs as judges via Pydantic AI.

Retrieval-Augmented Generation (RAG)

Mon, 01 Jan 0001 00:00:00 +0000

Retry Strategies

Mon, 01 Jan 0001 00:00:00 +0000

Review items in an annotation queue

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate trace items and submit structured feedback using a simplified review interface.

Routers

Mon, 01 Jan 0001 00:00:00 +0000

Distribute traffic across multiple deployments for A/B testing, traffic migration, and load distribution.

Rubric-Based Evaluation

Mon, 01 Jan 0001 00:00:00 +0000

Run an evaluation from the Playground

Mon, 01 Jan 0001 00:00:00 +0000

Run an evaluation from the prompt playground

Mon, 01 Jan 0001 00:00:00 +0000

Run an evaluation with multimodal content

Mon, 01 Jan 0001 00:00:00 +0000

Run backtests on a new version of an agent

Mon, 01 Jan 0001 00:00:00 +0000

Safety best practices

Mon, 01 Jan 0001 00:00:00 +0000

Comprehensive safety practices for responsible AI deployment — covering moderation, adversarial testing, human oversight, prompt engineering for safety, and production monitoring.

Scoring Overview

Mon, 01 Jan 0001 00:00:00 +0000

Evaluate AI outputs and return evaluation metrics with Weave Scorers

Set Latest Assistant Version

Mon, 01 Jan 0001 00:00:00 +0000

Set the latest version for an assistant.

Set up automations

Mon, 01 Jan 0001 00:00:00 +0000

Create event-driven automations that trigger actions based on monitor metrics and trace activity.

Set up composite online evaluators

Mon, 01 Jan 0001 00:00:00 +0000

Set up guardrails

Mon, 01 Jan 0001 00:00:00 +0000

Ensure LLM safety and measure output quality in production applications

Set up LLM-as-a-judge online evaluators

Mon, 01 Jan 0001 00:00:00 +0000

Set up monitors

Mon, 01 Jan 0001 00:00:00 +0000

Passively score production traffic to surface trends and issues

Set up multi-turn online evaluators

Mon, 01 Jan 0001 00:00:00 +0000

Set up online code evaluators

Mon, 01 Jan 0001 00:00:00 +0000

Simple Validation

Mon, 01 Jan 0001 00:00:00 +0000

Single-hop Query Testset

Mon, 01 Jan 0001 00:00:00 +0000

Single-Node Performance

Mon, 01 Jan 0001 00:00:00 +0000

Single-node Chroma performance benchmarks and limitations.

Span-Based

Mon, 01 Jan 0001 00:00:00 +0000

Supported Models

Mon, 01 Jan 0001 00:00:00 +0000

Supported models for Evaluations

Swarm

Mon, 01 Jan 0001 00:00:00 +0000

Synthesizer Generate From Contexts

Mon, 01 Jan 0001 00:00:00 +0000

Synthesizer Generate From Docs

Mon, 01 Jan 0001 00:00:00 +0000

Synthesizer Generate From Goldens

Mon, 01 Jan 0001 00:00:00 +0000

Synthesizer Generate From Scratch

Mon, 01 Jan 0001 00:00:00 +0000

Synthetic Data Generation Introduction

Mon, 01 Jan 0001 00:00:00 +0000

Test

Mon, 01 Jan 0001 00:00:00 +0000

Strategies for testing LangChain agents, including unit tests, integration tests, and trajectory evaluations.

Test

Mon, 01 Jan 0001 00:00:00 +0000

Test

Mon, 01 Jan 0001 00:00:00 +0000

Strategies for testing LangChain agents, including unit tests, integration tests, and trajectory evaluations.

Test

Mon, 01 Jan 0001 00:00:00 +0000

Test a ReAct agent with Pytest/Vitest and LangSmith

Mon, 01 Jan 0001 00:00:00 +0000

Test Agent Card

Mon, 01 Jan 0001 00:00:00 +0000

Online tool to validate if a domain supports the A2A protocol and visualize agent card information. Enter any URL to check for A2A protocol support and parse the agent.json file.

Test deployed agents

Mon, 01 Jan 0001 00:00:00 +0000

Test multi-turn conversations

Mon, 01 Jan 0001 00:00:00 +0000

Test Pinecone at scale

Mon, 01 Jan 0001 00:00:00 +0000

Test Pinecone with a real-world dataset and semantic search workload.

Testing

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to test your CrewAI Crew and evaluate their performance.

Testing

Mon, 01 Jan 0001 00:00:00 +0000

Testset Generation

Mon, 01 Jan 0001 00:00:00 +0000

Testset Generation for Agents or Tool use cases

Mon, 01 Jan 0001 00:00:00 +0000

Testset Generation for RAG

Mon, 01 Jan 0001 00:00:00 +0000

Testset Generation for RAG

Mon, 01 Jan 0001 00:00:00 +0000

Text Embeddings

Mon, 01 Jan 0001 00:00:00 +0000

Generate and use text embeddings with Mistral AI’s API for NLP tasks like similarity, classification, and retrieval

Text-to-SQL Evaluation Quickstart

Mon, 01 Jan 0001 00:00:00 +0000

Together AI

Mon, 01 Jan 0001 00:00:00 +0000

Track and evaluate Together AI’s open source LLMs using Weave’s OpenAI SDK compatibility for seamless integration with model calls, fine-tuning workflows, and hosted models.

Trace and Evaluate a Computer Vision Pipeline with Weave

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use trace and evaluate a computer vision pipeline with weave with W&B Weave

Trace grading

Mon, 01 Jan 0001 00:00:00 +0000

Use trace grading to create datasets, configure graders, and track evaluation runs for your models.

Tracing and logging evaluations with Observability tools

Mon, 01 Jan 0001 00:00:00 +0000

Training Overview

Mon, 01 Jan 0001 00:00:00 +0000

Launch RFT jobs using the eval-protocol CLI

Troubleshooting

Mon, 01 Jan 0001 00:00:00 +0000

TruLens

Mon, 01 Jan 0001 00:00:00 +0000

Using TruLens and Pinecone to evaluate grounded LLM applications

Tutorial Introduction

Mon, 01 Jan 0001 00:00:00 +0000

Tutorial Setup

Mon, 01 Jan 0001 00:00:00 +0000

TXT RAG Search

Mon, 01 Jan 0001 00:00:00 +0000

The ‘TXTSearchTool’ is designed to perform a RAG (Retrieval-Augmented Generation) search within the content of a text file.

Unit testing

Mon, 01 Jan 0001 00:00:00 +0000

Test agent logic without API calls using fake chat models and in-memory persistence.

Unit testing

Mon, 01 Jan 0001 00:00:00 +0000

Test agent logic without API calls using fake chat models and in-memory persistence.

Use builtin scorers

Mon, 01 Jan 0001 00:00:00 +0000

Use Weave’s predefined scorers for evaluating your AI applications

Use Claude Code with Chrome (beta)

Mon, 01 Jan 0001 00:00:00 +0000

Connect Claude Code to your Chrome browser to test web apps, debug with console logs, automate form filling, and extract data from web pages.

Use server-side caching

Mon, 01 Jan 0001 00:00:00 +0000

Cache values server-side in your agent deployment using stale-while-revalidate and key-value cache APIs.

User Simulation

Mon, 01 Jan 0001 00:00:00 +0000

Using GPT-5.2

Mon, 01 Jan 0001 00:00:00 +0000

Learn about how to use and migrate to GPT-5.2 and the GPT-5 model family, the latest models in the OpenAI API.

Using Pre-chunked Data

Mon, 01 Jan 0001 00:00:00 +0000

Using Secrets

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to create secrets that can be utilized within your reward function.

Using standard tests

Mon, 01 Jan 0001 00:00:00 +0000

Using standard tests

Mon, 01 Jan 0001 00:00:00 +0000

Verdict

Mon, 01 Jan 0001 00:00:00 +0000

Use Verdict evaluation framework with Weave to trace and monitor your LLM evaluation pipelines

Verifiers

Mon, 01 Jan 0001 00:00:00 +0000

Track and debug Verifiers RL environments and LLM agent training with Weave, capturing multi-round conversations, evaluation rollouts, and model performance metrics for comprehensive observability of reinforcement learning workflows.

Vibe Coder Quickstart

Mon, 01 Jan 0001 00:00:00 +0000

Vibe Coding

Mon, 01 Jan 0001 00:00:00 +0000

Weave Integration

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use Weights & Biases (W&B) Weave to track, experiment with, evaluate, and improve your CrewAI applications.

Website RAG Search

Mon, 01 Jan 0001 00:00:00 +0000

The ‘WebsiteSearchTool’ is designed to perform a RAG (Retrieval-Augmented Generation) search within the content of a website.

What is Weave?

Mon, 01 Jan 0001 00:00:00 +0000

Learn about W&B Weave and how it helps you build, evaluate, and improve LLM applications

What's New

Mon, 01 Jan 0001 00:00:00 +0000

The latest updates and improvements to AG-UI

Why Evaluate Agents

Mon, 01 Jan 0001 00:00:00 +0000

Workflow Evaluation Quickstart

Mon, 01 Jan 0001 00:00:00 +0000

Working with evals

Mon, 01 Jan 0001 00:00:00 +0000

Build, run, and iterate on evaluations to systematically test and improve AI model outputs — OpenAI’s practical guide to eval-driven development.

XML RAG Search

Mon, 01 Jan 0001 00:00:00 +0000

The ‘XMLSearchTool’ is designed to perform a RAG (Retrieval-Augmented Generation) search within the content of a XML file.

YouTube Channel RAG Search

Mon, 01 Jan 0001 00:00:00 +0000

The ‘YoutubeChannelSearchTool’ is designed to perform a RAG (Retrieval-Augmented Generation) search within the content of a Youtube channel.