Experimentation ↗

Original Documentation

What is an experiment?#

An experiment is a deliberate change made to your application to test a hypothesis or idea. For example, in a Retrieval-Augmented Generation (RAG) system, you might replace the retriever model to evaluate how a new embedding model impacts chatbot responses.

Principles of a Good Experiment#

Define measurable metrics: Use metrics like accuracy, precision, or recall to quantify the impact of your changes.
Systematic result storage: Ensure results are stored in an organized manner for easy comparison and tracking.
Isolate changes: Make one change at a time to identify its specific impact. Avoid making multiple changes simultaneously, as this can obscure the results.
Iterative process: Follow a structured approach: *Make a change → Run evaluations → Observe results →

graph LR
    A[Make a change] --> B[Run evaluations]
    B --> C[Observe results]
    C --> D[Hypothesize next change]
    D --> A

Experiments in Ragas#

Components of an Experiment#

Test dataset: The data used to evaluate the system.
Application endpoint: The application, component or model being tested.
Metrics: Quantitative measures to assess performance.

Execution Process#

Setup: Define the experiment parameters and load the test dataset.
Run: Execute the application on each sample in the dataset.
Evaluate: Apply metrics to measure performance.
Store: Save results for analysis and comparison.

Creating Experiments with Ragas#

Ragas provides an @experiment decorator to streamline the experiment creation process. If you prefer a hands-on intro first, see the Quick Start guide.

Basic Experiment Structure#

from ragas import experiment
import asyncio

@experiment()
async def my_experiment(row):
    # Process the input through your system
    response = await asyncio.to_thread(my_system_function, row["input"])

    # Return results for evaluation
    return {
        **row,  # Include original data
        "response": response,
        "experiment_name": "baseline_v1",
        # Add any additional metadata
        "model_version": "gpt-4o",
        "timestamp": datetime.now().isoformat()
    }

Running Experiments#

from ragas import Dataset

# Load your test dataset
dataset = Dataset.load(name="test_data", backend="local/csv", root_dir="./data")

# Run the experiment
results = await my_experiment.arun(dataset)

Parameterized Experiments#

You can create parameterized experiments to test different configurations:

@experiment()
async def model_comparison_experiment(row, model_name: str, temperature: float):
    # Configure your system with the parameters
    response = await my_system_function(
        row["input"], 
        model=model_name, 
        temperature=temperature
    )

    return {
        **row,
        "response": response,
        "experiment_name": f"{model_name}_temp_{temperature}",
        "model_name": model_name,
        "temperature": temperature
    }

# Run with different parameters
results_gpt4 = await model_comparison_experiment.arun(
    dataset, 
    model_name="gpt-4o", 
    temperature=0.1
)

results_gpt35 = await model_comparison_experiment.arun(
    dataset, 
    model_name="gpt-3.5-turbo", 
    temperature=0.1
)

Experiment Management Best Practices#

1. Consistent Naming#

Use descriptive names that include:

What changed (model, prompt, parameters)
Version numbers
Date/time if relevant

experiment_name = "gpt4o_v2_prompt_temperature_0.1_20241201"

2. Result Storage#

Experiments automatically save results to CSV files in the experiments/ directory with timestamps:

experiments/
├── 20241201-143022-baseline_v1.csv
├── 20241201-143515-gpt4o_improved_prompt.csv
└── 20241201-144001-comparison.csv

3. Metadata Tracking#

Include relevant metadata in your experiment results:

return {
    **row,
    "response": response,
    "experiment_name": "baseline_v1",
    "git_commit": "a1b2c3d",
    "environment": "staging",
    "model_version": "gpt-4o-2024-08-06",
    "total_tokens": response.usage.total_tokens,
    "response_time_ms": response_time
}

Advanced Experiment Patterns#

A/B Testing#

Test two different approaches simultaneously:

@experiment()
async def ab_test_experiment(row, variant: str):
    if variant == "A":
        response = await system_variant_a(row["input"])
    else:
        response = await system_variant_b(row["input"])

    return {
        **row,
        "response": response,
        "variant": variant,
        "experiment_name": f"ab_test_variant_{variant}"
    }

# Run both variants
results_a = await ab_test_experiment.arun(dataset, variant="A")
results_b = await ab_test_experiment.arun(dataset, variant="B")

Multi-Stage Experiments#

For complex systems with multiple components:

@experiment()
async def multi_stage_experiment(row):
    # Stage 1: Retrieval
    retrieved_docs = await retriever(row["query"])

    # Stage 2: Generation
    response = await generator(row["query"], retrieved_docs)

    return {
        **row,
        "retrieved_docs": retrieved_docs,
        "response": response,
        "num_docs_retrieved": len(retrieved_docs),
        "experiment_name": "multi_stage_v1"
    }

Error Handling in Experiments#

Handle errors gracefully to avoid losing partial results:

@experiment()
async def robust_experiment(row):
    try:
        response = await my_system_function(row["input"])
        error = None
    except Exception as e:
        response = None
        error = str(e)

    return {
        **row,
        "response": response,
        "error": error,
        "success": error is None,
        "experiment_name": "robust_v1"
    }

Integrating with Metrics#

Experiments work seamlessly with Ragas metrics:

from ragas.metrics import FactualCorrectness

@experiment()
async def evaluated_experiment(row):
    response = await my_system_function(row["input"])

    # Calculate metrics inline
    factual_score = FactualCorrectness().score(
        response=response,
        reference=row["expected_output"]
    )

    return {
        **row,
        "response": response,
        "factual_correctness": factual_score.value,
        "factual_reason": factual_score.reason,
        "experiment_name": "evaluated_v1"
    }

This integration allows you to automatically calculate and store metric scores alongside your experiment results, making it easy to track performance improvements over time.

Link last verified June 7, 2026. View original ↗

Source: RAGAS Docs

Link last verified: 2026-03-04