RAG Evaluation ↗
noOriginal Documentation
The rag_eval template provides a complete RAG evaluation setup with custom metrics, dataset management, and experiment tracking.
Create the Project#
# Using uvx (no installation required)
uvx ragas quickstart rag_eval
cd rag_eval
# Or with ragas installed
ragas quickstart rag_eval
cd rag_evalInstall Dependencies#
uv syncOr with pip:
pip install -e .Set Your API Key#
export OPENAI_API_KEY="your-openai-key"export ANTHROPIC_API_KEY="your-anthropic-key"Update evals.py:
from anthropic import Anthropic
from ragas.llms import llm_factory
client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
llm = llm_factory("claude-3-5-sonnet-20241022", provider="anthropic", client=client)export GOOGLE_API_KEY="your-google-api-key"Update evals.py:
import google.generativeai as genai
from ragas.llms import llm_factory
genai.configure(api_key=os.environ.get("GOOGLE_API_KEY"))
client = genai.GenerativeModel("gemini-2.0-flash")
llm = llm_factory("gemini-2.0-flash", provider="google", client=client)from openai import OpenAI
from ragas.llms import llm_factory
client = OpenAI(
api_key="ollama",
base_url="http://localhost:11434/v1"
)
llm = llm_factory("mistral", provider="openai", client=client)Run the Evaluation#
uv run python evals.pyThe evaluation will:
- Load test data from the
load_dataset()function - Query your RAG application with test questions
- Evaluate responses using custom metrics
- Display results in the console
- Save results to CSV in
evals/experiments/
Project Structure#
rag_eval/
├── README.md # Project documentation
├── pyproject.toml # Project configuration
├── rag.py # RAG application implementation
├── evals.py # Evaluation workflow
├── __init__.py # Python package marker
└── evals/
├── datasets/ # Test data files
├── experiments/ # Evaluation results (CSV)
└── logs/ # Execution logs and tracesUnderstanding the Code#
The RAG Application (rag.py)#
A simple RAG implementation with:
- Document storage: In-memory document collection
- Keyword retrieval: Simple keyword matching for document retrieval
- Response generation: OpenAI API for generating answers
- Tracing: Logs each query for debugging
from rag import default_rag_client
# Initialize with OpenAI client
rag_client = default_rag_client(llm_client=openai_client, logdir="evals/logs")
# Query the RAG system
response = rag_client.query("What is Ragas?")
print(response["answer"])The Evaluation Script (evals.py)#
The evaluation workflow:
- Dataset loading: Creates test cases with questions and grading notes
- Metric definition: Custom
DiscreteMetricfor pass/fail evaluation - Experiment execution: Runs queries and evaluates responses
- Result storage: Saves to CSV for analysis
from ragas import Dataset, experiment
from ragas.metrics import DiscreteMetric
# Define your metric
my_metric = DiscreteMetric(
name="correctness",
prompt="Check if the response contains points from grading notes...",
allowed_values=["pass", "fail"],
)
# Run experiment
@experiment()
async def run_experiment(row):
response = rag_client.query(row["question"])
score = my_metric.score(llm=llm, response=response["answer"], ...)
return {**row, "response": response["answer"], "score": score.value}Customization#
Add Test Cases#
Edit the load_dataset() function in evals.py:
def load_dataset():
dataset = Dataset(
name="test_dataset",
backend="local/csv",
root_dir="evals",
)
data_samples = [
{
"question": "What is Ragas?",
"grading_notes": "- evaluation framework - LLM applications",
},
{
"question": "How do experiments work?",
"grading_notes": "- track results - compare runs - store metrics",
},
# Add more test cases...
]
for sample in data_samples:
dataset.append(sample)
dataset.save()
return datasetModify the Metric#
Change evaluation criteria by updating the metric prompt:
my_metric = DiscreteMetric(
name="quality",
prompt="""Evaluate the response quality:
Response: {response}
Expected Points: {grading_notes}
Rate as:
- 'excellent': All points covered with clear explanation
- 'good': Most points covered
- 'poor': Missing key points
Rating:""",
allowed_values=["excellent", "good", "poor"],
)Add Multiple Metrics#
Create additional metrics for different evaluation aspects:
from ragas.metrics import DiscreteMetric, NumericalMetric
correctness = DiscreteMetric(
name="correctness",
prompt="Is the response factually correct? {response}",
allowed_values=["correct", "incorrect"],
)
relevance = NumericalMetric(
name="relevance",
prompt="Rate relevance 1-5: {response} for question: {question}",
allowed_values=(1, 5),
)Use Your Own RAG System#
Replace the example RAG with your production system:
# In evals.py
from your_rag_module import YourRAGClient
rag_client = YourRAGClient(...)
@experiment()
async def run_experiment(row):
# Call your RAG system
response = await rag_client.query(row["question"])
score = my_metric.score(
llm=llm,
response=response,
grading_notes=row["grading_notes"],
)
return {
**row,
"response": response,
"score": score.value,
}Viewing Results#
Results are saved to evals/experiments/ as CSV files. Each experiment run creates a new file with:
- Input data (questions, grading notes)
- Model responses
- Evaluation scores
- Timestamps
import pandas as pd
# Load results
results = pd.read_csv("evals/experiments/your_experiment.csv")
# Calculate pass rate
pass_rate = (results["score"] == "pass").mean()
print(f"Pass rate: {pass_rate:.1%}")Next Steps#
- Improve RAG Guide - Compare naive vs agentic RAG
- Custom Metrics - Write your own metrics
- Datasets - Learn about dataset management
- Experimentation - Advanced experiment tracking
Link last verified
June 7, 2026.
View original ↗
Source: RAGAS Docs
Link last verified: 2026-03-04