Evaluate a simple RAG system

yes

Editorial Notes

This is the practical RAG evaluation walkthrough and the page most teams should run first when they need to measure a retrieval pipeline rather than guess at it. Pay attention to the distinction between retrieval metrics like context precision and recall and generation metrics like faithfulness and answer relevancy, because a RAG system can fail at either stage and the fix differs entirely. A common mistake is optimizing answer quality while ignoring context recall, leaving the model fluent but ungrounded. Start with the simple-evals page first if you are new to RAGAS.


Original Documentation

In this tutorial, we will write a simple evaluation pipeline to evaluate a RAG (Retrieval-Augmented Generation) system. At the end of this tutorial, you’ll learn how to evaluate and iterate on a RAG system using evaluation-driven development.

flowchart LR
    A["Query<br/>'What is Ragas 0.3?'"] --> B[Retrieval System]

    C[Document Corpus<br/> Ragas 0.3 Docs📄] --> B

    B --> D[LLM + Prompt]
    A --> D

    D --> E[Final Answer]

We will start by writing a simple RAG system that retrieves relevant documents from a corpus and generates an answer using an LLM.

python -m ragas_examples.rag_eval.rag

Next, we will write down a few sample queries and expected outputs for our RAG system. Then convert them to a CSV file.

import pandas as pd

samples = [
    {"query": "What is Ragas 0.3?", "grading_notes": "- Ragas 0.3 is a library for evaluating LLM applications."},
    {"query": "How to install Ragas?", "grading_notes": "- install from source  - install from pip using ragas[examples]"},
    {"query": "What are the main features of Ragas?", "grading_notes": "organised around - experiments - datasets - metrics."}
]
pd.DataFrame(samples).to_csv("datasets/test_dataset.csv", index=False)

To evaluate the performance of our RAG system, we will define a llm based metric that compares the output of our RAG system with the grading notes and outputs pass/fail based on it.

from ragas.metrics import DiscreteMetric
my_metric = DiscreteMetric(
    name="correctness",
    prompt = "Check if the response contains points mentioned from the grading notes and return 'pass' or 'fail'.\nResponse: {response} Grading Notes: {grading_notes}",
    allowed_values=["pass", "fail"],
)

Next, we will write the experiment loop that will run our RAG system on the test dataset and evaluate it using the metric, and store the results in a CSV file.

@experiment()
async def run_experiment(row):
    response = rag_client.query(row["query"])

    score = my_metric.score(
        llm=llm,
        response=response.get("answer", " "),
        grading_notes=row["grading_notes"]
    )

    experiment_view = {
        **row,
        "response": response.get("answer", ""),
        "score": score.value,
        "log_file": response.get("logs", " "),
    }
    return experiment_view

Now whenever you make a change to your RAG pipeline, you can run the experiment and see how it affects the performance of your RAG.

Running the example end to end#

  1. Setup your OpenAI API key

    export OPENAI_API_KEY="your_openai_api_key"
  2. Run the evaluation

    python -m ragas_examples.rag_eval.evals

Voila! You have successfully run your first evaluation using Ragas. You can now inspect the results by opening the experiments/experiment_name.csv file.

Link last verified June 7, 2026. View original ↗
Source: RAGAS Docs
Link last verified: 2026-03-04