Evaluate a prompt ↗
noOriginal Documentation
In this tutorial, we will write a simple evaluation pipeline to evaluate a prompt that is part of an AI system, here a movie review sentiment classifier. At the end of this tutorial you’ll learn how to evaluate and iterate on a single prompt using evaluation driven development.
flowchart LR
A["'This movie was amazing!<br/>Great acting and plot.'"] --> B["Classifier Prompt"]
B --> C["Positive"]We will start by testing a simple prompt that classifies movie reviews as positive or negative.
First, make sure you have installed ragas examples and setup your OpenAI API key:
pip install ragas[examples]
export OPENAI_API_KEY = "your_openai_api_key"Now test the prompt:
python -m ragas_examples.prompt_evals.promptThis will test the input "The movie was fantastic and I loved every moment of it!" and should output "positive".
💡 Quick Start: If you want to see the complete evaluation in action, you can jump straight to the end-to-end command that runs everything and generates the CSV results automatically.
Next, we will write down few sample inputs and expected outputs for our prompt. Then convert them to a CSV file.
import pandas as pd
samples = [{"text": "I loved the movie! It was fantastic.", "label": "positive"},
{"text": "The movie was terrible and boring.", "label": "negative"},
{"text": "It was an average film, nothing special.", "label": "positive"},
{"text": "Absolutely amazing! Best movie of the year.", "label": "positive"}]
pd.DataFrame(samples).to_csv("datasets/test_dataset.csv", index=False)Now we need to have a way to measure the performance of our prompt in this task. We will define a metric that will compare the output of our prompt with the expected output and outputs pass/fail based on it.
from ragas.metrics import discrete_metric
from ragas.metrics.result import MetricResult
@discrete_metric(name="accuracy", allowed_values=["pass", "fail"])
def my_metric(prediction: str, actual: str):
"""Calculate accuracy of the prediction."""
return MetricResult(value="pass", reason="") if prediction == actual else MetricResult(value="fail", reason="")Next, we will write the experiment loop that will run our prompt on the test dataset and evaluate it using the metric, and store the results in a csv file.
from ragas import experiment
@experiment()
async def run_experiment(row):
response = run_prompt(row["text"])
score = my_metric.score(
prediction=response,
actual=row["label"]
)
experiment_view = {
**row,
"response":response,
"score":score.value,
}
return experiment_viewNow whenever you make a change to your prompt, you can run the experiment and see how it affects the performance of your prompt.
Passing Additional Parameters#
You can pass additional parameters like models or configurations to your experiment function:
@experiment()
async def run_experiment(row, model):
response = run_prompt(row["text"], model=model)
score = my_metric.score(
prediction=response,
actual=row["label"]
)
experiment_view = {
**row,
"response": response,
"score": score.value,
}
return experiment_view
# Run with specific parameters
run_experiment.arun(dataset, "gpt-4")
# Or use keyword arguments
run_experiment.arun(dataset, model="gpt-4o")Running the example end to end#
Setup your OpenAI API key
export OPENAI_API_KEY = "your_openai_api_key"Run the evaluation
python -m ragas_examples.prompt_evals.evals
This will:
- Create the test dataset with sample movie reviews
- Run the sentiment classification prompt on each sample
- Evaluate the results using the accuracy metric
- Export everything to a CSV file with the results
Voila! You have successfully run your first evaluation using Ragas. You can now inspect the results by opening the experiments/experiment_name.csv file.