Rubric-Based Evaluation

no

Original Documentation

Rubric-based evaluation metrics allow you to evaluate LLM responses using custom scoring criteria. Ragas provides two types of rubric metrics:

  1. DomainSpecificRubrics: Uses the same rubric for all samples in a dataset (set at initialization)
  2. InstanceSpecificRubrics: Each sample can have its own unique rubric (passed per evaluation)

The rubric consists of descriptions for each score, typically ranging from 1 to 5. The response is evaluated and scored using an LLM based on the descriptions specified in the rubric.

Domain-Specific Rubrics#

Use DomainSpecificRubrics when you want to apply the same evaluation criteria across all samples. This is useful for domain-wide evaluations where the scoring criteria remain constant.

Example#

from openai import AsyncOpenAI
from ragas.llms.base import llm_factory
from ragas.metrics.collections import DomainSpecificRubrics

# Setup
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)

# Reference-free evaluation (default)
metric = DomainSpecificRubrics(llm=llm)
result = await metric.ascore(
    user_input="What's the longest river in the world?",
    response="The longest river in the world is the Nile, stretching approximately 6,650 kilometers through northeastern Africa.",
)
print(f"Score: {result.value}, Feedback: {result.reason}")

# Reference-based evaluation
metric_with_ref = DomainSpecificRubrics(llm=llm, with_reference=True)
result = await metric_with_ref.ascore(
    user_input="What's the longest river in the world?",
    response="The longest river in the world is the Nile.",
    reference="The Nile is a major north-flowing river in northeastern Africa.",
)

Custom Rubrics#

You can define your own rubrics to customize the scoring criteria:

from ragas.metrics.collections import DomainSpecificRubrics

my_custom_rubrics = {
    "score1_description": "Answer and ground truth are completely different",
    "score2_description": "Answer and ground truth are somewhat different",
    "score3_description": "Answer and ground truth are somewhat similar",
    "score4_description": "Answer and ground truth are similar",
    "score5_description": "Answer and ground truth are exactly the same",
}

metric = DomainSpecificRubrics(llm=llm, rubrics=my_custom_rubrics, with_reference=True)

With Retrieved Contexts#

The metric also supports evaluation with retrieved contexts:

result = await metric.ascore(
    user_input="What's the longest river in the world?",
    response="Based on the context, the Nile is the longest river.",
    retrieved_contexts=[
        "Scientists debate whether the Amazon or the Nile is the longest river.",
        "The Nile River was central to Ancient Egyptians' wealth and power.",
    ],
)

Convenience Classes#

For clearer intent, use the convenience classes:

from ragas.metrics.collections import (
    RubricsScoreWithoutReference,
    RubricsScoreWithReference,
)

# Reference-free
metric_no_ref = RubricsScoreWithoutReference(llm=llm)

# Reference-based
metric_with_ref = RubricsScoreWithReference(llm=llm)

Default Rubrics#

Reference-Free Rubrics (Default)#

ScoreDescription
1The response is entirely incorrect and fails to address any aspect of the user input.
2The response contains partial accuracy but includes major errors or significant omissions.
3The response is mostly accurate but lacks clarity, thoroughness, or minor details.
4The response is accurate and clear, with only minor omissions or slight inaccuracies.
5The response is completely accurate, clear, and thoroughly addresses the user input.

Reference-Based Rubrics#

ScoreDescription
1The response is entirely incorrect, irrelevant, or does not align with the reference.
2The response partially matches the reference but contains major errors or omissions.
3The response aligns with the reference overall but lacks sufficient detail or clarity.
4The response is mostly accurate, aligns closely with the reference with minor issues.
5The response is fully accurate, completely aligns with the reference, clear and detailed.

Instance-Specific Rubrics#

Use InstanceSpecificRubrics when different samples require different evaluation criteria. This is useful when:

  • Different questions require different evaluation standards
  • You want to customize scoring based on specific task requirements
  • Evaluation criteria vary across your dataset

Example#

from openai import AsyncOpenAI
from ragas.llms.base import llm_factory
from ragas.metrics.collections import InstanceSpecificRubrics

# Setup
client = AsyncOpenAI()
llm = llm_factory("gpt-4o-mini", client=client)

metric = InstanceSpecificRubrics(llm=llm)

# Each sample can have its own rubrics
email_rubrics = {
    "score1_description": "The email is unprofessional or inappropriate",
    "score2_description": "The email lacks proper formatting or tone",
    "score3_description": "The email is acceptable but could be improved",
    "score4_description": "The email is professional with minor issues",
    "score5_description": "The email is highly professional and well-written",
}

result = await metric.ascore(
    user_input="Write a professional email declining a meeting invitation",
    response="Dear John, Thank you for the invitation...",
    rubrics=email_rubrics,
)
print(f"Score: {result.value}, Feedback: {result.reason}")

# Different rubrics for a different type of task
code_rubrics = {
    "score1_description": "The code doesn't work or has critical bugs",
    "score2_description": "The code has significant issues or is poorly structured",
    "score3_description": "The code works but lacks optimization or best practices",
    "score4_description": "The code is good with minor improvements possible",
    "score5_description": "The code is excellent, efficient, and follows best practices",
}

result = await metric.ascore(
    user_input="Write a function to sort a list",
    response="def sort_list(arr): return sorted(arr)",
    rubrics=code_rubrics,
)

With Reference and Contexts#

result = await metric.ascore(
    user_input="Explain the water cycle",
    response="The water cycle involves evaporation, condensation, and precipitation.",
    reference="The water cycle describes how water evaporates from surfaces, rises into the atmosphere, condenses into clouds, and falls as precipitation.",
    retrieved_contexts=["Water cycle information from encyclopedia..."],
    rubrics={
        "score1_description": "Explanation is completely wrong",
        "score2_description": "Explanation has major inaccuracies",
        "score3_description": "Explanation is partially correct",
        "score4_description": "Explanation is mostly correct",
        "score5_description": "Explanation is comprehensive and accurate",
    },
)

Legacy API#

Deprecated

The legacy API below is deprecated. Please use ragas.metrics.collections.DomainSpecificRubrics or ragas.metrics.collections.InstanceSpecificRubrics instead.

from ragas import evaluate
from datasets import Dataset

from ragas.metrics import rubrics_score_without_reference, rubrics_score_with_reference

rows = {
    "question": [
        "What's the longest river in the world?",
    ],
    "ground_truth": [
        "The Nile is a major north-flowing river in northeastern Africa.",
    ],
    "answer": [
        "The longest river in the world is the Nile, stretching approximately 6,650 kilometers (4,130 miles) through northeastern Africa.",
    ],
    "contexts": [
        [
            "Scientists debate whether the Amazon or the Nile is the longest river in the world.",
            "The Nile River was central to the Ancient Egyptians' rise to wealth and power.",
        ],
    ]
}

dataset = Dataset.from_dict(rows)

result = evaluate(
    dataset,
    metrics=[
        rubrics_score_without_reference,
        rubrics_score_with_reference
    ],
)

Custom rubrics with legacy API:

from ragas.metrics._domain_specific_rubrics import RubricsScore

my_custom_rubrics = {
    "score1_description": "answer and ground truth are completely different",
    "score2_description": "answer and ground truth are somewhat different",
    "score3_description": "answer and ground truth are somewhat similar",
    "score4_description": "answer and ground truth are similar",
    "score5_description": "answer and ground truth are exactly the same",
}

rubrics_score = RubricsScore(rubrics=my_custom_rubrics)
Link last verified June 7, 2026. View original ↗
Source: RAGAS Docs
Link last verified: 2026-03-04