Judge Alignment Quickstart ↗

Original Documentation

The judge_alignment template measures how well an LLM-as-judge aligns with human evaluation standards.

Create the Project#

ragas quickstart judge_alignment
cd judge_alignment

Install Dependencies#

uv sync

Set Your API Key#

export OPENAI_API_KEY="your-openai-key"

Run the Evaluation#

uv run python evals.py

Project Structure#

judge_alignment/
├── README.md              # Project documentation
├── pyproject.toml         # Project configuration
├── evals.py               # Evaluation workflow
├── __init__.py            # Python package marker
└── evals/
    ├── datasets/          # Test datasets
    ├── experiments/       # Evaluation results
    └── logs/              # Execution logs

What It Evaluates#

The template evaluates LLM judge alignment:

Scenario: Pre-existing responses are evaluated by an LLM judge
Human Labels: Ground truth pass/fail labels
LLM Judge: Evaluates same responses with grading criteria
Alignment Metric: Agreement between human and LLM judgments

Understanding the Code#

Judge Metrics (`evals.py`)#

Two judge implementations to compare:

# Baseline judge (simple prompt)
accuracy_metric = DiscreteMetric(
    name="accuracy",
    prompt="Check if response contains points from grading notes...",
    allowed_values=["pass", "fail"],
)

# Improved judge (enhanced with abbreviation guide)
accuracy_metric_v2 = DiscreteMetric(
    name="accuracy",
    prompt="""Evaluate if response covers ALL key concepts...

    ABBREVIATION GUIDE:
    • Financial: val=valuation, post-$=post-money, rev=revenue...
    • Business: mkt=market, reg=regulation...
    """,
    allowed_values=["pass", "fail"],
)

The Evaluation#

Tests alignment with human judgment:

@discrete_metric(name="alignment", allowed_values=["aligned", "misaligned"])
def alignment_metric(llm_judgment: str, human_judgment: str):
    # Compares LLM judge output with human label
    return "aligned" if llm_judgment == human_judgment else "misaligned"

Test Data#

The dataset includes:

Pre-evaluated responses
Human pass/fail labels
Grading notes with expected points
Various abbreviations and business terminology

Use Cases#

Compare Judge Versions#

Run experiments with both judges:

# Test baseline judge
results_v1 = await run_with_judge(accuracy_metric)

# Test improved judge
results_v2 = await run_with_judge(accuracy_metric_v2)

# Compare alignment rates

Improve Judge Quality#

Iterate on judge prompts to improve alignment:

Identify misalignment patterns
Update judge prompt with clearer criteria
Re-evaluate alignment
Repeat until satisfactory

Next Steps#

Prompt Evaluation - Compare different prompts
LLM Benchmarking - Compare different models

Link last verified June 7, 2026. View original ↗

Source: RAGAS Docs

Link last verified: 2026-03-04