AG-UI ↗

Original Documentation

AG-UI is an event-based protocol for streaming agent updates to user interfaces. The protocol standardizes message, tool-call, and state events, which makes it easy to plug different agent runtimes into visual frontends. The ragas.integrations.ag_ui module helps you transform those event streams into Ragas message objects and run experiments against live AG-UI endpoints using the modern @experiment decorator pattern.

This guide assumes you already have an AG-UI compatible agent running (for example, one built with Google ADK, PydanticAI, or CrewAI) and that you are familiar with creating datasets in Ragas.

Install the integration#

The AG-UI helpers live behind an optional extra. Install it together with the dependencies required by your evaluator LLM. When running inside Jupyter or IPython, include nest_asyncio so you can reuse the notebook’s event loop.

pip install "ragas[ag-ui]" python-dotenv nest_asyncio

Configure your evaluator LLM credentials. For example, if you are using OpenAI models:

# .env
OPENAI_API_KEY=sk-...

Load the environment variables inside Python before running the examples:

from dotenv import load_dotenv
import nest_asyncio

load_dotenv()

# If you're inside Jupyter/IPython, patch the running event loop once.
nest_asyncio.apply()

Build an experiment dataset#

Dataset can contain single-turn or multi-turn samples. With AG-UI you can test either pattern—single questions with free-form responses, or longer conversations that include tool calls.

Single-turn samples#

Use Dataset.from_pandas() with user_input and reference columns when you only need to grade the final answer text.

import pandas as pd
from ragas.dataset import Dataset

scientist_questions = Dataset.from_pandas(
    pd.DataFrame([
        {
            "user_input": "Who originated the theory of relativity?",
            "reference": "Albert Einstein originated the theory of relativity.",
        },
        {
            "user_input": "Who discovered penicillin and when?",
            "reference": "Alexander Fleming discovered penicillin in 1928.",
        },
    ]),
    name="scientist_questions",
    backend="inmemory",
)

Multi-turn samples with tool expectations#

When you want to grade intermediate agent behavior—like whether it calls tools correctly and achieves the user’s goal—use conversation lists as user_input. Provide expected tool calls as JSON and optionally a reference outcome for goal accuracy evaluation.

import json
import pandas as pd
from ragas.dataset import Dataset
from ragas.messages import HumanMessage

weather_queries = Dataset.from_pandas(
    pd.DataFrame([
        {
            "user_input": [HumanMessage(content="What's the weather in Paris?")],
            "reference_tool_calls": json.dumps([
                {"name": "get_weather", "args": {"location": "Paris"}}
            ]),
            # Expected outcome for AgentGoalAccuracyWithReference
            "reference": "The user received the current weather conditions for Paris.",
        },
        {
            "user_input": [HumanMessage(content="Is it raining in London right now?")],
            "reference_tool_calls": json.dumps([
                {"name": "get_weather", "args": {"location": "London"}}
            ]),
            "reference": "The user received the current weather conditions for London.",
        },
    ]),
    name="weather_queries",
    backend="inmemory",
)

Loading from CSV#

For larger datasets, store your test cases in CSV files and load them with the Dataset API:

from ragas.dataset import Dataset

dataset = Dataset.load(
    name="scientist_biographies",
    backend="local/csv",
    root_dir="./test_data",
)

Choose metrics and evaluator model#

The integration works with any Ragas metric. To unlock the modern collections portfolio (and mix in custom checks), build an Instructor-compatible LLM for the evaluator prompts and use a synchronous OpenAI client for embeddings.

from openai import AsyncOpenAI, OpenAI
from ragas.llms import llm_factory
from ragas.embeddings import embedding_factory
from ragas.metrics import DiscreteMetric
from ragas.metrics.collections import (
    AgentGoalAccuracyWithReference,
    AnswerRelevancy,
    FactualCorrectness,
    ToolCallF1,
)

async_llm_client = AsyncOpenAI()
evaluator_llm = llm_factory("gpt-4o-mini", client=async_llm_client)

# AnswerRelevancy's embeddings still run synchronously, so pair it with a sync client.
embedding_client = OpenAI()
evaluator_embeddings = embedding_factory(
    "openai", model="text-embedding-3-small", client=embedding_client, interface="modern"
)

conciseness_metric = DiscreteMetric(
    name="conciseness",
    allowed_values=["verbose", "concise"],
    prompt=(
        "Is the response concise and efficiently conveys information?\n\n"
        "Response: {response}\n\n"
        "Answer with only 'verbose' or 'concise'."
    ),
)

# Metrics for single-turn Q&A evaluation
qa_metrics = [
    FactualCorrectness(
        llm=evaluator_llm, mode="f1", atomicity="high", coverage="high"
    ),
    AnswerRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings, strictness=2),
    conciseness_metric,
]

# Metrics for multi-turn agent evaluation
# - ToolCallF1: Rule-based metric for tool call accuracy
# - AgentGoalAccuracyWithReference: LLM-based metric for goal achievement
tool_metrics = [
    ToolCallF1(),
    AgentGoalAccuracyWithReference(llm=evaluator_llm),
]

Run experiments with @experiment#

The AG-UI integration provides run_ag_ui_row() to call your endpoint and enrich each row with the agent’s response. Combine this with the @experiment decorator to build evaluation pipelines.

⚠️ The endpoint must expose the AG-UI SSE stream. Common paths include /chat, /agent, or /agentic_chat.

Basic single-turn evaluation#

In Jupyter or IPython, use top-level await (after nest_asyncio.apply()) instead of asyncio.run to avoid the “event loop is already running” error. For scripts you can keep asyncio.run.

from ragas import experiment
from ragas.integrations.ag_ui import run_ag_ui_row
from ragas.metrics.collections import FactualCorrectness

@experiment()
async def factual_experiment(row):
    # Call AG-UI endpoint and get enriched row
    enriched = await run_ag_ui_row(row, "http://localhost:8000/chat")

    # Score with metrics
    score = await FactualCorrectness(llm=evaluator_llm).ascore(
        response=enriched["response"],
        reference=row["reference"],
    )

    return {**enriched, "factual_correctness": score.value}

# Run the experiment against the dataset
# In Jupyter/IPython (after calling nest_asyncio.apply())
factual_result = await factual_experiment.arun(
    scientist_questions,
    name="scientist_qa_eval"
)

# In a standalone script, use:
# factual_result = asyncio.run(factual_experiment.arun(scientist_questions, name="scientist_qa_eval"))

factual_result.to_pandas()

The resulting dataframe includes per-sample scores, raw agent responses, and any retrieved contexts (tool results). Results are automatically saved by the experiment framework, and you can export to CSV through pandas.

Multi-turn tool evaluation#

For multi-turn datasets and tool evaluation, pass the messages and reference tool calls directly to the metrics:

import json
from ragas import experiment
from ragas.integrations.ag_ui import run_ag_ui_row
from ragas.messages import ToolCall
from ragas.metrics.collections import AgentGoalAccuracyWithReference, ToolCallF1

@experiment()
async def tool_experiment(row):
    # Call AG-UI endpoint and get enriched row
    enriched = await run_ag_ui_row(row, "http://localhost:8000/chat")

    # Parse reference_tool_calls from JSON string (e.g., from CSV)
    ref_tool_calls_raw = row.get("reference_tool_calls")
    if isinstance(ref_tool_calls_raw, str):
        ref_tool_calls = [ToolCall(**tc) for tc in json.loads(ref_tool_calls_raw)]
    else:
        ref_tool_calls = ref_tool_calls_raw or []

    # Score with tool metrics using the modern collections API
    f1_result = await ToolCallF1().ascore(
        user_input=enriched["messages"],
        reference_tool_calls=ref_tool_calls,
    )
    goal_result = await AgentGoalAccuracyWithReference(llm=evaluator_llm).ascore(
        user_input=enriched["messages"],
        reference=row.get("reference", ""),
    )

    return {
        **enriched,
        "tool_call_f1": f1_result.value,
        "agent_goal_accuracy": goal_result.value,
    }

# Run the experiment
# In Jupyter/IPython
tool_result = await tool_experiment.arun(
    weather_queries,
    name="weather_tool_eval"
)

# Or in a script
# tool_result = asyncio.run(tool_experiment.arun(weather_queries, name="weather_tool_eval"))

tool_result.to_pandas()

If a request fails, the experiment logs the error and returns placeholder values for that sample so the experiment can continue with remaining samples.

Working directly with AG-UI events#

Sometimes you may want to collect event logs separately—perhaps from a recorded run or a staging environment—and evaluate them offline. The conversion helpers expose the same parsing logic used by run_ag_ui_row().

from ragas.integrations.ag_ui import convert_to_ragas_messages
from ag_ui.core import TextMessageChunkEvent

events = [
    TextMessageChunkEvent(
        message_id="assistant-1",
        role="assistant",
        delta="Hello from AG-UI!",
        timestamp="2024-12-01T00:00:00Z",
    )
]

ragas_messages = convert_to_ragas_messages(events, metadata=True)

If you already have a MessagesSnapshotEvent you can skip streaming reconstruction and call convert_messages_snapshot.

from ragas.integrations.ag_ui import convert_messages_snapshot
from ag_ui.core import MessagesSnapshotEvent, UserMessage, AssistantMessage

snapshot = MessagesSnapshotEvent(
    messages=[
        UserMessage(id="msg-1", content="Hello?"),
        AssistantMessage(id="msg-2", content="Hi! How can I help you today?"),
    ]
)

ragas_messages = convert_messages_snapshot(snapshot)

The converted messages can be used to build custom evaluation workflows or passed directly to metric scoring functions.

Extraction helpers#

The integration provides helper functions to extract specific data from messages:

from ragas.integrations.ag_ui import (
    extract_response,    # Get concatenated AI response text
    extract_tool_calls,  # Get all tool calls from AI messages
    extract_contexts,    # Get tool results/contexts
)

messages = convert_to_ragas_messages(events)

response = extract_response(messages)      # "Hello! The weather is sunny."
tool_calls = extract_tool_calls(messages)  # [ToolCall(name="get_weather", args={"location": "SF"})]
contexts = extract_contexts(messages)      # ["Sunny, 72F in San Francisco"]

Tips for production experiments#

Custom headers: pass authentication tokens or tenant IDs via extra_headers parameter to run_ag_ui_row().
Timeouts: tune the timeout parameter if your agent performs long-running tool calls.
Metadata debugging: set metadata=True to keep AG-UI run, thread, and message IDs on every message for easier traceability.
Experiment naming: use descriptive name arguments to .arun() for easy identification of results.

For a complete production example, see examples/ragas_examples/ag_ui_agent_experiments/experiments.py which provides:

CLI arguments for endpoint configuration
CSV-based test datasets
Proper logging and error handling
Timestamped result output

An interactive walkthrough notebook is also available at howtos/integrations/ag_ui.ipynb.

API Reference#

Primary API#

run_ag_ui_row(row, endpoint_url, ...) - Run a single row against an AG-UI endpoint and return enriched data with response, messages, tool_calls, and contexts.

Conversion Functions#

convert_to_ragas_messages(events, metadata=False) - Convert AG-UI event sequences to Ragas messages
convert_messages_snapshot(snapshot, metadata=False) - Convert AG-UI message snapshots to Ragas messages
convert_messages_to_ag_ui(messages) - Convert Ragas messages to AG-UI format

Extraction Helpers#

extract_response(messages) - Extract concatenated AI response text
extract_tool_calls(messages) - Extract all tool calls from AI messages
extract_contexts(messages) - Extract tool results/contexts from messages

Low-Level#

call_ag_ui_endpoint(endpoint_url, user_input, ...) - Call an AG-UI endpoint and collect streaming events
AGUIEventCollector - Collect and reconstruct messages from streaming events

Link last verified June 7, 2026. View original ↗

Source: RAGAS Docs

Link last verified: 2026-03-04