LlamaIndex Agent Evaluation Quickstart ↗
noOriginal Documentation
The llamaIndex_agent_evals template evaluates LlamaIndex workflow agents with tool call accuracy metrics.
Create the Project#
ragas quickstart llamaIndex_agent_evals
cd llamaIndex_agent_evalsInstall Dependencies#
uv syncSet Your API Keys#
export OPENAI_API_KEY="your-openai-key"
export GOOGLE_API_KEY="your-google-key" # For evaluator LLMRun the Evaluation#
uv run python evals.pyProject Structure#
llamaIndex_agent_evals/
├── README.md # Project documentation
├── pyproject.toml # Project configuration
├── llamaindex_agent.py # LlamaIndex agent with tools
├── evals.py # Evaluation workflow
├── __init__.py # Python package marker
└── evals/
├── datasets/
│ └── contexts/ # Test context files (JSON)
├── experiments/ # Evaluation results
└── logs/ # Execution logsWhat It Evaluates#
The template evaluates a LlamaIndex agent’s tool calling accuracy:
- Agent: LlamaIndex
FunctionAgentwith list management tools (add, remove, list items) - Test Cases: Complex scenarios like duplicate additions, ambiguous removal requests
- Metrics: Tool call accuracy, response correctness
Understanding the Code#
The Agent (llamaindex_agent.py)#
LlamaIndex agent with simple tools:
from llama_index.core.agent.workflow import FunctionAgent
agent = FunctionAgent(
name="list_manager",
tools=[add_item, remove_item, list_items],
llm=llm
)The Evaluation (evals.py)#
Tests tool call accuracy using F1 score:
@numeric_metric(name="tool_call_accuracy")
def tool_call_accuracy_metric(predicted_calls: List[Dict], ground_truth_calls: List[Dict]):
# Compares predicted vs ground truth tool calls
# Returns F1 score between 0.0 and 1.0Test Data#
The template includes JSON test contexts in evals/datasets/contexts/:
ambiguous_removal_request.json- Tests handling of ambiguous requestsduplicate_addition.json- Tests handling of duplicate operationsrepeated_removal.json- Tests repeated operations
Next Steps#
- Agent Evaluation - Evaluate general AI agents
- Workflow Evaluation - Evaluate complex workflows
Link last verified
June 7, 2026.
View original ↗
Source: RAGAS Docs
Link last verified: 2026-03-04