Workflow Evaluation Quickstart ↗
noOriginal Documentation
The workflow_eval template evaluates complex LLM workflows with email classification and routing.
Create the Project#
ragas quickstart workflow_eval
cd workflow_evalInstall Dependencies#
uv syncSet Your API Key#
export OPENAI_API_KEY="your-openai-key"Run the Evaluation#
uv run python evals.pyProject Structure#
workflow_eval/
├── README.md # Project documentation
├── pyproject.toml # Project configuration
├── workflow.py # Workflow implementation
├── evals.py # Evaluation workflow
├── __init__.py # Python package marker
└── evals/
├── datasets/ # Test datasets
├── experiments/ # Evaluation results
└── logs/ # Execution logsWhat It Evaluates#
The template evaluates a customer support email classification workflow:
- Workflow: Multi-step email processing (classification → extraction → response)
- Categories: Bug Report, Feature Request, Billing
- Test Cases: Customer emails with expected categories and extracted fields
- Metric: Custom discrete metric checking classification accuracy
Understanding the Code#
The Workflow (workflow.py)#
Implements a customer support email workflow:
from workflow import default_workflow_client
workflow = default_workflow_client()
result = workflow.process_email("I found a bug in version 2.1.4...")
# Returns: category, extracted fields, responseThe Evaluation (evals.py)#
Tests workflow accuracy against pass criteria:
def load_dataset():
dataset_dict = [
{
"email": "Hi, I'm getting error code XYZ-123 when using version 2.1.4...",
"pass_criteria": "category Bug Report; product_version 2.1.4; error_code XYZ-123",
},
# More test cases...
]The metric evaluates if the workflow correctly:
- Classifies the email category
- Extracts relevant fields (version, error code, invoice number, etc.)
- Generates appropriate responses
Test Cases#
The template includes diverse scenarios:
- Bug Reports: With version numbers and error codes
- Feature Requests: With urgency levels and product areas
- Billing Issues: With invoice numbers and amounts
Customization#
Add Your Own Workflow#
Replace the example workflow with your own:
from your_workflow import YourWorkflow
workflow = YourWorkflow()
@experiment()
async def run_experiment(row):
result = await workflow.process(row["input"])
# Evaluate result...Next Steps#
- Agent Evaluation - Evaluate AI agents
- LlamaIndex Agent Evaluation - Evaluate LlamaIndex workflows
Link last verified
June 7, 2026.
View original ↗
Source: RAGAS Docs
Link last verified: 2026-03-04