Test Agents
Explore how AI agents are evaluated on complex transportation workflows. Configure the evaluation harness below to see pre-computed results from a multi-stage Road Safety Audit pipeline.
Test Harness Configuration
Solver
Claude Sonnet 4.6
Stage 1: Data Review & Conditions Assessment
- Crash Trend Analysis
- Corridor Profile
- Speed Assessment
- Infrastructure Inventory
Stage 2: Field Observation Analysis
- Field Observation Analysis
- Prompt List Evaluation
Stage 3: Issue Synthesis & Countermeasure Development
- Issue Synthesis
- Countermeasure Development
- Prioritization
Stage 4: Findings Package Assembly
- Findings Assembly
Scorer
Tier 0: Gate
Schema validation per stage
Cost: Free · Type: Deterministic
Tier 1: Primary
Code-based scoring against ground truth
- Crash Pattern Recall
- Evidence Chain Completeness
- Issue Coverage (F1)
- Countermeasure Coverage (F1)
- Prioritization Alignment
Cost: Free · Type: Code-based
Tier 2: Judge
LLM-as-Judge qualitative evaluation
- Crash Analysis Quality
- Field Data Utilization
- Countermeasure Appropriateness
Cost: ~$0.10/each · Type: Model-based