Test Agents

Explore how AI agents are evaluated on complex transportation workflows. Configure the evaluation harness below to see pre-computed results from a multi-stage Road Safety Audit pipeline.

Test Harness Configuration

Solver

Claude Sonnet 4.6

Stage 1: Data Review & Conditions Assessment

  • Crash Trend Analysis
  • Corridor Profile
  • Speed Assessment
  • Infrastructure Inventory

Stage 2: Field Observation Analysis

  • Field Observation Analysis
  • Prompt List Evaluation

Stage 3: Issue Synthesis & Countermeasure Development

  • Issue Synthesis
  • Countermeasure Development
  • Prioritization

Stage 4: Findings Package Assembly

  • Findings Assembly

Scorer

Tier 0: Gate

Schema validation per stage

Cost: Free · Type: Deterministic

Tier 1: Primary

Code-based scoring against ground truth

  • Crash Pattern Recall
  • Evidence Chain Completeness
  • Issue Coverage (F1)
  • Countermeasure Coverage (F1)
  • Prioritization Alignment

Cost: Free · Type: Code-based

Tier 2: Judge

LLM-as-Judge qualitative evaluation

  • Crash Analysis Quality
  • Field Data Utilization
  • Countermeasure Appropriateness

Cost: ~$0.10/each · Type: Model-based