TB-Safety
Safety is a core mission for transportation agencies and a top AI use case. TB-Safety evaluates whether AI can reliably support safety workflows.
| Model | Knowledge | Tasks | TB-Safety Index |
|---|---|---|---|
| 1Claude Opus 4.6 | 92% | 67%Draft-Ready | 73.3% |
| 2Claude Haiku 4.5 | 90.7% | 65%Draft-Ready | 71.4% |
| 3GPT-5.2 | 93.3% | 59%Draft-Ready | 67.6% |
| Claude Sonnet 4.6 | 94.7% | 56%Draft-Ready | 65.7% |
| Gemini 3 Pro Preview | 94.7% | 49%Not Recommended | 60.4% |
| GPT-5 mini | 73.3% | 53%Draft-Ready | 58.1% |
| Gemini 3 Flash Preview | 90.7% | 45%Not Recommended | 56.4% |
Knowledge Evaluation: RSP1 - Roadway Safety Professional Level 1
Test Results as of: 2026-03-06
Overall Scores
Performance by Knowledge Domain
Tasks Evaluation: Road Safety Audit (RSA)
Multi-model eval results · 7 models · Study: DelDOT US-13 Road Safety Audit
Model Detail: Claude Opus 4.6
Overall RSA Score
Weighted composite across all pipeline stages
Stage 1: Data Review
Did the agent correctly identify crash patterns from the data?
Stage 2: Field Analysis
Did the agent properly link field observations to evidence?
Stage 3: Synthesis
Did the agent find the right issues and recommend appropriate fixes?
Stage 4: Findings
Did the agent prioritize findings correctly?
What This Means For You
Practical guidance for transportation professionals based on AI performance across RSA workflow stages
The best-performing model scores 67% on the RSA task evaluation. Useful starting point; significant revision needed.
Stage 1: Data Review
Verify data extraction carefully
Models show inconsistent performance extracting crash patterns and geometric data from study inputs. Some models miss key severity classifications or mischaracterize contributing factors. Use AI-generated data summaries as a starting point, but independently verify crash counts, severity breakdowns, and geometric measurements against source records before proceeding to field analysis.
Stage 2: Field Analysis
AI effectively correlates observations
AI effectively correlates field observations with crash data and identifies contributing factors. Useful for generating draft field analysis sections, though site-specific conditions (drainage, sight distance nuances) still require practitioner validation.
Stage 3: Synthesis & Countermeasures
Human oversight recommended
Models generate plausible countermeasures but often miss proportionality — recommending interventions that are disproportionate to the severity or frequency of the identified issues. Always review countermeasure cost-effectiveness and feasibility.
Stage 4: Findings Report
Review narrative quality carefully
AI produces structurally sound reports but narrative quality varies. Findings may lack the specificity and defensibility expected by reviewing agencies. Use AI-generated reports as a starting draft, then refine language and strengthen evidence citations.