TB-Safety

Safety is a core mission for transportation agencies and a top AI use case. TB-Safety evaluates whether AI can reliably support safety workflows.

Knowledge: Results Available
Tasks: Results Available
ModelKnowledgeTasksTB-Safety Index
1Claude Opus 4.692%67%Draft-Ready73.3%
2Claude Haiku 4.590.7%65%Draft-Ready71.4%
3GPT-5.293.3%59%Draft-Ready67.6%
Claude Sonnet 4.694.7%56%Draft-Ready65.7%
Gemini 3 Pro Preview94.7%49%Not Recommended60.4%
GPT-5 mini73.3%53%Draft-Ready58.1%
Gemini 3 Flash Preview90.7%45%Not Recommended56.4%

Knowledge Evaluation: RSP1 - Roadway Safety Professional Level 1

Test Results as of: 2026-03-06

Overall Scores

Performance by Knowledge Domain

Tasks Evaluation: Road Safety Audit (RSA)

Multi-model eval results · 7 models · Study: DelDOT US-13 Road Safety Audit

Estimated Token Cost & Duration

Model Detail: Claude Opus 4.6

67%

Overall RSA Score

Draft-ReadyUseful starting point; significant revision needed

Weighted composite across all pipeline stages

Stage 1: Data Review

71%
Assist-Ready

Did the agent correctly identify crash patterns from the data?

Stage 2: Field Analysis

100%
Practitioner-Ready

Did the agent properly link field observations to evidence?

Stage 3: Synthesis

61%
Draft-Ready

Did the agent find the right issues and recommend appropriate fixes?

Stage 4: Findings

44%
Not Recommended

Did the agent prioritize findings correctly?

Not recommended — produce this stage from scratch

What This Means For You

Practical guidance for transportation professionals based on AI performance across RSA workflow stages

Current AI Readiness:Draft-Ready

The best-performing model scores 67% on the RSA task evaluation. Useful starting point; significant revision needed.

Stage 1: Data Review

Verify data extraction carefully

Models show inconsistent performance extracting crash patterns and geometric data from study inputs. Some models miss key severity classifications or mischaracterize contributing factors. Use AI-generated data summaries as a starting point, but independently verify crash counts, severity breakdowns, and geometric measurements against source records before proceeding to field analysis.

Avg across models:48%

Stage 2: Field Analysis

AI effectively correlates observations

AI effectively correlates field observations with crash data and identifies contributing factors. Useful for generating draft field analysis sections, though site-specific conditions (drainage, sight distance nuances) still require practitioner validation.

Avg across models:95%

Stage 3: Synthesis & Countermeasures

Human oversight recommended

Models generate plausible countermeasures but often miss proportionality — recommending interventions that are disproportionate to the severity or frequency of the identified issues. Always review countermeasure cost-effectiveness and feasibility.

Avg across models:51%

Stage 4: Findings Report

Review narrative quality carefully

AI produces structurally sound reports but narrative quality varies. Findings may lack the specificity and defensibility expected by reviewing agencies. Use AI-generated reports as a starting draft, then refine language and strengthen evidence citations.

Avg across models:52%