TB-Safety

Safety is a core mission for transportation agencies and a top AI use case. TB-Safety evaluates whether AI can reliably support safety workflows.

Knowledge: Results Available

Tasks: Results Available

Model	Knowledge	Tasks	TB-Safety Index
1Claude Opus 4.6	92%	67%Draft-Ready	73.3%
2Claude Haiku 4.5	90.7%	65%Draft-Ready	71.4%
3GPT-5.2	93.3%	59%Draft-Ready	67.6%
Claude Sonnet 4.6	94.7%	56%Draft-Ready	65.7%
Gemini 3 Pro Preview	94.7%	49%Not Recommended	60.4%
GPT-5 mini	73.3%	53%Draft-Ready	58.1%
Gemini 3 Flash Preview	90.7%	45%Not Recommended	56.4%

Knowledge Evaluation: RSP1 - Roadway Safety Professional Level 1

Test Results as of: 2026-03-06

Overall Scores

Performance by Knowledge Domain

Tasks Evaluation: Road Safety Audit (RSA)

Multi-model eval results · 7 models · Study: DelDOT US-13 Road Safety Audit

Estimated Token Cost & Duration

Model Detail: Claude Opus 4.6

67%

Overall RSA Score

Draft-ReadyUseful starting point; significant revision needed

Weighted composite across all pipeline stages

Stage 1: Data Review

71%

Assist-Ready

Did the agent correctly identify crash patterns from the data?

Stage 2: Field Analysis

100%

Practitioner-Ready

Did the agent properly link field observations to evidence?

Stage 3: Synthesis

61%

Draft-Ready

Did the agent find the right issues and recommend appropriate fixes?

Stage 4: Findings

44%

Not Recommended

Did the agent prioritize findings correctly?

⚠Not recommended — produce this stage from scratch

What This Means For You

Practical guidance for transportation professionals based on AI performance across RSA workflow stages

Current AI Readiness:Draft-Ready

The best-performing model scores 67% on the RSA task evaluation. Useful starting point; significant revision needed.

Stage 1: Data Review

Verify data extraction carefully

Models show inconsistent performance extracting crash patterns and geometric data from study inputs. Some models miss key severity classifications or mischaracterize contributing factors. Use AI-generated data summaries as a starting point, but independently verify crash counts, severity breakdowns, and geometric measurements against source records before proceeding to field analysis.

Avg across models:48%

Stage 2: Field Analysis

AI effectively correlates observations

AI effectively correlates field observations with crash data and identifies contributing factors. Useful for generating draft field analysis sections, though site-specific conditions (drainage, sight distance nuances) still require practitioner validation.

Avg across models:95%

Stage 3: Synthesis & Countermeasures

Human oversight recommended

Models generate plausible countermeasures but often miss proportionality — recommending interventions that are disproportionate to the severity or frequency of the identified issues. Always review countermeasure cost-effectiveness and feasibility.

Avg across models:51%

Stage 4: Findings Report

Review narrative quality carefully

AI produces structurally sound reports but narrative quality varies. Findings may lack the specificity and defensibility expected by reviewing agencies. Use AI-generated reports as a starting draft, then refine language and strengthen evidence citations.

Avg across models:52%

TB-Safety

Knowledge Evaluation: RSP1 - Roadway Safety Professional Level 1

Overall Scoresi

Performance by Knowledge Domaini

Tasks Evaluation: Road Safety Audit (RSA)

Model Detail: Claude Opus 4.6i

Overall RSA Score

Stage 1: Data Review

Stage 2: Field Analysis

Stage 3: Synthesis

Stage 4: Findings

What This Means For You

Verify data extraction carefully

AI effectively correlates observations

Human oversight recommended

Review narrative quality carefully

Overall Scores

Performance by Knowledge Domain

Model Detail: Claude Opus 4.6