PSAI-Bench: Physical Security AI Triage Benchmark

Do frontier AI models actually benefit from video when triaging physical security alerts — or are they just reasoning about metadata?

PSAI-Bench is an evaluation framework that answers this question. It generates realistic security scenarios grounded in real datasets (UCF Crime, Caltech Pedestrian), scores model outputs on safety-weighted metrics, and measures the perception-reasoning gap — the difference between what a model achieves with metadata alone vs. metadata + video.

Key finding: GPT-4o achieves a near-perfect Threat Detection Rate (0.999) but collapses to a 0.580 Aggregate Score due to over-reliance on the SUSPICIOUS verdict (42.4% of responses). The model catches almost every threat but at the cost of drowning operators in false escalations — exactly the failure mode PSAI-Bench is designed to detect.

Results

UCF Crime Metadata Track (3,000 scenarios)

System	Accuracy	TDR	FASR	Safety Score	ECE	Aggregate
GPT-4o	0.645	0.999	0.792	0.947	0.186	0.580
Severity Heuristic	0.762	1.000	1.000	1.000	—	0.832
Majority Class	0.582	1.000	0.000	0.750	—	0.444
Random	0.326	0.656	0.307	0.569	—	0.446
Always SUSPICIOUS	0.179	1.000	0.000	0.750	—	-0.204

GPT-4o's raw accuracy (0.645) is beaten by the severity heuristic baseline (0.762) — a rule-based system that uses only the alert severity field. The model's strength is threat detection (TDR=0.999), but its SUSPICIOUS overuse (42.4% vs. the 30% penalty threshold) tanks the aggregate score.

GPT-4o Per-Difficulty Breakdown

Difficulty	Accuracy	Safety Score
Easy	0.723	—
Medium	0.587	—
Hard	0.625	—

Evaluation Tracks

PSAI-Bench supports three evaluation tracks with increasing perceptual requirements:

Track	Input	Scenarios	What It Tests
Metadata	Alert text, zone info, device stats, weather, time	8,000 (UCF + Caltech)	Pure reasoning over structured context
Visual	Metadata + video clip URI	2,900 (UCF Crime)	Whether video adds triage value
Multi-Sensor	Camera + PIR + badge + vibration + thermal	1,000 (synthetic)	Sensor fusion reasoning

Metrics

Metric	What It Measures
TDR (Threat Detection Rate)	Fraction of threats classified as THREAT or SUSPICIOUS
FASR (False Alarm Suppression Rate)	Fraction of benign alerts correctly classified as BENIGN
Safety Score	Weighted combination: `(w * TDR + FASR) / (w + 1)` at w=3
ECE (Expected Calibration Error)	How well confidence scores match actual accuracy
Aggregate Score	`0.4 * accuracy + 0.4 * safety_score + 0.2 * calibration_factor - suspicious_penalty`

The SUSPICIOUS penalty activates when a system uses the SUSPICIOUS verdict for >30% of responses, preventing the degenerate strategy of hedging everything.

Installation

pip install -e .

# With API client support (for running evaluations)
pip install -e ".[api]"

# With development tools
pip install -e ".[dev]"

Quick Start

Generate scenarios

# Metadata track — UCF Crime dataset (3,000 scenarios)
psai-bench generate --track metadata --source ucf --n 3000 --seed 42

# Metadata track — Caltech Pedestrian dataset (5,000 scenarios)
psai-bench generate --track metadata --source caltech --n 5000 --seed 42

# Multi-sensor track (1,000 scenarios)
psai-bench generate --track multi_sensor --n 1000 --seed 42

Run baselines

psai-bench baselines --scenarios data/generated/metadata_ucf_seed42.json

Evaluate a frontier model

# Requires OPENAI_API_KEY environment variable
psai-bench evaluate --scenarios data/generated/metadata_ucf_seed42.json --model gpt-4o --n 100

# Available models: claude-sonnet, gpt-4o, gemini-flash

Score outputs

psai-bench score --scenarios data/generated/metadata_ucf_seed42.json \
    --outputs results/evaluations/gpt-4o_ucf_metadata_run1.json

Compare two systems

psai-bench compare --scenarios data/generated/metadata_ucf_seed42.json \
    --outputs-a results/evaluations/gpt-4o_run1.json \
    --outputs-b results/baselines/ucf/severity_heuristic_outputs.json

Architecture

psai_bench/
  cli.py           # Click CLI with generate, score, evaluate, compare commands
  generators.py    # Scenario generators for all 3 tracks
  distributions.py # Realistic alert distribution models
  evaluators.py    # API wrappers for Claude, GPT-4o, Gemini
  scorer.py        # Scoring engine (TDR, FASR, Safety Score, ECE, Aggregate)
  baselines.py     # 4 statistical baselines
  statistics.py    # McNemar's test, bootstrap CIs, run consistency
  validation.py    # Scenario and submission validators
  schema.py        # JSON Schema definitions for alerts and outputs
  video_mapper.py  # Maps UCF Crime videos to visual track scenarios
  downloader.py    # HuggingFace dataset downloader

Reproducibility

All generated data is deterministic given a seed. The canonical datasets are produced by:

psai-bench generate --track metadata --source ucf --n 3000 --seed 42
psai-bench generate --track metadata --source caltech --n 5000 --seed 42

NumPy version note: Scenario generation uses numpy.random.RandomState for determinism. Per NEP 19, the bit stream may differ across NumPy major versions. Results in this repository were generated with NumPy 1.24+ on Python 3.11.

Contributing

See CONTRIBUTING.md for development setup and guidelines.

Citation

@misc{psai-bench-2025,
  title   = {PSAI-Bench: Physical Security AI Triage Benchmark},
  author  = {Apisar, Addison},
  year    = {2025},
  url     = {https://github.com/YuYongJu/psai-bench},
  note    = {Evaluates whether frontier AI models benefit from video in physical security triage}
}

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
.planning		.planning
psai_bench		psai_bench
results		results
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PSAI-Bench: Physical Security AI Triage Benchmark

Results

UCF Crime Metadata Track (3,000 scenarios)

GPT-4o Per-Difficulty Breakdown

Evaluation Tracks

Metrics

Installation

Quick Start

Generate scenarios

Run baselines

Evaluate a frontier model

Score outputs

Compare two systems

Architecture

Reproducibility

Contributing

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PSAI-Bench: Physical Security AI Triage Benchmark

Results

UCF Crime Metadata Track (3,000 scenarios)

GPT-4o Per-Difficulty Breakdown

Evaluation Tracks

Metrics

Installation

Quick Start

Generate scenarios

Run baselines

Evaluate a frontier model

Score outputs

Compare two systems

Architecture

Reproducibility

Contributing

Citation

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages