DxGPT Latitude Bench

A systematic evaluation framework for measuring how well AI models perform medical diagnosis tasks. This project compares different AI systems by testing their ability to identify medical conditions and understand their severity.

What This Project Does

This system presents medical cases to AI models and evaluates their diagnostic responses. It solves a key problem in medical AI evaluation: traditional methods penalized correct but differently-worded diagnoses. Our approach recognizes when diagnoses are medically equivalent even if the exact words differ.

Core Components

Evaluation System (bench/)
Three progressive approaches to testing AI diagnostic capabilities. Started with strict code matching, evolved to include semantic understanding, ensuring fair evaluation of medical expertise.

Data Processing (data29/)
Manages nearly 10,000 medical cases from hospitals, medical exams, and rare disease databases. Creates balanced test sets that represent real medical diversity.

Reusable Tools (utils/)
Portable modules for semantic analysis, medical classification systems, and AI model integration. These can be extracted for use in other projects.

Key Innovation

The project discovered that expert medical responses were being unfairly penalized. For example, a specialist's diagnosis of "aneurysmal subarachnoid hemorrhage" would score zero against "subarachnoid hemorrhage" despite being more precise. Our semantic safety net fixes this by recognizing medical equivalence.

Getting Started

Create a Python environment and install dependencies:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e .

Configure API access in .env file (see .env.example)
Run an evaluation:

cd bench/pipelines/pipeline_v2*
python run.py

Results and Insights

Testing revealed clear performance differences between AI models:

Advanced visual models showed highest capability but variable consistency
Standard language models provided reliable, stable performance
Specialized medical models offered domain expertise with limitations

Purpose

This framework helps teams make informed decisions when selecting AI models for medical applications. It provides objective, reproducible metrics while respecting the nuanced nature of medical diagnosis.

A project focused on advancing responsible medical AI evaluation through transparent, clinically-relevant assessment methods.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
bench		bench
claude_plot		claude_plot
data29		data29
utils		utils
.cursorrules		.cursorrules
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DxGPT Latitude Bench

What This Project Does

Core Components

Key Innovation

Getting Started

Results and Insights

Further Reading

Purpose

About

Uh oh!

Releases

Packages

Languages

foundation29org/dxgpt-bench-lab

Folders and files

Latest commit

History

Repository files navigation

DxGPT Latitude Bench

What This Project Does

Core Components

Key Innovation

Getting Started

Results and Insights

Further Reading

Purpose

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages