A systematic evaluation framework for measuring how well AI models perform medical diagnosis tasks. This project compares different AI systems by testing their ability to identify medical conditions and understand their severity.
This system presents medical cases to AI models and evaluates their diagnostic responses. It solves a key problem in medical AI evaluation: traditional methods penalized correct but differently-worded diagnoses. Our approach recognizes when diagnoses are medically equivalent even if the exact words differ.
Evaluation System (bench/)
Three progressive approaches to testing AI diagnostic capabilities. Started with strict code matching, evolved to include semantic understanding, ensuring fair evaluation of medical expertise.
Data Processing (data29/)
Manages nearly 10,000 medical cases from hospitals, medical exams, and rare disease databases. Creates balanced test sets that represent real medical diversity.
Reusable Tools (utils/)
Portable modules for semantic analysis, medical classification systems, and AI model integration. These can be extracted for use in other projects.
The project discovered that expert medical responses were being unfairly penalized. For example, a specialist's diagnosis of "aneurysmal subarachnoid hemorrhage" would score zero against "subarachnoid hemorrhage" despite being more precise. Our semantic safety net fixes this by recognizing medical equivalence.
- Create a Python environment and install dependencies:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .-
Configure API access in
.envfile (see.env.example) -
Run an evaluation:
cd bench/pipelines/pipeline_v2*
python run.pyTesting revealed clear performance differences between AI models:
- Advanced visual models showed highest capability but variable consistency
- Standard language models provided reliable, stable performance
- Specialized medical models offered domain expertise with limitations
For deeper understanding:
- 📄 Complete Research Study - Comprehensive research paper with methodology, statistical analysis, and findings
- 🔬 Pipeline V4 - Advanced Evaluation System - State-of-the-art diagnostic AI evaluation pipeline with dual methodology
- Conceptual model and research findings
- Pipeline methodology details
- Data processing documentation
- Reusable utilities
This framework helps teams make informed decisions when selecting AI models for medical applications. It provides objective, reproducible metrics while respecting the nuanced nature of medical diagnosis.
A project focused on advancing responsible medical AI evaluation through transparent, clinically-relevant assessment methods.