Foundation models have shown great promise in processing electronic health records (EHRs), offering the flexibility to handle diverse medical data modalities such as text, time series, and images. This repository presents a comprehensive benchmark framework designed to evaluate the predictive performance, fairness, and interpretability of foundation models—both as unimodal encoders and multimodal learners—using the publicly available MIMIC-IV database.
To support consistent and reproducible evaluation, we developed a standardized data processing pipeline that harmonizes heterogeneous clin- ical records into an analysis-ready format. We systematically compared eight foundation models, encompassing both unimodal and multimodal models, as well as domain-specific and general-purpose variants.
The structure of this repository is detailed as follows:
scripts/1_...contains the scripts for data processing pipelinescripts/2_... & 3_...contains the scripts for evaluating foundation models as unimodal encodersscripts/4_..._...contains the scripts for evaluating foundation models as multimodal learners
For data access and description, please visit: https://mimic.mit.edu
MIMIC-IV v2.2 https://physionet.org/content/mimiciv/2.2
MIMIC-CXR https://physionet.org/content/mimic-cxr/2.0.0
MIMIC-IV-Note https://physionet.org/content/mimic-iv-note/2.2/note
The following sub-sections describe the workflow for our benchmark and how they should ideally be run.
For generating master dataset, run 1_1 Master Dataset Generation.ipynb
For generating benchmark dataset, we used
python "1_2 Data Processing Pipeline.py --input-path {input_path} --output-pkl-path {output_pkl_path} --output-csv-path {output_csv_path} --age-lower 18 --start-diff 0 --end-diff 24
Key Arguements:
input_path: Path to master dataset.output_pkl_path: Output path for processed ICU stay data.output_csv_path: Output path for metadata table.age_lower: Lower bound of patient age.start_diff: Time difference between the start of information collection period and ICU admission time(positive/negative) in hourend_diff: Time difference between the end of information collection period and ICU admission time(positive only) in hour- For the full list of arguments, please refer to the script
1_2 Data Processing Pipeline.py
For extracting representations from each data modality,
Demographics
python "2_1 Embedding Extraction for Demographics.py" --input-icu-path {input_icu_path} --input-metadata-path {input_metadata_path} --output-path {output_path}
input_icu_path: Path to processed ICU stay datainput_metadata_path: Path to metadata tableoutput_path: Output path for extracted embeddings
Time-series
python "2_2 Embedding Extraction for Time-series.py" --input-icu-path {input_icu_path} --input-metadata-path {input_metadata_path} --output-path {output_path} --emb-technique {emb_technique}
emb_technique: Select from three embedding techniques for time-series data: fixed time interval / gru / moment
Images
For extracting embeddings with CXR-Foundation, we recommend run 2_3_1 Embedding Extraction with CXR-Foundation.ipynb with Google Colab
For extracting embeddings with Swin Transformer,
python "2_3_2 Embedding Extraction with Swin Transformer.py" --input-icu-path {input_icu_path} --input-metadata-path {input_metadata_path} --output-path {output_path}
Notes
python "2_4 Embedding Extraction for Clinical Notes.py" --input-icu-path {input_icu_path} --input-metadata-path {input_metadata_path} --output-path {output_path} --emb-technique {emb_technique}
emb_technique: Select from two embedding techniques for free-text data: openai / radbert
For combine embeddings from each data modalities into a unified representation and train Logistic Regression for outcome prediction,
python "3_1 Classification.py" --input-icu-path {input_icu_path} --input-metadata-path {input_metadata_path} --embedding-path {embedding_path} --output-path {output_path}
--time-series-model {time_series_model} --image-model {image_model} --text-model {text_model} --outcome {outcome}
embedding_path: Path to extracted embeddingsoutput_path: Output path for predicted resultstime_series_model: Model used for extracting embeddings from time-series dataimage_model: Model used for extracting embeddings from image datatext_model: Model used for extracting embeddings from free-text dataoutcome: Outcome of interest: in-hospital motality/ length of stay
For generating prompts for large vision-language models, run 4_1 Generate QA Dataset.ipynb
- Kunyu Yu (Email: [email protected])
- Nan Liu (Email: [email protected])