MIMIC-III Clinical
Database
MMIC-III CLINICAL DATASET
• MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over forty thousand
patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.
• The database includes information such as demographics, vital sign measurements made at the bedside, laboratory test results,
procedures, medications, caregiver notes, imaging reports, and mortality (including post-hospital discharge).
• The MIMIC-III database was populated with data from several sources, including
• Archives from critical care information systems
• Hospital electronic health record databases
• Social Security Administration Death Master File.
• Two different critical care information systems were in place over the data collection period: Philips CareVue Clinical
Information System (Philips Health-care, Andover, MA) and iMDsoft MetaVision ICU (iMDsoft, Needham, MA).
• Additional information was collected from hospital and laboratory health record systems, including:
• patient demographics and in-hospital mortality.
• laboratory test results.
• discharge summaries and reports of electrocardiogram and imaging studies.
• billing-related information such as International Classification of Disease, 9th Edition (ICD-9) codes, Diagnosis Related
Group (DRG) codes, and Current Procedural Terminology (CPT) codes.
• Before data was incorporated into the MIMIC-III database, it was first deidentified in accordance with Health Insurance
Portability and Accountability Act (HIPAA) standards using structured data cleansing and date shifting. The deidentification
process for structured data required the removal of data elements listed in HIPAA, including fields such as patient name,
telephone number, address, and dates.
• Protected health information was removed from free text fields, such as diagnostic reports and physician notes
Data Description
• MIMIC-III is a relational database consisting of 26 tables. Tables are linked by identifiers which usually have the suffix ‘ID’.
• Charted events such as notes, laboratory tests, and fluid balance are stored in a series of ‘events’ tables.
• Five tables are used to define and track patient stays: ADMISSIONS; PATIENTS; ICUSTAYS; SERVICES; and
TRANSFERS. Another five tables are dictionaries for cross-referencing codes against their respective definitions: D_CPT;
D_ICD_DIAGNOSES; D_ICD_PROCEDURES; D_ITEMS; and D_LABITEMS.
• The current version of the database is MIMIC-III v1.4 released on 2 September 2016. It was a major release enhancing data
quality and providing large amount of additional data for Metavision patients.
• Reference link: https://physionet.org/content/mimiciv/2.2/
Instructions for getting access to MIMIC-IV Dataset
1. Researchers seeking to use the database must:
• Complete a recognized course in protecting human research participants that includes Health Insurance Portability and
Accountability Act (HIPAA) requirements
• Sign a data use agreement, which outlines appropriate data usage and security standards, and forbids efforts to
identify individual patient.
2. For creating a credential user account on PhysioNet, the following form has to be filled,
https://physionet.org/credential-application/.
3. When the application has been approved, the user will receive an email notification. Approval may take several business
days, and will be delayed if the request is missing any required information.
4. Users must complete the training course in human subjects research, accessible via the provided link,
https://physionet.org/content/mimiciv/view-required-training/2.2/#1. For completing CITI training, follow this link for step
by step instructions https://physionet.org/about/citi-course/
5. The last step is to sign the data use agreement for the project.
MIMIC-III and IV Dataset for Clinical Soap Notes Generation
Team Cadence at MEDIQA-Chat 2023: Generating, augmenting and summarizing clinical dialogue with large
language models
Abstract: This paper describes Team Cadence’s winning submission to Task C of the MEDIQA-Chat 2023 shared tasks. Due to the
small volume of training data available, a data-augmentation-first approach was adopted to the three tasks by focusing on the
dialogue generation task, i.e., Task C. In order to generate synthetic patient doctor conversations, a sample of thousand
discharge summary notes from the MIMIC-IV Note (Johnson et al., 2023; Goldberger et al., 2000) dataset were collected. These
dialogue-note pairs were then added to the Task A and Task B training datasets provided by the organizers for downstream data
augmentation.
Link to the paper: https://aclanthology.org/2023.clinicalnlp-1.28.pdf
MIMIC-III and IV Dataset for Clinical Soap Notes Generation
PULSAR at MEDIQA-Sum 2023: Large Language Models Augmented by Synthetic Dialogue Convert Patient
Dialogues to Medical Records
Abstract: This paper describes PULSAR, a system submission at the ImageClef 2023 MediQA-Sum task on summarizing patient-
doctor dialogues into clinical records. The proposed framework relies on domain specific pre-training, to produce a specialized
language model which is trained on task-specific natural data augmented by synthetic data generated by a black-box LLM. In
order to provide the model with sufficient medical knowledge, the team used the MIMIC-III, a pre-trained corpus of 2 million
data, which consists of a large number of clinical records, such as admission notes, discharge summaries or lab results for
pretraining a flan-t5 model for predicting missing medical terms in notes.
Link to the paper: https://arxiv.org/pdf/2307.02006.pdf
MIMIC-III and IV Dataset for Clinical Soap Notes Generation
Towards Automatic Generation of Shareable Synthetic Clinical Notes Using Neural Language Models
Abstract: This paper proposes a task setup that consists of: (1) real de-identified clinical notes datasets used to
train models, which in turn generate synthetic notes; (2) privacy measures used to estimate the privacy
preservation properties of the synthetic notes; and (3) utility benchmarks used to estimate the usefulness of the
notes. The paper uses MIMIC-III (v1.4) (Johnson et al., 2016), a large de-identified database that comprises nearly
60,000 hospital admissions for 38,645 adult patients for composing real clinical notes dataset.
Link to the paper: https://arxiv.org/pdf/1905.07002.pdf