MIMIC in The OMOP Common Data Model
MIMIC in The OMOP Common Data Model
Objectives : In the era of big data, the intensive care unit (ICU) is very likely to
benefit from real-time computer analysis and modeling based on close patient mon-
itoring and Electronic Health Record data. MIMIC is the first open access database
in the ICU domain. Many studies have shown that common data models (CDMs)
improve database searching by allowing code, tools and experience to be shared.
OMOP-CDM is spreading all over the world. The objective was to evaluate the
difficulty to transform MIMIC into an OMOP (MIMIC-OMOP) database and the
benefits of this transformation for analysts.
Material & Method: A documented, tested, versioned, exemplified and open
repository has been set up to support the transformation and improvement of the
MIMIC community’s source code. The resulting data set was evaluated over a 48-
hour datathon.
Result: With an investment of 2 people for 500 hours, 64% of the data items of
the 26 MIMIC tables have been standardized into the OMOP CDM and 78% of the
source concepts mapped to reference terminologies. The model proved its ability to
support community contributions and was well received during the datathon with
160 participants and 15,000 requests executed with a maximum duration of one
minute.
Conclusion: The resulting MIMIC-OMOP dataset is the first MIMIC-OMOP
dataset available free of charge with real disidentified data ready for replicable in-
tensive care research. This approach can be generalized to any medical field.
code and tools [8, 9]. However, some studies have due to tricky concept mapping to standard terminolo-
shown that the results are not fully reproducible from gies tasks. OMOP has the advantage of not making
one CDM to another [10] or from one centre to an- the terminology mapping step mandatory by keeping
other [11]. Some approaches argue that keeping the the local codes accessible to analysts. Compared to the
local conceptual model [12] and the local structural Fast Healthcare Interoperability Resources (FHIR) 2 ,
model [13] leads to better results. On one hand, keep- OMOP performs better as a conceptual CDM because
ing MIMIC on its specific form will not solve the limi- the FHIR ressources currently do not specify the ter-
tation for multicenter research but on the other hand, minology to be used for most of the attributes. OMOP
a fully standardized form would introduce other disad- relational model can be materialized in csv format and
vantages, such as loss of data and lower computational stored in any relational database when FHIR uses json
performances. The ideal solution is probably in be- files and needs some processing and higher skills to
tween to allow local or standardized analysis depend- exploit. Among the above models, OMOP is the best
ing on the research question. candidate to overcome the MIMIC limitations men-
OMOP (Observational Medical Outcomes Partner- tioned earlier.
ship Common Data Model) is a CDM originally de- Our article was guided by the two following dimen-
signed for multicenter research related to adverse drug sions:
adverse events and now extends to medical, labor-
tory and genomic cases. OMOP provides structural 1. Data Transformations : evaluate the process of
and conceptual models relying on reference terminolo- transforming MIMIC into OMOP in terms of time
gies such as SNOMED for diagnostics, RxNORM for needed, skills required and quality of the result.
drugs and LOINC for laboratory results. Several ex- 2. Data Analytics : evaluate the resulting dataset to
amples of database transformed into OMOP have been support efficient, shareable and real-time analysis.
published [14, 15] and OMOP stores 682 million pa-
tient records from around the world [16]. Each clinical
area is stored in different dedicated tables. The OMOP 1 MATERIAL & METHOD
conceptual model is based on a closure table pattern 1
capable of ingesting any simple, hierarchical and also 1.1 Data Transformations
graph terminologies such as SNOMED. In addition to The majority of source code is implemented in Post-
local terminologies, OMOP defines and maintains a greSQL 9.6.9 (Postgres) because it is the primary sup-
set of standard terminologies to be mapped unidirec- port for the MIMIC database and allows the commu-
tionally (local to standard) by implementers. Although nity to reproduce our work on limited resources with-
OMOP has proven its reliability [17], the concept map- out licensing costs and benefit from recent Postgres
ping process is known to have an impact on results improvements in the data processing area. Some elab-
[18] and the application of the same protocol on dif- orated data transformations have been implemented
ferent data sources leads to different results [11]. This as Postgres functions.
shows the importance of keeping local terminologies The OMOP CDM version 5.3.3.1 (OMOP) tables
so that local analysis is still possible. Previous prelim- were created from the provided scripts with some
inary work has been done on the translation of MIMIC changes documented in our scripts. OMOP defines
into OMOP [19]. This work remains to be refined and 15 standardized clinical data tables, 3 health sys-
updated for proper evaluation. tem data tables, 2 health economics data tables,
In a recent comparative study of different CDM 5 tables for derived elements and 12 tables for
[8, 20] OMOP obtained best results for completeness, standardized vocabulary. The vocabulary tables
integrity, flexibility, simplicity of integration, imple- were loaded from concepts downloaded from Athena 3
mentability, for a wider coverage of the structural and and the clinical and derived tables were loaded from
conceptual model, a more systematic analysis thanks MIMIC.
to an analytical library and to visualization tools and MIMIC-III version 1.4.21 (MIMIC) was also loaded
easier access to data through SQL queries. In terms into Postgres with the provided scripts. A subset of
of conceptual approach, OMOP offers a broader set of 100 patients over the 46 000 total MIMIC patients
standard concepts. In terms of structural CDM it is selected based on their broad representativeness in
very rigorous in how data should be loaded into spe- the database and cloned into a second instance to
cific tables while others CDM such i2b2 are very flex- serve as a light and representative development set.
ible with a general table that solves all data domains.
This rigorous approach is necessary for standardiza-
tion. Previous work has been done to load MIMIC-III
into i2b2 [21] - however the work couldn’t be finalized
2 https://www.hl7.org/fhir/
1 https://karwin.blogspot.com/2010/03/rendering- 3 https://www.ohdsi.org/analytic-tools/athena-
trees-with-closure-tables.html standardized-vocabularies/
2
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
perpetuity. MIMIC in the OMOP Common Data Model
It is made available under a CC-BY 4.0 International license .
4 https://github.com/MIT-LCP/mimic-code 7 https://www.ohdsi.org/web/achilles/
5 http://forums.ohdsi.org/ 8 http://www.ohdsi.org/web/ wiki/-
6 https://github.com/MIT-LCP/mimic-omop doku.php?id=documentation:cdm:concept
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
N.PARIS AND A.PARROT
perpetuity.
It is made available under a CC-BY 4.0 International license .
same. In the second case, MIMIC data is not coded in 1.2 Data Analytics
the standard OMOP terminologies, but the mapping Beyond the model transformation and respect of the
is already provided by OMOP (ex: ICD9/SNOMED- OMOP standardisation process we applied some anal-
CT), so the domain tables have been loaded accord- ysis.
ingly. In the third case, terminology mapping is not MIMIC provides a large number of SQL scripts
provided, but it is small enough to be done manually for preprocess and normalize data, calculate derived
in a few hours (such as demographic status, signs scores and defined cohorts as known as ”contrib”. Some
and symptoms). In the fourth case, terminology map- of them have been implemented on top of the OMOP
ping is not provided and consists of a large set of format to load the OMOP derived tables.
local terms (admission diagnosis, drugs). Then, only A set of general denormalized tables has been built
a subset of the most represented codes was manually on top of the original OMOP format that have the
mapped. When a manual terminology mapping con- concept_name related to the concept_id columns. The
cept is required, a mapping csv file has been built. concept table is a central element of OMOP and,
This solution can be adapted to medical users who therefore, it is involved in many joins to obtain the
do not have training in database engineering. The concept label. By precalculating the joins with the
spreadsheet has several columns such as local/stan- concept tables, the denormalized tables faster cal-
dard labels, ids and also comments, evaluation metrics culation and simplify SQL queries.
and a script loads them into the Postgres when com- In addition, a set of specialized analytical tables has
pleted. We have chosen to use simple SQL queries been built on the original OMOP format. The OMOP
that are flexible enough to be queried on demand or microbiologicalevents table is a reorganization
to generate a pre-filled csv with the best matches. It of the measurement table data of microorganisms
uses Postgres full-text ranking features and links local and associated susceptibility testing antibiotics and is
and standard candidates with a rating function based based on the MIMIC microbiologicalevents table.
on their labels. This work was performed under the The OMOP icustays table allows to quickly obtain
control of an intensivist. the patients admitted in resuscitation and is inspired
The evaluation phase was both quantitative and by the MIMIC icustays tables.
qualitative. The quantitative evaluation measures the The OMOP note_nlp table was originally de-
completeness of our work : the percent of concepts signed to store final or intermediate derived informa-
that are mapped to a standard. The qualitative eval- tions and metadata from clinical notes. When defini-
uation evaluates the correctness. For newly generated tive, the extracted information is intended to be
mappings it has consisted of manually tagging each moved to the dedicated domain or table and then
mapping with a score between 0 and 1 and eventually reused as regular structured data. When the informa-
write a commentary on each mapped concept. In case tion is still intermediate, it is stored in the note_nlp
the mapping of was provided by OMOP - automatic table and can be used for later analysis. To popu-
OMOP terminologies mapping -, the evaluation was late this table, we provided two information extrac-
performed on a subset of concepts manually picked tion pipelines. The first pipeline extracted numerical
within each terminology. values such as weight, height, body mass index and
left ventricular cardiac ejection fraction from medi-
cal notes with a SQL script. The resulting structured
numerical values were loaded into the measurement
or observation tables according to their domain. The
4
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
perpetuity. MIMIC in the OMOP Common Data Model
It is made available under a CC-BY 4.0 International license .
second pipeline section extractor based on the apache 2.1 Data Transformations
UIMA framework divides notes into sections to help The MIMIC to OMOP conversion was performed
analysts choose or avoid certain sections of their analy- by two developers (a data engineer and an intensivist)
sis. Section templates (such as ”Illness History”) have for 500 hours. This includes ETL, git documenta-
been automatically extracted from text with regular tion, concept mapping, contributions and unit tests.
expressions, then filtered to keep only the most fre- ETL (with unit tests and generation of ready-to-load
quent (frequency > to 1%). archive) on a subset of 100 patients lasts five minutes
A 48-hour open access datathon9 was set up in and enables fast development cycles. The ETL lasts
Paris AP-HP (Assistance Publique des Hopitaux de 3 hours to process the whole MIMIC database. The
Paris) in collaboration with the MIT once the MIMIC- resulting csv archive is almost the same size as the
OMOP transformation was ready for research. This original archive, and MIMIC-OMOP is also the same
datathon was organised to evaluate OMOP as an alter- size as MIMIC once loaded and indexed into Postgres.
native data model for accessing and analysing MIMIC
data during a real event. Scientific questions had been
prepared in an online forum where participants could 2.1.1 Structural Mapping
introduce themselves and propose a topic or choose The result of the Structural Mapping are presented
an existing one. OMOP has been loaded into apache in the table 2. Among of the 37 OMOP tables, the one
HIVE 1.2.1 in ORC format. Users had access to the related to hospital costs were not applicable, some re-
ORC dataset from a web interface jupyter notebooks lated to derived data were not populated and some
with python, R or scala. A SQL web client allowed tables related to vocabulary were pre-loaded with ter-
teams to write SQL from presto to the same dataset. minology informations. The 26 tables of MIMIC have
The hadoop cluster was based on 5 computers with been dispatched into 19 OMOP tables. The reduced
16 cores and 220GB of RAM memory. The MIMIC- number of tables results from the differences in de-
OMOP dataset has been loaded from a Postgres in- sign of both models. OMOP stores all the terminolo-
stance to HIVE thought apache SQOOP 1.4.6 di- gies into one table whereas MIMIC has one table for
rectly in ORC format. Participants also had access each terminology and the same applies for facts data
to the Schemaspy database physical model to access that are grouped by nature in OMOP while MIMIC
the OMOP physical data model with both table/col- tables are more specialized and respects the source
umn comments and key primary/foreign relationships EHR’s design. For example the measurement gather
materializing the relationships between the tables. All measured information and combines 4 source tables
queries have been logged. resulting in 365 181 104 rows which is 20% more than
the largest MIMIC table. To some extends this is a
regression in terms of performances. Two important
2 RESULT
tables are provided by OMOP to represent the rela-
All transformation processes are freely accessible to tionship between the data : concept_relationship
the public via the MIMIC-OMOP git repository main- and fact_relationship. We used them to bind
tained by MIT-LCP [6] . The repository is based on the drugs into a solution, for microbiology / antibi-
git and is designed for sharing, improvement, collabo- ograms and for visit_detail / caresite links. The
ration and reproducible work. Moreover, it is archived following SQL query (listing 1) shows how a mi-
on a universal and durable software archive solution10 . croorganism is linked to its susceptibility test by a
The git repository centralizes the various resources of fact_relationship and illustrates the flexibility of
this work such as documentation, source code, unit the model. However this flexibility affects the simplic-
tests, as well as questioning examples, discussions and ity and the performances of the model by increasing
problem issues. It also indicates web resources such as the number of joins within SQL queries.
the physical data model for MIMIC11 and OMOP12 Listing 1. Original table microbiology SQL query
datasets and the Achilles’ web client 13 . All the code SELECT measurement_source_value
, value_as_concept_id
to create these statistics is provided on the article’s , concept_name
framagit repository (see Repository Work section). FROM measurement
JOIN concept r e s i s t a n c e
ON value_as_concept_id = concept_id
JOIN fact_relationship
ON measurement_id = fact_id_2
JOIN
(
9 http://blogs.aphp.fr/dat-icu/ SELECT measurement_id AS id_is_staph
FROM measurement
10 https://www.softwareheritage.org/ WHERE
11 https://mit-lcp.github.io/mimic-omop/schemaspy- measurement_type_concept_id = 2000000007
−− ’Labs − Culture Organisms’
mimic AND value_as_concept_id = 4149419
12 https://mit-lcp.github.io/mimic-omop/schemaspy- −− ’Staph aureus coag +’
AND measurement_concept_id = 46235217
omop −− ’ Bacteria i d e n t i f i e d in Blood product
13 https://mit-lcp.github.io/mimic-omop/AchillesWeb unit . autologous by Culture ’
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
N.PARIS AND A.PARROT
perpetuity.
It is made available under a CC-BY 4.0 International license .
) staph ON id_is_staph = fact_id_1 present in MIMIC. The same was true when date in-
WHERE TRUE
AND measurement_type_concept_id = 2000000008
formation was not provided (start /end_datetime for
−− ’ Labs − Culture S e n s i t i v i t y ’ drug_exposure).
As mentioned in the table 4, from 20% to 80% of
The table 3 presents the basic characterization of the source columns has not been kept. Almost all were
the MIMIC-OMOP population and assesses the over- redundant with others or provided derived informa-
all quality of structural mapping. Fortunately most tion. The main concern is the loss of some times-
statistics remain similar between the two versions with tamps. For example, the MIMIC chartevents ta-
still few differences. The table 3 shows MIMIC con- bles provides the storetime and charttime columns, but
tains 61,532 intensive care stays while OMOP con- OMOP only provides one column to store timestamp.
tains 71 576 intensive care stays. This represents a Thus, MIMIC storetime column was eliminated during
16% increase in stays. By desigh MIMIC aggregates the ETL which has been considered the less valuable.
information from various systems. Thus the trans- As mentioned in the methods the incorrect en-
fer information is divided into several tables, such as tries are not kept in the process. According to the
admissions, transfers and also icustays. Rather tables 4, four MIMIC tables (inputevents_mv,
OMOP centralizes this information in the detail of chartevents, procedureevents_mv, note) have
the visit_detail. We also added emergency stays as deleted rows in the ETL process. All of them were
a normal location for patients throughout their hospi- tagged in MIMIC as erroneous or cancelled.
tal stay (unlike what had been done by MIMIC). The A set of minor modifications of the OMOP tables
Icustays MIMIC table has not been transformed be- structure was made in order to fit the data. All char-
cause it derives from the transfer table14 and we acter columns with limited length have been modi-
decided to assign a new visit_detail row for each fied to unlimited length since it could cause unpre-
ICU stay (based on the transfer table) while MIMIC dictable truncation of content, while having no nega-
preferred to assign a new ICU stay if a new admission tive impact on Postgres storage size or performance.
occurs > 24h after the end of the previous stay. The visit_occurrence and the visit_detail ta-
This table also shows an increase of the number of bles have been corrected accordingly to some discus-
laboratory measurements per admission. This is be- sions on the OHDSI forum. The nlp_note table has
cause MIMIC-OMOP gathers laboratory data from been extended with fields mentioned in online doc-
both the MIMIC dedicated laboratory table and umentation but forgotten in the scripts. In addition
the chartevents table which is usually not consid- the offset column has been divided into two integer
ered for this purpose. For laboratory tests we put a type columns because the offset term is a SQL re-
specimen (i.e. a blood sample) for many laboratory served word and it makes sense to fill the resulting
results (because one blood sample can be used for sev- offset_begin and offset_end resulting columns with in-
eral tests), we decided to create as many rows of sam- teger values.
ples as laboratory tests because the information is not All the PgTAP unit tests passed. Moreover OMOP
had a 100% match of the integrity constraints and the
foreign key relationships of the data models. After 18
14 https://mimic.physionet.org/mimictables/icustays/
6
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
perpetuity. MIMIC in the OMOP Common Data Model
It is made available under a CC-BY 4.0 International license .
we know that some of ICD-9-CM codes can have a text conditions have been normalized and mapped to
one-to-several match with SNOMED15 (28%). OMOP standard codes to meet the conceptual model.
In several cases, OMOP had no suitable con- As indicated in the methods section, we have pro-
cepts for the ICU specific cases. In particular, vided many derived values. Common derived informa-
the visit_detail table does not yet introduce tions were introduced and loaded: corrected serum cal-
relevant information and duplicates information cium, corrected serum potassium, P/F ratio, corrected
from visit_occurrence table. Therefore we ex- osmolarity, SAPSII.
tended the concepts to track bed transfers and Denormalized derived tables improve SQL query
room transfers thought admitting_concept_id, dis- performances and verbosity. In addition, the resulting
charge_to_concept_id or visit_type_concept_id tables are much more human readable with the con-
columns. These added concepts have been intro- cept label directly in table and greatly reduces joins.
duced with concept_id between 2 billion and 2.001B Therefore, a little denormalization greatly improves
to distinguish them with OMOP concepts (0 to 2B) the analysts experience of the data model and the sim-
and MIMIC locals (>2.001B). plicity by adding some redundancy in the data while
Some local concepts could not mapped to standard not interrupting existing SQL queries. Moreover, these
ones. This unmapped concepts are linked with the con- normalized views are backward compatible and remain
cept_id = 0 and appear in different cases. In the first standardized allowing the creation of multicentric al-
case, the local concept has no equivalent in the stan- gorithms. We provide two examples of materialized
dard concept set. In the second case, it has not yet specialized views derived from microbiologyevents
been mapped and may have a standard equivalent. and icustays MIMIC that simplify the experience for
In the third case, the value is missing and cannot be scientists (listing 2). This results reflect the lack of
mapped. In our opinion, although not all of these cases simplicity of the model in its original form but this
can be used for standard queries, they should have a can be easily overcome with such analytics tables.
different concept identifier in order to be treated dif-
ferently (not only concept_id = 0). Some of the do- Listing 2. Optimized and denormalized microbiology table
mains_id do not match the table name, it makes sense SQL query
because the observation domain can be measure- SELECT antibiotic_source_value ,
ment table and vice versa. Although various types antibiotic_interpretation_concept_id ,
antibiotic_interpretation_concept_name
of information are stored in the measurement ta- FROM microbiology
ble, the dedicated OMOP concepts for the measure- WHERE
organism_concept_id = 4149419
ment_type_concept_id column were not sufficient to −− ’Staph aureus coag +’
distinguish them. Therfore we added some (Labs - AND specimen_concept_id = 46235217
−− ’ Bacteria i d e n t i f i e d in Blood product
Chemistry, Labs - Culture Organisms etc). unit . autologous by Culture ’ ;
8
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
perpetuity. MIMIC in the OMOP Common Data Model
It is made available under a CC-BY 4.0 International license .
their sections from its standard, considering that these Any data transformation is likely to generate bugs
sections were not widely used16 . that can have a later impact in medical research.
The French Hospitals of Paris (AP-HP) organized The foundations of the Relational database manage-
a datathon with MIMIC-OMOP. 25 teams, 160 par- ment system (RDBMS), such as transactions, stan-
ticipants had 48 hours to undertake a clinical project dardization and integrity constraints, are integrated
using the database MIMIC-OMOP through 15,000 re- safeguards that have been useful throughout the pro-
quests with a maximum duration of one minute. They cess. In addition the implemented unit tests ensure
had the opportunity to create mixed teams: clini- that past bugs are behind us. An ideal but complex
cians brought the issues that required data mining, validation method [26] would be to replicate existing
as well as their data expertise; data scientists judged MIMIC studies and ensure that the results are con-
the technical feasibility and finally implemented the sistent across data models. OHDSI Achilles tool com-
various analyses needed. Writing standard queries (i.e. pletes our quality assessment. It it is a surprisingly
with standard concepts) requires knowing the organi- slow tool to process. The rules and their descriptions
zation of relational models (SQL) and also mastering are difficult to understand. More specific one should
the graphical nature of certain terminologies such as be provided and described.
SNOMED-CT in order to capture all potential codes Another missing aspect is a set of quality tables for
that might be related to the one analysts think of assessing and measuring data quality. MIMIC have
first. Overall the teams quickly mastered the OMOP a column to keep track of corrupted information. It
model and managed to produce results at the end of would be interesting to be able to keep the disor-
the datathon. This results are in favor of a good un- dered data in OMOP and enable research in the data
derstandibility and simplicity of the model. cleaning/quality field. Although OMOP-CDM pro-
vides rules to name columns, there are some mistakes
3 DISCUSSION and we have to modify it. One the first hand, it is a
problem for a CDM to contain errors but this other
3.1 Data Transformations hand it is easy to relay issues that are now corrected.
The choice of a simple SQL based ETL over a ded-
icated ETL software has several advantages. SQL as
the unique language factors both people’s knowledge 3.2 Data Analytics
and computer resources allowing analysts to become It is important that OMOP maintains a level of
implementers and revise code or contribute to trans- standardisation in order to simplify ETL and make
formations. SQL was also used for the semantic map- it consistent. However, once done, it makes sense to
ping and we did not use language algorithm has proven give access to scientific data through more denormal-
to be effective [25] and OHDSI provides USAGI17 . The ized and specialized tables. There are many concerns
use of csv as format for sharing informations is simple about OMOP’s performance and optimization. How-
and universal. Both are standard and target a large ever, there will never be a perfect multi-purpose case
community (physicians, engineers and analysts) with table, and it is the responsibility of the data scientists
translational profiles and is compatible with multiple to build his own, simplified, specialized tables for his
technology. research and to respond effectively and clearly to his
The calculation time of the ETL on the Postgres in- needs.
stance on a modest personal computer is compatible The derived data integrate quite well into OMOP.
with a community work where the collaborator can We used note_nlp to store information derived from
clone the source code and configure a development in- notes, measurement to store derived numerical in-
stance to reproduce or improve the work. formations and cohort_attribute to store derived
By choosing a public git repository for documenta- scores. However, it is not yet clear whether derived
tion and source code support, this allows analysts to data should be stored by domain or whether it should
learn more about the project and learn how to con- be stored in dedicated derived tables. We found that
tribute. there are no tables to track the source and description
The highly active OMOP forum is full of details of these data.
and in training. It contrasts the implementation guide The pipeline notes’ section extractor we used was
suffers from not being as detailed and maintained. We based on apache UIMA framework. While some meth-
believe that the OMOP community would greatly ben- ods already exist to extract medical sections [27], the
efit from a systematic and concise synchronization be- prior work of describing sections was too high, and we
tween the forum, mailing lists, source code repository opted for a naive approach.
and end user documentation. Last but not least, as noted in the introduction, a
good CDM for the ICU would allow for near real-
time early warning systems and inference modelling
16 https://loinc.org/news/loinc-version-2-63-and-relma- on fresh data. OMOP is clearly designed to provide
version-6-22-are-now-available/ a static data set and does not have real-time inges-
17 https://github.com/OHDSI/Usagi tion and data versioning control mechanisms like EHR
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
N.PARIS AND A.PARROT
perpetuity.
It is made available under a CC-BY 4.0 International license .
usually do. Analysis of static data sets is essential for on consistency rather than performance. However, we
reproducible results. However, when the algorithms have shown that it is easy to overcome the weaknesses
need to be moved to the bed side, it is necessary to and improve OMOP with a set of design or tech-
have fresh data and a way of re-identifying the patient nology optimization and a dedicated structure that
that OMOP does not yet provide. That said, a solu- ultimately remains a standard and shareable because
tion like HL7 FHIR is a great way to implement real- it derives from the original model.
time inference from EHR data, and that’s how FHIR
and OMOP are complementary. This has already been The first major contribution of this study is to
studied18 but needs further optimisation. evaluate OMOP in the context of a freely accessible
The datathon showed that distributed platforms and well known database - MIMIC. The second major
with basic hardware provide SQL tools for Online contribution is to provide a freely accessible dataset in
Analytical Processing (OLAP) with excellent perfor- the OMOP format that could be useful to researchers.
mances that overcome RDBMS weaknesses. Therefore, The third major contribution is to share with the
it takes advantage of SQL language analysis functions OMOP community some useful transformations ded-
such as grouping, windowing, assembling and math- icated to intensive care that can be reused on any
ematical functions that are often missing in NoSQL OMOP data set.
databases. While some are open-source, those dis-
tributed technology are not easily accessible but cloud Future studies on the evaluation of structural and
based solution are more and more affordable for re- conceptual mapping through practical research stud-
searchers. ies on local and standard coding will be carried out.
The real life test of the datathon revealed the strong In addition, we plan to enhance the USAGI OHDSI
need to make the physical data model accessible, in- concept mapping tool to enable international concept
cluding comments on columns and tables, and we dis- mapping suggestion to transform other foreign ICU
covered that an open-source tool called schemaspy was databases. Finally, research on how to articulate FHIR
very helpful. In addition, we found that the git repos- and OMOP to get the best of both worlds (information
itory is the best place to document and interact with at the patient level versus information at the multi-
the community. center level) and improve bedside care have to be done.
4 CONCLUSION 5 GRANT
The data transformations of MIMIC in OMOP This research received non specific grant from any
model is a success. The transformation of MIMIC into funding agency in the public, commercial, or not-for-
OMOP has required efforts that remain reasonable. It profit sectors
is and always will be a work in progress because stan-
dard concept mapping is an almost infinite process 6 REPOSITORY WORK
with constant improvements. Fortunately, the pub-
lished version of MIMIC-OMOP is search-ready and All the latex files, statistics, pdf of the article are
already offers the same scope of data as the original provide online : https://framagit.org/aphp/mimic-
MIMIC version and even more with the derived data. omop-article.
It is publicly available on the git repository and have
been designed to be easily revised, copied or enriched 7 REFERENCES
according to the OMOP or MIMIC philosophy by any
users who knows SQL. [1] D. C. Angus, M. A. Kelley, R. J. Schmitz,
A. White, J. Popovich, “Caring for the critically ill pa-
The OMOP model is powerful because it allows a tient. Current and projected workforce requirements
broad spectrum of analysis from specialized local mod- for care of the critically ill and patients with pul-
els to evidence-based statistical analysis in an easy-to- monary disease: can we meet the requirements of an
learn and accessible format. The major complexity of aging population?” JAMA, vol. 284, no. 21, pp. 2762–
this model is intrinsically linked to terminologies com- 2770 (2000 Dec).
plexity with the use of its closure table [28]. [2] E. Azoulay, C. Alberti, I. Legendre, C. B. Buis-
Compared to the original MIMIC data model, work- son, J. R. Le Gall, “Post-ICU mortality in critically ill
ing with OMOP offers the ability to write standard infected patients: an international study,” Intensive
code and analyses that could benefit other interna- Care Med, vol. 31, no. 1, pp. 56–63 (2005 Jan).
tional users. [3] J. L. Vincent, “Is the current management of
As we have seen, the effectiveness of the OMOP severe sepsis and septic shock really evidence based?”
model has some weaknesses because it seems to focus PLoS Med., vol. 3, no. 9, p. e346 (2006 Sep).
[4] M. K. Ross, W. Wei, L. Ohno-Machado, “”Big
data” and the electronic health record,” Yearb Med
18 http://omoponfhir.org/ Inform, vol. 9, pp. 97–104 (2014 Aug).
10
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
perpetuity. MIMIC in the OMOP Common Data Model
It is made available under a CC-BY 4.0 International license .
[5] Y. Zhang, S. L. Guo, L. N. Han, T. L. Li, “Appli- for Comparative Effectiveness with the Observational
cation and Exploration of Big Data Mining in Clinical Medical Outcomes Partnership,” Appl Clin Inform,
Medicine,” Chin. Med. J., vol. 129, no. 6, pp. 731–738 vol. 6, no. 3, pp. 536–547 (2015).
(2016 Mar). [16] G. Hripcsak, J. D. Duke, N. H. Shah, C. G. Re-
[6] A. E. Johnson, T. J. Pollard, L. Shen, ich, V. Huser, M. J. Schuemie, M. A. Suchard, R. W.
L. W. Lehman, M. Feng, M. Ghassemi, B. Moody, Park, I. C. Wong, P. R. Rijnbeek, J. van der Lei,
P. Szolovits, L. A. Celi, R. G. Mark, “MIMIC-III, N. Pratt, G. N. Noren, Y. C. Li, P. E. Stang, D. Madi-
a freely accessible critical care database,” Sci Data, gan, P. B. Ryan, “Observational Health Data Sciences
vol. 3, p. 160035 (2016 May). and Informatics (OHDSI): Opportunities for Observa-
[7] M. G. Kahn, D. Batson, L. M. Schilling, tional Researchers,” Stud Health Technol Inform, vol.
“Data Model Considerations for Clinical Effective- 216, pp. 574–578 (2015).
ness Researchers,” Medical Care, vol. 50, pp. S60– [17] J. M. Overhage, P. B. Ryan, C. G. Reich, A. G.
S67 (2012 Jul.), doi:10.1097/MLR.0b013e318259bff4, Hartzema, P. E. Stang, “Validation of a common data
URL https://insights.ovid.com/crossref?an= model for active safety surveillance research,” J Am
00005650-201207001-00013. Med Inform Assoc, vol. 19, no. 1, pp. 54–60 (2012).
[8] J. J. Gagne, “Common Models, Different Ap- [18] C. Reich, P. B. Ryan, P. E. Stang, M. Rocca,
proaches,” Drug Saf, vol. 38, no. 8, pp. 683–686 (2015 “Evaluation of alternative standardized terminolo-
Aug). gies for medical conditions within a network of ob-
[9] P. R, L. T, “Data enclaves for sharing informa- servational healthcare databases,” J Biomed Inform,
tion derived from clinical and administrative data,” vol. 45, no. 4, pp. 689–696 (2012 Aug).
JAMA (2018), doi:10.1001/jama.2018.9342, URL [19] J. G. Md Shamsuzzoha Bayzid, Vojtech Huser,
+http://dx.doi.org/10.1001/jama.2018.9342. “Conversion of MIMIC to OHDSI CDM,” (2016).
[10] Y. Xu, X. Zhou, B. T. Suehs, A. G. Hartzema, [20] M. Garza, G. Del Fiol, J. Tenenbaum,
M. G. Kahn, Y. Moride, B. C. Sauer, Q. Liu, K. Moll, A. Walden, M. N. Zozus, “Evaluating common data
M. K. Pasquale, V. P. Nair, A. Bate, “A Compara- models for use with a longitudinal community reg-
tive Assessment of Observational Medical Outcomes istry,” J Biomed Inform, vol. 64, pp. 333–341 (2016
Partnership and Mini-Sentinel Common Data Mod- 12).
els and Analytics: Implications for Active Drug Safety [21] C. Chronaki, A. Shahin, R. Mark, “Designing
Surveillance,” Drug Saf, vol. 38, no. 8, pp. 749–765 Reliable Cohorts of Cardiac Patients across MIMIC
(2015 Aug). and eICU,” Comput Cardiol (2010), vol. 42, pp. 189–
[11] D. Madigan, P. B. Ryan, M. Schuemie, P. E. 192 (2015).
Stang, J. M. Overhage, A. G. Hartzema, M. A. [22] D. L. Moody, G. G. Shanks, “Improving the
Suchard, W. DuMouchel, J. A. Berlin, “Evaluating quality of data models: empirical validation of a
the impact of database heterogeneity on observational quality management framework,” vol. 28, no. 6, pp.
study results,” Am. J. Epidemiol., vol. 178, no. 4, pp. 619–650, doi:10.1016/S0306-4379(02)00043-1, URL
645–651 (2013 Aug). http://linkinghub.elsevier.com/retrieve/pii/
[12] H. Morgenstern, B. Rafaely, “Spatial Rever- S0306437902000431.
beration and Dereverberation Using an Acoustic [23] M. G. Kahn, D. Batson, L. M. Schilling, “Data
Multiple-Input Multiple-Output System,” J. Audio Model Considerations for Clinical Effectiveness Re-
Eng. Soc, vol. 65, no. 1/2, pp. 42–55 (2017 Jan.Feb.), searchers,” vol. 50, pp. S60–S67, doi:10.1097/MLR.
doi:https://doi.org/10.17743/jaes.2016.0063. 0b013e318259bff4, URL https://insights.ovid.
[13] O. H. Klungel, X. Kurz, M. C. de Groot, com/crossref?an=00005650-201207001-00013.
R. G. Schlienger, S. Tcherny-Lessenot, L. Grimaldi, [24] D. Yoon, E. Ahn, M. Young Park, S. Yeon Cho,
L. Ibanez, R. H. Groenwold, R. F. Reynolds, “Multi- P. Ryan, M. J. Schuemie, D. H. Shin, H. Park, R. W.
centre, multi-database studies with common proto- Park, “Conversion and Data Quality Assessment of
cols: lessons learnt from the IMI PROTECT project,” Electronic Health Record Data at a Korean Tertiary
Pharmacoepidemiol Drug Saf, vol. 25 Suppl 1, pp. 156– Teaching Hospital to a Common Data Model for Dis-
165 (2016 Mar). tributed Network Research,” vol. 22, p. 54 (2016 02).
[14] C. Maier, L. Lang, H. Storf, P. Vorm- [25] P. A. Bernstein, J. Madhavan, E. Rahm,
stein, R. Bieber, J. Bernarding, T. Herrmann, “Generic schema matching, ten years later,” PVLDB,
C. Haverkamp, P. Horki, J. Laufer, F. Berger, G. Hon- p. 2011.
ing, H. W. Fritsch, J. Schuttler, T. Ganslandt, H. U. [26] A. E. W. Johnson, T. J. Pollard, R. G. Mark,
Prokosch, M. Sedlmayr, “Towards Implementation “Reproducibility in critical care: a mortality prediction
of OMOP in a German University Hospital Consor- case study,” presented at the F. Doshi-Velez, J. Fack-
tium,” Appl Clin Inform, vol. 9, no. 1, pp. 54–61 (2018 ler, D. Kale, R. Ranganath, B. Wallace, J. Wiens
01). (Eds.), Proceedings of the 2nd Machine Learning
[15] F. FitzHenry, F. S. Resnic, S. L. Robbins, for Healthcare Conference, vol. 68 of Proceedings
J. Denton, L. Nookala, D. Meeker, L. Ohno-Machado, of Machine Learning Research, pp. 361–376 (2017
M. E. Matheny, “Creating a Common Data Model
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
N.PARIS AND A.PARROT
perpetuity.
It is made available under a CC-BY 4.0 International license .
12