Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
162 views12 pages

MIMIC in The OMOP Common Data Model

In the era of big data, the intensive care unit (ICU) is very likely to benefit from real-time computer analysis and modeling based on close patient monitoring and Electronic Health Record data. MIMIC is the first open access database in the ICU domain. Many studies have shown that common data models (CDMs) improve database searching by allowing code, tools and experience to be shared. OMOP-CDM is spreading all over the world. The objective was to evaluate the difficulty to transform MIMIC into
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views12 pages

MIMIC in The OMOP Common Data Model

In the era of big data, the intensive care unit (ICU) is very likely to benefit from real-time computer analysis and modeling based on close patient monitoring and Electronic Health Record data. MIMIC is the first open access database in the ICU domain. Many studies have shown that common data models (CDMs) improve database searching by allowing code, tools and experience to be shared. OMOP-CDM is spreading all over the world. The objective was to evaluate the difficulty to transform MIMIC into
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020.

The copyright holder for this


preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
perpetuity.
It is made available under a CC-BY 4.0 International license .

MIMIC in the OMOP Common Data Model

Nicolas PARIS and Adrien PARROT


[email protected], [email protected]

Objectives : In the era of big data, the intensive care unit (ICU) is very likely to
benefit from real-time computer analysis and modeling based on close patient mon-
itoring and Electronic Health Record data. MIMIC is the first open access database
in the ICU domain. Many studies have shown that common data models (CDMs)
improve database searching by allowing code, tools and experience to be shared.
OMOP-CDM is spreading all over the world. The objective was to evaluate the
difficulty to transform MIMIC into an OMOP (MIMIC-OMOP) database and the
benefits of this transformation for analysts.
Material & Method: A documented, tested, versioned, exemplified and open
repository has been set up to support the transformation and improvement of the
MIMIC community’s source code. The resulting data set was evaluated over a 48-
hour datathon.
Result: With an investment of 2 people for 500 hours, 64% of the data items of
the 26 MIMIC tables have been standardized into the OMOP CDM and 78% of the
source concepts mapped to reference terminologies. The model proved its ability to
support community contributions and was well received during the datathon with
160 participants and 15,000 requests executed with a maximum duration of one
minute.
Conclusion: The resulting MIMIC-OMOP dataset is the first MIMIC-OMOP
dataset available free of charge with real disidentified data ready for replicable in-
tensive care research. This approach can be generalized to any medical field.

INTRODUCTION two successive ICU systems at the Beth Israel Dea-


coness Medical Center in Boston admitted from 2001
Intensive care units (ICUs) are designed to pro- to 2012. It is the first ICU database available for
vide comprehensive support to the most severely-ill free and has been intensively used in research re-
patients within a hospital [1]. Mortality is typically sulting in more than 300 international publications.
high among these patients, both during and after the However, its monocentric nature makes it difficult to
hospital stay [2]. Understanding the effectiveness of in- generalize findings to other ICUs. The MIMIC rela-
terventions on patient outcomes remains a challenge, tional data model reflects the original intensive care
due to heterogeneity of patients, complexity of dis- information systems, as evidenced by the two sepa-
ease, and variation in care patterns. Intensivists use a rate inputevent_mv and ouputevent_cv [6] or
limited level of evidence to guide decision making [3] the two separate terminologies for physiological data.
whereas ICUs are a high density environment for data This leads analysts (datascientists, statisticians, etc.)
production. to reconcile the corresponding data to address this
With the increasing adoption of electronic health heterogeneity during the pre-processing step of each
record (EHR) systems around the world leading to study.
large amounts of clinical data [4] and the development For Kahn et al. [7], ”databases modelling is the pro-
of data mining, innovation throught health data is cess of determining how data are to be stored in a
likely to play an important role in clinical medicine [5]. database”. It specifies data types, constraints, relation-
Indeed, based on important medical informations, ex- ship and metadata definitions and provides a stan-
pectations are to improve clinical outcomes and prac- dardized way to represent resources/data and their
tices, enable personalized medicine and guide early relationships. Some studies have shown that using
warning systems, and also easily enroll a large, multi- a common data model (CDM) by standardizing the
center cohort while minimizing costs. structure (data model) and concepts (terminological
MIMIC-III (Medical Information Mart for Inten- model) of the database allows large scale multicenter
sive Care) is a high granularity dataset of over 60,000 research, exploitation of rare diseases or rare events
intensive care stays and 46 000 unique patients from and catalyzes research by sharing practices, source
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
N.PARIS AND A.PARROT
perpetuity.
It is made available under a CC-BY 4.0 International license .

code and tools [8, 9]. However, some studies have due to tricky concept mapping to standard terminolo-
shown that the results are not fully reproducible from gies tasks. OMOP has the advantage of not making
one CDM to another [10] or from one centre to an- the terminology mapping step mandatory by keeping
other [11]. Some approaches argue that keeping the the local codes accessible to analysts. Compared to the
local conceptual model [12] and the local structural Fast Healthcare Interoperability Resources (FHIR) 2 ,
model [13] leads to better results. On one hand, keep- OMOP performs better as a conceptual CDM because
ing MIMIC on its specific form will not solve the limi- the FHIR ressources currently do not specify the ter-
tation for multicenter research but on the other hand, minology to be used for most of the attributes. OMOP
a fully standardized form would introduce other disad- relational model can be materialized in csv format and
vantages, such as loss of data and lower computational stored in any relational database when FHIR uses json
performances. The ideal solution is probably in be- files and needs some processing and higher skills to
tween to allow local or standardized analysis depend- exploit. Among the above models, OMOP is the best
ing on the research question. candidate to overcome the MIMIC limitations men-
OMOP (Observational Medical Outcomes Partner- tioned earlier.
ship Common Data Model) is a CDM originally de- Our article was guided by the two following dimen-
signed for multicenter research related to adverse drug sions:
adverse events and now extends to medical, labor-
tory and genomic cases. OMOP provides structural 1. Data Transformations : evaluate the process of
and conceptual models relying on reference terminolo- transforming MIMIC into OMOP in terms of time
gies such as SNOMED for diagnostics, RxNORM for needed, skills required and quality of the result.
drugs and LOINC for laboratory results. Several ex- 2. Data Analytics : evaluate the resulting dataset to
amples of database transformed into OMOP have been support efficient, shareable and real-time analysis.
published [14, 15] and OMOP stores 682 million pa-
tient records from around the world [16]. Each clinical
area is stored in different dedicated tables. The OMOP 1 MATERIAL & METHOD
conceptual model is based on a closure table pattern 1
capable of ingesting any simple, hierarchical and also 1.1 Data Transformations
graph terminologies such as SNOMED. In addition to The majority of source code is implemented in Post-
local terminologies, OMOP defines and maintains a greSQL 9.6.9 (Postgres) because it is the primary sup-
set of standard terminologies to be mapped unidirec- port for the MIMIC database and allows the commu-
tionally (local to standard) by implementers. Although nity to reproduce our work on limited resources with-
OMOP has proven its reliability [17], the concept map- out licensing costs and benefit from recent Postgres
ping process is known to have an impact on results improvements in the data processing area. Some elab-
[18] and the application of the same protocol on dif- orated data transformations have been implemented
ferent data sources leads to different results [11]. This as Postgres functions.
shows the importance of keeping local terminologies The OMOP CDM version 5.3.3.1 (OMOP) tables
so that local analysis is still possible. Previous prelim- were created from the provided scripts with some
inary work has been done on the translation of MIMIC changes documented in our scripts. OMOP defines
into OMOP [19]. This work remains to be refined and 15 standardized clinical data tables, 3 health sys-
updated for proper evaluation. tem data tables, 2 health economics data tables,
In a recent comparative study of different CDM 5 tables for derived elements and 12 tables for
[8, 20] OMOP obtained best results for completeness, standardized vocabulary. The vocabulary tables
integrity, flexibility, simplicity of integration, imple- were loaded from concepts downloaded from Athena 3
mentability, for a wider coverage of the structural and and the clinical and derived tables were loaded from
conceptual model, a more systematic analysis thanks MIMIC.
to an analytical library and to visualization tools and MIMIC-III version 1.4.21 (MIMIC) was also loaded
easier access to data through SQL queries. In terms into Postgres with the provided scripts. A subset of
of conceptual approach, OMOP offers a broader set of 100 patients over the 46 000 total MIMIC patients
standard concepts. In terms of structural CDM it is selected based on their broad representativeness in
very rigorous in how data should be loaded into spe- the database and cloned into a second instance to
cific tables while others CDM such i2b2 are very flex- serve as a light and representative development set.
ible with a general table that solves all data domains.
This rigorous approach is necessary for standardiza-
tion. Previous work has been done to load MIMIC-III
into i2b2 [21] - however the work couldn’t be finalized

2 https://www.hl7.org/fhir/
1 https://karwin.blogspot.com/2010/03/rendering- 3 https://www.ohdsi.org/analytic-tools/athena-

trees-with-closure-tables.html standardized-vocabularies/

2
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
perpetuity. MIMIC in the OMOP Common Data Model
It is made available under a CC-BY 4.0 International license .

1.1.1 Structural Mapping Integrity constraints (primary keys, foreign keys,


The structural mapping aims at moving the MIMIC non-nullable columns) have been included to apply
data into the right place in OMOP with some data integrity checks at ETL runtime. The last axe of
transformations. It parts into three phases: concep- the structural evaluation is Achilles Sofware. It is an
tion, implementation and evaluation. open-source analysis software produced by OHDSI7 .
The conception phase consists of looping over each Like many previous authors, we used the Achilles to
MIMIC table and choosing an equivalent location in assess data quality [24]. This tool is used for data
OMOP for each column. In general both projects were characterization, data quality assessment (Achilles’
appropriately documented but in several cases, we heel) and health observation data visualization. It has
needed some clarification from MIMIC contributors been common practice to perform Achilles tests and
on the dedicated MIMIC git repository4 , or from the use it as a quality assessment in related studies. All
OMOP community forum5 . Some trickier choices have the resulting tables are presented in the results section.
been discussed in the MIMIC-OMOP git repository6
and can be tracked in the commit logs.
The implementation is done by an Extract- 1.1.2 Conceptual Mapping
Transform-Load (ETL) process which is composed The conceptual mapping aims at aligning the
of Postgres scripts, extracting information from the MIMIC local terminologies to the OMOP’s stan-
source or concept mapping tables, then transforming dard ones. It consists of three phases: integration,
and loading an OMOP target table. The scripts are alignment and evaluation.
managed sequentially through a main program. In last The integration phase is about loading both kind of
resort some modification of the structural model of terminologies into the OMOP vocabulary tables. The
OMOP have been made. A dedicated script recaps all OMOP terminologies are provided by the Athena tool
of them. It contains columns name modifications, new and were loaded with the associated programs. We
columns, columns type modifications or database in- have used an export with all terminologies without
dexing modification. In particular, each source table licensing limitations. The local terminologies have
has been added a global unique sequence incremented been extracted from the multiple MIMIC tables and
from 0 that serves as the primary key and links in loaded in the OMOP concept table. When possi-
the OMOP target tables. As a result every record is ble, relevant informations from the original MIMIC
uniquely identified allowing to chain the information tables have been concatenated in the concept_name
with OMOP while simplifying the primary/foreign key column. MIMIC local concepts were loaded with a
maintenance. concept_id identifier starting from 2 billion (lower
Although evaluating a structural model is difficult numbers are reserved for OMOP terminologies8 ). In
[22], several articles have attempted to assess the qual- the OMOP concept table, MIMIC concepts can be
ity of the CDM [7, 20]. The criteria developed by Khan distinguished with the vocabulary_id identifier equal
et al [23], which refer to the Moody and Shanks met- to ”MIMIC code” and a domain_id identifier tar-
rics [22], have been adapted to assess the quality of geting the OMOP table in which the corresponding
the data transformation (table 1). data is stored. This domain information is used in the
Beside Moody and Shanks, we provide a set of con- ETL to send the information in the proper table. We
trols to guaranty a correct transformation. In order want to call this method ”concept-driven dispatch-
to compare overall statistics, some SQL queries have ing”. OMOP documentation explains that conceptual
been setup to compare MIMIC and MIMIC-OMOP mapping has to be done before the structural mapping
and we provide basic populations characterizations. because the nature of the OMOP standard concepts
All tables have been covered and tested through sim- guides in which table (domain) the information should
ple counts, aggregate counts or distribution checks. be stored. The concept-driven dispatching method-
We estimated the loss of information during the ETL ology enable changing the concept mapping after
process by measuring the percentage of both columns the transformation without modifying the underlying
and rows lost in the process as other previous studies ETL code because the latter is dynamically based on
have done [15]. This is important to note we have the concept table content.
chosen not to keep irrelevant informations: for exam- The alignment phase to standardizing local
ple some rows are known to be invalid in MIMIC or MIMIC codes into OMOP standard codes, had four
some informations are redundant. Each ETL script distinct cases. In the first case, some MIMIC data is
has been tested using pgTAP, a unit test framework by chance already coded according to OMOP stan-
for Postgres. Each unit test script checked whether a dard terminologies (e.g. LOINC laboratory results)
particular OMOP target table was loaded correctly. and, therefore, the standard and local concepts are the

4 https://github.com/MIT-LCP/mimic-code 7 https://www.ohdsi.org/web/achilles/
5 http://forums.ohdsi.org/ 8 http://www.ohdsi.org/web/ wiki/-
6 https://github.com/MIT-LCP/mimic-omop doku.php?id=documentation:cdm:concept
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
N.PARIS AND A.PARROT
perpetuity.
It is made available under a CC-BY 4.0 International license .

Table 1. Transformation Quality Evaluation Metrics


Data Model Dimension Descriptions
Completeness - structural mapping Domain coverage : coverage of sources domains that are accommodated by the
standard OMOP model
Completeness - conceptual mapping Data coverage : coverage of sources data concepts that mapped to standard
OMOP concept
Integrity ”Meaningful data relationships and constraints that uphold the intent of the
data’s original purpose” [23]
Flexibility The ease to expand the standard model for new datatypes, concepts
Integration The capacity of the standard model to use multiples terminology and links its
to standard one
Implementability The stability of the models, the community, the cost of adoption
Understandability The ease of the standard model to be understood
Simplicity The ease of querying the standard model - the model should contains the min-
imum of concepts and relationship

same. In the second case, MIMIC data is not coded in 1.2 Data Analytics
the standard OMOP terminologies, but the mapping Beyond the model transformation and respect of the
is already provided by OMOP (ex: ICD9/SNOMED- OMOP standardisation process we applied some anal-
CT), so the domain tables have been loaded accord- ysis.
ingly. In the third case, terminology mapping is not MIMIC provides a large number of SQL scripts
provided, but it is small enough to be done manually for preprocess and normalize data, calculate derived
in a few hours (such as demographic status, signs scores and defined cohorts as known as ”contrib”. Some
and symptoms). In the fourth case, terminology map- of them have been implemented on top of the OMOP
ping is not provided and consists of a large set of format to load the OMOP derived tables.
local terms (admission diagnosis, drugs). Then, only A set of general denormalized tables has been built
a subset of the most represented codes was manually on top of the original OMOP format that have the
mapped. When a manual terminology mapping con- concept_name related to the concept_id columns. The
cept is required, a mapping csv file has been built. concept table is a central element of OMOP and,
This solution can be adapted to medical users who therefore, it is involved in many joins to obtain the
do not have training in database engineering. The concept label. By precalculating the joins with the
spreadsheet has several columns such as local/stan- concept tables, the denormalized tables faster cal-
dard labels, ids and also comments, evaluation metrics culation and simplify SQL queries.
and a script loads them into the Postgres when com- In addition, a set of specialized analytical tables has
pleted. We have chosen to use simple SQL queries been built on the original OMOP format. The OMOP
that are flexible enough to be queried on demand or microbiologicalevents table is a reorganization
to generate a pre-filled csv with the best matches. It of the measurement table data of microorganisms
uses Postgres full-text ranking features and links local and associated susceptibility testing antibiotics and is
and standard candidates with a rating function based based on the MIMIC microbiologicalevents table.
on their labels. This work was performed under the The OMOP icustays table allows to quickly obtain
control of an intensivist. the patients admitted in resuscitation and is inspired
The evaluation phase was both quantitative and by the MIMIC icustays tables.
qualitative. The quantitative evaluation measures the The OMOP note_nlp table was originally de-
completeness of our work : the percent of concepts signed to store final or intermediate derived informa-
that are mapped to a standard. The qualitative eval- tions and metadata from clinical notes. When defini-
uation evaluates the correctness. For newly generated tive, the extracted information is intended to be
mappings it has consisted of manually tagging each moved to the dedicated domain or table and then
mapping with a score between 0 and 1 and eventually reused as regular structured data. When the informa-
write a commentary on each mapped concept. In case tion is still intermediate, it is stored in the note_nlp
the mapping of was provided by OMOP - automatic table and can be used for later analysis. To popu-
OMOP terminologies mapping -, the evaluation was late this table, we provided two information extrac-
performed on a subset of concepts manually picked tion pipelines. The first pipeline extracted numerical
within each terminology. values such as weight, height, body mass index and
left ventricular cardiac ejection fraction from medi-
cal notes with a SQL script. The resulting structured
numerical values were loaded into the measurement
or observation tables according to their domain. The

4
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
perpetuity. MIMIC in the OMOP Common Data Model
It is made available under a CC-BY 4.0 International license .

second pipeline section extractor based on the apache 2.1 Data Transformations
UIMA framework divides notes into sections to help The MIMIC to OMOP conversion was performed
analysts choose or avoid certain sections of their analy- by two developers (a data engineer and an intensivist)
sis. Section templates (such as ”Illness History”) have for 500 hours. This includes ETL, git documenta-
been automatically extracted from text with regular tion, concept mapping, contributions and unit tests.
expressions, then filtered to keep only the most fre- ETL (with unit tests and generation of ready-to-load
quent (frequency > to 1%). archive) on a subset of 100 patients lasts five minutes
A 48-hour open access datathon9 was set up in and enables fast development cycles. The ETL lasts
Paris AP-HP (Assistance Publique des Hopitaux de 3 hours to process the whole MIMIC database. The
Paris) in collaboration with the MIT once the MIMIC- resulting csv archive is almost the same size as the
OMOP transformation was ready for research. This original archive, and MIMIC-OMOP is also the same
datathon was organised to evaluate OMOP as an alter- size as MIMIC once loaded and indexed into Postgres.
native data model for accessing and analysing MIMIC
data during a real event. Scientific questions had been
prepared in an online forum where participants could 2.1.1 Structural Mapping
introduce themselves and propose a topic or choose The result of the Structural Mapping are presented
an existing one. OMOP has been loaded into apache in the table 2. Among of the 37 OMOP tables, the one
HIVE 1.2.1 in ORC format. Users had access to the related to hospital costs were not applicable, some re-
ORC dataset from a web interface jupyter notebooks lated to derived data were not populated and some
with python, R or scala. A SQL web client allowed tables related to vocabulary were pre-loaded with ter-
teams to write SQL from presto to the same dataset. minology informations. The 26 tables of MIMIC have
The hadoop cluster was based on 5 computers with been dispatched into 19 OMOP tables. The reduced
16 cores and 220GB of RAM memory. The MIMIC- number of tables results from the differences in de-
OMOP dataset has been loaded from a Postgres in- sign of both models. OMOP stores all the terminolo-
stance to HIVE thought apache SQOOP 1.4.6 di- gies into one table whereas MIMIC has one table for
rectly in ORC format. Participants also had access each terminology and the same applies for facts data
to the Schemaspy database physical model to access that are grouped by nature in OMOP while MIMIC
the OMOP physical data model with both table/col- tables are more specialized and respects the source
umn comments and key primary/foreign relationships EHR’s design. For example the measurement gather
materializing the relationships between the tables. All measured information and combines 4 source tables
queries have been logged. resulting in 365 181 104 rows which is 20% more than
the largest MIMIC table. To some extends this is a
regression in terms of performances. Two important
2 RESULT
tables are provided by OMOP to represent the rela-
All transformation processes are freely accessible to tionship between the data : concept_relationship
the public via the MIMIC-OMOP git repository main- and fact_relationship. We used them to bind
tained by MIT-LCP [6] . The repository is based on the drugs into a solution, for microbiology / antibi-
git and is designed for sharing, improvement, collabo- ograms and for visit_detail / caresite links. The
ration and reproducible work. Moreover, it is archived following SQL query (listing 1) shows how a mi-
on a universal and durable software archive solution10 . croorganism is linked to its susceptibility test by a
The git repository centralizes the various resources of fact_relationship and illustrates the flexibility of
this work such as documentation, source code, unit the model. However this flexibility affects the simplic-
tests, as well as questioning examples, discussions and ity and the performances of the model by increasing
problem issues. It also indicates web resources such as the number of joins within SQL queries.
the physical data model for MIMIC11 and OMOP12 Listing 1. Original table microbiology SQL query
datasets and the Achilles’ web client 13 . All the code SELECT measurement_source_value
, value_as_concept_id
to create these statistics is provided on the article’s , concept_name
framagit repository (see Repository Work section). FROM measurement
JOIN concept r e s i s t a n c e
ON value_as_concept_id = concept_id
JOIN fact_relationship
ON measurement_id = fact_id_2
JOIN
(
9 http://blogs.aphp.fr/dat-icu/ SELECT measurement_id AS id_is_staph
FROM measurement
10 https://www.softwareheritage.org/ WHERE
11 https://mit-lcp.github.io/mimic-omop/schemaspy- measurement_type_concept_id = 2000000007
−− ’Labs − Culture Organisms’
mimic AND value_as_concept_id = 4149419
12 https://mit-lcp.github.io/mimic-omop/schemaspy- −− ’Staph aureus coag +’
AND measurement_concept_id = 46235217
omop −− ’ Bacteria i d e n t i f i e d in Blood product
13 https://mit-lcp.github.io/mimic-omop/AchillesWeb unit . autologous by Culture ’
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
N.PARIS AND A.PARROT
perpetuity.
It is made available under a CC-BY 4.0 International license .

Table 2. MIMIC to OMOP data flows


OMOP tables Number of rows MIMIC tables
CARE_SITE 93 transfers, service
COHORT_ATTRIBUTE 228379 callout
CONCEPT 30344 d_cpt, d_icd_procedures, d_items, d_labitems
CONDITION_OCCURRENCE 716595 admissions, diagnosis_icd
DEATH 14849 patients, admissions
DRUG_EXPOSURE 24934751 prescriptions, inputevents_cv, inputevents_mv
MEASUREMENT 365181104 chart/lab/microbiology/in/output events
NOTE 2082294 noteevents
NOTE_NLP 16350855 noteevents
OBSERVATION 6721040 admissions, chartevents, datetimevvents, drgcodes
OBSERVATION_PERIOD 58976 patients, admissions
PERSONS 46520 patients, admissions
PROCEDURE_OCCURRENCE 1063525 cptevents, procedureevents_mv, procedure_icd
PROVIDER 7567 caregivers
SPECIMEN 39874171 chartevents, labevents, microbiologyevents
VISIT_OCCURRENCE 58976 admissions
VISIT_DETAIL 271808 admissions, transfers, service

) staph ON id_is_staph = fact_id_1 present in MIMIC. The same was true when date in-
WHERE TRUE
AND measurement_type_concept_id = 2000000008
formation was not provided (start /end_datetime for
−− ’ Labs − Culture S e n s i t i v i t y ’ drug_exposure).
As mentioned in the table 4, from 20% to 80% of
The table 3 presents the basic characterization of the source columns has not been kept. Almost all were
the MIMIC-OMOP population and assesses the over- redundant with others or provided derived informa-
all quality of structural mapping. Fortunately most tion. The main concern is the loss of some times-
statistics remain similar between the two versions with tamps. For example, the MIMIC chartevents ta-
still few differences. The table 3 shows MIMIC con- bles provides the storetime and charttime columns, but
tains 61,532 intensive care stays while OMOP con- OMOP only provides one column to store timestamp.
tains 71 576 intensive care stays. This represents a Thus, MIMIC storetime column was eliminated during
16% increase in stays. By desigh MIMIC aggregates the ETL which has been considered the less valuable.
information from various systems. Thus the trans- As mentioned in the methods the incorrect en-
fer information is divided into several tables, such as tries are not kept in the process. According to the
admissions, transfers and also icustays. Rather tables 4, four MIMIC tables (inputevents_mv,
OMOP centralizes this information in the detail of chartevents, procedureevents_mv, note) have
the visit_detail. We also added emergency stays as deleted rows in the ETL process. All of them were
a normal location for patients throughout their hospi- tagged in MIMIC as erroneous or cancelled.
tal stay (unlike what had been done by MIMIC). The A set of minor modifications of the OMOP tables
Icustays MIMIC table has not been transformed be- structure was made in order to fit the data. All char-
cause it derives from the transfer table14 and we acter columns with limited length have been modi-
decided to assign a new visit_detail row for each fied to unlimited length since it could cause unpre-
ICU stay (based on the transfer table) while MIMIC dictable truncation of content, while having no nega-
preferred to assign a new ICU stay if a new admission tive impact on Postgres storage size or performance.
occurs > 24h after the end of the previous stay. The visit_occurrence and the visit_detail ta-
This table also shows an increase of the number of bles have been corrected accordingly to some discus-
laboratory measurements per admission. This is be- sions on the OHDSI forum. The nlp_note table has
cause MIMIC-OMOP gathers laboratory data from been extended with fields mentioned in online doc-
both the MIMIC dedicated laboratory table and umentation but forgotten in the scripts. In addition
the chartevents table which is usually not consid- the offset column has been divided into two integer
ered for this purpose. For laboratory tests we put a type columns because the offset term is a SQL re-
specimen (i.e. a blood sample) for many laboratory served word and it makes sense to fill the resulting
results (because one blood sample can be used for sev- offset_begin and offset_end resulting columns with in-
eral tests), we decided to create as many rows of sam- teger values.
ples as laboratory tests because the information is not All the PgTAP unit tests passed. Moreover OMOP
had a 100% match of the integrity constraints and the
foreign key relationships of the data models. After 18
14 https://mimic.physionet.org/mimictables/icustays/

6
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
perpetuity. MIMIC in the OMOP Common Data Model
It is made available under a CC-BY 4.0 International license .

Table 3. Baseline characteristics MIMIC versus OMOP


items MIMIC MIMIC-OMOP
Persons (Number) 46.520 46.520
Admissions (Number) 58.976 58.976
Icustays (Number) 71.575 61.532
Gender, Female (Number, %) 20.399 20.399 (43 %)
Age (Mean) 64 years, 4 months 64 ans, 4 months
0-5 8110 8110
6-15 1 1
16-25 1434 1434
26-45 5962 5962
46-65 17375 17375
66-80 15793 15793
>80 10301 10301
Emergency 42071 42071
Elective 7706 7706
Surgical patients 19246 19246
Length of stay, hospital (median) 6.46 (Q1-Q3 : 3.74 - 11.79) 6.59 (Q1-Q3 : 3.84 - 11.88)
Length of stay, ICU (median) 2.09 (Q1-Q3 : 1.10 - 4.48) 1.87 (Q1-Q3 : 0.95 - 3.87)
Mortality, ICU (Number, %) 5814 (9%) 5815 (9%)
Mortality, hospital (Number, %) 4511 (7%) 4559 (6%)
Lab measurements per admissions (mean) 478 678
Procedures per admissions (mean) 4.6 4.6
Drugs per admissions (mean) 82.8 82.8
Exit dignosis per admissions (mean) 11.0 11.0

Table 4. Data lost 2.1.2 Conceptual Mapping


Relations Rows lost Columns lost The results of the Conceptual Mapping’s complete-
ness are presented in the table 5.
admissions - 30%
We have often mapped many source concepts to a
callout - 80%
unique standard concept_id because MIMIC provides
caregivers - 50%
a large number of equivalent concepts. For example
chartevents 0.04% 40%
MIMIC provides 6 distinct concepts for body tempera-
cptevents - 60%
ture : Temperature C, Temperature C (calc), Temper-
datetimeevents 0.0001% 50%
ature F, Temperature F (calc), Temperature Fahren-
diagnoses_icd - 20%
heit and Temperature Celsius. All of them have been
drugcodes - 60%
mapped to the LOINC ”Body temperature” and nu-
inputevents_cv - 41%
merical values have been normalized.
inputevents_mv 10,00% 46%
OMOP’s terminology coverage has already been
labevents - 34%
rated as excellent [20]. We used the OMOP termi-
microbiologyevents - 30%
nology mappings - NDC-RxNorm, ICD9-SNOMED,
noteevents 0.04% 19%
CPT4-SNOMED - to standardize a consequent set of
outputevents - 39%
MIMIC non-standard terminologies.
patients - 50%
The automatic OMOP terminologies mapping was
prescriptions - 16%
evaluated by an intensivist. This results are in favor of
procedureevents_mv 3,00% 70%
a good integration of the model. We checked 100 ele-
procedures_icd - 40%
ments for each mapping used (NDC, ICD9 and CPT4).
services - 34%
ICD9 and CPT4 are correctly mapped to SNOMED
transfers - 47%
(100%). But only 85% of NDCs are linked to a cor-
rect RxNorm code. This is partly due to an incorrect
NDC drug code (from MIMIC), partly because only
78% of NDC codes are mapped to Rxnorm. Moreover,
even if this does not seem to have affected our ETL
hours of computations Achilles Heel issued 15 errors,
18 warnings and 8 notifications. This result is good
compared to other studies [24].
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
N.PARIS AND A.PARROT
perpetuity.
It is made available under a CC-BY 4.0 International license .

Table 5. Terminology Mapping coverage


Omop tables (domain) Records % Mapped records Concepts source % Mapped concepts source
CARE_SITE 144 100% 58 100%
CONDITION_OCCURRENCE 716595 90% 6984 94%
DRUG_EXPOSURE 24934751 38% 7398 56%
MEASUREMENT 40141521 73% 1035 76%
OBSERVATION 6721040 68% 1440 80%
PERSONS 93040 100% 43 100%
PROCEDURE_OCCURRENCE 1063525 99% 2203 99%
SPECIMEN 39874171 70% 92 77%
NOTE 2082294 100% 15 100%
VISIT_OCCURRENCE 176928 100% 34 100%
VISIT_DETAIL 396932 100% 28 100%

we know that some of ICD-9-CM codes can have a text conditions have been normalized and mapped to
one-to-several match with SNOMED15 (28%). OMOP standard codes to meet the conceptual model.
In several cases, OMOP had no suitable con- As indicated in the methods section, we have pro-
cepts for the ICU specific cases. In particular, vided many derived values. Common derived informa-
the visit_detail table does not yet introduce tions were introduced and loaded: corrected serum cal-
relevant information and duplicates information cium, corrected serum potassium, P/F ratio, corrected
from visit_occurrence table. Therefore we ex- osmolarity, SAPSII.
tended the concepts to track bed transfers and Denormalized derived tables improve SQL query
room transfers thought admitting_concept_id, dis- performances and verbosity. In addition, the resulting
charge_to_concept_id or visit_type_concept_id tables are much more human readable with the con-
columns. These added concepts have been intro- cept label directly in table and greatly reduces joins.
duced with concept_id between 2 billion and 2.001B Therefore, a little denormalization greatly improves
to distinguish them with OMOP concepts (0 to 2B) the analysts experience of the data model and the sim-
and MIMIC locals (>2.001B). plicity by adding some redundancy in the data while
Some local concepts could not mapped to standard not interrupting existing SQL queries. Moreover, these
ones. This unmapped concepts are linked with the con- normalized views are backward compatible and remain
cept_id = 0 and appear in different cases. In the first standardized allowing the creation of multicentric al-
case, the local concept has no equivalent in the stan- gorithms. We provide two examples of materialized
dard concept set. In the second case, it has not yet specialized views derived from microbiologyevents
been mapped and may have a standard equivalent. and icustays MIMIC that simplify the experience for
In the third case, the value is missing and cannot be scientists (listing 2). This results reflect the lack of
mapped. In our opinion, although not all of these cases simplicity of the model in its original form but this
can be used for standard queries, they should have a can be easily overcome with such analytics tables.
different concept identifier in order to be treated dif-
ferently (not only concept_id = 0). Some of the do- Listing 2. Optimized and denormalized microbiology table
mains_id do not match the table name, it makes sense SQL query
because the observation domain can be measure- SELECT antibiotic_source_value ,
ment table and vice versa. Although various types antibiotic_interpretation_concept_id ,
antibiotic_interpretation_concept_name
of information are stored in the measurement ta- FROM microbiology
ble, the dedicated OMOP concepts for the measure- WHERE
organism_concept_id = 4149419
ment_type_concept_id column were not sufficient to −− ’Staph aureus coag +’
distinguish them. Therfore we added some (Labs - AND specimen_concept_id = 46235217
−− ’ Bacteria i d e n t i f i e d in Blood product
Chemistry, Labs - Culture Organisms etc). unit . autologous by Culture ’ ;

2.2 Data Analytics


This results are in favor of a good flexibility of the
Some MIMIC raw informations have been trans-
model allowing to store derived data.
formed and added to meet the structural model.
The note section extraction pipeline resulted in 1200
The laboratory textual values have been splitted into
sections that were collected and then manually fil-
operators, numeric values, and units when needed
tered to exclude false positives. 400 similar groups
with a dedicated Postgres stored procedure. The free
were highlighted. The extracted sections have not been
mapped to standard terminologies such as LOINC
15 https://www.nlm.nih.gov/research/umls/mapping_ Clinical Document Ontology (CDO). The reason for
projects/icd9cm_to_snomedct.html this is that the CDO LOINC decided not to maintain

8
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
perpetuity. MIMIC in the OMOP Common Data Model
It is made available under a CC-BY 4.0 International license .

their sections from its standard, considering that these Any data transformation is likely to generate bugs
sections were not widely used16 . that can have a later impact in medical research.
The French Hospitals of Paris (AP-HP) organized The foundations of the Relational database manage-
a datathon with MIMIC-OMOP. 25 teams, 160 par- ment system (RDBMS), such as transactions, stan-
ticipants had 48 hours to undertake a clinical project dardization and integrity constraints, are integrated
using the database MIMIC-OMOP through 15,000 re- safeguards that have been useful throughout the pro-
quests with a maximum duration of one minute. They cess. In addition the implemented unit tests ensure
had the opportunity to create mixed teams: clini- that past bugs are behind us. An ideal but complex
cians brought the issues that required data mining, validation method [26] would be to replicate existing
as well as their data expertise; data scientists judged MIMIC studies and ensure that the results are con-
the technical feasibility and finally implemented the sistent across data models. OHDSI Achilles tool com-
various analyses needed. Writing standard queries (i.e. pletes our quality assessment. It it is a surprisingly
with standard concepts) requires knowing the organi- slow tool to process. The rules and their descriptions
zation of relational models (SQL) and also mastering are difficult to understand. More specific one should
the graphical nature of certain terminologies such as be provided and described.
SNOMED-CT in order to capture all potential codes Another missing aspect is a set of quality tables for
that might be related to the one analysts think of assessing and measuring data quality. MIMIC have
first. Overall the teams quickly mastered the OMOP a column to keep track of corrupted information. It
model and managed to produce results at the end of would be interesting to be able to keep the disor-
the datathon. This results are in favor of a good un- dered data in OMOP and enable research in the data
derstandibility and simplicity of the model. cleaning/quality field. Although OMOP-CDM pro-
vides rules to name columns, there are some mistakes
3 DISCUSSION and we have to modify it. One the first hand, it is a
problem for a CDM to contain errors but this other
3.1 Data Transformations hand it is easy to relay issues that are now corrected.
The choice of a simple SQL based ETL over a ded-
icated ETL software has several advantages. SQL as
the unique language factors both people’s knowledge 3.2 Data Analytics
and computer resources allowing analysts to become It is important that OMOP maintains a level of
implementers and revise code or contribute to trans- standardisation in order to simplify ETL and make
formations. SQL was also used for the semantic map- it consistent. However, once done, it makes sense to
ping and we did not use language algorithm has proven give access to scientific data through more denormal-
to be effective [25] and OHDSI provides USAGI17 . The ized and specialized tables. There are many concerns
use of csv as format for sharing informations is simple about OMOP’s performance and optimization. How-
and universal. Both are standard and target a large ever, there will never be a perfect multi-purpose case
community (physicians, engineers and analysts) with table, and it is the responsibility of the data scientists
translational profiles and is compatible with multiple to build his own, simplified, specialized tables for his
technology. research and to respond effectively and clearly to his
The calculation time of the ETL on the Postgres in- needs.
stance on a modest personal computer is compatible The derived data integrate quite well into OMOP.
with a community work where the collaborator can We used note_nlp to store information derived from
clone the source code and configure a development in- notes, measurement to store derived numerical in-
stance to reproduce or improve the work. formations and cohort_attribute to store derived
By choosing a public git repository for documenta- scores. However, it is not yet clear whether derived
tion and source code support, this allows analysts to data should be stored by domain or whether it should
learn more about the project and learn how to con- be stored in dedicated derived tables. We found that
tribute. there are no tables to track the source and description
The highly active OMOP forum is full of details of these data.
and in training. It contrasts the implementation guide The pipeline notes’ section extractor we used was
suffers from not being as detailed and maintained. We based on apache UIMA framework. While some meth-
believe that the OMOP community would greatly ben- ods already exist to extract medical sections [27], the
efit from a systematic and concise synchronization be- prior work of describing sections was too high, and we
tween the forum, mailing lists, source code repository opted for a naive approach.
and end user documentation. Last but not least, as noted in the introduction, a
good CDM for the ICU would allow for near real-
time early warning systems and inference modelling
16 https://loinc.org/news/loinc-version-2-63-and-relma- on fresh data. OMOP is clearly designed to provide
version-6-22-are-now-available/ a static data set and does not have real-time inges-
17 https://github.com/OHDSI/Usagi tion and data versioning control mechanisms like EHR
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
N.PARIS AND A.PARROT
perpetuity.
It is made available under a CC-BY 4.0 International license .

usually do. Analysis of static data sets is essential for on consistency rather than performance. However, we
reproducible results. However, when the algorithms have shown that it is easy to overcome the weaknesses
need to be moved to the bed side, it is necessary to and improve OMOP with a set of design or tech-
have fresh data and a way of re-identifying the patient nology optimization and a dedicated structure that
that OMOP does not yet provide. That said, a solu- ultimately remains a standard and shareable because
tion like HL7 FHIR is a great way to implement real- it derives from the original model.
time inference from EHR data, and that’s how FHIR
and OMOP are complementary. This has already been The first major contribution of this study is to
studied18 but needs further optimisation. evaluate OMOP in the context of a freely accessible
The datathon showed that distributed platforms and well known database - MIMIC. The second major
with basic hardware provide SQL tools for Online contribution is to provide a freely accessible dataset in
Analytical Processing (OLAP) with excellent perfor- the OMOP format that could be useful to researchers.
mances that overcome RDBMS weaknesses. Therefore, The third major contribution is to share with the
it takes advantage of SQL language analysis functions OMOP community some useful transformations ded-
such as grouping, windowing, assembling and math- icated to intensive care that can be reused on any
ematical functions that are often missing in NoSQL OMOP data set.
databases. While some are open-source, those dis-
tributed technology are not easily accessible but cloud Future studies on the evaluation of structural and
based solution are more and more affordable for re- conceptual mapping through practical research stud-
searchers. ies on local and standard coding will be carried out.
The real life test of the datathon revealed the strong In addition, we plan to enhance the USAGI OHDSI
need to make the physical data model accessible, in- concept mapping tool to enable international concept
cluding comments on columns and tables, and we dis- mapping suggestion to transform other foreign ICU
covered that an open-source tool called schemaspy was databases. Finally, research on how to articulate FHIR
very helpful. In addition, we found that the git repos- and OMOP to get the best of both worlds (information
itory is the best place to document and interact with at the patient level versus information at the multi-
the community. center level) and improve bedside care have to be done.

4 CONCLUSION 5 GRANT

The data transformations of MIMIC in OMOP This research received non specific grant from any
model is a success. The transformation of MIMIC into funding agency in the public, commercial, or not-for-
OMOP has required efforts that remain reasonable. It profit sectors
is and always will be a work in progress because stan-
dard concept mapping is an almost infinite process 6 REPOSITORY WORK
with constant improvements. Fortunately, the pub-
lished version of MIMIC-OMOP is search-ready and All the latex files, statistics, pdf of the article are
already offers the same scope of data as the original provide online : https://framagit.org/aphp/mimic-
MIMIC version and even more with the derived data. omop-article.
It is publicly available on the git repository and have
been designed to be easily revised, copied or enriched 7 REFERENCES
according to the OMOP or MIMIC philosophy by any
users who knows SQL. [1] D. C. Angus, M. A. Kelley, R. J. Schmitz,
A. White, J. Popovich, “Caring for the critically ill pa-
The OMOP model is powerful because it allows a tient. Current and projected workforce requirements
broad spectrum of analysis from specialized local mod- for care of the critically ill and patients with pul-
els to evidence-based statistical analysis in an easy-to- monary disease: can we meet the requirements of an
learn and accessible format. The major complexity of aging population?” JAMA, vol. 284, no. 21, pp. 2762–
this model is intrinsically linked to terminologies com- 2770 (2000 Dec).
plexity with the use of its closure table [28]. [2] E. Azoulay, C. Alberti, I. Legendre, C. B. Buis-
Compared to the original MIMIC data model, work- son, J. R. Le Gall, “Post-ICU mortality in critically ill
ing with OMOP offers the ability to write standard infected patients: an international study,” Intensive
code and analyses that could benefit other interna- Care Med, vol. 31, no. 1, pp. 56–63 (2005 Jan).
tional users. [3] J. L. Vincent, “Is the current management of
As we have seen, the effectiveness of the OMOP severe sepsis and septic shock really evidence based?”
model has some weaknesses because it seems to focus PLoS Med., vol. 3, no. 9, p. e346 (2006 Sep).
[4] M. K. Ross, W. Wei, L. Ohno-Machado, “”Big
data” and the electronic health record,” Yearb Med
18 http://omoponfhir.org/ Inform, vol. 9, pp. 97–104 (2014 Aug).

10
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
perpetuity. MIMIC in the OMOP Common Data Model
It is made available under a CC-BY 4.0 International license .

[5] Y. Zhang, S. L. Guo, L. N. Han, T. L. Li, “Appli- for Comparative Effectiveness with the Observational
cation and Exploration of Big Data Mining in Clinical Medical Outcomes Partnership,” Appl Clin Inform,
Medicine,” Chin. Med. J., vol. 129, no. 6, pp. 731–738 vol. 6, no. 3, pp. 536–547 (2015).
(2016 Mar). [16] G. Hripcsak, J. D. Duke, N. H. Shah, C. G. Re-
[6] A. E. Johnson, T. J. Pollard, L. Shen, ich, V. Huser, M. J. Schuemie, M. A. Suchard, R. W.
L. W. Lehman, M. Feng, M. Ghassemi, B. Moody, Park, I. C. Wong, P. R. Rijnbeek, J. van der Lei,
P. Szolovits, L. A. Celi, R. G. Mark, “MIMIC-III, N. Pratt, G. N. Noren, Y. C. Li, P. E. Stang, D. Madi-
a freely accessible critical care database,” Sci Data, gan, P. B. Ryan, “Observational Health Data Sciences
vol. 3, p. 160035 (2016 May). and Informatics (OHDSI): Opportunities for Observa-
[7] M. G. Kahn, D. Batson, L. M. Schilling, tional Researchers,” Stud Health Technol Inform, vol.
“Data Model Considerations for Clinical Effective- 216, pp. 574–578 (2015).
ness Researchers,” Medical Care, vol. 50, pp. S60– [17] J. M. Overhage, P. B. Ryan, C. G. Reich, A. G.
S67 (2012 Jul.), doi:10.1097/MLR.0b013e318259bff4, Hartzema, P. E. Stang, “Validation of a common data
URL https://insights.ovid.com/crossref?an= model for active safety surveillance research,” J Am
00005650-201207001-00013. Med Inform Assoc, vol. 19, no. 1, pp. 54–60 (2012).
[8] J. J. Gagne, “Common Models, Different Ap- [18] C. Reich, P. B. Ryan, P. E. Stang, M. Rocca,
proaches,” Drug Saf, vol. 38, no. 8, pp. 683–686 (2015 “Evaluation of alternative standardized terminolo-
Aug). gies for medical conditions within a network of ob-
[9] P. R, L. T, “Data enclaves for sharing informa- servational healthcare databases,” J Biomed Inform,
tion derived from clinical and administrative data,” vol. 45, no. 4, pp. 689–696 (2012 Aug).
JAMA (2018), doi:10.1001/jama.2018.9342, URL [19] J. G. Md Shamsuzzoha Bayzid, Vojtech Huser,
+http://dx.doi.org/10.1001/jama.2018.9342. “Conversion of MIMIC to OHDSI CDM,” (2016).
[10] Y. Xu, X. Zhou, B. T. Suehs, A. G. Hartzema, [20] M. Garza, G. Del Fiol, J. Tenenbaum,
M. G. Kahn, Y. Moride, B. C. Sauer, Q. Liu, K. Moll, A. Walden, M. N. Zozus, “Evaluating common data
M. K. Pasquale, V. P. Nair, A. Bate, “A Compara- models for use with a longitudinal community reg-
tive Assessment of Observational Medical Outcomes istry,” J Biomed Inform, vol. 64, pp. 333–341 (2016
Partnership and Mini-Sentinel Common Data Mod- 12).
els and Analytics: Implications for Active Drug Safety [21] C. Chronaki, A. Shahin, R. Mark, “Designing
Surveillance,” Drug Saf, vol. 38, no. 8, pp. 749–765 Reliable Cohorts of Cardiac Patients across MIMIC
(2015 Aug). and eICU,” Comput Cardiol (2010), vol. 42, pp. 189–
[11] D. Madigan, P. B. Ryan, M. Schuemie, P. E. 192 (2015).
Stang, J. M. Overhage, A. G. Hartzema, M. A. [22] D. L. Moody, G. G. Shanks, “Improving the
Suchard, W. DuMouchel, J. A. Berlin, “Evaluating quality of data models: empirical validation of a
the impact of database heterogeneity on observational quality management framework,” vol. 28, no. 6, pp.
study results,” Am. J. Epidemiol., vol. 178, no. 4, pp. 619–650, doi:10.1016/S0306-4379(02)00043-1, URL
645–651 (2013 Aug). http://linkinghub.elsevier.com/retrieve/pii/
[12] H. Morgenstern, B. Rafaely, “Spatial Rever- S0306437902000431.
beration and Dereverberation Using an Acoustic [23] M. G. Kahn, D. Batson, L. M. Schilling, “Data
Multiple-Input Multiple-Output System,” J. Audio Model Considerations for Clinical Effectiveness Re-
Eng. Soc, vol. 65, no. 1/2, pp. 42–55 (2017 Jan.Feb.), searchers,” vol. 50, pp. S60–S67, doi:10.1097/MLR.
doi:https://doi.org/10.17743/jaes.2016.0063. 0b013e318259bff4, URL https://insights.ovid.
[13] O. H. Klungel, X. Kurz, M. C. de Groot, com/crossref?an=00005650-201207001-00013.
R. G. Schlienger, S. Tcherny-Lessenot, L. Grimaldi, [24] D. Yoon, E. Ahn, M. Young Park, S. Yeon Cho,
L. Ibanez, R. H. Groenwold, R. F. Reynolds, “Multi- P. Ryan, M. J. Schuemie, D. H. Shin, H. Park, R. W.
centre, multi-database studies with common proto- Park, “Conversion and Data Quality Assessment of
cols: lessons learnt from the IMI PROTECT project,” Electronic Health Record Data at a Korean Tertiary
Pharmacoepidemiol Drug Saf, vol. 25 Suppl 1, pp. 156– Teaching Hospital to a Common Data Model for Dis-
165 (2016 Mar). tributed Network Research,” vol. 22, p. 54 (2016 02).
[14] C. Maier, L. Lang, H. Storf, P. Vorm- [25] P. A. Bernstein, J. Madhavan, E. Rahm,
stein, R. Bieber, J. Bernarding, T. Herrmann, “Generic schema matching, ten years later,” PVLDB,
C. Haverkamp, P. Horki, J. Laufer, F. Berger, G. Hon- p. 2011.
ing, H. W. Fritsch, J. Schuttler, T. Ganslandt, H. U. [26] A. E. W. Johnson, T. J. Pollard, R. G. Mark,
Prokosch, M. Sedlmayr, “Towards Implementation “Reproducibility in critical care: a mortality prediction
of OMOP in a German University Hospital Consor- case study,” presented at the F. Doshi-Velez, J. Fack-
tium,” Appl Clin Inform, vol. 9, no. 1, pp. 54–61 (2018 ler, D. Kale, R. Ranganath, B. Wallace, J. Wiens
01). (Eds.), Proceedings of the 2nd Machine Learning
[15] F. FitzHenry, F. S. Resnic, S. L. Robbins, for Healthcare Conference, vol. 68 of Proceedings
J. Denton, L. Nookala, D. Meeker, L. Ohno-Machado, of Machine Learning Research, pp. 361–376 (2017
M. E. Matheny, “Creating a Common Data Model
medRxiv preprint doi: https://doi.org/10.1101/2020.08.14.20175141.this version posted August 17, 2020. The copyright holder for this
preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in
N.PARIS AND A.PARROT
perpetuity.
It is made available under a CC-BY 4.0 International license .

18–19 Aug), URL http://proceedings.mlr.press/


v68/johnson17a.html.
[27] J. C. Denny, A. Spickard, K. B. Johnson, N. B.
Peterson, J. F. Peterson, R. A. Miller, “Evaluation of
a method to identify and categorize section headers in
clinical documents,” J Am Med Inform Assoc, vol. 16,
no. 6, pp. 806–815 (2009).
[28] “Bill Karwin’s blog Rendering Trees with Clo-
sure Tables,” https://karwin.blogspot.com/2010/
03/rendering-trees-with-closure-tables.html.

12

You might also like