Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views13 pages

Medical Coding With Clinical Notes

This document discusses a novel approach to predicting ICD codes from electronic health records (EHR) throughout a patient's hospital stay, rather than only at discharge. The proposed Label-Attentive Hierarchical Sequence Transformer (LAHST) model utilizes causal attention to ensure predictions are based solely on past notes, enhancing predictive accuracy for earlier diagnosis and treatment suggestions. The research highlights the potential for predictive medicine by leveraging comprehensive clinical notes to optimize resource allocation and identify disease risks sooner.

Uploaded by

Zafar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views13 pages

Medical Coding With Clinical Notes

This document discusses a novel approach to predicting ICD codes from electronic health records (EHR) throughout a patient's hospital stay, rather than only at discharge. The proposed Label-Attentive Hierarchical Sequence Transformer (LAHST) model utilizes causal attention to ensure predictions are based solely on past notes, enhancing predictive accuracy for earlier diagnosis and treatment suggestions. The research highlights the potential for predictive medicine by leveraging comprehensive clinical notes to optimize resource allocation and identify disease risks sooner.

Uploaded by

Zafar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Continuous Predictive Modeling of Clinical Notes and ICD Codes

in Patient Health Records

Mireia Hernandez Caralt Clarence Boon Liang Ng Marek Rei


Imperial College London, United Kingdom
{mireia.hernandez-caralt22,clarence.ng21,marek.rei}@imperial.ac.uk

Abstract Category # notes # words

Electronic Health Records (EHR) serve as a Discharge summ. 59,652 79,649,691


valuable source of patient information, offering ECG 209,051 5,625,393
insights into medical histories, treatments, and Echo 45,794 12,810,062
outcomes. Previous research has developed Nursing 1,046,053 185,856,841
systems for detecting applicable ICD codes Physician 141,624 322,961,183
that should be assigned while writing a given
Radiology 522,279 165,805,982
EHR document, mainly focusing on discharge
summaries written at the end of a hospital
Respiratory 31,739 11,957,187
stay. In this work, we investigate the poten- Other 26,988 13,086,023
tial of predicting these codes for the whole pa-
tient stay at different time points during their Table 1: The number of clinical notes from different
stay, even before they are officially assigned categories, along with the number of words in those
by clinicians. The development of methods notes, in the MIMIC-III dataset.
to predict diagnoses and treatments earlier in
advance could open opportunities for predic-
tive medicine, such as identifying disease risks Health records are often accompanied by the In-
sooner, suggesting treatments, and optimizing ternational Classification of Diseases (ICD) codes –
resource allocation. Our experiments show that standardized codes that categorize diagnoses and
predictions regarding final ICD codes can be procedures performed during clinical encounters
made already two days after admission and we (Cartwright, 2013). Assigning ICD codes man-
propose a custom model that improves perfor-
ually is a highly time-consuming task necessary
mance on this early prediction task.
for billing, therefore previous research has been
developing multi-label classification systems to de-
1 Introduction
tect applicable codes that should be assigned while
Electronic health records (EHR) are rich reposito- writing a given document (Mullenbach et al., 2018;
ries of patient information, chronicling their med- Liu et al., 2021). The research focus has been on
ical history, diagnoses, treatment plans, medica- classifying discharge summaries, which are writ-
tions and outcomes (Jensen et al., 2012; Johnson ten at the end of a patient’s hospital stay (Ji et al.,
et al., 2016). The aggregation and modeling of this 2021; Dai et al., 2022). While this setup provides
data over time presents a unique opportunity for a useful proxy task, the complete EHR sequence
revealing patterns that can improve patient care, op- is much longer, containing detailed reports from
erational efficiency, and healthcare delivery. Con- nursing and radiology, along with specialized notes
tained within the EHR are textual notes written on echographies and cardiograms (FCG), among
by clinicians during patient encounters, which are others (Table 1). Recent work has argued that for
essential for a comprehensive understanding of pa- most practical applications such code classifica-
tient health. These free-text narratives stand out as tion should be performed on earlier medical notes
a particularly rich source of nuanced information, instead of the discharge summary (Cheng et al.,
but their unstructured format and domain-specific 2023).
language use have left them largely underutilized, In this work, we investigate the potential of pre-
compared to more readily available structured data dicting ICD codes for the whole patient stay at
sources (Tayefi et al., 2021). different time points during their stay. Beyond the
243
Proceedings of the 23rd Workshop on Biomedical Language Processing, pages 243–255
August 16, 2024. ©2024 Association for Computational Linguistics
task of detecting codes for a given note, we treat pre-trained word2vec embeddings (Mikolov et al.,
ICD codes as a structured summary of all the treat- 2013) and combined neighbouring word represen-
ments provided and diagnoses assigned during a tations using convolutional filters or recurrent ar-
hospital stay. The development of models for pre- chitectures. Despite their simplicity, some of these
dicting this information early in the clinical time- models achieved very high-performance baselines
line, based on partial indicators even before these that were difficult to surpass with transformer ap-
codes have been officially assigned, would open proaches (Ji et al., 2021).
many possibilities for predictive medicine. Such Efforts to apply pre-trained transformers without
systems would go beyond the post-discharge diag- further modifications to the ICD coding problem
nostic practices and could be used for identifying were unsuccessful (Ji et al., 2021; Dai et al., 2022).
early disease risks, suggesting potential treatments The discharge summary contains 3,594 tokens on
or optimizing hospital resource allocation. average, while the combined set of notes contains
We investigate the feasibility of this novel task an average of 21,916 tokens per patient stay (Ng
and evaluate to what extent the final set of ICD et al., 2023). Crucial information to predict pa-
codes can be predicted at earlier stages during the tient diagnoses is likely to be dispersed throughout
hospital stay. In addition, we propose a custom these notes, thus models with limited context length
model for this task that is able to improve pre- risk overlooking a significant portion of relevant
diction accuracy at different time steps. Unlike data. For this reason, subsequent studies focused
previous ICD code prediction models, the architec- on adapting transformer architectures to process
ture is designed with causal attention to ensure that longer textual sequences.
representations at any point throughout a patient’s PubMedBERT-hier (Ji et al., 2021) employed
hospital stay are constructed based on the notes hierarchical transformers to mitigate the length lim-
available up to that point, without accessing infor- itation issue, obtaining substantially better results.
mation in the future. The model is then optimized This approach segments the document into chunks
to predict ICD codes after every additional note in of 512 tokens and employs a BERT-based model
the input sequence, instead of only at the discharge pre-trained on the biomedical domain to encode
summary, teaching it to make predictions at any each segment (Gu et al., 2022). The segments are
chosen time point during the hospital stay. This then combined using a hierarchical transformer run-
task poses additional challenges, as the length of ning over the CLS-token embeddings. The TrLDC
the complete EHR sequence far exceeds that of the model (Dai et al., 2022) further improved perfor-
discharge summary and early notes have a weaker mance by employing a RoBERTa-based model pre-
correlation with the final labels. We introduce a trained from scratch on biomedical articles and
novel method that both augments the data during clinical notes (Lewis et al., 2020).
training and extends the context during inference, The PAAT (Partition Attention) model (Kim,
substantially enhancing the performance on early 2022) was able to surpass LSTM-based models like
ICD code prediction. The code for the model and MSMN (Yuan et al., 2022) on the task of identify-
the experiments are available online.1 ing the top 50 labels. PAAT combines the Clinical
Long-Former and a bi-LSTM, employing partition-
2 Related Work based label attention for improved performance.
HiLAT (Hierarchical Label-Attention) (Liu et al.,
The closest previous research to ours has been on 2022) achieved strong results on the top 50 labels
automating ICD code assignment. Given a docu- by utilizing ClinicalPlusXLNet, which outperforms
ment mentioning diagnoses and treatments in free- other transformers like RoBERTa variants, with the
form text, the aim is to detect the correct codes downside being that the training speed is four times
that should be assigned by the clinician. The first slower due to its bidirectional context capturing
attempts at this task primarily relied on convolu- (Liu et al., 2022).
tional neural networks (CNNs) (Mullenbach et al., The HTDS (Hierarchical Transformer for Docu-
2018; Li and Yu, 2020; Liu et al., 2021) and long ment Sequences) model (Ng et al., 2023) integrated
short-term memory networks (LSTMs) (Vu et al., earlier notes into the input context when making de-
2020; Yuan et al., 2022). These models utilized cisions about the discharge summary. This model
1
https://github.com/mireiahernandez/icd-continuous- employs a RoBERTa base transformer and a sepa-
prediction rate transformer layer running over the individual
244
token representations, not only the CLS embed- access to information in the future steps. At the
dings. They found that the earlier notes were in- same time, the whole sequence can be efficiently
deed useful as additional evidence at the end of the processed in parallel by masking any attention con-
hospital stay and provided performance improve- nections on the right side of the target position.
ments when classifying discharge summaries. This component takes as input the sequence of
All this prior work has trained systems to assign chunk embeddings e ∈ RN ×D and generates a
ICD codes at the end of the hospital stay, whereas sequence of embeddings h ∈ RN ×D which com-
we investigate models for making predictions at any bines information over past documents:
point during the stay. Furthermore, while previous
work has focused on detecting explicit mentions of hi = CausalAttn(e1 , ..., ei ), i ∈ [1, N ] (1)
diagnoses and treatments in a given text, we inves-
tigate to what extent future labels can be inferred Step 4. Masked multi-head label attention.
based on only earlier documents. We apply label-wise attention (Mullenbach et al.,
2018) with two key modifications: the use of multi-
3 Architecture ple attention heads, and the use of causal masking
to obtain temporal label-wise document embed-
We investigate a model architecture that can be dings. For each temporal position t, we define an
trained to encode a long temporal sequence of attention mask at ∈ RL×N to prevent attention
many clinical notes and make predictions at any to future notes, which is constant in the label di-
time point, only using earlier notes as context. The mension, and nullifies attention weights beyond
model breaks the sequence into smaller chunks temporal position t.
and encodes them using a hierarchical transformer. (
These chunks are combined with causal label at- 0, if i ≤ t
at [:, i] = (2)
tention, which gathers evidence with label-specific −∞ otherwise
attention heads while ensuring that representations
at any time are constructed based on the notes avail- We then combine this mask with multi-head atten-
able up to that point, without accessing informa- tion (Vaswani et al., 2017) using learnable label em-
tion in the future. Finally, a probability distribution beddings q ∈ RL×D as queries and the previously
across the labels is predicted at each possible time generated past context embeddings h ∈ RN ×D as
point. We refer to the model as a Label-Attentive keys and values:
Hierarchical Sequence Transformer (LAHST) and
describe it in more detail below. Figure 1 provides dt = MultiHeadAttn(q, h, h, at )
(3)
a diagram of the architecture. = Concat(head1 , · · · , headH )W o
Step 1. Document splitting. Each document
within the EHR sequence of a patient is tokenized Here, each head inputs a linear projection of the
and split into chunks of T tokens. Each patient key, query and value embeddings ek,i = WiK q,
has a variable total number of chunks, and during eq,i = WiQ h, ev,i = WiV h and applies masked
training, a maximum of N chunks is selected based attention (Choromanski et al., 2021) as follows:
on the criteria described in the next section.
Step 2. Chunk encoding. Each of the chunks is headi = Attention(ek,i , eq,i , ev,i , mask = at )
encoded with a pre-trained language model (PLM), ek,i eTq,i (4)
extracting the CLS-token embedding as the repre- = SoftMax( p + a t )ev,i
D/H
sentation, yielding a tensor e ∈ RN ×D . We use
the RoBERTa-base-PM-M3-Voc checkpoint, as it This yields a sequence of label-wise document
has been trained on two domains that match our embeddings dt ∈ RL× D for each position t ∈
task closely: 1) PubMed and PMC, which cover {1, ..., N }. In practice, we obtain all the embed-
biomedical publications, and 2) MIMIC-III, which dings d ∈ RN ×L×D efficiently in one pass by as-
contains clinical health records (Lewis et al., 2020). signing the batch dimension to the temporal dimen-
Step 3. Causal attention. We augment a trans- sion.
former layer with causal attention (Choromanski Step 5. Temporal label-wise predictions. Fi-
et al., 2021) in order to combine temporal infor- nally, temporal probabilities are calculated by pro-
mation from any previous step without providing jecting the embedding using linear weights w ∈
245
Figure 1: LAHST (Label-Attentive Hierarchical Sequence Transformer) architecture. Clinical notes generated
throughout the hospital stay are split into chunks. Each chunk is encoded using a pre-trained language model (PLM)
to extract the CLS-token embedding. Next, a hierarchical transformer encoder is applied, utilizing causal masking
to combine information among past segment embeddings. Finally, the network generates a distinct document
representation for each label and temporal point combination and these are then transformed into probabilities by
the output layer.

RL×D followed by a sigmoid activation. The prob- code classification, as local context is more impor-
ability at time t for label l is calculated using the tant for this task and hierarchical models have been
the label weight wl ∈ RD and the label document shown to outperform long-context models in this
embedding at position t, denoted as dt,l ∈ RD : setting (Dai et al., 2022). However, even hierar-
chical models have difficulty with very long se-
pt,l = sigmoid(wl · dt,l ) (5) quences, particularly during training. The gradient
must be backpropagated through each individual
The output of the model is a probability matrix
chunk encoding, which can easily cause memory
p ∈ RN ×L , containing probabilities for each label
issues when the models are large and the number
at each temporal point. The masking process within
of segments exceeds a maximum limit.
the transformer and label attention modules ensures
that time t probability calculations consider only For this reason, we propose a novel solution for
past documents. The model is trained using the applying hierarchical transformers to very long doc-
binary cross-entropy loss. ument sequences, such as the sequences of notes in
EHR. We refer to this method as the Extended Con-
4 Extending the Context text Algorithm (ECA). It consists of the following
Hierarchical transformer architectures break long modifications to the training and inference loops.
inputs into smaller components and reduce the num- Training (Algorithm 1). For each episode of
ber of long-distance attention operations, thereby training, the loop iterates over the training dataset
keeping memory and computation requirements Dtrain , processing each data sample (s, y), where
more manageable when processing very long se- s is the input sequence and y is the correspond-
quences. This makes them well-suited for ICD ing label. Within the loop, a random selection of
246
Algorithm 1 ECA Training loop collected embeddings are passed through causal
Dtrain ← training set (sequence-label pairs) attention and masked multi-head label attention
Nmax ← max. number of chunks to obtain predictions p based on the complete se-
for each episode do quence.
for each (s, y) in Dtrain do As the computation can be performed in sep-
m ← min(Nmax , len(s)) arate batches and then combined, this allows for
select m random indices i1 , ...im considerably longer sequences to be used as in-
sort i1 , · · · , im in ascending order put during inference. Unlike other methods for
s′ ← [s[i1 ], · · · , s[im ]] extending the context of transformers that rely on
p, h ← model.forward(s′ ) reducing or compressing long-distance attention
L ← BCE(y, p) (Beltagy et al., 2020; Munkhdalai et al., 2024), this
do backward pass and optimizer step proposed method is also exact – the result is always
end for the same as it would be with a single pass using
end for infinite memory.

Algorithm 2 ECA Inference loop


5 Experiment Set-up
Dtest ← test set (no labels)
Nmax ← max. number of chunks
for each s in Dtest do 5.1 Evaluation framework
hlist ←empty list We investigate the novel task of temporal ICD code
for i in range(0, len(s), Nmax ) do prediction, which requires the prediction of ICD
sbatch ← s[i : i + Nmax ] codes at any point during the hospital stay using the
pbatch , hbatch ← model.forward(sbatch ) notes available at that time, without relying on the
append hbatch to hlist discharge summary. To evaluate the performance,
end for we will compare the predictive power of our model
h ← concatenate hlist along batch dim. at different points throughout the EHR sequence.
p ← model.label_attention(h) Our evaluation setup is inspired by the Clinical-
end for BERT model (Huang et al., 2019), which evaluates
the likelihood of readmission at different cut-off
notes is chosen to create a subset of the input se- times since admission.
quence, with the maximum number of chunks set The cut-off times were selected to be the 25%,
as Nmax . These sub-sequences s′ are then used 50%, and 75% percentiles of the total volume of
for optimizing the model, each time sampling a notes present in the training dataset, which are
slightly different training instance. Instead of try- shown in Table 2. These correspond to 2, 5, and
ing to fit the whole sequence into the input during 13-day cut-offs, respectively. For example, in the 2-
training, we sample notes and form multiple dif- day setting, the model only has access to the notes
ferent shorter versions of the sequence for training written in the first 2 days in order to predict all
the model. This has the added benefit of creating the ICD codes that will be assigned to that patient
a data augmentation effect, as the model learns to by the end of their hospital stay. Where space al-
make decisions based on different versions of the lows, we additionally report on all the notes up to
same datapoint. (but excluding) the discharge summary, indicating
Inference (Algorithm 2). During inference, we a setting where the model could be used to assist
process all the notes in the sequence in batches of in the writing of the discharge summary itself. For
Nmax chunks. Each sequence batch, denoted as comparison, we also report performance on the full
sbatch , is encoded to obtain embeddings hbatch ∈ sequence which includes the discharge summary,
R Nmax ×D . Even if the full sequence does not although this setting is retrospective and would not
fit into memory, it can be processed in separate provide any predictive benefit. In line with widely
batches to obtain all the hbatch embeddings. These used approaches to ICD coding (Mullenbach et al.,
embeddings are then concatenated along the batch 2018), we focus on Micro-F1, Micro-AUC and Pre-
dimension to obtain chunk embeddings for the cision@5 metrics, with additional metrics provided
complete sequence h ∈ R Ntotal ×D . Finally, the in the appendix.
247
Percentile Days elapsed # notes 5.3 Implementation details

25% 1.8 112,594 The model is implemented in Pytorch and it was


50% 5.2 225,160 trained on an Nvidia GeForce GTX Titan Xp
75% 12.8 337,726 (12GB RAM) GPU, utilizing an average mem-
ory of 11.22 GB. The model processed 5 sam-
Table 2: Percentiles of the total volume of notes present ples per second and training took an average
in the training dataset. The number of days correspond- time of 11 hours and 50 minutes. We used a
ing to the 25th , 50th , and 75th percentiles will be used as super-convergence learning rate scheduler (Smith
temporal evaluation points throughout this project. and Topin, 2019), based on its use in HTDS
(Ng et al., 2023), and an early-stopping strat-
egy with a 3 epoch patience and a maximum
# chunks / patient # patients of 20 epochs. Chunk size T was set to 512
2 days 17.9±22.1 1,559 tokens as that is the largest size supported by
5 days 27.6±33.9 1,559 RoBERTa-base-PM-M3-Voc. For the main exper-
13 days 35.8±42.4 1,559 iments, a limit of Nmax = 16 was used during
excl. DS 40.4±47.7 1,559 training, while the entire sequence (with up to 181
last day 48.4±48.1 1,573 chunks) was used for inference. The tuning ranges
and chosen hyperparameter values are included in
Table 3: Length of EHR in number of chunks per patient Appendix A.
(average and standard deviation) and count of patients
of our dataset at different temporal cut-offs (dev set). 6 Results
In addition to the LAHST framework described in
Sections 3 and 4, we also evaluate PubMedBert-
5.2 Preprocessing
Hier (Ji et al., 2021) and HTDS (Ng et al., 2023)
on the early prediction task. HTDS was trained to
We use the MIMIC-III dataset (Johnson et al., 2016) consider earlier notes in the context while making
for evaluation, as it contains a collection of Elec- decisions about the discharge summary, making it
tronic Health Records with timestamped free-text the most likely existing model to also perform well
reports by nurses and doctors, together with the cor- on the early prediction task. In addition, HTDS
responding ICD-9 labels. First, we follow the pre- results are very close to the state-of-the-art on the
processing steps outlined by the CAML approach MIMIC-III dataset, making it a very strong base-
(Mullenbach et al., 2018) to obtain a dataset of line. However, HTDS is a larger model and re-
free-text clinical notes paired with ICD diagnoses quires considerably more GPU resources compared
and procedure codes, and we also extract their pro- to LAHST. Therefore, we also report a modified
posed train/dev/test splits. The label space is vast, version (HTDS*) which has a comparable number
so following their method, we focus on predicting of parameters. We also include the performance of
the top 50 codes. TrLDC (Dai et al., 2022) from the respective paper
For our novel task, we perform some additional as an additional strong baseline on classification of
preprocessing steps. First, we extract the times- the discharge reports.
tamps of each note and, in cases where the spe- In Table 4 we report the performance of these
cific time is missing, assign it to 12:00:00 of that systems at increasingly challenging temporal cut-
day. Moreover, we found that some patients had offs. LAHST shows strong performance at any
additional notes beyond the discharge summary time point, outperforming all the other models at
document, such as other discharge summaries or every early prediction task. The results indicate
nursing notes. We exclude these additional notes that some of the diagnosis and treatment codes for
to ensure that our EHR sequence always concludes the whole hospital stay can be predicted already
with a single discharge summary document. We within the first few days of admission. While the
also exclude 14 patients as their EHR contains no performance of all systems is expectedly lower in
other notes besides the discharge summary. Table the more challenging settings, they are still able to
3 displays the statistics of our dataset at various reach 46% F1 and 82.9% AUC with only 2 days
temporal cut-offs. of information, which could provide useful pre-
248
Last day 0-13 days 0-5 days 0-2 days
Model F1 AUC P@5 F1 AUC P@5 F1 AUC P@5 F1 AUC P@5
TrLDC 70.1 93.7 65.9 - - - - - - - - -
PMB-H 67.2 91.5 63.0 30.7 68.0 30.2 31.3 68.4 31.0 31.7 68.7 31.5
HTDS 73.3 95.2 68.1 49.7 82.1 47.6 47.5 80.6 45.9 44.5 78.7 43.6
HTDS* 70.7 93.8 66.2 48.6 82.0 47.0 46.7 80.7 45.5 43.6 78.7 43.3
LAHST 70.3 94.6 67.5 52.9 87.0 53.0 50.3 85.4 50.7 46.0 82.9 47.1

Table 4: Evaluation on the early ICD code prediction task at increasingly challenging temporal cut-offs. TrLDC
(Dai et al., 2022) result is from the respective paper. We evaluated PubMedBERT-Hier (PMB-H; Ji et al., 2021) and
HTDS (Ng et al., 2023) at different early prediction points. HTDS* is a version of HTDS that is more comparable
to LAHST in terms of computation requirements. LAHST is the model described in Sections 3 and 4. Results for
PMB-H, HTDS, HTDS* and LAHST are averaged over 3 runs with different random seeds.

dictions to the hospital staff. In all the early pre- Last Random ECA
diction settings, LAHST achieves the best results
according to all metrics. While HTDS is trained to 0-2 days 33.5 ±0.6 29.2 ±0.5 46.2 ±0.1
look at earlier documents and is also able to make 0-5 days 37.6 ±0.3 35.1 ±0.4 50.9 ±0.1
competitive predictions, it is reliant on information 0-13 days 38.2 ±0.1 37.7 ±0.5 53.6 ±0.2
in the discharge summary and therefore underper- Excl.DS 37.9 ±0.2 38.4 ±0.4 54.3 ±0.2
forms when this is not available. In contrast, the Last day 71.0 ±0.3 71.3±0.2 71.1 ±0.1
LAHST model is trained to make predictions based
on varying amounts of evidence and achieves the Table 5: Micro-F1 score on the development set, using
the LAHST model with alternative strategies for context
best performance.
inclusion.
In the "Last day" setting, which includes the
discharge summary, HTDS slightly outperforms
LAHST – this is expected, as HTDS is a larger model. The "Random" setting samples a random
model and specifically trained for discharge sum- subset of chunks from the sequence instead. The
maries. However, when compared to the similarly- results in Table 5 show that processing the entire
sized HTDS*, LAHST delivers comparable F1 sequence with ECA yields substantial performance
along with improved AUC and P@5. Even though improvements (+12.7, +13.3 and +15.4 Micro-F1
LAHST is not trained for this particular setting, score for 2 days, 5 days and 13 days) compared
the supervision on earlier time points helps it to truncating or sampling the sequence. This re-
achieve good results also when classifying dis- sult highlights the importance of including all the
charge summaries. In addition, it outperforms both available notes in the input. Only when the dis-
PubMedBERT-Hier and TrLDC according to all charge note is available (in the ‘Last day’ setting)
metrics. We include larger results tables with addi- the previous notes become less important and all
tional metrics in Appendix B. the strategies give the same performance.
Selection of context during training. We in-
7 Analyzing the Extended Context vestigate the effect of randomly sampling different
sub-sequences of notes during training. We train an
Selection of context during inference. The Ex- alternative version of LAHST by truncating the se-
tended Context Algorithm (ECA) allows the model quence to the most recent notes instead of sampling
to include much longer EHR sequences in the con- them randomly. During inference, both versions
text during inference (with a generous 181 chunk still receive all the notes as input, as described in Al-
cap applied in our experiments). We evaluate the gorithm 2. The results in Table 6 show how training
effect of this algorithm compared to alternative without random sampling substantially decreases
strategies used in other hierarchical models. The performance across all evaluation points (-8.4, -8.4,
"Last" setting uses the most recent 16 chunks of -8.2, -8.0, -0.9, F1 respectively). This indicates that
text, illustrating the setting where the sequence is randomly sampling different sub-sequences dur-
truncated from the beginning in order to fit into the ing optimization augments the training data with
249
Last Random/ECA
0-2 days 37.8 ±0.2 46.2 ±0.1
0-5 days 42.5 ±0.1 50.9 ±0.1
0-13 days 45.4 ±0.1 53.6 ±0.2
Excl. DS 46.3 ±0.1 54.3 ±0.2
Last day 70.2 ±0.2 71.1 ±0.1

Table 6: Micro-F1 of LAHST on the development set,


using alternative sampling strategies during training.
(a) 2-days cut-off

different variations which helps the model better


generalize to different temporal cut-offs, without
increasing memory or computation requirements.

8 Model Interpretability
The attention weights in the label-attention layer of
LAHST can potentially be used as an importance
indicator of different input notes. A higher weight (b) Last day cut-off
is associated with an increased relevance of the
particular document to predicting a specific code. Figure 2: Average attention weight per document type
For an initial visualization, we average the weights at different temporal cut-offs. The LAHST model pro-
cesses the complete EHR sequence and focuses more on
across all the codes to find which document types
reports of diagnostic tests for early prediction, switching
are most important at different temporal cut-offs. to the discharge summary when it is available.
The results are shown in Fig. 2. Within the
2-day cut-off, all the reports that have diagnostic
characteristics have received the highest attention tions for treatments, and optimization of resource
weights. For example, the echocardiography report allocation. We designed a specialized architecture
is the description of an ultrasound test to identify (LAHST) for this task, which uses a hierarchical
abnormalities in the heart structure and is used by structure combined with label attention and causal
cardiologists to diagnose heart diseases (Van et al., attention to efficiently make predictions at any pos-
2023). The radiology reports detail the results of sible time points in the EHR sequence. The Ex-
imaging procedures such as X-rays and MRIs to tended Context Algorithm was further proposed
diagnose diseases (Alarifi et al., 2021). All of such to allow the model to better handle very long se-
reports are highly technical and are specifically cre- quences of notes. The system is trained by sam-
ated to assist physicians with diagnostic practices. pling different sub-sequences of notes, which al-
In the absence of the discharge summary, they are lows the model to fit into memory while also aug-
the most valuable document types for making early menting the data with variations of available ex-
predictions of ICD-9 codes and the network has amples. During inference, the whole sequence is
correctly focused more attention on them. In the then processed separately in batches and combined
"Last day" setting, the discharge summary becomes together with a single attention layer, allowing for
available, containing an overview of the entire hos- lossless representations of very long context to be
pital stay, and the same model is able to switch calculated.
most of its attention to it. Our experiments showed that useful predictions
regarding the final ICD codes for a patient can
9 Conclusions
be made already soon after the hospital admis-
In this study, we investigated the potential of pre- sion. The LAHST model substantially outper-
dicting ICD codes for the whole patient stay at formed existing approaches on the early prediction
different time points during their stay. Being able task, while also achieving competitive results on
to predict likely diagnoses and treatments in ad- the standard task of assigning codes to discharge
vance would have important applications for predic- summaries. The model achieved 82.9% AUC al-
tive medicine, by enabling early diagnosis, sugges- ready 2 days after admission, indicating that it is
250
able to rank and suggest relevant ICD codes based bility and Accountability Act (HIPAA). This de-
on limited information very early into a hospital identification process ensures that the dataset can
stay. be used for research purposes on an international
scale (Johnson et al., 2016).
10 Limitations While no conflicts were identified, machine
learning systems for ICD coding carry certain risks
The primary focus of this project was to investigate
when deployed in hospitals. Firstly, automated ap-
the feasibility of this novel task and explore a novel
proaches are trained in a supervised manner using
architecture for the early prediction of ICD codes.
data from hospitals, and therefore, they are suscep-
Even though this could open up new avenues for
tible to reproducing manual coding errors. These er-
early disease detection and procedure forecasting,
rors may include miscoding due to misunderstand-
our work has some limitations that should be con-
ings of abbreviations and synonyms or overbilling
sidered in future work.
due to unbundling errors (Sonabend W et al., 2020).
Firstly, our study is limited to the MIMIC dataset
Moreover, automated systems may also suffer from
as it is one of the largest and most established avail-
distribution shifts, potentially affecting their porta-
able datasets containing electronic health records
bility across various EHR systems in different hos-
and ICD codes. However, the findings based on this
pitals (Sonabend W et al., 2020). To address these
dataset may not generalise equally to every clinical
concerns, it is important to build interpretable mod-
setting. Therefore, new experiments would need to
els and develop tools that enable human coders
be conducted on representative data samples before
to supervise the decisions made by ICD coding
considering applying such technology in practice.
models.
Our experiments focused on PubMedBERT-Hier
(Ji et al., 2021), HTDS (Ng et al., 2023) and Acknowledgements
LAHST. However, there are many other architec-
tures and pre-trained models available which could Mireia Hernandez Caralt acknowledges that the
be investigated in this setting. project that gave rise to these results received the
Our model is based on a hierarchical transformer support of a fellowship from “la Caixa” Foun-
architecture which achieves good performance but dation (ID 100010434). The fellowship code is
is also quite computationally expensive compared LCF/BQ/EU22/11930076.
to LSTM or CNN-based approaches (training the
model took roughly 12 hours on a 12GB GPU).
References
With our computational resources, we were limited
to running experiments using the [CLS]-token rep- Mohammad Alarifi, Patrick Timothy, Jabour Abdulrah-
man, Min Wu, and Jake Luo. 2021. Understanding
resentation and a maximum of 16 chunks in a batch.
patient needs and gaps in radiology reports through
However, with additional resources this work could online discussion forum analysis. Insights Imaging,
be further scaled up by retaining all token represen- 12(50).
tations and increasing the model size to allow for
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020.
the allocation of additional chunks. Longformer: The long-document transformer. CoRR,
Finally, our evaluation of the temporal ICD cod- abs/2004.05150.
ing task is focused on reporting the aggregate met-
Donna J Cartwright. 2013. Icd-9-cm to icd-10-cm
rics for the top 50 ICD-9 coding labels. Future codes: What? why? how? Advances in Wound
work could investigate a larger number of labels, Care, 2(10):588–592.
along with analysing the performance separately
on individual labels and label types. Hua Cheng, Rana Jafari, April Russell, Russell Klopfer,
Edmond Lu, Benjamin Striner, and Matthew Gorm-
ley. 2023. MDACE: MIMIC documents annotated
11 Ethics Statement with code evidence. In Proceedings of the 61st An-
nual Meeting of the Association for Computational
After careful consideration, we have determined Linguistics (Volume 1: Long Papers), pages 7534–
that no ethical conflicts apply to this project. While 7550, Toronto, Canada. Association for Computa-
clinical data is inherently sensitive, it is important tional Linguistics.
to note that the MIMIC-III dataset has undergone Krzysztof Choromanski, Han Lin, Haoxian Chen,
a rigorous de-identification process, following the Tianyi Zhang, Arijit Sehanobish, Valerii Likhosh-
guidelines outlined by the Health Insurance Porta- erstov, Jack Parker-Holder, Tamás Sarlós, Adrian
251
Weller, and Thomas Weingarten. 2021. From block- document classification. In Proceedings of the 2021
toeplitz matrices to differential equations on graphs: Conference on Empirical Methods in Natural Lan-
towards a general theory for scalable masked trans- guage Processing, pages 5941–5953, Online and
formers. In International Conference on Machine Punta Cana, Dominican Republic. Association for
Learning. Computational Linguistics.

Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond Tomas Mikolov, Kai Chen, G.s Corrado, and Jeffrey
Elliott. 2022. Revisiting transformer-based models Dean. 2013. Efficient estimation of word represen-
for long document classification. In Findings of the tations in vector space. Proceedings of Workshop at
Association for Computational Linguistics: EMNLP ICLR, 2013.
2022, pages 7212–7230, Abu Dhabi, United Arab
Emirates. Association for Computational Linguistics. James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng
Sun, and Jacob Eisenstein. 2018. Explainable predic-
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto tion of medical codes from clinical text. In Proceed-
Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng ings of the 2018 Conference of the North American
Gao, and Hoifung Poon. 2022. Domain-Specific Lan- Chapter of the Association for Computational Lin-
guage Model Pretraining for Biomedical Natural Lan- guistics: Human Language Technologies, Volume
guage Processing. ACM transactions on computing 1 (Long Papers), pages 1101–1111, New Orleans,
for healthcare, 3(1):1–23. Louisiana. Association for Computational Linguis-
tics.
Kexin Huang, Jaan Altosaar, and Rajesh Ranganath.
2019. ClinicalBERT: modeling clinical notes and pre- Tsendsuren Munkhdalai, Manaal Faruqui, and Sid-
dicting hospital readmission. Computing Research dharth Gopal. 2024. Leave no context behind:
Repository, arXiv:1904.05342v3. Version 3. Efficient infinite context transformers with infini-
attention. arXiv preprint arXiv:2404.07143.
Peter B Jensen, Lars J Jensen, and Søren Brunak. 2012.
Boon Liang Clarence Ng, Diogo Santos, and Marek
Mining electronic health records: towards better re-
Rei. 2023. Modelling temporal document sequences
search applications and clinical care. Nature Reviews
for clinical ICD coding. In Proceedings of the 17th
Genetics, 13(6):395–405.
Conference of the European Chapter of the Asso-
Shaoxiong Ji, Matti Hölttä, and Pekka Marttinen. 2021. ciation for Computational Linguistics, pages 1640–
Does the magic of bert apply to medical code assign- 1649, Dubrovnik, Croatia. Association for Computa-
ment? a quantitative study. Computers in Biology tional Linguistics.
and Medicine, 139:104998. Leslie N Smith and Nicholay Topin. 2019. Very fast
Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-Wei training of neural networks using large learning rate.
H. H. Lehman, Mengling Feng, Mohammad Ghas- Artificial intelligence and machine learning for multi-
semi, Benjamin Moody, Peter Szolovits, Leo An- domain operations applications, 1106:369–386.
thony Celi, and Roger G. Mark. 2016. MIMIC-III, Aaron Sonabend W, Winston Cai, Yuri Ahuja, Ashwin
a freely accessible critical care database. Scientific Ananthakrishnan, Zongqi Xia, Sheng Yu, and Chuan
Data, 3(160035). Hong. 2020. Automated ICD coding via unsuper-
vised knowledge integration (UNITE). International
Haanju Yoo Daeseong Kim Sewon Kim. 2022. An Au-
journal of medical informatics, 139, 104135.
tomatic ICD Coding Network Using Partition-Based
Label Attention . SSRN Electronic Journal. Maryam Tayefi, Phuong Ngo, Taridzo Chomutare, Her-
cules Dalianis, Elisa Salvi, Andrius Budrionis, and
Patrick Lewis, Myle Ott, Jingfei Du, and Veselin Stoy- Fred Godtliebsen. 2021. Challenges and opportu-
anov. 2020. Pretrained language models for biomedi- nities beyond structured data in analysis of elec-
cal and clinical tasks: understanding and extending tronic health records. Wiley Interdisciplinary Re-
the state-of-the-art. Proceedings of the 3rd Clinical views: Computational Statistics, 13(6):e1549.
Natural Language Processing Workshop, page 146.
Phi Nguyen Van, Hieu Pham Huy, and Long Tran Quoc.
Fei Li and Hong Yu. 2020. ICD Coding from Clinical 2023. Echocardiography segmentation using neural
Text Using Multi-Filter Residual Convolutional Neu- ode-based diffeomorphic registration field. IEEE
ral Network. Proceedings of the AAAI Conference Transactions On Medical Imaging, XX.
on Artificial Intelligence, 34(05):8180.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Leibo Liu, Oscar Perez-Concha, Anthony Nguyen, Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Vicki Bennett, and Louisa Jorm. 2022. Hierarchical Kaiser, and Illia Polosukhin. 2017. Attention is all
label-wise attention transformer model for explain- you need. In Advances in Neural Information Pro-
able icd coding. Journal of biomedical informatics, cessing Systems (NIPS).
133:104161.
Thanh Vu, Dat Nguyen, and Anthony Nguyen.
Yang Liu, Hua Cheng, Russell Klopfer, Matthew R. 2020. A label attention model for ICD cod-
Gormley, and Thomas Schaaf. 2021. Effective con- ing from clinical text. In Proceedings of IJCAI.
volutional attention network for multi-label clinical Doi:10.24963/ijcai.2020/461.
252
Zheng Yuan, Chuanqi Tan, and Songfang Huang. 2022.
Code synonyms do matter: Multiple synonyms
matching network for automatic ICD coding. In
Proceedings of the 60th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 2:
Short Papers), pages 808–814, Dublin, Ireland. As-
sociation for Computational Linguistics.

253
Appendix A

Hyper-parameter Range
Num. Layers (Mask. Transf.) 1,2,3
Num. Heads (Mask. Transf.) 1,2,3
Num. Heads (Label Atten.) 1,2,3
Peak LR 1e-5, 5e-5, 1e-4

Table 7: The range of hyperparameters searched for


tuning the model. The chosen value is shown in bold.

Appendix B
Detailed results tables using different time cut-offs.

0-2 days
Micro-F1 Macro-F1 Micro-AUC Macro-AUC P@5
PubMedBERT-Hier (Ji et al., 2021) - - - - -
TrLDC (Dai et al., 2022) - - - - -
HTDS (Ng et al., 2023) 44.5 39.7 78.7 77.5 43.6
HTDS* 43.6 40.0 78.7 76.1 43.3
LAHST 46.0 40.1 82.9 79.5 47.1

0-5 days
Micro-F1 Macro-F1 Micro-AUC Macro-AUC P@5
PubMedBERT-Hier (Ji et al., 2021) - - - - -
TrLDC (Dai et al., 2022) - - - - -
HTDS (Ng et al., 2023) 47.5 42.5 80.6 79.5 45.9
HTDS* 46.7 42.5 80.7 78.2 45.5
LAHST 50.3 44.6 85.4 82.2 50.7

254
0-13 days
Micro-F1 Macro-F1 Micro-AUC Macro-AUC P@5
PubMedBERT-Hier (Ji et al., 2021) - - - - -
TrLDC (Dai et al., 2022) - - - - -
HTDS (Ng et al., 2023) 49.7 44.6 82.1 81.2 47.6
HTDS* 48.6 44.6 82.0 79.7 47.0
LAHST 52.9 47.3 87.0 83.8 53.0

Excl DS
Micro-F1 Macro-F1 Micro-AUC Macro-AUC P@5
PubMedBERT-Hier (Ji et al., 2021) - - - - -
TrLDC (Dai et al., 2022) - - - - -
HTDS (Ng et al., 2023) 50.2 45.2 82.3 81.5 47.8
HTDS* 49.0 45.1 82.4 80.2 47.5
LAHST 53.5 47.8 87.3 84.1 53.7

Last Day
Micro-F1 Macro-F1 Micro-AUC Macro-AUC P@5
PubMedBERT-Hier (Ji et al., 2021) 68.1 63.3 90.8 88.6 64.4
TrLDC (Dai et al., 2022) 70.1 63.8 93.7 91.4 65.9
HTDS (Ng et al., 2023) 73.3 67.7 95.2 93.6 68.1
HTDS* 70.7 64.9 93.8 91.6 66.2
LAHST 70.3 64.3 94.6 92.6 67.5

255

You might also like