Medical Coding With Clinical Notes
Medical Coding With Clinical Notes
RL×D followed by a sigmoid activation. The prob- code classification, as local context is more impor-
ability at time t for label l is calculated using the tant for this task and hierarchical models have been
the label weight wl ∈ RD and the label document shown to outperform long-context models in this
embedding at position t, denoted as dt,l ∈ RD : setting (Dai et al., 2022). However, even hierar-
chical models have difficulty with very long se-
pt,l = sigmoid(wl · dt,l ) (5) quences, particularly during training. The gradient
must be backpropagated through each individual
The output of the model is a probability matrix
chunk encoding, which can easily cause memory
p ∈ RN ×L , containing probabilities for each label
issues when the models are large and the number
at each temporal point. The masking process within
of segments exceeds a maximum limit.
the transformer and label attention modules ensures
that time t probability calculations consider only For this reason, we propose a novel solution for
past documents. The model is trained using the applying hierarchical transformers to very long doc-
binary cross-entropy loss. ument sequences, such as the sequences of notes in
EHR. We refer to this method as the Extended Con-
4 Extending the Context text Algorithm (ECA). It consists of the following
Hierarchical transformer architectures break long modifications to the training and inference loops.
inputs into smaller components and reduce the num- Training (Algorithm 1). For each episode of
ber of long-distance attention operations, thereby training, the loop iterates over the training dataset
keeping memory and computation requirements Dtrain , processing each data sample (s, y), where
more manageable when processing very long se- s is the input sequence and y is the correspond-
quences. This makes them well-suited for ICD ing label. Within the loop, a random selection of
246
Algorithm 1 ECA Training loop collected embeddings are passed through causal
Dtrain ← training set (sequence-label pairs) attention and masked multi-head label attention
Nmax ← max. number of chunks to obtain predictions p based on the complete se-
for each episode do quence.
for each (s, y) in Dtrain do As the computation can be performed in sep-
m ← min(Nmax , len(s)) arate batches and then combined, this allows for
select m random indices i1 , ...im considerably longer sequences to be used as in-
sort i1 , · · · , im in ascending order put during inference. Unlike other methods for
s′ ← [s[i1 ], · · · , s[im ]] extending the context of transformers that rely on
p, h ← model.forward(s′ ) reducing or compressing long-distance attention
L ← BCE(y, p) (Beltagy et al., 2020; Munkhdalai et al., 2024), this
do backward pass and optimizer step proposed method is also exact – the result is always
end for the same as it would be with a single pass using
end for infinite memory.
Table 4: Evaluation on the early ICD code prediction task at increasingly challenging temporal cut-offs. TrLDC
(Dai et al., 2022) result is from the respective paper. We evaluated PubMedBERT-Hier (PMB-H; Ji et al., 2021) and
HTDS (Ng et al., 2023) at different early prediction points. HTDS* is a version of HTDS that is more comparable
to LAHST in terms of computation requirements. LAHST is the model described in Sections 3 and 4. Results for
PMB-H, HTDS, HTDS* and LAHST are averaged over 3 runs with different random seeds.
dictions to the hospital staff. In all the early pre- Last Random ECA
diction settings, LAHST achieves the best results
according to all metrics. While HTDS is trained to 0-2 days 33.5 ±0.6 29.2 ±0.5 46.2 ±0.1
look at earlier documents and is also able to make 0-5 days 37.6 ±0.3 35.1 ±0.4 50.9 ±0.1
competitive predictions, it is reliant on information 0-13 days 38.2 ±0.1 37.7 ±0.5 53.6 ±0.2
in the discharge summary and therefore underper- Excl.DS 37.9 ±0.2 38.4 ±0.4 54.3 ±0.2
forms when this is not available. In contrast, the Last day 71.0 ±0.3 71.3±0.2 71.1 ±0.1
LAHST model is trained to make predictions based
on varying amounts of evidence and achieves the Table 5: Micro-F1 score on the development set, using
the LAHST model with alternative strategies for context
best performance.
inclusion.
In the "Last day" setting, which includes the
discharge summary, HTDS slightly outperforms
LAHST – this is expected, as HTDS is a larger model. The "Random" setting samples a random
model and specifically trained for discharge sum- subset of chunks from the sequence instead. The
maries. However, when compared to the similarly- results in Table 5 show that processing the entire
sized HTDS*, LAHST delivers comparable F1 sequence with ECA yields substantial performance
along with improved AUC and P@5. Even though improvements (+12.7, +13.3 and +15.4 Micro-F1
LAHST is not trained for this particular setting, score for 2 days, 5 days and 13 days) compared
the supervision on earlier time points helps it to truncating or sampling the sequence. This re-
achieve good results also when classifying dis- sult highlights the importance of including all the
charge summaries. In addition, it outperforms both available notes in the input. Only when the dis-
PubMedBERT-Hier and TrLDC according to all charge note is available (in the ‘Last day’ setting)
metrics. We include larger results tables with addi- the previous notes become less important and all
tional metrics in Appendix B. the strategies give the same performance.
Selection of context during training. We in-
7 Analyzing the Extended Context vestigate the effect of randomly sampling different
sub-sequences of notes during training. We train an
Selection of context during inference. The Ex- alternative version of LAHST by truncating the se-
tended Context Algorithm (ECA) allows the model quence to the most recent notes instead of sampling
to include much longer EHR sequences in the con- them randomly. During inference, both versions
text during inference (with a generous 181 chunk still receive all the notes as input, as described in Al-
cap applied in our experiments). We evaluate the gorithm 2. The results in Table 6 show how training
effect of this algorithm compared to alternative without random sampling substantially decreases
strategies used in other hierarchical models. The performance across all evaluation points (-8.4, -8.4,
"Last" setting uses the most recent 16 chunks of -8.2, -8.0, -0.9, F1 respectively). This indicates that
text, illustrating the setting where the sequence is randomly sampling different sub-sequences dur-
truncated from the beginning in order to fit into the ing optimization augments the training data with
249
Last Random/ECA
0-2 days 37.8 ±0.2 46.2 ±0.1
0-5 days 42.5 ±0.1 50.9 ±0.1
0-13 days 45.4 ±0.1 53.6 ±0.2
Excl. DS 46.3 ±0.1 54.3 ±0.2
Last day 70.2 ±0.2 71.1 ±0.1
8 Model Interpretability
The attention weights in the label-attention layer of
LAHST can potentially be used as an importance
indicator of different input notes. A higher weight (b) Last day cut-off
is associated with an increased relevance of the
particular document to predicting a specific code. Figure 2: Average attention weight per document type
For an initial visualization, we average the weights at different temporal cut-offs. The LAHST model pro-
cesses the complete EHR sequence and focuses more on
across all the codes to find which document types
reports of diagnostic tests for early prediction, switching
are most important at different temporal cut-offs. to the discharge summary when it is available.
The results are shown in Fig. 2. Within the
2-day cut-off, all the reports that have diagnostic
characteristics have received the highest attention tions for treatments, and optimization of resource
weights. For example, the echocardiography report allocation. We designed a specialized architecture
is the description of an ultrasound test to identify (LAHST) for this task, which uses a hierarchical
abnormalities in the heart structure and is used by structure combined with label attention and causal
cardiologists to diagnose heart diseases (Van et al., attention to efficiently make predictions at any pos-
2023). The radiology reports detail the results of sible time points in the EHR sequence. The Ex-
imaging procedures such as X-rays and MRIs to tended Context Algorithm was further proposed
diagnose diseases (Alarifi et al., 2021). All of such to allow the model to better handle very long se-
reports are highly technical and are specifically cre- quences of notes. The system is trained by sam-
ated to assist physicians with diagnostic practices. pling different sub-sequences of notes, which al-
In the absence of the discharge summary, they are lows the model to fit into memory while also aug-
the most valuable document types for making early menting the data with variations of available ex-
predictions of ICD-9 codes and the network has amples. During inference, the whole sequence is
correctly focused more attention on them. In the then processed separately in batches and combined
"Last day" setting, the discharge summary becomes together with a single attention layer, allowing for
available, containing an overview of the entire hos- lossless representations of very long context to be
pital stay, and the same model is able to switch calculated.
most of its attention to it. Our experiments showed that useful predictions
regarding the final ICD codes for a patient can
9 Conclusions
be made already soon after the hospital admis-
In this study, we investigated the potential of pre- sion. The LAHST model substantially outper-
dicting ICD codes for the whole patient stay at formed existing approaches on the early prediction
different time points during their stay. Being able task, while also achieving competitive results on
to predict likely diagnoses and treatments in ad- the standard task of assigning codes to discharge
vance would have important applications for predic- summaries. The model achieved 82.9% AUC al-
tive medicine, by enabling early diagnosis, sugges- ready 2 days after admission, indicating that it is
250
able to rank and suggest relevant ICD codes based bility and Accountability Act (HIPAA). This de-
on limited information very early into a hospital identification process ensures that the dataset can
stay. be used for research purposes on an international
scale (Johnson et al., 2016).
10 Limitations While no conflicts were identified, machine
learning systems for ICD coding carry certain risks
The primary focus of this project was to investigate
when deployed in hospitals. Firstly, automated ap-
the feasibility of this novel task and explore a novel
proaches are trained in a supervised manner using
architecture for the early prediction of ICD codes.
data from hospitals, and therefore, they are suscep-
Even though this could open up new avenues for
tible to reproducing manual coding errors. These er-
early disease detection and procedure forecasting,
rors may include miscoding due to misunderstand-
our work has some limitations that should be con-
ings of abbreviations and synonyms or overbilling
sidered in future work.
due to unbundling errors (Sonabend W et al., 2020).
Firstly, our study is limited to the MIMIC dataset
Moreover, automated systems may also suffer from
as it is one of the largest and most established avail-
distribution shifts, potentially affecting their porta-
able datasets containing electronic health records
bility across various EHR systems in different hos-
and ICD codes. However, the findings based on this
pitals (Sonabend W et al., 2020). To address these
dataset may not generalise equally to every clinical
concerns, it is important to build interpretable mod-
setting. Therefore, new experiments would need to
els and develop tools that enable human coders
be conducted on representative data samples before
to supervise the decisions made by ICD coding
considering applying such technology in practice.
models.
Our experiments focused on PubMedBERT-Hier
(Ji et al., 2021), HTDS (Ng et al., 2023) and Acknowledgements
LAHST. However, there are many other architec-
tures and pre-trained models available which could Mireia Hernandez Caralt acknowledges that the
be investigated in this setting. project that gave rise to these results received the
Our model is based on a hierarchical transformer support of a fellowship from “la Caixa” Foun-
architecture which achieves good performance but dation (ID 100010434). The fellowship code is
is also quite computationally expensive compared LCF/BQ/EU22/11930076.
to LSTM or CNN-based approaches (training the
model took roughly 12 hours on a 12GB GPU).
References
With our computational resources, we were limited
to running experiments using the [CLS]-token rep- Mohammad Alarifi, Patrick Timothy, Jabour Abdulrah-
man, Min Wu, and Jake Luo. 2021. Understanding
resentation and a maximum of 16 chunks in a batch.
patient needs and gaps in radiology reports through
However, with additional resources this work could online discussion forum analysis. Insights Imaging,
be further scaled up by retaining all token represen- 12(50).
tations and increasing the model size to allow for
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020.
the allocation of additional chunks. Longformer: The long-document transformer. CoRR,
Finally, our evaluation of the temporal ICD cod- abs/2004.05150.
ing task is focused on reporting the aggregate met-
Donna J Cartwright. 2013. Icd-9-cm to icd-10-cm
rics for the top 50 ICD-9 coding labels. Future codes: What? why? how? Advances in Wound
work could investigate a larger number of labels, Care, 2(10):588–592.
along with analysing the performance separately
on individual labels and label types. Hua Cheng, Rana Jafari, April Russell, Russell Klopfer,
Edmond Lu, Benjamin Striner, and Matthew Gorm-
ley. 2023. MDACE: MIMIC documents annotated
11 Ethics Statement with code evidence. In Proceedings of the 61st An-
nual Meeting of the Association for Computational
After careful consideration, we have determined Linguistics (Volume 1: Long Papers), pages 7534–
that no ethical conflicts apply to this project. While 7550, Toronto, Canada. Association for Computa-
clinical data is inherently sensitive, it is important tional Linguistics.
to note that the MIMIC-III dataset has undergone Krzysztof Choromanski, Han Lin, Haoxian Chen,
a rigorous de-identification process, following the Tianyi Zhang, Arijit Sehanobish, Valerii Likhosh-
guidelines outlined by the Health Insurance Porta- erstov, Jack Parker-Holder, Tamás Sarlós, Adrian
251
Weller, and Thomas Weingarten. 2021. From block- document classification. In Proceedings of the 2021
toeplitz matrices to differential equations on graphs: Conference on Empirical Methods in Natural Lan-
towards a general theory for scalable masked trans- guage Processing, pages 5941–5953, Online and
formers. In International Conference on Machine Punta Cana, Dominican Republic. Association for
Learning. Computational Linguistics.
Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond Tomas Mikolov, Kai Chen, G.s Corrado, and Jeffrey
Elliott. 2022. Revisiting transformer-based models Dean. 2013. Efficient estimation of word represen-
for long document classification. In Findings of the tations in vector space. Proceedings of Workshop at
Association for Computational Linguistics: EMNLP ICLR, 2013.
2022, pages 7212–7230, Abu Dhabi, United Arab
Emirates. Association for Computational Linguistics. James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng
Sun, and Jacob Eisenstein. 2018. Explainable predic-
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto tion of medical codes from clinical text. In Proceed-
Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng ings of the 2018 Conference of the North American
Gao, and Hoifung Poon. 2022. Domain-Specific Lan- Chapter of the Association for Computational Lin-
guage Model Pretraining for Biomedical Natural Lan- guistics: Human Language Technologies, Volume
guage Processing. ACM transactions on computing 1 (Long Papers), pages 1101–1111, New Orleans,
for healthcare, 3(1):1–23. Louisiana. Association for Computational Linguis-
tics.
Kexin Huang, Jaan Altosaar, and Rajesh Ranganath.
2019. ClinicalBERT: modeling clinical notes and pre- Tsendsuren Munkhdalai, Manaal Faruqui, and Sid-
dicting hospital readmission. Computing Research dharth Gopal. 2024. Leave no context behind:
Repository, arXiv:1904.05342v3. Version 3. Efficient infinite context transformers with infini-
attention. arXiv preprint arXiv:2404.07143.
Peter B Jensen, Lars J Jensen, and Søren Brunak. 2012.
Boon Liang Clarence Ng, Diogo Santos, and Marek
Mining electronic health records: towards better re-
Rei. 2023. Modelling temporal document sequences
search applications and clinical care. Nature Reviews
for clinical ICD coding. In Proceedings of the 17th
Genetics, 13(6):395–405.
Conference of the European Chapter of the Asso-
Shaoxiong Ji, Matti Hölttä, and Pekka Marttinen. 2021. ciation for Computational Linguistics, pages 1640–
Does the magic of bert apply to medical code assign- 1649, Dubrovnik, Croatia. Association for Computa-
ment? a quantitative study. Computers in Biology tional Linguistics.
and Medicine, 139:104998. Leslie N Smith and Nicholay Topin. 2019. Very fast
Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-Wei training of neural networks using large learning rate.
H. H. Lehman, Mengling Feng, Mohammad Ghas- Artificial intelligence and machine learning for multi-
semi, Benjamin Moody, Peter Szolovits, Leo An- domain operations applications, 1106:369–386.
thony Celi, and Roger G. Mark. 2016. MIMIC-III, Aaron Sonabend W, Winston Cai, Yuri Ahuja, Ashwin
a freely accessible critical care database. Scientific Ananthakrishnan, Zongqi Xia, Sheng Yu, and Chuan
Data, 3(160035). Hong. 2020. Automated ICD coding via unsuper-
vised knowledge integration (UNITE). International
Haanju Yoo Daeseong Kim Sewon Kim. 2022. An Au-
journal of medical informatics, 139, 104135.
tomatic ICD Coding Network Using Partition-Based
Label Attention . SSRN Electronic Journal. Maryam Tayefi, Phuong Ngo, Taridzo Chomutare, Her-
cules Dalianis, Elisa Salvi, Andrius Budrionis, and
Patrick Lewis, Myle Ott, Jingfei Du, and Veselin Stoy- Fred Godtliebsen. 2021. Challenges and opportu-
anov. 2020. Pretrained language models for biomedi- nities beyond structured data in analysis of elec-
cal and clinical tasks: understanding and extending tronic health records. Wiley Interdisciplinary Re-
the state-of-the-art. Proceedings of the 3rd Clinical views: Computational Statistics, 13(6):e1549.
Natural Language Processing Workshop, page 146.
Phi Nguyen Van, Hieu Pham Huy, and Long Tran Quoc.
Fei Li and Hong Yu. 2020. ICD Coding from Clinical 2023. Echocardiography segmentation using neural
Text Using Multi-Filter Residual Convolutional Neu- ode-based diffeomorphic registration field. IEEE
ral Network. Proceedings of the AAAI Conference Transactions On Medical Imaging, XX.
on Artificial Intelligence, 34(05):8180.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Leibo Liu, Oscar Perez-Concha, Anthony Nguyen, Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Vicki Bennett, and Louisa Jorm. 2022. Hierarchical Kaiser, and Illia Polosukhin. 2017. Attention is all
label-wise attention transformer model for explain- you need. In Advances in Neural Information Pro-
able icd coding. Journal of biomedical informatics, cessing Systems (NIPS).
133:104161.
Thanh Vu, Dat Nguyen, and Anthony Nguyen.
Yang Liu, Hua Cheng, Russell Klopfer, Matthew R. 2020. A label attention model for ICD cod-
Gormley, and Thomas Schaaf. 2021. Effective con- ing from clinical text. In Proceedings of IJCAI.
volutional attention network for multi-label clinical Doi:10.24963/ijcai.2020/461.
252
Zheng Yuan, Chuanqi Tan, and Songfang Huang. 2022.
Code synonyms do matter: Multiple synonyms
matching network for automatic ICD coding. In
Proceedings of the 60th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 2:
Short Papers), pages 808–814, Dublin, Ireland. As-
sociation for Computational Linguistics.
253
Appendix A
Hyper-parameter Range
Num. Layers (Mask. Transf.) 1,2,3
Num. Heads (Mask. Transf.) 1,2,3
Num. Heads (Label Atten.) 1,2,3
Peak LR 1e-5, 5e-5, 1e-4
Appendix B
Detailed results tables using different time cut-offs.
0-2 days
Micro-F1 Macro-F1 Micro-AUC Macro-AUC P@5
PubMedBERT-Hier (Ji et al., 2021) - - - - -
TrLDC (Dai et al., 2022) - - - - -
HTDS (Ng et al., 2023) 44.5 39.7 78.7 77.5 43.6
HTDS* 43.6 40.0 78.7 76.1 43.3
LAHST 46.0 40.1 82.9 79.5 47.1
0-5 days
Micro-F1 Macro-F1 Micro-AUC Macro-AUC P@5
PubMedBERT-Hier (Ji et al., 2021) - - - - -
TrLDC (Dai et al., 2022) - - - - -
HTDS (Ng et al., 2023) 47.5 42.5 80.6 79.5 45.9
HTDS* 46.7 42.5 80.7 78.2 45.5
LAHST 50.3 44.6 85.4 82.2 50.7
254
0-13 days
Micro-F1 Macro-F1 Micro-AUC Macro-AUC P@5
PubMedBERT-Hier (Ji et al., 2021) - - - - -
TrLDC (Dai et al., 2022) - - - - -
HTDS (Ng et al., 2023) 49.7 44.6 82.1 81.2 47.6
HTDS* 48.6 44.6 82.0 79.7 47.0
LAHST 52.9 47.3 87.0 83.8 53.0
Excl DS
Micro-F1 Macro-F1 Micro-AUC Macro-AUC P@5
PubMedBERT-Hier (Ji et al., 2021) - - - - -
TrLDC (Dai et al., 2022) - - - - -
HTDS (Ng et al., 2023) 50.2 45.2 82.3 81.5 47.8
HTDS* 49.0 45.1 82.4 80.2 47.5
LAHST 53.5 47.8 87.3 84.1 53.7
Last Day
Micro-F1 Macro-F1 Micro-AUC Macro-AUC P@5
PubMedBERT-Hier (Ji et al., 2021) 68.1 63.3 90.8 88.6 64.4
TrLDC (Dai et al., 2022) 70.1 63.8 93.7 91.4 65.9
HTDS (Ng et al., 2023) 73.3 67.7 95.2 93.6 68.1
HTDS* 70.7 64.9 93.8 91.6 66.2
LAHST 70.3 64.3 94.6 92.6 67.5
255