LLMs
LLMs
Zifeng Wang1∗ , Benjamin Danek1∗ , Ziwei Yang3 , Zheng Chen4 , Jimeng Sun1,2#
1 Department of Computer Science, University of Illinois Urbana-Champaign, Champaign,
IL
2 Carle Illinois College of Medicine, University of Illinois Urbana-Champaign, Champaign, IL
3 Bioinformatics Center, Institute for Chemical Research, Kyoto University
4 Institute of Scientific and Industrial Research, Osaka University
# Corresponding author. Emails: [email protected]
Abstract
Data science plays a critical role in clinical research, but it requires professionals with
arXiv:2410.21591v1 [cs.AI] 28 Oct 2024
expertise in coding and medical data analysis. Large language models (LLMs) have shown
great potential in supporting medical tasks and performing well in general coding tests.
However, these tests do not assess LLMs’ ability to handle data science tasks in medicine,
nor do they explore their practical utility in clinical research. To address this, we de-
veloped a dataset consisting of 293 real-world data science coding tasks, based on 39
published clinical studies, covering 128 tasks in Python and 165 tasks in R. This dataset
simulates realistic clinical research scenarios using patient data. Our findings reveal that
cutting-edge LLMs struggle to generate perfect solutions, frequently failing to follow input
instructions, understand target data, and adhere to standard analysis practices. Conse-
quently, LLMs are not yet ready to fully automate data science tasks. We benchmarked
advanced adaptation methods and found two to be particularly effective: chain-of-thought
prompting, which provides a step-by-step plan for data analysis, which led to a 60% im-
provement in code accuracy; and self-reflection, enabling LLMs to iteratively refine their
code, yielding a 38% accuracy improvement. Building on these insights, we developed a
platform that integrates LLMs into the data science workflow for medical professionals. In
a user study with five medical doctors, we found that while LLMs cannot fully automate
coding tasks, they significantly streamline the programming process. We found that 80%
of their submitted code solutions were incorporated from LLM-generated code, with up to
96% reuse in some cases. Our analysis highlights the potential of LLMs, when integrated
into expert workflows, to enhance data science efficiency in clinical research.
*
These authors contributed equally to this work.
1
1 Introduction
In clinical research, data science plays a pivotal role in analyzing complex datasets, such
as clinical trial data and real-world data (RWD), which are critical for improving patient
care and advancing evidence-based medicine.1 For example, it was reported that for a
pharmaceutical company, the insights brought from RWD analysis could unlock up to
$300M values annually by optimizing clinical trial design and execution.2 Data scientists,
at the core of this process, require years of coding expertise, alongside a deep understand-
ing of diverse medical data types, including patient clinical data and omics data, while also
collaborating closely with medical professionals.3 However, the growing demand for data
science skills and the limited availability of experienced data scientists have become bot-
tlenecks in the clinical research process.4 Coding is central to the work of data scientists,
underpinning essential tasks such as statistical modeling, data cleaning, and visualization
using Python and R. Given this, exploring methods for streamlining the coding process
in clinical research data science is crucial to accelerate drug development and improve
patient outcomes.
Code generation has been extensively explored with the advent of large language mod-
els (LLMs), which have demonstrated strong capabilities in tasks like code completion.5
Continuous efforts have been made in developing more powerful code-specific LLMs,6–8 re-
fining prompting strategies,9 integrating external knowledge through retrieval-augmented
generation (RAG),10, 11 and enabling LLMs’ self-reflection.12 These advancements further
lead to the LLM-based platform for software development13 and data analysis.14 While
LLMs have been evaluated for general programming tests,5, 15–17 software engineering,18
and data analysis ,19, 20 assessments specifically targeting clinical research data science
remain scarce. In medicine, recent works have introduced LLMs to automate machine
learning modeling21 and support bioinformatics tool development,22 but they do not cover
broad data science tasks. Therefore, this paper seeks to build a comprehensive code gen-
eration dataset to assess to which extent the cutting-edge LLMs can automate clinical
research data analysis, modeling, and visualization.
Our objective was to evaluate the practical utility of LLMs in handling complex clinical
research data and performing the associated data science tasks. To this end, we identified
39 clinical studies published in medical journals that were linked to patient-level datasets
(Fig. 1a). We started by extracting and summarizing the analyses performed in these
studies, such as patient characteristic exploration and Kaplan-Meier curves. We then
developed the code necessary to reproduce these analyses and the reported results in these
studies. These coding tasks, along with their reference solutions, were all manually crafted
and cross-verified to ensure accuracy. The result was a collection of 293 diverse, high-
quality data science tasks, covering primary tools used in Python and R, e.g., lifelines
for survival analysis in Python and Bioconductor for biomedical data analysis in R.
Additionally, we categorized the difficulty of these tasks into Easy, Medium, and Hard, by
the number of “semantic lines” of code in the reference solutions (Fig 1d). The semantic
lines metric aggregates lines of code that serve the same operation into a single unit,
providing a clear measure of task complexity. A detailed overview of the dataset and its
characteristics is provided in Fig. 1.
Here, we rigorously evaluated the extent to which clinical research data science tasks
can be automated by LLMs. We benchmarked six state-of-the-art LLMs using various
methods, including chain-of-thought prompting, few-shot prompting, automatic prompt-
ing, self-reflection, and retrieval-augmented generation (Fig. 1f). Our analysis focused on
both the accuracy and quality of the code generated by these models. While we found that
current LLMs are not yet capable of fully automating complex clinical data science tasks,
they do generate code that is highly similar to the correct final solutions. Building on this
insight, we investigated the development of a platform designed to facilitate collaboration
between human experts and artificial intelligence (AI) to streamline coding tasks. This
platform aims to enhance the productivity of clinical researchers by integrating LLMs into
established data science workflows, with a focus on user-friendliness and the reliability of
the outputs. Our results demonstrated that the platform significantly improved human
experts’ efficiency in executing data science tasks, highlighting the promising future of
2
human-AI collaboration in clinical research data science.
2 Results
2.1 Creating data science tasks from clinical studies
We curated our testing dataset, CliniDSBench, to reflect the real-world challenges that
data scientists face in clinical research. The dataset is grounded in published medical
studies and linked to patient-level datasets from cBioPortal.23 These patient datasets are
diverse, each containing data from hundreds to thousands of patients, including clinical
information such as demographics and survival data, lab results, and omics data like
gene expression, mutations, structural variants, and copy number alterations. Unlike
prior studies that mostly focus on single spreadsheets, each study in our dataset is linked
to multiple spreadsheets, up to five, providing a more complex and realistic basis for
evaluating data science workflows. This setup mirrors the multifaceted nature of real-
world clinical research, where data scientists must integrate and analyze information from
various sources to generate insights. The dataset building and evaluation framework is
illustrated in Fig. 1a.
We obtained the publications associated with these datasets from PubMed and re-
viewed them to extract the types of analyses conducted. Through this process, we iden-
tified common analyses frequently performed in clinical research, such as summarizing
patient baseline characteristics, plotting Kaplan-Meier curves to assess treatment effects
across groups, and creating mutational OncoPrints to highlight significant gene mutations
in specific patients. This extraction process allowed us to filter and refine the initial set of
studies, ensuring both diversity and comprehensiveness in the analyses covered. After this
refinement, we retained 39 studies, which were used to create the final testing dataset.
We designed a series of coding questions in a step-by-step format, mirroring the logical
progression of analyses in the original studies, ultimately leading to the main findings. For
example, a study may include exploratory data analysis, gene mutation analysis to detect
abnormal mutation patterns, survival analysis to visualize patient outcomes, and statisti-
cal tests to verify significance. Correspondingly, we developed coding questions for each
step, ensuring that earlier steps provide the necessary groundwork for subsequent analyses.
As such, each coding task consists of five components: (1) the input question, (2) patient
dataset schema description, (3) prerequisite code (called “prefix”), (4) reference solutions,
and (5) test cases. This design reflects the practical setup data scientists encounter in
real-world projects. In total, we manually curated 128 analysis tasks in Python and 165
in R based on the extracted analyses from 39 studies. As shown in Fig. 1c, the input
questions typically consist of 50-100 words describing the task and output requirements,
while the reference code solutions span more than 20 lines, sometimes exceeding 50 lines,
reflecting the complexity of the tasks.
We quantitatively assessed the difficulty of each coding task by calculating the number
of semantic lines in the reference solutions. A semantic line aggregates multiple lines of
code that contribute to the same operation, as illustrated in Fig. 1d. This approach pre-
vents the difficulty assessment from being skewed by repetitive or tedious tasks that are
fundamentally simple. The statistics for semantic lines and difficulty levels are presented
in Fig. 1e. Our analysis shows that Python solutions tend to be more complex than R
solutions, particularly for Medium and Hard tasks. This is largely due to R’s rich ecosys-
tem of medical-specific packages, which often allow for more direct solutions. In contrast,
Python frequently requires additional customization and manual coding to achieve similar
outcomes, contributing to higher complexity in Python coding tasks. The pie charts in
Fig. 1b show the libraries frequently used in the reference answers.
2.2 LLMs are not yet ready for fully automated data science
As illustrated in Fig. 1f, our evaluation framework is composed of three key compo-
nents: models, methods, and tasks. For the first component, we selected six cutting-edge
3
a
Extract analyses Data collection Coding benchmark
Medical publications building
Coding tasks Analysis results
Multimodal
patient data
#Participant baselines
#Kaplan-Meier curve
#Statistical test
#Mutation OncoPrint Omics Clinical Clinical
#Ridgeplot … sample
#Volcano plot
Test cases Execution sandbox
…
b c
import pandas as pd
# Group by cancer type and calculate the mean of the overall survival
grouped_data = filtered_data.groupby('cancer_type')['OS_MONTHS'].mean()
f
Models Methods Tasks
4
a 1 Model inputs 2 Generate answers 3 Evaluation
Question: Dataset description:
How many patients are in the dataset? A Pandas Dataframe whose name is `data_mutations`, shape is
(5, 23). 1 2 3
Return the output in a variable `n`.
This is the result of reading `data_mutations`'s first five
…
rows by `data_mutations.head().to_string()`:
Instruction:
Hugo_Symbol
...
Entrez_Gene_Id
Table description 𝑛𝑛 candidates
Write Python code to answer the user's
request.
Note: (1) …
A Pandas Dataframe whose name is `data_clinical_sample`, Sample 𝑘𝑘
shape is (5, 13).
Return the Python code in HTML format ... and execute
<code>
3 gbm_columbia_2019_21 237 Treatment Medical
... your code here ... Therapy Targeted Therapy Nivolumab
</code> ... Unit test 1
pass@1
Unit test 2
clinical_patient.csv
pass@5
Vanilla prompting …
clinical_sample.csv Unit test 3
Automatic prompting gene_mutations.csv
Results
Manual prompting gene_expression.csv Testing cases
copy_number_alteration.csv
structural_variant.csv
…
b c
5
LLMs: GPT-4o,24 GPT-4o-mini,25 Sonnet,26 Opus,27 Gemini-pro,28 and Gemini-flash.28
These models represent a diverse set of advanced generalist LLMs capable of perform-
ing code generation based on input instructions. To explore their effectiveness and po-
tential room for improvement in clinical research data science tasks, we applied several
adaptation methods: chain-of-thought,29 few-shot prompting,30 automatic prompting,31
self-reflection,12 and retrieval-augmented generation (RAG).32
For evaluation, we assessed the combinations of models and adaptation methods across
three tasks: code generation, code debugging, and human-AI collaboration. The first two
tasks were used to measure the models’ accuracy in automating the coding process. We
employed Pass@k as the primary metric, where k represents the number of attempts the
model is allowed to make to solve a coding task. This metric behaves like the probability
that at least one out of the k attempts is correct. Specifically, we selected k = 1 as a strict
benchmark to evaluate how well LLMs can automate tasks on the first attempt, providing
insight into their immediate accuracy. Additionally, we used k = 5 as a more relaxed
metric to explore the model’s potential to improve given multiple attempts, offering a
broader assessment of its ability to generate correct solutions when allowed more tries.
We first evaluated the immediate accuracy of LLMs in generating code solutions on
their initial attempt. For each task, the LLM is provided with a raw question that de-
scribes the target task, as well as a dataset description (Fig. 2a). The dataset description
includes details such as spreadsheet names, column names, and common cell values, which
guide the LLM in identifying the correct spreadsheet, column, and values to work with.
Additionally, the instruction section offers supplementary guidance for the LLM during
code generation. We used three types of instructions: Vanilla, Manual, and Automatic.
The Vanilla instruction provides minimal guidance, merely instructing the LLM to solve
the task, while Manual and Automatic instructions are more detailed, either manually
crafted or optimized through automatic prompt generation.31 The generated code solu-
tions must pass all the testing cases to be judged right.
From our experiments, we found that current LLMs cannot consistently produce per-
fect code for clinical research data science tasks across all difficulty levels. As shown in
Fig. 2d, for Python tasks, the Pass@1 scores vary significantly based on task difficulty.
For Easy tasks, most LLMs achieve Pass@1 rates in the range of 0.40-0.80. However,
for Medium tasks, the Pass@1 rates drop to 0.15-0.40, and for Hard tasks, they range
from 0.05 to 0.15. Performance differences also exist between different LLMs, particularly
within the same series. For instance, the lightweight variant GPT-4o-mini generally un-
derperforms compared to its larger counterpart, GPT-4o, with differences in performance
of up to twofold in many cases. This highlights the limitations of current LLMs, especially
as task complexity increases. The trend is similar in R, where performance declines with
increased task difficulty, though there is a significant difference in performance between
Python and R tasks (Fig 2e).
Diving deeper into the variations across instruction types, we observed that (1) in
Python tasks (Fig. 2d), neither automatically generated prompts nor manually crafted
prompts consistently outperformed the Vanilla prompts. For example, Vanilla performed
better than AutoPrompt in 4 out of 6 LLMs for Easy tasks, 2 out of 6 for Medium tasks,
and 4 out of 6 for Hard tasks. A similar trend was observed for R tasks (Fig. 2e). (2) More
powerful models, such as GPT-4o and Gemini-Pro, showed greater benefits from carefully
crafted instructions, particularly in Easy and Medium Python tasks. In contrast, lighter
models like GPT-4o-mini and Gemini-Flash did not exhibit such improvements, and in
some cases, complex instructions even seemed to hinder performance. This suggests that
lightweight models may struggle to fully interpret and utilize complex instructions, which
can reduce their effectiveness in data science coding tasks.
We adjusted the temperature settings to sample multiple solutions from LLMs and
calculated Pass@5 scores for Python tasks (Fig. 2b). In most cases, increasing the tem-
perature allows LLMs to generate more creative and diverse solutions, resulting in higher
probabilities of producing a correct solution. This trend was consistent across all models,
suggesting the potential benefit of having LLMs brainstorm multiple solutions to reach
better outcomes. On average, LLMs solved more tasks when given five attempts compared
6
to just one, as measured by Pass@1. However, despite this improvement, the overall per-
formance remains far from perfect.
7
a b
1 Self-reflection inputs 2 Proposed solution
# Debugging: Testing logs
Traceback (most recent call last):
...
line 14, in <module>
assert n == 19
AssertionError # REASON: The error is due to NaN values
in the 'OS_MONTHS' column
# Debugging: Runtime logs
# PROPOSAL: Remove rows with NaN values in
Traceback (most recent call last): 'OS_MONTHS'
File
... import pandas as pd
oncoprint_data = from lifelines import KaplanMeierFitter
patient_cna_status_melted.pivot("G import matplotlib.pyplot as plt
ene", "PATIENT_ID", "CNA")
TypeError: DataFrame.pivot() takes # Load the data
1 positional argument but 4 were data_mutations =
given
pd.read_csv("/workdir/data_mutations.csv")
...
# Debugging: Print intermediate
print("Dataframe with -log10"
"q_value > 1.0:")
print(significant_genes_df.head())
print(" List of significant"
"genes:")
print(significant_genes)
c d
e Python tasks
f
R tasks
g
# case 1: Refuse answer from lifelines import KaplanMeierFitter
print("""sorry I am not from lifelines.plotting import
able to answer this add_at_risk_table
question. I cannot access """Error: ...
external files or specific Traceback (most recent call last): data_clinical_patient = pd.read_csv(
data schemas.""") File '/workdir/data_clinical_patient.csv')
"/code/edd199722e64438896b4b960dd5e8607.p codeleted_patients = data_clinical_patient[
y", line 34, in <module> data_clinical_patient['IDH_1P19Q_SUBTYPE'] == 'Co-deleted’]
# case 2: Wrong format from lifelines.plotting import
Certainly! Here's the code: add_at_risk_table """Error: Traceback (most recent call last):
```python ImportError: cannot import name File "/usr/local/lib/python3.10/dist-
... 'add_at_risk_table' from packages/pandas/core/indexes/base.py", line 3805, in get_loc
``` 'lifelines.plotting' return self._engine.get_loc(casted_key)
Let me know if there are (/usr/local/lib/python3.10/dist- ...
any further adjustments packages/lifelines/plotting.py) raise KeyError(key) from err
you'd like! ""“ KeyError: 'IDH_1P19Q_SUBTYPE'"""
8
the first attempt at executing the code solutions: (1) results from running the test cases,
(2) runtime logs that capture any errors encountered during execution, and (3) additional
print statements that show the values and shapes of intermediate variables. These logs
were combined with the original code to help the LLM generate an explanation for the
errors and propose a corrective plan, including revised code (Fig. 3a). We tested self-
reflection over multiple rounds, from 1 to 5, and tracked the trend of Pass@1 performance
throughout the process. From the results, we observed significant improvements through
self-correction. After five rounds of correction, Pass@1 scores increased by an average
of around 0.2 across all task types. For Python tasks, LLMs could solve approximately
60-80% of Easy tasks, 40-50% of Medium tasks, and 20-25% of Hard tasks after self-
correction. For R tasks, LLMs achieved around 40-70% success on Easy tasks, 30-55%
on Medium tasks, and 50-60% on Hard tasks. This represents a substantial improvement
compared to their first attempt, demonstrating the effectiveness and potential of LLMs’
self-reflection capabilities. Notably, most of the improvement occurred within the first
two rounds, with diminishing returns in later rounds, indicating that early corrections are
the most impactful.
We conducted an in-depth analysis of the erroneous LLM-generated solutions for
Python (Fig. 3e) and R tasks (Fig. 3f), categorizing the errors into six types: Tests failure,
Data misoperation, Package misuse, Instruction misfollow, Invalid syntax, and Timeout.
Specifically, Data misoperation refers to errors arising from incorrect operations on the in-
put datasets, such as selecting from non-existing columns. Package misuse includes errors
from passing incorrect arguments to functions, importing incorrect packages, or calling
functions without proper imports. Instruction misfollow occurs when LLMs fail to follow
the provided instructions, leading to outputs that are a mixture of text and code or refus-
ing to answer the question. A couple of example error cases are shown in Fig. 3g. Overall,
most of the erroneous solutions failed the testing cases but could still be executed. The
next most common errors were Data misoperation and Package misuse. After applying
self-reflection, the most significant improvement came from LLMs resolving many of the
Data misoperation and Package misuse errors, making the code executable, which is re-
flected by the increase in errors related to Tests failure. This demonstrates the utility of
LLM self-correction in addressing relatively superficial errors identified through execution
logs. One notable anomaly was Gemini-Pro, which encountered a high number of Instruc-
tion Misfollow errors, especially after self-reflection. This was likely due to Gemini-Pro’s
strict safety policies, which caused the model to refuse to answer certain coding questions,
particularly when trying to do self-correction.
Upon reviewing the error cases, we found that many of the LLM-generated solutions,
while imperfect, are close to correct and only require minor manual edits. To quantify how
much these solutions can reduce the effort required for data science tasks, we compared the
LLM-generated solutions with the reference solutions (Fig. 2c). Specifically, we conducted
a difference analysis to compute the proportion of reference code that could be replicated
by the LLM-generated solutions. The results show that, despite their imperfections, LLM-
generated solutions are promising for streamlining data science workflows. For instance,
LLMs produced code that covered approximately 60% of the reference solutions for Easy
tasks, around 60% for Medium tasks in R and 40% in Python, and about 50% for Hard
tasks in R and 25% in Python. It is important to note that this metric underestimates
the code similarities, as AI-generated code may achieve the same function in a different
way from the reference solution.
9
a b
AI-generated code User submitted code
...
d
Studies: Included analyses:
e f
Functions in this system were well
integrated. Output quality Comments about user confidence in the system
3.4 (2.5-4.3) I would definitely like to continue working with the
System did require technical
support system and inputting my own data to learn more
about its functionalities.
I felt very confident using the
system. Support & Integration
3.0 (2.2-3.8)
The system was consistent
Comments about system learning curve
I did not need to study or learn
anything new to use the system System complexity I believe having a better understanding of python
3.5 (2.7-4.3) libraries and data science in general would have
System did not feel too complex benefitted me more because my prompts would
be a lot more nuanced to get the exact results I
The system did not feel
cumbersome to use am looking for.
Figure 4: Overview of the user study. a, we compare the LLM-generated code captured
in our logging system and the user-submitted answers to highlight the modifications made by
users. b, statistics of difficulty levels of the user-faced coding tasks and the user operations
using our platform. c, the distributions of the proportions of user-submitted code that are
copied and pasted from LLM-generated code. d, the target study we asked users to work
on. e, the aggregated feedback obtained from the questionnaires we sent to users. f, example
collected users’ comments.
10
that LLM-generated code is close to the correct solution, and Fig. 3e and Fig. 3f indicate
that most code executes successfully but fails only at the final testing stages. Additionally,
LLM self-reflection can resolve most of the bugs. These findings highlight the potential of
LLMs to assist data scientists in streamlining the coding process in clinical research.
To bridge the gap in utilizing LLMs for clinical research and leveraging the insights
from our experiments, we developed a platform that integrates LLMs into data science
projects for clinical research. The architecture of this platform is shown in Extended
Fig. 1. The platform is designed to offer an integrated interface for users to:
• Chat with LLMs to brainstorm and plan analyses, with the ability to query external
knowledge bases, including webpages, research papers, and other resources.
• Generate code for data science tasks through interactions with LLMs, allowing users to
streamline code writing for complex analyses.
• Identify and debug errors in the user-provided code, with LLMs proposing solutions to
improve the code.
The interface supports real-time interactions, allowing users to generate and execute code
in a sandbox environment with instant visualizations. This removes the need for users
to handle complex prompt crafting or manually switch between chat sessions and coding
platforms like Jupyter Notebook. By simplifying the data science workflow, the platform
empowers users with minimal coding expertise to perform complex data science tasks with
ease.
In our user study, we involved five medical doctors with varying levels of coding ex-
pertise. Each participant was assigned three studies,33–35 with approximately 10 coding
tasks per study (Fig. 4d). Users worked with LLMs on our platform to complete these
tasks and submitted their solutions once their code passed all the test cases. The difficulty
levels of the tasks were quantified, with the distribution shown in Fig. 4b. During the
study, we tracked two core actions: code generation and code improvement (debugging)
requests. The statistics of these user behaviors are depicted in Fig. 4b, where most users
completed the first two studies, and a few tackled the third. After the study, we analyzed
the logs to compare the LLM-generated code with the final code solutions submitted by
the users (Fig. 4a). Additionally, we conducted a survey to gather their feedback on the
platform and their experience working with LLMs. The survey questions were built based
on the Health Information Technology Usability Evaluation Scale (Health-ITUES).36
The results of the code comparison analysis are presented in Fig. 4c, showing the
distribution of the proportion of user-submitted code derived from LLM-generated solu-
tions. We found that a significant portion of the user-submitted code was drawn from
AI-generated code. For Easy tasks, the median proportions were 0.88, 0.87, and 0.84
across the three studies, indicating that users heavily relied on LLM-provided solutions
when crafting their final submissions. For Medium and Hard tasks, the ratios were gen-
erally lower: in Study 1, the proportions were 0.44 for Medium tasks and 0.96 for Hard
tasks, while in Study 2, the proportions were 0.75 for Medium and 0.28 for Hard tasks.
These findings demonstrate the potential of LLMs to streamline the data science process,
even for users without advanced coding expertise, with greater reliance on LLMs for easier
tasks and more mixed results for more complex ones.
The quantitative results from the user survey are summarized in Fig. 4e, where we
grouped the questions into four main categories. The average user ratings for each category
are: Output Quality (3.4), Support & Integration (3.0), System Complexity (3.5), and
System Usability (4.0). These ratings suggest that, overall, users had a positive experience
using the platform. Additionally, we collected qualitative feedback (Fig. 4f), where one
user expressed a strong interest in continuing to use AI for research on their own data,
highlighting the platform’s practical utility. Another user acknowledged the platform’s
value in helping them learn programming and data analysis, underscoring its potential as
an educational tool for those with limited coding experience. These insights reinforce the
platform’s ability to enhance both productivity and learning in data science workflows.
11
3 Discussion
In collaboration with medical experts, data scientists play a pivotal role in analyzing
complex datasets, such as real-world patient data, to derive insights that improve patient
care and inform evidence-based medicine. However, the rising demand for data science
expertise, combined with the limited availability of skilled professionals, has created a bot-
tleneck, slowing progress and hindering the full potential of data-driven clinical research.
This shortage is restricting the ability to fully harness the vast amount of data available
for advancing clinical research.
Large language models (LLMs) have emerged as powerful generalist artificial intelli-
gence (AI) capable of following human instructions to perform a wide range of tasks in
medicine.37–39 In parallel, LLMs have demonstrated strong capabilities in solving coding
challenges,5 completing software engineering tasks,13 and performing basic data analy-
sis.20 These advancements suggest that LLMs hold great promise for streamlining data
science projects in clinical research, a potential that has not yet been fully explored.
The primary goal of this study is to thoroughly evaluate the performance of cutting-
edge LLMs in handling complex clinical research data and performing data science pro-
gramming tasks. To achieve this, we developed a comprehensive coding benchmark,
CliniDSBench, comprising 39 published clinical studies and 293 diverse, practical, and
high-quality data analysis tasks in both Python and R. Based on these benchmarks, we
found that current LLMs are not yet capable of fully automating data science tasks. At
their first attempts, LLMs successfully solved only 40%-80% of Easy tasks, 15%-40% of
Medium tasks, and 5%-15% of Hard tasks. This highlights the necessity of human over-
sight and post-processing to prevent misinformation and incorrect results when relying on
LLMs for clinical data analysis.
Though imperfect, we found that much of the LLM-generated code was quite close to
the correct solution. This observation motivated us to explore advanced adaptation meth-
ods to improve LLM performance further. On the one hand, we found that LLMs could
self-correct a significant portion of erroneous code, leading to substantial improvements
over their initial attempts. On the other hand, involving human experts more directly
in the process, such as by providing concrete, step-by-step plans for data analysis tasks,
resulting in the best performance across all adaptation strategies.
Beyond automatic testing, we conducted a user study using our developed interface,
which integrates LLMs into the data science workflow. The study revealed that users
heavily relied on LLM-generated code when crafting their final solutions, validating the
effectiveness of LLMs in streamlining the coding process. This workflow typically followed
a pattern where LLMs provided an initial solution, users collaborated with the LLMs
for debugging, and then the users refined the final solution. The platform was highly
appreciated by users, not only for its practical utility in accelerating data science tasks
but also for its educational value in helping them improve their programming and data
analysis skills.
This study has several limitations. First, to ensure the quality of the benchmark, we
manually created all the questions and solutions for the analysis tasks, which restricted
our ability to scale the benchmark to cover more studies. A larger dataset with more
coding tasks would not only enhance evaluation but could also be used for training LLMs
specifically for data science tasks. Second, the user study results may be biased, as the
participants were primarily medical doctors who, while knowledgeable in their domain,
had varying levels of coding proficiency. The usage patterns might differ significantly if
data scientists were the users, as they possess more advanced coding skills but know less
about medicine. Third, the patient-level data in our testing set are publicly available, but
privacy risks must be carefully considered when deploying LLMs for real-world clinical
data analysis. A recommended approach would be to separate the environment running
code on sensitive patient data from the environment where LLMs are used. LLMs should
only access the dataset schema or global statistics without access to individual patient
data. Finally, the patient data used in our testing set were relatively clean, standardized,
and semantically meaningful. In real-world scenarios, data can be messier and more varied,
which could affect LLM performance. Future work should explore strategies to handle less
12
structured, real-world data effectively.
The findings from our study show that while LLMs are not yet capable of fully au-
tomating clinical research data science tasks, they can be valuable tools when used in
collaboration with human experts. This human-AI partnership can lead to the creation
of effective coding solutions, boost productivity, potentially accelerate drug development,
and improve patient outcomes. However, testing this hypothesis requires future prospec-
tive studies in clinical research to confirm the practical impact of such collaborations.
4 Methods
4.1 Dataset curation
We created the testing dataset, referred to as CliniDSBench, based on published clinical
studies and their associated patient-level datasets from cBioPortal.23 cBioPortal is a
comprehensive database for cancer and genomics research, providing access to hundreds of
studies with linked patient data. These datasets encompass various modalities, including
clinical data, clinical sample data, mutations, copy number alterations, structural variants,
RNA, mRNA, and tumor miRNA, among others. This setup ensures that the coding
tasks in CliniDSBench are closely aligned with the real-world challenges faced in clinical
research data science, using authentic data and analysis tasks.
We began by reviewing the studies listed on cBioPortal’s dataset page. For each study,
we labeled the types of analyses performed. These labels were then aggregated to identify
the most common analyses, ensuring the selected studies covered a comprehensive range
of tasks for the testing dataset. For each selected study, we manually created coding tasks
based on the extracted analyses, mirroring the sequence of data analysis steps that led to
the findings in the original studies. Each coding task represents one step in this process.
The tasks are structured with five key components: the input question, a description of
the patient dataset schema, prefix code, reference solutions, and test cases. An example
of the input coding task is shown in Extended Fig. 2.
To ensure the feasibility of automatic testing, it is crucial to maintain consistency
between the input question and the testing cases, particularly regarding the output name
and format. For instance, a simple question might be: “tell me the number of patients in
the dataset”. This question is inherently open-ended, allowing for a variety of answers.
The most straightforward approach is to calculate the unique number of patient IDs
in the dataset, such as num = df["patient id"].nunique(). However, for the testing
cases to work, it is essential that the variable num represents this number in the code.
Since the variable name can be arbitrary (e.g., n, num patient, or number of patients),
a testing case inspecting the variable num will fail if the name differs. To avoid this
issue, each question is divided into two parts: the task description and the output format
requirement, ensuring a constrained answer. For example, the full question would also
specify the output requirement: “make sure the output number of patients is an integer
assigned to a variable named "num"”. Correspondingly, testing cases like assert num ==
20 are attached to the LLM-generated code to verify its correctness.
The prefix code refers to the prerequisite code necessary to run before addressing a
specific question. This approach mirrors the workflow of data scientists working in com-
putational notebook environments like Jupyter Notebook,40 where certain data processing
steps are required for multiple analyses. However, it would be redundant and inefficient
to repeat these steps for every coding task. For example, in one step, a data scientist
might merge the patient clinical data table with the mutation table to link patient out-
comes with gene mutation information. This merged dataset is then used in subsequent
analyses, such as survival analysis grouped by gene mutations. For these follow-up tasks,
the LLMs are not required to repeat the data merging process. Instead, the merging code
is provided as prefix code, allowing the LLMs to build on the processed data and focus on
the specific task at hand. This structure ensures efficiency and mimics how data scientists
typically manage code dependencies across related tasks.
To protect the privacy of patient records, it is crucial to handle how patient data is
13
passed in prompts to proprietary LLMs, such as OpenAI’s GPT models, for coding tasks.
Our approach avoids using individual-level patient records as input. Instead, we use a
template to generate a caption for each dataset. This caption includes the table’s name,
its dimensions (shape), the names of all columns, and representative values from each
column. This method ensures that no private or sensitive information about individuals
is shared, while still enabling LLMs to understand the dataset’s structure and content
sufficiently to synthesize code for data science.
We developed specific testing cases for various types of outputs required to answer
the input data science questions, including numerical, categorical, dataframes, and object
outputs. For numerical outputs, such as the number of patients (integers) or average age
(continuous values), the testing cases check for exact matches in integers and verify that
the absolute difference for continuous values falls within an acceptable error range. For
categorical outputs, such as a list of the top 10 frequently mutated genes, the testing
cases ensure that the generated list matches the expected set. For dataframe outputs,
such as merging two tables, we clarify the expected column names and content in the
question. The testing cases then verify the shape of the resulting dataframe and check
global statistics for each column, depending on the variable types. When the expected
outputs are special objects, such as in visualization tasks, where the output is a figure,
we specify in the question which tools should be used to create the visualization. The
testing cases then check if the specified tools were used correctly. Additionally, we include
instructions to save the inputs used for plotting, allowing us to verify the correctness of
these inputs as a proxy for validating the accuracy of the visualizations.
To estimate the difficulty level of each coding task, we calculated the number of seman-
tic lines in the reference solutions. This was done by using GPT-4o to analyze the input
code and decompose it into a sequence of operations. The unique number of operations
was used as an indicator of semantic lines. For Python tasks, we categorized those with
fewer than 10 semantic lines as Easy, 10-15 as Medium, and more than 15 as Hard. For
R tasks, we defined those with fewer than 6 semantic lines as Easy, 6-10 as Medium, and
more than 10 as Hard. The prompt used to extract these operations is shown in Extended
Fig. 7.
14
was the limited dataset scale, which was only sufficient for testing purposes. Additionally,
including publicly available code examples from sources like GitHub would likely offer
minimal benefit, as these LLMs have already been extensively trained on such data.
Chain-of-thought Research has shown that prompting LLMs to break down tasks
into multiple steps, rather than providing a direct answer, significantly improves per-
formance.29 These steps can either be generated by the LLM or provided by a human
expert. In our experiments, we implemented this technique by creating detailed step-by-
step instructions on how to solve data science tasks, referred to as the CoT variant. This
approach is reflected in the experiment results, as shown in Fig. 3b, where it consistently
outperformed direct answer generation by guiding the LLM through a structured process.
15
Self-reflection LLMs can produce flawed outputs on their first attempt, but they
can improve through iterative feedback and refinement.49 This approach mirrors the
natural process humans follow when programming, testing, and debugging code.12 In our
experiments, we implemented self-reflection, allowing LLMs to attempt debugging their
incorrect code solutions, with results shown in Fig. 3c. The process involved executing the
initially generated code along with the testing cases and collecting output logs reflecting
(1) errors from failed tests, (2) runtime errors within the code, and (3) the printed values
and shapes of intermediate variables. We then prompted the LLM to explain why the
code was incorrect, propose a plan for correction, and provide a revised code solution.
This cycle was repeated up to five times, and in each round, only the unresolved questions
were carried forward for further self-reflection. This iterative method allowed LLMs to
gradually improve their code solutions over multiple attempts. The prompt used to enable
LLM’s self-reflection is in Extended Fig. 6.
16
Questionnaire Design We developed a questionnaire to gather feedback from users
following the user study. The survey was designed based on the Health Information
Technology Usability Evaluation Scale (Health-ITUES),36 originally created to assess the
usability of a web-based communication system for scheduling nursing staff. We adapted
the questions in line with the spirit of the scale, covering four main topics: output quality,
support & integration, system complexity, and system usability. The original 22-item
questionnaire was streamlined to 10 items, with users rating each on a 5-point Likert
scale, ranging from strongly disagree to strongly agree. Additionally, we included an open-
ended question, allowing users to provide free-text comments for further insights. This
format ensured a concise yet comprehensive collection of user feedback on the platform’s
performance and usability.
Pass@k ranges from 0 to 1 and estimates the probability that at least one of the k
generated code samples for a given task passes all the unit tests. In our study, we used
two metrics: Pass@1 and Pass@5. Pass@1 is a stricter metric, evaluating whether the
LLM can solve the task on the first attempt. To ensure reproducibility, we set the LLMs’
temperature to zero for this evaluation. For Pass@5, we allowed the LLMs to generate 10
solutions for each question to estimate the likelihood of producing a correct answer within
five attempts.
To compare the LLM-generated code with user-submitted or reference code solutions,
we first parse the code string using abstract syntax trees (AST) to extract operators and
variables, which allows for a structural analysis of the code. We then tokenize the code
based on this parsing result. Using Python’s difflib library, we compare the differences
between two text sequences.
Let s1 represent the tokenized LLM-generated code and s2 represent the tokenized
user-submitted code. We compute the length of overlapping tokens between the two
sequences, denoted as s̄. The ratio of user-submitted code copied from LLM-generated
code can then be calculated using the following formula:
length(s̄)
Copy Ratio = . (2)
length(s2 )
This method quantifies the extent to which the user-submitted code overlaps with the
LLM-generated code, providing insight into the level of influence the LLM had on the
final solution.
17
a Execution
output logs
Execution
Code
output
generation
figures
panel
User inputs
Code
execution
Language:
Python or R Command
history
Selected dataset
b overview
Dataset
panel
Dataset
visualization
Dataset selection
Extended Fig. 1: Overview of the developed clinical research data science platform.
a, code generation panel where users provide their requests, read the generated code, and
switch to execution results. Users can view the previously sent requests and switch back if
desired. b, Users can select one from a list of datasets to analyze. For each dataset, users can
preview the content.
18
Example coding task
For the significant genes, get the indicator of the mutation type for each patient, the
mutation types of interest are:
- silent
- missense
- splice site
- nonsense
- frame shift
- inframe indel
where the indicator `1` means mutations, `0` means wild-type. Question
import pandas as pd
# Filter genes with mutation significance (-log10 q_value) larger than 1.0
significant_genes_df = data_mutsig[data_mutsig['q'] < 0.1]
Extended Fig. 2: An example of Python coding task with the input question, prefix code, and
testing cases.
19
Prompt for Vanilla method
Dataset schema:
{data}
Return directly with the generated Python code wrapped by <code> tags:
<code>
... your code here ...
</code>
20
Prompt for Manual method
You are now the following python function:
The function takes two inputs: a `history_dict` and an optional `reference_code` string.
IMPORTANT: Always refer to this history and the `reference_code` when generating new code in order to properly use
existing variables and previously loaded resources, as well as to follow established coding patterns. DO NOT USE ECHARTS
TO GENERATE CHARTS when reference code is empty.
The function returns a string of raw Python code, wrapped within <code> and </code> tags. For example:
<code>
import pandas as pd
table = pd.read_csv("example.csv")
</code>
Feel free to leverage libraries such as pandas, numpy, math, matplotlib, sklearn, etc. in the code generation process. Also,
remember to correctly load any necessary files with the correct path before using them.
When it's appropriate to provide output for evaluation or visualization, make sure to use the print() function and plt.show()
respectively.
Returns:
Python code that should be the next steps in the execution according to the human question and using the history code
as the prefix.
\"\"\"
Respond exclusively with the generated code wrapped <code></code>. Ensure that the code you generate is executable
Python code that can be run directly in a Python environment, requiring no additional string encapsulation.
history_code = \"\"\"{history_code}\"\"\"
human_question = \"\"\"{question}
# DO NOT use function that will pop up a new window (e.g., PIL & Image.show() is NOT preferable, saving the PIL image is
better)
# However, feel free to use matplotlib.pyplot.show()\"\"\"
# Load the data referring to the data file path provided in the data schema
data = \"\"\"{data}\"\"\"
history_dict = {{
"history_code": history_code,
"human question": human_question,
"data": data,
}}
21
Prompt for AutoPrompt method
Consider this task as building a robust and flexible data processing module. With a natural language question and an
accompanying dataset schema at your disposal, generate thoroughly designed Python code to retrieve, process, and
analyze data precisely.
Your solution must follow the latest best practices in coding for efficiency, readability, and scalability. Ensure that initial
imports, data loading, schema parsing, data handling, and operation implementations:
1. complete with amendments for error handling and real-time operation logs
2.are made timelessly versatile. To round off, integrate segregated documentation within your code and detail logical
mappings relevant to user complexity boundaries and potential integrations including cross-language interoperability
hallmarks, where feasible.
Please write a Python script that performs the following tasks without using the main() function or the if __name__ ==
\"__main__\": construct.
Here is the Python code including comprehensive solutions for the given question and dataset schema: The code
should be written directly in the global scope. Return the code wrapped by the <code> and </code> tag.
The data scientists need your help with the following code snippet.
REFERENCE_CODE = ```
{reference_code}
```
The code snippet either produced the following error message or received a user's request for improvements:
QUESTION_LOG = ```
{question}
```
#############
# OBJECTIVE #
Depending on the content of [QUESTION_LOG], either:
1. Debug the code to fix an error message, ensuring the code is executable and produces the expected output, or
2. Refine or adapt the code to improve its performance or functionality based on the user's request.
Your should insert printing statements into the code to display all the intermediate results for verification and debugging
purposes. You need to solve the problem step-by-step, providing the corrected or improved code snippet at the end.
1. REASON: Analyze the reason for the error or the potential for improvement as described in [QUESTION_LOG].
2. PROPOSAL: Describe the approaches you will take to either fix the error or improve the code.
3. CODE: Provide the corrected or enhanced code snippet.
#############
# RESPONSE: HTML #
Show your response in HTML format with the following structure:
<code>
# REASON: Short description of the reason for the error or the area for improvement using less than 100 tokens.
# PROPOSAL: High-level description of the approach to fix the error or enhance the code using less than 100 tokens.
# CODE: Corrected or improved code snippet to fix the error or enhance functionality
... your code starts here ...
</code>
```
Extended Fig. 6: Prompt for the Self-reflection method in code debugging and improvement.
22
Prompt for computing semantic lines
Decompose the code string into essential operations required to describe the steps needed to go from raw data to the final
outcome.
# RESPONSE FORMAT #
You are required to output a JSON object containing a list of operation descriptions. Each list item should be a string concisely
describing a single operation.
For example:
```json
{{
"operations": [
"operation 1",
"operation 2",
...
"operation n"
]
}}
```
# EXAMPLE #
An example decomposition of the code string is provided below:
```
import numpy as np
from sklearn.manifold import TSNE
import plotly.express as px
import pandas as pd
figx = px.scatter(
tsne,
x="tsne1",
y="tsne2",
color="_primary_disease",
hover_name="_primary_disease",
width=970,
height=500,
template="ggplot2",
color_discrete_sequence= px.colors.qualitative.Alphabet,
size_max=0.1,
)
figx.show()
```
```
{{
"operations": [
"import numpy as np",
"import TSNE from sklearn.manifold",
"import plotly.express as px",
"import pandas as pd",
"drop columns `sample_type_id`, `sample_type`, and `_primary_disease` from merged_df to create X_filtered",
"apply TSNE to X_filtered with 2 components, auto learning rate, random initialization, and perplexity of 3 to get
X_embedded",
"create a DataFrame tsne with columns `tsne1` and `tsne2` from X_embedded",
"concatenate the `_primary_disease` column from merged_df to tsne DataFrame",
"sort the tsne DataFrame by the `_primary_disease` column",
"create a scatter plot using Plotly Express with `tsne1` and `tsne2` as axes, coloring by `_primary_disease`, and
customizing the plot appearance",
"display the scatter plot"
]
}}
```
# CODE STRING #
Decompose the following based on the instructions above:
```
{code}
```
Extended Fig. 7: Prompt for extracting and computing the number of semantic lines for
estimating coding task difficulty.
23
References
[1] Radenkovic, D., Keogh, S. B. & Maruthappu, M. Data science in modern evidence-
based medicine. Journal of the Royal Society of Medicine 112, 493–494 (2019).
[2] Champagne, D., Devereson, A., Pérez, L. & Saunders, D. Cre-
ating value from next-generation real-world evidence. https:
//www.mckinsey.com/industries/life-sciences/our-insights/
creating-value-from-next-generation-real-world-evidence. Accessed:
2024-09-16.
[3] Ellis, L. D. To meet future needs, health care leaders must
look at the data (science). https://www.hsph.harvard.edu/ecpe/
to-meet-future-needs-health-care-leaders-must-look-at-the-data-science/.
Accessed: 2024-09-16.
[4] Meyer, M. A. Healthcare data scientist qualifications, skills, and job focus: a content
analysis of job postings. Journal of the American Medical Informatics Association
26, 383–391 (2019).
[5] Chen, M. et al. Evaluating large language models trained on code. arXiv preprint
arXiv:2107.03374 (2021).
[6] Li, Y. et al. Competition-level code generation with alphacode. Science 378, 1092–
1097 (2022).
[7] Luo, Z. et al. Wizardcoder: Empowering code large language models with evol-
instruct. In The Twelfth International Conference on Learning Representations
(2023).
[8] Lozhkov, A. et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint
arXiv:2402.19173 (2024).
[9] Zhang, F. et al. Repocoder: Repository-level code completion through iterative
retrieval and generation. arXiv preprint arXiv:2303.12570 (2023).
[10] Parvez, M. R., Ahmad, W., Chakraborty, S., Ray, B. & Chang, K.-W. Retrieval
augmented code generation and summarization. In Findings of the Association for
Computational Linguistics: EMNLP 2021, 2719–2734 (2021).
[11] Wang, Z. Z. et al. CodeRAG-Bench: Can retrieval augment code generation? arXiv
preprint arXiv:2406.14497 (2024).
[12] Chen, X., Lin, M., Schärli, N. & Zhou, D. Teaching large language models to self-
debug. In The Twelfth International Conference on Learning Representations (2024).
[13] Wang, X. et al. OpenDevin: An open platform for ai software developers as generalist
agents. arXiv preprint arXiv:2407.16741 (2024).
[14] Zheng, T. et al. Opencodeinterpreter: Integrating code generation with execution
and refinement. arXiv preprint arXiv:2402.14658 (2024).
[15] Austin, J. et al. Program synthesis with large language models. arXiv preprint
arXiv:2108.07732 (2021).
[16] Hendrycks, D. et al. Measuring coding challenge competence with apps. In Thirty-
fifth Conference on Neural Information Processing Systems Datasets and Benchmarks
Track.
[17] Liu, J., Xia, C. S., Wang, Y. & Zhang, L. Is your code generated by chatgpt really
correct? rigorous evaluation of large language models for code generation. Advances
in Neural Information Processing Systems 36 (2024).
[18] Jimenez, C. E. et al. Swe-bench: Can language models resolve real-world github
issues? In The Twelfth International Conference on Learning Representations (2024).
[19] Huang, J. et al. Execution-based evaluation for data science code generation models.
In Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop
(Language Advances), 28–36 (2022).
24
[20] Lai, Y. et al. DS-1000: A natural and reliable benchmark for data science code
generation. In International Conference on Machine Learning, 18319–18345 (PMLR,
2023).
[21] Tayebi Arasteh, S. et al. Large language models streamline automated machine
learning for clinical studies. Nature Communications 15, 1603 (2024).
[22] Tang, X. et al. Biocoder: a benchmark for bioinformatics code generation with large
language models. Bioinformatics 40, i266–i276 (2024).
[23] cbioportal for cancer genomics. https://www.cbioportal.org/datasets. Accessed:
2024-09-17.
[24] Hello gpt-4o. https://openai.com/index/hello-gpt-4o/. Accessed: 2024-09-17.
[25] Gpt-4o mini: advancing cost-efficient intelligence. https://openai.com/index/
gpt-4o-mini-advancing-cost-efficient-intelligence/. Accessed: 2024-09-17.
[26] Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet. Ac-
cessed: 2024-09-17.
[27] Introducing the next generation of claude. https://www.anthropic.com/news/
claude-3-family. Accessed: 2024-09-17.
[28] Reid, M. et al. Gemini 1.5: Unlocking multimodal understanding across millions of
tokens of context. arXiv preprint arXiv:2403.05530 (2024).
[29] Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models.
Advances in Neural Information Processing Systems 35, 24824–24837 (2022).
[30] Brown, T. B. et al. Language models are few-shot learners. In Proceedings of the
34th International Conference on Neural Information Processing Systems, NIPS ’20
(Curran Associates Inc., Red Hook, NY, USA, 2020).
[31] Khattab, O. et al. Dspy: Compiling declarative language model calls into self-
improving pipelines. arXiv preprint arXiv:2310.03714 (2023).
[32] Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.
Advances in Neural Information Processing Systems 33, 9459–9474 (2020).
[33] Zehir, A. et al. Mutational landscape of metastatic cancer revealed from prospective
clinical sequencing of 10,000 patients. Nature Medicine 23, 703–713 (2017).
[34] Welch, J. S. et al. Tp53 and decitabine in acute myeloid leukemia and myelodysplastic
syndromes. New England Journal of Medicine 375, 2023–2036 (2016).
[35] Mostavi, M., Chiu, Y.-C., Huang, Y. & Chen, Y. Convolutional neural network
models for cancer type prediction based on gene expression. BMC Medical Genomics
13, 1–13 (2020).
[36] Yen, P.-Y., Wantland, D. & Bakken, S. Development of a customizable health it
usability evaluation scale. In AMIA Annual Symposium Proceedings, vol. 2010, 917
(American Medical Informatics Association, 2010).
[37] Wang, Z. et al. Accelerating clinical evidence synthesis with large language models.
arXiv preprint arXiv:2406.17755 (2024).
[38] Lin, J., Xu, H., Wang, Z., Wang, S. & Sun, J. Panacea: A foundation model
for clinical trial search, summarization, design, and recruitment. medRxiv 2024–06
(2024).
[39] Jin, Q. et al. Matching patients to clinical trials with large language models. ArXiv
(2023).
[40] Jupyter. https://jupyter.org/. Accessed: 2024-09-23.
[41] Openai codex. https://openai.com/index/openai-codex/. Accessed: 2024-09-23.
[42] Nijkamp, E. et al. A conversational paradigm for program synthesis. arXiv preprint
arXiv:2203.13474 30 (2022).
[43] Fried, D. et al. Incoder: A generative model for code infilling and synthesis. In The
Eleventh International Conference on Learning Representations (2023).
25
[44] Van Veen, D. et al. Adapted large language models can outperform medical experts
in clinical text summarization. Nature Medicine 30, 1134–1142 (2024).
[45] Nie, F., Chen, M., Zhang, Z. & Cheng, X. Improving few-shot performance of
language models via nearest neighbor calibration. arXiv preprint arXiv:2212.02216
(2022).
[46] New embedding models and api updates. https://openai.com/index/
new-embedding-models-and-api-updates/. Accessed: 2024-09-23.
[47] Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E. & Singh, S. Autoprompt: Eliciting
knowledge from language models with automatically generated prompts. In Proceed-
ings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP) (Association for Computational Linguistics, 2020).
[48] Vertex ai search. https://cloud.google.com/enterprise-search?hl=en. Ac-
cessed: 2024-09-23.
[49] Madaan, A. et al. Self-refine: Iterative refinement with self-feedback. Advances in
Neural Information Processing Systems 36 (2024).
26