Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
31 views26 pages

LLMs

The study investigates the potential of large language models (LLMs) to automate data science tasks in clinical research, revealing that while LLMs can assist in coding, they are not yet capable of fully automating these tasks. A dataset of 293 real-world coding tasks was developed, and findings showed that LLMs often fail to generate accurate solutions but can significantly enhance the efficiency of medical professionals when integrated into their workflows. The research highlights the importance of human-AI collaboration in improving data science efficiency in clinical settings.

Uploaded by

jeevanhm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views26 pages

LLMs

The study investigates the potential of large language models (LLMs) to automate data science tasks in clinical research, revealing that while LLMs can assist in coding, they are not yet capable of fully automating these tasks. A dataset of 293 real-world coding tasks was developed, and findings showed that LLMs often fail to generate accurate solutions but can significantly enhance the efficiency of medical professionals when integrated into their workflows. The research highlights the importance of human-AI collaboration in improving data science efficiency in clinical settings.

Uploaded by

jeevanhm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Can Large Language Models Replace Data Scientists in Clinical Research?

Zifeng Wang1∗ , Benjamin Danek1∗ , Ziwei Yang3 , Zheng Chen4 , Jimeng Sun1,2#
1 Department of Computer Science, University of Illinois Urbana-Champaign, Champaign,

IL
2 Carle Illinois College of Medicine, University of Illinois Urbana-Champaign, Champaign, IL
3 Bioinformatics Center, Institute for Chemical Research, Kyoto University
4 Institute of Scientific and Industrial Research, Osaka University
# Corresponding author. Emails: [email protected]

Abstract
Data science plays a critical role in clinical research, but it requires professionals with
arXiv:2410.21591v1 [cs.AI] 28 Oct 2024

expertise in coding and medical data analysis. Large language models (LLMs) have shown
great potential in supporting medical tasks and performing well in general coding tests.
However, these tests do not assess LLMs’ ability to handle data science tasks in medicine,
nor do they explore their practical utility in clinical research. To address this, we de-
veloped a dataset consisting of 293 real-world data science coding tasks, based on 39
published clinical studies, covering 128 tasks in Python and 165 tasks in R. This dataset
simulates realistic clinical research scenarios using patient data. Our findings reveal that
cutting-edge LLMs struggle to generate perfect solutions, frequently failing to follow input
instructions, understand target data, and adhere to standard analysis practices. Conse-
quently, LLMs are not yet ready to fully automate data science tasks. We benchmarked
advanced adaptation methods and found two to be particularly effective: chain-of-thought
prompting, which provides a step-by-step plan for data analysis, which led to a 60% im-
provement in code accuracy; and self-reflection, enabling LLMs to iteratively refine their
code, yielding a 38% accuracy improvement. Building on these insights, we developed a
platform that integrates LLMs into the data science workflow for medical professionals. In
a user study with five medical doctors, we found that while LLMs cannot fully automate
coding tasks, they significantly streamline the programming process. We found that 80%
of their submitted code solutions were incorporated from LLM-generated code, with up to
96% reuse in some cases. Our analysis highlights the potential of LLMs, when integrated
into expert workflows, to enhance data science efficiency in clinical research.

*
These authors contributed equally to this work.

1
1 Introduction
In clinical research, data science plays a pivotal role in analyzing complex datasets, such
as clinical trial data and real-world data (RWD), which are critical for improving patient
care and advancing evidence-based medicine.1 For example, it was reported that for a
pharmaceutical company, the insights brought from RWD analysis could unlock up to
$300M values annually by optimizing clinical trial design and execution.2 Data scientists,
at the core of this process, require years of coding expertise, alongside a deep understand-
ing of diverse medical data types, including patient clinical data and omics data, while also
collaborating closely with medical professionals.3 However, the growing demand for data
science skills and the limited availability of experienced data scientists have become bot-
tlenecks in the clinical research process.4 Coding is central to the work of data scientists,
underpinning essential tasks such as statistical modeling, data cleaning, and visualization
using Python and R. Given this, exploring methods for streamlining the coding process
in clinical research data science is crucial to accelerate drug development and improve
patient outcomes.
Code generation has been extensively explored with the advent of large language mod-
els (LLMs), which have demonstrated strong capabilities in tasks like code completion.5
Continuous efforts have been made in developing more powerful code-specific LLMs,6–8 re-
fining prompting strategies,9 integrating external knowledge through retrieval-augmented
generation (RAG),10, 11 and enabling LLMs’ self-reflection.12 These advancements further
lead to the LLM-based platform for software development13 and data analysis.14 While
LLMs have been evaluated for general programming tests,5, 15–17 software engineering,18
and data analysis ,19, 20 assessments specifically targeting clinical research data science
remain scarce. In medicine, recent works have introduced LLMs to automate machine
learning modeling21 and support bioinformatics tool development,22 but they do not cover
broad data science tasks. Therefore, this paper seeks to build a comprehensive code gen-
eration dataset to assess to which extent the cutting-edge LLMs can automate clinical
research data analysis, modeling, and visualization.
Our objective was to evaluate the practical utility of LLMs in handling complex clinical
research data and performing the associated data science tasks. To this end, we identified
39 clinical studies published in medical journals that were linked to patient-level datasets
(Fig. 1a). We started by extracting and summarizing the analyses performed in these
studies, such as patient characteristic exploration and Kaplan-Meier curves. We then
developed the code necessary to reproduce these analyses and the reported results in these
studies. These coding tasks, along with their reference solutions, were all manually crafted
and cross-verified to ensure accuracy. The result was a collection of 293 diverse, high-
quality data science tasks, covering primary tools used in Python and R, e.g., lifelines
for survival analysis in Python and Bioconductor for biomedical data analysis in R.
Additionally, we categorized the difficulty of these tasks into Easy, Medium, and Hard, by
the number of “semantic lines” of code in the reference solutions (Fig 1d). The semantic
lines metric aggregates lines of code that serve the same operation into a single unit,
providing a clear measure of task complexity. A detailed overview of the dataset and its
characteristics is provided in Fig. 1.
Here, we rigorously evaluated the extent to which clinical research data science tasks
can be automated by LLMs. We benchmarked six state-of-the-art LLMs using various
methods, including chain-of-thought prompting, few-shot prompting, automatic prompt-
ing, self-reflection, and retrieval-augmented generation (Fig. 1f). Our analysis focused on
both the accuracy and quality of the code generated by these models. While we found that
current LLMs are not yet capable of fully automating complex clinical data science tasks,
they do generate code that is highly similar to the correct final solutions. Building on this
insight, we investigated the development of a platform designed to facilitate collaboration
between human experts and artificial intelligence (AI) to streamline coding tasks. This
platform aims to enhance the productivity of clinical researchers by integrating LLMs into
established data science workflows, with a focus on user-friendliness and the reliability of
the outputs. Our results demonstrated that the platform significantly improved human
experts’ efficiency in executing data science tasks, highlighting the promising future of

2
human-AI collaboration in clinical research data science.

2 Results
2.1 Creating data science tasks from clinical studies
We curated our testing dataset, CliniDSBench, to reflect the real-world challenges that
data scientists face in clinical research. The dataset is grounded in published medical
studies and linked to patient-level datasets from cBioPortal.23 These patient datasets are
diverse, each containing data from hundreds to thousands of patients, including clinical
information such as demographics and survival data, lab results, and omics data like
gene expression, mutations, structural variants, and copy number alterations. Unlike
prior studies that mostly focus on single spreadsheets, each study in our dataset is linked
to multiple spreadsheets, up to five, providing a more complex and realistic basis for
evaluating data science workflows. This setup mirrors the multifaceted nature of real-
world clinical research, where data scientists must integrate and analyze information from
various sources to generate insights. The dataset building and evaluation framework is
illustrated in Fig. 1a.
We obtained the publications associated with these datasets from PubMed and re-
viewed them to extract the types of analyses conducted. Through this process, we iden-
tified common analyses frequently performed in clinical research, such as summarizing
patient baseline characteristics, plotting Kaplan-Meier curves to assess treatment effects
across groups, and creating mutational OncoPrints to highlight significant gene mutations
in specific patients. This extraction process allowed us to filter and refine the initial set of
studies, ensuring both diversity and comprehensiveness in the analyses covered. After this
refinement, we retained 39 studies, which were used to create the final testing dataset.
We designed a series of coding questions in a step-by-step format, mirroring the logical
progression of analyses in the original studies, ultimately leading to the main findings. For
example, a study may include exploratory data analysis, gene mutation analysis to detect
abnormal mutation patterns, survival analysis to visualize patient outcomes, and statisti-
cal tests to verify significance. Correspondingly, we developed coding questions for each
step, ensuring that earlier steps provide the necessary groundwork for subsequent analyses.
As such, each coding task consists of five components: (1) the input question, (2) patient
dataset schema description, (3) prerequisite code (called “prefix”), (4) reference solutions,
and (5) test cases. This design reflects the practical setup data scientists encounter in
real-world projects. In total, we manually curated 128 analysis tasks in Python and 165
in R based on the extracted analyses from 39 studies. As shown in Fig. 1c, the input
questions typically consist of 50-100 words describing the task and output requirements,
while the reference code solutions span more than 20 lines, sometimes exceeding 50 lines,
reflecting the complexity of the tasks.
We quantitatively assessed the difficulty of each coding task by calculating the number
of semantic lines in the reference solutions. A semantic line aggregates multiple lines of
code that contribute to the same operation, as illustrated in Fig. 1d. This approach pre-
vents the difficulty assessment from being skewed by repetitive or tedious tasks that are
fundamentally simple. The statistics for semantic lines and difficulty levels are presented
in Fig. 1e. Our analysis shows that Python solutions tend to be more complex than R
solutions, particularly for Medium and Hard tasks. This is largely due to R’s rich ecosys-
tem of medical-specific packages, which often allow for more direct solutions. In contrast,
Python frequently requires additional customization and manual coding to achieve similar
outcomes, contributing to higher complexity in Python coding tasks. The pie charts in
Fig. 1b show the libraries frequently used in the reference answers.

2.2 LLMs are not yet ready for fully automated data science
As illustrated in Fig. 1f, our evaluation framework is composed of three key compo-
nents: models, methods, and tasks. For the first component, we selected six cutting-edge

3
a
Extract analyses Data collection Coding benchmark
Medical publications building
Coding tasks Analysis results
Multimodal
patient data
#Participant baselines
#Kaplan-Meier curve
#Statistical test
#Mutation OncoPrint Omics Clinical Clinical
#Ridgeplot … sample

#Volcano plot
Test cases Execution sandbox

b c

Python: 14 studies, 128 analyses R: 25 studies, 165 analyses

import pandas as pd

# Load a dataset from a CSV file Semantic line #1:


data = pd.read_csv('data_clinical_patient.csv') Load and preprocess e
the dataset
# Filter rows where age is less than 18
filtered_data = data[data['age'] < 18]

# Group by cancer type and calculate the mean of the overall survival
grouped_data = filtered_data.groupby('cancer_type')['OS_MONTHS'].mean()

# Sort the results by the mean income Semantic line #2:


sorted_data = grouped_data.sort_values(ascending=False)
Compute summary
# Display the sorted results statistics and
print(sorted_data) display results

f
Models Methods Tasks

Sonnet Opus Chain-of- Few-shot Code generation


thought prompting

GPT-4o GPT-4o-mini Automatic Code debugging


Self-reflection
prompting

Gemini-pro Gemini-flash Retrieval-augmented Human-AI


generation collaboration

Figure 1: Framework overview. a, we created a data science coding dataset based on


the extracted analyses from medical publications. b, the total number of analysis tasks and
studies in the testing data, which also covers a diverse set of tools and libraries. c, illustration
of the complexity of the tasks by the distributions of question length and answer length. d,
an example of semantic lines. e, the distribution of semantic lines in the reference answers
across different difficulty levels. f, the selected models, adaptation methods, and coding tasks
in this study.

4
a 1 Model inputs 2 Generate answers 3 Evaluation
Question: Dataset description:
How many patients are in the dataset? A Pandas Dataframe whose name is `data_mutations`, shape is
(5, 23). 1 2 3
Return the output in a variable `n`.
This is the result of reading `data_mutations`'s first five

rows by `data_mutations.head().to_string()`:
Instruction:
Hugo_Symbol
...
Entrez_Gene_Id
Table description 𝑛𝑛 candidates
Write Python code to answer the user's
request.
Note: (1) …
A Pandas Dataframe whose name is `data_clinical_sample`, Sample 𝑘𝑘
shape is (5, 13).
Return the Python code in HTML format ... and execute
<code>
3 gbm_columbia_2019_21 237 Treatment Medical
... your code here ... Therapy Targeted Therapy Nivolumab
</code> ...  Unit test 1
 pass@1
 Unit test 2
 clinical_patient.csv
 pass@5
 Vanilla prompting  …
 clinical_sample.csv  Unit test 3
 Automatic prompting gene_mutations.csv
 Results
 Manual prompting  gene_expression.csv Testing cases
 copy_number_alteration.csv
 structural_variant.csv
 …

b c

d Python tasks e R tasks

Figure 2: Assessment of different models and adaptation methods in automating


clinical research data science tasks. a, the inputs for LLMs to generate the code and the
associated evaluation process. b, the pass@5 of three LLMs with varying temperatures across
difficulty levels in the Python coding dataset. c, the proportions of the reference solution
code that can be drawn directly from the LLM-generated code. d and e show the pass@1 of
six LLMs across difficulty levels in Python and R coding datasets, respectively.

5
LLMs: GPT-4o,24 GPT-4o-mini,25 Sonnet,26 Opus,27 Gemini-pro,28 and Gemini-flash.28
These models represent a diverse set of advanced generalist LLMs capable of perform-
ing code generation based on input instructions. To explore their effectiveness and po-
tential room for improvement in clinical research data science tasks, we applied several
adaptation methods: chain-of-thought,29 few-shot prompting,30 automatic prompting,31
self-reflection,12 and retrieval-augmented generation (RAG).32
For evaluation, we assessed the combinations of models and adaptation methods across
three tasks: code generation, code debugging, and human-AI collaboration. The first two
tasks were used to measure the models’ accuracy in automating the coding process. We
employed Pass@k as the primary metric, where k represents the number of attempts the
model is allowed to make to solve a coding task. This metric behaves like the probability
that at least one out of the k attempts is correct. Specifically, we selected k = 1 as a strict
benchmark to evaluate how well LLMs can automate tasks on the first attempt, providing
insight into their immediate accuracy. Additionally, we used k = 5 as a more relaxed
metric to explore the model’s potential to improve given multiple attempts, offering a
broader assessment of its ability to generate correct solutions when allowed more tries.
We first evaluated the immediate accuracy of LLMs in generating code solutions on
their initial attempt. For each task, the LLM is provided with a raw question that de-
scribes the target task, as well as a dataset description (Fig. 2a). The dataset description
includes details such as spreadsheet names, column names, and common cell values, which
guide the LLM in identifying the correct spreadsheet, column, and values to work with.
Additionally, the instruction section offers supplementary guidance for the LLM during
code generation. We used three types of instructions: Vanilla, Manual, and Automatic.
The Vanilla instruction provides minimal guidance, merely instructing the LLM to solve
the task, while Manual and Automatic instructions are more detailed, either manually
crafted or optimized through automatic prompt generation.31 The generated code solu-
tions must pass all the testing cases to be judged right.
From our experiments, we found that current LLMs cannot consistently produce per-
fect code for clinical research data science tasks across all difficulty levels. As shown in
Fig. 2d, for Python tasks, the Pass@1 scores vary significantly based on task difficulty.
For Easy tasks, most LLMs achieve Pass@1 rates in the range of 0.40-0.80. However,
for Medium tasks, the Pass@1 rates drop to 0.15-0.40, and for Hard tasks, they range
from 0.05 to 0.15. Performance differences also exist between different LLMs, particularly
within the same series. For instance, the lightweight variant GPT-4o-mini generally un-
derperforms compared to its larger counterpart, GPT-4o, with differences in performance
of up to twofold in many cases. This highlights the limitations of current LLMs, especially
as task complexity increases. The trend is similar in R, where performance declines with
increased task difficulty, though there is a significant difference in performance between
Python and R tasks (Fig 2e).
Diving deeper into the variations across instruction types, we observed that (1) in
Python tasks (Fig. 2d), neither automatically generated prompts nor manually crafted
prompts consistently outperformed the Vanilla prompts. For example, Vanilla performed
better than AutoPrompt in 4 out of 6 LLMs for Easy tasks, 2 out of 6 for Medium tasks,
and 4 out of 6 for Hard tasks. A similar trend was observed for R tasks (Fig. 2e). (2) More
powerful models, such as GPT-4o and Gemini-Pro, showed greater benefits from carefully
crafted instructions, particularly in Easy and Medium Python tasks. In contrast, lighter
models like GPT-4o-mini and Gemini-Flash did not exhibit such improvements, and in
some cases, complex instructions even seemed to hinder performance. This suggests that
lightweight models may struggle to fully interpret and utilize complex instructions, which
can reduce their effectiveness in data science coding tasks.
We adjusted the temperature settings to sample multiple solutions from LLMs and
calculated Pass@5 scores for Python tasks (Fig. 2b). In most cases, increasing the tem-
perature allows LLMs to generate more creative and diverse solutions, resulting in higher
probabilities of producing a correct solution. This trend was consistent across all models,
suggesting the potential benefit of having LLMs brainstorm multiple solutions to reach
better outcomes. On average, LLMs solved more tasks when given five attempts compared

6
to just one, as measured by Pass@1. However, despite this improvement, the overall per-
formance remains far from perfect.

2.3 Unlocking the power of LLMs through strategic adaptations


Motivated by the varied performances of LLMs with different instruction levels, we hy-
pothesized that tailored adaptations for LLMs in clinical research data science could lead
to greater improvements. To test this, we introduced two key dimensions of adaptation:
(1) enhancing LLM inference and reasoning by incorporating advanced instructions or
external knowledge, and (2) employing multiple rounds of trial-and-error, allowing LLMs
to iteratively correct their errors. The results of these adaptations are shown in Fig. 3.
For the first dimension of adaptation, in addition to Manual (ManualPrompt) and
Automatic prompt optimization (AutoPrompt), we introduced three additional strate-
gies: chain-of-thought (CoT), few-shot prompting (Few-shot), and retrieval-augmented
generation (RAG). ManualPrompt incorporates human knowledge into the instructions,
offering additional hints such as common error cases, key columns like unique patient iden-
tifiers, and specific guidance for certain analyses. AutoPrompt utilizes the dspy prompt
optimizer,31 which generates prompts via an LLM and selects the best one. We optimized
prompts using three studies from the training set, keeping 11 studies as the test set. RAG
equips LLMs with a Google search engine, enabling them to look up package documenta-
tion, StackOverflow discussions, and clinical knowledge before generating code solutions.
Few-shot prompting adds several example question-and-answer pairs from the training set
to guide the model. For CoT, we enriched the instructions with step-by-step guidance,
asking the LLM to follow concrete steps toward the final solution. These instructions were
manually created to ensure accuracy, mimicking scenarios where proficient data scientists
provide more detailed input.
The comparison of adaptation strategies based on GPT-4o is illustrated in Fig. 3b.
Each data point represents the average Pass@1 score achieved for coding tasks in a given
study. A point on the diagonal line indicates equivalent performance between the adap-
tation and the vanilla method. The results can be categorized into three patterns:
• AutoPrompt overfitted on the training tasks and struggled to generalize effectively on
the testing tasks. The diversity of analyses in our dataset led to substantial differences
between the training and testing tasks, which AutoPrompt failed to navigate. This
limitation is further verified by the results from Few-shot, which also did not show
improvements when incorporating examples from the training tasks.
• RAG performed similarly to Vanilla, despite incorporating external knowledge into the
inputs. We hypothesize this is because GPT-4o was likely trained on a wide range
of public sources, including package documentation, webpages, medical articles, and
online guidelines. As a result, the additional information retrieved by RAG offered
minimal benefit, as much of it was already within the model’s pre-existing knowledge.
Furthermore, the retrieval process can sometimes introduce noise, embedding irrelevant
or distracting context into the prompt, which negatively affects performance.
• ManualPrompt provided a modest improvement, boosting Pass@1 by an average of 10%
across studies and outperforming Vanilla in 7 out of 11 cases. This demonstrates the
effectiveness of incorporating expert knowledge to better adapt LLMs to specific tasks.
However, the benefit remains limited, as LLMs often struggle to process nuanced hints
and apply them accurately to the tasks at hand. In contrast, CoT led to substan-
tial improvements, outperforming Vanilla in 8 out of 11 studies, with improvements
ranging from double to triple the Pass@1 scores. These results highlight the potential
of human-AI collaboration. When LLMs are guided with more structured, step-by-
step instructions from human experts, they can perform significantly better than when
generating solutions independently.
We conducted further experiments to evaluate whether LLMs can solve more problems
through self-reflection. The results are shown in Fig. 3c for Python tasks and Fig. 3d for R
tasks. To enable self-reflection, we provided LLMs with three types of logs captured from

7
a b
1 Self-reflection inputs 2 Proposed solution
# Debugging: Testing logs
Traceback (most recent call last):
...
line 14, in <module>
assert n == 19
AssertionError # REASON: The error is due to NaN values
in the 'OS_MONTHS' column
# Debugging: Runtime logs
# PROPOSAL: Remove rows with NaN values in
Traceback (most recent call last): 'OS_MONTHS'
File
... import pandas as pd
oncoprint_data = from lifelines import KaplanMeierFitter
patient_cna_status_melted.pivot("G import matplotlib.pyplot as plt
ene", "PATIENT_ID", "CNA")
TypeError: DataFrame.pivot() takes # Load the data
1 positional argument but 4 were data_mutations =
given
pd.read_csv("/workdir/data_mutations.csv")

...
# Debugging: Print intermediate
print("Dataframe with -log10"
"q_value > 1.0:")
print(significant_genes_df.head())
print(" List of significant"
"genes:")
print(significant_genes)

c d

e Python tasks

f
R tasks

g
# case 1: Refuse answer from lifelines import KaplanMeierFitter
print("""sorry I am not from lifelines.plotting import
able to answer this add_at_risk_table
question. I cannot access """Error: ...
external files or specific Traceback (most recent call last): data_clinical_patient = pd.read_csv(
data schemas.""") File '/workdir/data_clinical_patient.csv')
"/code/edd199722e64438896b4b960dd5e8607.p codeleted_patients = data_clinical_patient[
y", line 34, in <module> data_clinical_patient['IDH_1P19Q_SUBTYPE'] == 'Co-deleted’]
# case 2: Wrong format from lifelines.plotting import
Certainly! Here's the code: add_at_risk_table """Error: Traceback (most recent call last):
```python ImportError: cannot import name File "/usr/local/lib/python3.10/dist-
... 'add_at_risk_table' from packages/pandas/core/indexes/base.py", line 3805, in get_loc
``` 'lifelines.plotting' return self._engine.get_loc(casted_key)
Let me know if there are (/usr/local/lib/python3.10/dist- ...
any further adjustments packages/lifelines/plotting.py) raise KeyError(key) from err
you'd like! ""“ KeyError: 'IDH_1P19Q_SUBTYPE'"""

Instruction misfollow Package misuse Data misoperation

Figure 3: Exploration of strategic adaptations and their effectiveness. a, the inputs


for LLMs’ self-reflection are the testing logs, runtime logs, and the printing statements, from
the initial code, and outputs the proposed solutions. b, study-level comparison of different
adaptations versus vanilla methods. c and d, the Pass@1 with increasing rounds of self-
reflections for Python and R tasks, respectively. e and f, the outcome classifications of code
solutions before and after self-reflection for Python and R tasks, respectively. g, demonstra-
tions of three error types.

8
the first attempt at executing the code solutions: (1) results from running the test cases,
(2) runtime logs that capture any errors encountered during execution, and (3) additional
print statements that show the values and shapes of intermediate variables. These logs
were combined with the original code to help the LLM generate an explanation for the
errors and propose a corrective plan, including revised code (Fig. 3a). We tested self-
reflection over multiple rounds, from 1 to 5, and tracked the trend of Pass@1 performance
throughout the process. From the results, we observed significant improvements through
self-correction. After five rounds of correction, Pass@1 scores increased by an average
of around 0.2 across all task types. For Python tasks, LLMs could solve approximately
60-80% of Easy tasks, 40-50% of Medium tasks, and 20-25% of Hard tasks after self-
correction. For R tasks, LLMs achieved around 40-70% success on Easy tasks, 30-55%
on Medium tasks, and 50-60% on Hard tasks. This represents a substantial improvement
compared to their first attempt, demonstrating the effectiveness and potential of LLMs’
self-reflection capabilities. Notably, most of the improvement occurred within the first
two rounds, with diminishing returns in later rounds, indicating that early corrections are
the most impactful.
We conducted an in-depth analysis of the erroneous LLM-generated solutions for
Python (Fig. 3e) and R tasks (Fig. 3f), categorizing the errors into six types: Tests failure,
Data misoperation, Package misuse, Instruction misfollow, Invalid syntax, and Timeout.
Specifically, Data misoperation refers to errors arising from incorrect operations on the in-
put datasets, such as selecting from non-existing columns. Package misuse includes errors
from passing incorrect arguments to functions, importing incorrect packages, or calling
functions without proper imports. Instruction misfollow occurs when LLMs fail to follow
the provided instructions, leading to outputs that are a mixture of text and code or refus-
ing to answer the question. A couple of example error cases are shown in Fig. 3g. Overall,
most of the erroneous solutions failed the testing cases but could still be executed. The
next most common errors were Data misoperation and Package misuse. After applying
self-reflection, the most significant improvement came from LLMs resolving many of the
Data misoperation and Package misuse errors, making the code executable, which is re-
flected by the increase in errors related to Tests failure. This demonstrates the utility of
LLM self-correction in addressing relatively superficial errors identified through execution
logs. One notable anomaly was Gemini-Pro, which encountered a high number of Instruc-
tion Misfollow errors, especially after self-reflection. This was likely due to Gemini-Pro’s
strict safety policies, which caused the model to refuse to answer certain coding questions,
particularly when trying to do self-correction.
Upon reviewing the error cases, we found that many of the LLM-generated solutions,
while imperfect, are close to correct and only require minor manual edits. To quantify how
much these solutions can reduce the effort required for data science tasks, we compared the
LLM-generated solutions with the reference solutions (Fig. 2c). Specifically, we conducted
a difference analysis to compute the proportion of reference code that could be replicated
by the LLM-generated solutions. The results show that, despite their imperfections, LLM-
generated solutions are promising for streamlining data science workflows. For instance,
LLMs produced code that covered approximately 60% of the reference solutions for Easy
tasks, around 60% for Medium tasks in R and 40% in Python, and about 50% for Hard
tasks in R and 25% in Python. It is important to note that this metric underestimates
the code similarities, as AI-generated code may achieve the same function in a different
way from the reference solution.

2.4 Human-AI collaboration boosts productivity for data science


in clinical research
By far, the two critical findings from our experiments are: (1) When human experts
provide more detailed, step-by-step instructions, the quality of LLM-generated code sig-
nificantly improves, as demonstrated by the superior performance of Chain-of-Thought
(CoT) prompting (Fig. 3b). (2) Although LLM-generated code is often imperfect, it
serves as a strong starting point for human experts to refine. Evidence from Fig. 2c shows

9
a b
AI-generated code User submitted code

# Apply the group mapping


data_clinical_patient = df_clinical
# Load the data data_clinical_patient['Group'] =
Data_clinical_patient = df_clinical['MORPHOLOGIC_RESPONSE'].map(group
pd.read_csv('/workdir/data_clinical _mapping)
_patient.csv’)
# Calculate the median number of cycles
... completed for each group
median_cycles =
# Apply the group mapping data_clinical_patient.groupby('Group')['CYCLE
data_clinical_patient['Group'] = Difference S_COMPLETED'].median().reset_index()
data_clinical_patient['Morphologic_
Response'].map(group_mapping) analysis data_clinical_patient['Group'] =
data_clinical_patient['MORPHOLOGIC_RESPONSE']
.map(group_mapping)

...

User added User deleted User changed

d
Studies: Included analyses:

Study 1: 9 analyses Data processing, Data exploration,


(Zehir et al., 2017) Ridgeplot, Survival analysis, Heatmap

Data processing, Data exploration, Box


Study 2: 10 analyses
plot, Bar plot, Statistical test, Survival
(Welch et al., 2016) regression

Data processing, Dimension reduction,


Study 3: 10 analyses
Heatmap, Clustering, Machine learning,
(Mostavi et al., 2020) Scatter plot

e f
Functions in this system were well
integrated. Output quality Comments about user confidence in the system
3.4 (2.5-4.3) I would definitely like to continue working with the
System did require technical
support system and inputting my own data to learn more
about its functionalities.
I felt very confident using the
system. Support & Integration
3.0 (2.2-3.8)
The system was consistent
Comments about system learning curve
I did not need to study or learn
anything new to use the system System complexity I believe having a better understanding of python
3.5 (2.7-4.3) libraries and data science in general would have
System did not feel too complex benefitted me more because my prompts would
be a lot more nuanced to get the exact results I
The system did not feel
cumbersome to use am looking for.

I imagine people would learn to use


this system very quickly. System usability
I thought the system was easy to
4.0 (3.2-4.8)
use. Comments about the system overall
User would like to use system I loved using this tool, and I hope to continue using
frequently
this for other projects. Another feature to include
0 1 2 3 4 5 would be guidance for handling errors for novice
programmers like physicians who may want to use
1 - Strongly Disagree 2 - Somewhat Disagree this tool.
3 - Neutral 4 - Somewhat Agree
5 - Strongle Agree

Figure 4: Overview of the user study. a, we compare the LLM-generated code captured
in our logging system and the user-submitted answers to highlight the modifications made by
users. b, statistics of difficulty levels of the user-faced coding tasks and the user operations
using our platform. c, the distributions of the proportions of user-submitted code that are
copied and pasted from LLM-generated code. d, the target study we asked users to work
on. e, the aggregated feedback obtained from the questionnaires we sent to users. f, example
collected users’ comments.

10
that LLM-generated code is close to the correct solution, and Fig. 3e and Fig. 3f indicate
that most code executes successfully but fails only at the final testing stages. Additionally,
LLM self-reflection can resolve most of the bugs. These findings highlight the potential of
LLMs to assist data scientists in streamlining the coding process in clinical research.
To bridge the gap in utilizing LLMs for clinical research and leveraging the insights
from our experiments, we developed a platform that integrates LLMs into data science
projects for clinical research. The architecture of this platform is shown in Extended
Fig. 1. The platform is designed to offer an integrated interface for users to:
• Chat with LLMs to brainstorm and plan analyses, with the ability to query external
knowledge bases, including webpages, research papers, and other resources.
• Generate code for data science tasks through interactions with LLMs, allowing users to
streamline code writing for complex analyses.
• Identify and debug errors in the user-provided code, with LLMs proposing solutions to
improve the code.
The interface supports real-time interactions, allowing users to generate and execute code
in a sandbox environment with instant visualizations. This removes the need for users
to handle complex prompt crafting or manually switch between chat sessions and coding
platforms like Jupyter Notebook. By simplifying the data science workflow, the platform
empowers users with minimal coding expertise to perform complex data science tasks with
ease.
In our user study, we involved five medical doctors with varying levels of coding ex-
pertise. Each participant was assigned three studies,33–35 with approximately 10 coding
tasks per study (Fig. 4d). Users worked with LLMs on our platform to complete these
tasks and submitted their solutions once their code passed all the test cases. The difficulty
levels of the tasks were quantified, with the distribution shown in Fig. 4b. During the
study, we tracked two core actions: code generation and code improvement (debugging)
requests. The statistics of these user behaviors are depicted in Fig. 4b, where most users
completed the first two studies, and a few tackled the third. After the study, we analyzed
the logs to compare the LLM-generated code with the final code solutions submitted by
the users (Fig. 4a). Additionally, we conducted a survey to gather their feedback on the
platform and their experience working with LLMs. The survey questions were built based
on the Health Information Technology Usability Evaluation Scale (Health-ITUES).36
The results of the code comparison analysis are presented in Fig. 4c, showing the
distribution of the proportion of user-submitted code derived from LLM-generated solu-
tions. We found that a significant portion of the user-submitted code was drawn from
AI-generated code. For Easy tasks, the median proportions were 0.88, 0.87, and 0.84
across the three studies, indicating that users heavily relied on LLM-provided solutions
when crafting their final submissions. For Medium and Hard tasks, the ratios were gen-
erally lower: in Study 1, the proportions were 0.44 for Medium tasks and 0.96 for Hard
tasks, while in Study 2, the proportions were 0.75 for Medium and 0.28 for Hard tasks.
These findings demonstrate the potential of LLMs to streamline the data science process,
even for users without advanced coding expertise, with greater reliance on LLMs for easier
tasks and more mixed results for more complex ones.
The quantitative results from the user survey are summarized in Fig. 4e, where we
grouped the questions into four main categories. The average user ratings for each category
are: Output Quality (3.4), Support & Integration (3.0), System Complexity (3.5), and
System Usability (4.0). These ratings suggest that, overall, users had a positive experience
using the platform. Additionally, we collected qualitative feedback (Fig. 4f), where one
user expressed a strong interest in continuing to use AI for research on their own data,
highlighting the platform’s practical utility. Another user acknowledged the platform’s
value in helping them learn programming and data analysis, underscoring its potential as
an educational tool for those with limited coding experience. These insights reinforce the
platform’s ability to enhance both productivity and learning in data science workflows.

11
3 Discussion
In collaboration with medical experts, data scientists play a pivotal role in analyzing
complex datasets, such as real-world patient data, to derive insights that improve patient
care and inform evidence-based medicine. However, the rising demand for data science
expertise, combined with the limited availability of skilled professionals, has created a bot-
tleneck, slowing progress and hindering the full potential of data-driven clinical research.
This shortage is restricting the ability to fully harness the vast amount of data available
for advancing clinical research.
Large language models (LLMs) have emerged as powerful generalist artificial intelli-
gence (AI) capable of following human instructions to perform a wide range of tasks in
medicine.37–39 In parallel, LLMs have demonstrated strong capabilities in solving coding
challenges,5 completing software engineering tasks,13 and performing basic data analy-
sis.20 These advancements suggest that LLMs hold great promise for streamlining data
science projects in clinical research, a potential that has not yet been fully explored.
The primary goal of this study is to thoroughly evaluate the performance of cutting-
edge LLMs in handling complex clinical research data and performing data science pro-
gramming tasks. To achieve this, we developed a comprehensive coding benchmark,
CliniDSBench, comprising 39 published clinical studies and 293 diverse, practical, and
high-quality data analysis tasks in both Python and R. Based on these benchmarks, we
found that current LLMs are not yet capable of fully automating data science tasks. At
their first attempts, LLMs successfully solved only 40%-80% of Easy tasks, 15%-40% of
Medium tasks, and 5%-15% of Hard tasks. This highlights the necessity of human over-
sight and post-processing to prevent misinformation and incorrect results when relying on
LLMs for clinical data analysis.
Though imperfect, we found that much of the LLM-generated code was quite close to
the correct solution. This observation motivated us to explore advanced adaptation meth-
ods to improve LLM performance further. On the one hand, we found that LLMs could
self-correct a significant portion of erroneous code, leading to substantial improvements
over their initial attempts. On the other hand, involving human experts more directly
in the process, such as by providing concrete, step-by-step plans for data analysis tasks,
resulting in the best performance across all adaptation strategies.
Beyond automatic testing, we conducted a user study using our developed interface,
which integrates LLMs into the data science workflow. The study revealed that users
heavily relied on LLM-generated code when crafting their final solutions, validating the
effectiveness of LLMs in streamlining the coding process. This workflow typically followed
a pattern where LLMs provided an initial solution, users collaborated with the LLMs
for debugging, and then the users refined the final solution. The platform was highly
appreciated by users, not only for its practical utility in accelerating data science tasks
but also for its educational value in helping them improve their programming and data
analysis skills.
This study has several limitations. First, to ensure the quality of the benchmark, we
manually created all the questions and solutions for the analysis tasks, which restricted
our ability to scale the benchmark to cover more studies. A larger dataset with more
coding tasks would not only enhance evaluation but could also be used for training LLMs
specifically for data science tasks. Second, the user study results may be biased, as the
participants were primarily medical doctors who, while knowledgeable in their domain,
had varying levels of coding proficiency. The usage patterns might differ significantly if
data scientists were the users, as they possess more advanced coding skills but know less
about medicine. Third, the patient-level data in our testing set are publicly available, but
privacy risks must be carefully considered when deploying LLMs for real-world clinical
data analysis. A recommended approach would be to separate the environment running
code on sensitive patient data from the environment where LLMs are used. LLMs should
only access the dataset schema or global statistics without access to individual patient
data. Finally, the patient data used in our testing set were relatively clean, standardized,
and semantically meaningful. In real-world scenarios, data can be messier and more varied,
which could affect LLM performance. Future work should explore strategies to handle less

12
structured, real-world data effectively.
The findings from our study show that while LLMs are not yet capable of fully au-
tomating clinical research data science tasks, they can be valuable tools when used in
collaboration with human experts. This human-AI partnership can lead to the creation
of effective coding solutions, boost productivity, potentially accelerate drug development,
and improve patient outcomes. However, testing this hypothesis requires future prospec-
tive studies in clinical research to confirm the practical impact of such collaborations.

4 Methods
4.1 Dataset curation
We created the testing dataset, referred to as CliniDSBench, based on published clinical
studies and their associated patient-level datasets from cBioPortal.23 cBioPortal is a
comprehensive database for cancer and genomics research, providing access to hundreds of
studies with linked patient data. These datasets encompass various modalities, including
clinical data, clinical sample data, mutations, copy number alterations, structural variants,
RNA, mRNA, and tumor miRNA, among others. This setup ensures that the coding
tasks in CliniDSBench are closely aligned with the real-world challenges faced in clinical
research data science, using authentic data and analysis tasks.
We began by reviewing the studies listed on cBioPortal’s dataset page. For each study,
we labeled the types of analyses performed. These labels were then aggregated to identify
the most common analyses, ensuring the selected studies covered a comprehensive range
of tasks for the testing dataset. For each selected study, we manually created coding tasks
based on the extracted analyses, mirroring the sequence of data analysis steps that led to
the findings in the original studies. Each coding task represents one step in this process.
The tasks are structured with five key components: the input question, a description of
the patient dataset schema, prefix code, reference solutions, and test cases. An example
of the input coding task is shown in Extended Fig. 2.
To ensure the feasibility of automatic testing, it is crucial to maintain consistency
between the input question and the testing cases, particularly regarding the output name
and format. For instance, a simple question might be: “tell me the number of patients in
the dataset”. This question is inherently open-ended, allowing for a variety of answers.
The most straightforward approach is to calculate the unique number of patient IDs
in the dataset, such as num = df["patient id"].nunique(). However, for the testing
cases to work, it is essential that the variable num represents this number in the code.
Since the variable name can be arbitrary (e.g., n, num patient, or number of patients),
a testing case inspecting the variable num will fail if the name differs. To avoid this
issue, each question is divided into two parts: the task description and the output format
requirement, ensuring a constrained answer. For example, the full question would also
specify the output requirement: “make sure the output number of patients is an integer
assigned to a variable named "num"”. Correspondingly, testing cases like assert num ==
20 are attached to the LLM-generated code to verify its correctness.
The prefix code refers to the prerequisite code necessary to run before addressing a
specific question. This approach mirrors the workflow of data scientists working in com-
putational notebook environments like Jupyter Notebook,40 where certain data processing
steps are required for multiple analyses. However, it would be redundant and inefficient
to repeat these steps for every coding task. For example, in one step, a data scientist
might merge the patient clinical data table with the mutation table to link patient out-
comes with gene mutation information. This merged dataset is then used in subsequent
analyses, such as survival analysis grouped by gene mutations. For these follow-up tasks,
the LLMs are not required to repeat the data merging process. Instead, the merging code
is provided as prefix code, allowing the LLMs to build on the processed data and focus on
the specific task at hand. This structure ensures efficiency and mimics how data scientists
typically manage code dependencies across related tasks.
To protect the privacy of patient records, it is crucial to handle how patient data is

13
passed in prompts to proprietary LLMs, such as OpenAI’s GPT models, for coding tasks.
Our approach avoids using individual-level patient records as input. Instead, we use a
template to generate a caption for each dataset. This caption includes the table’s name,
its dimensions (shape), the names of all columns, and representative values from each
column. This method ensures that no private or sensitive information about individuals
is shared, while still enabling LLMs to understand the dataset’s structure and content
sufficiently to synthesize code for data science.
We developed specific testing cases for various types of outputs required to answer
the input data science questions, including numerical, categorical, dataframes, and object
outputs. For numerical outputs, such as the number of patients (integers) or average age
(continuous values), the testing cases check for exact matches in integers and verify that
the absolute difference for continuous values falls within an acceptable error range. For
categorical outputs, such as a list of the top 10 frequently mutated genes, the testing
cases ensure that the generated list matches the expected set. For dataframe outputs,
such as merging two tables, we clarify the expected column names and content in the
question. The testing cases then verify the shape of the resulting dataframe and check
global statistics for each column, depending on the variable types. When the expected
outputs are special objects, such as in visualization tasks, where the output is a figure,
we specify in the question which tools should be used to create the visualization. The
testing cases then check if the specified tools were used correctly. Additionally, we include
instructions to save the inputs used for plotting, allowing us to verify the correctness of
these inputs as a proxy for validating the accuracy of the visualizations.
To estimate the difficulty level of each coding task, we calculated the number of seman-
tic lines in the reference solutions. This was done by using GPT-4o to analyze the input
code and decompose it into a sequence of operations. The unique number of operations
was used as an indicator of semantic lines. For Python tasks, we categorized those with
fewer than 10 semantic lines as Easy, 10-15 as Medium, and more than 15 as Hard. For
R tasks, we defined those with fewer than 6 semantic lines as Easy, 6-10 as Medium, and
more than 10 as Hard. The prompt used to extract these operations is shown in Extended
Fig. 7.

4.2 Large language models and adaptation methods


We investigated a diverse range of transformer-based large language models (LLMs) for
data science code generation tasks, focusing on cutting-edge proprietary models. These in-
clude OpenAI’s GPT-4o24 and GPT-4o-mini,25 Google’s Gemini-Pro and Gemini-Flash,28
as well as Anthropic’s Opus-327 and Sonnet-3.5.26 Each of these models is a flagship
proprietary LLM known for its strong performance in medical and biomedical tasks. Ad-
ditionally, all these models feature long context windows, enabling them to handle large
inputs efficiently: GPT-4o and GPT-4o-mini support up to 128K tokens, Gemini-Pro up
to 2M tokens, Gemini-Flash up to 1M tokens, and both Opus-3 and Sonnet-3.5 support
up to 200K tokens. This extended context capacity is essential for processing complex
datasets and tasks typical in clinical research and data science.
No open-source code LLMs were included in this study for several key reasons. Firstly,
most open-source code LLMs have limited context lengths, typically ranging from 2K to 8K
tokens, which is insufficient for many of the data science tasks in our dataset. These tasks
not only require input questions but also detailed dataset schema descriptions, sometimes
spanning multiple tables with hundreds of columns. Secondly, previous studies have shown
that open-source code LLMs significantly underperform compared to proprietary models,
even on simpler tasks. For instance, in DS-1000,20 proprietary models like Codex41 out-
performed open-source models such as CodeGen42 and InCoder43 by four to five times.
Similarly, in BioCoder,22 GPT-4 achieved a Pass@1 rate of approximately 40%, while
open-source models like StarCoder, even at 15.5 billion parameters,8 scored below 10%,
despite fine-tuning. Given these findings, the proprietary LLMs used in our study can be
considered to represent the upper bound of current LLM performance.
In this study, we explored adaptation methods to guide pre-trained generalist LLMs
for specific tasks without fine-tuning the models. The primary reason for this approach

14
was the limited dataset scale, which was only sufficient for testing purposes. Additionally,
including publicly available code examples from sources like GitHub would likely offer
minimal benefit, as these LLMs have already been extensively trained on such data.

In-context learning LLMs exhibit a remarkable ability to comprehend input re-


quests and follow provided instructions during code generation. A key concept in this
process is in-context learning (ICL), which allows LLMs to learn from examples and task
instructions provided within the input context at inference time.30 ICL has become a
major technique for adapting LLMs to medical tasks.38, 39, 44 In this study, we imple-
mented ICL across all methods, as each input question contains specific instructions for
the expected output format, which the LLMs use to generate responses that aim to pass
testing cases. The Vanilla method represents the minimum prompt engineering to ask
LLMs to answer the input coding question (Extended Fig. 3). We further enhanced this
by incorporating additional expert knowledge into the prompts, which we refer to as
the ManualPrompt variant, as shown in Fig. 2 and Fig. 3. The details of this prompt are
shown in Extended Fig. 4. Additionally, few-shot prompting, a form of ICL, was employed
to guide LLMs to produce both high-quality and correctly formatted outputs. This was
achieved by adding demonstrations of example input questions and output code solutions
into the prompt, following the five-shot prompting technique. Consistent with prior find-
ings,45 we observed that using relevant examples is more effective than random ones. To
optimize this, we dynamically retrieved examples most relevant to the input question by
computing semantic similarity using OpenAI’s embedding model.46 This ensured that the
examples provided in the prompt closely aligned with the task at hand, improving LLM
performance. This approach was identified as the Few-shot variant in experiments shown
in Fig. 3b.

Chain-of-thought Research has shown that prompting LLMs to break down tasks
into multiple steps, rather than providing a direct answer, significantly improves per-
formance.29 These steps can either be generated by the LLM or provided by a human
expert. In our experiments, we implemented this technique by creating detailed step-by-
step instructions on how to solve data science tasks, referred to as the CoT variant. This
approach is reflected in the experiment results, as shown in Fig. 3b, where it consistently
outperformed direct answer generation by guiding the LLM through a structured process.

Automatic prompting LLMs have demonstrated strong capabilities in generating


text, including the input prompts themselves, which describe target tasks. This opens up
the possibility for LLMs to generate and optimize their own prompts.47 We implemented
an automatic prompt optimization pipeline using DsPy’s31 Optimizer for instruction re-
finement. This system works in two parts: a prompt generator, which proposes new
prompts in each iteration, and an output evaluator, which assesses the quality of the
LLM’s output based on these candidate prompts. The evaluator, also an LLM, evalu-
ates the generated answers and returns a score. Through repeated iterations, the prompt
generator refines and proposes increasingly better prompts, supervised by the evaluator’s
feedback. This method is referred to as the AutoPrompt variant in the results shown in
Fig. 2 and Fig. 3. The automatically generated prompt is shown in Extended Fig. 5.

Retrieval-augmented generation LLMs that rely solely on their internal knowl-


edge often produce erroneous outputs, particularly due to outdated information or hal-
lucinations. Retrieval-Augmented Generation (RAG) addresses this issue by dynamically
incorporating external knowledge into prompts during generation.32 In our experiments,
we implemented RAG through an external API that connects to the Google search en-
gine via Vertex AI Search.48 We restricted searches to medical-related sources such as
PubMed, as well as coding-related platforms like GitHub and StackOverflow. The top 10
most relevant search results were retrieved and incorporated into the LLM’s prompt to
assist with solving coding tasks. This forms the foundation of the RAG variant shown in
the experiment results in Fig. 3.

15
Self-reflection LLMs can produce flawed outputs on their first attempt, but they
can improve through iterative feedback and refinement.49 This approach mirrors the
natural process humans follow when programming, testing, and debugging code.12 In our
experiments, we implemented self-reflection, allowing LLMs to attempt debugging their
incorrect code solutions, with results shown in Fig. 3c. The process involved executing the
initially generated code along with the testing cases and collecting output logs reflecting
(1) errors from failed tests, (2) runtime errors within the code, and (3) the printed values
and shapes of intermediate variables. We then prompted the LLM to explain why the
code was incorrect, propose a plan for correction, and provide a revised code solution.
This cycle was repeated up to five times, and in each round, only the unresolved questions
were carried forward for further self-reflection. This iterative method allowed LLMs to
gradually improve their code solutions over multiple attempts. The prompt used to enable
LLM’s self-reflection is in Extended Fig. 6.

4.3 Experimental setup


All experiments were run in Python v3.10. The versions of key software are: anthropic
v.0.34.2, boto3 v.1.35.16, openai v.1.44.1, google-generativeai v.0.7.2, google-cloud-aiplatform
v.1.65.0, dspy-ai v.2.4.14, langchain v.0.2.16, and docker v.7.1.0 with Python v.3.10.
Specifically, we accessed OpenAI’s models via OpenAI’s platform, the Google models
through Vertex AI provided in Google cloud, and Anthropic’s models through AWS’s
Bedrock APIs.

Sandbox development We established a sandbox environment to enable the au-


tomatic execution and testing of code generated by LLMs. This was accomplished by
creating a standardized Docker image that hosts both Python and R environments, al-
lowing scripts in either language to be run via the command line. We utilized Pipenv to
manage the Python environment, installing packages like pandas and matplotlib. Sim-
ilarly, for R, we defined necessary packages such as dplyr and survival to be installed
when building the image. The sandbox interface dynamically builds Docker containers
based on the defined image, accepts code strings from LLMs, converts them into Python
or R scripts, and then executes them. The sandbox also accepts dataset uploads, enabling
parallel real-time code execution without impacting the main experimental environment.
This setup ensures a controlled and isolated environment for running and testing LLM-
generated code safely and efficiently.

Platform Development For the user study, we developed a platform to facilitate


human-AI collaborative coding for data science projects, as illustrated in Extended Fig. 1.
The platform is designed to relieve users from setting up their coding environment and
provide code suggestions based on natural language requests, while enabling real-time
code execution and feedback. The primary window features a user input box where
users can choose from various platform commands, which are categorized into two types:
brainstorming and programming. In the brainstorming mode, users can interact with the
LLM assistant to search medical publications from PubMed or perform general searches
via Google. They can also collaborate with LLMs to develop plans for data analysis tasks.
In the programming mode, users can either ask the LLM to generate code from scratch or
request improvements or corrections to existing code. Users have the option to generate
code in either R or Python. Once the code is generated, users can execute it within a
sandbox environment. The platform provides execution logs and any produced artifacts,
such as figures, directly to the frontend, allowing users to receive immediate feedback on
the results. This process maximizes the utility of LLMs by enabling users to collaboratively
plan data analyses and then guide LLMs to generate accurate code solutions. In the second
window, users can select patient datasets and apply their generated data analysis code to
gain insights. The platform allows users to preview tables, columns, and values from the
selected dataset, providing a streamlined experience for conducting data science tasks in
a collaborative, AI-assisted environment.

16
Questionnaire Design We developed a questionnaire to gather feedback from users
following the user study. The survey was designed based on the Health Information
Technology Usability Evaluation Scale (Health-ITUES),36 originally created to assess the
usability of a web-based communication system for scheduling nursing staff. We adapted
the questions in line with the spirit of the scale, covering four main topics: output quality,
support & integration, system complexity, and system usability. The original 22-item
questionnaire was streamlined to 10 items, with users rating each on a 5-point Likert
scale, ranging from strongly disagree to strongly agree. Additionally, we included an open-
ended question, allowing users to provide free-text comments for further insights. This
format ensured a concise yet comprehensive collection of user feedback on the platform’s
performance and usability.

4.4 Evaluation metrics and statistical analysis


The Pass@k metric was used to evaluate the performance of code generation in our study.
Here, n represents the total number of code solutions generated, and c is the number of
correct solutions, where c ≤ n. Correct samples are those that pass all unit tests. The
unbiased estimator5 for Pass@k is given by:
" # n−c
k
Pass@k = Eproblems 1 − n . (1)
k

Pass@k ranges from 0 to 1 and estimates the probability that at least one of the k
generated code samples for a given task passes all the unit tests. In our study, we used
two metrics: Pass@1 and Pass@5. Pass@1 is a stricter metric, evaluating whether the
LLM can solve the task on the first attempt. To ensure reproducibility, we set the LLMs’
temperature to zero for this evaluation. For Pass@5, we allowed the LLMs to generate 10
solutions for each question to estimate the likelihood of producing a correct answer within
five attempts.
To compare the LLM-generated code with user-submitted or reference code solutions,
we first parse the code string using abstract syntax trees (AST) to extract operators and
variables, which allows for a structural analysis of the code. We then tokenize the code
based on this parsing result. Using Python’s difflib library, we compare the differences
between two text sequences.
Let s1 represent the tokenized LLM-generated code and s2 represent the tokenized
user-submitted code. We compute the length of overlapping tokens between the two
sequences, denoted as s̄. The ratio of user-submitted code copied from LLM-generated
code can then be calculated using the following formula:

length(s̄)
Copy Ratio = . (2)
length(s2 )

This method quantifies the extent to which the user-submitted code overlaps with the
LLM-generated code, providing insight into the level of influence the LLM had on the
final solution.

17
a Execution
output logs

Execution
Code
output
generation
figures
panel

User inputs
Code
execution

Language:
Python or R Command
history

Selected dataset
b overview

Dataset
panel

Dataset
visualization

Dataset selection

Extended Fig. 1: Overview of the developed clinical research data science platform.
a, code generation panel where users provide their requests, read the generated code, and
switch to execution results. Users can view the previously sent requests and switch back if
desired. b, Users can select one from a list of datasets to analyze. For each dataset, users can
preview the content.

18
Example coding task

For the significant genes, get the indicator of the mutation type for each patient, the
mutation types of interest are:
- silent
- missense
- splice site
- nonsense
- frame shift
- inframe indel

The output should be dataframe named `mutation_indicator`, with the columns


- PATIENT_ID
- Silent
- Missense
- Splice site
- Nonsense
- Frame shift
- In frame indel
- Hugo_Symbol

where the indicator `1` means mutations, `0` means wild-type. Question

import pandas as pd

# Load the data


data_mutsig = pd.read_csv("/workdir/data_mutsig.csv")

# Filter genes with mutation significance (-log10 q_value) larger than 1.0
significant_genes_df = data_mutsig[data_mutsig['q'] < 0.1]

# Sort the genes by their significance


significant_genes_df = significant_genes_df.sort_values(by='q', ascending=True)

# Extract the gene names


significant_genes = significant_genes_df['gene'].tolist()

# Save the list to a file


with open("significant_genes.txt", "w") as f:
for gene in significant_genes:
f.write(f"{gene}")

# Print the list of significant genes


print(significant_genes)
Prefix code

assert mutation_indicator["PATIENT_ID"].nunique() == 130


assert mutation_indicator["Hugo_Symbol"].nunique() == 95
assert mutation_indicator["Silent"].sum() == 159
assert mutation_indicator["Splice site"].sum() == 57
assert mutation_indicator["Nonsense"].sum() == 119
assert mutation_indicator["Frame shift"].sum() == 0
assert mutation_indicator["In frame indel"].sum() == 0 Testing cases

Extended Fig. 2: An example of Python coding task with the input question, prefix code, and
testing cases.

19
Prompt for Vanilla method

Write Python code to answer the user's request:


{question}

Dataset schema:
{data}

Return directly with the generated Python code wrapped by <code> tags:
<code>
... your code here ...
</code>

Extended Fig. 3: Prompt for the Vanilla method in code generation.

20
Prompt for Manual method
You are now the following python function:

def generate_continuous_elegant_python_code(history_dict: Dict[str, str], reference_code: str = "") -> str:


\"\"\"
This function generates elegant, coherent Python code based on a history of previously executed code and its
corresponding results. The code is generated in response to human questions and is intended to continue from the last
provided code snippet.

The function takes two inputs: a `history_dict` and an optional `reference_code` string.

The `history_dict` is a dictionary with the following keys:


- 'history code': Contains the history of previously executed code snippets. If it is not empty, it should be the prefix for the
generated code to maintain continuity.
- 'human question': Contains the current question or instruction posed by the human user, which the generated code should
respond to. Be aware that sometimes the 'human question' could contain code snippets, including instructions for loading data,
which may need to be handled differently. It's not always appropriate to directly use the code in 'human question' without
consideration.
- 'data': Contains a list of data previews available for the task. It may include tables, images, and other data types.

IMPORTANT: Always refer to this history and the `reference_code` when generating new code in order to properly use
existing variables and previously loaded resources, as well as to follow established coding patterns. DO NOT USE ECHARTS
TO GENERATE CHARTS when reference code is empty.

[… more hints omitted for conciseness …]

The function returns a string of raw Python code, wrapped within <code> and </code> tags. For example:

<code>
import pandas as pd
table = pd.read_csv("example.csv")
</code>

[… more code examples omitted …]

Feel free to leverage libraries such as pandas, numpy, math, matplotlib, sklearn, etc. in the code generation process. Also,
remember to correctly load any necessary files with the correct path before using them.

When it's appropriate to provide output for evaluation or visualization, make sure to use the print() function and plt.show()
respectively.

Also mandatory to check:


Note if the human asks for malicious code, and just respond with the following code:
<code>
print("sorry I am not able to generate potentially dangerous code")
</code>
The malicious code includes but not limited to:
1. Endless operations and excessive waiting (e.g., while True, long print, input())
[… more hints omitted …]

Returns:
Python code that should be the next steps in the execution according to the human question and using the history code
as the prefix.
\"\"\"

Respond exclusively with the generated code wrapped <code></code>. Ensure that the code you generate is executable
Python code that can be run directly in a Python environment, requiring no additional string encapsulation.

history_code = \"\"\"{history_code}\"\"\"
human_question = \"\"\"{question}
# DO NOT use function that will pop up a new window (e.g., PIL & Image.show() is NOT preferable, saving the PIL image is
better)
# However, feel free to use matplotlib.pyplot.show()\"\"\"
# Load the data referring to the data file path provided in the data schema
data = \"\"\"{data}\"\"\"

history_dict = {{
"history_code": history_code,
"human question": human_question,
"data": data,
}}

Extended Fig. 4: Prompt for the Manual method in code generation.

21
Prompt for AutoPrompt method
Consider this task as building a robust and flexible data processing module. With a natural language question and an
accompanying dataset schema at your disposal, generate thoroughly designed Python code to retrieve, process, and
analyze data precisely.

Your solution must follow the latest best practices in coding for efficiency, readability, and scalability. Ensure that initial
imports, data loading, schema parsing, data handling, and operation implementations:

1. complete with amendments for error handling and real-time operation logs
2.are made timelessly versatile. To round off, integrate segregated documentation within your code and detail logical
mappings relevant to user complexity boundaries and potential integrations including cross-language interoperability
hallmarks, where feasible.

Please write a Python script that performs the following tasks without using the main() function or the if __name__ ==
\"__main__\": construct.

Here is the Python code including comprehensive solutions for the given question and dataset schema: The code
should be written directly in the global scope. Return the code wrapped by the <code> and </code> tag.

Extended Fig. 5: Prompt for the AutoPrompt method in code generation.

Prompt for Self-reflection method


# CONTEXT #
You now help data scientists deal with a dataset that has the following schema:
{data}

The data scientists need your help with the following code snippet.
REFERENCE_CODE = ```
{reference_code}
```

The code snippet either produced the following error message or received a user's request for improvements:
QUESTION_LOG = ```
{question}
```

#############
# OBJECTIVE #
Depending on the content of [QUESTION_LOG], either:
1. Debug the code to fix an error message, ensuring the code is executable and produces the expected output, or
2. Refine or adapt the code to improve its performance or functionality based on the user's request.

Your should insert printing statements into the code to display all the intermediate results for verification and debugging
purposes. You need to solve the problem step-by-step, providing the corrected or improved code snippet at the end.
1. REASON: Analyze the reason for the error or the potential for improvement as described in [QUESTION_LOG].
2. PROPOSAL: Describe the approaches you will take to either fix the error or improve the code.
3. CODE: Provide the corrected or enhanced code snippet.

#############
# RESPONSE: HTML #
Show your response in HTML format with the following structure:

<code>
# REASON: Short description of the reason for the error or the area for improvement using less than 100 tokens.
# PROPOSAL: High-level description of the approach to fix the error or enhance the code using less than 100 tokens.
# CODE: Corrected or improved code snippet to fix the error or enhance functionality
... your code starts here ...
</code>
```

Extended Fig. 6: Prompt for the Self-reflection method in code debugging and improvement.

22
Prompt for computing semantic lines
Decompose the code string into essential operations required to describe the steps needed to go from raw data to the final
outcome.

# RESPONSE FORMAT #
You are required to output a JSON object containing a list of operation descriptions. Each list item should be a string concisely
describing a single operation.
For example:
```json
{{
"operations": [
"operation 1",
"operation 2",
...
"operation n"
]
}}
```

# EXAMPLE #
An example decomposition of the code string is provided below:
```
import numpy as np
from sklearn.manifold import TSNE
import plotly.express as px
import pandas as pd

X_filtered = merged_df.drop(["sample_type_id", "sample_type", "_primary_disease"], axis=1)

X_embedded = TSNE(n_components=2, learning_rate='auto',


init='random', perplexity=3).fit_transform(X_filtered)
X_embedded.shape

tsne = pd.DataFrame(X_embedded, columns = ["tsne1", "tsne2"])


tsne = pd.concat([tsne, merged_df["_primary_disease"].reset_index(drop=True)], axis = 1, sort = False)
tsne = tsne.sort_values(by = "_primary_disease")

figx = px.scatter(
tsne,
x="tsne1",
y="tsne2",
color="_primary_disease",
hover_name="_primary_disease",
width=970,
height=500,
template="ggplot2",
color_discrete_sequence= px.colors.qualitative.Alphabet,
size_max=0.1,
)

figx.show()
```

```
{{
"operations": [
"import numpy as np",
"import TSNE from sklearn.manifold",
"import plotly.express as px",
"import pandas as pd",
"drop columns `sample_type_id`, `sample_type`, and `_primary_disease` from merged_df to create X_filtered",
"apply TSNE to X_filtered with 2 components, auto learning rate, random initialization, and perplexity of 3 to get
X_embedded",
"create a DataFrame tsne with columns `tsne1` and `tsne2` from X_embedded",
"concatenate the `_primary_disease` column from merged_df to tsne DataFrame",
"sort the tsne DataFrame by the `_primary_disease` column",
"create a scatter plot using Plotly Express with `tsne1` and `tsne2` as axes, coloring by `_primary_disease`, and
customizing the plot appearance",
"display the scatter plot"
]
}}
```

# CODE STRING #
Decompose the following based on the instructions above:
```
{code}
```

Extended Fig. 7: Prompt for extracting and computing the number of semantic lines for
estimating coding task difficulty.

23
References
[1] Radenkovic, D., Keogh, S. B. & Maruthappu, M. Data science in modern evidence-
based medicine. Journal of the Royal Society of Medicine 112, 493–494 (2019).
[2] Champagne, D., Devereson, A., Pérez, L. & Saunders, D. Cre-
ating value from next-generation real-world evidence. https:
//www.mckinsey.com/industries/life-sciences/our-insights/
creating-value-from-next-generation-real-world-evidence. Accessed:
2024-09-16.
[3] Ellis, L. D. To meet future needs, health care leaders must
look at the data (science). https://www.hsph.harvard.edu/ecpe/
to-meet-future-needs-health-care-leaders-must-look-at-the-data-science/.
Accessed: 2024-09-16.
[4] Meyer, M. A. Healthcare data scientist qualifications, skills, and job focus: a content
analysis of job postings. Journal of the American Medical Informatics Association
26, 383–391 (2019).
[5] Chen, M. et al. Evaluating large language models trained on code. arXiv preprint
arXiv:2107.03374 (2021).
[6] Li, Y. et al. Competition-level code generation with alphacode. Science 378, 1092–
1097 (2022).
[7] Luo, Z. et al. Wizardcoder: Empowering code large language models with evol-
instruct. In The Twelfth International Conference on Learning Representations
(2023).
[8] Lozhkov, A. et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint
arXiv:2402.19173 (2024).
[9] Zhang, F. et al. Repocoder: Repository-level code completion through iterative
retrieval and generation. arXiv preprint arXiv:2303.12570 (2023).
[10] Parvez, M. R., Ahmad, W., Chakraborty, S., Ray, B. & Chang, K.-W. Retrieval
augmented code generation and summarization. In Findings of the Association for
Computational Linguistics: EMNLP 2021, 2719–2734 (2021).
[11] Wang, Z. Z. et al. CodeRAG-Bench: Can retrieval augment code generation? arXiv
preprint arXiv:2406.14497 (2024).
[12] Chen, X., Lin, M., Schärli, N. & Zhou, D. Teaching large language models to self-
debug. In The Twelfth International Conference on Learning Representations (2024).
[13] Wang, X. et al. OpenDevin: An open platform for ai software developers as generalist
agents. arXiv preprint arXiv:2407.16741 (2024).
[14] Zheng, T. et al. Opencodeinterpreter: Integrating code generation with execution
and refinement. arXiv preprint arXiv:2402.14658 (2024).
[15] Austin, J. et al. Program synthesis with large language models. arXiv preprint
arXiv:2108.07732 (2021).
[16] Hendrycks, D. et al. Measuring coding challenge competence with apps. In Thirty-
fifth Conference on Neural Information Processing Systems Datasets and Benchmarks
Track.
[17] Liu, J., Xia, C. S., Wang, Y. & Zhang, L. Is your code generated by chatgpt really
correct? rigorous evaluation of large language models for code generation. Advances
in Neural Information Processing Systems 36 (2024).
[18] Jimenez, C. E. et al. Swe-bench: Can language models resolve real-world github
issues? In The Twelfth International Conference on Learning Representations (2024).
[19] Huang, J. et al. Execution-based evaluation for data science code generation models.
In Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop
(Language Advances), 28–36 (2022).

24
[20] Lai, Y. et al. DS-1000: A natural and reliable benchmark for data science code
generation. In International Conference on Machine Learning, 18319–18345 (PMLR,
2023).
[21] Tayebi Arasteh, S. et al. Large language models streamline automated machine
learning for clinical studies. Nature Communications 15, 1603 (2024).
[22] Tang, X. et al. Biocoder: a benchmark for bioinformatics code generation with large
language models. Bioinformatics 40, i266–i276 (2024).
[23] cbioportal for cancer genomics. https://www.cbioportal.org/datasets. Accessed:
2024-09-17.
[24] Hello gpt-4o. https://openai.com/index/hello-gpt-4o/. Accessed: 2024-09-17.
[25] Gpt-4o mini: advancing cost-efficient intelligence. https://openai.com/index/
gpt-4o-mini-advancing-cost-efficient-intelligence/. Accessed: 2024-09-17.
[26] Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet. Ac-
cessed: 2024-09-17.
[27] Introducing the next generation of claude. https://www.anthropic.com/news/
claude-3-family. Accessed: 2024-09-17.
[28] Reid, M. et al. Gemini 1.5: Unlocking multimodal understanding across millions of
tokens of context. arXiv preprint arXiv:2403.05530 (2024).
[29] Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models.
Advances in Neural Information Processing Systems 35, 24824–24837 (2022).
[30] Brown, T. B. et al. Language models are few-shot learners. In Proceedings of the
34th International Conference on Neural Information Processing Systems, NIPS ’20
(Curran Associates Inc., Red Hook, NY, USA, 2020).
[31] Khattab, O. et al. Dspy: Compiling declarative language model calls into self-
improving pipelines. arXiv preprint arXiv:2310.03714 (2023).
[32] Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.
Advances in Neural Information Processing Systems 33, 9459–9474 (2020).
[33] Zehir, A. et al. Mutational landscape of metastatic cancer revealed from prospective
clinical sequencing of 10,000 patients. Nature Medicine 23, 703–713 (2017).
[34] Welch, J. S. et al. Tp53 and decitabine in acute myeloid leukemia and myelodysplastic
syndromes. New England Journal of Medicine 375, 2023–2036 (2016).
[35] Mostavi, M., Chiu, Y.-C., Huang, Y. & Chen, Y. Convolutional neural network
models for cancer type prediction based on gene expression. BMC Medical Genomics
13, 1–13 (2020).
[36] Yen, P.-Y., Wantland, D. & Bakken, S. Development of a customizable health it
usability evaluation scale. In AMIA Annual Symposium Proceedings, vol. 2010, 917
(American Medical Informatics Association, 2010).
[37] Wang, Z. et al. Accelerating clinical evidence synthesis with large language models.
arXiv preprint arXiv:2406.17755 (2024).
[38] Lin, J., Xu, H., Wang, Z., Wang, S. & Sun, J. Panacea: A foundation model
for clinical trial search, summarization, design, and recruitment. medRxiv 2024–06
(2024).
[39] Jin, Q. et al. Matching patients to clinical trials with large language models. ArXiv
(2023).
[40] Jupyter. https://jupyter.org/. Accessed: 2024-09-23.
[41] Openai codex. https://openai.com/index/openai-codex/. Accessed: 2024-09-23.
[42] Nijkamp, E. et al. A conversational paradigm for program synthesis. arXiv preprint
arXiv:2203.13474 30 (2022).
[43] Fried, D. et al. Incoder: A generative model for code infilling and synthesis. In The
Eleventh International Conference on Learning Representations (2023).

25
[44] Van Veen, D. et al. Adapted large language models can outperform medical experts
in clinical text summarization. Nature Medicine 30, 1134–1142 (2024).
[45] Nie, F., Chen, M., Zhang, Z. & Cheng, X. Improving few-shot performance of
language models via nearest neighbor calibration. arXiv preprint arXiv:2212.02216
(2022).
[46] New embedding models and api updates. https://openai.com/index/
new-embedding-models-and-api-updates/. Accessed: 2024-09-23.
[47] Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E. & Singh, S. Autoprompt: Eliciting
knowledge from language models with automatically generated prompts. In Proceed-
ings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP) (Association for Computational Linguistics, 2020).
[48] Vertex ai search. https://cloud.google.com/enterprise-search?hl=en. Ac-
cessed: 2024-09-23.
[49] Madaan, A. et al. Self-refine: Iterative refinement with self-feedback. Advances in
Neural Information Processing Systems 36 (2024).

26

You might also like