0% found this document useful (0 votes)

447 views6 pages

C3M1 - Assignment: 1 Estimating Treatment Effect Using Machine Learning

This document describes an assignment analyzing data from a randomized controlled trial (RCT) comparing the effects of Levamisole and Fluorouracil drug treatment versus chemotherapy alone on colon cancer patients. The document outlines calculating basic statistics from the dataset, including the treatment probability and empirical 5-year death probabilities for the treated and untreated groups. Exercises are provided to have the student implement functions to calculate these values. Packages for machine learning, statistics, and data processing are also imported for use in the assignment analysis.

Uploaded by

Sarah Mendes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

447 views6 pages

C3M1 - Assignment: 1 Estimating Treatment Effect Using Machine Learning

Uploaded by

Sarah Mendes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

C3M1_Assignment

September 13, 2020

1 Estimating Treatment Effect Using Machine Learning

Welcome to the first assignment of AI for Medical Treatment!
You will be using different methods to evaluate the results of a randomized control trial (RCT).
You will learn: - How to analyze data from a randomized control trial using both: - traditional
statistical methods - and the more recent machine learning techniques - Interpreting Multivari-
ate Models - Quantifying treatment effect - Calculating baseline risk - Calculating predicted risk
reduction - Evaluating Treatment Effect Models - Comparing predicted and empirical risk reduc-
tions - Computing C-statistic-for-benefit - Interpreting ML models for Treatment Effect Estimation
- Implement T-learner

1.0.1 This assignment covers the folowing topics:

• Section ??
– Section ??
– Section ??
* Section ??
* Section ??
• Section ??
– Section ??
* Section ??
– Section ??
* Section ??
– Section ??
* Section ??
* Section ??
• Section ??
– Section ??
* Section ??
* Section ??
• Section ??
– Section ??

1
* Section ??
* Section ??
* Section ??

1.1 Packages
We’ll first import all the packages that we need for this assignment.

• pandas is what we’ll use to manipulate our data

• numpy is a library for mathematical and scientific operations
• matplotlib is a plotting library
• sklearn contains a lot of efficient tools for machine learning and statistical modeling
• random allows us to generate random numbers in python
• lifelines is an open-source library that implements c-statistic
• itertools will help us with hyperparameters searching

1.2 Import Packages

Run the next cell to import all the necessary packages, dependencies and custom util functions.

In [1]: import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import sklearn
import random
import lifelines
import itertools

plt.rcParams['figure.figsize'] = [10, 7]

## 1 Dataset ### 1.1 Why RCT?

In this assignment, we’ll be examining data from an RCT, measuring the effect of a particu-
lar drug combination on colon cancer. Specifically, we’ll be looking the effect of Levamisole and
Fluorouracil on patients who have had surgery to remove their colon cancer. After surgery, the
curability of the patient depends on the remaining residual cancer. In this study, it was found that
this particular drug combination had a clear beneficial effect, when compared with Chemother-
apy. ### 1.2 Data Processing In this first section, we will load in the dataset and calculate basic
statistics. Run the next cell to load the dataset. We also do some preprocessing to convert categor-
ical features to one-hot representations.

In [2]: data = pd.read_csv("levamisole_data.csv", index_col=0)

Let’s look at our data to familiarize ourselves with the various fields.

In [3]: print(f"Data Dimensions: {data.shape}")

data.head()

Data Dimensions: (607, 14)

2
Out[3]: sex age obstruct perfor adhere nodes node4 outcome TRTMT \
1 1 43 0 0 0 5.0 1 1 True
2 1 63 0 0 0 1.0 0 0 True
3 0 71 0 0 1 7.0 1 1 False
4 0 66 1 0 0 6.0 1 1 True
5 1 69 0 0 0 22.0 1 1 False

differ_2.0 differ_3.0 extent_2 extent_3 extent_4

1 1 0 0 1 0
2 1 0 0 1 0
3 1 0 1 0 0
4 1 0 0 1 0
5 1 0 0 1 0

Below is a description of all the fields (one-hot means a different field for each
level): - sex (binary): 1 if Male, 0 otherwise - age (int): age of patient at start
of the study - obstruct (binary): obstruction of colon by tumor - perfor (binary):
perforation of colon - adhere (binary): adherence to nearby organs - nodes (int):
number of lymphnodes with detectable cancer - node4 (binary): more than 4 positive
lymph nodes - outcome (binary): 1 if died within 5 years - TRTMT (binary): treated
with levamisole + fluoroucil - differ (one-hot): differentiation of tumor - extent
(one-hot): extent of local spread
In particular pay attention to the TRTMT and outcome columns. Our primary endpoint for our
analysis will be the 5-year survival rate, which is captured in the outcome variable.
### Exercise 01
Since this is an RCT, the treatment column is randomized. Let’s warm up by finding what the
treatment probability is.
ntreatment
ptreatment =
n
• ntreatment is the number of patients where TRTMT = True
• n is the total number of patients.

In [4]: # UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)

def proportion_treated(df):
"""
Compute proportion of trial participants who have been treated

Args:
df (dataframe): dataframe containing trial results. Column
'TRTMT' is 1 if patient was treated, 0 otherwise.

Returns:
result (float): proportion of patients who were treated
"""

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

3
proportion = len(df[(df.TRTMT == 1)])/len(df.TRTMT)

### END CODE HERE ###

return proportion

Test Case

In [5]: print("dataframe:\n")
example_df = pd.DataFrame(data =[[0, 0],
[1, 1],
[1, 1],
[1, 1]], columns = ['outcome', 'TRTMT'])
print(example_df)
print("\n")
treated_proportion = proportion_treated(example_df)
print(f"Proportion of patient treated: computed {treated_proportion}, expected: 0.75")

dataframe:

outcome TRTMT
0 0 0
1 1 1
2 1 1
3 1 1

Proportion of patient treated: computed 0.75, expected: 0.75

Next let’s run it on our trial data.

In [6]: p = proportion_treated(data)
print(f"Proportion Treated: {p} ~ {int(p*100)}%")

Proportion Treated: 0.49093904448105435 ~ 49%

### Exercise 02
Next, we can get a preliminary sense of the results by computing the empirical 5-year death
probability for the treated arm versus the control arm.
The probability of dying for patients who received the treatment is:
ntreatment,death
ptreatment, death =
ntreatment

• ntreatment,death is the number of patients who received the treatment and died.
• ntreatment is the number of patients who received treatment.

4
The probability of dying for patients in the control group (who did not received treatment) is:
ncontrol,death
pcontrol, death =
ncontrol
- ncontrol,death is the number of patients in the control group (did not receive the treatment) who
died. - ncontrol is the number of patients in the control group (did not receive treatment).

In [7]: # UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)

def event_rate(df):
'''
Compute empirical rate of death within 5 years
for treated and untreated groups.

Args:
df (dataframe): dataframe containing trial results.
'TRTMT' column is 1 if patient was treated, 0 otherwise.
'outcome' column is 1 if patient died within 5 years, 0 oth

Returns:
treated_prob (float): empirical probability of death given treatment
untreated_prob (float): empirical probability of death given control
'''

treated_prob = 0.0
control_prob = 0.0

### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

treated_prob = len(df[(df.TRTMT==1) & (df.outcome==1)])/len(df[df.TRTMT==1])

control_prob = len(df[(df.TRTMT==0) & (df.outcome==1)])/len(df[df.TRTMT==0])

### END CODE HERE ###

return treated_prob, control_prob

Test Case

In [8]: print("TEST CASE\ndataframe:\n")

example_df = pd.DataFrame(data =[[0, 1],
[1, 1],
[1, 1],
[0, 1],
[1, 0],
[1, 0],
[1, 0],
[0, 0]], columns = ['outcome', 'TRTMT'])
#print("dataframe:\n")
print(example_df)
print("\n")

5
treated_prob, control_prob = event_rate(example_df)
print(f"Treated 5-year death rate, expected: 0.5, got: {treated_prob:.4f}")
print(f"Control 5-year death rate, expected: 0.75, got: {control_prob:.4f}")

TEST CASE
dataframe:

outcome TRTMT
0 0 1
1 1 1
2 1 1
3 0 1
4 1 0
5 1 0
6 1 0
7 0 0

Treated 5-year death rate, expected: 0.5, got: 0.5000

Control 5-year death rate, expected: 0.75, got: 0.7500

Now let’s try the function on the real data.

In [9]: treated_prob, control_prob = event_rate(data)

print(f"Death rate for treated patients: {treated_prob:.4f} ~ {int(treated_prob*100)}%"

print(f"Death rate for untreated patients: {control_prob:.4f} ~ {int(control_prob*100)}

Death rate for treated patients: 0.3725 ~ 37%

Death rate for untreated patients: 0.4822 ~ 48%

On average, it seemed like treatment had a positive effect.

Sanity checks It’s important to compute these basic summary statistics as a sanity check for
more complex models later on. If they strongly disagree with these robust summaries and there
isn’t a good reason, then there might be a bug.

1.2.1 Train test split

We’ll now try to quantify the impact more precisely using statistical models. Before we get started
fitting models to analyze the data, let’s split it using the train_test_split function from sklearn.
While a hold-out test set isn’t required for logistic regression, it will be useful for comparing its
performance to the ML models later on.

In [10]: # As usual, split into dev and test set

from sklearn.model_selection import train_test_split
np.random.seed(18)

Assignment 1:: Intro To Machine Learning
No ratings yet
Assignment 1:: Intro To Machine Learning
6 pages
Quiz 4 - Exploratory Data Analysis - Courserav3 PDF
0% (2)
Quiz 4 - Exploratory Data Analysis - Courserav3 PDF
1 page
Quiz 4 - Exploratory Data Analysis - Courserav3 PDF
0% (2)
Quiz 4 - Exploratory Data Analysis - Courserav3 PDF
1 page
Data Analysis & Processing Guide
100% (2)
Data Analysis & Processing Guide
17 pages
GSCH003 - Rev04 24.11.2021
No ratings yet
GSCH003 - Rev04 24.11.2021
55 pages
Astm D7234-12 (Adhesion Strength of Coatings On Concrete)
No ratings yet
Astm D7234-12 (Adhesion Strength of Coatings On Concrete)
9 pages
Civil Engineering Material Analysis
100% (1)
Civil Engineering Material Analysis
7 pages
2820H Service Manual
No ratings yet
2820H Service Manual
55 pages
Fsadasgfas
No ratings yet
Fsadasgfas
4 pages
IR-ADV C3530 C3525 C3520 III Series Partscatalog E EUR
No ratings yet
IR-ADV C3530 C3525 C3520 III Series Partscatalog E EUR
138 pages
Declaration of Trust
83% (6)
Declaration of Trust
3 pages
Lifelines
No ratings yet
Lifelines
343 pages
Diabetic Retinopathy Risk Modeling
No ratings yet
Diabetic Retinopathy Risk Modeling
24 pages
2 Template 11& 14, Annex 3A
No ratings yet
2 Template 11& 14, Annex 3A
7 pages
Lifelines
No ratings yet
Lifelines
347 pages
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
100% (1)
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
38 pages
Heavy Vehicle Tire Safety Guide
No ratings yet
Heavy Vehicle Tire Safety Guide
12 pages
Understanding Resistance to Change
No ratings yet
Understanding Resistance to Change
19 pages
C2M4 - Assignment: 1 Cox Proportional Hazards and Random Survival Forests
No ratings yet
C2M4 - Assignment: 1 Cox Proportional Hazards and Random Survival Forests
18 pages
Emerging Trends in Sales Management
100% (7)
Emerging Trends in Sales Management
14 pages
SP 3 D Upgrade Guide
No ratings yet
SP 3 D Upgrade Guide
37 pages
Agust 21
No ratings yet
Agust 21
8 pages
Predicting Breast Cancer Using Logistic Regression - by Mo Kaiser - The Startup - Medium
No ratings yet
Predicting Breast Cancer Using Logistic Regression - by Mo Kaiser - The Startup - Medium
15 pages
October 16, 2020: Pandas PD Numpy NP Timeit Scipy - Stats Random Statistics Matplotlib - Pyplot PLT
No ratings yet
October 16, 2020: Pandas PD Numpy NP Timeit Scipy - Stats Random Statistics Matplotlib - Pyplot PLT
4 pages
Team No-7
No ratings yet
Team No-7
12 pages
NOC Check List DCA
No ratings yet
NOC Check List DCA
8 pages
How To Prepare Your Dataset For Machine Learning in Python
No ratings yet
How To Prepare Your Dataset For Machine Learning in Python
14 pages
Week - 6 - SWI - MLP - LogisticRegression - Ipynb - Colaboratory
No ratings yet
Week - 6 - SWI - MLP - LogisticRegression - Ipynb - Colaboratory
15 pages
4-10 Aiml
No ratings yet
4-10 Aiml
25 pages
ML Lab: Healthcare Data Analysis
No ratings yet
ML Lab: Healthcare Data Analysis
16 pages
Program 7
100% (1)
Program 7
4 pages
UnivariateRegression Summary
No ratings yet
UnivariateRegression Summary
36 pages
Linear Regression Analysis Guide
No ratings yet
Linear Regression Analysis Guide
20 pages
Exploratory Data Analysis Main Concepts
No ratings yet
Exploratory Data Analysis Main Concepts
1 page
Eddie Soriano
No ratings yet
Eddie Soriano
3 pages
Machine File
No ratings yet
Machine File
27 pages
Formulario
No ratings yet
Formulario
2 pages
Quiz 4 - Exploratory Data Analysis - Courserav2
No ratings yet
Quiz 4 - Exploratory Data Analysis - Courserav2
1 page
Linear & Logistic Regression Models
No ratings yet
Linear & Logistic Regression Models
8 pages
NEW Bayesian - Approaches.in - Oncology.using.R.and - OpenBUGS
100% (1)
NEW Bayesian - Approaches.in - Oncology.using.R.and - OpenBUGS
260 pages
ETREP
No ratings yet
ETREP
20 pages
Fibers Do The Twist: Science February 2014
No ratings yet
Fibers Do The Twist: Science February 2014
4 pages
Introduction - Jupyter: 0.0.1 What Is Jupyter Notebooks?
No ratings yet
Introduction - Jupyter: 0.0.1 What Is Jupyter Notebooks?
4 pages
Group Work Assignment Supervised and Unsupervised Learning
No ratings yet
Group Work Assignment Supervised and Unsupervised Learning
10 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
2012 07 Ultra Sensitive Artificial Skin
No ratings yet
2012 07 Ultra Sensitive Artificial Skin
2 pages
COMP5318
No ratings yet
COMP5318
42 pages
Bda Assign
No ratings yet
Bda Assign
15 pages
05 Data Preparation and Regression
No ratings yet
05 Data Preparation and Regression
2 pages
Data Mining Journal 5 Kashan
No ratings yet
Data Mining Journal 5 Kashan
7 pages
utf-8''C3M2 Assignment
No ratings yet
utf-8''C3M2 Assignment
29 pages
Utf 8''W2 - P2 1 PDF
No ratings yet
Utf 8''W2 - P2 1 PDF
5 pages
Simulating Depression and QoL Data
No ratings yet
Simulating Depression and QoL Data
2 pages
1.rakitanprinter 20 Januari 2020-1 1
No ratings yet
1.rakitanprinter 20 Januari 2020-1 1
1 page
C2 W4 Lab 02 Tree Ensemble
No ratings yet
C2 W4 Lab 02 Tree Ensemble
16 pages
Experiment 5
No ratings yet
Experiment 5
10 pages
Experiment 5
No ratings yet
Experiment 5
9 pages
Experiment 5
100% (1)
Experiment 5
6 pages
AI Lab9 22it3044
No ratings yet
AI Lab9 22it3044
21 pages
Multivariate - Data - Selection: 0.1 How To Select Dataframe Subsets From Multivariate Data
No ratings yet
Multivariate - Data - Selection: 0.1 How To Select Dataframe Subsets From Multivariate Data
8 pages
September 30, 2020: This Exercise Covers The Following Aspects
No ratings yet
September 30, 2020: This Exercise Covers The Following Aspects
9 pages
Confidence Intervals Tutorial
No ratings yet
Confidence Intervals Tutorial
3 pages
This Study Resource Was: Sec / 08 - 2 Sec / 0 Sec / 13 - 3
No ratings yet
This Study Resource Was: Sec / 08 - 2 Sec / 0 Sec / 13 - 3
5 pages
1 The Empirical Rule and Distribution
No ratings yet
1 The Empirical Rule and Distribution
5 pages
Mahendra Engineering College
No ratings yet
Mahendra Engineering College
2 pages
Notes On The Behavior of The Cellular Automata Parity Rule: I, J I 1, J I+1, J I, J 1 I, j+1
No ratings yet
Notes On The Behavior of The Cellular Automata Parity Rule: I, J I 1, J I+1, J I, J 1 I, j+1
2 pages
0.1 Multivariate Distributions in Python
No ratings yet
0.1 Multivariate Distributions in Python
2 pages
Class-Work-Naive-Bayes (21-10-2024)
No ratings yet
Class-Work-Naive-Bayes (21-10-2024)
5 pages
Python Cod1
No ratings yet
Python Cod1
3 pages
AI Project Medicine Recommending System
No ratings yet
AI Project Medicine Recommending System
11 pages
Encoded Data Document
No ratings yet
Encoded Data Document
6 pages
Lab Manual - MachineLearningLaboratory-DR - Vaishnavi
No ratings yet
Lab Manual - MachineLearningLaboratory-DR - Vaishnavi
71 pages
Diabetes Prediction with Logistic Regression
No ratings yet
Diabetes Prediction with Logistic Regression
9 pages
Design and Fabrication of Hoverbike
No ratings yet
Design and Fabrication of Hoverbike
11 pages
Medical Bayesian Network Analysis
No ratings yet
Medical Bayesian Network Analysis
8 pages
Prediction Diabetic NBayes
No ratings yet
Prediction Diabetic NBayes
3 pages
Board of Education Meeting Summary
No ratings yet
Board of Education Meeting Summary
13 pages
PORT AND TERMINAL INFORMATION BOOK-Ver 3 1 - 18 12 13
No ratings yet
PORT AND TERMINAL INFORMATION BOOK-Ver 3 1 - 18 12 13
21 pages
ML1
No ratings yet
ML1
6 pages
OD429516601930181100
No ratings yet
OD429516601930181100
1 page
C2 W4 Lab 02 Tree Ensemble
No ratings yet
C2 W4 Lab 02 Tree Ensemble
10 pages
Ai in HC - 2
No ratings yet
Ai in HC - 2
9 pages
KHDL
No ratings yet
KHDL
133 pages
EDIT Techincal Questions
No ratings yet
EDIT Techincal Questions
4 pages
Bacdeaf 23032025 115708 Split 1
No ratings yet
Bacdeaf 23032025 115708 Split 1
37 pages
LHS Cab Fuse & Relay Panel Guide
No ratings yet
LHS Cab Fuse & Relay Panel Guide
1 page
Shopee Delivery Po Pra Sa Kabaong Ni Don
No ratings yet
Shopee Delivery Po Pra Sa Kabaong Ni Don
65 pages
MATH9944-Chapter Summary-5144
No ratings yet
MATH9944-Chapter Summary-5144
16 pages
Atul MLT Exp 4-11
No ratings yet
Atul MLT Exp 4-11
17 pages
Boo PH 3
No ratings yet
Boo PH 3
11 pages
Example With IPD Trial Data - Descem
No ratings yet
Example With IPD Trial Data - Descem
14 pages
Aiml Programs
No ratings yet
Aiml Programs
12 pages
ML Lab Exp
No ratings yet
ML Lab Exp
7 pages
Molecular Classification of Leukemia Using Gene Expression Data and Random Forest
No ratings yet
Molecular Classification of Leukemia Using Gene Expression Data and Random Forest
17 pages
Screenshot 2024-11-24 at 5.07.05 PM
No ratings yet
Screenshot 2024-11-24 at 5.07.05 PM
1 page
Engineer Onboarding Form
No ratings yet
Engineer Onboarding Form
12 pages
PP Riseofchina
No ratings yet
PP Riseofchina
16 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Thesis Paper On Net Zero Carbon
No ratings yet
Thesis Paper On Net Zero Carbon
68 pages
Exp1 DL
No ratings yet
Exp1 DL
6 pages
Sarayu
No ratings yet
Sarayu
27 pages
Experiment Lab4
No ratings yet
Experiment Lab4
4 pages
ACADEMIC CALENDAR 2025 Approved
No ratings yet
ACADEMIC CALENDAR 2025 Approved
2 pages
Da Rec
No ratings yet
Da Rec
29 pages
ASSIGNMENT II - Logistic Regression (Sukanya Das - 221001001006)
No ratings yet
ASSIGNMENT II - Logistic Regression (Sukanya Das - 221001001006)
10 pages