Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views27 pages

Project 2

The document outlines a project to analyze a dataset related to patient appointment no-shows, consisting of 110,527 records and 14 columns. It poses various questions regarding factors such as gender, alcoholism, scholarship status, and age that may influence whether a patient shows up for their scheduled appointment. The document includes data wrangling steps, visualizations, and initial findings about the dataset's characteristics and patient demographics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views27 pages

Project 2

The document outlines a project to analyze a dataset related to patient appointment no-shows, consisting of 110,527 records and 14 columns. It poses various questions regarding factors such as gender, alcoholism, scholarship status, and age that may influence whether a patient shows up for their scheduled appointment. The document includes data wrangling steps, visualizations, and initial findings about the dataset's characteristics and patient demographics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Project: Investigate a Dataset - [No-show]

Table of Contents

Introduction
Dataset Description :
we will analysis the dataset (No-show) this data set cointain of : 110,527 Raw, 14 Column.

<!DOCTYPE html> table { font-family: arial, sans-serif; border-collapse: collapse; width: 100%;
}

td, th { border: 1px solid #dddddd; text-align: center; vertical-align: middle; padding: 8px; height:
50px; width: 50px; }

tr:nth-child(even) { background-color: #dddddd;

}
Question(s) for Analysis
Q1 : Can the Gender be considered as a factor to predict if a patient will show up for
their scheduled appointment?

Q2 : Can the Gender and Alcoholism be considered as a factor to predict if a patient


will show up for their scheduled appointment?

Q3 : Can the Gender and Scholarship be considered as a factor to predict if a patient


will show up for their scheduled appointment?

Q4 : Can the Gender and Handicap be considered as a factor to predict if a patient


will show up for their scheduled appointment?

Q5 : Can the Gender and SMS_received be considered as a factor to predict if a


patient will show up for their scheduled appointment?

Q6 : Can the Alcoholism and diseases be considered as a factor to predict if a patient


will show up for their scheduled appointment?

Q7 : Can the Age be considered as a factor to predict if a patient will show up for
their scheduled appointment?

Q8 : Can the Gender be considered as a factor to predict if a patient will show up for
their scheduled appointment?

Q9 : Can the Scholarship be considered as a factor to predict if a patient will show up


for their scheduled appointment?

Q10 : Can the Alcoholism be considered as a factor to predict if a patient will show up
for their scheduled appointment?
Q11 : Can the SMS_received be considered as a factor to predict if a patient will show
up for their scheduled appointment?

Q12 : Can the Handicap be considered as a factor to predict if a patient will show up
for their scheduled appointment?

Q13 : Can the Diabetes be considered as a factor to predict if a patient will show up
for their scheduled appointment?

Q14 : Can the Hypertension be considered as a factor to predict if a patient will show
up for their scheduled appointment?

Q15 : Can the waiting period between scheduled day and appointment day be
considered as a factor to predict if a patient will show up for their scheduled
appointment?
# Use this cell to set up import statements for all of the packages
that you
# plan to use.
# Remember to include a 'magic word' so that your visualizations are
plotted
# inline with the notebook. See this page for more:
# http://ipython.readthedocs.io/en/stable/interactive/magics.html
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Loading Data from CSV file


# load dataset
df = df = pd.read_csv("noshowappointments-kagglev2-may-2016.csv")
# show first 5 raw
df.head()

PatientId AppointmentID Gender ScheduledDay \


0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z

AppointmentDay Age Neighbourhood Scholarship


Hipertension \
0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0
1
1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
0
2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0
0
3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0
0
4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
1

Diabetes Alcoholism Handcap SMS_received No-show


0 0 0 0 0 No
1 0 0 0 0 No
2 0 0 0 0 No
3 0 0 0 0 No
4 1 0 0 0 No

From the First impression we can see that the column name is mess
Data Wrangling
Tip: In this section of the report, you will load in the data, check for cleanliness, and
then trim and clean your dataset for analysis. Make sure that you document your data
cleaning steps in mark-down cells precisely and justify your cleaning decisions.

General Properties
Tip: You should not perform too many operations in each cell. Create cells freely to
explore your data. One option that you can take with this project is to do a lot of
explorations in an initial notebook. These don't have to be organized, but make sure
you use enough comments to understand the purpose of each code cell. Then, after
you're done with your analysis, create a duplicate notebook where you will trim the
excess and organize your steps so that you have a flowing, cohesive report.

i prefer that editing name column in the first to be easy after that
so i will rename columns
# Rename columns that have a wrong name or to be easy to handel
df = df.rename(columns={"ScheduledDay": "Scheduled_Day",
"AppointmentDay": "Appointment_Day","Hipertension":"Hypertension",
"Handcap":"Handicap","No-show":"No_show"})
# check that is rename
df.head()

PatientId AppointmentID Gender Scheduled_Day \


0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z

Appointment_Day Age Neighbourhood Scholarship


Hypertension \
0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0
1
1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
0
2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0
0
3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0
0
4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
1

Diabetes Alcoholism Handicap SMS_received No_show


0 0 0 0 0 No
1 0 0 0 0 No
2 0 0 0 0 No
3 0 0 0 0 No
4 1 0 0 0 No

# number of raw and columns


df.shape

(110527, 14)

# show information for dataset


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PatientId 110527 non-null float64
1 AppointmentID 110527 non-null int64
2 Gender 110527 non-null object
3 Scheduled_Day 110527 non-null object
4 Appointment_Day 110527 non-null object
5 Age 110527 non-null int64
6 Neighbourhood 110527 non-null object
7 Scholarship 110527 non-null int64
8 Hypertension 110527 non-null int64
9 Diabetes 110527 non-null int64
10 Alcoholism 110527 non-null int64
11 Handicap 110527 non-null int64
12 SMS_received 110527 non-null int64
13 No_show 110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB

General look of data


I perfer to visualize , explor and discovry the data befor change it to understand data
easly

fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,10)) # to split plot


# plot a pie chart
sorted_counts = df.Gender.value_counts() # to know count of Gender
values
ax1.pie(sorted_counts, labels = ['Female', 'Male'], startangle = 0,
counterclock = False, autopct='%1.2f%%', explode = (0.1, 0));
plt.axis('square')
plt.title('Gender And Show Status');
# plot a pie chart
sorted_counts = df.Scholarship.value_counts() # to know count of
Scholarship values
ax2.pie(sorted_counts, labels = ['No Scholarship', 'Scholarship'],
startangle = 200,
counterclock = False, autopct='%1.2f%%', explode = (0.1, 0));
plt.axis('square')
plt.title('Gender and Scholarship Show Status');

we can see that most patients of females greater than male. and little patients have a scholarship.
# plot a pie chart
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,10))

sorted_counts = df.Hypertension.value_counts() # to know Hypertension


count of values
ax1.pie(sorted_counts, labels = ['No Have Hypertension', 'Have
Hypertension'], startangle = 0,
counterclock = False, autopct='%1.2f%%', explode = (0.1, 0));
ax1.axis('square')

# plot a pie chart


sorted_counts = df.Diabetes.value_counts() # to know Diabetes count of
values
ax2.pie(sorted_counts, labels = ['No Have Diabetes', 'Have Diabetes'],
startangle = 200,
counterclock = False, autopct='%1.2f%%', explode = (0.1, 0));
ax2.axis('square')
plt.title('Hipertension and Diabetes Show Status');
we can see that most patients no have Hypertension and Diabetes.
# plot a pie chart
sorted_counts = df.Alcoholism.value_counts() # to know Hypertension
count of values
plt.pie(sorted_counts, labels = ['No Alcoholism', 'Alcoholism'],
startangle = 200,
counterclock = False, autopct='%1.2f%%', explode = (0.1, 0));
plt.axis('square')
plt.title('Alcoholism Show Status');

we can see that most patients no have Alcoholism .


fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,10))
# plot a pie chart
sorted_counts = df.SMS_received.value_counts()
ax1.pie(sorted_counts, labels = ['No SMS_received', 'SMS_received'],
startangle = 0,
counterclock = False, autopct='%1.2f%%');
plt.axis('square')
# replace NO by 0 and YES by 1
df['No_show'].replace({'No': 0, 'Yes': 1}, inplace = True)
sorted_counts = df.No_show.value_counts()
ax2.pie(sorted_counts, labels = ['show', 'No show'], startangle = 200,
counterclock = False, autopct='%1.2f%%', explode = (0.1, 0));
plt.axis('square')
plt.title('SMS_received and Show Status');

we can see that most patients of females greater than male. and little patients have a scholarship.
df.Handicap.value_counts()

0 108286
1 2042
2 183
3 13
4 3
Name: Handicap, dtype: int64

df.nunique()

PatientId 62299
AppointmentID 110527
Gender 2
Scheduled_Day 103549
Appointment_Day 27
Age 104
Neighbourhood 81
Scholarship 2
Hypertension 2
Diabetes 2
Alcoholism 2
Handicap 5
SMS_received 2
No_show 2
dtype: int64

df.duplicated(['PatientId','No_show']).sum()

38710

df.duplicated(['PatientId','Gender','Appointment_Day','No_show']).sum(
)

7378

df['Age'].describe()

count 110527.000000
mean 37.088874
std 23.110205
min -1.000000
25% 18.000000
50% 37.000000
75% 55.000000
max 115.000000
Name: Age, dtype: float64

ind = df.index[df['Age']<0]
df.iloc[ind]

PatientId AppointmentID Gender Scheduled_Day \


99832 4.659432e+14 5775010 F 2016-06-06T08:58:13Z

Appointment_Day Age Neighbourhood Scholarship


Hypertension \
99832 2016-06-06T00:00:00Z -1 ROMÃO 0
0

Diabetes Alcoholism Handicap SMS_received No_show


99832 0 0 0 0 0

# get index of age == 115


ind = df.index[df['Age']==115]
df.iloc[ind]

PatientId AppointmentID Gender Scheduled_Day \


63912 3.196321e+13 5700278 F 2016-05-16T09:17:44Z
63915 3.196321e+13 5700279 F 2016-05-16T09:17:44Z
68127 3.196321e+13 5562812 F 2016-04-08T14:29:17Z
76284 3.196321e+13 5744037 F 2016-05-30T09:44:51Z
97666 7.482346e+14 5717451 F 2016-05-19T07:57:56Z

Appointment_Day Age Neighbourhood Scholarship


Hypertension \
63912 2016-05-19T00:00:00Z 115 ANDORINHAS 0
0
63915 2016-05-19T00:00:00Z 115 ANDORINHAS 0
0
68127 2016-05-16T00:00:00Z 115 ANDORINHAS 0
0
76284 2016-05-30T00:00:00Z 115 ANDORINHAS 0
0
97666 2016-06-03T00:00:00Z 115 SÃO JOSÉ 0
1

Diabetes Alcoholism Handicap SMS_received No_show


63912 0 0 1 0 1
63915 0 0 1 0 1
68127 0 0 1 0 1
76284 0 0 1 0 0
97666 0 0 0 1 0

df['Appointment_Day'][63912] , df['Appointment_Day'][63915],
df['Appointment_Day'][68127],df['Appointment_Day'][76284]

('2016-05-19T00:00:00Z',
'2016-05-19T00:00:00Z',
'2016-05-16T00:00:00Z',
'2016-05-30T00:00:00Z')

Conclusion :
we can say that this data set need to more change to be useful. Rename some columns. Change
some data types. Drop some samples dublicated. Drop some samples data wrong like age = -1 .
Update data like Handicap .

Data Cleaning
Tip: Make sure that you keep your reader informed on the steps that you are taking in
your investigation. Follow every code cell, or every set of related code cells, with a
markdown cell to describe to the reader what was found in the preceding cell(s). Try to
make it so that the reader can then understand what they will be seeing in the
following cell(s).

# drop the row with a -1 age


df.drop(df.query('Age < 0').index, inplace=True)
df.query('Age < 0')
Empty DataFrame
Columns: [PatientId, AppointmentID, Gender, Scheduled_Day,
Appointment_Day, Age, Neighbourhood, Scholarship, Hypertension,
Diabetes, Alcoholism, Handicap, SMS_received, No_show]
Index: []

Row was dropped successfully.

# if the value is greater than 1 change it to 1, otherwise keep it


df['Handicap'] = np.where(df['Handicap'] > 1, 1, df['Handicap'])
df.Handicap.value_counts()

0 108285
1 2241
Name: Handicap, dtype: int64

df['Scheduled_Day'] = pd.to_datetime(df['Scheduled_Day']) # convert


Dtype to datatime
df['Appointment_Day'] = pd.to_datetime(df['Appointment_Day']) #
convert Dtype to datatime
# check agian
print(df.dtypes)
df.head()

PatientId float64
AppointmentID int64
Gender object
Scheduled_Day datetime64[ns, UTC]
Appointment_Day datetime64[ns, UTC]
Age int64
Neighbourhood object
Scholarship int64
Hypertension int64
Diabetes int64
Alcoholism int64
Handicap int64
SMS_received int64
No_show int64
dtype: object

PatientId AppointmentID Gender Scheduled_Day \


0 2.987250e+13 5642903 F 2016-04-29 18:38:08+00:00
1 5.589978e+14 5642503 M 2016-04-29 16:08:27+00:00
2 4.262962e+12 5642549 F 2016-04-29 16:19:04+00:00
3 8.679512e+11 5642828 F 2016-04-29 17:29:31+00:00
4 8.841186e+12 5642494 F 2016-04-29 16:07:23+00:00

Appointment_Day Age Neighbourhood Scholarship \


0 2016-04-29 00:00:00+00:00 62 JARDIM DA PENHA 0
1 2016-04-29 00:00:00+00:00 56 JARDIM DA PENHA 0
2 2016-04-29 00:00:00+00:00 62 MATA DA PRAIA 0
3 2016-04-29 00:00:00+00:00 8 PONTAL DE CAMBURI 0
4 2016-04-29 00:00:00+00:00 56 JARDIM DA PENHA 0

Hypertension Diabetes Alcoholism Handicap SMS_received No_show

0 1 0 0 0 0 0

1 0 0 0 0 0 0

2 0 0 0 0 0 0

3 0 0 0 0 0 0

4 1 1 0 0 0 0

Data type changed successfully

# check the appointments days


df['Appointment_Day'].max()- df['Appointment_Day'].min()

Timedelta('40 days 00:00:00')

# There are 7378 duplicate cases with the same patient and same
AppointmentDay we will drop it.
df.drop_duplicates(['PatientId','Gender','Appointment_Day','No_show'],
inplace = True)
# ensure data duplicated is droped
df.shape

(103148, 14)

# Drop columns that we didn't need in analysis


df.drop(columns=['AppointmentID','PatientId'],axis = 1 , inplace =
True)
df.head()

Gender Scheduled_Day Appointment_Day Age \


0 F 2016-04-29 18:38:08+00:00 2016-04-29 00:00:00+00:00 62
1 M 2016-04-29 16:08:27+00:00 2016-04-29 00:00:00+00:00 56
2 F 2016-04-29 16:19:04+00:00 2016-04-29 00:00:00+00:00 62
3 F 2016-04-29 17:29:31+00:00 2016-04-29 00:00:00+00:00 8
4 F 2016-04-29 16:07:23+00:00 2016-04-29 00:00:00+00:00 56

Neighbourhood Scholarship Hypertension Diabetes Alcoholism


\
0 JARDIM DA PENHA 0 1 0 0

1 JARDIM DA PENHA 0 0 0 0
2 MATA DA PRAIA 0 0 0 0

3 PONTAL DE CAMBURI 0 0 0 0

4 JARDIM DA PENHA 0 1 1 0

Handicap SMS_received No_show


0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

Exploratory Data Analysis


Tip: Now that you've trimmed and cleaned your data, you're ready to move on to
exploration. Compute statistics and create visualizations with the goal of addressing
the research questions that you posed in the Introduction section. You should
compute the relevant statistics throughout the analysis when an inference is made
about the data. Note that at least two or more kinds of plots should be created as part
of the exploration, and you must compare and show trends in the varied visualizations.
Tip: - Investigate the stated question(s) from multiple angles. It is recommended that
you be systematic with your approach. Look at one variable at a time, and then follow
it up by looking at relationships between variables. You should explore at least three
variables in relation to the primary question. This can be an exploratory relationship
between three variables of interest, or looking at how two independent variables
relate to a single dependent variable of interest. Lastly, you should perform both
single-variable (1d) and multiple-variable (2d) explorations.

df.hist(figsize=(18,15));
Research Question 1 ( Can the Gender be considered as a factor to
predict if a patient will show up for their scheduled appointment?!)
# percentages of no show patients based on gender
no_show_perc_gender = df.groupby(['Gender']).mean() *100
# plot a bar chart
no_show_perc_gender['No_show'].plot.bar();
plt.title('The Percentages of No Show Patients Based on Gender')
plt.xticks([0, 1], ['Female', 'Male'])
plt.ylabel('No Show Percentage');
Research Question 2 ( Can the Gender and Alcoholism be considered
as a factor to predict if a patient will show up for their scheduled
appointment?!)
# percentages of no show patients based on gender and Alcoholism
no_show_perc_gender_Alcoholism =
df.groupby(['Gender','Alcoholism']).mean() * 100
# plot a bar chart
no_show_perc_gender_Alcoholism['No_show'].plot.bar();
plt.title('The Percentages of No Show Patients Based on Gender And
Alcoholism')
plt.ylabel('No Show Percentage');
Research Question 3 ( Can the Gender and Scholarship be considered
as a factor to predict if a patient will show up for their scheduled
appointment?!)
# percentages of no show patients based on gender and Scholarship
no_show_perc_gender_Scholarship =
df.groupby(['Gender','Scholarship']).mean() * 100
# plot a bar chart
no_show_perc_gender_Scholarship['No_show'].plot.bar();
plt.title('The Percentages of No Show Patients Based on Gender And
Scholarship')
plt.ylabel('No Show Percentage');
Research Question 4 ( Can the Gender and Handicap be considered as
a factor to predict if a patient will show up for their scheduled
appointment?!)
# percentages of no show patients based on gender and Handicap
no_show_perc_gender_Handicap =
df.groupby(['Gender','Handicap']).mean() * 100
no_show_perc_gender_Handicap['No_show'].plot.bar();
plt.title('The Percentages of No Show Patients Based on Gender And
Handicap')
plt.ylabel('No Show Percentage');
Research Question 5 ( Can the Gender and SMS_received be
considered as a factor to predict if a patient will show up for their
scheduled appointment?!)
# percentages of no show patients based on gender and SMS_received
no_show_perc_gender_SMS_received =
df.groupby(['Gender','SMS_received']).mean() * 100
# plot a bar chart
no_show_perc_gender_SMS_received['No_show'].plot.bar();
plt.title('The Percentages of No Show Patients Based on Gender And
SMS_received')
plt.ylabel('No Show Percentage');
Research Question 6 ( Can the Alcoholism and diseases be considered
as a factor to predict if a patient will show up for their scheduled
appointment?!)
# percentages of no show patients based on Alcoholism and diseases
no_show_perc_Alcoholism_diseases =
df.groupby(['Alcoholism','Hypertension','Diabetes']).No_show.mean() *
100

# plot a bar chart


no_show_perc_Alcoholism_diseases.plot.bar();
plt.title('The Percentages of No Show Patients Based on Alcoholism and
diseases')
plt.ylabel('No Show Percentage');
Research Question 7 ( Can the Age be considered as a factor to predict
if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on Age


Group')
plt.xlabel('Age')
plt.ylabel('Patient Number');
age_attendance(df,'Age')
Research Question 8 ( Can the Gender be considered as a factor to
predict if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on Gender')


plt.xlabel('Gender')
plt.ylabel('Patient Number');
age_attendance(df,'Gender')

Research Question 9 ( Can the Scholarship be considered as a factor to


predict if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on Age


Group')
plt.xlabel('Scholarship')
plt.ylabel('Patient Number');
age_attendance(df,'Scholarship')

Research Question 10 ( Can the Alcoholism be considered as a factor


to predict if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on Alcoholism


')
plt.xlabel('Alcoholism')
plt.ylabel('Patient Number');
age_attendance(df,'Alcoholism')
Research Question 11 ( Can the SMS_received be considered as a factor
to predict if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on


SMS_received')
plt.xlabel('SMS_received')
plt.ylabel('Patient Number');
age_attendance(df,'SMS_received')

Research Question 12 ( Can the Handicap be considered as a factor to


predict if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on Handicap')


plt.xlabel('Handicap')
plt.ylabel('Patient Number');
age_attendance(df,'Handicap')

Research Question 13 ( Can the Diabetes be considered as a factor to


predict if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on Diabetes')


plt.xlabel('Diabetes')
plt.ylabel('Patient Number');
age_attendance(df,'Diabetes')
Research Question 14 ( Can the Hypertension be considered as a factor
to predict if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on


Hypertension')
plt.xlabel('Hypertension')
plt.ylabel('Patient Number');
age_attendance(df,'Hypertension')
Research Question 15 (Can the waiting period between scheduled day
and appointment day be considered as a factor to predict if a patient
will show up for their scheduled appointment?!)
# calculate days between scheduled_day and appointment_day
days_between = (df['Appointment_Day'] - df['Scheduled_Day']).dt.days
# insert a new column (days_between) before column 3
df.insert(3, 'days_between', days_between)
df.head()

Gender Scheduled_Day Appointment_Day


days_between \
0 F 2016-04-29 18:38:08+00:00 2016-04-29 00:00:00+00:00
-1
1 M 2016-04-29 16:08:27+00:00 2016-04-29 00:00:00+00:00
-1
2 F 2016-04-29 16:19:04+00:00 2016-04-29 00:00:00+00:00
-1
3 F 2016-04-29 17:29:31+00:00 2016-04-29 00:00:00+00:00
-1
4 F 2016-04-29 16:07:23+00:00 2016-04-29 00:00:00+00:00
-1

Age Neighbourhood Scholarship Hypertension Diabetes


Alcoholism \
0 62 JARDIM DA PENHA 0 1 0
0
1 56 JARDIM DA PENHA 0 0 0
0
2 62 MATA DA PRAIA 0 0 0
0
3 8 PONTAL DE CAMBURI 0 0 0
0
4 56 JARDIM DA PENHA 0 1 1
0

Handicap SMS_received No_show


0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

# percentages of no show patients based on gender and Alcoholism


negative_days = df.query('days_between < 0')
df.drop(negative_days.index, inplace=True)
df.query('days_between < 0')

Empty DataFrame
Columns: [Gender, Scheduled_Day, Appointment_Day, days_between, Age,
Neighbourhood, Scholarship, Hypertension, Diabetes, Alcoholism,
Handicap, SMS_received, No_show]
Index: []

show = df.No_show==0 # declare variable show that == 0


noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on Age


Group')
plt.xlabel('days_between')
plt.ylabel('Patient Number');
age_attendance(df,'days_between')

Conclusions
Limitation:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Untitled2.ipynb'])

4294967295

You might also like