0% found this document useful (0 votes)

20 views27 pages

Project 2

The document outlines a project to analyze a dataset related to patient appointment no-shows, consisting of 110,527 records and 14 columns. It poses various questions regarding factors such as gender, alcoholism, scholarship status, and age that may influence whether a patient shows up for their scheduled appointment. The document includes data wrangling steps, visualizations, and initial findings about the dataset's characteristics and patient demographics.

Uploaded by

محمود بطران

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views27 pages

Project 2

Uploaded by

محمود بطران

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Project: Investigate a Dataset - [No-show]

Table of Contents

Introduction
Dataset Description :
we will analysis the dataset (No-show) this data set cointain of : 110,527 Raw, 14 Column.

<!DOCTYPE html> table { font-family: arial, sans-serif; border-collapse: collapse; width: 100%;
}

td, th { border: 1px solid #dddddd; text-align: center; vertical-align: middle; padding: 8px; height:
50px; width: 50px; }

tr:nth-child(even) { background-color: #dddddd;

}
Question(s) for Analysis
Q1 : Can the Gender be considered as a factor to predict if a patient will show up for
their scheduled appointment?

Q2 : Can the Gender and Alcoholism be considered as a factor to predict if a patient

will show up for their scheduled appointment?

Q3 : Can the Gender and Scholarship be considered as a factor to predict if a patient

will show up for their scheduled appointment?

Q4 : Can the Gender and Handicap be considered as a factor to predict if a patient

will show up for their scheduled appointment?

Q5 : Can the Gender and SMS_received be considered as a factor to predict if a

patient will show up for their scheduled appointment?

Q6 : Can the Alcoholism and diseases be considered as a factor to predict if a patient

will show up for their scheduled appointment?

Q7 : Can the Age be considered as a factor to predict if a patient will show up for
their scheduled appointment?

Q8 : Can the Gender be considered as a factor to predict if a patient will show up for
their scheduled appointment?

Q9 : Can the Scholarship be considered as a factor to predict if a patient will show up

for their scheduled appointment?

Q10 : Can the Alcoholism be considered as a factor to predict if a patient will show up
for their scheduled appointment?
Q11 : Can the SMS_received be considered as a factor to predict if a patient will show
up for their scheduled appointment?

Q12 : Can the Handicap be considered as a factor to predict if a patient will show up
for their scheduled appointment?

Q13 : Can the Diabetes be considered as a factor to predict if a patient will show up
for their scheduled appointment?

Q14 : Can the Hypertension be considered as a factor to predict if a patient will show
up for their scheduled appointment?

Q15 : Can the waiting period between scheduled day and appointment day be
considered as a factor to predict if a patient will show up for their scheduled
appointment?
# Use this cell to set up import statements for all of the packages
that you
# plan to use.
# Remember to include a 'magic word' so that your visualizations are
plotted
# inline with the notebook. See this page for more:
# http://ipython.readthedocs.io/en/stable/interactive/magics.html
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Loading Data from CSV file

# load dataset
df = df = pd.read_csv("noshowappointments-kagglev2-may-2016.csv")
# show first 5 raw
df.head()

PatientId AppointmentID Gender ScheduledDay \

0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z

AppointmentDay Age Neighbourhood Scholarship

Hipertension \
0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0
1
1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
0
2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0
0
3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0
0
4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
1

Diabetes Alcoholism Handcap SMS_received No-show

0 0 0 0 0 No
1 0 0 0 0 No
2 0 0 0 0 No
3 0 0 0 0 No
4 1 0 0 0 No

From the First impression we can see that the column name is mess
Data Wrangling
Tip: In this section of the report, you will load in the data, check for cleanliness, and
then trim and clean your dataset for analysis. Make sure that you document your data
cleaning steps in mark-down cells precisely and justify your cleaning decisions.

General Properties
Tip: You should not perform too many operations in each cell. Create cells freely to
explore your data. One option that you can take with this project is to do a lot of
explorations in an initial notebook. These don't have to be organized, but make sure
you use enough comments to understand the purpose of each code cell. Then, after
you're done with your analysis, create a duplicate notebook where you will trim the
excess and organize your steps so that you have a flowing, cohesive report.

i prefer that editing name column in the first to be easy after that
so i will rename columns
# Rename columns that have a wrong name or to be easy to handel
df = df.rename(columns={"ScheduledDay": "Scheduled_Day",
"AppointmentDay": "Appointment_Day","Hipertension":"Hypertension",
"Handcap":"Handicap","No-show":"No_show"})
# check that is rename
df.head()

PatientId AppointmentID Gender Scheduled_Day \

Appointment_Day Age Neighbourhood Scholarship

Hypertension \
0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0
1
1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
0
2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0
0
3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0
0
4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
1

Diabetes Alcoholism Handicap SMS_received No_show

0 0 0 0 0 No
1 0 0 0 0 No
2 0 0 0 0 No
3 0 0 0 0 No
4 1 0 0 0 No

# number of raw and columns

df.shape

(110527, 14)

# show information for dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PatientId 110527 non-null float64
1 AppointmentID 110527 non-null int64
2 Gender 110527 non-null object
3 Scheduled_Day 110527 non-null object
4 Appointment_Day 110527 non-null object
5 Age 110527 non-null int64
6 Neighbourhood 110527 non-null object
7 Scholarship 110527 non-null int64
8 Hypertension 110527 non-null int64
9 Diabetes 110527 non-null int64
10 Alcoholism 110527 non-null int64
11 Handicap 110527 non-null int64
12 SMS_received 110527 non-null int64
13 No_show 110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB

General look of data

I perfer to visualize , explor and discovry the data befor change it to understand data
easly

fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,10)) # to split plot

# plot a pie chart
sorted_counts = df.Gender.value_counts() # to know count of Gender
values
ax1.pie(sorted_counts, labels = ['Female', 'Male'], startangle = 0,
counterclock = False, autopct='%1.2f%%', explode = (0.1, 0));
plt.axis('square')
plt.title('Gender And Show Status');
# plot a pie chart
sorted_counts = df.Scholarship.value_counts() # to know count of
Scholarship values
ax2.pie(sorted_counts, labels = ['No Scholarship', 'Scholarship'],
startangle = 200,
counterclock = False, autopct='%1.2f%%', explode = (0.1, 0));
plt.axis('square')
plt.title('Gender and Scholarship Show Status');

we can see that most patients of females greater than male. and little patients have a scholarship.
# plot a pie chart
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,10))

sorted_counts = df.Hypertension.value_counts() # to know Hypertension

count of values
ax1.pie(sorted_counts, labels = ['No Have Hypertension', 'Have
Hypertension'], startangle = 0,
counterclock = False, autopct='%1.2f%%', explode = (0.1, 0));
ax1.axis('square')

# plot a pie chart

sorted_counts = df.Diabetes.value_counts() # to know Diabetes count of
values
ax2.pie(sorted_counts, labels = ['No Have Diabetes', 'Have Diabetes'],
startangle = 200,
counterclock = False, autopct='%1.2f%%', explode = (0.1, 0));
ax2.axis('square')
plt.title('Hipertension and Diabetes Show Status');
we can see that most patients no have Hypertension and Diabetes.
# plot a pie chart
sorted_counts = df.Alcoholism.value_counts() # to know Hypertension
count of values
plt.pie(sorted_counts, labels = ['No Alcoholism', 'Alcoholism'],
startangle = 200,
counterclock = False, autopct='%1.2f%%', explode = (0.1, 0));
plt.axis('square')
plt.title('Alcoholism Show Status');

we can see that most patients no have Alcoholism .

fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,10))
# plot a pie chart
sorted_counts = df.SMS_received.value_counts()
ax1.pie(sorted_counts, labels = ['No SMS_received', 'SMS_received'],
startangle = 0,
counterclock = False, autopct='%1.2f%%');
plt.axis('square')
# replace NO by 0 and YES by 1
df['No_show'].replace({'No': 0, 'Yes': 1}, inplace = True)
sorted_counts = df.No_show.value_counts()
ax2.pie(sorted_counts, labels = ['show', 'No show'], startangle = 200,
counterclock = False, autopct='%1.2f%%', explode = (0.1, 0));
plt.axis('square')
plt.title('SMS_received and Show Status');

we can see that most patients of females greater than male. and little patients have a scholarship.
df.Handicap.value_counts()

0 108286
1 2042
2 183
3 13
4 3
Name: Handicap, dtype: int64

df.nunique()

PatientId 62299
AppointmentID 110527
Gender 2
Scheduled_Day 103549
Appointment_Day 27
Age 104
Neighbourhood 81
Scholarship 2
Hypertension 2
Diabetes 2
Alcoholism 2
Handicap 5
SMS_received 2
No_show 2
dtype: int64

df.duplicated(['PatientId','No_show']).sum()

38710

df.duplicated(['PatientId','Gender','Appointment_Day','No_show']).sum(
)

7378

df['Age'].describe()

count 110527.000000
mean 37.088874
std 23.110205
min -1.000000
25% 18.000000
50% 37.000000
75% 55.000000
max 115.000000
Name: Age, dtype: float64

ind = df.index[df['Age']<0]
df.iloc[ind]

PatientId AppointmentID Gender Scheduled_Day \

99832 4.659432e+14 5775010 F 2016-06-06T08:58:13Z

Appointment_Day Age Neighbourhood Scholarship

Hypertension \
99832 2016-06-06T00:00:00Z -1 ROMÃO 0
0

Diabetes Alcoholism Handicap SMS_received No_show

99832 0 0 0 0 0

# get index of age == 115

ind = df.index[df['Age']==115]
df.iloc[ind]

PatientId AppointmentID Gender Scheduled_Day \

63912 3.196321e+13 5700278 F 2016-05-16T09:17:44Z
63915 3.196321e+13 5700279 F 2016-05-16T09:17:44Z
68127 3.196321e+13 5562812 F 2016-04-08T14:29:17Z
76284 3.196321e+13 5744037 F 2016-05-30T09:44:51Z
97666 7.482346e+14 5717451 F 2016-05-19T07:57:56Z

Appointment_Day Age Neighbourhood Scholarship

Hypertension \
63912 2016-05-19T00:00:00Z 115 ANDORINHAS 0
0
63915 2016-05-19T00:00:00Z 115 ANDORINHAS 0
0
68127 2016-05-16T00:00:00Z 115 ANDORINHAS 0
0
76284 2016-05-30T00:00:00Z 115 ANDORINHAS 0
0
97666 2016-06-03T00:00:00Z 115 SÃO JOSÉ 0
1

Diabetes Alcoholism Handicap SMS_received No_show

63912 0 0 1 0 1
63915 0 0 1 0 1
68127 0 0 1 0 1
76284 0 0 1 0 0
97666 0 0 0 1 0

df['Appointment_Day'][63912] , df['Appointment_Day'][63915],
df['Appointment_Day'][68127],df['Appointment_Day'][76284]

('2016-05-19T00:00:00Z',
'2016-05-19T00:00:00Z',
'2016-05-16T00:00:00Z',
'2016-05-30T00:00:00Z')

Conclusion :
we can say that this data set need to more change to be useful. Rename some columns. Change
some data types. Drop some samples dublicated. Drop some samples data wrong like age = -1 .
Update data like Handicap .

Data Cleaning
Tip: Make sure that you keep your reader informed on the steps that you are taking in
your investigation. Follow every code cell, or every set of related code cells, with a
markdown cell to describe to the reader what was found in the preceding cell(s). Try to
make it so that the reader can then understand what they will be seeing in the
following cell(s).

# drop the row with a -1 age

df.drop(df.query('Age < 0').index, inplace=True)
df.query('Age < 0')
Empty DataFrame
Columns: [PatientId, AppointmentID, Gender, Scheduled_Day,
Appointment_Day, Age, Neighbourhood, Scholarship, Hypertension,
Diabetes, Alcoholism, Handicap, SMS_received, No_show]
Index: []

Row was dropped successfully.

# if the value is greater than 1 change it to 1, otherwise keep it

df['Handicap'] = np.where(df['Handicap'] > 1, 1, df['Handicap'])
df.Handicap.value_counts()

0 108285
1 2241
Name: Handicap, dtype: int64

df['Scheduled_Day'] = pd.to_datetime(df['Scheduled_Day']) # convert

Dtype to datatime
df['Appointment_Day'] = pd.to_datetime(df['Appointment_Day']) #
convert Dtype to datatime
# check agian
print(df.dtypes)
df.head()

PatientId float64
AppointmentID int64
Gender object
Scheduled_Day datetime64[ns, UTC]
Appointment_Day datetime64[ns, UTC]
Age int64
Neighbourhood object
Scholarship int64
Hypertension int64
Diabetes int64
Alcoholism int64
Handicap int64
SMS_received int64
No_show int64
dtype: object

PatientId AppointmentID Gender Scheduled_Day \

0 2.987250e+13 5642903 F 2016-04-29 18:38:08+00:00
1 5.589978e+14 5642503 M 2016-04-29 16:08:27+00:00
2 4.262962e+12 5642549 F 2016-04-29 16:19:04+00:00
3 8.679512e+11 5642828 F 2016-04-29 17:29:31+00:00
4 8.841186e+12 5642494 F 2016-04-29 16:07:23+00:00

Appointment_Day Age Neighbourhood Scholarship \

0 2016-04-29 00:00:00+00:00 62 JARDIM DA PENHA 0
1 2016-04-29 00:00:00+00:00 56 JARDIM DA PENHA 0
2 2016-04-29 00:00:00+00:00 62 MATA DA PRAIA 0
3 2016-04-29 00:00:00+00:00 8 PONTAL DE CAMBURI 0
4 2016-04-29 00:00:00+00:00 56 JARDIM DA PENHA 0

Hypertension Diabetes Alcoholism Handicap SMS_received No_show

0 1 0 0 0 0 0

1 0 0 0 0 0 0

2 0 0 0 0 0 0

3 0 0 0 0 0 0

4 1 1 0 0 0 0

Data type changed successfully

# check the appointments days

df['Appointment_Day'].max()- df['Appointment_Day'].min()

Timedelta('40 days 00:00:00')

# There are 7378 duplicate cases with the same patient and same
AppointmentDay we will drop it.
df.drop_duplicates(['PatientId','Gender','Appointment_Day','No_show'],
inplace = True)
# ensure data duplicated is droped
df.shape

(103148, 14)

# Drop columns that we didn't need in analysis

df.drop(columns=['AppointmentID','PatientId'],axis = 1 , inplace =
True)
df.head()

Gender Scheduled_Day Appointment_Day Age \

0 F 2016-04-29 18:38:08+00:00 2016-04-29 00:00:00+00:00 62
1 M 2016-04-29 16:08:27+00:00 2016-04-29 00:00:00+00:00 56
2 F 2016-04-29 16:19:04+00:00 2016-04-29 00:00:00+00:00 62
3 F 2016-04-29 17:29:31+00:00 2016-04-29 00:00:00+00:00 8
4 F 2016-04-29 16:07:23+00:00 2016-04-29 00:00:00+00:00 56

Neighbourhood Scholarship Hypertension Diabetes Alcoholism

\
0 JARDIM DA PENHA 0 1 0 0

1 JARDIM DA PENHA 0 0 0 0
2 MATA DA PRAIA 0 0 0 0

3 PONTAL DE CAMBURI 0 0 0 0

4 JARDIM DA PENHA 0 1 1 0

Handicap SMS_received No_show

0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

Exploratory Data Analysis

Tip: Now that you've trimmed and cleaned your data, you're ready to move on to
exploration. Compute statistics and create visualizations with the goal of addressing
the research questions that you posed in the Introduction section. You should
compute the relevant statistics throughout the analysis when an inference is made
about the data. Note that at least two or more kinds of plots should be created as part
of the exploration, and you must compare and show trends in the varied visualizations.
Tip: - Investigate the stated question(s) from multiple angles. It is recommended that
you be systematic with your approach. Look at one variable at a time, and then follow
it up by looking at relationships between variables. You should explore at least three
variables in relation to the primary question. This can be an exploratory relationship
between three variables of interest, or looking at how two independent variables
relate to a single dependent variable of interest. Lastly, you should perform both
single-variable (1d) and multiple-variable (2d) explorations.

df.hist(figsize=(18,15));
Research Question 1 ( Can the Gender be considered as a factor to
predict if a patient will show up for their scheduled appointment?!)
# percentages of no show patients based on gender
no_show_perc_gender = df.groupby(['Gender']).mean() *100
# plot a bar chart
no_show_perc_gender['No_show'].plot.bar();
plt.title('The Percentages of No Show Patients Based on Gender')
plt.xticks([0, 1], ['Female', 'Male'])
plt.ylabel('No Show Percentage');
Research Question 2 ( Can the Gender and Alcoholism be considered
as a factor to predict if a patient will show up for their scheduled
appointment?!)
# percentages of no show patients based on gender and Alcoholism
no_show_perc_gender_Alcoholism =
df.groupby(['Gender','Alcoholism']).mean() * 100
# plot a bar chart
no_show_perc_gender_Alcoholism['No_show'].plot.bar();
plt.title('The Percentages of No Show Patients Based on Gender And
Alcoholism')
plt.ylabel('No Show Percentage');
Research Question 3 ( Can the Gender and Scholarship be considered
as a factor to predict if a patient will show up for their scheduled
appointment?!)
# percentages of no show patients based on gender and Scholarship
no_show_perc_gender_Scholarship =
df.groupby(['Gender','Scholarship']).mean() * 100
# plot a bar chart
no_show_perc_gender_Scholarship['No_show'].plot.bar();
plt.title('The Percentages of No Show Patients Based on Gender And
Scholarship')
plt.ylabel('No Show Percentage');
Research Question 4 ( Can the Gender and Handicap be considered as
a factor to predict if a patient will show up for their scheduled
appointment?!)
# percentages of no show patients based on gender and Handicap
no_show_perc_gender_Handicap =
df.groupby(['Gender','Handicap']).mean() * 100
no_show_perc_gender_Handicap['No_show'].plot.bar();
plt.title('The Percentages of No Show Patients Based on Gender And
Handicap')
plt.ylabel('No Show Percentage');
Research Question 5 ( Can the Gender and SMS_received be
considered as a factor to predict if a patient will show up for their
scheduled appointment?!)
# percentages of no show patients based on gender and SMS_received
no_show_perc_gender_SMS_received =
df.groupby(['Gender','SMS_received']).mean() * 100
# plot a bar chart
no_show_perc_gender_SMS_received['No_show'].plot.bar();
plt.title('The Percentages of No Show Patients Based on Gender And
SMS_received')
plt.ylabel('No Show Percentage');
Research Question 6 ( Can the Alcoholism and diseases be considered
as a factor to predict if a patient will show up for their scheduled
appointment?!)
# percentages of no show patients based on Alcoholism and diseases
no_show_perc_Alcoholism_diseases =
df.groupby(['Alcoholism','Hypertension','Diabetes']).No_show.mean() *
100

# plot a bar chart

no_show_perc_Alcoholism_diseases.plot.bar();
plt.title('The Percentages of No Show Patients Based on Alcoholism and
diseases')
plt.ylabel('No Show Percentage');
Research Question 7 ( Can the Age be considered as a factor to predict
if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on Age

Group')
plt.xlabel('Age')
plt.ylabel('Patient Number');
age_attendance(df,'Age')
Research Question 8 ( Can the Gender be considered as a factor to
predict if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on Gender')

plt.xlabel('Gender')
plt.ylabel('Patient Number');
age_attendance(df,'Gender')

Research Question 9 ( Can the Scholarship be considered as a factor to

predict if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on Age

Group')
plt.xlabel('Scholarship')
plt.ylabel('Patient Number');
age_attendance(df,'Scholarship')

Research Question 10 ( Can the Alcoholism be considered as a factor

to predict if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on Alcoholism

')
plt.xlabel('Alcoholism')
plt.ylabel('Patient Number');
age_attendance(df,'Alcoholism')
Research Question 11 ( Can the SMS_received be considered as a factor
to predict if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on

SMS_received')
plt.xlabel('SMS_received')
plt.ylabel('Patient Number');
age_attendance(df,'SMS_received')

Research Question 12 ( Can the Handicap be considered as a factor to

plt.title('The Percentages of No Show Patients Based on Handicap')

plt.xlabel('Handicap')
plt.ylabel('Patient Number');
age_attendance(df,'Handicap')

Research Question 13 ( Can the Diabetes be considered as a factor to

plt.title('The Percentages of No Show Patients Based on Diabetes')

plt.xlabel('Diabetes')
plt.ylabel('Patient Number');
age_attendance(df,'Diabetes')
Research Question 14 ( Can the Hypertension be considered as a factor
to predict if a patient will show up for their scheduled appointment?!)
show = df.No_show==0 # declare variable show that == 0
noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on

Hypertension')
plt.xlabel('Hypertension')
plt.ylabel('Patient Number');
age_attendance(df,'Hypertension')
Research Question 15 (Can the waiting period between scheduled day
and appointment day be considered as a factor to predict if a patient
will show up for their scheduled appointment?!)
# calculate days between scheduled_day and appointment_day
days_between = (df['Appointment_Day'] - df['Scheduled_Day']).dt.days
# insert a new column (days_between) before column 3
df.insert(3, 'days_between', days_between)
df.head()

Gender Scheduled_Day Appointment_Day

days_between \
0 F 2016-04-29 18:38:08+00:00 2016-04-29 00:00:00+00:00
-1
1 M 2016-04-29 16:08:27+00:00 2016-04-29 00:00:00+00:00
-1
2 F 2016-04-29 16:19:04+00:00 2016-04-29 00:00:00+00:00
-1
3 F 2016-04-29 17:29:31+00:00 2016-04-29 00:00:00+00:00
-1
4 F 2016-04-29 16:07:23+00:00 2016-04-29 00:00:00+00:00
-1

Age Neighbourhood Scholarship Hypertension Diabetes

Alcoholism \
0 62 JARDIM DA PENHA 0 1 0
0
1 56 JARDIM DA PENHA 0 0 0
0
2 62 MATA DA PRAIA 0 0 0
0
3 8 PONTAL DE CAMBURI 0 0 0
0
4 56 JARDIM DA PENHA 0 1 1
0

Handicap SMS_received No_show

0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0

# percentages of no show patients based on gender and Alcoholism

negative_days = df.query('days_between < 0')
df.drop(negative_days.index, inplace=True)
df.query('days_between < 0')

Empty DataFrame
Columns: [Gender, Scheduled_Day, Appointment_Day, days_between, Age,
Neighbourhood, Scholarship, Hypertension, Diabetes, Alcoholism,
Handicap, SMS_received, No_show]
Index: []

show = df.No_show==0 # declare variable show that == 0

noshow = df.No_show==1 # declare variable noshow that == 11
# make function take DataFram and col_name
def age_attendance(df,col_name):
# histogram plot
plt.figure(figsize = [16, 4])
df[col_name]
[show].hist(alpha=0.75,bins=10,color='blue',label='show')
df[col_name]
[noshow].hist(alpha=1,bins=10,color='red',label='no_show')
plt.legend();

plt.title('The Percentages of No Show Patients Based on Age

Group')
plt.xlabel('days_between')
plt.ylabel('Patient Number');
age_attendance(df,'days_between')

Conclusions
Limitation:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Untitled2.ipynb'])

4294967295

Maccura F560 - F 580 (Hematology Analyser)
No ratings yet
Maccura F560 - F 580 (Hematology Analyser)
29 pages
DEK 265-Horizon Installation Manual
No ratings yet
DEK 265-Horizon Installation Manual
68 pages
Pima Indian Diabetes Questions
No ratings yet
Pima Indian Diabetes Questions
6 pages
Doctors Appointment
No ratings yet
Doctors Appointment
34 pages
Solved Problems in Industrial Quality Control 20131 PDF
No ratings yet
Solved Problems in Industrial Quality Control 20131 PDF
59 pages
Final Project Report Found
No ratings yet
Final Project Report Found
86 pages
Logistic - Ipynb - Colaboratory
No ratings yet
Logistic - Ipynb - Colaboratory
6 pages
Big Data & Predictive Analytics: How To Submit
No ratings yet
Big Data & Predictive Analytics: How To Submit
4 pages
OEM OEM Preinstallation Preinstallation Kit (OPK) Overview Kit (OPK) Overview
No ratings yet
OEM OEM Preinstallation Preinstallation Kit (OPK) Overview Kit (OPK) Overview
32 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Artificial Neural Network (Ann)
No ratings yet
Artificial Neural Network (Ann)
1 page
Logistic Regression for Heart Disease
No ratings yet
Logistic Regression for Heart Disease
8 pages
Sleep Disorder 1689050852
No ratings yet
Sleep Disorder 1689050852
41 pages
Health Care Project
No ratings yet
Health Care Project
14 pages
Week 4 Laboratory Activity
No ratings yet
Week 4 Laboratory Activity
6 pages
CT Payment Pax s90 Remote Download Procedure Update To CTP Pax App V100e 1
No ratings yet
CT Payment Pax s90 Remote Download Procedure Update To CTP Pax App V100e 1
4 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
DhBqO7 - vRayQaju - 71WsBg - Intro To Clinical Data Study Guide - M4
No ratings yet
DhBqO7 - vRayQaju - 71WsBg - Intro To Clinical Data Study Guide - M4
9 pages
Konnwei Kw310 Can Obdii+Eobd Code Reader: Specifications
No ratings yet
Konnwei Kw310 Can Obdii+Eobd Code Reader: Specifications
16 pages
fОвщвдвлв
No ratings yet
fОвщвдвлв
77 pages
Guidelines in EI Installation
No ratings yet
Guidelines in EI Installation
7 pages
Heart Disease Indicator Prediction Model
No ratings yet
Heart Disease Indicator Prediction Model
17 pages
UPSC EPFO APFC Exam Syllabus
0% (1)
UPSC EPFO APFC Exam Syllabus
5 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
Python Analysis
No ratings yet
Python Analysis
30 pages
Ass 1 Dsbda
No ratings yet
Ass 1 Dsbda
8 pages
Breast Cancer Diagnosis Using Machine Learning Alg
No ratings yet
Breast Cancer Diagnosis Using Machine Learning Alg
13 pages
Diabetes EDA and Kears Modeling
No ratings yet
Diabetes EDA and Kears Modeling
26 pages
Ccna Cloud
No ratings yet
Ccna Cloud
294 pages
Health Outcomes Overview
No ratings yet
Health Outcomes Overview
12 pages
Transcript - Participate Safely and Responsibly Online PDF
No ratings yet
Transcript - Participate Safely and Responsibly Online PDF
11 pages
00 - Project - Your First Data Science Project - Jupyter Notebook
No ratings yet
00 - Project - Your First Data Science Project - Jupyter Notebook
8 pages
# Load Packages: Pandas Pandas PD PD Numpy Numpy NP NP
No ratings yet
# Load Packages: Pandas Pandas PD PD Numpy Numpy NP NP
17 pages
Lord's Piso Wifi
No ratings yet
Lord's Piso Wifi
2 pages
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
No ratings yet
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
15 pages
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
CS335 Lecture 1 Slides
No ratings yet
CS335 Lecture 1 Slides
30 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
16 pages
6298 Schematics List
No ratings yet
6298 Schematics List
2 pages
ML Proj Diabetes
No ratings yet
ML Proj Diabetes
51 pages
Dialysis Analysis
No ratings yet
Dialysis Analysis
18 pages
MIE1628 A5 PartB
No ratings yet
MIE1628 A5 PartB
15 pages
Priyanka Mini Project NSM
No ratings yet
Priyanka Mini Project NSM
7 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
14 pages
The Importance and Applications of Data Compression
No ratings yet
The Importance and Applications of Data Compression
4 pages
S Sandeep Kumar DoctorsVistit DA
No ratings yet
S Sandeep Kumar DoctorsVistit DA
10 pages
Lab 3
No ratings yet
Lab 3
3 pages
Prinect Product Portfolio
No ratings yet
Prinect Product Portfolio
143 pages
Student Mental Health Vs CGPA - EDA - Colab
No ratings yet
Student Mental Health Vs CGPA - EDA - Colab
18 pages
Healthcare-Project-Simplilearn - Week1
No ratings yet
Healthcare-Project-Simplilearn - Week1
6 pages
AML Sessional 1 Students
No ratings yet
AML Sessional 1 Students
16 pages
Techciti: Managed Services
No ratings yet
Techciti: Managed Services
6 pages
TP3.ipynb - Colab
No ratings yet
TP3.ipynb - Colab
17 pages
Day 7 Task: Understanding Package Manager and Systemctl: Tasks
No ratings yet
Day 7 Task: Understanding Package Manager and Systemctl: Tasks
6 pages
Diabetes Prediction 1704256341
No ratings yet
Diabetes Prediction 1704256341
17 pages
16488092936246d54d2efc1RESULTWALK IN INTERVIEW HELD ON MARCH 2022
No ratings yet
16488092936246d54d2efc1RESULTWALK IN INTERVIEW HELD ON MARCH 2022
2 pages
Time Table - 1, B.Tech (Electronics and Communication Engineering, Esr /iot ), V Sem
No ratings yet
Time Table - 1, B.Tech (Electronics and Communication Engineering, Esr /iot ), V Sem
1 page
Anaytics Project 1jan2025
No ratings yet
Anaytics Project 1jan2025
12 pages
ML - Preprocessing - Introduction
No ratings yet
ML - Preprocessing - Introduction
14 pages
Autos Automobile.. EDA Project by Anjali Sinha
No ratings yet
Autos Automobile.. EDA Project by Anjali Sinha
26 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
20 pages
Healthcare Data Exploration Report Word File
No ratings yet
Healthcare Data Exploration Report Word File
9 pages
Candidate Supervision Declaration Form Preparation Form 7 - 0417 32
No ratings yet
Candidate Supervision Declaration Form Preparation Form 7 - 0417 32
2 pages
ADS Exp-1
No ratings yet
ADS Exp-1
3 pages
Absenteeism Module
No ratings yet
Absenteeism Module
2 pages
2024 Wk5 Explorative Data Analysis-1.Ko - en
No ratings yet
2024 Wk5 Explorative Data Analysis-1.Ko - en
51 pages
Stability-Routh Hurwitz Root Locus
No ratings yet
Stability-Routh Hurwitz Root Locus
19 pages
3a-105230 PBR 33 RH
No ratings yet
3a-105230 PBR 33 RH
1 page
Thesis Asset Management Client Login
100% (2)
Thesis Asset Management Client Login
4 pages
Vedant, Aiml
No ratings yet
Vedant, Aiml
63 pages
m3125 Practical 3
No ratings yet
m3125 Practical 3
13 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
Rdgupta PPT Gi Sip Part-Ii3
No ratings yet
Rdgupta PPT Gi Sip Part-Ii3
39 pages
Observation: Import As Import As Import As Import As
No ratings yet
Observation: Import As Import As Import As Import As
31 pages
Healthcare Analytics
No ratings yet
Healthcare Analytics
72 pages
RBX - G2 - Man08008 (Ing)
No ratings yet
RBX - G2 - Man08008 (Ing)
45 pages
Investigate A Dataset-2
No ratings yet
Investigate A Dataset-2
9 pages
Sih PS 2024
No ratings yet
Sih PS 2024
5 pages
Grade 6 - Term 2 - Sample Paper - Answer Key
No ratings yet
Grade 6 - Term 2 - Sample Paper - Answer Key
9 pages
Programming For Data Analytics
No ratings yet
Programming For Data Analytics
27 pages
Healthcare Tutorial
No ratings yet
Healthcare Tutorial
12 pages
Health Risk Prediction
No ratings yet
Health Risk Prediction
80 pages
MGNM - 801 - Ca1
No ratings yet
MGNM - 801 - Ca1
14 pages
Week-01 B
No ratings yet
Week-01 B
4 pages
Assignment 2 (Set B)
No ratings yet
Assignment 2 (Set B)
5 pages
Pima Indians Diabetes Patient Classification
No ratings yet
Pima Indians Diabetes Patient Classification
22 pages
Diwali Sales Anlaysis
No ratings yet
Diwali Sales Anlaysis
10 pages
Data Perparation Penting
No ratings yet
Data Perparation Penting
12 pages

Project 2

Uploaded by

Project 2

Uploaded by

Project: Investigate a Dataset - [No-show]

tr:nth-child(even) { background-color: #dddddd;

Q2 : Can the Gender and Alcoholism be considered as a factor to predict if a patient

Q3 : Can the Gender and Scholarship be considered as a factor to predict if a patient

Q4 : Can the Gender and Handicap be considered as a factor to predict if a patient

Q5 : Can the Gender and SMS_received be considered as a factor to predict if a

Q6 : Can the Alcoholism and diseases be considered as a factor to predict if a patient

Q9 : Can the Scholarship be considered as a factor to predict if a patient will show up

Loading Data from CSV file

PatientId AppointmentID Gender ScheduledDay \

AppointmentDay Age Neighbourhood Scholarship

Diabetes Alcoholism Handcap SMS_received No-show

PatientId AppointmentID Gender Scheduled_Day \

Appointment_Day Age Neighbourhood Scholarship

Diabetes Alcoholism Handicap SMS_received No_show

# number of raw and columns

# show information for dataset

General look of data

fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,10)) # to split plot

sorted_counts = df.Hypertension.value_counts() # to know Hypertension

# plot a pie chart

we can see that most patients no have Alcoholism .

PatientId AppointmentID Gender Scheduled_Day \

Appointment_Day Age Neighbourhood Scholarship

Diabetes Alcoholism Handicap SMS_received No_show

# get index of age == 115

PatientId AppointmentID Gender Scheduled_Day \

Appointment_Day Age Neighbourhood Scholarship

Diabetes Alcoholism Handicap SMS_received No_show

# drop the row with a -1 age

Row was dropped successfully.

# if the value is greater than 1 change it to 1, otherwise keep it

df['Scheduled_Day'] = pd.to_datetime(df['Scheduled_Day']) # convert

PatientId AppointmentID Gender Scheduled_Day \

Appointment_Day Age Neighbourhood Scholarship \

Hypertension Diabetes Alcoholism Handicap SMS_received No_show

Data type changed successfully

# check the appointments days

Timedelta('40 days 00:00:00')

# Drop columns that we didn't need in analysis

Gender Scheduled_Day Appointment_Day Age \

Neighbourhood Scholarship Hypertension Diabetes Alcoholism

Handicap SMS_received No_show

Exploratory Data Analysis

# plot a bar chart

plt.title('The Percentages of No Show Patients Based on Age

plt.title('The Percentages of No Show Patients Based on Gender')

Research Question 9 ( Can the Scholarship be considered as a factor to

plt.title('The Percentages of No Show Patients Based on Age

Research Question 10 ( Can the Alcoholism be considered as a factor

plt.title('The Percentages of No Show Patients Based on Alcoholism

plt.title('The Percentages of No Show Patients Based on

Research Question 12 ( Can the Handicap be considered as a factor to

plt.title('The Percentages of No Show Patients Based on Handicap')

Research Question 13 ( Can the Diabetes be considered as a factor to

plt.title('The Percentages of No Show Patients Based on Diabetes')

plt.title('The Percentages of No Show Patients Based on

Gender Scheduled_Day Appointment_Day

Age Neighbourhood Scholarship Hypertension Diabetes

Handicap SMS_received No_show

# percentages of no show patients based on gender and Alcoholism

show = df.No_show==0 # declare variable show that == 0

plt.title('The Percentages of No Show Patients Based on Age

You might also like