0% found this document useful (0 votes)

7 views7 pages

DA 1 (Datamining)

The document outlines a data analysis project using the Stroke Prediction Dataset from Kaggle, focusing on descriptive statistics and data visualization techniques. It includes Python code for calculating statistical measures and generating visualizations such as scatter plots, bar charts, histograms, and line graphs. The analysis aims to explore relationships between various features in the dataset, particularly related to stroke prediction.

Uploaded by

MINTU CHOWDARY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views7 pages

DA 1 (Datamining)

Uploaded by

MINTU CHOWDARY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

24/07/2025, 12:15 Welcome to Colab - Colab

DA-1 Submission By: C.Yashwanth Chowdary 22BDS0063

Consider the “Stroke Prediction Dataset” which is available in Kaggle. The link for the dataset is
https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

from google.colab import files

uploaded = files.upload()

Choose files No file chosen Upload widget is only available when the cell has been
executed in the current browser session. Please rerun this cell to enable.

1. Apply the descriptive statistical functions in Python/R to find the minimum value,
maximum value, mean value, median value, mode value, quantile, standard deviation,
variance, and summary of the above dataset by considering appropriate features.

import pandas as pd

# Load the dataset

df = pd.read_csv('healthcare-dataset-stroke-data.csv')

# Display the first 5 rows

print("First 5 rows of the dataset:")
print(df.head())

# Display column information

print("\nColumn information:")
print(df.info())

# Identify numerical and categorical columns for descriptive statistics

numerical_cols = df.select_dtypes(include=['number']).columns
categorical_cols = df.select_dtypes(include=['object']).columns

print("\nDescriptive Statistics:")

# Process numerical columns

for col in numerical_cols:
print(f"\n--- {col} (Numerical) ---")
print(f"Minimum: {df[col].min()}")
print(f"Maximum: {df[col].max()}")
print(f"Mean: {df[col].mean():.2f}")
print(f"Median: {df[col].median():.2f}")
print(f"Standard Deviation: {df[col].std():.2f}")
print(f"Variance: {df[col].var():.2f}")
print(f"Quantiles (25th, 50th, 75th percentile):\n{df[col].quantile([0.25, 0.

# Process categorical columns for mode

for col in categorical_cols:
# Mode can return multiple values if there's a tie, so we take the first one
mode_val = df[col].mode()[0]

https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 1/7
24/07/2025, 12:15 Welcome to Colab - Colab
print(f"\n--- {col} (Categorical) ---")
print(f"Mode: {mode_val}")

# Overall summary using .describe() for numerical columns

print("\nOverall Summary of Numerical Features:")
print(df[numerical_cols].describe())

# Overall summary using .describe(include='object') for categorical columns

print("\nOverall Summary of Categorical Features:")
print(df[categorical_cols].describe())

https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 2/7
24/07/2025, 12:15 Welcome to Colab - Colab
75% 114.090000 33.100000 0.000000
max 271.740000 97.600000 1.000000

Overall Summary of Categorical Features:

gender ever_married work_type Residence_type smoking_status
count 5110 5110 5110 5110 5110
unique 3 2 5 2 4
top Female Yes Private Urban never smoked
freq 2994 3353 2925 2596 1892

Demonstrate the above dataset's different data visualization techniques such as Scatter Plot,
Horizontal Bar Chart, Histogram, and Line Graph by considering appropriate features.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your loaded DataFrame

# df = pd.read_csv('healthcare-dataset-stroke-data.csv')

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='age', y='avg_glucose_level', hue='stroke', palette='v
plt.title('Scatter Plot of Age vs. Average Glucose Level (colored by Stroke)')
plt.xlabel('Age')
plt.ylabel('Average Glucose Level')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show() # Use plt.show() to display the plot in an interactive environment

https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 3/7
24/07/2025, 12:15 Welcome to Colab - Colab

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your loaded DataFrame

# df = pd.read_csv('healthcare-dataset-stroke-data.csv')

plt.figure(figsize=(10, 6))
work_type_counts = df['work_type'].value_counts().sort_values(ascending=True)
sns.barplot(x=work_type_counts.values, y=work_type_counts.index, palette='Blues_d
plt.title('Distribution of Work Type')
plt.xlabel('Count')
plt.ylabel('Work Type')
plt.show() # Use plt.show() to display the plot in an interactive environment

https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 4/7
24/07/2025, 12:15 Welcome to Colab - Colab

/tmp/ipython-input-4-3625557923.py:10: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in

sns.barplot(x=work_type_counts.values, y=work_type_counts.index, palette='Bl

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your loaded DataFrame

# df = pd.read_csv('healthcare-dataset-stroke-data.csv')

# Handle missing BMI values by dropping them for the histogram calculation
df_bmi_cleaned = df.dropna(subset=['bmi'])

plt.figure(figsize=(10, 6))
sns.histplot(df_bmi_cleaned['bmi'], bins=30, kde=True, color='purple')
plt.title('Histogram of BMI Distribution')
plt.xlabel('BMI')
plt.ylabel('Frequency')
plt.show() # Use plt.show() to display the plot in an interactive environment

https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 5/7
24/07/2025, 12:15 Welcome to Colab - Colab

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your loaded DataFrame

# df = pd.read_csv('healthcare-dataset-stroke-data.csv')

# Create age bins for better visualization

bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
labels = [f'{i}-{i+9}' for i in bins[:-1]]
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

avg_glucose_per_age_group = df.groupby('age_group')['avg_glucose_level'].mean().

# Ensure the age groups are sorted for the line plot
avg_glucose_per_age_group['age_group'] = pd.Categorical(avg_glucose_per_age_grou
avg_glucose_per_age_group = avg_glucose_per_age_group.sort_values('age_group')

plt.figure(figsize=(12, 6))
sns.lineplot(data=avg_glucose_per_age_group, x='age_group', y='avg_glucose_level
plt.title('Average Glucose Level Across Age Groups')
plt.xlabel('Age Group')
plt.ylabel('Average Glucose Level')
plt.grid(True, linestyle='--', alpha=0.6)
https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 6/7
24/07/2025, 12:15 Welcome to Colab - Colab
plt.xticks(rotation=45)
plt.tight_layout()
plt.show() # Use plt.show() to display the plot in an interactive environment

/tmp/ipython-input-6-3438248326.py:13: FutureWarning: The default of observed=

avg_glucose_per_age_group = df.groupby('age_group')['avg_glucose_level'].mea

https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 7/7

Psychology of Emotion 2nd Edition Paula M. Niedenthal Download
100% (4)
Psychology of Emotion 2nd Edition Paula M. Niedenthal Download
115 pages
UNIT1
No ratings yet
UNIT1
38 pages
Aphical Representation
No ratings yet
Aphical Representation
8 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
SBP Junior Officer Og 1 Sample Test and Areas
50% (2)
SBP Junior Officer Og 1 Sample Test and Areas
5 pages
Jackfruit
No ratings yet
Jackfruit
16 pages
MLS 5 - Python Project Support
No ratings yet
MLS 5 - Python Project Support
54 pages
MSC Chemistry 2sem Course 2. 4
No ratings yet
MSC Chemistry 2sem Course 2. 4
321 pages
Ai Record Programs
No ratings yet
Ai Record Programs
34 pages
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
100% (1)
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
38 pages
Thesis On Teachers Job Satisfaction
100% (3)
Thesis On Teachers Job Satisfaction
4 pages
CH 10
100% (1)
CH 10
56 pages
Planning and Installing Photovoltaic Systems A Guide For Installers Architects and Engineers 2nd Edition Deutsche Gesellshaft Für Sonnenenergie Download
100% (1)
Planning and Installing Photovoltaic Systems A Guide For Installers Architects and Engineers 2nd Edition Deutsche Gesellshaft Für Sonnenenergie Download
125 pages
Ariesta Solusi
No ratings yet
Ariesta Solusi
6 pages
ADS LAB Merged
No ratings yet
ADS LAB Merged
86 pages
DV 6
No ratings yet
DV 6
9 pages
Sampling Distributions & Point Estimation
No ratings yet
Sampling Distributions & Point Estimation
13 pages
DS Manual 1
No ratings yet
DS Manual 1
96 pages
Heart Disease Report With Comments and Code
No ratings yet
Heart Disease Report With Comments and Code
9 pages
Practical 4
No ratings yet
Practical 4
3 pages
Ensemble Methods Bagging Boosting and Stacking
100% (1)
Ensemble Methods Bagging Boosting and Stacking
19 pages
Experiment - 1 csd201
No ratings yet
Experiment - 1 csd201
19 pages
Healthcare Tutorial
No ratings yet
Healthcare Tutorial
12 pages
Hybrid Excited Synchronous Machines: Topologies, Design and Analysis 1st Edition Yacine Amara Download
100% (1)
Hybrid Excited Synchronous Machines: Topologies, Design and Analysis 1st Edition Yacine Amara Download
132 pages
Ai&Ml Bail606 ML Lab Manual
No ratings yet
Ai&Ml Bail606 ML Lab Manual
50 pages
Data Analytics7
No ratings yet
Data Analytics7
5 pages
Heart Failure Prediction
100% (1)
Heart Failure Prediction
41 pages
Activity 2 - Sampling and Sources of Data
No ratings yet
Activity 2 - Sampling and Sources of Data
1 page
Healthcare Data Exploration Report Word File
No ratings yet
Healthcare Data Exploration Report Word File
9 pages
Live Project Report: "Study of Customer Service With Reference To Big Bazaar"
No ratings yet
Live Project Report: "Study of Customer Service With Reference To Big Bazaar"
40 pages
Heart - Disease - 1.ipynb - Colaboratory
No ratings yet
Heart - Disease - 1.ipynb - Colaboratory
9 pages
Data Science
No ratings yet
Data Science
8 pages
Hindusthan Zinc LTD Work Life Balance
No ratings yet
Hindusthan Zinc LTD Work Life Balance
93 pages
Phase 2
No ratings yet
Phase 2
6 pages
Research Article: Tesfaye Mekonnen Sifan
No ratings yet
Research Article: Tesfaye Mekonnen Sifan
12 pages
Unit 5
No ratings yet
Unit 5
93 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
14 pages
DATA M EXAMS Programation 2
No ratings yet
DATA M EXAMS Programation 2
3 pages
Heart Disease Diagnosis Using Machine Learning
No ratings yet
Heart Disease Diagnosis Using Machine Learning
26 pages
Consequences of Multicollinearity
100% (2)
Consequences of Multicollinearity
2 pages
DMML Lab Report 02
No ratings yet
DMML Lab Report 02
11 pages
ML Proj Diabetes
No ratings yet
ML Proj Diabetes
51 pages
Data Preprocess Steps
No ratings yet
Data Preprocess Steps
2 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
Diabetes Prediction 1704256341
No ratings yet
Diabetes Prediction 1704256341
17 pages
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
No ratings yet
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
15 pages
ML Lab
No ratings yet
ML Lab
14 pages
Data Project
No ratings yet
Data Project
12 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
Sample Worksheet 1
No ratings yet
Sample Worksheet 1
8 pages
Python ML for Healthcare Data
No ratings yet
Python ML for Healthcare Data
3 pages
Batch-2 Ieee DMT
No ratings yet
Batch-2 Ieee DMT
4 pages
MULLINS, The Governance Impact of Index Funds - Evidence From RD
No ratings yet
MULLINS, The Governance Impact of Index Funds - Evidence From RD
58 pages
Assignment 1
No ratings yet
Assignment 1
10 pages
S11202415 - Lab 2
No ratings yet
S11202415 - Lab 2
2 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Week 4 Laboratory Activity
No ratings yet
Week 4 Laboratory Activity
6 pages
Ass 1 Dsbda
No ratings yet
Ass 1 Dsbda
8 pages
Tutorial2 Q&A
No ratings yet
Tutorial2 Q&A
5 pages
21it044 Dav Practical 6 Colab
No ratings yet
21it044 Dav Practical 6 Colab
9 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Module 6 Assessment: Solution: (A) Given, X̅ 75 Z X-X ̅ / S
No ratings yet
Module 6 Assessment: Solution: (A) Given, X̅ 75 Z X-X ̅ / S
4 pages
My Code
No ratings yet
My Code
7 pages
Data Analysis
No ratings yet
Data Analysis
4 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Heart Disease Indicator Prediction Model
No ratings yet
Heart Disease Indicator Prediction Model
17 pages
Practical No.-01
No ratings yet
Practical No.-01
25 pages
Doing Data Science in R An Introduction For Social Scientists - 1st Edition High-Resolution PDF Download
100% (13)
Doing Data Science in R An Introduction For Social Scientists - 1st Edition High-Resolution PDF Download
14 pages
Six Sigma Vs Lean: Reflective Practice
No ratings yet
Six Sigma Vs Lean: Reflective Practice
6 pages
AIML Expt
No ratings yet
AIML Expt
7 pages
Module BHR 2104 Medical Statistics For Health Professionals
No ratings yet
Module BHR 2104 Medical Statistics For Health Professionals
78 pages
Eda Exam Answers
No ratings yet
Eda Exam Answers
7 pages
Total Documentation
No ratings yet
Total Documentation
21 pages
Heart Disease - EDA and Prediction
No ratings yet
Heart Disease - EDA and Prediction
13 pages
Linear Regression Hands-On
No ratings yet
Linear Regression Hands-On
27 pages
The Impacts of English Language Proficiency On Employment Opportunities
No ratings yet
The Impacts of English Language Proficiency On Employment Opportunities
9 pages
Final Exams Schedule F2024
No ratings yet
Final Exams Schedule F2024
3 pages
ECON1005 Final Exam Sem I 2024-2025
No ratings yet
ECON1005 Final Exam Sem I 2024-2025
6 pages
Performance Task 3 Pearson Correlation
No ratings yet
Performance Task 3 Pearson Correlation
8 pages
Miller Expert Report Re: Ziyad Yaghi Case No. 5:09-cr-00216-FL
No ratings yet
Miller Expert Report Re: Ziyad Yaghi Case No. 5:09-cr-00216-FL
10 pages
Advanced ML: Math Foundations
No ratings yet
Advanced ML: Math Foundations
46 pages
Chapter 7 Correlation
No ratings yet
Chapter 7 Correlation
6 pages
Python Libraries for Data Science
No ratings yet
Python Libraries for Data Science
96 pages