Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views7 pages

DA 1 (Datamining)

The document outlines a data analysis project using the Stroke Prediction Dataset from Kaggle, focusing on descriptive statistics and data visualization techniques. It includes Python code for calculating statistical measures and generating visualizations such as scatter plots, bar charts, histograms, and line graphs. The analysis aims to explore relationships between various features in the dataset, particularly related to stroke prediction.

Uploaded by

MINTU CHOWDARY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

DA 1 (Datamining)

The document outlines a data analysis project using the Stroke Prediction Dataset from Kaggle, focusing on descriptive statistics and data visualization techniques. It includes Python code for calculating statistical measures and generating visualizations such as scatter plots, bar charts, histograms, and line graphs. The analysis aims to explore relationships between various features in the dataset, particularly related to stroke prediction.

Uploaded by

MINTU CHOWDARY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

24/07/2025, 12:15 Welcome to Colab - Colab

DA-1 Submission By: C.Yashwanth Chowdary 22BDS0063

Consider the “Stroke Prediction Dataset” which is available in Kaggle. The link for the dataset is
https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

from google.colab import files


uploaded = files.upload()

Choose files No file chosen Upload widget is only available when the cell has been
executed in the current browser session. Please rerun this cell to enable.

1. Apply the descriptive statistical functions in Python/R to find the minimum value,
maximum value, mean value, median value, mode value, quantile, standard deviation,
variance, and summary of the above dataset by considering appropriate features.

import pandas as pd

# Load the dataset


df = pd.read_csv('healthcare-dataset-stroke-data.csv')

# Display the first 5 rows


print("First 5 rows of the dataset:")
print(df.head())

# Display column information


print("\nColumn information:")
print(df.info())

# Identify numerical and categorical columns for descriptive statistics


numerical_cols = df.select_dtypes(include=['number']).columns
categorical_cols = df.select_dtypes(include=['object']).columns

print("\nDescriptive Statistics:")

# Process numerical columns


for col in numerical_cols:
print(f"\n--- {col} (Numerical) ---")
print(f"Minimum: {df[col].min()}")
print(f"Maximum: {df[col].max()}")
print(f"Mean: {df[col].mean():.2f}")
print(f"Median: {df[col].median():.2f}")
print(f"Standard Deviation: {df[col].std():.2f}")
print(f"Variance: {df[col].var():.2f}")
print(f"Quantiles (25th, 50th, 75th percentile):\n{df[col].quantile([0.25, 0.

# Process categorical columns for mode


for col in categorical_cols:
# Mode can return multiple values if there's a tie, so we take the first one
mode_val = df[col].mode()[0]

https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 1/7
24/07/2025, 12:15 Welcome to Colab - Colab
print(f"\n--- {col} (Categorical) ---")
print(f"Mode: {mode_val}")

# Overall summary using .describe() for numerical columns


print("\nOverall Summary of Numerical Features:")
print(df[numerical_cols].describe())

# Overall summary using .describe(include='object') for categorical columns


print("\nOverall Summary of Categorical Features:")
print(df[categorical_cols].describe())

https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 2/7
24/07/2025, 12:15 Welcome to Colab - Colab
75% 114.090000 33.100000 0.000000
max 271.740000 97.600000 1.000000

Overall Summary of Categorical Features:


gender ever_married work_type Residence_type smoking_status
count 5110 5110 5110 5110 5110
unique 3 2 5 2 4
top Female Yes Private Urban never smoked
freq 2994 3353 2925 2596 1892

Demonstrate the above dataset's different data visualization techniques such as Scatter Plot,
Horizontal Bar Chart, Histogram, and Line Graph by considering appropriate features.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your loaded DataFrame


# df = pd.read_csv('healthcare-dataset-stroke-data.csv')

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='age', y='avg_glucose_level', hue='stroke', palette='v
plt.title('Scatter Plot of Age vs. Average Glucose Level (colored by Stroke)')
plt.xlabel('Age')
plt.ylabel('Average Glucose Level')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show() # Use plt.show() to display the plot in an interactive environment

https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 3/7
24/07/2025, 12:15 Welcome to Colab - Colab

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your loaded DataFrame


# df = pd.read_csv('healthcare-dataset-stroke-data.csv')

plt.figure(figsize=(10, 6))
work_type_counts = df['work_type'].value_counts().sort_values(ascending=True)
sns.barplot(x=work_type_counts.values, y=work_type_counts.index, palette='Blues_d
plt.title('Distribution of Work Type')
plt.xlabel('Count')
plt.ylabel('Work Type')
plt.show() # Use plt.show() to display the plot in an interactive environment

https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 4/7
24/07/2025, 12:15 Welcome to Colab - Colab

/tmp/ipython-input-4-3625557923.py:10: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in

sns.barplot(x=work_type_counts.values, y=work_type_counts.index, palette='Bl

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your loaded DataFrame


# df = pd.read_csv('healthcare-dataset-stroke-data.csv')

# Handle missing BMI values by dropping them for the histogram calculation
df_bmi_cleaned = df.dropna(subset=['bmi'])

plt.figure(figsize=(10, 6))
sns.histplot(df_bmi_cleaned['bmi'], bins=30, kde=True, color='purple')
plt.title('Histogram of BMI Distribution')
plt.xlabel('BMI')
plt.ylabel('Frequency')
plt.show() # Use plt.show() to display the plot in an interactive environment

https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 5/7
24/07/2025, 12:15 Welcome to Colab - Colab

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your loaded DataFrame


# df = pd.read_csv('healthcare-dataset-stroke-data.csv')

# Create age bins for better visualization


bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
labels = [f'{i}-{i+9}' for i in bins[:-1]]
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

avg_glucose_per_age_group = df.groupby('age_group')['avg_glucose_level'].mean().

# Ensure the age groups are sorted for the line plot
avg_glucose_per_age_group['age_group'] = pd.Categorical(avg_glucose_per_age_grou
avg_glucose_per_age_group = avg_glucose_per_age_group.sort_values('age_group')

plt.figure(figsize=(12, 6))
sns.lineplot(data=avg_glucose_per_age_group, x='age_group', y='avg_glucose_level
plt.title('Average Glucose Level Across Age Groups')
plt.xlabel('Age Group')
plt.ylabel('Average Glucose Level')
plt.grid(True, linestyle='--', alpha=0.6)
https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 6/7
24/07/2025, 12:15 Welcome to Colab - Colab
plt.xticks(rotation=45)
plt.tight_layout()
plt.show() # Use plt.show() to display the plot in an interactive environment

/tmp/ipython-input-6-3438248326.py:13: FutureWarning: The default of observed=


avg_glucose_per_age_group = df.groupby('age_group')['avg_glucose_level'].mea

https://colab.research.google.com/#scrollTo=qF7m2TvxJ6dy&printMode=true 7/7

You might also like