ML-Project

Introduction

A stroke is a medical emergency where blood flow to the brain is interrupted, causing brain cells to die due to lack of oxygen and nutrients. It’s a leading cause of death and disability worldwide but is often preventable and treatable with prompt action. Every minute a stroke goes untreated, ~1.9 million brain cells die.

Immediate action can save lives and reduce long-term disability. Awareness of symptoms and risk factors is critical.

🛠 GitHub Repository: ML-Project
📋 Trello Board: Project Tasks & Timeline
Presentation: Stroke

Datasets Used

We use 1 file with raw data.

Stroke data (healthcare-dataset): Demographics like age, gender, hypertension,... of clients.

A Metadata(dictionary) is provided to help us understand the content of the columns in each file and guide us through the analysis

id: Every client's unique ID.
gendr: Specifies the client's gender.
age: Indicates the age of the client.
hypertension: High blood pressure (Yes/No)
heart_disease: Issue with heart (Yes/No)
ever_married: Marital status (Yes/No)
work_type: Employment status (Private/Self-employed/Govt_job/children/Never_worked).
Residence_type: Living area (Urban/Rural).
avg_glucose_level: Average glucose level in blood.
bmi: Body Mass Index.
smoking_status: Smoking status (never smoked/formerly smoked/Unknown/smokes)
stroke: Stroke status (True/False)

Comments on the Data: Looks clean, some missing data can be filled or removed to neglect immpact on the analysis

Business Problem & Hypothesis

Predict stroke based on patient's health metrics to prevent risk

Question:
Conclusion:

Business recommendation:

This model can be used by hospitals and telehealth providers to screen patients quickly for stroke risk based on health metrics.
Helps prioritize care and perform early intervention in emergency.
Potential to integrate into electronic medical record (EMR) systems for real-time alerts.

Methodology

Our methodology involved several key steps, focusing on data preprocessing, ML-Model selection, Model training , Model evaluation, and tuning

1. Data preprocessing:

Datasets were downloaded from kaggle.
Data Cleaning:
- maping categorical values to numerical, drop "id" column, not considering gender "Other"
- fillna bmi with average value-> other approach ??
- reduce some outliers on age , gender, bmi

2. EDA

generic EDA on following columns:
- Age: Older patients have a significantly higher risk.
- Hypertension & Heart Disease: Strong positive correlation with stroke.
- Glucose Level & BMI: Higher values may indicate risk but with some variability.
- Smoking Status: Formerly smoked and smokes groups show increased risk.
power transformer on all numerical columns or glucose level for observing the distribution
Check relationship with target column "stroke": heatmap for nummerical, chi test for categorical

3. Model selection:

KNN
Logistic Regression
Random Forest

4. Model training:

Trained train_dataset and predict test ones for every model for classification

5. Model evaluation: Evaluate metrics for classification (target is category)

Accuracy
Recall
Prediction
F1 -> Focus on F1 due to imbalanced in target "stroke" -> Best performing model Logistic Regression

6. Model tuning: Grid search vs Random Search

7. Insights:

Older adults with medical conditions should be prioritized.
Lifestyle factors matter — smoking still plays a role.
Medical metrics like glucose and BMI are useful but work best when combined with age or disease.
Tree-based models performed best and are interpretable.

Data Analysis Tools and Libraries:**

Python: The primary programming language for data manipulation and analysis.
Pandas: Essential for data loading, cleaning, and transformation.
Matplotlib / Seaborn: Used for creating various visualizations (bar charts, line graphs).

Repository Structure

proj-vanguard-abtest/
├── data/                        # Raw and cleaned CSV files
├── figures/                     # Sketching of structure in dataset
├── notebooks/                   # Python notebooks with analysis
├── README.md                    # This file
└── slides                       # Url of presentation

👥 Team Members

Robert, Jessica, Guilherme, Egbe

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
data/raw		data/raw
notebooks		notebooks
slides		slides
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
config.yaml		config.yaml
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ML-Project

Introduction

Datasets Used

Business Problem & Hypothesis

Methodology

Data Analysis Tools and Libraries:**

Repository Structure

👥 Team Members

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Jessica-Bu/ML-Project

Folders and files

Latest commit

History

Repository files navigation

ML-Project

Introduction

Datasets Used

Business Problem & Hypothesis

Methodology

Data Analysis Tools and Libraries:**

Repository Structure

👥 Team Members

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages