0% found this document useful (0 votes)

7 views24 pages

R Based Project

A nice R project

Uploaded by

naresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views24 pages

R Based Project

A nice R project

Uploaded by

naresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Obesity Data Analysis Report

Marios Natsios

2025-04-10

Table of Contents
Introduction ................................................................................................................. 2
1. Subsetting the Dataset .............................................................................................. 2
2. Data Loading and Preparation .................................................................................... 3
3. Data Exploration........................................................................................................ 5
3.1 Missing Values ..................................................................................................... 5
3.2 Outlier Detection ................................................................................................. 5
4. Univariate & Bivariate Visualizations ........................................................................... 7
4.1 Obesity Class Distribution .................................................................................... 7
4.2 Age Distribution by Obesity Class .......................................................................... 8
4.3 Weight vs Height Scatter Plot ................................................................................ 9
4.4 Alcohol Consumption Frequency ........................................................................ 10
4.5 Weight Density Plot by Obesity ............................................................................ 11
5. Correlation and Multivariate Relationships ................................................................ 12
5.1 Correlation Matrix .............................................................................................. 12
5.2 Pair Plot ............................................................................................................. 13
6. Frequency Tables for Categorical Variables ............................................................... 14
7. Classification Models .............................................................................................. 15
7.1 Train/Test Split ................................................................................................... 15
7.2 Decision Tree Model ........................................................................................... 15
7.3 K-Nearest Neighbors (KNN) ................................................................................ 18
7.4 K Tuning & Error Rate .......................................................................................... 21
7.5 PCA Plot ............................................................................................................ 22
8. Additional Plot......................................................................................................... 23
Conclusion ................................................................................................................. 24
Introduction
Obesity is a growing public health concern and a complex medical condition involving
excessive body fat. It is associated with serious conditions like cardiovascular diseases,
diabetes, and joint problems. This report performs a detailed data analysis of obesity using
Exploratory Data Analysis (EDA), correlation assessment, and classification models
(Decision Tree and K-Nearest Neighbors). The goal is to understand which variables
influence obesity and how well we can predict it using machine learning.

1. Subsetting the Dataset

# Load data
obesity_data <- read.csv("C:/Users/Marios/OneDrive/Documents/Obesity.csv")

# Set seed using your student ID

set.seed(6385883)

# Generate a random subset of 1000 observations

subset_obesity <- obesity_data[sample(nrow(obesity_data), 1000), ]

# View the subset

head(subset_obesity)

## Gender Age Height Weight family_history_with_overweight FAVC FCVC

NCP
## 643 Male 18 1.72 52.06 yes yes 1.95
3.00
## 1020 Female 29 1.53 65.03 yes no 2.00
1.00
## 1270 Male 18 1.79 108.24 yes yes 2.00
2.14
## 1253 Male 23 1.79 105.14 yes yes 2.26
3.00
## 1060 Female 34 1.66 80.39 yes yes 2.00
3.00
## 1379 Male 33 1.78 103.77 yes yes 2.90
1.73
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 643 Sometimes no 1.75 no 0.20 1.743 Sometimes Public_Transportation
## 1020 Sometimes no 1.00 no 0.26 0.000 no Public_Transportation
## 1270 Sometimes no 2.94 no 1.00 0.103 Sometimes Public_Transportation
## 1253 Sometimes no 2.03 no 0.32 0.509 Sometimes Public_Transportation
## 1060 Sometimes no 2.64 no 0.29 1.503 no Automobile
## 1379 Sometimes no 2.60 no 2.11 0.585 Sometimes Automobile
## NObeyesdad
## 643 Insufficient_Weight
## 1020 Overweight_Level_II
## 1270 Obesity_Type_I
## 1253 Obesity_Type_I
## 1060 Overweight_Level_II
## 1379 Obesity_Type_I

# Save the subset to a new CSV (optional)

write.csv(subset_obesity, "Obesity_Subset.csv", row.names = FALSE)

Note: We’re working with a random sample of 1000 records to optimize

performance and maintain diversity in the dataset.

2. Data Loading and Preparation

library(tidyverse)
library(ggplot2)
library(dplyr)
library(readr)
library(ggcorrplot)
library(GGally)
library(caret)
library(rpart)
library(rpart.plot)
library(class)
library(pROC)
library(reshape2)
library(corrplot)
library(summarytools)

data <- read.csv("C:/Users/nares/OneDrive/Documents/Obesity.csv")

data$NObeyesdad <- as.factor(data$NObeyesdad)
str(data)

## 'data.frame': 2111 obs. of 17 variables:

## $ Gender : chr "Female" "Female" "Male" "Male"
...
## $ Age : int 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5
1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68
...
## $ family_history_with_overweight: chr "yes" "yes" "yes" "no" ...
## $ FAVC : chr "no" "no" "no" "no" ...
## $ FCVC : num 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : num 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : chr "Sometimes" "Sometimes"
"Sometimes" "Sometimes" ...
## $ SMOKE : chr "no" "yes" "no" "no" ...
## $ CH2O : num 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : chr "no" "yes" "no" "no" ...
## $ FAF : num 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : chr "no" "Sometimes" "Frequently"
"Frequently" ...
## $ MTRANS : chr "Public_Transportation"
"Public_Transportation" "Public_Transportation" "Walking" ...
## $ NObeyesdad : Factor w/ 7 levels
"Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...

summary(data)

## Gender Age Height Weight

## Length:2111 Min. :14.00 Min. :1.450 Min. : 39.00
## Class :character 1st Qu.:20.00 1st Qu.:1.630 1st Qu.: 65.47
## Mode :character Median :23.00 Median :1.700 Median : 83.00
## Mean :24.32 Mean :1.702 Mean : 86.59
## 3rd Qu.:26.00 3rd Qu.:1.770 3rd Qu.:107.43
## Max. :61.00 Max. :1.980 Max. :173.00
##
## family_history_with_overweight FAVC FCVC
## Length:2111 Length:2111 Min. :1.000
## Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Median :2.390
## Mean :2.419
## 3rd Qu.:3.000
## Max. :3.000
##
## NCP CAEC SMOKE CH2O
## Min. :1.000 Length:2111 Length:2111 Min. :1.000
## 1st Qu.:2.660 Class :character Class :character 1st Qu.:1.585
## Median :3.000 Mode :character Mode :character Median :2.000
## Mean :2.686 Mean :2.008
## 3rd Qu.:3.000 3rd Qu.:2.480
## Max. :4.000 Max. :3.000
##
## SCC FAF TUE CALC
## Length:2111 Min. :0.000 Min. :0.0000 Length:2111
## Class :character 1st Qu.:0.125 1st Qu.:0.0000 Class :character
## Mode :character Median :1.000 Median :0.6250 Mode :character
## Mean :1.010 Mean :0.6579
## 3rd Qu.:1.670 3rd Qu.:1.0000
## Max. :3.000 Max. :2.0000
##
## MTRANS NObeyesdad
## Length:2111 Insufficient_Weight:272
## Class :character Normal_Weight :287
## Mode :character Obesity_Type_I :351
## Obesity_Type_II :297
## Obesity_Type_III :324
## Overweight_Level_I :290
## Overweight_Level_II:290
3. Data Exploration
3.1 Missing Values
colSums(is.na(data))

## Gender Age
## 0 0
## Height Weight
## 0 0
## family_history_with_overweight FAVC
## 0 0
## FCVC NCP
## 0 0
## CAEC SMOKE
## 0 0
## CH2O SCC
## 0 0
## FAF TUE
## 0 0
## CALC MTRANS
## 0 0
## NObeyesdad
## 0

Insight: No missing values found. Data is clean and ready for modeling.

3.2 Outlier Detection

numeric_cols <- select_if(data, is.numeric)
par(mfrow = c(2,2))
for (i in colnames(numeric_cols)) {
boxplot(numeric_cols[[i]], main = i, col = "lightblue")
}
long_data <- gather(numeric_cols)
ggplot(long_data, aes(x = key, y = value)) +
geom_boxplot(fill = "orange") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Boxplot for Outlier Detection")

Observation: Some numeric variables (especially weight and height) contain

outliers which may affect model performance.

4. Univariate & Bivariate Visualizations

4.1 Obesity Class Distribution
ggplot(data, aes(x = NObeyesdad, fill = NObeyesdad)) +
geom_bar() +
labs(title = "Obesity Class Distribution", x = "Obesity Class", y =
"Count") +
theme_minimal()
Summary: The dataset has more entries in normal and overweight categories.
Class imbalance should be considered in model evaluation.

4.2 Age Distribution by Obesity Class

ggplot(data, aes(x = NObeyesdad, y = Age, fill = NObeyesdad)) +
geom_boxplot() +
labs(title = "Age Distribution by Obesity Class")
Insight: Normal and overweight individuals are often younger; higher age
variability appears in obesity classes.

4.3 Weight vs Height Scatter Plot

ggplot(data, aes(x = Height, y = Weight, color = NObeyesdad)) +
geom_point() +
labs(title = "Weight vs Height by Obesity Class")
Insight: A clear trend shows that higher weights at fixed height correlate with
more severe obesity.

4.4 Alcohol Consumption Frequency

ggplot(data, aes(x = CALC)) +
geom_bar(fill = "steelblue") +
labs(title = "Alcohol Consumption Frequency", x = "CALC")
Summary: Majority of participants consume alcohol occasionally or not at
all.

4.5 Weight Density Plot by Obesity

ggplot(data, aes(x = Weight, fill = NObeyesdad)) +
geom_density(alpha = 0.5) +
ggtitle("Density Plot of Weight")
Summary: Higher weights shift the density toward obesity classes,
confirming weight as a strong indicator.

5. Correlation and Multivariate Relationships

5.1 Correlation Matrix
cor_matrix <- cor(numeric_cols)
ggcorrplot(cor_matrix, lab = TRUE, title = "Correlation Heatmap")
Key Finding: BMI and weight are strongly correlated, making them critical
features in obesity prediction.

5.2 Pair Plot

ggpairs(data, columns = c("Age", "Weight", "Height", "FCVC"), aes(color =
NObeyesdad))
Insight: Clear class separation in multivariate relationships, aiding
classification.

6. Frequency Tables for Categorical Variables

cat_cols <- select_if(data, is.factor)
for (col in colnames(cat_cols)) {
print(freq(data[[col]]))
}

## Error in table(names(candidates))[["tested"]]: subscript out of bounds

## Frequencies
##
## Freq % Valid % Valid Cum. % Total %
Total Cum.
## ------------------------- ------ --------- -------------- --------- ------
--------
## Insufficient_Weight 272 12.88 12.88 12.88
12.88
## Normal_Weight 287 13.60 26.48 13.60
26.48
## Obesity_Type_I 351 16.63 43.11 16.63
43.11
## Obesity_Type_II 297 14.07 57.18 14.07
57.18
## Obesity_Type_III 324 15.35 72.52 15.35
72.52
## Overweight_Level_I 290 13.74 86.26 13.74
86.26
## Overweight_Level_II 290 13.74 100.00 13.74
100.00
## <NA> 0 0.00
100.00
## Total 2111 100.00 100.00 100.00
100.00

Observation: Categorical variables (e.g., MTRANS, FAVC) show useful

variation across classes.

7. Classification Models
7.1 Train/Test Split
set.seed(123)
splitIndex <- createDataPartition(data$NObeyesdad, p = 0.7, list = FALSE)
train_data <- data[splitIndex, ]
test_data <- data[-splitIndex, ]

7.2 Decision Tree Model

dt_model <- rpart(NObeyesdad ~ ., data = train_data, method = "class")
rpart.plot(dt_model, extra = 106, fallen.leaves = TRUE)
dt_pred <- predict(dt_model, test_data, type = "class")
confusionMatrix(dt_pred, test_data$NObeyesdad)

## Confusion Matrix and Statistics

##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 75 15 0
## Normal_Weight 6 53 0
## Obesity_Type_I 0 0 103
## Obesity_Type_II 0 0 0
## Obesity_Type_III 0 0 0
## Overweight_Level_I 0 14 0
## Overweight_Level_II 0 4 2
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 0 0 0
## Normal_Weight 0 0 4
## Obesity_Type_I 6 0 0
## Obesity_Type_II 83 0 0
## Obesity_Type_III 0 97 0
## Overweight_Level_I 0 0 76
## Overweight_Level_II 0 0 7
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 0
## Normal_Weight 0
## Obesity_Type_I 4
## Obesity_Type_II 0
## Obesity_Type_III 0
## Overweight_Level_I 0
## Overweight_Level_II 83
##
## Overall Statistics
##
## Accuracy : 0.9019
## 95% CI : (0.876, 0.924)
## No Information Rate : 0.1661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8854
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 0.9259 0.61628
## Specificity 0.9728 0.98168
## Pos Pred Value 0.8333 0.84127
## Neg Pred Value 0.9889 0.94200
## Prevalence 0.1282 0.13608
## Detection Rate 0.1187 0.08386
## Detection Prevalence 0.1424 0.09968
## Balanced Accuracy 0.9494 0.79898
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 0.9810 0.9326
## Specificity 0.9810 1.0000
## Pos Pred Value 0.9115 1.0000
## Neg Pred Value 0.9961 0.9891
## Prevalence 0.1661 0.1408
## Detection Rate 0.1630 0.1313
## Detection Prevalence 0.1788 0.1313
## Balanced Accuracy 0.9810 0.9663
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 1.0000 0.8736
## Specificity 1.0000 0.9743
## Pos Pred Value 1.0000 0.8444
## Neg Pred Value 1.0000 0.9797
## Prevalence 0.1535 0.1377
## Detection Rate 0.1535 0.1203
## Detection Prevalence 0.1535 0.1424
## Balanced Accuracy 1.0000 0.9239
## Class: Overweight_Level_II
## Sensitivity 0.9540
## Specificity 0.9761
## Pos Pred Value 0.8646
## Neg Pred Value 0.9925
## Prevalence 0.1377
## Detection Rate 0.1313
## Detection Prevalence 0.1519
## Balanced Accuracy 0.9651

varImpPlot <- as.data.frame(varImp(dt_model))

ggplot(varImpPlot, aes(x = reorder(rownames(varImpPlot), Overall), y =
Overall)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + ggtitle("Decision Tree Variable Importance")

Interpretation: Weight, BMI, and activity level (FCVC) play key roles in
decision tree splits.

7.3 K-Nearest Neighbors (KNN)

train_X <- train_data %>% select_if(is.numeric)
test_X <- test_data %>% select_if(is.numeric)
train_Y <- train_data$NObeyesdad
test_Y <- test_data$NObeyesdad
train_X <- scale(train_X)
test_X <- scale(test_X)

knn_pred <- knn(train = train_X, test = test_X, cl = train_Y, k = 5)

cm_knn <- confusionMatrix(knn_pred, test_Y)
cm_knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 67 8 0
## Normal_Weight 8 44 2
## Obesity_Type_I 0 2 92
## Obesity_Type_II 0 0 2
## Obesity_Type_III 0 0 1
## Overweight_Level_I 3 22 3
## Overweight_Level_II 3 10 5
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 0 0 2
## Normal_Weight 0 0 3
## Obesity_Type_I 0 0 9
## Obesity_Type_II 88 0 2
## Obesity_Type_III 1 97 0
## Overweight_Level_I 0 0 57
## Overweight_Level_II 0 0 14
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 1
## Normal_Weight 5
## Obesity_Type_I 9
## Obesity_Type_II 2
## Obesity_Type_III 0
## Overweight_Level_I 9
## Overweight_Level_II 61
##
## Overall Statistics
##
## Accuracy : 0.8006
## 95% CI : (0.7673, 0.8311)
## No Information Rate : 0.1661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.767
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 0.8272 0.51163
## Specificity 0.9800 0.96703
## Pos Pred Value 0.8590 0.70968
## Neg Pred Value 0.9747 0.92632
## Prevalence 0.1282 0.13608
## Detection Rate 0.1060 0.06962
## Detection Prevalence 0.1234 0.09810
## Balanced Accuracy 0.9036 0.73933
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 0.8762 0.9888
## Specificity 0.9620 0.9890
## Pos Pred Value 0.8214 0.9362
## Neg Pred Value 0.9750 0.9981
## Prevalence 0.1661 0.1408
## Detection Rate 0.1456 0.1392
## Detection Prevalence 0.1772 0.1487
## Balanced Accuracy 0.9191 0.9889
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 1.0000 0.65517
## Specificity 0.9963 0.93211
## Pos Pred Value 0.9798 0.60638
## Neg Pred Value 1.0000 0.94424
## Prevalence 0.1535 0.13766
## Detection Rate 0.1535 0.09019
## Detection Prevalence 0.1566 0.14873
## Balanced Accuracy 0.9981 0.79364
## Class: Overweight_Level_II
## Sensitivity 0.70115
## Specificity 0.94128
## Pos Pred Value 0.65591
## Neg Pred Value 0.95176
## Prevalence 0.13766
## Detection Rate 0.09652
## Detection Prevalence 0.14715
## Balanced Accuracy 0.82122

Insight: KNN with k = 5 gives decent classification accuracy. More tuning may
enhance performance.
cm_data <- as.data.frame(cm_knn$table)
ggplot(cm_data, aes(x = Prediction, y = Reference, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), color = "white", size = 5) +
scale_fill_gradient(low = "blue", high = "red") +
ggtitle("KNN Confusion Matrix Heatmap")
knn_acc <- cm_knn$overall['Accuracy']
knn_acc

## Accuracy
## 0.8006329

7.4 K Tuning & Error Rate

error_rate <- c()
for (k in 1:20) {
pred_k <- knn(train = train_X, test = test_X, cl = train_Y, k = k)
error_rate[k] <- mean(pred_k != test_Y)
}
plot(1:20, error_rate, type = "b", col = "blue", pch = 19,
xlab = "K Value", ylab = "Error Rate",
main = "K Value vs Error Rate")
Tuning Tip: Minimum error rate may occur at k = 3 to 7, suggesting optimal K
should be searched using cross-validation.

7.5 PCA Plot

pca <- prcomp(train_X)
pca_df <- data.frame(pca$x[, 1:2], Class = train_Y)
ggplot(pca_df, aes(x = PC1, y = PC2, color = Class)) +
geom_point(size = 2) +
ggtitle("PCA Plot of KNN Classes") +
theme_minimal()
Observation: PCA shows good separation between obesity classes,
validating the KNN model’s classification potential.

8. Additional Plot
ggplot(data, aes(x = Age, y = Weight, color = NObeyesdad)) +
geom_point(size = 2) +
ggtitle("Age vs Weight by Obesity Class")
Conclusion: Age combined with weight contributes to understanding obesity
trends better than either alone.

Conclusion
This report offers a comprehensive analysis of obesity data. Outlier analysis, distribution
patterns, and multivariate visualizations revealed that weight, height, age, and activity are
key predictors. Decision Tree and KNN models were implemented, both performing
reasonably well. Future work could incorporate ensemble methods and feature selection
for enhanced accuracy.

Pima Indian Diabetes Questions
No ratings yet
Pima Indian Diabetes Questions
6 pages
Lean Six Sigma Template
100% (6)
Lean Six Sigma Template
908 pages
Group 11 Project 2
No ratings yet
Group 11 Project 2
60 pages
Stroke Prediction Dataset
No ratings yet
Stroke Prediction Dataset
48 pages
Assignment# 06
No ratings yet
Assignment# 06
16 pages
ObesityDataSet Raw and Data Sinthetic - CSV
No ratings yet
ObesityDataSet Raw and Data Sinthetic - CSV
72 pages
DSBDA2
No ratings yet
DSBDA2
6 pages
Major Project - Colab
No ratings yet
Major Project - Colab
15 pages
Eda-Ml-Decision-Tree - Ipynb - Colab
No ratings yet
Eda-Ml-Decision-Tree - Ipynb - Colab
20 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
7 pages
Healthcare Analytics
No ratings yet
Healthcare Analytics
72 pages
Week-01 B
No ratings yet
Week-01 B
4 pages
ADS Exp-1
No ratings yet
ADS Exp-1
3 pages
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
6034 Logistic Regression
No ratings yet
6034 Logistic Regression
6 pages
Department of Statistics: COURSE STATS 330/762
No ratings yet
Department of Statistics: COURSE STATS 330/762
8 pages
Project 16 Calories Burnt Prediction
No ratings yet
Project 16 Calories Burnt Prediction
10 pages
Diabetis Project
No ratings yet
Diabetis Project
7 pages
Mock Part1.ipynb - Colab
No ratings yet
Mock Part1.ipynb - Colab
10 pages
# Load Packages: Pandas Pandas PD PD Numpy Numpy NP NP
No ratings yet
# Load Packages: Pandas Pandas PD PD Numpy Numpy NP NP
17 pages
Stata All Command (Jahidul)
No ratings yet
Stata All Command (Jahidul)
13 pages
CIND123 Lab 1 Console
No ratings yet
CIND123 Lab 1 Console
4 pages
Python 2025
No ratings yet
Python 2025
25 pages
Heart Disease Prediction Model
No ratings yet
Heart Disease Prediction Model
19 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
20 pages
Pima Indians Diabetes Database Analysis - Kaggle
No ratings yet
Pima Indians Diabetes Database Analysis - Kaggle
37 pages
Machine Learning Project Guide
No ratings yet
Machine Learning Project Guide
12 pages
Explanationdocx
No ratings yet
Explanationdocx
9 pages
Project 3 - Diabetes Prediction - Ipynb - Colab
No ratings yet
Project 3 - Diabetes Prediction - Ipynb - Colab
4 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
Lesson 8 Assignment 14 - GSIR Tinyverse
No ratings yet
Lesson 8 Assignment 14 - GSIR Tinyverse
10 pages
Examen 24
No ratings yet
Examen 24
7 pages
Diabetes Data Analysis & Outlier Removal
No ratings yet
Diabetes Data Analysis & Outlier Removal
16 pages
R Exercice
No ratings yet
R Exercice
11 pages
B) Stata Interface (With Data and Commands, Windows) : End: The Introduction of Data Has Finished
No ratings yet
B) Stata Interface (With Data and Commands, Windows) : End: The Introduction of Data Has Finished
14 pages
Data Perparation Penting
No ratings yet
Data Perparation Penting
12 pages
Linear and Multilinear Regression
No ratings yet
Linear and Multilinear Regression
5 pages
Patient Data Management System
100% (1)
Patient Data Management System
27 pages
Jupyter Notebook On Obesity Prediction
No ratings yet
Jupyter Notebook On Obesity Prediction
15 pages
Heart Disease Indicator Prediction Model
No ratings yet
Heart Disease Indicator Prediction Model
17 pages
Healthcare-Project-Simplilearn - Week1
No ratings yet
Healthcare-Project-Simplilearn - Week1
6 pages
Python Programs
No ratings yet
Python Programs
5 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
Analyse Econometrique Avec Stata 12 2
No ratings yet
Analyse Econometrique Avec Stata 12 2
414 pages
Prashant ML Tree Okay
No ratings yet
Prashant ML Tree Okay
8 pages
ML Lab Manual-Iso
No ratings yet
ML Lab Manual-Iso
40 pages
IntroR 2
No ratings yet
IntroR 2
18 pages
Exp 5
No ratings yet
Exp 5
7 pages
Medical Cost Analysis
No ratings yet
Medical Cost Analysis
17 pages
lab - 8 - - (6) عفان عبدالله احمد - التكليف -
No ratings yet
lab - 8 - - (6) عفان عبدالله احمد - التكليف -
18 pages
Medical Insurance Analysis ??
No ratings yet
Medical Insurance Analysis ??
17 pages
STAT501 Online - HW2R - Spring2024
No ratings yet
STAT501 Online - HW2R - Spring2024
7 pages
BMI Analysis for Student Health
No ratings yet
BMI Analysis for Student Health
16 pages
Aiml Experiment 6
No ratings yet
Aiml Experiment 6
1 page
Turing Data Analysis
No ratings yet
Turing Data Analysis
30 pages
7th Report
No ratings yet
7th Report
14 pages
Untitled2.Ipynb - Colab
No ratings yet
Untitled2.Ipynb - Colab
8 pages
Introduction MANU
No ratings yet
Introduction MANU
6 pages
Nour Abdelkeriem - Article - Final
No ratings yet
Nour Abdelkeriem - Article - Final
12 pages
6 TH
No ratings yet
6 TH
1 page
Critical Review of Flattened Butterfly Topology For On-Chip Networks
No ratings yet
Critical Review of Flattened Butterfly Topology For On-Chip Networks
3 pages
Assignment 7
100% (1)
Assignment 7
3 pages
BBA 8th Sem Market Research (Autosaved)
100% (1)
BBA 8th Sem Market Research (Autosaved)
23 pages
Data Scientist Masters Program Guide
No ratings yet
Data Scientist Masters Program Guide
30 pages
Lecture 6: Modeling, Evaluation, and Visualization
No ratings yet
Lecture 6: Modeling, Evaluation, and Visualization
14 pages
AK ML Lab Manual
No ratings yet
AK ML Lab Manual
103 pages
MAT 3 14th WeeK
No ratings yet
MAT 3 14th WeeK
28 pages
Employee Churn Prediction Using Logistic Regression
No ratings yet
Employee Churn Prediction Using Logistic Regression
72 pages
Handout 1 - RESEARCH PROPOSAL OUTLINE
No ratings yet
Handout 1 - RESEARCH PROPOSAL OUTLINE
7 pages
Digitalization Impact on Bank Service Efficiency
No ratings yet
Digitalization Impact on Bank Service Efficiency
7 pages
Focus-Group Interview and Data Analysis: Fatemeh Rabiee
No ratings yet
Focus-Group Interview and Data Analysis: Fatemeh Rabiee
6 pages
Factors Influencing Grade 10 Students' Career Choice in Mauritius
No ratings yet
Factors Influencing Grade 10 Students' Career Choice in Mauritius
15 pages
Mindthe Gap ASuccinct Explorationof Research Gap Types
No ratings yet
Mindthe Gap ASuccinct Explorationof Research Gap Types
13 pages
Rajpreet Finalized Dissertation
No ratings yet
Rajpreet Finalized Dissertation
110 pages
Presentation 3
No ratings yet
Presentation 3
34 pages
Qualitative Research Introduction
No ratings yet
Qualitative Research Introduction
4 pages
Statistics Exam for Students
No ratings yet
Statistics Exam for Students
3 pages
KNN Is A Very Simple Algorithm Used To Solve Classification Problems. KNN Stands For K-Nearest Neighbors. K Is The Number of Neighbors in KNN
0% (1)
KNN Is A Very Simple Algorithm Used To Solve Classification Problems. KNN Stands For K-Nearest Neighbors. K Is The Number of Neighbors in KNN
9 pages
Statistical Analysis Techniques Guide
No ratings yet
Statistical Analysis Techniques Guide
19 pages
Homework 3
No ratings yet
Homework 3
33 pages
ML Draft Syllabus
No ratings yet
ML Draft Syllabus
3 pages
Student Performance Analysis
No ratings yet
Student Performance Analysis
5 pages
Scientific Wri. Midterm Exam Group 1 Topic A
No ratings yet
Scientific Wri. Midterm Exam Group 1 Topic A
14 pages
Example Paired T Test
No ratings yet
Example Paired T Test
2 pages
Agglomerative Clustering Guide
No ratings yet
Agglomerative Clustering Guide
3 pages
Statistics
No ratings yet
Statistics
6 pages
Lead Scoring Model Case Study
No ratings yet
Lead Scoring Model Case Study
12 pages
Basic Principles of Experimental Design Basic Statistics and Data Analysis
No ratings yet
Basic Principles of Experimental Design Basic Statistics and Data Analysis
1 page
Variable Selection & Model Building Guide
No ratings yet
Variable Selection & Model Building Guide
32 pages
Abstract A New XG Model For Football Analytics.
No ratings yet
Abstract A New XG Model For Football Analytics.
3 pages

R Based Project

Uploaded by

R Based Project

Uploaded by

Obesity Data Analysis Report

1. Subsetting the Dataset

# Set seed using your student ID

# Generate a random subset of 1000 observations

# View the subset

## Gender Age Height Weight family_history_with_overweight FAVC FCVC

# Save the subset to a new CSV (optional)

Note: We’re working with a random sample of 1000 records to optimize

2. Data Loading and Preparation

data <- read.csv("C:/Users/nares/OneDrive/Documents/Obesity.csv")

## 'data.frame': 2111 obs. of 17 variables:

## Gender Age Height Weight

3.2 Outlier Detection

Observation: Some numeric variables (especially weight and height) contain

4. Univariate & Bivariate Visualizations

4.2 Age Distribution by Obesity Class

4.3 Weight vs Height Scatter Plot

4.4 Alcohol Consumption Frequency

4.5 Weight Density Plot by Obesity

5. Correlation and Multivariate Relationships

5.2 Pair Plot

6. Frequency Tables for Categorical Variables

## Error in table(names(candidates))[["tested"]]: subscript out of bounds

Observation: Categorical variables (e.g., MTRANS, FAVC) show useful

7.2 Decision Tree Model

## Confusion Matrix and Statistics

varImpPlot <- as.data.frame(varImp(dt_model))

7.3 K-Nearest Neighbors (KNN)

knn_pred <- knn(train = train_X, test = test_X, cl = train_Y, k = 5)

7.4 K Tuning & Error Rate

7.5 PCA Plot

You might also like