Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views24 pages

R Based Project

A nice R project

Uploaded by

naresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views24 pages

R Based Project

A nice R project

Uploaded by

naresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Obesity Data Analysis Report

Marios Natsios

2025-04-10

Table of Contents
Introduction ................................................................................................................. 2
1. Subsetting the Dataset .............................................................................................. 2
2. Data Loading and Preparation .................................................................................... 3
3. Data Exploration........................................................................................................ 5
3.1 Missing Values ..................................................................................................... 5
3.2 Outlier Detection ................................................................................................. 5
4. Univariate & Bivariate Visualizations ........................................................................... 7
4.1 Obesity Class Distribution .................................................................................... 7
4.2 Age Distribution by Obesity Class .......................................................................... 8
4.3 Weight vs Height Scatter Plot ................................................................................ 9
4.4 Alcohol Consumption Frequency ........................................................................ 10
4.5 Weight Density Plot by Obesity ............................................................................ 11
5. Correlation and Multivariate Relationships ................................................................ 12
5.1 Correlation Matrix .............................................................................................. 12
5.2 Pair Plot ............................................................................................................. 13
6. Frequency Tables for Categorical Variables ............................................................... 14
7. Classification Models .............................................................................................. 15
7.1 Train/Test Split ................................................................................................... 15
7.2 Decision Tree Model ........................................................................................... 15
7.3 K-Nearest Neighbors (KNN) ................................................................................ 18
7.4 K Tuning & Error Rate .......................................................................................... 21
7.5 PCA Plot ............................................................................................................ 22
8. Additional Plot......................................................................................................... 23
Conclusion ................................................................................................................. 24
Introduction
Obesity is a growing public health concern and a complex medical condition involving
excessive body fat. It is associated with serious conditions like cardiovascular diseases,
diabetes, and joint problems. This report performs a detailed data analysis of obesity using
Exploratory Data Analysis (EDA), correlation assessment, and classification models
(Decision Tree and K-Nearest Neighbors). The goal is to understand which variables
influence obesity and how well we can predict it using machine learning.

1. Subsetting the Dataset


# Load data
obesity_data <- read.csv("C:/Users/Marios/OneDrive/Documents/Obesity.csv")

# Set seed using your student ID


set.seed(6385883)

# Generate a random subset of 1000 observations


subset_obesity <- obesity_data[sample(nrow(obesity_data), 1000), ]

# View the subset


head(subset_obesity)

## Gender Age Height Weight family_history_with_overweight FAVC FCVC


NCP
## 643 Male 18 1.72 52.06 yes yes 1.95
3.00
## 1020 Female 29 1.53 65.03 yes no 2.00
1.00
## 1270 Male 18 1.79 108.24 yes yes 2.00
2.14
## 1253 Male 23 1.79 105.14 yes yes 2.26
3.00
## 1060 Female 34 1.66 80.39 yes yes 2.00
3.00
## 1379 Male 33 1.78 103.77 yes yes 2.90
1.73
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 643 Sometimes no 1.75 no 0.20 1.743 Sometimes Public_Transportation
## 1020 Sometimes no 1.00 no 0.26 0.000 no Public_Transportation
## 1270 Sometimes no 2.94 no 1.00 0.103 Sometimes Public_Transportation
## 1253 Sometimes no 2.03 no 0.32 0.509 Sometimes Public_Transportation
## 1060 Sometimes no 2.64 no 0.29 1.503 no Automobile
## 1379 Sometimes no 2.60 no 2.11 0.585 Sometimes Automobile
## NObeyesdad
## 643 Insufficient_Weight
## 1020 Overweight_Level_II
## 1270 Obesity_Type_I
## 1253 Obesity_Type_I
## 1060 Overweight_Level_II
## 1379 Obesity_Type_I

# Save the subset to a new CSV (optional)


write.csv(subset_obesity, "Obesity_Subset.csv", row.names = FALSE)

Note: We’re working with a random sample of 1000 records to optimize


performance and maintain diversity in the dataset.

2. Data Loading and Preparation


library(tidyverse)
library(ggplot2)
library(dplyr)
library(readr)
library(ggcorrplot)
library(GGally)
library(caret)
library(rpart)
library(rpart.plot)
library(class)
library(pROC)
library(reshape2)
library(corrplot)
library(summarytools)

data <- read.csv("C:/Users/nares/OneDrive/Documents/Obesity.csv")


data$NObeyesdad <- as.factor(data$NObeyesdad)
str(data)

## 'data.frame': 2111 obs. of 17 variables:


## $ Gender : chr "Female" "Female" "Male" "Male"
...
## $ Age : int 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5
1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68
...
## $ family_history_with_overweight: chr "yes" "yes" "yes" "no" ...
## $ FAVC : chr "no" "no" "no" "no" ...
## $ FCVC : num 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : num 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : chr "Sometimes" "Sometimes"
"Sometimes" "Sometimes" ...
## $ SMOKE : chr "no" "yes" "no" "no" ...
## $ CH2O : num 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : chr "no" "yes" "no" "no" ...
## $ FAF : num 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : chr "no" "Sometimes" "Frequently"
"Frequently" ...
## $ MTRANS : chr "Public_Transportation"
"Public_Transportation" "Public_Transportation" "Walking" ...
## $ NObeyesdad : Factor w/ 7 levels
"Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...

summary(data)

## Gender Age Height Weight


## Length:2111 Min. :14.00 Min. :1.450 Min. : 39.00
## Class :character 1st Qu.:20.00 1st Qu.:1.630 1st Qu.: 65.47
## Mode :character Median :23.00 Median :1.700 Median : 83.00
## Mean :24.32 Mean :1.702 Mean : 86.59
## 3rd Qu.:26.00 3rd Qu.:1.770 3rd Qu.:107.43
## Max. :61.00 Max. :1.980 Max. :173.00
##
## family_history_with_overweight FAVC FCVC
## Length:2111 Length:2111 Min. :1.000
## Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Median :2.390
## Mean :2.419
## 3rd Qu.:3.000
## Max. :3.000
##
## NCP CAEC SMOKE CH2O
## Min. :1.000 Length:2111 Length:2111 Min. :1.000
## 1st Qu.:2.660 Class :character Class :character 1st Qu.:1.585
## Median :3.000 Mode :character Mode :character Median :2.000
## Mean :2.686 Mean :2.008
## 3rd Qu.:3.000 3rd Qu.:2.480
## Max. :4.000 Max. :3.000
##
## SCC FAF TUE CALC
## Length:2111 Min. :0.000 Min. :0.0000 Length:2111
## Class :character 1st Qu.:0.125 1st Qu.:0.0000 Class :character
## Mode :character Median :1.000 Median :0.6250 Mode :character
## Mean :1.010 Mean :0.6579
## 3rd Qu.:1.670 3rd Qu.:1.0000
## Max. :3.000 Max. :2.0000
##
## MTRANS NObeyesdad
## Length:2111 Insufficient_Weight:272
## Class :character Normal_Weight :287
## Mode :character Obesity_Type_I :351
## Obesity_Type_II :297
## Obesity_Type_III :324
## Overweight_Level_I :290
## Overweight_Level_II:290
3. Data Exploration
3.1 Missing Values
colSums(is.na(data))

## Gender Age
## 0 0
## Height Weight
## 0 0
## family_history_with_overweight FAVC
## 0 0
## FCVC NCP
## 0 0
## CAEC SMOKE
## 0 0
## CH2O SCC
## 0 0
## FAF TUE
## 0 0
## CALC MTRANS
## 0 0
## NObeyesdad
## 0

Insight: No missing values found. Data is clean and ready for modeling.

3.2 Outlier Detection


numeric_cols <- select_if(data, is.numeric)
par(mfrow = c(2,2))
for (i in colnames(numeric_cols)) {
boxplot(numeric_cols[[i]], main = i, col = "lightblue")
}
long_data <- gather(numeric_cols)
ggplot(long_data, aes(x = key, y = value)) +
geom_boxplot(fill = "orange") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Boxplot for Outlier Detection")

Observation: Some numeric variables (especially weight and height) contain


outliers which may affect model performance.

4. Univariate & Bivariate Visualizations


4.1 Obesity Class Distribution
ggplot(data, aes(x = NObeyesdad, fill = NObeyesdad)) +
geom_bar() +
labs(title = "Obesity Class Distribution", x = "Obesity Class", y =
"Count") +
theme_minimal()
Summary: The dataset has more entries in normal and overweight categories.
Class imbalance should be considered in model evaluation.

4.2 Age Distribution by Obesity Class


ggplot(data, aes(x = NObeyesdad, y = Age, fill = NObeyesdad)) +
geom_boxplot() +
labs(title = "Age Distribution by Obesity Class")
Insight: Normal and overweight individuals are often younger; higher age
variability appears in obesity classes.

4.3 Weight vs Height Scatter Plot


ggplot(data, aes(x = Height, y = Weight, color = NObeyesdad)) +
geom_point() +
labs(title = "Weight vs Height by Obesity Class")
Insight: A clear trend shows that higher weights at fixed height correlate with
more severe obesity.

4.4 Alcohol Consumption Frequency


ggplot(data, aes(x = CALC)) +
geom_bar(fill = "steelblue") +
labs(title = "Alcohol Consumption Frequency", x = "CALC")
Summary: Majority of participants consume alcohol occasionally or not at
all.

4.5 Weight Density Plot by Obesity


ggplot(data, aes(x = Weight, fill = NObeyesdad)) +
geom_density(alpha = 0.5) +
ggtitle("Density Plot of Weight")
Summary: Higher weights shift the density toward obesity classes,
confirming weight as a strong indicator.

5. Correlation and Multivariate Relationships


5.1 Correlation Matrix
cor_matrix <- cor(numeric_cols)
ggcorrplot(cor_matrix, lab = TRUE, title = "Correlation Heatmap")
Key Finding: BMI and weight are strongly correlated, making them critical
features in obesity prediction.

5.2 Pair Plot


ggpairs(data, columns = c("Age", "Weight", "Height", "FCVC"), aes(color =
NObeyesdad))
Insight: Clear class separation in multivariate relationships, aiding
classification.

6. Frequency Tables for Categorical Variables


cat_cols <- select_if(data, is.factor)
for (col in colnames(cat_cols)) {
print(freq(data[[col]]))
}

## Error in table(names(candidates))[["tested"]]: subscript out of bounds

## Frequencies
##
## Freq % Valid % Valid Cum. % Total %
Total Cum.
## ------------------------- ------ --------- -------------- --------- ------
--------
## Insufficient_Weight 272 12.88 12.88 12.88
12.88
## Normal_Weight 287 13.60 26.48 13.60
26.48
## Obesity_Type_I 351 16.63 43.11 16.63
43.11
## Obesity_Type_II 297 14.07 57.18 14.07
57.18
## Obesity_Type_III 324 15.35 72.52 15.35
72.52
## Overweight_Level_I 290 13.74 86.26 13.74
86.26
## Overweight_Level_II 290 13.74 100.00 13.74
100.00
## <NA> 0 0.00
100.00
## Total 2111 100.00 100.00 100.00
100.00

Observation: Categorical variables (e.g., MTRANS, FAVC) show useful


variation across classes.

7. Classification Models
7.1 Train/Test Split
set.seed(123)
splitIndex <- createDataPartition(data$NObeyesdad, p = 0.7, list = FALSE)
train_data <- data[splitIndex, ]
test_data <- data[-splitIndex, ]

7.2 Decision Tree Model


dt_model <- rpart(NObeyesdad ~ ., data = train_data, method = "class")
rpart.plot(dt_model, extra = 106, fallen.leaves = TRUE)
dt_pred <- predict(dt_model, test_data, type = "class")
confusionMatrix(dt_pred, test_data$NObeyesdad)

## Confusion Matrix and Statistics


##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 75 15 0
## Normal_Weight 6 53 0
## Obesity_Type_I 0 0 103
## Obesity_Type_II 0 0 0
## Obesity_Type_III 0 0 0
## Overweight_Level_I 0 14 0
## Overweight_Level_II 0 4 2
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 0 0 0
## Normal_Weight 0 0 4
## Obesity_Type_I 6 0 0
## Obesity_Type_II 83 0 0
## Obesity_Type_III 0 97 0
## Overweight_Level_I 0 0 76
## Overweight_Level_II 0 0 7
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 0
## Normal_Weight 0
## Obesity_Type_I 4
## Obesity_Type_II 0
## Obesity_Type_III 0
## Overweight_Level_I 0
## Overweight_Level_II 83
##
## Overall Statistics
##
## Accuracy : 0.9019
## 95% CI : (0.876, 0.924)
## No Information Rate : 0.1661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8854
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 0.9259 0.61628
## Specificity 0.9728 0.98168
## Pos Pred Value 0.8333 0.84127
## Neg Pred Value 0.9889 0.94200
## Prevalence 0.1282 0.13608
## Detection Rate 0.1187 0.08386
## Detection Prevalence 0.1424 0.09968
## Balanced Accuracy 0.9494 0.79898
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 0.9810 0.9326
## Specificity 0.9810 1.0000
## Pos Pred Value 0.9115 1.0000
## Neg Pred Value 0.9961 0.9891
## Prevalence 0.1661 0.1408
## Detection Rate 0.1630 0.1313
## Detection Prevalence 0.1788 0.1313
## Balanced Accuracy 0.9810 0.9663
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 1.0000 0.8736
## Specificity 1.0000 0.9743
## Pos Pred Value 1.0000 0.8444
## Neg Pred Value 1.0000 0.9797
## Prevalence 0.1535 0.1377
## Detection Rate 0.1535 0.1203
## Detection Prevalence 0.1535 0.1424
## Balanced Accuracy 1.0000 0.9239
## Class: Overweight_Level_II
## Sensitivity 0.9540
## Specificity 0.9761
## Pos Pred Value 0.8646
## Neg Pred Value 0.9925
## Prevalence 0.1377
## Detection Rate 0.1313
## Detection Prevalence 0.1519
## Balanced Accuracy 0.9651

varImpPlot <- as.data.frame(varImp(dt_model))


ggplot(varImpPlot, aes(x = reorder(rownames(varImpPlot), Overall), y =
Overall)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + ggtitle("Decision Tree Variable Importance")

Interpretation: Weight, BMI, and activity level (FCVC) play key roles in
decision tree splits.

7.3 K-Nearest Neighbors (KNN)


train_X <- train_data %>% select_if(is.numeric)
test_X <- test_data %>% select_if(is.numeric)
train_Y <- train_data$NObeyesdad
test_Y <- test_data$NObeyesdad
train_X <- scale(train_X)
test_X <- scale(test_X)

knn_pred <- knn(train = train_X, test = test_X, cl = train_Y, k = 5)


cm_knn <- confusionMatrix(knn_pred, test_Y)
cm_knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 67 8 0
## Normal_Weight 8 44 2
## Obesity_Type_I 0 2 92
## Obesity_Type_II 0 0 2
## Obesity_Type_III 0 0 1
## Overweight_Level_I 3 22 3
## Overweight_Level_II 3 10 5
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 0 0 2
## Normal_Weight 0 0 3
## Obesity_Type_I 0 0 9
## Obesity_Type_II 88 0 2
## Obesity_Type_III 1 97 0
## Overweight_Level_I 0 0 57
## Overweight_Level_II 0 0 14
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 1
## Normal_Weight 5
## Obesity_Type_I 9
## Obesity_Type_II 2
## Obesity_Type_III 0
## Overweight_Level_I 9
## Overweight_Level_II 61
##
## Overall Statistics
##
## Accuracy : 0.8006
## 95% CI : (0.7673, 0.8311)
## No Information Rate : 0.1661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.767
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 0.8272 0.51163
## Specificity 0.9800 0.96703
## Pos Pred Value 0.8590 0.70968
## Neg Pred Value 0.9747 0.92632
## Prevalence 0.1282 0.13608
## Detection Rate 0.1060 0.06962
## Detection Prevalence 0.1234 0.09810
## Balanced Accuracy 0.9036 0.73933
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 0.8762 0.9888
## Specificity 0.9620 0.9890
## Pos Pred Value 0.8214 0.9362
## Neg Pred Value 0.9750 0.9981
## Prevalence 0.1661 0.1408
## Detection Rate 0.1456 0.1392
## Detection Prevalence 0.1772 0.1487
## Balanced Accuracy 0.9191 0.9889
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 1.0000 0.65517
## Specificity 0.9963 0.93211
## Pos Pred Value 0.9798 0.60638
## Neg Pred Value 1.0000 0.94424
## Prevalence 0.1535 0.13766
## Detection Rate 0.1535 0.09019
## Detection Prevalence 0.1566 0.14873
## Balanced Accuracy 0.9981 0.79364
## Class: Overweight_Level_II
## Sensitivity 0.70115
## Specificity 0.94128
## Pos Pred Value 0.65591
## Neg Pred Value 0.95176
## Prevalence 0.13766
## Detection Rate 0.09652
## Detection Prevalence 0.14715
## Balanced Accuracy 0.82122

Insight: KNN with k = 5 gives decent classification accuracy. More tuning may
enhance performance.
cm_data <- as.data.frame(cm_knn$table)
ggplot(cm_data, aes(x = Prediction, y = Reference, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), color = "white", size = 5) +
scale_fill_gradient(low = "blue", high = "red") +
ggtitle("KNN Confusion Matrix Heatmap")
knn_acc <- cm_knn$overall['Accuracy']
knn_acc

## Accuracy
## 0.8006329

7.4 K Tuning & Error Rate


error_rate <- c()
for (k in 1:20) {
pred_k <- knn(train = train_X, test = test_X, cl = train_Y, k = k)
error_rate[k] <- mean(pred_k != test_Y)
}
plot(1:20, error_rate, type = "b", col = "blue", pch = 19,
xlab = "K Value", ylab = "Error Rate",
main = "K Value vs Error Rate")
Tuning Tip: Minimum error rate may occur at k = 3 to 7, suggesting optimal K
should be searched using cross-validation.

7.5 PCA Plot


pca <- prcomp(train_X)
pca_df <- data.frame(pca$x[, 1:2], Class = train_Y)
ggplot(pca_df, aes(x = PC1, y = PC2, color = Class)) +
geom_point(size = 2) +
ggtitle("PCA Plot of KNN Classes") +
theme_minimal()
Observation: PCA shows good separation between obesity classes,
validating the KNN model’s classification potential.

8. Additional Plot
ggplot(data, aes(x = Age, y = Weight, color = NObeyesdad)) +
geom_point(size = 2) +
ggtitle("Age vs Weight by Obesity Class")
Conclusion: Age combined with weight contributes to understanding obesity
trends better than either alone.

Conclusion
This report offers a comprehensive analysis of obesity data. Outlier analysis, distribution
patterns, and multivariate visualizations revealed that weight, height, age, and activity are
key predictors. Decision Tree and KNN models were implemented, both performing
reasonably well. Future work could incorporate ensemble methods and feature selection
for enhanced accuracy.

You might also like