Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views3 pages

Guió Data Processing

Uploaded by

x9zdyv9yj9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views3 pages

Guió Data Processing

Uploaded by

x9zdyv9yj9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

DIAPO 3: INTRODUCTION

Life expectancy shows how many years a person is expected to live. It


depends on many things like health, income, education, and access to
medical care.

This project aims to explore two central questions:

1. Which variables have the greatest impact on life expectancy?


2. Can we predict life expectancy using available data?

To answer this questions, public data from the World Health


Organization is used.

DIAPO 4: DATASET OVERVIEW

This dataset contains information from 193 countries between 2000


and 2015. The dataset includes 22 variables including:

 Demographic: Population, child and adult deaths


 Economic: GDP, income level
 Health: vaccine rates, alcohol use, HIV/AIDS
 Social: years of schooling, development status

The goal is to study how these relate to life expectancy, the main
variable we want to explain and predict.

DIAPO 5: DATA PREPROCESSING

Before starting the analysis, the dataset was cleaned and prepared.

First, I checked the structure and missing values. The year 2015 had
too many missing values, so I removed it from the dataset.

Next, I filled the missing values:

 For numerical variables I used the median of each country.


 For categorical variables I used the mode in each country.

Some variables still had missing values for entire countries, so I filled
those using the global average of all countries.

DIAPO 6: EXPLORATORY DATA ANALYSIS

I started by making scatter plots to compare different variables with


life expectancy. These helped show general trends. I also calculated a
correlation matrix to find which variables were most related to life
expectancy in a linear way.
I took the initial 8 most correlated values and plotted them but
realized that schooling and income composition had strange zero
values that seemed incorrect. I fixed these using values from nearby
years and removed countries that didn’t have valid data.

I also noticed a strange bimodal shape in the plot of adult mortality


vs. life expectancy, but could not find a clear reason, so for prediction
I decided to remove this lower part. (posar molts plots de aquests)

Then I focused on the two most correlated variables (schooling and


income) and saw that there was a pretty accurate linear trend for
both of them with respect to life expectancy.

I also found some interesting results in grouped plots where I grouped


by regions and for better visualization I chose a representative
country from each major region:

In this case, the distribution was clearly different between countries.


Schooling and income are both related to the country, and also
strongly related to life expectancy. The more developed a country
is, the higher its schooling, income, and life expectancy.

Another clear pattern appeared when grouping by development


status (developed or developing countries).

And In the case of the 5 representative countries, I saw a slight


increase in life expectancy over time, although there were some
outliers.

Next, I tested four types of models (linear, quadratic, cubic and


logarithmic) to see which worked best for each variable, since until
now, I had only worked with linear correlations. I used R^2 to see
which model showed better results for each variable. I also
removed outliers from some variables (BMI, Diphtheria, Polio, and
Adult Mortality) to improve results.

Because of this, the correlation rankings changed, so I selected


the top 8 variables again and created new final plots using
the best model for each one.

DIAPO 7: PREDICTIVE MODEL

The objective of this stage was to build a regression model capable of


predicting life expectancy using the eight most correlated variables,
each transformed using the best-fitting functional form (linear,
quadratic, cubic, or logarithmic).

Methodology:
 I selected the top 8 variables with highest R^2 from previous
correlation analysis.
 Then applied the best transformation for each variable to
maximize the model performance.
 Split the dataset randomly into 80% training and 20% testing
subsets.
 Trained a multiple regression model using the transformed
variables.
 Evaluated the model using MSE and RMSE on the test set.

Model performance:

 R² = 0.8663, Adjusted R² = 0.8658: The model explains


86.6% of the variance in life expectancy, a very strong result in
the context of health and development data.
 RMSE = 3.72 years: Predictions are accurate within ±3.7
years, which is reasonable given typical life expectancy values
range from 40 to 85 years.
 Statistical significance: F-statistic and p-value (< 2.2e−16)
confirm the model’s global significance.

Most Impactful variables

 Significant predictors (with low p-values):


 HIV/AIDS: strong negative effect (−4.08)
 Income composition of resources: strong positive
effect (+27.25)
 Adult Mortality: negative influence (−0.014)
 Other significant variables: Percentage expenditure,
Diphtheria, Polio, BMI.
 Schooling was not significant (p = 0.604) likely due to
multicollinearity or loss of variation after transformation.

So to answer the two initial questions. For the first one the answer
would be the mainly the income composition, schooling, HIV/AIDS and
immunization.

And for the second question related to prediction, yes, we can predict
life expectancy with this data with a relatively high accuracy.

You might also like