MAKERERE UNIVERSITY
COLLEGE OF BUSSINESS AND MANAGEMENT
SCIENCES
SCHOOL OF STATISTICS AND PLANNING
DEPARTMENT OF STATISTICS AND ACTUARIAL
SCIENCES
DATA ANALYSIS 3 COURSEWORK REPORT
NAME STUDENT NO. REG. NO.
NALUTAAYA AGNES 1800700023 18/U/023
LUNKUSE
TUMWEBAZE CLARITY 1800700022 18/U/022
KAGGA IVAN CLIFF MAZZI 217005264 17/U/4354/PS
TUHAIRWE DUNCAN 1800723176 18/U/23176/PS
SSEKABIRA CAROL 216004779 16/U/11552/PS
NANDAWULA
MUNEZERO BONIVENTURE 1800723165 18/U/23165/PS
ATWIINE GLORIA 1800741655 18/U/41655
TSIKHABI JOSHUA I MASAWI 1800714171 18/U/14171/PS
INTRODUCTION
The dataset was downloaded from https://www.kaggle.com/datasets and it was saved on desktop.
Since this site only has csv files, the dataset was converted into the desired format of an excel
workbook and then it was imported into Stata and then used in analysis.
It contains 18,207 observations (rows) and 80 features (variables). In this exercise we want to
look at how variables like international reputation, skill moves, long shots, strength and vision
affect wage each footballer in the dataset earns.
We propose the following multiple linear regression model:
Wage = β + α*InternationalReputation + κ*SkillMoves + γ*Strength + θ*LongShots + ρ*Vision
Where:
Wage is the dependent variable.
InternationalReputation, SkillMoves, Strength, Longshots and Vision are the independent
variables.
β is the intercept.
α, κ, γ, θ and ρ are the coefficients that measure the strength at which Wage depends on
InternationalReputation, SkillMoves, Strength, Longshots and Vision respectively.
PART ONE: DATA CLEANING
Generally, codebook command was run and it gave a description of the data corresponding to
every variable in the dataset.
• Dealing with duplicates
A duplicates report was generated and there were no duplicates in the entire dataset. The
command and output were:
. duplicates report
Duplicates in terms of all variables
copies observations surplus
1 18207 0
• Cleaning variable Wage and generating Salary
The dependent variable had a euro symbol of currency and a symbol “K” for thousands so it was
read by Stata as a string variable and therefore there was need to clean it up. The symbol of the
euro currency was dropped and the cell content was destringed plus ignoring the “K” at the end.
The new variable name given to Wage was salary. Then after it was multiplied by 1000 to give
figures in thousands. The command and output were:
. split Wage , parse(€) generate (wage_y)
variables created as string:
wage_y1 wage_y2
.
. destring wage_y2, generate(salary) ignore("K")
wage_y2: character K removed; salary generated as int
(241 missing values generated)
.
. replace salary = salary * 1000
variable salary was int now long
(17,966 real changes made)
And therefore, there is need to restructure the multiple linear regression model as:
salary = β + α*InternationalReputation + κ*SkillMoves + γ*Strength + θ*LongShots + ρ*Vision
Where:
salary is the dependent variable.
InternationalReputation, SkillMoves, Strength, Longshots and Vision are the independent
variables.
β is the intercept.
α, κ, γ, θ and ρ are the coefficients that measure the strength at which salary depends on
InternationalReputation, SkillMoves, Strength, Longshots and Vision respectively.
• Dropping some variables and keeping some variables.
Out of all the variables, 6 were needed and the variable Name and A making them 8. So, the
keep command was used so that Stata could keep the 8 and drop the rest. The command was:
. keep A Name InternationalReputation SkillMoves Strength Vision LongShots salary
• Cleaning variable Name
The variable Name was encoded from string to a form that Stata could manipulate. The new
variable generated was Names. The command was:
. encode Name, generate(Names)
• Dealing with missing values in InternationalReputation and SkillMoves variables
In order to work on missing values in the variables InternationalReputation and SkillMoves
which have discrete integer values 1, 2, 3, 4 and 5, a calculation of the average rank position for
1, 2, 3, 4 and 5 was made in Stata. The command and output were:
. display (1+2+3+4+5)/5
3
Then its this average position 3 that was replaced where the missing values in these two variables
were. The commands were:
. replace InternationalReputation = 3 if InternationalReputation == .
(48 real changes made)
. replace SkillMoves = 3 if SkillMoves == .
(48 real changes made)
• Dealing with missing values in variable salary
In order to work on missing values of the variable salary, the mean of values present in the
column was calculated and replaced where the missing values were. The command and output
for generating mean were:
. mean salary
Mean estimation Number of obs = 17,966
Mean Std. Err. [95% Conf. Interval]
salary 9861.85 165.0083 9538.418 10185.28
The command for replacing missing values in the salary command was:
. replace salary = 9861.85 if salary == .
variable salary was long now double
(241 real changes made)
• Generating variable Time from variable A
The variable A was renamed to Time since a time variable is needed in the Durbin Watson’s d
statistic test for autocorrelation. So, that is why this variable was not dropped at initial stages
because its use was to be realized in the near future. The command was:
. label variable A "time"
.
. rename A Time
• Dealing with missing values in variable Strength
The variable Strength had missing values. So, the mean in that column was calculated and then
replaced in the missing values. The commands and output were:
. mean Strength
Mean estimation Number of obs = 18,159
Mean Std. Err. [95% Conf. Interval]
Strength 65.31197 .0931837 65.12932 65.49462
. replace Strength = 65.31197 if Strength == .
variable Strength was byte now float
(48 real changes made)
• Dealing with missing values in variable LongShots
The variable LongShots had missing values. So, the mean in that column was calculated and then
replaced in the missing values. The commands and output were:
. mean LongShots
Mean estimation Number of obs = 18,159
Mean Std. Err. [95% Conf. Interval]
LongShots 47.10997 .1429296 46.82982 47.39013
. replace LongShots = 47.10997 if LongShots == .
variable LongShots was byte now float
(48 real changes made)
• Dealing with missing values in variable Vision
The variable Vision had missing values. So, the mean in that column was calculated and then
replaced in the missing values. The commands and output were:
. mean Vision
Mean estimation Number of obs = 18,159
Mean Std. Err. [95% Conf. Interval]
Vision 53.4009 .104982 53.19513 53.60668
. replace Vision = 53.4009 if Vision == .
variable Vision was byte now float
(48 real changes made)
Conclusion:
The data was summarized after the cleaning procedure. The command and output were:
. summarize Time InternationalReputation SkillMoves Strength LongShots Vision salary Names
Variable Obs Mean Std. Dev. Min Max
Time 18,207 9103 5256.053 0 18206
Internatio~n 18,207 1.118196 .4052307 1 5
SkillMoves 18,207 2.362992 .7558765 1 5
Strength 18,207 65.31197 12.54044 17 97
LongShots 18,207 47.10997 19.23512 3 94
Vision 18,207 53.4009 14.12822 10 94
salary 18,207 9861.85 21970.4 1000 565000
Names 18,207 8562.34 4944.38 1 17194
The total number of observations for all variables is 18,207.
• Time: The mean is 9103, standard deviation is 5256.053, minimum is 0 and maximum is
18206.
• InternationalReputation: The mean is 1.118196, standard deviation is 0.4052307,
minimum is 1 and maximum is 5.
• SkillMoves: The mean is 2.362992, standard deviation is 0.7558765, minimum is 1 and
maximum is 5.
• Strength: The mean is 65.31197, standard deviation is 12.54044, minimum is 17 and
maximum is 97.
• LongShots: The mean is 47.10977, standard deviation is 19.23512, minimum is 3 and
maximum is 94.
• Vision: The mean is 53.4009, standard deviation is 14.12822, minimum is 10 and
maximum is 94.
• salary: The mean is 9861.85, standard deviation is 21970.4, minimum is 1000 and
maximum is 565000.
• Names: It has mean, standard deviation a minimum and maximum which we will
consider invalid since the data in it is in word form though not in string format since it
was encoded from string.
PART TWO: MULTIPLE LINEAR REGRESSION AND
TESTING ASSUMPTIONS
Running the multiple linear regression model.
The command and output were:
. regress salary i.InternationalReputation i.SkillMoves Strength LongShots Vision
Source SS df MS Number of obs = 18,207
F(11, 18195) = 1772.09
Model 4.5453e+12 11 4.1321e+11 Prob > F = 0.0000
Residual 4.2427e+12 18,195 233177953 R-squared = 0.5172
Adj R-squared = 0.5169
Total 8.7880e+12 18,206 482698402 Root MSE = 15270
salary Coef. Std. Err. t P>|t| [95% Conf. Interval]
InternationalReputation
2 19552.05 471.2686 41.49 0.000 18628.32 20475.78
3 56872 844.8184 67.32 0.000 55216.07 58527.92
4 158901.5 2168.295 73.28 0.000 154651.5 163151.6
5 286925 6338.772 45.27 0.000 274500.4 299349.6
SkillMoves
2 -4862.77 481.343 -10.10 0.000 -5806.248 -3919.292
3 -3332.904 595.9187 -5.59 0.000 -4500.961 -2164.847
4 7805.529 819.2849 9.53 0.000 6199.654 9411.405
5 9032.809 2289.064 3.95 0.000 4546.028 13519.59
Strength 177.3617 9.44834 18.77 0.000 158.8421 195.8814
LongShots 52.97948 11.2732 4.70 0.000 30.88294 75.07603
Vision 173.3927 13.10883 13.23 0.000 147.6981 199.0872
_cons -13400.04 807.4742 -16.60 0.000 -14982.77 -11817.31
For the categorical variables (InternationalReputation and SkillMoves) dummy variables were
generated.
Interpretation
The total number of observations is 18207.
The F value is 0.0000 which is less than 0.05 which means that not all coefficients of variables
in this linear regression model are zero.
The adjusted R^2 is 0.5169 which means that 51.69% of a change in salary is explained by
InternationalReputation, SkillMoves, Strength, Longshots and Vision.
A change in InternationalReputation from 1 to 2 increases salary by 19552.05.
A change in InternationalReputation from 1 to 3 increases salary by 56872.
A change in InternationalReputation from 1 to 4 increases salary by 158901.5.
A change in InternationalReputation from 1 to 5 increases salary by 286925.
A change in SkillMoves from 2 to 1 decreases salary by 4862.77.
A change in SkillMoves from 3 to 1 decreases salary by 3332.904.
A change in SkillMoves from 1 to 4 increases salary by 7805.529.
A change in SkillMoves from 1 to 5 increases salary by 9032.809
A unit increase in Strength increases salary by 177.3617.
A unit increase in LongShots increases salary by 52.97948.
A unit increase in Vision increases salary by 173.3927.
All p values are 0.000 meaning that all variables are significant in this model.
The intercept is -13400.04.
The multiple linear model is:
salary = -13400.04 + 19552.05*InternatioanlReputation2 + 56872* InternatioanlReputation3 +
158901.5*InternatioanlReputation4 + 286925* InternatioanlReputation5 +
-4862.77*SkillMoves2 + -3332.904*SkillMoves3 + 7805.529* SkillMoves4 + 9032.809*
SkillMoves5 + 177.3617*Strength + 52.97948*LongShots + 173.3927*Vision
TESTING ASSUMPTIONS
1) Your dependent variable must be measured at a continuous level/ scale.
We checked this assumption by tabulating and summarizing charges. The commands were:
. tab salary
. summarize salary
Variable Obs Mean Std. Dev. Min Max
salary 18,207 9861.85 21970.4 1000 565000
Conclusion: The variable Salary is continuous.
2) You have two or more independent variables measured at continuous or categorical
level.
We checked this assumption by using multiple one-way frequency tables for
InternationalReputation, SkillMoves, Strength, LongShots and Vision using the command:
. tab1 InternationalReputation SkillMoves Strength LongShots Vision
And from this we conclude that Strength, LongShots and Vision are continuous and also
InternationalReputation and SkillMoves are categorical.
Furthermore, on this assumption we did some summary statistics for the independent variables.
The command and output are:
. summarize InternationalReputation SkillMoves Strength LongShots Vision salary
Variable Obs Mean Std. Dev. Min Max
Internatio~n 18,207 1.118196 .4052307 1 5
SkillMoves 18,207 2.362992 .7558765 1 5
Strength 18,207 65.31197 12.54044 17 97
LongShots 18,207 47.10997 19.23512 3 94
Vision 18,207 53.4009 14.12822 10 94
salary 18,207 9861.85 21970.4 1000 565000
3) There needs to be a linear relationship between:
a) the dependent variable and independent variables and
b) the dependent variable and the independent variables collectively.
Testing assumption 3a): The needs to be a linear relationship between dependent variable
and independent variables.
Generally, a log transformation was done for variable salary in order to shift the line of best fit up
in all plots done for this assumption. The command was:
. generate logsalary = ln( salary)
Strength
It was tested by using a scatter plot of salary and Strength.
The command and output before the log transformation were:
. twoway (scatter salary Strength) (lfit salary Strength)
600000
400000
200000
0
20 40 60 80 100
Strength
salary Fitted values
The command and output after the log transformation were:
. twoway (scatter logsalary Strength) (lfit logsalary Strength)
14
12
10
8
6
20 40 60 80 100
Strength
logsalary Fitted values
Conclusion: There is linearity between Strength and salary.
LongShots
It was tested by using a scatter plot of salary and LongShots.
The command and output before the log transformation were:
. twoway (scatter salary LongShots ) (lfit salary LongShots )
600000
400000
200000
0
0 20 40 60 80 100
LongShots
salary Fitted values
The command and output after the log transformation were:
. twoway (scatter logsalary LongShots ) (lfit logsalary LongShots )
14
12
10
8
6
0 20 40 60 80 100
LongShots
logsalary Fitted values
Conclusion: There is linearity between Longshots and salary.
Vision
It was tested by using a scatter plot of salary and Vision.
The command and output before the log transformation were:
. twoway (scatter salary Vision ) (lfit salary Vision )
600000
400000
200000
0
0 20 40 60 80 100
Vision
salary Fitted values
The command and output after the log transformation were:
. twoway (scatter logsalary Vision ) (lfit logsalary Vision )
14
12
10
8
6
0 20 40 60 80 100
Vision
logsalary Fitted values
Conclusion: There is linearity between Vision and salary.
Testing assumption 3b): There should be a linear relationship between the dependent
variable and the independent variables collectively.
This assumption is tested using partial regression plots or added variable plots. The command
produces an added variable plot for all variables in the multiple linear regression model giving
respect to dummy variables too.
The command was:
. avplots
Conclusion: There is linearity between salary and all independent variables.
4) Your data must not show multi collinearity which occurs when you have two or more
variables that are highly correlated.
This assumption was tested using variance inflation factors. The command and output were:
. estat vif
Variable VIF 1/VIF
Internatio~n
2 1.12 0.894554
3 1.07 0.933454
4 1.03 0.975212
5 1.03 0.967540
SkillMoves
2 4.51 0.221882
3 6.43 0.155575
4 2.51 0.398925
5 1.14 0.875024
Strength 1.10 0.912298
LongShots 3.67 0.272388
Vision 2.68 0.373396
Mean VIF 2.39
Conclusion: Since all the variance inflation factors (VIF) are between 1 and 10, there is moderate
correlation and hence no high correlation.
5) There should be homoscedasticity
This assumption was tested using a residual versus fitted values plot (rvfplot). The command and
output were:
. rvfplot, yline(0)
400000
200000
Residuals
0
-200000
-400000
0 100000 200000 300000
Fitted values
Conclusion: Since variances are moving in a way that they spread out from the line of fit at 0, it
means there is no homoscedasticity.
6) There should be no autocorrelation
The assumption was tested using the Durbin Watson’s d statistic. Since it only applies to time
series data, then the data was declared to be time series data and the test was applied. The command
and output were:
. tsset Time
time variable: Time, 0 to 18206
delta: 1 unit
. estat dwatson
Durbin-Watson d-statistic( 12, 18207) = 1.421951
Conclusion: The Durbin Watson Statistic is 1.421951 which is not equal to 2 meaning there is
autocorrelation. This statistic ranges between 0 and 4; at 2 there is no autocorrelation.
7) The residuals should be normally distributed.
In order to test this assumption, studentized residuals were generated and then plotted in a
histogram with a normal density plot imposed. The commands and output were:
. predict stres, rstudent
. histogram stres, normal
(bin=42, start=-21.936043, width=.95808297)
.8
.6
Density
.4 .2
0
-20 -10 0 10 20
Studentized residuals
Conclusion: The studentized residuals are normally distributed.
8) There should be no significant outliers, high leverage points and highly influential
points
Outliers
To check for outliers a stem and leaf display was generated. Then the outliers above and below
were identified. For those that were above, since they are few, they were all listed and for those
that were below, since they are many we shall list only 10 in this document but in Stata all will
be output since there is a command for that in the do-file.
The commands and output were:
i) For those above
. list InternationalReputation SkillMoves Strength LongShots Vision salary stres if stres <= -5
Intern~n SkillM~s Strength LongSh~s Vision salary stres
23. 5 1 80 16 70 130000 -12.34081
42. 4 1 69 13 50 77000 -5.966961
69. 4 4 67 86 86 100000 -5.60563
77. 4 4 58 71 93 21000 -10.7845
109. 4 2 86 56 48 57000 -7.300009
110. 5 5 86 82 79 15000 -21.93604
207. 4 4 86 79 74 55000 -8.657025
222. 4 5 60 71 84 72000 -7.453211
281. 4 4 68 66 86 67000 -7.738674
315. 4 4 55 78 78 62000 -7.867957
318. 4 1 63 11 53 60000 -7.052768
319. 4 1 70 13 65 10000 -10.61177
379. 4 4 92 90 75 25000 -10.77981
548. 4 2 72 55 68 20000 -9.823951
551. 4 3 75 77 81 11000 -10.79005
553. 4 3 78 80 82 13000 -10.71431
677. 3 4 66 71 81 1000 -5.238172
ii) For those below
. list InternationalReputation SkillMoves Strength LongShots Vision salary stres if stres >= 9 in 1/21
Intern~n SkillM~s Strength LongSh~s Vision salary stres
1. 5 4 59 94 94 565000 18.30344
5. 4 4 75 91 94 355000 11.10393
6. 4 4 66 80 89 340000 10.3053
7. 4 4 58 82 92 420000 15.72673
8. 5 3 83 85 84 455000 10.90395
9. 4 3 83 59 63 380000 13.90365
12. 4 3 73 92 86 355000 11.96119
15. 3 2 76 69 79 225000 10.23313
19. 3 1 79 10 69 240000 11.1949
21. 4 3 77 54 87 315000 9.366471
Conclusion: There are outliers in the model.
Leverage points
In order to check for leverage points, they were predicted using the following command:
. predict leverage, leverage
Then the cut off for leverage points was calculated. The formula is (2k + 2)/n where k is the number
of variables used in the multiple linear regression model and n is the total number of observations.
The command and output were:
. display ((2*6)+2)/18207
.00076894
Then a list of leverage points greater than the cut off of 0.00076894 was generated but for this
document only the first 10 have been shown, all of them will appear in Stata since the command
is there in the do-file. The command and output were:
. list InternationalReputation SkillMoves Strength LongShots Vision salary leverage if leverage > .00076894 in
> 1/10
Intern~n SkillM~s Strength LongSh~s Vision salary leverage
1. 5 4 59 94 94 565000 .1726412
2. 5 5 79 93 82 405000 .1719912
3. 5 5 49 82 87 290000 .1720172
4. 4 1 64 12 68 260000 .02072
5. 4 4 75 91 94 355000 .0202646
6. 4 4 66 80 89 340000 .0202131
7. 4 4 58 82 92 420000 .0202777
8. 5 3 83 85 84 455000 .172036
9. 4 3 83 59 63 380000 .0201346
10. 3 1 78 12 70 94000 .0040192
Conclusion: There are leverage points in the model.
Influential points
In order to check for influential points, distributed inter frame space (dfits) were predicted using
the following command:
. predict dfits, dfits
Then the cut off for influential points was calculated. The formula is (2*sqrt(k/n)) where k is the
number of variables used in the multiple linear regression model and n is the total number of
observations. The command and output were:
. display 2*sqrt(6/18207)
.03630667
Then a list of influential points greater than the cut off of 0.03630667 was generated but for this
document only the first 10 have been shown, all of them will appear in Stata since the command
is there in the do-file. The command and output were:
. list InternationalReputation SkillMoves Strength LongShots Vision salary dfits if dfits>.03630667 in 1/11
Intern~n SkillM~s Strength LongSh~s Vision salary dfits
1. 5 4 59 94 94 565000 8.360995
2. 5 5 79 93 82 405000 2.931811
4. 4 1 64 12 68 260000 .8741187
5. 4 4 75 91 94 355000 1.596951
6. 4 4 66 80 89 340000 1.480169
7. 4 4 58 82 92 420000 2.26254
8. 5 3 83 85 84 455000 4.970357
9. 4 3 83 59 63 380000 1.993044
10. 3 1 78 12 70 94000 .099716
11. 4 4 84 84 77 205000 .1810446
Conclusion: There are influential points in the model.
Treatment of unusual points like outliers.
In order to treat unusual points like outliers, a robust regression was used.
Robust regression is an iterative procedure that seeks to identify outliers and minimize their
impact on the coefficient estimates.
For this document, the robust regression model will be shown excluding the iterations. All the
out put will fully be shown in Stata since the command is there in the do-file.
The command and output were:
. rreg salary i.InternationalReputation i.SkillMoves Strength LongShots Vision
Robust regression Number of obs = 18,203
F( 11, 18191) = 22023.11
Prob > F = 0.0000
salary Coef. Std. Err. t P>|t| [95% Conf. Interval]
InternationalReputation
2 6796.234 94.90796 71.61 0.000 6610.205 6982.262
3 48224.09 170.1606 283.40 0.000 47890.56 48557.62
4 156367.1 436.6914 358.07 0.000 155511.1 157223.1
5 -3.67e-10 2221.061 -0.00 1.000 -4353.488 4353.488
SkillMoves
2 -1504.482 96.93651 -15.52 0.000 -1694.486 -1314.477
3 -229.601 120.0113 -1.91 0.056 -464.8344 5.632478
4 3018.416 165.0164 18.29 0.000 2694.968 3341.864
5 2914.861 467.8699 6.23 0.000 1997.792 3831.93
Strength 58.55993 1.902957 30.77 0.000 54.82995 62.2899
LongShots 23.77543 2.27064 10.47 0.000 19.32476 28.2261
Vision 33.79158 2.640441 12.80 0.000 28.61607 38.96709
_cons -2412.282 162.6184 -14.83 0.000 -2731.03 -2093.535
Since there is some insignificancy in the categorical variables in this treatment, one of them was
dropped i.e. SkillMoves and the robust regression model was finally run.
The command and output were:
. rreg salary i.InternationalReputation Strength LongShots Vision
Robust regression Number of obs = 18,202
F( 7, 18194) = 34959.35
Prob > F = 0.0000
salary Coef. Std. Err. t P>|t| [95% Conf. Interval]
InternationalReputation
2 11701.48 95.58984 122.41 0.000 11514.11 11888.84
3 50277.6 171.8411 292.58 0.000 49940.78 50614.43
4 156413.4 445.3501 351.21 0.000 155540.5 157286.4
5 284990.4 3159.159 90.21 0.000 278798.2 291182.7
Strength 47.29808 1.90894 24.78 0.000 43.55638 51.03979
LongShots 16.49655 1.869364 8.82 0.000 12.83242 20.16068
Vision 51.34202 2.598845 19.76 0.000 46.24804 56.436
_cons -3127.501 163.6803 -19.11 0.000 -3448.329 -2806.672
Interpretation
There are 18,202 observations since some unusual points have been dropped.
The F value is 0.0000 which is less than 0.05 which means that not all coefficients of variables in
this robust regression model are zero.
A change in InternationalReputation from 1 to 2 increases salary by 11701.48.
A change in InternationalReputation from 1 to 3 increases salary by 50277.6.
A change in InternationalReputation from 1 to 4 increases salary by 156413.4.
A change in InternationalReputation from 1 to 5 increases salary by 284990.4.
A unit increase in Strength increases salary by 47.29808.
A unit increase in LongShots increases salary by 16.49655.
A unit increase in Vision increases salary by 51.34202.
All p values are 0.000 meaning that all variables are significant in this model.
The intercept is -3127.501.
The robust model is:
salary = -3127.501 + 11701.48*InternatioanlReputation2 + 50277.6* InternatioanlReputation3 +
156413.4*InternatioanlReputation4 + 284990.4* InternatioanlReputation5 +
47.29808*Strength + 16.49655*LongShots + 51.34202*Vision
END