Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views94 pages

Lecture Intro To Stat in R

Uploaded by

Mohiuddin Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views94 pages

Lecture Intro To Stat in R

Uploaded by

Mohiuddin Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

Introduction to statistics in R

Rémy Beugnon & Malte Jochum


US

§ Rémy Beugnon
https://remybeugnon.netlify.app
@BeugnonRemy

§ Christian Ristok
@ChristianRistok

§ Malte Jochum
http://maltejochum.de/
@MalteJochum
Summary

In this lecture:

1. The stepwise process to analyze your data

2. Application

3. Practical on your own

4. Conclusion

3
Summary

In this lecture:

1. The stepwise process to analyze your data

4
Summary

In this lecture:

1. The stepwise process to analyze your data


Focus on linear models with continuous predictors.

Response

Explanatory

5
Steps to analyze your data
Check your 1. What are your variables?
data i. What is your response variable?
ii. What is your explanatory variable?
structure 2. How are your data distributed?
3. How do you expect your response variable to be
distributed?

6
Steps to analyze your data
Check your 1. What are your variables?
data i. What is your response variable?
ii. What is your explanatory variable?
structure 2. How are your data distributed?
3. How do you expect your response variable to be
distributed?

Build your 1. What do you want to test?


hypothesis 2. What data distribution will you use?
3. What are the statistical hypotheses?

7
Steps to analyze your data
Check your 1. What are your variables?
data i. What is your response variable?
ii. What is your explanatory variable?
structure 2. How are your data distributed?
3. How do you expect your response variable to be
distributed?

Build your 1. What do you want to test?


hypothesis 2. What data distribution will you use?
3. What are the statistical hypotheses?

1. Build your model with R


Build your 2. Fit your model with R
model

8
Steps to analyze your data
Check your 1. What are your variables?
data i. What is your response variable?
ii. What is your explanatory variable?
structure 2. How are your data distributed?
3. How do you expect your response variable to be
distributed?

Build your 1. What do you want to test?


hypothesis 2. What data distribution will you use?
3. What are the statistical hypotheses?

1. Build your model with R


Build your 2. Fit your model with R
model

1. Check model quality


Check 2. Are statistical assumptions met?
model fit

9
Steps to analyze your data
Check your 1. What are your variables?
data i. What is your response variable?
ii. What is your explanatory variable?
structure 2. How are your data distributed?
3. How do you expect your response variable to be
distributed?

Build your 1. What do you want to test?


hypothesis 2. What data distribution will you use?
3. What are the statistical hypotheses?

1. Build your model with R


What can you do? Build your 2. Fit your model with R
i. double check your data model
ii. data transformation

NO 1. Check model quality


Data Check 2. Are statistical assumptions met?
transformation model fit

10
Steps to analyze your data
Check your 1. What are your variables?
data i. What is your response variable?
ii. What is your explanatory variable?
structure 2. How are your data distributed?
3. How do you expect your response variable to be
distributed?

Build your 1. What do you want to test?


hypothesis 2. What data distribution will you use?
3. What are the statistical hypotheses?

1. Build your model with R


What can you do? Build your 2. Fit your model with R
i. double check your data model
ii. data transformation

NO 1. Check model quality


Data Check 2. Are statistical assumptions met?
transformation model fit

YES

Extract your
results
11
Summary

In this lecture:

1. The stepwise process to analyze your data

2. Application

12
Who to do that using RStudio

§ You need
§ RStudio

§ R version 4.0 or higher

§ The following packages:

§ Data handling: dplyr

§ Model quality checks: performance (needed with see)

§ Extract your results: ggeffects

§ Plot: ggplot2 (join the course from Steph for more details)

§ A dataset to analyze

13
Example: tree diversity effect on litterfall and decomposition

Tree species richness

14
Example: tree diversity effect on litterfall and decomposition
Litterfall biomass
[g/m2]

Tree species richness

15
Example: tree diversity effect on litterfall and decomposition
Litterfall biomass
[g/m2]
Litter species
richness

Tree species richness


16
Example: tree diversity effect on litterfall and decomposition

17
Example: tree diversity effect on litterfall and decomposition

C loss (%)

Litter sp. rich

18
Example: tree diversity effect on litterfall and decomposition

C loss (%)

N loss (%)
Litter sp. rich Litter sp. rich

19
Example: tree diversity effect on litterfall and decomposition

C loss (%)

N loss (%)
Litter sp. rich Litter sp. rich
C loss (%)

Litterfall

20
Example: tree diversity effect on litterfall and decomposition

C loss (%)

N loss (%)
Litter sp. rich Litter sp. rich
C loss (%)

N loss (%)

Litterfall Litterfall

21
Example: tree diversity effect on litterfall abundance

Litterfall abundance
(g/m2)

Tree species richness

22
Check your data structure
Check your 1. What are your variables?
data i. What is your response variable?
ii. What is your explanatory variable?
structure 2. How are your data distributed?
3. How do you expect your response variable to be
distributed?

23
Check your data structure

1. load your data in a dataset called df:

File R function [package] Example


type
.csv read.csv(file = ‘name.csv’) df = read.csv(file = “my-data.csv”)

.txt read.delim(file = ‘name.txt’) df = read.txt(file = “my-data.txt”)

.xlsx read_xlsx(path = ‘name.xlsx’, df = read_xlsx(path = “my-data.xlsx”, sheet = “rawdata”)


sheet = “sheet.name”)
[package: readxl]

24
Check your data structure

1. load your data in a dataset called df:

25
Check your data structure

1. load your data in a dataset called df


2. what are your variables?

Litterfall neigh.sp.rich
TSP

26
Check your data structure

1. load your data in a dataset called df


2. what are your variables?

Variable Measure Type Expected Expected distribution


name range
TSP Sample name

litterfall Quantity of litter in


gram fall on 1 m2
neigh.sp.rich Number of species
in the surrounding

27
Check your data structure

1. load your data in a dataset called df


2. what are your variables?

Variable Measure Type Expected Expected distribution


name range
TSP Sample name

litterfall Quantity of litter in


gram fall on 1 m2
neigh.sp.rich Number of species
in the surrounding

str(df)

28
Check your data structure

1. load your data in a dataset called df


2. what are your variables?

Variable Measure Type Expected Expected distribution


name range
TSP Sample name Character
litterfall Quantity of litter in
Numeric
gram fall on 1 m2
neigh.sp.rich Number of species
Integer
in the surrounding

29
Check your data structure

1. load your data in a dataset called df


2. what are your variables?

Variable Measure Type Expected range Expected distribution


name
TSP Sample name Character All sample names
litterfall Quantity of litter in
Numeric 0 – 500 g/m2
gram fall on 1 m2
neigh.sp.rich Number of species
Integer [1;12]
in the surrounding

30
Check your data structure

1. load your data in a dataset called df


2. what are your variables?

Variable Measure Type Expected range Expected distribution


name
TSP Sample name Character All sample names -
litterfall Quantity of litter in
Numeric 0 – 500 g/m2 Normal
gram fall on 1 m2
neigh.sp.rich Number of species
Integer [1;12] -
in the surrounding

31
Check your data structure

DANGER ZONE
Your data are not Normally distributed, your residuals should be!

32
Check your data structure

DANGER ZONE
Your data are not Normally distributed, your residuals should be!
Let takes people height as example:

33
Check your data structure

DANGER ZONE
Your data are not Normally distributed, your residuals should be!
Let takes people height as example, drinking your soup makes you grow
up

34
Check your data structure

DANGER ZONE
Your data are not Normally distributed, your residuals should be!
Let takes people height as example, drinking your soup makes you grow
up

NOT NORMAL

35
Check your data structure

DANGER ZONE

Height should follow a normal distribution


Therefore, your residuals should follow a normal distribution
Your population DOES NOT follow a normal distribution

(Same goes with other distribution types!)

36
Check your data structure

1. load your data in a data called df


2. what are your variables?
3. how are your variables distributed?
1. Missing values

WARNING DANGER ZONE


Only keep complete rows:
df = df[complete.cases(),]

37
Check your data structure

1. load your data in a data called df


2. what are your variables?
3. how are your variables distributed?
Quick and dirty
plot(df)

38
Check your data structure

1. load your data in a data called df


2. what are your variables?
3. how are your variables distributed?

boxplot(df$litterfall)

Litterfall neigh.sp.rich

39
Check your data structure

1. load your data in a data called df


2. what are your variables?
3. how are your variables distributed?
1. Control data out of range:

df[df$litterfall<0 | df$litterfall>500,]

Conditions on rows
All columns

40
Check your data structure

1. load your data in a data called df


2. what are your variables?
3. how are your variables distributed?
1. Control data out of range:

df[df$litterfall<0 | df$litterfall>500,]

41
Check your data structure

1. load your data in a data called df


2. what are your variables?
3. how are your variables distributed?
1. Control data out of range:

df[df$neigh.sp.rich<1 | df$neigh.sp.rich>12,]

42
Check your data structure

1. load your data in a data called df


2. what are your variables?
3. how are your variables distributed?
1. Control data out of range
2. Correct if typos or remove

Write the opposite conditional:


df[df$neigh.sp.rich>=1 & df$neigh.sp.rich<=12,]

Leave R to do it for you:


df[!(df$neigh.sp.rich<1 | df$neigh.sp.rich>12),]

43
Check your data structure

1. load your data in a data called df


2. what are your variables?
3. how are your variables distributed?
1. Control data out of range
2. Correct if typos or remove

WARNING DANGER ZONE


You will overwrite your data in r keep a safe copy
df.raw = df

df = df[!(df$neigh.sp.rich<1 | df$neigh.sp.rich>12),]
df = df[!(df$litterfall<0 | df$litterfall>500),]

44
Check your data structure

1. load your data in a data called df


2. what are your variables?
3. how are your variables distributed?
1. Control data out of range
2. Correct if typos or remove

Litterfall neigh.sp.rich

45
Build your hypothesis
Check your 1. What are your variables?
data i. What is your response variable?
ii. What is your explanatory variable?
structure 2. How are your data distributed?
3. How do you expect your response variable to be
distributed?

Build your 1. What do you want to test?


hypothesis 2. What data distribution will you use?
3. What are the statistical hypotheses?

46
Build your hypothesis

1. what do you want to test?

Litterfall abundance
(g/m2)

Tree species richness

47
Build your hypothesis

1. what do you want to test?

Tree species richness increase litterfall

48
Build your hypothesis

1. what do you want to test?

Tree species richness increase litterfall

“litterfall” increase with “neigh.sp.rich”

49
Build your hypothesis

1. what do you want to test?

Tree species richness increase litterfall

“litterfall” increase with “neigh.sp.rich”

𝑙𝑖𝑡𝑡𝑒𝑟𝑓𝑎𝑙𝑙 ~ 𝜇 + 𝛼 ×𝑛𝑒𝑖𝑔ℎ. 𝑠𝑝. 𝑟𝑖𝑐ℎ + ε

H0: α = 0, 𝑙𝑖𝑡𝑡𝑒𝑟𝑓𝑎𝑙𝑙 ~ 𝜇 + ε

H1: α ≠ 0, 𝑙𝑖𝑡𝑡𝑒𝑟𝑓𝑎𝑙𝑙 ~ 𝜇 + 𝛼 ×𝑛𝑒𝑖𝑔ℎ. 𝑠𝑝. 𝑟𝑖𝑐ℎ + ε

50
Build your hypothesis

1. what do you want to test?


take a look at your data: plot(df$litterfall ~ df$neigh.sp.rich)

51
Build your hypothesis

1. what do you want to test?


2. what distribution will you use? How do you expect your data to fall
around your mean

𝑙𝑖𝑡𝑡𝑒𝑟𝑓𝑎𝑙𝑙 ~ 𝜇 + 𝛼 ×𝑛𝑒𝑖𝑔ℎ. 𝑠𝑝. 𝑟𝑖𝑐ℎ + ε

52
Build your hypothesis

1. what do you want to test?


2. what distribution will you use? How do you expect your data to fall
around your mean

𝑙𝑖𝑡𝑡𝑒𝑟𝑓𝑎𝑙𝑙 ~ 𝜇 + 𝛼 ×𝑛𝑒𝑖𝑔ℎ. 𝑠𝑝. 𝑟𝑖𝑐ℎ + ε

ε ↪ 𝑁(0, 𝜎)

53
Build your hypothesis

1. what do you want to test?


2. what distribution will you use?
3. what are you statistical hypotheses?

54
Build your hypothesis

1. what do you want to test?


2. what distribution will you use?
3. what are you statistical hypotheses?

i. Independence

ii. Random sampling

iii. Normally distributed error: ε ↪ 𝑁(0, 𝜎)

iv. Equal variances (homoscedasticity)

v. Linearity

vi. Predictors are fixed

55
Build your hypothesis

1. what do you want to test?


2. what distribution will you use?
3. what are you statistical hypotheses?
most control by your experiment structure

i. Independence

ii. Random sampling

iii. Normally distributed error: ε ↪ 𝑁(0, 𝜎)

iv. Equal variances (homoscedasticity)

v. Linearity

vi. Predictors are fixed

56
Build your model in R
Check your 1. What are your variables?
data i. What is your response variable?
ii. What is your explanatory variable?
structure 2. How are your data distributed?
3. How do you expect your response variable to be
distributed?

Build your 1. What do you want to test?


hypothesis 2. What data distribution will you use?
3. What are the statistical hypotheses?

1. Build your model with R


Build your 2. Fit your model with R
model

57
Build your model in R

1. build your model

58
Build your model in R

1. build your model

Function: lm() (glm() for other residual distribution)

59
Build your model in R

1. build your model

Function: lm() (glm() for other residual distribution)

Formula: y ~ x

60
Build your model in R

1. build your model

Function: lm() (glm() for other residual distribution)

Formula: y ~ x

Together: lm(formula = litterfall ~ neigh.sp.rich, data = df)

2. fit the model to your data:

mod = lm(formula = litterfall ~ neigh.sp.rich, data = df)

61
Check the model fit
Check your 1. What are your variables?
data i. What is your response variable?
ii. What is your explanatory variable?
structure 2. How are your data distributed?
3. How do you expect your response variable to be
distributed?

Build your 1. What do you want to test?


hypothesis 2. What data distribution will you use?
3. What are the statistical hypotheses

1. Build your model with R


Build your 2. Fit your model
model

1. Check model quality


Check 2. Are statistical assumptions met?
model fit

62
Check the model fit

Check the model quality and the assumptions: the performance package

check_model(mod)

63
Check the model fit

Check the model quality and the assumptions: the performance package

check_model(mod)

64
Check the model fit

Check the model quality and the assumptions: the performance package

check_model(mod)

65
Check the model fit

Check the model quality and the assumptions: the performance package

check_model(mod)

66
Check the model fit

Check the model quality and the assumptions: the performance package

check_model(mod)

67
Check the model fit

Check the model quality and the assumptions: the performance package

check_model(mod)

68
Check the model fit

Check the model quality and the assumptions: the performance package

check_model(mod)

69
Data transformation and outliers
Check your 1. What are your variables?
data i. What is your response variable?
ii. What is your explanatory variable?
structure 2. How are your data distributed?
3. How do you expect your response variable to be
distributed?

Build your 1. What do you want to test?


hypothesis 2. What data distribution will you use?
3. What are the statistical hypotheses

1. Build your model with R


What can you do? Build your 2. Fit your model
i. double check your data model
ii. data transformation

NO 1. Check model quality


Data Check 2. Are statistical assumptions met?
transformation model fit

70
Data transformation and outliers

Check outliers with performance: check_outliers(mod)

71
Data transformation and outliers

Check outliers with performance: check_outliers(mod)


Data transformation: why?
- to make linear non-linear things
- to make normal non-normal distribution

72
Data transformation and outliers

Check outliers with performance: check_outliers(mod)


Data transformation: why?
- to make linear non-linear things
- to make normal non-normal distribution

73 Barry et al 2019
Data transformation and outliers

Check outliers with performance: check_outliers(mod)


Data transformation: why?
- to make linear non-linear things
- to make normal non-normal distribution

74 Barry et al 2019
Data transformation and outliers

Check outliers with performance: check_outliers(mod)


Data transformation: why?
- to make linear non-linear things
- to make normal non-normal distribution

75 Barry et al 2019
Data transformation and outliers

Check outliers with performance: check_outliers(mod)


Data transformation: why?
- to make linear non-linear things
- to make normal non-normal distribution

76 Barry et al 2019
Data transformation and outliers

Check outliers with performance: check_outliers(mod)


Data transformation: why?
- to make linear non-linear things
- to make normal non-normal distribution

77
Data transformation and outliers

Check outliers with performance: check_outliers(mod)


Data transformation:

78
Data transformation and outliers

Check outliers with performance: check_outliers(mod)


Data transformation:

log-transformation explanatory variable

79
Data transformation and outliers

Check outliers with performance: check_outliers(mod)


Data transformation: log-transformation explanatory variable
Compare the models quality: compare_performance(mod, mod.log)

80
Data transformation and outliers

Check outliers with performance: check_outliers(mod)


Data transformation: log-transformation explanatory variable
Compare the models quality: compare_performance(mod, mod.log)

Sigma: residual
standard error
AIC: fit quality –
weighted by the RMSE: Mean Root
number of variables Standard Error –
standard error of the
BIC: fit quality – residuals
weighted by the
number of variables R: fit quality – part of
and the sample size variance explained
81
Data transformation and outliers

Check outliers with performance: check_outliers(mod)


Data transformation: log-transformation explanatory variable
Compare the models quality: compare_performance(mod, mod.log)

AIC: fit quality –


weighted by the
number of variables

∆𝐴𝐼𝐶 > [2,8]: the model are different


Can be completed by ANOVA (see following lecture) R: fit quality – part of
variance explained
82
Extract your results
Check your 1. What are your variables?
data i. What is your response variable?
ii. What is your explanatory variable?
structure 2. How are your data distributed?
3. How do you expect your response variable to be
distributed?

Build your 1. What do you want to test?


hypothesis 2. What data distribution will you use?
3. What are the statistical hypotheses?

1. Build your model with R


What can you do? Build your 2. Fit your model with R
i. double check your data model
ii. data transformation

NO 1. Check model quality


Data Check 2. Are statistical assumptions met?
transformation model fit

YES

Extract your
results
83
Extract your results

summary(mod)

𝑙𝑖𝑡𝑡𝑒𝑟𝑓𝑎𝑙𝑙 ~ 𝜇 + 𝛼 ×log(𝑛𝑒𝑖𝑔ℎ. 𝑠𝑝. 𝑟𝑖𝑐ℎ) + ε

Mean litterfall when diversity null = 50.852 +/- 18.304 g/m2 (Estimate +/- 1.96 x SE)
Effect species richness = 53.960 +/- 15.958 g/m2/log(#species)
84
Extract your results

summary(mod)

Mean litterfall when diversity null = 50.852 +/- 18.304 g/m2 (Estimate +/- 1.96 x SE)
Effect species richness = 53.960 +/- 15.958 g/m2/log(#species)
85
Extract your results

summary(mod)

DANGER ZONE: the factors

A B C D
lm(formula = litterfall ~ species, data = df)

86
Extract your results

summary(mod)

DANGER ZONE: the factors

A B C D
lm(formula = litterfall ~ species, data = df)

𝑙𝑖𝑡𝑡𝑒𝑟𝑓𝑎𝑙𝑙 ~ 𝛼! ×𝑠𝑝𝑒𝑐𝑖𝑒! + 𝛼" ×𝑠𝑝𝑒𝑐𝑖𝑒" + 𝛼# ×𝑠𝑝𝑒𝑐𝑖𝑒# + 𝛼$ ×𝑠𝑝𝑒𝑐𝑖𝑒$ + ε

𝑠𝑝𝑒𝑐𝑖𝑒% is 0 or 1

87
Extract your results

summary(mod)

DANGER ZONE: the factors

𝑙𝑖𝑡𝑡𝑒𝑟𝑓𝑎𝑙𝑙 ~ 𝛼! ×𝑠𝑝𝑒𝑐𝑖𝑒! + 𝛼" ×𝑠𝑝𝑒𝑐𝑖𝑒" + 𝛼# ×𝑠𝑝𝑒𝑐𝑖𝑒# + 𝛼$ ×𝑠𝑝𝑒𝑐𝑖𝑒$ + ε

88
Extract your results

summary(mod)

DANGER ZONE: the factors

𝛼!

𝛼" − 𝛼!

𝛼# − 𝛼!

𝛼$ − 𝛼!

𝑙𝑖𝑡𝑡𝑒𝑟𝑓𝑎𝑙𝑙 ~ 𝛼! ×𝑠𝑝𝑒𝑐𝑖𝑒! + 𝛼" ×𝑠𝑝𝑒𝑐𝑖𝑒" + 𝛼# ×𝑠𝑝𝑒𝑐𝑖𝑒# + 𝛼$ ×𝑠𝑝𝑒𝑐𝑖𝑒$ + ε

If you like to test the differences between the different factors you need to do
an ANOVA and a Tukey test
89
Extract your results

summary(mod)

DANGER ZONE: the factors


If you like to test the differences between the different factors you need to do
an ANOVA and a Tukey test
mod = lm(formula = litterfall ~ species, data = df)
mod.aov = aov(mod)
TukeyHSD(mod.aov)

90
Extract your results

summary(mod)

𝑙𝑖𝑡𝑡𝑒𝑟𝑓𝑎𝑙𝑙 ~ 𝜇 + 𝛼 ×log(𝑛𝑒𝑖𝑔ℎ. 𝑠𝑝. 𝑟𝑖𝑐ℎ) + ε

91
Extract your results

summary(mod)

Extract the coefficients: summary(mod)$coefficients

To extract the predictions from your models: ggeffect package


pred = ggpredict(model = mod, terms = 'neigh.sp.rich’)

92
Summary

In this lecture:

1. The stepwise process to analyses your data

2. Application on an example with R

3. Practical on your own

93
Your time to play

C loss (%)

N loss (%)
Litter sp. rich Litter sp. rich
C loss (%)

N loss (%)

Litterfall Litterfall
94

You might also like