Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views12 pages

Report

The document analyzes the relationship between genetic diversity and migration distance using linear regression. It finds a strong negative correlation, with diversity decreasing as distance increases. The regression model explains 84.65% of variability in diversity. However, some assumptions of the model are not fully met, as variability is greater at higher distances and some outliers exist. The regression equation predicts that for every 1,000,000 km increase in distance, diversity decreases by 8 units. It also provides a prediction for an unsampled population's diversity that migrated 15,000 km, including a 95% confidence and prediction interval. An alternative ANOVA model to test for differences in mean diversity between regions is also described and its assumptions are checked, finding normality is met

Uploaded by

Joaquin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views12 pages

Report

The document analyzes the relationship between genetic diversity and migration distance using linear regression. It finds a strong negative correlation, with diversity decreasing as distance increases. The regression model explains 84.65% of variability in diversity. However, some assumptions of the model are not fully met, as variability is greater at higher distances and some outliers exist. The regression equation predicts that for every 1,000,000 km increase in distance, diversity decreases by 8 units. It also provides a prediction for an unsampled population's diversity that migrated 15,000 km, including a 95% confidence and prediction interval. An alternative ANOVA model to test for differences in mean diversity between regions is also described and its assumptions are checked, finding normality is met

Uploaded by

Joaquin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Question 1.

Analyze the relationship between Diversity and Migration Distance using linear
regression.

The correlation between these two variables is obtained by the following sequence: Stats / Basic
Statistics / Correlation.

Table 1.

Correlation between Diversity and Migration Distance.

Pearson -0.920
correlation

P-value 0.000

There is a strong negative linear correlation between the Diversity and Migration Distance
variables. When the migration distance increases, the genetic diversity measure decreases and when the
migration distance decreases, the genetic diversity measure increases.

Scatterplot for Distance and Diversity


0.80

0.75

0.70
Diversity

0.65

0.60

0.55

0.50

0 5000 10000 15000 20000 25000 30000


Distance
Data from a study of human genetic diversity

Figure 1.
Scatterplot for migration distance and genetic diversity measure.

Table 2.

Regression Analysis

Regression Analysis: Diversity versus Distance


Analysis of Variance
D F-
Source F Adj SS Adj MS Value P-Value

Regression 1 0.112752 0.112752 242.66 0.000


Distance 1 0.112752 0.112752 242.66 0.000
Error 44 0.020445 0.000465
Lack-of- 43 0.020375 0.000474 6.81 0.297
Fit
Pure Error 1 0.000070 0.000070
Total 45 0.133197
Model Summary
R-
S R-sq sq(adj) R-sq(pred)

0.0215558 84.65 84.30% 80.15%


%
Coefficients
Term Coef SE Coef T-Value P-Value VIF

Constant 0.79861 0.00567 140.90 0.000


Distance - 0.000000 -15.58 0.000 1.00
0.000008
Regression Equation
Diversit = 0.79861
y - 0.000008 Distance
Fits and Diagnostics for Unusual Observations
Ob
s Diversity Fit Resid Std Resid

7 0.68880 0.62301 0.06579 3.24 R


8 0.61590 0.59751 0.01839 0.93 X
9 0.57060 0.58769 - -0.88 X
0.01709
10 0.50180 0.58712 - -4.40 R X
0.08532
R Large residual
X Unusual X

In Table 2 below, it is observed that the value of the determination coefficient is r2 = 0.8465.
Therefore, 84.65% of the total variability of diversity is explained by the regression model obtained.

a. Are there any problems with fitting and testing the relationship?

In Figure 1 above, the shape of the point cloud confirms the existence of a strong negative linear
relationship between the variables distance and genetic diverse measure. However, the variability of
diversity seems to be greater at high levels of distance, so the assumption of homoscedasticity seems not
to be fully met.

Additionally, in Table 2 above, in the section corresponding to Fits and Diagnostics for Unusual
Observations, four observations are presented; two of them result in large residuals while three of them
have been identified as unusual X; that is, they are outliers that can affect the quality of the estimated
regression model.

Finally, in the residual graph that is shown in Figure 2 below, it is observed that the values of the
residues at both ends move away from the line corresponding to the normal distribution, so that the
assumption of normality of the residues can be failing
Normal Probability Plot
(response is Diversity)
99

95

90

80
70
Percent

60
50
40
30
20

10

1
-0.10 -0.05 0.00 0.05
Residual

Figure 2.

Normal probability plot for residuals.

b. State the statistical model used and the Regression Equation

The statistical model corresponds to the following expression:

diversity =β 0+ β 1 distance+ u

Where

β 0: intercept parameter.

β 1: slope parameter.

u: unobserved random error or disturbance term.

The regression equation is given by:

^
Diversity=0.79861−0.000008∗^
Distance
When the migration distance increases by 1,000,000 kilometers, the genetic diversity measure decreases
by 8 units.

c. State the hypotheses tested and your conclusions on the significance of the regression equation.

When verifying the joint significance of the model, the following hypotheses are tested:

Ho: β0 = β2 = 0 vs. H1: Not all the βi are 0

Table 2 above presents the results of the Analysis of Variance. This table summarizes the results
of the joint significance test of the regression model. We compare the p-value in the far right of the table
with the level of significance of the test, which has been set at 0.05. Based upon the ANOVA results, we
reject the null (F = 242.66; df = 1, 44; p <.001). The migration distance have the ability to explain the
variation in the genetic diversity measure.

The section called Coefficients of Table 2 above, details the result of the hypothesis test of partial
significance of the coefficients of the regression model. The following hypotheses are tested:

For constant, Ho: β0 = 0 vs. H1: β0 ≠ 0

Based upon the t-test results, the null hypotheses are rejected (t = 140.9, p < .001). The
regression coefficient associated with the constant was not equal to zero and should be included in the
model to predict the genetic diversity measure.

For distance, Ho: β1 = 0 vs. H1: β1 ≠ 0

Based upon the t-test results, the null hypotheses are rejected (t = -15.58, p < .001). The
regression coefficient associated with the distance was not equal to zero and should be included in the
model to predict the genetic diversity measure.

d. Predict diversity for a single unsampled population that would have had to migrate 15,000km
and provide an interval in which you are 95% confident it should fall.

The prediction for diversity is detailed in Table 3 below. The point value predicted by the model
for the diversity is 0.683409. The 95% confidence interval is (0.675002, 0.691817). The 95% prediction
interval is (0.639160, 0.727658).

Table 3.

Confidence interval and prediction interval for diversity.

Prediction for Diversity


Regression Equation
Diversit = 0.79861
y - 0.000008 Distance
Settings
Settin
Variable g

Distance 15000
Prediction
Fit SE Fit 95% CI 95% PI

0.683409 0.0041718 (0.675002, (0.639160, 0.727658)


0.691817)

Question 2. An alternative approach would be to test for differences in mean diversity between the
regions (Africa, Asia, et cetera) by ANOVA.

a. State the model you are fitting,

As Montgomery (2013) points out, the following model will be adjusted (Montgomery, 2013, p.
69):

y ij =μi +ϵ ij

i = 1, 2, …, 6

j = 1, …, n

Where:

i: region.

j: number of observation.

y ij: j-th observation of the region i.

μi: mean of diversity for the i-th region.

ϵ ij : random error component.

It is necessary to emphasize that the number of observations taken within each region is different.
Therefore the design is unbalanced.
b. Check all the assumptions.

The assumptions of the model, as expressed by Montgomery (2013), are the following: “the
errors are normally and independently distributed with mean zero and constant but unknown variance”
(Montgomery, 2013, p. 80).

The assumption of normality is checked in the normal probability plot and the residue histogram
presented in Figure 3 below. It is observed that the normal probability plot is similar to the graph of
residues presented in Figure 2. Additionally, the histogram of the residuals is perfectly symmetrical and
the mean is zero. Therefore, the assumption of normality is met.

To verify the assumption of waste independence, the waste chart is used against the order in
which the data was collected. This is presented in the Versus Order chart, which is in Figure 3 below. No
systematic pattern is observed in the waste, so there is no evidence of non-compliance with this
assumption.

To verify that the assumption of homoscedasticity is fulfilled, the Versus Fits chart is used, which
is shown in Figure 3 below. It is observed that this graph presents a funnel pattern, which shows that the
variance decreases as the adjusted value increases and therefore, the constant variance assumption is not
fulfilled. Consequently, it is necessary to perform a transformation of the data.
Residual Plots for Diversity
Normal Probability Plot Versus Fits
99 0.10

90 0.05

Residual
Percent

50 0.00

-0.05
10

-0.10
1
-0.10 -0.05 0.00 0.05 0.10 0.60 0.65 0.70 0.75 0.80
Residual Fitted Value

Histogram Versus Order


30 0.10

0.05
Frequency

20

Residual
0.00

10
-0.05

-0.10
0
-0.10 -0.05 0.00 0.05 0.10 1 5 10 15 20 25 30 35 40 45
Residual Observation Order

Figure 3.

Residual plots for diversity.

c. State the hypotheses you tested and your conclusions

The following hypotheses are tested:

Ho: µ1 = µ2 = µ3 = µ4 = µ5 = µ6 vs. H1: µi = µj for at least one pair (i, j).

Based upon the ANOVA results showed in Table 4 below, we reject the null (F = 32.67; df = 5, 40;
p < .001). The treatments means differ; that is, the region significantly affects the mean of genetic
diversity measure.

Table 4.

Analysis of Variance and model summary.

Source DF Adj SS Adj MS F-Value P-Value

Regio 5 0.10700 0.02140 32.67 0.000


n 0
Error 40 0.02620 0.00065
5
Total 45 0.13320
Model Summary
R-
S R-sq sq(adj) R-sq(pred)

0.0255923 80.33 77.87% 70.62%


%

d. Provide an appropriate summary of which means are different from which others.

The 95% confidence intervals for the diversity mean according to the region is shown in Table 5.
With these intervals we can know which regions have different media. To do this, we observe which
confidence intervals overlap and which do not. Those intervals that do not overlap will correspond to
regions with different mean of diversity. This can be better appreciated in Figure 5. Additionally,
Tuckey's multiple comparisons test, which is detailed in Table 6, indicates that the regions of Africa,
Middle East and Europe have a similar average and form group A. In addition, the regions of Middle
East, Europe and Asia have a similar average and are grouped according to group B. The regions of Asia
and Oceania form group C and finally the Americas differ from all previous groups.

The highest averages of diversity are associated with the populations of Africa, Europe and the
Middle East, which supports the idea of greater diversity corresponds to the smaller distances with respect
to Africa, while the lower averages correspond to Asia, Oceania and the Americas, which are the ones
that are farthest from Africa.

Table 5.

Descriptive statistics by region.

Region N Mean StDev 95% CI

Africa 5 0.77608 0.0072 (0.75295, 0.79921)


4
Americas 5 0.5989 0.0688 (0.5758, 0.6220)
Asia 2 0.73097 0.0173 (0.71995, 0.74200)
2 7
Europe 8 0.75253 0.0033 (0.73424, 0.77081)
4
Middle_East 4 0.75737 0.0074 (0.73151, 0.78324)
2
Oceania 2 0.6836 0.0216 (0.6471, 0.7202)
Pooled StDev = 0.0255923
Table 6.

Mean comparison by region.

Tukey Pairwise Comparisons


Grouping Information Using the Tukey Method and 95% Confidence
Region N Mean Grouping

Africa 5 0.77608 A
Middle_East 4 0.75737 A B
Europe 8 0.75253 A B
Asia 22 0.73097 B C
Oceania 2 0.6836 C
Americas 5 0.5989 D
Means that do not share a letter are significantly different.
Interval Plot of Diversity vs Region
95% CI for the Mean

0.80

0.75
Diversity

0.70

0.65

0.60

Africa Americas Asia Europe Middle_East Oceania


Region
The pooled standard deviation is used to calculate the intervals.

Figure 5.

Interval plot of diversity by region.

Question 3. Discuss which of regression or ANOVA is the better in this case. Are both valid and
equally useful in respect of the aim of the study?

In the first question, a linear regression model was construed that included only one explanatory
variable and one variable to be explained. The two variables are continuous and the regression model was
the most adequate procedure. However, if we include the variable region in the regression model, the
conclusions we would have reached would be similar to the question 2. The ANOVA was adequate to
construct a model with a continuous variable to explain and a qualitative explanatory variable, with the
advantage over the simple regression model in question 1 that conclusions can be obtained regarding the
groups defined by the regions. The two models used are valid and useful for the study that we want to
perform.

Reference
Montgomery, D. C. (2013). Design and analysis of experiments (Eighth ed.). Hoboken: John Wiley &
Sons, Inc.

You might also like