0% found this document useful (0 votes)

13 views51 pages

Statistics, Statistical Modeling and Data Analytics

SSMDA IPU Unit 2

Uploaded by

cedotes304

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views51 pages

Statistics, Statistical Modeling and Data Analytics

SSMDA IPU Unit 2

Uploaded by

cedotes304

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Subject: SSMDS Code:DA-304T

Unit-2:Statistical Modelling

An introduction to linear models

A linear model is a function, like that is fit to a set of data, often to model a process that
generated the data or something like the data. Linear models are mathematical models that
describe the relationship between a dependent variable and one or more independent variables
using a linear equation. These models are widely used in statistics, machine learning, and
econometrics for prediction and inference.

1. Simple Linear Regression (SLR)

Equation:

Y=β0+β1X+ε

Where:

• Y= Dependent variable (response)

• X= Independent variable (predictor)

• β0 = Intercept
• β1= Slope
• ε = Error term

Use Cases:

• Predicting sales based on advertising spend

• Estimating house prices based on size

2. Multiple Linear Regression (MLR)

Equation:

Y=β0+β1X1+β2X2+⋯+βnXn+ε

Where:

• X1,X2,...,Xn are multiple independent variables

• β1,β2,...,βn are their corresponding coefficients
Use Cases:

• Predicting customer churn based on demographics, usage, and behavior

• Estimating stock prices using multiple economic indicators

3. Generalized Linear Models (GLM)

Extends linear regression to allow for response variables that have distributions other than
normal. Common types:
• Logistic Regression: Used for binary outcomes (e.g., pass/fail, spam/not spam)

• Poisson Regression: Used for count data (e.g., number of customer complaints)

4. Assumptions of Linear Models

1. Linearity: The relationship between independent and dependent variables is linear.

2. Independence: Observations are independent of each other.

3. Homoscedasticity: Constant variance of residuals (errors).

4. Normality: Residuals are normally distributed.

5. No Multicollinearity: Independent variables should not be highly correlated.

5. Applications of Linear Models

• Finance: Predicting stock prices, credit scoring

• Healthcare: Estimating patient recovery time based on treatment

• Marketing: Understanding the impact of advertising spend on sales

All students are familiar with the idea of a linear model from learning the equation of a line,
which is

Y=mX+b (8.1)
where mm is the slope of the line and bb is the YY-intercept. It is useful to think of
equation (8.1) as a function that maps values of XX to values of YY. Using this function, if we
input some value of XX, we always get the same value of Y as the output.

A linear model is a function, like that in equation (8.1), that is fit to a set of data, often to model
a process that generated the data or something like the data. The line in Figure 8.1A is just that,
a line, but the line in Figure 8.1B is a linear model fit to the data in Figure 8.1B.
Figure 8.1: A line vs. a linear model. (A) the line y=−3.48X+105.7y=−3.48X+105.7 is drawn.
(B) A linear model fit to the data. The model coefficients are numerically equal to the slope and
intercept of the line in A.

8.1 Two specifications of a linear model

8.1.1 The “error draw” specification

In introductory textbooks, a linear model is typically specified using an error-draw scheme.

Y=β0+β1X+εε∼N(0,σ2)(8.2)Y=β0+β1X+ε(8.2)ε∼N(0,σ2)

The first line of this specification has two components: the linear
predictor Y=β0+β1XY=β0+β1X and the error εε. The linear predictor component looks like
the equation for a line except that 1) β0β0 is used for the intercept and β1β1 for the slope and
2) the intercept term precedes the slope term. This re-labeling and re-arrangement make the
notation for a linear model more flexible for more complicated linear models. For
example Y=β0+β1X1+β2X2+εY=β0+β1X1+β2X2+ε is a model where YY is a function of
two XX variables.

The linear predictor is the deterministic or systematic part of the specification. As with the
equation for a line, the linear predictor component of a linear model is a function that maps a
specific value of XX to a unique value of YY. This mapped value is the expected value, or
expectation, given a specific input value of XX. The expectation is often written
as E[Y|X]E[Y|X], which is read as “the expected value of YY given XX”, where “given X”
means a specific value of X. This text will often use the word conditional in place of “given”.
For example, I would read E[Y|X]E[Y|X] as “the expected value of YY conditional on XX”. It
is important to recognize that E[Y|X]E[Y|X] is a conditional mean – it is the mean value
of YY when we observe that XX has some specific value xx (that is X=xX=x).

The second line of the specification (8.2) is read as “epsilon is distributed as Normal with mean
zero and variance sigma squared”. This line explicitly specifies the distribution of the error
component of line 1. The error component of a linear model is a random “draw” from a normal
distribution with mean zero and variance σ2σ2. The second line shows that the error component
of the first line is stochastic. Using the error-model specification, we can think of any
measurement of YY as an expected value plus some random value sampled from a normal
distribution with a specified variance. Because the stochastic part of this specification draws
an “error” from a population, I refer to this as the error-draw specification of the linear model.

8.1.2 The “conditional draw” specification

A second way of specifying a linear model is using a conditional-draw scheme.

yi∼N(μi,σ2)E(Y|X)=μμi=β0+β1xi(8.3)yi∼N(μi,σ2)E(Y|X)=μ(8.3)μi=β0+β1xi

The first line states that the response variable YY is a random variable independently drawn
from a normal distribution with mean μμ and variance σ2σ2. This first line is
the stochastic part of the statistical model. The second line simply states that μμ (the greek
letter “mu”) from the first line is the conditional mean (or expectation). The third line is the
liner predictor, which states how μiμi is generated given that X=xiX=xi. Again, the linear
predictor is the systematic (or deterministic) part of the statistical model. It is systematic
because the same value of xixi will always generate the same μiμi. With the conditional-draw
specification, we can think of a measurement (yiyi) as a random draw from the specified
distribution. Because it is YY and not some “error” that is drawn from the specified
distribution, I refer to this as the conditional-draw specification of the linear model.
8.1.3 Comparing the error-draw and conditional-draw ways of specifying the linear
model
These two ways of specifying the model encourage slightly different ways of thinking about
how the data (the response varible YY) were generated. The error-draw specification
“generates” data by 1) constructing what yiyi “should be” given xixi (this is the conditional
expection), then 2) adding some error eiei drawn from a normal distribution with mean zero
and some specified variance. The conditional-draw specification “generates” data by 1)
constructing what yiyi “should be” given xixi, then 2) drawing a random variable from some
specified distribution whose mean is this expectation. This random draw is not “error” but the
measured value yiyi. For the error draw generation, we need only one hat of random numbers,
but for the conditional draw generation, we need a hat for each value of xixi.

Here is a short script that generates data by implementing both the error-draw and conditional-
draw specifications. See if you can follow the logic of the code and match it to the meaning of
these two ways of specifying a linear model.
n <- 5
b_0 <- 10.0

b_1 <- 1.2

sigma <- 0.4

x <- 1:n
y_expected <- b_0 + b_1*x

# error-draw. Note that the n draws are all from the same distribution

set.seed(1)

y_error_draw <- y_expected + rnorm(n, mean = 0, sd = sigma)

# conditional-draw. Note that the n draws are each from a different

# distribution because each has a different mean.
set.seed(1)

y_conditional_draw <- rnorm(n, mean = y_expected, sd = sigma)

data.table(X = x,

"Y (error draw)" = y_error_draw,

"Y (conditional draw)" = y_conditional_draw)

## X Y (error draw) Y (conditional draw)

## <int> <num> <num>

## 1: 1 10.94942 10.94942

## 2: 2 12.47346 12.47346

## 3: 3 13.26575 13.26575

## 4: 4 15.43811 15.43811

## 5: 5 16.13180 16.13180

rnorm() is a pseudorandom number generator that simulates random draws from a normal
distribution with the specified mean and variance. The algorithm to generate the numbers is
entirely deterministic – the numbers are not truly random but are “pseudorandom”. The list of
numbers returned closely approximates a set of true, random numbers. The sequence of
numbers returned is determined by the “seed”, which can be set with the set.seed() function (R
will use an internal seed if not set by the user).
The error-draw specification is not very useful for thinking about data generation for data
analyzed by generalized linear models, which are models that allow one to specify distribution
families other than Normal (such as the binomial, Poisson, and Gamma families). In fact,
thinking about a model as a predictor plus error can lead to the misconception that, in a
generalized linear model, the error (or residuals from the fit) has a distribution from the non-
Normal distribution modeled. This cannot be true because the distributions modeled using
generalized linear models (other than the Normal) do not have negative values (some residuals
must have negative values since the mean of the residuals is zero). Introductory biostatistics
textbooks typically only introduce the error-draw specification because introductory textbooks
recommend data transformation or non-parametric tests if the data are not approximately
normal. This is unfortunate because generalized linear models are extremely useful for real
biological data.

Although a linear model (or statistical model more generally) is a model of a data-generating
process, linear models are not typically used to actually generate any data. Instead, when we
use a linear model to understand something about a real dataset, we think of our data as one
realization of a process that generates data like ours. A linear model is a model of that process.
That said, it is incredibly useful to use linear models to create fake datasets for at least two
reasons: to probe our understanding of statistical modeling generally and, more specifically, to
check that a model actually creates data like that in the real dataset that we are analyzing.

8.1.4 ANOVA notation of a linear model

Many textbooks treat ANOVA differently from regression and express a linear model as an
ANOVA model (and generally do not use the phrase “linear model”). ANOVA models are all
variations of

yij=μ+τi+εij(8.4)(8.4)yij=μ+τi+εij

Unlike the error and conditional draw specifications above, the ANOVA model doesn’t have a
linear predictor in the form of a regression equation (or the equation for a line) – that is, there
are neither XX variables nor coefficients (ββ). Instead, the ANOVA model is made up of a
linear combination of means and deviations from means. In (8.4), μμ is the grand mean (the
mean of the means of the groups), τiτi is the deviation of the mean of group ii from the grand
mean (these are the effects), and εijεij is the deviation (or error) of individual jj from the mean
of group ii. Traditional ANOVA computes effects and the statistics for inference by computing
means and deviations from means. Modern linear models compute effects and the statistics for
inference by solving for the coefficients of a regression model.

8.2 A linear model can be fit to data with continuous, discrete, or categorical XX variables
In the linear model fit to the data in Figure 8.1B, the XX variable is continuous, which can
take any real number between the minimum XX and maximum XX in the data. For biological
data, most variables that are continuous are positive, real numbers (a zero is not physically
possible but could be recorded in the data if the true value is less than the minimum measurable
amount). One exception is a composition (the fraction of a total), which can be zero. Negative
values can occur with variables in which negative represent a direction (work, electrical
potential) or a rate. Discrete variables are numeric but limited to certain real numbers. Most
biological variables that are discrete are counts, and can be zero, but not negative. Categorical
variables are non-numeric descriptions of a measure. Many of the categorical variables in this
text will be the experimentally controlled treatment variable of interest (the
variable treatmenttreatment containing the values “wild type” and “knockout”) but some are
measured covariates (the variable sexsex containing the values “female” and “male”).

8.2.1 Fitting linear models to experimental data in which the XX variable is continuous
or discrete

A linear model fit to data with a numeric (continous or discrete) XX is classical regression and
the result is typically communicated by a regression line. The experiment introduced in
Chapter ?? [Linear models with a single, continuous X] is a good example. In this experiment,
the researchers designed an experiment to measure the effect of warming on the timing of
photosynthetic activity. Temperature was experimentally controlled at one of five settings (0,
2.25, 4.5, 6.75, or 9 °C above ambient temperature) within twelve, large enclosures. The
response variable in the illustrated example is Autumn “green-down”, which is the day of year
(DOY) of the transition to loss of photosynthesis. The intercept and slope parameters of the
regression line (Figure 8.2) are the coefficients of the linear model. The slope (4.98 days per 1
°C added warmth) estimates the effect of warming on green-down DOY. What is not often
appreciated at the introductory biostatistics level is that the slope is a difference in conditional
means. Any point on a regression line is the expected value of YY at a specified value of XX,
that is, the conditional mean E(Y|X)E(Y|X). The slope is the difference in expected values
for a pair of points that differ in XX by one unit.

b1=E(Y|X=x+1)−E(Y|X=x+1)b1=E(Y|X=x+1)−E(Y|X=x+1)

I show this in Figure 8.2 using the points on the regression line at x=5x=5 and x=6x=6.
Thinking about a regression coefficient as a difference in conditional means is especially useful
for understanding the coefficients of a categorical XX variable, as described below.
Figure 8.2: Illustration of the slope in a linear model with a numeric X. The slope (the
coefficient of X) is the difference in expected value for any two X that are one unit apart. This
is illustrated for the points on the line at x = 5 and x = 6.

8.2.2 Fitting linear models to experimental data in which the XX variable is categorical
Linear models can be fit to experimental data in which the XX variable is categorical – this is
the focus of this text! For the model fit to the data in Figure 8.1B, the coefficient of XX is the
slope of the line. Perhaps surprisingly, 1) we can fit a model like equation (8.2) to data in which
the XX variable is categorical and 2) the coefficient of XX is a slope. How is this possible?
The slope of a line is y2−y1x2−x1y2−y1x2−x1 where (x1,y1)(x1,y1) and (x2,y2)(x2,y2) are
the graph coordinates of any two points on the line. What is the denominator of the slope
function (x2−x1)(x2−x1) when XX is categorical?

The solution to using a linear model with categorical XX is to recode the factor levels into
numbers. An example of this was outlined in Chapter ?? (Analyzing experimental data with a
linear model). The value of XX for individual mouse i is a number that indicates the treatment
assignment – a value of 0 is given to mice with a functional ASK1 gene and a value of 1 is
given to mice with a knocked out gene. The regression line goes through the two group means
(Figure 8.3). With the (0, 1)
coding, ¯¯¯xASK1Δadipo−¯¯¯xASK1F/F=1x¯ASK1Δadipo−x¯ASK1F/F=1, so the
denominator of the slope is equal to one and the slope is simply equal to the
numerator ¯¯¯yASK1Δadipo−¯¯¯yASK1F/Fy¯ASK1Δadipo−y¯ASK1F/F. The coefficient
(which is a slope!) is the difference in conditional means.
Figure 8.3: Illustration of the slope in a linear model with categorical X. The slope (the
coefficient of X) is the difference in conditional means.
8.3 Statistical models are used for prediction, explanation, and description

Researchers typically use statistical models to understand relationships between one or

more YY variables and one or more XX variables. These relationships include

1. Descriptive modeling. Sometimes a researcher merely wants to describe the

relationship between YY and a set of XX variables, perhaps to discover patterns. For
example, the arrival of a spring migrant bird (YY) as a function of sex (X1X1) and age
(X2X2) might show that males and younger individuals arrive earlier. Importantly, if
another XX variable is added to the model (or one dropped), the coefficients, and
therefore, the precise description, will change. That is, the interpretation of a coefficient
as a descriptor is conditional on the other covariates (XX variables) in the model. In a
descriptive model, there is no implication of causal effects and the goal is not prediction.
Nevertheless, it is very hard for humans to discuss a descriptive model without using
causal language, which probably means that it is hard for us to think of these models
as mere description. Like natural history, descriptive models are useful as patterns in
want of an explanation, using more explicit causal models including experiments.
2. Predictive modeling. Predictive modeling is very common in applied research. For
example, fisheries researchers might model the relationship between population density
and habitat variables to predict which subset of ponds in a region are most suitable for
brook trout (Salvelinus fontinalis) reintroduction. The goal is to build a model with
minimal prediction error, which is the error between predicted and actual values for a
future sample. In predictive modeling, the XX (“predictor”) variables are largely
instrumental – how these are related to YY is not a goal of the modeling, although
sometimes an investigator may be interested in the relative importance among
the XX for predicting YY (for example, collecting the data may be time consuming, or
expensive, or enviromentally destructive, so know which subset of XX are most
important for predicting YY is a useful strategy).

3. Explanatory (causal) modeling. Very often, researchers are explicitly interested

in how the XX variables are causally related to YY. The fisheries researchers that want
to reintroduce trout may want to develop and manage a set of ponds to maintain healthy
trout populations. This active management requires intervention to change habitat traits
in a direction, and with a magnitude, to cause the desired response. This model is
predictive – a specific change in XX predicts a specific response in YY – because the
coefficients of the model provide knowledge on how the system functions – how
changes in the inputs cause change in the output. Causal interpretation of model
coefficients requires a set of strong assumptions about the XX variables in the model.
These assumptions are typically met in experimental designs but not observational
designs.

With observational designs, biologists are often not very explicit about which of these is the
goal of the modeling and use a combination of descriptive, predictive, and causal language to
describe and discuss results. Many papers read as if the researchers intend explanatory
inference but because of norms within the biology community, mask this intention with
“predictive” language. Here, I advocate embracing explicit, explanatory modeling by being
very transparent about the model’s goal and assumptions.

8.4 What is the interpretation of a regression coefficient?

A regression coefficient is the difference in YY that we expect to see if we see a one unit
difference in X, but we see no difference in any other covariate (the other X).
8.5 What do we call the XX and YY variables?

The inputs to a linear model (the XX variables) have many names. In this text, the XX variables
are typically
• treatment variables – this term makes sense only for categorical variables and is often
used for variables that are a factor containing the treatment assignment (for example
“control” and “knockout”)

• factor variables (or simply, factors) – again, this term makes sense only for categorical
variables

• covariates – this term is usually used for the non-focal XX variables in a statistical
model.

A linear model is a regression model and in regression modeling, the XX variables are typically
called
• independent variables (often shortened to IV) – “independent” in the sense that in a
statistical model at least, the XX are not a function of YY.

• predictor variables (or simply, “predictors”) – this makes the most sense in prediction
models.

• explanatory variables – this term is usually applied in observational designs and is

best used if the explicit goal is causal modeling.

In this text, the output of a linear model (the YY variable or variables if the model is
multivariate) will most often be calle either of

• response variable (or simply, “response”)

• outcome variable (or simply, “outcome”)

These terms have a causal connotation in everyday english. These terms are often used in
regression modeling with observational data, even if the model is not explicitly causal. On
other term, common in introductory textbooks, is
• dependent variable – “dependent” in the sense that in a statistical model at least,
the YY is a function of the XX.

8.6 Modeling strategy

A “best practice” sequence of steps used throughout this text to analyze experimental data is

1. examine the data using exploratory plots to

• examine individual points and identify outliers that are likely due to data transcription
errors or measurement blunders

• examine outlier points that are biologically plausible, but raise ref flags about undue
influe on fit models. This information is used to inform the researcher on the strategy
to handle outliers in the statistical analysis, including algorithms for excluding data or
implementation of robust methods.
• provide useful information for initial model filtering (narrowing the list of potential
models that are relevant to the question and data). Statistical modeling includes a
diverse array of models, yet almost all methods used by researchers in biology, and all
models in this book, are generalizations of the linear model specified in (8.3). For some
experiments, there may be multiple models that are relevant to the question and data.
Model checking (step 3) can help decide which model to ultimately use.

2. fit the model, in order to estimate the model parameters and the uncertainty in these
estimates.

3. check the model, which means to use a series of diagnostic plots and computations of
model output to check that the fit model reasonably approximates the data. If the
diagnostic plots suggest a poor approximation, then choose a different model and go
back to step 2.
4. inference from the model, which means to use the fit parameters to learn, with
uncertainty, about the system, or to predict future observations, with uncertainty.

5. plot the model, which means to plot the data, which may be adjusted, and the estimated
parameters (or other results dervived from the estimates) with their uncertainty.

Note that step 1 (exploratory plots) is not data mining, or exploring the data for patterns to test.

8.7 Predictions from the model

For the linear model specified in Model (8.2), the fit model is

yi=b0+b1xi+ei(8.5)(8.5)yi=b0+b1xi+ei

where b0b0 and b1b1 are the coefficients of the fit model and the eiei are the residuals of the
fit model. We can use the coefficients and residuals to recover the yiyi, although this would
rarely be done. More commonly, we could use the coefficients to calculate conditional means
(the mean conditional on a specified value of XX).

^yi=b0+b1xi(8.6)(8.6)y^i=b0+b1xi

The conditional means are typically called fitted values, if the XX are the XX used to fit the
model, or predicted values, if the XX are new. “Predicted values” is often shortened to “the
prediction”.

8.8 Inference from the model

If our goal is inference, we want to use the fit parameters to learn, with uncertainty, about the
system. Using equation (8.5), the coefficients b0b0 and b1b1 are point estimates of the true,
generating parameters β0β0 and β1β1, the eiei are estimates of εiεi (the true, biological
“noise”), and ∑e2iN−2∑ei2N−2 is an estimate of the true, population variance σ2σ2 (this will
be covered more in chapter xxx but you may recognize that ∑e2iN−2∑ei2N−2 is the formula
for a variance). And, using equation (8.6), ^yiy^i is the point estimate of the parameter μiμi (the
true mean conditional on X=xiX=xi). Throughout this text, Greek letters refer to a theoretical
parameter and Roman letters refer to point estimates.

Our uncertainty in the estimates of the parameters due to sampling is the standard error of the
estimate. It is routine to report standard errors of means and coefficients of the model. While a
standard error of the estimate of σσ is available, this is effectively never reported, at least in
the experimental biology literature, presumably because the variance is thought of as a nuisance
parameter (noise) and not something worthy of study. This is a pity. Certainly treatments can
effect the variance in addition to the mean.

Parametric inference assumes that the response is drawn from some probability distribution
(Normal, or Poisson, or Bernouli, etc.). Throughout this text, I emphasize reporting and
interpreting point estimates and interval estimates of the point estimate. A confidence
interval is a type of interval estimate. A confidence interval of a parameter is a measure of the
uncertainty in the estimate. A 95% confidence interval has a 95% probability (in the sense of
long-run frequency) of containing the parameter. This probability is a property of the
population of intervals that could be computed using the same sampling and measuring
procedure. It is not correct, without further assumptions, to state that there is a 95% probability
that the parameter lies within the interval. Perhaps a more useful interpretation is that the
interval is a compatability interval in that it contains the range of estimates that are
compatible with the data, in the sense that a tt-test would not reject the null hypothesis of a
difference between the estimate and any value within the interval (this interpretation does not
imply anything about the true value).

Another kind of inference is a significance test, which is the computation of the probability of
“seeing the data” or something more extreme than the data, given a specified null hypothesis.
This probability is the p-value, which can be reported with the point estimate and confidence
interval. There are some reasonable arguments made by very influential statisticians that p-
values are not useful and lead researchers into a quagmire of misconceptions that impede good
science. Nevertheless, the current methodology in most fields of Biology have developed in a
way to become completely dependent on p-values. I think at this point, a p-value can be a
useful, if imperfect tool in inference, and will show how to compute p-values throughout this
text.
Somewhat related to a significance test is a hypothesis test, or a Null-Hypothesis Signficance
Test (NHST), in which the pp-value from a significance test is compared to a pre-specified
error rate called αα. Hypothesis testing was developed as a formal means of decision making
but this is rarely the use of NHST in experimental biology. For almost all applications of p-
values that I see in the literature that I read in ecology, evolution, physiology, and wet-bench
biology, comparing a pp-value to αα adds no value to the communication of the results.

8.8.1 Assumptions for inference with a statistical model

1. The data were generated by a process that is “linear in the parameters”, which means
that the different components of the model are added together. This additive part of the
model containing the parameters is the linear predictor in
specifications (8.2) and (8.3) above. For example, a cubic polynomial model

E(Y|X)=β0+β1X+β2X2+β3X3E(Y|X)=β0+β1X+β2X2+β3X3

is a linear model, even though the function is non-linear, because the different components are
added. Because a linear predictor is additive, it can be compactly defined using matrix algebra

E(Y|X)=XβE(Y|X)=Xβ
where mathbfXmathbfX is the model matrix and ββ is the vector of parameters. We discuss
these more in chapter xxx.

A Generalized Linear Model (GLM) has the form g(μi)=ηig(μi)=ηi where ηη (the Greek
letter “eta”) is the linear predictor

η=Xβη=Xβ
GLMs are extensions of linear models. There are non-linear models that are not linear in the
parameters, that is, the predictor is not a simple dot product of the model matrix and a vector
of parameters. For example, the Michaelis-Menten model is a non-linear model

E(Y|X)=β1Xβ2+XE(Y|X)=β1Xβ2+X

that is non-linear in the parameters because the parts are not added together. This text covers
linear models and generalized linear models, but not non-linear models that are also non-linear
in the parameters.
2. The draws from the probability distribution are independent. Independence
implies uncorrelated YY conditional on the XX, that is, for any two YY with the same
value of XX, we cannot predict the value of one given the value of the other. For
example, in the ASK1 data above, “uncorrelated” implies that we cannot predict the
glucose level of one mouse within a specific treatment combination given the glucose
level of another mouse in that combination. For linear models, this assumption is often
stated as “independent errors” (the εε in model (8.2)) instead of independent
observations.

There are lots of reasons that conditional responses might be correlated. In the mouse example,
correlation within treatment group could arise if subsets of mice in a treatment group are
siblings or are housed in the same cage. More generally, if there are measures both within and
among experimental units (field sites or humans or rats) then we’d expect the measures within
the same unit to err from the model in the same direction. Multiple measures within
experimental units (a site or individual) creates “clustered” observations. Lack of independence
or clustered observations can be modeled using models with random effects. These models go
by many names including linear mixed models (common in Ecology), hierarchical models,
multilevel models, and random effects models. A linear mixed model is a variation of
model (8.2). This text introduces linear mixed models in chapter xxx.

Measures that are taken from sites that are closer together or measures taken closer in time or
measures from more closely related biological species will tend to have more similar values
than measures taken from sites that are further apart or from times that are further apart or from
species that are less closely related. Space and time and phylogeny create spatial and temporal
and phylogenetic autocorrelation. Correlated error due to space or time or phylogeny can be
modeled with Generalized Least Squares (GLS) models. A GLS model is a variation of
model (8.2).

8.8.2 Specific assumptions for inference with a linear model

1. Constant variance or homoskedasticity. The most common way of thinking about

this is the error term εε has constant variance, which is a short way of saying that
random draws of εε in model (8.2) are all from the same (or identical) distribution.
This is explicitly stated in the second line of model specification (8.2). If we were to
think about this using model specification (8.3), then homoskedasticity means
that σσ in N(μ,σ)N(μ,σ) is constant for all observations (or that
the conditional probability distributions are identical, where conditional would mean
adjusted for μμ)

Many biological processes generate data in which the error is a function of the mean. For
example, measures of biological variables that grow, such as lengths of body parts or population
size, have variances that “grow” with the mean. Or, measures of counts, such as the number of
cells damaged by toxin, the number of eggs in a nest, or the number of mRNA transcripts per
cell have variances that are a function of the mean. Heteroskedastic error can be modeled
with Generalized Least Squares, a generalization of the linear model, and with Generalized
Linear Models (GLM), which are “extensions” of the classical linear model.

2. Normal or Gaussian probability distribution. As above, the most common way of

thinking about this is the error term εε is Normal. Using model specification (8.3), we’d
say the conditional probablity distribution of the response is normal. A normal
probability distribution implies that 1) the response is continuous and 2) the conditional
probability is symmetric around muimui. If the conditional probability distribution has
a long left or right tail it is skewed left or right. Counts (number of cells, number of
eggs, number of mRNA transcripts) and binary responses (sucessful escape or sucessful
infestation of host) are not continuous and often often have asymmetric probablity
distributions that are skewed to the right and while sometimes both can be reasonably
modeled using a linear model they are more often modeled using generalized linear
models, which, again, is an extension of the linear model in equation (8.3). A classical
linear model is a specific case of a GLM.

A common misconception is that inference from a linear model assumes that the raw response
variable is normally distributed. Both the error-draw and conditional-draw specifications of a
linear model show precisely why this conception is wrong. Model (??) states explicitly that it
is the error that has the normal distribution – the distribution of YY is a mix of the distribution
of XX and the error. Model (8.3) states that the conditional outcome has a normal distribution,
that is, the distribution after adjusting for variation in XX.

8.9 “linear model,”regression model”, or “statistical model”?

Statistical modeling terminology can be confusing. The XX variables in a statistical model may
be quantitative (continuous or integers) or categorical (names or qualitative amounts) or some
mix of the two. Linear models with all quantitative independent variables are often called
“regression models.” Linear models with all categorical independent variables are often called
“ANOVA models.” Linear models with a mix of quantitative and categorical variables are often
called “ANCOVA models” if the focus is on one of the categorical XX or “regression models”
if there tend to be many independent variables.

This confusion partly results from the history of the development of regression for the analysis
of observational data and ANOVA for the analysis of experimental data. The math underneath
classical regression (without categorical variables) is the linear model. The math underneath
classical ANOVA is the computation of sums of squared deviations from a group mean, or
“sums of squares”. The basic output from a regression is a table of coefficients with standard
errors. The basic ouput from ANOVA is an ANOVA table, containing the sums of squares along
with mean-squares, F-ratios, and p-values. Because of these historical differences in usage,
underlying math, and output, many textbooks in biostatistics are organized around regression
“vs.” ANOVA, presenting regression as if it is “for” observational studies and ANOVA as if it
is “for” experiments.

It has been recognized for many decades that experiments can be analyzed using the technique
of classical regression if the categorical variables are coded as numbers (again, this will be
explained later) and that both regression and ANOVA are variations of a more general, linear
model. Despite this, the “regression vs. ANOVA” way-of-thinking dominates the teaching of
biostatistics.

Regression Analysis
Regression analysis is a statistical method used to model the relationship between a
dependent variable and one or more independent variables. It helps in prediction, trend
analysis, and understanding how variables influence each other.

1. Types of Regression Analysis

1.1 Simple Linear Regression

• Models the relationship between one independent variable (X) and one dependent
variable (Y).

• Equation: Y=β0+β1X+ε

• Use Cases: Predicting house prices based on size, sales based on advertising spend.

1.2 Multiple Linear Regression

• Extends simple regression to multiple independent variables.

• Equation: Y=β0+β1X1+β2X2+⋯+βnXn+ε

• Use Cases: Predicting salary based on education, experience, and location.

1.3 Polynomial Regression

• Models non-linear relationships by including polynomial terms

• Equation: Y=β0+β1X+β2X2+β3Xn+⋯+ε

• Use Cases: Predicting economic growth trends, modeling complex real-world

behaviors.
1.4 Logistic Regression (For classification)

• Used when the dependent variable is categorical (e.g., 0 or 1).

• Equation: log(p1−p)=β0+β1X1+β2X2+⋯+βnXn

• Use Cases: Predicting whether a customer will buy a product (Yes/No).

1.5 Ridge & Lasso Regression (Regularization Techniques)

•
Use Cases: Feature selection, high-dimensional datasets.

2. Assumptions of Regression Analysis

1. Linearity: Relationship between dependent and independent variables is linear.

2. Independence: Observations are independent.

3. Homoscedasticity: Constant variance of residuals (errors).

4. Normality: Residuals follow a normal distribution.

5. No Multicollinearity: Independent variables should not be highly correlated.

3. Applications of Regression Analysis

• Finance: Predicting stock prices, risk assessment

• Marketing: Estimating ad effectiveness

• Healthcare: Disease progression modeling

• Economics: Forecasting GDP growth

Analysis of Variance (ANOVA)

1. Introduction
Analysis of Variance (ANOVA) is a statistical method used to compare means
across multiple groups to determine if at least one group differs significantly. It
extends the t-test, which compares only two groups, to handle more than two
groups.
2. Types of ANOVA
1. One-Way ANOVA – Compares means across one categorical independent
variable with multiple levels.
o Example: Examining test scores across different teaching methods
(Method A, B, C).
2. Two-Way ANOVA – Compares means across two categorical independent
variables.
o Example: Studying the effect of teaching methods and gender on
test scores.
3. Repeated Measures ANOVA – Compares means when the same subjects
are tested multiple times.
o Example: Measuring blood pressure before, during, and after
treatment.
4. MANOVA (Multivariate ANOVA) – Extends ANOVA to multiple
dependent variables.
o Example: Studying the effect of diet on both weight loss and
cholesterol levels.

3. Assumptions of ANOVA
• Independence – Observations are independent.
• Normality – Data follows a normal distribution.
• Homogeneity of Variance (Homoscedasticity) – Variances in different
groups should be similar.

4. One-Way ANOVA: Formula and Steps

The test statistic for ANOVA is the F-ratio:
Where:
• Between-group variation measures how much the group means differ
from the overall mean.
• Within-group variation measures how much the individual observations
vary within each group.
Steps to Perform One-Way ANOVA
1. State the Hypotheses
o H0 (Null Hypothesis): All group means are equal.
o Ha (Alternative Hypothesis): At least one group mean is different.
2. Calculate the Mean for Each Group and Overall Mean
3. Compute the Sum of Squares
4. Compute the Degrees of Freedom
5. Compute the Mean Squares (MS)
6. Compute the F-Statistic
7. Compare with Critical Value or P-Value

6. Applications of ANOVA
• Medical Research: Comparing drug effectiveness.
• Education: Evaluating different teaching methods.
• Manufacturing: Testing variations in production processes.
• Marketing: Assessing the impact of different advertising strategies.
• One-Way ANOVA checks for differences across one factor.
• Two-Way ANOVA checks for the impact of two factors and interactions.
• Interpretation of FFF-statistic and p-value determines significance.

Gauss-Markov Theorem
1. Introduction
The Gauss-Markov theorem states that in a linear regression model
where the errors satisfy certain conditions, the Ordinary Least Squares
(OLS) estimator is the Best Linear Unbiased Estimator (BLUE).
This means that among all possible linear unbiased estimators, OLS has
the smallest variance.

Assumptions of the Gauss-Markov Theorem (CLASSICAL LINEAR

REGRESSION MODEL - CLRM)
For the OLS estimator to be BLUE, the following assumptions must
hold:
o

No Perfect Multicollinearity
o The independent variables are not perfectly correlated.

4. Conclusion
Since OLS estimators satisfy the conditions of unbiasedness and have the
minimum variance among all linear estimators, they are Best Linear
Unbiased Estimators (BLUE).

5. Importance of the Gauss-Markov Theorem

• Ensures OLS estimators are efficient and reliable.
• Helps in hypothesis testing and confidence interval estimation.
• Justifies using OLS in econometrics and statistics.
Geometry of Least Squares
1.Introduction: The least squares method is used in regression analysis
to find the best-fitting line by minimizing the sum of squared residuals.
This process has a geometric interpretation in terms of vector spaces
and projections.
Subspace Formulation of Linear Models
1. Introduction
A linear regression model can be formulated using the concepts of vector
spaces and subspaces. The key idea is that the observed response vector Y
belongs to a high-dimensional space, and we approximate it using a lower-
dimensional subspace spanned by the columns of the design matrix X.
Orthogonal Projections in Linear Regression
5. Conclusion

Orthogonal projections help decompose a vector into a model (fit) and

error (residual).
The hat matrix H projects onto the model space C(X).
The residuals are orthogonal to the model space, ensuring the best least
squares fit.
Choosing the Right Regression Model

Scenario Suitable Regression Model

Predicting a continuous outcome Linear Regression

Predicting a binary outcome Logistic Regression

Predicting count data Poisson Regression

Dealing with multicollinearity Ridge/Lasso Regression

Modeling a non-linear relationship Polynomial Regression

Predicting future values in time series ARIMA, VAR

Conclusion
Regression models are powerful tools for data analysis. The choice of model
depends on the type of data, assumptions, and objective.

Factorial Experiments in Statistics

A factorial experiment is a statistical design where multiple factors
(independent variables) are varied simultaneously to study their effects on a
response variable. It allows for analyzing both main effects and interaction
effects efficiently.

1. Basics of Factorial Experiments

• A factor is an independent variable that influences the response.
• Each factor has multiple levels (e.g., Temperature: Low, Medium, High).
• A factorial design tests all possible combinations of factor levels.
For example, in a 2 × 3 factorial experiment:
• Factor A (2 levels): Low, High
• Factor B (3 levels): Low, Medium, High
• Total treatments = 2×3=62 \times 3 = 62×3=6 combinations.

2. Types of Factorial Designs

A. Full Factorial Design
• Tests all possible combinations of factors and levels.
• Provides complete information on main effects and interactions.
• Example: A 232^323 design means 3 factors, each with 2 levels (Total =
23=82^3 = 823=8 runs).
B. Fractional Factorial Design
• Uses only a subset of full factorial runs.
• Useful when full factorial is expensive or time-consuming.
• Example: Instead of a 24=162^4 = 1624=16 design, a fractional design
might use only 8 runs.
C. Nested and Split-Plot Designs
• Used when factors are not fully randomized.
• Nested design: Some factors are only tested within certain levels of
another factor.
• Split-plot design: Used when certain factors are difficult to change
during the experiment.

3. Analyzing Factorial Experiments

A. Main Effects
• The effect of each factor individually on the response.
• Example: Does Temperature alone affect product yield?
B. Interaction Effects
• Occurs when the effect of one factor depends on the level of another
factor.
• Example: If Temperature and Pressure together influence yield
differently than expected from their separate effects.
C. Analysis of Variance (ANOVA)
• Used to analyze factorial experiments.
• Breaks variance into main effects, interactions, and error components.
• Example: Two-Way ANOVA for a 2×22 \times 22×2 factorial design.

4. Example: A 2×2 Factorial Experiment

Suppose we study the effect of Temperature (Low/High) and Pressure
(Low/High) on product quality.

Temperature Pressure Quality (%)

Low Low 75

Low High 80

High Low 85

High High 95

• Main Effects:
o Temperature Effect: (Avg. Quality at High – Avg. Quality at Low)
o Pressure Effect: (Avg. Quality at High – Avg. Quality at Low)
• Interaction Effect:
o If (High, High) quality is higher/lower than expected based on
individual effects of Temperature and Pressure.

5. Applications of Factorial Experiments

• Manufacturing: Optimize process conditions.
• Agriculture: Study effects of fertilizers and irrigation.
• Pharmaceuticals: Drug formulation testing.
• Marketing: A/B testing for advertisements.

6. Conclusion
Factorial experiments provide efficient, cost-effective analysis of multiple
factors and their interactions.

Analysis of Covariance (ANCOVA)

1. Introduction
Analysis of Covariance (ANCOVA) is a statistical technique that combines
ANOVA (Analysis of Variance) and Regression. It is used to:
• Compare means of dependent variables across groups (like ANOVA).
• Control for the effect of continuous covariates (like regression).
For example, if we are studying the effect of a new drug on blood pressure, but
some patients have different initial blood pressure levels, ANCOVA can adjust
for these differences to make a fair comparison.

2. Model of ANCOVA
The general ANCOVA model is:
How ANCOVA Works
1. Removes variability in YYY due to ZZZ (covariate).
2. Adjusts means for the categorical groups based on the covariate.
3. Tests for group differences after controlling for ZZZ.

3. Assumptions of ANCOVA
For ANCOVA to be valid, these assumptions must hold:

Linearity: The covariate should have a linear relationship with the

dependent variable.
Homogeneity of Regression Slopes: The effect of the covariate should be
the same across all groups.
Normality: The residuals should be normally distributed.
Homoscedasticity: Variance of residuals should be the same across groups.

4. Example: ANCOVA in Action

Scenario:
We are testing the effect of two teaching methods (Traditional vs. New) on
students' final test scores, but students had different pre-test scores.

Teaching Method Pre-Test Score (Covariate) Final Test Score

Traditional 50 65
Teaching Method Pre-Test Score (Covariate) Final Test Score

Traditional 60 72

New Method 50 75

New Method 60 82

Analysis
• The covariate (Pre-Test Score) affects the dependent variable (Final
Score).
• ANCOVA removes this effect and tests whether the New Method is
significantly better.

5. Steps in Conducting ANCOVA

1. Fit a regression model using the covariate.
2. Check if the covariate significantly affects the dependent variable.
3. Perform ANOVA on adjusted values to check for group differences.
4. Interpret adjusted means to compare groups fairly.

Regression Diagnostics
Regression diagnostics is the process of evaluating how well a regression
model fits the data and checking if it meets key assumptions. If these
assumptions are violated, predictions and inferences may be unreliable.

1. Key Regression Assumptions

To ensure a valid regression model, these assumptions should be checked:
Linearity: The relationship between independent variables (XXX) and
dependent variable (YYY) must be linear.
Homoscedasticity: The variance of residuals should be constant across all
levels of XXX.
Normality of Residuals: Residuals should be normally distributed.
Independence of Errors: Residuals should not be correlated (especially in
time series data).
No Multicollinearity: Independent variables should not be highly
correlated with each other.

2. Regression Diagnostics Methods

A. Checking Linearity
• Scatter Plots: Plot YYY vs. XXX to check if the relationship looks
linear.
• Residual vs. Fitted Plot: Residuals should be randomly scattered around
zero.

Solution if violated: Try polynomial regression or transformation (e.g., log

or square root).

B. Checking Homoscedasticity
• Residual Plot: Plot residuals against fitted values.
• Breusch-Pagan Test: A statistical test for homoscedasticity.

Solution if violated: Use weighted least squares (WLS) regression or

transform the dependent variable.

C. Checking Normality of Residuals

• Histogram or Q-Q Plot: Residuals should resemble a normal
distribution.
• Shapiro-Wilk Test: A formal test for normality.

Solution if violated: Apply transformations (e.g., log, Box-Cox) or use non-

parametric regression.

D. Checking Independence of Errors

• Durbin-Watson Test: Checks for autocorrelation in residuals (especially
for time series data).

Solution if violated: Use time-series models like ARIMA.

E. Checking Multicollinearity
• Variance Inflation Factor (VIF): A VIF > 10 indicates severe
multicollinearity.

Solution if violated: Remove highly correlated predictors or use

Ridge/Lasso regression.

Residuals in Regression Analysis

1. A residual is the difference between the observed value and the predicted
value in a regression model. It represents the error or deviation from the fitted
line.

Residuals help in diagnosing whether the model fits the data well.

2. Properties of Residuals
• The sum of residuals in OLS regression is zero:

• Residuals should be randomly distributed (no pattern).

• They should have constant variance (homoscedasticity).
• They should be normally distributed for valid hypothesis testing.
3. Types of Residuals

Raw Residuals:

(basic difference between actual and predicted values).

Standardized Residuals: Adjusted for standard deviation, useful for
detecting outliers.
Studentized Residuals: Further adjusted for variance at each data point.
PRESS Residuals: Used in cross-validation to check model accuracy.

4. Checking Residuals in Regression

A. Residual Plot (Residuals vs. Fitted Values)

• Random pattern → Good model fit

• Curved pattern → Model might be missing a key variable

• Funnel shape → Heteroscedasticity (non-constant variance)

B. Normality Check
• Histogram or Q-Q Plot: Residuals should resemble a normal
distribution.
• Shapiro-Wilk Test: A statistical test for normality.
C. Autocorrelation Check
• Durbin-Watson Test: Used in time series regression to check for
correlated errors.

Influence diagnostics
1.Influence diagnostics identify data points that have a large impact on the
regression model. A few extreme observations (outliers or high-leverage points)
can skew the model, making predictions unreliable.

2. Key Concepts in Influence Diagnostics

A. Leverage (Hat Matrix, hi)
• Measures how much an observation influences its own predicted value.
• High leverage means the observation is far from the mean of the
independent variables.

• Rule of thumb: If the point has high leverage.

B. Residuals and Outliers
• Standardized residuals: Large residuals indicate outliers.
• Studentized residuals: Adjusts residuals by their standard deviation.
• If ∣ri∣>2, the point may be an outlier.
C. Influence Measures
These metrics determine how much a data point affects the overall regression
results.
1. Cook’s Distance (Di)
• Measures how much removing an observation changes the regression.
• Large Di suggests a highly influential observation.
• Rule: If Di>1, the point may be influential.
2. DFBETAS
• Measures how much each coefficient changes if a point is removed.
• If

it indicates high influence.

3. DFFITS
• Measures how much the predicted value changes if a point is removed
• If

the point is influential.

4. Covariance Ratio (COVRATIO)

• Checks if removing a point significantly affects the covariance matrix of

coefficients.
• If COVRATIO deviates significantly from 1, the point may be influential.

3. Detecting Influential Points

• Plot Cook’s Distance: Identify points above the threshold.
• Leverage vs. Residuals Plot: Shows high-leverage outliers.
• DFBETA and DFFITS: Identify which points affect specific
coefficients.

Transformations in Regression Analysis

1. What are Transformations?
Transformations in regression are used to improve the fit of the model by
addressing issues such as:
Non-linearity: If the relationship between variables is not linear.
Heteroscedasticity: If the variance of residuals is not constant.
Non-Normality: If residuals are not normally distributed.
Multicollinearity: When independent variables are highly correlated.

2. Types of Transformations
A. Log Transformation (Y’=logY or X’=logX)
Used when the relationship is exponential.
• Helps stabilize variance and make residuals more normal.
• Example:
• Commonly used in economics, biology, and finance.
B. Square Root Transformation :

• Useful for count data or when residuals show a funnel shape.

• Reduces right-skewness but less drastic than a log transformation.
• Example: Population studies, Poisson regression.
C. Box-Cox Transformation

• Finds the best power transformation for normality.

D. Reciprocal Transformation (Y′=1/Y)
• Used when there is a strong non-linearity (hyperbolic relationship).
• Can be useful for rates or proportions.
E. Polynomial Transformations (Y′=X2,X3)
• Used for quadratic or cubic relationships.
• Can capture non-linear trends in data.
F. Interaction Transformations (Y=aX1X2+b)
• Used when two predictors interact to influence Y.
• Example: In marketing, price × advertising interaction.

3. When to Use Transformations?

Issue Recommended Transformation

Non-linearity Log, Polynomial, Reciprocal

Heteroscedasticity Log, Square Root, Box-Cox

Non-normal residuals Log, Box-Cox, Square Root

Skewed data Log (for right skew), Square Root, Reciprocal

Count data Square Root, Log

Interaction effects Polynomial, Interaction Terms

Box-Cox Transformation in Regression

The Box-Cox transformation is a statistical technique used to stabilize
variance, normalize residuals, and improve the linearity of a regression
model. It is particularly useful when the dependent variable (YYY) is non-
normal or exhibits heteroscedasticity (unequal variance).

1. Box-Cox Transformation Formula

The Box-Cox transformation is defined as:

2. Why Use Box-Cox Transformation?

Fixes Non-Normality: Makes residuals more normally distributed.
Stabilizes Variance: Reduces heteroscedasticity (unequal spread of
residuals).
Improves Linearity: Makes relationships between X and Y more linear.
Enhances Model Fit: Leads to more reliable statistical inferences.

3. Choosing the Best λ\lambdaλ Value

The optimal λ\lambdaλ value is found using Maximum Likelihood
Estimation (MLE). Common values of λ include:

4. When to Use Box-Cox?

• If residuals in a regression model are not normally distributed.
• If variance of residuals increases with larger values of YYY.
• If a log transformation is considered but a better alternative is needed.

Model Selection and Model Building Strategies

Model selection and model building strategies help in choosing the best
regression model that balances predictive accuracy, interpretability,
and generalization. The goal is to find a model that fits the data well
without overfitting or underfitting.

1. Steps in Model Building

A. Define the Objective
• What is the goal? (Prediction, inference, explanation)
• What type of response variable? (Continuous → Regression, Categorical
→ Classification)
B. Data Preprocessing
• Handling missing values, outliers, and transformations (e.g., Box-Cox,
log).
• Feature scaling (Standardization/Normalization).
• Encoding categorical variables (One-Hot, Label Encoding).
C. Exploratory Data Analysis (EDA)
• Visualize relationships between variables.
• Check for multicollinearity, outliers, and non-linearity.

2. Model Selection Strategies

A. Subset Selection Methods
1. Best Subset Selection
o Evaluates all possible models and selects the best one.
o Computationally expensive for large datasets.
2. Stepwise Selection (Forward/Backward Selection)
o Forward Selection: Starts with no variables, adds one at a time
based on significance.
o Backward Selection: Starts with all variables, removes least
significant ones.
o Stepwise Regression: A combination of forward and backward
selection.

B. Regularization Methods (Shrinkage Methods)

Used to reduce overfitting and handle multicollinearity.
1. Ridge Regression (L2 Regularization)
o Penalizes large coefficients to prevent overfitting.
o Keeps all variables but shrinks their importance.
2. Lasso Regression (L1 Regularization)
o Shrinks some coefficients to zero (performs feature selection).
o Useful when there are many irrelevant predictors.
3. Elastic Net
o Combines Ridge and Lasso for improved feature selection.

C. Information Criteria Methods

1. Akaike Information Criterion (AIC)
o Penalizes model complexity; lower AIC is better.
o Good for comparing non-nested models.
2. Bayesian Information Criterion (BIC)
o More strict penalty than AIC (favors simpler models).
o Preferred when sample size is large.

D. Cross-Validation for Model Selection

1. K-Fold Cross-Validation
o Splits data into K subsets and trains on K-1 while testing on the
remaining 1.
o More stable than train-test split.
2. Leave-One-Out Cross-Validation (LOOCV)
o Uses every data point as a test set once.
o Computationally expensive but unbiased.

3. Model Evaluation Metrics

Regression Classification
Metric
Models Models

How well the

R² (Coefficient of
model explains N/A
Determination)
variance

Adjusts R² for
Adjusted R² number of N/A
predictors

Mean Squared Penalizes large

N/A
Error (MSE) errors

Mean Absolute Less sensitive to

N/A
Error (MAE) outliers than MSE

Root Mean Square root of

Squared Error MSE, N/A
(RMSE) interpretable

Measures
AUC-ROC N/A classification
performance

Balances precision
F1-Score N/A
and recall
4. Model Building Strategies
1. Start with a Simple Model
o Fit a baseline model first.
o Avoid using too many variables initially.
2. Iterative Feature Selection
o Use correlation analysis, VIF (Variance Inflation Factor), and
domain knowledge.
3. Compare Models Using Cross-Validation
o Use train-validation-test split to ensure model generalization.
4. Check for Assumptions
o Linearity → Scatter plots.
o Homoscedasticity → Residual plots.
o Normality of Residuals → Q-Q plot.
5. Hyperparameter Tuning
o Use Grid Search or Random Search to optimize model
parameters.

Logistic Regression Model

1. Logistic Regression?
Logistic regression is a classification algorithm used when the
dependent variable (Y) is binary (0/1, Yes/No, True/False). Instead of
predicting a continuous outcome (as in linear regression), logistic
regression predicts the probability that an observation belongs to a
certain class.
2. Why Not Use Linear Regression for Classification?
Linear regression is not suitable for classification because:
It can produce probabilities outside [0,1].
It assumes a linear relationship, but classification problems are often
non-linear.
Logistic regression applies the sigmoid (logit) function to constrain
outputs between 0 and 1.

3. Types of Logistic Regression

Type Use Case

Binary Logistic Predicts two outcomes (e.g., Spam/Not

Regression Spam)

Multinomial Logistic Predicts more than two categories

Regression without order (e.g., Cat, Dog, Bird)

Ordinal Logistic Predicts ordered categories (e.g., Low,

Regression Medium, High)

4. Logistic Regression Model Building

A. Assumptions

Binary Outcome: The dependent variable should be 0 or 1.

Independent Observations: No autocorrelation between
observations.
No Multicollinearity: Predictor variables should not be highly
correlated.
Linearity in Log-Odds: Independent variables should have a linear
relationship with the log-odds of the dependent variable.
B. Model Fitting Steps
1.Preprocess Data (Handle missing values, outliers, encoding categorical
variables)
2.Split Data (Train-Test Split, usually 80-20%)
3⃣. Train Model using Maximum Likelihood Estimation (MLE)
4. Evaluate Model using performance metrics
5.Fine-Tune Model (Feature selection, regularization)

5. Model Evaluation Metrics

Metric Purpose

Accuracy Overall correctness of predictions

Measures how many predicted positives are

Precision
actually positive

Recall Measures how many actual positives were

(Sensitivity) correctly identified

F1-Score Balances Precision and Recall

ROC-AUC Evaluates the model’s ability to distinguish

Curve between classes

Regularization in Logistic Regression

To prevent overfitting, regularization is applied:
1. L1 Regularization (Lasso Regression)
o Shrinks some coefficients to zero (feature selection).
2. L2 Regularization (Ridge Regression)
o Shrinks coefficients but keeps all variables.
3. Elastic Net
o Combination of L1 and L2 regularization.

Poisson Regression Models

1.Poisson Regression
Poisson regression is used when the dependent variable is a count (i.e., non-
negative integers: 0,1,2,3…). It models the rate at which events occur over time
or space.
Mathematical Formulation:
The Poisson regression model assumes that the response variable YYY follows
a Poisson distribution:

2. When to Use Poisson Regression?

Dependent variable is a count (non-negative integer: 0,1,2,3...)
Counts follow a Poisson distribution (mean ≈ variance)
Predictors can be continuous or categorical
Example Use Cases:

• Number of customer calls per hour

• Number of accidents per day

• Number of emails received per day

• Number of diseases reported per region

3. Assumptions of Poisson Regression

Count data (non-negative integer values)

Mean ≈ Variance (Equidispersion assumption: E(Y)=Var(Y))

Log-linear relationship between predictors and response

Observations are independent
No excess zeros (if too many zeros, consider Zero-Inflated Poisson or
Negative Binomial Regression)

4. Types of Poisson Regression Models

Model Description Use Case

Standard Poisson
Assumes mean = variance Counts of emails per day
Regression

Zero-Inflated Defects in a
Accounts for excess zeros
Poisson (ZIP) manufacturing process

Negative Binomial Handles overdispersion Disease outbreaks per

Regression (variance > mean) city
5. Model Evaluation Metrics

Metric Purpose

Measures model fit (similar to R² in linear

Deviance
regression)

AIC (Akaike Information

Used for model comparison (lower is better)
Criterion)

Log-Likelihood Measures how well the model fits the data

Pearson’s Chi-Square Test Tests goodness-of-fit

MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Ms 236 N 0
No ratings yet
Ms 236 N 0
63 pages
Stat Modelling Notes
No ratings yet
Stat Modelling Notes
49 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Lecture Notes On High Dimensional Linear Regression
No ratings yet
Lecture Notes On High Dimensional Linear Regression
73 pages
Session CLRM Review 1
No ratings yet
Session CLRM Review 1
47 pages
ch12 0
No ratings yet
ch12 0
43 pages
7 OLS Assumptions
No ratings yet
7 OLS Assumptions
37 pages
CS ELEC 4 Finals Module
No ratings yet
CS ELEC 4 Finals Module
57 pages
Econometrics 1: Classical Linear Regression Analysis
No ratings yet
Econometrics 1: Classical Linear Regression Analysis
20 pages
1 - Linear Models
No ratings yet
1 - Linear Models
22 pages
Previewpdf
No ratings yet
Previewpdf
27 pages
Linear Regression for Marketers
No ratings yet
Linear Regression for Marketers
49 pages
STAT 714 Linear Statistical Models: Lecture Notes
No ratings yet
STAT 714 Linear Statistical Models: Lecture Notes
150 pages
Econometrics 2 Notes
No ratings yet
Econometrics 2 Notes
14 pages
Classical LinearReg 000
No ratings yet
Classical LinearReg 000
41 pages
Week 5 Notes
No ratings yet
Week 5 Notes
175 pages
SM Notes 2020
No ratings yet
SM Notes 2020
139 pages
Unit - 1
No ratings yet
Unit - 1
8 pages
Simple Regression Model: Erbil Technology Institute
No ratings yet
Simple Regression Model: Erbil Technology Institute
9 pages
Simple Linear Regression1
No ratings yet
Simple Linear Regression1
36 pages
Limited Dependent Variables
No ratings yet
Limited Dependent Variables
17 pages
Sta 3
No ratings yet
Sta 3
9 pages
Unit5 R
No ratings yet
Unit5 R
5 pages
Linear Regression Models 2018
No ratings yet
Linear Regression Models 2018
68 pages
Linear Model 1
No ratings yet
Linear Model 1
71 pages
2 Modele Lineare
No ratings yet
2 Modele Lineare
43 pages
CH 2
No ratings yet
CH 2
31 pages
Econometrics Module 2
No ratings yet
Econometrics Module 2
38 pages
Basic Econometrics Health
No ratings yet
Basic Econometrics Health
183 pages
Notes 23 Regression R
No ratings yet
Notes 23 Regression R
5 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
18 pages
Chapter Three
No ratings yet
Chapter Three
35 pages
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
No ratings yet
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
16 pages
LM Week1 1 2019
No ratings yet
LM Week1 1 2019
28 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Chap 5
No ratings yet
Chap 5
13 pages
Linear Regression 1st Edition David J. Olive (Auth.) Instant Download
No ratings yet
Linear Regression 1st Edition David J. Olive (Auth.) Instant Download
119 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
No ratings yet
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
89 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
1.1 Simple Linear Regression Model
100% (1)
1.1 Simple Linear Regression Model
15 pages
Ordinary Least Squares
No ratings yet
Ordinary Least Squares
54 pages
Linear Regression Model - Applied - Part 1&2
No ratings yet
Linear Regression Model - Applied - Part 1&2
69 pages
Regression Analysis Essentials
No ratings yet
Regression Analysis Essentials
55 pages
Lectureslides Chap6-Annot PDF
No ratings yet
Lectureslides Chap6-Annot PDF
30 pages
StatLearning2r PDF
No ratings yet
StatLearning2r PDF
267 pages
Using R For Linear Regression
No ratings yet
Using R For Linear Regression
9 pages
WQU - Econometrics - Module2 - Compiled Content
100% (2)
WQU - Econometrics - Module2 - Compiled Content
73 pages
Week 2 DrBuddhananda Banerjee Vector RV
No ratings yet
Week 2 DrBuddhananda Banerjee Vector RV
10 pages
Finance Students' Matlab Guide
No ratings yet
Finance Students' Matlab Guide
3 pages
Problem-Set - 1 Practise Problems From Textbook
No ratings yet
Problem-Set - 1 Practise Problems From Textbook
2 pages
UNIT 1 Linear Discrimnat
No ratings yet
UNIT 1 Linear Discrimnat
7 pages
Linear Regression
No ratings yet
Linear Regression
53 pages
1.1 Normal Linear Model: 1.1.1 Theoretical (Population) Model and Fitted (Sample) Model
No ratings yet
1.1 Normal Linear Model: 1.1.1 Theoretical (Population) Model and Fitted (Sample) Model
63 pages
Multiple Regression
No ratings yet
Multiple Regression
22 pages
7772 LectureNotes
No ratings yet
7772 LectureNotes
120 pages
Bootstrap Methodology
No ratings yet
Bootstrap Methodology
33 pages
J Apor 2021 102711
No ratings yet
J Apor 2021 102711
17 pages
Probabilistic Programming in Python Using PyMC
No ratings yet
Probabilistic Programming in Python Using PyMC
19 pages
Statistics For The Behavioral and Social Sciences A Brief Course 5th Edition Arthur Aron Download PDF
100% (26)
Statistics For The Behavioral and Social Sciences A Brief Course 5th Edition Arthur Aron Download PDF
84 pages
Qualitative Data Models Guide
No ratings yet
Qualitative Data Models Guide
34 pages
Statistics For The Behavioral Sciences 2nd Edition Gregory J Privitera Download
100% (1)
Statistics For The Behavioral Sciences 2nd Edition Gregory J Privitera Download
83 pages
Chapter4 Sampling Stratified Sampling
No ratings yet
Chapter4 Sampling Stratified Sampling
43 pages
Financial Econometrics Guide
100% (1)
Financial Econometrics Guide
483 pages
An Aggregatedisaggregate Intermittent Demand Approach Adida To Forecasting
No ratings yet
An Aggregatedisaggregate Intermittent Demand Approach Adida To Forecasting
11 pages
Barndorff-Nielsen, Hansen, Lunde & Shephard (2009)
No ratings yet
Barndorff-Nielsen, Hansen, Lunde & Shephard (2009)
32 pages
SRM Notes
No ratings yet
SRM Notes
38 pages
The Most Important Probability Distribution in Statistics
100% (1)
The Most Important Probability Distribution in Statistics
57 pages
Lecture 3 Slides (ANOVA)
No ratings yet
Lecture 3 Slides (ANOVA)
19 pages
Introduction To Estimation
No ratings yet
Introduction To Estimation
9 pages
STA301 Subjective Questions Short Notes DOWNLOADPDF
No ratings yet
STA301 Subjective Questions Short Notes DOWNLOADPDF
22 pages
Regress
No ratings yet
Regress
11 pages
IE 511 Experimental Design in Engineering PSU, Fall 2021
No ratings yet
IE 511 Experimental Design in Engineering PSU, Fall 2021
6 pages
Regression Analysis for Economists
No ratings yet
Regression Analysis for Economists
57 pages
Cheat Sheet
No ratings yet
Cheat Sheet
5 pages
MBA Statistics for Management Course Guide
No ratings yet
MBA Statistics for Management Course Guide
31 pages
Business Statistics For Contemporary Decision Making 7th Edition by Ken Black Ebook and TestBank Bundle Verified PDF
No ratings yet
Business Statistics For Contemporary Decision Making 7th Edition by Ken Black Ebook and TestBank Bundle Verified PDF
410 pages
Variance Component Estimation & Best Linear Unbiased Prediction (Blup)
100% (1)
Variance Component Estimation & Best Linear Unbiased Prediction (Blup)
16 pages
CH04 Cheat Sheet
No ratings yet
CH04 Cheat Sheet
3 pages
Software Estimation Quality Metric
No ratings yet
Software Estimation Quality Metric
4 pages
Shifting Economic Foundation Trad Marriage Erosion
No ratings yet
Shifting Economic Foundation Trad Marriage Erosion
28 pages
Bayesian and Maximum Likelihood Estimations of The Dagum Parameters Under Combined-Unified Hybrid Censoring
No ratings yet
Bayesian and Maximum Likelihood Estimations of The Dagum Parameters Under Combined-Unified Hybrid Censoring
22 pages
Correcting Heterogeneous and Biased Forecast Error at Intel For Supply Chain Optimization
No ratings yet
Correcting Heterogeneous and Biased Forecast Error at Intel For Supply Chain Optimization
14 pages
Au B.com Business Statistics
No ratings yet
Au B.com Business Statistics
221 pages
Behavioral Statistics Estimation
No ratings yet
Behavioral Statistics Estimation
5 pages
Two-Variable Regression Model: The Problem of Estimation: Gujarati 4e, Chapter 3
No ratings yet
Two-Variable Regression Model: The Problem of Estimation: Gujarati 4e, Chapter 3
15 pages