N
bution as an alternative to the Poisson distri-
Negative Binomial Distribution
bution.
A random variable X follows a negative Greenwood, M. and Yule, G.U. (1920) and
binomial distribution with parameters k and Eggenberger, F. and Polya, G. (1923) found
r if its probability function is of the form: applications of the negative binomial distri-
bution. Ever since, there has been an increas-
k
P (X = k) = Cr+k−1 · pr · qk , ing number of applications of this distri-
bution, and the statistical techniques based
where p is the probability of success, q is the on this distribution have been developed in
probability of failure, and Cnx is the number a parallel way.
of combinations of x objects among n.
In the situation where a “success–failure”
MATHEMATICAL ASPECTS
random experiment is repeated in an inde-
If X1 , X2 , . . . , Xn are n independent random
pendent way, the probability of success is
variables following a geometric distri-
denoted by p and the probability of fail-
bution, then the random variable
ure by q = 1 − p. The experiment
is repeated as many times as required to X = X 1 + X2 + . . . + Xr
obtain r successes. The number of obtained
failures before attaining this goal is a ran- follows a negative binomial distribution.
dom variable following a negative binomi- To calculate the expected value of X, the fol-
al distribution described above. The negative lowing property is used, where Y and Z are
binomial distribution is a discrete binomial random variables:
distribution. E [Y + Z] = E [Y] + E [Z] .
We therefore have:
HISTORY
The first to treat the negative binomial distri- E [X] = E [X1 + X2 + . . . + Xr ]
bution was Pascal, Blaise (1679). Then de = E [X1 ] + E [X2 ] + . . .
Montmort, P.R. (1714) applied the negative
+ E [Xr ]
binomial distribution to represent the num- q q q
ber of times that a coin should be flipped to = + +...+
p p p
obtain a certain number of heads. Student q
(1907) used the negative binomial distri- =r· .
p
370 Negative Binomial Distribution
To calculate the variance of X, the follow- The probability of obtaining a third heads on
ing property is used, where Y and Z are inde- the sixth throw, meaning the probability of
pendent variables: obtaining tails three times before obtaining
the third heads, is therefore equal to:
Var (Y + Z) = Var (Y) + Var (Z) .
3 1 1
We therefore have: P(X = 3) = C3+3−1 · 3· 3
2 2
Var (X) = Var (X1 + X2 + . . . + Xr ) 1 1
= C53 · 3 · 3
2 2
= Var (X1 ) + Var (X2 ) + . . .
5! 1 1
+ Var (Xr ) = · 3· 3
3!(5 − 3)! 2 2
q q q
= 2 + 2 +...+ 2 = 0.1563 .
p p p
q
=r· 2 .
p
FURTHER READING
Bernoulli distribution
DOMAINS AND LIMITATIONS
Binomial distribution
Among the specific fields for which the neg-
Discrete probability distribution
ative binomial distribution has been applied
Poisson distribution
are accidents statistics, biological sciences,
ecology, market studies, medical research,
and psychology.
REFERENCES
Eggenberger, F. und Polya, G.: Über
EXAMPLES
die Statistik verketteter Vorgänge.
A coin is flipped several times. We are inter-
Zeitschrift für angewandte Mathematis-
ested in the probability of obtaining heads
che Mechanik 3, 279–289 (1923)
a third time on the sixth throw.
We then have: Greenwood, M., Yule, G.U.: An inquiry into
the nature of frequency distributions rep-
Number of successes: r=3 resentative of multiple happenings with
Number of failures: k =6−r =3 particular reference to the occurrence of
1 multiple attacks of disease or of repeat-
Probability of one success: p = ed accidents. J. Roy. Stat. Soc. Ser. A 83,
2
(obtaining heads) 255–279 (1920)
1 Montmort, P.R. de: Essai d’analyse sur les
Probability of one failure: q = jeux de hasard, 2nd edn. Quillau, Paris
2
(obtaining tails) (1713)
Pascal, B.: Varia Opera Mathematica. D.
The probability of obtaining k tails before Petri de Fermat. Tolosae (1679)
the third heads is given by:
Gosset, S.W. (“Student”): On the error
3
k of counting with a haemacytometer.
1 1
P (X = k) = C3+k−1
k
· · . Biometrika 5, 351–360 (1907)
2 2
Newcomb, Simon 371
putational Statistics. Physica, Vien-
Nelder, John A.
na.
John Nelder was born in 1924 at Dulver- 1983 (with McCullagh, P.) Generalized
ton (Somerset, England). After receiving his Linear Models. Chapman & Hall,
diploma in mathematical statistics in Cam- London, pp. 261.
bridge, he was named president of the Sec-
tion of Statistiscs of the National Vegetable FURTHER READING
Research Station at Wellesbourn, a position Statistical software
he held from 1951 to 1968. He earned the
title of Doctor in sciences at Birmingham
and was then elected president of the Statisti-
Newcomb, Simon
cal Department at Rothamsted Experimental Newcomb, Simon was born in 1835 in Nova
Station in Harpenden from 1968 to 1984. He Scotia (Canada) and died in 1909. This astro-
is now invited professor in the Imperial Col- nomer of Canadian origin contributed to the
lege of Science, Technology and Medicine in treatment of outliers in statistics, to the appli-
London. He was elected member of the Roy- cation of probability theory for the interpre-
al Society in 1984 and was president of two tation of data, and to the development of what
prestigious societies: the International Bio- we call today robust methods in statistics.
metric Society and the Royal Statistical Soci- Until the age of 16 he studied essentially by
ety. consulting the books that his father found for
Nelder is the author of statistical computa- him. After that, he began studying medicine
tional systems Genstat and GLIM, now used with plans of becoming a doctor. In 1857, he
N
in more than 70 countries. Author of more was engaged in the American Ephemeris and
than 140 articles published in statistical and Nautical Almanac in Cambridge in the state
biological reviews, he is also author of Com- of Massachussetts. At the same time, he stud-
puters in Biology and coauthor with Peter ied in Harvard and he graduated in 1858. In
McCullagh of a book treating thegeneralized 1861, he became professor of mathematics
linear model. Nelder also developed the idea at the Naval Observatory.
of generally balanced designs. The collection of Notes on the Theory of
Probabilities (Mathematics Monthly, 1859–
Selected works and articles of John A. 61) constitutes a work that today is still con-
Nelder: sidered modern. Newcomb’s most remark-
able contribution to statistics is his approach
1965 (with Mead, R.) A simplex method
to the treatment of outliers in astronomical
for function minimization. Computa-
data; affirming that the normal distribution
tional Journal, 7, 303–333.
does not fit, he invented the normal contam-
1971 Computers in Biology. Wykeham,
inated distribution.
London and Winchester, pp. 149.
1972 (with Wedderburn, R.W.M.) Gener-
Selected works and articles of Newcomb,
alized linear models. J. Roy. Stat. Soc.
Simon:
Ser. A 135, 370–384.
1974 Genstat: a statistical system. In: 1859–61 Notes on the theory of probabi-
COMPSTAT: Proceedings in Com- lities. Mathematics Monthly, 1, 136–
372 Neyman, Jerzy
139, 233–235, 331–355, 349–350; 2, stratified sampling and the method of
134–140, 272–275; 3, 68, 119–125, purposive selection. J. Roy. Stat. Soc.
341–349. 97, 558–606, discussion pp. 607–
625.
1938 Contribution to the theory of sam-
Neyman, Jerzy pling human populations. J. Am. Stat.
Assoc. 33, 101–116.
Neyman, Jerzy (1894–1981) was one of the
founders of modern statistics. One of his
outstanding contributions was establishing, FURTHER READING
with Pearson, Egon Sharpe, the basis of the Hypothesis testing
theory of hypothesis testing. Born in Russia,
Neyman, Jerzy studied physics and mathe-
matics at Kharkov University. In 1921, he Nonlinear Regression
went to Poland, his ancestral country of ori-
gin, where he worked for a while as a statis- An analysis of regression where the depen-
tician at the National Institute of Agricul- dent variable Y depends on one or many
ture of Bydgoszcz. In 1924, he spent some independent variables X1 , . . . , Xk is called
a nonlinear regression if the equation of
time in London, where he could study in Uni-
versity College under the direction of Pear- regression Y = f X1 , . . . , Xk ; β0 , . . . , βp is
son, Karl. At this time he met Pearson, Egon not linear in parameters β0 , . . . , βp .
Sharpe Gosset, William Sealy, and Fish-
er, Ronald Aylmer. In 1937, at the end of HISTORY
a trip to the United States, where he present- Nonlinear regression dates back to the 1920s
ed papers at many conferences, he accept- to Fisher, Ronald Aylmer and Macken-
ed the position of professor at the Univer- zie, W.A. However, the use and more detailed
sity of California at Berkeley. He created the investigation of these models had to wait
Department of Statistics and finished his bril- for advances in automatic calculations in the
liant career at this university. 1970s.
Selected articles of Neyman, Jerzy:
1928 (with Pearson, E.S.) On the use and MATHEMATICAL ASPECTS
interpretation of certain test crite- Let Y1 , Y2 , . . . , Yr and X11 , . . . , X1r ,
ria for purposes of statistical infer- X21 , . . . , X2r , . . ., Xk1 , . . . , Xkr be the r obser-
ence, I, II. Biometrika 20A, 175–240, vations (respectively) of the dependent
pp. 263–295. variable Y and the independent variables
1933 (with Pearson, E.S.) Testing of statis- X1 , . . . , Xk .
tical hypotheses in relation to proba- The goal is to estimate the parameters
bilities a priori. Proc. Camb. Philos. β0 , . . . , βp of the model:
Soc. 29, 492–510.
Yi = f X1i , . . . , Xki ; β0 , . . . , βp + εi ,
1934 On the two different aspects of the
representative method: the method of i = 1, . . . , r .
Nonlinear Regression 373
By the least-squares method and under that corresponds to the maximum of
the corresponding hypothesis concerning descent from function S.
errors (see analysis of residuals), we esti- Iteratively, we determine the parameters
mate the parameters β0 , . . . , βp by the val- that minimize S.
ues β̂0 , . . . , β̂p that minimize the sum of 3. Marquard–Levenburg This method, not
squared errors, denoted by S β0 , . . . , βp = developed here, integrates the advan-
r
2 tages of two mentioned procedures forc-
Yi − f X1i , . . . , Xki ; β0 , . . . , βp :
i=1 ing and accelerating the convergence of
the approximations of parameters to be
r
estimated.
min εi2 = min S β0 , . . . , βp .
β0 ,...,βp β0 ,...,βp
i=1
DOMAINS AND LIMITATIONS
If f is nonlinear, the resolution of the normal In multiple linear regression, there are
equations (neither of which is linear): functions that are not linear in parameters
but that become linear after a transformation.
r
Yi − f X1i , . . . , Xki ; β0 , . . . , βp If this is not the case, we call these func-
i=1 tions nonlinear and treat them with nonlinear
methods.
∂f X1i , . . . , Xki ; β0 , . . . , βp
· =0 An example of a nonlinear function is given
∂βj
by:
(j = 0, . . . , p) can be difficult. β1 !
Y= exp (−β2 X) − exp (−β1 X) N
The problem is that the given equations do β1 − β2
not necessarily have a solution or have more +ε.
than one.
In what follows, we discuss the different EXAMPLES
approaches to the resolution of the problem. Let P1 = (x1 , y1 ), P2 = (x2 , y2 ), and P3 =
1. Gauss–Newton (x3 , y3 ) be three points on a graph and d1 ,
The procedure, iterative in nature, con- d2 , and d3 the approximative distances (the
sists in the linearization of f with the help “approximations” being distributed accord-
of the Taylor expansion in the successive ing to the Gauss distribution of the same vari-
estimation of parameters by linear regres- ance) between the data points and a point
sion. P = (x, y) of unknown and searched coor-
We will develop this method in more dinates.
detail in the following example. A possible regression model to estimate the
2. Steepest descent coordinates of P is as follows:
The idea of this method is to fix, in an
approximate way, the parameters βj for di = f (xi , yi ; x, y) + εi , i = 1, 2, 3 ,
estimates and then to add an approxi-
where the function f represents the distance
mation of
between point Pi and point P, that is:
∂S β0 , . . . , βp
− f (xi , yi ; x, y) = (x − xi )2 + (y − yi )2 .
∂βj
374 Nonlinear Regression
The distance function is clearly nonlinear the following model (the “observations” of
and cannotbetransformed in alinear form,so the model appear in brackets):
the appropriate regression is nonlinear. We !
apply the method of Gauss–Newton. di − f (xi , yi ; x0 , y0 )
Considering f as a function of parameters x x i − x0
and y to be estimated, the Taylor develop- = x ·
f (xi , yi ; x0 , y0 )
ment of f , until it becomes the linear term in
yi − y0
(x0 , y0 ), is given by: + y · + εi ,
f (xi , yi ; x0 , y0 )
f (xi , yi ; x, y) i = 1, 2, 3 .
f (xi , yi ; x0 , y0 )
Let x, ˆ y
ˆ be the estimation of (x, y)
∂f (xi , yi ; x, y)
+ x=x0 · (x − x0 ) that was found. Thus we get a bet-
∂x
y=y0 ter
estimation of coordinates of P by
ˆ y0 + yˆ .
∂f (xi , yi ; x, y) x0 + x,
+ x=x0 · (y − y0 ) ,
∂x Using this new point as our point of depar-
y=y0
ture, we can use the same procedure again.
where The remaining coordinates are found in con-
verging way to the desired point, that is, to
∂f (xi , yi ; x, y) xi − x0 the point that approaches the best.
=
∂x x=x0 f (xi , yi ; x0 , y0 ) Let us take now a concrete example: P1 =
y=y0
(0, 5); P2 = (5, 0); P3 = (10, 10); d1 = 6;
and d2 = 4; d3 = 6.
If we choose as the initial point of the Gauss–
∂f (xi , yi ; x, y) yi − y0
Newton procedure the following point: P =
∂x x=x0 = f (x , y ; x , y ) .
i i 0 0
y=y0
0+5+10 5+0+10
3 , 3 = (5, 5), we obtain the
set of the following coordinates:
So let (x0 , y0 ) be a first estimation of P, found,
for example, geometrically. (5, 5)
The linearized regression model is thus (6.2536, 4.7537)
expressed by:
(6.1704, 4.7711)
di = f (xi , yi ; x0 , y0 ) (6.1829, 4.7621)
xi − x0 (6.1806, 4.7639)
+ · (x − x0 )
f (xi , yi ; x0 , y0 )
yi − y0 that converges to the point (6.1806,4.7639).
+ · (y − y0 ) + εi , The distances between the point found and
f (xi , yi ; x0 , y0 )
P1 , P2 , and P3 are respectively 6.1856,
i = 1, 2, 3 .
4.9078, and 6.4811.
With the help of a multiple linear regres-
sion (without constants), we can estimate the FURTHER READING
parameters (x, y) = (x − x0 , y − y0 ) of Least squares
Nonparametric Statistics 375
Model of the variable of interest in the population.
Multiple linear regression Nonparametric models differ from paramet-
Normal equations ric models in that the model structure is not
Regression analysis specified a priori but is instead determined
Simple linear regression from data. Therefore, these methods are also
sometimes called parameter-free methods or
distribution-free methods.
REFERENCES
Bates, D.M., Watts D.G.: Nonlinear regres-
sion analysis and its applications. Wiley, HISTORY
New York (1988) The term nonparametric was first used by
Draper, N.R., Smith, H.: Applied Regres- Wolfowitz, 1942.
sion Analysis, 3rd edn. Wiley, New York
(1998) See also nonparametric test.
Fisher, R.A.: On the mathematical foun-
dations of theoretical statistics. Philos. DOMAIN AND LIMITATIONS
Trans. Roy. Soc. Lond. Ser. A 222, 309– The nonparametric method varies from the
368 (1922) analysis of a one-way classification model
Fisher, R.A., Mackenzie, W. A.: The manuri- for comparing treatments to regression and
al response of different potato varieties. curve fitting problems.
J. Agricult. Sci. 13, 311–320 (1923) The most frequently used nonparametric
Fletcher, R.: Practical Methods of Optimiza-
tests are Anderson-Darling test, Chi- N
square test, Kendall’s tau, Kolmogorov-
tion. Wiley, New York (1997)
Smirnov test, Kruskall–Wallis test, Wil-
Seber, A.F., Wilt, S.C.: Nonlinear Regres- coxon rank sum test, Spearman’s rank
sion. Wiley, New York (2003) correlation coefficient, and Wilcoxon sign
rank test.
Nonparametric tests have less power than the
Nonparametric Statistics appropriate parametric tests, but are more
robust when the assumptions underlying the
Statistical procedures that allow us to pro- parametric test are not satisfied.
cess data from small samples, on variables
about which nothing is known concerning
their distribution. Specifically, nonparamet- EXAMPLES
ric statistical methods were developed to be A histogram is a simple nonparametric esti-
used in cases when the researcher knows mate of a probability distribution. A gener-
nothing about the parameters of the vari- alization of histogram is kernel smoothing
able of interest in the population (hence technique by which from a given data set
the name nonparametric). Nonparametric a very smooth probability density function
methods do not rely on the estimation of can be constructed.
parameters (such as the mean or the stan-
dard deviation) describing the distribution See also Nonparametric tests.
376 Nonparametric Test
FURTHER READING raphy of about 3000 articles, written before
Nonparametric test 1962, concerning nonparametric tests.
DOMAINS AND LIMITATIONS
REFERENCES
The fast development of nonparametric tests
Parzen, E.: On estimation of a probability
can be explained by the following points:
density function and mode. Ann. Math.
• Nonparametric methods require few
Stat. 33, 1065–1076 (1962)
assumption concerning the population
Savage, R.: Bibliography of Nonparamet- under study, such as assumption of nor-
ric Statistics and Related Topics: Intro- mality.
duction. J. Am. Stat. Assoc. 48, 844–849 • Nonparametric methods are often easier
(1953) to understand and to apply than the equiv-
Gibbons, J.D., Chakraborti, S.: Nonpara- alent parametric tests.
metric Statistical Inference, 4th ed. CRC • Nonparametric methods are applicable in
(2003) situations where parametric tests cannot
be used, for example, when the variables
are measured only on ordinal scales.
Nonparametric Test • Despite the fact that, at first view, non-
parametric methods seem to sacrifice an
A nonparametric test is a type of hypoth- essential part of information contained
esis testing in which it is not necessary to in the samples, theoretical researches
specify the form of the distribution of the have shown that nonparametric tests are
population under study. In any event, we
only slightly inferior to their paramet-
should have independent observations, that
ric counterparts when the distribution
is, that the selection of individuals that forms
of the studied population is specified,
the sample included must not influence the
for example, the normal distribution.
choice of other individuals.
Nonparametric tests, on the other hand,
are superior to parametric tests when
HISTORY the distribution of the population is far
The first nonparametric test appeared in the away from the specified (normal) distri-
works of Arbuthnott, J. (1710), who intro- bution.
duced the sign test. But most nonparamet-
ric tests were developed between 1940 and EXAMPLES
1955. See Kolmogorov–Smirnov test, Kruskal–
We make special mention of the articles of Wallis test, Mann–Withney test, Wilcox-
Kolmogorov, Andrey Nikolaevich (1933), on signed test), sign test.
Smirnov, Vladimir Ivanovich (1939),
Wilcoxon, F. (1945, 1947), Mann, H.B. and FURTHER READING
Whitney, D.R. (1947), Mood, A.M. (1950), Goodness of fit test
and Kruskal, W.H. and Wallis, W.A. (1952). Hypothesis testing
Later, many other articles were added to this Kolmogorov–Smirnov test
list. Savage, I.R. (1962) published a bibliog- Kruskal-Wallis test
Norm of a Vector 377
Mann–Whitney test Stamford Research Laboratories, Stam-
Sign test ford, CT (1957)
Spearman rank correlation coefficient
Testof independence
Wilcoxon signed test Norm of a Vector
Wilcoxon test
The norm of a vector indicates the length of
REFERENCES this vector defined from the scalar product
Arbuthnott, J.: An argument for Divine Prov- of the vector by itself. The norm of a vector
idence, taken from the constant regularity is obtained by taking the square root of this
observed in the births of both sexes. Phi- scalar product.
los. Trans. 27, 186–190 (1710)(3.4). A vector of norm 1 is called a unit vector.
Kolmogorov, A.N.: Sulla determinazione
empirica di una legge di distribuzione. MATHEMATICAL ASPECTS
Giornale dell’Instituto Italiano degli For a given vector x,
Attuari 4, 83–91 (6.1) (1933) ⎡ ⎤
x1
Kruskal, W.H., Wallis, W.A.: Use of ranks ⎢ x2 ⎥
⎢ ⎥
in one-criterion variance analysis. J. Am. x=⎢ . ⎥ ,
Stat. Assoc. 47, 583–621 and errata, ibid. ⎣ . ⎦
.
48, 907–911 (1952) xn
Mann, H.B., Whitney, D.R.: On a test
whether one of two random variables is
we define the scalar product of x by itself N
by:
stochastically larger than the other. Ann. n
Math. Stat. 18, 50–60 (1947) x ·x= x2i .
Mood, A.M.: Introduction to the Theory i=1
of Statistics. McGraw-Hill, New York The norm of vector x thus equals:
(1950) (chap. 16). "
# n
Savage, I.R.: Bibliography of Nonparamet- √ #
x = x · x = $
x2i .
ric Statistics. Harvard University Press,
i=1
Cambridge, MA (1962)
Smirnov, N.V.: Estimate of deviation
DOMAINS AND LIMITATIONS
between empirical distribution functions
As the norm of a vector is calculated with
in two independent samples. (Russian).
the help of the scalar product, the properties
Bull. Moscow Univ. 2(2), 3–16 (6.1, 6.2)
of the norm come from those of the scalar
(1939)
product. Thus:
Wilcoxon, F.: Individual comparisons by • The norm of a vector x is strictly positive
ranking methods. Biometrics 1, 80–83 if the vector is not zero, and zero if it is:
(1945)
x > 0 if x = 0 and
Wilcoxon, F.: Some rapid approximate sta-
tistical procedures. American Cyanamid, x = 0 if and only if x = 0 .
378 Normal Distribution
• The norm of the result of a multiplication The unit vector having the direction of x is
of a vector x by a scalar equals the product obtained by dividing each component of x by
of the norm of x and the absolute value of 25: ⎡ ⎤
the scalar: 0.48
y = ⎣ 0.6 ⎦ ,
k · x = |k| · x . 0.64
• For two vectors x and y, we have: and we want to verify that y = 1. Hence:
x + y ≤ x + y , y 2 = 0.482 + 0.62 + 0.642
= 0.2304 + 0.36 + 0.4096 = 1 .
that is, the norm of a sum of two vectors
is smaller than or equal to the sum of the FURTHER READING
norms of these vectors. Least squares
• For two vectors x and y, the absolute value L1 estimation
of the scalar product of x and y is smaller Vector
than or equal to the product of norms:
x · y ≤ x · y . REFERENCES
Dodge, Y.: Mathématiques de base pour
This result is called the Cauchy–Schwarz économistes. Springer, Berlin Heidelberg
inequality. New York (2002)
The norm of a vector is also used to obtain
Dodge, Y., Rousson, V.: Multivariate L1 -
a unit vector having the same direction as the
mean. Metrika 49, 127–134 (1999)
given vector. It is enough in this case to divide
the initial vector by its norm:
x Normal Distribution
y= .
x
Random variable X is distributed accord-
EXAMPLES ing to a normal distribution if it has a density
function of the form:
Consider the following vector defined in the
Euclidean space of three dimensions: 1 − (x − μ)2
f (x) = √ exp ,
⎡ ⎤ σ 2π 2σ 2
12
x = ⎣ 15 ⎦ . (σ > 0) .
16
The scalar product of this vector by itself
equals:
x · x = 122 + 152 + 162
= 144 + 225 + 256 = 625
from where we get the norm of x:
√ √
x = x · x = 625 = 25 . Normal distribution, μ = 0, σ = 1
Normal Distribution 379
We will say that X follows a normal distri- Note also that Galton, F. spoke of the
bution of mean μ and of variance σ 2 . The “frequency of error distribution” or of the
normal distribution is a continuous proba- “distribution of deviations from a mean”.
bility distribution. Stigler, S. (1980) presents a more detailed
discussion on the different names of this
HISTORY curve.
The normal distribution is often attributed to
Laplace, P.S. and Gauss, C.F., whose name MATHEMATICAL ASPECTS
it bears. However, its origin dates back to the The expected value of the normal distri-
works of Bernoulli, J. who in his work Ars bution is given by:
Conjectandi (1713) provided the first basic
elements of the law of large numbers. E [X] = μ .
In 1733, de Moivre, A. was the first to obtain
the normal distribution as an approximation The variance is equal to:
of the binomial distribution. This work was
written in Latin and published in an English Var (X) = σ 2 .
version in 1738. De Moivre, A. called what
If the mean μ is equal to 0, and the variance
he found a “curve”; he discovered this curve
σ 2 is equal to 1, then we obtain a centered
while calculating the probabilities of gain
and reduced normal distribution (or standard
for different games of hazard.
normal distribution) whose density func-
Laplace, P.S., after de Moivre, A., stud-
tion is given by:
ied this distribution and obtained a more N
2
formal and general result of the de Moivre 1 −x
approximation. In 1774 he obtained the nor- f (x) = √ exp .
2π 2
mal distribution as an approximation of the
hypergeometric distribution. If a random variable X follows a normal
Gauss, C.F. studied this distribution through distribution of mean μ and variance σ 2, then
problems of measurement in astronomy. His the random variable
works in 1809 and 1816 established tech- X−μ
niques based on the normal distribution that Z=
σ
became standard methods during the 19th
century. Note that even if the first appro- follows a centered and reduced normal distri-
ximation of this distribution is due to de bution (of mean 0 and variance 1).
Moivre, Galileo had already found that the
errors of observation were distributed in DOMAINS AND LIMITATIONS
a symmetric way and tended to regroup The normal distribution is the most famous
around their true value. continuous probability distribution. It
Many denominations of the normal distri- plays a central role in the theory of proba-
bution can be found in the literature. bility and its statistical applications. Many
Quetelet, A. (1796–1874) spoke of the measurements such as the size or weight
“curve of possibilities” or the “distribution of individuals, the diameter of a piece of
of possibilities”. machinery, the results of an IQ test, etc.
380 Normal Equations
approximately follow a normal distribution. P.S. de (1891) Œuvres complètes, vol 8.
The normal distribution is frequently used as Gauthier-Villars, Paris, pp. 27–65)
an approximation, either when the normality Moivre, A. de: Approximatio ad summam
is attributed to a distribution in the construc- terminorum binomii (a + b)n , in seriem
tion of a model or when a known distribution expansi. Supplementum II to Miscellanae
is replaced by a normal distribution with Analytica, pp. 1–7 (1733). Photograph-
the same expected value or variance. It is ically reprinted in a rare pamphlet on
used for the approximation of the chi-square Moivre and some of his discoveries. Pub-
distribution, the Student distribution, and lished by Archibald, R.C. Isis 8, 671–683
discrete probability distributions such as (1926)
the binomial distribution and the Poisson
Moivre, A. de: The Doctrine of Chances: or,
distribution.
A Method of Calculating the Probability
The normal distribution is also a fundamen-
of Events in Play. Pearson, London (1718)
tal element of the theory of sampling, where
its role is important in the study of correla- Stigler, S.: Stigler’s law of eponymy, Trans-
tion, regression analysis, variance analy- actions of the New York Academy of Sci-
sis, and covariance analysis. ences, 2nd series 39, 147–157 (1980)
FURTHER READING
Continuous probability distribution Normal Equations
Lognormal distribution
Normal equations are equations obtained by
Normal table
setting equal to zero the partial derivatives
of the sum of squared errors (least squares);
REFERENCES normal equations allow one to estimate the
Bernoulli, J.: Ars Conjectandi, Opus Posthu- parameters of a multiple linear regression.
mum. Accedit Tractatus de Seriebus
infinitis, et Epistola Gallice scripta de
ludo Pilae recticularis. Impensis Thurni- HISTORY
siorum, Fratrum, Basel (1713) See analysis of regression.
Gauss, C.F.: Theoria Motus Corporum
Coelestium. Werke, 7 (1809) MATHEMATICAL ASPECTS
Gauss, C.F.: Bestimmung der Genauigkeit Consider a general model of multiple linear
der Beobachtungen, vol. 4, pp. 109–117 regression:
(1816). In: Gauss, C.F. Werke (published
in 1880). Dieterichsche Universitäts-
p−1
Druckerei, Göttingen Yi = β 0 + βj · Xji + εi , i = 1, . . . , n ,
j=1
Laplace, P.S. de: Mémoire sur la probabi-
lité des causes par les événements. Mem. where Yi is the dependent variable, Xji , j =
Acad. Roy. Sci. (presented by various sci- 1, . . . , p−1, are the independent variables, εi
entists) 6, 621–656 (1774) (or Laplace, is the term of random nonobservable error,
Normal Equations 381
βj , j = 0, . . . , p − 1, are the parameters to be derivativesbutwritten in vector form, wefind
estimated, and n is the number of observa- the normal equations:
tions.
To apply the method of least squares, we X Xβ̂ = X Y .
should minimize the sum of squared errors:
DOMAINS AND LIMITATIONS
n
f β0 , β1 , . . . , βp−1 = εi2 We can generalize the concept of normal
i=1 equations in the case of a nonlinear regres-
⎛ ⎞2
sion (with r observations and p + 1 param-
n
p−1
= ⎝ Yi − β0 − βj · Xji ⎠ . eters to estimate).
i=1 j=1 They are expressed by:
r
2
Setting the p partial derivatives equal to zero Yi − f X1i , . . . , Xki ; β0 , . . . , βp
relative to the parameters to be estimated, we ∂βj
i=1
get the p normal equations:
r
⎧ = Yi − f X1i , . . . , Xki ; β0 , . . . , βp
⎪ ∂f (β0 ,β1 ,...,βp−1 )
⎪
⎪ =0,
⎪
⎪ ∂β0 i=1
⎪
⎪ ∂f ( β )
⎨ 0 ,β 1 ,...,β p−1
=0, ∂f X1i , . . . , Xki ; β0 , . . . , βp
∂β1 · =0,
⎪ .. ∂βj
⎪
⎪.
⎪
⎪
⎪
⎪ ∂f (β0 ,β1 ,...,βp−1 ) where j = 0, . . . , p. Because f is nonlinear
⎩ =0.
∂βp−1
in the parameters to be estimated, the normal N
We can also express the same model in matrix equations are also nonlinear and so can be
form: very difficult to solve. Moreover, the system
Y = Xβ + ε , of equations can have more than one solution
that corresponds to the possibility of many
where Y is the vector (n×1) of observations minima of f . Normally, we try to develop an
related to the dependent variable (n observa- iterative procedure to solve the equations or
tions), X is the matrix (n × p) of the plan refer to the techniques explained in nonlinear
related to the independent variables, ε is the regression.
(n × 1) vector of errors, and β is the (p × 1)
vector of the parameters to be estimated.
By the least-squares method, we should EXAMPLES
minimize: If the model contains, for example, 3 param-
eters β0 , β1, and β2, the normal equations are
ε ε = (Y − Xβ )(Y − Xβ) the following:
= Y Y − β X Y − Y Xβ ⎧
⎪
⎪ ∂f (β0 , β1 , β2 )
+ β X Xβ ⎪
⎪ =0,
⎪
⎪ ∂β0
= Y Y − 2β X Y + β X Xβ . ⎨ ∂f (β0 , β1 , β2 )
=0,
⎪
⎪ ∂β1
Setting to zero the derivatives relative to β ⎪
⎪
⎪ ∂f (β0 , β1 , β2 )
⎪
(in matrix form), corresponding to the partial ⎩ =0.
∂β 1
382 Normal Probability Plot
That is: on an arithmetic scale. In practice, the nor-
⎧ mal probability papers sold on the market are
⎪
n
n n
⎪
⎪β0 · n + β1 · X1i + β2 · X2i = Yi , used.
⎪
⎪
⎪
⎪ i=1 i=1 i=1
⎪
⎪
⎪β · X + β · X 2 + β · X · X
⎪ n n n
⎪
⎪
⎪
⎪
0 1i 1 1i 2 1i 2i
MATHEMATICAL ASPECTS
⎪
⎪ i=1 i=1 i=1
⎨
n
= X1i · Yi , On an axis system the choice of theorigin and
⎪
⎪
⎪
⎪
i=1 size of the unity length is made according to
⎪
⎪
n
n
n
⎪
⎪ 2 the observations that are to be represented.
⎪
⎪ β0 · X2i + β1 · X2i · X1i + β2 · X2i
⎪
⎪
⎪
⎪
i=1 i=1 i=1 The abscissa is then arithmetically graduat-
⎪
⎪
n
⎪
⎩ = X · Yi .
2i ed like millimeter paper.
i=1
In ordinate, the values of the distribution
See simple linear regression. function F(t) of the normal distribution
are transcribed at the height t of a fictive ori-
gin placed at mid-height of the sheet of paper.
FURTHER READING
Therefore:
Error
Least squares F (t ) t
Multiple linear regression 0.9772 2
Nonlinear regression 0.8412 1
Parameter 0.5000 0
Simple linear regression 0.1588 −1
0.0
REFERENCES will be placed, and only the scale on the left,
Legendre, A.M.: Nouvelles méthodes pour which will be graduated from 0 to 1 (or from
la détermination des orbites des comètes. 0 to 100 if written in %), will be kept.
Courcier, Paris (1805) Consider a set of n point data that are sup-
posed to be classified in increasing order. Let
us call this series x1 , x2 , . . . , xn , and for each
abscissa xi (i = 1, . . . , n), the ordinate is
Normal Probability Plot calculated as
A normal probability plot allows one to ver- i− 3
8
ify if a data set is distributed according to 100 ·
n+ 1
4
a normal distribution. When this is the
case, it is possible to estimate the mean and and represented on a normal probability plot.
the standard deviation from the plot. It is also possible to transcribe
The cumulated frequencies are transcribed
in ordinate on a Gaussian scale, graduat- i− 1
100 · 2
ed by putting the value of the distribution n
function F(t) of the normal distribution as a function of xi on the plot. This second
(centered and reduced) opposite to the point approximation of the cumulated frequency
located at a distance t from the origin. In is explained as follows. If the surface locat-
abscissa, the observations are represented ed under the normal curve (with area equal to
Normal Probability Plot 383
Normal probability paper. Vertical scale: 0.01 - 99.99 %
384 Normal Probability Plot
1) is divided into n equal parts, it can be sup- estimated standard deviation is given by the
posed that each of our n observations can be inverse of the slope of the line.
found in one of these parts. Therefore the ith It is easy to determine the slope by sim-
observation xi (in increasing order) is located ply considering two points. Let us choose
in the middle of the ith part, which in terms (μ̂,0.5) and the point at the ordinate
of cumulated frequency corresponds to
X0.9772 − μ̂
0.9772 = F (2) = F
i− 1 σ̂
2
.
n whose abscissa X0.9772 is read on the line.
The normality of the observed distribution We have:
can then be observed by examining the align- 1 F −1 (0.9772) − F −1 (0.5)
ment of the points: if the points are aligned, =
σ̂ X−μ
then the distribution is normal.
= slope of the line of least squares ,
from which we finally obtain:
X0.9772 − μ̂
σ̂ = .
2
EXAMPLES
Consider ten pieces fabricated by a machine
X taken randomly: They have the following
diameters:
DOMAINS AND LIMITATIONS
By representing the cumulated frequencies 9.36 , 9.69 , 11.10 , 8.72 , 10.13 ,
as a function of the values of the statistical 9.98 , 11.42 , 10.33 , 9.71 , 10.96 .
variable on a normal probability plot, it is
possible to verify if the observed distribution These pieces, supposedly distributed ac-
follows a normal distribution. Indeed, if cording to a normal distribution, are then
this is the case, the points are approximately classified in increasing order and, for each
aligned. diameter
i − 38
Therefore the graphical adjustment of a line fi = ,
to a set of points allows to: 10 + 14
• Visualize the normal character of the are calculated, i being the order number of
distribution in a better way than with a his- each observation:
togram.
• Make a graphical estimation of the xi fi
mean and the standard deviation of 8.72 5/82
the observed distribution. 9.36 13/82
The line of least squares cuts the ordinate 9.69 21/82
0.50 (or 50%) at the estimation of the mean 9.71 29/82
(which can be read on the abscissa). The 9.98 37/82
Normal Table 385
xi fi gave a normal table. Pearson, Egon Sharpe
10.13 45/82 and Hartley, H.O. (1948) edited a normal
10.33 53/82 table based on the values calculated by Shep-
10.96 61/82 pard, W.F. (1903, 1907). Exhaustive lists of
11.10 69/82 normal tables on the market was given by the
11.42 77/82 National Bureau of Standards (until 1952)
and by Greenwood, J.A. and Hartley, H.O.
(until 1958).
DOMAINS AND LIMITATIONS
See central limit theorem.
MATHEMATICAL ASPECTS
Let random variable Z follow a normal
The graphical representation of fi with distribution of mean 0 and of variance 1.
respect to xi on a piece of normal probabi- The density function f (t) thus equals:
lity paper provides for our random sample: 2
1 t
estimated mean of diameters: μ̂ = 10.2, f (t) = √ · exp − .
estimated standard deviation: σ̂ = 0.9. 2π 2
The distribution function of random vari-
FURTHER READING able Z is defined by: N
Distribution function z
Frequency F (z) = P (Z ≤ z) = f (t) dt .
Frequency distribution −∞
Normal distribution The normal table represented here gives the
values of F (z) for the different values of z
between 0 and 3.5.
Normal Table With a normal distribution symmetric rel-
The normal table, also called a Gauss table or ative to the mean, we have:
a normal distribution table, gives the val-
F (z) = 1 − F (−z) ,
ues of the distribution function of a ran-
dom variable following a standard normal which allows to determine F
(z) for a nega-
distribution, that is, of the mean equalling 0 tive value of z.
and of variance equalling 1. Sometimes we must use the normal table, in
inverse direction, to find the value of z cor-
HISTORY responding to a given probability. We gener-
de Laplace, Pierre Simon (1774) obtained ally denote by z = zα the value of random
the normal distribution from the hyper- variable Z for which
geometric distribution and in a second
work (dated 1778 but published in 1781) he P (Z ≤ zα ) = 1 − α .
386 Normal Table
EXAMPLES In the case of a two-tail test, we should find
See Appendix G. the value zα/2 . With a significance level of
Examples of normal table use: 5%, the critical value z0.025 is obtained in the
following manner:
P (Z ≤ 2.5) = 0.9938 ,
P (1 ≤ Z ≤ 2) P (Z ≤ z0.025)
α
= P (Z ≤ 2) − P (Z ≤ 1) = 1 − = 1 − 0.025 = 0.975
2
= 0.9772 − 0.8413 = 0.1359 ,
⇒ z0.025 = 1.960 .
P (Z ≤ −1)
= 1 − P (Z ≤ 1)
FURTHER READING
= 1 − 0.8413 = 0.1587 , Hypothesis testing
P (−0.5 ≤ Z ≤ 1.5) Normal distribution
= P (Z ≤ 1.5) − P (Z ≤ −0.5) One-sided test
Statistical table
= P (Z ≤ 1.5) − [1 − P (Z ≤ 0.5)]
Two-sided test
= 0.9332 − [1 − 0.6915]
= 0.9332 − 0.3085 = 0.6247 .
REFERENCES
Conversely, we can use the normal table to Greenwood, J.A., Hartley, H.O.: Guide to
determine the limit zα for which the probabi- Tables in Mathematical Statistics. Prince-
lity that the random variable Z takes a value ton University Press, Princeton, NJ (1962)
smaller to that limit (zα ), is equal to a fixed Laplace, P.S. de: Mémoire sur la probabi-
value (1 − α), that is: lité des causes par les événements. Mem.
Acad. Roy. Sci. (presented by various sci-
P (Z ≤ zα ) = 0.95 ⇒ zα = 1.645 ,
entists) 6, 621–656 (1774) (or Laplace,
P (Z ≤ zα ) = 0.975 ⇒ zα = 1.960 . P.S. de (1891) Œuvres complètes, vol 8.
Gauthier-Villars, Paris, pp. 27–65)
This allows us, in hypothesis testing, to
determine the critical value zα relative to Laplace, P.S. de: Mémoire sur les probabil-
a given significance level α. ités. Mem. Acad. Roy. Sci. Paris, 227–332
(1781) (or Laplace, P.S. de (1891) Œuvres
Example of Application complètes, vol. 9. Gauthier-Villars, Paris,
In the case of a one-tailed test, if the signif- pp. 385–485.)
icance level α equals 5%, then we can deter- National Bureau of Standards.: A Guide to
mine the critical value z0.05 in the following Tables of the Normal Probability Integral.
manner: U.S. Department of Commerce. Applied
Mathematics Series, 21 (1952)
P (Z ≤ z0.05 ) = 1 − α
Pearson, E.S., Hartley, H.O.: Biometrika
= 1 − 0.05 Tables for Statisticians, 1 (2nd edn.).
= 0.95 Cambridge University Press, Cambridge
⇒ z0.05 = 1.645 . (1948)
Normalization 387
Sheppard, W.F.: New tables of the proba- Here are the means and the standard devia-
bility integral. Biometrika 2, 174–190 tions of each exam calculated on all the stu-
(1903) dents:
Sheppard, W.F.: Table of deviates of the nor- Exam 1: μ1 = 35 , σ1 = 4 ,
mal curve. Biometrika 5, 404–406 (1907) Exam 2: μ2 = 45 , σ2 = 1.5 ,
A student named Marc got the following
Normalization results:
Normalization is the transformation from Exam 1: X1 = 41 ,
a normally distributed random variable to Exam 2: X2 = 48 .
a random variable following a standard nor-
The question is to know which exam Marc
mal distribution. It allows to calculate and
scored better on relative to the other stu-
compare the values belonging to the nor-
dents. We cannot directly comparetheresults
mal curves of the mean and of different
of two exams because the results belong to
variances on the basis of a reference nor-
distributions with different means and stan-
mal distribution, that is, the standard normal
dard deviations.
distribution.
Is we simply examine the difference of each
note and compare that to the mean of its
MATHEMATICAL ASPECTS distribution, we get:
Let the random variable
X
follow the nor-
mal distribution N μ, σ 2 . Its normaliza- Exam 1: X1 − μ1 = 41 − 35 = 6 ,
tion gives us a random variable Exam 2: X2 − μ2 = 48 − 45 = 3 .
N
X−μ Note that Marc’s score was 6 points higher
Z=
σ than the mean on exam 1 and only 3 points
higher than the mean on exam 2. A hasty con-
that follows a standard normal distribution
clusion would suggest to us that Marc did
N (0, 1).
better on exam 1 than on exam 2, relative to
Each value of the distribution N μ, σ 2 can
the other students.
be transformed into standard variable Z,
This conclusion takes into account only the
each Z representing a deviation from the
difference of each result from the mean. It
mean expressed in units of standard devi-
does not consider the dispersion of student
ation.
grades around the mean. We divide the dif-
ference from the mean by the standard devi-
EXAMPLES ation to make the results comparable:
The students of a professional school have
X 1 − μ1 6
had two exams. Each exam was graded on Exam 1: Z1 = = = 1.5 ,
a scale of 1 to 60, and the grades are consid- σ1 4
X 2 − μ 2 3
ered the realizations of two random variables Exam 1: Z2 = = = 2.
of a normal distribution. Let us compare σ2 1.5
the grades received by the students on these By this calculation we have normalized the
two exams. scores X1 and X2 . We can conclude that the
388 Null Hypothesis
normalized value is greater for exam 2 (Z2 = EXAMPLES
2) than for exam 1 (Z1 = 1.5) and that Marc We are going to examine the null hypothesis
did better on exam 1, relative to other stu- on three examples of hypothesis testing:
dents. 1. Hypothesis testing on the percentage of
a sample
FURTHER READING A candidate running for office wants to
Normal distribution know if he will receive more than 50% of
the vote.
The null hypothesis for this problem can
Null Hypothesis be posed thus:
In the fulfillment of hypothesis testing, the
null hypothesis is the hypothesis that is to be H0 : π = 0.5 ,
tested. It is designated by H0 . The opposite
hypothesis is called the alternative hypoth- where π is the percentage of the popu-
esis. It is the alternative hypothesis that will lation to be estimated.
be accepted if the test leads to rejecting the 2. Hypothesis testing on the mean of the
null hypothesis. population
A manufacturer wants to test the preci-
HISTORY sion of a new machine that should produce
See hypothesis and hypothesis testing. pieces 8 mm in diameter.
We can pose the following null hypoth-
MATHEMATICAL ASPECTS esis:
During the fulfillment of hypothesis testing H0 : μ = 8 ,
on a parameter of a population, the null
hypothesis is usually a supposition on the where μ is the mean of the population to
presumed value of this parameter. The null be estimated.
hypothesis will then be presented as: 3. Hypothesis testing on the comparison of
means of two populations
H0 : θ = θ 0 , An insurance company has decided to
θ being the unknown parameter of the popu- equip its offices with computers. It wants
lation and θ0 the presumed value of this to buy computers from two different sup-
parameter. This parameter can be, for pliers if there is no significant difference
example, the mean of the population. between the reliability of the two brands.
When hypothesis testing aims at comparing It draws two samples of PC’s comming
two populations, the null hypothesis gener- from each brand. Then it records how
ally supposes that the parameters are equal: much time was consumed for each sample
to have a problem on PC.
H0 : θ1 = θ2 ,
According to the null hypothesis, the
where θ1 is the parameter of the popula- mean of the time passed since the first
tion where the first sample came from and θ2 problem is the same for each brand:
the population that the second sample came
from. H 0 : μ 1 − μ2 = 0 ,
Null Hypothesis 389
where μ1 and μ2 are the respective means FURTHER READING
of two populations. Alternative hypothesis
This hypothesis can also be written as: Analysis of variance
Hypothesis
H0 : μ1 = μ2 . Hypothesis testing