Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views13 pages

Introduction To Statistics

The document introduces the fundamentals of statistics, emphasizing its application in analyzing multispectral data and digital image processing. It covers key concepts such as data types, measures of central tendency, and statistical properties, including sampling and inferential statistics. Additionally, it discusses the characteristics of geographic data and the importance of understanding variability and distribution in statistical analysis.

Uploaded by

Ameziane Bachir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views13 pages

Introduction To Statistics

The document introduces the fundamentals of statistics, emphasizing its application in analyzing multispectral data and digital image processing. It covers key concepts such as data types, measures of central tendency, and statistical properties, including sampling and inferential statistics. Additionally, it discusses the characteristics of geographic data and the importance of understanding variability and distribution in statistical analysis.

Uploaded by

Ameziane Bachir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Introduction to Statistics

Danny M. Vaughn, Ph.D., CMS

Introduction

The nature of statistical applications is introduced throughout the two Spectral


courses, and while a detailed treatment is beyond the course’s principal thesis,
there is a need to introduce these quantitative measures as a means of
understanding the analysis of multispectral data. This is a short discussion, and
many of the details, including more advanced formulas and spatial statistical
applications, are available upon request for those interested in penetrating deeper
into the quantitative aspects of digital image processing, Geographic Information
Systems, spatial modeling, and analysis.

Statistics

Statistics deal with the collection, classification, description, presentation, and


analysis (interpretation) of data (numerical information). They are based upon
observations or measurements of data. Statistics is inductive which implies
specific observations and measurements yield a more general conclusion.
Statistics rely upon some notion of repetition. It follows, estimates can be derived,
and variation and uncertainty of an estimate understood from repeated
observations. Statistics are often used to describe and summarize data. They can
also help to generalize complex spatial patterns such as clustering, uniform,
regular, and random spatial patterns. Probabilities or estimates of outcomes can
also be determined for an event at a given location within specified limits.
Statistics is sometimes referred to as a study of variation in data sets.

Statistics allow us to make inferences (conclusions most often based upon known
and accepted facts) from a sample, based upon a population of numerical data.
Sampling Statistics represent a portion of the total population set of data.
Populations are groups or aggregates of data. An estimate (statistic) is a property
of a sample drawn at random (by chance) from a population. Estimates are
expressed by roman letters. A sample standard deviation is symbolized by the
letter s, mean is xbar, and variance is s2. A more detailed discussion on these
measures will be addressed in later sections.
2

Selected Vocabulary

 Data – Numerical information.

 Data set – Groups of data in tabular format. A data set may consist of,
observations, variables, and variates.

 Observation - Elements of phenomena (e.g. individuals, actions,


processes).

 Variable – A property that can be measured, classified, or counted, e.g.


spectral values, male/female, discharge, velocity, etc.

 Variate – A particular value of a variable, e.g. a digital number in


multispectral imagery, discharge in m3s-1 (cubic meters per second), velocity
in ms-1 (meters per second).

 Descriptive Statistics – A concise, numerical or quantitative summary is


reported for a variable or data set. Statistics include:

 Measures of central tendency.

 Measures of dispersion and variability.

 Measures of shape or relative position.

 Spatial data.

 Location Issues.

 Inferential Statistics – A reported result (generalization) is derived from a


sample from a larger population. Inferential Statistics are based upon probability
theory.

 Population Statistics – Groups or aggregates of data.

 Sampling Statistics – A portion of the total set of population data.


3

 Parameter – A property descriptive of a population. Population parameters are


expressed by Greek letters:  is standard deviation, 
 is a population variance.

 Estimate (statistic) – A property of a sample drawn at random from a


population. Estimates are expressed by roman letters. Standard deviation is s,
mean is xbar, and variance is s2. Estimates are values based upon confidence
intervals.

 Function – When two variables are related such that the values of one are
dependent upon the values of the other. If the functional relationship is not
known, causal conclusions cannot be inferred.

 Hypothesis testing – Examples include: Z tests and t tests.

Characteristics of Geographic Data


 Primary Data – Data acquired directly from an original source, e.g. from
a field observation or measurement.

 Secondary Data – Pre-existing data from an agency or other source.

Variables of the Data Set

 Continuous Variable – Any value within a specifically identified range of


values. Values which belong to a continuous series include; height, weight,
chronological time, discharge, velocity, etc.

 Discontinuous (discrete) Variable – A specific (counted and limited to whole


numbers) values only. An eight bit spectral digital number range is from 0 to 255.
The size of a family (3) implies the exact size of the group. Other examples
include school enrollment and number of books in a library.

Levels of Measurement

 Nominal Variable – A qualitative property of equality or difference in


established categories. Variables must be exhaustive and mutually exclusive.
4

 Ordinal Variable – A property of equality or difference and rank order within


the data.

 Interval variable – A property of equality or difference, order, and no true 0


(starting) point within the data (temp in Fº or Cº).

 Ratio variable –A property of equality or difference, order, and a true 0


(starting) point within the data. Interval data may transformed to ratio data by
subtracting the differences of variates which eliminates or cancels out the arbitrary
origin.

Measurement Concepts

 Precision – Degree of exactness or a measure of repeatability. A measurement


of how close positions are clustered. Precision is based on a relative reference, e.g.
a circle 1 inch in diameter.

 Accuracy – The closeness of a position to a known absolute reference system.

 Validity – Credibility based upon operational definitions of acceptance. A


subjective parameter, e.g. level of poverty, quality of …, etc.

Reliability – How consistent, repeatable, or stable is the data over changes in


spatial pattern over time?

Basic Statistical Properties

Constant – A property common to all members of a group.

 Property1 – Multiplying a constant (c) by each score is equal to adding all the
scores (Xi), and multiplying by a constant (c): cXI:

cXi = cX1 + cX2 + cX3 + ....cXn

 Property 2 – If a given constant (C) equals 4, and there are 5 variables (N), then:

C = C + C + C + C + C; which equals NC:

4+4+4+4+4 = 20, and N(5) x C(4) = 20.


5

 Property 3 – The summation () of the sum of any number of terms is the sum
of the summations of these terms taken separately:

(Xi + Yi + Zi) = (X1 + Y1 + Z1) + (X2 + Y2 + Z2)+

(X3 + Y3 + Z3 )... = Xi + Yi + Zi

 Property 4 - The sum of the products of two sets of paired numbers is:

Xi Yi ; which equals: X1Y1 + X2Y2 + XnYn...

 Property 5 - Given a set of values (Xn), the sum of the squared values

(Xn2 ) is equal to Xn2 where, Xn X2

3 9

2 4

5 25

6 36
4 16

Xn = 20 Xn2 = 90
______ ______

 Property 6 - Given a set of values (Xn), the square of the sum of the values
(Xn's) is equal to : (X)2

where: Xn

Xn = 20
______
(Xn)2 = (20) 2 = 400
6

The Normal Distribution of Scores

Frequency curves are conceptualized as extending across the x axis from minus
infinity to plus infinity; although they realistically taper off barely above the x axis.
The total area under the curve is always infinite since the curve never intersects
with the x axis, but for convenience the total area is taken as unity (1). The
Normal Curve (Figure 1) is written in standard score (Z scores) form with a
mean equal to 0, variance equal to 1, and standard deviation equal to 1.

Figure 1. A normal distribution illustrating the area percentages for plus or minus
three standard deviations.

Standard Scores (Z scores) are derived as a transformation from raw scores


(variates) to standard deviation units which are used to compare a score
(variates) with a collection of scores (variates) derived from different procedures
(e.g. an English vs. a Mathematics test). Position is considered rather than the
magnitude and measurement of units of scores. The formula for computing a Z
score from a raw score is: Z score = (Xi - Xbar)/s; where Xi is a variate, Xbar is the
mean, and s is the standard deviation. The discussions for measures of central
tendency and measures of dispersion are presented later in this set of statistics
notes.

Properties of a Z score:

 If a raw score is >X, it is referred to as a positive Z score.


 If a raw score is <X, it is referred to as a negative Z score.
 If a raw score =X, it is referred to as a 0 Z score.
7

Standard scores (Z scores) have a mean = 0, and a standard deviation = 1, thus


they are readily amendable to algebraic manipulation. After computing for a
standard score, locate the Z value in a table of Z scores (reference to any statistic’s
textbook). This will give a value of area between the mean and a Z score. If the
raw score is greater than the mean, add the Z score to 50 to obtain the percentile
rank. If the raw score is lower than the mean, subtract the Z score from 50.

Frequency Distribution

A Scatterplot is a graphic distribution of two variables (points), e.g. brightness


values (Digital Numbers/DN) for two spectral bands. A frequency distribution
shows the number of times each value occurs, and arranges scores from lowest to
highest.

Given a frequency distribution of values (for example brightness values of 56, 57,
67, 99, 120), a histogram plots frequency (counts) on the vertical axis, and
variables (brightness values in this case) are plotted on the horizontal axis.

The Ogive or cumulative frequency (Cf) plot is a continuous count of frequencies


for each BV at or below a given level. The cumulative percentage frequency
(Cf%) is the percentage of a given number of BV's to the total (200 in this
example), so that a Cf% = Cf / 200.

BV's Frequency Cf Cf%

56 5 5 0.025

57 34 39 0.20

67 100 139 0.70

99 45 184 0.92

120 16 200 1.00

A percentile rank is a percentile corresponding to a raw score in which as an


example, if one is in the 90 percentile (percentile rank), it would be interpreted as
90 percent of the scores are at or below this value, while 10 percent of the scores
are above this value. To obtain the percentile rank for a given score, e.g. (67):
8

1. Calculate the lower true limit of the score (67) by subtracting 0.5 unit from the
score (66.5).

2. Subtract the lower limit (66.5) from the score whose percentile rank is being
estimated (67).

3. Multiply the result by the frequency of scores with a value of 67 (100).

4. Divide the result by the width of the class interval (1 in this case).

5. Add the result to the cumulative frequency (139).

6. Divide the result by the total number of frequencies (200).


(((67-66.5)(100)/1)+139)/200 = .945 x 100 = 95%.

In a given distribution of brightness values (frequency distribution), it is important


to recognize a variety of properties about the distribution.

Measures of Central Tendency

Central tendency is a method of describing the spread of the distribution of scores


around a central measure of the frequency distribution. The four properties
include: mode, median, arithmetic mean and deviation from the mean.

Arithmetic Mean (Xbar) – The sum of the value of Xi (Xi) multiplied by the
frequency of its occurrence (fi), divided by the number of measurements (N):
Mean equals the arithmetic average:

Xbar = Xi/N also

Xbar = f1X1 + f2X2 + ....fnXn/N = fiXi / N

where Xi = fiXi

Deviation from the mean (xi) – The difference between a particular score (Xi) and
the mean (Xbar):

xi = (Xi- Xbar)
9

 Property 1 of the mean - The sum of the deviations of all the measurements in a
set from their arithmetic mean equals 0.

(Xi - Xbar) = Xi - X = NX - NX = 0

since X = Xi/N; then Xi = NX

 Property 2 of the mean – The sum of squares or the sum of deviations from the
arithmetic mean, (Xi - Xbar)2 or x2 is less than the sum of squares of deviations
from any other value.

 Property 3 of the mean – The mean is that measure of central tendency about
which the sum of squares is a minimum. It follows the mean is a measure of
central location in the least square sense.

Median – The point on the number scale such that half of the observations fall
above it and half below it.

Mode – The most frequently occurring value. If the frequency of occurrences is


equal for each value, there is no mode. Where two values have equal frequency,
the mode is determined by adding the brightness values of the two that occur
equally, and dividing by the total number of repetitive values (2). The mode
represents the highest point on a curve (histogram).In a normal distribution
(symmetric bell curve, Figure 2), the mean, mode, and median are the same values.
10

Figure 2. A normal curve illustrating the relationship between the mean, median,
and mode.

When a distribution is not normal, it results in the tail extending long to the right
(positively skewed toward the high end of the distribution). When the tail trends
to the left it is negatively skewed toward the low end of the distribution.

In a positively skewed distribution the mean, median, and mode are distributed as
illustrated below in Figure 3.
11

Figure 3. A positively skewed distribution illustrating the relationship between the


mean, median, and mode.

In a negatively skewed distribution the mean, median, and mode are distributed as
illustrated below in Figure 4.

Figure 4. A negatively skewed distribution illustrating the relationship between


the mean, median, and mode.
12

Measures of Dispersion and Variation

The sample (unbiased) variance (s2) is the mean (average) of the squared
deviations around the mean:

s2 = (Xi - Xbar)2 / N-1

A sample variance includes degrees of freedom (df) defined as the total number of
variables (N) minus the number of constraints placed on the data or the number of
variables free to vary. For example, given 5 measurements which equal 100, four
are free to vary, but the last must be a value which when combined with the other 4
= 100, thus:

df = N-1 or 4

In many cases a variate (e.g. a brightness value) can be less that the mean of the
distribution resulting in a negative variance. A preferred method of dealing with
negative signs would be to report a measure of variability as the standard
deviation (s) which is a measure of variation in units of original measurements.

The sample standard deviation is written as:

s = √(Xi - Xbar)2 / N-1

Standard deviation is a measure of the spread of data (variability) from the mean
value in the distribution of a data set (Figure 1). The area under the normal curve
is divided into standard deviation units such that a positive one (+1.0) standard
deviation unit accounts for 0.3413 percent (34.13% since the total area under the
curve is assumed to equal unity or 1). Alternatively, a negative one (-1.0) standard
deviation unit also accounts for 0.3413 percent (34.13%) of the total area under the
curve. It follows that a combined plus and minus one standard deviation from a
mean statistically states that approximately 68% of the pixels values in the
distribution are found between the minus 1.0 and plus 1.0 standard deviation. For
example, TM band 5 in the “Forest” class statistics has a mean DN value of 62.0
with a standard deviation of 7.0. This indicates that approximately 68% (68.2%)
of the 702 pixels in the training data set for band 5 are found in the DN value range
55-69 (62-7 and 62 + 7). If the standard deviation is 2.0 from the mean, then
13

approximately 95% (95.4%) of the pixels in band 5 of the 702 pixel data set are
found with DN values that range between 48 (62-14) and 76 (62+14). Three
standard deviations from a mean value accounts for over 99% (99.7%) of a data
distribution within the range of 41(62-21) and 83 (62+21).

Correlation*

When two variables are related such that the values of one are dependent upon the
values of the other, this relationship is termed a function. Correlation is a degree
of relationship between variables. The range is from r = +1 (direct relationship) to
r = -1 (inverse relationship) with |1| as a perfect predictor, and all points plotted on
a straight line. The closer an association between two variables approaches one,
the higher the correlation. No correlation between observations implies they are
independent of each other; therefore, they are not correlated. In image processing,
high correlations between two bands would suggest that using only one of the
bands would account for a majority of the variability in the spectral values
throughout the entire scene. It follows correlation can be used to reduce the
dimensionality of the data (use less bands in a final classification) to a more
manageable number. This has important applications when using hyper-spectral
data scenes with over 200 spectral bands, since computer processing time could
become very long the more bands that are used in the classification process.

Covariance*

Covariance is a joint variation between two independent variables (expressed in


both directions along the X axis) about their common mean. It may also be stated
as how much two random variables change together. When plotted, it represents a
bivariate normal probability surface (the counterpart to the normal distribution).
Volume under any part of the surface may be expressed as the probability of an
individual pairing of digital numbers (brightness values) between two spectral
bands occurring at that location under the delineated surface.

* A more comprehensive treatment can be requested by contacting Dr. Danny M.


Vaughn at [email protected].

You might also like