Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views7 pages

Data Analysis Micro

The document outlines the process of statistical analysis, emphasizing the importance of defining problems, planning, data collection, analysis, and drawing conclusions. It discusses common issues in statistical practices, including misuse, misinterpretation, and bias, as well as the significance of proper sampling and handling missing data. The document also highlights the PPDAC framework and the need for careful consideration of data quality to ensure meaningful results in research and decision-making.

Uploaded by

Ravindra Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views7 pages

Data Analysis Micro

The document outlines the process of statistical analysis, emphasizing the importance of defining problems, planning, data collection, analysis, and drawing conclusions. It discusses common issues in statistical practices, including misuse, misinterpretation, and bias, as well as the significance of proper sampling and handling missing data. The document also highlights the PPDAC framework and the need for careful consideration of data quality to ensure meaningful results in research and decision-making.

Uploaded by

Ravindra Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Unit 1 major tasks, breaking them down into subtasks, and planning what needs to be

done and the resources required to accomplish those tasks 3. Data


Statistical data and Concepts
As one executes the analysis plan, data must be collected to perform the desired
analysis. Because not all datasets will be of the same quality, considerations must
A. The statistical Methods be made in order to account for issues that may exist, such as missing data,
differences in availability, cost, resolution, formatting, or errors. GIS now
Statistical analysis is the process of collecting and analyzing data in order to
frequently have tool sets in order to overcome these issues. For example,
discern patterns and trends. It is a method for removing bias from evaluating
breaklines can handle lack of continuity in data and coding schemes can be used
data by employing numerical analysis. This technique is useful for collecting
in order to overcome missing data.
the interpretations of research, developing statistical models, and planning
4. Analysis
surveys and studies.
The conclusions are drawn using statistical analysis facilitating decision-making Analysis can be seen as a multi-part exercise, moving through the following
and helping businesses make future predictions on the basis of past trends. It steps: review and manipulation of collected data to make it useable, study of the
can be defined as a science of collecting and analyzing data to identify trends data to identify patterns relevant to the problem at hand, and use of data to make
and patterns and presenting them. Statistical analysis involves working with conclusions or to alter the plan made in the previous stage.
numbers and is used by businesses and other institutions to make use of data to
5. Conclusion
derive meaningful information.
The conclusion stage, simply put, is where conclusions are drawn from data
analysis and presented to others. Implementation of the conclusions is not
 PPDAC(Problem, Plan, Data, Analysis, and Conclusion):
included in this process. Conclusions can lead to more problems to be answered,
PPDAC stands for five stages of project development: Problem, Plan,
additional analysis to finish answering the problem at hand, a complete overhaul
Data, Analysis, and Conclusion/Communication. A summary of a revised
of the plan to answer the problem at hand, or a plan to apply the conclusions to
PPDAC approach is shown in the diagram below.
the problem.
As stated by Mackay and Oldford: “The purpose of the Conclusion stage
is to report the results of the study in the language of the Problem. Concise
numerical summaries and presentation graphics [tabulations, maps,
geovisualizations] should be used to clarify the discussion. Statistical jargon
should be avoided. As well, the Conclusion provides an opportunity to discuss the
strengths and weaknesses of the Plan, Data and Analysis especially in regards to
possible errors that may have arisen”.

B. Misuse, Misinterpretation and bias

In some instances the misuse has been simply lack of awareness of the
kinds of problems that may be encountered, in others carelessness or lack
of caution and review, whilst on occasion this misuse is deliberate. One
reason for this has been the growth of so-called evidence-based policy
making — using research results to guide and justify political, economic
and social decision-making. Whilst carefully designed, peer-reviewed and
repeatable research does provide a strong foundation for decisionmaking,
weak research or selective presentation of results can have profoundly
damaging consequences. In this section we provide guidance on the kinds
of problems that may be encountered, and comment on how some of
these can be avoided or minimized. The main categories of misuse can be
summarized as:
1. Problem

Defining the problem is the first stage in this model. This stage includes
determining what issues need to be addressed, what data will likely be needed,
• inadequate or unrepresentative data

and what tools and procedures will be used. Defining the problem is a process • misleading visualization of results
that might require multiple drafts. However, it is important to have the problem • inadequate reasoning on the basis of results
clearly defined before beginning the following stages. Defining the problem is
not always a single event in this process. The initial specifications of a problem
• deliberate falsification of data

may be altered after obtaining preliminary results, technical considerations, and


unforeseen events. Because there is usually more than one party involved and Inadequate or unrepresentative data
interested in the outcome, it can be beneficial to maintain the initial outline of the
This is probably the most common reason for 'statistics' and statistical
problem. However, when a change is necessary each party should be in analysis falling short of acceptable standards. Problems typically relate to
agreement. inadequacies in sampling, i.e. in the initial design of the data collection,
selection or extraction process. This results in the sample, from which
2. Plan inferences about the population are made, being biased or simply
inadequate. The following list includes some of the main situations which
Once the problem has been defined, the planning stage helps create an approach lead to such problems:
that is used to address the problem. This is completed by creating a detailed
project plan, which may include the estimated cost of data, equipment required,
an outline of tasks that should be completed, timelines for completing the work,
• datasets and sample sizes — there are many situations where the
dataset or sample size analyzed is simply too small to address the
and a list software tools that may be used. This often involves determining the
questions being posed, or is not large enough for use with the
proposed statistical technique, or is used in a misleading fashion. Inadequate reasoning
Smaller sample sizes are also more prone to bias from missing data Drawing conclusions from research findings is always a complex process,
and non-responses in surveys and similar research exercises. often subject to debate. The confidence that can be placed on conclusions
will depend, in part, on the nature and quality of the data collected and
• clustered sampling — this issue relates to the collection of data in
analyzed, and the quality of the reasoning applied to the interpretation of
the findings. Certain types of reasoning may appear entirely plausible but
a manner that is known in advance to be biased, but is not on closer examination can be seen as fundamentally flawed. The list
below provides a number of commonly encountered problems of this
subsequently adjusted for this bias. Examples include the deliberate type.
decision to over-sample minority social groups because of expected
lower response rates or due to a need to focus on some
characteristic of these groups which is of particular interest.
• Correlation versus causation — it is extremely easy to assume that
because there is a close (perhaps highly significant) relationship
between two variables, that one causes the other. This may occur in
• self-selection and pre-screening — this is a widespread group of many ways and can be quite subtle (obvious examples are much
easier to spot). Take the following example: "Girls at single sex
problems in sampling and the subsequent reporting of events.
schools do better than girls in mixed schools, therefore single-sex
Surveys that invite respondents to participate rather than randomly schools are better for girls". Based on test results in the UK and in a
number of other countries the first part of this statement is well
selecting individuals and ensuring that the resulting survey sample
documented, but is the second part, which is a conclusion implying
is truly representative are especially common. For example, surveys causality, actually correct? Without closer examination it is difficult
to know.
that rely on opting in, such as those placed in magazines, or via the
Internet, provide a set of data from those who read the publication
or view the Internet site, which is a first category of selection, and
• Misunderstanding of the nature of randomness and chance —
there are a number of ways in which natural randomness of events
from this set the individuals who choose to respond are then self- can be misunderstood, leading to incorrect judgments or
conclusions. A simple example is misjudging the effect of sample
selecting. This group may represent those with a particular
size. Suppose that a large hospital has 40 births per day on average,
viewpoint, those with strong views (so greater polarization of with 50% of these being boys. A smaller hospital nearby has 10
births/day, also 50% being boys on average. On some days the
responses) or simply those who have the time and inclination to
proportion of boys will be higher, on others lower. Which hospital
respond. would you expect to have the most days in a year with at least 60%
of births being boys? The answer is the smaller hospital, because
• exclusions — the process of research design and/or sampling may its records will exhibit inherently more variability — a change
from 5 boys to 6 is sufficient to raise the proportion to 60%,
inadvertently or deliberately exclude certain groups or datasets. An whereas the larger hospital would need to have at least 4 more
example is the use of telephone interviewing, which effectively pre- boys than girls born to result in a 60%+ result, which is less likely
to occur.
selects respondents by telephone ownership. If the proportion of
exclusions is very small (e.g. in this example, the current proportion
• Ecological fallacy — this fallacy involves ascribing characteristics
of people with telephones in a given country may be very high) this to members of a group when only the overall group characteristics
may not be a significant issue. A different category of exclusion is are known (special statistical techniques have been devised to
address certain problems of this type. A simple example is the
prevalent where some data is easier to collect than others. suggestion that most individuals in a given census area earn
$50,000 p.a. based on the census return figure for the area in
• pre-conceptions — researchers in scientific and social research question, whereas there may be no individuals at all in this area
frequently have a particular research focus, experience and possibly matching this description
current norms or paradigms of their discipline or society at large.
This may result in inadvertent use of techniques or survey questions
that influence the outcome of the research. A common problem is • Atomistic fallacy — this fallacy involves ascribing characteristics
the wording of questions may lead the respondent to respond in a to members of a group based on a potentially unrepresentative
particular manner. Pre-conceptions may easily also lead to weak or sample of members. As such it can be regarded as a central issue in
incorrect reasoning from the data to conclusions statistical research, often related to sampling that is far too small or
unrepresentative to enable such conclusions to be reached
• •data trawling — with large multi-variate datasets there is a high
probability that statistically significant findings can be discovered
somewhere in the data — brute-force processing of datasets looking • Misinterpretation of visualizations — there is endless scope for
for significant results that relate to a particular area of research misinterpretation and there are many books on what makes for
interest, with or without explicit pre-conceptions, will often good and bad visualizations. The emphasis should always be on
succeed but may well be entirely spurious. Techniques such as data- clarity of communication, often achieved through simplicity in
mining, cluster-hunting and factor analysis may all be 'misused' in design and labeling. However, the apparently simple and clear chart
this way can easily provide scope for confused reporting.

• •temporal or spatial effects — the temporal or spatial sequence or


arrangement of samples may be of critical importance, for many Deliberate falsification of data
reasons.
• over- and under-scoring — the responses individuals provide to
• There are occasions when data is deliberately falsified. This maybe as a
result of a rogue individual scientist or group, commercial
questions or tasks often show a distinct bias. When asked to state enterprise and even government agencies.
how confident the respondent is in the answer they have given,
almost always the confidence level is over-stated, typically by 10-
20% based on the relative frequency of correct responses.
C. Sampling and sampling size
• deliberate bias — by judicious selection, combination,
Sampling is central to the discipline of statistics. Typically samples are
arrangement and/or reporting of data (which may have been made in order to obtain a picture of the population as a whole without the
need to make observations on every member of that population. This
extremely carefully collected) is an important and serious area of saves time, cost and may be the only feasible approach, perhaps because
misuse. the population is infinite or very large or is dynamic, so sampling
provides a snapshot at a particular moment in time. Ideally we wish to
make a sample that provides an extremely good representation of the
population under study, whilst at the same time involving as few
observations as possible. These two objectives are clearly related, since a
perfect representation is only possible if the entire population is measured
or if the population is completely uniform. This latter point highlights the
fact that larger and more carefully structured samples may be required to
obtain an estimate of a given quality if the population is highly variable.
The difference between the measured value for an attribute in a sample needs to be identified through inspection and validation techniques,
from the 'true value' in the population is termed sampling error. thereby identifying the scale and nature of any problems, and
Sample size implementing changes or corrections to the data gathering and/or
subsequent processing of the data.
There are many factors that affect the choice of sample size. In public
opinion surveys it is very common to hear that the sample taken was of Handling missing values — techniques and tools
around 1000-1500 people. This figure is obtained from a relatively
simplistic calculation, based on achieving an approximately 95% This section provides a brief summary of the main approaches for
confidence level in the results with estimation of a proportion, p, within a handling missing values. In most instances these are procedures offered
range of roughly +/-3% (see also, our discussion on confidence intervals). within software packages, but it remains the responsibility of the
The figure of 1000-1500 arises from these two requirements — using a researcher to select the method used and to document and justify such
Binomial distribution the standard error (SE) of the proportion, p, is selection when reporting the results of analysis. In many cases estimating
√(pq/n). Note that the term √(pq) is maximized for any given n when missing values will apply to real-valued measurements, but some
p=q=0.5, so this assumption provides an upper bound of 1/2 on √(pq) and procedures may apply to binary or categorical data.
thereby on the range of expected variation in our estimate. Now from the
Normal distribution, which is the limit of the Binomial for large n (and a Ignoring entire records:
reasonably rapid approximation if p and q are similar in size), we know This is the most commonly available approach for handling missing data.
that 95% of the distribution is included within roughly +/- 2 standard As noted above, this is only acceptable if the number of records is
deviations. Thus the sample size needed to ensure an error in an estimate relatively large, the number of records with missing data is relatively
of x=5% is obtained from the formula for 2SEs, i.e. 1/√n. This gives the small (<5%), and the missing records can be shown to occur completely
result n=1/x2 so for x=5%, x=0.05 we have n=400, or for 3% we have just at random (MCAR) or are missing at random (MAR) within well-defined
over 1100. For a 1% range at 95%+ confidence a sample size of 10,000 subsets of the data. In general this approach cannot be used in small
would be required, so the choice of 1000-1500 is a compromise between sample balanced trials nor for time series.
the quality of estimation and the cost and time involved in undertaking
the survey. Setting missing values to fixed value:
Many packages allow missing values to be replaced with a fixed value
(e.g. 0) or a user-provided value for each instance. The problems of
adopting these approaches are obvious.
D. Data preparation and cleaning
Single estimation procedures:
Careful data preparation is an essential part of statistical analysis. This A very common approach to missing values is to use some form of
step assumes the data have been collected, coded and recorded, and now estimation based on the characteristics of the data that has been
reside in some form of data store. This will often be in the form of a successfully collected. For example, the SPSS Transform operation,
simple table (sometimes referred to as a data matrix) or SQL-compatible Missing Values option, offers the following options for estimating such
database. Depending on how the data were coded and stored, variables values: (i) use of the mean or median value of nearby points (by which it
may or may not have suitable descriptions, coding or data types assigned, means neighboring records, with the number of such records used
and if not, this is one of the first tasks to carry out. At this stage it will selectable by the researcher); (ii) use of the overall series mean, i.e. the
become apparent if any data items are incompatible with the data type mean based on all records; (iii) linear interpolation, which applies to data
assignments, for example text coded in numeric fields, or missing data in series and uses the two adjacent nonmissing values, to fill in the gap or
that requires entry of a suitable 'missing data' code. Data provided from gaps; (iv) linear regression, which is similar to linear interpolation but use
third parties, whether governmental or commercial, should include a a larger number of neighboring points and calculates a best fit line
metadata document that describes all aspects of the data in a standardized through these as its estimator for intermediate missing values. Other
and thorough manner. software packages may provide additional options — for example, a
variety of model-based interpolation options are available in the
Depending on the nature and validity of the duplicates,
SAS/ETS (Economic and Time Series) software. Similar procedures are
decisions have to made on how they are to be treated. In some instances
provided in some other packages, but often it remains the researcher's
data will need to be deduplicated, in others data will be retained
responsibility to provide or compute estimates for missing values as a part
unchanged, whilst in some instances additional data will be added to
of the data cleaning and preparation stage.
records to ensure the duplicates are separately identifiable. When
analyzing duplicates using Exploratory Data Analysis (EDA) tools,
Multiple imputation (MI):
duplicates may be hidden.
Multiple imputation (MI) methods vary depending on the type of data that
Zero and null (missing data) occurrences form a special
is missing and the software tools used to estimate (impute) the missing
group of duplicates that apply to one or more variables being studied.
values.
EDA methods will also tend to highlight exceptional data values,
Essentially there are 3 stages to MI:
anomalies and outliers (specific measurements or entire cases) that
require separate examination and/or removal from subsequent analysis. o the missing data are filled in m times to create m complete
Once a dataset has undergone preliminary inspection and datasets (m is typically 5)
cleaning, further amendments may be made in order to support
subsequent analyses and the use of specific statistical models. It may be o the m complete datasets are analyzed separately, in the usual
desirable for such amendments to result in the creation of a new data table manner o the results from the multiple analyses are
or data layer, thereby ensuring that the source data remains untouched and combined in order to provide statistical inferences regarding
available for re-inspection and analysis.
the data

E. Missing data and data errors Depending on the type


and pattern of missing data,
SAS/STAT and SPSS will generate estimates for the missing values using
Missing data is a wide-ranging term for the absence of expected data from some form of regression analysis of the valid data (single, multiple or
a recorded sample, which may occur for many reasons and in many ways. logistic regression), or MCMC methods under an assumption of
In can be a small, easily managed problem, or a larger problem that raises multivariate Normality, for more general missing values.
questions about the usability of the dataset. Missing data can involve all The latter approach is of the general form:
relevant data for a record or set of records, or it could be missing values
from a set of measurements relating to individual records (incomplete (a) initialize estimates for the missing values for all
records). Each different situation requires separate consideration, both as records and variables by drawing random values from
to the seriousness of the problem, and to the means by which such a Normal distribution with mean and variance that
difficulties are to be addressed. In the following paragraphs we discuss match the nonmissing data (or use a multinomial
some examples of the key issues and approaches to resolving them. We distribution for categorical data, with proportions in
then look in more detail at some of the techniques and tools provided each class defined by the proportions in the non-
within statistical software packages that are designed to assist in these missing data);
situations.
Another common reason for missing data is incorrect
(b) using all the data, except for missing data on the
jth variable, use a univariate method (e.g. regression) to
data recording, coding or subsequent processing. The precise reason for
impute the missing values in that variable;
such errors and the scale of the problem are important to determine.
Incorrect data coding by researchers and data preparation staff can often
be checked through systematic verification, for example by taking a
(c) iterate across all variables and track the
convergence of both the mean and variance of the
sample of each block of survey returns and having these independently
imputed missing values.
recoded and compared with the original coding. Incorrect interpretation of
survey questions, or incorrect recording of data by surveyed individuals,
F. Statistical error embraces a wide range of techniques and tools and draws heavily on
developments in fields such as computer science, knowledge-based
engineering and visualization in additional to classical statistics. Although
When approaching any form of data analysis many types and sources of a relatively new area of statistics, it may well become the dominant
error may be considered: approach for addressing many categories of problem formerly addressed
by traditional analysis.
• The data collection procedure may contain errors; there may be Examples of approaches that may be described as methods of
gross data capture or encoding errors; there may be errors in the
computational statistics we would include:
approach adopted to selecting data, designs or in analysis.

• However, none of these relates to the rather special use of the term • data mining and advanced visualization techniques (exploratory
statistical error. This term refers specifically to non-systematic or data analysis techniques)
random errors that are observed during measurement.

• There may be many reasons why such random variations occur, but
• a wide range of simulation techniques that utilize randomization,
notably Monte
in general the assumption is made that these reasons are
unknowable and therefore cannot be readily removed. Carlo simulation and random permutation procedures

• Systematic variations and gross errors can, at least in theory, be


• function and kernel density estimation
separated out and either accounted for or removed, leaving simply
statistical error.
• local rather than global analysis techniques (e.g. local regression
G. Statistical Modelling
techniques such as Loess and GWR, cluster hunting etc.)

Statistical (or stochastic) modeling is the process of finding a suitable • resampling and cross-validation of data modeling (e.g. using
mathematical model that can be used to describe or 'fit' an observed
procedures such as jackknifing and bootstrapping)
dataset, where the observations are subject to uncertainty or randomness,
and can be regarded as having been drawn from one or more probability
distributions. For this purpose a mathematical model is treated as a set of
one or more equations, in one or more independent variables and one (or I. Inference
occasionally more) dependent variables.
The dependent variables are sometimes referred to as response variables
or endogenous variables. Statistical inference is the process whereby we seek to draw conclusions
from samples that apply to a finite or infinite population from which these
The independent variables are likewise
samples are assumed to have been drawn (see also, the closely related
sometimes described as explanatory, predictor or exogenous variables.
topic of estimation). As such inferential statistics is essentially analytical,
Statistical models may involve discrete-valued (or categorical) data and/or requiring a representative (random) sample from a larger population,
continuous valued data. The general class of statistical models for which together with some model of how data obtained from such samples relate
all variables are continuous and the dependent variable is a vector, are to the population as a whole. This approach is distinct from descriptive
known as regression models. With discrete models the population being statistics, which seek to measure and report (numerically or graphically)
studied is meaningfully divided into distinct groups or categories, which key features of the data. Statistical inference typically leads to some form
may have one or more levels). In many cases discrete models are of proposition regarding the population, for example that the population
analyzed using analysis of variance methods (ANOVA). mean has an estimated value of x with a confidence or belief interval of
[a,b].
Linear models are linear in their parameters, not in the independent
In the frequentist view of statistics it is imagined that the particular
variables. Thus the model:
sample selected is one of many possible such samples that could be
obtained by repeated random sampling (or simulated repeated sampling).
The particular sample in question can thus be identified with respect to its
likelihood of occurring and from this inferences can be made regarding
in which β represents a vector of parameters to be determined, is a simple the population as a whole (for example, obtaining a confidence interval
example of a linear statistical model (a regression model in this case) —
the expression shown below is also linear since it remains linear in the for the mean).
coefficients, β, even if it is not linear in the predictor variables, x:

J. Bias

The term bias, in a statistical context, has a variety of meanings. These


include: selection bias, recall bias, estimation bias, systematic bias and
A typical example of a linear discrete model would be of the form: observer bias.
Each of the terms relates to a specific technical aspect of the overall
concept of bias, but exclude the broader questions of bias in the
conceptualization of a problem or in the general manner in which results
are collected, processed, analyzed and interpreted.
which states that the observed or measured value y (observation j in group
i) is a linear combination of an overall mean value, μ, plus a treatment or
group effect, T, plus some unexplained random variation, or error, e. The Selection bias: The most common usage of the term bias relates to
most general form of the linear model (the general linear model or GLM)
can be represented by the matrix equation: selection bias. This refers to the selection of individuals or entities in a
manner that is not representative of the population of interest. In general
selection bias can be minimized by careful study design, but there are
many pitfalls in this process and it is easy for bias to occur without it
being at all obvious. One example of selection bias is self-selection —

H. Computational Statistics sample surveys are typically only completed by willing participants, and
by definition there is selection bias occurring. Sampling ‘easy to collect’

Computational statistics is not the same as statistical computing or data is another example of possible bias.
statistical software. It refers to the use of computational power
(processing power, storage, digital displays) as a means of analyzing
complex and large datasets using largely statistical procedures. It
Recall bias: Recalled results are not as reliable as measured or monitored The term confounding is also used in a similar context in the design of
results. For example, when completing questionnaires respondents often experiments. Here the concern is with the design of an experiment that
over-estimate or overelaborate in some areas and under-estimate or omit seeks to identify the effect of a treatment, for example an additive to a
important topics in others. In medical research, cohort studies are fuel to increase its efficiency, whilst removing the effect of other factors
generally not as susceptible to bias as case-control studies, since the latter on the analysis.
may be subject to selection bias or recall bias (being asked to recall
information after the event).
L. Hypothesis testing

Estimation bias: Bias is also used to refer to the difference between the
A hypothesis is the general term for a proposed explanation for some
true or population value of a parameter being estimated from a sample
observed phenomena, which may or may not be the subject of further
and the sample value,
scrutiny in order to confirm or reject its veracity. Scientific studies
i.e. estimation bias. For example, the variance of a truly representative
investigate hypotheses by conducting experiments and analyzing data.
sample of size n from a much larger population will be less than or equal
Within the context of scientific analysis, statistics is utilized to assist this
to the population variance, since there will generally be a wider range of
evaluation process by helping to identify how likely particular
values in the population than in the sample. To obtain an unbiased
observations are given the hypothesis in question is true, i.e. do they
estimate the sample variance is adjusted by a factor n/(n-1), which tends
support the hypothesis or suggest it is unlikely to be true? More
to 1 as the sample size is increased.
specifically, a statistical hypothesis is "an assertion regarding the
distribution(s) of one or several random variables. It may concern the
Systematic bias: In more common usage bias is applied to deliberate or
parameters of the distribution, or the probability distribution of a
unintentional systematic distortion of information. This usage does have
population under study."
practical importance within the field of statistical analysis, for example in
studies that involve measurements using specialized equipment. It is not
H0: the distribution of trees in a study region is random
unusual for equipment to be either incorrectly calibrated (e.g. the zero
point is not precisely located at zero) or to experience drift over time or in
H0: the population mean value (lifetime) of a sample of electronic devices
response to environmental factors (e.g. temperature fluctuations). Careful is 500 hours
study design and regular device calibration are approaches to addressing
such problems, although many effects may not be possible to control (e.g.
solar activity impacting data recorded via satellite remote sensing M. Types of error
devices).

In the context of statistical hypothesis testing the expression type of error


Observer bias: Similar issues arise with so-called observer bias. In this refers specifically to two main types of error that can occur: false
case those involved in designing, administering and/or analyzing an negatives and false positives. Both types of error can often be reduced by
experiment or other data collection exercise, influence the results, increasing the sample size, but this typically involves additional cost
typically unintentionally. and/or time and may not be possible for other practical reasons. A false
For example, the presence of an observer might alter the results in an negative means that we reject a hypothesis that we are testing, such as the
animal or human behavior study (see, for example, the so-called mean of our data equals 2, when in fact it is true (also known as a Type I
Hawthorne effect), or in recording very small changes in temperature in error — see further, below).
an enclosed space. Such effects can generally be minimized by careful A false positive result means that we accept our hypothesis (or fail to
experimental design, by use of multiple independent observers, and in reject it) when it is in fact false (also known as a Type II error).
some experiments, by ensuring that the observers are unaware of which
treatments have been applied to which cases. More formally, in statistical analysis we say:

Type I errors occur when a false negative result is obtained in terms of


K. Cofounding the null hypothesis by obtaining a false positive measurement i.e. we
reject H0 when in fact it is true. This type of error is often denoted with
the Greek letter α
The term confounding in statistics usually refers to variables that have
been omitted from an analysis but which have an important association Type II errors occur when a false positive result is obtained in terms of
the null hypothesis by obtaining a false negative measurement i.e. we
(correlation) with both the independent and dependent variable. Thus accept H0 when in fact it is false. This type of error is often denoted with
confounding is closely connected to the notion of causality and cause- the Greek letter β
effect relationships. In many cases the evidence in favor of breastfeeding
is either poorly established or affected by confounding factors, notably
the kind of mother who chooses to breastfeed. N. Statistical significance
Causal factors — confounding variables may be surrogates
for underlying causal factors. Although confounding factors are very
The term significance in a statistical context does not equate to its
important, identifying relevant factors and then controlling for their meaning in general usage. The latter implies some idea of overall importance,
effects can be difficult. A common approach is to stratify the analysis by whereas statistical significance is a purely probabilistic statement regarding the
chance of observing a particular result. If it is estimated that the chance is very
the confounding factor of interest, e.g. separate a problem in to a series of low — perhaps 1 in 100, we might state that the result is statistically
problems, each corresponding to a separate age group, and/or to analyze significant. This ignores any measure of true importance, which might be
measurable in some way (i.e. using some measure of the size of the observed
the data using time as a variable.
effect) or may not be readily measurable — for example, publishing results
may have a large social or political effect which are not readily understood or
measurable at the time. Statistical measures of the size of an effect include
correlation coefficients, absolute differences, odds ratios, relative risks, and factor (N-n)/(N-1), although if n is small in comparison to N (<5%) the
measures of association for count data. adjustment has little effect.

Significance does not provide any indication of causation, despite the fact Odds ratios
that a highly significant relationship may exist between two variables. In the earlier topic on probability we introduced the terms odds and odds
ratio (OR). With reference to a simple 2x2 table of the form shown
below, this enabled us to define four odds proportions:
We are generally more interested in large differences — for example a
new anticancer drug might be shown to provide a statistically significant p1=A/(A+B), p2=C/(C+D), q1=B/(A+B), and q2=D/(C+D) and thus the
improvement in life expectancy in 80% of patients, but if this improvement is odds ratio (OR) can be written in various ways as:
from 20 weeks to 22 weeks we might regard it as of very limited value. If the
significance level, α, is small (e.g. 0.001) it requires a much larger difference
between the observed and expected results that a larger α value. However,
because it allows larger differences to be regarded as not significant it
increases the chances of not identifying a real difference that exists — this is
known as a Type II error (failure to reject the Null Hypothesis (H 0) when it is
in fact false — see further Type I and Type II errors). Note that the p-value
obtained when conducting a test is not a measure of whether H 0 is true, it is Exposed
simply a quantitative measure of the strength of evidence against H 0. How to
interpret specific p-values remains unclear, and many statisticians regard the Infected A
use of p-values and even the term significance testing, as unsatisfactory.
Not infected B

O. Confidence Interval OR and LOR values are point estimates and provides no information on
the range of values this estimate might take. Confidence intervals for the
odds ratio can be estimated using the following expression, which utilizes
When a sample has been taken, and some parameter or measure such as the estimated standard error (SE) of the log odds ratio (where exp() is the
the mean value has been calculated, this provides an estimate of the exponential function and A, B C and D are not small):
population parameter. However, it remains an estimate, and different
samples might yield a range of different values for the parameter. It is
helpful to have some idea as to the size of this range since the expectation
is that the true or population value will lie within this range, but without
taking a very large number of samples the possible range remains
unknown.
P. Power and robustness

However, if the distribution of the observations is known it


is possible to provide an estimate of upper and lower limits which will 1) Power
include the population parameter with a given level of probability. Such
bounds are known as confidence limits, and the range or interval as the
confidence interval for the parameter in question. Power is a term used in a technical sense in statistics to refer to the
probability that a test will reject a null hypothesis when it should do,
i.e. when the hypothesis is false. This equates to requiring that the
In many instances such limits are sought for an estimate of probability of a Type II error is small, and with the probability of a
the mean value or for a proportion. In general confidence intervals for a Type II error being denoted by β, the power of a test is often denoted
mean value are obtained based on an assumption that the observations are by 1-β. The desired power of a test (often set at 0.8 or 80%), given
Normally distributed even if the data do not fully conform to this the level of Type I error that is acceptable (typically set at 0.05 or
assumption. For other parameters, such as the variance, confidence limits 5%, or 0.01 or 1%), can be used to help identify the type of test and
are affected more by the underlying distribution. size of sample to choose. The NeymanPearson Lemma emphasizes
the importance of tests based on the likelihood ratio as a means of
ensuring that the power of a test is as great as possible. In general,
Mean values parametric tests (i.e. tests that are based on the assumption of an
underlying probability distribution with estimated parameters) are
regarded as more powerful (in this sense) than non-parametric tests
If a sample of size n is drawn from a population with known mean, μ, and carried out on datasets of similar size.
standard deviation, σ, one can construct confidence intervals for the
sample mean based on this information. If the population distribution is
Normal, the sample mean will lie in the range [μ-kσ/√n, μ+kσ/√n] with a 2) Robustness
probability that can be computed from the Normal distribution depending
on the size of k. But in general the population mean is not known and the
standard deviation is also not known in advance. Robustness is a term that relates to how sensitive a descriptive
statistic or test is to extreme values or underlying assumptions. Some
descriptive statistics, such as the median, the inter-quartile range and
the trimmed mean, are more robust than others, such as the
arithmetic mean and the range. Likewise, a statistical test or
procedure (e.g. regression) is described as being robust if it not
especially sensitive to small changes in the data or assumptions, and
Proportions in particular if the effect of outliers not too great.

In the case of simple proportions, a similar calculation is carried out. If x


is the number of events observed in a sample of size n, the proportion of
these events is thus p = x/n and we define q = (1-p). Q. Degrees of freedom

Then an estimate of the true range of values that the population


The term degrees of freedom, often denoted DF or df, was introduced by
proportion, P, is expected to have (for a reasonably large sample, with p
R A Fisher in 1925. The simplest way to understand the concept is to
not too close to 0 or 1) is:
consider a simple arithmetic expression, such as the sum of a set of n
positive integers. If you know the total or the mean value (which is the
total divided by n), then choosing any (n-1) integers will determine what
the one remaining value must be in order to equal the total. In other
words, you have the freedom to choose n-1 of the numbers, so there are n-
1 degrees of freedom. If you know two such facts about a set of numbers,
then there will be n-2 degrees of freedom. For certain probability
This interval estimate (known as a Wald interval) can be significantly
distributions, notably the t-distribution, Fdistribution and chi-square
improved upon for some 'rogue' values of n and p, even when these are
large (see further, the section on tests for a proportion). If the population distribution, the number of degrees of freedom is the parameter (or
is of finite size, N, the second term in these expressions is adjusted by a parameters) that determine the distribution shape, and hence the df-value
is required if these distributions are to be used in hypothesis testing and
inference.

In some instances (certain forms of statistical modeling) the degrees of


freedom are needed in a computation, but the determination of the correct
value to use is not immediately obvious. Typically a procedure is applied
(often involving the use of the trace matrix operator) that produces a
value to use that is described as the effective degrees of freedom.

R. Non-parametric analysis.

Non-parametric methods of statistical analysis are a large collection of


techniques that do not involve making prior assumptions about the
distribution of the variables being investigated. They are often considered
as less powerful than parametric methods, but are much more widely
applicable since they involve few if any assumptions and are often more
robust. Although non-parametric methods have been developed over a
very long period, in recent years they have been augmented by
visualization tools and a substantial number of new techniques that are
essentially computationally driven.

Many of the more familiar non-parametric methods and tests are


described in this Handbook, for example: goodness of fit tests, rank
correlation methods, contingency table analysis, non-parametric ANOVA
and non-parametric regression methods (e.g. the use of smoothing
functions). In many cases data values are replaced by their ranks, i.e. the
data are arranged in order and the observed values are replaced with their
ordinal value or rank.

You might also like