Survey Design and Data Analysis Guide
Survey Design and Data Analysis Guide
Data Coding
Data coding in research methodology is a preliminary step to analyzing data. The data that is
obtained from surveys, experiments or secondary sources are in raw form. This data needs to
be refined and organized to evaluate and draw conclusions. Data coding is not an easy job and
the person or persons involved in data coding must have knowledge and experience of it.
What is a code?
A code in research methodology is a short word or phrase describing the meaning and context
of the whole sentence, phrase or paragraph. The code makes the process of data analysis
easier. Numerical quantities can be assigned to codes and thus these quantities can be
interpreted. Codes help quantify qualitative data and give meaning to raw data.
What is data coding?
Data coding is the process of driving codes from the observed data. In qualitative research the
data is either obtained from observations, interviews or from questionnaires. The purpose of
data coding is to bring out the essence and meaning of the data that respondents have
provided. The data coder extract preliminary codes from the observed data, the preliminary
codes are further filtered and refined to obtain more accurate precise and concise codes. Later,
in the evaluation of data the researcher assigns values, percentages or other numerical
quantities to these codes to draw inferences. It should be kept in mind that the purpose of data
coding is not to just to eliminate excessive data but to summarize it meaningfully. The data
coder should ascertain that none of the important points of the data have been lost in data
coding.
Coding Examples
Few examples are mentioned here to understand the data coding in a better manner.
“I prefer to shop from a store that provides a large inventory of the same product, every brand
and every style in that product range. Usually in these stores you get maximum range of
products you want to purchase. You get profits through deals and sales.”
The data coder can assign different codes to what the respondent narrated above. These codes
might be as following;
“Preference for horizontal markets”
“Horizontal integration”
“Shopping preference”
Preliminary codes
When data coder assigns codes to the observed data, he cannot manage to assign well-refined
codes in the first instance. He has to assign some preliminary codes first so that the data has
become concise. He later on, further refines the codes to get the final codes. It must be kept in
mind that codes are not the final words or phrases on the basis of which evaluation will be
made. The researcher will filter the preliminary codes and then the final codes. He needs a
pattern on the basis of which he can categorize the human behavior, action or likes and
dislikes.
Final codes
The final codes will help you observe a better pattern in the data. This pattern is necessary to
reach the final evaluation or analysis stage of the data. The final codes in data coding
mean finding out meaningful words and phrases from the observed data. The respondents
often do not choose meaningful words in their responses. The coder needs to extract the
meaning out of the respondent’s wording. The codes in their final stage are like topics and
themes, these themes generate a whole discussion to get the final results. Sometimes the
interviewer or the observer writes down some codes as he observes the behavior of the
respondent. Such codes are really worthy in the research because these codes cannot be
derived from the written responses that the respondents provide. The data coder should look
for the verbs and the actions that the respondent has mentioned in the text. He should also
observe the behavior and where ever possible derive codes. One thing should be kept in mind
that qualitative data analysis is all about finding out the meanings and interpretations, so the
coder should have an eye for such things.
Categories
The codes are given meaningful names and they are put in categories. These categories help
refine the research a lot. When data is coded again and again, it get refined. The refined data
itself leads to patterns and themes. The patterns are the key to find out the true results of the
research. These patterns or categories determine where does the large amount of the data
inclines.
Why Coding?
Steps in data management
Prepare the data collection instrument and collect the data;
Prepare the data dictionary or codebook;
Tips on Coding:
Prepare the data matrix worksheets;
Prepare instructions for data entry and data analysis.
Why Coding?
All research collects data of some sort. In order to make sense of the data, it must be
analyzed. Analysis begins with the labeling of data as to its source, how it was collected, the
information it contains, etc.
Working with original data, however, can be very cumbersome, whether it is hundreds of
mailed questionnaires, figures on yearly accident rates for the fifty states, or observations of
classroom behavior of school children. For this reason, data are often coded.
Coded allow the researcher to reduce large quantities of information into a form than can be
more easily handled, especially by computer programs. Not all data need to be coded. For
example, the accident rates for the fifty states would not be coded, but each state could be
assigned a number (1 through 50) instead of using the state name. There are also content
analysis computer programs that help researchers to code textual data for qualitative or
quantitative analysis.
A. Prepare the data collection instrument and collect the data. Example: Quality of
Work Life Questionnaire
1. Name of Division where you work: _____________________________
2. How long have you been an employee in this company? _______years
3. How many county-sponsored training sessions have you attended? _____
4. What is your job classification?
_____Management
_____Technical
_____Administrative
_____Clerical
5. Is your position
_____supervisory
_____non-supervisory
6. Sex
_____male
_____female
7. In what area would you like to receive additional training? ___________
Tips on Coding:
1. Use numbers to represent response categories. For example,
2. Use zero and one to code variables with binary response categories, such as:
Are you a supervisor? No=0 Yes=1
Sex: Male=0 Female=1
Are you at headquarters or in the field? Headquarters=0 Field=1
(Be sure to use the number zero, and not the letter "O"; and the number one, not the letter
"L").
3. The same data can be coded in more than one way. For example, the following data on
what materials the library should acquire can be coded in two different ways:
4. One question on a questionnaire can yield more than one variable. For example: What type
of training would you like to receive?
_____supervising _____budgeting _____computers _____personnel
The researcher has to try to anticipate how the data will look. A good idea of this can be
gained from doing a pilot test of the instrument, and a dry run of the data collection process. It
is important to be sure to leave enough columns to properly code the information for each
variable, and to provide enough variables to capture all the richness, complexity, and variety
of data that has been collected.
If a sample of college students is asked about barriers they encounter is attempting to use
the campus library, will students be asked to list the one main barrier, to rank order all the
barriers, or to choose only the barriers relevant to them? And what if the students do not
follow the instructions? Depending on what shape the data come in, the researcher will have
to decide how to code this information, using one, two, or many variables.
Each single numeral or character that is entered into a computer program takes up one column
of space. Each datum can be found by knowing its location by column number in the matrix.
Columns 1 through 3 taken together represent the person's employee ID number.
Column 4 represents the division worked in.
Columns 5-6 represent the length of time employed.
Columns 7-8 represent the number of training classes taken (note that the information on
number of classes taken is missing for person number 003).
Column 9 represents the person's job classification.
Column 10 indicates whether the person is a supervisor or not.
Column 11 indicates whether the person is male or female.
Column 12 indicates what type of training the person wants in the future.
Each record, case, questionnaire, or other unit of analysis is represented by a single row of
data across the matrix. For example, person 001 is found in row 1; person 002 in row 2; and
person 003 in row 3.
Each record must be entered in exactly the same way. If the position of the data are to be
entered in fixed-columns, this is referred to as fixed-field format. If data are missing for a
record on any of the variables, something must still be entered into that field. Usually this is a
number indicating that the data is missing. For a 1-column field, use the number 9; for a two-
column field, use 99; and so forth. Just make sure that "9" or "99" is not also a valid response.
In that case, use some other number; some computer programs will allow you to use a period
(".") as a placeholder that is also an indicator of missing data.
When you ask the computer, for example, the compute the average length of time employed
of all the employees in your survey, the computer will look in columns 5-6 of each record. It
will take whatever it finds there, and attempt to compute an average. It is important, therefore,
that all length of employment data be in columns 5-6 for every record, and that no other type
of data be in columns 5-6. The computer will disregard missing data codes (i.e., values of
"99") in computing the average.
Many computer programs have a limitation of a total of 80 columns of data per record. This is
a holdover from when data were punched on cardboard cards that were fed into card readers,
rather than entering data directly into the computer. If your data require more than 80
columns, you will have to construct additional data matrices to record the remainder of the
information for each record.
Characteristics of a Binomial
A random variable has a binomial distribution if all of following conditions are met:
1. There are a fixed number of trials (n).
2. Each trial has two possible outcomes: success or failure.
3. The probability of success (call it p) is the same for each trial.
4. The trials are independent, meaning the outcome of one trial does not influence that of any
other.
Let X equal the total number of successes in n trials; if all of the above conditions are met, X
has a binomial distribution with probability of success equal to p.
Checking the binomial conditions step by step
You flip a fair coin 10 times and count the number of heads.
Does this represent a binomial random variable? You can check by reviewing your responses
to the questions and statements in the list that follows:
1. Are there a fixed number of trials?
You’re flipping the coin 10 times, which is a fixed number. Condition 1 is met, and n = 10.
2. Does each trial have only two possible outcomes —success or failure?
The outcome of each flip is either heads or tails, and you’re interested in counting the number
of heads, so flipping a head represents success and flipping a tail is a failure. Condition 2 is
met.
3. Is the probability of success the same for each trial?
Because the coin is fair the probability of success (getting a head) is p = 1⁄2 for each trial. You
also know that 1 – 1⁄2 = 1⁄2 is the probability of failure (getting a tail) on each trial. Condition
3 is met.
4. Are the trials independent?
We assume the coin is being flipped the same way each time, which means the outcome of
one flip does not affect the outcome of subsequent flips. Condition 4 is met.
()n= n!
. The notation n! stands for n-factorial, the number of ways to rearrange n
x x ! ( n−x ) !
items. To calculate n!, you multiply n(n – 1)(n – 2) . . . (2)( 1). For example 3! is 3(2)(1) = 6;
2! is 2(1) = 2; and 1! is 1. By convention, 0! equals 1. To calculate “3 choose 2,” you do the
following:
()3= 3! 3∗2∗1 6
2 2 ! ( 3−2 ) !
= ( 2∗1 ) (1) 2
=3
Suppose you cross three traffic lights on your way to work, and the probability of each of
them being red is 0.30. (Assume the lights are independent.) You let X be the number of red
lights you encounter and you want to find the probability distribution for X. You know p =
probability of red light = 0.30; 1 – p = probability of a non-red light = 1 – 0.30 = 0.70; and the
number of non-red lights is 3 – X. Using the formula, you obtain the probabilities for X = 0, 1,
2, and 3 red lights:
You need not worry about whether to include an “equal to” in a less-than or greater-than
probability because the probability of a continuous random variable equaling one number
exactly is zero. (There is no area under the curve at one specific point.)
Suppose, for example, that you enter a fishing contest. The contest takes place in a pond
where the fish lengths have a normal distribution with mean = 16 inches and standard
deviation = 4 inches.
Problem 1: What is the chance of catching a small fish — say, less than 8 inches?
Problem 2: Suppose a prize is offered for any fish over 24 inches. What is the chance of
catching a fish at least that size?
Problem 3: What is the chance of catching a fish between 16 and 24 inches?
To solve these problems, first draw a picture of the distribution.
Figure 5-3 shows a picture of X’s distribution for fish lengths. You can see where each of the
fish lengths mentioned in each of the three fish problems falls.
X−μ
Step 3 says change the x-values to z-values using the Z-formula Z = . For Problem 1 of
σ
8−16
the fish example, you have P(X<8) = P(Z< ) = P(Z < -2). Similarly for Problem 2, P(X
4
> 24) becomes P(Z > 2). Problem 3 translates from P(16 < X < 24) to P(0 < Z < 2). Figure 5-4
shows a comparison of the X-distribution and Z-distribution for the values x = 8,16, and 24,
which transform into z = –2, 0, and +2, respectively.
Now that you have changed x-values to z-values, you move to Step 4 and find probabilities for
those z-values using the Z-table (Table A-1 in the appendix). In Problem 1 of the fish
example, you want P(Z < –2); go to the Z-table and look at the row for –2.0 and the column
for 0.00, intersect them, and you find 0.0228 — according to Step 5a you’re done. So, the
chance of a fish being less than 8 inches is equal to 0.0228.
For Problem 2, find P(Z > 2.00). Because it’s a “greater-than” problem, this calls for Step 5b.
To be able to use the Z-table you need to rewrite this in terms of a “less-than” statement.
Because the entire probability for the Z-distribution equals 1, we know P(Z > 2.00) = 1 – P(Z
< 2.00) = 1 – 0.9772 = 0.0228.
So, the chance that a fish is greater than 24 inches is 0.0228. (Note the answers to Problems 1
and 2 are the same because the Z-distribution is symmetric; see Figure 5-3.)
In Problem 3, you find P(0 < Z < 2.00); this requires Step 5c. First find P(Z < 2.00), which is
0.9772 from the Z-table, and then subtract off the part you don’t want, which is P(Z < 0) =
0.500
from the Z-table. This gives you 0.9772 – 0.500 = 0.4772. So the chance of a fish being
between 16 and 24 inches is 0.4772.
The Poisson distribution is the discrete probability distribution of the number of events
occurring in a given time period, given the average number of times the event occurs over that
time period.
A certain fast-food restaurant gets an average of 3 visitors to the drive-through per minute.
This is just an average, however. The actual amount can vary.
A Poisson distribution can be used to analyze the probability of various events regarding how
many customers go through the drive-through. It can allow one to calculate the probability of
a lull in activity (when there are 0 customers coming to the drive-through) as well as the
probability of a flurry of activity (when there are 5 or more customers coming to the drive-
through). This information can, in turn, help a manager plan for these events with staffing and
scheduling.
The Poisson distribution is applicable only when several conditions hold.
Conditions for Poisson Distribution:
An event can occur any number of times during a time period.
Events occur independently. In other words, if an event occurs, it does not affect the
probability of another event occurring in the same time period.
The rate of occurrence is constant; that is, the rate does not change based on time.
The probability of an event occurring is proportional to the length of the time period.
For example, it should be twice as likely for an event to occur in a 2 hour time period
than it is for an event to occur in a 1 hour period.
For example, the Poisson distribution is appropriate for modeling the number of phone calls
an office would receive during the noon hour, if they know that they average 4 calls per hour
during that time period.
Although the average is 4 calls, they could theoretically get any number of calls
during that time period.
The events are effectively independent since there is no reason to expect a caller to
affect the chances of another person calling.
The occurrence rate may be assumed constant.
It is reasonable to assume that (for example) the probability of getting a call in the first
half hour is the same as the probability of getting a call in the final half hour.
Of course, this situation isn't an absolute perfect theoretical fit for the Poisson distribution.
For instance, the office certainly cannot receive a trillion calls during the period, of time, as
there are less than a trillion people alive to be making calls. Practically speaking, the situation
is close enough that the Poisson distribution does a good job of modeling the situation's
behavior.
Probabilities with the Poisson Distribution
Given that a situation follows a Poisson distribution, there is a formula, which allows one to
calculate the probability of observing k events over a period of time for any non-negative
integer value of k.
Let XX be the discrete random variable that represents the number of events observed over a
given time period. Let λ be the expected value (average) of XX. If XX follows a Poisson
distribution, then the probability of observing k events over the time period is
k −λ
λ e
P(X=k) = , where e is Euler's number.
k!
Example: In the World Cup, an average of 2.5 goals are scored each game. Modeling this
situation with a Poisson distribution, what is the probability that k goals are scored in a game?
In this instance, λ=2.5. The above formula applies directly:
k −λ 0 −2.5
λ e 2.5 e
P(X=0) = = = 0.082
k! 0!
k −λ 1 −2.5
λ e 2.5 e
P(X=1) = = = 0.205
k! 1!
k −λ 2 −2.5
λ e 2.5 e
P(X=2) = = = 0.257
k! 2!
k −λ 3 −2.5
λ e 2.5 e
P(X=3) = = = 0.213
k! 3!
k −λ 4 −2.5
λ e 2.5 e
P(X=4) = = = 0.133
k! 4!
A fast food restaurant gets an average of 2.8 customers approaching the register every minute.
Assuming the number of customers approaching the register per minute follows a Poisson
distribution, what is the probability that 4 customers approach the register in the next minute?
Round your answer to 3 decimal places.
The Poisson distribution can be used to calculate the probabilities of "less than" and "more
than" using the rule of sum and complement probabilities.
Example: A statistician records the number of cars that approach an intersection. He finds that
an average of 1.6 cars approach the intersection every minute. Assuming the number of cars
that approach this intersection follows a Poisson distribution, what is the probability that 3 or
more cars will approach the intersection within a minute?
For this problem, .λ=1.6. The goal of this problem is to find P(X≥3), the probability that there
are 3 or more cars approaching the intersection within a minute. Since there is no upper limit
on the value of k, this probability cannot be computed directly. However, its complement
(X≤2), can be computed to give P(X≥3):
k −λ 0 −1.6
λ e 1.6 e
P(X=0) = = ~ 0.202
k! 0!
k −λ 1 −1.6
λ e 1.6 e
P(X=1) = = ~ 0.323
k! 1!
k −λ 2 −1.6
λ e 1.6 e
P(X=2) = = ~ 0.258
k! 2!
≡ P(X<=2) = P(X=0) + P(X=1) + P(X=2) ~ 0.783
≡ P(X>=3) = 1 - P(X<=2)
= 1 – 0.783 ~ 0.217
Therefore, the probability that there are 3 or more cars approaching the intersection within a
minute is approximately 0.217.
When a computer disk manufacturer tests a disk, it writes to the disk and then tests it using a
certifier. The certifier counts the number of missing pulses or errors. The number of errors in
a test area on a disk has a Poisson distribution with λ=0.2. What percentage of test areas have
two or fewer errors?
There are other applications of the Poisson distribution that come from more open-ended
problems. For example, it can be used to help determine the amount of staffing that is needed
in a call center.
Example: A call center receives an average of 4.5 calls every 5 minutes. Each agent can
handle one of these calls over the 5 minute period. If a call is received, but no agent is
available to take it, then that caller will be placed on hold. Assuming that the calls follow a
Poisson distribution, what is the minimum number of agents needed on duty so that calls are
placed on hold at most 10% of the time?
In order for all calls to be taken, the number of agents on duty should be greater than or equal
to the number of calls received. If X is the number of calls received and k is the number of
agents, then k should be set such that P(X > k) ≤0.1, or equivalently, .P(X ≤ k) > 0.9.
The average number of calls is 4.5, so λ=4.5:
k −λ 0 −4.5
λ e 4.5 e
P(X=0) = = ~ 0.011
k! 0!
k −λ 1 −4.5
λ e 4.5 e
P(X=1) = = ~ 0.050 ≡ P(X <1) ~0.061
k! 1!
k −λ 2 −4.5
λ e 4.5 e
P(X=2) = = ~ 0.112 ≡ P(X <2) ~ 0.173
k! 2!
k −λ 3 −4.5
λ e 4.5 e
P(X=3) = = ~ 0.169 ≡ P(X <3) ~ 0.342
k! 3!
k −λ 4 −4.5
λ e 4.5 e
P(X=4) = = ~ 0.190 ≡ P(X <4) ~ 0.532
k! 4!
k −λ 5 −4.5
λ e 4.5 e
P(X=5) = = ~ 0.171 ≡ P(X <5) ~ 0.703
k! 5!
k −λ 6 −4.5
λ e 4.5 e
P(X=6) = = ~ 0.128 ≡ P(X <6) ~ 0.831
k! 6!
k −λ 7 −4.5
λ e 4.5 e
P(X=7) = = ~ 0.082 ≡ P(X <7) ~ 0.913
k! 7!
If the goal is to make sure that less than 10% of calls are placed on hold, then 7 agents should
be on duty.
Properties of the Poisson Distribution
The expected value of a Poisson distribution should come as no surprise, as each Poisson
distribution is defined by its expected value.
Expected Value of Poisson Random Variable:
Given a discrete random variable X that follows a Poisson distribution with parameter λ, the
expected value of this variable is .E[X] = λ.
HYPOTHESIS
Hypothesis testing is a decision-making process for evaluating claims about a population.
We must define the population under study
state the particular hypotheses that will be investigated
give the significance level
select a sample from the population
collect the data
perform the calculations required for the statistical test
reach a conclusion.
A statistical test uses the data obtained from a sample to make a decision about whether or
not the null hypothesis should be rejected.
The numerical value obtained from a statistical test is called the test value.
In the hypothesis-testing situation, there are four possible outcomes.
In reality, the null hypothesis may or may not be true, and a decision is made to reject or
not to reject it on the basis of the data obtained from a sample.
H0 True H0 False
Reject H0 Error Type I Correct decision
Do not Reject H0 Correct decision Error Type II
A type I error occurs if one rejects the null hypothesis when it is true.
A type II error occurs if one does not reject the null hypothesis when it is false.
The level of significance is the maximum probability of committing a type I error. This
probability is symbolized by α (Greek letter alpha). That is, P(type I error)=α.
P(type II error) = β (Greek letter beta).
Typical significance levels are: 0.10, 0.05, and 0.01.
For example, when α = 0.10, there is a 10% chance of rejecting a true null hypothesis.
The critical value(s) separates the critical region from the noncritical region.
The symbol for critical value is C.V.
The critical or rejection region is the range of values of the test value that indicates that
there is a significant difference and that the null hypothesis should be rejected.
The noncritical or non-rejection region is the range of values of the test value that indicates
that the difference was probably due to chance and that the null hypothesis should not be
rejected.
A one-tailed test (right or left) indicates that the null hypothesis should be rejected when the
test value is in the critical region on one side of the mean.
In a two-tailed test, the null hypothesis should be rejected when the test value is in either of
the two critical regions.
Example 2:
A national magazine claims that the average college student watches less television than the
general public. The national average is 29.4 hours per week, with a standard deviation of 2
hours. A sample of 30 college students has a mean of 27 hours. Is there enough evidence to
support the claim at α = 0.01?
Solution:
Step 1: State the hypotheses and identify the claim. H0: µ ≥ 29.4 H1: µ < 29.4 (claim)
Step 2: Find the critical value. Since α = 0.01 and the test is a left-tailed test, the critical value
is z = –2.33.
Step 3: Compute the test value.
z = [27– 29.4]/[2/√30] = – 6.57.
Step 4: Make the decision. Since the test value, – 6.57, falls in the critical region, the decision
is to reject the null hypothesis.
Step 5: Summarize the results. There is enough evidence to support the claim that college
students watch less television than the general public.
Example 3:
The Medical Rehabilitation Education Foundation reports that the average cost of
rehabilitation for stroke victims is ₦24,672. To see if the average cost of rehabilitation is
different at a large hospital, a researcher selected a random sample of 35 stroke victims and
found that the average cost of their rehabilitation is ₦25,226.
Solution:
The standard deviation of the population is ₦3,251. At α = 0.01, can it be concluded that the
average cost at a large hospital is different from ₦24,672?
Step 1: State the hypotheses and identify the claim. H0: µ = ₦24,672 H1: µ ≠ ₦24,672
(claim)
Step 2: Find the critical values. Since α = 0.01 and the test is a two-tailed test, the critical
values are z = –2.58 and +2.58.
Step 3: Compute the test value.
z = [25,226 – 24,672]/[3,251/√35] = 1.01.
Step 4: Make the decision. Do not reject the null hypothesis, since the test value falls in the
noncritical region.
Step 5: Summarize the results. There is not enough evidence to support the claim that the
average cost of rehabilitation at the large hospital is different from ₦24,672.
Example1:
A job placement director claims that the average starting salary for nurses is ₦24,000. A
sample of 10 nurses has a mean of ₦23,450 and a standard deviation of ₦400. Is there enough
evidence to reject the director’s claim at α = 0.05?
Solution:
Step 1: State the hypotheses and identify the claim. H0: µ = ₦24,000 (claim) H1: µ ≠
₦24,000 Step 2: Find the critical value. Since α = 0.05 and the test is a two-tailed test, the
critical values are t = –2.262 and +2.262 with d.f. = 9.
Step 3: Compute the test value. t = [23,450 – 24,000]/[400/√10] = – 4.35.
Step 4: Reject the null hypothesis, since – 4.35 < – 2.262.
Step 5: There is enough evidence to reject the claim that the starting salary of nurses is
₦24,000.
A bivariate distribution, put simply, is the probability that a certain event will occur when
there are two independent random variables in your scenario. For example, having two bowls,
each filled with two different types of candies, and pulling one candy from each bowl gives
you two independent random variables, the two different candies. Since you are pulling one
candy from each bowl at the same time, you have a bivariate distribution when calculating
your probability of ending up with particular kinds of candies.
Some examples:
– Height (X) and weight (Y ) are measured for each individual in a sample.
– Stock market valuation (X) and quarterly corporate earnings (Y ) are recorded for each
company in a sample.
– A cell culture is treated with varying concentrations of a drug, and the growth rate (X) and
drug concentration (Y ) are recorded for each trial.
– Temperature (X) and precipitation (Y ) are measured on a given day at a set of weather
stations.
Be clear about the difference between bivariate data and two sample data. In two sample data,
the X and Y values are not paired, and there aren’t necessarily the same number of X and Y
values.
Two-sample data:
Sample 1: 3,2,5,1,3,4,2,3
Sample 2: 4,4,3,6,5
What Does It Look Like?
So, what does a bivariate distribution look like? Such a distribution actually doesn't have a
standard look. You can create a table with these distributions or you can list each probability
out one by one. In any case, you always have two independent random variables in any given
scenario.
Here is what a bivariate distribution looks like in table form.
This bivariate distribution shows you the probability of picking red or blue candies from a red
bowl and a blue bowl if you pick one candy from each bowl and there are an equal number of
red and blue candies in each bowl.
MODULE FIVE
POINT AND INTERVAL ESTIMATES
Estimation theory
Estimation theory is a branch of statistics that deals with estimating the values of parameters
based on measured/empirical data that has a random component.
An estimate is a single value that is calculated based on samples and used to estimate a
population value.
An estimator is a function that maps the sample space to a set of estimates.
The entire purpose of estimation theory is to arrive at an estimator, which takes the sample as
input and produces an estimate of the parameters with the corresponding accuracy.
Point Estimator
A point estimator is a statistic (that is, a function of the data) that is used to infer the value of
an unknown parameter in a statistical model.
A point estimate is one of the possible values a pointer estimator can assume.
Mathematically, suppose there is a fixed parameter θ that needs to be estimated and X is a
random variable corresponding to the observed data. Then an estimator of θ, usually denoted
by the symbol θ^, is a function of the random variable X, and hence itself a random variable θ
^(X).
A point estimate for a particular observed dataset (i.e. for X = x) is then θ ^(x), which is a
fixed value.
MODULE SIX
MATHEMATICAL EXPECTATION
Mathematical expectation, also known as the expected value, is the summation or
integration of possible values from a random variable. It is also known as the product of the
probability of an event occurring, denoted P(x), and the value corresponding with the actual
observed occurrence of the event. The expected value is a useful property of any random
variable. Usually notated as E(X), the expect value can be computed by the summation
overall the distinct values that the random variable can take. The mathematical expectation
will be given by the mathematical formula as, E(X) =? (x 1p1, x2p2, …, xnpn), where x is a
random variable with the probability function, f(x), p is the probability of the occurrence, and
n is the number of all possible values in the case. The mathematical expectation of an
indicator variable can be zero if there is no occurrence of an event A, and the mathematical
expectation of an indicator variable can be one if there is an occurrence of an event A. Thus, it
is a useful tool to find the probability of event A.
The second property is that the mathematical expectation of the product of the two random
variables will be the product of the mathematical expectation of those two variables, provided
that the two variables are independent in nature. In other words, E(XY)=E(X)E(Y).
The generalization of this property states that the mathematical expectation of the product of
the n number of independent random variables is equal to the product of the mathematical
expectation of the n independent random variables.
The third property states that the mathematical expectation of the product of a constant and
the function of a random variable is equal to the product of the constant and the mathematical
expectation of the function of that random variable provided that their mathematical
expectation exists. The third also states that the mathematical expectation of the sum of a
constant and the function of a random variable is equal to the sum of the constant and the
mathematical expectation of the function of that random variable provided that their
mathematical expectation exists. In other words, E(a *f(X))=a E(f(X)) and
E(a+f(X))=a+E(f(X)), where a is a constant and f(X) is the function.
The fourth property states that the mathematical expectation of the sum of the product
between a constant and the function of a random variable and the other constant is equal to the
sum of the product between the constant and the mathematical expectation of the function of
that random variable and the other constant provided that their mathematical expectation
exists. In other words, E(aX+b)=aE(X)+b, where a and b are constants.
The fifth property states that the mathematical expectation of the linear combination of the
random variables is equal to the sum of the product between the ‘n’ constant and the
mathematical expectation of the ‘n’ number of variables. In other words, E(?aiXi)=? ai E(Xi).
Here, ai, (i=1…n) are constants.
Example 1
A casino is considering a dice game that would pay the winner of the game $10. The game is
similar to craps, the participant would roll two fair, 6-sided dice and if they sum to 7 or 11,
the participant wins; otherwise they lose. What is the expected payout the casino will make as
each game is played?
Solution:
One first needs to identify the probability distribution f(x) of the sum of two dice.
Below is a table that identifies these probabilities:
Sum 2 3 4 5 6 7 8 9 10 11 13
Probability 1 1 1 1 5 1 5 1 1 1 1
36 18 12 9 36 6 36 9 12 18 36
Since the casino stands to lose $10 each time a contestant roles a 7 or 11, the mathematical
expected value (or expected cost) of this game to the casino is:
1 1 1 1 5 1 5 1 1 1 1
x0+ x0+ x 0 + x 0 + x 0 + x 10 + x 0 + x 0 + x 0 + x 10 +
36 18 12 9 36 6 36 9 12 18 36
x 0 = $2.22. the expected value of this game is $2.22
Example 4:
In a contest sponsored by 7Up Bottling co, you win a prize if the cap on your bottle of
minerals says “WINNER”; however, you may only claim one prize. Eager to win, you blow
all your savings on minerals; as a result, you have a 0.05% chance of winning ₦1,000,000, a
1% chance of winning ₦20,000, and a 90% chance of winning ₦10. Ignoring the money you
spent on minerals, what is your expected value and standard deviation?