RM Record
RM Record
Classification: The collected data, also known as raw data or ungrouped data, are always in
an unorganized form and need to be organized and presented in meaningful and readily
comprehensible form in order to facilitate further statistical analysis. It is, therefore, essential
for an investigator to condense a mass of data into more and more comprehensible and
assimilable form. The process of grouping into different classes or sub classes according to
some characteristics is known as classification. Thus, classification is the first step in
tabulation.
Objects of classification: The following are main objectives of classifying the data:
1. It condenses the mass of data in an easily assimilable form.
2. It eliminates unnecessary details.
3. It facilitates comparison and highlights the significant aspect of data
4. It enables one to get a mental picture of the information collected.
5. It helps in the statistical treatment of the information collected.
Types of classification: Statistical data are classified in respect of their characteristics.
Broadly there are four basic types of classification namely
Chronological classification: In chronological classification the collected data are
arranged according to the order of time expressed in years, months, weeks, etc., The data
is generally classified in ascending order of 39 times.
Eg: The data related with population, sales of a firm, imports and exports of a country
are always subjected to chronological classification.
Eg: Country America, China, Denmark, France, India yield of wheat in (kg/acre) 1925
893 225 439 862.
Qualitative classification: In this type of classification data are classified on the basis of
same attributes or quality like sex, literacy, religion, employment etc, such attributes
cannot be measured along with a scale. For example, if the population to be classified in
respect to one attribute, say sex, then we can classify them into two namely that of males
and females. Similarly, they can also be classified into 'employed' or "unemployed on the
basis of another attribute 'employment'. Thus, when the classification is done with respect
to one attribute, which is dichotomous in nature, two classes are formed, one possessing
the attribute and the other not possessing the attribute. This type of classification is called
simple dichotomous classification.
The classification, where two or more attributes are considered and several classes are
formed, is called a manifold classification.
Eg: If we classify population simultaneously with respect to two attributes, eg sex and
employment, then population are first classified with respect to "sex" into males and females.
Each of these classes may then be further classified into 'employment and "unemployment on
the basis of attribute 'employment and as such population are classified into four classes
namely
i. Male employed
ii. Male unemployed
iii. Female employed
iv. Female unemployed
Still the classification may be further extended by considering other attributes like marital
status etc. This can be explained by the following chart population male female employed
unemployed.
Quantitative classification: Quantitative classification refers to the classification of the
data according to some characteristics that can be measured such as height, weight, etc.
A variable may either be continuous eg; heights and weights of the person 160.2-165.2, 150.1-
152.4 and 53.2-54.2kg, 66.4-67.4kg respectively or discrete eg: number of rooms, number of
machines (2,3,4,6).
TABULATION OF RAW DATA
Tabulation is the process of summarizing classified or grouped data in the form of a table so
that it is easily understood, and an investigator is quickly able to locate the desired
information. A table is a systematic arrangement of classified data in columns and rows.
Thus, a statistical table makes it possible for the investigator to present a huge mass of data in
a detailed and orderly form. It facilitates comparison and often reveals certain patterns in data
which are otherwise not obvious. Classification and "Tabulation", as a matter of fact, are not
two distinct processes. Actually, they go together. Before tabulation data are classified and
then displayed under different columns and rows of a table. Tabulation is essential because of
the following reasons.
1. It conserves space and reduces explanatory and descriptive statement to a minimum.
2. It facilitates the process of comparison.
3. It facilitates the summation of items and the detection of errors and omissions.
4. It provides a basis for various statistical computations.
Generally accepted principles of tabulation: such principles of tabulation, particularly of
constructing statistical tables, can be briefly states as follows:
1. Every table should have a clear, concise and adequate title so as to make the table
intelligible without reference to the text and this title should always be placed just
above the body of the table.
2. Every table should be given a distinct number to facilitate easy reference.
3. The column headings (captions) and the row headings (stubs) of the table should be
clear and brief.
4. The units of measurements under each heading or sub-heading must always be indicated.
5. Explanatory footnotes, if any, concerning the table should be placed directly
beneath the table, along with the reference symbols used in the table.
6. Source or sources from where the data in the table have been obtained must be
indicated just below the table.
7. Usually, the columns are separated from one another by lines which make the table
more readable and attractive. Lines are always drawn at the top and bottom of the table
and below the captions.
8. There should be thick lines to separate the data under one class from the under another
class and the lines separating the sub-divisions of the classes should be comparatively
thin lines.
9. The columns may be numbered to facilitate reference.
10. Those columns whose data are to be compared should be kept side by side. Similarly,
percentages and/or averages must also be kept close to the data.
11. It is generally considered better to approximate figures before tabulation as the same
would reduce unnecessary details in the itself.
12. In order to emphasize the relative significance of certain categories, different kinds of
type, spacing and indentations may be used.
13. It is important that all column figures be properly aligned. Decimal points and (+) or
(-) signs should be in perfect alignment.
14. Abbreviations should be avoided to the extent possible and ditto marks should not be
used in the table.
15. Miscellaneous and exceptional items, if any, should be usually placed in the last row
of the table.
16. Table should be made as logical, clear, accurate and simple as possible. If the data
happens to be very large, they should not be crowded in a single table for that would
make the table wieldy and inconvenient.
17. Total of rows should normally be placed in the extreme right column and that of
columns should be placed at the bottom.
18. The arrangement of the categories in a table may be chronological, geographical,
alphabetical or according to magnitude to facilitate comparison.
Above all, the table must suit the needs and requirements of an investigation.
Advantages Of Tabulation:
Statistical data arranged in a tabular form serve following objectives:
1. It simplifies complex data and the data presented is easily understood.
2. It facilitates comparison of related facts.
3. It facilitates computation of various statistical measures like averages, dispersion,
correlation etc.
4. It presents facts in minimum possible space and unnecessary repetitions and
explanations are avoided. Moreover, the needed information can be easily located.
5. Tabulated data are good for references, and they make it easier to present the
information in the form of graphs and diagrams.
Preparing A Table: The making of a compact table itself an art. This should contain all the
information needed within the smallest possible space. What the purpose of tabulation is and
how the tabulated information is to be used are the main points to be kept in mind while
preparing for a statistical table.
Format Of a Table:
Title Head note
Column heading
Stub entries Body
1. Table number: A table should be numbered for easy reference and identification. This
number, if possible, should be written in the centre at the top of the table. Sometimes it
is also written just before the title of the table.
2. Title of the table: A good table should have a clearly worded, brief but unambiguous
title explaining the nature of the data contained in the table. It should also state
arrangement of data and the period covered. The title should be placed centrally on the
top of a table just below the table number (or just after table number in the same line).
3. Captions or column headings: Captions in a table stand for brief and self- explanatory
headings of vertical columns. Captions may involve headings and 43 sub-headings as
well. The unit of data contained should also be given for each column. Usually, a
relatively less important and shorter classification should be tabulated in the columns.
4. Stubs or row designation: Stubs stands for brief and self-explanatory headings of
horizontal rows. Normally, a relatively more important classification is given in rows.
Also, a variable with a large number of classes is usually represented in rows. For
example, rows may stand for score of classes and columns for data related to sex of
students. In the process, there will be many rows for scores classes but only two
columns for male and female students.
5. Body of the table: The body of the table contains the numerical information of
frequency of observations in different cells.
6. Footnotes: Footnotes are given at the foot of the table for explanation of any fact or
information included in the table which needs some explanation. Thus, they are meant
for explaining or providing further details about the data, that have not been covered in
title, captions and stubs.
7. Sources of data: Lastly one should also mention the source of information from which
data is taken. This may preferably include the name of the author, volume, page and the
year of publication. This should also state whether the data contained in the table is of
'primary or secondary' nature.
Requirements of a Good Table: A good statistical table is not merely a careless grouping of
columns and rows but should be such that it summarizes the total information in an easily
accessible form in minimum possible space. Thus, while preparing a table, one must have a
clear idea of the information to be presented, the facts to be compared and the points to be
stressed. Though there is no hard and fast rule for forming a table yet.
Types Of Tables:
Tables can be classified according to their purpose, stage of enquiry, nature of data or number
of characteristics used.
On the basis of the number of characteristics, tables may be classified as follows:
a. Simple and complex tables
b. General purpose and special purpose (summary) tables
Simple or one-way table: In this type only one characteristic is shown. The number of adults
in different occupations in a locality.
Simple or Two-way Table: A table, which contains data on two characteristics, is called a
two- way table. In such case, therefore, either stub or caption is divided into two co-ordinate
parts. In the given table, as an example the caption may be further divided in respect of 'sex'.
This subdivision is shown in two-way table, which now contains two characteristics namely,
occupation and sex.
Manifold or higher order table: More and more complex tables can be formed by including
3 or more characteristics. For example, we may further classify the caption sub-heading in the
above table in respect of 'marital status', 'religion' and 'socio-economic status' etc. A table
which has more than two characteristics is termed as manifold table.
General purpose tables, also known as reference tables or repository tables, provide
information for general use or reference. They usually contain detailed information and not
constructed for specific discussion.
Special purpose tables, also known as summary tables, provide information for particular
discussion. When attached to a report they are found in the body of the text.
DIAGRAMMATIC AND GRAPHICAL REPRESENTATION
OF RAW DATA
One of the most convincing and appealing ways in which statistical results may be presented
is through diagram and graphs. Just one diagram is enough to represent a given data more
effectively than thousand words. It is easy to understand diagrams even for ordinary people.
Diagram is a visual form for presentation of statistical data, highlighting their basic facts and
relationship. If we draw diagrams on the basis of the data collected, they will easily be
understood and appreciated by all. It is readily intelligible and saves a considerable amount of
time and energy.
Types of Diagrams
In such diagrams, only one-dimensional measurements, i.e., height is used, and the
width is not considered.
1. Line Diagram.
2. Simple Diagram.
3. Multiple Bar Diagram.
4. Sub-divided Bar Diagram.
5. Percentage Bar Diagram.
Line Diagram is used in case where there are many items to be shown and there is not much
difference in their values. Such diagram is prepared by drawing a vertical line for each item
according to the scale. The distance between the lines is kept uniform. Line diagram makes
comparison easy, but it is less attractive.
In two-dimensional diagrams, the area represents the data and so the length and breadth
have both to be considered. Such diagrams are also called Area Diagrams or Surface
Diagrams.
Rectangles
Rectangles are used to represent the relative magnitude of two or more values.
The area of the rectangles is kept in proportion to the values.
Rectangles are placed side by side for comparison.
We may represent the figures as they are given or may convert them to percentages
and then subdivide the length into various components.
Squares
The rectangular method of diagrammatic presentation is difficult to use where the
values of items vary widely.
The method of drawing a square diagram is very simple.
One has to take the square root of the values of various items that are to be shown in
the diagrams and then select a suitable scale to draw the squares.
Pie Diagram or Circular Diagram
In such diagram, both the total and the component parts or sectors can be shown. The
area of a circle is proportional to the square of its radius.
While making comparisons, pie diagrams should be used on a percentage basis and
not on an absolute basis.
It consists of cubes, cylinders, spheres, etc. In such diagrams three things, namely
length, width and height have to be taken into account. Of all the figures, making of cubes is
easy, Side of a cube is drawn in proportion to the cube root of the magnitude of data, Cubes
of figures can be ascertained with the help of logarithms. The logarithm of the figures can be
divided by 3 and the antilog of that value will be the cube root.
PICTOGRAMS AND CARTOGRAMS
Pictograms are not abstract presentation such as lines or bars but really depict the
kind of data we are dealing with. Pictures are attractive and easy to comprehend and as such
this method is particularly useful in presenting statistics to the layman. When Pictograms are
used, data is represented through pictorial symbol that are carefully selected.
Graphs
A graph is a visual form of presentation of statistical data. A graph is more attractive than a
table figure. Even a common man can use and understand the message of data from the graph
Comparisons can be made between two or more phenomena very easily with the help of a
graph
Graphs are divided into
Histogram.
Frequency Curve.
Ogive.
Lorenz Curve.
B. Frequency Polygon: If we mark the midpoints of the top horizontal sides of the
rectangles in the histogram and then join them by a straight line, the figure so formed is
called a Frequency Polygon. This is done under the assumption that the frequencies in a
class interval are evenly distributed throughout the class. The area of the polygon is equal
to the area of the histogram, because the area left outside is equal to the area included in
it.
C. Frequency Curve: If the middle point of the upper boundaries of the rectangles of a
histogram are corrected by a smooth freehand curve, then that diagram is called
Frequency Curve. The curve should begin and end at the base line.
D. Ogives: For a set of observations, we know how to construct a frequency distribution. In
some cases, we may require the number of observations less than a given value or more
than a given value. This is obtained by accumulating (adding) the frequencies up to (or
above) the given value. This accumulated frequency is called Cumulative Frequency.
These cumulative frequencies are then listed in a table called Cumulative Frequency
Table. The curve table is obtained by plotting cumulative frequencies. This is called a
Cumulative Frequency Curve or an Ogive.
In ungrouped or raw data, individual items are given. The average of a numbers is
obtained by finding their sum (by adding) and then dividing it by n.
𝑥1+𝑥2+𝑥3+⋯ ∑𝑖 𝑛=1 𝑥1
+𝑥𝑛
𝑛 𝑛
x= =
Merits
1. It can be easily calculated.
2. Its calculations are based on all the observations.
3. It is easy to understand.
Demerits
1. It may not be represented in actual data and so it is theoretical.
2. The extreme values have greater effect on mean.
3. It cannot be calculated if all the values are not known.
Uses of Arithmetic mean
1. A common man uses mean for calculating average marks obtained by a student.
2. It is extensively used in practical statistics.
3. Estimates are always obtained by mean.
Direct method If n observations in the raw data consist of a distinct value denoted by x1, x2,
x3......., xn of the observed variable x occurring with frequencies 𝑓1,𝑓2,𝑓3,…,𝑓𝑛 respectively,
then the arithmetic mean of the variable x is given by x, where
𝑛 𝑛
𝑓1𝑥1 + 𝑓2𝑥2+ 𝑓3𝑥3+⋯+𝑓𝑛𝑥𝑛 ∑ =1 𝑓𝑖𝑥𝑖 ∑ =1 𝑓𝑖𝑥𝑖
𝑖 𝑖
x= 𝑓1+𝑓2+𝑓3+⋯+𝑓𝑛
= ∑ 𝑓𝑖 = 𝑁
𝑓𝑖 = 𝑓1 + 𝑓2 + ⋯ +
∑𝑛
𝑖=
where, N =
𝑓𝑛 =
Sum of frequencies
1
𝑁
x=a+ where
The word "average" implies a value in the distribution, around which the other values are
distributed, gives a mental picture of the central value. There are several kinds of averages, of
which of which the commonly used are
3. The mode.
Many of the questions from the patient satisfaction surveys include rating scales. This will
require calculating means and standard deviations for data analysis. This can be done using
popular spread she software, such as Microsoft Excel, or even online calculators. If neither of
these is readily available both the mean and standard deviation of a data set can be calculated
using arithmetic formula. Following are brief descriptions of the mean and standard deviation
with examples of how to calculate each.
The Mean For a data set, the mean is the sum of the observations divided by the number
observations. It identifies the central location of the data, sometimes referred to in English as
the average.
3+2+4+1+4+4
6
M = = 18/6 = 3
The Arithmetic mean is widely used in statistical calculation. It is sometimes simply called
Mean. To obtain the mean the individual values are first added together, and then divided by
denoted by the sign å. The individual observation sis denoted by the D. And the mean is
the number observations. The operation of adding together is called 'Summation' and is
The arithmetic mean works well for values that fit the normal distribution.
It is sensitive to extreme values, which makes it not work well for data that are highly skewed.
Eg. The mean is calculated thus: The diastolic blood pressure of 10 individuals is 81 If the
values given are 83,75, 81, 79, 71, 95, 75, 77,84,90
The total is 810/10 = 81
The advantages of mean are that it is easy to calculate.
The disadvantages are that sometimes it may be unduly influenced by abnormal values in
distribution
Sometimes it may
The mean is the average of all numbers and is sometimes called the arithmetic mean. To
calculate m add together all of the numbers in a set and then divide the sum by the total count
of numbers Calculation of arithmetic mean for individual observations.
The arithmetic mean is computed by summing up the observations and diving the sum by the
1 number of observations.
b. The following is the monthly income of 10 employees in an office Income in Rs: 1780,
1760, 1810, 1680, 1940, 1790, 1890, 1960, 1810, 1050.
Calculate the mean income.
åfX
𝑁
As X =
X = 756/60 = 12.6
X CL f Mid value m fm
Less than 20 10 – 20 2 15 30
20 – 30 20 – 30 8 25 200
30 – 50 30 – 50 20 40 800
50 – 90 50 – 90 25 70 1750
90 – 120 90 – 120 16 105 1680
Above 120 120 – 150 9 135 1215
N = 18 åfm = 5675
Note. 1st interval has been taken equal to second and last equal to penultimate one. We can
use any method to find X; used for Continuous series.
Uses of the arithmetic mean
a. A common man can use to calculate averages
b. It is extremely used in practical statistics
c. Estimates are always obtained by mean
d. Businessman uses it to find out the operation cost, profit per unit of article, output per
man as per machine.
Merits:
a. It can be easily calculated
b. Its calculation is based on all the observations
c. It is easy to understand
d. It is rightly defined by the mathematical formula
e. It is least affected by sampling fluctuations
f. It is the best average to compare two or more series.
g. It is the average obtained by calculations and it does not depend upon any position
Demerits:
1. It may not be represented in actual data
2. The extreme values have greater effect on mean
3. It cannot be calculated if values are not known
4. It cannot be determined for qualitative data such as love, beauty, honesty
5. May lead to fallacious conditions in the absence of original observations.
Standard Deviation
The standard deviation is the most common measure of variability, measuring the spread of
the data and the relationship of the mean to the rest of the data. If the data points are close to
the mean, indicate that the responses are fairly uniform, then the standard deviation will be
small. Conversely, if many d points are far from the mean, indicating that there is a wide
variance in the responses, then the stand deviation will be large. If all the data values are
equal, then the standard deviation will be zero.
The standard deviation is calculated using the following formula.
Σ(Χ−Μ)2
S2 = 𝑛−1
Where Σ -Sum of X-Individual score, M-Mean of all scores, N-Sample size (number of scores)
S = 1.265
The standard deviation is a measure of Dispersion.
Standard deviation is the square root of the arithmetic averages of the squares of the
deviations means from the mean.
It is an important and popular measure of Dispersion and is introduced by Karl Pearson.
It is known as the root mean square deviation because it is the square root of the means of
squa deviations from the arithmetic mean.
It is denoted by the small Greek letter sigma (s)
1.265
CV = S/M =
= 42
3
MEDIAN
Median is defined as the middle most or the central value of the variable in a set of
observations, when the observations are arranged either in ascending or in descending order
of their magnitudes. It divides the arranged series in two equal parts. Median is a position
average, whereas the arithmetic mean is a calculated average.
Calculation of Median
If the given data is ungrouped, arrange the n values of the given variable in ascending (or
descending) order of magnitudes.
When the data is Ungrouped
Case 1: When n is odd,
𝑛+1
2
In this case th term is the median,
𝒏+𝟏
𝟐
Median: Md or M =
th term
Case2: When n is even,
In this case, there are two middle terms (n/2)th term and (n/2+1)th.. The median is the average
of these two terms
𝑛 𝑛
( )𝑡ℎ
2
𝑡𝑒𝑟𝑚+[( 2+1)]𝑡ℎ 𝑡𝑒𝑟𝑚
2
Median: Md or M=
MODE
Modal class: It is that class in a grouped frequency distribution in which the mode lies.
The modal class can be determined either by inspection or with the help of grouping table.
𝒇𝒎−𝒇𝟏
Mode = l +
𝟐𝒇𝒎−𝒇𝟏−𝒇𝟐
×𝒊
×𝒊
∆𝟏
∆𝟏+
Mode = l+
∆𝟐
where, ∆1 = fm - f1,
∆2 = fm - f2,
MERITS, DEMERITS AND USES OF MODE
Merits
1. It can be easily understood.
2. It can be located in some cases by inspection.
3. It is capable of being ascertained graphically.
Demerits
1. There are different formulae for its calculations which ordinarily give different answers.
2. Mode is determinate. Some series have two or more than two modes.
3. It is an unsuitable measure as it is affected more by sampling fluctuations.
Uses
The value of the variable which divides the series, when arranged in ascending order, into 10
equal parts is called a decile.
Deciles are denoted by D1, D2, D3,…, D9. The fifth decile, Ds, is the median of the given data.
Computation of Deciles
Case 1: Computation of Deciles for Individual series
In this case, the kth decile is given by
] th term, 𝑘 = 1,2,3,4, … . . ,9.
𝑛+1
10
Dk = Value of k
When the series is arranged in ascending order.
Step5 Find the cumulative frequency just greater than 𝑖𝑁 . Then the corresponding value
10
of variables is the ith decile Di, i = 1,2,3,… ,9
×ℎ
�
10
�
𝑓
Di = L +
Computation of Percentiles
Case 1: Computation of Percentile of Individual series
In this case, the kth percentile is given by
P = Value 𝑛+ th term,
1
[k
( )]
100
k
∑ Xy
r =𝑁 𝜎𝑥 𝜎𝑦
Here,
x = (Χ-Χ υ)
y = (Y-Y υ)
𝜎𝑥 = standard deviation of series X
𝜎𝑦 = standard deviation of series
Y N = Number of pairs of
observation
r = the (product moment) correlation coefficient.
This method is used to be applied only where deviations of items are taken from actual mean
and not from assumed mean.
The value of the coefficient of correlation as obtained by the above formula shall always lie
between + 1. When r = +1, it means there is perfect positive correlation between the
variables. When r = -1, it means there is perfect negative correlation between the variables.
When r = 0, it means there is no relationship between the two variables.
However, in practice such values of r as +1, -1, and 0 are rare. We normally get values which
lie between + 1 and -1 such as + 0.8, - 0.26, etc.
The coefficient of correlation describes not only the magnitude of correlation but also its
direction. Thus, + 0.8 would mean the correlation is positive because the sign of r is + and the
magnitude of correlation is 0.8. Similarly, -0.26, means low degree of negative correlation.
The above formula for computing Pearson's coefficient of correlation can be transformed to
the following form which is easier to apply.
∑ 𝑥𝑦
r =√Σ×2×Σ𝑦2
Where,
x = (X-X υ) and
y = (Y-Y υ)
It is obvious that while applying this formula we have not to calculate separately the standard
deviation of X and Y series as is required by the formula.
This simplifies greatly the task of calculating correlation coefficient.
Steps In Calculating Correlation Coefficient
1. Take the deviations of X series from the mean of X and denote these deviations by x.
2. Square these deviations and obtain the total i.e.. Σ x².
3. Take the deviations of Y series from the mean of Y and donate these deviations by y
4. Square these deviations and obtain the total i.e., Σ y2.
5. Multiply the deviations of X and Y series and obtain the total, i.e., Σ xy.
6. Substitute the values of Σ xy, Σ x² and Σ y2 in the above formula.
The coefficient of rank correlation is based on the various values of the variates and is
denoted by R. it is applied to the problems in which data cannot be measured quantitatively
but qualitative assessment is possible such as beauty, honesty etc. In this case the best
individual is given rank number 1, next rank 2 and so on. The coefficient of rank correlation
is given by the formula:
6 Σ 𝐷2
R=1−
𝑛(𝑛2−1)
where,
D² is the square of the difference of the corresponding ranks and
n is the number of pairs of observations.
When the ranks are given, the difference of ranks of X from the corresponding ranks of
Y is calculated and thus column D is obtained and by squaring these terms D2 column
is headed and thus all these values are substituted in the given formula.
When only the data is given and the ranks are not mentioned, then first the ranks are to
be assigned accordingly to both the series X and Y by giving rank 1 to the highest
values in both the series and so on.
CALCULATION OF "t" TEST AND ITS INTERPRETATION
In biological experiments it becomes essential to compare the means of two samples to draw
a conclusion. Visual expression of the difference between the two sample means usually fail
to give significant difference between two sample means usually fail to give significant
difference. Therefore, the degree of level of significance of difference between two means is
to be qualifies to reach a definite conclusion.
To test the significance of difference of means of two samples, W.S Gosset 1908 applied a
statistical tool called t Test. The pen name of Gosset was STUDENT and hence this test is
called as Students T test. t ratio is the ratio of difference between two means. Aylmer Fisher
developed student's t test and explained in various ways.
In students t test we make a choice between two alternatives.
1. To accept the null hypothesis (no difference between two means)
2. To reject the null hypothesis that is the difference between the means of two samples is
statistically significant.
Determination of Significance:
Probability of occurrence of any calculated value of t is determined by comparing it with the
value given in the t table corresponding to the combined degree of freedom derived from the
number of observations in the samples under study. If the calculated value of t exceeds the
value of given at p= 0.05 in the table (5% level), it is said to be significant
If the calculated value is less than the value given in the table, it is not significant.
Degree of Freedom.
The quantity in the denominator which is one less than the independent number of
observations in a sample is called degree of freedom.
In unpaired t test df = N-1 in paired t test df = N1+N2-2 (N1 and N2 are the number of
observations in each of the two series)
Application of the t- distribution;- The following are some of the examples to test the
significance of the various results obtained from small sample.
t(𝒙−𝝁) √𝒏
𝒔
=
where,
√(𝐱−𝐱)²
S= 𝒏−𝟏
√𝒅𝟐−𝐧(𝐝)²
𝒏−𝟏
S=
If the calculated value of t is more than the table value t0.05, the difference between X and 𝜇
is significant at 5% level of significance.
If the calculated value of t is less than the table value t 0.05, the difference between X and 𝜇 is
not significant at 5% level of significance., hence the sample might have been drawn from the
population with mean = µ.
normal population of unknown mean the 95% fiducial limits of the population mean (𝜇) are:
Fiducial limits of population mean: Assuming that the sample is a random sample from a
𝑺
√𝒏 t0.05
X+
√𝒏𝟏+𝒏𝟐
t𝐗𝟏−𝐗𝟐 𝒏𝟏 + 𝒏𝟐
𝐒
= x
where,
X1 = mean of the first sample
X2 = mean of the second sample
n1 = no. of observations in the first sample
n2 = no. of observations in the second sample
S= combined standard deviation.
The value of S is calculated by the following formula:
√(𝐗𝟏−𝐗𝟏)𝟐 +𝟎 (𝐗𝟐−𝐗𝟐)𝟐
S= 𝐧𝟏+𝐧𝟐−𝟐
when we are given with the number of observations and standard deviation of the samples,
the pooled estimate of the standard deviation can be obtained as follows:
√(𝐧𝟏−𝟏)𝑺𝟏𝟐 + (𝐧𝟐−𝟐)𝑺𝟐𝟐
S= 𝐧𝟏+𝐧𝟐−𝟐
Interpretation of the results:
If the calculated value of t is more than the t 0.05 (t0.01), the difference between the sample
means is said to be significant at 5% (1%) level of significance otherwise t calculate value is
less than table value no significant difference exists between sample means.
3. Testing the difference between means of two samples (Dependent sample or matched
paired observations)
Two samples are said to be dependent when the elements in one sample are related to those in
the other in any significant or meaningful manner. E.g., to find out the effect of training on
some employees, find out the efficacy of a coaching class or determine whether there is a
significant difference in the efficacy of two drugs- one made within country and other
imported. The t-test based on the paired observations is defined by the following formula:
𝐝√𝐧
t= 𝐒
Where,
d = the mean of differences.
S = the standard deviation the differences.
𝒓 √𝒏−𝟐
t= √𝟏−𝒓𝟐
df = n-2
If the calculated value of t is more than the t0.05, the value of ‘r’ is significant at 5%.
A t-test is an analysis of two populations means through the use of statistical examination; a
t- test with two samples is commonly used with small sample sizes, testing the difference
between the samples when the variances of two normal distributions are not known.
Significance Testing: The terms "significance level" or "level of significance" refer to the
likelihood that the random sample you choose (for eg;- test scores) is not representative of the
population. The lower the significance level, the more confident you can be in replicating
your results. Significance levels most commonly used in educational research are the .05
and .01 levels.
Eg:- .05 as another way of saying 95/100 times that the sample from the population, Similarly,
.01 suggests that 99/100 times that the sample from the population. These numbers and signs
come from Significance Testing, which begins with the Null Hypothesis.
Part I: The Null Hypothesis
The traditional way to test this question involves:
Step 1. Develop a research question.
Step 2. Find previous research to support, refute, or suggest ways of testing the question.
Step 3. Construct a hypothesis by revising research question:
The x2 test is one of the simplest and most widely used nonparametric tests in statistical work.
The x² was first used by Karl Pearson in the year 1900.
The quantity x² describe the magnitude of the discrepancy between theory and observation.
It is defined as x² = Σ(0-Ε)2/Ε
Where,
O = Observed frequencies
E = Expected frequencies
STEPS- To determine the value of x2, the steps required are:
a) Calculate the expected frequencies. In general, the expected frequency for any call can be
calculated from the following equation:
E= RT x CT/N
E = Expected frequency
RT = The row total for the row containing the cell
CT = The column total for the column containing the
cell N= The total number of observations.
b) Take the difference between observed and expected frequencies and obtain the squares of
these difference i.e., obtain the value of (0-E)2
c) Divide the values of (O-E)2 obtained in the step (b) by the respective expected frequency
and obtain the total (Σ(0-E)2/E). This gives the value of x2 which can range from zero to
infinity. If x2 is zero it means that the observed and expected frequencies completely coincide.
The greater the discrepancy between the observed and expected frequency, the greater shall
be value of x2
The calculated value of x2 is compared with the table value of x2 for given degrees of freedom
at a certain specified level of significance.
If at the stated level (generally 5% level is selected), the calculated value of x2 is more than
the table value of x2. The difference between the theory and observation is considered to be
significant.
If, on the other hand, the calculated value of x is less than the table value, the difference
between theory and observation is not considered as significant.
The computed value of x2 is a random variable which takes on different values from sample to
sample, that is x2 has a sampling distribution.
It should be noted that the value x2 is always positive and its upper limit is infinity. Also, since
x2 is derived from observation, it is a statistic and a parameter.
The chi square test (x2 test) is, therefore, termed non- parametric
Degrees of Freedom
While comparing the calculated value of x2 with the table value we must determine the
degrees of freedom. By degrees of freedom, we mean the number of classes to which the
values can be assigned arbitrarily or at will without violating the restrictions or limitations
placed.
The number of degrees of freedom is obtained by subtracting from the number of classes the
number of degrees of freedom lost in fitting.
Interpretation
The chi square test is one of the most popular statistical inference procedures today.
It is applicable to very large number of problems in practice which can be summed up under
the following heads:
a) x2 test as a test of independence:
With the help of chi square test, we can find out whether two or more attributes are associated
or not.
Suppose we have N observations classified according to some attributes we may ask whether
the attributes are related or independent.
In order to test whether or not the attributes are associated we take the null hypothesis that
there is no association in the attributes under study or, in other words, the two attributes are
independent.
If the calculated value of x2 is less than the table value at a certain level of significance
(generally 5% level), we say that the results of the experiment provide no evidence for
doubting the hypothesis or, in other words, the hypothesis that the attributes are not
associated holds good.
On the other hand, if the calculated value of x2 is greater than the table value at a certain level
of significance, we say that the results of the experiment do not support the hypothesis or, in
other words the attributes are associated.
It should be noted that x2 is not a measure of the degree or form of relationship. It only tells
us whether two principles of classification are or not significantly related, without reference
to any assumptions concerning the form of relationship.
b) x2 test as a test of goodness of fit;
x2 test is very popularly known as test of goodness of fit for the reason that it enables us to
ascertain how approximately the theoretical distributions such as Binomial, Poisson, Normal,
etc., fit empirical distributions, i.e., those obtained from sample data.
When an ideal frequency curve whether normal or some other type is fitted to the data, we are
interested in finding out how well this curve fits with the observed facts.
A test of the goodness of fit of the two can be made just by inspection, but such a test is
obviously inadequate. Precision can be secured by applying the x2 test.
The following are the steps in testing the goodness of fit:
1. A null and alternative hypothesis are established, and a significance level is selected for
rejection of the null hypothesis.
2. A random sample of observations is drawn from a relevant statistical population.
3. A set of expected or theoretical frequencies is derived under the assumption that the null
hypothesis is true. This generally takes the form of assuming that a particular probability
distribution is applicable to the statistical population under consideration.
4. The observed frequencies are compared with the expected, or theoretical frequencies.
5. If the calculated value of x2 is less than the table value at a certain. The level of
significance (generally 5% level) and for certain degrees of freedom the fit is considered
to be good, i.e. the divergence between the actual and expected frequencies is attributed to
fluctuations of simple sampling. On the other hand, if the calculated value of x² is greater
than the table value. The fit is poor, i.e., it cannot be attributed to fluctuations of simple
sampling rather it is due to the inadequacy of the theory to fit the observed facts.
6. It should be borne in mind that in repeated sampling too good a fit is just as likely as too
bad a fit. When the computed chi square is too close to zero, we should suspect the
possibility that two sets of frequencies have been manipulated to force them to agree and,
therefore, the design of our experiment should be thoroughly checked
Between− Column
variance Within
Column variance
F=
Symbolically,
2
F = 𝑆1
𝑆22
The F= distribution (named after the famous statistician R.A. Fisher) measures the ratio of the
variance between groups to the variance within groups. The variance between the samples
means is the numerator and the variance within the sample means is the denominator. If there
is no real difference from group to group any sample difference will be explainable by
random variation and the variance between groups should be close to the variance within
groups. However, if there is a real difference between the groups the variance between groups
will be significantly larger than the variance within groups.
4. Compare the calculated value of F with the table value of F for the difference at a
certain critical level
(generally, we take 5% level of significance). If the calculated value of F is greater than the
table value it is concluded that the difference in sample means is significant. i.e, it could not
have arisen due to fluctuations of simple sampling, or, in other words, the samples do not
come from the sample population. On the other hand, if the calculated value of F is less than
the table value, the difference is no significant and has arisen due to fluctuations of simple
sampling.
It is customary to summarise calculations for sum of squares, together with the r numbers of
df an mean square in a table called the analysis of variance table, generally abbreviated
ANOVA. The specimen of ANOVA table is given below:
Analysis of variance (ANOVA) table: One way classification Model