01.basic Statistics
01.basic Statistics
UNIT - I
What is Statistics?
The word „Statistics‟ has its root either to Latin word „Status‟ or Italian word „Statista‟ or
German word „Statistik‟ each of which means a „political state‟.
The word „Statistics‟ was primarily associated with the presentation of facts and figures
pertaining to demographic, social and political situations prevailing in a state/government. Its
evolution over time formed the basis for most of the science and art disciplines.
Statistics is used in the developmental phases of both theoretical and applied areas,
encompassing the field of Industry, Agriculture, Medicine, Sports and Business analytics.
Meaning of Statistics
Statistics is concerned with the systematic collection of numerical data and its
interpretation.
Definition of Statistics
Limitation of Statistics
Scope of Statistics
Information, especially facts or numbers collected for decision making is called data.
Data may be numerical or categorical. Data may also be generated through a variable.
Variable: A variable is an entity that varies from a place to place, a person to person, a
trial to trial and so on. For instance the height is a variable; domicile is a variable since they vary
from person to person.
Types of Data
1. Quantitative data
2. Qualitative data
Quantitative Data
Quantitative data (variable are measurements that are collected or recorded as a number.
Apart from the usual data like height, weight etc.,
Qualitative Data
Qualitative data are measurements that cannot be measured on a natural numerical scale.
For example, the blood types are categorized as O, A, B along with the Rh factors. They can only
be classified into one of the pre assigned or pre designated categories.
1. Primary data
2. Secondary data
Primary data are that information which is collected for the first time, from a Survey, or
an observational study or through experimentation. For example
A survey is conducted to identify the reasons from the parents for selection of a particular
school for their children in a locality.
Information collected from the observations made by the customers based on the service
they received.
To test the efficacy of a drug, a randomized control trial is conducted using a particular
drug and a placebo.
(i) Survey data: The investigator or his agency meets the respondents and gets the required data.
(iii) Observational data: In the case of a psychological study or in a medical situation, the
investigator simply observes and records the information about respondent. In other words the
investigator behaves like a spectator.
As the name says, the investigator himself goes to the field, meets the respondents and
gets the required information. In this method, the investigator personally interviews the
respondent either directly or through phone or through any electronic media. This method is
suitable when the scope of investigation is small and greater accuracy is needed.
Merits
This method ensures accuracy because of personal interaction with the investigator.
This method enables the interviewer to suitably adjust the situations with the respondent.
Limitations
When the field of enquiry is vast, this method is more expensive, time consuming and
cumbersome.
In this type of survey, there is chance for personal bias by the investigator in terms of
asking „leading questions‟.
In the present age of communication explosion, telephones and mobile phones are
extensively used to collect data from the respondents. This saves the cost and time of collecting
the data with a good amount of accuracy.
With the widespread use of computers, telephone interviewing can be combined with
immediate entry of the response into a data file by means of terminals, personal computers, or
voice data entry. Computer – Assisted Telephone Interviewing (CATI) is used in market research
organizations throughout the world.
The indirect method is used in cases where it is delicate or difficult to get the information
from the respondents due to unwillingness or indifference. The information about the respondent
is collected by interviewing the third party who knows the respondent well.
Instances for this type of data collection include information on addiction, marriage
proposal, economic status, witnesses in court, criminal proceedings etc. The shortcoming of this
method is genuineness and accuracy of the information, as it completely depends on the third
party.
Advantages
In a short span of time, vast geographical area can be covered.
It involves less labor.
Limitations
For instance, the Central Statistical Organization (CSO) of Government of India has local
correspondents NSSO. Through them they get the required data. Newspaper publishers appoint
agents to collect news for their dailies. These people collect data in their locality on behalf of the
publisher and transmit them to the head office.
In this method, the trained enumerators or interviewers take the schedules themselves,
contact the informants, get replies and fill them in their own hand writing. Thus, schedules are
filled by the enumerator whereas questionnaires are filled by the respondents. The enumerators
are paid honorarium. This method is suitable when the respondents include illiterates. The
success of this method depends on the training imparted to the enumerators. The voters‟ list
preparation, information on ration card for public distribution in India, etc., follow this method of
data collection. National Sample Survey Office (NSSO) collects information using schedules
depending on the theme.
2. Secondary data
Secondary data is collected and processed by some other agency but the investigator uses
it for his study. They can be obtained from published sources such as government reports,
documents, newspapers, books written by economists or from any other source., for example
websites. Use of secondary data saves time and cost. Before using the secondary data scrutiny
must be done to assess the suitability, reliability, adequacy, and accuracy of the data.
The secondary data comes from two main sources, namely published or unpublished.
The data which are not published are also available in files and office records
Government and Private organizations. The different sources described above are schematically
described below.
Primary Secondary
It is collected for the first time Compiled from already existing sources
It is collected directly by the investigator Complied by persons other than the persons who
or by his team collected the data
Classification of Data
Classification is the process of arranging the primary data in a definite pattern and
presenting in a systematic form.
Definition: Classification as the process of arranging the data into sequences and groups
according to their common characteristics or separating them into different but related parts.
Objectives of Classification
Classification of data has manifold objectives. The salient features among them are the
following:
Types of Classification
The method of classifying data with reference to geographical location such as countries,
states, cities, districts, etc., is called classification by space or spatial classification. It is also
termed as geographical classification. The following are some examples:
Examples of attributes include nationality, religion, gender, marital status, literacy and so
on
When the characteristics are measured on numerical scale, they may be classified on the
basis of their magnitude. Such a classification is known as classification by size or quantitative
classification.
For example data relating to the characteristics such as height, weight, age, income,
marks of students, production and consumption, etc., which are quantitative in nature, come
under this category.
There are certain rules to be followed for classifying the data which are given below.
The classes must be exhaustive, i.e., it should be possible to include each of the data
points in one or the other group or class.
The classes must be mutually exclusive, i.e., there should not be any overlapping.
It must be ensured that number of classes should be neither too large or nor too small.
Generally, the number of classes may be fixed between 4 and 15.
The magnitude or width of all the classes should be equal in the entire classification.
The system of open end classes may be avoided.
Types of Tables
General Tables
Summary Tables.
General tables contain a collection of detailed information including all that is relevant
to the subject or theme.
The main purpose of such tables is to present all the information available on a certain
problem at one place for easy reference and they are usually placed in the appendices of reports.
Summary tables are designed to serve some specific purposes. They are smaller in size
than general tables, emphasize on some aspect of data and are generally incorporated within the
text.
The summary tables are also called derivative tables because they are derived from the
general tables. The information contained in the summary table aims at analysis and inference.
Hence, they are also known as interpretative tables
The statistical tables may further be classified into two broad classes namely simple
tables and complex tables. A simple table summarizes information on a single characteristic and
is also called a univariate table.
Components of a Table
Table Number
Title of the Table
Caption
Stub heading Total
(Column headings)
Stub
Body
(Row entries)
Total
Diagrams
A diagram is a visual form for presenting statistical data for highlighting the basic facts
and relationship which are inherent in the data. The diagrammatic presentation is more
understandable and it is appreciated by everyone. It attracts the attention and it is a quicker way
of grasping the results saving the time. It is very much required, particularly, in presenting
qualitative data.
Graphs
The quantitative data is usually represented by graphs. Though it is not quite attractive
and understandable by a layman, the classification and tabulation techniques will reduce the
complexity of presenting the data using graphs. Statisticians have understood the importance of
graphical presentation to present the data in an interpretable way. The graphs are drawn
manually on graph papers.
Diagrams and graphs are extremely useful due to the following reasons:
While constructing diagrams for statistical data, the following guidelines are to be kept in
mind:
Types of Diagrams
Simple bar diagram can be drawn either on horizontal or vertical base. But, bars on
vertical base are more common. Bars are erected along the axis with uniform width and space
between the bars must be equal. While constructing a simple bar diagram, the scale is determined
as proportional to the highest value of the variable. The bars can be coloured to make the
diagram attractive. This diagram is mostly drawn for categorical variable. It is more useful to
present the data related to the fields of Business and Economics.
Example 1
2010 55
2011 40
2012 30
2013 25
2014 35
2015 70
Solution:
(i) We represent the above data by simple bar diagram in the following manner:
Step-1: Years are marked along the X-axis and labelled as „Year‟.
Step-2: Values of Production Cost are marked along the Y-axis and labelled as „Production Cost
(in lakhs of `).
Step-3: Vertical rectangular bars are erected on the years marked and whose height is
proportional to the magnitude of the respective production cost.
(ii) (a) The maximum production cost of the company was in the year 2015.
(b) The minimum production cost of the company was in the year 2013.
(c) The production cost of the company during the period 2012- 2014 is less than 40
lakhs.
55 40 30 25 35 70
6
42.5 Lakhs
70
100
35
200%
Multiple bar diagram is used for comparing two or more sets of statistical data. Bars with
equal width are placed adjacently for each cluster of values of the variable. There should be
equal space between clusters. In order to distinguish bars in each cluster, they may be either
differently coloured or shaded. Legends should be provided.
Example 2
The table given below shows the profit obtained before and after tax payment (in lakhs of
rupees) by a business man on selling cars from the year 2014 to 2017.
2014 195 80
2015 200 87
2016 165 45
2017 140 32
(i) Construct a multiple bar diagram for the above data.
(ii) In which year, the company earned maximum profit before paying the tax?
(iii) In which year, the company earned minimum profit after paying the tax?
(iv) Find the difference between the average profit earned by the company before paying
the tax and after paying the tax.
Solution:
Since we are comparing the profit earned before and after paying the tax by the same
Company, the multiple bar diagram is drawn. The diagram is drawn following the procedure
presented below:
Step 1 : Years are marked along the X-axis and labeled as “Year”.
Step 2 : Values of Profit before and after paying the tax are marked along the Y-axis and labeled
as “Profit (in lakhs of `)”.
Step 3 :Vertical rectangular bars are erected on the years marked, whose heights are proportional
to the respective profit. The vertical bars corresponding to the profit earned before and
after paying the tax in each year are placed adjacently.
Step 4 :The vertical bars drawn corresponding to the profit earned before paying the tax are
filled with one type of colour. The vertical bars drawn corresponding to the profit
earned after paying the tax are filled with another type of colour. The colouring
procedure should be applied to all the years uniformly.
Step 5 :Legends are displayed to describe the different colours applied to the bars drawn for
profit earned before and after paying the tax.
Figure 2: Multiple Bar Diagram for Profit by the Company earned before and after
paying the Tax
(i) The company earned the maximum profit before paying the tax in the year 2015.
(ii) The company earned the minimum profit after paying the tax in the year 2017.
(iii) The average profit earned before paying the tax = 4/700 = 175 lakhs
The average profit earned after paying the tax = 4/244 = 61 lakhs
Hence, difference between the average profit earned by the company before paying the
tax and after paying the tax is
A component bar diagram is used for comparing two or more sets of statistical data, as
like multiple bar diagram. But, unlike multiple bar diagram, the bars are stacked in component
bar diagrams. In the construction of sub-divided bar diagram, bars are drawn with equal width
such that the heights of the bars are proportional to the magnitude of the total frequency. The
bars are positioned with equal space. Each bar is sub-divided into various parts in proportion to
the values of the components. The subdivisions are distinguished by different colours or shades.
If the number of clusters and the categories in the clusters are large, the multiple bar diagram is
not attractive due to more number of bars. In such situation, component bar diagram is preferred.
Example 3
Total expenditure incurred on various heads of two schools in an year are given below.
Draw a suitable bar diagram
Solution
Since we are comparing the amount spent by two schools in a year towards various
expenditures with respect to their total expenditures, a component bar diagram is drawn.
Step 1 : Schools are marked along the X-axis and labeled as “School”.
Step 2 : Expenditure Head are marked along the Y-axis and labeled as “Expenditure (` in
lakhs)”.
Step 3 : Vertical rectangular bars are erected for each school, whose heights are proportional to
their respective total expenditure.
Step 4 : Each vertical bar is split into components in the order of the list of expenditure heads.
Area of each rectangular box is proportional to the frequency of the respective
expenditure head/component. Rectangular boxes for each school are coloured with
different colours. Same colours are applied to the similar expenditure heads for each
school.
Step 5 : Legends are displayed to describe the colours applied to the rectangular boxes drawn for
various expenditure heads.
Percentage bar diagram is another form of component bar diagram. Here, the heights of the
components do not represent the actual values, but percentages. The main difference between
sub-divided bar diagram and percentage bar diagram is that, in the former, the height of the bars
corresponds to the magnitude of the value. But, in the latter, it corresponds to the percentages.
Thus, in the component bar diagram, heights of the bars are different, whereas in the percentage
bar diagram, heights are equal corresponding to 100%. Hence, percentage bar diagram will be
more appealing than sub-divided bar diagram. Also, comparison between components is much
easier using percentage bar diagram
Example 4. Total expenditure incurred on various heads of two schools in an year are given
below. Draw the percentage sub-divided bar diagram
Construction/Repairs 80 90
Computers 35 50
Laboratory 30 25
Watering plants 45 40
Library books 40 30
Total 230 235
. Also find (i) The percentage of amount spent for computers in School I (ii) What are the
expenditures in which School II spent more than School I.
Solution:
Since we are comparing the amount spent by two schools in a year towards various expenditures
with respect to their total expenditures in percentages, a percentage bar diagram is drawn.
Step 1 : Schools are marked along the X-axis and labeled as “School”.
Step 2 : Amount spent in percentages are marked along the Y-axis and labeled as “Percentage of
Expenditure (` in lakhs)”.
Step 3 : Vertical rectangular bars are erected for each school, whose heights are taken to be
hundred.
Step 4 : Each vertical bar is split into components in the order of the list of percentage
expenditure heads. Area of each rectangular box is proportional to the percentage of frequency of
the respective expenditure head/component. Rectangular boxes for each school are coloured with
different colours. Same colours are applied to the similar expenditure heads for each school.
Step 5 : Legends are displayed to describe the colours applied to the rectangular boxes drawn for
various expenditure heads.
(ii) 38% of expenditure was spent for construction/Repairs by School II than School I
5) Pie Diagram
The Pie diagram is a circular diagram. As the diagram looks like a pie, it is given this
name. A circle which has 360c is divided into different sectors. Angles of the sectors, subtending
at the center, are proportional to the magnitudes of the frequency of the components.
Procedure:
The following procedure can be followed to draw a Pie diagram for a given data:
Class frequency
(ii) Compute angles for each component using the formula. 360
N
(iv) Draw the first sector in the anti-clockwise direction at an angle calculated for the first
component.
(v) Draw the second sector adjacent to the first sector at an angle corresponding to the
second component.
Example 5.
Draw a pie diagram for the following data (in hundreds) of house hold expenditure of a
family.
Items Expenditure
Food 87
Clothing 24
Recreation 11
Education 13
Rent 25
Miscellaneous 20
Also find
(i) The central angle of the sector corresponding to the expenditure incurred on Education
(ii) By how much percentage the recreation cost is less than the Rent.
Solution
The following procedure is followed to draw a Pie diagram for a given data:
(ii) Compute angles for each component food, clothing, recreation, education, rent and
Class frequency
miscellaneous using the formula 360
N
(iv) Draw the first sector in the anti-clockwise direction at an angle calculated for the first
component food.
(v) Draw the second sector adjacent to the first sector at an angle corresponding to the
second component clothing.
(vi) This process is continued for all the components namely recreation, education, rent
and miscellaneous.
The central angle of the sector corresponding to the expenditure incurred on Education is
26o
Types of Graphs
Graphical representation can be advantageous to bring out the statistical nature of the
frequency distribution of quantitative variable, which may be discrete or continuous.
1. Histogram
2. Frequency Polygon
3. Frequency Curve
4. Cumulative Frequency Curves (Ogives)
1. Histogram
Limitations:
We cannot construct a histogram for distribution with open-ended classes. The histogram
is also quite misleading, if the distribution has unequal intervals.
Example 1
The following table shows the time taken (in minutes) by 100 students to travel to school on a
particular day
Solution
Since we are displaying the distribution of time taken (in minutes) by 100 students to
travel to school on a particular day in visual form, the histogram is drawn.
Step 1 : Time taken are marked along the X-axis and labeled as “Time (in minutes)”.
Step 2 : Number of students are marked along the Y-axis and labeled as “No. of students”.
Step 3 : Corresponding to each time taken, a vertical attached bar is drawn whose height is
proportional to the number of students.
2. Frequency Polygon
Frequency polygon is drawn after drawing histogram for a given frequency distribution.
The area covered under the polygon is equal to the area of the histogram. Vertices of the polygon
represent the class frequencies. Frequency polygon helps to determine the classes with higher
frequencies. It displays the tendency of the data. The following procedure can be followed to
draw frequency polygon:
(i) Mark the midpoints at the top of each vertical bar in the histogram representing the
classes.
Example 2
A firm reported that its Net Worth in the years 2011-2016 are as followings
Solution:
Since we are displaying the distribution of Net worth in the years 2011-2016, the
Frequency polygon is drawn to determine the classes with higher frequencies. It displays the
tendency of the data.
Step 1: Year are marked along the X-axis and labeled as „Year‟.
Step 2: Net worth are marked along the Y-axis and labeled as „Net Worth (in lakhs of `)‟.
Step 3: Mark the midpoints at the top of each vertical bar in the histogram representing the year.
Step 4:Connect the midpoints by line segments. The Frequency polygon is presented in Figure 2.
3. Frequency Curve
Example 3
The ages of group of pensioners are given in the table below. Draw the Frequency curve
to the following data.
Since we are displaying the distribution of Age and Number of Pensioners, the Frequency
curve is drawn, to provide better understanding about the age and number of pensioners than
frequency polygon.
Step 1 : Age are marked along the X-axis and labeled as „Age‟.
Step 2 : Number of pensioners are marked along the Y-axis and labeled as „No. of Pensioners‟.
Step3 : Mark the midpoints at the top of each vertical bar in the histogram representing the age.
Step 4 : Connect the midpoints by line segments by smoothing the vertices of the frequency
polygon
„more than‟ cumulative frequencies. The following procedure can be followed to draw the ogive
curves:
Less than Ogive: Less than cumulative frequency of each class is marked against the
corresponding upper limit of the respective class. All the points are joined by a free-hand curve
to draw the less than ogive curve.
More than Ogive: More than cumulative frequency of each class is marked against the
corresponding lower limit of the respective class. All the points are joined by a free-hand curve
to draw the more than ogive curve.
Example 4.1
Draw the less than Ogive curve for the following data:
Daily Wages (in Rs.) 70-80 80-90 90-100 100-110 110-120 120-130 130-140 140-150
No. of workers 12 18 35 42 50 45 20 8
Also, find
(ii) The number of workers whose daily wages are less than ` 125.
Solution:
Since we are displaying the distribution of Daily Wages and No. of workers, the Ogive
curve is drawn, to provide better understanding about the wages and No. of workers.
The following procedure can be followed to draw Less than Ogive curve:
Step 1 : Daily wages are marked along the X-axis and labeled as “Wages(in `)”.
Step 2 : No. of Workers are marked along the Y-axis and labeled as “No. of workers”.
Step 3 : Find the less than cumulative frequency, by taking the upper class-limit of daily wages.
The cumulative frequency corresponding to any upper class-limit of daily wages is the
sum of all the frequencies less than the limit of daily wages.
Step 4 : The less than cumulative frequency of Number of workers are plotted as points against
the daily wages (upper-limit). These points are joined to form less than ogive curve.
Daily wages
No. of workers
(less than)
80 12
90 30
100 65
110 107
120 157
130 202
140 222
150 230
Example 4.2
The following table shows the marks obtained by 120 students of class IX in a cycle test-I
. Draw the more than Ogive curve for the following data :
Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100
No. of students 2 6 8 20 30 22 18 8 4 2
Also, find
Solution:
Since we are displaying the distribution marks and No. of students, the more than Ogive
curve is drawn, to provide better understanding about the marks of the students and No. of
students.
The following procedure can be followed to draw More than Ogive curve:
Step 1 : Marks of the students are marked along the X-axis and labeled as „Marks‟.
Step 2 : No. of students are marked along the Y-axis and labeled as „No. of students‟.
Step 3 : Find the more than cumulative frequency, by taking the lower class-limit of marks. The
cumulative frequency corresponding to any lower class limit of marks is the sum of all
the frequencies above the limit of marks.
Step 4 : The more than cumulative frequency of number of students are plotted as points against
the marks (lower-limit). These points are joined to form more than ogive curve.
20 112
30 104
40 84
50 54
60 32
70 14
80 6
90 2
Figure 2. More than Ogive curve for Marks and No. of students
Example 3
Graphically,
(i) find the number of trees which yield mangoes of less than 55 kg.
(ii) find the number of trees from which mangoes of more than 75 kg.
Draw the Less than and More than Ogive curves. Also, find the median using the Ogive
curves
Yield (in kg) 40-50 50-60 60-70 70-80 80-90 90-100 Total
No. of trees 10 15 17 14 12 2 70
Solution:
Since we are displaying the distribution of Yield and No. of trees, the Ogive curve is
drawn, to provide better understanding about the Yield and No. of trees
Step 1 : Yield of mangoes are marked along the X-axis and labeled as „Yield (in Kg.)‟.
Step 2 : No. of trees are marked along the Y-axis and labeled as „No. of trees‟.
Step 3 : Find the less than cumulative frequency, by taking the upper class-limit of Yield of
mangoes. The cumulative frequency corresponding to any upper class-limit of Mangoes
is the sum of all the frequencies less than the limit of mangoes.
Step 4 : Find the more than cumulative frequency, by taking the lower class-limit of Yield of
mangoes. The cumulative frequency corresponding to any lower class-limit of Mangoes
is the sum of all the frequencies above the limit of mangoes.
Step 5 : The less than cumulative frequency of Number of trees are plotted as points against the
yield of mangoes (upper-limit). These points are joined to form less than ogive curve.
Step 6 : The more than cumulative frequency of Number of trees are plotted as points against the
yield of mangoes (lower-limit). These points are joined to form more than O give curve
Data may be presented in the form of tables as well as using diagrams and graphs. Tables
can be compared with graphs and diagrams on the basis of various characteristics as follows;
(i) Table contains precise and accurate information, whereas graphs and diagrams give
only an approximate idea.
(ii) More information can be presented in tables than in graphs and diagrams.
(iii) Tables require careful reading and are difficult to interpret, whereas diagrams and
graphs are easily interpretable.
(iv) For common men, graphs and diagrams are attractive and more appealing than tables.
(v) Diagrams and graphs exhibit the inherent trends in the distribution easily on
comparable mode than the tables.
(i) Diagrams can be drawn on plain papers, whereas graphs require graph papers.
(ii) Diagrams are appropriate and effective to present information about one or more
variables. Normally, it is difficult to draw graphs for more than one variable in the
same graph.
(iii) Graphs can be used for interpolation and/or extrapolation, but diagrams cannot be
used for this purpose.
(iv) Median can be determined using graphs, but not using diagrams.
(v) Diagrams can be used for comparison of data/variables, whereas graphs can be used
for determining the relationship between variables.