Chapter One - Introduction To Statistics
Chapter One - Introduction To Statistics
Meaning of Statistics
A.L. Bowley defined statistics as “The science of counting”. Afterwards, he redefined as
“The science of averages”.
Characteristics of statistics
The features or charactistics of statistics are as follows
1. Statistics means an aggregate of facts.
Statistics are group of facts. Single facts are not called as statistics. Collection of
many facts is called statistics. Facts can be analysed only when there are more than
one facts.
For example – The age of group of persons, the height of students in a class, the price
of a product for number of period, the profit of a group of companies, the profit of a
company for number of period is called as statistics.
Functions of Statistics
The important functions of statistics are as follows
1. It presents the data ( facts) in a definite form
It presents the data or facts in a simple and definite form. The facts or statements or
results expressed in numbers is more convincing and clear than those not expressed in
numbers (expressed in quality).
3. It facilitates comparison
Comparison of data is a function of statistics after simplifying the data. Statistical
measures like averages, ratios, coefficient etc., are used for the purpose of
comparison.
Limitations of statistics
The following are the important limitations of statistics
1. Statistics does not deal with qualitative data. It deals only with quantitative data
Statistics can be applied only for quantitative data i.e., data can be measured in
numbers such as price, salary, income, expenses, height, weight etc.
Statistics cannot be applied for qualitative data i.e., data cannot be measured in
numbers such as honesty, integrity, loyalty, taste, culture, friendship, wisdom etc.
Conclusion
There are limitations of statistics, but still it is useful and helpful to study various
problems. Only thing is that it must be used by experts with proper care and caution.
Units or Individuals
The objects whose characteristics are studied in a statistical survey are called units or
individuals.
Population or Universe
The totality or collection of all units or individuals (objects) under consideration is
called population or Universe
For example – number of students in a particular school or college, number of
companies in a particular region etc.
Finite population
A population which contains countable number of units is called finite population
For example – number of students in a school or college, number of text books in a
library etc.
Infinite population
A population which contains uncountable number of units is called infinite population
for example – number of stars in the sky, number of fish in the sea or ocean etc.
Characteristics
The units or individuals (objects) to be studied has some characteristics. It can be
quantitative or qualitative characteristics.
Quantitative characteristics
A characteristic which is numerically measurable is called quantitative characteristics.
For example – marks of a student, attendance of a student, price of a product, height
of a person, weight of a person etc.
Qualitative characteristic
A characteristic which is not numerically measurable is called qualitative
characteristic.
For example – taste of a fruit, skin colour of a person etc.
Variable
A quantitative characteristic which varies from unit to unit is called variable.
For example – marks, height, weight, price etc.
Attribute
A qualitative characteristic which varies from unit to unit is called attribute.
For example – region, colour, taste etc.
Discrete variable
A variable which assumes only specified values in a given range is called discrete
variable.
Continuous variable
A variable which assumes all the values in the range is called continuous variable.
Data
Data means information (facts and figures) collected from which conclusion is
obtained.
Quantitative data
Data which are expressed in numbers is called quantitative data
Qualitative data
Data which are not expressed in numbers is called qualitative data
Sample survey
The study or investigation is based on the part of the population is called sample
Survey.
Sample
Sample is the part of the population or universe which is selected for the purpose of
Study or investigation.
Sample means a representative part of the population.
Sampling
The process of extracting a sample from a population is called sampling
Primary data
Primary data are fresh data collected directly from the field. It is also called first hand
data.
Primary data are the data collected for the first time directly from the field by the
investigator.
Secondary data
Secondary data are the data which investigator does not directly collect from the field.
They are the data which is collected by others for some other purpose
For example
Journals, newspapers, periodicals
Websites on the internet etc.
Investigator
Investigator is the person who conducts the statistical enquiry.
Enumerator
Enumerator is the person who collects the information for the investigator.
Respondents (informants)
Respondents are the persons from where the information is collected.
Series
Series refers to an arrangement of data in a logical or specific order such as size, time
of occurrence or any other characteristics (measurable or non-measurable)
Types of series
There are three types of series
Individual series
Discrete series
Continuous series
Individual series
It is a series of values of each units or individual observations.
The individual series can be arranged in two different ways
In ascending order
Arranging the data from the smallest value to largest value.
In descending order
Arranging the data from the largest value to the smallest value.
Discrete series
It is a series which shows specified value of the variables and the corresponding
frequency. Variables are not repeated in the discrete series.
Continuous series
It is a series which shows the variables in group as class internal and the
corresponding frequency.
Frequency distribution
A systematic presentation of the values of a variable and the corresponding frequency
is called frequency distribution.
Frequency
It refers to the number of times the value of a variable repeated in the series.
Class frequency
It refers to the number of observations relating to a particular class.
Frequency table
A tabular presentation of frequency distribution is called frequency table.
Discrete frequency distribution
A discrete frequency distribution is a presentation of specific value of a variable and
the corresponding frequency.
Class Intervals
In a frequency distribution, if the range is vast, it is divided into sub ranges (groups)
called class intervals
Class
The sub ranges (groups) is called class.
Class limit
The lowest and the highest values taken to define the boundaries of each class is
called class limit. The boundaries of a class intervals is called class limit.
Lower limit
The lowest value of the class is called lower limit.
Upper limit
The highest value of the class is called upper limit.
Data
Data is defined as the collection of numbers, words, characters, images, and others that can
arranged in some manner to form meaningful information.
Types of data
The data is divided into two categories based on the source from which they are obtained
1. Primary data
2. Secondary data
1. Primary Data
Primary data are the data that are collected for the first time by an investigator for a
specific purpose. Primary data are the fresh data that are directly collected from the
field. They are the first hand data
c) Telephonic interview
The investigator gets the data or information through telephone. This method is
quick and get accurate information.
e) Method of questionnaire
Questionnaire is a list of questions which is to be filled in the by the informants
and these answers are the required data or information for the investigation. The
questionnaire is sent to the informants by mail or otherwise. The informants are
required to fill up the questionnaire and send them back to the investigator. Thus
the investigator obtains the required data or information.
2. Secondary data
Secondary data are the data which the investigator does not collect directly from the
field. They are the data which he borrows from others who have collected them for some
other purpose.
Definition Primary data are those that are Secondary data are those data
collected for the first time that is already collected by
some other person
Originality Primary data are original Secondary data are not original
because they are collected by because they are collected by
the investigator for the first some other person for their
time purpose.
Nature of data Primary data are in the form Secondary data are in the form
of raw-material of finished form
Reliability and suitability Primary data are more reliable Secondary data are less reliable
and suitable because these are because they are collected by
collected for a particular some other person for their
purpose purpose which may not match
Time and money Collecting primary data is Secondary data is economical
quite expensive because it because it requires less time
requires both time and money and money
Precaution and editing Precaution and editing is not Precaution and editing is
required for primary data required because it is collected
because it is collected for a by some other person for their
particular purpose. purpose.
Process It done not involve much
It involves much process in process in collecting secondary
collecting primary data data but rather it is quickly and
easily.
Sources The sources in collecting
The sources in collecting
primary data is through
secondary data is websites,
surveys, experiments,
government publications,
observations, personal
journals, articles etc.
interviews etc.
Classification
Classification is a systematic grouping of units according to their common characteristics.
Each of these groups is called class.
Types of classification
There are four types of classification. They are
1. Quantitative classification
2. Qualitative classification
3. Spatial classification
4. Temporal classification
1. Quantitative classification
Classification of units on the basis of quantitative characteristics (variable) such as
age, height, weight, income, etc. is quantitative classification.
For example
Weight: 40 – 50 50 -60 60-70 70-80 80-90 90-100
No.of Persons: 50 200 260 360 90 40
2. Qualitative classification
Classification of units on the basis of qualitative characteristics (attribute) such as
gender, literacy, colour, taste etc., is qualitative classification.
2. Mani-fold classification
Classification of units on the basis of two or more characteristics is called Mani-fold
classification.
Tabulation
Tabulation is a process of systematic arrangement of data in a rows and columns of table. It is
a neat form of presentation of classified data.
Objectives of Tabulation
1. To present the data in a simple and understandable manner.
2. To facilitate comparison of the data
3. To give an identity to the data
4. To facilitate quick location of required data.
Footnote: Source:
1. Table Number
A number should be given for each table when there are large numbers of tables. It is
for identification and future reference.
2. Title
A title should be given to the table. The title should describe the content of the table.
It should be clear and brief.
3. Headnote
It is a brief note given applying to all or major part of the data in the table
For example: units of measurement like in lakhs, in crores, in kilograms, in rupees, in
millions, etc.
4. Stub
It refers to the row headings. It explains what the row represents.
5. Captions.
It refers to the column headings. It explains what the column represents.
7. Footnote
It is an explanation in brief and precise clarifying anything of the table.
For example – abbreviation etc.
8. Source
It indicates the source from where the data is collected.
Types of diagram
1. Simple bar diagram
2. Component bar diagram
3. Percentage bar diagram
4. Multiple bar diagram
5. Pictogram
6. Pie diagram or pie chart
5. Pictogram
Pictograms are diagrammatic representation of statistical data using pictures of
resemblance. These are very useful in attracting attention. They are easily understood.
Graph
A graph is a mathematical diagram which shows the relationship between two or more sets of
numbers or measurements.
A graph is a pictorial representation or diagram that represents data or values in an organised
manner.
Types of Graph
1. Histogram
2. Frequency polygon
3. Frequency curve
4. Ogives ( cumulative frequency curve)
1. Histogram
Histogram is drawn for a continuous frequency distribution. A histogram is a set of
adjacent rectangles whose height and width is proportional to frequency and width of
the class interval. Class interval are taken on X axis and frequency is on Y axis. The
graph formed by series of rectangles adjacent to one another is histogram.
2. Frequency polygon
A frequency polygon is a graph that displays the frequencies of data values as points
connected by straight lines. It is obtained by plotting midvalues of class interval (or
midpoints) on the x-axis and their corresponding frequencies on the y-axis.
3. Frequency curve
A frequency curve is smoothed line graph that displays the frequencies of data values
as points connected by smooth lines. It is obtained by plotting midvalues of class
interval (or midpoints) on the x-axis and their corresponding frequencies on the y-
axis.
Diagram is suitable for showing categorical Graph is suitable for showing time series and
and geographical data frequency distribution
Central Tendency
The property of concentration of the observations around a central value is called central
tendency. The central value around which there is concentration is called measure of central
tendency.
Objectives of Averaging
1. To present the entire data in a single value this describes the characteristics of the
entire data
2. To facilitate comparison of data
3. To facilitate further statistical treatment of data
4. To provide data for decision making
1. Arithmetic mean
Arithmetic mean is the quotient obtained by dividing the sum of the observations by
the number of observations.
2. Median
Median of a set of values is the middle most value when they are arranged in the
ascending order of magnitude.
Merits of median
a) It is easy to understand and calculate
b) It Is useful in the case of open-end classes
c) It is not influenced by the magnitude of extreme deviation from it
d) It is most appropriate average of qualitative data
e) It indicates the value of the middle item in the distribution
Demerits of median
a) It is necessary to arrange the data for calculating median
b) It is a positional average, therefore each and every observation is not considered
c) It may not always be representative of the observation as it ignores the extreme
values .
d) It is erratic if the number of items is small
e) It is not capable of further algebraic treatment as it is not based on mathematical
property.
Mode
Mode is the value which has the highest frequency. Mode is the value which is most
frequently occurring value. Mode is the value which is repeated maximum number
of items.
Merits of mode
a) It is simple to calculate. In most cases , it is located by inspection
b) It is not unduly affected by extreme values.
c) It can be determined even in open-end classes
d) It is used to determine average of qualitative data
e) It can also be determined graphically
Demerits of mode
a) It cannot always be determined.
b) It is not capable of algebraic treatment.
c) It is not based on each and every item of the series
d) It is not rigidly defined. There are several formula for determining mode which
gives different answers.
Symmetrical distribution
In a symmetrical distribution the values of mean, median and mode coincide. The spread of
the frequencies (observations) is the same on both sides of the centre point of the curve.
Asymmetrical distribution.
A distribution which is not symmetrical is called asymmetrical or skewed distribution. Such a
distribution can be positively skewed distribution or negatively skewed distribution.
1. Zero skewness
It means majority of the data points or values are concentrated around average. It’s
left and right sides are mirror images. Frequency (observation) increases slowly at the
same proportion and after reaching the highest, it decreases slowly at the same
proportion. It is a perfect bell shape curve. The value of mean, median and mode is
equal or same.
Mean = Median = Mode
2. Positive skewness
It means majority of the data points or values are concentrated on left side of the
distribution or averages (mean). Frequency (observation) increases immediately and
after reaching the highest, it decreases. It has a tail at the right side more or longer. A
positive skewness has a long tail on the right side. The value of mean is highest and
mode is lowest and median lies between the mean and mode.
Mean ˃ Median ˃Mode
3. Negative skewness
It means majority of the data points or values are concentrated on the right side of the
distribution or averages (mean). Frequency (observation) increases slowly and after
reaching the highest, it decreases. It has a tail at the left side more or longer. A
negative skewness has a long tail on the left side. The value of mode is highest and
mean is lowest and median lies between the mean and mode.
Kurtosis
Kurtosis is a statistical measure that measures the shape of a frequency distribution. It
provides information about the tail and peak of the frequency distribution comparing to
normal distribution. Tail means values at the extremes and peak means values around the
average.
Kurtosis refers to extent of presence of extreme values (tails) in the frequency distribution.
Kurtosis tells us spread of data points or values around the tails of the frequency distribution.
It describes the share of the data or frequency distribution. Kurtosis is a measure of whether
the data is heavily tailed or light tailed.
Types of kurtosis
1. Leptokurtic ( K ˃3)
2. Platykurtic ( K ˂3)
3. Mesokurtic (K=3)
1. Leptokurtic ( K ˃3)
It is greater than mesokurtic which has longer tail. . It means more numbers located at
the tails (extremes) and few numbers around the mean (average).
It has a long tail i.e., more numbers are located at the tail or outliers (extremes) and
few numbers are located around the mean (average).
Tail is length and peak is high
2. Platykurtic ( K ˂3)
It is lower than mesokurtic which has short tail. It means very few numbers located at
the tail (extremes) and more number around the mean (average).
It has a less tail i.e., less numbers are located at the tail or outliers (extremes) and
more numbers are located around the mean (average)
Tail is less and peak is flat
3. Mesokurtic (K=3)
It is between leptokurtic and platykurtic i.e., it is a normal distribution.
Tail is moderate and peak is moderate.
It is a type of distribution which is symmetry. It means both the extreme ends are
similar. It same as normal distribution.
3. Risk Management
Statistics helps managers to assess and manage risks and take business decisions.
Statistics provide tools such as risk analysis, Monte Carlo simulation, sensitivity
analysis, decision trees helps manager to evaluate various risks, simulate various
scenarios and take decision which minimize risks and maximize opportunities.
4. Forecasting and prediction
Statistics helps manager to forecast future trends and outcomes based on historical
data. Statistics provide tools such as time series, regression analysis, forecasting
models helps in predicting future trends, demand pattern, and business condition. This
information is essential for planning, budgeting, resource allocation and setting
realistic goals.
9. Strategic planning
Statistics helps manager in strategic planning initiatives. Statistics provides details
relating to market conditions, competitive dynamics, and industry trends. Statistical
analysis helps manager to identify opportunities, market potential and formulate
strategies to achieve the goals and objectives of the organization.
1. Descriptive statistics
a) Descriptive statistics summaries and describe the features of data set.
b) Measure of central tendency such as mean, median and mode helps the manager
to understand the average or characteristics of a group or most typical values.
c) Measure of dispersion such as range, quartile deviation, mean deviation, standard
deviation and coefficient of variation which shows the variability in the data. It
helps the manager to assess the consistency and reliability of the processes or
outcomes.
d) Managers use descriptive statistics such as mean, median, mode, mean deviation,
standard deviation, coefficient of variation to understand historical performance,
assess current situations and identify trends or patterns.
3. Tabulation
a) Tabulation is a process of systematic arrangement of data in a rows and columns
of table. It is a neat form of presentation of classified data.
b) Tabulation reduces large volumes of data into a concise and structured form. This
makes managers to grasp the information quickly and efficiently.
c) Managers can compare different variables, trends with the help of tabulated data.
This comparative analysis will help managers to understand the data effectively
and take decisions.
d) Presentation of data through tables will help in communication of findings and
understanding to stakeholders, executives and team members. This will promote
clarity consensus and take proper decisions.
4. Correlation analysis
a) Correlation refers to the relationship or association between two or more
variables.
b) Correlation allows managers to identify relationships between different factors or
variables that may influence business outcome.
c) Correlation allows managers to make predictions about future outcomes based on
historical data by examining the strength and direction of correlation between
variables
d) Correlations helps managers to allocate resources more efficiently
5. Regression analysis
a) Regression analysis examines the relationship between dependent variable and
independent variable and predict based on these relationships. It includes linear
regression and multiple regression.
b) Linear regression – it establishes the relationship between independent variable
and dependent variable, and predict based on the changes in the independent
variables. For example, predicting sales based on advertising spend.
c) Multiple regression – it establishes the relationship between several independent
variables or factors and its impact on dependent variable. It helps managers in
understanding the combined effect of various independent variables or factors on
the dependent variable or outcome.
d) Managers use regression analysis to understand the impact of independent
variables on dependent variable. It helps in forecasting future trends, optimizing
resource allocation, and decision making.
8. Decision analysis
a) Decision analysis helps in making decision making processes under uncertainty
and risk by using techniques lie decision trees, sensitivity analysis and scenario
analysis
b) Manager uses various techniques like decision trees, Monte Carlo simulations,
sensitivity analysis scenario analysis in evaluating alternative courses of actions,
assessing risks and choosing optimal strategy or best course of action which
maximizes profitability.
9. Inferential Statistics
a) Inferential statistics involve making inferences or conclusion about a population
based on sample data, using techniques like hypothesis testing and confidence
intervals.
b) Hypothesis testing consists of null hypothesis and alternative hypothesis which is
assumptions or statement about the characteristics of the population.
c) There are various types of testing of hypothesis like Z –test, t-test, chi-square test,
ANOVA etc. depending on the nature of data and hypothesis and interpreting the
result either to accept or reject the null hypothesis.
d) Confidence interval or level is specified. It is the range of values within which a
population parameter is estimated to like at a certain level of confidence usually at
95 % or 99% confidence interval.
e) Significance Level (α) is specified. The probability of rejecting the null hypothesis
when it is actually true. It is usually set at 5% or 1% significance level.
f) Managers use inferential statistics to draw conclusions about a population based
on sample data, make decisions with a level of certainty and lead organization to
the success.