FACULTY OF BUSINESS & COMMERCE
DEPARTMENT OF MANAGEMENT
Master of Business Administration (MBA) : First Semester (25COT-608)
Quantitative Techniques for Business Decisions
Dr. Neera Kumari
(Ph.D. in Statistics from Delhi University)
Associate Professor, CU-UP
Dr. Neera Kumari
CU, Unnao, U.P. Page 1
Syllabus
Unit 1: Role of Statistics in Managerial Decision Making
Descriptive Statistics, Scope, Functions and Limitations of Statistics, Measures of Central
Tendency – Mean, Median, Mode, Percentiles, Quartiles, Measures of Dispersion – Range,
Interquartile Range, Mean Deviation, Mean Absolute Deviation, Standard Deviation, Variance,
Coefficient of Variation. Measures of Shape and Relative Location: Skewness.
Importance and relevance of Artificial Intelligence in Business Statistics.
Unit-I
Role of Statistics in Managerial Decision Making
In this unit, we will explore six key topics to understand the fundamental principles of
statistics/Quantitative techniques and their practical applications. These topics cover everything
from the basics of organizing and summarizing data to analyzing its distribution and variability.
1. Introduction to Quantitative Technique for Business Decisions/Introduction to Statistics,
Definitions, Application, Uses, Limitation
2. Types of Data/Sources of Data/Collection of Data/Measurement of Scales
3. Classification and Tabulation of Data
4. Measure of Central Tendency with Examples
5. Measure of Dispersion with Examples
6. Measure of Position with Examples
Dr. Neera Kumari
CU, Unnao, U.P. Page 2
1. Introduction to Quantitative Technique
Quantitative techniques (QT) are systematic, data-driven methods used to analyze numerical data
and make informed business decisions. They integrate mathematical, statistical, and operational
research tools to model complex business problems and identify optimal solutions. These
techniques help managers improve efficiency, reduce costs, and maximize profits by providing a
structured approach to problem-solving. Quantitative techniques (QTs) are systematic
mathematical and statistical methods used to analyze numerical data for effective decision-making
in business, economics, finance, and other fields. They provide a structured approach to
understanding complex problems by converting real-world phenomena into numerical models.
Here are a few widely accepted definitions:
General Definition:
Quantitative techniques are a collection of mathematical, statistical, and computational methods
used to analyze quantitative data and make objective decisions based on numerical evidence.
Business Context:
Quantitative techniques in business are analytical tools used to evaluate data, optimize operations,
predict trends, and support strategic decision-making. These methods include statistical analysis,
optimization, forecasting, and simulation.
Operational Research Perspective:
Quantitative techniques are systematic and scientific approaches that use mathematical models to
solve complex operational problems and improve decision quality.
Statistical Definition:
In the context of statistics, quantitative techniques refer to methods for collecting, organizing,
analyzing, and interpreting numerical data to identify patterns and relationships.
Introduction to Statistics
Definition: Statistic or Statistical method is treated as a branch of science which deals with
collection, classification, presentation and interpretation.
Webstar’s Definition – He classified fact representing the condition of people in the state,
especially those facts which can be stated in table or tables of numbers or in any tabular or
classified arrangement.
It dictates about presentation of facts and figure about condition of people but never speaks about
methodological aspects and any mathematical treatment to the data.
Dr. Neera Kumari
CU, Unnao, U.P. Page 3
A.W. Bowley’s – Statistics may be called as science of counting, or science of averages, or is the
science of the measurements of social organisms regarded as whole in all its manifestation.
This definition speaks about measurements of social organism. Do not talk much about collection,
method of analysis etc.
Yule and Kendall – Statistics means qualitative data affected to a marked extent by multiplicity
of causes.
Speak about the qualitative data and does not mention about quantitative data, collection,
interpretation of data which is having demerits.
Horace Secrist – Aggregates of facts affected to a marked extent by multiplicity of causes
expressed enumerated or estimated according to reasonable standards of accuracy, collected in a
systematic manner for a predetermined purpose and placed in relation to each other.
Many phenomena in nature, activities and experiments are subject to measurements and variations
which are bound to occur and non-available due to causes beyond the control of the experimenter
are called errors.
Functions of Statistics
1. It presents facts in a definite form.
2. It simplifies mass of figures.
3. It facilitates comparison.
4. It helps in formulating and testing hypothesis.
5. It helps in prediction and planning.
6. It helps in the formulation of suitable policies.
7. It helps in the estimation.
Importance of Statistics:
1. Statistical method enables to condense the data. It allows to construct several facilitation
function apart from simulation means effect in any phenomena due to changes in
conditions.
2. Statistical method gives tools of comparison.
3. Estimation, predication is destructive process which reduces losses due to possibly using
statistical tools.
4. Graphical distribution gives shape, symmetry and spread of the data.
5. Inter – relation between two or more variable can be measured with statistical techniques
Dr. Neera Kumari
CU, Unnao, U.P. Page 4
e. g. Correlation Coefficient.
6. Statistical methods help in planning controlling decision making.
7. Use of statistical methods save considerable amount of time, money, manpower, or
valuable national resources.
8. Under uncertainties risk can be reduced to get reliable result with minimum loss.
9. Statistical method gives systematic methods of data collection and investigation.
Scopes of Statistics:
The tools and techniques givens by statistical methods has diversify application in almost all fields
of studies and research and at several phases. A comprehensive list of fields where statistics is
used cannot be used. However, some important fields where use of statistics is indispensable are
industry, agriculture, business, commerce, demography, biotechnology, fisheries, food science,
economics, education government agencies, social sciences, medical science, management science
etc. The few fields can be as:
Industry: In industry from establishment of factories, only by getting information about
availability of resources, transportation, communication, labour and consumer market, statistics is
used.
At production it is used as ‘Statistical quality control’ to maintain the quality of product. New
production machinery, techniques can be compared with old one. Market survey analysis is made
for particular product. ISO – 9000 uses statistics at large extent to offer a certification to the
companies. Estimation, forecasting, controlling etc. are used by the administrators, planners.
Biological Science
Agriculture and Horticulture: Agricultural field crop experiment design of experiment and
statistical methods are used in estimation and forecasting. For comparing consistent performance
of different varieties over location (called space or spatial effect) is done by using pooled analysis.
In extension studies social statistics as ‘F’, ‘t’ and X2 tests are used to judge the effect of one
attribute over the other. Effect of Education and training, in knowledge, adoption can be tested.
Regression equations and forecasting methods are useful in various agronomic and fertilizer
application studies and cultivation practices. Various method of irrigation such as drip, pot, flood
and others can be compared by using statistical tools.
Dr. Neera Kumari
CU, Unnao, U.P. Page 5
Forestry: Number of tress in jungle, volume of wood estimation, forest density, and number of
animals can be done using different statistical techniques as sampling,
Fisheries: Used to estimate volume of fish in lake. In case of Animal science, poultry and other
animal’s growth studies, nutrition studies statistical tools such as correlation coefficient, analysis
of variance and regression analysis is important.
Medical: Here statistics has played an important role in deciding average life expectancy of human
which is useful in life insurance and other studies. Demography is science which deals with the
human population dynamics revealing several characteristics such as infant mortality rate, death
rate, birth rate, growth rate etc. Estimation of sex ratio is possible using statistical tools. Different
population characteristics rates are measure indicator of health of that concerned country.
Economic: Huge amount of data is collected through pre-determined objectives and pre designed
sampling design and through trained expert man power and using presented questionnaire. The
data may be on national income, per capita income, poverty line, industrial production, cost of
cultivation. Different index numbers such as cost of living, export / import, area, production
productivity trends, prices are economic barometers used to measure economic condition. The
share index number is the best tool for getting idea about the changes in prices of share in stock
markets. Demand – supply analysis techniques developed to have time series analysis of the
phenomena under consideration is also called time series analysis and helps in estimation of trends.
Social Science: Statistics is the science for measuring the behaviour of social organism as a whole
where we rest associated of i) education and criminality, ii) education and marriage adjustment,
iii) Sex and education, iv) income and criminality.
Management Science: Most of the managerial functions make use of statistics. For efficient
working of various sections such as production, sales, marketing etc various statistical tools are
used to improve the current situation.
1) Statistical tools estimation, forecasting, test of significance, index numbers, time series
analysis, statistical quality control play vital role in management activities.
2) Operational Research and Linear Programming are of prime use in transportation, job
assignment, sequencing, inventory control. Risk measurement is done using standard
deviation, covariance analysis technique. The regression coefficient called beta index in
portfolio is used in decision making.
Dr. Neera Kumari
CU, Unnao, U.P. Page 6
Limitations of Statistics:
Despite the usefulness of statistics in many fields, impression should not be carried that statistics
are like magical devices which always provide the correct solution to problems. Unless the data
are properly collected and critically interpreted there is every likehood of drawing wrong
conclusions. Therefore, it is also necessary to know the limitations and the possible misuses of
statistics. The following are the important limitations of the science of statistics:
1) Statistics does not deal with Individual Measurements: Since statistics deal with aggregates
of facts the study of individual measurements lies outside the scope of statistics. Data are
statistical when they relate to measurement of masses not statistical when they relate to an
individual item or event as a separate entity. For example, the wage earned by an individual
worker at any one time taken by itself is not a statistical datum. But the wages of workers of a
factory can be used statistically.
2) Statistics Deals only with Quantitative Characteristics: Statistics are numerical statements
of facts. Such characteristics as cannot be expressed in numbers are incapable of statistical
analysis. For example, we may study the intelligence of boys on the basis of the marks obtained
by them in an examination.
3) Statistical Results are true only on an Average: The conclusions obtained statistically are
not universally true they are true only under certain conditions. This is because statistics as a
science is less exact as compared to natural sciences.
4) Statistics is only one of the methods of studying a problem: Statistical tools do not provide
the best solution under all circumstances. Very often, it is necessary to consider a problem in
the light of a country’s culture, religion and philosophy. Statistics cannot be of much help in
studying such problems. Hence statistical conclusions should be supplemented by other
evidences.
5) Statistics can be Misused: The greatest limitation of statistics is that it is liable to be misused.
The misuse of statistics may arise because of several reasons. For example, if statistical
conclusions are based on incomplete information, one may arrive at fallacious conclusions.
Statistics are like clay and they can be molded in any manner so as to establish right or wrong
conclusions. In this context, W. I. King pointed out. “One of the shortcomings of statistics is
that they do not bear on their face the label of their quality.”
Dr. Neera Kumari
CU, Unnao, U.P. Page 7
Statistics and Business
With the growing size and ever increasing competition, the problems of the business enterprises
are becoming complex and they are using more and more statistics in decision – making. However,
the employment of statistical methods in the solution of business problems belongs almost
exclusively to the 20th century. In earlier days when business firms were small, owners of the firms
were directly engaged in almost all the areas of business activity. An owner of a small firm then
might act as the store’s manager, accountant, salesman, purchaser, etc. it was possible for him
make personal contacts with the customers and know exactly what they wanted from him. With
the growth in the size of business firms it has often become impossible for the owners to maintain
personal contact with the thousands and lakhs of customers, Management has become a specialized
job and a manager is called upon to plan, organize, supervise and control the operations of the
business house. Since very little personal contact is possible with customers these days, a modern
business firm faces a much greater degree of uncertainty concerning future operations than if did
when the size of business was small. Therefore, unless a very careful study of the market is made.
If he is to be successful in his decision – making, he must be able to deal systematically with the
uncertainly itself by careful evaluation and application of statistical methods, concerning the
business activities. The higher the degree of accuracy of a businessman’s estimates, the greater is
the success attending on his business. In recent years it has become increasingly evident that
statistics and statistical methods have provided the businessman with one of his most valuable
tools for decision – making.
Business activities can be grouped under the following heads:
Production
Sale
Purchase
Finance
Personnel
Accounting
Market and Product Research
Quality Control
Dr. Neera Kumari
CU, Unnao, U.P. Page 8
With the help of statistical methods in respect of each of the above areas abundant quantitative
information can be obtained which can be of immense use in formulating suitable policies: the
information might be in the form of reports or computer printouts or it might simply consist of
records that are kept in ledgers or other books, or in fine folders in a filing cabinet.
However, it should be remembered that through statistical methods are extremely useful in taking
decisions, they are not perfect substitute for commonsense. A practitioner of business statistics
must, therefore combine the knowledge of the business environment in which he operates and its
technological characteristics with a heavy dose of commonsense and ability to interpret statistical
methods to non – statisticians.
Distrust of Statistics:
By distrust of statistics we mean lack of confidence in statistical statements and statistical methods.
It is often commented by people. “Statistics can prove anything.” Statistics are like miniskirts; they
cover up the essentials but give you the ideas. The following three main reasons accounts for such
views about statistics:
1. Figures are convincing and therefore, people are easily led to believe them.
2. They can be manipulated in such a manner as to establish foregone conclusions.
3. Even if correct figures are used they may be presented in such a manner that the reader is
misused.
2. Collection of Data
Data may be obtained either from the primary source or the secondary source. A primary source
is one that itself collects the data; a secondary source is one that makes available data which were
collected by some other agency. For example, the data collected by the Ministry of Industries and
made available through various publications constitute primary source. However, if the Ministry
of Industries uses data collected by some other organization, say, National Sample Survey
Organization, this will constitute secondary source for Ministry.
It is preferable to make use of the primary source wherever possible for the following reasons:
1. The secondary source may contain mistakes due to error in transcription made when the
figures were copied from the primary source.
2. The primary source frequently includes definitions of terms and units used.
Dr. Neera Kumari
CU, Unnao, U.P. Page 9
3. The primary source often includes a copy of the schedule and a description of the procedure
used in selecting the sample and in collecting the data.
4. Primary source usually shows data in greater detail.
Depending on the source, statistical data are classified under two categories:
1. Primary Data 2. Secondary Data
Primary and Secondary Data
Primary data are obtained by a study specifically designed to fulfill the data needs of the problem
at hand. Such data are original in character and are generated in large number of surveys conducted
mostly by Government and also by some individuals, institutions and research bodies. For
example, data obtained in a population census by the Office of the Register General and Census
Commissioner. Ministry of Home Affairs is primary data.
Data which are not originally collected but rather obtained from published sources are known as
secondary data. For example, for the Officer of the Registrar General and Census Commissioner
the census data are primary whereas for all others, who use such data, they are secondary. The
secondary data constitute the chief material on the basis of which statistical work is carried out in
many investigations. In fact, before collecting primary data it is desirable that one should go
through the existing literature and learn what is already known of the general area in which the
specific problem falls and any and all surrounding information that may give us leads and lessons.
This can help in getting an idea about the possible pitfalls, avoiding duplication of effort and waste
of resources.
The difference between primary and secondary data is only of degree data which are primary in
the hands of one become secondary in the hands of another. Data are primary for the individual
agency or institution collecting them whereas for the rest of the world they are secondary. Suppose
an investigator wants data about the spending habits of the students of Delhi University. If he
collects the data himself or through his agent adopting any suitable method such as contracting
and interviewing students or circulating a questionnaire, the data would constitute primary data for
him. On the other hand, if the student’s union has already made a similar survey and the
investigator obtained data form union office, such data would constitute secondary data for him.
Secondary data offers the following advantages:
Dr. Neera Kumari
CU, Unnao, U.P. Page 10
1. It is highly convenient to use information which someone else has compiled. There is no
need for printing data collection forms. Researchers alone or with some clerical assistance
may obtain information from published records complied by somebody else.
2. If secondary data are available, they are much quicker to obtain than primary data.
3. Secondary data may be available on some subjects where it would be impossible to collect
primary data. For example, census data cannot be collected by an available or research
organization, but can only be obtained from Government publications.
However, two major problems are encountered in using secondary data:
1. The first the difficulty of finding data which exactly fir the need of the present project.
2. The secondary problem is finding data which are sufficiently accurate.
Choice between Primary and Secondary Data
1. Nature and scope of enquiry
2. Availability of financial resources
3. Availability of time
4. Degree of accuracy desired
5. The collecting agency, i. e. whether an individual, an institution or a Government body.
It may be pointed out that most statistical analysis rests upon secondary data. Primary data are
generally used in those cases where the secondary data do not provide an adequate basis for
analysis. In certain cases, both primary as well as secondary data may be employed. The reason
why secondary data are being increasing used is that published statistics are now available
covering diverse fields so that an investigator finds required data readily available to him in many
cases.
Methods of Collecting Primary Data
1. Direct personal interviews
2. Indirect oral interviews
3. Information from correspondents
4. Mailed questionnaire method
5. Schedules sent though enumerators
Drafting Questionnaire
Drafting questionnaire is a highly specialized job and requires a great deal of skill and experience.
The following general principles may be helpful in framing a questionnaire:
Dr. Neera Kumari
CU, Unnao, U.P. Page 11
1) Covering Letter:
2) Number of Questions should be small:
3) Questions should be arranged logically
4) Questions should be short and simple to understand
5) Ambiguous questions ought to be avoided
6) Personal questions should be avoided
7) Instructions to the informants
8) Questions should be capable of Objective answer
9) “Yes” or “No” question
10) Specific Information Questions and Open – end Questions
11) Questionnaire should Look attractive
12) Questions Requiring Calculations should be avoided
13) Pre – testing the Questionnaire
14) Cross – checks
15) Method of Tabulation
Sources of Secondary Data:
In most of the studies the investigator finds it impracticable to collect firsthand information on all
related issues and as such he makes use of the data collected by others. There arevast amounts of
published information from which statistical may be made and fresh statistics are constantly in a
state production. The sources of secondary data can broadly be classified under two heads:
1) Publish Sources
a. Reports and official publications
i. International bodies such as ‘World Bank’, ‘International Labour Organisation’,
‘Statistical Office of the United Nations’.
ii. Central and State Governments such as Abstract of the Indian Union, Economic Survey,
Government of India, Ministry of Finance.
iii. Report of committees and Commissions appointed by the Government such as Report
of the Committee on Corporate Governance.
b. Semi – official publications of various local bodies such as Municipal Corporations and
District Boards.
c. Publications of autonomous and private institutes, such as
Dr. Neera Kumari
CU, Unnao, U.P. Page 12
i. Trade and professional bodies, such as Federation of India Chamber of Commerce and
Industry, the Institute of Charted Accountants, the institute of Foreign Trade, Prestigious
journals of these institutes are respectively ‘Economic Trends’, ‘The Charted
Accountant’, ‘Foreign Trade Review’.
ii. Financial and economic journals such as ‘Indian Economic Review’, Reserve Bank of
India Bulletin’, ‘Indian Finance’.
iii. Annual Reports of Joint Stock Companies and Corporations.
iv. Publications brought out by various autonomous Research Institutes and Scholars such
as Institute of Economic Growth. Delhi; National Council of Applied Economic
Research, New Delhi; Institute of Politics and Economics, Pune.
2) Unpublished Sources
All statistical material is not always published. There are various sources of unpublished data such
as records maintained by various Government and private offices, studies made by research
institutions, scholars etc. Such sources can be used where necessary.
3. Classification and Tabulation of Data
The collected data is usually contained in scheduled and questionnaire. But that is not in an easily
assailable form. The answer will require some analysis if their salient points are to be brought out.
As a rule, the first step in the analysis is to classify and tabulate the information collected, or if
published statistics have been employed, rearrange these into new groups and tabulate the new
arrangement.
Classification is the grouping of related facts into classes. Facts in one class differ from those of
classification. Sorting facts on one basis of classification and then on another basis is called cross
– classification. When students seek admission in a college they submit applications to the office.
The application forms contain particulars about their performance in the previous examinations,
their data of birth, sex, nationality, etc. If one is interested in finding out how many first, second
and third class students have joined the college, one may look into each and every form and note
whether it relates to a first class student, second class student, etc. He may find out of 1000 students
who took admission 50 had first class, 800 second class and 150 third class. The process with the
help of which this information in a summary form is obtained is called the classification of data.
3.1 Object of Classification
The principle objectives of classification:
Dr. Neera Kumari
CU, Unnao, U.P. Page 13
1. To condense the mass of data in such a manner that similarities and dissimilarities can be
readily apprehended. Millions of figures can thus be arranged in a few classes having common
features.
2. To facilitate comparison.
3. To pinpoint the most significant features of the data at a glance.
4. To give prominence to the important information gathered while dropping out the unnecessary
elements.
5. To enable a statistical treatment of the material collected.
3.2 Types of Classification
1. Geographical, i. e. area wise, e. g., cities, districts, etc.
2. Chronological, i.e. on the basis of time.
3. Qualitative, i.e., according to some attributes.
4. Quantitative, i.e., in terms of magnitudes.
1) Geographical, i. e. area wise, e. g., cities, districts, etc.
In this type of classification data are classified on the basis of geographical or vocational
differences between the various items, like Countries, States, Cities, Regions, Zones, Area, etc. for
instance, the data about the production and per capita availability of milks in India for the year
20013 – 14 is given in the following table:
Milk Production and Per-capita availability in India
Year Production (million tonnes) Per Capita Availability (Grams / day)
2000 – 2001 80.6 217
2001 – 2002 84.4 222
2002 – 2003 86.2 224
2003 – 2004 88.1 225
2004 – 2005 92.5 233
2005 – 2006 97.1 241
2006 – 2007 102.6 251
2007 – 2008 107.9 260
2008 – 2009 112.2 266
2009 – 2010 116.4 273
2010 – 2011 121.8 281
2011 – 2012 127.9 290
2012 – 2013 132.4 299
2013 – 2014 137.7 307
Dr. Neera Kumari
CU, Unnao, U.P. Page 14
(Source: Various issues of Basic Animal Husbandry Statistics, MoA,GoI)
Geographical classifications are usually listed in alphabetical order for easy reference. Items may
also be listed by size to emphasize the important areas as in ranking the States by population.
2) Chronological, i.e. on the basis of time.
When data are observed over a period of time the type of classification is known as chronological
classification. For example, the figures of population as follow.
Population of India from 1951 to 2011
Year Population (in crore) Year Population (in crore)
1951 36.11 1991 84.64
1961 43.92 2001 102.87
1971 54.82 2011 121.00
1981 68.33
Time series are usually listed in chronological order, normally starting with the earliest period.
When the major emphasis falls on the most recent events, a reverse time order may be used.
3) Qualitative, i.e., according to some attributes.
In qualitative classification data are classified on the basis of some attribute or quality such as sex,
colour of hair, literacy, religion, etc.
4) Quantitative, i.e., in terms of magnitudes.
Quantitative classification refers to the classification of data according to some characteristics that
can be measured, such as height, weight, income, sales, profits, production, etc. For example, the
students of a college may be classified according to weight as follows:
Weight (lb.) No. of Students
90 – 100 50
100 – 110 200
110 – 120 260
120 – 130 360
130 – 140 90
140 – 150 40
Total 1000
Such a distribution is known as empirical frequency distribution or simple frequency distribution.
Dr. Neera Kumari
CU, Unnao, U.P. Page 15
3.3 Frequency Distribution: A frequency distribution refers to data classified on the basis of
some variable that can be measured such as prices, wages, age, number of units produced or
consumed. The term ‘variable’ refers to the characteristics that vary in amount or magnitude in a
frequency distribution.
A variable may be either continuous or discrete.
Discrete Continuous
No. of Children No. of families Weight (lb.) No. of Persons
0 10
100 – 110 10
1 40
110 – 120 15
2 80
120 – 130 40
3 100
130 – 140 45
4 250
140 – 150 20
5 150
150 - 160 4
6 50
Total 680 Total 134
Discrete Frequency Distribution Continuous Frequency Distribution
3.3.1 Formation of a Discrete Frequency Distribution
The number of times a particular value is repeated is frequency of that class. A frequency
distribution or frequency table is simply a table in which the data grouped into classes and the
numbers of cases which fall in each class are recorded. The numbers in each class are referred to
as frequency. When number of items is expressed by their proportion in each class, the table is
usually referred to as a ‘relative frequency distribution’, or simply a ‘percentage distribution’.
Example 1: In a survey of 35 families in a village, the number of children per family was recorded
and the following data obtained:
1 0 2 3 4 5 6 7 2 3 4 0 2 5
8 4 5 12 6 3 2 7 6 5 3 3 7 8
9 7 9 4 5 4 3
Represent the data in the form of discrete frequency distribution.
Dr. Neera Kumari
CU, Unnao, U.P. Page 16
Solution: Frequency Distribution of the number of Children
No. of Children Tally Bars Frequency
0 || 2
1 | 1
2 |||| 4
3 |||| | 6
4 |||| 5
5 |||| 5
6 ||| 3
7 |||| 4
8 || 2
9 || 2
10 - 0
11 - 0
12 | 1
Total : 35
It is clear from the table that the number of children varied from 0 to 12. There were 2 families with
no child, 5 families with 4 children each and only one family with 12 children.
3.3.2 Formation of Continuous Frequency Distribution
This type of classification is most popular in practice. The following technical terms are important
when a continuous frequency distribution is formed or data are classified according to class –
intervals:
1. Class Limits: The class limits are the lowest and the highest values that can be included in the
class. For example, take the class 20 – 40. The lowest value of the class is 20 and the highest 40.
2. Class Intervals: The difference between Upper and Lower Limit of a class is known as class
interval of that class. For example, in the class 100 – 200, the class interval is 100 (i. e., 200 minus
100).
A simple formula to obtain the estimate of appropriate class interval i. e., i is:
Where, L - S L = largest item
i= S = Smallest item
k k = the number of classes
3. Class Frequency: The number of observations corresponding to a particular class is known as
the frequency of that class or the class frequency.
Dr. Neera Kumari
CU, Unnao, U.P. Page 17
Class Mid – point or Class Mark: It is the value lying half – way between the lower and upper
class limits of a class – interval. Mid – point of a class.
3.4 Method of Classification
Exclusive Method: When the class interval is so fixed that the upper limit of one class is the lower
limit of the next class it is known as the exclusive method of classification. Upper limit of one class
is the lower limit of the next class. Example: 10000 – 15000, 15000 – 20000, 20000 – 25000, 25000
- 30000, 30000 – 35000 and 35000 – 40000.
Inclusive Method: Under the inclusive method of classification, the upper limit of one class is
included in that class itself. Examples: 10000 – 14999, 15000 – 19999, 20000 – 24999, 25000 -
29999, 30000 – 34999 and 35000 – 39999.
To decide whether to use the inclusive or the exclusive method it is important to determine whether
the variable under observation is a continuous or discrete one. In case of continuous variables, the
upper limit exclusive method must be used. For example, the variable height being inherently a
continuous one should be stated as 60” and under 62” and under 64”, and so on. The inclusive method
should, in general, be used in case of discrete variables. Thus, in classifying factories according to
number of workers, the limits should be stated as, for example, 100 – 199 employees, 200 – 299
employees and not 100 – 200, 200 – 300, etc.
Note: Inclusive method of grouping can be converted exclusive method of grouping.
3.5 Relative Frequency Distribution:
At times it may be desirable to convert class frequencies to relative class frequencies to show the
percentage of the total number of observations in each class. For example, we may be interested in
knowing the percentage of employees earning less than Rs. 10, 000 per month.
In order to convert a frequency distribution to a relative frequency distribution, each of the class
frequencies are divided by the total number of frequencies so that the relative frequencies would
always total 1.
3.7 Cumulative Frequency Distribution
The frequency is the number of times an event occurs within a given scenario. Cumulative frequency
is defined as the running total of frequencies. It is the sum of all the previous frequencies up to the
current point. It is easily understandable through a Cumulative Frequency Table.
Marks No. of Students Cumulative Frequency
0 -5 2 2
Dr. Neera Kumari
CU, Unnao, U.P. Page 18
5 – 10 10 12
10 – 15 5 17
15 – 20 5 22
1. Cumulative frequency distribution of less than type
Marks No. of Students Marks Group Cumulative Frequency
0–5 2 Less than 5 2
5 – 10 10 Less than 10 12
10 – 15 5 Less than 15 17
15 - 20 5 Less than 20 22
2. Cumulative frequency distribution of more than type
Marks No. of Students Mark Group Cumulative Frequency
0 -5 2 More Than 0 22
5 – 10 10 More Than 5 20
10 – 15 5 More Than 10 10
15 – 20 5 More Than 15 5
Note: in less than cumulative frequency distribution refers to the upper limit of the corresponding
class and in more than cumulative frequency distribution refers to lower limit of the corresponding
class.
Q.1 Present the following data of the percentage marks of 60 students in the form of a frequency
table with 10 classes of equal width, one class being 40-49:
41 17 83 63 54 92 60 58 70 06 67 82 33 44 92
57 49 34 73 54 63 36 52 32 75 60 33 09 79 88
28 30 42 93 43 80 03 32 57 67 24 64 63 11 62
35 82 10 23 00 41 60 32 72 53 60 33 40 57 55
Sol. Here the lowest value is 0 and the highest value is 92, and the required magnitude of the class
interval is 10, the classes taken by the inclusive method will be:
0-9, 10-19, 20---29, …, 80-89, 90-99.
Taking every value of the raw data in turn and putting a tally mark against the corresponding class
interval it falls in, we get the frequency distribution as given in the following table.
Frequency Distribution by Inclusive Method
Class Interval Tally Marks Frequency (f)
0-9 4
10-19 3
20-29 3
30-39 10
40-49 7
Dr. Neera Kumari
CU, Unnao, U.P. Page 19
50-59 9
60-69 11
70-79 5
80-89 5
90-99 3
Total 60
Q.2 The following table gives the scholastic aptitude scores of 50 departmental students of a
certain department in a certain university:
345 472 530 475 556 610 354 586 590 823
395 691 515 520 479 465 494 468 420 545
563 624 444 582 629 540 440 578 485 505
505 523 604 575 490 420 445 605 605 527
402 461 406 440 730 585 506 420 516 384
Construct a frequency distribution table with appropriate class limits and class boundaries (Take
the length of the class equal to 30 units).
Sol. In the given array, the lowest value is 345, and the upper value is 730. We have to drive the
whole arrange of 385 (=730-345) into classes of 30 units each. We may therefore divide the whole
range into 13 classes. Thus the classes are: 340-370, 370-400, 400-430, and so on. The class limits
are therefore: 340 and 370 respectively lower limit and upper limit for the first class, and the like.
Here the class boundaries may be taken as 340.5-370.5, 370.5-400.5, 400.5-430.5, and so on. Still
the width of each class is 30. But now we would face no difficulty in placing 370 or 400 if we
come across them
Frequency Distribution
Class Limits Class Boundaries Tally Marks Frequency (f)
340-370 340.5-370.5 2
370-400 370.5-400.5 2
400-430 400.5-430.5 5
430-460 430.5-460.5 4
460-490 460.5-490.5 8
490-520 490.5-520.5 7
520-550 520.5-550.5 5
550-580 550.5-580.5 5
580-610 580.5-610.5 8
610-640 610.5-640.5 2
640-670 640.5-670.5 --
670-700 670.5-700.5 1
Dr. Neera Kumari
CU, Unnao, U.P. Page 20
700-730 700.5-730.5 1
Q.3 The bonus to be paid by XYZ Company is as under:
Salary (Rs.) Bonus (Rs.) Salary (Rs.) Bonus (Rs.)
500-1000 100 2000-2500 400
1000-1500 200 2500-3000 500
1500-2000 300 3000-3500 600
Actual salaries of the employees, (in Rupees,) are as under:
875 1125 1875 2390 2625 3250 2850 2255
1910 1400 1685 1250 1875 3050 2325 2650
2400 1600 3190 2575 1125 2260 2150 1725
3455 2355 2250 2950 1050 780
Find out the total bonus paid to the employees.
Sol. As we are given the amount of bonus to be paid to the employees according to the salaries of
the employees within classes 500-1000, 1000-1500, …, 3000-3500, we shall convert the given
distribution of salaries into a frequency distribution with these classes.
Frequency Distribution of Salaries of 30 Employees
Salary (in Rs.) Tally Marks No. of Bonus (in Rs.) Total Bonus
Employees (f) (X) Fx
500-1000 2 100 200
1000-1500 5 200 1000
1500-2000 6 300 1800
2000-2500 8 400 3200
2500-3000 5 500 2500
3000-3500 4 600 2400
Total 30 fX 11100
Hence the total bonus paid to the employees = fX 11100
Q.4 Prepare frequency tables of the marks in statistics and Law separately obtained by the students
from the following data: (You may choose class-intervals of 10 marks viz., 10-20, 20-30 etc.):
Marks in Law Marks in Statistics Marks in Law Marks in Statistics
11 25 27 40
18 16 14 22
25 23 30 42
27 25 31 43
16 12 21 25
29 28 34 27
30 32 20 32
Dr. Neera Kumari
CU, Unnao, U.P. Page 21
20 18 37 29
26 27 23 32
12 15 37 40
25 33 36 37
28 34 32 23
19 20 19 33
13 18 35 35
30 35 34 32
22 23 33 33
23 20 32 22
29 37 40 36
30 36 42 44
36 28 15 24
22 27 41 35
25 39 38 30
Ans. Frequency Distribution Table
Marks Law Statistics
Tally Bars Frequency Tally Bars Frequency
10-20 9 5
20-30 16 17
30-40 16 17
40-50 3 5
Total 44 44
Q.5 Construct a cumulative frequency distribution table showing both:
(i) less-than type and (ii) more-than type from the following data:
Frequency Distribution
Class-Interval Frequency
(Age in Years) (No. of persons)
10-20 8
20-30 24
30-40 40
40-50 22
50-60 6
Total 100
Dr. Neera Kumari
CU, Unnao, U.P. Page 22
Sol.
(i) Table 1: Cumulative Frequency Distribution Table
(Less-than type and more-than type on Class-interval Basis)
Class-Interval Frequency Cumulative Frequency
(Age in years) (No. of persons) Less-than More-than
10-20 8 8 8 6+22+40+24+8 100
20-30 24 8+24 32 6+22+40+24 92
30-40 40 8+24+40 72 6+22+40 68
40-50 22 8+24+40+22 94 6+22 28
50-60 6 8+24+40+22+6 100 6 6
Total 100
(ii) Table 2: Cumulative Frequency Distribution Table
(Less-than type and more-than type on Class-interval Basis)
Class-Boundary Cumulative Class-Boundary Cumulative
(Age in years) Frequency (Age in years) Frequency
(No. of persons) (No. of persons)
Less than 10 0 More than 10 100
Less than 20 8 More than 20 92
Less than 30 32 More than 30 68
Less than 40 72 More than 40 28
Less than 50 94 More than 50 6
Less than 60 100 More than 60 0
Dr. Neera Kumari
CU, Unnao, U.P. Page 23
4. Measure of Central Tendency
4.1 Introduction
As the foundation for understanding data, descriptive statistical analysis is essential to machine
learning. Statistics mostly helps with inferring conclusions from data, which is an essential first
step, whereas machine learning concentrates on predictions. We explore the foundational ideas of
descriptive statistics in this chapter, providing insights to improve your comprehension of your
data. You will improve your machine learning models and broaden your understanding in general
by understanding these fundamental concepts.
Everything from data collection and processing to summation and presentation is covered
by descriptive statistics. Their importance stems from their capacity to articulate data as actionable
information and to substantiate conclusions derived from data.
Three essential descriptive statistics metrics are covered in this chapter: position, variation,
and central tendency. In addition to a variety of tools for data visualization and summarization, we
thoroughly examine these measurements. Additionally, we walk through the ways that machine
learning and descriptive statistics interact, highlighting the uses and importance of both in
improving data interpretation and predictive modelling.
4.2 Analyzing Central Trends
Finding a single value that summarizes the properties of the full quantity of unmanageable data is
one of the main goals of statistical analysis. This kind of number is known as the variable's
expected value, central value, or average.
“Average is an effort to identify a single figure that best represents the entire set of figures.”
A single value that sums up a set of values is called an average. This kind of value is very important
because it represents the traits of the entire group. An average's value falls between the greatest
and smallest things, or the two extremes, since it represents all of the data. This is the reason why
the term “measure of central tendency” is commonly applied to an average.
Dr. Neera Kumari
CU, Unnao, U.P. Page 24
Types of Central Tendency Measure
There are three different types of measure of central tendency: mean, median and mode.
Measures of Central
Tendency
1. Mean 2. Median 3. Mode
A. M. H. M.
G. M.
4.2.1 Average Analysis
What statisticians refer to as the Mean, and what most laypeople refer to as a “average”, is the
most generally used and well-liked way to represent all of the data by a single value. To find its
value, sum up all the things and divide the amount by the total number of items.
There are three different types of means: arithmetic, geometric, harmonic.
4.2.1.1 Arithmetic Mean
The average of a group of integers is called the arithmetic mean. To compute it, first add up all of
the set's numbers, and then divide the total by the number of numbers in the set. It is central
tendency measure that’s most frequently applied.
x1 x 2 ... x n 1 n
Symbolically, x or x xi , where, n is No. of Observations
n n i 1
Mathematical Properties of Arithmetical Mean
1. The items' total squared deviations from the arithmetic mean are at a minimum i. e. less than the
squared deviation of it from any other value and is called as variance.
2. Mean of combined series: The following formula can be used to calculate the combined average
of two or more related groups if we know the arithmetic mean and the number of items in each
n x n2 x 2
group: x12 1 1
n1 n2
Where, x12 Combined mean of the two group , x1 Arithmetic mean of first group
Dr. Neera Kumari
CU, Unnao, U.P. Page 25
x2 Arithmetic mean of sec ond group , n1 Number of items in the first group
n2 Number of items in the sec ond group
3. The sum of the deviation of the items from the A. M. is always zero i. e. x x 0 .
Q.6 Calculate the average bonus paid per member from the following data:
Bonus 500 600 700 800 900 1000 1100
(in Rs.)
No. of 1 3 5 7 6 2 1
persons
Ans.
Calculations for Average Bonus
X F Fx d=X- Fd X 800 fu
u
800 100
500 1 500 -300 -300 -3 -3
600 3 1800 -200 -600 -2 -6
700 5 3500 -100 -500 -1 -5
800 7 5600 0 0 0 0
900 6 5400 100 600 1 6
1000 2 2000 200 400 2 4
1100 1 1100 300 300 3 3
Total N f 25 fX 19900 fd 100 fu 1
(i) Direct method: X
fX 19900
796
N 25
(ii) Short-cut Method: X A
fd 800 (100) 796
N 25
(iii) Step-deviation Method: X A
fu h 800 (1) 100 796
N 25
Q.7 The following table gives the male population (in lakhs) of city X and city Y in certain year:
Age-group (years) City X City Y
0-5 14 9
5-10 13 8
10-15 13 8
15-20 13 7
20-30 33 15
30-40 29 12
40-50 17 9
Dr. Neera Kumari
CU, Unnao, U.P. Page 26
50-60 7 6
60-80 4 4
Calculate the average of males at city X and city Y separately.
Ans. Computation of average in city X and city Y
Age Mid-value d = m - 25 Population (in lakhs)
Group (m) City X City Y
f1 f2d f1 f2d
0-5 2.5 -22.5 14 -315 9 -202.5
5-10 7.5 -17.5 13 -227.5 8 -140
10-15 12.5 -12.5 13 -162.5 8 -100
15-20 17.5 -7.5 13 -97.5 7 -52.5
20-30 25 0 33 0 15 0
30-40 35 10 29 290 12 120
40-50 45 20 17 340 9 180
50-60 55 30 7 210 6 180
60-80 70 45 4 180 4 180
Total N1=143 f1 d 217 .5 N2=78 f 2 d 165
The average age of makes in City X is:
X1 A
f1d 25 217.5 26.5
N1 143
The average age of makes in City Y is:
X2 A
f 2 d 25 165 27.1
N2 78
The average age of males at City X is 26.5 and the average age of males at City Y is 27.1. Therefore,
the average age of males at City X is more than that of the males at City Y.
Q.8 Find the class intervals if the arithmetic mean of the following distribution is 20 and assumed
mean 22.
Step -4 -3 -2 -1 0 1 2 3 4
deviation:
Frequency: 11 13 16 14 9 17 6 6 4
x A
Ans. Here the given step deviation is the deviation u .
h
X A
fu h , A=22 (given) (1)
N
Dr. Neera Kumari
CU, Unnao, U.P. Page 27
Computation of Class Intervals
Step Deviation (u) Frequency (f) Fu X Class Interval
-4 11 -44 38 40-36
-3 13 -39 34 36-32
-2 16 -32 30 32-28
-1 14 -14 26 28-24
0 14 0 22 24-20
1 9 9 18 20-16
2 17 34 14 16-12
3 6 18 10 12-8
4 4 16 6 8-4
Total N=104 fu 52
(52)
From eq. (1), 20 22 h 22 0.5h , 2 0.5h h 4.
104
Q.9 (Combined Mean) The average salary of male employees in a firm was Rs. 5200 and that of
female was Rs. 4200. The mean salary of all the employees was Rs. 5000. Find the percentage of
male and female employees.
Sol. Let n1 and n 2 denote respectively the number of male and female employees in the firm, and
x1 and x1 denote respectively their average salary (in rupees). Let x denote the average salary of
all the workers in the firm.
We are given that: x1 5200 , x 2 4200 and x 5000.
n x n2 x2
Also we know, x 1 1 500(n1 n 2 ) 5200 n1 4200 n 2
n1 n 2
n1 4
(5200 5000 )n1 (5000 4200 )n 2 20n1 80n 2
n2 1
4
The percentage of male employees in the firm = 100 80
4 1
1
and The percentage of female employees in the firm = 100 80
4 1
Q.10 Twenty passengers were found ticketless on a bus. The sum of squares and the S.D. of the
amount found in their pockets were Rs. 2000 and Rs. 6 respectively. If the total fine imposed on
these passengers is equal to the total amount recovered from them and fine imposed is uniform,
what is the amount each one of them has to pay as fine? What difficulties do you visualize if such
a system of penalty were imposed?
Dr. Neera Kumari
CU, Unnao, U.P. Page 28
Sol. Let x i , i 1, 2, ..., 20 be the amount (in Rs.) found in the pocket of the ith passenger. Then we
are given:
20
n 20, x i2 Rs. 2000 and S .D.( ) Rs. 6 (i)
i 1
The total fine imposed on the ticketless passengers is given to be equal to the total amount
recovered from them.
20
Total fine imposed on the 20 passengers x i
i 1
Further, since the fine imposed is uniform among all the 20 passengers,
1 n
Fine to be paid by each passenger xi x
20 i 1
(ii)
1 1
We have: 2
n
xi2 x 2 x2
20
xi2 2
2000
x 2 Rs. 2 6 2 Rs. 2 (100 36) Rs. 2 x Rs. 8 [ From (i )]
20
Hence, using (ii), the fine paid by each of the passengers is Rs. 8.
If among these ticketless passengers, there are a few rich persons with large sums of money
in their pockets, then an obvious shortcoming of this system of imposing penalty is that, it will
give undue heavy penalty to the poor passengers (with smaller amounts of money in their pockets).
Q.11 The average salary paid to 100 employees of an establishment was found to be Rs. 5424.
Later on, after disbursement of salary, it was discovered that the salary of two employees was
wrongly entered as Rs. 8910 and Rs. 4950. Their correct salaries were Rs. 5910 and Rs. 5550. Find
the correct arithmetic average.
Sol. Let the variable X denote the salary (in Rs.) of an employee. Then we are given:
X 5424 or
X
100
X Rs.542400
Thus the total salary disbursed to all the employees in the establishment is Rs. 542400.
After incorporating the corrections, we have corrected
X = 542400 - (Sum of wrong salaries) + (Sum of correct salaries)
= 542400 – (8910+4950)+(5910+5550)
= 5,40,000
5,40,000
Correct average salary = = Rs. 5400.
100
Dr. Neera Kumari
CU, Unnao, U.P. Page 29
Merit of Arithmetic Mean
1. Every object in the series has an impact on it.
2. Due to rigid mathematical definition it is subject to further algebraic treatments.
3. It is reliable value.
4. The value that is calculated does not depend on where you are in the series.
Disadvantage Arithmetic Mean
1. The arithmetic mean is based on each and every item of the distribution. Therefore, it is
unduly affected by extreme items.
2. In a distribution with open end classes, the value of mean cannot be calculated without
appropriate assumption of class interval.
3. Arithmetic mean is not always good measures of central tendency. Because it provides
characteristics value. Where most of the value lies, therefore it is only good in a bell shape
curve and it is not useful in a ‘O’ shape curve or distribution.
4. It cannot be determined by inspection in all series.
5. It cannot be determined in case of qualitative characters.
Weighted Arithmetic Mean
The arithmetic mean, as previously explained, has a drawback in that it assigns equal weight to
each item. However, there are instances in which the various things' relative value differs. We
calculate the weighted arithmetic mean when this is the case. The relative importance of the various
items is represented by the term "weight." The following is the formula to find the weighted
arithmetic mean: x w
wx
w
It should be remembered that
1. If the weights are equal, the simple arithmetic mean and the weighted arithmetic mean are
equivalent. In symbolic terms, x1 x2 if w1 w2 .
2. The weighted arithmetic mean will always be greater than the simple arithmetic mean if
and only if larger values are given larger weights and smaller values are given smaller
weights. In a symbolic sense, x xW if w1 w2 x1 x2 0.
3. If and only if smaller weights are assigned to higher values and larger weights to lower
values, the simple arithmetic mean will be bigger than the weighted arithmetic mean.
Figuratively, x xw if w1 w2 x1 x2 0.
Dr. Neera Kumari
CU, Unnao, U.P. Page 30
Q.12 (Weighted Mean) The following table shows the number of workers in various trade
categories who worked from Monday to Friday in a week for varying number of hours each day.
If the hourly pay for categories A, B, C, D and E workers be respectively Rs. 0.97, Rs. 077, Rs.
1.01, Rs. 0.67 and Rs. 0.75, calculate the average wage per hour per worker for the whole week
for all categories together.
NUMBER OF WORKERS
Categories Mon. Tues. Wed. Thurs. Fri.
of workers (7 hours) (6 hours) (5 hours) (4 hours) (5 hours)
A 30 20 25 15 30
B 25 25 30 20 20
C 30 25 30 25 20
D 20 20 20 20 25
E 25 20 25 15 25
Ans. Total hours under category A=30×7+20×6+25×5+15×4+30×5=665
Similarly, we can obtain the total hours under categories B, C, D and E.
Category Hourly Rate (Rs.) Total hours wX
X W
A 0.97 665 645.05
B 0.77 655 504.35
C 1.01 710 717.10
D 0.67 565 378.55
E 0.75 605 453.75
Total w 3200 wX 2698 .80
The average wage per hours per worker for the whole week for the workers of all categories is
given by:
Xw
wX 2698 .80 Re . 0.84 per hours
w 3200
4.2.1.2 Geometric Mean
The nth root of the product of N items or values is the definition of the geometric mean. We take
the square root when there are two things, the cube root when there are three, and so forth.
Symbolically, G.M . n x1 x2 ... xn
Where, the various elements in the series are denoted by x1 , x 2 ,.x3 etc. When there are three or
more elements, it gets really challenging to multiply the numbers and get the root; hence,
logarithms are employed to simplify calculations. Next, the geometric mean is computed using the
formula below:
Dr. Neera Kumari
CU, Unnao, U.P. Page 31
n
n
log x log xi
log x1 log x 2 ... log x n i
log G.M . i 1
and G.M Anti log i 1
n n n
Benefits of Geometric Mean
1. Every single item in the series serves as its foundation.
2. It can be applied to average percentage increases in population, production, sales, or other
commercial or economic data.
3. This average performs effectively in situations that we typically encounter in the social and
economic spheres, when it is necessary to apply high weights to tiny items and small weights to
large items.
4. In theory, G. M. is regarded as the ideal average for creating index numbers.
Limitations of Geometric Mean
1. It is challenging to comprehend.
2. It is challenging to calculate and understand.
3. It cannot be computed when there are both + ve and – ve values in a series or one or more
of the value is zero.
Q. 13 The geometric mean of 1, 3, 5, 7, and 9 is found.
Solution. The GM is given as x1 x 2 ... x n 1 3 5 7 9 945 3.936
1/ n 1/ 5 1/ 5
Q.14 Find the geometric mean for the following data:
Classes 100-200 200-300 300-400 400-500 500-600
Frequency 15 18 30 20 17
Sol. We from the following table,
Classes Mid Value (x) Frequency (f) log10 x f log10 x
100-200 150 15 2.1761 32.6415
200-300 250 18 2.3979 43.1622
300-400 350 30 2.5441 76.3230
400-500 450 20 2.6532 53.0640
500-600 550 17 2.7404 46.5868
N=100 Sum=251.7775
Using the formula for geometric mean,
1 n
log 10 G f log 10 x
N i 1
Dr. Neera Kumari
CU, Unnao, U.P. Page 32
1
log 10 G (251.7775 ) 2.5178
100
G Anti log( 2.5178) 329.5
4.2.1.3 Harmonic Mean
The reciprocals of averaged numbers serve as the foundation for the harmonic mean. It can be
defined as the reciprocal of the individual observations’ reciprocal of the arithmetic mean.
Symbolically,
n
H .M .
1 1 1
...
x1 x 2 xn
Merits of Harmonic Mean
1. Its value is based on every item of series.
2. It is suitable for further algebraic manipulation.
3. It is rigidly defined.
Harmonic Mean Limitation
1. It is difficult to understand.
2. It is challenging to calculate.
3. It assigns the smallest items the most weight.
4. When a series contains both +ve and -ve values, or when one or more of the values are zero, the
computation is not possible.
Problem 15. At four distinct times of the year, milk is sold for 8, 10, 12, and 15 rupees per litre.
Find the average price per month in rupees, assuming that a household spends the same amount
on milk each of the four months.
Solution. Due to the family's equal spending during all four months, the harmonic mean of 8, 10,
12, and 15 determines the average monthly price of milk.
Monthly average price for milk
1 4 120 4 120
Rs. Rs. Rs. Rs.10.67
11 1 1 1 15 12 10 8 45
4 8 10 12 15
Q.16 Calculate the harmonic mean for the following data.
Marks 0-10 10-20 20-30 30-40 40-50
Dr. Neera Kumari
CU, Unnao, U.P. Page 33
No. of Students 4 6 10 7 3
Sol. The harmonic mean is calculated below:
Marks Mid value (x) Frequency (f) 1/x F*(1/x)
0-10 5 4 1/5 4/5
10-20 15 6 1/15 6/15
20-30 25 10 1/25 10/25
30-40 35 7 1/35 7/35
40-50 45 3 1/45 3/45
Where, N=30
n
f i 4 6 10 7 3
i 1 xi
5 15 25 35 45
0.8 0.4 0.4 0.2 0.06 1.86
N 30
H n
16.12
fi
x
1.86
i 1 i
Q.17 An aeroplane flies round a square the sides of which measure 100 miles each. The
aeroplane covers at a speed of, 100 miles per hour the first side, at 200 m.p.h. the second side,
at 300 m.p.h. the third side and 400 m.p.h. the fourth side. What is the average speed of the
aeroplane around the square?
Sol. Here we are to find average speed and the distance is kept fixed, therefore, we shall find harmonic
mean to get the correct average speed
Thus, average speed (Harmonic mean),
400 4800
H 192 m. p.h.
100 100 100 100 25
100 200 300 400
Q.18 (a) (Harmonic Mean) Milk is sold at the rates 8, 10, 12 and 15 rupees per litre in four
different months. Assuming that equal amounts are spent on milk by a family in the four months,
find the average price in rupees per month.
(b) An individual purchases three qualities of pencils. The relevant data are given below:
Quality Price Per Pencil (Rs.) Money Spent (Rs.)
A 1.00 50
B 1.50 30
C 2.00 20
Calculate the average price per pencil.
Sol. (a) Since equal amounts of money are spent by the family for each of the four months, the
average price of milk per month is given by the harmonic mean of 8, 10, 12 and 15.
Average price of milk per month
Dr. Neera Kumari
CU, Unnao, U.P. Page 34
1 4 120 4 120
Rs. Rs. Rs. Rs.10.67
1 1 1 1 1 15 12 10 8 45
4 8 10 12 15
(b) Here we are given: Total expenditure=Rs. (50+30+20)=Rs. 100
50 30 20
Total number of pencils purchased 80
1 1.50 2
Total Expenditure 100
Average price per pencil Rs.1.25
Total No. of pencils 80
Note: Average price of Rs. 1.25 can also be obtained by finding the weighted harmonic mean
(H.M.) of 1, 1.5 and 2 with corresponding weights 50, 30 and 20 respectively.
4.2.1.4 Relationship among the Averages
When the original objects in a distribution have different sizes, the values of A. M., G. M., and H.
M. will also vary and will be in the following order: A. M. ≥ G. M. ≥ H. M.
i.e., geometric mean exceeds harmonic mean and arithmetic mean exceeds geometric mean. Only
when each of the numbers x1 , x2 ,..., xn is the same do the equality signs hold true.
4.2.2 The Middle Value Assessment/Median
The value of the series item that divides the group in halves equal parts, one part containing all
values greater than the median and all values less than it is known as the median. In this way, the
median splits the series in half, with 50% of the observations falling below and 50% falling above
it.
Steps for Individual Series
1) Sort the data according to magnitude, either ascending or decreasing.
2) The middle item's value is the median.
Odd Number of Series
The value of the Median is [(n + 1) / 2]th item if the n is odd.
Even Number of Series
The average value of the (n / 2)th and [(n / 2) + 1]th item if n is even.
Steps for Continuous Series
1. Prepare the cumulative frequency distribution of the less-than type.
2. Find n and n/2 values.
3. Locate the class just beyond (n / 2)th frequency, which is known as the median class.
Dr. Neera Kumari
CU, Unnao, U.P. Page 35
4. The corresponding value of the variable gives the value of the median. In case of continuous
(n / 2) C.F .
frequency distribution, use the following formula. Median L i
f
L = Lower limit of median class, n = Total frequency of median class
i = Class interval of median class, f = frequency of median class
c. f. = Cumulative frequency of class just preceding to the median class
Median class is that class where the value of the (n / 2)th item lies. The above formula assumes the
following:
1. The variable's distribution is continuous and uses an exclusive kind of grouping.
2. It is orderly and evenly distributed observations within each class.
3. In case of inclusive type of class intervals it must be converted to exclusive type of grouping.
Q.19 (i) Find out the median of the marks obtained by a batch of 9 students in a class test.
Nos. 1 2 3 4 5 6 7 8 9
Marks 33 32 55 47 21 50 27 12 48
(ii) If an additional students scored marks 67, then find the median.
Sol. (i) We arrange the marks of 9 students in ascending order of magnitude as given below:
12, 21, 27, 32, 33, 47, 48, 50, 55
N 1 9 1
Now, Median th item 5th item 33
2 2
(ii) The additional student i.e., 10th students scored 67 marks then median will be
N 1 10 1
Median th item 5.5th item
2 2
(5th 6th) item 33 47
Thus, median is 40
2 2
Q.20 Obtain the median from the following data of weights of children in a particular locality.
Weights 0-4 4-8 8-12 12-16 16-20 20-24
No. of 3 9 18 20 16 7
Children
Sol. We form the following table,
Weights No. of Children Cumulative Frequency
0-4 3 3
4-8 9 12
8-12 18 30
12-16 20 50
16-20 16 66
20-24 7 73
Dr. Neera Kumari
CU, Unnao, U.P. Page 36
We know, N=73,
Median number (N/2)th item=(73/2)th item=36.5th item. Here, 36.5th item lies in the cumulative
frequency 50, which is corresponding to the class 12-16. Thus, 12-16 is the median class in which,
L=12, i=4, f=20, C.F.=30. Then,
( N / 2) C.F . 36.5 30
Median L i 12 4 12 1.3 13.3
f 20
Merits of Median
1. Both understanding and calculating it are simple tasks. Sometimes it can be found just by
looking.
2. Additionally, in the event of unequal classes, it is advised.
3. It is not influenced by extreme items.
Limitation of Median
1. When the number of observations is even, the median cannot be precisely calculated. All we do
is take the mean of the two middle terms to estimate it.
2. Treatment with algebra is not possible for it.
3. It is significantly impacted by sample variability as compared to the mean.
4.2.3 Most Frequent Value Assessment/Mode
In a set of observations, the mode is the value that appears the most frequently; in a series, the
mode is the value of the variable that is prominent. The value of x corresponds to the maximum
frequency in the discrete frequency distribution mode.
When dealing with a continuous frequency distribution, the mode can be found using the following
f1 f 0
formula: Mode L i
2 f1 f 0 f 2
Where, L= lower limit of the model class, f1= frequency of the model class, f0= frequency of the
class preceding the model class, f2= frequency of the class succeeding the model class and i= width
of the model class.
Merits of Mode
1) Mode is not at all affected by extreme values.
2) It is best measure for qualitative type of data.
3) It is typical or representative value.
Limitations of Mode
1) It is not always determined e. g., multi or bimodal series.
Dr. Neera Kumari
CU, Unnao, U.P. Page 37
2) It is not capable of algebra matriculation.
3) It is not rigidly defined measure.
4) It is not based on each and every item of distribution.
5) It is not good measure for quantitative type of data.
Mode in case of individual series:
Q.21 (i) Find out mode from the following data of sizes of shoes sold at a shop in one day.
5, 10, 7, 8, 7, 4, 7, 5, 3, 7, 2, 6
(ii) Find out the mode from the following data:
Wages 200 250 275 350 325 225
(Rs.)
Frequency 4 6 10 18 9 1
Sol. (i) For the sake of convenience, we arrange the data in a array and observe which term occurs
most frequently
2, 3, 4, 5, 5, 6, 7, 7, 7, 7, 8, 10
It is clear that term 7 occurs maximum number of item i.e., 4 times. Thus, mode is 7.
(ii) In this question maximum frequency is 18 and it belongs to the value 350. Thus, Rs. 350 is
model wage.
Q.22 The scores that various players received during a match are displayed in the table. What are
the supplied data's mean, median, and mode?
S.N. 1 2 3 4 5 6 7
Name Mukesh Akash Devdutt Rajat Dhruv Ravindra Rohit
Rune Scored 70 42 30 42 60 0 7
Solution. The mean is given by,
1 7 70 42 30 42 60 0 7 251
x
n i 1
xi
7
7
35.85 36.
Median: Let's first arrange the provided data in ascending order in order to determine the median.
Name Ravindra Rohit Devdutt Rajat Akash Dhruv Mukesh
Rune Scored 0 7 30 42 42 60 70
The data has an odd number of pieces. Therefore, [(n+1)/2]th observation is the median. The fourth
observation is 42, and the median is [(7+1)/2]. The most frequent data, or mode, is 42.
Mode in group data:
Dr. Neera Kumari
CU, Unnao, U.P. Page 38
Q.23 Find the mode from the following data:
Class 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45
Frequency 29 195 241 117 52 10 6 3 2
Sol. We can see; the highest frequency lies in the class 10-15. Hence, 10-15 is the modal class. For
the determination of mode, we use the formula,
f1 f 0
Mode L i
2 f1 f 0 f 2
Where, L=10, f1=241, f0=195, f2=117 and i=5
241 195 230
Mode 10 5 10 11.35
2 241 195 117 170
Q.24 Find out the mode from the following data:
Classes 0-10 10-20 20-40 40-50 50-70 70-80
Frequency 5 15 40 32 28 5
Sol. In the above distribution, class intervals are unequal and they are to be converted into equal
class intervals. If we add some classes, then only two classes namely 0-40 and 40-80 will be formed
which is not justified. Hence, in this case we will break the classes each of size 10. Frequency are
accordingly divided.
Classes 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Frequency 5 15 20 20 32 14 14 5
Here, Maximum frequency is 32 which belongs to 49-50 class. Hence, 40-50 is the model class in
which
L=40, i=10, f1=32, f0=20, f2=14
f1 f 0 32 20 120
Then, Mode L i 40 10 40 40 4 44
2 f1 f 0 f 2 2 32 20 14 30
Relationship between Mean, Median and Mode
A symmetrical distribution is one in which the mean, median, and mode values coincide (i.e., mean
= median = mode). On the other hand, a distribution is said to be asymmetrical or skewed when
the values of the mean, median, and mode are not equal. An extremely significant link between
Dr. Neera Kumari
CU, Unnao, U.P. Page 39
mean, median, and mode is present in distributions that are somewhat skewed or asymmetrical.
As the picture below makes evident, with such a distribution, the gap between the mean and the
median is roughly one-third that between the mean and the mode.
According to Karl Pearson, this relationship is as follows:
Mode – Mean – 3(Mean – Median), Mode = 3Median – 2Mean
Median = Mode + 2/3(Mean – Mode)
An empirical relationship is another name for this one. When the other two measures are known
to us for specific data, this is utilized to find one of the measures. By switching the LHS and RHS,
this connection can be recast in a variety of ways.
Applications of Central Tendency Measure in Machine Learning
1. Missing data is a prevalent problem in machine learning datasets. When imputed missing values
in features, measures of central tendency like the mean or median can yield an acceptable estimate
based on the central tendency of the available data.
2. Measures of central tendency can be used to quickly summaries the data distribution when
analyzing and visualizing datasets. Analysts and modelers can better grasp the central tendency of
various features and spot possible trends or outliers by using plots like box plots or histograms
overlaid with metrics like the mean or median.
3. To guarantee that features contribute evenly to model training; feature scaling is crucial in many
machine learning techniques. Central tendency measures like mean and median are frequently
employed in feature scaling methods like min-max scaling (scaling features to a given range based
on the minimum and maximum values) and standardization (subtracting the mean and dividing by
the standard deviation).
5. Measure of Dispersion and Variability
5.1 The central tendency measures provide us a single number that sums up the data, but unless all
the observations are the same, this figure cannot fully characterize a set of data. A description of
Dr. Neera Kumari
CU, Unnao, U.P. Page 40
the variability or dispersion of the observations is required. The items' deviation from certain core
values is measured by their dispersion.
Some description of dispersion,
1. The measure of the items' variety is called dispersion.
2. The variation or dispersion of the data is the degree to which numerical data tend to disperse
around an average value.
3. The degree of the variable's volatility or scatter around a central value is known as its dispersion
or spread.
The practical application of dispersion analysis is demonstrated by the example that follows:
S.N. 1 2 3 4 5 Total Mean
Series A 200 200 200 200 200 1000 200
Series B 200 205 202 203 190 1000 200
Series C 2 698 80 100 120 1000 200
Because the three series' arithmetic means are identical, one is apt to draw the conclusion that they
are all the same in this instance. A variance or dispersion measure quantifies how far an individual
observation strays from a core or average value.
There are Different Types of Dispersion:
Range Measures of Standard Deviation
Dispersion
Quartile Percentile Mean Deviation Variance
Significance of Measuring Variation
There are four main reasons why measures of variance are necessary:
1. To ascertain an average's reliability.
2. To act as a foundation for the variability's control.
3. To evaluate the variability of two or more series.
4. To make it easier to use more statistical measures.
Absolute and Relative Measure of Variation
The same statistical unit (such as rupees, kilogrammes, tonnes, etc.) used to convey the original
data is also used to define absolute measures of dispersion. Example: Range, Quartile Deviation
and Standard Deviation
Dr. Neera Kumari
CU, Unnao, U.P. Page 41
The ratio of the absolute dispersion measure to a suitable average is the relative measure of
dispersion.
5.2.1 Investigating Data Extremes
The simplest way to study dispersion is range. It is described as the difference between the
distribution's greatest item's value and its smallest item's value.
Symbolically, Range = L – S,
Where, L = Largest item and S = Smallest item.
The following formula can be used to get the coefficient of range, or relative measure related to
range:
LS
Coefficien t of Range .
LS
Limitations
1. The distribution's range is not determined by each and every item.
2. It is susceptible to significant variations between samples.
3. Nothing about the nature of the distribution inside the two extreme observations can be
inferred from range. As a result, its accuracy as a pointer to the value's dispersion within
a distribution is very low.
4. Range cannot be computed in case of open-end distributions.
Uses: Most commonly used in industrial quality control. In case of metrological department for
weather forecast etc.
5.2.3 Quartile Spread Analysis
The difference between the frequency distribution's first and third quartiles is known as the
quartile deviation. The interquartile range is another term for this. It is significant because within
this range, a range of regressions and deviations that help assess the characteristics of the data
can be generated.
Q3 Q1
Quartile Deviation
2
Where, Q1 = First Quartile (lower quartile) and Q3 = Third Quartile (upper quartile)
First Quartile
When the data are organized in ascending order, this is the value that 25% of the observations lie
below and 75% of the observations lie above.
Dr. Neera Kumari
CU, Unnao, U.P. Page 42
First Quartile in the individual series,
n 1
th
Q1 observatio ns
4
First Quartile for group data,
( N / 4) C.F .
Q1 L i
f
Third Quartile
When data are sorted in ascending order, this value is such that 75% of observations lie below it
and 25% of observations lie above it.
Third Quartile in the individual series,
n 1
th
Q3 3 observations
4
Third Quartile for group data,
(3N / 4) C.F .
Q3 L i
f
Q3 Q1
Coefficien t of Quartile Deviation
Q3 Q1
Advantages of Quartile
1. It is particularly useful for quantifying variation in open-ended distributions or those where the
data may be ranked but are still quantified quantitatively.
2. The occurrence of extreme values has no effect on the quartile deviation.
Limitations of Quartile
Quartile Deviation is not a useful tool for measuring dispersion because it ignores 50% of the
items, or the early 25% and the last 25%.
1. Needs arrangement of observations in orderly form either ascending or descending.
2. Affected considerably by sampling fluctuations.
3. Not suitable for mathematical treatments.
5.2.3 Positional Percentile Analysis
The percentile range is also used as a measure of dispersion. The percentile range of a set of data is
defined as: Percentile Range = P90 – P10. Where, P90 & P10 are the 90th and 10th percentiles
Dr. Neera Kumari
CU, Unnao, U.P. Page 43
P90 P10
respectively. The semi–percentile range, i.e., can also be used, but is not commonly
2
employed.
5.2.4 Exploring Mean Deviation
The average deviation is another name for the mean deviation. It is the average variation between
a distribution's items and the series' mean or median.
Ungroup Data:
1 D
x A
M .D.
M .D. and Coefficien t of M .D.
n n Median
Where, D denoted by deviation from median and ignoring the signs.
Merits of Mean Deviation
1. It is less affected by extreme values.
2. It is based on deviation from the averages gives a better idea about the scatterness around the
average.
Limitations of Mean Deviation
1. Signs of the deviations are neglected which is objectionable against mathematical principle.
2. Not satisfactory measures and does not give any practical comparison.
3. Cannot be computed with distribution with open end classes.
5.2.5 Data Spread Evaluation
The square of the standard deviation was referred to as the variance. In advanced work, where it
is possible to divide the total into multiple portions, each attributed to one of the reasons creating
variation in their original series, the concept of variance is crucial.
1 n
Variance xi x 2
n i 1
The following can be directly estimated in a frequency distribution where deviations are taken
from the assumed mean variance:
fd f d
i, where, d x A and i class int ervals
2 2
N N i
5.2.6 Data Spread Standardization
Dr. Neera Kumari
CU, Unnao, U.P. Page 44
It was [1] who first proposed the idea of the standard deviation. It is by far the most significant
and often applied dispersion study metric. The square root of the mean of the squared deviation
from the arithmetic mean is what gives SD its other name, root mean square deviation. The Greek
letter σ, which is interpreted as sigma, is used to represent SD.
Calculation of Standard Deviation – Individual Observation
Two Methods
1. By calculating the items' departure from the true mean
2. By calculating the items' departure from a presumptive
Deviations taken from Actual Mean: x 2
, where, x X X
n
d d
2 2
Deviations taken from Assumed Mean: , where, d X A
n n
Mathematical Properties of Standard Deviation
1. Combined Standard Deviation
n1 12 n2 22 n1d12 n2 d 22
We have, 12 , σ12 = Combined Standard Deviation
n1 n2
σ1 = Standard Deviation of first group, σ2 = Standard Deviation of second group
d1 x1 x12 and d 2 x 2 x12
2. Standard Deviation of n Natural Numbers
12
1 2
n 1 , Only positive integers, e.g. 1, 2, 3,..., n .
3. The smallest sum of squares that reflects how different each member of the series is from its
arithmetic mean.
4. The SD gives us a high degree of accuracy in locating the values within a frequency distribution.
Relation between Measures of Dispersion
The three most widely used metrics of dispersion have a fixed relationship when the distribution
is normal. The standard deviation is the biggest, followed by the mean deviation and the quartile
deviation, in the following proportions.
2 3 4 5
Q.D. or Q.D. & M .D. or M .D.
3 2 5 4
Dr. Neera Kumari
CU, Unnao, U.P. Page 45
The percentage of things that are normally included within one Q.D., M.D., or S.D. assessed
both above and below the mean can be compared once again. Within a standard distribution,
X Q.D. includes 50 percent of the items , X M .D. includes 57.31 percent of the items
X S .D. includes 68.27 percent of the items .
5.2.7 Data Variation Ratio Analysis
SD is a precise way to quantify dispersion. The coefficient of variation is the equivalent relative
measure. This measure devolved by Karl Pearson is the most commonly used measure of relative
variation. C.V . 100
X
Q.25 Find out Quartile deviation and its coefficient from the following data:
Class 4-8 8-12 12-16 16-20 20-24 24-28 28-32 32-36 36-40
Frequency 6 10 18 30 15 12 10 6 2
Sol. Let us find the table for cumulative frequencies,
Class Frequency C.F. Class Frequency C.F.
4-8 6 6 24-28 12 91
8-12 10 16 28-32 10 101
12-16 18 34 32-36 6 107
16-20 30 64 36-40 2 109
20-24 15 79
Q1N0=(N/4)th item=(109/4)=27.25
27.25 lies in the C.F. 34, which belongs to the class 12-16. Thus, 12-16 is the required class.
( N / 4) C.F . 4
Q1 L i 12 (27.25 16) 12 2.5 14.5
f 18
Q3N0=(3N/4)th item=(3*109/4)=(327/4)th item=81.75
81.75 lies in the C.F. 91, which belongs to the class 24-28. Thus 24-28 is the required class.
(3 N / 4) C.F . 4
Q3 L i 24 (81.75 79) 24 0.92 24.92
f 12
Q3 Q1 24.92 14.5 10.42
Q.D. 5.21
2 2 2
Q3 Q1 24.92 14.5 10.42
Coefficient of Q.D. 0.264
Q3 Q1 24.92 14.5 39.42
Dr. Neera Kumari
CU, Unnao, U.P. Page 46
Q.26 The following table gives the weekly wages (in rupees) in a certain commercial organization:
Find (i) The median and the first quartile.
(ii) The percentage of wage-earners receiving between rupees three seventy and four seventy per
week.
Weekly wages (Rs.) Frequency
300-320 3
320-340 8
340-360 24
360-380 31
380-400 50
400-420 67
420-440 38
440-460 21
460-480 12
480-500 2
N N
Ans. (i) Here 125 and 62.5
2 4
N
The c.f. just greater than 125 is 177 and therefore, the corresponding class 40-42 contains
2
median.
125 116
Md 400 2 400.295
61
N
The c.f. just greater than 62.5 is 66 and therefore, the corresponding class 36-38 contains
4
median Q3 .
62.5 35
Q3 360 2 361.774
31
Computation of first Quartile and Median
Class-boundary Frequency (f) Less than (c.f)
300-320 3 3
320-340 8 11
340-360 24 35
360-380 31 66
380-400 50 116
400-420 67 177
420-440 38 215
440-460 21 236
460-480 12 248
Dr. Neera Kumari
CU, Unnao, U.P. Page 47
480-500 2 250=N
(ii) Number of persons with wages between Rs. 370 and Rs. 470 is given by
380 370 470 460
31 50 61 38 21 12
2 20 2 20
1 31 1 12
50 61 38 21 191.5
2 2
Hence the percentage of workers getting wages between Rs. 370 and Rs. 470 is:
191.5
100 76.6
250
Q.27 From the following data, calculate mean deviation from median and its coefficient:
Age (in years) No. of persons
1-5 7
6-10 10
11-15 16
16-20 32
21-25 24
26-30 18
31-35 10
36-40 5
41-45 1
Ans.
Computational of Mean Deviation from Median
Age Mid-value No. of Cumulative m 18 d Total
(in years) (m) persons Frequency Deviations
(f) f d
1-5 3 7 7 15 105
6-10 8 10 17 10 100
11-15 13 16 33 5 80
16-20 18 32 65 8 0
21-25 23 24 89 5 120
26-30 28 18 107 10 180
31-35 33 10 117 15 150
36-40 38 5 122 20 100
41-45 43 1 123 25 25
Total 123 860
1
Median item = (123) 61.5th which lies in 16-20 class.
2
Dr. Neera Kumari
CU, Unnao, U.P. Page 48
N
C
2 61.5 33
Md l1 h 15.5 5 15.5 4.453 19.953.
f 32
Since median value comes out to be in fractions, we can do the same question conveniently by
taking the deviations from any arbitrary point A=18 (say), lying in the median class.
M .D. (about Md ) f d (Md M ' )( N1 N 2 ),
1
N
Where N1 =Number of items smaller than actual median, i.e., sum of the frequencies before and
including the median class = 7+10+16+32 = 65
N 2 N N1 123 65 58
Md (about Md )
1
860 (19.95 18)(65 58)
123
860 1.95 7 873.65
7.103
123 123
Mean deviation (about Md ) 7.103
Coeff . of M .D. 0.356
Md 19.95
Q.28 The first two moments of a distribution about the value 1 are 2 and 25 respectively. Find the
mean and standard deviation of the distribution.
Ans. In the usual notations, we are given that:
A= 5, 1' 2 and 2' 20
We know that: Mean ( X ) A 1' 1 2 3
var iance ( 2 ) 2 2' 1' 25 1 24
2
And S.D. ( ) 24 4.899
Q.29 Ten students obtained the following marks (out of 100). Calculate mean deviation.
5, 10, 20, 25, 40, 42, 45, 48, 70, 80
Sol. Here, mean is 38.5 then
n
1
M .D.about Mean
N
x
i 1
i mean
1 188
[33.5 28.5 18.5 13.5 1.5 3.5 6.5 9.5 31.5 41.5] 18.8
10 10
Dr. Neera Kumari
CU, Unnao, U.P. Page 49
Q.30 Calculate M.D. from median and its coefficient from the following data.
Size x 10 11 12 13 14
Frequency 3 12 18 12 3
Sol.
Size (x) Frequency (f) C.F. x Md f x Md
10 3 3 2 6
11 12 15 1 12
12 18 33 0 0
13 12 45 1 12
14 3 48 2 6
N=48 Sum=36
Median=value of (N/2)th item= value of (48/2)th item= 24th item=12
n
1 1
M .D.about Median
N
fx
i 1
i Median
48
36 0.75
M .D. 0.75
Coefficien t of M .D. 0.0625
Median 12
Relationship between Variance and Standard Deviation
It is evident from the formula that these two measurements are tightly related: variance = σ 2.
Standard deviation is the square root of variance, which is defined as the average squared deviation
from the arithmetic mean. The population's level of homogeneity or variability increases with a
decreasing σ2 number.
Applications of Dispersion/Variation Measure in AI/ML
1. Measures of dispersion are useful for evaluating the variability and dispersion of data features
in data quality assessments. Finding features with a large variance or standard deviation can point
to more variability in the data, necessitating feature engineering or additional research into
potential problems with data quality.
2. Measures of dispersion are important in feature selection techniques as well. Low variability
features (low variance or standard deviation) may be less useful for tasks involving predictive
modelling and may be eliminated during feature selection procedures in an effort to streamline
models and lessen over fitting.
3. The presence of outliers or aberrant data points within features may be indicated by high
variance or standard deviation values. The robustness and generalization performance of machine
learning models can be enhanced by the use of dispersion metrics, which can be used to identify
and possibly lessen the influence of outliers.
Dr. Neera Kumari
CU, Unnao, U.P. Page 50
4. Metrics for model evaluation, such as Mean Squared Error (MSE) or Root Mean Squared Error
(RMSE) in regression tasks, sometimes include measures of dispersion. These metrics give an idea
of how well the model matches the variability of the target variable with its predictions by
quantifying the spread of prediction errors.
5. Measures of dispersion are used in data transformation methods to obtain the desired data
characteristics for machine learning algorithms. For example, certain algorithms perform better
when features are standardized by scaling them to have a mean of zero and a standard deviation of
one (Z-score normalization).
6. Measure of Position/Data Distribution Assessment
The distribution (or pattern) of the data inside a dataset is described by measures of shape. There
are two types of distributions for data item values: symmetric and asymmetric. 'Normal
distribution' and 'skewed distribution' are two typical instances of symmetry and asymmetry.
6.1 Analysing Skewed Distribution
Skewness and kurtosis are two more comparable attributes that help us comprehend distribution
even better. The term “absence of symmetry” describes skewness. We inspect skewness to
ascertain the approximate form of the curve obtained from the given data. Positive or negative
skewness can occur in a data collection that lacks a symmetrical distribution, leading to a skewed
distribution.
Instead of having a symmetric bell-shaped curve, the distribution's frequency curve is more
stretched to one side than the other. To put it another way, one side (left or right) of it has a longer
tail than the other. A frequency distribution is considered positively skewed if its curve has a longer
tail pointing towards the right, and negatively skewed if it has a longer tail pointing towards the
left (see, above figure).
Dr. Neera Kumari
CU, Unnao, U.P. Page 51
The value of mean, median and mode fall at different points i.e., they do not coincide.
Some important measures of skewness:
6.1.1 Karl-Person Coefficient of Skewness.
6.1.2 Coefficient of Skewness based on Moments.
6.1.3 Bowley’s Coefficient of Skewness.
6.1.1 Karl-Person Coefficient of Skewness
According to Karl-Person coefficient of skewness,
Mean Mode M Mo
Sk
However, mode is frequently ill-defined, making it challenging to find. When faced with a
moderately asymmetrical (skewed) distribution, we use the following empirical relationship
between the mean, median, and mode: Mode 3 Median 2Mean
Mean (3Median 2Mean) 3(Mean Median)
We get, Sk
The Skewness is related to moments about the mean such that the expressions are as follows:
1 n
First Moment: 1 xi x , (Average of deviations from the mean)
n i 1
1 n
Second Moment: 2 xi x , (Biased variance)
2
n i 1
1 n
Third Moment: 3 xi x 3 , (Average of cubed deviations from the mean)
n i 1
1 n
Forth Moment: 4 xi x 4
n i 1
6.1.2 Measure of Skewness based on Moments
The Measure of skewness based on moments is denoted by β1 and is given by
32
1 3 ,
2
When 1 0 : Symmetric Distribution, 1 1: Negatively skewed (to the left )
and 1 1 : Positively skewed (to the right) .
Dr. Neera Kumari
CU, Unnao, U.P. Page 52
Applications of Skewness
1. In a truly symmetrical (unimodal) distribution, the mean, mode, and median, are equal.
2. In a skewed distribution the mode remains unchanged, however, the median is displaced in the
direction to which the distribution is skewed.
3. The arithmetic mean will also be displaced in the same direction and to the outside (distal end)
of the median.
4. It follows that in skewed distribution, the mean is highly sensitive by the degree of Skewness,
and ceases to describe a “central value” or typical value.
6.1.3 Data Distribution Peakedness/Kurtosis
To characterize the features of a frequency distribution, we have so far examined three measures:
skewness, dispersion, and central tendency. We are unable to fully describe a distribution, even if
we are aware of all three of these measures. This diagram will make the argument clearer.
Regarding the mean, all three curves are symmetrical and have the same variation (range). Prof.
Karl Pearson defined “convexity of the curve” or its “kurtosis” as the additional measure required
to fully identify a distribution. While skewness aids in determining the frequency curve’s right or
left tails, kurtosis gives us insight into the form and characteristics of a frequency distribution's
hump, or middle portion. Put otherwise, the flatness or peakiness of the frequency curve is the
subject of kurtotis.
Mesokurtic curves that are neither flat nor peaked are referred to be normal curves, and the shape
of their hump is recognised as normal. Mesokurtic curves are defined as having normal kurtosis
and having humps in the shape of a normal curve.
The Leptokurtic curves are characterised by having more peaks than the typical curve and being
either kurtosis-negative or lacking.
Dr. Neera Kumari
CU, Unnao, U.P. Page 53
However, the curve of Platykurtic and they are considered to have positive kurtosis or excess
kurtosis.
As measure of kurtosis, Karl Pearson gave the coefficient beta two 2 or its derivative Gamma
two 2 defined as follows:
4 4 4 4 3 4
2 and 3 3
22 4 4 4
2 2
For a normal or Mesokurtic curve, 2 3 or 2 0 . For a Leptokurtic curve, 2 0 or 2 0 ,
and for a Platykurtic curve, 2 0 or 2 0 .
Q.31 Karl Pearson’s coefficient of skewness for a distribution is -0.4 and its coefficient of variation
is 40%. Its mean is 50, find standard deviation, median and mode.
Ans. Karl Pearson’s coefficient of skewness is:
Mean Mo 50 Mo
Sk 0.4 ( given) (i)
C.V . 100 100 40 ( given)
Mean 50
50 40
20
100
Substituting in (i), we get
50 Mode
0.4 Mode 50 20 0.4 58.
20
Finally, the median is obtained from the empirical relationship, viz.,
Mean- Mode = 3(Mean - Median)
Or 50-38 = 3(50 - Median)
Or 12 = 150 - 3Median
Median = 52.67
Q.32 Determine the first four moments of a distribution around the number five, which are 7, 70,
140, and 175 in this case. Calculate 1 and 2 . Comment on the nature of the distribution.
Solution. Let r' be the rth order raw moment about the value 5 and r be the rth central moment.
Given: 1' 7, 2' 70, 3' 140 and 4' 175
2 2' 1' 70 49 21,
2
3 3' 31' 2' 21' 140 3 7 70 2 7 3 644
3
4 4' 4 3' 1' 6 2' 1' 31' 175 4 7 140 6 70 7 2 3 7 4 9632 .
2 4
Dr. Neera Kumari
CU, Unnao, U.P. Page 54
32 414736 4
1 44.7831; 2 21.8413
23 9261 22
Skewness: 1 1 6.692, and Kurtosis: 2 2 3 18.8413
Since 1 0 the distribution is negatively skewed; and 2 0, the distribution is leptokurtic.
Q.33 An analysis of production rejects resulted in the following figures:
No. of rejects per operator No. of operators
21-25 5
26-30 15
31-35 28
36-40 42
41-45 15
46-50 12
51-55 3
Compute mean, mode, standard deviation and coefficient of skewness and comment on the results.
Ans.
Computation of Mean and S.D.
No. of reject No. of Mid-value X 38 Fu fu2
u
per operator Operators (X) 5
(f)
21-25 5 23 -3 -15 45
26-30 15 28 -2 -30 60
31-35 28 33 -1 -28 28
36-40 42 38 0 0 0
41-45 15 43 1 15 15
46-50 12 48 2 24 48
51-55 3 53 3 9 27
Total 100 -25 223
Mean ( X ) A
fu h 38 5 (25) 36.75
N 100
Here the maximum frequency is 42. Thus the corresponding class 36-40 is modal class, the class
boundaries of which are 35.5-40.5. (Grouping would also give the same results).
1 (42 28)
Mode ( Mo) l1 h 35.5 5
1 2 (42 28) (42 15)
14
35.5 5 37.207
14 27
Dr. Neera Kumari
CU, Unnao, U.P. Page 55
1/ 2
fu fu
2 2
243 25
2 1/ 2
S .D. ( ) h 5
N N
100 100
= (2.35 0.0625 ) 5 =1.506×5 = 7.53
Hence, Karl Pearson’s coefficient of skewness based on mean, mode and S.D. is given by:
X Mo 36.75 37.207
Sk 0.0607
7.53
Q.34 The mean, median and the coefficient of variation of 100 observations are found to be 90, 84
and 80 respectively. Find the coefficient of skewness of the above system of 100 observations.
S.D.
Ans. We know, Coefficient of variation = 100 80
Mean
S.D. 80 90
Or 100 80 S.D. 72
90 100
Karl Pearson’s coefficient of skewness is:
3( Mean Median) 3(90 84)
Sk 0.75
S.D. 72
Applications of Skewness and Kurtosis in AI/ML
1. Skewness and kurtosis are useful feature engineering methods that can be applied to either build
new features or modify pre-existing ones. For instance, developing features that represent the
skewness or kurtosis of particular variables may give machine learning models extra data,
particularly when the distributional characteristics of features are important for tasks involving
prediction.
3. Skewness and kurtosis are two metrics that can be used in outlier detection algorithms to find
observations that significantly depart from the predicted distribution. Extreme skewness or
kurtosis values are two examples of outliers that should be investigated further or removed in order
to increase the robustness of the model.
4. Algorithms for feature selection can be informed by skewness and kurtosis measurements,
which point to features that display unique distributional characteristics. Features with high
skewness or kurtosis values can influence feature selection techniques in machine learning
processes by making a disproportionate contribution to model performance or by offering unique
information not captured by other features.
Dr. Neera Kumari
CU, Unnao, U.P. Page 56