Statistics
Statistics
Historically, statistics were used in governance for taxation, military planning, and land revenue
assessments. This early application was known as Political Arithmetic, pioneered by Sir William Petty and
John Graunt.
Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting numerical data
to aid in decision-making. It applies to various fields, including business, economics, health sciences, and
social sciences.
Statistics plays a crucial role in business by providing quantitative data and analytical tools for decision-
making.
• Quantitative Information: Provides insights on production levels, sales performance, revenue, and
expenditures.
• Time Series Analysis: Helps predict future trends based on past data (e.g., forecasting sales for the next
quarter).
• Decision Theory: Reduces uncertainty by applying probability models to business choices.
• Statistical Quality Control (SQC): Used in manufacturing industries to maintain product quality.
3. Functions of Statistics
1. Condensation: Summarizes large datasets into manageable forms (e.g., averages, graphs).
2. Comparison: Allows businesses and researchers to compare trends over time or between different
groups.
3. Forecasting: Predicts future outcomes based on existing data (e.g., demand forecasting).
4. Hypothesis Testing: Tests assumptions in research and business strategies.
5. Expectation: Aids in decision-making by estimating probable future events.
Website: [email protected] Contact us: 7889296332
4. Importance of Statistics
• Business & Industry: Used for production control, demand analysis, and financial planning.
• Science & Research: Helps in experimental design, hypothesis validation, and result interpretation.
• Banking & Insurance: Determines credit risk, customer behavior, and insurance premium calculations.
• Government & State Planning: Aids in population studies, economic planning, and national security
decisions.
5. Scope of Statistics
6. Limitations of Statistics
• Quantitative Nature: It focuses only on numerical data and cannot fully explain qualitative aspects like
emotions or opinions.
• Approximate Results: Conclusions are based on probability and sampling, which may not be 100%
accurate.
• Group Data Focus: It deals with mass data and may not be useful for studying individual cases.
• Requires Expertise: Misuse of statistical techniques can lead to incorrect conclusions.
Data in statistics is categorized based on its source, characteristics, and structure. Understanding these
classifications helps in selecting the appropriate statistical tools for analysis. Below is a detailed
explanation of each type:
• Definition: Data collected directly from the source for a specific purpose.
• Methods of Collection:
o Surveys: Questionnaires, interviews, polls.
o Experiments: Scientific or controlled testing.
o Observations: Directly recording behavior or events.
• Example:
o A company conducting a customer feedback survey to assess product satisfaction.
• Definition: Data that has already been collected by someone else for a different purpose.
• Sources:
o Books, research papers, government reports, newspapers.
o Online databases (World Bank, IMF, Census Reports).
• Example:
o Using World Bank economic reports for GDP analysis.
• Definition: Data involving two related variables, analyzed to determine relationships or correlations.
• Example:
o Income vs. Expenditure (Does higher income lead to more spending?).
o Advertising cost vs. Sales revenue (Does increased advertising increase sales?).
• Statistical Tools Used:
o Correlation coefficient (to measure relationship strength).
o Regression analysis (to predict one variable based on another).
• Definition: Data that describes qualities, characteristics, or attributes without numerical values.
• Types:
o Nominal Data: Categorical data with no specific order.
▪ Example: Gender (Male, Female, Other), Blood Group (A, B, O, AB).
o Ordinal Data: Categorical data that has a meaningful order but lacks precise numerical
differences.
▪ Example: Customer satisfaction (Low, Medium, High), Education levels (Primary,
Secondary, Higher).
• Population: The entire set of individuals or items under study (e.g., all students in a university).
• Sample: A subset of the population selected for analysis (e.g., a survey of 500 students).
Purpose of Sampling:
Studying an entire population is often impractical due to time and cost constraints. Sampling provides a
manageable and cost-effective alternative.
• Parameter: A numerical characteristic of a population (e.g., the average income of all households in a
country).
• Statistic: A numerical characteristic of a sample (e.g., the average income of 1,000 randomly selected
households).
• Descriptive Statistics: Summarizes and describes data using measures like mean, median, mode, and
graphical representations. Example: A census report.
• Inferential Statistics: Makes predictions or generalizations about a population based on sample data.
Example: Poll results predicting election outcomes.
• Null Hypothesis (H₀): Assumes no effect or difference. Example: "There is no difference in test scores
between online and offline students."
• Alternative Hypothesis (H₁): Suggests a significant effect. Example: "Students taking online courses
perform better."
• Type I Error: Rejecting a true null hypothesis (false positive, denoted by α).
• Type II Error: Failing to reject a false null hypothesis (false negative, denoted by β).
1. Population:
o The complete set of individuals, items, or data being studied.
o Example: All students in a university.
2. Sample:
o A subset of the population used to represent the whole.
o Example: 200 students selected from the university.
3. Sampling Frame:
o A complete list of all units in the population from which a sample is drawn.
o Example: A university’s student database.
4. Sampling Error:
o The difference between the sample estimate and the true population parameter.
o It occurs because a sample, rather than the full population, is studied.
Sampling methods are categorized into Probability Sampling (where every element has a known chance of
being selected) and Non-Probability Sampling (where elements are selected based on convenience or
judgment).
Website: [email protected] Contact us: 7889296332
Key Takeaway
Definition: Every unit in the population has an equal and independent chance of being selected.
Methods:
3. Stratified Sampling
• Definition: The population is divided into subgroups (strata), and samples are drawn proportionally from
each stratum.
• Example: A company divides employees into departments (Finance, HR, Marketing) and samples
proportionally from each.
• Merits: Increases accuracy by ensuring representation.
Demerits: Requires detailed knowledge of population characteristics.
4. Cluster Sampling
• Definition: The population is divided into clusters, and entire clusters are randomly selected.
• Example: Randomly selecting schools from a district and surveying all students in those schools.
• Merits: Cost-effective for large populations.
• Demerits: Increases the chance of sampling bias.
5. Multi-Stage Sampling
• Definition: Combines multiple sampling methods, often involving multiple stages of selection.
Example:
(Used when a complete sampling frame is unavailable or when cost/time constraints exist.)
1. Convenience Sampling
3. Quota Sampling
• Definition: The population is divided into groups, and a specific number of participants are selected from
each.
• Example: Selecting an equal number of male and female respondents.
• Merits: Ensures representation.
• Demerits: May be biased if selection is not random.
4. Snowball Sampling
5. Self-Selection Sampling
b) Quota sampling.
c) Stratified Sampling.
d) Cluster sampling.
a) Judgmental sampling.
Website: [email protected] Contact us: 7889296332
b) Convenience sampling.
c) Extensive sampling.
d) Cluster sampling.
a) Probability sampling.
c) Quota sampling.
d) Extensive sampling.
Q6.Which of the following would generally require the largest sample size?
a) Cluster sampling.
c) Systematic sampling.
Q7.In which sample population is divided into different strata and sample is taken from different
strata?
a) Quota Sampling.
c) Stratified sampling.
d) Purposive Sampling.
Q8.In which sample the investigator has complete freedom to choose his sample according to his
desire?
a) Quota Sampling.
c) Stratified sampling.
a) Increases.
d) Decreases.
c) Constant.
Ans: d) Decreases.
Q10. Which of the following would usually require the smallest sample size because of its efficiency?
a) Cluster sampling.
c) Quota sampling.
d) Stratified sampling.
a) Instantaneous sampling.
b) Natural sampling.
c) Helps in achieving higher degree of accuracy if populations to be studied are homogenous in nature.
b) Economical in nature.
Q14.A technique of Building up a list or a sample of a special population by using an initial set of
members as informants is called
a) Quota sampling.
b) Convenience Sampling.
d) Purposive sampling.
Q15. Statement R: Stratified sampling is used in situation where the population can be easily divided groups.
Explanation:
• Statement R: "Stratified sampling is used in a situation where the population can be easily divided into
groups."
o In stratified sampling, the population is divided into homogeneous groups (strata) based on
shared characteristics (e.g., age, gender, income level).
• Statement S: "Elements within a group are heterogeneous with respect to characteristics."
o In stratified sampling, elements within a group (stratum) are homogeneous, meaning they share
similar characteristics.
o The heterogeneity exists between different strata, not within them.
(a) Estimation
Explanation:
• Inferential statistics is the branch of statistics that studies unknown aspects of a population
distribution by using data from a sample. It involves estimation and hypothesis testing to make
predictions or inferences about a larger population.
• Descriptive statistics (d) only summarizes and describes the features of a dataset without making
predictions.
Q16.Statistics that estimate the values for a population from a sample of that population is _____.
A. Inferential
B. Descriptive
C. Predictive
D. Factual claim
Explanation:
• Inferential statistics is used to estimate population values based on a sample. It involves drawing
conclusions, making predictions, and testing hypotheses about a population using sample data.
• Descriptive statistics (B) only summarizes or describes data but does not make predictions.
Website: [email protected] Contact us: 7889296332
• Predictive statistics (C) is not a standard statistical term but may relate to predictive modeling, which
uses data to forecast future outcomes.
• Factual claim (D) is not related to statistical estimation.
Website: [email protected] Contact us: 7889296332
o The arithmetic mean is highly sensitive to outliers or extreme values. A single very large or very small value
can significantly distort the mean, making it unrepresentative of the data set.
o The arithmetic mean cannot be calculated for open-ended data sets, where the class intervals lack
specific upper or lower limits (e.g., "Above 100" or "Below 20").
Website: [email protected] Contact us: 7889296332
3. Non-graphical:
o Unlike the median or mode, the arithmetic mean does not lend itself to a clear graphical representation,
which can make its interpretation more difficult visually.
4. Misleading Results:
o In certain cases, the mean may provide misleading results, particularly when the data contains outliers.
For example, in a salary dataset where most salaries are around $30,000, but a few employees earn
millions, the mean will be skewed upwards and may not represent the central tendency accurately.
1. Simplicity:
o It is the easiest and most straightforward measure of central tendency. The calculation involves simply
adding up the values and dividing by the number of observations.
2. Comprehensive:
o The arithmetic mean takes all observations in the data set into account, making it a comprehensive
measure. Every value influences the final mean.
3. Equal Weight:
o The arithmetic mean gives equal weight to every data point. Each observation has the same impact on the
final value, unlike the weighted mean where some observations have more influence.
4. Calculated Value:
o Unlike the median and mode, which are based on the position of data points, the arithmetic mean is a
calculated value that uses all the data in the set.
5. Objectivity:
o The mean is not influenced by personal biases. It’s based purely on the numerical data, making it a purely
objective measure.
6. Mathematical Utility:
o The arithmetic mean is useful for further mathematical calculations like addition, subtraction,
multiplication, and division, making it applicable in various statistical analyses.
7. Comparative Base:
o The arithmetic mean serves as a strong basis for comparing multiple datasets. It gives a consistent
measure of central tendency that can be used to compare different data sets or distributions.
8. Stability:
o The arithmetic mean is a stable measure of central tendency. Minor changes in the data typically have a
minimal impact on the mean, making it a reliable measure in many cases.
Website: [email protected] Contact us: 7889296332
Individual Series: Data points are not grouped; each value is treated individually.
Discrete Series: Data values are repeated; frequencies are considered in the calculation.
Continuous Series: Data values are grouped into intervals; midpoints of these intervals are used for the
calculation.
Website: [email protected] Contact us: 7889296332
• GM is generally used when data is multiplicative or when you need to average ratios, percentages, or
growth rates.
• HM is often used in scenarios like calculating average rates or speeds, where the data points are
inversely related (i.e., the smaller values have a greater impact).
• AM ≥ GM ≥ HM holds true for any set of positive numbers.
• This relationship is useful in various fields such as economics, finance, and statistics, where different
types of averages are used to analyze data with different characteristics.
• The inequality reflects the fact that the arithmetic mean is the most sensitive to outliers, while the
harmonic mean tends to give more weight to smaller values and is often used in situations where rates
are involved.
Website: [email protected] Contact us: 7889296332
Website: [email protected] Contact us: 7889296332
The median can also be found graphically using a cumulative frequency curve or Ogive. This method involves
plotting the cumulative frequency on the vertical axis and the upper bounds of the class intervals on the horizontal
axis. The median is determined by locating the point where the curve crosses the median position.
Merits of Median:
1. Resistant to Outliers: The median is not influenced by extreme values, making it a better measure for skewed
distributions.
2. Simple to Calculate: Especially useful for ordinal data and when values are arranged in order.
3. Applicable to Skewed Data: Provides a better representation of the central value for skewed distributions than
the mean.
Demerits of Median:
1. Ignores Data Variability: The median only considers the middle value and ignores the spread or variation of the
rest of the data.
Website: [email protected] Contact us: 7889296332
2. Not Suitable for Further Calculations: Unlike the mean, the median cannot be used in most algebraic
operations.
3. Difficult for Large Datasets: Sorting the data to find the median can be time-consuming for large datasets.
• Median: Better when data is skewed or contains outliers, as it is less affected by extreme values.
• Mean: Best when data is symmetrically distributed or when every data point is important.
Website: [email protected] Contact us: 7889296332
The mode is the value that occurs most frequently in a data set. It is a measure of central tendency that identifies the
peak or most frequent value in a distribution.
• Most Frequent Value: The mode is the value or values that appear most frequently in a dataset.
• Useful for Categorical Data: It is particularly useful for nominal data, where the mean and median may not be
applicable.
• Can Be Multimodal: A dataset may have one mode (unimodal), more than one mode (bimodal or multimodal), or
no mode at all if all values occur with equal frequency.
Website: [email protected] Contact us: 7889296332
Website: [email protected] Contact us: 7889296332
The mode can also be determined graphically using a Histogram or Frequency Polygon. Here's how it can
be done:
1. Histogram:
o A histogram represents the frequencies of the classes with bars. The mode is located at the
highest peak of the histogram (i.e., the tallest bar).
2. Frequency Polygon:
o The mode can also be found at the peak of the frequency polygon, which is a line graph created
by joining the midpoints of the bars in the histogram.
Merits of Mode:
1. Unaffected by Extreme Values: The mode is not influenced by extreme values or outliers in the data.
2. Best Indicative Value: It best indicates the most frequent value in the data set, which is useful in
understanding common occurrences.
3. Ease of Calculation: It is relatively easy to calculate, especially in situations where you only need to
know the most frequent value.
4. Graphical Determination: The mode can be determined both mathematically and graphically, making
it versatile.
5. Applicability to Open-ended Classes: The mode can be determined even for open-ended classes,
unlike the mean or median.
Website: [email protected] Contact us: 7889296332
Demerits of Mode:
1. Not Based on All Values: The mode is not based on all the values of a series, making it less reliable in
some cases.
2. Lacks Algebraic Treatment: The mode cannot be used for further algebraic calculations and is less
suitable for advanced statistical operations.
3. Difficulty in Ascertainment: In some cases, the mode may be difficult to determine, especially in
unimodal distributions where frequencies are similar.
4. Fails to Represent Small Values: The mode may fail to represent small values in a series, which might
affect its usefulness as a central tendency measure.
Dispersion in statistics
Website: [email protected] Contact us: 7889296332
Website: [email protected] Contact us: 7889296332
Website: [email protected] Contact us: 7889296332
Merits of Range:
Simplest Measure of Dispersion – Easy to understand and compute.
Quick Calculation – Requires only the highest and lowest values.
Limitations of Range
Not Based on All Observations – Only considers the highest and lowest values, ignoring the rest of the
data.
Highly Affected by Sampling Fluctuations – A small change in extreme values can significantly alter the
range.
Unreliable Measure of Dispersion – Does not provide a complete picture of data variability.
Cannot Be Used for Open-End Classifications – If a class interval is unbounded (e.g., "50 and above"),
range cannot be accurately calculated.
Uses of Range:
Quality Control – Used in manufacturing to monitor product consistency.
Stock Market Analysis – Helps track fluctuations in share prices.
Weather Forecasting – Measures temperature variations over a period.
Everyday Life – Used in analyzing income disparity, test scores, etc.
Website: [email protected] Contact us: 7889296332
Website: [email protected] Contact us: 7889296332
Example:
Given data: 5, 10, 15, 20, 25
Website: [email protected] Contact us: 7889296332
Website: [email protected] Contact us: 7889296332
Website: [email protected] Contact us: 7889296332
Question:
Calculate the Standard Deviation (SD) for the following data set:
10, 12, 15, 18, 20
Website: [email protected] Contact us: 7889296332
Website: [email protected] Contact us: 7889296332
Website: [email protected] Contact us: 7889296332
1. Always Non-Negative – SD is zero or positive; zero only if all values are identical.
2. Measures Spread Around Mean – Indicates how much data deviates from the average.
3. Higher SD = More Variation – Greater spread means a larger SD.
4. Sensitive to Outliers – Extreme values can inflate SD.
5. Useful in Normal Distribution – Helps determine confidence intervals (68%-95%-99.7% Rule).
Website: [email protected] Contact us: 7889296332
Lorenze curve
It studies the degree of inequality in the distribution of income and wealth b/w countries.
Website: [email protected] Contact us: 7889296332
Q1.For a symmetrical distribution, Q1 and Q3 are 20 and 60 respectively. The value of median will be:
A. 20
B. 30
C. 40
D. 50
Q2.The arithmetic mean of the marks obtained by 50 students was calculated as 44. It was later
discovered that a score of 36 was misread as 56. Find the correct value of arithmetic mean of the
marks obtained by the students.
A. 43
B. 43.6
C. 45
D. 50
A. applied statistics
B. mathematical statistics
C. industry statistics
D. both a and b
4.If each observation of the data is increased by 5, then what happens to its mean?
A. is increased by 4
B. is increased by 5
C. is decreased by 4
Website: [email protected] Contact us: 7889296332
D. is decreased by 5
5.The measurements of spread or scatter of the individual values around the central point is called:
A. Measures of dispersion
B. Measures of central tendency
C. Measures of skewness
D. Measures of kurtosis
6. The correct relationship among Arithmetic Mean (AM), Geometric Mean (GM) and Harmonic Mean
(HM) is
A. GM = AM + HM
B. AM < HM
C. (GM)2 = HM*AM
D. AM = GM – HM
7.The degree to which numerical data tend to spread about an average value called:
A. Constant
B. Flatness
C. Variation
D. Skewness
8.If the mean is 20 and the median is 25, what is the mode?
a) 15
b) 30
c) 35
d) 40
Website: [email protected] Contact us: 7889296332