Things to Remember
I - What is statistics
○ By Statistics, we mean methods specially adapted to the elucidation of quantitative data
affected to a marked extent by a multiplicity of causes”.
Yule and Kendal
● Difference between descriptive and inferential statistics
Basis of comparison Descriptive Statistics Inferential Statistics
Meaning Descriptive statistics seeks to Inferential statistics deals with
describe the data, but do not making inferences about a
attempt to make inferences population from a sample
from the sample to the whole
population
What it does ? Summarize, organize and Conclusion and prediction of
present the data in a data
meaningful way
II - Data
Data Vs Information - When analysts are bewildered by plethora of data, which do not make
any sense on the surface of it, they are looking for methods to classify data that would convey meaning.
The idea here is to help them draw the right conclusion. Data needs to be arranged into information.
Raw Data - Raw Data represent numbers and facts in the original format in which the data have been
collected. We need to convert the raw data into information for decision making.
Types of Data:
It is very important to have a good understanding of the different data types, also called measurement
scales, is a crucial prerequisite for doing Exploratory Data Analysis (EDA)
Types of Data
Nominal Data Ordinal Data
Categorical Data
( represents This data represents discrete units Ordinal data represent discrete
characteristics, also and use to label variables that have and ordered units.Order is
called as qualitative no quantitative value . Nominal important in case of this data
data ) data has no order
Example What is your gender Winners in Hackathon
● Male ● First
● Female ● Second
What programming languages you ● Third
know : Proficiency in programming
● Python ● High
● R ● Medium
● SAS ● Low
Visualization Bar chart and Pie chart
methods
Numerical Data Discrete Data Continuous Data
Data which can’t be measured but Data which can be measured but
can be counted. Data can take on can’t be counted
only certain values
Two types of Continuous data
● Interval Data - Ordered
units have the same
difference. But it has no
true zero points.
● Ratio Data -same as
interval values, with the
difference that they do
have an absolute zero
Example Team members in a cricket team- It Interval Data -Temperature of a
can be 11 but not 11.5 particular place
● -10
● -5
● 0
● 5
● 10
( here 0 has no true meaning )
Ratio Data - Equal difference
● 0
● 5
● 10
● 15
Visualization Boxplots and Histogram
Technique
III - Measures of Central Tendency
Measures of Central Tendency
Mean Median Mode
Meaning The mean is simply the The median is the The mode is the value
average and “middle” value or or category that occurs
considered the most midpoint in your data most often within the
reliable measure of data
central tendency.
The mean is computed
by the sum of all
values, divided by the
number of values.
Example Uber Rating - After With 10,000 people, the Which is the most
every ride, you give a mean salary might be popular video on
rating for your $45,000, but the range is youtube? How will you
experience and final $20,000 to $3,000,000 find out? - Ans - The
rating which comes for with a mean of one which has the
the driver is calculated $100,000. Mean is maximum likes
using mean affected by extreme
values. In order to get a
real figure in cases
where we have outliers
in data median is
calculated
III - Measures of Dispersion
Meaning - refers to the idea of variability within your data. It answers unambiguously the question
"What is the magnitude of departure from the average value for different groups having identical
averages?".
Different types of measures of dispersion
1) Range is the simplest of all measures of dispersion. It is calculated as the difference between the
maximum and minimum value in the data set.
Range =Largest Value − Lowest Value
The range is also the most affected by outliers as it uses only the extreme values.It is advisable to use
range only for very small distributions with no outliers
2) Interquartile Range is the distance between the lower and upper quartiles of a data.
IQR = Q3 - Q1
IQR is considered a good measure of variation in skewed datasets as it is resistant to outliers.
3) Standard deviation is a measure of how much data values deviate away from the mean.
Larger the standard deviation, the greater the amount of variation.
SD = √ Σ( Data value - arithmetic mean )2 / Total number of values in the dataset
Standard deviation is a good measure of variability for normal distributions or distributions that aren’t
extremely skewed
4) Coefficient of variation is equal to the standard deviation divided by the mean. It is a
useful measure for comparing the variability between two different datasets. For eg. if
we need to compare the sales of Apple mobile phones between India and the US, the
coefficient of variation would be used as it's a relative measure free of units of
measurement.
Standard deviation will not be useful as sales in India would be given in INR and for US in
dollars and won’t give any meaningful result,therefore coefficient of variation is used
and is also called as relative standard deviation
IV - Boxplot
Boxplot is five numbers that help describe the centre, spread and shape of data are:
● Xsmallest
● First Quartile (Q1)
● Median (Q2)
● Third Quartile (Q3)
● Xlargest
v) Skewness - It refers to a lack of symmetry. Skewness results in inequality in the values of mean,
median and mode and lower and upper quartiles are not situated at equal distance from median.
● Skewness may be positive or negative
● In case of positive skewness for a distribution
○ Mean > Median > Mode
○ ( Q3 - Median) > ( Median - Q1 )
● In case of negative skewness for a distribution
○ Mean < Median < Mode
○ ( Q3 - median) < ( median - Q1 )
Relationships among the five-number summary and distribution shape
Left-Skewed Symmetric Right-Skewed
Median – Xsmallest
Median – Xsmallest
Median – Xsmallest
> ≈ <
Xlargest – Median
Xlargest – Median
Xlargest – Median
Q Q Q
1 – Xsmallest 1 – Xsmallest 1 – Xsmallest
> ≈ <
X X X
largest – Q3 largest – Q3 largest – Q3
Median – Q1 Median – Q1 Median – Q1
> ≈ <
Q3 – Median Q3 – Median Q3 – Median