MMW REVIEWER BA SOCIO 1A
Descriptive Statistics (DS)- used to summarize and describes data (i.e. organize,
summarize, simply, describe).
Under DS- (1) Measure of dispersion, (2) measure of frequency, (3) measure of
central tendency, and (4) Measure of relative position.
Measure of dispersion- help us know the spread of a data set.
Range- difference from highest value (HV) and lowest value (LW)
Pros:
Easy to compute
Easy to understand
Cons:
It can be distorted by a single extreme value (outlier)
Only two values are used for calculation
Formula: R=HV-LW
Variance and standard deviation – most common and useful measure of variability.
Provide information about how the data vary about the mean.
Standard Deviation (σ) – most widely used measure if dispersion. The more spread
apart the data, the higher the deviation. Square root of the variance. σ = √σ²
Variance (σ²) – the measure of variation considers the position of each observation
relative to the mean of the set. It is the average of the squared deviation from the
(population or) sample mean.
s= √∑(x−¯x)²/n-1
σ = √∑(x− μ)²/n
Measure of central tendency- numerical descriptive measure which indicates the
center of the distribution (E.g. Mean, mode, median).
In layman’s terms it is an “average”
Mean- most popular, also called as arithmetic mean
Formula: Mean = (Sum of all the observations/Total number of observations)
Properties:
1. Mean can be applied to interval and ratio data.
2. A set of data has a unique mean.
3. All values in the data set are included in computing the mean.
4. The mean is affected by the outliers.
5. It cannot be computed for the data in a frequency distribution with an open-
ended class.
6. Mean is most appropriate in symmetrical data
Weighted mean- The weighted mean is a particularly used when various classes or
groups contribute differently to the total. Weighted mean of a given group of data is
the average of the means of all the groups. A weighted average is most often
computed to equalize the frequency of the values in a data set.
Median- middle value of the data array
If n is odd add the middle value then divide by 2
Properties
1. Can be used for ordinal, interval and ratio data, but is more variable in an
ordinal type of data.
2. Median is unique for a given data set.
3. Median is not affected by outliers.
4. It can be computed for an open-ended frequency distribution.
5. Median is most appropriate in a skewed data.
Mode- number/s or term/s that appear most frequent in a data set.
Types of mode
Unimodal- one mode
Bimodal- 2 mode
Multimodal – 2 or more mode
Note: a data set can have no mode
Properties:
1. Mode is found by locating the most frequently occurring value.
2. The easiest average to compute.
3. There can be more than one mode or even no mode in any given data set.
4. Mode is not affected by the extreme small or large values.
5. Mode can be applied for nominal, ordinal, interval, and ratio data.
6.
Measure of relative position-
Quartiles
Q1 (First Quartile): Also known as the lower quartile.
Q2 (Second Quartile): This is the median of the data set, dividing it into two equal
halves.
Q3 (Third Quartile): Also called the upper quartile
IQR (Interquartile Range): The difference between the third and first quartile (Q3 -
Q1) is called the interquartile range.
For finding the Median:
If the number of data points is odd, the median is the middle number.
If the number of data points is even, the median is the average of the two middle
numbers.
This splits the data into two halves.
For large data sets, you can use a formula to determine the exact
Position of the quartiles:
Q1 position = (n+1) * ¼
Q3 position = (n+1) * ¾
Where “n” is the number of data points.
A percentile is a measure used in statistics to describe the relative standing of a
value within a data set.
Key points about percentile
Percentile Rank: If a value is at the nth percentile, it means that n% of the data
points are less than or equal to that value.
Percentile Rank Formula: To calculate the percentile rank for a specific data point in
a data
Set: P=k/n×100
Deciles are similar to percentiles, but instead of dividing the data into 100 equal
parts (percentiles), deciles divide the data into 10 equal parts. Each decile
represents 10% of the data.
Key Points about Deciles:
D1: The 1st decile represents the point below which 10% of the data fall.
D2: The 2nd decile represents the point below which 20% of the data fall.
D3: The 3rd decile represents the point below which 30% of the data fall.
And so on, up to D9, which represents the point below which 90% of the data fall.
A z-score, also known as a standard score, is a statistical measurement that
describes a data point’s position in relation to the mean of a group of values.
It tells you how many standard deviations a value is from the mean.
Z-scores can be positive or negative:
A positive z-score means the data point is above the mean.
A negative z-score means the data point is below the mean.
Z-Score Formula for a Population:
Z=x-μ/σ
Where:
X = the data point
Μ = the population mean
Σ = the population standard deviation
Z-Score Formula for a Sample:
Z= x-¯x /s
Where:
X = the data point
¯x = the sample mean
s= the sample standard deviation
Boxplot
John Wilder Tukey (1915-2000) introduced the boxplot in the 1970’s. A boxplot (or
box-and-whisker plot) is graph of a data set obtained by drawing a horizontal line
from the minimum data value to first quartile (Q_1), drawing a horizontal line to
third quartile (Q_3) to the maximum data value, and drawing a box whose vertical
line passes through Q_1 and Q_3 with a vertical line inside the box passing through
the median or second quartile (Q_2).
Skewness
Acid graphic elements square grid geometric memphis design a box plot can often
be used to identify skewness. A distribution of data is skewed if it is not symmetric
and extends more to side than to the other.
The boxplot will give the following information:
1.If the median is near the center of the box, the distribution is approximately
symmetric.
2.If the median falls to the right of the center of the box, the distribution is
negatively skewed.
3.If the median falls to the left of the center of the box, the distribution is positively
skewed.
4.If the lines are about the same length, the distribution is approximately
symmetric.
5.If the left line is larger than the right line, the distribution is negatively skewed.
6.If the right line is larger than the left line, the distribution is positively skewed.
PROBABILITIES AND NORMAL DISTRIBUTION
The normal distribution or Gaussian distribution is a continuous Probability
distribution that describes data that clusters around a mean. The graph of the
associated probability density function is bell-shaped, with a Peak at the mean, and
is known as the Gaussian function or bell curve. The normal curve was developed
mathematically in 1733 by Abraham de Moivre (1667-1754) as an approximation to
the binomial distribution. Carl Friedrich Gauss (1777-1855) used the normal curve to
analyze astronomical data in 1809.
The properties of the normal distribution are as follows:
1. The distribution is bell shaped.
2. The mean, median, and mode are equal and are located at the center of the
distribution.
3. The normal distribution is unimodal.
4. The normal distribution curve is symmetric about the mean.
5. The normal distribution is continuous.
6. The normal distribution is asymptotic (it never touches the x-axis)
7. The total area under the normal distribution curve is 1.00 or 100%.
8. The area under the part of a normal curve that lies within 1 standard
Deviation of the mean 68%; within 2 standard deviation deviations, about 95%; and
with 3 standard deviations, about 99.7%.
• About 68% of the area under the curve falls within 1 standard deviation of the
mean
• about 95% of the area under the curve falls within 2 standard deviations of the
mean
• about 99.7% of the area under the curve falls within 3 standard deviations of the
mean every normal curve corresponds to the “empirical rule” (also called the 68-95-
99.7% rule).
Statistics, or statistical procedures, refer to a set of mathematical procedures to
organise, summarise and interpret data.
Descriptive statistics are used to summarise and describe data (information that
has been collected). Data are usually organised and presented in tables or graphs
that summarise information, such as Histograms, pie charts, bars or scatterplots.
• Descriptive statistics are only descriptive and, thus, do not involve generalising
beyond the data that has been collected.
Inferential statistics aim to test hypotheses and explore relationships between
variables, and can be used to make predictions about the population.
• It is also used to draw conclusions and inferences; that is, to make valid
generalisations from samples.
Two main uses:
a) Making estimates about population (for example, the mean SAT score of all
11th graders in the US)
b) Testing hypotheses to draw conclusions about populations (for example, the
relationship between SAT scores and family income).
Hypothesis testing is used to assess the credibility of the hypothesis
By using sample data. The test provides evidence concerning the credibility of the
hypothesis, given the data.
• All analysts use a random population sample to test two different hypotheses:
the null hypothesis and the alternative hypothesis.
A t-test is an inferential statistic used to determine if there is a significant
difference between the means of two groups and how they are related.
• t-tests are used when the data sets follow a normal distribution and have known
variances, like the data set recorded from flipping a coin 100 times.
The independent-sample t-test (t-test independent) compares the
Means between two unrelated groups on the same continuous, dependent variable.
Examples,
✔ The effectiveness of two different diets on two different groups of individuals.
✔ Comparing the height of students in two different schools.
✔ Comparing employee satisfaction at Company A and Company B.
✔ Comparing reading scores for a classroom of 3 rd grade students to a classroom of
5th grade students.