ADS MODULE 2
Data
Exploration
Q What is Statistics? Explain types of statistics
ANS
Statistics is the science of collecting, analyzing, presenting, and
interpreting data, as well as of making decisions based on such analyses.
Statistics is at the heart of data analytics. Statistics is a branch that deals
with every aspect of the data. The main purpose of using statistics is to
plan the collected data in terms of experimental designs and statistical
surveys. Statistics is considered a mathematical science that works with
numerical data. Statistical knowledge helps to choose the proper method
of collecting the data and employ those samples in the correct analysis
process in order to effectively produce the results. In short, statistics is a
crucial process which helps to make the decision based on the data.
TYPES OF STATISTICS
Inferential Statistics consists of methods that use sample results to help
make decisions or predictions about a population.
Descriptive Statistics consists of methods for organizing, displaying, and
describing data by using tables, graphs, and summary measures.
Descriptive and Inferential Statistics
Descriptive and inferential statistics are two fields of statistics.
Descriptive statistics is used to describe data and inferential statistics is
used to make predictions. Descriptive and inferential statistics have
different tools that can be used to draw conclusions about the data.
In descriptive and inferential statistics, the former uses tools such as
central tendency, and dispersion while the latter makes use of hypothesis
testing, regression analysis, and confidence intervals.
The purpose of descriptive and inferential statistics is to analyze different
types of data using different tools. Descriptive statistics helps to describe
and organize known data using charts, bar graphs, etc., while inferential
statistics aims at making inferences and generalizations about the
population data.
Table :-1 Descriptive and Inferential Statistics
Descriptive Statistics
Descriptive statistics are a part of statistics that can be used to describe
data. It is used to summarize the attributes of a sample in such a way that
a pattern can be drawn from the group. It enables researchers to present
data in a more meaningful way such that easy interpretations can be made.
Descriptive statistics uses two tools to organize and describe data. These
are given as follows:
● Measures of Central Tendency - These help to describe the central
position of the data by using measures such as mean, median,
and mode.
● Measures of Dispersion - These measures help to see how spread out
the data is in a distribution with respect to a central point. Range,
standard deviation, variance, quartiles, and absolute deviation are the
measures of dispersion.
Inferential Statistics
Inferential statistics is a branch of statistics that is used to make inferences
about the population by analyzing a sample. When the population data is
very large it becomes difficult to use it. In such cases, certain samples are
taken that are representative of the entire population. Inferential statistics
draws conclusions regarding the population using these samples.
Sampling strategies such as simple random sampling, cluster sampling,
stratified sampling, and systematic sampling, need to be used in order to
choose correct samples from the population. Some methodologies used in
inferential statistics are as follows:
● Hypothesis Testing - This technique involves the use of hypothesis
tests such as the z test, f test, t test, etc. to make inferences about the
population data. It requires setting up the null hypothesis, alternative
hypothesis, and testing the decision criteria.
● Regression Analysis - Such a technique is used to check the
relationship between dependent and independent variables. The most
commonly used type of regression is linear regression.
Q Hypothesis testing AND steps involved in Hypothesis testing
ANS
• Hypothesis testing can be defined as a statistical tool that is used to
identify if the results of an experiment are meaningful or not. It
involves setting up a null hypothesis and an alternative hypothesis.
• These two hypotheses will always be mutually exclusive.
• this means that if the null hypothesis is true then the alternative
hypothesis is false and vice versa.
• An example of hypothesis testing is setting up a test to check if a
new medicine works on a disease in a more efficient manner.
Null Hypothesis
• The null hypothesis is a concise mathematical statement that is
used to indicate that there is no difference between two
possibilities.
• In other words, there is no difference between certain
characteristics of data.
• This hypothesis assumes that the outcomes of an experiment are
based on chance alone.
• It is denoted as H0.
• Hypothesis testing is used to conclude if the null hypothesis can be
rejected or not.
• Suppose an experiment is conducted to check if girls are shorter
than boys at the age of 5. The null hypothesis will say that they are
the same height.
Alternative Hypothesis
• The alternative hypothesis is an alternative to the null hypothesis.
• It is used to show that the observations of an experiment are due to
some real effect.
• It indicates that there is a statistical significance between two
possible outcomes and can be denoted as H1 or Ha.
• For the above-mentioned example, the alternative hypothesis
would be that girls are shorter than boys at the age of 5.
• Hypothesis Testing P Value
•
In hypothesis testing, the p value is used to indicate whether the
results obtained after conducting a test are statistically significant or
not.
• It also indicates the probability of making an error in rejecting or
not rejecting the null hypothesis.
• This value is always a number between 0 and 1.
• The p value is compared to an alpha level, α or significance level.
• The alpha level can be defined as the acceptable risk of incorrectly
rejecting the null hypothesis.
• The alpha level is usually chosen between 1% to 5%.
• P-value Formula
The P-value formula is short for probability value.
• P-value defines the probability of getting a result that is either the
same or more extreme than the other actual observations.
• The P-value represents the probability of occurrence of the given
event.
• The P-value formula is used as an alternative to the rejection point
to provide the least significance for which the null hypothesis
would be rejected.
• The smaller the P-value, the stronger is the evidence in favor of the
alternative hypothesis given observed frequency and expected
frequency.
• P-value is an important statistical measure, that helps to determine
whether the hypothesis is correct or not.
• P-value always only lies between 0 and 1.
• The level of significance(α) is a predefined threshold that should
be set by the researcher. It is generally fixed as 0.05.
• Hypothesis Testing Critical region
• All sets of values that lead to rejecting the null hypothesis lie in the
critical region. Furthermore, the value that separates the critical
region from the non-critical region is known as the critical value.
Depending upon the type of data available and the size, different types
of hypothesis testing are used to determine whether the null hypothesis
can be rejected or not.
Types of Hypothesis Testing
• Hypothesis Testing Z Test
• Hypothesis Testing t Test
• Hypothesis Testing Chi Square
• One Tailed Hypothesis Testing
• Two Tailed Hypothesis Testing
• Hypothesis testing can be easily performed in five simple steps.
• The most important step is to correctly set up the hypotheses and
identify the right method for hypothesis testing.
• The basic steps to perform hypothesis testing are as follows:
• Step 1: Set up the null hypothesis by correctly identifying whether
it is the left-tailed, right-tailed, or two-tailed hypothesis testing.
• Step 2: Set up the alternative hypothesis.
• Step 3: Choose the correct significance level, α, and find the
critical value.
• Step 4: Calculate the correct test statistic (z, t or χ) and p-value.
• Step 5: Compare the test statistic with the critical value or compare
the p-value with α to arrive at a conclusion.
• In other words, decide if the null hypothesis is to be rejected or
not.
Q Hypothesis Testing types
ANS
Types of Hypothesis Testing
• Hypothesis Testing Z Test
• Hypothesis Testing t Test
• Hypothesis Testing Chi Square
• One Tailed Hypothesis Testing
• Two Tailed Hypothesis Testing
• Hypothesis Testing Z Test
A z test is a way of hypothesis testing that is used for a large sample size (n ≥
30). It is used to determine whether there is a difference between the
population mean and the sample mean when the population standard
deviation is known. It can also be used to compare the mean of two samples.
Hypothesis Testing t Test
• The t test is another method of hypothesis testing that is used for a small
sample size (n < 30).
• It is also used to compare the sample mean and population mean.
• However, the population standard deviation is not known.
• Instead, the sample standard deviation is known.
• The mean of two samples can also be compared using the t test.
• Hypothesis Testing Chi Square
• The Chi square test is a hypothesis testing method that is used to check
whether the variables in a population are independent or not.
• It is used when the test statistic is chi-squared distributed.
• One Tailed Hypothesis Testing
• One tailed hypothesis testing is done when the rejection region is only in one
direction.
• It can also be known as directional hypothesis testing because the effects
can be tested in one direction only.
• This type of testing is further classified into the right tailed test and left
tailed test.
• Right Tailed Hypothesis Testing
• The right tail test is also known as the upper tail test.
• This test is used to check whether the population parameter is greater than
some value. The null and alternative hypotheses for this test are given as
follows:
• H0: The population parameter is ≤ some value
• H1: The population parameter is > some value.
• If the test statistic has a greater value than the critical value then the null
hypothesis is rejected
Left Tailed Hypothesis Testing
• The left tail test is also known as the lower tail test.
• It is used to check whether the population parameter is less than some value.
• The hypotheses for this hypothesis testing can be written as follows:
• H0: The population parameter is ≥ some value
• H1: The population parameter is < some value.
• The null hypothesis is rejected if the test statistic has a value lesser than the
critical value.
• Two Tailed Hypothesis Testing
In this hypothesis testing method, the critical region lies on both sides of the
sampling distribution.
• It is also known as a non - directional hypothesis testing method.
• he two-tailed test is used when it needs to be determined if the population
parameter is assumed to be different than some value.
• The hypotheses can be set up as follows:
• H0: the population parameter = some value
• H1: the population parameter ≠ some value
• The null hypothesis is rejected if the test statistic has a value that is not equal
to the critical value.
Q Explain Type I and Type –II errors
ANS
TYPE I AND TYPE II ERRORS
Type I and Type II errors are subjected to the result of the null hypothesis. In
case of type I or type-1 error, the null hypothesis is rejected though it is true
whereas type II or type-2 error, the null hypothesis is not rejected even when the
alternative hypothesis is true. Both the error type-i and type-ii are also known as
“false negative”. A lot of statistical theory rotates around the reduction of one or
both of these errors, still, the total elimination of both is explained as a statistical
impossibility.
Type I Error
A type I error appears when the null hypothesis (H0) of an experiment is true, but
still, it is rejected. It is stating something which is not present or a false hit. A type
I error is often called a false positive (an event that shows that a given condition is
present when it is absent). In words of community tales, a person may see the bear
when there is none (raising a false alarm) where the null hypothesis (H0) contains
the statement: “There is no bear”.
The type I error significance level or rate level is the probability of refusing the
null hypothesis given that it is true. It is represented by Greek letter α (alpha) and
is also known as alpha level. Usually, the significance level or the probability of
type i error is set to 0.05 (5%), assuming that it is satisfactory to have a 5%
probability of inaccurately rejecting the null hypothesis.
Type II Error
A type II error appears when the null hypothesis is false but mistakenly fails to be
refused. It is losing to state what is present and a miss. A type II error is also
known as false negative (where a real hit was rejected by the test and is observed
as a miss), in an experiment checking for a condition with a final outcome of true
or false.
A type II error is assigned when a true alternative hypothesis is not acknowledged.
In other words, an examiner may miss discovering the bear when in fact a bear is
present (hence fails in raising the alarm). Again, H0, the null hypothesis, consists
of the statement that, “There is no bear”, wherein, if a wolf is indeed present, is a
type II error on the part of the investigator. Here, the bear either exists or does not
exist within given circumstances, the question arises here is if it is correctly
identified or not, either missing detecting it when it is present, or identifying it
when it is not present.
The rate level of the type II error is represented by the Greek letter β (beta) and
linked to the power of a test (which equals 1−β).
What is Type I Error (False Positive)?
Type I error, also known as a false positive, occurs in statistical hypothesis testing
when a null hypothesis that is actually true is rejected. In other words, it's the error
of incorrectly concluding that there is a significant effect or difference when there
isn't one in reality.
In hypothesis testing, there are two competing hypotheses:
Null Hypothesis (H0): This hypothesis represents a default assumption that there
is no effect, no difference, or no relationship in the population being studied.
Alternative Hypothesis (H1): This hypothesis represents the opposite of the null
hypothesis. It suggests that there is a significant effect, difference, or relationship
in the population.
A Type I error occurs when the null hypothesis is rejected based on the sample
data, even though it is actually true in the population
What is Type II Error (False Negative)?
Type II error, also known as a false negative, occurs in statistical hypothesis
testing when a null hypothesis that is actually false is not rejected. In other words,
it's the error of failing to detect a significant effect or difference when one exists in
reality.
A Type II error occurs when the null hypothesis is not rejected based on the sample
data, even though it is actually false in the population. In other words, it's a failure
to recognize a real effect or difference.
Suppose a medical researcher is testing a new drug to see if it's effective in treating
a certain condition. The null hypothesis (H0) states that the drug has no effect,
while the alternative hypothesis (H1) suggests that the drug is effective.
If the researcher conducts a statistical test and fails to reject the null hypothesis
(H0), concluding that the drug is not effective, when in fact it does have an effect,
this would be a Type II error.
Type I and Type II Errors Examples
Examples of Type I Error
Some of examples of type I error include:
Medical Testing: Suppose a medical test is designed to diagnose a particular
disease. The null hypothesis (H0) is that the person does not have the disease, and
the alternative hypothesis (H1) is that the person does have the disease. A Type I
error occurs if the test incorrectly indicates that a person has the disease (rejects
the null hypothesis) when they do not actually have it.
Legal System: In a criminal trial, the null hypothesis (H0) is that the defendant is
innocent, while the alternative hypothesis (H1) is that the defendant is guilty. A
Type I error occurs if the jury convicts the defendant (rejects the null hypothesis)
when they are actually innocent.
Quality Control: In manufacturing, quality control inspectors may test products
to ensure they meet certain specifications. The null hypothesis (H0) is that the
product meets the required standard, while the alternative hypothesis (H1) is that
the product does not meet the standard. A Type I error occurs if a product is
rejected (null hypothesis is rejected) as defective when it actually meets the
required standard.
Examples of Type II Error
Using the same H0 and H1, some examples of type II error include:
Medical Testing: In a medical test designed to diagnose a disease, a Type II error
occurs if the test incorrectly indicates that a person does not have the disease
(fails to reject the null hypothesis) when they actually do have it.
Legal System: In a criminal trial, a Type II error occurs if the jury acquits the
defendant (fails to reject the null hypothesis) when they are actually guilty.
Quality Control: In manufacturing, a Type II error occurs if a defective product
is accepted (fails to reject the null hypothesis) as meeting the required standard.
Factors Affecting Type I and Type II Errors
Some of the common factors affecting errors are:
Sample Size: In statistical hypothesis testing, larger sample sizes generally
reduce the probability of both Type I and Type II errors. With larger samples, the
estimates tend to be more precise, resulting in more accurate conclusions.
Significance Level: The significance level (α) in hypothesis testing determines
the probability of committing a Type I error. Choosing a lower significance level
reduces the risk of Type I error but increases the risk of Type II error, and vice
versa.
Effect Size: The magnitude of the effect or difference being tested influences the
probability of Type II error. Smaller effect sizes are more challenging to detect,
increasing the likelihood of failing to reject the null hypothesis when it's false.
Statistical Power: The power of Statistics (1 – β) dictates that the opportunity for
rejecting a wrong null hypothesis is based on the inverse of the chance of
committing a Type II error. The power level of the test rises, thus a chance of the
Type II error dropping.
How to Minimize Type I and Type II Errors
To minimize Type I and Type II errors in hypothesis testing, there are several
strategies that can be employed based on the information from the sources
provided:
Minimizing Type I Error
To reduce the probability of a Type I error (rejecting a true null hypothesis), one
can choose a smaller level of significance (alpha) at the beginning of the study.
By setting a lower significance level, the chances of incorrectly rejecting the null
hypothesis decrease, thus minimizing Type I errors.
Minimizing Type II Error
The probability of a Type II error (failing to reject a false null hypothesis) can be
minimized by increasing the sample size or choosing a "threshold" alternative
value of the parameter further from the null value.
Increasing the sample size reduces the variability of the statistic, making it less
likely to fall in the non-rejection region when it should be rejected, thus
minimizing Type II errors.
Q Explain ANOVA Test
Ans
ANOVA Test
ANOVA Test is used to analyze the differences among the means of various groups
using certain estimation procedures. ANOVA means analysis of variance. ANOVA
test is a statistical significance test that is used to check whether the null hypothesis
can be rejected or not during hypothesis testing.
An ANOVA test can be either one-way or two-way depending upon the number of
independent variables.
What is ANOVA Test?
ANOVA test, in its simplest form, is used to check whether the means of three or
more populations are equal or not. The ANOVA test applies when there are more
than two independent groups. The goal of the ANOVA test is to check for
variability within the groups as well as the variability among the groups. The
ANOVA test statistic is given by the f test.
ANOVA test can be defined as a type of test used in hypothesis testing to compare
whether the means of two or more groups are equal or not. This test is used to
check if the null hypothesis can be rejected or not depending upon the statistical
significance exhibited by the parameters. The decision is made by comparing the
ANOVA test statistic with the critical value.
ANOVA Test Example
Suppose it needs to be determined if consumption of a certain type of tea will
result in a mean weight loss. Let there be three groups using three types of tea -
green tea, earl grey tea, and jasmine tea. Thus, to compare if there was any mean
weight loss exhibited by a certain group, the ANOVA test (one way) will be used.
Suppose a survey was conducted to check if there is an interaction between income
and gender with anxiety level at job interviews. To conduct such a test a two-way
ANOVA will be used.
ne Way ANOVA
The one way ANOVA test is used to determine whether there is any difference
between the means of three or more groups. A one way ANOVA will have only
one independent variable. The hypothesis for a one way ANOVA test can be set up
as follows:
Null Hypothesis, H0H0: μ1μ1 = μ2μ2 = μ3μ3 = ... = μkμk
Alternative Hypothesis, H1H1: The means are not equal
Decision Rule: If test statistic > critical value then reject the null hypothesis and
conclude that the means of at least two groups are statistically significant.
The steps to perform the one way ANOVA test are given below:
• Step 1: Calculate the mean for each group.
• Step 2: Calculate the total mean. This is done by adding all the means
and dividing it by the total number of means.
• Step 3: Calculate the SSB.
• Step 4: Calculate the between groups degrees of freedom.
• Step 5: Calculate the SSE.
• Step 6: Calculate the degrees of freedom of errors.
• Step 7: Determine the MSB and the MSE.
• Step 8: Find the f test statistic.
• Step 9: Using the f table for the specified level of significance, αα, find
the critical value. This is given by F(αα, df1. df2).
• Step 10: If f > F then reject the null hypothesis.
Limitations of One Way ANOVA Test
The one way ANOVA is an omnibus test statistic. This implies that the test will
determine whether the means of the various groups are statistically significant or
not. However, it cannot distinguish the specific groups that have a statistically
significant mean. Thus, to find the specific group with a different mean, a post hoc
test needs to be conducted.
Two Way ANOVA
The two way ANOVA has two independent variables. Thus, it can be thought of as
an extension of a one way ANOVA where only one variable affects the dependent
variable. A two way ANOVA test is used to check the main effect of each
independent variable and to see if there is an interaction effect between them. To
examine the main effect, each factor is considered separately as done in a one way
ANOVA. Furthermore, to check the interaction effect, all factors are considered at
the same time. There are certain assumptions made for a two way ANOVA test.
These are given as follows:
• The samples drawn from the population must be independent.
• The population should be approximately normally distributed.
• The groups should have the same sample size.
• The population variances are equal
Suppose in the two way ANOVA example, as mentioned above, the income groups
are low, middle, high. The gender groups are female, male, and transgender. Then
there will be 9 treatment groups and the three hypotheses can be set up as follows:
H01: All income groups have equal mean anxiety.
H11: All income groups do not have equal mean anxiety.
H02: All gender groups have equal mean anxiety.
H12: All gender groups do not have equal mean anxiety.
H03: Interaction effect does not exist
H13: Interaction effect exists.
Important Notes on ANOVA Test
• ANOVA test is used to check whether the means of three or more groups
are different or not by using estimation parameters such as the variance.
• An ANOVA table is used to summarize the results of an ANOVA test.
• There are two types of ANOVA tests - one way ANOVA and two way
ANOVA
• One way ANOVA has only one independent variable while a two way
ANOVA has two independent variables.
Q Explain Measures of Shape: Skewness and Kurtosis
Ans
• The measure of central tendency and measure of dispersion can
describe the distribution but they are not sufficient to describe the
nature of the distribution. Skewness and Kurtosis are the two
important characteristics of distribution that are studied in descriptive
statistics
1-Skewness
• Skewness is a statistical number that tells us if a distribution is
symmetric or not.
• A distribution is symmetric if the right side of the distribution is similar to
the left side of the distribution. If a distribution is symmetric, then the
Skewness value is 0.
• i.e. If a distribution is Symmetric (normal distribution): median= mean=
mode, (Skewness value is 0) If Skewness is greater than 0, then it is called
right-skewed or that the right tail is longer than the left tail. If Skewness is
less than 0, then it is called left-skewed or that the left tail is longer than the
right tail.
•
•
2-Kurtosis
• Kurtosis is a statistical number that tells us if a distribution is taller or
shorter than a normal distribution. If a distribution is similar to the normal
distribution, the Kurtosis value is 0.
• If Kurtosis is greater than 0, then it has a higher peak compared to the
normal distribution.
• If Kurtosis is less than 0, then it is flatter than a normal distribution.
• There are three types of distributions:
• Leptokurtic: Sharply peaked with fat tails, and less variable.
• Mesokurtic: Medium peaked
• Platykurtic: Flattest peak and highly dispersed.
•
•
• The coefficient of skewness can be defined as a measure that is used to
determine the strength and direction of the skewness of a sample distribution
by using descriptive statistics such as the mean, median, or mode.
Q Explain Different Types of Data
Ans
Data is referred to as a collection of information gathered by
observations, measurements, research, or analysis. It may
comprise facts, figures, numbers, names, or even general descriptions of
things. Data can be organized in the form of graphs, charts, or tables for
ease in our study. Data scientists help in analyzing the collected data
through data mining. For example, information gathered can be
represented in the form of data as given below,
• A set of numbers such as 1, 2, 3, 4, 5
• The list of student names in a class
• Physical attributes such as age, height, weight, etc.
Different Types of Data
Dat can be grouped on the basis of the qualitative or quantitative aspect
of the gathered information. Typically, data can be classified into
different types as given below,
• Qualitative Data
• Quantitative Data
Qualitative Data - As the name itself suggests quality and quality means
having uniqueness. Qualitative data is a type of data that can be
observed and recorded. For example Gender, phone numbers,
citizenship, etc. Qualitative Data is further classified as:
1. Nominal Data - It is a type of data mostly used for naming something
or labeling or in classification. It is also called "named data". For
example gender, country, race, eye color, hair color, hairstyle, etc
2. Ordinal Data - It is a type of data that is named, ordered both, and a
scale(range) is used. For example ranking in the class First, Second,
Third
Quantitative Data - As the name suggests it deals with quantity and
quantity is related to numbers. It is also known as numeric data. For
example the number of bananas in a dozen, the number of candies in a
box, etc.
Quantitative Data is further classified as:
1. Discrete Data - This type of data takes values that can be counted
such as counting the number of fruits on a tree, the number of students
in a class, etc.
2. Continuous Data - This type of data takes specific values that can be
measured and fall under a specific range such as weight, length,
temperature, speed, etc.
Primary And Secondary Data
Primary Data: Primary Data is a type of data that is gathered by the
individual. For example, data recorded by a student in a lab experiment,
teacher giving an oral test and writing the marks, letters, records,
autobiographies, etc.
Secondary Data: Secondary data is a type of data that is gathered by
someone else and used somewhere else. For example, Another teacher
using the marks obtained in an oral test for evaluation, newspapers,
encyclopedias, biographies, etc
• Ans The Spearman’s rank coefficient of correlation or Spearman
correlation coefficient is a nonparametric measure of rank correlation
(statistical dependence of ranking between two variables). Named after
Charles Spearman, it is often denoted by the Greek letter ‘ρ’ (rho) and
is primarily used for data analysis. It measures the strength and
direction of the association between two ranked variables.
• But before we talk about the Spearman correlation coefficient, it is
important to understand Pearson’s correlation first.
Pearson correlation
• A Pearson correlation is a statistical measure of the strength of a linear
relationship between paired data. For the calculation and significance
testing of the ranking variable, it requires the following data assumption
to hold true:
• Interval or ratio level
• Linearly related
• Bivariant distributed
• If your data doesn’t meet the above assumptions, then you would need
Spearman’s Coefficient.
• It is necessary to know what monotonic function is to understand
Spearman correlation coefficient.
• A monotonic function is one that either never decreases or never
increases as it is an independent variable increase.
• A monotonic function can be explained using the image below:
• The image explains three concepts in monotonic function:
• Monotonically increasing: When the ‘x’ variable increases and the ‘y’
variable never decreases.
• Monotonically decreasing: When the ‘x’ variable increases but the ‘y’
variable never increases
• Not monotonic: When the ‘x’ variable increases and the ‘y’ variable
sometimes increases and sometimes decreases.
• Here,
• n= number of data points of the two variables
• di= difference in ranks of the “ith” element
• The Spearman Coefficient,⍴, can take a value between +1 to -1 where,
• A ⍴ value of +1 means a perfect association of rank
• A ⍴ value of 0 means no association of ranks
• A ⍴ value of -1 means a perfect negative association between ranks.
• Closer the ⍴ value to 0, weaker is the association between the two ranks.
Question: Explain Measures of Central Tendency.
Ans
• A measure of central tendency is a descriptive statistic that describes the
average, or typical value of a set of scores
• There are three common measures of central tendency:
• the mode
• the median
• the mean
• The Mode:- The mode is the score that occurs most frequently in a set of
data When a distribution has two “modes,” it is called bimodal. If a
distribution has more than 2 “modes,” it is called multimodal
• The mode is primarily used with nominally scaled data
• It is the only measure of central tendency that is appropriate for
nominally scaled data
The Median:- The median is simply another name for the 50th percentile
• It is the score in the middle; half of the scores are larger than the
median and half of the scores are smaller than the median. The median
is often used when the distribution of scores is either positively or
negatively skewed
• The few really large scores (positively skewed) or really small scores
(negatively skewed) will not overly influence the median
How To Calculate the Median
• Conceptually, it is easy to calculate the median
• There are many minor problems that can occur; it is best to let a
computer do it
• Sort the data from highest to lowest
• Find the score in the middle
• middle = (N + 1) / 2
• If N, the number of scores, is even the median is the average of the
middle two scores
• What is the median of the following scores:
10 8 14 15 7 3 3 8 12 10 9
• Sort the scores:
15 14 12 10 10 9 8 8 7 3 3
• Determine the middle score:
middle = (N + 1) / 2 = (11 + 1) / 2 = 6
• Middle score = median = 9
• What is the median of the following scores:
24 18 19 42 16 12
• Sort the scores:
42 24 19 18 16 12
• Determine the middle score:
middle = (N + 1) / 2 = (6 + 1) / 2 = 3.5
• Median = average of 3rd and 4th scores:
(19 + 18) / 2 = 18.5
• The Mean:- The mean is:
• the arithmetic average of all the scores
(X)/N
• the number, m, that makes (X - m) equal to 0
• the number, m, that makes (X - m)2 a minimum
• The mean of a population is represented by the Greek letter ; the mean of a
sample is represented by X
When To Use the Mean
• You should use the mean when
• the data are interval or ratio scaled
• Many people will use the mean with ordinally scaled data too
• and the data are not skewed
• The mean is preferred because it is sensitive to every score
• If you change one score in the data set, the mean will change
Relations Between the Measures of Central Tendency
• In symmetrical distributions, the median and mean are equal
• For normal distributions, mean = median = mode
• In positively skewed distributions, the mean is greater than the median
• In negatively skewed distributions, the mean is smaller than the median
Median
Median is the value of the variable which divides the whole set of data
into two equal parts. It is the value such that in a set of observations,
50% observations are above and 50% observations are below it. Hence
the median is a positional average
The median weight of the students is 40 kg
Q15Explain Measures of Central Tendency and Measures of
Dispersion
Ans
• A measure of central tendency is a descriptive statistic that describes the
average, or typical value of a set of scores
• There are three common measures of central tendency:
• the mode
• the median
the mean
• The Mode:- The mode is the score that occurs most frequently in a set of
data When a distribution has two “modes,” it is called bimodal. If a
distribution has more than 2 “modes,” it is called multimodal
• The mode is primarily used with nominally scaled data
• It is the only measure of central tendency that is appropriate for
nominally scaled data
The Median:- The median is simply another name for the 50th percentile
• It is the score in the middle; half of the scores are larger than the
median and half of the scores are smaller than the median. The median
is often used when the distribution of scores is either positively or
negatively skewed
• The few really large scores (positively skewed) or really small scores
(negatively skewed) will not overly influence the median
How To Calculate the Median
• Conceptually, it is easy to calculate the median
• There are many minor problems that can occur; it is best to let a
computer do it
• Sort the data from highest to lowest
• Find the score in the middle
• middle = (N + 1) / 2
• If N, the number of scores, is even the median is the average of the
middle two scores
• What is the median of the following scores:
10 8 14 15 7 3 3 8 12 10 9
• Sort the scores:
15 14 12 10 10 9 8 8 7 3 3
• Determine the middle score:
middle = (N + 1) / 2 = (11 + 1) / 2 = 6
• Middle score = median = 9
• What is the median of the following scores:
24 18 19 42 16 12
• Sort the scores:
42 24 19 18 16 12
• Determine the middle score:
middle = (N + 1) / 2 = (6 + 1) / 2 = 3.5
• Median = average of 3rd and 4th scores:
(19 + 18) / 2 = 18.5
• The Mean:- The mean is:
• the arithmetic average of all the scores
(X)/N
• the number, m, that makes (X - m) equal to 0
• the number, m, that makes (X - m)2 a minimum
• The mean of a population is represented by the Greek letter ; the mean of a
sample is represented by X
When To Use the Mean
• You should use the mean when
• the data are interval or ratio scaled
• Many people will use the mean with ordinally scaled data too
• and the data are not skewed
• The mean is preferred because it is sensitive to every score
• If you change one score in the data set, the mean will change
Relations Between the Measures of Central Tendency
• In symmetrical distributions, the median and mean are equal
• For normal distributions, mean = median = mode
• In positively skewed distributions, the mean is greater than the median
• In negatively skewed distributions, the mean is smaller than the median
• Measures of dispersion are descriptive statistics that describe how similar a
set of scores are to each other
• The more similar the scores are to each other, the lower the measure
of dispersion will be
• The less similar the scores are to each other, the higher the measure of
dispersion will be
• In general, the more spread out a distribution is, the larger the measure
of dispersion will be
• There are three main measures of dispersion:
• The range
• The semi-interquartile range (SIR)
• Variance / standard deviation
• The Range:- The range is defined as the difference between the largest
score in the set of data and the smallest score in the set of data, XL - XS
• The range is used when
• you have ordinal data or
• you are presenting your results to people with little or no knowledge
of statistics
• The semi-interquartile range (or SIR) is defined as the difference of the
first and third quartiles divided by two
• The first quartile is the 25th percentile
• The third quartile is the 75th percentile
• SIR = (Q3 - Q1) / 2
• Variance:- Variance is defined as the average of the square deviations:
(X − )2
2 =
N
What Does the Variance Formula Mean?
• First, it says to subtract the mean from each of the scores
• This difference is called a deviate or a deviation score
• The deviate tells us how far a given score is from the typical, or
average, score
• Thus, the deviate is a measure of dispersion for a given score
• Variance is the mean of the squared deviation scores
• The larger the variance is, the more the scores deviate, on average, away
from the mean
• The smaller the variance is, the less the scores deviate, on average, from the
mean
• Standard Deviation:- When the deviate scores are squared in variance, their
unit of measure is squared as well
• E.g. If people’s weights are measured in pounds, then the variance of
the weights would be expressed in pounds2 (or squared pounds)
• Since squared units of measure are often awkward to deal with, the square
root of variance is often used instead
• The standard deviation is the square root of variance
• When the deviate scores are squared in variance, their unit of measure is
squared as well
• E.g. If people’s weights are measured in pounds, then the variance of
the weights would be expressed in pounds2 (or squared pounds)
• Since squared units of measure are often awkward to deal with, the square
root of variance is often used instead
• The standard deviation is the square root of variance
Q16 Explain correlation coefficient
Ans
• In Statistics, the correlation coefficient is a measure defined
between the numbers -1 and +1 and represents the linear
interdependence of the set of data.
• The correlation coefficient is used to measure the strength of the
relationship between two variables.
• The value of the correlation coefficient ranges from -1.0 to +1.0
• This means that any value beyond this range will be the result of
an error in correlation measurement.
• A correlation of value -1.0 means a perfect negative correlation,
while a correlation of +1.0 means a perfect positive correlation.
• A correlation of 0.0 means no linear relationship between the
movement of the two variables.
• How to Calculate Correlation Coefficient?
• The correlation coefficient can be calculated by first determining
the covariance of the given variables.
• This value is then divided by the product of standard deviations for
these variables.
• The equation given below summarizes the above concept:
Q17 Explain Measures of Dispersion
Ans
Q18 Question:
What is quartile and explain how to find quartile(Q1,Q2,Q3) for
discrete grouped data .
Ans
SOLN (STEPS FOR UNDERSTANDING)