Unit-3
Descriptive analytics
• Descriptive analytics refers to the interpretation of historical data to better
understand changes that occur in a business.
• It describes the use of a range of historic data to draw comparisons with other
reporting periods for the same company (i.e. quarterly or annually) or with
others within the same industry.
• These measures all describe what has occurred in a business during a set
period:
• Financial metrics including year-over-year (YOY) pricing changes, month-over-
month sales growth, the number of users, or the total revenue per subscriber.
Descriptive statistics
• Descriptive statistics summarize and organize characteristics of a data set. A data set
is a collection of responses or observations from a sample or entire population.
• In quantitative research, after collecting data, the first step of statistical analysis is
to describe characteristics of the responses, such as the average of one variable
(e.g., age), or the relation between two variables (e.g., age and creativity).
• The next step is inferential statistics, which help to decide whether:
• the data confirms or refutes the hypothesis
• it is generalizable to a larger population
Data summarized using descriptive analytics
• Financial statements
• Surveys
• Social media engagement
• Website traffic
• Scientific findings
• Weather reports
• Traffic data
Advantages of descriptive analytics
1. High Degree of Objectivity
• The descriptive analysis uses data to conclude and identify different
characteristics that correspond or match with current trends.
• Analysts avoid implementing changes if the data does not match what fits
within current market trends.
• The high degree of objectivity in this analysis ensures data is quantifiable
and usable.
Advantages of descriptive analytics
2. Describes Relationships Between Variables
• It summarizes relationships between two or more variables using tools like scatter diagrams.
• Businesses compare the relationships between variables through both bivariate and
multivariate tools. Following this, they can find market patterns and act accordingly.
3. Follows a Wider Approach
• It follows the broadest approach among all types of analysis as it can use any number of
variables for analysis.
• Billions of data points can be clustered together to make conclusions.
• This helps businesses make decisions that consider every aspect of the data set.
Advantages of descriptive analytics
4. Identifies New Hypotheses
• With descriptive analysis, analysts can identify new hypotheses and investigate them
through experimental studies.
• This leads to a smaller margin of error since trends are directly taken from data sets.
5. Flexible
• Analysts can use both quantitative and qualitative data to interpret results.
• Therefore, descriptive analysis is highly flexible and provides a more workable picture
of events.
Advantages of descriptive analytics
6. Aids in Identifying Business and Market Trends
• When an analyst collects a large amount of data for the study, the
conclusions provide them with helpful market trends.
• This helps in identifying areas to focus efforts on.
• Business operations are broken down by how they affect functionality.
• By using descriptive analysis, analysts identify business trends and
compare current positions to past ones.
Disadvantages of descriptive analytics
• Data sets could be summarized, but these may not tell a complete story.
• It cannot be used to test a hypothesis or understand why data present
the way they do.
• It cannot be used to predict what may happen in the future.
• The findings cannot be generalized to a broader population.
• Descriptive analytics tells you nothing about the data collection
methodology, meaning the data set may include errors.
The Five Steps of Descriptive
Analytics
Step 1 – State the Business Metrics
Step 2 – Identify the Data Required
Step 3 – Extract and Prepare the Data
Step 4 – Analyze the Data
Step 5 – Present the Data
Types of Descriptive Analysis
Types of Descriptive Analysis
1. Frequency
• Frequency is a metric that helps understand how often an event will occur.
• This is useful when a vast data set needs compilation for easy access.
2. Central Tendency
• Central Tendency helps find the average (or likelihood) of conclusions.
• This allows analysts to present descriptive metrics while understanding market trends.
•
• Mean, median, and mode are how companies use central tendency to measure descriptive
analysis.
1. Frequency
Frequency
describe a
dataset with a
single variable.
2. Central Tendency
Types of Descriptive Analysis
3. Position
• Position as a descriptive analysis measure allows analysts to identify a single
value’s position compared to others.
• Analysts use percentiles and quartiles to understand the data’s relationship
relative to one another.
4. Dispersion
• The measure of dispersion, like standard deviation, helps analysts understand
data divisions.
• It assists in identifying how much data spreads across a range.
• These variations help analysts understand how far or near the business is from
an ideal positioning.
• Percentile – It is used in determining the performance of a person in
comparison with others.
• It is used mostly in schools during the results of tests to check where a
person stands out from others.
• Percentile(x) = (Number of values fall under ‘x’/total number of values) ×
100
• P = (n/N) × 100
Where,
P is percentile
n – Number of values below ‘x’
N – Total count of population
What is the percentile value for the value 60 in a given population of weights of persons 50, 55, 40, 60, 100, 95,
90, 60, 80, 75.
Solution:
The given data is not sorted. So first sort the data in ascending order.
Sorted data: 40,50,55,60,60,75,80,90,95,100
Number of values fall under 60 (n)= 3
Total count of values (N)= 10
Percentile = (n/N) x 100
= (3/10) x 100
= 30
The percentile of value 60 for the given population is 30
• Quartiles are the set of values that divide the data points into four
identical values using three individual data points.
• Thus, quartile formula is a very important topic in Statistics that helps
us to study large amounts of data, they are used to divide the large
data values into four equal quarters.
• Quantile tables, while not explicitly defined by search results, are
statistical tools used to represent and analyze the distribution of data.
• They summarize data by dividing it into equal-sized groups
(quantiles), showing the values at specific points (e.g., quartiles: 25th,
50th, 75th percentiles).
• In statistics, dispersion (also called variability, scatter, or spread) is the
extent to which a distribution is stretched or squeezed.
Techniques Used for Descriptive Analysis
1. Specialized Descriptive Techniques
• Analysts often use specialized techniques to measure inequality, discrimination, and
segregation between two or more variables.
• These techniques help understand the social processes of a business.
• For example, a business can use key performance indicators (KPIs) to understand its
sales growth.
• Upon conducting the analysis, the business will have all the information it needs
regarding past sales to determine the future trajectory(moving path) of the
business.
Techniques Used for Descriptive Analysis
2. Tables
• Constructing quantile tables, dispersion tables, cross-tabulations, etc., help in
carrying out advanced hypotheses.
• Such hypotheses help businesses understand the difference between different
subdivisions of the company.
• For example, a business can use tables to determine product price dispersions.
• The analyst can determine how close or far the product price is from the industry
average.
• Upon getting results, the business can increase or decrease the price accordingly.
3. Crosstabs
• Crosstabs are two-way tabulations that help analysts determine promotions of different
components.
• Each variable is given a unique value based on which element of the business can make
further decisions.
• For example, an analyst can tabulate the proportion of employees that work at the middle
management level but also have health insurance benefits.
• A crosstab of management level and health assistance will define the relationship between
the two.
• After understanding the correlation between the two, the business can increase or
decrease health assistance.
• Cross-tabulation is also referred to as crosstab. It is a statistical
technique used to organize and analyze the relationship between two
or more categorical variables.
4. Tables of Means
• An analyst can use a table of means to determine differences between
subdivisions.
• This helps businesses with inferences for future decisions.
• For example, this can be used if a business wants to understand the gap in
earnings across different management levels.
• A table of means will help identify differences between all levels of management.
• The differences will likely be based on the experience level, hours of work, the
complexity of work, etc.
• However, discrepancies in expected results can help guide company decisions.
Probability Distribution
Probability
• Experiment: An experiment could be something like — whether it rains in
Delhi on a daily basis or not.
• Outcome: Outcome is the result of a single trial. If it rains today, the outcome
of today’s trial is “it rained”.
• Event: An event is one or more outcomes of an experiment. For the
experiment of whether it rains in Delhi every day the event could be “it
rained” or it didn’t rain.
• Probability: This simply the likelihood of an event. So it there’s a 60% chance
of it raining today, the probability of raining is 0.6.
Probability- Definition
• “Probability is a mathematical term for the likelihood that something
will occur. It is the ability to understand and estimate the possibility of a
different combination of outcomes.”
• Probability is basically the degree to which something can happen. In
order to determine the probability of a single event occurring, first of
all, we need to know the total number of possible consequences.
Probability Formula
• The probability formula is defined as the ratio of favorable outcomes to the
ratio of total outcomes.
• For any event (E), this can be shown as
P(E)=Number of favorable outcomes / Number of total outcomes
Or
P(E)=n(A) / n(S)
where,
P(E) is the probability of an event 'E’.
n(A) is the number of favorable outcomes of an event 'E'.
n(S) is the total number of events in a sample space.
Probability distribution
• A probability distribution is a mathematical function that describes the probability
of different possible values of a variable. Probability distributions are often depicted
using graphs or probability tables.
Example: Probability distribution
We can describe the probability distribution of one coin flip using a probability table:
Outcome Probability
Heads Tails
.5 .5
Probability distribution
• A probability distribution is an idealized frequency distribution.
• A frequency distribution describes a specific sample or dataset. It’s the
number of times each possible value of a variable occurs in the dataset.
• The number of times a value occurs in a sample is determined by
its probability of occurrence. Probability is a number between 0 and 1 that
says how likely something is to occur:
• 0 means it’s impossible.
• 1 means it’s certain.
• The higher the probability of a value, the higher its frequency in a sample.
Probability distribution
Common probability distributions include the
• binomial distribution
• Poisson distribution
• uniform distribution
Certain types of probability distributions are used in hypothesis testing,
including
• standard normal distribution
• F distribution
• Student’s t distribution
Binomial distribution
• A binomial distribution can be thought of as simply the probability of a
SUCCESS or FAILURE outcome in an experiment or survey that is repeated
multiple times.
• The binomial is a type of distribution that has two possible outcomes (the
prefix “bi” means two, or twice).
• For example, a coin toss has only two possible outcomes: heads or tails and
taking a test could have two possible outcomes: pass or fail.
Binomial Distribution Examples
• Finding the quantity of raw and used materials while making a product.
• Taking a survey of positive and negative reviews from the public for any
specific product or place.
• By using the YES/ NO survey, we can check whether the number of
persons views the particular channel.
• To find the number of male and female employees in an organisation.
• The number of votes collected by a candidate in an election is counted
based on 0 or 1 probability.
Poisson distribution
• A Poisson distribution is a discrete probability distribution.
• It gives the probability of an event happening a certain number of
times (k) within a given interval of time or space.
• The Poisson distribution has only one parameter, λ (lambda), which is
the mean number of events.
Examples of Poisson distributions
• Since Bortkiewicz’s time, Poisson distributions have been used to
describe many other things.
• For example, a Poisson distribution could be used to explain or predict:
• Text messages per hour
• Machine malfunctions per year
• Website visitors per month
• Influenza cases per year
Uniform distribution
• A continuous probability distribution is a Uniform distribution and is related to
the events which are equally likely to occur.
• It is defined by two parameters, x and y, where x = minimum value and y =
maximum value. It is generally denoted by u(x, y).
• If the probability density function or probability distribution of a uniform
distribution with a continuous random variable X is f(a)= 1/y-x, then It is
denoted by U(x,y), where x and y are constants such that x<a<y. It is written as
X ∼ U(a, b)
Two types of probability distributions
1. Discrete probability distributions
2. Continuous probability distributions
Discrete probability distributions
• A discrete probability distribution is a probability distribution of a categorical or discrete
variable.
• Discrete probability distributions only include the probabilities of values that are possible.
• In other words, a discrete probability distribution doesn’t include any values with a
probability of zero.
• For example, a probability distribution of dice rolls doesn’t include 2.5 since it’s not a
possible outcome of dice rolls.
• The probability of all possible values in a discrete probability distribution add up to one.
• It’s certain (i.e., a probability of one) that an observation will have one of the possible
values.
Probability tables
• A probability table represents the discrete probability distribution of
a categorical variable. Probability tables can also represent a discrete
variable with only a few possible values or a continuous variable that’s
been grouped into class intervals.
• A probability table is composed of two columns:
• The values or class intervals
• Their probabilities
Example: Probability table
• A robot greets people using a random greeting. The probability distribution
of the greetings is described by the following probability table:
Greeting Probability
“Greetings, human!” .6
“Hi!” .1
“Salutations, organic life- .2
form.”
“Hello!” .1
Common discrete probability distributions
Distribution Description Example
Binomial Describes variables with two possible The number of times a coin
outcomes. It’s the probability lands on heads when you
distribution of the number of toss it five times
successes in n trials with p probability
of success.
Discrete uniform Describes events that have equal The suit of a randomly
probabilities. drawn playing card
Poisson Describes count data. It gives the The number of text
probability of an event messages received per day
happening k number of times within a
given interval of time or space.
Continuous probability distributions
• A continuous probability distribution is the probability distribution of a continuous
variable.
• A continuous variable can have any value between its lowest and highest values.
Therefore, continuous probability distributions include every number in the variable’s
range.
• The probability that a continuous variable will have any specific value is so
infinitesimally(a value approaching zero) small that it’s considered to have a probability of
zero.
• However, the probability that a value will fall within a certain interval of values within its
range is greater than zero.
Common continuous probability distributions
Distribution Description Example
Normal distribution Describes data with values that become less SAT scores
probable the farther they are from the mean,
with a bell-shaped probability density
function.
Continuous uniform Describes data for which equal-sized intervals The amount of time cars wait at a
have equal probability. red light
Log-normal Describes right-skewed data. It’s the The average body weight of
probability distribution of a random variable different mammal species
whose logarithm is normally distributed.
Exponential Describes data that has higher probabilities for Time between earthquakes
small values than large values. It’s the
probability distribution of time between
independent events.
Probability distribution formulas
Distribution Formula Type of formula
Binomial Probability mass function
Discrete uniform Probability mass function
Poisson Probability mass function
Normal Probability density function
Continuous uniform Probability density function
Exponential Probability density function
SAMPLING AND ESTIMATION
UNIT-3
• The population is the entire group that you want to draw conclusions
about.
• The sample is the specific group of individuals that you will collect
data from.
Sampling frame
• The sampling frame is the actual list of individuals that the sample will be
drawn from.
• Ideally, it should include the entire target population (and nobody who is not
part of that population).
Example: Sampling frame
• You are doing research on working conditions at a social media marketing
company. Your population is all 1000 employees of the company. Your sampling
frame is the company’s HR database, which lists the names and contact details
of every employee.
Sample size
• The number of individuals you should include in your sample depends on
various factors, including the size and variability of the population and your
research design.
• There are different sample size calculators and formulas depending on what
you want to achieve with statistical analysis.
• AI-Therapy | Statistics for Psychologists | Sample size calculator
Sampling
• Sampling is a method that allows us to get information about the
population based on the statistics from a subset of the population
(sample), without having to investigate every individual.
Why do we need Sampling?
• Sampling is done to draw conclusions about populations from samples,
and it enables us to determine a population’s characteristics by directly
observing only a portion (or sample) of the population.
• Selecting a sample requires less time than selecting every item in a
population
• Sample selection is a cost-efficient method
• Analysis of the sample is less cumbersome and more practical than an
analysis of the entire population
Steps involved in Sampling
Different Types of Sampling Techniques
Probability Sampling:
• In probability sampling, every element of the population has an equal
chance of being selected.
• Probability sampling gives us the best chance to create a sample that is
truly representative of the population
Non-Probability Sampling:
• In non-probability sampling, all elements do not have an equal chance
of being selected.
• Consequently, there is a significant risk of ending up with a non-
representative sample which does not produce generalizable results
Simple Random Sampling
• Every individual is chosen entirely by chance and each member of
the population has an equal chance of being selected.
Systematic Sampling
• In this type of sampling, the first individual is selected randomly and
others are selected using a fixed ‘sampling interval’.
Stratified Sampling
Cluster Sampling
Cluster Sampling
• In the above example, we have divided our population into 5
clusters.
• Each cluster consists of 4 individuals and we have taken the
4th cluster in our sample.
• We can include more clusters as per our sample size.
• This type of sampling is used when we focus on a specific region
or area.
Non-Probability Sampling
Convenience Sampling
Quota Sampling
Judgment Sampling
Snowball Sampling
ESTIMATION
Estimation
• Any of numerous procedures used to calculate the value of some
property of a population from observations of a sample drawn
from the population.
• Estimation refers to the process by which one makes inferences about a
population, based on information obtained from a sample.
Estimation
• The procedure of making judgment or decision about a population
parameter is referred to as statistical estimation or simply estimation.
• Statistical estimation procedures provide estimates of population
parameter with a desired degree of confidence. The degree of
confidence can be controlled in part,
1.by the size the sample (larger sample greater accuracy of the estimate) and
2.by the type of the estimate made. Population parameters are estimated from
sample data because it is not possible (it is impracticable) to examine the entire
population in order to make such an exact determination.
Types of Estimates
Point estimate.
• A point estimate of a population parameter is a single value of a statistic.
• For example, the sample mean x is a point estimate of the population mean μ.
Similarly, the sample proportion p is a point estimate of the population
proportion P.
Interval estimate.
• An interval estimate is defined by two numbers, between which a population
parameter is said to lie.
• For example, a < x < b is an interval estimate of the population mean μ. It
indicates that the population mean is greater than a but less than b.
Confidence Intervals
Statisticians use a confidence interval to express the precision and
uncertainty associated with a particular sampling method.
A confidence interval consists of three parts.
• A confidence level.
• A statistic.
• A margin of error.
• The confidence level describes the uncertainty of a sampling method.
• The statistic and the margin of error define an interval estimate that
describes the precision of the method.
• The interval estimate of a confidence interval is defined by the sample
statistic + margin of error.
• Confidence intervals are preferred to point estimates, because confidence
intervals indicate
• (a) the precision of the estimate and
• (b) the uncertainty of the estimate.
Confidence Level
• The probability part of a confidence interval is called a confidence level.
• The confidence level describes the likelihood that a particular sampling method will
produce a confidence interval that includes the true population parameter.
• Suppose we collected all possible samples from a given population, and computed
confidence intervals for each sample. Some confidence intervals would include the
true population parameter; others would not. A 95% confidence level means that
95% of the intervals contain the true population parameter; a 90% confidence level
means that 90% of the intervals contain the population parameter; and so on.
Margin of Error
• In a confidence interval, the range of values above and below the sample
statistic is called the margin of error.
• For example, suppose the local newspaper conducts an election survey
and reports that the independent candidate will receive 30% of the vote.
The newspaper states that the survey had a 5% margin of error and a
confidence level of 95%. These findings result in the following confidence
interval: We are 95% confident that the independent candidate will
receive between 25% and 35% of the vote.
Analysis of descriptive analytics
Calculate Descriptive Statistics
1.Age: 1(<20), 2(20-25) 3(25-30), 4(30-40) 5(>40)
2.Gender: 1(Male), 2(Female)
3.Education: 1- High School, 2- Graduate in Arts and Science, 3-
Graduate in professional degree 4- Post Graduate degree.
4.Working experience (years): 1(< 1 year), 2(1-5 years), 3(5-10 years),
4(10-20 years), 5(>20 years)
• Enter your own data set (minimum 20 data set) in the data view of
SPSS then calculate the descriptive statistics, graphically represent the
variables in the form of BAR chart.
AIM:
• To enter the data and apply descriptive statistics test to analyze the variables
using SPSS software.
PROCEDURE:
1.Start SPSS.
2.Select File New Data.
3.Enter the variable name in the variable view enter the decimals, Labels and
values.
4.Enter the data in the data view.
5.Select the analyze
6.Select analyze
7.Select the variable you require to analyze and click button to move the
variable into the test variables.
8.Click OK.
Output