Decision Science & Data Analysis
Decision Science & Data Analysis
● Variable: A characteristic or a quantity of ● Data are the facts and figures collected, analysed
interest that can take on different values. and summarized for presentation and
interpretation.
● Dichotomous Variable
● Observation is a set of values corresponding to
● Discrete Variable a set of variables.
● Continuous Variable
e.g. variables are Symbol, Industry, Share Price and ● Random Variable is a quantity whose values
Volume are not known with certainty.
❖ Overview of Data
1
● By focusing on decisions as the unit of analysis, decision science provides
a unique framework for understanding public health problems, and for I
improving policies to address those problems.
2
Data Scientists are looking to Decision Scientists frame data analysis
understand, interpret and analyze with in terms of the decision making process.
the goal of building better products.
Therefore, data quality, statistical
rigor and measurement perfection are
often their trademarks.
For Data Scientists, the analysis, They are looking at the various ways of
statistical rigor and understanding analyzing data as it relates to a specific
comes first. Business challenges come business question posed by their
second. stakeholder/s.
Data Scientists think about data in Other names for this role may include:
terms of data patterns, data analytics, analyst and applied
processing, algorithms and statistics. analytics.
Often, data scientists areconducting
deep analysis, and experimental
statistics.
They are obsessed with finding causal The Data Scientist focuses on a finding
relationships. insights and relationships via statistics.
The Decision Scientist is looking to find
insights as they relate to the decision at-
hand. Example decisions might include:
Age groups to focus on, most optimal
way to spend a yearly budget etc.
They are deeply focused on data For Decision Scientists, the business
quality as it relates to their product problem comes first. Analysis follows
area because better data quality results and is dependent on the question or
in more thorough statisticalanalysis. business decision that needs to be made.
3
They frame data analysis in terms of The Decision Scientist need to consider
algorithms, machine learning, the type of analysis, visualization
statistics and experimentation. methods and behavioral understanding
that can help a stakeholder make a
specific decision. Decision Scientists
need to make insights useable.
They are looking to bring order to big They need to be able to work with a
data to find insights and learnings as it variety of data sources and inputs —
relates to their product/focus area. each selected based on its ability to help
answer the business question.
Their north star goal: use high-quality Their north star goal: use data and
data and robust statistics to support statistics to support business decision
product development. making, budgeting and marketing spend.
4
Virtually every area of business uses statistics in decision making. Here are some
recent examples:
A Deloitte Retail “Green” survey of 1080 adults revealed that 54% agreed that
plastic, non-compostable shopping bags should be banned.
A survey of 1007 adults by RBC Capital Markets showed that 37% of adults
would b e willing to drive 5 to 10 miles to save 20 cents on a gallon of gas.
Statistics is the science concerned with developing and studying methods for
collecting, analyzing, interpreting and presenting empirical data. Statistics
is a highly interdisciplinary field ; research in statistics finds applicability in
virtually all scientific fields and research questions in the various scientific
fields motivate the development of new statistical methods and theory. I n
developing methods and studying the theory that underlies the methods s
tatisticians draw on a variety of mathematical and computational tools.
❖ Types of Statistics
Basically, there are two types of statistics.
5
• Descriptive Statistics
• Inferential Statistics
In the case of descriptive statistics, the data or collection of data is described in
summary.
But in the case of inferential stats, it is used to explain the descriptive one.
Both these types have been used on large scale.
Histograms, pie charts, bars, and Inferential statistics are used to test
scatter plots are common ways hypotheses and study correlations
to summarise data and present it in between variables, and they can also be
tables or graphs. used topredict population sizes.
Descriptive statistics are just that: Inferential statistics are used to derive
descriptive. They don’t need to be conclusions and inferences from
normalised beyond the data they samples, i.e. to create accurate
collect. generalisations.
6
Probability and Statistics form the basis of Data Science. The probability
theory is very much helpful for making the prediction. Estimates and predictions
form an important part of Data science. With the help of statistical methods, we
make estimates for the further analysis. Thus, statistical methods are largely
dependent on the theory of probability. And all of probability and statistics is
dependent on Data.
❖ Data
● Data is the collected information(observations) we have about something or
facts and statistics collected together for reference or analysis.
● Data — a collection of facts (numbers, words, measurements, observations,
etc) that has been translated into a form that computers can process
Data matters a lot nowadays as we can infer important information from it. Now
let’s delve into how data is categorized.
Data can be of 2 types categorical and numerical data.
For Example in a bank, we have regions, occupation class, gender which follow
categorical data as the data is within a fixed certain value and balance, credit score,
age, tenure months follow numerical continuous distribution as data can follow an
unlimited range of values.
The root of statistics is driven by variables. A variable is a data set that can be
counted that marks a characteristic or attribute of an item. For example, a car can
have variables such as make, model, year, mileage, color, or condition. By
7
combining the variables across a set of data (i.e. the colors of all cars in a given
parking lot), statistics allows us to better understand trends and outcomes.
8
tend to cluster. F or example: The average starting salary for social workers is
$15,000 per Year and it gives some idea of how much variety or heterogeneity
there is in the distribution )
9
● It should be based on all the observations: The average should depend
upon each and every observation so that if any of the observation is dropped
average itself is Altered.
● It should be rigidly defined: An average should be properly defined so
that it has one and only one interpretation. It should preferably be defined
by an algebraic formula so that if different people compute the average from
the same figures they all get the same answer (Barring arithmetical
mistakes).
● It should be capable of further algebraic treatment: We should prefer to
have an average that could be used for further statistical computations. For
example: If we are given separately the figures of average income and
number of employees of two or more companies we should be able to
compute the combined average.
● It should have sampling stability : We should prefer to get a value which
has what the statisticians call ‘Sampling stability’. This means that if we
pick 10 different groups of college students, and compute the average of
each group, we should expect to get approximately the same values.
● It should not be unduly affected by the presence of extreme values:
Although each and every observation should influence the value of the
average, none of the observations should influence it unduly. If one or two
very small or very large observations unduly affect the average, i.e., either
increase its value or reduce its value, the average cannot be really typical of
the entire set of data.
10
II. Median
III. Mode
Q.What are various types of averages or means?
● Arithmetic mean
● Geometric mean
● Harmonic mean
The data that give information on each Grouped data: are presented in the form of
member of the population or sample a frequency distribution table/class
individually are called ungrouped data. Intervals with respective frequency.
Direct method: if X1, X2 , ...... XN represent Direct method: The formula for
the values of N items or observations, the estimating average from grouped data by
arithmetic mean denoted by (x̄ ) i s defined as: direct method is:
11
Short-cut method takes more time as In case of grouped data, considerable
compared to direct method. However, this is saving in time is possible by adopting the
true only for ungrouped data. short-cut method.
❖ Merits
● All values are used
● It has unique value & easy to calculate
● The sum of the deviations from the mean is zero.
❖ Demerits
❖ The mean is affected by extreme values
❖ Median
The median is a point in a distribution of scores above and below which exactly
half of the cases fall. This is a value which appears in the middle of ordered
sequence of values. This is also known as positional average. The term ‘position’
refers to the place of a value in a series.
Example: If the income of five persons is $7000, 7200,7500,7600,7800, then the
median income would be $7500.
12
Merits
● Median is unique
● Median is less affected by extreme values as compared to mean
● It can be used for open–end distribution
● Graphical presentation of median is possible
● Median is used for studying qualitative attributes
Demerits
● For median, it is necessary to arrange the data
● It is not capable for further algebraic treatment
● It does not use each and every observation of the data set
13
where the symbols have their usual meanings and interpretation.
Question: What is meant by Mode?
Answer: Mode refers to the most common value in a distribution or the largest
category of variable. It may also defined as the value which occurs the maximum
number of times, i.e. having the maximum frequency.
A distribution containing more than one mode is called bimodal or multimodal.
where,
L = Lower limit of the modal class
f1 = Frequency of the modal class
Fo = Frequency of the class preceding the modal class.
f2 = Frequency of the class succeeding the modal class.
14
Question: What are the merits of mode?
Answer:
● Like median, the mode is not affected by extreme values and its value can be
obtained in open-end distributions without ascertaining the class limits.
● Mode can be easily used to describe qualitative phenomenon. For example,
when we want to compare the consumer preferences for different types of
products, say, soap, toothpastes, are etc., of different media of advertising,
we should compare the modal preferences.
● In such distributions where there is an outstanding large frequency, mode
happens to be meaningful as an average.
Question: What are the limitations of mode?
Answer: Mode is not a rigidly defined measure as there are several formulae
for calculating the mode, all of which usually give somewhat different answers.
The value of mode cannot always be computed, such as ,in case of bimodal
distributions.
where
Individual
Series
Discrete
Series
Contionus
Series
17
Population Variance
Standard Deviation
(Population)
Standard Deviation
(Sample)
Coefficient of Variation
● Standard deviation was an absolute measure of Dispersion.
● C.V is relative measure of dispersion corresponding to standard deviation ●
C .V is used to compare the variability of two ormore data set.
❖ Skewness
● Skewness means lack of symmetry or departure from symmetry.
● A symmetric distribution has its mean, median, mode equal and the
frequency curve is symmetrically situated about these values.
● When distribution has longer tail on right side, it is positively skewed.
● When longer tail is on left side, it is negatively skewed.
❖ Symmetric distribution
If there is no skewness or the distribution is symmetric like the bell-shaped
normal curve t hen the m ean = median = mode.
19
❖ Shape of a Distribution
110
It is given by the formula:
❖ Kurtosis
The measure of kurtosis describes the degree of concentration of frequencies in a
given distribution. That is, whether the observed values are concentrated more
around the mode(a peaked curve) or away from the mode towards both tails. The
degree of kurtosis of a distribution is measured relative to the peakedness of a
normal curve.
There may be three possibilities;
(i) If a curve is more peaked than the normal curve, it is said to be Leptokurtic.
(ii) If a curve is less peaked than the normal curve, it is said to be Platykurtic.
(iii) If a curve is equally flat as normal curve, it is said to be Mesokurtic.
Measure of kurtosis
Kurtosis is measured by β2
.Formula for β2 is given by,
❖ Weighted Average
Average calculated where some of the numbers are assigned more importance or
weight
20
Where w the weight of the data value x.
M easure of Association between two variables
C ORRELATION & REGRESSION ANALYSIS
Such a distribution in which each individual or unit of the set is made up of two
values is called a bivariate distribution.
The concept of ‘correlation’ is a statistical tool which studies the relationship
between two variables and Correlation Analysis involves various methods and t
echniques used for studying and measuring the extent of the relationship between
the two variables.
“ Two variables are said to be in correlation if the change in one of the variables
r esults in a change in the other variable”.
❖ C ORRELATION
● When the relationship is of quantitative nature, the appropriate statistical
tool for discovering and measuring the relationship and expressing it in a b
rief formula is known as correlation.
● T he measure of correlation called the coefficient of correlation indicates the
s trength & direction of relationship between two variables.
● The coefficient between two variables x and y is denoted by r or rxy or ρ.
● It lies between – 1 to + 1.
● If r = 0, then the variables are said to be independent.
❖ TYPES OF CORRELATION
1. Based on Direction: --
● Positive Correlation : When increase/decrease in the value of one variable
results in a corresponding increase/ decrease in the value of other variable.
21
● N egative Correlation: When increase/ decrease in the value of one variable
results in a corresponding decrease/ increase in the value of other variable. 2
. B ased on Degree:--
● H igh
● Moderate
● Low
I f we wish to label the strength of the association, for absolute values of r, 0-0.19
is r egarded as very weak, 0.2-0.39 as weak, 0.40-0.59 as moderate, 0.6-0.79 as
strong and 0.8-1 as very strong correlation, but these are rather arbitrary limits,
and the context of the results should be considered.
22
3. S pearman’s Rank Correlation Coefficient.
1 . SCATTER DIAGRAM
● The simplest method for studying correlation in two variables is a special t
ype of dot chart called Scatter Diagram.
● In this method given data are plotted in the form of dots, for each pair of X a
nd Y.
● T he more the plotted points scatter over the chart, the lesser is the degree of
relationship between two variables.
● The more nearly the points come to the line, the higher the degree of r
elationship.
● If the points are very close to each other, a fairly good amount of correlation
can be expected between the two variables. On the other hand if they are
widely scattered a poor correlation can be expected between them.
A dvantages:
● It is readily comprehensive and enables us to form a rough idea of the nature
of relationship between the two variables x and y.
● It is not affected by extreme observations.
D isadvantages:
● It is not a suitable method if the number of observations is fairly large.
● I t is only a rough measure of correlation where the exact magnitude cannot
be known.
❖ PEARSON FORMULA
23
C orrelation coefficient is denoted by r given by the formula:-
24