Data science
using R
By
Abinaya
INTRODUCTION
• Data Science is about data gathering, analysis and decision-making.
• Data Science is about finding patterns in data, through analysis, and
make future predictions.
By using Data Science, companies are able to make:
• Better decisions (should we choose A or B)
• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe hidden information in the
data)
Data science using R 2
Where is data science
needed
Data Science is used in many industries in the world today, e.g. banking,
.
consultancy, healthcare, and manufacturing.
Examples of where Data Science is needed:
•For route planning: To discover the best routes to ship
•To foresee delays for flight/ship/train etc. (through predictive analysis)
•To create promotional offers
•To find the best suited time to deliver goods
•To forecast the next years revenue for a company
•To analyze health benefit of training
•To predict who will win elections Data science using R 3
Data Science can be applied in nearly every part of a business where data is available.
Examples are:
•Consumer goods
•Stock markets
•Industry
•Politics
•Logistic companies
•E-commerce
Introduction to statistics
• Statistics is a field of math that generally deals with
the collection of data, tabulation,
and interpretation of numerical data. In simple
words statistics is an area of applied mathematics
concerned with data collection analysis, interpretation,
and presentation.
• It is actually a form of mathematical analysis that uses
different quantitative models to produce a set of
experimental data or studies of real life. Statistics deals
with how data can be used to solve complex problems.
Data science using R 5
Some people consider statistics to be a distinct
mathematical science rather than a branch of
mathematics.
Statistics makes work easy and simple and provides a
clear and clean picture of the work you do on a regular
basis.
Statistics is used in a variety of sciences and has huge
applications, it is used in Weather Forecasting, the
Study of the Stock Market, Insurance Sectors, Betting
Industry, Data Science, and others
Data science using R 6
Descriptive statistics
Describes the data set we have or data set
that’s being analyzed. or Quantitatively
describing the data.
1. Graphical Representation
2. Tabular Representation
Data science using R 7
Graphical statistics
Plot Type Variable Type Description
Only One Categorical A bar plot is a chart that presents
Variable categorical data with rectangular bars
Or with heights or lengths proportional to
Bar Plot
One Categorical the values that they represent.
Variable & One Visually represents frequency
Continous Measure distribution.
Data science using R 8
A stacked bar chart, also known as a stacked
bar graph, is a graph that is used to break down
a category by another category and compare
Stacked Bar Two Categorical parts of a whole.
Plot Variables Each bar in the chart represents one category as
a whole, and segments in the bar represent
different parts or categories of that whole.
Visually represents cross-tabulation data.
A histogram is an approximate representation of
Only One
the distribution of numerical data. It is created by
Histogram Continuous
converting a continuous variable into categorical
Variable
by binning/bucketing it.
Data science using R 9
A density plot is a representation of the
Distribution distribution of a numeric variable. It uses a
Only One
Plot kernel density estimate to show the
Continuous
(Density probability density function of the variable. It
Variable
Plot) is a smoothed version of the histogram
Visually shows Skewness in data.
The box plot is a standardized way of
Only One displaying the distribution of data based on
Continuous the five-number summary: minimum, first
Variable quartile, median, third quartile, and
Box Plot
Or maximum.
(Box and
One The Minimum and Maximum in box-plot are
Whisker
Continuous & Lower Control Limit (LCL) and Upper Control
Plot)
One Limit (UCL).
Categorical Any data point beyond the LCL or UCL is
Variable typically considered as an outlier.
Quickly helps find outliers in data.
PRESENTATION TITLE 10
One of the
dimension has A line plot is a type of chart that displays
to be Time information as a series of data points called
and the ‘markers’ connected by straight line
Line Plot
second segments.
dimension a Visually shows trends in Time Series
Continuous Data.
Variable
A graph in which the values of two variables
are plotted along two axes. The pattern of the
Two
resulting points on the plot visually depicts
Scatter Plot Continuous
the existence of Correlation between the two
Variables
variables.
Quickly helps find Correlation.
One
Categorical
A pie chart is a circular statistical graphic,
Variable
which is divided into slices to illustrate
Pie Chart associated
numerical proportions.
with a
Quickly helps
Data science using compare
R parts of a whole. 11
Continuous
Bar plot
PRESENTATION TITLE 12
Stacked bar plot
Data science using R 13
Histogram
Distribution
plot
Box and whisker
plot
Line plot
Scatter plot
Pie chart
Tabular statistics
• Tabular method of data presentation is wide spread in
all spheres of human life. These methods are used to
summarize data from a sample or population into
table format.
• Data is grouped into categories and the number (or
frequency) of observations in each category is
obtained.
• Frequency distribution is a type of tabular method. A
frequency distribution is a tabular summary of data
showing the frequency of items in each of several
non-overlapping classes.
• The objective is to provide insights about the data that
cannot be quickly obtained by looking only at the
original data.
Example
Probablity
• Probability denotes the possibility of the outcome
of any random event.
• The meaning of this term is to check the extent to
which any event is likely to happen.
• For example, when we flip a coin in the air, what is
the possibility of getting a head? The answer to
this question is based on the number of possible
outcomes.
• Here the possibility is either head or tail will be the
outcome. So, the probability of a head to come as
a result is 1/2.
Probablity
distribution
• A probability distribution is a statistical function that describes
all the possible values and likelihoods that a random variable
can take within a given range.
• This range will be bounded between the minimum and
maximum possible values, but precisely where the possible
value is likely to be plotted on the probability distribution
depends on a number of factors.
• These factors include the distribution's mean (average),
standard deviation, skewness, and kurtosis.
Key takeaways
•A probability distribution depicts the expected
outcomes of possible values for a given data-generating
process.
•Probability distributions come in many shapes with
different characteristics, as defined by the mean,
standard deviation, skewness, and kurtosis.
•Investors use probability distributions to anticipate
returns on assets such as stocks over time and to hedge
their risk.
How probability
distribution works
• Perhaps the most common probability distribution is the normal
distribution, or "bell curve," although several distributions exist that
are commonly used.
• Typically, the data-generating process of some phenomenon will
dictate its probability distribution. This process is called the
probability density function.
• Probability distributions can also be used to create cumulative
distribution functions (CDFs), which add up the probability of
occurrences cumulatively and will always start at zero and end at
100%.
Hypothesis
testing
Hypothesis Testing is a type of statistical analysis in which
you put your assumptions about a population parameter
to the test. It is used to estimate the relationship between
2 statistical variables.
Let's discuss few examples of statistical hypothesis from
real-life -
•A teacher assumes that 60% of his college's students
come from lower-middle-class families.
•A doctor believes that 3D (Diet, Dose, and Discipline) is
90% effective for diabetic patients.
Now that you know about hypothesis testing, look at the
two types of hypothesis testing in statistics.
Statistical tests
Statistical tests are used in hypothesis testing. They can
be used to:
•Determine whether a predictor variable has a statistically
significant relationship with an outcome variable.
•Estimate the difference between two or more groups.
• Statistical tests assume a null hypothesis of no
relationship or no difference between groups.
• Then they determine whether the observed data fall
outside of the range of values predicted by the null
hypothesis.
• If you already know what types of variables you’re
dealing with, you can use the flowchart to choose the
right statistical test for your data.
Types of statistical
tests
• Z- Test
• T- Test
• Paired T-Test
• Independent T-Test
• One sample T-Test
• ANOVA Test
• Non-parametric statistical test
• Chi-Square test
Z-Test
• A z-test is a statistical test used to
determine whether two population
means are different when the
variances are known and the sample
size is large.
• In z-test mean of the population is
compared.The parameters used are
population mean and population
standard deviation.
• Z-test is used to validate a
hypothesis that the sample drawn
belongs to the same population.
T-Test
• In t-test the mean of the two given
samples are compared.
• A t-test is used when the population
parameters (mean and standard deviation)
are not known.
Paired T-Test
• Tests for the difference between two
variables from the same population( pre-
and post test score).
• For example- In a training program
performance score of the trainee before
and after completion of the program.
Independent T-
Test
• The independent t-test which is also
called the two sample t-test or
student’s t-test, is a statistical test
that determines whether there is a
statistically significant difference
between the means in two unrelated
groups.
• For example -comparing boys and
girls in a population.
ANOVA Test
• Analysis of variance (ANOVA) is a
statistical technique that is used to
check if the means of two or more
groups are significantly different
from each other.
• ANOVA checks the impact of one
or more factors by comparing the
means of different samples.
• If we use a t-test instead of ANOVA
test it won’t be reliable as number
of samples are more than two and
it will give error in the result.
Non-Parametric
statistical test
Non parametric tests are used when data
is not normally distributed. Non
parametric tests include chi-square
test.
Chi-square Test
• Chi-square test is used to compare two
categorical variables.
• Calculating the Chi-Square statistic value
and comparing it against a critical value
from the Chi-Square distribution allows to
assess whether the observed frequency
are significantly different from the
expected frequency.