0% found this document useful (0 votes)

6 views13 pages

Unit 4 Data Analytics

Uploaded by

rajesh630256559

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views13 pages

Unit 4 Data Analytics

Uploaded by

rajesh630256559

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Unit – 4

Statistics With R

Random Forest:
The Random Forest algorithm is a robust ensemble learning method primarily
used for classification and regression tasks in data mining.
At its core, the Random Forest is constructed by combining multiple decision
trees into a cohesive, forest-like structure. These individual decision trees are
built using a technique called bagging (bootstrap aggregating). Here's a
breakdown of the technical aspects:
Bootstrap Aggregating (Bagging): Random Forest employs the bagging
method to create diverse subsets of the training data. It generates multiple
random samples with replacement from the original dataset. Each sample is
used to train a separate decision tree.
Decision Trees: These trees are built using a subset of features randomly
chosen at each node of the tree. This randomness introduces diversity among
the trees, preventing them from being too correlated.
Voting Mechanism: During the prediction phase, each tree in the forest
independently predicts the outcome based on its specific subset of data and
features. For classification tasks, the mode (most frequent prediction) or average
(for regression tasks) of the predictions from all the trees is taken as the final
prediction of the Random Forest.
Feature Importance: Random Forest also provides a measure of feature
importance. By assessing how much each feature contributes to decreasing the
impurity or increasing the information gain across all trees in the forest, it helps
in understanding which features are most influential in the model's predictions.
Ensemble Strength: The strength of the Random Forest lies in its ability to
reduce overfitting. By aggregating predictions from multiple trees and
considering their collective decision through voting or averaging, it generally
improves generalization and handles noise or outliers better than individual
decision trees.
Tuning Parameters: Parameters like the number of trees in the forest, the
depth of individual trees, and the number of features considered at each split are
essential to tune for optimal performance.
Random Forest algorithm operates by creating an ensemble of diverse decision
trees, each trained on random subsets of the data and features. The collective
wisdom of these trees results in robust predictions, reduced overfitting, and
increased accuracy, making it a powerful and widely used technique in the
realm of data mining and machine learning.

Normal Distribution:
In a normal distribution, data is symmetrically distributed with no skew. When
plotted on a graph, the data follows a bell shape, with most values clustering
around a central region and tapering off as they go further away from the centre.

Normal distributions are also called Gaussian distributions or bell curves

because of their shape.

The y axis will be frequency ie., no of observations in that data set

and x axis will be a numerical attribute.
Normal distributions have key characteristics that are easy to spot in graphs:
 The mean, median and mode are exactly the same.
 The distribution is symmetric about the mean—half the values fall below
the mean and half above the mean.
 The distribution can be described by two values: the mean and
the standard deviation.

The mean determines where the peak of the curve is centered. Increasing the
mean moves the curve right, while decreasing it moves the curve left.

The empirical rule, or the 68-95-99.7 rule, tells you where most of your values
lie in a normal distribution:

 Around 68% of values are within 1 standard deviation from the mean.
 Around 95% of values are within 2 standard deviations from the mean.
 Around 99.7% of values are within 3 standard deviations from the mean.

You collect SAT scores from students in a new test preparation course. The data
follows a normal distribution with a mean score (M) of 1150 and a standard
deviation (SD) of 150.
Following the empirical rule:

 Around 68% of scores are between 1,000 and 1,300, 1 standard deviation
above and below the mean.
 Around 95% of scores are between 850 and 1,450, 2 standard deviations
above and below the mean.
 Around 99.7% of scores are between 700 and 1,600, 3 standard
deviations above and below the mean.
Binomial Distribution:
The binomial distribution is a probability distribution that describes the
outcomes of a certain experiment or situation where there are only two possible
outcomes, often referred to as success and failure.
Two Outcomes: In the context of the binomial distribution, there are only two
possible outcomes for each trial. For example, it could be a coin flip where the
outcomes are "heads" or "tails," or it could represent whether a student passes or
fails an exam.
Independent Trials: Each trial is independent, meaning that the outcome of
one trial doesn’t affect the outcome of another. For instance, in a series of coin
flips, getting heads on one flip doesn't impact the next flip's result.
Probability of Success: There is a fixed probability of success (let's call it "p")
associated with each trial. This probability remains constant for every trial. For
example, if you're flipping a fair coin, the probability of getting heads (success)
is 0.5.
Probability of Success: There is a fixed probability of success (let's call it "p")
associated with each trial. This probability remains constant for every trial. For
example, if you're flipping a fair coin, the probability of getting heads (success)
is 0.5.
The binomial distribution formula helps calculate the probability of obtaining a
certain number of successes (say, "x") in a fixed number of trials (n), given a
fixed probability of success (p).
Time Series Analysis:
Time series analysis is a statistical method used to understand, interpret, and
predict patterns or trends in data that are collected and recorded over regular
intervals of time.
Let's take a simple example of time series data: daily temperature recordings in
a city over the course of a year. We'll explore the components of time series
analysis using this data.
Date Temperature (in Celsius)
-------------------------------------
2023-01-01 10
2023-01-02 12
2023-01-03 11
... ...
2023-12-30 9
2023-12-31

Components of Time Series Analysis:

Trend: The general direction in which the data is moving over time. In our
temperature data, the trend might show whether temperatures are generally
rising, falling, or staying stable throughout the year.
Seasonality: Patterns that repeat at regular intervals. For instance, if the
temperature rises during summer and falls during winter consistently each year,
it exhibits seasonality.
Cycle: Repeating patterns that are not of fixed frequency like seasonality.
Cycles might span longer periods and don't necessarily repeat in a consistent
manner. Economic cycles (like business cycles) could be an example.
Irregularity/Noise: Random fluctuations or irregular variations that can't be
attributed to the trend, seasonality, or cycles. Sudden spikes or dips in
temperature that don’t follow a pattern might be considered as noise.
Analysis Steps for Time Series Data:
Visual Exploration: Plot the temperature data over time using line charts. This
helps in visually identifying trends, seasonality, and irregularities.
Decomposition: Break down the time series into its components—trend,
seasonality, and residual (irregularities)—using methods like moving averages,
seasonal decomposition, or mathematical models.
Trend Analysis: Use techniques like moving averages or regression to identify
the overall trend. Is the temperature generally increasing, decreasing, or
remaining constant over time?
Seasonality Identification: Analyze if there's a repeating pattern within
specific time intervals (like daily, weekly, or yearly). For instance, is there a
regular pattern of temperature change each summer?
Forecasting: Utilize forecasting methods such as ARIMA, Exponential
Smoothing, or machine learning models to predict future temperature trends
based on historical data patterns.
For instance, you might notice the daily temperature has been increasing
gradually over the years (trend) and there's a repeating pattern of higher
temperatures in the summer months (seasonality).
Time Series Analysis enables us to understand and utilize the underlying
patterns within the data, facilitating better decision-making and predictions
about future trends based on historical observations .

Linear Regression and Multiple Linear Regression:

Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such
as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y)

and one or more independent (y) variables, hence called as linear regression.
Since linear regression shows the linear relationship, which means it finds how
the value of the dependent variable is changing according to the value of the
independent variable.

The linear regression model provides a sloped straight line representing the
relationship between the variables.
y= a0+a1x+ ε

Y=DependentVariable
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression:

Single Linear Regression:

If a single independent variable is used to predict the value of a numerical

dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.

Multi Linear Regression:

f more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple
Linear Regression.

Logistic Regression:
Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for predicting
the categorical dependent variable using a given set of independent variables.

Logistic regression predicts the output of a categorical dependent variable.

Therefore the outcome must be a categorical or discrete value. It can be either
Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and
1, it gives the probabilistic values which lie between 0 and 1.

Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped

logistic function, which predicts two maximum values (0 or 1).

The curve from the logistic function indicates the likelihood of something such
as whether the cells are cancerous or not, a mouse is obese or not based on its
weight, etc.

Logistic Regression is a significant machine learning algorithm because it has the

ability to provide probabilities and classify new data using continuous and
discrete datasets.

Logistic Regression can be used to classify the observations using different types
of data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:
Logistic Regression Equation:

The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called
the Sigmoid function or the logistic function.

In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends to
1, and a value below the threshold values tends to 0.

Survival Analysis:

Survival analysis is a statistical technique used to analyze time-to-event data. It's

commonly employed in medical research, social sciences, engineering, and many
other fields where the focus is on the time it takes for an event to occur. This
method allows researchers to estimate the probability of an event happening over
time.

The primary concept in survival analysis is the "survival function," which

represents the probability that an event has not occurred by a certain time. The
event of interest could be anything, such as death, disease recurrence, equipment
failure, etc.

Survival analysis helps in estimating survival probabilities, comparing survival

curves between groups, identifying risk factors influencing time-to-event, and
predicting future events based on observed data.

The Kaplan-Meier curve is a graphical representation of survival probabilities in

survival analysis. It's a non-parametric method used to estimate the survival
function from time-to-event data, particularly in the presence of censored
observations.

The Kaplan-Meier curve illustrates the probability of an event not occurring (e.g.,
survival, equipment failure, etc.) over time.

It starts at time zero with all individuals "alive" or not having experienced the
event.

At each event time, it calculates the probability of survival, considering whether

an event occurred or if the observation is censored at that time.

The probabilities are multiplied together, providing a stepwise estimate of the

survival function
The x-axis represents time, typically in intervals or specific time points.

The y-axis represents the estimated survival probability.

The curve displays a step-like pattern that decreases as time progresses, indicating
the declining probability of surviving without experiencing the event.

STC152 Stats Notes - 2025
No ratings yet
STC152 Stats Notes - 2025
110 pages
Inferential Statistics
100% (1)
Inferential Statistics
176 pages
CE 459 Statistics: Assistant Prof. Muhammet Vefa AKPINAR
No ratings yet
CE 459 Statistics: Assistant Prof. Muhammet Vefa AKPINAR
211 pages
Statistical Methods For Data Science
100% (2)
Statistical Methods For Data Science
406 pages
SCSA1606 - Predictive and Advanced Analytics - Unit II
No ratings yet
SCSA1606 - Predictive and Advanced Analytics - Unit II
50 pages
Descriptive Statistics - Book
No ratings yet
Descriptive Statistics - Book
101 pages
ASM Using R 2 Marks Answer Keys
100% (1)
ASM Using R 2 Marks Answer Keys
10 pages
Stat 509 Notes
100% (1)
Stat 509 Notes
195 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
65 pages
0 Lec 4 5
No ratings yet
0 Lec 4 5
29 pages
Bayesian Analysis of Time Series - Broemeling L. D. (CRC 2019) (1st Ed.)
100% (6)
Bayesian Analysis of Time Series - Broemeling L. D. (CRC 2019) (1st Ed.)
293 pages
421 Questions Ccba Exam 1
No ratings yet
421 Questions Ccba Exam 1
145 pages
Statistics With MATLAB/Octave: Andreas Stahel Bern University of Applied Sciences Version of 30th June 2017
No ratings yet
Statistics With MATLAB/Octave: Andreas Stahel Bern University of Applied Sciences Version of 30th June 2017
46 pages
3.4 Sample Statistics, Histogram and Correlation
No ratings yet
3.4 Sample Statistics, Histogram and Correlation
7 pages
Eng 2015 Prelims Reviewer
No ratings yet
Eng 2015 Prelims Reviewer
11 pages
Introduction To Statistics: Haramaya University College of Computing and Informatics Department of Statistics
100% (1)
Introduction To Statistics: Haramaya University College of Computing and Informatics Department of Statistics
113 pages
Business Research 2
No ratings yet
Business Research 2
8 pages
Document
No ratings yet
Document
5 pages
Mike Zero - Mike EVA Technical Documentation
No ratings yet
Mike Zero - Mike EVA Technical Documentation
90 pages
First Week
No ratings yet
First Week
8 pages
English Portfolio Class 12
No ratings yet
English Portfolio Class 12
8 pages
Lec 3 and 2 After Mid
No ratings yet
Lec 3 and 2 After Mid
15 pages
Statistics 152
No ratings yet
Statistics 152
236 pages
Word File For Prob and Stats
No ratings yet
Word File For Prob and Stats
25 pages
STK110 Stats Notes - Quarter 2, 2025
No ratings yet
STK110 Stats Notes - Quarter 2, 2025
88 pages
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
No ratings yet
Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309
15 pages
Feudalism Debate
No ratings yet
Feudalism Debate
10 pages
Unit Iii
No ratings yet
Unit Iii
15 pages
DAUP Exam Notes - 2in1
No ratings yet
DAUP Exam Notes - 2in1
35 pages
Stats
0% (1)
Stats
816 pages
R Programming
No ratings yet
R Programming
8 pages
Statistics Unit 2.3
No ratings yet
Statistics Unit 2.3
8 pages
PYP Student Planner Guide
No ratings yet
PYP Student Planner Guide
31 pages
Final List - Cleared The Assessment - Aditya University-1
No ratings yet
Final List - Cleared The Assessment - Aditya University-1
70 pages
Quante Con
No ratings yet
Quante Con
146 pages
IEA 01 Probability & Statastical Method
No ratings yet
IEA 01 Probability & Statastical Method
30 pages
FDT and MCT
No ratings yet
FDT and MCT
19 pages
Lec 10 - EBM 313
No ratings yet
Lec 10 - EBM 313
43 pages
Linear Trend Estimation
No ratings yet
Linear Trend Estimation
6 pages
Chapter 05 Exploratory
No ratings yet
Chapter 05 Exploratory
26 pages
Sampling and Sampling Distribution With Business Application - v2
No ratings yet
Sampling and Sampling Distribution With Business Application - v2
11 pages
Short Notes
No ratings yet
Short Notes
2 pages
Comprehensive Guide to Data Analysis Techniques
No ratings yet
Comprehensive Guide to Data Analysis Techniques
12 pages
Key Concepts in Statistics & Probability
No ratings yet
Key Concepts in Statistics & Probability
34 pages
Probstats Reviewer
No ratings yet
Probstats Reviewer
3 pages
Ancient Psychomusicology Studies
No ratings yet
Ancient Psychomusicology Studies
557 pages
Kombolcha
No ratings yet
Kombolcha
22 pages
Resumos Forecasting
No ratings yet
Resumos Forecasting
17 pages
Geostatistical Mineral Resource Estimation: AMEC Advantage Training
No ratings yet
Geostatistical Mineral Resource Estimation: AMEC Advantage Training
8 pages
Statistics
100% (6)
Statistics
211 pages
SMA 160 Stds Notes PDF
No ratings yet
SMA 160 Stds Notes PDF
41 pages
Business Analytics
No ratings yet
Business Analytics
12 pages
Full Chapter of Social Psychology 10th Edition by Saul Kassin Ebook and TestBank Bundle EPUB DOCX PDF Download Now
No ratings yet
Full Chapter of Social Psychology 10th Edition by Saul Kassin Ebook and TestBank Bundle EPUB DOCX PDF Download Now
405 pages
Intro Stat
No ratings yet
Intro Stat
112 pages
How To Be An (Not Aggressive) : Assertive
No ratings yet
How To Be An (Not Aggressive) : Assertive
340 pages
UNIT-2 Data Analytics Using R
No ratings yet
UNIT-2 Data Analytics Using R
23 pages
Statistics With MATLABOctave
No ratings yet
Statistics With MATLABOctave
46 pages
PGDISM Assignments 05 06
No ratings yet
PGDISM Assignments 05 06
12 pages
R25 MCASyllabus
No ratings yet
R25 MCASyllabus
46 pages
S 15 Notes
No ratings yet
S 15 Notes
216 pages
WorkloadCharacterizationAndModeling 2005 Feitelson
No ratings yet
WorkloadCharacterizationAndModeling 2005 Feitelson
508 pages
Tsa Solutions
No ratings yet
Tsa Solutions
49 pages
D T Ra, Attt-440 006 (Htet) : Pojepasth
No ratings yet
D T Ra, Attt-440 006 (Htet) : Pojepasth
5 pages
Coatings Word March 2012
No ratings yet
Coatings Word March 2012
83 pages
Basic Stats
No ratings yet
Basic Stats
49 pages
Time Series Analysis Syllabus
No ratings yet
Time Series Analysis Syllabus
84 pages
Econometricks-Short Guide
No ratings yet
Econometricks-Short Guide
110 pages
Lecture Notes
No ratings yet
Lecture Notes
80 pages
2007 02 17 GENV Cofimvaba Landfill Site Phase 2
No ratings yet
2007 02 17 GENV Cofimvaba Landfill Site Phase 2
42 pages
Maths Formula Pocket Book Maths Formula-Page73
No ratings yet
Maths Formula Pocket Book Maths Formula-Page73
1 page
Context PDF
No ratings yet
Context PDF
31 pages
1-1 Introduction To Legged Robotics
No ratings yet
1-1 Introduction To Legged Robotics
13 pages
(Gabarito Dia 1) Quarto Bernoulli 2022
No ratings yet
(Gabarito Dia 1) Quarto Bernoulli 2022
50 pages
Affords Investors The Right To Exclude How It Works, Physics Mechanism
No ratings yet
Affords Investors The Right To Exclude How It Works, Physics Mechanism
17 pages
Essay Hazel
No ratings yet
Essay Hazel
5 pages
Aditya Univeristy Batch Wise - GD Interview Schedule
No ratings yet
Aditya Univeristy Batch Wise - GD Interview Schedule
8 pages
Exterior Render Settings (Vray 3.4 For Sketchup)
No ratings yet
Exterior Render Settings (Vray 3.4 For Sketchup)
14 pages
CIVWARE Lecture Topic 5.1 (Water Treatment) PDF
No ratings yet
CIVWARE Lecture Topic 5.1 (Water Treatment) PDF
19 pages
Ethnomath in Javanese Drums
No ratings yet
Ethnomath in Javanese Drums
12 pages
Class Syllabus For Grade 9 English 1 Honors Gifted and Honors 2024-2025
No ratings yet
Class Syllabus For Grade 9 English 1 Honors Gifted and Honors 2024-2025
4 pages
Unit Test Integral Calculus Set A
No ratings yet
Unit Test Integral Calculus Set A
4 pages
PG I & III Sem
No ratings yet
PG I & III Sem
1 page
Single Door Access Controller User Manual
No ratings yet
Single Door Access Controller User Manual
5 pages
Math Syllabi
No ratings yet
Math Syllabi
8 pages
GR 6 - Sa 1 - Date Sheet
No ratings yet
GR 6 - Sa 1 - Date Sheet
1 page
2 Polynomials Objective WS CBSE
No ratings yet
2 Polynomials Objective WS CBSE
4 pages
Numerical Analysis Practical Questions PDF
No ratings yet
Numerical Analysis Practical Questions PDF
1 page
4th Sem Computers Imps ORIGINAL
No ratings yet
4th Sem Computers Imps ORIGINAL
2 pages
4th Sem Computers Java Imps - Watermark
No ratings yet
4th Sem Computers Java Imps - Watermark
1 page
0830 Warning Codes
No ratings yet
0830 Warning Codes
10 pages
CAD Mock Preparation
No ratings yet
CAD Mock Preparation
5 pages
Mastery 2 (Etech)
No ratings yet
Mastery 2 (Etech)
4 pages
Skilled Boilermaker & Welder CV
No ratings yet
Skilled Boilermaker & Welder CV
2 pages
RGUKT CET Final Notification 20.08.2021
No ratings yet
RGUKT CET Final Notification 20.08.2021
14 pages

Unit 4 Data Analytics

Uploaded by

Unit 4 Data Analytics

Uploaded by

Unit – 4

Normal distributions are also called Gaussian distributions or bell curves

The y axis will be frequency ie., no of observations in that data set

Components of Time Series Analysis:

Linear Regression and Multiple Linear Regression:

Linear regression algorithm shows a linear relationship between a dependent (y)

Types of Linear Regression:

Single Linear Regression:

If a single independent variable is used to predict the value of a numerical

Multi Linear Regression:

Logistic regression predicts the output of a categorical dependent variable.

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped

Logistic Regression is a significant machine learning algorithm because it has the

Survival analysis is a statistical technique used to analyze time-to-event data. It's

The primary concept in survival analysis is the "survival function," which

Survival analysis helps in estimating survival probabilities, comparing survival

The Kaplan-Meier curve is a graphical representation of survival probabilities in

At each event time, it calculates the probability of survival, considering whether

The probabilities are multiplied together, providing a stepwise estimate of the

The y-axis represents the estimated survival probability.

You might also like