Unit – 4
Statistics With R
Random Forest:
The Random Forest algorithm is a robust ensemble learning method primarily
used for classification and regression tasks in data mining.
At its core, the Random Forest is constructed by combining multiple decision
trees into a cohesive, forest-like structure. These individual decision trees are
built using a technique called bagging (bootstrap aggregating). Here's a
breakdown of the technical aspects:
Bootstrap Aggregating (Bagging): Random Forest employs the bagging
method to create diverse subsets of the training data. It generates multiple
random samples with replacement from the original dataset. Each sample is
used to train a separate decision tree.
Decision Trees: These trees are built using a subset of features randomly
chosen at each node of the tree. This randomness introduces diversity among
the trees, preventing them from being too correlated.
Voting Mechanism: During the prediction phase, each tree in the forest
independently predicts the outcome based on its specific subset of data and
features. For classification tasks, the mode (most frequent prediction) or average
(for regression tasks) of the predictions from all the trees is taken as the final
prediction of the Random Forest.
Feature Importance: Random Forest also provides a measure of feature
importance. By assessing how much each feature contributes to decreasing the
impurity or increasing the information gain across all trees in the forest, it helps
in understanding which features are most influential in the model's predictions.
Ensemble Strength: The strength of the Random Forest lies in its ability to
reduce overfitting. By aggregating predictions from multiple trees and
considering their collective decision through voting or averaging, it generally
improves generalization and handles noise or outliers better than individual
decision trees.
Tuning Parameters: Parameters like the number of trees in the forest, the
depth of individual trees, and the number of features considered at each split are
essential to tune for optimal performance.
Random Forest algorithm operates by creating an ensemble of diverse decision
trees, each trained on random subsets of the data and features. The collective
wisdom of these trees results in robust predictions, reduced overfitting, and
increased accuracy, making it a powerful and widely used technique in the
realm of data mining and machine learning.
Normal Distribution:
In a normal distribution, data is symmetrically distributed with no skew. When
plotted on a graph, the data follows a bell shape, with most values clustering
around a central region and tapering off as they go further away from the centre.
Normal distributions are also called Gaussian distributions or bell curves
because of their shape.
The y axis will be frequency ie., no of observations in that data set
and x axis will be a numerical attribute.
Normal distributions have key characteristics that are easy to spot in graphs:
The mean, median and mode are exactly the same.
The distribution is symmetric about the mean—half the values fall below
the mean and half above the mean.
The distribution can be described by two values: the mean and
the standard deviation.
The mean determines where the peak of the curve is centered. Increasing the
mean moves the curve right, while decreasing it moves the curve left.
The empirical rule, or the 68-95-99.7 rule, tells you where most of your values
lie in a normal distribution:
Around 68% of values are within 1 standard deviation from the mean.
Around 95% of values are within 2 standard deviations from the mean.
Around 99.7% of values are within 3 standard deviations from the mean.
You collect SAT scores from students in a new test preparation course. The data
follows a normal distribution with a mean score (M) of 1150 and a standard
deviation (SD) of 150.
Following the empirical rule:
Around 68% of scores are between 1,000 and 1,300, 1 standard deviation
above and below the mean.
Around 95% of scores are between 850 and 1,450, 2 standard deviations
above and below the mean.
Around 99.7% of scores are between 700 and 1,600, 3 standard
deviations above and below the mean.
Binomial Distribution:
The binomial distribution is a probability distribution that describes the
outcomes of a certain experiment or situation where there are only two possible
outcomes, often referred to as success and failure.
Two Outcomes: In the context of the binomial distribution, there are only two
possible outcomes for each trial. For example, it could be a coin flip where the
outcomes are "heads" or "tails," or it could represent whether a student passes or
fails an exam.
Independent Trials: Each trial is independent, meaning that the outcome of
one trial doesn’t affect the outcome of another. For instance, in a series of coin
flips, getting heads on one flip doesn't impact the next flip's result.
Probability of Success: There is a fixed probability of success (let's call it "p")
associated with each trial. This probability remains constant for every trial. For
example, if you're flipping a fair coin, the probability of getting heads (success)
is 0.5.
Probability of Success: There is a fixed probability of success (let's call it "p")
associated with each trial. This probability remains constant for every trial. For
example, if you're flipping a fair coin, the probability of getting heads (success)
is 0.5.
The binomial distribution formula helps calculate the probability of obtaining a
certain number of successes (say, "x") in a fixed number of trials (n), given a
fixed probability of success (p).
Time Series Analysis:
Time series analysis is a statistical method used to understand, interpret, and
predict patterns or trends in data that are collected and recorded over regular
intervals of time.
Let's take a simple example of time series data: daily temperature recordings in
a city over the course of a year. We'll explore the components of time series
analysis using this data.
Date Temperature (in Celsius)
-------------------------------------
2023-01-01 10
2023-01-02 12
2023-01-03 11
... ...
2023-12-30 9
2023-12-31
Components of Time Series Analysis:
Trend: The general direction in which the data is moving over time. In our
temperature data, the trend might show whether temperatures are generally
rising, falling, or staying stable throughout the year.
Seasonality: Patterns that repeat at regular intervals. For instance, if the
temperature rises during summer and falls during winter consistently each year,
it exhibits seasonality.
Cycle: Repeating patterns that are not of fixed frequency like seasonality.
Cycles might span longer periods and don't necessarily repeat in a consistent
manner. Economic cycles (like business cycles) could be an example.
Irregularity/Noise: Random fluctuations or irregular variations that can't be
attributed to the trend, seasonality, or cycles. Sudden spikes or dips in
temperature that don’t follow a pattern might be considered as noise.
Analysis Steps for Time Series Data:
Visual Exploration: Plot the temperature data over time using line charts. This
helps in visually identifying trends, seasonality, and irregularities.
Decomposition: Break down the time series into its components—trend,
seasonality, and residual (irregularities)—using methods like moving averages,
seasonal decomposition, or mathematical models.
Trend Analysis: Use techniques like moving averages or regression to identify
the overall trend. Is the temperature generally increasing, decreasing, or
remaining constant over time?
Seasonality Identification: Analyze if there's a repeating pattern within
specific time intervals (like daily, weekly, or yearly). For instance, is there a
regular pattern of temperature change each summer?
Forecasting: Utilize forecasting methods such as ARIMA, Exponential
Smoothing, or machine learning models to predict future temperature trends
based on historical data patterns.
For instance, you might notice the daily temperature has been increasing
gradually over the years (trend) and there's a repeating pattern of higher
temperatures in the summer months (seasonality).
Time Series Analysis enables us to understand and utilize the underlying
patterns within the data, facilitating better decision-making and predictions
about future trends based on historical observations .
Linear Regression and Multiple Linear Regression:
Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such
as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y)
and one or more independent (y) variables, hence called as linear regression.
Since linear regression shows the linear relationship, which means it finds how
the value of the dependent variable is changing according to the value of the
independent variable.
The linear regression model provides a sloped straight line representing the
relationship between the variables.
y= a0+a1x+ ε
Y=DependentVariable
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model
representation.
Types of Linear Regression:
Single Linear Regression:
If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.
Multi Linear Regression:
f more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple
Linear Regression.
Logistic Regression:
Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for predicting
the categorical dependent variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either
Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and
1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such
as whether the cells are cancerous or not, a mouse is obese or not based on its
weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and
discrete datasets.
Logistic Regression can be used to classify the observations using different types
of data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:
Logistic Regression Equation:
The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called
the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends to
1, and a value below the threshold values tends to 0.
Survival Analysis:
Survival analysis is a statistical technique used to analyze time-to-event data. It's
commonly employed in medical research, social sciences, engineering, and many
other fields where the focus is on the time it takes for an event to occur. This
method allows researchers to estimate the probability of an event happening over
time.
The primary concept in survival analysis is the "survival function," which
represents the probability that an event has not occurred by a certain time. The
event of interest could be anything, such as death, disease recurrence, equipment
failure, etc.
Survival analysis helps in estimating survival probabilities, comparing survival
curves between groups, identifying risk factors influencing time-to-event, and
predicting future events based on observed data.
The Kaplan-Meier curve is a graphical representation of survival probabilities in
survival analysis. It's a non-parametric method used to estimate the survival
function from time-to-event data, particularly in the presence of censored
observations.
The Kaplan-Meier curve illustrates the probability of an event not occurring (e.g.,
survival, equipment failure, etc.) over time.
It starts at time zero with all individuals "alive" or not having experienced the
event.
At each event time, it calculates the probability of survival, considering whether
an event occurred or if the observation is censored at that time.
The probabilities are multiplied together, providing a stepwise estimate of the
survival function
The x-axis represents time, typically in intervals or specific time points.
The y-axis represents the estimated survival probability.
The curve displays a step-like pattern that decreases as time progresses, indicating
the declining probability of surviving without experiencing the event.