Notes on Random Forest
What is Random Forest?
Random Forest is a powerful and versatile supervised machine learning
algorithm that is used for both classification and regression. As an "ensemble
learning" method, it operates by constructing a multitude of decision trees
during training. For classification tasks, the output is the class chosen by
most trees, while for regression, it is the mean prediction of the individual
trees. The name "Random Forest" comes from its use of a collection of
decision trees, each grown with a degree of randomness.
How it Works
The algorithm's power comes from its ability to reduce overfitting and
improve predictive accuracy by combining the predictions of multiple simple
models. The key steps are as follows:
1. Bootstrapping: The algorithm selects a random subset of the training
data with replacement for each individual tree. This is called "bagging"
(bootstrap aggregating) and ensures that each tree is trained on a
slightly different dataset.
2. Feature Randomness: When building each tree, instead of
considering all features for the best split, the algorithm only considers
a random subset of features at each node. This process further
decorrelates the trees, making the ensemble more robust.
3. Building the Forest: These two randomization techniques—
bootstrapping the data and randomizing the features—ensure that
each tree in the forest is unique and not simply a copy of the others.
4. Prediction: To make a final prediction for a new data point, each tree
in the forest makes its own prediction.
o For Classification: The final prediction is determined by a
majority vote among all the trees.
o For Regression: The final prediction is the average of the
predictions from all the trees.
Key Concepts
Ensemble Learning: The general method of combining multiple
individual models to obtain a single, more robust, and accurate
prediction.
Bagging (Bootstrap Aggregating): A technique that involves
training multiple models on different subsets of the training data. This
reduces the variance of the model's predictions.
Feature Importance: Random Forest can be used to rank the
importance of each feature in the prediction process. This is done by
measuring how much each feature contributes to the reduction of
impurity (e.g., Gini impurity or entropy) across all trees.
Strengths and Weaknesses
Strengths:
High Accuracy: Random Forests often provide high accuracy
compared to single decision trees.
Robustness to Overfitting: The averaging of multiple trees reduces
the risk of overfitting, which is a major weakness of individual decision
trees.
Handles Large Datasets: It can work with a large number of features
and data points.
No Feature Scaling Required: Like decision trees, Random Forests
do not require features to be scaled.
Weaknesses:
Less Interpretable: While individual decision trees are easy to
interpret, the combined result of a Random Forest is less transparent,
making it a "black box" model.
Computationally Expensive: Training many trees can be
computationally intensive and slower than simpler algorithms.
Memory Intensive: Storing multiple decision trees requires more
memory than a single tree.
Use Cases
Finance: Predicting stock prices and detecting fraudulent transactions.
Healthcare: Disease diagnosis and predicting patient risk.
E-commerce: Recommendation engines and customer segmentation.