1
Lecture 18 | Evaluation Metrics for different
model
Our previous topic was Linear Regression. Linear Regression is used to
predict a number from continuous nominal or numeric values. The number
we predict can be any number, it can be a positive number or a
negative number, it can be a high value or a low value. A graphical
representation of linear regression was also shown. In graphical
representation there is plot, a studio plane, data points and a line
that must have most of the data points around it. In order to measure
the performance of graph, we also discussed the Mean Squared error and
in order to nullify the effect of resultant zero root mean square
error.
In colab we trained a modal over a data set. An issue raised during
its execution that the resultant value of Mean Squared Error was very
high. Our today’s topic will start from the finding the reasons of
this high value of Mean Squared Error.
We will find the reasons for high value of Mean Squared Error. When we
will find the reasons, then we will find the actions to minimize the
problems. Remember a little value of loss will exist even after
putting the efforts to mitigate the loss. We can apply number of
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
2
regression techniques to train the modal, then how we will find which
model will perform better and which model will not perform better. It
is the Mean Squared Error that will help us to select the better
model. The model having low Mean Squared Error will perform better and
vice versa.
The line in linear regression we discussed is between two dimensions.
In three-dimensional Linear Regression the line would be displayed
like white board which is termed as plane in mathematics. The linear
regression in more than three lines, human brain can’t apprehend.
Even if there are n number of independent variables and one number of
independent variable, the linear regression is possible and we will
have a white board like structure or plane arranged in n number of
directions.
One thing we can do to reduce the Mean Squared Error is to label the
modal with different attributes. As we have seen during the training
the modal, values of some attributes/ features/ columns were very high
and values of some attributes / features / columns were very low. When
we have such values, the modal focus more on high values in order to
learn more from them. It means that the model will learn less
significantly from attributes having minimal values between 0 and 1’s.
If the model would have given due importance to these small values
too, it could have performed better.
What is the cure to this problem. When we discussed about data pre
processing we mentioned data transformation and data reduction. We had
applied data reduction rigorously on data set of titanic. By applying
correlation, we extracted those columns which were suitable for model
training and columns those were not suitable for model training. So,
dimensionality reduction in data preprocessing means we shred the
unnecessary columns from the input data set, enabling our model to
utilize useful data for its learning.
Now we put data transformation in practice. As we discussed earlier
that our data set comprises some high values i.e. 1000nds and low
values from 0 to 01. Due to this huge difference our model does not
give due weightage to the low values between o and 01. So, there is
need to transform the data in such a way that model gives equal
importance to all attributes of the data set. As the model is giving
equal attention to all attributes so it now has more opportunity to
learn. So, data transformation technique is used for making the data
set of comprising features of equal importance.
Suppose we have to measure the performance of players in a team. We
have data set about weight of players that is figured out as 60 kg, 65
Kg, 55 Kg, 50 Kg and the height of players that is figured out as
163cm, 165cm 173cm etc. As the modal focus only on values to train
itself and don’t pay attention on units. Modal will give weightage to
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
3
high values and ignore or pay less attention to low values. This will
affect the modal badly. In order to rectify this, we limit the values
in a range and confine the model to use the values within specified
range.
Let us discuss few methods of data transformation.
1. Z-score is a variation of scaling that represents the number of
standard deviations away from the mean. We would use z-score to
ensure our feature distributions have mean = 0 and std = 1. It's
useful when there are a few outliers, but not so extreme that we
need clipping.
2. MinMax scaling
Rescaling (min-max normalization)
Also known as min-max scaling or min-max normalization,
rescaling is the simplest method and consists in rescaling the
range of features to scale the range in [0, 1] or [−1, 1].
Selecting the target range depends on the nature of the data.
Now question arises, whether scaling or standardization is to
be applied on high value features or all features in data set.
Should the target variable must also be part of that scaling
or standardization.
We are discussing different algorithms of Linear Regression.
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
4
Suppose we have to predict a score of a batsman from a data set. Now
we divide the data set in number of values and distribute it to
different group of people to predict. Suppose rows having score from
1-10 are given a group of 25 people to predict. Rows having score from
11-50 to another group and so on. We define range of score that
particular group of people will predict.
Similar is in the case of above Decision Regression Tree. Each tree is
predicting different set of values. Each tree has a range of values
from which resultant value is predicted.
Until now only thing that is to be understood is that decision tree
regression can be used to perform the regression and classification.
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
5
Random Forest regression is the form of decision tree regression. It
can be explained with the help of an example that suppose we are in a
problem in a forest and each tree in the forest is like our friend. We
take suggestion from each tree or friend about the problem. We can
simply say that it is like voting. We make decision after considering
the opinion of majority of friend.
This is another extension of Decision Tree Regressor. In this
regressor we take action on the basis of opinion given and if it does
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
6
not work, we came back and take the opinion again an avoid the actions
that produced the wrong results.
Now we move towards colab note book. Our practice session from colab
starts with the question that on which features scaling is to be
applied. Normally scaling is applied on all features including target
feature. However, in this colab the scaling will be applied only on
features and scaling all features including will be our task in the
assignment.
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
7
OBJECTIVE 2: MACHINE LEARNING
Next, I will feed these features into various classification
algorithms to determine the best performance using a simple
framework: Split, Fit, Predict, Score It.
Target Variable Splitting
We will split the Full dataset into Input and target variables
Input is also called Feature Variables Output refers to
Target variables
# Split data to be used in the models
# Create matrix of features
x = full_data.drop('Price', axis = 1) # grabs everything else but
'Price'
# Create target variable
y = full_data['Price'] # y is the column we're trying to predict
Before we train the models, it's essential to split our data into
training and testing sets. This ensures that we have a separate
dataset to evaluate the performance of our trained models. The
common practice is to use a certain portion of our data for
training (e.g., 70-80%) and the remaining portion for testing
(e.g., 20-30%)
from sklearn import preprocessing
pre_process = preprocessing.StandardScaler().fit(x)
x_transform = pre_process.fit_transform(x)
Now we are applying feature scaling to our feature matrix x using
the StandardScaler from scikit-learn. Feature scaling is a common
preprocessing step in machine learning to standardize or
normalize the features so that they have a mean of 0 and a
standard deviation of 1. This can be helpful, especially for
algorithms that are sensitive to the scale of input features.
In the code above, we first create an instance of StandardScaler
and then fit it to our data (x) using the fit method. After that,
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
8
we can apply the transformation to the feature matrix using the
transform method, which gives we x_transform with scaled features.
Remember, when we use the same scaling parameters for both the
training and testing sets, it ensures that the features are
scaled consistently, which is crucial for accurate model
performance evaluation.
# pipe = make_pipeline(StandardScaler(), LogisticRegression())
# pipe.fit(X_train, y_train)
We are creating a machine learning pipeline using scikit-learn's
make_pipeline function. This pipeline combines feature scaling
using StandardScaler() and a logistic regression model.
Let's break down the code:
1. make_pipeline(StandardScaler(), LogisticRegression()): This
function creates a pipeline that first applies
StandardScaler() for feature scaling and then fits a
LogisticRegression model on the scaled features. The pipeline
ensures that the feature scaling is consistently applied to
both the training and testing data.
2. pipe.fit(X_train, y_train): This line of code fits the created
pipeline to our training data X_train and corresponding
target variable y_train. This means that the feature scaling
and logistic regression model will be trained together as
part of the pipeline.
After we run the fit method, our pipeline (pipe) will be trained
and ready to make predictions on new data.
Here's a summary of what the pipeline does:
1. Scales the features in X_train using StandardScaler.
2. Fits a logistic regression model to the scaled features with
the corresponding target variable y_train.
We can now use the trained pipeline to make predictions on new
data or evaluate its performance on the test set (X_test and
y_test).
# x Represents the Features
x_transform.shape
x_transform
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
9
if we want to check the shape and the contents of the feature
matrix x_transform after applying the StandardScaler
transformation. The x_transform should be the scaled version of
our original feature matrix x.
The output will show us the shape of x_transform, which will be a
2-dimensional numpy array with the same number of rows as the
original feature matrix x and the number of columns representing
the number of features.
The contents of x_transform will be the scaled values of our
original features, where each column will have a mean of 0 and a
standard deviation of 1. Note that the exact values will depend
on the distribution and scaling of our original features.
Out put of above code
Now our values have been transformed in shape of 0 and 1 and no high
value exists.
y # y represents the Target
y.shape
(5000,)
# Use x and y variables to split the training data into train and test
set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_transform, y,
test_size = .10, random_state = 101)
We are using the transformed feature matrix x_transform and the
target variable y to split our data into training and testing
sets. The train_test_split function from scikit-learn is commonly
used for this purpose. This allows us to have separate datasets
for training and evaluating our machine learning models.
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
10
The code provided will split the data into training and testing
sets, with 90% of the data used for training and 10% for testing.
After running this code, we will have the following datasets:
x_train: The transformed feature matrix for training our
machine learning models.
x_test: The transformed feature matrix for evaluating our
trained models.
y_train: The target variable corresponding to the training
data.
y_test: The target variable corresponding to the testing
data.
Now we can use x_train and y_train to train our models and then
evaluate their performance on x_test and y_test. The test_size
parameter controls the proportion of data that goes into the
testing set. In this case, it's set to 0.10, meaning 10% of the
data will be used for testing, while the remaining 90% will be
used for training. The random_state parameter is set to 101, which
is an arbitrary seed to ensure reproducibility. We can change it
to any other value or set it to None for a random split.
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
11
LINEAR REGRESSION
Model Training
# Fit
# Import model
from sklearn.linear_model import LinearRegression
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
12
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# Create instance of model
lin_reg = LinearRegression()
# Pass training data into model
lin_reg.fit(x_train, y_train)
# pipe = make_pipeline(StandardScaler(), LinearRegression())
# pipe.fit(x_train, y_train)
LinearRegression
LinearRegression()
We are fitting a linear regression model to our training data
using scikit-learn.
Let's break down the code:
The above code creates an instance of the LinearRegression model
and then fits it to the scaled training data (x_train) along with
the corresponding target variable (y_train). This process trains
the model to learn the relationship between the features and the
target variable.
We can now use the trained lin_reg model to make predictions on
new data or evaluate its performance on the test set.
Regarding the commented-out code with make_pipeline, it seems we
already used it earlier in the process. The pipeline encapsulates
the feature scaling step using StandardScaler and the linear
regression model. Since we have already trained the lin_reg
model, there's no need to fit the pipeline again using the same
data (x_train and y_train). Instead, we can use the lin_reg model
directly for further analysis.
Class prediction
# Predict
y_pred = lin_reg.predict(x_test)
print(y_pred.shape)
print(y_pred)
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
13
We've successfully used the trained linear regression model
(lin_reg) to make predictions on the test data (x_test). The
predicted values are stored in the variable y_pred.
Let's break down the code:
The output will show we the shape of the y_pred array, which will
be a 1-dimensional numpy array containing the predicted values
for each sample in the test data. The number of elements in
y_pred will be equal to the number of samples in x_test.
The second print statement will display the actual predicted
values.
Keep in mind that the y_pred array contains the predictions made
by the linear regression model for the corresponding samples in
the test set. We can use these predicted values to evaluate the
model's performance and compare them to the true target values
(y_test).
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
14
sns.scatterplot(x=y_test, y=y_pred, color='blue', label='Actual Data
points')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)],
color='red', label='Ideal Line')
plt.legend()
plt.show()
It looks like we're using the Seaborn library to create a scatter
plot comparing the actual target values (y_test) with the
predicted values (y_pred) from our linear regression model.
Additionally, we're adding a red line representing the ideal
line, where the predicted values perfectly match the actual
values.
Here's a breakdown of the code:
In the scatter plot, each data point represents a sample from the
test set. The x-axis represents the actual target values
(y_test), and the y-axis represents the predicted values (y_pred)
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
15
from our linear regression model. The blue data points indicate
the actual values, while the red line represents the ideal line.
If the points align closely along the red line, it suggests that
the model's predictions are close to the actual values,
indicating a good fit. However, if the points are scattered away
from the red line, it indicates that the model's predictions
deviate from the actual values.
Visualization like this can give we a quick visual understanding
of how well our linear regression model is performing. For a more
quantitative evaluation, we can use metrics such as Mean Squared
Error (MSE), Mean Absolute Error (MAE), or R-squared (coefficient
of determination). These metrics provide insights into how well
the model is capturing the variation in the data.
# Combine actual and predicted values side by side
results = np.column_stack((y_test, y_pred))
# Printing the results
print("Actual Values | Predicted Values")
print("-----------------------------")
for actual, predicted in results:
print(f"{actual:14.2f} | {predicted:12.2f}")
We've successfully combined the actual target values (y_test) and
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
16
the predicted values (y_pred) side by side using NumPy's
column_stack function. The code below will print the results in a
formatted table, showing the actual values in one column and the
corresponding predicted values in another column.
The output will be a table with two columns, displaying the
actual target values in the left column and the corresponding
predicted values in the right column. The numbers will be
formatted with two decimal places for better readability.
This kind of side-by-side comparison allows we to visually assess
how well our model's predictions align with the actual target
values. If the predicted values are close to the actual values,
we should see similar numbers in both columns. However, if there
are substantial differences, it indicates that the model may not
be performing well on certain samples or might need further
improvements.
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
17
Residual Analysis
Residual analysis in linear regression is a way to check how well
the model fits the data. It involves looking at the differences
(residuals) between the actual data points and the predictions
from the model.
In a good model, the residuals should be randomly scattered
around zero on a plot. If there are patterns or a fan-like shape,
it suggests the model may not be the best fit. Outliers, points
far from the others, can also affect the model.
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
18
Residual analysis helps ensure the model's accuracy and whether
it meets the assumptions of linear regression. If issues are
found, adjustments to the model may be needed to improve its
performance.
residual = actual- y_pred.reshape(-1)
print(residual)
We've successfully combined the actual target values (y_test) and the
predicted values (y_pred) side by side using NumPy's column_stack function.
The code below will print the results in a formatted table, showing the
actual values in one column and the corresponding predicted values in
another column.
The output will be a table with two columns, displaying the actual target
values in the left column and the corresponding predicted values in the
right column. The numbers will be formatted with two decimal places for
better readability.
This kind of side-by-side comparison allows us to visually assess how well
our model's predictions align with the actual target values. If the
predicted values are close to the actual values, we should see similar
numbers in both columns. However, if there are substantial differences, it
indicates that the model may not be performing well on certain samples or
might need further improvements.
residual = actual- y_pred.reshape(-1) print(residual)
We've computed the residuals by subtracting the predicted values (y_pred)
from the actual target values (actual) and stored the result in the residual
array. Residuals represent the differences between the actual values and
the corresponding predicted values.
In the code, y_pred.reshape(-1) is used to ensure that the predicted values
are in the same shape as actual, so they can be directly subtracted. The
result is an array of residuals, where each element corresponds to the
difference between the actual value and the predicted value for a specific
sample.
By examining the values in the residual array, we can gain insights into
how well the model is performing. Ideally, the residuals should be close
to zero, indicating that the model's predictions are accurate. Positive
residuals indicate that the model underestimates the target variable,
while negative residuals suggest overestimation.
Analysing the distribution of residuals and looking for patterns can help
us identify areas where the model might be performing poorly and guide us
in making further improvements to our model or data preprocessing.
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
19
If we want to get additional information about the overall performance of
the model, we may consider computing metrics such as Mean Squared Error
(MSE) or Mean Absolute Error (MAE) using the residuals. These metrics
provide a quantitative measure of how well the model is predicting the
target variable across the entire dataset.
# Distribution plot for Residual (difference between actual and
predicted values)
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
20
sns.distplot(residual, kde=True)
We are using Seaborn's distplot function to create a distribution
plot of the residuals. The residuals represent the differences
between the actual target values and the corresponding predicted
values from our linear regression model.
The distplot function will create a histogram of the residuals,
and by setting kde=True, it will also overlay a kernel density
estimate (KDE) to visualize the shape of the distribution.
In the plot, the x-axis represents the range of residual values,
and the y-axis shows the density or frequency of occurrences of
each residual value. The distribution plot provides insights into
the distribution of errors made by our linear regression model.
Ideally, we would want the residuals to be cantered around zero
with a symmetric distribution, indicating that the model's
predictions are unbiased and accurate. Deviations from this
pattern might indicate areas where the model is not performing
well.
Common patterns to look for in the distribution plot include:
1. Symmetry: A symmetric distribution around zero suggests the
model is making unbiased predictions.
2. Skewness: If the distribution is skewed, it indicates that
the model is systematically overestimating or
underestimating the target variable.
3. Outliers: Unusual large or small residuals (outliers) may
indicate specific data points that the model is struggling
to predict accurately.
By examining the distribution of residuals, we can gain insights
into the overall performance of our model and identify potential
areas of improvement.
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
21
It represents that our mode is not skewed as the distribution is
center aligned but note the values of the X and Y axis they in power
of 6. Which means the difference between actual and predicted value
was high and but it is reduced to some extent. Which is Good.
Model Evaluation
Linear Regression
# Score It
from sklearn.metrics import mean_squared_error
print('Linear Regression Model')
# Results
print('--'*30)
# mean_squared_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
# Print evaluation metrics
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
We've evaluated the performance of our linear regression model
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
22
using the Mean Squared Error (MSE) and Root Mean Squared Error
(RMSE). These metrics are common measures used to assess how well
a regression model's predictions match the actual target values.
The mean_squared_error function from scikit-learn calculates the
MSE between the actual target values (y_test) and the predicted
values (y_pred). MSE measures the average squared difference
between predicted and actual values, penalizing larger errors
more heavily.
The RMSE is simply the square root of the MSE. It represents the
average magnitude of the errors in the same units as the target
variable. RMSE is a commonly used metric for regression tasks
because it is more interpretable and easier to relate to the
scale of the original target variable.
By displaying both the MSE and RMSE, we get a sense of how well
the model is performing on the test data. Smaller values for
these metrics indicate better performance, as they suggest that
the model's predictions are closer to the actual values.
Now, we have quantified the performance of our linear regression
model using the evaluation metrics, which can help we compare
this model to other models or assess its suitability for our
specific task.
Linear Regression Model
------------------------------------------------------------
Mean Squared Error: 9839952411.801708
Root Mean Squared Error: 99196.53427313732
# Linear Regression Model
# ------------------------------------------------------------
# Mean Squared Error: 10100187858.864885
# Root Mean Squared Error: 100499.69083964829
# 10170939558
Based on the provided results, it appears that the evaluation
metrics for our linear regression model are as follows:
Mean Squared Error (MSE): 10,100,187,858.86
Root Mean Squared Error (RMSE): 100,499.69
The MSE is a measure of the average squared difference between
the actual target values and the predicted values. In this case,
it suggests that, on average, the squared difference between the
predicted and actual values is quite large.
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
23
The RMSE, which is the square root of the MSE, gives an estimate
of the average error in the same units as the target variable. In
this case, it indicates that, on average, the model's predictions
have an error of approximately 100,499.69 in the same units as
the target variable.
Please note that these error values depend on the scale and units
of the target variable. If our target variable is measured in
larger units, it is not unusual to have larger values for the
error metrics.
The last value "10170939558" seems to be a standalone number with
no context provided.
s = 10100187858 - 9839952411
print(s)
260235447
y_train.shape
(4500,)
Decision Tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
rf_regressor = DecisionTreeRegressor()
rf_regressor.fit(x_train,y_train)
#Predicting the SalePrices using test set
y_pred_rf = rf_regressor.predict(x_test)
DTr = mean_squared_error(y_pred_rf,y_test)
#Random Forest Regression Accuracy with test set
print('Decision Tree Regression : ',DTr)
We've created a Decision Tree Regressor and fitted it to our
training data (x_train and y_train). We then used this Decision
Tree Regressor to predict the target values for the test set
(x_test) and stored the predictions in y_pred_rf.
Finally, we calculated the Mean Squared Error (MSE) between the
predicted values (y_pred_rf) and the actual target values (y_test)
using scikit-learn's mean_squared_error function, and we assigned
the result to DTr.
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
24
Here's the code breakdown:
The DecisionTreeRegressor is a regression model that builds a
decision tree to predict the target variable. In this code, we
used the Decision Tree Regressor to make predictions (y_pred_rf)
on the test set. The mean_squared_error function was then used to
calculate the MSE between the predicted values (y_pred_rf) and the
actual target values (y_test).
The printed output will show the Mean Squared Error for Decision
Tree Regression with the test set.
Keep in mind that Decision Trees have some limitations, such as
the tendency to overfit, and may not always provide the best
performance for regression tasks. Random Forests, on the other
hand, are an extension of Decision Trees that can often offer
better generalization and predictive performance. If we want to
try Random Forest Regression, we can use RandomForestRegressor from
scikit-learn in a similar manner.
Decision Tree Regression : 31316806651.827576
Random Forest
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor()
rf_regressor.fit(x_train,y_train)
#Predicting the SalePrices using test set
y_pred_rf = rf_regressor.predict(x_test)
RFr = mean_squared_error(y_pred_rf,y_test)
#Random Forest Regression Accuracy with test set
print('Random Forest Regression : ',RFr)
Now we've created a Random Forest Regressor, fitted it to our
training data (x_train and y_train), and used it to predict the
target values for the test set (x_test). We stored the
predictions in y_pred_rf.
Next, we calculated the Mean Squared Error (MSE) between the
predicted values (y_pred_rf) and the actual target values (y_test)
using scikit-learn's mean_squared_error function, and we assigned
the result to RFr.
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
25
Here's the code breakdown:
Random Forest Regression is an ensemble learning method that
combines multiple decision trees to improve predictive
performance and reduce overfitting. The RandomForestRegressor class
in scikit-learn implements the Random Forest algorithm for
regression tasks.
The printed output will show the Mean Squared Error for Random
Forest Regression with the test set.
Comparing the MSE values for Decision Tree Regression and Random
Forest Regression can give we insights into which model performs
better on this specific task. Generally, Random Forest Regression
tends to perform better than a single Decision Tree, especially
when the dataset is complex and prone to overfitting.
Random Forest Regression : 14315329749.65445
Gradient Boosting Regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
rf_regressor = GradientBoostingRegressor()
rf_regressor.fit(x_train,y_train)
#Predicting the SalePrices using test set
y_pred_rf = rf_regressor.predict(x_test)
#Random Forest Regression Accuracy with test set
GBr = mean_squared_error(y_pred_rf,y_test)
print('Gradient Boosting Regression : ',GBr)
Great! Now we've created a Gradient Boosting Regressor, fitted it
to our training data (x_train and y_train), and used it to predict
the target values for the test set (x_test). We stored the
predictions in y_pred_rf.
Next, we calculated the Mean Squared Error (MSE) between the
predicted values (y_pred_rf) and the actual target values (y_test)
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
26
using scikit-learn's mean_squared_error function, and we assigned
the result to GBr.
Here's the code breakdown:
Gradient Boosting is another ensemble learning method that
combines multiple weak learners (typically decision trees) to
create a strong predictive model. The GradientBoostingRegressor
class in scikit-learn implements the Gradient Boosting algorithm
for regression tasks.
The printed output will show the Mean Squared Error for Gradient
Boosting Regression with the test set.
Comparing the MSE values for Decision Tree Regression, Random
Forest Regression, and Gradient Boosting Regression can give we
insights into which model performs better on this specific task.
Gradient Boosting, like Random Forests, tends to perform well on
various tasks, making it a powerful choice for regression
problems.
Gradient Boosting Regression : 12029643835.717766
# Sample model scores (replace these with our actual model scores)
model_scores = {
"Linear Regression": 9839952411.801708,
"Descison Tree": 29698988724.82603,
"Random Forest":14315329749.65445,
"Gradient Boosting": 12029643835.717766
}
# Sort the model scores in ascending order based on their values (lower
values first)
sorted_scores = sorted(model_scores.items(), key=lambda x: x[1])
# Display the ranking of the models
print("Model Rankings (lower values are better):")
for rank, (model_name, score) in enumerate(sorted_scores, start=1):
print(f"{rank}. {model_name}: {score}")
We have provided sample model scores for different regression
models, and we want to sort them in ascending order based on
their values (lower values first) and then display the rankings.
Here's the code to achieve that:
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439
27
The code first sorts the model scores in ascending order based on
their values, and then it displays the ranking of the models from
best to worst. In this ranking, models with lower scores (e.g.,
MSE values) are considered better performers, as they indicate
closer predictions to the actual values.
Model Rankings (lower values are better):
1. Linear Regression: 9839952411.801708
2. Gradient Boosting: 12029643835.717766
3. Random Forest: 14315329749.65445
4. Descison Tree: 29698988724.82603
Prepared by: Tahir Mahmood 0321-5111997, What up 0312-8536439