MODULE TITLE: Introduction to Artificial Intelligence
MODULE NUMBER: CIS2205
DEPARTMENT: School of Computing Engineering
COURSEWORK: Report
MODULE LEADER: Quratul-Ain-Mahesar
Abstract
Machine learning techniques for increasing the accuracy of real estate valuation predictions
are really critical in real estate market analysis. The pre-processed dataset contains
numerical and categorical variables, including dimensions, geographic location, and
features. Data integrity and information leakage are maintained by replacing missing
numerical values with the median and categorical data with the mode. ColumnTransformer
in Scikit-learn is an automated process that combines OneHotEncoder for encoding
categorical variables and StandardScaler for normalising numeric variables. Both linear
regression and random forest models are trained using 70% for the training set and 30%
for the test set. For the performance evaluation, mean absolute error (MAE), root mean
square error (RMSE), and coefficient of determination (R2) are taken as metrics. The results
indicate that the random forest model outperforms linear regression because the former
models complex interactions and non-linear relationships well. For example, scatter plots
and feature importance plots help understand which parts of the model are doing their job
and what could be key drivers of prices for real estate. This piece thus presents a very
pertinent message for comprehensive preprocessing in ensemble methods of predictive
analytics. It provides a basis for developing models of machine learning in the field of real
estate that could be further optimised via additional features and tuning of hyperparameters.
1
List of Contents
Abstract ............................................................................................................................... 1
List of Contents ................................................................................................................... 2
Table of Figures .................................................................................................................. 3
1. Objective ..................................................................................................................... 4
2. Introduction ................................................................................................................. 4
3. Approach ..................................................................................................................... 4
4. Key Insights ................................................................................................................ 5
4.1. Data Preprocessing ............................................................................................... 5
4.2. Performance of models......................................................................................... 5
4.3. Feature Analysis ................................................................................................... 5
4.4. Visualisations ....................................................................................................... 5
4.5. Optimisation ......................................................................................................... 6
5. Code Guidelines .......................................................................................................... 6
5.1. Needed Imports Pandas are imported as pd by libraries. ..................................... 6
6. Loading the dataset ..................................................................................................... 6
7. Addressing Missing Values ........................................................................................ 6
8. Divide the dataset ........................................................................................................ 7
9. Preprocessor for Data .................................................................................................. 7
10. Divide the data into testing and training .................................................................. 7
10.1. Linear Regression Model.................................................................................. 7
10.2. Random Forest Model ...................................................................................... 8
11. Metrics for Evaluation ............................................................................................. 8
12. Comparison of Expected and Real Prices ................................................................ 8
13. Significance of Features ........................................................................................... 9
2
14. Linear Regression: Forecasted vs Real Costs .......................................................... 9
14.1. Results of Linear Regression: ......................................................................... 10
14.2. Results of Random Forest: ............................................................................. 11
15. Findings.................................................................................................................. 11
16. Recommendations .................................................................................................. 11
17. Conclusion ............................................................................................................. 12
18. References .............................................................................................................. 13
Table of Figures
Figure 1 Actual vs Predicted prices using linear regression model .................................. 10
Figure 2 Actual vs Predicted prices using random forest model ...................................... 10
3
1. Objective
This research aims to predict house prices from the Kaggle dataset "House Prices using
Advanced Regression Techniques." Both Linear Regression (LR) and Random Forest (RF)
algorithms are used to estimate the target variable SalePrice. The procedural outline
includes data preprocessing, training, evaluation, and visualisation.
2. Introduction
Linear regression is the most fundamental and most applied technique in machine learning.
(Maulud & Abdulazeez, 2020). This method (Akgün & Öğüdücü, 2015; Maulud &
Abdulazeez, 2020) is the type of mathematical model to be used in determining a
relationship between the variables involved in the analysis. In addition, linear regression
(Lim, 2019) is also encountered many times in mathematical research methods to estimate
what may happen and predict their values based on several input variables. This method
involves data analysis and modelling to determine the linear relationships of independent
and dependent variables. The process helps present relationships between dependent and
independent variables from analysis and learning to real-world training results. Features of
each process are usually hidden, such as algorithms used, databases, precision, and
performance. (Sarkar et al., 2015). The rapid development of remote sensing technologies,
especially concerning platforms, sensors, and information infrastructure, has significantly
increased the accessibility of Earth observation data for geospatial analysis. (Sarkar et al.,
2015; Seo et al., 2017). Land use and land cover mapping applications are expanding, along
with a need to update the available maps, creating new avenues for the production of
innovative methods for land use classification in various land management sectors. These
advancements aim to address challenges that are local, regional, and global in nature.
(Karan & Samadder, 2018; Ramo & Chuvieco, 2017).
3. Approach
In the pre-processing stage of data, missing values for numerical features were corrected
using the median and categorical features using mode. Additionally, it normalises its
numerical features by applying StandardScaler and encoded categorical variables using
OneHotEncoder. ColumnTransformer was applied, combining all these operations in one
pipeline. The data were split into a training set that included 70% of all data and a testing
4
subset that had 30% of the data. It was then run on two machine learning techniques, linear
regression and random forest, in Scikit-learn pipelines, which involve both preprocessing
and modelling procedures, to make an estimation possible. It compared such through
metrics, such as the mean absolute error, the root mean square error, and coefficient of
determination R2 that were used at a very granular level to test the performance versus the
model. It's highly efficient and systematic, which essentially allows meaningful data on
just how well these models would predict the real estate's value to acquire. (Dimopoulos et
al., 2018; Mathotaarachchi et al., 2024).
4. Key Insights
4.1. Data Preprocessing
• It is important since it is a basic process that ensures the quality and reliability of data
used in an analysis or model.
• Feature Scaling and Encoding to make it interoperable with Machine Learning.
4.2. Performance of models
• The Random Forest model performs better than linear regression since its accuracy and
generalisation capabilities are better because the model can depict nonlinear relations
more accurately.
• The proposed algorithm of Random Forest was proved to perform pretty well in
managing the minimisation of mean absolute error and root mean square error while
maximising coefficient estimate.
4.3. Feature Analysis
• The Random Forest algorithm allows easy selection for the factors influencing a
property the most; information very valuable for a real estate professional (Jui et al.,
2020).
4.4. Visualisations
• As scatter plots will show, the random forests were more accurate compared to the
linear regression with respect to the correlation between the actual values and
predictions.
5
4.5. Optimisation
• More advanced techniques like gradient boosting and XGBoost can be used along with
hyperparameter optimization, and running more feature engineering should be
achieved.(Rey-Blanco et al., 2024).
5. Code Guidelines
5.1. Needed Imports Pandas are imported as pd by libraries.
train_test_split from sklearn.preprocessing import from
sklearn.model_selection import Importing StandardScaler and
OneHotEncoder from sklearn.compose Importing ColumnTransformer from
sklearn.pipeline pipeline from the import of sklearn.linear_model
Importing LinearRegression from sklearn.ensemble sklearn.metrics'
RandomForestRegressor import r2_score, mean_absolute_error, and
mean_squared_error
import seaborn as sns import matplotlib.pyplot as plt
Goal: Import machine learning (scikit-learn), data handling (pandas), and visualisation
(matplotlib, seaborn) libraries.
6. Loading the dataset
Dataset link dataset
Dataset link dataset data =
pd.read_csv('/content/drive/MyDrive/AiLabs/house_prices.csv')
• Goal: Open a DataFrame and load the dataset for analysis.
• Note: Enter the real path to your dataset in place of the file path.
7. Addressing Missing Values
Since 'SalePrice' is the target variable in numeric_features = data,
remove it from the list.(include=['number'])
select_dtypes.(columns=['SalePrice']) drop.columns
data.select_dtypes(include=['object']) = categorical_features.columns
Data[categorical_features] =
data[categorical_features].fillna(data[categorical_features].mode().ilo
6
c[0]) data[numeric_features] =
data[numeric_features].fillna(data[numeric_features].median())
• Numerical Columns: The median of each column is used to replace missing values.
• Categorical Columns: The mode, or most frequent value, of each column is used to
replace missing data.
• Why: For numerical data, the median manages outliers better.
For categorical data, the mode makes sense.
8. Divide the dataset
x = data and Features.drop(axis=1, 'SalePrice') Features
y = data['SalePrice'] Aim
• Goal: Isolate the target variable (y) from the characteristics (x) for supervised learning.
9. Preprocessor for Data
• StandardScaler was used to normalise the data (mean = 0, std = 1) in the numerical
features.
• Category Features: Converted categories to binary indicators through encoding using
OneHotEncoder.
• Goal: Ascertain that every feature is appropriately scaled for machine learning
techniques and presented numerically.
10.Divide the data into testing and training
Train_test_split(x, y, test_size=0.3, random_state=42) sets x_train,
x_test, y_train, y_test.
• Goal: Separate the information
• Training Set: Models are trained using 70% of the data. The testing set contains 30%
of the data required to assess the model's performance.
• Random Seed: Guarantees repeatability of outcomes.
10.1. Linear Regression Model
Pipeline(steps=[('preprocessing', preprocessor), lr_pipeline =
Pipeline(Linear Regression Model, lr_pipeline) ('model',
7
LinearRegression())])
(X_train, y_train) lr_pipeline.fit
lr_pipeline.predict(X_test) = lr_pred
• Pipeline: Integrates the model (LinearRegression) and preprocessing (preprocessor).
• Fit: Use the training data to train the pipeline.
• Predict: Make assumptions based on the test findings.
10.2. Random Forest Model
Pipeline(steps=[('preprocessor', preprocessor), rf_pipeline = Random
Forest Model ('model', RandomForestRegressor(random_state=42))])
(X_train, y_train) rf_pipeline.fit
rf_pipeline.predict(X_test) = rf_pred
• Pipeline: combines the RandomForestRegressor model with preprocessing. A reliable
model for capturing nonlinear interactions is a Random Forest.
11.Metrics for Evaluation
"Linear Regression:" is printed.
mean_absolute_error(y_test, lr_pred)) print("MAE:",
y_test, lr_pred, squared=False) print("RMSE:", mean_squared_error)
"R2_score(y_test, lr_pred)" print("R²:")
"Random Forest:" print("
mean_absolute_error(y_test, rf_pred)) print("MAE:",
y_test, rf_pred, squared=False) print("RMSE:", mean_squared_error)
"R2_score(y_test, rf_pred)" print("R²:")
• Measures: (MAE) Magnitude.
• RMSE: More severe penalties are applied to greater errors.
• R2: Shows how effectively the model accounts for data variance.
• Goal: Evaluate how well the Random Forest and Linear Regression models perform.
12.Comparison of Expected and Real Prices
Plot.Scatter(y_test, rf_pred, alpha=0.5)
('Actual Prices') plt.xlabel
'Predicted Prices' plt.ylabel
8
plt.title('Predicted vs. Actual Prices in Random Forest')
plt.show()
• Scatter Plot: Describes the relationship between actual and predicted values. The
diagonal line (y = x) represents ideal expectations.
13. Significance of Features
rf_pipeline.named_steps['model'] =
importancesPreprocessor.transformers_[0]
=.feature_importances_num_features[2]
list(preprocessor.transformers_[1][1]) =
cat_features.categorical_features() to obtain feature_names_out
list(num_features) + cat_features = features
sns.barplot(y=features, x=importances)
'Feature Importance' plt.title
plt.show()
• Helpful: Determining which traits have the greatest predictive power.
14.Linear Regression: Forecasted vs Real Costs
(figsize=(8, 6)) plt.figure
plt.scatter(alpha=0.5, y_test, lr_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()]'k--
', lw=2)
('Actual Prices') plt.xlabel
'Predicted Prices' plt.ylabel
plt.title('Predicted vs. Actual Prices in Linear Regression')
plt.show()
9
• Line-based scatter plotting: This highlights the closeness of values to the projections.
The results for these can be seen in Figure 1 and Figure 2.
Figure 1 Comparison of Actual Prices to Predicted Prices Using a Linear Regression Model.
14.1. Results of Linear Regression:
MAE: 18247.00205131123
RMSE: 28007.160173400105
R²: 0.8875909312985639
Figure 2 Comparing Actual and Predicted Prices with the Random Forest Model.
10
14.2. Results of Random Forest:
MAE: 17028.310616438357
RMSE: 27234.863837233843
R²: 0.8937048091293024
15.Findings
The study examined significant data in the context of property price forecasting using
machine learning-based predictive modelling techniques (Khobragade et al., 2018). The
data was analysed using both linear regression models and random forest models, each of
which has specific advantages and disadvantages. Linear regression presents an explicit
and easily interpretable model that emphasizes linear relationships between the explanatory
and dependent variables. It suffered linear dynamics, as was determined by an MAE of
18,247 and an R2 of 0.887. Comparing this with the performance and capacity of the model
as indicated by the coefficient of determination, showing an increase in complex and
interaction patterns between the given variables, the model indicates a decrease in MAE
by 17,028 and an R2 of 0.894 for a random forest model. The results indicated that the
Random Forest method was more precise and robust than linear regression, and thus, it was
a better predictor of real estate values. (Mathotaarachchi et al., 2024).
Furthermore, the examination of feature significance facilitated the identification of critical
attributes influencing property pricing. These insights bear considerable consequences for
participants in the real estate market, notably developers, investors, and property owners.
Scatter plots effectively represented the model's efficacy and the importance of various
features while highlighting the practical utility of Random Forest methodologies in real-
world scenarios.
16.Recommendations
To further improve the model:
• Perform hyperparameter tuning for the Random Forest model (Probst et al., 2019).
• Explore additional ensemble methods like Gradient Boosting or XGBoost.
• Engineer new features that capture interactions between existing variables.
11
17.Conclusion
The study assesses the effectiveness of two machine learning algorithms, Random Forest
and Linear Regression, in estimating residential property values. The code lays down a
strong foundation for predictive modelling by proper preparation of data, handling missing
values, and execution of relevant transformations such as one-hot encoding for categorical
data and normalisation for numerical attributes. The explanation for different performance
metrics proves the effectiveness of the predictions of the models. The RF model appears to
outperform the LR model based on its assessment metrics. This is because the RF model,
as an ensemble of decision trees, can identify complex relationships and non-linear
associations between features. In all the parameters, R2 is higher, and MAE and RMSE are
lower on the side of the random forest. So, that will say it more effectively predicts the
value of the home as well as accurately represents home value. To continue, visual
demonstration is in the form of a scatter plot for comparison between the models, showing
the difference between the original price and the estimated or expected prices for each one
of them. The Random Forest model has a better mechanism for predicting because data
points lie closer to each other on the optimal prediction line. The feature importance plot
is an excellent tool for explaining what attributes are more important for residential
property values. It is particularly helpful for home owners, investors, and real estate
professionals in understanding what are the primary drivers of the valuation of properties.
While very informative, it must be said that the machine learning models are not flawless.
More advanced algorithms like Gradient Boosting or XGBoost and potentially even more
diverse or complete training datasets through acquisition could further improve the
performance of this approach. Such a solution already forms a very good foundation for a
predictive model in the field of real estate because it so vividly shows the capacity of data
science methods to transform unstructured housing data into meaningful and actionable
property valuation insights. This code illustrates:
1. A complete workflow for predictive modelling from start to finish.
2. Proactively managing missing values and transforming features is essential for
successful data analysis.
3. Training and comparison of two models (Linear Regression and Random Forest).
12
4. Visualisation of model performance and feature importance.
Both models provide insights into the dataset, but Random Forest performs better for
capturing complex patterns.
18.References
Akgün, B., & Öğüdücü, Ş. G. (2015). Streaming linear regression on spark MLlib and
MOA. Proceedings of the 2015 IEEE/ACM International Conference on Advances
in Social Networks Analysis and Mining 2015,
13
Dimopoulos, T., Tyralis, H., Bakas, N. P., & Hadjimitsis, D. (2018). Accuracy
measurement of Random Forests and Linear Regression for mass appraisal models
that estimate the prices of residential apartments in Nicosia, Cyprus. Advances in
geosciences, 45, 377-382.
Jui, J. J., Imran Molla, M., Bari, B. S., Rashid, M., & Hasan, M. J. (2020). Flat price
prediction using linear and random forest regression based on machine learning
techniques. Embracing Industry 4.0: Selected Articles from MUCET 2019,
Karan, S. K., & Samadder, S. R. (2018). Improving accuracy of long-term land-use change
in coal mining areas using wavelets and Support Vector Machines. International
Journal of Remote Sensing, 39(1), 84-100.
Khobragade, A. N., Maheswari, N., & Sivagami, M. (2018). Analyzing the housing rate in
a real estate informative system: A prediction analysis. Int. J. Civil Engine. Technol,
9(5), 1156-1164.
Mathotaarachchi, K. V., Hasan, R., & Mahmood, S. (2024). Advanced Machine Learning
Techniques for Predictive Modeling of Property Prices. Information, 15(6), 295.
Maulud, D., & Abdulazeez, A. M. (2020). A review on linear regression comprehensive in
machine learning. Journal of Applied Science and Technology Trends, 1(2), 140-
147.
Probst, P., Wright, M. N., & Boulesteix, A. L. (2019). Hyperparameters and tuning
strategies for random forest. Wiley Interdisciplinary Reviews: data mining and
knowledge discovery, 9(3), e1301.
Ramo, R., & Chuvieco, E. (2017). Developing a random forest algorithm for MODIS
global burned area classification. Remote Sensing, 9(11), 1193.
Rey-Blanco, D., Zofío, J. L., & González-Arias, J. (2024). Improving hedonic housing
price models by integrating optimal accessibility indices into regression and
random forest analyses. Expert Systems with Applications, 235, 121059.
Sarkar, M. R., Rabbani, M. G., Khan, A. R., & Hossain, M. M. (2015). Electricity demand
forecasting of Rajshahi City in Bangladesh using fuzzy linear regression model.
2015 International Conference on Electrical Engineering and Information
Communication Technology (ICEEICT),
Seo, D. K., Kim, Y. H., Eo, Y. D., Park, W. Y., & Park, H. C. (2017). Generation of
radiometric, phenological normalized image based on random forest regression for
change detection. Remote Sensing, 9(11), 1163.
14