FOREIGN TRADE UNIVERSITY
HO CHI MINH CAMPUS
FINAL REPORT
Course name: Artificial Intelligence in Era of Digital Transformation
Date: 14/07/2025 – Class code: 24 – Course code: AIDE300
No. Full Name ID Peer Evaluation
(0%-100%)
1 Nguyễn Mai Anh 2312255005 100%
2 Nguyễn Thiên Trang 2312255072 100%
3 Võ Lâm Thanh Trúc 2312255074 100%
Grade (in number) Grade (in words)
Examiner 1’s signature Examiner 2’s signature
TABLES OF CONTENT
CHAPTER I. INTRODUCTION AND OVERVIEW.................................................... 1
1. Executive Summary................................................................................................... 1
2. Introduction................................................................................................................ 2
2.1. Identify problems.................................................................................................... 2
2.2. Report Objective......................................................................................................3
3. Disclaimer.................................................................................................................. 3
3.1. AI Prompt................................................................................................................ 3
3.2. Dataset..................................................................................................................... 4
CHAPTER II. PREDICTIVE MODEL BUILDING..................................................... 5
1. Data Cleaning Process................................................................................................5
2. Data Description.........................................................................................................5
3. Model Development................................................................................................... 9
3.1. Data Preparation...................................................................................................... 9
3.2. Model Selection Rationale.................................................................................... 10
3.3. Feature Engineering...............................................................................................11
4. Model Training......................................................................................................... 12
4.1. DXG.VN................................................................................................................13
4.2. PDR.VN................................................................................................................ 13
4.3. VHM.VN...............................................................................................................14
4.4. NVL.VN................................................................................................................ 15
4.5. KDH.VN................................................................................................................16
5. Evaluation & selection............................................................................................. 18
5.1. MSE (Mean Squared Error).................................................................................. 18
5.2. R² (R-squared)....................................................................................................... 19
6. 5-Day Forecasting (03/07/2025 – 08/07/2025)........................................................ 20
6.1. DXG.VN................................................................................................................21
6.2. PDR.VN................................................................................................................ 22
6.3. VHM.VN...............................................................................................................22
6.4. NVL.VN................................................................................................................ 23
6.5. KDH.VN................................................................................................................23
CHAPTER III. CONCLUSION..................................................................................... 24
1. Findings.................................................................................................................... 24
2. Recommendations and Future Work........................................................................ 24
1
CHAPTER I. INTRODUCTION AND OVERVIEW
1. Executive Summary
In the framework of the very unstable Vietnamese real estate market, affected by
the COVID-19 and stricter credit policies, it is more important to predict the stock prices
in this industry. Thus, the absence of effective forecasting tools and data-driven strategies
may result in emotionally-based decisions, higher risk, and lost opportunities.
This report addresses that gap by applying and comparing four Machine Learning
models (Linear Regression, Decision Tree, Random Forest, and Support Vector
Regression) to predict the stock prices of five real estate companies in Vietnam: DXG,
PDR, VHM, NVL, and KDH. These were chosen for their diversity in scale,
capitalization, and business models, representative of Vietnam’s real estate sector.
Expanding beyond one single stock enhances model robustness and generalizability, as
supported by Domingos (2012) and Goodfellow et al. (2016).
Preprocessing of historical price and technical indicators was done and models
were trained. The evaluation of performance was done based on MAE, MSE, and R².
This research aims to provide investors a baseline for decision-making and serve as a step
toward more advanced Deep Learning applications in financial prediction.
2. Introduction
2.1. Identify problems
The real estate sector in Vietnam has been experiencing a long period of instability
post COVID recovery period because of credit tightening, legal barriers, fluctuations in
interest rates, … Official data show that in Q1 2020, only 1,300 real estate units were
sold successfully, which is 14.3 percent of the total supply, the lowest absorption rate in
four years (ThS. Nguyen Thi Hoa, 2020). Meanwhile, the growth of credit to the real
estate sector declined significantly, by 26 percent in 2018 to 12 percent in 2020,
indicating tighter lending policies and limited access to funds (Phan Nam, 2022). These
2
changes have increased market uncertainty and put institutional and retail investors to test
in their decision-making.
In this regard, there is an increasing need for data-driven tools that can facilitate
more precise and timely investment decisions. Machine Learning (ML), which can
capture complex patterns and nonlinear relationships, has become an interesting method
of stock price prediction. Nevertheless, its performance in a turbulent and noisy market
such as real estate is not well-researched in the Vietnamese setting.
2.2. Report Objective
This study aims to apply Machine Learning models to predict real estate stock
prices in the next 5 days. Specifically, the team will:
- Compare the effectiveness of 4 machine learning models: Linear Regression,
Decision Tree, Random Forest and Support Vector Regression, through popular
evaluation metrics: MAE, MSE and R².
- Determine the most optimal prediction model for each stock code to help
individual investors have more basis for decision making.
- Lay the foundation for the development of a Deep Learning model with long-term
prediction capabilities and better adaptability to unstructured factors such as
market events. This model will be tested on highly volatile stocks such as NVIDIA
in the next phase of the course.
3. Disclaimer
3.1. AI Prompt
The code and models in this report were built on Google Colab using historical
data of five major Vietnamese real estate stocks (DXG - Dat Xanh Group Joint Stock
Company, PDR - Phat Dat Real Estate Development Joint Stock Company, VHM -
Vinhomes Joint Stock Company, NVL - No Va Land Investment Group Corporation -
Novaland, KDH - Khang Dien House Trading and Investment Joint Stock Company )
from 01/07/2015 to 01/07/2025, retrieved through the yfinance API. Core Python
3
libraries including pandas, numpy, scikit-learn, matplotlib, and seaborn were used for
data processing, model training, evaluation, and visualization. The code structure
includes:
- Calculation of technical indicators (e.g., EMA_10, EMA_20, MACD, Signal Line,
ATR, Upper Band);
- Removal of multicollinearity via correlation analysis;
- Model training and testing using Linear Regression, Decision Tree, Random
Forest, and SVR;
- Model comparison using MAE, MSE, and R²;
- Forecasting and visualizing the next 5 trading days’ prices for each stock
The code was partially generated with the help of AI tools (ChatGPT-4o, Gemini 2.0,
DeepSeek V3, Grok 3.0) and later refined by the team to fit the project’s needs. It is
intended purely for academic use, not as financial advice.
3.2. Dataset
The dataset spans varying historical depths (2015–2023), reflecting real-world
forecasting where data availability differs across assets. This variation allows for
evaluating model flexibility and generalization across data richness levels, aligning with
recommendations to test model robustness under diverse conditions (Géron, 2019).
4
CHAPTER II. PREDICTIVE MODEL BUILDING
1. Data Cleaning Process
Figure 1. Data cleaning Code
Forward fill coupled with backward fill was used in dealing with missing values
such that continuity is maintained without data loss. One duplicate row was recognized
and eliminated to give integrity to the data. The significant columns such as open, high,
low, close, price and volume were casted to float so that they can work with numerals
calculations.
2. Data Description
5
Figure 2. Historical Closing Prices of Selected Stocks (2015–2025)
The chart shows the differences in price trends and volatility of the chosen
stocks throughout time. Such differences underline the need to train an individual
model per stock and select features that reflect each stock’s unique nature to
enhance prediction accuracy.
Figure 3. Distribution of Closing Prices (2015–2025)
Most closing prices fall between 10,000 and 30,000 VND, with DXG and KDH
appearing more frequently in this range. In contrast, VHM shows a broader and higher
price spread, reflecting stronger historical valuation. These variations may affect how
each model learns and predicts stock behavior.
6
Figure 4. Cumulative Returns Over Time (2015–2025)
Stocks like KDH and DXG demonstrate strong and consistent long-term growth,
while others such as PDR, VHM, and NVL display shorter time horizons and greater
volatility. Consistent trends support stable predictions, whereas noisy or limited data
challenge the model’s ability to generalize.
Figure 5. 20-Day Exponential Moving Averages (EMA) of Stock Prices
7
The EMA plot shows strong upward trends in VHM, DXG, and KDH, while NVL
and PDR display lower and more volatile patterns. Clearer trends may aid model
learning, whereas volatile series require more robust algorithms to capture patterns.
Figure 6. Rolling Standard Deviation of Daily Returns for Selected Stocks
The rolling standard deviation reveals volatility differences across stocks. DXG
and KDH show broader historical coverage and cyclical patterns, while PDR, VHM, and
NVL have shorter, more volatile windows from 2023. These distinctions highlight
varying risk profiles and their impact on model generalization.
3. Model Development
3.1. Data Preparation
We used a supervised learning setup, where the target variable was the next-day
closing price. A Target column was created by shifting the Close column by one time
step. The last row was dropped due to a missing target value from the shift. Each ticker
was printed to verify dataset integrity after cleansing.
8
Each dataset was split into 80% training and 20% testing without shuffling to
preserve time-series order. For example, DXG.VN had 1,982 training rows and 496
testing rows, ensuring both convergence and temporal validity.
The same pattern was applied to all five real estate stocks. Processed datasets
(X_train, X_test, y_train, y_test) were stored in a structured dictionary for easy access
during model training and evaluation.
Figure 7. Data splitting
3.2. Model Selection Rationale
To evaluate model performance in stock price prediction, four supervised learning
algorithms were trained and compared: Linear Regression (LR), Decision Tree (DT),
Random Forest (RF), and Support Vector Regressor (SVR) with a linear kernel for speed
and interpretability.
For each stock, models were trained on (X_train, y_train) and evaluated on
(X_test, y_test). Performance was assessed using MAE, MSE, and R² — capturing
average error, penalized error magnitude, and explained variance, respectively. Each
model was re-fitted per stock to account for differences in market behavior.
9
This evaluation framework provided insight into how each algorithm performed
across different data volumes and volatility patterns, helping identify the most suitable
model for each stock.
3.3. Feature Engineering
A set of eight technical indicators commonly used in finance and research was
selected to train the forecasting model. These include EMA_10 and EMA_20 for short-
and medium-term trends, MACD and its Signal Line for momentum, Bollinger Upper
Band for price range, ATR for market volatility, High price for daily fluctuation
amplitude, and Closing price.
Feature selection focused on avoiding mathematical and conceptual overlap
among technical indicators. For example, EMA and SMA both use moving averages,
while MACD and RSI are momentum-based. Including highly correlated features can
increase multicollinearity, skew learning, and reduce interpretability. Thus, only
indicators with statistical and conceptual independence were selected.
Figure 8. Feature Selections
4. Model Training
10
The four models were applied to each stock, using historical features to predict
next-day closing prices. Predicted and actual values during the test period were visualized
on line graphs, helping to assess how well each model captured trends, fluctuations, and
patterns over time.
4.1. DXG.VN
Figure 9. Model Training for DXG.VN
Linear Regression and SVR forecasts in vicinity of the real prices, steady, smooth.
In the meantime, Decision Tree and Random Forest are characterized by overfitting in
areas of high fluctuation. Hence, the best model in DXG Linear Regression in terms of its
generalization, and predictive power in the long-term.
11
4.2. PDR.VN
Figure 10. Model Training for PDR.VN
The forecasts of the Linear Regression and SVR are smooth and close to the price
movements. Decision Tree and Random Forest are jagged, which is the indicator of
possible overfitting. Accordingly, Linear Regression is the most appropriate choice for
PDR, balancing accuracy and stability.
4.3. VHM.VN
12
Figure 11. Model Training for VHM.VN
The Linear Regression is apt, and it tracks the real price. SVR is also very fine
though occasionally the prediction is a little bit erroneous. Decision Tree and Random
Forest are insensitive, and short term variations are ignored. Visually, Linear Regression
is the most effective model with VHM, because of its ability to capture both long-term
and short-term trends well.
4.4. NVL.VN
13
Figure 12. Model Training for NVL.VN
Both Linear Regression and SVR have natural and smooth graphs in predicting
prices. Decision Tree and Random Forest have the tendency to flatten fluctuations
leading to the loss of detailed signals. According to this graph, Linear Regression is the
best choice for NVL, thanks to its high accuracy and smooth graphs.
4.5. KDH.VN
14
Figure 13. Model Training for KDH.VN
All models closely followed the actual price, but Linear Regression and SVR were
smoother and more stable. Decision Tree was slightly noisy, while Random Forest tended
to overfit. Thus, Linear Regression was the most optimal model for KDH due to its stable
performance and practical simplicity.
Graphically, Linear Regression and Support Vector Regressor showed the best
results, with estimated lines mostly conforming to real prices. But a complete decision
can not be made based solely on visual results. To choose the most appropriate model, the
provided metrics should also be taken into account.
5. Evaluation & selection
15
According to the No Free Lunch theorem (Wolpert & Macready, 1997), no single
model works best for all cases, so four different models were tested on the same data. To
ensure objective selection, three common regression evaluation metrics were used:
5.1. MAE (Mean Absolute Error)
Figure 14. Mean Absolute Error (MAE) - Training Model
Based on MAE, Linear Regression was the most accurate and stable, especially for
DXG, NVL, and KDH. Decision Tree and Random Forest performed poorly on VHM
and PDR, likely due to overfitting. SVR was steady but less effective than Linear
Regression.
5.2. MSE (Mean Squared Error)
16
Figure 15. Mean Squared Error (MSE) - Training Model
Linear Regression recorded the lowest and least variable MSE on stocks such as
DXG, NVL and KDH, and this further affirms its effectiveness. On the contrary, Decision
Tree and Random Forest were ineffective on VHM and PDR probably caused by
overfitting. SVR was just as good as Linear Regression, but no better.
5.3. R² (R-squared)
Figure 15. R² (R-squared) - Training Model
17
Linear Regression achieved strong R² scores across most stocks, including DXG,
NVL, VHM, and KDH, showing strong explanatory power. SVR performed slightly
lower, while Decision Tree and Random Forest struggled with PDR and VHM due to
poor generalization on volatile data. Overall, Linear Regression was the most reliable in
terms of R².
6. 5-Day Forecasting (03/07/2025 – 08/07/2025)
Figure 16. 5-Day Forecasting (03/07/2025 – 08/07/2025)
The code snippet shown below was used in the Linear Regression, which
underwent the best performance to predict the stock prices within a 5-trading-day interval
(03/07/2025 08/07/2025). The result contains the predicted closing price of the five real
estate stocks.
18
19
Figure 17. 5-Day Forecasting Graph
6.1. DXG.VN
Table 1. 5-Day Forecasting Metrics - DXG.VN
In the initial days the model predicted rather accurately but strongly
under-predicted on 8 th of July when real price moved up starkly. Linear regression
lacked responsiveness to the sudden changes and it could not respond and follow the
rapid upwards movement.
6.2. PDR.VN
Table 2. 5-Day Forecasting Metrics - PDR.VN
The model results were more observant and the error was minimal during the
entire period. On July 7, it was a little bit over-optimistic, and yet, the model tracked the
actual trend relatively closely, indicating moderate fluctuations.
20
6.3. VHM.VN
Table 3. 5-Day Forecasting Metrics - VHM.VN
The model experienced frequent underestimation of the stock price most
especially on the 8 th of July when the price drastically rose. This indicates that the
model had failed to match with the fast pace of growth and was conservative in making
prediction on high volatility stocks.
6.4. NVL.VN
Table 4. 5-Day Forecasting Metrics - NVL.VN
The prediction matched the actual price within close limits during the initial few
days but it failed to capture the falling trend on July 8 and this resulted to an
over-derivation in the prediction. The model supposed that there was an ongoing upward
trend but in the market, it actually turned down.
21
6.5. KDH.VN
Table 5. 5-Day Forecasting Metrics - KDH.VN
The model, similar to the VHM, always underestimated the actual price
particularly in those cases where the price grew significantly towards the end of the
period. The model will be appropriate only when there is a stable market, and it is not
sensitive enough to pick high rebounds.
7. Discussion of Potential Causes for the Identified Trends
The patterns noted in the forecast accuracy of the five stocks of the real estate may
be explained by a few generic factors.
The volatility rate and price range of every stock was quite influential. The model
would generalise well on stocks with steady and moderate variance but would yield a
higher error when it had to make a prediction on the stock with sudden spikes or
exhibited volatility with respect to its price.
The linearity of the selected model, which was Linear Regression, restricted how
the nonlinearity of dynamics or sudden shifts in the market could be detected. The model
was able to follow general trends but had problems responding to abrupt directional
changes, particularly towards the latter portion of an outlook period.
22
The model is only based on technical signals based on past prices without
incorporating external qualitative data like financial news or policy revision. This left it
less sensitive to near-term market drivers that can contribute to real estate stocks.
Lastly, the length and the quality of the data affected the functionality of the
model. Stocks which have longer and consistent historical records could be trained more
accurately and better generalized. The less reliable forecasts, on the other hand, were
obtained for those with limited or volatile datasets.
A combination of these reasons pins the point on more adaptive modeling
strategies and features established in the demand of input in order to improve predictive
performance in a more complex and volatile market dynamic such as in Vietnam where
the real estate industry is prevalent.
23
CHAPTER III. CONCLUSION
1. Findings
While no single model proved to be universally optimal, Linear Regression and
Random Forest consistently performed well across most tickers. Among these, Linear
Regression demonstrated the lowest MAE and MSE, along with relatively high
R-squared values, indicating strong predictive ability in stable market conditions.
In contrast, Decision Tree and Random Forest models occasionally suffered from
overfitting, particularly when market patterns were noisy or inconsistent. In some cases,
these models even produced negative R-squared scores, suggesting poor generalization
on the test set. The performance gaps among models highlighted the inherent limitations
of traditional machine learning models, particularly their reduced effectiveness in
short-term forecasts under conditions of high volatility or sudden, news-driven market
shifts.
Overall, the modeling pipeline, which included feature engineering, technical
indicator extraction, and sequential train-test splitting, was successful in generating
reliable baseline forecasts for relatively stable periods. However, the ability to capture
sharp price movements remains limited under the current framework.
2. Recommendations and Future Work
For future development, it is recommended to explore deep learning architectures
especially LSTM (Long Short-Term Memory) models—which are well-suited for
capturing temporal dependencies and could offer improved accuracy for volatile stocks
such as NVIDIA.
Consider integrating quantitative data (historical stock prices or technical
indicators) with qualitative inputs such as news headlines, market sentiment, and social
media signals. This multimodal approach could enhance model responsiveness to
external shocks and market narratives.
24
Development of hybrid models (e.g., Bi-directional LSTM or BERT combined
with numerical features) is suggested to better capture both sequential dependencies and
semantic patterns from text-based data sources;
Finally, implementing an early warning system that reacts to real-time information
such as surprise product releases (e.g., the launch of DeepSeek) would increase model
adaptability and forecasting robustness in high-impact, short-horizon scenarios.
25
REFERENCES
1. Phan Nam. (2022, May 10). Tín dụng bất động sản: “Nắn” chứ không nên “siết.”
Nhịp Sống Kinh Tế Việt Nam & Thế Giới.
https://vneconomy.vn/techconnect//tin-dung-bat-dong-san-nan-chu-khong-nen-siet
.htm?utm_source=chatgpt.com
2. ThS. Nguyễn Thị Hoa. (2020). Thị trường bất động sản Việt Nam trong cơn bão
Covid-19: Đón chờ lực bật mới. Consosukien.vn.
https://consosukien.vn/thi-truong-bat-dong-san-viet-nam-trong-con-bao-covid-19-
don-cho-luc-bat-moi.htm
3. Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for
optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82.
https://doi.org/10.1109/4235.585893
4. Domingos, P. (2012). A few useful things to know about machine learning.
Communications of the ACM, 55(10), 78–87.
https://doi.org/10.1145/2347736.2347755
5. Murphy, J. J. (1999). Technical analysis of the financial markets: A comprehensive
guide to trading methods and applications. New York Institute of Finance.
6. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
https://www.deeplearningbook.org/
7. Kaufman, P. J. (2013). Trading systems and methods (5th ed.). Wiley.
8. Zhang, X., Qu, Y., & Li, R. (2017). Stock price prediction via discovering
multi-frequency trading patterns. Proceedings of the 26th International
26
Conference on World Wide Web, 1231–1240.
https://doi.org/10.1145/3097983.3098131
9. Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and
TensorFlow (2nd ed.). O'Reilly Media.
10. [AIDE300] GROUP 4 - MIDTERM COLAB NOTEBOOK
27
GOOGLE COLAB
https://colab.research.google.com/drive/1sBobAz4spemHli97SZjQGWN0h7wILyCF?usp=sharin
g&fbclid=IwY2xjawLiAGFleHRuA2FlbQIxMQABHpU_d-DujWbtIJ8_VH5YgYAOOlQKGY8v60sM
qXqxDzpQNu9ahaSl1sdw6RLb_aem_sD0Com7R9MOQh7IgUeiv2g
28