Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
248 views12 pages

Stock Market Time Series Analysis

Uploaded by

Tanay Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
248 views12 pages

Stock Market Time Series Analysis

Uploaded by

Tanay Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Hindawi

Scientific Programming
Volume 2022, Article ID 4758698, 12 pages
https://doi.org/10.1155/2022/4758698

Research Article
Research on Stock Price Time Series Prediction Based on Deep
Learning and Autoregressive Integrated Moving Average

1
Daiyou Xiao and Jinxia Su2
1
School of Finance, Central University of Finance and Economics, Beijing, China
2
School of Business, Central University of Finance and Economics, Beijing, China

Correspondence should be addressed to Daiyou Xiao; [email protected]

Received 7 December 2021; Revised 24 January 2022; Accepted 21 February 2022; Published 31 March 2022

Academic Editor: Hangjun Che

Copyright © 2022 Daiyou Xiao and Jinxia Su. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Different from traditional algorithms and model, machine learning is a systematic and comprehensive application of computer
algorithms and statistical models, and it has been widely used in many fields. In the field of finance, machine learning is mainly
used to study the future trend of capital market price. In this paper, to predict the time-series data of stock, we applied the
traditional models and machine learning models for forecasting the linear and non-linear problem, respectively. First, stock
samples that occurred from year 2010 to 2019 at the New York Stock Exchange are collected. Next, the ARIMA (autoregressive
integrated moving average model) model and LSTM (long short-term memory) neural network model are applied to train and
predict stock price and stock price subcorrelation. Finally, we evaluate the proposed model by several indicators, and the ex-
periment results show that: (1) Stock price and stock price correlation are accurately predicted by the ARIMA model and LSTM
model; (2) compared with ARIMA, the LSTM model performance better in prediction; and (3) the ensemble model of ARIMA-
LSTM significantly outperforms other benchmark methods. Therefore, our proposed method provides theoretical support and
method reference for investors about stock trading in China stock market.

1. Introduction deep learning, especially neural network model, has become


the current hot spot of stock prediction model. Meanwhile,
Stock market forecasting is a behavior to determine the stock market forecast has been more convenient and efficient
future value of corporate stocks or other financial instru- due to these technologies [2]. At present, stock forecasting
ments traded on exchanges. Successful forecast of the future models usually fall into traditional linear models and models
stock price can make considerable profit. According to EMH represented by deep learning. However, since the time series
(efficiency market hypothesis), stock prices reflect all data have both linear and nonlinear parts, the forecasting
existing information, so any price changes not based on results singly through forecasting models are usually not so
newly released information cannot be forecast. Although reliable. Therefore, many experts and scholars combine
other people disagree with the hypothesis, some supporters various single models to significantly improve the accuracy
of the view hold countless methods and techniques that and stability of the forecasting results.
supposedly allow them to access to future price information. In addition, the coefficient of association between the
Stock market forecast is especially difficult, given the stock index and its constituent stocks can reflect the sen-
nonlinearity, volatility, and complexity of the market. Before sitivity of the constituent stocks to the changes of the stock
the emergence of machine learning technology, stock market index, that is, the correlation between the constituent stocks
forecasts were generally realized through fundamental and and the stock index (also known as “stock character”), which
technical analysis. With the computer technologies, such as can be referred to by investors to adapt investment strategies.
machine learning, emerged and developed in business [1], According to the market trend forecast, extraneous income
2 Scientific Programming

can be expected to gain by choosing different β coefficient of performs well in predicting economic and financial time
stocks. Moreover, the stock index and its constituent stock series. Other researchers put forward a stock price prediction
prices often keep trend in sync in the global stock market. method using deep learning models [10]: 14 different DL
Therefore, except for predicting stock index and single stock methods similar to LSTM are comprehensively adopted in
prices, better portfolio strategies can be worked out by S&P stocks; BSE-BANKEX stock index will be capable of
forecasting the correlation coefficients of the expected forecasting one or even four steps ahead. It is found that the
constituent stock of the stock index for higher returns on DL methods proposed in their research can obtain a good
investment. prediction results for stock price. Joo II and Seung-ho
Based on all of this, this paper takes the strong-enough proposed a stock price forecast model of a two-way LSTM
representative S&P 500 stock index and its constituent recurrent neural network, which adds a hidden layer in the
stocks as the research object to forecast the future trend of opposite direction of the data flow to deal with the limited
the S&P 500 stock index through forecast models and then network through the previous model based on the RNN [11].
predict the correlation coefficient between its constituent It was found that, compared with the nonbidirectional
stocks and the stock index, so as to formulate the optimal LSTM recurrent neural network, the stock price prediction
investment strategy for investors to refer to at a certain model using the bidirectional LSTM recurrent neural net-
extent. work has higher accuracy. To get rid of high noise in stock
Over the past few decades, many social science researches data, researchers applied the wavelet threshold denoising
have focused on predict social and economic development method to preprocess the initial data sets [12]. In their study,
trends with quantitative methods. Many feasible methods in the soft/hard threshold method used for data preprocessing
time-series analysis, both with advantages and disadvantages, has a significant effect on noise suppression. Based on this
can be interpreted as techniques for using past data to build research, a new multioptimal combination wavelet trans-
forecasts and strategies on future value. form (MOCWT) was proposed, and the research finally
First, research about linear model: As early as the 1990s, showed that MOCWT is more accurate in forecasting than
the ARIMA (autoregressive integrated moving average) traditional methods. Researchers also proposed the LSTM
method has already been used by scholars to forecast in the model and employed it to intraday stock forecasts [13]. Chen
capital market. Some researchers used the ARIMA and and Ge made an exploration on the forecasting mechanism
coefficients to predict stock market data [3], and in their of stock price movement based on LSTM and found that it
experiments, researchers found that the experiment result significantly improved the forecasting performance [14].
was better than the prediction of the zero hypothesis of Third, the research on the hybrid model is as follows:
random fluctuations in the base value. The ARIMA model Peter and Zhang used ARIMA and ANN hybrid method to
has been used in many fields including temperature pre- study time series estimation [15]. Narendra Babu and Eswara
diction, prices prediction for electricity, and wind speed. Reddy proposed a linear hybrid model that can simulta-
Some studies adopted the process of ARIMA time in their neously maintain the prediction accuracy and the trend of
research [4]. Yang et al. selected the Shanghai Composite the data [16]. Baek and Kim proposed a novel data en-
Index to structure ARIMA model [5]. Kim and Sayama hancement method for stock market index prediction based
developed a new method aiming to forecasting the future on the ModAugNet framework [17]. The method includes
trend of the S&P 500 index by establishing a complex the over-fitting prevention LSTM module and the predictive
network of time series of the index-foundation S&P 500 and LSTM module and it is found from analysis that the test
then linking the network to the interconnected weights [6]. performance depends entirely on the latter. An ensemble
The study showed that adding network measurement results method LSTM with GARCH is proposed [18]; it has high
to the ARIMA can improve the prediction accuracy. Khashei predictive ability and good applicability. Chen et al. pro-
and Hajirahimi believe that the time series in the hybrid posed a new ensemble model to problems on portfolio
model is divided into t linear and nonlinear two parts [7]. selecting with skewness and kurtosis [19].
Therefore, ARIMA and MLP (multiparametric linear pro- Through the analysis of recent literature, it can be found
gramming) are chosen to build hybrid models. They also that domestic and foreign forecasting models can be roughly
found that on the whole, the ANN-ARIMA hybrid model divided into linear, nonlinear, and hybrid models. In gen-
can be adopted to achieve more accurate results. Unggara eral, the current research status at home and abroad can be
et al. used the Firefly algorithm to optimize the ARIMA (p, d, summarized as follows: The research on linear models
q) model and determined the best ARIMA model by looking mainly focuses on the ARIMA model. For recent researches,
for the smallest AIC (Akaike information criterion) value many researchers keep more belief in predictive perfor-
[8]. As a result, the ARIMA model optimized by the Firefly mance of non-linear models than that of linear models. The
algorithm has a better forecasting performance. hybrid model is the best predictive model in all. It can not
Second, research about neural network model: The only process the linear part of time series data, but also has
LSTM (long short-time memory) network, which has better processing capabilities for its nonlinear part. There-
achieved further success in processing large data sets, is fore, in our study, a single method is first used to predict the
mainly used for deeper learning. Although LSTM model is trend of stock indexes, and then a hybrid one is adopted to
limited in the number of inputs, Siami-Namini and Namin predict the correlation coefficients of stock indexes and their
attempted to use the LSTM in financial data sets [9]. Ex- constituent stocks, so that provide investors with guidance
periment results indicate that the proposed method to profit to a certain extent.
Scientific Programming 3

2. Methods standard value. The essence of the two standards is


maximum likelihood estimation or nonlinear least
In this section, we first introduce ARIMA model and LSTM square estimation, and the AIC standard is chosen in
model, respectively, and finally introduce our proposed this article.
integration model. Step 4: Model testing: Model testing is the test on
whether the estimated model meets the norms of a
stationary univariate process. In particular, the resid-
2.1. Autoregressive Integrated Moving Average Model.
uals of the model output should be independent of each
ARIMA (p, d, and q), where p is the autoregressive term, q is
other, and the mean value and the constant that
the number of moving average terms, and d is the number of
changes with time should remain unchanged. If the
differences made when the time series becomes stationary.
estimate is insufficient, the modeling process must be
The prediction results can be adjusted by adjusting the
resumed to build a better model.
aforementioned three parameters d, p, and q, so as to draw
the optimal model. The model calculation formula is as Step 5: Data prediction: After the ARIMA model with
follows: the minimum AIC standard is obtained, the data can be
input into the model to predict its linear part.
yt � θ0 + Φ1 yt− 2 + · · · + Φp yt− p + εt − θ1 εt− 1 − θ2 εt− 2 − · · · − θq εt− q , (1)

where yt and εt are the actual value and random error of the
2.2. Long Short-Term Memory Model. Many researchers
time period t, respectively; Φi (i � 1, 2,. . ., p) and θj (j � 1,
found that different models are good at dealing with dif-
2,. . ., q) are the model parameters; p and q, the order of the
ferent types of prediction problems. This provides a basis for
model (p and q are integers), are also the model parameter
using the ARIMA-LSTM hybrid model, which contains both
mentioned earlier; the random error εt , whose mean value is
linear and nonlinear parts, to produce better results than a
0, is assumed to be independent and obey the same dis-
single method. Figure 1 shows the LSTM neural grid stores
tribution in the model. The variance of constant term is
the internal structure of cells.
denoted as σ 2. Equation (1) involves several important
Our study used the standard LSTM including the four
special cases of ARIMA series models. If q � 0, then equation
interactive neural networks (forgetting gates, input gates,
(1) can be simplified to an AR model of order p. When p � 0,
input candidate gates, and output gates).
the model can be simplified to a q-order MA model. Among
them, the model order (p, q) is the key link in ARIMA model ft � σ 􏼐Wf × 􏼂ht− 1 , xt 􏼃 + bf 􏼑, (2)
construction, which determines the accuracy of model
prediction. The parameters of the AR and MA operations are where σ represents the sigmoid activation function.
defined as (p) and (q), respectively. These two parameters i
need to be determined by the auto-correlation graph (ACF). σ(X) � . (3)
1 + e− x
ARIMA includes the following steps:
And then, a new unit state Ct is obtained from the input
Step 1: Data diagnosis and check: In the first step, it is
gate, this state will be as an update unit state in the next time
necessary to check the stationarity of the given time
step. The input gate employed the σ as the activation
series data, which is essential to improve the accuracy of 􏽥 t as outputs. it is employed to de-
function and it and C
forecasting. A stationary time series is a time series 􏽥 t.
termined the feature in Ct to reflect C
whose statistical properties such as mean, variance, and
covariance are related to time. it � σ 􏼐Wi × 􏽨ht− 1, xt 􏽩 + bi 􏼑,
(4)
Step 2: Model parameter estimation: In order to sta- 􏽥 t � tanh Wc × 􏼂ht− 1 , xt 􏼃 + bc 􏼁,
C
bilize the nonstationary time series, a proper degree of
difference (d) is performed on it, and the stability test is σ function outputs a value in the range 0 to 1 and the tanh
performed again and this process is continued until a outputs a value in the range − 1 to 1.
stable series is obtained. (d) is a positive integer that Next, the value selected by the ht activates Ot and Ct ,
shows the degree of difference. If the difference op- which are decided by the output gate.
eration is performed (d) times, the integration pa-
rameter of the ARIMA model is set to (d), and then the σ t � σ 􏼐W0 × 􏽨ht− 1, xt 􏽩 + b0 􏼑, (5)
obtained stationary data are identified. In this process,
the model (ACF graph) and partial auto-correlation 􏽥 t,
Ct � ft × Ct− 1 + it × C (6)
graph (PACF graph) are determined.
Step 3: Model identification and selection: After en- ht � ot × tanh Ct 􏼁. (7)
suring that the input variable is a stationary series, the
parameter d has been determined. Next, calculation Equations (6) and (7) produce the Ct and ht , and they
algorithms are used to estimate the parameters to find will be passed to the next time step. The experiment in this
the coefficients most suitable for the selected ARIMA article is a regression problem, and the range of output value
model. And then the AIC standard or BIC standard is of the proposed model is − 1 to 1; therefore, the last element
used to test the model and select the minimum is activated by the tanh function.
4 Scientific Programming

ht–1 ht

Ct–1 Ct

σ ft

~
+
tanh Ct

h σ it tanh ht

σ
ot

xt xt+1

Figure 1: LSTM neural grid stores the internal structure of cells.

2.3. Ensemble Prediction Model. Unlike single algorithms, a y t � Lt + N t . (8)


combination of multiple methods can obtain higher esti-
mation results [20]. These hybrid models are based on su- Figure 2 shows the ARIMA-LSTM hybrid model. In our
pervised machine learning algorithms, so they can be used time series data sets, Lt represents the linear part and Nt
for training and prediction purposes. Moreover, the en- represents the nonlinear part. In our hybrid model, the
semble methods improve the solving problem applicability linear part Lt is calculated through the ARIMA at first, and
and will obtain better performance [21, 22]. Traditional then the LSTM is applied to predict the nonlinear part Nt . At
econometric models and machine-learning-based models last, the sum of the error values of the two models is
have been widely used in the prediction research of time obtained.
series. In our study, for the time series in the stock market, In the mixed model, the linear component Lt is calcu-
due to the existence of a large number of linear and non- lated through the ARIMA model at first and then the LSTM
linear relations, the previous single model would be difficult model is used to predict the nonlinear component Nt of the
to deal with this type of prediction. Therefore, in our study, time series. At last the sum of the error values of the two
we will make a combination prediction based on ARIMA models is obtained. The formulas for calculating Lt and Nt
and LSTM, respectively, for different characteristics of stock are given in formulas (9) and (10):
market data, in order to obtain better prediction effect.
LSTMerror � LSTM mean[error], (9)
Recently, the ensemble method based on ARIMA and LSTM
has been applied to some fields like business and energy and
achieved great success [23–25]. ARIMAerror � ARIMA mean[error]. (10)
Even if the results obtained using the mixed model and
Calculate the weight of the model using the error values
the results obtained using the model alone are not related to
obtained in equations (11) and (12).
each other, it also demonstrates that it has reduced the
prediction error. Therefore, the hybrid model is considered LSTMerror
ARIMAweight � 􏼠1 − 􏼠 􏼡􏼡 × 2, (11)
to be the most successful prediction task model [26]. To LSTMerror +ARIMAweight
make predictions, many ensemble methods composed of
linear and nonlinear models are employed by different re- ARIMAweight � 2 − LSTMerror . (12)
searchers. In our experiment, historical data are used in the
time series to predict the future. Figure 2 introduces the Use the given equation (13) to obtain the weight value of
structure of the proposed ensemble method. the model and each predicted value of the final mixed model.

LSTMweight [i] × LSTMerror [i] + ARIMAweight [i] × ARIMAerror [i]


Hybridpredict [i] � . (13)
2
Scientific Programming 5

LSTMerror

LSTM Model LSTMweight

Time series data + /2 Time series data

ARIMAweight
ARIMA Model

ARIMAerror

Figure 2: ARIMA-LSTM hybrid model.

3. ARIMA-LSTM Hybrid Model Design 3500


and Evaluation 3000
2500
To evaluate the performance of the proposed ensemble 2000
model, we employ some commonly used method such as
1500
MAE, MSE, and RMSE to compare and evaluate the en-
1000
semble method and several benchmark methods.
500
0
3.1. Data Set Selection and Preprocessing. In this study, two
2010/1/4

2011/1/4

2012/1/4

2013/1/4

2014/1/4

2015/1/4

2016/1/4

2017/1/4

2018/1/4

2019/1/4
stock index forecasting models, ARIMA and LSTM, are
constructed at first. The S&P 500 stock index is selected in
the empirical data selects, and the daily trading data is Figure 3: S&P Stock index closing sequence.
selected in the data sample interval selects from January 1,
2010, to December 31, 2019, which are 2519 sets of data in
Table 1: The experiment results about ARIMA and LSTM model
total. Among them, the first 90% is used for model
forecasting.
training, and the 10% is used for model prediction. The
S&P 500 stock index sequence is shown in Figure 3. It can MSE MAE RMSE
be found from the figure that within the selected time ARIMA 0.000101 0.007333 0.043788
range, the S&P 500 Index generally shows a steady in- LSTM 0.000096 0.007184 0.028828
creasing trend.
3.3. The Design of Stock Price Correlation Coefficient
Prediction Ensemble Model
3.2. Comparative Analysis of Stock Index Forecast Model
Results. After obtaining the prediction data set, the 3.3.1. The Design of ARIMA for Stock Price Correlation
aforementioned four test methods are used in this study to Coefficient Prediction
test the data of each forecasting method. The following
table shows the different loss value obtained on the basis of (1) In the experiment of correlation coefficient pre-
the prediction of the four foreign exchange median prices diction, the adjusted closing price of the constituent
and the ARIMA model and the RNN neural network stocks of the S&P 500 index is selected, and the
model. sample interval is still set from January 1, 2010, to
Table 1 shows the fitting results based on the loss values December 31, 2019, on the New York Stock Ex-
of the prediction results of each model under different loss change daily transaction receipts. Data are mainly
functions. It can be seen from the table that the loss acquired in the use of Python language’s Beautiful
functions of the LSTM model are all smaller than the Soup function library through crawler technology.
ARIMA model, which is because the LSTM model can not The trading data of the constituent stocks originates
only describe the nonlinear relationship of time series data from the Quandl database, and the industry in-
but also has certain processing capabilities for its linear part formation of the constituent stocks is from
despite of its instability in comparison with the ARIMA Wikipedia.
model. However, generally speaking, both models have After preprocessing the data, the program randomly
gained very low loss values, indicating that the two models generates 150 stocks from the remaining 446 assets,
are both relatively perform well in predicting accuracy. and calculates the correlation coefficient of each pair
Figure 4 shows the predicted results using LSTM and of assets in a 100-day time window. In order to
ARIMA, respectively. diversify the data, 5 sets of data are set up in this
6 Scientific Programming

Actual V.S. Predicted using LSTM Actual V.S. Predicted using ARIMA
1.00 1.00

0.95 0.95

0.90 0.90
Normalised Value

Normalised Value
0.85 0.85

0.80 0.80

0.75 0.75

0.70 0.70

0.65 0.65

0 50 100 150 200 250 0 50 100 150 200 250


Number of Days Number of Days

Actual data points Actual data points


Testing prediction Testing prediction

Figure 4: The predicted results using LSTM and ARIMA.

article with a starting value every 20 days: day 1; day AIC � − 2In(L) + 2N, (14)
21; day 41; day 61; and day 81. Each value corre-
sponds to a rolling 100-day window, advancing in where L represents the maximum likelihood function and N
100-day time-steps until the end of the data set represents the number of parameters.
training. In this process, a total of 55,875 sets of time The AIC standard was proposed by the Japanese stat-
series data were trained, and each set has 24 time- istician Akaike, so it is named directly after initials of his
steps. Development, test1, and test2 are produced name. To evaluate the performance of the ARIMA model
using these 55,875 × 24 data sets. In the model with the application of AIC standard, the maximum like-
evaluation stage, this paper divides the data as fol- lihood function and the model parameters are used to judge
lows to achieve forward optimization. its prediction effect. Specifically, the larger the maximum
likelihood function value, the higher the prediction effect;
(2) The parameters of the model should be determined
theoretically speaking, the more the number of model pa-
before fitting the ARIMA model. ARIMA (p, d, q),
rameters is set, the lower the difficulty of fitting the data
where d is easiest to be determined. Data difference
relationship or the better the fit will be. However, too many
aims to making the last data used is a time series that
parameters will also complicate the model structure, which
tends to be stable, which can improve forecasting
may lead to more difficulties in parameter estimation,
accuracy. As mentioned in the previous section, the
thereby reducing the model prediction accuracy. Therefore,
S&P 500 Index and its constituent stocks generally
the ideal ARIMA model should be the optimal combination
show a steady increasing trend. The data will tend to
of maximum likelihood function and parameters. The AIC
be stable after a difference, so the parameter d here
standard comprehensively considers the above two indica-
can be determined as the value 1. The determination
tors and can perform comprehensively on evaluation of the
of the parameters p and q needs to adopt the ACF
ARIMA model. Therefore, when optimizing the ARIMA
and PACF of the data.
model, the parameter with the smallest AIC value will be
The ACF and PACF are set into zero after a certain order selected.
is called truncation. The running results show that most data If the ARIMA model is used to predict future data, the
sets show an oscillation trend, as shown in Table 2. There are generated data are in the ARIMA model. In other words, the
also notable trends covering rising/falling trends, large drops underlying process of generating the time series only has a
occasionally when the correlation coefficient is stabilized, linear correlation structure, but the nonlinear relationship in
and stable periods with mixed oscillations. Although the the experiment data cannot be described. The ARIMA
ACF and PACF images show that most of the data sets are method still has certain limitations in predicting complex
close to white noise, the images show that five groups of real-world problems. In this regard, the NN model can be
parameters can be effectively used in the prediction of the employed to analyze the nonlinear parts that the ARIMA
ARIMA model. These five sequences are used in this article model cannot deal with.
to test the ARIMA model, and a total of 55,875 data sets are After fitting the ARIMA model to the linear part of the
trained. What is more, for each data set, we will select the data, this article generates a new data set to calculate the
smallest AIC-value-based model after training. residual value of the remaining non-linear part at every 21-
AIC (Akaike information criteria) is a commonly used time steps, as shown in Figure 5. Since the input is the
test standard for the prediction performance of ARIMA nonlinear partial residuals processed by the ARIMA model,
models. The expression of AIC calculation is as follows: the residual distributions of the X and Y data sets all fall
Scientific Programming 7

Table 2: ARIMA model’s ACF/PACF.


Type Data plot ACF plot PACF plot
Autocorrelation Partial Autocorrelation
0.75 1.0 1.5
0.50 0.8 1.0
0.25 0.6 0.5
0.00 0.4 0.0
Oscillatory 0.2 –0.5
–0.25
0.0 –1.0
–0.50 –0.2 –1.5
–0.75 –0.4 –2.0
–1.00 –0.6 –2.5
0 5 10 15 20 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
1.00 Autocorrelation Partial Autocorrelation
0.75 1.0
5
0.8
0.50 4
0.6
0.25 0.4 3
0.00 0.2 2
Oscillatory and steady –0.25 0.0 1
–0.50 –0.2 0
–0.4 –1
–0.75 –0.6 –2
0 5 10 15 20 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
1.00 Autocorrelation Partial Autocorrelation
0.75 1.0
0.8 3
0.50 0.6
0.25 0.4 2
0.00 0.2 1
Dip –0.25 0.0
–0.2 0
–0.50
–0.4
–0.75 –0.6 –1
0 5 10 15 20 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Autocorrelation Partial Autocorrelation
0.8
1.00 1
0.6
0.4 0.75
0
0.2 0.50
0.0 0.25 –1
Decreasing –0.2 0.00 –2
–0.4 –0.25
–0.6 –0.50 –3
–0.8 –0.75 –4
0 5 10 15 20 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
1.0 Autocorrelation Partial Autocorrelation
0.8 1.00 5
0.6 0.75 4
0.4 0.50 3
0.2 0.25 2
Increasing 0.0 0.00 1
–0.2 –0.25 0
–0.4 –0.50 –1
–0.6 –0.75 –2
0 5 10 15 20 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

between 0 and 1. The newly generated X and Y segmentation on RNN, which contains 25 units. The final output of the
data set will be used as the input value of the next nonlinear cells is combined into a value with a full-connection layer.
LSTM model for training. This value is then output as a final predicted value through
a tanh activation function of a two-layer network. The tanh
activation function of the two-layer network can be
3.3.2. Forecast Design Based on LSTM Stock Price Correlation simply understood as the tanh function magnified by two
Coefficient. (1) Data Selection and Acquisition: After the times. Figure 7 shows the simplified architecture of the
ARIMA model processes the linear part of 150 pairs of method.
combined assets generated at any time, the remaining
nonlinear part is calculated as the residual value and used as
the input of the LSTM model, as shown in Figure 5. 3.4. Prediction Results Analysis
The input data set of the LSTM model is also divided into
X and Y trains, X and Y developments and two sets of X and 3.4.1. Forecasting Performance Evaluation. This paper aims
Y test set 1 and test set 2. The input data are stored in the X to fit the parameters of the model so that the optimal pa-
and Y data sets as shown in Figure 6. Each x data set size is a rameters can be used to apply and predict various assets in
55,874 × 20 matrix, and each X time series corresponds to a Y different time periods. Therefore, only the first window is
data set. trained, and the trained model can be applied to the data
(2) Training for LSTM Model: The model structure training of the three time intervals of the validation set and
constructed in this paper is an improved LSTM model based the two test sets. In addition, when the prediction results of
8 Scientific Programming

0.6

0.5

0.4

0.3

0.2

0.1

0.0
–4 –2 0 2 4
residual
Figure 5: Residual data distribution of training set.

RAW DATA d1 d21 d41 d61 d81 d2400 d2420 d2440 d2460 d2480

1 2 3 4 5
5 sections

×
STOCK A T1 T2 T3 T100 T101 T102 T103 T200 T201 T202 T203 T300 T2301 T2302 T2303 T2400

STOCK B T1 T2 T3 T100 T101 T102 T103 T200 T201 T202 T203 T300 T2301 T2302 T2303 T2400
11175
pairs

CORRELATION c1 c2 c3 c4 c21 c22 c23 c24


COEFFICIENT
=
Train X Train Y
Dev X Dev Y 55875
Test1 X Test2 Y lines
of data
Test1 X Test2 Y

Figure 6: The input data for time series.

LSTM1 LSTM1 LSTM1 LSTM1

LSTM2 LSTM2 LSTM2 LSTM2

LSTM3 LSTM3 LSTM3 LSTM3

LSTM4 LSTM4 LSTM4 LSTM4

×2 ŷ
DATASET

LSTM22 LSTM22 LSTM22 LSTM22

LSTM23 LSTM23 LSTM23 LSTM23

LSTM24 LSTM24 LSTM24 LSTM24

LSTM25 LSTM25 LSTM25 LSTM25

Time Step 1 Time Step 2 Time Step 3 Time Step 20

Figure 7: The structure of the LSTM model.

the correlation coefficient of the model in the two time model to test the model in this article. The MSE and
periods are relatively ideal, some classic financial prediction MAE values of four financial models are calculated in this
models are selected to analyze the prediction effects of each article.
Scientific Programming 9

3.4.2. Forecast Results and Analysis. After the data are


processed, in the hybrid model prediction experiment, the 0.8
ARIMA method is first employed in this article to process

correlation coefficient
the S&P 500 index component stocks in the aspect of linear 0.6
as the first step, and then the nonlinear part of the data
0.4
residual value processed at the first step is used as the input
data of the LSTM model. Finally, model establishment, data 0.2
training and testing is developed. The final prediction results
of the correlation coefficient between the 150 randomly 0.0
generated asset portfolios and the S&P 500 index in the next
–0.2
20 time steps are shown in Figure 8.
0 5 10 15 20
3.5. Control Group Forecasting Model. Predicting the results time step
by the hybrid model alone is not enough to show that the
Figure 8: Prediction results of correlation coefficient.
certain advantages of the model in the forecasting perfor-
mance of research objects such as correlation coefficients. In
order to make comparison between the proposed hybrid direction. In order to quantify the volatility of assets and
model proposed and other models for the accuracy of fi- market returns, it is necessary to specify the market returns
nancial sequence forecasting, other commonly used fore- themselves. This specification is called the “market model.”
casting models are introduced as the reference group. Many
studies have shown that the full-sequence model is poor in Ri,t � αi + βi Rm,t + εi,t , (17)
prediction performance during the period of predicting fi- where Ri,t represents the return of asset i at time t; in the
nancial sequences, so three other commonly used prediction same way, Rm,t represents the return of asset m at time t; αi
models are also discussed, which are compared with the represents the excess return of asset i after risk adjustment; βi
prediction results of hybrid models. represents the impact of asset i on the market sensitivity; εi,t
represents the residual income of asset i at time t, also called
3.5.1. Full-Sequence Model (FS). Adopting the full-sequence the error term. So there is
algorithm is the easiest way to estimate the portfolio cor-
E εi 􏼁 � 0,
relation. All the past correlation values are used in the model
to predict the future correlation coefficient. Var(εi ) Var εi 􏼁 � σ 2εi , (18)
2
βi βj σ m (t) 1) Cov􏼐Ri Rj 􏼑 � ρij σ i σ j � βi βj σ 2m ,
􏽢ρ(t)
ij � ρ � ρ􏽢(t− . (15)
σ i σ j ij ij

where σ i and σ j respectively represent the standard deviation


However, compared with other equivalent models, the of asset i and asset j; σ m represents the standard deviation of
prediction quality of this model is relatively poor. market returns. In a single-index model, the estimated value
of the correlation coefficient ρ􏽢(t) ij can be expressed as,

3.5.2. Constant Coefficient Correlation Model (CCC). The βi βj σ 2 m


CCC model shows that the average value of the correlation 􏽢ρ(t)
ij � . (19)
σiσj
coefficients of all asset portfolios can be regarded as the
estimated value of the required predicted asset portfolio.
Therefore, all assets in the portfolio in this model have the
same correlation coefficient. 3.5.4. Multisequence Model. The industry sector of the asset
​ (t− 1) is considered in the multisequence. The model assumes
􏽐 i > j ρij that assets generally have a trend of volatility in the same
ρ(t)
ij � . (16)
n(n − 1)/2 direction in the same industry, so it can be considered that
the correlation coefficients of the asset portfolio are equal
to the average value of the correlation coefficients of the
3.5.3. Single-Index Model (SI). Adopting the single-index industry. For example, there are company A and B that
model is a simple way of asset pricing, which is usually used belong to industry sectors α and β, respectively; then, their
to evaluate the risk and return of stocks. To facilitate cal- correlation coefficients are equal to the average correlation
culation and analysis, the single-index model acts with a coefficients of all asset portfolios in their respective in-
kind of macro factor, such as the S&P 500 index, to measure dustry sector combinations (α, β). According to whether
the risk and return of stocks. The single-index model as- the two industrial sectors α and β are the same, the
sumes that the rate of return on assets and the “single index,” prediction formula is slightly different. The equation is as
that is, the market rate of return changes in the same follows:
10 Scientific Programming

0.30

0.40
0.25
Mean Absolute Error

Mean Squared Error


0.35
0.20

0.30
0.15

0.25
0.10

0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
epochs epochs

TRAIN_MAE TRAIN_MSE
DEV_MAE DEV_MSE

Figure 9: The learning curve of the ensemble model (ARIMA-LSTM).

0.55 0.45
0.50 0.40
Mean Absolute Error

Mean Squared Error

0.45 0.35

0.30
0.40
0.25
0.35
0.20
0.30
0.15
0.25 0.10

0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
epochs epochs

TEST1_MAE TEST1_MSE
TEST2_MAE TEST2_MSE

Figure 10: The prediction curve of the ensemble model (ARIMA-LSTM).

Table 3: ARIMA-LSTM mixed-model loss function evaluation table.


Develop data set Test1 data set Test2 data set
MSE RMSE MAE MSE RMSE MAE MSE RMSE MAE
ARIMA-LSTM 0.103 0.320 0.250 0.101 0.319 0.248 0.116 0.341 0.266
Full history 0.479 0.692 0.558 0.463 0.681 0.554 0.432 0.657 0.523
Constant correlation 0.277 0.526 0.462 0.214 0.463 0.400 0.281 0.530 0.430
Single-index model 0.420 0.624 0.540 0.411 0.641 0.534 0.247 0.497 0.482
Multigroup 0.307 0.554 0.451 0.291 0.539 0.455 0.297 0.536 0.448

􏽐​ nα ​ nβ (t− 1) 3.6. Experimental Results and Evaluation. From Figures 9





⎪ iεα 􏽐 jεβ;i ≠ j ρij

⎪ , α � β, and 10, it can be found that the learning curve of the train

⎪ nα 􏼐nβ − 1􏼑 data set and the development data set after a certain period


(20) of learning and training (about 350 time steps) begin to



⎪ ​ nβ (t− 1) converge, and the aforementioned two data sets have ob-


⎪ 􏽐​ nα
iεα 􏽐 jεβ;i ≠ j ρij tained smaller MSE and MAE loss function values.

⎩ , a ≠ b,
nα nβ Table 3 shows that the value of the Validation set (develop),
test1, and test2 are all smaller than that of the compared models
where α and β respectively represent different industry through the ARIMA-LSTM ensemble model designed in our
sectors in the stock market; nα and nβ represent the number study and calculation of the MSE, RMSE, and MAE for pre-
of companies in the α plate and β plate, respectively. dicted values. Therefore, it could be considered that the
Scientific Programming 11

accuracy of the ensemble method has been improved, and the [4] T. Zheng, J. Farrish, and M. Kitterlin, “Performance trends of
model can be extensively used to other applications of stock hotels and casino hotels through the recession: an ARIMA
market prediction. with intervention analysis of stock indices,” Journal of Hos-
pitality Marketing & Management, vol. 25, no. 1, pp. 49–68,
2016.
4. Conclusion [5] B. Yang, C. Li, D. Wang, and X. He, “Research on the Risk of
Shanghai Composite Index Based on VaR and GARCH
First, the two single models have good applicability to the Model,” in Proceedings of the 2017 3rd International Con-
data with single dimension. The loss function is used to ference on Economics, Social Science, Arts, Education and
calculate the prediction results of the proposed model, and Management Engineering (ESSAEME 2017), Huhhot, China,
we found that both ARIMA and LSTM model have lower January, 2017.
loss function values in stock index prediction. By comparing [6] M. Kim and H. Sayama, “Predicting stock market movements
the loss function values of all methods, it can indicate that using network science: an information theoretic approach,”
the three loss function indexes of LSTM model are superior Applied Network Science, vol. 2, no. 1, p. 35, 2017.
to ARIMA model. Moreover, the prediction accuracy of [7] M. Khashei and Z. Hajirahimi, “A comparative study of series
ARIMA-LSTM hybrid model is better than other financial arima/mlp hybrid models for stock price forecasting,”
models. In this paper, we proposed a hybrid model ARIMA- Communications in Statistics-Simulation and Computation,
LSTM, linearity is filtered out in ARIMA modeling, and vol. 48, no. 9, pp. 2625–2640, 2019.
[8] I. Unggara, A. Musdholifah, and K. S. Anny, “Optimization of
nonlinear trends are predicted in LSTM recursive neural
ARIMA forecasting model using firefly algorithm,” IJCCS
networks. The loss function test results show that the MSE, (Indonesian Journal of Computing and Cybernetics Systems),
MAE, and RMSE of ARIMA-LSTM hybrid model are vol. 13, no. 2, 2019.
smaller than those of other control models. Therefore, [9] S. Siami-Namini and A. S. Namin, “Forecasting Economics
ARIMA-LSTM model is feasible to predict the correlation and Financial Time Series: ARIMA vs. LSTM,” Papers, 2018,
coefficient of portfolio optimization. Although the predic- https://arxiv.org/abs/1803.06386.
tion results in this paper are basically consistent with the [10] A. Jayanth Balaji, D. S. Harish Ram, and B. B. Nair, “Ap-
expected results before the experiment, the time series before plicability of deep learning models for stock price forecasting
2010 is not considered for only the data after 2010 are se- an empirical study on BANKEX data,” Procedia Computer
lected. Therefore, the model’s ability to predict the special Science, vol. 143, pp. 947–953, 2018.
financial situation before 2010 need to be further tested. [11] T. Joo II and C. Seung-Ho, “Stock prediction model based on
What is more, as financial anomalies and noise are common, bidirectional LSTM recurrent neural network,” Journal of
all special trends cannot be covered by the model. Therefore, Korea Instiute of Information, Electronics, and Communica-
tion Technology, vol. 11, no. 2, pp. 204–208, 2018.
in the next step, it is necessary for researchers to further
[12] X. Liang, Z. Ge, L. Sun, M. He, and H. Chen, “LSTM with
study how to deal with Black Swan Theory in the financial wavelet Transform based data preprocessing for stock price
world. prediction,” Mathematical Problems in Engineering, vol. 2019,
Article ID 1340174, 8 pages, 2019.
Data Availability [13] S. Borovkova and I. Tsiamas, “An ensemble of LSTM neural
networks for high-frequency stock market classification,”
The experimental data of this research are available from the SSRN Electronic Journal, vol. 01, 2018.
corresponding author upon request. [14] S. Chen and L. Ge, “Exploring the attention mechanism in,”
LSTM-based Hong Kong Stock price Movement Prediction,
Taylor & Francis Journals, Milton Park, UK, 2019.
Conflicts of Interest [15] G. Peter and Zhang, “Time series forecasting using a hybrid
ARIMA and neural network model,” Neurocomputing, vol. 50,
All the authors declared that they have no conflicts of in- 2003.
terest regarding this study. [16] C. Narendra Babu and B. Eswara Reddy, “Prediction of se-
lected Indian stock using a partitioning-interpolation based
References ARIMA-GARCH model,” Applied Computing and Infor-
matics, vol. 11, no. 2, pp. 130–143, 2015.
[1] Z. Bao and C. Wang, “A multi-agent knowledge integration [17] Y. Baek and H. Y. Kim, “ModAugNet: a new forecasting
process for enterprise management innovation from the framework for stock market index value with an overfitting
perspective of neural network,” Information Processing & prevention LSTM module and a prediction LSTM module,”
Management, vol. 59, no. 2, Article ID 102873, 2022. Expert Systems with Applications, vol. 113, no. DEC,
[2] S. Deng, X. Huang, J. Shen, H. Yu, and C. Wang, “Prediction pp. 457–480, 2018.
and trading in crude oil markets using multi-class classifi- [18] H. Y. Kim and C. H. Won, “Forecasting the volatility of stock
cation and multi-objective optimization,” IEEE Access, vol. 7, price index: a hybrid model integrating LSTM with multiple
no. 99, p. 1, 2019. GARCH-type models,” Expert Systems with Applications,
[3] G. Caginalp and G. Constantine, “Statistical inference and vol. 103, pp. 25–37, 2018.
modelling of momentum in stock prices,” Applied Mathe- [19] B. Chen, J. Zhong, and Y. Chen, “A hybrid approach for
matical Finance, vol. 2, no. 4, 1995. portfolio selection with higher-order moments: empirical
12 Scientific Programming

evidence from Shanghai Stock Exchange,” Expert Systems with


Applications, vol. 145, Article ID 113104, 2019.
[20] D. Opitz and R. Maclin, “Popular ensemble methods: an
empirical study,” Journal of Artificial Intelligence Research,
vol. 11, pp. 169–198, 1999.
[21] J. J. Garcıa Adeva, U. Cervino Beresi, and R. A. Calvo,
“Accuracy and diversity in ensembles of text categorisers,”
CLEI Electronic Journal, vol. 8, no. 2, pp. 1–12, 2005.
[22] M. Oliveira and L. Torgo, “Ensembles for time series fore-
casting,” in Proceedings of the Asian Conference on Machine
Learning (ACML’2014), pp. 360–370, Nha Trang city, Virt-
nam, January, 2015.
[23] E. Dave, A. Leonardo, M. Jeanice, and N. Hanafiah, “Fore-
casting Indonesia exports using a hybrid model ARIMA-
LSTM,” Procedia Computer Science, vol. 179, no. 1,
pp. 480–487, 2021.
[24] Y. Deng, H. Fan, and S. Wu, “A hybrid ARIMA-LSTM model
optimized by BP in the forecast of outpatient visits,” Journal of
Ambient Intelligence and Humanized Computing, 2020.
[25] Z. Wang, J. Qu, X. Fang, H. Li, T. Zhong, and H. Ren,
“Prediction of early stabilization time of electrolytic capacitor
based on ARIMA-Bi_LSTM hybrid model,” Neurocomputing,
vol. 403, 2020.
[26] M. Khashei and M. Bijari, “Improving forecasting perfor-
mance of financial variables by integrating linear and non-
linear ARIMA and artificial,” QJER, vol. 8, no. 2, pp. 83–100,
2008.

You might also like