Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
27 views17 pages

MF Performance Prediction

This paper investigates the effectiveness of deep learning models, specifically LSTM and GRU, in predicting mutual fund performance compared to traditional statistical methods using the Sharpe ratio as a performance metric. The study finds that deep learning techniques, particularly when combined in an ensemble approach, outperform conventional methods across multiple forecast horizons. The research aims to fill a gap in forecasting multiple mutual funds simultaneously and provides a detailed methodology for training these models.

Uploaded by

suryaghoshal5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views17 pages

MF Performance Prediction

This paper investigates the effectiveness of deep learning models, specifically LSTM and GRU, in predicting mutual fund performance compared to traditional statistical methods using the Sharpe ratio as a performance metric. The study finds that deep learning techniques, particularly when combined in an ensemble approach, outperform conventional methods across multiple forecast horizons. The research aims to fill a gap in forecasting multiple mutual funds simultaneously and provides a detailed methodology for training these models.

Uploaded by

suryaghoshal5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Predicting Mutual Funds’ Performance using Deep Learning and Ensemble Techniques

Nghia Chu1
Vega Corporation, Tang 3, 106 Hoang Quoc Viet, Nghia Do, Cau Giay, Hanoi, Vietnam
[email protected]

Binh Dao
Hanoi University, Km 9, Nguyen Trai, Nam Tu Liem, Hanoi, Vietnam
[email protected]

Nga Pham
Monash Centre for Financial Studies, 30 Collins St, Melbourne, VIC 3000, Australia
[email protected]

Huy Nguyen
arXiv:2209.09649v3 [q-fin.ST] 31 Jul 2023

The University of Melbourne, Melbourne Connect 700 Swanston St, Carlton, VIC 3053, Australia
[email protected]

Hien Tran
Hanoi University, Km 9, Nguyen Trai, Nam Tu Liem, Hanoi, Vietnam
[email protected]

Abstract
Predicting fund performance is beneficial to both investors and fund managers and yet is a challenging task. In this
paper, we have tested whether deep learning models can predict fund performance more accurately than traditional
statistical techniques. Fund performance is typically evaluated by the Sharpe ratio, which represents the risk-adjusted
performance to ensure meaningful comparability across funds. We calculated the annualized Sharpe ratios based on the
monthly returns time series data for more than 600 open-end mutual funds investing in listed large-cap equities in the
United States. With five different forecast horizons ranging from 6 to 24 months, we find that long short-term memory
(LSTM) and gated recurrent units (GRU) deep learning methods, both trained using modern Bayesian optimization,
provide higher accuracy in forecasting funds’ Sharpe ratios than traditional statistical ones. An ensemble method which
combines forecasts from LSTM and GRU achieves the best performance of all models. There is evidence to say that
deep learning and ensembling offer promising solutions in addressing the challenge of fund performance forecasting.
Keywords: deep learning, forecasting, time series, fund performance, LSTM, GRU, ensemble.

1. Introduction ing skills. While it is always written in most funds’ prospec-


tus that past performance is not indicative of future perfor-
Investment funds are important intermediaries of the mance, there has been ample empirical literature that doc-
world’s financial market. Globally, there were more than uments fund performance persistence, for example, among
126,000 regulated mutual funds in 2020. In the United fixed income funds as in (Kahn and Rudd, 1995), equity
States alone, there were around 9,000 mutual funds with funds and real estate funds (Arnold et al., 2019). Some
24 trillion dollars of assets under management (AUM) in attribute performance persistence of mutual funds to man-
2020, according to the Investment Company Fact Book by agers’ skills (Berk and van Binsbergen, 2015, Kacperczyk
(ICI, 2021). Choosing the right investment fund has al- et al., 2014) whereas others explain performance persis-
ways been challenging for institutional and individual in- tence by common factors affecting stock returns and per-
vestors. Fund performance is driven by various factors, in- sistent differences in fund expenses and transaction costs
cluding the fund managers’ understanding of the macro en- (Carhart, 1997) and capital flows (Lou, 2012).
vironment, industry factors, stock performance, and trad- While the asset pricing literature centers around fac-
tors that explain fund performance and performance pre-
1 Corresponding dictability, the topic of forecasting fund performance seems
Author

Preprint submitted to XXX August 1, 2023


to be under-explored. It is, therefore, interesting to answer These studies have in common that they investigate the
whether future fund performance can be forecast given recurrent neural network (RNN) algorithms that are well-
past performance data, especially with the advanced tech- known approaches in sequence modelling. They also de-
niques in deep learning for time-series forecasting. velop global deep learning models capable of learning from
Being able to forecast fund performance with confi- multiple time series to improve prediction accuracy for in-
dence will allow asset owners, typically individual investors, dividual ones. This technique is known as cross-learning,
endowments, foundations, pension funds and sovereign funds, which has been found as a promising alternative to tradi-
to make more informed decisions about their future asset tional forecasting (Semenoglou et al., 2021).
manager selection. It is well documented in the literature Our study addresses the question of whether cross-
that investors recognize better-skilled managers and re- learning with carefully tuned RNNs, specifically LSTM
ward outperformance with more fund flows (Berk and van and GRU, works more effectively than traditional univari-
Binsbergen, 2015). Better performance predictability will ate methods in the problem of forecasting performances
drive fund flows in the market to outperforming funds, at for multiple mutual funds. The sample of mutual funds
the same time, creates pressure for underperforming funds is relatively homogeneous in that they adopt a similar in-
to improve. This process enhances market efficiency. vestment strategy in the United States, which increases
Despite its significance, there has been a lack of re- the chance that information in one time series would help
search in forecasting fund performances, especially research predict the future values of other time series.
that involves forecasting the performance of multiple mu- We compare the accuracy of ARIMA, Theta, ETS and
tual funds as multiple simultaneous time series using one Naive statistical methods with deep learning algorithms in
model. This research is intended to fill this gap. We aim to predicting fund performance. While ARIMA and ETS are
evaluate various techniques, including modern deep learn- popular univariate statistical forecasting models, Theta
ing algorithms, traditional statistical techniques, and en- was the best-performing forecasting algorithm in the M3
semble methods that combine model forecasts in different competition, according to (Makridakis and Hibon, 2000).
ways. The Naive method is included for comparison as, despite
This work investigates whether deep learning techniques its simplicity, it works well for various financial and eco-
can better forecast future fund performances based on past nomic time series (Hyndman and Athanasopoulos, 2018).
performance data than conventional statistical methods. In addition to individual deep learning and statistical algo-
Deep learning comprises sophisticated neural network-based rithms, we investigate how various ensembling methods for
methods designed to learn patterns from data. Deep learn- combining model forecasts lead to different performance
ing has demonstrated its effectiveness in many prediction gains.
problems that involve natural language and vision datasets. In summary, the contributions of our research are:
While there have been certain advances for time series
• To the best of our knowledge, our research is the first
data, there is still a lack of convincing evidence that deep
that addresses the problem of forecasting the perfor-
learning outperforms other simpler, statistics-based meth-
mance of multiple mutual funds simultaneously using
ods on different problem domains.
modern deep learning approaches.
Past research concluded that neural networks were not
well suited for time series forecasting (Hyndman, 2018). • We compare deep learning performances against well-
This may be because numerous deep learning-based al- known statistical methods using rolling origin cross-
gorithms require large amounts of data to perform well. validation with well-defined data splitting methods
For example, the famous BERT model, a transformer used and multiple forecast horizons.
for natural language processing, was pre-trained on an ex-
• We experiment with different ensembling methods
tremely large text corpus, including the English Wikipedia
for combining model forecasts, including local out-
with over 2,500M words (Kenton and Toutanova, 2019).
of-sample performance per time series per algorithm
The nature of time series data does not permit such a
and global out-of-sample performance per algorithm
massive amount of data available for training deep learn-
in computing weighted averages of forecasts.
ing models.
Contrarily, recent research in deep learning applied to • We present a detailed methodology for training deep
the time series domain shows more promising results. In learning models, specifically for LSTM and GRU al-
the M4 forecasting competition, the hybrid approach inte- gorithms, using modern Bayesian optimization. We
grating exponential smoothing into advanced neural net- believe the methodology benefits machine learning
works outperforms other statistical methods and wins the practitioners aiming to study the hyper-parameter
competition as in (Smyl, 2020). (Hewamalage et al., 2021) optimization process of deep learning models.
investigates the effectiveness of recurrent neural networks
in forecasting various datasets. The study concludes that
2. Study background
RNNs are competitive alternatives to state-of-the-art uni-
variate benchmark methods such as ARIMA and exponen- This section presents the literature on predicting the
tial smoothing (ETS). performance of mutual funds, traditional univariate fore-
2
casting models, and deep learning models for time series time windows observed in the literature, we decided to use
forecasting. forecast horizons of 6, 9, 12, 18 and 24 months, with funds’
performance being forecasted for every single month dur-
2.1. Predicting mutual fund performance ing a forecast horizon.
Fund performance is fundamentally assessed based on Most studies on forecasting fund performance have used
two aspects (1) expected return and (2) risk level, i.e., traditional time series forecasting models. Starting from
the variability of returns, measured by the standard de- the 1980s and 1990s, more advanced neural network tech-
viation of return. Although the expected rate of return niques were used to address financial forecasting ques-
of a fund is a critical indicator, it alone does not provide tions. Specifically, the use of neural networks to forecast
an accurate picture of how the fund has performed relat- fund performance was introduced in the papers of (Chi-
ing to other funds, as higher expected returns come at the ang et al., 1996) and (Indro et al., 1999). (Chiang et al.,
cost of having a higher level of risk. Different funds could 1996) employed neural networks to forecast mutual fund
exhibit different variability of returns due to differences in net asset value, while (Indro et al., 1999) used the same
risk-taking strategies, diversification levels or management approach to forecasting mutual fund performance in terms
skills. of percentage returns.
Since the 1960s, the Sharpe ratio introduced by (Sharpe, A more recent paper is (Wang and Huang, 2010). While
1966) and modified in (Sharpe, 1994) has become the most (Wang and Huang, 2010) also use the Sharpe ratio to mea-
popular performance statistics of mutual funds for consid- sure fund performance like ours, they compare the fast
ering the trade-off between fund’s risk and return. Sharpe adaptive neural network classifier (FANNC) with the Back
ratio is also known as the reward-to-variability ratio, as Propagation neural network (BPN) model and report that
a fund’s excess return (the rate of return over and above the FANNC is more efficient than the BPN in both classi-
the risk-free rate of return) is assessed against its level of fying and predicting the performance of mutual funds.
risk, i.e., the variability of returns. Therefore, the Sharpe It should also be noted that (Wang and Huang, 2010) use
ratio is unaffected by scale. The implied assumption of the manager’s momentum strategies and herding behavior as
Sharpe ratio is that all investors can invest funds at the input variables for the prediction of funds’ Sharpe ratio
risk-free interest rate, representing the opportunity cost whereas, in our approach, past Sharpe ratios are the input
of investing in an investment fund or the cost of capital to forecast future ones.
invested. Fund managers are under constant pressure to The study of (Pan et al., 2019) is similar to the paper
deliver a high Sharpe ratio, otherwise risking losing capital of (Wang and Huang, 2010) in which they use three inputs
allocated to their fund. of fund size, annual management fee and fund custodian
While other performance indicators are introduced in fee and the fund return as the output for data envelopment
academic literature, the Sharpe ratio has always been among analysis, Back Propagation Neural Network and GABPN
the preferred choices of industry practitioners when it comes (new evolutionary method) for 17 open-end balance stock
to fund performance evaluation, thanks to its simplicity. It funds for the period August 31, 2015 to July 1, 2016. This
is reported by most asset managers and included in large study also uses five accuracy measures to evaluate fore-
fund databases such as Morningstar or Lipper, cited re- cast performance, analyzes the rate of return and builds
ports from (Elton and Gruber, 2020). Therefore, in this the mutual fund net worth prediction model. Compared
paper, we use the Sharpe ratio as an indicator of fund to (Pan et al., 2019), our paper examined a substantially
performance. larger number of funds over a much longer time window
Given the scant literature on forecasting fund perfor- using extensive forecasting methods such as deep learning,
mance, the selection of a reasonable forecast horizon re- ensemble, and traditional statistical techniques.
mains an open question. Our considerations include a
wide range of time windows in which performance per- 2.2. Traditional univariate forecasting models
sistence is found to hold, as identified in the persistence Time series forecasting, especially stock price and fund
literature stream. Persistence studies investigate whether performance forecasting, have traditionally been popular
funds’ relative performance is sustained over adjacent pe- research topics in statistics and econometrics. Traditional
riods. Some find persistence to be short-lived (Bollen and methods range from simple seasonal decomposition (STL)
Busse, 2005), with performance persistence observed only and simple exponential smoothing (SEM) to more complex
within six months (Kacperczyk et al., 2014) or one year ones such as ETS (Hyndman et al., 2008) and ARIMA
(Carhart, 1997). (Lou, 2012) find that evidence for funds’ (Newbold, 1983; Box et al.,2015). Non-linear autoregres-
persistence reversed after six to twelve quarters. However, sive time series using numerical forecasting procedure (m-
(Irvine et al., 2022) provide evidence that outperformance step-ahead predictive) is also used in (Cai, 2005) and auto-
persists for at least 17 to 24 months, depending on different matic ARIMA in (Melard and Pasteels, 2000) using Time
asset pricing models. For longer time windows, Berk and Series Expert TSE-AX. Combining several traditional meth-
van Binsbergen report large cross-sectional performance ods can be seen in (Bergmeir and Benítez, 2012), which
differences in managers’ skills persist for as long as ten used six traditional forecasting methods, namely autore-
years. Based on the diverse predictability and persistence gression (AR), moving average (MA), combinations of those
3
models (ARMA), and integrated ARMA as ARIMA, Sea- vector machine (SVM) to forecast stock index. The pa-
sonal and Trend decomposition using Loess (STL) as well per concludes that the neural network method for finan-
as TAR (Threshold AR) based on cross-validation and cial prediction can handle continuous and categorical vari-
evaluation using the series’ last part. They suggest that ables and yield good prediction accuracy. (Sezer et al.,
the use of a blocked form of cross-validation for time series2020) presents a review of financial time series forecasting
evaluation should be the standard procedure. methods. Machine Learning (ML) models (Artificial Neu-
The traditional uni-variate methods dominated other ral Networks (ANNs), Evolutionary Computations (ECs),
complex computational methods at many forecasting com- Genetic Programming (GP), and Agent-based models),
petitions, including M2, M3, and M4 as in (Ahmed et al., Deep Learning (DL) models (Convolutional Neural Net-
2010); (Crone et al., 2011); (Makridakis and Hibon, 2000). works (CNNs), Deep Belief Networks (DBNs), and Long-
A recent paper, (Fiorucci and Louzada, 2020), proposes Short Term Memory (LSTM)) have appeared within the
a methodology for combining time series forecast mod- field, with results that significantly outperform their tra-
els. The authors use a cross-validation scheme that as- ditional ML counterparts for data like market indices, for-
signs higher weights to methods with more accurate in- eign exchange rates, and commodity prices forecasting.
sample predictions. The methodology is used to combine (Hewamalage et al., 2021) provides a big picture of the
forecasts named generalized rolling origin evaluation from perspective direction for time series forecasting using re-
traditional forecasting methods such as the Theta, Sim- current neural networks. The authors present an extensive
ple Exponential Smoothing (SES), and ARIMA models. study and provide a software framework of existing RNN
(Makridakis et al., 2020) reported 61 forecasting methods architectures for forecasting as Stacked Architecture and
used in M4 competition with 100,000 time series. The pa- Sequences to Sequence (S2S) for three RNN units includ-
per of (Li et al., 2020) used the ARIMA model with deep ing Elman RNN, LSTM and GRU. The main difference
learning to forecast high-frequency financial time series. between S2S and Stacked with the dense layer and the
The traditional uni-variate methods show that they moving window input format is that in the former, the
work well when the data volume is not extensive (Ban- error is calculated per each time step and the latter calcu-
dara et al., 2020; Xu et al., 2019 and Sezer et al., 2020) lates the error only for the last time step.
and a low number of parameters to be estimated. In (Chen et al., 2021), a multivariate model is pro-
posed to explain Bitcoin price with other economic and
2.3. Deep learning forecasting models technology variables, using both traditional statistical and
Recently, several papers combine the traditional meth- deep learning models. They use RNN (Recurrent Neural
ods with deep learning (Li et al., 2020; (Pannakkong et al., Network) and RF (Random Forest) for the first stage to
2017) and Xu et al., 2019). (Li et al., 2020) work on the detect the importance of input variables. They propose a
Chinese CSI 300 index and compare method as Monte wide range of models for the second stage such as ARIMA
Carlo numerical simulation, ARIMA, support vector ma- (Autoregressive Integrated Moving Average), SVR (Sup-
chine (SVM), long short-term memory (LSTM) and ARIMA- port Vector Regression), GA (Genetic Algorithm), LSTM
SVM models. Their results show that the enhanced ARIMA (Long Short-Term Memory), ANFIS (Adaptive Network
model based on LSTM not only improves the forecasting Fuzzy Inference System), for bitcoin forecast in different
accuracy of the single ARIMA model in both fitting and periods.
forecasting but also reduces the computational complex-
ity of only a single deep learning model. (Pannakkong Our research fills the gap of the lack of modern deep
et al., 2017) proposes a hybrid forecasting model involving learning research for predicting funds’ risk-adjusted per-
autoregressive integrated moving average (ARIMA), artifi- formance measured by the Sharpe ratio. The results of
cial neural networks (ANNs) and k-means clustering. (Xu our research show that deep learning and ensembling are
et al., 2019) use all linear models and deep belief network promising approaches to addressing the challenge of pre-
(DBN) models. dicting mutual funds’ performance.
Several other papers, (Jianwei et al., 2019), (Cao and
Wang, 2019), and (Sezer et al., 2020) use mixture mod- 3. Methodology
els for financial commodity and stock forecasting. (Jian-
wei et al., 2019) employs independent component anal- In this section, we present the methodology employed
ysis (ICA), gate recurrent unit neural network (GRUNN) for this research, from cross-validation scheme, model se-
named ICA-GRUNN among others to estimate gold prices. lection to model training and other techniques to ensure
The paper shows that ICA-GRUNN produces prediction the robustness of our results.
with high accuracy and outperforms the benchmark meth-
ods, namely ARIMA, radial basis function neural network 3.1. Cross-validation scheme
(RBFNN), long short-term memory neural network (LSTM), Cross-validation has been widely used in machine learn-
GRUNN and ICA-LSTM. (Cao and Wang, 2019) proposes ing to estimate models’ prediction errors as described in
the convolutional neural network (CNN) and CNN-support

4
(Ruppert, 2004). This method estimates the mean out- threshold.
of-sample error, also known as the average generalization
error by (Ruppert, 2004). • m can be calculated as the biggest integer lower than
For normal machine learning problems, where there is or equal to H/p, which is problem dependent.
no time dimension, k-fold cross-validation is the common We recognize that if we strictly follow GROE, this
cross-validation technique applied in practice. This tech- may lead to validation sets with different sizes. We be-
nique requires the dataset to be randomly divided into a lieve that setting the length of all the validation sets equal
train set and a validation set, in k different settings. The and equal to the forecast horizon will make the valida-
random division makes each element of the dataset exist tion sets represent the actual forecast dataset when ap-
in the validation set exactly once and in the training set plying a forecasting model in practice better. Therefore,
at least once. we adjust the method slightly by first determining the last
In time series forecasting, there is a time dimension, train/validation sets, and then working backwards to de-
and care must be taken to ensure that time-dependent termine the other train/validation sets. In our method,
characteristics are preserved during dataset setup for model the last validation set is defined to consist of the last H
evaluations. For example, we cannot simply and randomly observations in the dataset, and the corresponding train
assign an element to the validation set, since this may in- set consists of all the observations in the dataset prior to
terrupt time order between observations and lead to data the validation set. Then we shift the origin of the last train
leakage, where future data may be used to predict past set, which is n − H, back in time m data points to find the
data. origin of the previous train sets, i.e. using ni = ni+1 − m.
The study of (Bergmeir and Benítez, 2012) showed that This modification ensures all validation sets have the
the use of cross-validation in time series forecasting led to same length, which is exactly equal to the forecast horizon.
a more robust model selection. In this research, we follow As an example, for the forecast horizon of 18 months, our
the rolling origin scheme to divide our dataset into six cross-validation scheme results in the following division of
cross-validation splits. Our choice of six cross-validation train and validation sets:
splits is identical to the study of (Fiorucci and Louzada,
2020). To limit the computational costs when training
deep learning models within reasonable limits, it is beyond
the scope of our study to evaluate the effects of different
cross-validation splits on the study results.
Models are trained on each training set and then evalu-
ated on the respective validation set. The error measured
in each validation set is the estimated generalization error
for that particular fold. Averaging the generalization error
estimated for all the folds will yield the estimated general-
ization error of the model. Model selection is then carried
out by comparing average generalization errors across all Figure 1: Train and validation sets for a forecast horizon
the models trained. of 18m
Specifically, our cross-validation scheme is a slightly
adjusted version of Generalized Rolling Origin Evaluation 3.2. Recurrent neural networks
(GROE), as described in (Fiorucci and Louzada, 2020).
Below are the properties of this GROE scheme in connec- In this research, we employ two modern types of recur-
tion to our study: rent neural networks (RNNs) known as LSTM and GRU
and compare their effectiveness in forecasting fund per-
• p denotes the number of origins, where an origin is formances against their statistical counterparts. In this
the index of the last observation of a train set. The section, we present the unit design and mathematics fun-
origins are referred to as n1 , n2 , ..., np , which can damentals of these techniques. In addition to LSTM and
be found recursively through the equation ni+1 = GRU, we also describe the vanilla RNNs which provide the
ni + m. In this equation, m represents the number foundation for more advanced RNN architectures.
of data points between two consecutive origins. In
our study, the number of origins p equals six, which 3.2.1. Vanilla recurrent neural networks
means original data are split in six different ways to RNNs are designed to address the problem of learning
produce six train/validation set pairs. from and predicting sequences. They have been widely
applied in the field of natural language processing. In the
• H represents the different forecast horizons, which following figure and equations, we illustrate the internal
are 6, 9, 12, 18, and 24 months in our study. workings of RNNs. The basic RNN cell has the structure
• n is the length of each time series and n1 is equal as shown in Figure 2.
to n − H if n − H is greater than or equal a certain
5
Figure 2: A Basic Recurrent Cell

Ht = ϕ (Ht−1 · Wh1 + Xt · Wh2 + bh ) (1)


Figure 3: Computation of hidden state in LSTM
Ot = Ht · Wo + bo (2)
In equation 1, Ht denotes the hidden state of the RNN In Figure 3, there are three gates designed to regulate
cell, and Xt denotes the input of the cell at the current information flow through the memory cell. Specifically,
time step (i.e. time step t). Wh1 and Wh2 are matrices the input gate controls how much information to read into
of weights whereas bh denotes the bias vector for the hid- the cell, the forget gate decides how much of the past infor-
den state. ϕ is the activation function of the hidden state. mation to forget and how much to retain, and the output
This equation describes the recurrent computation which gate reads output from the cell. This design enables the
gives rise to the term "recurrent", where the current hid- cell to decide when to remember and when to ignore in-
den state Ht is computed based on the previous hidden puts, which is essential in remembering useful information
state Ht−1 and the current input Xt . This recurrence in and forgetting less useful one.
computation leads to the ability of the hidden state Ht to Mathematically, these gates are computed as follows:
store information of the sequence up to time step t. In a
similar way, the hidden state of the next time step Ht+1 It = σ (Ht−1 · Wi1 + Xt · Wi2 + bi )
is computed based on Ht and the input Xt+1 at the next Ft = σ (Ht−1 · Wf 1 + Xt · Wf 2 + bf )
time step. Ot = σ (Ht−1 · Wo1 + Xt · Wo2 + bo )
Equation 2 shows how the output is computed using
the hidden state of the current time step. In equation 2, In the above equations, It , Ft and Ot represent the in-
Ot denotes the output of the cell at time step t, Wo is the put gate, forget gate and output gate at the current time
weight matrix, and bo represents the bias vector for the step t. Xt is the input at time t and Ht−1 is the hid-
cell output. den state at the previous time step. The Ws represent
Even though the recurrent computation makes it pos- the weight matrices and the bs are bias vectors. σ is the
sible for RNNs to carry past information into the cur- sigmoid activation function that makes the values of these
rent time step, RNNs has limited capability in handling gates in the range of (0, 1).
long-term dependencies in sequential data. This type of The memory cell and the hidden state are computed
network suffers from the problems of vanishing and ex- as follows:
ploding gradients, making learning in long sequences dif- C̃t = tanh (Ht−1 · Wc1 + Xt ·Wc2 + bc )
ficult. Vanishing gradients happen when gradients dur-
ing backpropagation become vanishingly small, and the Ct = Ft ⊙ Ct−1 + It ⊙ C̃t
weights cannot be updated adequately, whereas explod- Ht = Ot ⊙ tanh (Ct )
ing gradients occur when large gradients accumulate dur-
In the above equations, C̃t represents the candidate
ing backpropagation which results in unstable models as
memory cell and Ct represent memory cell content. Similar
model weights receive very large updates.
to the equations of the gates, Ws represent the weight
3.2.2. Long short-term memory matrices and the bs are bias vectors. The computation of
the candidate memory cell is similar to those of the gates,
Long short-term memory (LSTM), introduced in (Hochre-
except that it uses the tanh activation function instead of
iter and Schmidhuber, 1997) extends the capacity of vanilla
sigmoid.
RNNs to be able to remember and effectively handle longer
The candidate memory cell captures both past infor-
sequences. The computation of the hidden state in an
mation from Ht−1 and information from the current input
LSTM is illustrated in Figure 3:
6
Xt . The computation of memory cell content Ct is based
on past memory cell state represented by Ct−1 and the
current memory cell candidate which serves as current in- Zt = σ (Ht−1 · Wz1 + Xt · Wz2 + bz )
put. The element-wise matrix multiplication denoted by Rt = σ (Ht−1 · Wr1 + Xt · Wr2 + br )
⊙ makes it possible for controlling how much of past infor-
mation is forgotten (by multiplying Ft by Ct−1 ) and how
much of current input is retained (by multiplying It by The above two equations show the mathematical mech-
C̃t ). Both Ft and It use sigmoid as the activation func- anisms of the two gates in GRU: update gate (Zt ) and reset
tion, which produces values in the range (0, 1) to control gate (Rt ). The computation of both the update gate and
how much information is discarded/retained in element- reset gate at the current time step t is based on the hidden
wise matrix multiplication. For example, when values of state at the previous time step Ht−1 and the current input
Ft are close to zero, the result of element-wise matrix mul- Xt . The Ws are the weight matrices and bs are the bias
tiplication between Ft by Ct−1 will make past information vectors. σ is the sigmoid activation function that makes
close to zero or in other words past information become the values of these gates in the range of (0, 1).
forgotten. The next equations show the computation of the can-
The computation of the hidden state Ht depends on didate hidden state (H̃t ) and the hidden state (Ht ).
the memory cell Ct and how much of the memory cell is
passed as output is controlled by the output gate Ot .
Our study uses the above described design for LSTM,
in contrast to (Hewamalage et al., 2021) which uses LSTM H̃t = tanh ((Rt ⊙ Ht−1 ) · Wh1 + Xt · Wh2 + bh )
with peephole connections. We also use the stacked ar- Ht = (1 − Zt ) ⊙ H̃t + Zt ⊙ Ht−1
chitectures for both LSTM and GRU, where the term
“stacked" means that different recurrent layers can be stacked The candidate hidden state at the current time step H̃t
on top of each other. captures both past information from the previous hidden
state Ht−1 and information from the current input Xt .
3.2.3. Gated recurrent units The element-wise matrix multiplication of Ht−1 and Rt
Gated recurrent units (GRU) (Cho et al., 2014) were makes it possible for the reset gate to control how much
introduced almost two decades after LSTM. GRU possess of past information in Ht−1 is retained as the value of Rt
a similar but simpler architecture than that of LSTM, is in the range (0, 1). If, for example, Rt equals one, all
which gives them the advantage of faster computation. of the information of Ht−1 is retained, and if Rt equals
GRU also use gates to regulate information flow, but it zero, all previously hidden state information is discarded.
contains only the reset gate and the update gate. The candidate hidden state, therefore, contains the current
The computation of the hidden state in GRU is de- input information plus parts of the previously hidden state
scribed in Figure 4. information as controlled by the reset gate.
In the next equation, the current hidden state Ht is
computed from the current candidate hidden state H̃t and
the hidden state from the previous time step Ht−1 . The
update gate is used to control how much information from
these two components is passed to the current hidden state.
When the update gate Zt is close to one, the current hid-
den state takes most of the information from the previous
hidden state whereas when the update gate Zt is close to
zero, the current hidden state takes most of the informa-
tion from the current candidate hidden state.
A previous study conducted to evaluate GRU and LSTM
algorithms on the sequence modelling tasks of polyphonic
music modelling and speech signal modelling found GRU
to be comparable to LSTM (Chung et al., 2014). How-
ever, it is unclear which type of these algorithms performs
better in forecasting fund performance. Therefore, our re-
search aims to evaluate and compare these two deep learn-
ing methods in the context of fund performance forecast-
ing. Our study implements these methods using the Keras
Figure 4: Computation of hidden state in GRU and Tensorflow libraries in Python where Bayesian opti-
mization is used for hyper-parameter optimization, which
will be described in a later section.

7
3.2.4. Ensemble methods series forecasting, which are ARIMA, ETS, and Theta, as
Ensemble methods are ways to combine forecasts from well as the Naive method, which is known to work well for
different models with the aim to improve models’ perfor- numerous financial and economic time series. The Theta
mance. There are different methods to combine forecasts. method of forecasting, introduced by (Assimakopoulos and
The simplest method would be averaging the model pre- Nikolopoulos, 2000), is a special case of simple exponen-
dictions for different algorithms. Other methods can make tial smoothing with drift. These four methods are recom-
use of weighted averages. The study of (Fiorucci and mended by (Petropoulos et al., 2022) when benchmarking
Louzada, 2020) combines time series forecast models with new forecasting methods. Further detail about the four
the weights proportional to the quality of each model’s models can be seen in (Hyndman and Athanasopoulos,
in-sample predictions in a cross-validation process. Their 2018).
findings reveal that forecast results improve compared to In our experiments, we use an optimized version of the
methods that use equal weights for combination. However, Theta model proposed by (Fiorucci et al., 2016). We use
all of the methods used for combination in their study are the well-known forecast R package introduced in (Hynd-
statistical methods, and in-sample prediction qualities are man and Khandakar, 2008) for training ETS, ARIMA and
used to determine weights. Naive models, and the forecTheta R package for training
Our approach in ensembling forecast results is different the optimized Theta models. These models are trained
from the above study in that in addition to combining fore- using the parallel processing capabilities provided in the
casts of top-performing models, we also combine forecasts furrr R package.
for models belonging to different approaches, i.e. all deep
learning models combined with all statistical models. Ac- 4. Experiment setup
cording to (Petropoulos et al., 2022), combined forecasts
work more effectively if the methods that generate fore- 4.1. Data description
casts are diverse, which leads to fewer correlated errors. Our dataset started with the monthly returns of more
Another difference is that we use out-of-sample model pre- than 1200 mutual funds investing in listed large-cap equi-
diction qualities to determine weights, which we believe ties in the United States available on Morningstar Direct.
are more objectively representative of models’ performance Daily returns series are not available. We required at least
than using in-sample prediction qualities. The reason is 20 years of data up to October 2021 and obtained 634
that deep learning models may easily overfit the training funds. These funds are relatively homogeneous as they
data and produce high-quality in-sample predictions while share the same investment strategy and are all domiciled
the corresponding performances can significantly drop on in the United States. From the monthly returns series, we
out-of-sample data. computed the annualised Sharpe ratios for each fund on
In this study, we compare three ensembling methods: a rolling basis. Table 1 summarises the statistics of these
• The first method uses equal weights in averaging time series.
model forecasts. This is the simple form for com-
bining forecasts, known as simple averages. Mean Sd Min Max
us45890c7395 Return 4.351 27.069 -129.960 62.540
• The second method calculates a weighted average Sharpe 0.213 0.542 -1.444 2.132
of model forecasts where the weights are inversely us5529815731 Return 7.522 23.741 -97.820 70.040
proportional to the final out-of-sample MASE (our Sharpe 0.295 0.555 -1.430 1.790
us82301q7759 Return 8.140 22.494 -100.080 69.620
main forecasting performance metric used for evalu- Sharpe 0.301 0.571 -1.444 2.507
ation of forecasting models) result of each algorithm. us6409175063 Return 8.804 29.045 -114.100 93.900
The weights represent the overall (or global) average Sharpe 0.304 0.584 -1.288 3.129
out-of-sample model quality across all time series for us1253254072 Return 15.065 35.222 -156.560 123.460
Sharpe 0.276 0.533 -1.188 4.768
each algorithm. Riskfree 1.464 1.713 0.011 6.356
Overall Return 8.226 23.272 -176.220 126.320
• The third method calculates a weighted average of Overall Sharpe 0.290 0.556 -2.446 4.768
model forecasts where the weights are inversely pro-
portional to the average out-of-sample MASE result Table 1: Time series descriptive statistics
of each algorithm for the particular time series in
question. The weights represent the specific (or lo- Looking at the details, the table illustrates five fund ex-
cal) average out-of-sample model quality for each amples with the annualised return (percent) and the an-
time series in each algorithm. nualised Sharpe ratio, in terms of their mean, standard
deviation, minimum and maximum values. The five funds
3.3. Traditional statistical models for time series forecast- are respectively the minimum average return, 25th per-
ing centile, 50th percentile, 75th percentile and maximum av-
To compare against deep learning algorithms, we select erage return over the period. The risk-free rate of return is
three well-known traditional statistical methods for time the market yield on the 3-month U.S. Treasury Securities.
8
The last two rows represent the overall average return and time series are inverse log-transformed, and then the offset
respectively the Sharpe ratio of all the funds used in the that was previously added is subtracted from each time
sample. series.
Within the primary “large-cap equities" strategy that
categorizes the funds, there are three major sub-strategies, 4.2.4. Forecasting multiple outputs
namely growth, value and blend. Those adopting a “growth" Unlike univariate statistical methods, deep learning al-
sub-strategy focuses on growth stocks listed in the United gorithms can potentially take advantage of cross-learning,
States, i.e., stocks with strong earnings growth potential, in which patterns from multiple time series are learnt to
whereas “value" sub-strategy means the investment targets improve forecast accuracy for individual time series. In
value stocks, those evaluated to be undervalued. “Blend" this work, we aim to compare forecasting methods for pre-
represents a mix of growth and value sub-strategies. The dicting 634 mutual funds, which involve 634 time series of
study sample of 634 funds includes all the three sub-categories a homogeneous group of funds. Each time series requires
mentioned above. multi-step-ahead forecasts for the applicable forecast hori-
zon. Therefore, we approach the problem as a multiple-
4.2. Data preprocessing output and multi-step-ahead forecasting problem, where
4.2.1. Data splitting inputs are fed into the neural networks, and after train-
The time series data are split into 6 train and vali- ing, vectors of outputs are produced directly by the model
dation sets using the cross-validation scheme described in for six, nine, 12, 18 and 24 months ahead. This strategy
section 3.1 of Methodology. This cross-validation scheme has been demonstrated effective by the works of (Taieb
allows for robust and reliable model selection based on et al., 2012) and (Wen et al., 2017) and also adopted by
their average out-of-sample performances. (Hewamalage et al., 2021).
Supervised deep learning maps inputs to outputs, where
4.2.2. Removing trend and seasonality in our time series context, inputs are fed into the model
We transform each time series in the train set of each and outputs produced by the model following a sliding win-
cross-validation split into its stationary form by two com- dow approach. This sliding window scheme is also used in
monly used techniques known as log transformation and the work of (Hewamalage et al., 2021). This approach con-
differencing. Log transformation stabilizes the variance siders each data sample as consisting of an input window
of the time series (Hyndman and Athanasopoulos, 2013). of past observations of all time series that is mapped to an
Since negative Sharpe ratios exist, an offset is added into output window of future observations which immediately
all time series, to ensure each one of them is all positive, follow the input window in terms of time order relation-
before applying log transformation. ship. In our experiments, each train set is preprocessed to
While log transformation is effective in handling time form multiple sliding input and output windows with these
series variances, differencing can help remove changes in characteristics, where each sliding window of the next con-
the level of a time series and is therefore effective in elimi- secutive time step has the same form as the previous one
nating (or reducing) trend and seasonality (Hyndman and but is shifted by a one-time step. This scheme allows for
Athanasopoulos, 2013). Subsequent to log transforma- the use of past (lagged) time series in predicting future
tions, we examine each time series to see which one needs time series in each data observation.
differencing and the number of differences required to trans- As patterns are learnt from inputs to predict outputs,
form them into stationary form. The result shows that the we believe it is necessary to set the length of input se-
majority of the time series have been stationary after log quences to be at least equal to the length of output se-
transformation and 68 of them need to be applied first quences. This would suggest that deep learning algorithms
differencing. These 68 time series are applied first differ- have sufficient information from the past to predict the fu-
encing in each cross-validation split and their last train ture. The length of each input window should not be too
observation is saved for later inverse transformation back large, since the use of too long input windows would signif-
to their original scales. icantly reduce the number of training samples and prob-
ably affect model performance. In our study, we set the
4.2.3. Data postprocessing lengths of input windows equal to the lengths of the fore-
Since each time series has been transformed prior to cast horizon plus two months, given the lengths of output
entering the modelling stage, the models’ outputs are not windows are determined by the chosen forecast horizons.
in the original scale and this does not allow for a direct These choices balance the need for covering sufficient in-
comparison of the models’ outputs with the validation sets formation in the input windows and the need to retain as
to obtain error metrics. Therefore, models’ predictions many training samples as possible.
are transformed back to the original scales following the
4.3. Training deep learning models
inverse order that we apply in the preprocessing steps. In
detail, predictions for those time series that are previously The performance of deep learning models depends on
differenced receive inverse differencing; after this step, all a number of factors, including, among others, the right

9
choice of the loss function, the optimization procedure time series forecasting, where datasets are usually rela-
within the training loop, and the outer optimization pro- tively small compared to those in other fields such as NLP
cedure that selects the best combination (or configuration) or Computer Vision.
of hyper-parameters. Our training procedure emphasizes We also provide an option to employ batch normaliza-
extensive hyper-parameter tuning based on careful consid- tion (Ioffe and Szegedy, 2015) to further regularize and im-
eration for hyper-parameter configurations. This section prove training speed, which allows for the possible training
describes in detail these training and optimization proce- of deeper networks. Originally, batch normalization was
dures. understood to work by reducing internal covariate shifts.
Later research demonstrated this concept was a misun-
4.3.1. Loss function derstanding. The work of (Santurkar et al., 2018) clari-
We use RMSE as the loss function for training deep fied that the effectiveness of batch normalization lay in its
learning models. This loss function will be optimized when ability to make the optimization landscape significantly
training mini-batches of data samples. This type of loss smoother.
function has the advantage of restricting large, outlier er- In the training process that uses batch normalization,
rors, since large errors may result in really large squared this normalization is performed for each mini-batch. By
errors. This loss function is also the metric that we opti- using batch normalization, we can use higher learning rates,
mize in the Bayesian optimization loops. and be more relaxed about network initialization. As a
regularizer, in some cases, we may not need to use dropout
4.3.2. Hyper-parameter search setup if batch normalization is already in use. In our work, it
Hyper-parameters refer to the parameters that mod- is, however, uncertain that batch normalization can re-
els cannot learn during the training process but may have place the need for dropout, especially when our training
an effect on models’ performance. For RNNs, the number data are limited. Therefore, we choose to optimize both
of hidden layers, units in each hidden layer, and learning of these hyper-parameters.
rate, etc., are important hyper-parameters. The process There is also additional consideration for the order of
of identifying which hyper-parameter configuration results batch normalization and dropout in actual implementa-
in optimal generalization performance is known as hyper- tion. The combination of batch normalization and dropout
parameter optimization. We employ Bayesian optimiza- works well in some cases but decreases the model perfor-
tion for this optimization procedure, which will be detailed mance in others, and thus their order should be carefully
in the next section. considered (Li et al., 2019). To work around this issue,
In our experiments, the following hyper-parameters are and to possibly take advantage of both dropout and batch
optimized: normalization, in our experiments, we add another hyper-
parameter that controls whether batch normalization lay-
• Learning rate ers are added to the model before or after dropout layers.
The range of dropout rates in our experiments includes
• Number of hidden layers
zero, which means the case of no dropout is covered.
• Units in a hidden layer Another regularization method tuned is weight decay.
Weight decay, also called L2 regularization, is probably
• Dropout rate for inputs the most common technique for regularizing parametric
machine learning models (Zhang et al., 2021). This tech-
• Dropout rate for hidden layers
nique works by adding a penalty term (or regularization
• Mini-batch size term) into the model’s loss function so that the learning
process will minimize the prediction loss plus the penalty
• Whether to use batch normalization or not term. The updated loss function with weight decay is then
given by:
• If using batch normalization, placing it either before p
or after dropout layers
X
L = Error + ψ wi2
i=1
• Activation functions | {z }
L2 regularization
• Weight decay Applied to our context, Error denotes the root mean square
error from the network outputs, wi represents the train-
• Number of epochs
able parameters of the network with p the number of all
Unlike (Hewamalage et al., 2021), we consider the dropout such parameters.
rate as an important hyper-parameter for optimization. Another important hyper-parameter is the number of
Dropout is an effective technique for addressing the over- hidden layers, which controls the network depth. The use
fitting problem in deep neural networks (Srivastava et al., of regularization methods described above can help reduce
2014). This technique may be even more appropriate for models’ overfitting, which as a result may allow for deeper
networks to be trained. Therefore, we set the range of the
10
number of hidden layers from one to five, with five hidden 4.3.6. Bayesian optimization
layers representing a quite deep network given the limited Bayesian optimization (BO) is a modern hyper-parameter
amount of training data. optimization technique that is effective for searching through
We use Adam as the optimizer for the model train- a large search space that may involve a large number of
ing process. Adam is a computationally efficient gradient- hyper-parameters. It has demonstrated superior perfor-
based optimization method for optimizing stochastic ob- mance to the random search and grid search approach in a
jective functions, which is demonstrated to work well in variety of settings (Wu et al., 2019; Putatunda and Rama,
practice and compares favorably to other stochastic opti- 2019).
mization approaches (Kingma and Ba, 2014). The essence of BO is the construction of a probabilistic
In addition to the above, we choose to optimize other surrogate model that models the true objective function,
hyper-parameters including learning rate, mini-batch size, and the use of this surrogate model together with the ac-
units which represent the dimensionality of the output quisition function to guide the search process. Its proce-
space in a hidden layer, number of epochs, and activation dure first defines an objective function to optimize, which
functions. For each numeric hyper-parameter, as appro- may be the loss function, or some other function deemed
priate, we try to include a wide range of values for the more appropriate for model selection. In training models
search, without making the computation cost too high. where the evaluation of the objective function is costly,
For example, the learning rate ranges between 0.001 and the surrogate which is cheaper to evaluate is used as an
0.1, and the dropout rate ranges between 0.0 and 1.0. To approximation of the objective function (Bergstra et al.,
control the computation cost within reasonable limits, we 2011).
limit the range of the number of epochs to between 1 and The whole process of Bayesian optimization is sequen-
30. tial by nature since the determination of the next promis-
ing points to search is dependent on what is already known
4.3.3. Hyper-parameter optimization about the previously searched points. In addition to the
Prior studies have investigated different methods for surrogate model, another important component of BO is
executing this optimization procedure. Among these, grid the acquisition function, which is designed to guide the
search, random search and Bayesian optimization provide search toward potential low values of the objective func-
three alternatives for optimizing hyper-parameters. Bayesian tion (Archetti and Candelieri, 2019). The acquisition func-
optimization has been demonstrated to be more effective tion allows for the balance between exploitation and ex-
and will be chosen in our experiments. ploration, where exploitation means that searching is per-
formed near the region of the current best points and ex-
4.3.4. Grid search ploration refers to the searching in the regions that have
The grid search method finds the optimal hyper-parameter not been explored.
configuration based on a predetermined grid of values. The initial probabilistic surrogate model is constructed
This requires careful consideration in the choice of grid by fitting a probability model over a sample of points se-
values, as there might be no limit to the number of possi- lected by random search or some other sampling method.
ble hyper-parameter configurations. For example, a learn- In the next step, a new promising point to search is iden-
ing rate that ranges between 0.001 and 0.1 can take count- tified using the acquisition function. The objective func-
less possible values, and when combined with several other tion is then evaluated and then the probabilistic surrogate
hyper-parameters, the search space would be so vast, such model is updated with the new information. The next step
that covering all possible configurations in the search is im- uses the acquisition function to suggest a further promis-
possible. To limit the choice of hyper-parameters in a grid, ing point. This process is repeated in a sequential manner
we can rely on expert knowledge and experience, but this until some termination condition is satisfied.
cannot guarantee success in many cases, especially when Gaussian Process (GP) is a popular choice for the sur-
there are a large number of hyper-parameters for tuning rogate model. The GP can be understood as a collection
in deep learning models. of random variables which satisfies the condition that if
any finite number of these random variables are combined,
4.3.5. Random search the result will be a joint Gaussian distribution (Archetti
Random search narrows down the number of possible and Candelieri, 2019). . Another surrogate model is Tree-
hyper-parameters to search by selecting a random subset of structured Parzen Estimator (TPE). Unlike GP which mod-
all possible configurations. By narrowing down the search els P (y|x) directly, the TPE approach models P (x|y) and
space, this algorithm reduces training time and has proved P (y) (Bergstra et al., 2011), where x represents hyper-
to be effective in many cases (Putatunda and Rama, 2019). parameters, and y represents the associated evaluation
The research of (Bergstra and Bengio, 2012) and (Pu- score of the objective function. The hyper-parameter search
tatunda and Rama, 2019) show that random search is more space can be defined by a generative process, of which
effective than grid search as a hyper-parameter optimiza- the TPE replaces the prior distributions of the hyper-
tion method. parameter configuration with specific non-parametric den-

11
sities; the substitutions become a learning algorithm that Scaled errors is defined as:
then creates various densities over the search space (Bergstra et
et al., 2011). qt = 1 Pn
i=2 |Xi − Xi−1 |
The TPE algorithm is implemented in the well-known n−1

hyperopt Python package in (Bergstra et al., 2013). In our which is independent of the scale of the data. A scaled
experiments, we utilize this library for hyper-parameter error is less than one if it arises from a better forecast
optimization using the TPE algorithm of the Bayesian op- than the average one-step Naive forecast computed in-
timization framework, where the number of iterations is set sample. Conversely, it is greater than one if the forecast is
to 800 for each deep learning method. All deep learning worse than the average one-step Naive forecast computed
models are trained using a server with the following char- in-sample (see Hyndman and Koehler, 2006).
acteristics: 8-core CPU, 16 GB RAM and Linux Ubuntu The famous scaled error is the Mean Absolute Scaled
20.04.4. Error:
MASE = mean (|qt |)
4.4. Forecast accuracy measures
When MASE < 1, the proposed method gives, on aver-
In this section, we present the metrics used to com- age, smaller errors than the one-step errors from the Naive
pute forecast accuracy that enables performance compar- method. If multi-step forecasts are being computed, it is
ison among the models studied. possible to scale by the in-sample MAE computed from
Let Xt denote the observation at time t and Ft denote multi-step naïve forecasts.
the forecast of Yt . Then the forecast error et = Xt − Ft . The recent paper of (Kim and Kim, 2016) investigated
Let’s have k forecasts and that observation of data at each and provided practical advantages of the new accuracy
forecast period. measure MAAPE, the mean arctangent absolute percent-
We use the same notation as in (Hyndman and Koehler, age error:
2006) mean (xt ) to denote the sample mean of {xt } over
the sample. Analogously, we use the median (xt ) for the
sample median. 1 X
N
1 X
N 
Xt − Ft

The most commonly used scale-dependent measures MAAPE =
N t=1
AAPEt =
N t=1
arctan
Xt
are based on absolute errors or squared errors. Let Xt
denote the observation at time t and Ft denote the fore- Although MAAPE is finite when the response variable
cast of Yt . Then the forecast error
 et = Xt − Ft . Let Mean (i.e. Xt ) equals zero, it has a nice trigonometric repre-
Square Error M SE = mean e2l . sentation. However, because MAAPE’s value is expressed
Root Mean Square Error: in radians, this makes MAAPE less intuitive. Note that
q MAAPE does not have a symmetric version, since divi-
RM SE = mean (e2l ) sion by zero is no longer a concern. The MAAPE is also
scale-free because its values are expressed in radians.
Mean Absolute Error:

M AE = mean (|et |) 5. Findings and analysis

Often, the RMSE is preferred to the MSE as it is on the In this section, we run and train six models for five
same scale as the data. Historically, the RMSE and MSE different forecast horizons, i.e., 30 models in total. The
have been popular, largely because of their theoretical rel- forecast horizons include 6 months, 9 months, 12 months,
evance in statistical modelling. However, they are more 18 months, and 24 months. Among six models are two
sensitive to outliers than MAE or MDAE (Median Abso- deep learning models (LSTM, GRU), three traditional sta-
lute Error). tistical (ARIMA, ETS, Theta), and the Naive model as a
Compared to absolute error, percentage errors have benchmark.
the advantage of being scale-independent, and so are fre- The first subsection below presents the average results
quently used to compare forecast performance across dif- for five different forecast horizons using the six models and
ferent datasets. Let defined percentage error as pt = 100et /Xt . the ensemble models. The second subsection illustrates
The Symmetric Median Absolute Percentage Error: further details using the 18-month forecast horizon.

SM DAP E = median (200 |Xt − Ft | / (Xt + Ft )) 5.1. Average of multiple forecast horizons
5.1.1. Comparison of deep learning vs statistical models
The problems arising from small values of Xt may be
Table 2 provides the averages of the accuracy measures
less severe for SMDAPE. However, usually when Xt is close
over five forecast horizons for each of five accuracy mea-
to zero, Fl is also likely to be close to zero. Thus, the
sures (MASE, RMSE, MAE, SMDAPE and MAAPE) and
measure still involves division by a number close to zero.
each of six models. It is evident that the LTSM outper-
forms all other models, then the GRU, resulting in the deep
12
Algorithm MASE RMSE MAE SMDAPE MAAPE 5.1.2. Ensemble models
LSTM 1.510 0.441 0.363 74.204 59.340
GRU 1.546 0.453 0.371 81.481 59.791 Algorithm MASE RMSE MAE SMDAPE MAAPE
ARIMA 1.815 0.533 0.435 96.504 63.207 All algorithms
ETS 1.835 0.535 0.441 105.028 65.012 simple average 2.208 0.678 0.530 94.532 63.984
Theta 1.961 0.566 0.471 106.827 67.567 global weights 1.852 0.553 0.445 91.495 62.839
Naive 4.176 1.938 1.437 120.213 75.317 local weights 1.785 0.529 0.428 91.004 62.567
Deep learning
Table 2: Average accuracy measure of five forecast hori- simple average 1.469 0.431 0.353 74.315 58.429
global weights 1.468 0.430 0.352 74.261 58.418
zons local weights 2.056 0.594 0.493 127.001 67.196
Stats
simple average 2.721 0.841 0.654 110.821 68.312
learning models taking the first place. The next places are global weights 2.119 0.631 0.509 108.158 66.826
the ARIMA model, ETS, and the Theta, respectively. The local weights 2.074 0.609 0.498 122.540 66.737
worst model surprisingly is the Naive model. Regardless
of the forecast horizon, the LSTM model is the best model Table 3: Average accuracy measure of ensemble method.
on average and produces the most consistent result across
different accuracy measures, as shown in Table 6 in the Table 3 provides the results of the ensemble methods
appendix. applied to combine forecasts by different groups of mod-
els: all models (All algorithms in the table), LSTM and
GRU (Deep learning), and statistical models (Stats). For
each ensembling group, we use a set of combination meth-
ods including simple average, weighted average with global
weights (referred to as global weights in the table), and
weighted average with local weights (referred to as local
weights in the table).
The best results come from the ensemble models of
LSTM and GRU. The second best is All algorithms. The
ensembles of All algorithms provide better accuracy for all
error metrics than ensembling using only three statistical
methods. Furthermore, ensembling provides lower error
metrics than the individual ones for deep learning meth-
ods for simple average and global weights methods. For
example, the average - mean (across five forecast horizons)
of MASE for LSTM and GRU are respectively 1.510 and
1.546, whereas the highest mean of ensemble methods of
simple average and global weights are only 1.469 and 1.468.
The same results are also obtained for other error metrics
such as RMSE, MAE, SMDAPE, and MAAPE. However,
the local weights ensemble of GRU and LSTM are worse
Figure 5: Comparison of MASE across all algorithms and than the individual ones (true with MASE, RMSE, MAE,
forecast horizons. SMDAPE, and MAAPE), so the ensemble model with All
algorithms has resulted in better accuracy measures.
Figure 5 shows the MASE accuracy measure and RMSE While the ensemble of All algorithms provides much
and MAE are reported in Table 6 in the appendix. The better accuracy measures than the ensemble of statistical
results confirm the outperformance of the LSTM model as models, it is not as good as that of the original individual
its MASE stays lowest across six models and the differ- deep learning models.
ent forecast horizons, except for the six-month horizon for Initially, one would expect that the ensemble of all
which GRU slightly outperformed LSTM. Across the five models should provide the best accuracy measure as, the-
horizons, LSTM produces the lowest MASE (1.363) for the oretically, a more diverse set of models can result in fewer
forecast horizon of 12 months. correlated errors. However, as shown by our analysis, the
Table 6 reports RMSE, MAE, SMDAPE and MAAPE ensembles of GRU and LSTM using global weights provide
results. We conclude that LSTM is the best model across the best model with the lowest mean of all error metrics.
different algorithms and across different forecast horizons. The relative performance of ensembling methods dif-
Again, the 12-month horizon receives the best forecasting fers across ensembling groups and forecast horizons. For
accuracy for the first two metrics and the nine-month for All algorithms ensembling models, across all forecast hori-
the last two. The Naive model consistently produces the zons (see table 7), the local weights methods provide the
least accuracy across algorithms and forecast horizons, es- lowest accuracy measures. For the ensembling of the deep
pecially the longest one of 24 months. learning models, the best choice is the global weight for all

13
the forecast horizons, except the 24-month horizon. For It can be concluded that the methods of deep learn-
the ensembling of the Stats model, the choice is the global ing provide significantly more accurate forecast than the
weight for only the 6, 9 and 12-month periods. traditional statistical ones, which confirms our research
Further details on each forecast horizon for six original hypothesis.
models and nine ensemble models (three ensemble com-
binations and three weighting methods) are provided in
Tables 6 and 7 in the appendix. 5.2.2. Ensemble models for the 18-month horizon
The next subsection further illustrates the detailed re-
sults obtained for the forecast horizon of 18 months. De- Algorithm MASE RMSE MAE SMDAPE MAAPE
All algorithms
tailed analysis for other forecast horizons can be performed simple average 1.535 0.463 0.369 111.130 76.021
in a similar manner using the results provided in the ap- global weights 1.513 0.455 0.363 110.183 75.772
local weights 1.469 0.440 0.353 109.933 75.565
pendices. Deep learning
simple average 1.322 0.393 0.318 96.948 73.896
5.2. Forecast horizon of 18 months global weights 1.322 0.393 0.318 96.939 73.900
local weights 1.519 0.472 0.365 141.321 74.531
5.2.1. Comparison of deep learning vs statistical models Stats
simple average 1.794 0.536 0.431 131.529 79.995
For each algorithm, we first calculate the average accu- global weights 1.756 0.524 0.422 130.779 79.70
racy metrics for each time series across the cross-validation local weights 1.639 0.493 0.394 142.974 77.915
splits. The results for all time series of an algorithm are
then averaged to yield the final metrics’ values represent- Table 5: Accuracy measures of Ensemble models of 18
ing the performance of that particular algorithm. For the months
latter step, we also report the median and standard devi-
The best results come from the ensemble models of
ation in addition to the mean.
GRU and LSTM, then All algorithms ensembling and Stats
Table 4 presents the accuracy metrics measured for two
ensembling groups. Furthermore, ensembling provides lower
methods of Deep learning (LSTM, GRU) and three statis-
error metrics than the individual ones for deep learning
tical methods (ARIMA, ETS, and Theta) and the bench-
methods for simple average and global weights methods.
mark model Naive.
For example, the means of MASE for LSTM and GRU are
Algorithm LSTM GRU ARIMA ETS Theta Naive respectively 1.425 and 1.428, whereas the lowest mean of
MASE simple average and global weights is only 1.322. The same
mean 1.425 1.428 1.539 1.721 1.762 2.760
median 1.374 1.384 1.500 1.700 1.699 1.684 results are also obtained for other accuracy measures such
sd 0.177 0.153 0.172 0.159 0.225 3.234 as RMSE, MAE, SMDAPE, and MAAPE. However, the
RMSE
mean 0.422 0.422 0.454 0.503 0.513 0.830
local weights ensemble of GRU and LSTM are worse than
median 0.415 0.412 0.445 0.498 0.498 0.497 the individual ones (true with MASE, RMSE, MAE and
sd 0.045 0.044 0.044 0.043 0.053 1.019 SMDAPE).
MAE
mean 0.342 0.343 0.370 0.414 0.424 0.660
median 0.334 0.335 0.363 0.411 0.411 0.408
sd 0.038 0.033 0.037 0.034 0.049 0.765 6. Conclusion
SMDAPE
mean 95.775 109.261 111.320 133.133 132.561 137.438 The most interesting purpose of this study is to address
median 108.577 107.999 130.480 129.424 131.305
sd
88.499
20.007 13.229 19.848 10.339 10.804 17.906
the challenge of forecasting the performance of multiple
MAAPE mutual funds simultaneously using modern deep learning
mean 74.979 76.324 77.763 81.963 82.685 85.286 approaches, with a comparison against popular traditional
median 74.674 75.891 76.613 82.901 83.061 82.188
sd 3.839 3.825 4.402 3.217 3.873 12.389 statistical approaches. The deep learning approaches are
studied from the cross-learning perspective, which means
Table 4: Mean accuracy measures of six forecasting models information from various time series is used to improve
of 18 months predictions of individual time series, and no external fea-
tures are added to the models. In addition, we use differ-
Overall, LSTM has the best performance with the small- ent ensemble methods to combine forecasts generated by
est mean, median and standard error for almost all accu- models of traditional and modern approaches. The results
racy metrics except for the SMDAPE metric for which the show that the ranking order of model quality for the stud-
GRU model yields a lower mean and median. The results ied methods are Ensemble of deep learning models, LSTM,
clearly show that LSTM is significantly superior to the GRU, ARIMA, ETS, Theta, and Naive.
other four forecasting methods. While ARIMA is the best- The results among the ensemble methods vary depend-
performing traditional statistical approach, it produces a ing on which models are combined. The best model comes
much lower accuracy level than the GRU model. The met- from the ensemble using weighted averages using global
rics’ median and the standard error confirm the same ob- weights of LSTM and GRU models.
servation. Therefore in the following table, we present only
the mean value of five different accuracy metrics.
14
In our study, both LSTM and GRU models are trained
with Bayesian optimization, a modern approach for hyper-
parameter optimization that can be effective when eval-
uating the model for individual configurations of hyper-
parameters is costly, a property particularly true for deep
learning problems. In this paper, we have outlined a de-
tailed methodology for training these deep learning mod- Algorithm MASE RMSE MAE SMDAPE MAAPE
All algorithms
els using Bayesian optimization, which we believe could be simple average 2.026 0.566 0.485 57.088 44.819
valuable for other research. global weights 2.017 0.562 0.483 56.973 44.700
local weights
We conclude that deep learning models and their en- 2.011 0.560 0.482
Deep learning
56.818 44.655

sembling offer promising solutions to the dilemma question simple average 1.737 0.486 0.416 49.372 39.781
of forecasting the performance of multiple mutual funds 6m global weights 1.737 0.486 0.416 49.378 39.771
local weights 3.026 0.817 0.725 118.567 59.910
measured by Sharpe ratios. Stats
simple average 2.206 0.623 0.528 62.855 47.206
global weights 2.195 0.618 0.526 62.800 47.143
7. Appendix local weights 2.546 0.694 0.610 86.404 51.459
All algorithms
simple average 1.754 0.524 0.422 51.380 43.182
Algorithm 6m 9m 12m 18m 24m global weights 1.715 0.513 0.412 50.520 42.522
local weights 1.689 0.506 0.406 49.718 42.029
MASE Deep learning
LSTM 1.863 1.483 1.363 1.425 1.417 simple average 1.526 0.446 0.366 44.548 42.095
GRU 1.835 1.587 1.453 1.428 1.429 9m global weights 1.524 0.446 0.366 44.508 42.042
local weights 2.406 0.696 0.578 101.304 53.860
Arima 2.252 2.035 1.779 1.539 1.470
Stats
ETS 2.073 1.726 2.104 1.721 1.552 simple average 2.040 0.604 0.490 65.303 47.650
Theta 2.055 1.687 2.119 1.762 2.183 global weights 1.959 0.582 0.470 62.085 46.262
Naive 2.977 3.517 2.302 2.760 18.326 local weights 2.104 0.626 0.505 82.616 46.953
All algorithms
RMSE
simple average 1.699 0.507 0.408 93.867 63.387
LSTM 0.516 0.438 0.414 0.422 0.416 global weights 1.684 0.503 0.404 92.321 63.099
GRU 0.519 0.462 0.443 0.422 0.418 local weights 1.680 0.502 0.404 92.168 62.985
ARIMA 0.637 0.619 0.529 0.454 0.428 Deep learning
simple average 1.365 0.419 0.328 61.133 55.451
ETS 0.576 0.522 0.610 0.503 0.461
12m global weights 1.363 0.419 0.328 60.908 55.453
Theta 0.574 0.504 0.613 0.513 0.628 local weights 1.948 0.558 0.468 125.655 70.640
Naive 0.890 1.021 0.668 0.830 6.280 Stats
MAE simple average 2.009 0.588 0.483 127.343 69.327
global weights 2.003 0.586 0.481 126.859 69.202
LSTM 0.447 0.356 0.328 0.342 0.340
local weights 2.039 0.590 0.490 139.245 70.686
GRU 0.439 0.381 0.349 0.343 0.343 All algorithms
ARIMA 0.539 0.488 0.427 0.370 0.353 simple average 1.535 0.463 0.369 111.130 76.021
ETS 0.496 0.414 0.505 0.414 0.373 global weights 1.513 0.455 0.363 110.183 75.772
local weights 1.469 0.440 0.353 109.933 75.565
Theta 0.492 0 .405 0.509 0.424 0.524
Deep learning
Naive 0.715 0.847 0.553 0.660 4.409 simple average 1.322 0.393 0.318 96.948 73.896
SMDAPE 18m global weights 1.322 0.393 0.318 96.939 73.900
LSTM 50.581 42.983 60.205 95.775 121.474 local weights 1.519 0.472 0.365 141.321 74.531
Stats
GRU 56.180 45.544 73.905 109.261 122.516
simple average 1.794 0.536 0.431 131.529 79.995
ARIMA 71.258 76.036 99.108 111.320 124.797 global weights 1.756 0.524 0.422 130.779 79.70
ETS 59.843 55.429 133.069 133.133 143.664 local weights 1.639 0.493 0.394 142.974 77.915
Theta 56.510 53.094 132.822 132.561 159.146 All algorithms
simple average 4.025 1.329 0.968 159.198 92.513
Naive 73.452 85.090 136.191 137.438 168.894
global weights 2.331 0.729 0.560 147.479 88.100
MAAPE local weights 2.073 0.634 0.498 146.383 87.601
LSTM 43.561 40.625 56.486 74.979 81.050 Deep learning
GRU 40.669 43.733 56.506 76.324 81.720 simple average 1.395 0.409 0.335 119.573 80.923
24m global weights 1.395 0.409 0.335 119.573 80.924
ARIMA 46.255 45.339 63.897 77.763 82.783
local weights 1.382 0.430 0.332 148.157 77.038
ETS 45.489 41.978 71.679 81.963 83.952 Stats
Theta 45.335 42.205 71.980 82.685 95.627 simple average 5.558 1.853 1.337 167.076 97.383
Naive 51.922 59.664 72.535 85.286 107.177 global weights 2.683 0.843 0.645 158.265 91.822
local weights 2.042 0.640 0.490 161.460 86.667

Table 6: Accuracy measures across 6 algorithms Table 7: Accuracy measures of Ensemble models
and five forecast horizons

8. Acknowledgement
We acknowledge the support from Hanoi University for
providing a server for training models, Ms Hoang Anh and
15
Mr Nguyen Ngoc Hieu for their assistance in the initial Fiorucci, J. A. and Louzada, F. (2020). Groec: combination method
phase of the project. via generalized rolling origin evaluation. International Journal of
Forecasting, 36(1):105–109.
Fiorucci, J. A., Pellegrini, T. R., Louzada, F., Petropoulos, F., and
References Koehler, A. B. (2016). Models for optimising the theta method
and their relationship to state space models. International Journal
Ahmed, N. K., Atiya, A. F., Gayar, N. E., and El-Shishiny, H. (2010). of Forecasting, 32(4):1151–1161.
An empirical comparison of machine learning models for time se- Hewamalage, H., Bergmeir, C., and Bandara, K. (2021). Recurrent
ries forecasting. Econometric Reviews, 29(5-6):594–621. neural networks for time series forecasting: Current status and
Archetti, F. and Candelieri, A. (2019). Bayesian optimization and future directions. International Journal of Forecasting, 37(1):388–
data science. Springer. 427.
Arnold, T. R., Ling, D. C., and Naranjo, A. (2019). Private equity Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory.
real estate funds: returns, risk exposures, and persistence. The Neural computation, 9(8):1735–1780.
Journal of Portfolio Management, 45(7):24–42. Hyndman, R. (2018). A brief history of time series forecasting com-
Assimakopoulos, V. and Nikolopoulos, K. (2000). The theta model: petitions. URL https://robjhyndman.com/hyndsight/forecasting-
a decomposition approach to forecasting. International journal of competitions.
forecasting, 16(4):521–530. Hyndman, R. and Athanasopoulos, G. (2013). Forecasting: princi-
Bandara, K., Bergmeir, C., and Smyl, S. (2020). Forecasting across ples and practice.[e-book].
time series databases using recurrent neural networks on groups Hyndman, R., Koehler, A. B., Ord, J. K., and Snyder, R. D. (2008).
of similar series: A clustering approach. Expert systems with ap- Forecasting with exponential smoothing: the state space approach.
plications, 140:112896. Springer Science & Business Media.
Bergmeir, C. and Benítez, J. M. (2012). On the use of cross- Hyndman, R. J. and Athanasopoulos, G. (2018). Forecasting: prin-
validation for time series predictor evaluation. Information Sci- ciples and practice. OTexts.
ences, 191:192–213. Hyndman, R. J. and Khandakar, Y. (2008). Automatic time series
Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Al- forecasting: the forecast package for r. Journal of statistical soft-
gorithms for hyper-parameter optimization. Advances in neural ware, 27:1–22.
information processing systems, 24. Hyndman, R. J. and Koehler, A. B. (2006). Another look at mea-
Bergstra, J. and Bengio, Y. (2012). Random search for hyper- sures of forecast accuracy. International journal of forecasting,
parameter optimization. Journal of machine learning research, 22(4):679–688.
13(2). ICI, R. . S. P. (2021). 2021 Investment Company FACT BOOK.
Bergstra, J., Yamins, D., Cox, D. D., et al. (2013). Hyperopt: A Indro, D., Jiang, C., Patuwo, B., and Zhang, G. (1999). Predicting
python library for optimizing the hyperparameters of machine mutual fund performance using artificial neural networks. Omega,
learning algorithms. In Proceedings of the 12th Python in sci- 27(3):373–380.
ence conference, volume 13, page 20. Citeseer. Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerat-
Berk, J. B. and van Binsbergen, J. H. (2015). Measuring skill in the ing deep network training by reducing internal covariate shift.
mutual fund industry. Journal of Financial Economics, 118(1):1– In International conference on machine learning, pages 448–456.
20. PMLR.
Bollen, N. P. B. and Busse, J. A. (2005). Short-term persistence Irvine, P. J., Kim, J. H. J., and Ren, J. (2022). The beta anomaly
in mutual fund performance. The Review of financial studies, and mutual fund performance [working paper.
18(2):569–597. Jianwei, E., Ye, J., and Jin, H. (2019). A novel hybrid model on
Box, G. E., Jenkins, G. M., Reinsel, G. C., and Ljung, G. M. (2015). the prediction of time series and its application for the gold price
Time series analysis: forecasting and control. John Wiley & Sons. analysis and forecasting. Physica A: Statistical Mechanics and its
Cai, Y. (2005). A forecasting procedure for nonlinear autoregressive Applications, 527:121454.
time series models. Journal of Forecasting, 24(5):335–351. Kacperczyk, M., Nieuwerburgh, S. V., and Veldkamp, L. (2014).
Cao, J. and Wang, J. (2019). Stock price forecasting model based Time-varying fund manager skill. The Journal of Finance,
on modified convolution neural network and financial time se- 69(4):1455–1484.
ries analysis. International Journal of Communication Systems, Kahn, R. N. and Rudd, A. (1995). Does historical performance pre-
32(12):e3987. dict future performance? Financial analysts journal, 51(6):43–52.
Carhart, M. M. (1997). On persistence in mutual fund performance. Kenton, J. D. M.-W. C. and Toutanova, L. K. (2019). Bert: Pre-
The Journal of Finance, 52(1):57–82. training of deep bidirectional transformers for language under-
Chen, W., Xu, H., Jia, L., and Gao, Y. (2021). Machine learning standing. In Proceedings of NAACL-HLT, pages 4171–4186.
model for bitcoin exchange rate prediction using economic and Kim, S. and Kim, H. (2016). A new metric of absolute percentage
technology determinants. International Journal of Forecasting, error for intermittent demand forecasts. International Journal of
37(1):28–43. Forecasting, 32(3):669–679.
Chiang, W.-C., Urban, T., and Baldridge, G. (1996). A neural net- Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic
work approach to mutual fund net asset value forecasting. Omega, optimization. arXiv preprint arXiv:1412.6980.
24(2):205–215. Li, X., Chen, S., Hu, X., and Yang, J. (2019). Understanding the
Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y. (2014). disharmony between dropout and batch normalization by variance
On the properties of neural machine translation: Encoder-decoder shift.
approaches. arXiv preprint arXiv:1409.1259. Li, Z., Han, J., and Song, Y. (2020). On the forecasting of high-
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical frequency financial time series based on arima model improved by
evaluation of gated recurrent neural networks on sequence model- deep learning. Journal of Forecasting, 39(7):1081–1097.
ing. arXiv preprint arXiv:1412.3555. Lou, D. (2012). A flow-based explanation for return predictability.
Crone, S. F., Hibon, M., and Nikolopoulos, K. (2011). Advances The Review of Financial Studies, 25(12):3457–3489.
in forecasting with neural networks? empirical evidence from the Makridakis, S. and Hibon, M. (2000). The m3-competition: results,
nn3 competition on time series prediction. International Journal conclusions and implications. International journal of forecasting,
of forecasting, 27(3):635–660. 16(4):451–476.
Elton, E. J. and Gruber, M. J. (2020). A review of the performance Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2020). The
measurement of long-term mutual funds. Financial analysts jour- m4 competition: 100,000 time series and 61 forecasting methods.
nal, 76(3):22–37. International Journal of Forecasting, 36(1):54–74.
Melard, G. and Pasteels, J.-M. (2000). Automatic arima modeling

16
including interventions, using time series expert software. Inter-
national Journal of Forecasting, 16(4):497–508.
Newbold, P. (1983). Arima model building and the time series anal-
ysis approach to forecasting. Journal of forecasting, 2(1):23–35.
Pan, W.-T., Han, S.-Z., Yang, H.-L., and Chen, X.-Y. (2019). Pre-
diction of mutual fund net value based on data mining model.
Cluster computing, 22:9455–9460.
Pannakkong, W., Pham, V.-H., and Huynh, V.-N. (2017). A novel
hybridization of arima, ann, and k-means for time series forecast-
ing. International Journal of Knowledge and Systems Science
(IJKSS), 8(4):30–53.
Petropoulos, F., Apiletti, D., Assimakopoulos, V., Babai, M. Z., Bar-
row, D. K., Taieb, S. B., Bergmeir, C., Bessa, R. J., Bijak, J.,
Boylan, J. E., et al. (2022). Forecasting: theory and practice.
International Journal of Forecasting.
Putatunda, S. and Rama, K. (2019). A modified bayesian optimiza-
tion based hyper-parameter tuning approach for extreme gradient
boosting. In 2019 Fifteenth International Conference on Infor-
mation Processing (ICINPRO), pages 1–6. IEEE.
Ruppert, D. (2004). The elements of statistical learning: data min-
ing, inference, and prediction.
Santurkar, S., Tsipras, D., Ilyas, A., and Mądry, A. (2018). How does
batch normalization help optimization? In Proceedings of the
32nd international conference on neural information processing
systems, pages 2488–2498.
Semenoglou, A.-A., Spiliotis, E., Makridakis, S., and Assimakopou-
los, V. (2021). Investigating the accuracy of cross-learning time
series forecasting methods. International Journal of Forecasting,
37(3):1072–1084.
Sezer, O. B., Gudelek, M. U., and Ozbayoglu, A. M. (2020). Financial
time series forecasting with deep learning: A systematic literature
review: 2005–2019. Applied Soft Computing, 90:106181.
Sharpe, W. F. (1966). Mutual fund performance. The Journal of
Business, 39(1):451–476.
Sharpe, W. F. (1994). The sharpe ratio. Journal of portfolio man-
agement, 21(1):49–58.
Smyl, S. (2020). A hybrid method of exponential smoothing and
recurrent neural networks for time series forecasting. International
Journal of Forecasting, 36(1):75–85.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and
Salakhutdinov, R. (2014). Dropout: a simple way to prevent neu-
ral networks from overfitting. The journal of machine learning
research, 15(1):1929–1958.
Taieb, S. B., Bontempi, G., Atiya, A. F., and Sorjamaa, A. (2012).
A review and comparison of strategies for multi-step ahead time
series forecasting based on the nn5 forecasting competition. Expert
systems with applications, 39(8):7067–7083.
Wang, K. and Huang, S. (2010). Using fast adaptive neural network
classifier for mutual fund performance evaluation. Expert systems
with applications, 37(8):6007–6011.
Wen, R., Torkkola, K., Narayanaswamy, B., and Madeka, D. (2017).
A multi-horizon quantile recurrent forecaster. arXiv preprint
arXiv:1711.11053.
Wu, J., Chen, X.-Y., Zhang, H., Xiong, L.-D., Lei, H., and Deng,
S.-H. (2019). Hyperparameter optimization for machine learning
models based on bayesian optimization. Journal of Electronic
Science and Technology, 17(1):26–40.
Xu, W., Peng, H., Zeng, X., Zhou, F., Tian, X., and Peng, X. (2019).
A hybrid modelling method for time series forecasting based on
a linear regression model and deep learning. Applied Intelligence,
49(8):3002–3015.
Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2021). Dive into
deep learning. arXiv preprint arXiv:2106.11342.

17

You might also like