MF Performance Prediction
MF Performance Prediction
Nghia Chu1
Vega Corporation, Tang 3, 106 Hoang Quoc Viet, Nghia Do, Cau Giay, Hanoi, Vietnam
[email protected]
Binh Dao
Hanoi University, Km 9, Nguyen Trai, Nam Tu Liem, Hanoi, Vietnam
[email protected]
Nga Pham
Monash Centre for Financial Studies, 30 Collins St, Melbourne, VIC 3000, Australia
[email protected]
Huy Nguyen
arXiv:2209.09649v3 [q-fin.ST] 31 Jul 2023
The University of Melbourne, Melbourne Connect 700 Swanston St, Carlton, VIC 3053, Australia
[email protected]
Hien Tran
Hanoi University, Km 9, Nguyen Trai, Nam Tu Liem, Hanoi, Vietnam
[email protected]
Abstract
Predicting fund performance is beneficial to both investors and fund managers and yet is a challenging task. In this
paper, we have tested whether deep learning models can predict fund performance more accurately than traditional
statistical techniques. Fund performance is typically evaluated by the Sharpe ratio, which represents the risk-adjusted
performance to ensure meaningful comparability across funds. We calculated the annualized Sharpe ratios based on the
monthly returns time series data for more than 600 open-end mutual funds investing in listed large-cap equities in the
United States. With five different forecast horizons ranging from 6 to 24 months, we find that long short-term memory
(LSTM) and gated recurrent units (GRU) deep learning methods, both trained using modern Bayesian optimization,
provide higher accuracy in forecasting funds’ Sharpe ratios than traditional statistical ones. An ensemble method which
combines forecasts from LSTM and GRU achieves the best performance of all models. There is evidence to say that
deep learning and ensembling offer promising solutions in addressing the challenge of fund performance forecasting.
Keywords: deep learning, forecasting, time series, fund performance, LSTM, GRU, ensemble.
4
(Ruppert, 2004). This method estimates the mean out- threshold.
of-sample error, also known as the average generalization
error by (Ruppert, 2004). • m can be calculated as the biggest integer lower than
For normal machine learning problems, where there is or equal to H/p, which is problem dependent.
no time dimension, k-fold cross-validation is the common We recognize that if we strictly follow GROE, this
cross-validation technique applied in practice. This tech- may lead to validation sets with different sizes. We be-
nique requires the dataset to be randomly divided into a lieve that setting the length of all the validation sets equal
train set and a validation set, in k different settings. The and equal to the forecast horizon will make the valida-
random division makes each element of the dataset exist tion sets represent the actual forecast dataset when ap-
in the validation set exactly once and in the training set plying a forecasting model in practice better. Therefore,
at least once. we adjust the method slightly by first determining the last
In time series forecasting, there is a time dimension, train/validation sets, and then working backwards to de-
and care must be taken to ensure that time-dependent termine the other train/validation sets. In our method,
characteristics are preserved during dataset setup for model the last validation set is defined to consist of the last H
evaluations. For example, we cannot simply and randomly observations in the dataset, and the corresponding train
assign an element to the validation set, since this may in- set consists of all the observations in the dataset prior to
terrupt time order between observations and lead to data the validation set. Then we shift the origin of the last train
leakage, where future data may be used to predict past set, which is n − H, back in time m data points to find the
data. origin of the previous train sets, i.e. using ni = ni+1 − m.
The study of (Bergmeir and Benítez, 2012) showed that This modification ensures all validation sets have the
the use of cross-validation in time series forecasting led to same length, which is exactly equal to the forecast horizon.
a more robust model selection. In this research, we follow As an example, for the forecast horizon of 18 months, our
the rolling origin scheme to divide our dataset into six cross-validation scheme results in the following division of
cross-validation splits. Our choice of six cross-validation train and validation sets:
splits is identical to the study of (Fiorucci and Louzada,
2020). To limit the computational costs when training
deep learning models within reasonable limits, it is beyond
the scope of our study to evaluate the effects of different
cross-validation splits on the study results.
Models are trained on each training set and then evalu-
ated on the respective validation set. The error measured
in each validation set is the estimated generalization error
for that particular fold. Averaging the generalization error
estimated for all the folds will yield the estimated general-
ization error of the model. Model selection is then carried
out by comparing average generalization errors across all Figure 1: Train and validation sets for a forecast horizon
the models trained. of 18m
Specifically, our cross-validation scheme is a slightly
adjusted version of Generalized Rolling Origin Evaluation 3.2. Recurrent neural networks
(GROE), as described in (Fiorucci and Louzada, 2020).
Below are the properties of this GROE scheme in connec- In this research, we employ two modern types of recur-
tion to our study: rent neural networks (RNNs) known as LSTM and GRU
and compare their effectiveness in forecasting fund per-
• p denotes the number of origins, where an origin is formances against their statistical counterparts. In this
the index of the last observation of a train set. The section, we present the unit design and mathematics fun-
origins are referred to as n1 , n2 , ..., np , which can damentals of these techniques. In addition to LSTM and
be found recursively through the equation ni+1 = GRU, we also describe the vanilla RNNs which provide the
ni + m. In this equation, m represents the number foundation for more advanced RNN architectures.
of data points between two consecutive origins. In
our study, the number of origins p equals six, which 3.2.1. Vanilla recurrent neural networks
means original data are split in six different ways to RNNs are designed to address the problem of learning
produce six train/validation set pairs. from and predicting sequences. They have been widely
applied in the field of natural language processing. In the
• H represents the different forecast horizons, which following figure and equations, we illustrate the internal
are 6, 9, 12, 18, and 24 months in our study. workings of RNNs. The basic RNN cell has the structure
• n is the length of each time series and n1 is equal as shown in Figure 2.
to n − H if n − H is greater than or equal a certain
5
Figure 2: A Basic Recurrent Cell
7
3.2.4. Ensemble methods series forecasting, which are ARIMA, ETS, and Theta, as
Ensemble methods are ways to combine forecasts from well as the Naive method, which is known to work well for
different models with the aim to improve models’ perfor- numerous financial and economic time series. The Theta
mance. There are different methods to combine forecasts. method of forecasting, introduced by (Assimakopoulos and
The simplest method would be averaging the model pre- Nikolopoulos, 2000), is a special case of simple exponen-
dictions for different algorithms. Other methods can make tial smoothing with drift. These four methods are recom-
use of weighted averages. The study of (Fiorucci and mended by (Petropoulos et al., 2022) when benchmarking
Louzada, 2020) combines time series forecast models with new forecasting methods. Further detail about the four
the weights proportional to the quality of each model’s models can be seen in (Hyndman and Athanasopoulos,
in-sample predictions in a cross-validation process. Their 2018).
findings reveal that forecast results improve compared to In our experiments, we use an optimized version of the
methods that use equal weights for combination. However, Theta model proposed by (Fiorucci et al., 2016). We use
all of the methods used for combination in their study are the well-known forecast R package introduced in (Hynd-
statistical methods, and in-sample prediction qualities are man and Khandakar, 2008) for training ETS, ARIMA and
used to determine weights. Naive models, and the forecTheta R package for training
Our approach in ensembling forecast results is different the optimized Theta models. These models are trained
from the above study in that in addition to combining fore- using the parallel processing capabilities provided in the
casts of top-performing models, we also combine forecasts furrr R package.
for models belonging to different approaches, i.e. all deep
learning models combined with all statistical models. Ac- 4. Experiment setup
cording to (Petropoulos et al., 2022), combined forecasts
work more effectively if the methods that generate fore- 4.1. Data description
casts are diverse, which leads to fewer correlated errors. Our dataset started with the monthly returns of more
Another difference is that we use out-of-sample model pre- than 1200 mutual funds investing in listed large-cap equi-
diction qualities to determine weights, which we believe ties in the United States available on Morningstar Direct.
are more objectively representative of models’ performance Daily returns series are not available. We required at least
than using in-sample prediction qualities. The reason is 20 years of data up to October 2021 and obtained 634
that deep learning models may easily overfit the training funds. These funds are relatively homogeneous as they
data and produce high-quality in-sample predictions while share the same investment strategy and are all domiciled
the corresponding performances can significantly drop on in the United States. From the monthly returns series, we
out-of-sample data. computed the annualised Sharpe ratios for each fund on
In this study, we compare three ensembling methods: a rolling basis. Table 1 summarises the statistics of these
• The first method uses equal weights in averaging time series.
model forecasts. This is the simple form for com-
bining forecasts, known as simple averages. Mean Sd Min Max
us45890c7395 Return 4.351 27.069 -129.960 62.540
• The second method calculates a weighted average Sharpe 0.213 0.542 -1.444 2.132
of model forecasts where the weights are inversely us5529815731 Return 7.522 23.741 -97.820 70.040
proportional to the final out-of-sample MASE (our Sharpe 0.295 0.555 -1.430 1.790
us82301q7759 Return 8.140 22.494 -100.080 69.620
main forecasting performance metric used for evalu- Sharpe 0.301 0.571 -1.444 2.507
ation of forecasting models) result of each algorithm. us6409175063 Return 8.804 29.045 -114.100 93.900
The weights represent the overall (or global) average Sharpe 0.304 0.584 -1.288 3.129
out-of-sample model quality across all time series for us1253254072 Return 15.065 35.222 -156.560 123.460
Sharpe 0.276 0.533 -1.188 4.768
each algorithm. Riskfree 1.464 1.713 0.011 6.356
Overall Return 8.226 23.272 -176.220 126.320
• The third method calculates a weighted average of Overall Sharpe 0.290 0.556 -2.446 4.768
model forecasts where the weights are inversely pro-
portional to the average out-of-sample MASE result Table 1: Time series descriptive statistics
of each algorithm for the particular time series in
question. The weights represent the specific (or lo- Looking at the details, the table illustrates five fund ex-
cal) average out-of-sample model quality for each amples with the annualised return (percent) and the an-
time series in each algorithm. nualised Sharpe ratio, in terms of their mean, standard
deviation, minimum and maximum values. The five funds
3.3. Traditional statistical models for time series forecast- are respectively the minimum average return, 25th per-
ing centile, 50th percentile, 75th percentile and maximum av-
To compare against deep learning algorithms, we select erage return over the period. The risk-free rate of return is
three well-known traditional statistical methods for time the market yield on the 3-month U.S. Treasury Securities.
8
The last two rows represent the overall average return and time series are inverse log-transformed, and then the offset
respectively the Sharpe ratio of all the funds used in the that was previously added is subtracted from each time
sample. series.
Within the primary “large-cap equities" strategy that
categorizes the funds, there are three major sub-strategies, 4.2.4. Forecasting multiple outputs
namely growth, value and blend. Those adopting a “growth" Unlike univariate statistical methods, deep learning al-
sub-strategy focuses on growth stocks listed in the United gorithms can potentially take advantage of cross-learning,
States, i.e., stocks with strong earnings growth potential, in which patterns from multiple time series are learnt to
whereas “value" sub-strategy means the investment targets improve forecast accuracy for individual time series. In
value stocks, those evaluated to be undervalued. “Blend" this work, we aim to compare forecasting methods for pre-
represents a mix of growth and value sub-strategies. The dicting 634 mutual funds, which involve 634 time series of
study sample of 634 funds includes all the three sub-categories a homogeneous group of funds. Each time series requires
mentioned above. multi-step-ahead forecasts for the applicable forecast hori-
zon. Therefore, we approach the problem as a multiple-
4.2. Data preprocessing output and multi-step-ahead forecasting problem, where
4.2.1. Data splitting inputs are fed into the neural networks, and after train-
The time series data are split into 6 train and vali- ing, vectors of outputs are produced directly by the model
dation sets using the cross-validation scheme described in for six, nine, 12, 18 and 24 months ahead. This strategy
section 3.1 of Methodology. This cross-validation scheme has been demonstrated effective by the works of (Taieb
allows for robust and reliable model selection based on et al., 2012) and (Wen et al., 2017) and also adopted by
their average out-of-sample performances. (Hewamalage et al., 2021).
Supervised deep learning maps inputs to outputs, where
4.2.2. Removing trend and seasonality in our time series context, inputs are fed into the model
We transform each time series in the train set of each and outputs produced by the model following a sliding win-
cross-validation split into its stationary form by two com- dow approach. This sliding window scheme is also used in
monly used techniques known as log transformation and the work of (Hewamalage et al., 2021). This approach con-
differencing. Log transformation stabilizes the variance siders each data sample as consisting of an input window
of the time series (Hyndman and Athanasopoulos, 2013). of past observations of all time series that is mapped to an
Since negative Sharpe ratios exist, an offset is added into output window of future observations which immediately
all time series, to ensure each one of them is all positive, follow the input window in terms of time order relation-
before applying log transformation. ship. In our experiments, each train set is preprocessed to
While log transformation is effective in handling time form multiple sliding input and output windows with these
series variances, differencing can help remove changes in characteristics, where each sliding window of the next con-
the level of a time series and is therefore effective in elimi- secutive time step has the same form as the previous one
nating (or reducing) trend and seasonality (Hyndman and but is shifted by a one-time step. This scheme allows for
Athanasopoulos, 2013). Subsequent to log transforma- the use of past (lagged) time series in predicting future
tions, we examine each time series to see which one needs time series in each data observation.
differencing and the number of differences required to trans- As patterns are learnt from inputs to predict outputs,
form them into stationary form. The result shows that the we believe it is necessary to set the length of input se-
majority of the time series have been stationary after log quences to be at least equal to the length of output se-
transformation and 68 of them need to be applied first quences. This would suggest that deep learning algorithms
differencing. These 68 time series are applied first differ- have sufficient information from the past to predict the fu-
encing in each cross-validation split and their last train ture. The length of each input window should not be too
observation is saved for later inverse transformation back large, since the use of too long input windows would signif-
to their original scales. icantly reduce the number of training samples and prob-
ably affect model performance. In our study, we set the
4.2.3. Data postprocessing lengths of input windows equal to the lengths of the fore-
Since each time series has been transformed prior to cast horizon plus two months, given the lengths of output
entering the modelling stage, the models’ outputs are not windows are determined by the chosen forecast horizons.
in the original scale and this does not allow for a direct These choices balance the need for covering sufficient in-
comparison of the models’ outputs with the validation sets formation in the input windows and the need to retain as
to obtain error metrics. Therefore, models’ predictions many training samples as possible.
are transformed back to the original scales following the
4.3. Training deep learning models
inverse order that we apply in the preprocessing steps. In
detail, predictions for those time series that are previously The performance of deep learning models depends on
differenced receive inverse differencing; after this step, all a number of factors, including, among others, the right
9
choice of the loss function, the optimization procedure time series forecasting, where datasets are usually rela-
within the training loop, and the outer optimization pro- tively small compared to those in other fields such as NLP
cedure that selects the best combination (or configuration) or Computer Vision.
of hyper-parameters. Our training procedure emphasizes We also provide an option to employ batch normaliza-
extensive hyper-parameter tuning based on careful consid- tion (Ioffe and Szegedy, 2015) to further regularize and im-
eration for hyper-parameter configurations. This section prove training speed, which allows for the possible training
describes in detail these training and optimization proce- of deeper networks. Originally, batch normalization was
dures. understood to work by reducing internal covariate shifts.
Later research demonstrated this concept was a misun-
4.3.1. Loss function derstanding. The work of (Santurkar et al., 2018) clari-
We use RMSE as the loss function for training deep fied that the effectiveness of batch normalization lay in its
learning models. This loss function will be optimized when ability to make the optimization landscape significantly
training mini-batches of data samples. This type of loss smoother.
function has the advantage of restricting large, outlier er- In the training process that uses batch normalization,
rors, since large errors may result in really large squared this normalization is performed for each mini-batch. By
errors. This loss function is also the metric that we opti- using batch normalization, we can use higher learning rates,
mize in the Bayesian optimization loops. and be more relaxed about network initialization. As a
regularizer, in some cases, we may not need to use dropout
4.3.2. Hyper-parameter search setup if batch normalization is already in use. In our work, it
Hyper-parameters refer to the parameters that mod- is, however, uncertain that batch normalization can re-
els cannot learn during the training process but may have place the need for dropout, especially when our training
an effect on models’ performance. For RNNs, the number data are limited. Therefore, we choose to optimize both
of hidden layers, units in each hidden layer, and learning of these hyper-parameters.
rate, etc., are important hyper-parameters. The process There is also additional consideration for the order of
of identifying which hyper-parameter configuration results batch normalization and dropout in actual implementa-
in optimal generalization performance is known as hyper- tion. The combination of batch normalization and dropout
parameter optimization. We employ Bayesian optimiza- works well in some cases but decreases the model perfor-
tion for this optimization procedure, which will be detailed mance in others, and thus their order should be carefully
in the next section. considered (Li et al., 2019). To work around this issue,
In our experiments, the following hyper-parameters are and to possibly take advantage of both dropout and batch
optimized: normalization, in our experiments, we add another hyper-
parameter that controls whether batch normalization lay-
• Learning rate ers are added to the model before or after dropout layers.
The range of dropout rates in our experiments includes
• Number of hidden layers
zero, which means the case of no dropout is covered.
• Units in a hidden layer Another regularization method tuned is weight decay.
Weight decay, also called L2 regularization, is probably
• Dropout rate for inputs the most common technique for regularizing parametric
machine learning models (Zhang et al., 2021). This tech-
• Dropout rate for hidden layers
nique works by adding a penalty term (or regularization
• Mini-batch size term) into the model’s loss function so that the learning
process will minimize the prediction loss plus the penalty
• Whether to use batch normalization or not term. The updated loss function with weight decay is then
given by:
• If using batch normalization, placing it either before p
or after dropout layers
X
L = Error + ψ wi2
i=1
• Activation functions | {z }
L2 regularization
• Weight decay Applied to our context, Error denotes the root mean square
error from the network outputs, wi represents the train-
• Number of epochs
able parameters of the network with p the number of all
Unlike (Hewamalage et al., 2021), we consider the dropout such parameters.
rate as an important hyper-parameter for optimization. Another important hyper-parameter is the number of
Dropout is an effective technique for addressing the over- hidden layers, which controls the network depth. The use
fitting problem in deep neural networks (Srivastava et al., of regularization methods described above can help reduce
2014). This technique may be even more appropriate for models’ overfitting, which as a result may allow for deeper
networks to be trained. Therefore, we set the range of the
10
number of hidden layers from one to five, with five hidden 4.3.6. Bayesian optimization
layers representing a quite deep network given the limited Bayesian optimization (BO) is a modern hyper-parameter
amount of training data. optimization technique that is effective for searching through
We use Adam as the optimizer for the model train- a large search space that may involve a large number of
ing process. Adam is a computationally efficient gradient- hyper-parameters. It has demonstrated superior perfor-
based optimization method for optimizing stochastic ob- mance to the random search and grid search approach in a
jective functions, which is demonstrated to work well in variety of settings (Wu et al., 2019; Putatunda and Rama,
practice and compares favorably to other stochastic opti- 2019).
mization approaches (Kingma and Ba, 2014). The essence of BO is the construction of a probabilistic
In addition to the above, we choose to optimize other surrogate model that models the true objective function,
hyper-parameters including learning rate, mini-batch size, and the use of this surrogate model together with the ac-
units which represent the dimensionality of the output quisition function to guide the search process. Its proce-
space in a hidden layer, number of epochs, and activation dure first defines an objective function to optimize, which
functions. For each numeric hyper-parameter, as appro- may be the loss function, or some other function deemed
priate, we try to include a wide range of values for the more appropriate for model selection. In training models
search, without making the computation cost too high. where the evaluation of the objective function is costly,
For example, the learning rate ranges between 0.001 and the surrogate which is cheaper to evaluate is used as an
0.1, and the dropout rate ranges between 0.0 and 1.0. To approximation of the objective function (Bergstra et al.,
control the computation cost within reasonable limits, we 2011).
limit the range of the number of epochs to between 1 and The whole process of Bayesian optimization is sequen-
30. tial by nature since the determination of the next promis-
ing points to search is dependent on what is already known
4.3.3. Hyper-parameter optimization about the previously searched points. In addition to the
Prior studies have investigated different methods for surrogate model, another important component of BO is
executing this optimization procedure. Among these, grid the acquisition function, which is designed to guide the
search, random search and Bayesian optimization provide search toward potential low values of the objective func-
three alternatives for optimizing hyper-parameters. Bayesian tion (Archetti and Candelieri, 2019). The acquisition func-
optimization has been demonstrated to be more effective tion allows for the balance between exploitation and ex-
and will be chosen in our experiments. ploration, where exploitation means that searching is per-
formed near the region of the current best points and ex-
4.3.4. Grid search ploration refers to the searching in the regions that have
The grid search method finds the optimal hyper-parameter not been explored.
configuration based on a predetermined grid of values. The initial probabilistic surrogate model is constructed
This requires careful consideration in the choice of grid by fitting a probability model over a sample of points se-
values, as there might be no limit to the number of possi- lected by random search or some other sampling method.
ble hyper-parameter configurations. For example, a learn- In the next step, a new promising point to search is iden-
ing rate that ranges between 0.001 and 0.1 can take count- tified using the acquisition function. The objective func-
less possible values, and when combined with several other tion is then evaluated and then the probabilistic surrogate
hyper-parameters, the search space would be so vast, such model is updated with the new information. The next step
that covering all possible configurations in the search is im- uses the acquisition function to suggest a further promis-
possible. To limit the choice of hyper-parameters in a grid, ing point. This process is repeated in a sequential manner
we can rely on expert knowledge and experience, but this until some termination condition is satisfied.
cannot guarantee success in many cases, especially when Gaussian Process (GP) is a popular choice for the sur-
there are a large number of hyper-parameters for tuning rogate model. The GP can be understood as a collection
in deep learning models. of random variables which satisfies the condition that if
any finite number of these random variables are combined,
4.3.5. Random search the result will be a joint Gaussian distribution (Archetti
Random search narrows down the number of possible and Candelieri, 2019). . Another surrogate model is Tree-
hyper-parameters to search by selecting a random subset of structured Parzen Estimator (TPE). Unlike GP which mod-
all possible configurations. By narrowing down the search els P (y|x) directly, the TPE approach models P (x|y) and
space, this algorithm reduces training time and has proved P (y) (Bergstra et al., 2011), where x represents hyper-
to be effective in many cases (Putatunda and Rama, 2019). parameters, and y represents the associated evaluation
The research of (Bergstra and Bengio, 2012) and (Pu- score of the objective function. The hyper-parameter search
tatunda and Rama, 2019) show that random search is more space can be defined by a generative process, of which
effective than grid search as a hyper-parameter optimiza- the TPE replaces the prior distributions of the hyper-
tion method. parameter configuration with specific non-parametric den-
11
sities; the substitutions become a learning algorithm that Scaled errors is defined as:
then creates various densities over the search space (Bergstra et
et al., 2011). qt = 1 Pn
i=2 |Xi − Xi−1 |
The TPE algorithm is implemented in the well-known n−1
hyperopt Python package in (Bergstra et al., 2013). In our which is independent of the scale of the data. A scaled
experiments, we utilize this library for hyper-parameter error is less than one if it arises from a better forecast
optimization using the TPE algorithm of the Bayesian op- than the average one-step Naive forecast computed in-
timization framework, where the number of iterations is set sample. Conversely, it is greater than one if the forecast is
to 800 for each deep learning method. All deep learning worse than the average one-step Naive forecast computed
models are trained using a server with the following char- in-sample (see Hyndman and Koehler, 2006).
acteristics: 8-core CPU, 16 GB RAM and Linux Ubuntu The famous scaled error is the Mean Absolute Scaled
20.04.4. Error:
MASE = mean (|qt |)
4.4. Forecast accuracy measures
When MASE < 1, the proposed method gives, on aver-
In this section, we present the metrics used to com- age, smaller errors than the one-step errors from the Naive
pute forecast accuracy that enables performance compar- method. If multi-step forecasts are being computed, it is
ison among the models studied. possible to scale by the in-sample MAE computed from
Let Xt denote the observation at time t and Ft denote multi-step naïve forecasts.
the forecast of Yt . Then the forecast error et = Xt − Ft . The recent paper of (Kim and Kim, 2016) investigated
Let’s have k forecasts and that observation of data at each and provided practical advantages of the new accuracy
forecast period. measure MAAPE, the mean arctangent absolute percent-
We use the same notation as in (Hyndman and Koehler, age error:
2006) mean (xt ) to denote the sample mean of {xt } over
the sample. Analogously, we use the median (xt ) for the
sample median. 1 X
N
1 X
N
Xt − Ft
The most commonly used scale-dependent measures MAAPE =
N t=1
AAPEt =
N t=1
arctan
Xt
are based on absolute errors or squared errors. Let Xt
denote the observation at time t and Ft denote the fore- Although MAAPE is finite when the response variable
cast of Yt . Then the forecast error
et = Xt − Ft . Let Mean (i.e. Xt ) equals zero, it has a nice trigonometric repre-
Square Error M SE = mean e2l . sentation. However, because MAAPE’s value is expressed
Root Mean Square Error: in radians, this makes MAAPE less intuitive. Note that
q MAAPE does not have a symmetric version, since divi-
RM SE = mean (e2l ) sion by zero is no longer a concern. The MAAPE is also
scale-free because its values are expressed in radians.
Mean Absolute Error:
Often, the RMSE is preferred to the MSE as it is on the In this section, we run and train six models for five
same scale as the data. Historically, the RMSE and MSE different forecast horizons, i.e., 30 models in total. The
have been popular, largely because of their theoretical rel- forecast horizons include 6 months, 9 months, 12 months,
evance in statistical modelling. However, they are more 18 months, and 24 months. Among six models are two
sensitive to outliers than MAE or MDAE (Median Abso- deep learning models (LSTM, GRU), three traditional sta-
lute Error). tistical (ARIMA, ETS, Theta), and the Naive model as a
Compared to absolute error, percentage errors have benchmark.
the advantage of being scale-independent, and so are fre- The first subsection below presents the average results
quently used to compare forecast performance across dif- for five different forecast horizons using the six models and
ferent datasets. Let defined percentage error as pt = 100et /Xt . the ensemble models. The second subsection illustrates
The Symmetric Median Absolute Percentage Error: further details using the 18-month forecast horizon.
SM DAP E = median (200 |Xt − Ft | / (Xt + Ft )) 5.1. Average of multiple forecast horizons
5.1.1. Comparison of deep learning vs statistical models
The problems arising from small values of Xt may be
Table 2 provides the averages of the accuracy measures
less severe for SMDAPE. However, usually when Xt is close
over five forecast horizons for each of five accuracy mea-
to zero, Fl is also likely to be close to zero. Thus, the
sures (MASE, RMSE, MAE, SMDAPE and MAAPE) and
measure still involves division by a number close to zero.
each of six models. It is evident that the LTSM outper-
forms all other models, then the GRU, resulting in the deep
12
Algorithm MASE RMSE MAE SMDAPE MAAPE 5.1.2. Ensemble models
LSTM 1.510 0.441 0.363 74.204 59.340
GRU 1.546 0.453 0.371 81.481 59.791 Algorithm MASE RMSE MAE SMDAPE MAAPE
ARIMA 1.815 0.533 0.435 96.504 63.207 All algorithms
ETS 1.835 0.535 0.441 105.028 65.012 simple average 2.208 0.678 0.530 94.532 63.984
Theta 1.961 0.566 0.471 106.827 67.567 global weights 1.852 0.553 0.445 91.495 62.839
Naive 4.176 1.938 1.437 120.213 75.317 local weights 1.785 0.529 0.428 91.004 62.567
Deep learning
Table 2: Average accuracy measure of five forecast hori- simple average 1.469 0.431 0.353 74.315 58.429
global weights 1.468 0.430 0.352 74.261 58.418
zons local weights 2.056 0.594 0.493 127.001 67.196
Stats
simple average 2.721 0.841 0.654 110.821 68.312
learning models taking the first place. The next places are global weights 2.119 0.631 0.509 108.158 66.826
the ARIMA model, ETS, and the Theta, respectively. The local weights 2.074 0.609 0.498 122.540 66.737
worst model surprisingly is the Naive model. Regardless
of the forecast horizon, the LSTM model is the best model Table 3: Average accuracy measure of ensemble method.
on average and produces the most consistent result across
different accuracy measures, as shown in Table 6 in the Table 3 provides the results of the ensemble methods
appendix. applied to combine forecasts by different groups of mod-
els: all models (All algorithms in the table), LSTM and
GRU (Deep learning), and statistical models (Stats). For
each ensembling group, we use a set of combination meth-
ods including simple average, weighted average with global
weights (referred to as global weights in the table), and
weighted average with local weights (referred to as local
weights in the table).
The best results come from the ensemble models of
LSTM and GRU. The second best is All algorithms. The
ensembles of All algorithms provide better accuracy for all
error metrics than ensembling using only three statistical
methods. Furthermore, ensembling provides lower error
metrics than the individual ones for deep learning meth-
ods for simple average and global weights methods. For
example, the average - mean (across five forecast horizons)
of MASE for LSTM and GRU are respectively 1.510 and
1.546, whereas the highest mean of ensemble methods of
simple average and global weights are only 1.469 and 1.468.
The same results are also obtained for other error metrics
such as RMSE, MAE, SMDAPE, and MAAPE. However,
the local weights ensemble of GRU and LSTM are worse
Figure 5: Comparison of MASE across all algorithms and than the individual ones (true with MASE, RMSE, MAE,
forecast horizons. SMDAPE, and MAAPE), so the ensemble model with All
algorithms has resulted in better accuracy measures.
Figure 5 shows the MASE accuracy measure and RMSE While the ensemble of All algorithms provides much
and MAE are reported in Table 6 in the appendix. The better accuracy measures than the ensemble of statistical
results confirm the outperformance of the LSTM model as models, it is not as good as that of the original individual
its MASE stays lowest across six models and the differ- deep learning models.
ent forecast horizons, except for the six-month horizon for Initially, one would expect that the ensemble of all
which GRU slightly outperformed LSTM. Across the five models should provide the best accuracy measure as, the-
horizons, LSTM produces the lowest MASE (1.363) for the oretically, a more diverse set of models can result in fewer
forecast horizon of 12 months. correlated errors. However, as shown by our analysis, the
Table 6 reports RMSE, MAE, SMDAPE and MAAPE ensembles of GRU and LSTM using global weights provide
results. We conclude that LSTM is the best model across the best model with the lowest mean of all error metrics.
different algorithms and across different forecast horizons. The relative performance of ensembling methods dif-
Again, the 12-month horizon receives the best forecasting fers across ensembling groups and forecast horizons. For
accuracy for the first two metrics and the nine-month for All algorithms ensembling models, across all forecast hori-
the last two. The Naive model consistently produces the zons (see table 7), the local weights methods provide the
least accuracy across algorithms and forecast horizons, es- lowest accuracy measures. For the ensembling of the deep
pecially the longest one of 24 months. learning models, the best choice is the global weight for all
13
the forecast horizons, except the 24-month horizon. For It can be concluded that the methods of deep learn-
the ensembling of the Stats model, the choice is the global ing provide significantly more accurate forecast than the
weight for only the 6, 9 and 12-month periods. traditional statistical ones, which confirms our research
Further details on each forecast horizon for six original hypothesis.
models and nine ensemble models (three ensemble com-
binations and three weighting methods) are provided in
Tables 6 and 7 in the appendix. 5.2.2. Ensemble models for the 18-month horizon
The next subsection further illustrates the detailed re-
sults obtained for the forecast horizon of 18 months. De- Algorithm MASE RMSE MAE SMDAPE MAAPE
All algorithms
tailed analysis for other forecast horizons can be performed simple average 1.535 0.463 0.369 111.130 76.021
in a similar manner using the results provided in the ap- global weights 1.513 0.455 0.363 110.183 75.772
local weights 1.469 0.440 0.353 109.933 75.565
pendices. Deep learning
simple average 1.322 0.393 0.318 96.948 73.896
5.2. Forecast horizon of 18 months global weights 1.322 0.393 0.318 96.939 73.900
local weights 1.519 0.472 0.365 141.321 74.531
5.2.1. Comparison of deep learning vs statistical models Stats
simple average 1.794 0.536 0.431 131.529 79.995
For each algorithm, we first calculate the average accu- global weights 1.756 0.524 0.422 130.779 79.70
racy metrics for each time series across the cross-validation local weights 1.639 0.493 0.394 142.974 77.915
splits. The results for all time series of an algorithm are
then averaged to yield the final metrics’ values represent- Table 5: Accuracy measures of Ensemble models of 18
ing the performance of that particular algorithm. For the months
latter step, we also report the median and standard devi-
The best results come from the ensemble models of
ation in addition to the mean.
GRU and LSTM, then All algorithms ensembling and Stats
Table 4 presents the accuracy metrics measured for two
ensembling groups. Furthermore, ensembling provides lower
methods of Deep learning (LSTM, GRU) and three statis-
error metrics than the individual ones for deep learning
tical methods (ARIMA, ETS, and Theta) and the bench-
methods for simple average and global weights methods.
mark model Naive.
For example, the means of MASE for LSTM and GRU are
Algorithm LSTM GRU ARIMA ETS Theta Naive respectively 1.425 and 1.428, whereas the lowest mean of
MASE simple average and global weights is only 1.322. The same
mean 1.425 1.428 1.539 1.721 1.762 2.760
median 1.374 1.384 1.500 1.700 1.699 1.684 results are also obtained for other accuracy measures such
sd 0.177 0.153 0.172 0.159 0.225 3.234 as RMSE, MAE, SMDAPE, and MAAPE. However, the
RMSE
mean 0.422 0.422 0.454 0.503 0.513 0.830
local weights ensemble of GRU and LSTM are worse than
median 0.415 0.412 0.445 0.498 0.498 0.497 the individual ones (true with MASE, RMSE, MAE and
sd 0.045 0.044 0.044 0.043 0.053 1.019 SMDAPE).
MAE
mean 0.342 0.343 0.370 0.414 0.424 0.660
median 0.334 0.335 0.363 0.411 0.411 0.408
sd 0.038 0.033 0.037 0.034 0.049 0.765 6. Conclusion
SMDAPE
mean 95.775 109.261 111.320 133.133 132.561 137.438 The most interesting purpose of this study is to address
median 108.577 107.999 130.480 129.424 131.305
sd
88.499
20.007 13.229 19.848 10.339 10.804 17.906
the challenge of forecasting the performance of multiple
MAAPE mutual funds simultaneously using modern deep learning
mean 74.979 76.324 77.763 81.963 82.685 85.286 approaches, with a comparison against popular traditional
median 74.674 75.891 76.613 82.901 83.061 82.188
sd 3.839 3.825 4.402 3.217 3.873 12.389 statistical approaches. The deep learning approaches are
studied from the cross-learning perspective, which means
Table 4: Mean accuracy measures of six forecasting models information from various time series is used to improve
of 18 months predictions of individual time series, and no external fea-
tures are added to the models. In addition, we use differ-
Overall, LSTM has the best performance with the small- ent ensemble methods to combine forecasts generated by
est mean, median and standard error for almost all accu- models of traditional and modern approaches. The results
racy metrics except for the SMDAPE metric for which the show that the ranking order of model quality for the stud-
GRU model yields a lower mean and median. The results ied methods are Ensemble of deep learning models, LSTM,
clearly show that LSTM is significantly superior to the GRU, ARIMA, ETS, Theta, and Naive.
other four forecasting methods. While ARIMA is the best- The results among the ensemble methods vary depend-
performing traditional statistical approach, it produces a ing on which models are combined. The best model comes
much lower accuracy level than the GRU model. The met- from the ensemble using weighted averages using global
rics’ median and the standard error confirm the same ob- weights of LSTM and GRU models.
servation. Therefore in the following table, we present only
the mean value of five different accuracy metrics.
14
In our study, both LSTM and GRU models are trained
with Bayesian optimization, a modern approach for hyper-
parameter optimization that can be effective when eval-
uating the model for individual configurations of hyper-
parameters is costly, a property particularly true for deep
learning problems. In this paper, we have outlined a de-
tailed methodology for training these deep learning mod- Algorithm MASE RMSE MAE SMDAPE MAAPE
All algorithms
els using Bayesian optimization, which we believe could be simple average 2.026 0.566 0.485 57.088 44.819
valuable for other research. global weights 2.017 0.562 0.483 56.973 44.700
local weights
We conclude that deep learning models and their en- 2.011 0.560 0.482
Deep learning
56.818 44.655
sembling offer promising solutions to the dilemma question simple average 1.737 0.486 0.416 49.372 39.781
of forecasting the performance of multiple mutual funds 6m global weights 1.737 0.486 0.416 49.378 39.771
local weights 3.026 0.817 0.725 118.567 59.910
measured by Sharpe ratios. Stats
simple average 2.206 0.623 0.528 62.855 47.206
global weights 2.195 0.618 0.526 62.800 47.143
7. Appendix local weights 2.546 0.694 0.610 86.404 51.459
All algorithms
simple average 1.754 0.524 0.422 51.380 43.182
Algorithm 6m 9m 12m 18m 24m global weights 1.715 0.513 0.412 50.520 42.522
local weights 1.689 0.506 0.406 49.718 42.029
MASE Deep learning
LSTM 1.863 1.483 1.363 1.425 1.417 simple average 1.526 0.446 0.366 44.548 42.095
GRU 1.835 1.587 1.453 1.428 1.429 9m global weights 1.524 0.446 0.366 44.508 42.042
local weights 2.406 0.696 0.578 101.304 53.860
Arima 2.252 2.035 1.779 1.539 1.470
Stats
ETS 2.073 1.726 2.104 1.721 1.552 simple average 2.040 0.604 0.490 65.303 47.650
Theta 2.055 1.687 2.119 1.762 2.183 global weights 1.959 0.582 0.470 62.085 46.262
Naive 2.977 3.517 2.302 2.760 18.326 local weights 2.104 0.626 0.505 82.616 46.953
All algorithms
RMSE
simple average 1.699 0.507 0.408 93.867 63.387
LSTM 0.516 0.438 0.414 0.422 0.416 global weights 1.684 0.503 0.404 92.321 63.099
GRU 0.519 0.462 0.443 0.422 0.418 local weights 1.680 0.502 0.404 92.168 62.985
ARIMA 0.637 0.619 0.529 0.454 0.428 Deep learning
simple average 1.365 0.419 0.328 61.133 55.451
ETS 0.576 0.522 0.610 0.503 0.461
12m global weights 1.363 0.419 0.328 60.908 55.453
Theta 0.574 0.504 0.613 0.513 0.628 local weights 1.948 0.558 0.468 125.655 70.640
Naive 0.890 1.021 0.668 0.830 6.280 Stats
MAE simple average 2.009 0.588 0.483 127.343 69.327
global weights 2.003 0.586 0.481 126.859 69.202
LSTM 0.447 0.356 0.328 0.342 0.340
local weights 2.039 0.590 0.490 139.245 70.686
GRU 0.439 0.381 0.349 0.343 0.343 All algorithms
ARIMA 0.539 0.488 0.427 0.370 0.353 simple average 1.535 0.463 0.369 111.130 76.021
ETS 0.496 0.414 0.505 0.414 0.373 global weights 1.513 0.455 0.363 110.183 75.772
local weights 1.469 0.440 0.353 109.933 75.565
Theta 0.492 0 .405 0.509 0.424 0.524
Deep learning
Naive 0.715 0.847 0.553 0.660 4.409 simple average 1.322 0.393 0.318 96.948 73.896
SMDAPE 18m global weights 1.322 0.393 0.318 96.939 73.900
LSTM 50.581 42.983 60.205 95.775 121.474 local weights 1.519 0.472 0.365 141.321 74.531
Stats
GRU 56.180 45.544 73.905 109.261 122.516
simple average 1.794 0.536 0.431 131.529 79.995
ARIMA 71.258 76.036 99.108 111.320 124.797 global weights 1.756 0.524 0.422 130.779 79.70
ETS 59.843 55.429 133.069 133.133 143.664 local weights 1.639 0.493 0.394 142.974 77.915
Theta 56.510 53.094 132.822 132.561 159.146 All algorithms
simple average 4.025 1.329 0.968 159.198 92.513
Naive 73.452 85.090 136.191 137.438 168.894
global weights 2.331 0.729 0.560 147.479 88.100
MAAPE local weights 2.073 0.634 0.498 146.383 87.601
LSTM 43.561 40.625 56.486 74.979 81.050 Deep learning
GRU 40.669 43.733 56.506 76.324 81.720 simple average 1.395 0.409 0.335 119.573 80.923
24m global weights 1.395 0.409 0.335 119.573 80.924
ARIMA 46.255 45.339 63.897 77.763 82.783
local weights 1.382 0.430 0.332 148.157 77.038
ETS 45.489 41.978 71.679 81.963 83.952 Stats
Theta 45.335 42.205 71.980 82.685 95.627 simple average 5.558 1.853 1.337 167.076 97.383
Naive 51.922 59.664 72.535 85.286 107.177 global weights 2.683 0.843 0.645 158.265 91.822
local weights 2.042 0.640 0.490 161.460 86.667
Table 6: Accuracy measures across 6 algorithms Table 7: Accuracy measures of Ensemble models
and five forecast horizons
8. Acknowledgement
We acknowledge the support from Hanoi University for
providing a server for training models, Ms Hoang Anh and
15
Mr Nguyen Ngoc Hieu for their assistance in the initial Fiorucci, J. A. and Louzada, F. (2020). Groec: combination method
phase of the project. via generalized rolling origin evaluation. International Journal of
Forecasting, 36(1):105–109.
Fiorucci, J. A., Pellegrini, T. R., Louzada, F., Petropoulos, F., and
References Koehler, A. B. (2016). Models for optimising the theta method
and their relationship to state space models. International Journal
Ahmed, N. K., Atiya, A. F., Gayar, N. E., and El-Shishiny, H. (2010). of Forecasting, 32(4):1151–1161.
An empirical comparison of machine learning models for time se- Hewamalage, H., Bergmeir, C., and Bandara, K. (2021). Recurrent
ries forecasting. Econometric Reviews, 29(5-6):594–621. neural networks for time series forecasting: Current status and
Archetti, F. and Candelieri, A. (2019). Bayesian optimization and future directions. International Journal of Forecasting, 37(1):388–
data science. Springer. 427.
Arnold, T. R., Ling, D. C., and Naranjo, A. (2019). Private equity Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory.
real estate funds: returns, risk exposures, and persistence. The Neural computation, 9(8):1735–1780.
Journal of Portfolio Management, 45(7):24–42. Hyndman, R. (2018). A brief history of time series forecasting com-
Assimakopoulos, V. and Nikolopoulos, K. (2000). The theta model: petitions. URL https://robjhyndman.com/hyndsight/forecasting-
a decomposition approach to forecasting. International journal of competitions.
forecasting, 16(4):521–530. Hyndman, R. and Athanasopoulos, G. (2013). Forecasting: princi-
Bandara, K., Bergmeir, C., and Smyl, S. (2020). Forecasting across ples and practice.[e-book].
time series databases using recurrent neural networks on groups Hyndman, R., Koehler, A. B., Ord, J. K., and Snyder, R. D. (2008).
of similar series: A clustering approach. Expert systems with ap- Forecasting with exponential smoothing: the state space approach.
plications, 140:112896. Springer Science & Business Media.
Bergmeir, C. and Benítez, J. M. (2012). On the use of cross- Hyndman, R. J. and Athanasopoulos, G. (2018). Forecasting: prin-
validation for time series predictor evaluation. Information Sci- ciples and practice. OTexts.
ences, 191:192–213. Hyndman, R. J. and Khandakar, Y. (2008). Automatic time series
Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Al- forecasting: the forecast package for r. Journal of statistical soft-
gorithms for hyper-parameter optimization. Advances in neural ware, 27:1–22.
information processing systems, 24. Hyndman, R. J. and Koehler, A. B. (2006). Another look at mea-
Bergstra, J. and Bengio, Y. (2012). Random search for hyper- sures of forecast accuracy. International journal of forecasting,
parameter optimization. Journal of machine learning research, 22(4):679–688.
13(2). ICI, R. . S. P. (2021). 2021 Investment Company FACT BOOK.
Bergstra, J., Yamins, D., Cox, D. D., et al. (2013). Hyperopt: A Indro, D., Jiang, C., Patuwo, B., and Zhang, G. (1999). Predicting
python library for optimizing the hyperparameters of machine mutual fund performance using artificial neural networks. Omega,
learning algorithms. In Proceedings of the 12th Python in sci- 27(3):373–380.
ence conference, volume 13, page 20. Citeseer. Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerat-
Berk, J. B. and van Binsbergen, J. H. (2015). Measuring skill in the ing deep network training by reducing internal covariate shift.
mutual fund industry. Journal of Financial Economics, 118(1):1– In International conference on machine learning, pages 448–456.
20. PMLR.
Bollen, N. P. B. and Busse, J. A. (2005). Short-term persistence Irvine, P. J., Kim, J. H. J., and Ren, J. (2022). The beta anomaly
in mutual fund performance. The Review of financial studies, and mutual fund performance [working paper.
18(2):569–597. Jianwei, E., Ye, J., and Jin, H. (2019). A novel hybrid model on
Box, G. E., Jenkins, G. M., Reinsel, G. C., and Ljung, G. M. (2015). the prediction of time series and its application for the gold price
Time series analysis: forecasting and control. John Wiley & Sons. analysis and forecasting. Physica A: Statistical Mechanics and its
Cai, Y. (2005). A forecasting procedure for nonlinear autoregressive Applications, 527:121454.
time series models. Journal of Forecasting, 24(5):335–351. Kacperczyk, M., Nieuwerburgh, S. V., and Veldkamp, L. (2014).
Cao, J. and Wang, J. (2019). Stock price forecasting model based Time-varying fund manager skill. The Journal of Finance,
on modified convolution neural network and financial time se- 69(4):1455–1484.
ries analysis. International Journal of Communication Systems, Kahn, R. N. and Rudd, A. (1995). Does historical performance pre-
32(12):e3987. dict future performance? Financial analysts journal, 51(6):43–52.
Carhart, M. M. (1997). On persistence in mutual fund performance. Kenton, J. D. M.-W. C. and Toutanova, L. K. (2019). Bert: Pre-
The Journal of Finance, 52(1):57–82. training of deep bidirectional transformers for language under-
Chen, W., Xu, H., Jia, L., and Gao, Y. (2021). Machine learning standing. In Proceedings of NAACL-HLT, pages 4171–4186.
model for bitcoin exchange rate prediction using economic and Kim, S. and Kim, H. (2016). A new metric of absolute percentage
technology determinants. International Journal of Forecasting, error for intermittent demand forecasts. International Journal of
37(1):28–43. Forecasting, 32(3):669–679.
Chiang, W.-C., Urban, T., and Baldridge, G. (1996). A neural net- Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic
work approach to mutual fund net asset value forecasting. Omega, optimization. arXiv preprint arXiv:1412.6980.
24(2):205–215. Li, X., Chen, S., Hu, X., and Yang, J. (2019). Understanding the
Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y. (2014). disharmony between dropout and batch normalization by variance
On the properties of neural machine translation: Encoder-decoder shift.
approaches. arXiv preprint arXiv:1409.1259. Li, Z., Han, J., and Song, Y. (2020). On the forecasting of high-
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical frequency financial time series based on arima model improved by
evaluation of gated recurrent neural networks on sequence model- deep learning. Journal of Forecasting, 39(7):1081–1097.
ing. arXiv preprint arXiv:1412.3555. Lou, D. (2012). A flow-based explanation for return predictability.
Crone, S. F., Hibon, M., and Nikolopoulos, K. (2011). Advances The Review of Financial Studies, 25(12):3457–3489.
in forecasting with neural networks? empirical evidence from the Makridakis, S. and Hibon, M. (2000). The m3-competition: results,
nn3 competition on time series prediction. International Journal conclusions and implications. International journal of forecasting,
of forecasting, 27(3):635–660. 16(4):451–476.
Elton, E. J. and Gruber, M. J. (2020). A review of the performance Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2020). The
measurement of long-term mutual funds. Financial analysts jour- m4 competition: 100,000 time series and 61 forecasting methods.
nal, 76(3):22–37. International Journal of Forecasting, 36(1):54–74.
Melard, G. and Pasteels, J.-M. (2000). Automatic arima modeling
16
including interventions, using time series expert software. Inter-
national Journal of Forecasting, 16(4):497–508.
Newbold, P. (1983). Arima model building and the time series anal-
ysis approach to forecasting. Journal of forecasting, 2(1):23–35.
Pan, W.-T., Han, S.-Z., Yang, H.-L., and Chen, X.-Y. (2019). Pre-
diction of mutual fund net value based on data mining model.
Cluster computing, 22:9455–9460.
Pannakkong, W., Pham, V.-H., and Huynh, V.-N. (2017). A novel
hybridization of arima, ann, and k-means for time series forecast-
ing. International Journal of Knowledge and Systems Science
(IJKSS), 8(4):30–53.
Petropoulos, F., Apiletti, D., Assimakopoulos, V., Babai, M. Z., Bar-
row, D. K., Taieb, S. B., Bergmeir, C., Bessa, R. J., Bijak, J.,
Boylan, J. E., et al. (2022). Forecasting: theory and practice.
International Journal of Forecasting.
Putatunda, S. and Rama, K. (2019). A modified bayesian optimiza-
tion based hyper-parameter tuning approach for extreme gradient
boosting. In 2019 Fifteenth International Conference on Infor-
mation Processing (ICINPRO), pages 1–6. IEEE.
Ruppert, D. (2004). The elements of statistical learning: data min-
ing, inference, and prediction.
Santurkar, S., Tsipras, D., Ilyas, A., and Mądry, A. (2018). How does
batch normalization help optimization? In Proceedings of the
32nd international conference on neural information processing
systems, pages 2488–2498.
Semenoglou, A.-A., Spiliotis, E., Makridakis, S., and Assimakopou-
los, V. (2021). Investigating the accuracy of cross-learning time
series forecasting methods. International Journal of Forecasting,
37(3):1072–1084.
Sezer, O. B., Gudelek, M. U., and Ozbayoglu, A. M. (2020). Financial
time series forecasting with deep learning: A systematic literature
review: 2005–2019. Applied Soft Computing, 90:106181.
Sharpe, W. F. (1966). Mutual fund performance. The Journal of
Business, 39(1):451–476.
Sharpe, W. F. (1994). The sharpe ratio. Journal of portfolio man-
agement, 21(1):49–58.
Smyl, S. (2020). A hybrid method of exponential smoothing and
recurrent neural networks for time series forecasting. International
Journal of Forecasting, 36(1):75–85.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and
Salakhutdinov, R. (2014). Dropout: a simple way to prevent neu-
ral networks from overfitting. The journal of machine learning
research, 15(1):1929–1958.
Taieb, S. B., Bontempi, G., Atiya, A. F., and Sorjamaa, A. (2012).
A review and comparison of strategies for multi-step ahead time
series forecasting based on the nn5 forecasting competition. Expert
systems with applications, 39(8):7067–7083.
Wang, K. and Huang, S. (2010). Using fast adaptive neural network
classifier for mutual fund performance evaluation. Expert systems
with applications, 37(8):6007–6011.
Wen, R., Torkkola, K., Narayanaswamy, B., and Madeka, D. (2017).
A multi-horizon quantile recurrent forecaster. arXiv preprint
arXiv:1711.11053.
Wu, J., Chen, X.-Y., Zhang, H., Xiong, L.-D., Lei, H., and Deng,
S.-H. (2019). Hyperparameter optimization for machine learning
models based on bayesian optimization. Journal of Electronic
Science and Technology, 17(1):26–40.
Xu, W., Peng, H., Zeng, X., Zhou, F., Tian, X., and Peng, X. (2019).
A hybrid modelling method for time series forecasting based on
a linear regression model and deep learning. Applied Intelligence,
49(8):3002–3015.
Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2021). Dive into
deep learning. arXiv preprint arXiv:2106.11342.
17