Frequency-Domain MLPs Time Series Forecasting
Frequency-Domain MLPs Time Series Forecasting
Kun Yi1 , Qi Zhang2 , Wei Fan3 , Shoujin Wang4 , Pengyang Wang5 , Hui He1
Defu Lian6 , Ning An7 , Longbing Cao8 , Zhendong Niu1∗
1
Beijing Institute of Technology, 2 Tongji University, 3 University of Oxford
arXiv:2311.06184v1 [cs.LG] 10 Nov 2023
4
University of Technology Sydney, 5 University of Macau, 6 USTC
7
HeFei University of Technology, 8 Macquarie University
{yikun, hehui617, zniu}@bit.edu.cn, [email protected], [email protected]
[email protected], [email protected], [email protected], [email protected]
Abstract
Time series forecasting has played the key role in different industrial, including
finance, traffic, energy, and healthcare domains. While existing literatures have de-
signed many sophisticated architectures based on RNNs, GNNs, or Transformers,
another kind of approaches based on multi-layer perceptrons (MLPs) are pro-
posed with simple structure, low complexity, and superior performance. However,
most MLP-based forecasting methods suffer from the point-wise mappings and
information bottleneck, which largely hinders the forecasting performance. To
overcome this problem, we explore a novel direction of applying MLPs in the
frequency domain for time series forecasting. We investigate the learned patterns of
frequency-domain MLPs and discover their two inherent characteristic benefiting
forecasting, (i) global view: frequency spectrum makes MLPs own a complete view
for signals and learn global dependencies more easily, and (ii) energy compaction:
frequency-domain MLPs concentrate on smaller key part of frequency components
with compact signal energy. Then, we propose FreTS, a simple yet effective archi-
tecture built upon Frequency-domain MLPs for Time Series forecasting. FreTS
mainly involves two stages, (i) Domain Conversion, that transforms time-domain
signals into complex numbers of frequency domain; (ii) Frequency Learning, that
performs our redesigned MLPs for the learning of real and imaginary part of fre-
quency components. The above stages operated on both inter-series and intra-series
scales further contribute to channel-wise and time-wise dependency learning. Ex-
tensive experiments on 13 real-world benchmarks (including 7 benchmarks for
short-term forecasting and 6 benchmarks for long-term forecasting) demonstrate
our consistent superiority over state-of-the-art methods. Code is available at this
repository: https://github.com/aikunyi/FreTS.
1 Introduction
Time series forecasting has been a critical role in a variety of real-world industries, such as climate
condition estimation [1, 2, 3], traffic state prediction [4, 5, 6], economic analysis [7, 8], etc. In the
early stage, many traditional statistical forecasting methods have been proposed, such as exponential
smoothing [9] and auto-regressive moving averages (ARMA) [10]. Recently, the emerging devel-
opment of deep learning has fostered many deep forecasting models, including Recurrent Neural
Network-based methods (e.g., DeepAR [11], LSTNet [12]), Convolution Neural Network-based meth-
ods (e.g., TCN [13], SCINet [14]), Transformer-based methods (e.g., Informer [15], Autoformer [16]),
and Graph Neural Network-based methods (e.g., MTGNN [17], StemGNN [18], AGCRN [19]), etc.
∗
Corresponding author
While these deep models have achieved promising forecasting performance in certain scenarios,
their sophisticated network architectures would usually bring up expensive computation burden
in training or inference stage. Besides, the robustness of these models could be easily influenced
with a large amount of parameters, especially when the available training data is limited [15, 20].
Therefore, the methods based on multi-layer perceptrons (MLPs) have been recently introduced with
simple structure, low complexity, and superior forecasting performance, such as N-BEATS [21],
LightTS [22], DLinear [23], etc. However, these MLP-based methods rely on point-wise mappings
to capture temporal mappings, which cannot handle global dependencies of time series. Moreover,
they would suffer from the information bottleneck with regard to the volatile and redundant local
momenta of time series, which largely hinders their performance for time series forecasting.
To overcome the above problems, we explore a novel direction of applying MLPs in the frequency
domain for time series forecasting. We investigate the learned patterns of frequency-domain MLPs
in forecasting and have discovered their two key advantages: (i) global view: operating on spectral
components acquired from series transformation, frequency-domain MLPs can capture a more
complete view of signals, making it easier to learn global spatial/temporal dependencies. (ii) energy
compaction: frequency-domain MLPs concentrate on the smaller key part of frequency components
with the compact signal energy, and thus can facilitate preserving clearer patterns while filtering
out influence of noises. Experimentally, we have observed that frequency-domain MLPs capture
much more obvious global periodic patterns than the time-domain MLPs from Figure 1(a), which
highlights their ability to recognize global signals. Also, from Figure 1(b), we easily note a much
more clear diagonal dependency in the learned weights of frequency-domain MLPs, compared with
the more scattered dependency learned by time-domain MLPs. This illustrates the great potential
of frequency-domain MLPs to identify most important features and key patterns while handling
complicated and noisy information.
To fully utilize these advantages, we propose FreTS, a simple yet effective architecture of Frequency-
domain MLPs for Time Series forecasting. The core idea of FreTS is to learn the time series
forecasting mappings in the frequency domain. Specifically, FreTS mainly involves two stages: (i)
Domain Conversion: the original time-domain series signals are first transformed into frequency-
domain spectrum on top of Discrete Fourier Transform (DFT) [24], where the spectrum is composed of
several complex numbers as frequency components, including the real coefficients and the imaginary
coefficients. (ii) Frequency Learning: given the real/imaginary coefficients, we redesign the frequency-
domain MLPs originally for the complex numbers by separately considering the real mappings and
imaginary mappings. The respective real/imaginary parts of output learned by two distinct MLPs are
then stacked in order to recover from frequency components to the final forecasting. Also, FreTS
performs above two stages on both inter-series and intra-series scales, which further contributes to the
channel-wise and time-wise dependencies in the frequency domain for better forecasting performance.
We conduct extensive experiments on 13 benchmarks under different settings, covering 7 benchmarks
for short-term forecasting and 6 benchmarks for long-term forecasting, which demonstrate our
consistent superiority compared with state-of-the-art methods.
2 Related Work
Forecasting in the Time Domain Traditionally, statistical methods have been proposed for forecast-
ing in the time domain, including (ARMA) [10], VAR [25], and ARIMA [26]. Recently, deep learning
2
based methods have been widely used in time series forecasting due to their capability of extracting
nonlinear and complex correlations [27, 28]. These methods have learned the dependencies in the
time domain with RNNs (e.g., deepAR [11], LSTNet [12]) and CNNs (e.g., TCN [13], SCINet [14]).
In addition, GNN-based models have been proposed with good forecasting performance because of
their good abilities to model series-wise dependencies among variables in the time domain, such as
TAMP-S2GCNets [5], AGCRN [19], MTGNN [17], and GraphWaveNet [29]. Besides, Transformer-
based forecasting methods have been introduced due to their attention mechanisms for long-range
dependency modeling ability in the time domain, such as Reformer [20] and Informer [15].
Forecasting in the Frequency Domain Several recent time series forecasting methods have ex-
tracted knowledge of the frequency domain for forecasting [30]. Specifically, SFM [31] decomposes
the hidden state of LSTM into frequencies by Discrete Fourier Transform (DFT). StemGNN [18]
performs graph convolutions based on Graph Fourier Transform (GFT) and computes series corre-
lations based on Discrete Fourier Transform. Autoformer [16] replaces self-attention by proposing
the auto-correlation mechanism implemented with Fast Fourier Transforms (FFT). FEDformer [32]
proposes a DFT-based frequency enhanced attention, which obtains the attentive weights by the
spectrums of queries and keys, and calculates the weighted sum in the frequency domain. CoST [33]
uses DFT to map the intermediate features to frequency domain to enables interactions in representa-
tion. FiLM [34] utilizes Fourier analysis to preserve historical information and remove noisy signals.
Unlike these efforts that leverage frequency techniques to improve upon the original architecture such
as Transformer and GNN, in this paper, we propose a new frequency learning architecture that learns
both channel-wise and time-wise dependencies in the frequency domain.
MLP-based Forecasting Models Several studies have explored the use of MLP-based networks in
time series forecasting. N-BEATS [21] utilizes stacked MLP layers together with doubly residual
learning to process the input data to iteratively forecast the future. DEPTS [35] applies Fourier
transform to extract periods and MLPs for periodicity dependencies for univariate forecasting.
LightTS [22] uses lightweight sampling-oriented MLP structures to reduce complexity and com-
putation time while maintaining accuracy. N-HiTS [36] combines multi-rate input sampling and
hierarchical interpolation with MLPs for univariate forecasting. LTSF-Linear [37] proposes a set
of embarrassingly simple one-layer linear model to learn temporal relationships between input and
output sequences. These studies demonstrate the effectiveness of MLP-based networks in time series
forecasting tasks, and inspire the development of our frequency-domain MLPs in this paper.
3 FreTS
In this section, we elaborate on our proposed novel approach, FreTS, based on our redesigned MLPs
in the frequency domain for time series forecasting. First, we present the detailed frequency learning
architecture of FreTS in Section 3.1, which mainly includes two-fold frequency learners with domain
conversions. Then, we detailedly introduce our redesigned frequency-domain MLPs adopted by
above frequency learners in Section 3.2. Besides, we also theoretically analyze their superior nature
of global view and energy compaction, as aforementioned in Section 1.
Problem Definition Let [X1 , X2 , · · · , XT ] ∈ RN ×T stand for the regularly sampled multi-variate
time series dataset with N series and T timestamps, where Xt ∈ RN denotes the multi-variate
values of N distinct series at timestamp t. We consider a time series lookback window of length-L
at timestamp t as the model input, namely Xt = [Xt−L+1 , Xt−L+2 , · · · , Xt ] ∈ RN ×L ; also, we
consider a horizon window of length-τ at timestamp t as the prediction target, denoted as Yt =
[Xt+1 , Xt+2 , · · · , Xt+τ ] ∈ RN ×τ . Then the time series forecasting formulation is to use historical
observations Xt to predict future values Ŷt and the typical forecasting model fθ parameterized by θ
is to produce forecasting results by Ŷt = fθ (Xt ).
3.1 Frequency Learning Architecture
The frequency learning architecture of FreTS is depicted in Figure 2, which mainly involves Domain
Conversion/Inversion stages, Frequency-domain MLPs, and the corresponding two learners, i.e.,
the Frequency Channel Learner and the Frequency Temporal Learner. Besides, before taken to
learners, we concretely apply a dimension extension block on model input to enhance the model
capability. Specifically, the input lookback window Xt ∈ RN ×L is multiplied with a learnable
3
Frequency Channel Learner Frequency Temporal Learner
Nxd
L N
FreTS
NxLxd Nxd Lxd Lxd
Frequency-domain
Frequency-domain
NxLxd NxLxd
Dimension
Extension
Conversion
Conversion
Inversion
Inversion
MLP
Domain
Domain
Domain
MLP
Domain
Concat
Concat
FFN
Input 𝓗 𝓏chan 𝓏temp 𝓢 Output
NxL 𝐗 0 0 𝐘 Nxτ
𝐇
𝐙 𝐒
L
Shared with L timestamps Shared with N channels
𝓗𝓦+𝓑
Real Part
N chan chan temp temp N
N 𝓦 𝓑 𝓦 𝓑
𝓦 Complex number weight 𝓗𝓦
L L τ
L
Amplitude
𝓑 Complex number bias
𝓑
Amplitude
time dimensions
N Complex Number Imaginary Part
Real Part
Channel dimensions Imaginary Part Operations in the frequency domain
Frequency Frequency
Figure 2: The framework overview of FreTS: the Frequency Channel Learner focuses on modeling
inter-series dependencies with frequency-domain MLPs operating on the channel dimensions; the
Frequency Temporal Learner is to capture the temporal dependencies by performing frequency-
domain MLPs on the time dimensions.
weight vector ϕd ∈ R1×d to obtain a more expressive hidden representation Ht ∈ RN ×L×d , yielding
Ht = Xt × ϕd to bring more semantic information, inspired by word embeddings [38].
Frequency Channel Learner Considering channel dependencies for time series forecasting is
important because it allows the model to capture interactions and correlations between different
variables, leading to a more accurate predictions. The frequency channel learner enables commu-
nications between different channels; it operates on each timestamp by sharing the same weights
between L timestamps to learn channel dependencies. Concretely, the frequency channel learner
:,(l)
takes Ht ∈ RN ×L×d as input. Given the l-th timestamp Ht ∈ RN ×d , we perform the frequency
channel learner by:
:,(l) :,(l)
Hchan = DomainConversion(chan) (Ht )
:,(l) :,(l)
Zchan = FreMLP(Hchan , W chan , B chan ) (3)
:,(l)
Z:,(l) = DomainInversion(chan) (Zchan )
4
:,(l) N :,(l)
where Hchan ∈ C 2 ×d is the frequency components of Ht ; DomainConversion(chan) and
DomainInversion(chan) indicates such operations are performed along the channel dimension.
FreMLP are frequency-domain MLPs proposed in Section 3.2, which takes W chan = (Wrchan +
jWichan ) ∈ Cd×d as the complex number weight matrix with Wrchan ∈ Rd×d and Wichan ∈ Rd×d ,
and B chan = (Brchan + jBichan ) ∈ Cd as the biases with Brchan ∈ Rd and Bichan ∈ Rd . And
:,(l) N
Zchan ∈ C 2 ×d is the output of FreMLP, also in the frequency domain, which is conversed back to
time domain as Z:,(l) ∈ RN ×d . Finally, we ensemble Z:,(l) of L timestamps into a whole and output
Zt ∈ RN ×L×d .
Frequency Temporal Learner The frequency temporal learner aims to learn the temporal patterns
in the frequency domain; also, it is constructed based on frequency-domain MLPs conducting on
each channel and it shares the weights between N channels. Specifically, it takes the frequency
(n),:
channel learner output Zt ∈ RN ×L×d as input and for the n-th channel Zt ∈ RL×d , we apply the
frequency temporal learner by:
(n),: (n),:
Ztemp = DomainConversion(temp) (Zt )
(n),: (n),:
Stemp = FreMLP(Ztemp , W temp , B temp ) (4)
(n),: (n),:
S = DomainInversion(temp) (Stemp )
(n),: L (n),:
where Ztemp ∈ C 2 ×d
is the corresponding frequency spectrum of Zt ; DomainConversion(temp)
and DomainInversion(temp) indicates the calculations are applied along the time dimension.
W temp = (Wrtemp + jWitemp ) ∈ Cd×d is the complex number weight matrix with Wrtemp ∈ Rd×d
and Witemp ∈ Rd×d , and B temp = (Brtemp + jBitemp ) ∈ Cd are the complex number biases with
(n),: L
Brtemp ∈ Rd and Bitemp ∈ Rd . Stemp ∈ C 2 ×d is the output of FreMLP and is converted back to the
time domain as S(n),: ∈ RL×d . Finally, we incorporate all channels and output St ∈ RN ×L×d .
Projection Finally, we use the learned channel and temporal dependencies to make predictions for
the future τ timestamps Ŷt ∈ RN ×τ by a two-layer feed forward network (FFN) with one forward
step which can avoid error accumulation, formulated as follows:
Ŷt = σ(St ϕ1 + b1 )ϕ2 + b2 (5)
N ×L×d
where St ∈ R is the output of the frequency temporal learner, σ is the activation function,
ϕ1 ∈ R(L∗d)×dh , ϕ2 ∈ Rdh ×τ are the weights, b1 ∈ Rdh , b2 ∈ Rτ are the biases, and dh is the
inner-layer dimension size.
As shown in Figure 3, we elaborate our novel frequency-domain MLPs in FreTS that are redesigned
for the complex numbers of frequency components, in order to effectively capture the time series key
patterns with global view and energy compaction, as aforementioned in Section 1.
Definition 1 (Frequency-domain MLPs). Formally, for a complex number input H ∈ Cm×d ,
given a complex number weight matrix W ∈ Cd×d and a complex number bias B ∈ Cd , then the
frequency-domain MLPs can be formulated as:
Y ℓ = σ(Y ℓ−1 W ℓ + B ℓ )
(6)
Y0 = H
where Y ℓ ∈ Cm×d is the final output, ℓ denotes the ℓ-th layer, and σ is the activation function.
As both H and W are complex numbers, according to the rule of multiplication of complex numbers
(details can be seen in Appendix C), we further extend the Equation (6) to:
Y ℓ = σ(Re(Y ℓ−1 )Wrℓ − Im(Y ℓ−1 )Wiℓ + Brℓ ) + jσ(Re(Y ℓ−1 )Wiℓ + Im(Y ℓ−1 )Wrℓ + Biℓ ) (7)
where W ℓ = Wrℓ + jWiℓ and B ℓ = Brℓ + jBiℓ . According to the equation, we implement the MLPs
in the frequency domain (abbreviated as FreMLP) by the separate computation of the real and
imaginary parts of frequency components. Then, we stack them to form a complex number to acquire
the final results. The specific implementation process is shown in Figure 3.
5
Theorem 1. Suppose that H is the representation of raw time series and H is the corresponding
frequency components of the spectrum, then the energy of a time series in the time domain is equal to
the energy of its representation in the frequency domain. Formally, we can express this with above
notations by: Z ∞ Z ∞
|H(v)|2 dv = |H(f )|2 df (8)
−∞ −∞
R∞
where H(f ) = −∞
H(v)e−j2πf v dv, v is the time/channel dimension, f is the frequency dimension.
The proof is shown in Appendix D.2. Therefore, the operations of FreMLP, i.e., HW + B, are
equal to the operations (H ∗ W + B) in the time domain. This implies that the operations of
frequency-domain MLPs can be viewed as global convolutions in the time domain.
4 Experiments
To evaluate the performance of FreTS, we conduct extensive experiments on thirteen real-world time
series benchmarks, covering short-term forecasting and long-term forecasting settings to compare
with corresponding state-of-the-art methods.
Datasets Our empirical results are performed on various domains of datasets, including traffic,
energy, web, traffic, electrocardiogram, and healthcare, etc. Specifically, for the task of short-term
forecasting, we adopt Solar 2 , Wiki [39], Traffic [39], Electricity 3 , ECG [18], METR-LA [40],
and COVID-19 [5] datasets, following previous forecasting literature [18]. For the task of long-
term forecasting, we adopt Weather [16], Exchange [12], Traffic [16], Electricity [16], and ETT
datasets [15], following previous long time series forecasting works [15, 16, 32, 41]. We preprocess
all datasets following [18, 15, 16] and normalize them with the min-max normalization. We split the
datasets into training, validation, and test sets by the ratio of 7:2:1 except for the COVID-19 datasets
with 6:2:2. More dataset details are in Appendix B.1.
Baselines We compare our FreTS with the representative and state-of-the-art models for both short-
term and long-term forecasting to evaluate their effectiveness. For short-term forecasting, we compre
FreTS against VAR [25], SFM [31], LSTNet [12], TCN [13], GraphWaveNet [29], DeepGLO [39],
StemGNN [18], MTGNN [17], and AGCRN [19] for comparison. We also include TAMP-S2GCNets
[5], DCRNN [40] and STGCN [42], which require pre-defined graph structures, for comparison. For
long-term forecasting, we include Informer [15], Autoformer [16], Reformer [20], FEDformer [32],
LTSF-Linear [37], and the more recent PatchTST [41] for comparison. Additional details about the
baselines can be found in Appendix B.2.
2
https://www.nrel.gov/grid/solar-power-data.html
3
https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
6
Implementation Details Our model is implemented with Pytorch 1.8 [43], and all experiments
are conducted on a single NVIDIA RTX 3080 10GB GPU. We take MSE (Mean Squared Error) as
the loss function and report MAE (Mean Absolute Errors) and RMSE (Root Mean Squared Errors)
results as the evaluation metrics. For additional implementation details, please refer to Appendix B.3.
Table 1: Short-term forecasting comparison. The best results are in bold, and the second best results
are underlined. Full benchmarks of short-term forecasting are in Appendix F.1.
Models Solar Wiki Traffic ECG Electricity COVID-19
MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE
VAR 0.184 0.234 0.052 0.094 0.535 1.133 0.120 0.170 0.101 0.163 0.226 0.326
SFM 0.161 0.283 0.081 0.156 0.029 0.044 0.095 0.135 0.086 0.129 0.205 0.308
LSTNet 0.148 0.200 0.054 0.090 0.026 0.057 0.079 0.115 0.075 0.138 0.248 0.305
TCN 0.176 0.222 0.094 0.142 0.052 0.067 0.078 0.107 0.057 0.083 0.317 0.354
DeepGLO 0.178 0.400 0.110 0.113 0.025 0.037 0.110 0.163 0.090 0.131 0.169 0.253
Reformer 0.234 0.292 0.047 0.083 0.029 0.042 0.062 0.090 0.078 0.129 0.152 0.209
Informer 0.151 0.199 0.051 0.086 0.020 0.033 0.056 0.085 0.074 0.123 0.200 0.259
Autoformer 0.150 0.193 0.069 0.103 0.029 0.043 0.055 0.081 0.056 0.083 0.159 0.211
FEDformer 0.139 0.182 0.068 0.098 0.025 0.038 0.055 0.080 0.055 0.081 0.160 0.219
GraphWaveNet 0.183 0.238 0.061 0.105 0.013 0.034 0.093 0.142 0.094 0.140 0.201 0.255
StemGNN 0.176 0.222 0.190 0.255 0.080 0.135 0.100 0.130 0.070 0.101 0.421 0.508
MTGNN 0.151 0.207 0.101 0.140 0.013 0.030 0.090 0.139 0.077 0.113 0.394 0.488
AGCRN 0.123 0.214 0.044 0.079 0.084 0.166 0.055 0.080 0.074 0.116 0.254 0.309
FreTS (Ours) 0.120 0.162 0.041 0.074 0.011 0.023 0.053 0.078 0.050 0.076 0.123 0.167
Table 2: Long-term forecasting comparison. We set the lookback window size L as 96 and the
prediction length as τ ∈ {96, 192, 336, 720} except for traffic dataset whose prediction length is
set as τ ∈ {48, 96, 192, 336}. The best results are in bold and the second best are underlined. Full
results of long-term forecasting are included in Appendix F.2.
Models FreTS PatchTST LTSF-Linear FEDformer Autoformer Informer Reformer
Metrics MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE
96 0.032 0.071 0.034 0.074 0.040 0.081 0.050 0.088 0.064 0.104 0.101 0.139 0.108 0.152
Weather
192 0.040 0.081 0.042 0.084 0.048 0.089 0.051 0.092 0.061 0.103 0.097 0.134 0.147 0.201
336 0.046 0.090 0.049 0.094 0.056 0.098 0.057 0.100 0.059 0.101 0.115 0.155 0.154 0.203
720 0.055 0.099 0.056 0.102 0.065 0.106 0.064 0.109 0.065 0.110 0.132 0.175 0.173 0.228
Exchange
96 0.037 0.051 0.039 0.052 0.038 0.052 0.050 0.067 0.050 0.066 0.066 0.084 0.126 0.146
192 0.050 0.067 0.055 0.074 0.053 0.069 0.064 0.082 0.063 0.083 0.068 0.088 0.147 0.169
336 0.062 0.082 0.071 0.093 0.064 0.085 0.080 0.105 0.075 0.101 0.093 0.127 0.157 0.189
720 0.088 0.110 0.132 0.166 0.092 0.116 0.151 0.183 0.150 0.181 0.117 0.170 0.166 0.201
48 0.018 0.036 0.016 0.032 0.020 0.039 0.022 0.036 0.026 0.042 0.023 0.039 0.035 0.053
Traffic
96 0.020 0.038 0.018 0.035 0.022 0.042 0.023 0.044 0.033 0.050 0.030 0.047 0.035 0.054
192 0.019 0.038 0.020 0.039 0.020 0.040 0.022 0.042 0.035 0.053 0.034 0.053 0.035 0.054
336 0.020 0.039 0.021 0.040 0.021 0.041 0.021 0.040 0.032 0.050 0.035 0.054 0.035 0.055
Electricity
96 0.039 0.065 0.041 0.067 0.045 0.075 0.049 0.072 0.051 0.075 0.094 0.124 0.095 0.125
192 0.040 0.064 0.042 0.066 0.043 0.070 0.049 0.072 0.072 0.099 0.105 0.138 0.121 0.152
336 0.046 0.072 0.043 0.067 0.044 0.071 0.051 0.075 0.084 0.115 0.112 0.144 0.122 0.152
720 0.052 0.079 0.055 0.081 0.054 0.080 0.055 0.077 0.088 0.119 0.116 0.148 0.120 0.151
96 0.061 0.087 0.065 0.091 0.063 0.089 0.072 0.096 0.079 0.105 0.093 0.121 0.113 0.143
ETTh1
192 0.065 0.091 0.069 0.094 0.067 0.094 0.076 0.100 0.086 0.114 0.103 0.137 0.120 0.148
336 0.070 0.096 0.073 0.099 0.070 0.097 0.080 0.105 0.088 0.119 0.112 0.145 0.124 0.155
720 0.082 0.108 0.087 0.113 0.082 0.108 0.090 0.116 0.102 0.136 0.125 0.157 0.126 0.155
96 0.052 0.077 0.055 0.082 0.055 0.080 0.063 0.087 0.081 0.109 0.070 0.096 0.065 0.089
ETTm1
192 0.057 0.083 0.059 0.085 0.060 0.087 0.068 0.093 0.083 0.112 0.082 0.107 0.081 0.108
336 0.062 0.089 0.064 0.091 0.065 0.093 0.075 0.102 0.091 0.125 0.090 0.119 0.100 0.128
720 0.069 0.096 0.070 0.097 0.072 0.099 0.081 0.108 0.093 0.126 0.115 0.149 0.132 0.163
Short-Term Time Series Forecasting Table 1 presents the forecasting accuracy of our FreTS
compared to thirteen baselines on six datasets, with an input length of 12 and a prediction length
of 12. The best results are highlighted in bold and the second-best results are underlined. From the
table, we observe that FreTS outperforms all baselines on MAE and RMSE across all datasets, and
on average it makes improvement of 9.4% on MAE and 11.6% on RMSE. We credit this to the fact
that FreTS explicitly models both channel and temporal dependencies, and it flexibly unifies channel
and temporal modeling in the frequency domain, which can effectively capture the key patterns with
the global view and energy compaction. We further report the complete benchmarks of short-term
forecasting under different steps on different datasets (including METR-LA dataset) in Appendix F.1.
7
Long-term Time Series Forecasting Table 2 showcases the long-term forecasting results of FreTS
compared to six representative baselines on six benchmarks with various prediction lengths. For
the traffic dataset, we select 48 as the lookback window size L with the prediction lengths τ ∈
{48, 96, 192, 336}. For the other datasets, the input lookback window length is set to 96 and the
prediction length is set to τ ∈ {96, 192, 336, 720}. The results demonstrate that FreTS outperforms
all baselines on all datasets. Quantitatively, compared with the best results of Transformer-based
models, FreTS has an average decrease of more than 20% in MAE and RMSE. Compared with more
recent LSTF-Linear [37] and the SOTA PathchTST [41], FreTS can still outperform them in general.
In addition, we provide further comparison of FreTS and other baselines and report performance
under different lookback window sizes in Appendix F.2. Combining Tables 1 and 2, we can conclude
that FreTS achieves competitive performance in both short-term and long-term forecasting task.
Frequency Channel and Temporal Table 3: Ablation studies of frequency channel and temporal
Learners We analyze the effects learners in both short-term and long-term forecasting. ’I/O’
of frequency channel and temporal indicates lookback window sizes/prediction lengths.
learners in Table 3 in both short-term
Tasks Short-term Long-term
and long-term experimental settings.
Dataset Electricity METR-LA Exchange Weather
We consider two variants: FreCL: I/O 12/12 12/12 96/336 96/336
we remove the frequency temporal Metrics MAE RMSE MAE RMSE MAE RMSE MAE RMSE
learner from FreTS, and FreTL: we FreCL 0.054 0.080 0.086 0.168 0.067 0.086 0.051 0.094
remove the frequency channel learner FreTL 0.058 0.086 0.085 0.167 0.065 0.085 0.047 0.091
from FreTS. From the comparison, FreTS 0.050 0.076 0.080 0.166 0.062 0.082 0.046 0.090
we observe that the frequency chan-
nel learner plays a more important role in short-term forecasting. In long-term forecasting, we note
that the frequency temporal learner is more effective than the frequency channel learner. In Appendix
E.1, we also conduct the experiments and report performance on other datasets. Interestingly, we find
out the channel learner would lead to the worse performance in some long-term forecasting cases. A
potential explanation is that the channel independent strategy [41] brings more benefit to forecasting.
FreMLP vs. MLP We further study the effectiveness of FreMLP in time series forecasting. We
use FreMLP to replace the original MLP component in the existing SOTA MLP-based models (i.e.,
DLinear and NLinear [37]), and compare their performances with the original DLinear and NLinear
under the same experimental settings. The experimental results are presented in Table 4. From the
table, we easily observe that for any prediction length, the performance of both DLinear and NLinear
models has been improved after replacing the corresponding MLP component with our FreMLP.
Quantitatively, incorporating FreMLP into the DLinear model brings an average improvement of
6.4% in MAE and 11.4% in RMSE on the Exchange dataset, and 4.9% in MAE and 3.5% in RMSE
on the Weather dataset. A similar improvement has also been achieved on the two datasets with regard
to NLinear, according to Table 4. These results confirm the effectiveness of FreMLP compared to
MLP again and we include more implementation details and analysis in Appendix B.5.
Table 4: Ablation study on the Exchange and Weather datasets with a lookback window size of 96
and the prediction length τ ∈ {96, 192, 336, 720}. DLinear (FreMLP)/NLinear (FreMLP) means
that we replace the MLPs in DLinear/NLinear with FreMLP. The best results are in bold.
Datasets Exchange Weather
Lengths 96 192 336 720 96 192 336 720
Metrics MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE
DLinear 0.037 0.051 0.054 0.072 0.071 0.095 0.095 0.119 0.041 0.081 0.047 0.089 0.056 0.098 0.065 0.106
DLinear (FreMLP) 0.036 0.049 0.053 0.071 0.063 0.071 0.086 0.101 0.038 0.078 0.045 0.086 0.055 0.097 0.061 0.100
NLinear 0.037 0.051 0.051 0.069 0.069 0.093 0.115 0.146 0.037 0.081 0.045 0.089 0.052 0.098 0.058 0.106
NLinear (FreMLP) 0.036 0.050 0.049 0.067 0.067 0.091 0.109 0.139 0.035 0.076 0.043 0.084 0.050 0.094 0.057 0.103
The complexity of our proposed FreTS is O(N log N + L log L). We perform efficiency comparisons
with some state-of-the-art GNN-based methods and Transformer-based models under different
numbers of variables N and prediction lengths τ , respectively. On the Wiki dataset, we conduct
experiments over N ∈ {1000, 2000, 3000, 4000, 5000} under the same lookback window size of 12
8
) U H 7 6 ) U H 7 6 ) U H 7 6 ) U H 7 6
$ * &