Ref 27
Ref 27
ABSTRACT Accurate system marginal price and load forecasts play a pivotal role in economic power
dispatch, system reliability and planning. Price forecasting helps electricity buyers and sellers in an energy
market to make effective decisions when preparing their bids and making bilateral contracts. Despite
considerable research work in this domain, load and price forecasting still remains to be a complicated
task. Various uncertain elements contribute to electricity price and demand volatility, such as changes in
weather conditions, outages of large power plants, fuel cost, complex bidding strategies and transmission
congestion in the power system. Thus, to deal with such difficulties, we propose a novel hybrid deep
learning method based upon bidirectional long short-term memory (BiLSTM) and a multi-head self-attention
mechanisms that can accurately forecast locational marginal price (LMP) and system load on a day-ahead
basis. Additionally, ensemble empirical mode decomposition (EEMD), a data-driven algorithm, is used for
the extraction of hidden features from the load and price time series. Besides that, an intuitive understanding
of how the proposed model works under the hood is demonstrated using different interpretability use cases.
The performance of the presented method is compared with existing well-known techniques applied for
short-term electricity load and price forecast in a comprehensive manner. The proposed method produces
considerably accurate results in comparison to existing benchmarks.
INDEX TERMS Deep learning, energy markets, ensemble empirical mode decomposition, multi-head self-
attention, short-term load and price forecasting, Kalman smoothing, model interpretability.
and profit maximization. As specified by a research study DL by providing insights about the internal working of the
in [2], 1% improvement in load forecasting error can save up model. The notable contributions of this research work are;
to $300, 000 annually for a 1 gigawatt (GW) peak load. If the (1) Development of a hybrid DL model that can be used for
price forecast is integrated, the savings can jump to $600, 000 an accurate day-ahead multi-horizon load and price forecast
annually as a ballpark estimate. In developed energy mar- in a liberalised energy market; (2) Use of signal processing
kets, load-serving entities respond to wholesale market price techniques to capture hidden features from complex electric-
signals by using the Demand Response (DR) program. Elec- ity time series data; (3) To identify globally important vari-
tricity consumers who participate in DR program adapt their ables and significant events or regimes within the data using
usage patterns with respect to electricity prices in exchange a novel interpretable strategy; (4) Performance analysis of
for some financial incentives in the form of reduced electric- suggested technique and its comparison with existing popular
ity bills. DR program’s effectiveness in a smart grid basis on DL models.
precise load and price forecast in a dynamic environment. The remainder of the paper is ordered as; Section II
In the proper functioning of electric utilities, forecasting is describes the existing literature on various time series fore-
applied to control actions and decisions like unit commit- casting methods. In Section III, we formulate a novel DL
ment, real-time dispatch, fuel scheduling, etc. Large errors in framework for the time series forecasting task. Section IV
forecasting contribute to increased operating costs. The sort deals with the implementation and performance evaluation.
of sequential data together with the underlying context is the Section V demonstrates an analysis of different interpretabil-
significant element that impacts the performance of employed ity use cases. In the end, Section VI presents the conclusions
forecasting techniques. and future work directions.
A wide range of methods, varying in complexity and
prediction procedures have been introduced for improving II. RELATED WORK
forecasting accuracy. On the basis of the time period, fore- Various research studies have been introduced in the realm
casting can be classified as short, medium and long term. of short-term load and price forecasting. As a whole, elec-
Short-term load forecasting (STLF) is based upon intra- tricity forecasting models may be grouped into two classes
day and day-ahead power system operations. It is mainly namely classical time series models and computational intel-
applicable in unit commitment, real-time supply-demand ligence models. Classical time series models such as sta-
management, optimal load flow, demand response program tistical and regression techniques are mostly efficacious for
and preparing bidding strategies. Medium-term load fore- predicting stationary or linear time series data. Computa-
casting (MTLF) is primarily concerned with developing tional intelligence methods among which ANN, Evolutionary
maintenance schedules and making fuel import decisions. Algorithm (EA), fuzzy modelling, etc. are trained by the
Long-term load forecasting (LTLF) is mainly utilized in data and the underlying structure is found out after training.
power generation and transmission system planning. Because Regression-based methods are popular in electricity price
of the ever-increasing complexity of the power system, fore- and load forecasting tasks and used for modelling the lin-
casting load and price with maximum accuracy and minimum ear relationship amongst a target variable y and multiple
computational effort is still an open research problem. Over independent variables x1 , x2 , . . . , xn . Auto-Regressive Inte-
the years, various methods of electricity load and price fore- grated Moving Average (ARIMA) is a commonly employed
casting have been suggested by scholars across the globe. regression technique that is specifically designed to cater data
Electric price forecasting is a relatively less investigated topic that display non-stationarity. ARIMA method combines the
in comparison to load forecasting. In present research work, autoregression (AR), integration (I) and the moving aver-
load and price forecasting are considered as two separate age (MA) components. In the deregulated energy market,
problems. However, electricity load and price are mutually ARIMA-based techniques have been widely used for price
interwound and have a direct relationship, that is, the change forecasting [3] with good accuracy. A study presented in [4]
in demand affects the electricity price. uses a modified ARIMA method for short-term prediction
Among present-day forecasting techniques, artificial neu- of hourly loads of weekdays, weekends and public holidays.
ral networks (ANN) are regarded as more advanced and rev- It uses operator estimate as an initial forecast and combines it
olutionary. In addition, existing deep learning (DL) research with temperature and load data parameters to obtain a better
is mostly based on black-box techniques and does not give forecasting accuracy.
much insight into how different factors affect forecasting Classical time series methods, including ARIMA, are
accuracy. Therefore, the main focus of this paper is to make mainly effective on univariate data and most of the time
an effort to address the challenges of multi-horizon load and give a simplified solution. They are not robust to fluctu-
price forecasting. We propose a unique sequence-to-sequence ations and disturbances. Moreover, they do not work for
(S2S) framework based upon bidirectional long short-term categorical feature data. Many research studies on electricity
memory (BiLSTM) and Multi-Head Self-Attention (MHSA) time series forecasting focus on building prediction strategies
mechanisms, which can learn the inside representations from using classical machine learning methods like Support Vector
sequential data for an accurate prediction. Also, special con- Regression (SVR). SVR can come up with a more general-
sideration is given to address the black box problem in ized solution to the problem than the traditional regression
approaches. However, the drawback of SVR is that it can prediction methods. Recurrent neural network (RNN) is a
predict just one step ahead value by default. To solve this, class of DL models, which adds built-in support for input
the authors in [5] presented a direct strategy of using mul- data comprised of a sequence of observations and fits better to
tiple SVR models in parallel to predict the next 24-hours modeling problems such as temporal series analysis and fore-
load simultaneously. This approach can be computationally casting [13], [14]. In practical scenarios, when the gradient is
intensive taking into account the number of parallel models. passed back through many timesteps, it tends to vanish or
Real-world time series signals are complex and can have explode, which may result in vanishing or exploding gradient
various causes. Each of these causes may take place at par- problems in standard RNN [15], [16]. Because of this, it can
ticular periods or frequencies. Time series decomposition be challenging to train standard RNNs for complex problems
can provide useful insights at these different time intervals that involve learning extended temporal dependencies. The
and frequencies for accurate forecasting. In order to enhance long short term memory (LSTM) networks are a variant of
the prediction performances, hybrid methods built on signal RNN, which can discover or learn longer-term dependencies
decomposition techniques have gained great interest recently. and thus, can tackle issues like vanishing and exploding gradi-
Wavelet Transform (WT) and Empirical Mode Decomposi- ent. Multiple forecasting frameworks premised around LSTM
tion (EMD) are two broadly used time series decomposi- have been proposed for electricity price forecasting [17], [18]
tion techniques, which decomposes the signal into distinct and load forecasting [19]. In study [20], authors have investi-
frequency components and extract hidden features of the gated the performance of LSTM network in comparison to
signal to discover the fluctuations. WT method decomposes a support vector machines for different forecasting horizons
sequence into its constituent parts using a predefined mother on electricity price dataset. The study in [21] focuses on
wavelet. The authors in [6] utilize WT for decomposing analyzing the key factors that influenced the electricity load
load time series into different frequency components and and price forecasting. The results show that the exogenous
uses SVR for forecasting. However, there are some detection variables like weather and hour of the day have a meaningful
constraints of the WT method. The decomposition deterio- influence on model performance.
rates under the noisy scenarios and the mother wavelet has Electricity price time series at daily, weekly and monthly
a non-adaptive nature [7]. In comparison with other signal intervals exhibits seasonality along with sudden, momen-
processing algorithms, EMD is a more robust technique and tary and normally unexpected price fluctuations [22]. Such
has proven to be more feasible in different domains for challenges have motivated researchers to concentrate their
analyzing time series signals that exhibit non-linearity and endeavors on the development of improved electricity price
non-stationarity. EMD breaks down a temporal series signal forecasting methods. Moreover, due to variation in household
without exiting the time domain, which helps to preserve the activities, discrete domestic loads are continuously affecting
physical meaning of the data. EMD employs a Sifting Process the overall system demand profile. An LSTM-based approach
for decomposition of time series signal into various frequency utilizing individual resident behaviors is presented in [23],
elements termed Intrinsic Mode Function (IMF) in addition which takes household appliance consumption patterns into
to residue component. consideration. The authors in the study [18] presented an
In the present literature, most of the approaches [8], [9] optimized EEMD-LSTM hybrid predicting model that amply
combine EMD algorithm with SVR and DL techniques to utilizes the hidden information inside the price dataset for a
enhance the predictive power of the existing model. These better prediction. BiLSTM is an extension of the described
hybrid strategies are based upon the concept of ‘‘divide and LSTM model which processes the input data in two-way;
conquer’’. A divide and conquer approach is done by itera- once in a forward direction (from past to future) and again
tively splitting down a problem into two or greater similar in the reverse direction (from future to past). In paper [24],
type sub-problems until these become simpler enough to be the authors did a comparison study of BiLSTM with regular
solved directly. In the end, these simpler solutions are aggre- LSTM and reported improved performance of BiLSTM in
gated to provide the solution of the original problem [10], comparison to standard unidirectional LSTM model. Gated
[11]. A considerable downside of a standard EMD is the Recurrent Unit (GRU) is a new generation of RNN which is
common occurrence of the mode mixing problem. In mode almost identical to LSTM. A research study in [25] inves-
mixing, individual IMF’s contain multiple local frequencies, tigated the impact of electricity price on STLF using GRU.
which oftentimes is the consequence of signal intermittency In comparison to LSTM, the GRU gives better computation
and noises. To overcome this issue, an ensemble empirical performance with equal or higher accuracy. In paper [26],
mode decomposition (EEMD) technique is introduced in the GRU and LSTM methods are integrated for the dis-
paper [12]. The EEMD method appends a white noise series tribution feeder’s LTLF. The suggested approach demon-
to the original signal and decomposes that sequences into strated superior performance in a practical study of West
distinct IMF’s using the sifting process. Canada’s urban grid in comparison to regression methods and
Nowadays, DL-based algorithms are the prominent feed-forward neural networks.
approaches for tackling time series forecasting problems. To address the slow training speed problem related to
Ever-growing data accessibility and computing power have ANNs, the authors in [27] introduced a novel learning algo-
made DL an essential part of the advanced time series rithm named Extreme Learning Machines (ELM). A unique
characteristic feature of standard ELM algorithm is that they the given approach for the wind power prediction task.
have only one large hidden layer which remains untrained. Auto-Encoders or S2S, a class of deep ANN, have been
It has some advantages like training the network more quickly broadly explored for time series prediction tasks as they are
and reduced overfitting but comes along with some feature capable of discovering complex patterns deep-rooted in the
learning constraints [28]. The authors in [29] put forward input data. The research study in [40] presents a stacked
a novel method named recurrent extreme learning machines denoising auto-encoders for STPF. It reports improved results
which incorporates ELM with RNN. The presented approach for online hourly and day-ahead forecast than non-deep learn-
exhibits promising results with much lower training time ing models like SVR. The S2S-based methods have been a
than traditional back-propagation-based feed-forward neural hot spot of DL for various research studies. But currently,
network techniques. The study in [30] presents a unique a few attempts have been made to implement these methods
ensemble model in which ELM is integrated with WT and for electricity load and price forecasting.
partial least squares regression methods for STLF. It uses The attention mechanism is a recent trend in DL which
distinct wavelet combinations to construct an ensemble of helps to understand what part of historical information is
independent forecasters. Discrete sub-components obtained crucial in predicting future behaviors. The core idea behind
from the wavelet decomposition and independent outputs by the attention mechanism is to ignore the irrelevant informa-
24 parallel ELM models are aggregated with the help of the tion and only focus on the details which are more relating
partial least squares regression process to forecast the hourly to the task at hand, similar to the human brain’s attention.
load of the following day. The traditional STLF and STPF approaches build upon
The authors in [31], [32] considered the CNN-based S2S models treats all input features equally and overlooks
approach for load forecasting of individual households and the idea that individual input features impact in different
reported better performance results in comparison to other ways to the output which have a consequence of poor fore-
DL methods. Intelligent forecasting techniques based on casting performance especially if the input sequences are
hybrid CNN-LSTM methods convincingly outperform base- too long. As a sophisticated approach, the methods based
line CNN or LSTM architectures across a wide range of fore- on attention architecture have been employed for different
casting tasks [33]. Hybrid CNN-LSTM methods utilize the time series forecasting tasks [41], [42]. In [43], authors
convolutional layers for extracting useful feature information used an attention-based RNN encoder-decoder method that
and learning the core structure of time series data together outperforms traditional S2S models on machine translation
with the benefits of LSTM layers for capturing shorter and tasks. In 2017, a study presented by the Google team [44]
longer-term dependencies. In recent years, Temporal Con- introduced a Transformer framework, built entirely from
volutional Networks (TCN), a special CNN variant, have self-attention layers to learn text representations. DL-based
gained traction due to their distinctive structure that makes approaches are generally black box as they do not provide
them suitable for handling sequential data. TCN avert the much insight into the internal working of the model. Attention
information leakage from future to past by employing causal mechanism has the advantage to be more interpretable than
connections [34] and uses one-dimensional dilated convo- other DL models, thanks to the attention weights. Recently,
lutions which amplify the receptive field through skipping attention-based approaches have been adopted for time series
input values with a definite step [35]. Furthermore, TCNs with interpretability motivations. In [45], the Google Cloud
present various benefits over RNN-based methods including AI team has investigated the different interpretability use
reduced memory requirement, parallel processing of longer cases such as examining variable importance and identifying
sequences, flexible receptive field size and more stable gra- various significant events within the time series dataset using
dients [36]. average attention weights. From a practical viewpoint, these
Fuzzy neural networks (FNN) combine fuzzy logic interpretable insights can be beneficial to model developers
with the learning capability of traditional ANNs. In [37], for further improving the accuracy of the forecasting model.
the authors presented an hourly day-ahead STPF approach Currently, there are few research methodologies that fore-
by employing a novel neuro-fuzzy modelling technique and cast both electricity load and price in an efficient manner.
tested its practicality on the ISO New England energy mar- Besides, most of the existing approaches do not consider the
ket price dataset. In some cases, gradient-based training impact of electricity load on its price. Because in an electricity
approaches could trap into local minima, making Evolu- market, load and price are mutually interwound, forecasting
tionary Algorithm (EA) a promising choice. Particle swarm both these quantities should be viewed as a unified problem.
optimization (PSO) is an EA method built upon the idea Apart from that, most of the present-day electricity time series
of the social interactions of humans. The research study forecasting techniques do not offer any explanation about
in [38] introduces a neural network whose weights are the outcome of the models. The proposed research work is
adjusted through the PSO algorithm for faster training and going to undertake the problem of multi-horizon electricity
better model convergence. A hybrid approach that integrates price and load forecasting in a wholesale energy market.
the PSO, WT and an adaptable fuzzy-based prediction net- Also, it attempts to address the black box problem associated
work has been presented in [39] for predicting Portugal’s with DL methods with the help of different interpretability
short-term wind power. The results verify the efficacy of scenarios for better model transparency and visibility.
Algorithm 1 EEMD
1: Add White noise to the signal X (t) to generate X i . X i =
X + αN i where N i , i = 1, 2, 3, . . . , n is a white noise
with zero mean and unit variance.
2: For each i decompose X i using sifting process to obtain
corresponding intrinsic mode functions. IMFim , where
m = 1, 2, 3 . . . , M .
3: Adding the different white noise series into the original
signal at a time and then repeating the above step.
4: Obtain the ensembles (means) of all correspond-
ing IMFs P of the decompositions as the final result.
IMFm = 1n IMFim
B. KALMAN SMOOTHING
III. MULTI-HORIZON LOAD AND PRICE FORECASTING In multi-horizon time series forecasting, the presence of noisy
USING HYBRID DEEP LEARNING APPROACH and messy data can hurt the final predictions. Denoising and
In recent times, DL techniques are utilized extensively for smoothing minimize the noise element in a time series data
multi horizon time series forecasting tasks because of their and generally allow us to see patterns or trends more clearly.
capacity to extract complex patterns and to better express the Kalman Filters [46] is probably one of the most famous
intricate relationship of electricity load or price with other and widely used estimation algorithm for noisy systems that
exogenous variables. Incorporating price forecasts along with continuously change their states [47]. Kalman filters esti-
load forecasts can further improve decision-making and mate the current state of a noisy system, based on previous
return-on-investments. Multi-horizon time series forecasting observations using a two-step ‘‘predict-update’’ process. It is
estimates the target variable at distinct future timesteps across a recursive data processing algorithm in which once new
the forecasting horizon. A multi-horizon time series predic- information arrives, it updates predictions. Kalman smooth-
tion involves forecasting the future k values [yn+1 , . . . , yn+k ] ing is a denoising method, which utilizes both past and future
of a historic time series [y1 , . . . , yn ] comprised of n obser- observations to make estimates. It is a forward-backward pass
vation, where k > 1 indicates the forecast horizon. algorithm. In forward pass step, algorithm computes yt+1|t
Multi-horizon forecast use cases generally have accessibility and yt+1|t+1 for 0 ≤ t < T, similar to Kalman filtering.
to a broad range of input data, as illustrated in Figure 1, While in the backward pass, it computes yt|T for 0 ≤ t <
comprising known details regarding the future and historical T. More specifically, the smoother uses all the available data
time series. Moreover, the heterogeneity among various data to re-estimate the prediction. Smoothness can be controlled
inputs further increases the complexity of multi-horizon time by adjusting the covariance matrix of noise σ 2 . In the pro-
series forecasting tasks, especially as the forecast horizon posed method, we applied the Kalman smoothing technique
increases. to denoise the raw time series data which helps to reduce
the presence of noise. This choice proved to be favourable
A. EEMD in terms of forecasting accuracy. Also, it is light on memory
The standard EMD algorithm may encounter problems of and execution time is fast.
mode mixing in cases where large amplitude signals include
fluctuations with higher frequencies. In the mode mixing C. PROPOSED MHSA-S2S ALGORITHM
phenomenon, different frequencies coexist in the same IMF. Multi-horizon time series forecasting can be framed as S2S
Moreover, if the first decomposed component is defective, or many-to-many estimation problem because it accepts
the later one will show the same distorted results as well a sequence as input and outputs a sequence prediction.
and may obtain incorrect IMFs. To resolve this difficulty, S2S models can handle variable-length input and output
a noise-assisted signal analysis technique called ensemble sequences and they are good at learning fine-grained details
empirical mode decomposition (EEMD) [12] is applied for from the data. The main drawback of traditional S2S mod-
the extraction of hidden features from time series data. In the els is that if the encoder somehow made a bad sum-
EEMD method, a uniformly distributed white noise series is mary of the input data then the decoder translation will be
added to original signal before the decomposition process to bad as well. In the case of longer time series sequences,
mitigate the effect of mode mixing. In the presented work, this problem is more evident which makes it difficult for
2) DECODER NETWORK gradient problem in deeper networks. Also, they provide sim-
The decoder network at each time step focuses on different pler interconnections if additional complexity is not required.
parts of the encoder. Decoder with two stacked BiLSTM
layers and skip connections are used for decoding the hidden 3) INTERPRETABLE MHSA MECHANISM
states output from the encoder. The attention mechanism To augment the learning capability of the architecture,
calculates a separate source context vector at each decoding an interpretable MHSA layer is employed. The MHSA mech-
time-step, unlike the traditional S2S model that uses the same anism is better for modeling internal context or dependencies
context vector for every decoder hidden state. The decoder between different parts of the sequence in comparison to the
network uses the temporal context vector ha as an initial state standard attention method. It helps to learn the important
and reconstructs the past time series as a target sequence relationship between values of a sequence, just like the human
of fixed length. The temporal context vector is combined brain understands the words in a sentence. The MHSA mech-
with the previous decoder output and passed to the decoder anism takes the data processed by the decoder network as an
BiLSTM cell to produce a new decoder hidden state. Thus, input and passes it to the concluding fully connected layer
the temporal context vector and the decoder target hidden which finally outputs the forecasting value. It is designed to
state at time-step t are concatenated to produce hidden vector further correct the decoder output and direct the model to look
St which is then passed to the next layer is given as: at only relevant frames for an improved prediction. Further-
more, it helps to avoid overlooking crucial information by act-
St = concatenate[st , ct ]. (7) ing as an interface between the decoder and output layer. The
primary goal of the MHSA mechanism is to achieve superior
The parameters inside the BiLSTM layers are updated forecasting results that should be interpretable to end-users
by virtue of the back-propagation algorithm. Just piling up at the same time. The MHSA mechanism allows the input
several hidden layers could easily gain expensive compu- values of the same sequence to interplay with one another to
tation power. The deeper the network, the harder it is to figure out to whom they should pay more attention. On the
train. Because the further a gradient has to travel, the more a basis of these interactions and attention scores, concatenated
network is prone to vanishing and exploding problems. One outputs are produced.
of the solutions for avoiding vanishing or exploding gradient A standard self-attention mechanism takes the input in the
issues is using skip or residual connections. Even though form of a query (Q), key (K) and value (V). In order to obtain
there is no solid theoretical explanation, but practically long Q, K and V, each input is multiplied with three sets of small
skip connections are incredibly useful in dense prediction randomly initialized weights, separate for keys, queries and
tasks. Additional paths provided by skip connection are values. We first take a dot product between query of input
beneficial for model convergence and avoiding the vanishing 1 with all keys which result in an attention score against
TABLE 1. Set of input features for system load and locational marginal
price target variables.
C. IMPLEMENTATION DETAILS
As described earlier, the attention-based S2S algorithm is
introduced to forecast multi-horizon load and price values for
FIGURE 6. Correlation between system load and LMP. a wholesale energy market. We divided the dataset into three
subsets: a training dataset for learning, a validation dataset for
tracking the performance of the model and a testing set for
evaluation in proportion to 60%, 20%, and 20% respectively.
In order to perform the 24-hours day-ahead load and price
forecasts, the model is fed 168-hours previous feature values
using the sample generation process. The MHSA-S2S model
as seen in Figure 8 has two BiLSTM layers and an attention
layer in the encoder network. Bahdanau attention layer pro-
duces a context vector which is passed as an initial hidden
state to the decoder. Identical BiLSTM layers are employed in
the decoder network as well and each layer with 128 neurons
is followed by a skip or residual connection layer.
Slow convergence of deep neural networks motivated us
to add residual connections between hidden layers of the
network for faster gradient propagation. Such connections
FIGURE 7. Correlation between average natural gas price and electricity
can skip over any surplus or unused parts of the architecture,
LMP. delivering adaptive depth and network intricacies to entertain
different kinds of datasets and scenarios (i.e. adapt to the
simpler model if extra complexity is not needed). At first, sys-
LMP etc. with an hourly resolution. To get the most out of
tem load or LMP sequences are decomposed by the EEMD
the datasets, additional features like hour of the day, day of the
process into several IMF’s. These IMF’s along with raw input
week, month, previous day average load or LMP, same hour
features are passed to the Catboost algorithm, which on the
previous day load or LMP and previous week average load or
basis of the feature importance function selects only the best
LMP are extracted from existing feature variables. As natural
feature. The significant features are selected based on how
gas is a primary fuel for NE generators, natural gas monthly
much on average the loss value changes by varying a partic-
average price information is also included in the price dataset.
ular feature value. The selected data is then directed to the
Thus, the final sets of input features from load and price
Kalman smoothing algorithm for noise removal. Using a cus-
datasets used for training the model are shown in Table 1. For
tom sample generation process, sequences are transformed
handling of categorical variables in the dataset, Target Encod-
into three-dimensional data (i.e samples, time steps, features).
ing technique [52] is used. Target encoding is a Bayesian
We selected mean squared error (MSE) as a loss function
encoding method that does not increase the dimensions of the
and tanh activation function for the hidden units of BiLSTM
dataset. It computes the average of the target variable values
layers. We utilized Adam as an optimizer and trained the
for every category and substitutes each category variable
model for 200 epochs.To achieve an optimal number of learn-
with this average value. Datasets are scaled to ensure that
able parameters along with quick model training, the MHSA
all the input values are under the same scale. The mean µ is
layer employs 08 number of heads. The hyperparameter set-
subtracted from every data value, and the resultant is divided
tings are identical for both system load and LMP forecasting
by the standard deviation σ . The standardization of data can
tasks which confirms the better generalization ability of the
be expressed as;
proposed model to handle different sorts of datasets. The
x−µ open source libraries such as Tensorflow, Keras, scikit-learn,
x0 = . (12)
σ pyEMD and pyKalman are used in the implementation of
TABLE 2. Average RMSE score of various seasons resulted by MHSA-S2S and other five existing DL-based methods for ISO New England day-ahead
system load forecast.
FIGURE 10. The average MAE score of 24-hours system load forecasting
with respect to various seasons resulted by the MHSA-S2S model and
other five existing DL methods.
FIGURE 13. The average MAE score of 24-hours LMP forecasting with
respect to various seasons resulted from the MHSA-S2S model and the
other five existing DL methods.
TABLE 3. Average RMSE score of various seasons resulted by MHSA-S2S and other five existing DL-based methods for ISO New England day-ahead LMP
forecast.
which indicate lesser forecasting error of the MHSA-S2S TABLE 4. Feature variables attention weights magnitude.
model in comparison to the other existing method.
deploy the forecasting model on resource-constrained Inter- [23] W. Kong, Z. Y. Dong, D. J. Hill, F. Luo, and Y. Xu, ‘‘Short-term residential
net of Things (IoT) production environment by decreasing the load forecasting based on resident behaviour learning,’’ IEEE Trans. Power
Syst., vol. 33, no. 1, pp. 1087–1088, Jan. 2018.
storage and computational complexity. [24] S. Siami-Namini, N. Tavakoli, and A. S. Namin, ‘‘The performance of
LSTM and BiLSTM in forecasting time series,’’ in Proc. IEEE Int. Conf.
REFERENCES Big Data (Big Data), Dec. 2019, pp. 3285–3292.
[25] W. Wu, W. Liao, J. Miao, and G. Du, ‘‘Using gated recurrent unit network
[1] W.-C. Hong, Y. Dong, W. Y. Zhang, L.-Y. Chen, and B. K. Panigrahi, to forecast short-term load considering impact of electricity price,’’ Energy
‘‘Cyclic electric load forecasting by seasonal SVR with chaotic genetic Procedia, vol. 158, pp. 3369–3374, Feb. 2019.
algorithm,’’ Int. J. Electr. Power Energy Syst., vol. 44, no. 1, pp. 604–614, [26] M. Dong and L. Grumbach, ‘‘A hybrid distribution feeder long-term load
Jan. 2013. forecasting method based on sequence prediction,’’ IEEE Trans. Smart
[2] T. Hong, ‘‘Crystal ball lessons in predictive analytics,’’ EnergyBiz Mag., Grid, vol. 11, no. 1, pp. 470–482, Jan. 2020.
vol. 12, no. 2, pp. 35–37, 2015. [27] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, ‘‘Extreme learning machine: The-
[3] J. Contreras, R. Espinola, F. J. Nogales, and A. J. Conejo, ‘‘ARIMA models ory and applications,’’ Neurocomputing, vol. 70, nos. 1–3, pp. 489–501,
to predict next-day electricity prices,’’ IEEE Trans. Power Syst., vol. 18, Dec. 2006.
no. 3, pp. 1014–1020, Aug. 2003. [28] M. Akhtaruzzaman, M. K. Hasan, S. R. Kabir, S. N. H. S. Abdullah,
[4] N. Amjady, ‘‘Short-term hourly load forecasting using time-series model- M. J. Sadeq, and E. Hossain, ‘‘HSIC bottleneck based distributed deep
ing with peak load estimation capability,’’ IEEE Trans. Power Syst., vol. 16, learning model for load forecasting in smart grid with a comprehensive
no. 3, pp. 498–505, Aug. 2001. survey,’’ IEEE Access, vol. 8, pp. 222977–223008, 2020.
[5] E. Ceperic, V. Ceperic, and A. Baric, ‘‘A strategy for short-term load [29] Ö. F. Ertugrul, ‘‘Forecasting electricity load by a novel recurrent extreme
forecasting by support vector regression machines,’’ IEEE Trans. Power learning machines approach,’’ Int. J. Electr. Power Energy Syst., vol. 78,
Syst., vol. 28, no. 4, pp. 4356–4364, Nov. 2013. pp. 429–435, Jun. 2016.
[6] J. Pahasa and N. Theera-Umpon, ‘‘Short-term load forecasting using [30] S. Li, L. Goel, and P. Wang, ‘‘An ensemble approach for short-term
wavelet transform and support vector machines,’’ in Proc. Int. Power Eng. load forecasting by extreme learning machine,’’ Appl. Energy, vol. 170,
Conf. (IPEC), Dec. 2008, pp. 47–52. pp. 22–29, May 2016.
[7] S. R. Mohanty, N. Kishor, and D. K. Singh, ‘‘Comparison of empirical [31] K. Amarasinghe, D. L. Marino, and M. Manic, ‘‘Deep neural networks
mode decomposition and wavelet transform for power quality assessment for energy load forecasting,’’ in Proc. IEEE 26th Int. Symp. Ind. Electron.
in FPGA,’’ in Proc. IEEE Int. Conf. Power Electron., Drives Energy Syst. (ISIE), Jun. 2017, pp. 1483–1488.
(PEDES), Dec. 2018, pp. 1–6. [32] M. Alhussein, K. Aurangzeb, and S. I. Haider, ‘‘Hybrid CNN-LSTM model
[8] J. Bedi and D. Toshniwal, ‘‘Empirical mode decomposition based for short-term individual household load forecasting,’’ IEEE Access, vol. 8,
deep learning for electricity demand forecasting,’’ IEEE Access, vol. 6, pp. 180544–180557, 2020.
pp. 49144–49156, 2018. [33] F. U. M. Ullah, A. Ullah, I. U. Haq, S. Rho, and S. W. Baik, ‘‘Short-
[9] Y. Yaslan and B. Bican, ‘‘Empirical mode decomposition based denoising term prediction of residential power energy consumption via CNN
method with support vector regression for time series prediction: A case and multi-layer Bi-directional LSTM networks,’’ IEEE Access, vol. 8,
study for electricity load forecasting,’’ Measurement, vol. 103, pp. 52–61, pp. 123369–123380, 2020.
Jun. 2017. [34] S. Bai, J. Z. Kolter, and V. Koltun, ‘‘An empirical evaluation of generic
[10] X. Qiu, Y. Ren, P. N. Suganthan, and G. A. J. Amaratunga, ‘‘Empirical convolutional and recurrent networks for sequence modeling,’’ 2018,
mode decomposition based ensemble deep learning for load demand time arXiv:1803.01271. [Online]. Available: http://arxiv.org/abs/1803.01271
series forecasting,’’ Appl. Soft Comput., vol. 54, pp. 246–255, May 2017. [35] J. Chen, D. Chen, and G. Liu, ‘‘Using temporal convolution network for
[11] M. R. Haq and Z. Ni, ‘‘A new hybrid model for short-term electricity load remaining useful lifetime prediction,’’ Eng. Rep., vol. 3, no. 3, Mar. 2021,
forecasting,’’ IEEE Access, vol. 7, pp. 125413–125423, 2019. Art. no. e12305.
[12] Z. Wu and N. E. Huang, ‘‘Ensemble empirical mode decomposition: [36] P. Lara-Benítez, M. Carranza-García, J. M. Luna-Romera, and
A noise-assisted data analysis method,’’ Adv. Adapt. Data Anal., vol. 1, J. C. Riquelme, ‘‘Temporal convolutional networks applied to energy-
no. 1, pp. 1–41, Jan. 2009. related time series forecasting,’’ Appl. Sci., vol. 10, no. 7, p. 2322,
[13] A. Rahman, V. Srikumar, and A. D. Smith, ‘‘Predicting electricity con- Mar. 2020.
sumption for commercial and residential buildings using deep recurrent [37] A. Alshejari, V. S. Kodogiannis, and S. Leonidis, ‘‘Development of neu-
neural networks,’’ Appl. Energy, vol. 212, pp. 372–385, Feb. 2018. rofuzzy architectures for electricity price forecasting,’’ Energies, vol. 13,
[14] M. K. Shehzad, L. Rose, and M. Assaad, ‘‘Dealing with CSI compression no. 5, p. 1209, Mar. 2020.
to reduce losses and overhead: An artificial intelligence approach,’’ 2021, [38] Azzam-ul-Asar, S. R. U. Hassnain, and A. Khan, ‘‘Short term load fore-
arXiv:2104.00189. [Online]. Available: http://arxiv.org/abs/2104.00189 casting using particle swarm optimization based ANN approach,’’ in Proc.
[15] U. Ugurlu, I. Oksuz, and O. Tas, ‘‘Electricity price forecasting using Int. Joint Conf. Neural Netw., Aug. 2007, pp. 1476–1481.
recurrent neural networks,’’ Energies, vol. 11, no. 5, p. 1255, May 2018. [39] H. M. I. Pousinho, V. M. F. Mendes, and J. P. S. Catalão, ‘‘A hybrid
[16] Y. Bengio, P. Simard, and P. Frasconi, ‘‘Learning long-term dependencies PSO–ANFIS approach for short-term wind power prediction in Portugal,’’
with gradient descent is difficult,’’ IEEE Trans. Neural Netw., vol. 5, no. 2, Energy Convers. Manage., vol. 52, no. 1, pp. 397–402, Jan. 2011.
pp. 157–166, Mar. 1994. [40] L. Wang, Z. Zhang, and J. Chen, ‘‘Short-term electricity price forecasting
[17] L. Jiang and G. Hu, ‘‘Day-ahead price forecasting for electricity market with stacked denoising autoencoders,’’ IEEE Trans. Power Syst., vol. 32,
using long-short term memory recurrent neural network,’’ in Proc. 15th Int. no. 4, pp. 2673–2681, Jul. 2017.
Conf. Control, Autom., Robot. Vis. (ICARCV), Nov. 2018, pp. 949–954. [41] S.-Y. Shih, F.-K. Sun, and H.-Y. Lee, ‘‘Temporal pattern attention for
[18] S. Zhou, L. Zhou, M. Mao, H.-M. Tai, and Y. Wan, ‘‘An optimized multivariate time series forecasting,’’ Mach. Learn., vol. 108, nos. 8–9,
heterogeneous structure LSTM network for electricity price forecasting,’’ pp. 1421–1441, Sep. 2019.
IEEE Access, vol. 7, pp. 108161–108173, 2019. [42] M. Jurasovic, E. Franklin, M. Negnevitsky, and P. Scott, ‘‘Day ahead load
[19] W. Kong, Z. Y. Dong, Y. Jia, D. J. Hill, Y. Xu, and Y. Zhang, ‘‘Short-term forecasting for the modern distribution network—A tasmanian case study,’’
residential load forecasting based on LSTM recurrent neural network,’’ in Proc. Australas. Universities Power Eng. Conf. (AUPEC), Nov. 2018,
IEEE Trans. Smart Grid, vol. 10, no. 1, pp. 841–851, Jan. 2019. pp. 1–6.
[20] Y. Zhu, R. Dai, G. Liu, Z. Wang, and S. Lu, ‘‘Power market price forecast- [43] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by
ing via deep learning,’’ in Proc. 44th Annu. Conf. IEEE Ind. Electron. Soc. jointly learning to align and translate,’’ 2014, arXiv:1409.0473. [Online].
(IECON), Oct. 2018, pp. 4935–4939. Available: http://arxiv.org/abs/1409.0473
[21] S. Hassan, A. Khosravi, J. Jaafar, and M. Q. Raza, ‘‘Electricity load and [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
price forecasting with influential factors in a deregulated power industry,’’ L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ 2017,
in Proc. 9th Int. Conf. Syst. Syst. Eng. (SOSE), Jun. 2014, pp. 79–84. arXiv:1706.03762. [Online]. Available: http://arxiv.org/abs/1706.03762
[22] R. Weron, ‘‘Electricity price forecasting: A review of the state-of-the- [45] B. Lim, S. O. Arik, N. Loeff, and T. Pfister, ‘‘Temporal fusion trans-
art with a look into the future,’’ Int. J. Forecasting, vol. 30, no. 4, formers for interpretable multi-horizon time series forecasting,’’ 2019,
pp. 1030–1081, Oct. 2014. arXiv:1912.09363. [Online]. Available: http://arxiv.org/abs/1912.09363
[46] R. E. Kalman, ‘‘A new approach to linear filtering and prediction prob- MUHAMMAD SHAHZAD YOUNIS received the
lems,’’ Trans. ASME, J. Basic Eng., vol. 82D, pp. 35–45, Mar. 1960. bachelor’s degree from the National University
[47] M. K. Shehzad, L. Rose, and M. Assaad, ‘‘A novel algorithm to report CSI of Sciences and Technology, Islamabad, Pakistan,
in MIMO-based wireless networks,’’ 2021, arXiv:2104.00200. [Online]. in 2002, the master’s degree from the University
Available: http://arxiv.org/abs/2104.00200 of Engineering and Technology, Taxila, Pakistan,
[48] M.-T. Luong, H. Pham, and C. D. Manning, ‘‘Effective approaches to in 2005, and the Ph.D. degree from Universiti
attention-based neural machine translation,’’ 2015, arXiv:1508.04025. Teknologi PETRONAS, Perak, Malaysia, in 2009.
[Online]. Available: http://arxiv.org/abs/1508.04025
He is currently working as an Assistant Profes-
[49] A. Motamedi, H. Zareipour, and W. D. Rosehart, ‘‘Electricity price and
sor at the Department of Electrical Engineering,
demand forecasting in smart grids,’’ IEEE Trans. Smart Grid, vol. 3, no. 2,
pp. 664–674, Jun. 2012. School of Electrical Engineering and Computer
[50] ISO New England Energy, Load, and Demand Reports. Accessed: Science (SEECS), National University of Sciences and Technology (NUST).
May 28, 2021. [Online]. Available: https://www.iso-ne.com/isoexpress/ He has published more than 30 articles in domestic and international journals
web/reports/load-and-demand/-/tree/net-ener-peak-load and conferences. His research interests include deep learning, statistical
[51] G. V. Welie, ‘‘The rapid transformation of new England’s power system and signal processing, adaptive filters, convex optimization, biomedical signal
implications for wholesale electricity markets,’’ presented at the Harvard processing, wireless communication modeling, smart grid, and digital signal
Kennedy School, Cambridge, MA, USA, Apr. 8, 2019. Accessed: processing.
May 28, 2021. [Online]. Available: https://projects.iq.harvard.edu/
energyconsortium/event/rapid-transformation-new-englands-power-
system-and-implications-wholesale
[52] D. Micci-Barreca, ‘‘A preprocessing scheme for high-cardinality categor-
ical attributes in classification and prediction problems,’’ ACM SIGKDD
Explor. Newslett., vol. 3, no. 1, pp. 27–32, Jul. 2001.