Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
33 views15 pages

Ref 27

The document presents a novel hybrid deep learning framework combining bidirectional long short-term memory (BiLSTM) and multi-head self-attention mechanisms for accurate day-ahead forecasting of locational marginal price (LMP) and system load in electricity markets. It addresses the complexities of load and price forecasting by utilizing ensemble empirical mode decomposition (EEMD) to extract hidden features from time series data, while also providing interpretability of the model's inner workings. The proposed method outperforms existing forecasting techniques, demonstrating significant improvements in accuracy and insights into the influencing factors of electricity demand and pricing.

Uploaded by

Ruchira Tabassum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views15 pages

Ref 27

The document presents a novel hybrid deep learning framework combining bidirectional long short-term memory (BiLSTM) and multi-head self-attention mechanisms for accurate day-ahead forecasting of locational marginal price (LMP) and system load in electricity markets. It addresses the complexities of load and price forecasting by utilizing ensemble empirical mode decomposition (EEMD) to extract hidden features from time series data, while also providing interpretability of the model's inner workings. The proposed method outperforms existing forecasting techniques, demonstrating significant improvements in accuracy and insights into the influencing factors of electricity demand and pricing.

Uploaded by

Ruchira Tabassum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Received April 11, 2021, accepted May 17, 2021, date of publication June 3, 2021, date of current version

June 21, 2021.


Digital Object Identifier 10.1109/ACCESS.2021.3086039

Multi-Horizon Electricity Load and Price


Forecasting Using an Interpretable Multi-Head
Self-Attention and EEMD-Based Framework
MUHAMMAD FURQAN AZAM AND MUHAMMAD SHAHZAD YOUNIS
School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad 44000, Pakistan
Corresponding author: Muhammad Shahzad Younis ([email protected])

ABSTRACT Accurate system marginal price and load forecasts play a pivotal role in economic power
dispatch, system reliability and planning. Price forecasting helps electricity buyers and sellers in an energy
market to make effective decisions when preparing their bids and making bilateral contracts. Despite
considerable research work in this domain, load and price forecasting still remains to be a complicated
task. Various uncertain elements contribute to electricity price and demand volatility, such as changes in
weather conditions, outages of large power plants, fuel cost, complex bidding strategies and transmission
congestion in the power system. Thus, to deal with such difficulties, we propose a novel hybrid deep
learning method based upon bidirectional long short-term memory (BiLSTM) and a multi-head self-attention
mechanisms that can accurately forecast locational marginal price (LMP) and system load on a day-ahead
basis. Additionally, ensemble empirical mode decomposition (EEMD), a data-driven algorithm, is used for
the extraction of hidden features from the load and price time series. Besides that, an intuitive understanding
of how the proposed model works under the hood is demonstrated using different interpretability use cases.
The performance of the presented method is compared with existing well-known techniques applied for
short-term electricity load and price forecast in a comprehensive manner. The proposed method produces
considerably accurate results in comparison to existing benchmarks.

INDEX TERMS Deep learning, energy markets, ensemble empirical mode decomposition, multi-head self-
attention, short-term load and price forecasting, Kalman smoothing, model interpretability.

I. INTRODUCTION targeted at encouraging greater interactivity between the sup-


Over the last couple of decades, numerous countries have ply and demand side. In most countries, different variations
introduced deregulated energy markets to address the prob- of these two models exist side by side. The transition towards
lem of ever-increasing electricity prices through competition. low-carbon electricity production is making energy markets
In a deregulated energy market, electricity is traded just like a more complex. Electricity markets are needed to be remod-
normal commodity using different contracts. The wholesale elled in such a way that would ensure affordable electricity
energy market starts with power plants or generators, which for both households and industries.
after the necessary approval from an independent system Load and price forecasting are two key components of
operator (ISO) are connected to the electric grid for electric- modern energy markets. Accurate and efficient load fore-
ity production. The electricity produced by power plants is casting can help participants for better planning of their
bought by an entity that will often, in turn, resell that power generation schedules, optimization of spinning reserves and
to meet consumer demand. Electricity pool and bilateral ensuring system stability [1]. In energy markets, the electric-
trading, both serve as two distinct energy market designs. ity prices are described in terms of the Locational Marginal
The pool is centered around promoting competition among Price (LMP). LMP is comprised of three components, per
electricity generators, whereas bilateral trading is more unit generation cost, congestion cost and transmission losses.
Nowadays, price forecasting is also becoming more important
The associate editor coordinating the review of this manuscript and for various market participants intending to plan their bids
approving it for publication was Seyyed Ali Pourmousavi Kani . in the day-ahead energy markets for better risk management
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
85918 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 9, 2021
M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

and profit maximization. As specified by a research study DL by providing insights about the internal working of the
in [2], 1% improvement in load forecasting error can save up model. The notable contributions of this research work are;
to $300, 000 annually for a 1 gigawatt (GW) peak load. If the (1) Development of a hybrid DL model that can be used for
price forecast is integrated, the savings can jump to $600, 000 an accurate day-ahead multi-horizon load and price forecast
annually as a ballpark estimate. In developed energy mar- in a liberalised energy market; (2) Use of signal processing
kets, load-serving entities respond to wholesale market price techniques to capture hidden features from complex electric-
signals by using the Demand Response (DR) program. Elec- ity time series data; (3) To identify globally important vari-
tricity consumers who participate in DR program adapt their ables and significant events or regimes within the data using
usage patterns with respect to electricity prices in exchange a novel interpretable strategy; (4) Performance analysis of
for some financial incentives in the form of reduced electric- suggested technique and its comparison with existing popular
ity bills. DR program’s effectiveness in a smart grid basis on DL models.
precise load and price forecast in a dynamic environment. The remainder of the paper is ordered as; Section II
In the proper functioning of electric utilities, forecasting is describes the existing literature on various time series fore-
applied to control actions and decisions like unit commit- casting methods. In Section III, we formulate a novel DL
ment, real-time dispatch, fuel scheduling, etc. Large errors in framework for the time series forecasting task. Section IV
forecasting contribute to increased operating costs. The sort deals with the implementation and performance evaluation.
of sequential data together with the underlying context is the Section V demonstrates an analysis of different interpretabil-
significant element that impacts the performance of employed ity use cases. In the end, Section VI presents the conclusions
forecasting techniques. and future work directions.
A wide range of methods, varying in complexity and
prediction procedures have been introduced for improving II. RELATED WORK
forecasting accuracy. On the basis of the time period, fore- Various research studies have been introduced in the realm
casting can be classified as short, medium and long term. of short-term load and price forecasting. As a whole, elec-
Short-term load forecasting (STLF) is based upon intra- tricity forecasting models may be grouped into two classes
day and day-ahead power system operations. It is mainly namely classical time series models and computational intel-
applicable in unit commitment, real-time supply-demand ligence models. Classical time series models such as sta-
management, optimal load flow, demand response program tistical and regression techniques are mostly efficacious for
and preparing bidding strategies. Medium-term load fore- predicting stationary or linear time series data. Computa-
casting (MTLF) is primarily concerned with developing tional intelligence methods among which ANN, Evolutionary
maintenance schedules and making fuel import decisions. Algorithm (EA), fuzzy modelling, etc. are trained by the
Long-term load forecasting (LTLF) is mainly utilized in data and the underlying structure is found out after training.
power generation and transmission system planning. Because Regression-based methods are popular in electricity price
of the ever-increasing complexity of the power system, fore- and load forecasting tasks and used for modelling the lin-
casting load and price with maximum accuracy and minimum ear relationship amongst a target variable y and multiple
computational effort is still an open research problem. Over independent variables x1 , x2 , . . . , xn . Auto-Regressive Inte-
the years, various methods of electricity load and price fore- grated Moving Average (ARIMA) is a commonly employed
casting have been suggested by scholars across the globe. regression technique that is specifically designed to cater data
Electric price forecasting is a relatively less investigated topic that display non-stationarity. ARIMA method combines the
in comparison to load forecasting. In present research work, autoregression (AR), integration (I) and the moving aver-
load and price forecasting are considered as two separate age (MA) components. In the deregulated energy market,
problems. However, electricity load and price are mutually ARIMA-based techniques have been widely used for price
interwound and have a direct relationship, that is, the change forecasting [3] with good accuracy. A study presented in [4]
in demand affects the electricity price. uses a modified ARIMA method for short-term prediction
Among present-day forecasting techniques, artificial neu- of hourly loads of weekdays, weekends and public holidays.
ral networks (ANN) are regarded as more advanced and rev- It uses operator estimate as an initial forecast and combines it
olutionary. In addition, existing deep learning (DL) research with temperature and load data parameters to obtain a better
is mostly based on black-box techniques and does not give forecasting accuracy.
much insight into how different factors affect forecasting Classical time series methods, including ARIMA, are
accuracy. Therefore, the main focus of this paper is to make mainly effective on univariate data and most of the time
an effort to address the challenges of multi-horizon load and give a simplified solution. They are not robust to fluctu-
price forecasting. We propose a unique sequence-to-sequence ations and disturbances. Moreover, they do not work for
(S2S) framework based upon bidirectional long short-term categorical feature data. Many research studies on electricity
memory (BiLSTM) and Multi-Head Self-Attention (MHSA) time series forecasting focus on building prediction strategies
mechanisms, which can learn the inside representations from using classical machine learning methods like Support Vector
sequential data for an accurate prediction. Also, special con- Regression (SVR). SVR can come up with a more general-
sideration is given to address the black box problem in ized solution to the problem than the traditional regression

VOLUME 9, 2021 85919


M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

approaches. However, the drawback of SVR is that it can prediction methods. Recurrent neural network (RNN) is a
predict just one step ahead value by default. To solve this, class of DL models, which adds built-in support for input
the authors in [5] presented a direct strategy of using mul- data comprised of a sequence of observations and fits better to
tiple SVR models in parallel to predict the next 24-hours modeling problems such as temporal series analysis and fore-
load simultaneously. This approach can be computationally casting [13], [14]. In practical scenarios, when the gradient is
intensive taking into account the number of parallel models. passed back through many timesteps, it tends to vanish or
Real-world time series signals are complex and can have explode, which may result in vanishing or exploding gradient
various causes. Each of these causes may take place at par- problems in standard RNN [15], [16]. Because of this, it can
ticular periods or frequencies. Time series decomposition be challenging to train standard RNNs for complex problems
can provide useful insights at these different time intervals that involve learning extended temporal dependencies. The
and frequencies for accurate forecasting. In order to enhance long short term memory (LSTM) networks are a variant of
the prediction performances, hybrid methods built on signal RNN, which can discover or learn longer-term dependencies
decomposition techniques have gained great interest recently. and thus, can tackle issues like vanishing and exploding gradi-
Wavelet Transform (WT) and Empirical Mode Decomposi- ent. Multiple forecasting frameworks premised around LSTM
tion (EMD) are two broadly used time series decomposi- have been proposed for electricity price forecasting [17], [18]
tion techniques, which decomposes the signal into distinct and load forecasting [19]. In study [20], authors have investi-
frequency components and extract hidden features of the gated the performance of LSTM network in comparison to
signal to discover the fluctuations. WT method decomposes a support vector machines for different forecasting horizons
sequence into its constituent parts using a predefined mother on electricity price dataset. The study in [21] focuses on
wavelet. The authors in [6] utilize WT for decomposing analyzing the key factors that influenced the electricity load
load time series into different frequency components and and price forecasting. The results show that the exogenous
uses SVR for forecasting. However, there are some detection variables like weather and hour of the day have a meaningful
constraints of the WT method. The decomposition deterio- influence on model performance.
rates under the noisy scenarios and the mother wavelet has Electricity price time series at daily, weekly and monthly
a non-adaptive nature [7]. In comparison with other signal intervals exhibits seasonality along with sudden, momen-
processing algorithms, EMD is a more robust technique and tary and normally unexpected price fluctuations [22]. Such
has proven to be more feasible in different domains for challenges have motivated researchers to concentrate their
analyzing time series signals that exhibit non-linearity and endeavors on the development of improved electricity price
non-stationarity. EMD breaks down a temporal series signal forecasting methods. Moreover, due to variation in household
without exiting the time domain, which helps to preserve the activities, discrete domestic loads are continuously affecting
physical meaning of the data. EMD employs a Sifting Process the overall system demand profile. An LSTM-based approach
for decomposition of time series signal into various frequency utilizing individual resident behaviors is presented in [23],
elements termed Intrinsic Mode Function (IMF) in addition which takes household appliance consumption patterns into
to residue component. consideration. The authors in the study [18] presented an
In the present literature, most of the approaches [8], [9] optimized EEMD-LSTM hybrid predicting model that amply
combine EMD algorithm with SVR and DL techniques to utilizes the hidden information inside the price dataset for a
enhance the predictive power of the existing model. These better prediction. BiLSTM is an extension of the described
hybrid strategies are based upon the concept of ‘‘divide and LSTM model which processes the input data in two-way;
conquer’’. A divide and conquer approach is done by itera- once in a forward direction (from past to future) and again
tively splitting down a problem into two or greater similar in the reverse direction (from future to past). In paper [24],
type sub-problems until these become simpler enough to be the authors did a comparison study of BiLSTM with regular
solved directly. In the end, these simpler solutions are aggre- LSTM and reported improved performance of BiLSTM in
gated to provide the solution of the original problem [10], comparison to standard unidirectional LSTM model. Gated
[11]. A considerable downside of a standard EMD is the Recurrent Unit (GRU) is a new generation of RNN which is
common occurrence of the mode mixing problem. In mode almost identical to LSTM. A research study in [25] inves-
mixing, individual IMF’s contain multiple local frequencies, tigated the impact of electricity price on STLF using GRU.
which oftentimes is the consequence of signal intermittency In comparison to LSTM, the GRU gives better computation
and noises. To overcome this issue, an ensemble empirical performance with equal or higher accuracy. In paper [26],
mode decomposition (EEMD) technique is introduced in the GRU and LSTM methods are integrated for the dis-
paper [12]. The EEMD method appends a white noise series tribution feeder’s LTLF. The suggested approach demon-
to the original signal and decomposes that sequences into strated superior performance in a practical study of West
distinct IMF’s using the sifting process. Canada’s urban grid in comparison to regression methods and
Nowadays, DL-based algorithms are the prominent feed-forward neural networks.
approaches for tackling time series forecasting problems. To address the slow training speed problem related to
Ever-growing data accessibility and computing power have ANNs, the authors in [27] introduced a novel learning algo-
made DL an essential part of the advanced time series rithm named Extreme Learning Machines (ELM). A unique

85920 VOLUME 9, 2021


M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

characteristic feature of standard ELM algorithm is that they the given approach for the wind power prediction task.
have only one large hidden layer which remains untrained. Auto-Encoders or S2S, a class of deep ANN, have been
It has some advantages like training the network more quickly broadly explored for time series prediction tasks as they are
and reduced overfitting but comes along with some feature capable of discovering complex patterns deep-rooted in the
learning constraints [28]. The authors in [29] put forward input data. The research study in [40] presents a stacked
a novel method named recurrent extreme learning machines denoising auto-encoders for STPF. It reports improved results
which incorporates ELM with RNN. The presented approach for online hourly and day-ahead forecast than non-deep learn-
exhibits promising results with much lower training time ing models like SVR. The S2S-based methods have been a
than traditional back-propagation-based feed-forward neural hot spot of DL for various research studies. But currently,
network techniques. The study in [30] presents a unique a few attempts have been made to implement these methods
ensemble model in which ELM is integrated with WT and for electricity load and price forecasting.
partial least squares regression methods for STLF. It uses The attention mechanism is a recent trend in DL which
distinct wavelet combinations to construct an ensemble of helps to understand what part of historical information is
independent forecasters. Discrete sub-components obtained crucial in predicting future behaviors. The core idea behind
from the wavelet decomposition and independent outputs by the attention mechanism is to ignore the irrelevant informa-
24 parallel ELM models are aggregated with the help of the tion and only focus on the details which are more relating
partial least squares regression process to forecast the hourly to the task at hand, similar to the human brain’s attention.
load of the following day. The traditional STLF and STPF approaches build upon
The authors in [31], [32] considered the CNN-based S2S models treats all input features equally and overlooks
approach for load forecasting of individual households and the idea that individual input features impact in different
reported better performance results in comparison to other ways to the output which have a consequence of poor fore-
DL methods. Intelligent forecasting techniques based on casting performance especially if the input sequences are
hybrid CNN-LSTM methods convincingly outperform base- too long. As a sophisticated approach, the methods based
line CNN or LSTM architectures across a wide range of fore- on attention architecture have been employed for different
casting tasks [33]. Hybrid CNN-LSTM methods utilize the time series forecasting tasks [41], [42]. In [43], authors
convolutional layers for extracting useful feature information used an attention-based RNN encoder-decoder method that
and learning the core structure of time series data together outperforms traditional S2S models on machine translation
with the benefits of LSTM layers for capturing shorter and tasks. In 2017, a study presented by the Google team [44]
longer-term dependencies. In recent years, Temporal Con- introduced a Transformer framework, built entirely from
volutional Networks (TCN), a special CNN variant, have self-attention layers to learn text representations. DL-based
gained traction due to their distinctive structure that makes approaches are generally black box as they do not provide
them suitable for handling sequential data. TCN avert the much insight into the internal working of the model. Attention
information leakage from future to past by employing causal mechanism has the advantage to be more interpretable than
connections [34] and uses one-dimensional dilated convo- other DL models, thanks to the attention weights. Recently,
lutions which amplify the receptive field through skipping attention-based approaches have been adopted for time series
input values with a definite step [35]. Furthermore, TCNs with interpretability motivations. In [45], the Google Cloud
present various benefits over RNN-based methods including AI team has investigated the different interpretability use
reduced memory requirement, parallel processing of longer cases such as examining variable importance and identifying
sequences, flexible receptive field size and more stable gra- various significant events within the time series dataset using
dients [36]. average attention weights. From a practical viewpoint, these
Fuzzy neural networks (FNN) combine fuzzy logic interpretable insights can be beneficial to model developers
with the learning capability of traditional ANNs. In [37], for further improving the accuracy of the forecasting model.
the authors presented an hourly day-ahead STPF approach Currently, there are few research methodologies that fore-
by employing a novel neuro-fuzzy modelling technique and cast both electricity load and price in an efficient manner.
tested its practicality on the ISO New England energy mar- Besides, most of the existing approaches do not consider the
ket price dataset. In some cases, gradient-based training impact of electricity load on its price. Because in an electricity
approaches could trap into local minima, making Evolu- market, load and price are mutually interwound, forecasting
tionary Algorithm (EA) a promising choice. Particle swarm both these quantities should be viewed as a unified problem.
optimization (PSO) is an EA method built upon the idea Apart from that, most of the present-day electricity time series
of the social interactions of humans. The research study forecasting techniques do not offer any explanation about
in [38] introduces a neural network whose weights are the outcome of the models. The proposed research work is
adjusted through the PSO algorithm for faster training and going to undertake the problem of multi-horizon electricity
better model convergence. A hybrid approach that integrates price and load forecasting in a wholesale energy market.
the PSO, WT and an adaptable fuzzy-based prediction net- Also, it attempts to address the black box problem associated
work has been presented in [39] for predicting Portugal’s with DL methods with the help of different interpretability
short-term wind power. The results verify the efficacy of scenarios for better model transparency and visibility.

VOLUME 9, 2021 85921


M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

Algorithm 1 EEMD
1: Add White noise to the signal X (t) to generate X i . X i =
X + αN i where N i , i = 1, 2, 3, . . . , n is a white noise
with zero mean and unit variance.
2: For each i decompose X i using sifting process to obtain
corresponding intrinsic mode functions. IMFim , where
m = 1, 2, 3 . . . , M .
3: Adding the different white noise series into the original
signal at a time and then repeating the above step.
4: Obtain the ensembles (means) of all correspond-
ing IMFs P of the decompositions as the final result.
IMFm = 1n IMFim

the EEMD method is used for the decomposition of sys-


FIGURE 1. Multi-horizon time series forecasting generally includes an tem load and LMP target variables. An illustration of signal
intricate set of inputs – comprising historical observed independent
feature inputs, known future information (e.g. holiday dates) and past decomposition using EEMD is shown in Figure 2.
target values.

B. KALMAN SMOOTHING
III. MULTI-HORIZON LOAD AND PRICE FORECASTING In multi-horizon time series forecasting, the presence of noisy
USING HYBRID DEEP LEARNING APPROACH and messy data can hurt the final predictions. Denoising and
In recent times, DL techniques are utilized extensively for smoothing minimize the noise element in a time series data
multi horizon time series forecasting tasks because of their and generally allow us to see patterns or trends more clearly.
capacity to extract complex patterns and to better express the Kalman Filters [46] is probably one of the most famous
intricate relationship of electricity load or price with other and widely used estimation algorithm for noisy systems that
exogenous variables. Incorporating price forecasts along with continuously change their states [47]. Kalman filters esti-
load forecasts can further improve decision-making and mate the current state of a noisy system, based on previous
return-on-investments. Multi-horizon time series forecasting observations using a two-step ‘‘predict-update’’ process. It is
estimates the target variable at distinct future timesteps across a recursive data processing algorithm in which once new
the forecasting horizon. A multi-horizon time series predic- information arrives, it updates predictions. Kalman smooth-
tion involves forecasting the future k values [yn+1 , . . . , yn+k ] ing is a denoising method, which utilizes both past and future
of a historic time series [y1 , . . . , yn ] comprised of n obser- observations to make estimates. It is a forward-backward pass
vation, where k > 1 indicates the forecast horizon. algorithm. In forward pass step, algorithm computes yt+1|t
Multi-horizon forecast use cases generally have accessibility and yt+1|t+1 for 0 ≤ t < T, similar to Kalman filtering.
to a broad range of input data, as illustrated in Figure 1, While in the backward pass, it computes yt|T for 0 ≤ t <
comprising known details regarding the future and historical T. More specifically, the smoother uses all the available data
time series. Moreover, the heterogeneity among various data to re-estimate the prediction. Smoothness can be controlled
inputs further increases the complexity of multi-horizon time by adjusting the covariance matrix of noise σ 2 . In the pro-
series forecasting tasks, especially as the forecast horizon posed method, we applied the Kalman smoothing technique
increases. to denoise the raw time series data which helps to reduce
the presence of noise. This choice proved to be favourable
A. EEMD in terms of forecasting accuracy. Also, it is light on memory
The standard EMD algorithm may encounter problems of and execution time is fast.
mode mixing in cases where large amplitude signals include
fluctuations with higher frequencies. In the mode mixing C. PROPOSED MHSA-S2S ALGORITHM
phenomenon, different frequencies coexist in the same IMF. Multi-horizon time series forecasting can be framed as S2S
Moreover, if the first decomposed component is defective, or many-to-many estimation problem because it accepts
the later one will show the same distorted results as well a sequence as input and outputs a sequence prediction.
and may obtain incorrect IMFs. To resolve this difficulty, S2S models can handle variable-length input and output
a noise-assisted signal analysis technique called ensemble sequences and they are good at learning fine-grained details
empirical mode decomposition (EEMD) [12] is applied for from the data. The main drawback of traditional S2S mod-
the extraction of hidden features from time series data. In the els is that if the encoder somehow made a bad sum-
EEMD method, a uniformly distributed white noise series is mary of the input data then the decoder translation will be
added to original signal before the decomposition process to bad as well. In the case of longer time series sequences,
mitigate the effect of mode mixing. In the presented work, this problem is more evident which makes it difficult for

85922 VOLUME 9, 2021


M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

The encoder, which has an initial hidden state ho , accepts an


input sequence x1 , . . . ., xT of a random length t and then
generates a sequence of encoded information, h1 , . . . ., ht .
As the network reads the input sequence values, the cell state
and the hidden state are updated. Instead of traditional LSTM
layers, we employ BiLSTM which can handle sequential data
concurrently in both forward and reverse directions with the
help of two interconnected hidden layers. At each time step,
BiLSTM layers update its hidden state ht as it reads input data
xt as under:

→ −

ht = LSTM (xt , h t−1 ), (1)

− ←−
ht = LSTM (xt , h t−1 ), (2)

→ ← −
ht = ht ⊕ ht , (3)

→ ←−
where ht is the forward LSTM’s output and ht indicates the
output of the backward LSTM and combining these two give
the final output of BiLSTM. The attention layer attends these
encoder hidden states and generates the temporal context
vector by calculating a weighted aggregate of each inter-
FIGURE 2. Example of IMFs obtained after applying EEMD over electricity
load data.
mediate hidden state. Temporal context vector uses encoder
hidden states and decoder hidden states information along
models to perform forecasting more accurately. In addition with alignment vector. It helps to determine those hidden
to that, traditional S2S models treat all the values in the states in the encoder and decoder network which are more
input sequence equally while creating a context vector. How- related to one another. The temporal context vector of the
ever, an input sequence may contain trivial feature values attention layer is represented by ha .
as well and treating all the values in a sequence equally is T
X
not a good idea. Bahdanau et al. [43] and Luong et al. [48] ha = αij hj , (4)
both employed attention in encoder-decoder architectures to j=1
maximize the utility of available context and enhance accu-
racy in language translation-related tasks. Therefore, it makes where attention weights αij tell how much emphasis to put
sense that identical developments can be anticipated for the on the ith input frame at timestep t to estimate the jth value
time series forecasting problem. To address the multi-horizon of output vector. To get the attention weights, a score for
forecasting problem, we propose the MHSA-S2S architecture every encoder hidden state is obtained using a score function.
shown in Figure 3. The proposed model comprises three parts, The score function is a dot product between the encoder and
a BiLSTM-based encoder network with an attention-based decoder hidden states. Subsequently, all the scores are passed
context layer, a BiLSTM-based decoder network, and an to the softmax layer to get attention weights in terms of the
interpretable multi-head self-attention mechanism. An inter- probability distribution.
pretable multi-head self-attention mechanism is used for
exp(ei,t )
highlighting key features from the decoder sequence using αij = PT . (5)
attention weights. Lastly, a fully connected dense layer is w=1 (ei,w )
employed for forecasting the load or price values at multiple
An alignment score based on which the encoder hidden
time steps. A detailed explanation of the proposed method is
states are aligned with the decoder hidden states is computed.
presented below.
The alignment score uses all encoder hidden states and one
time-step past decoder hidden state.
1) ENCODER NETWORK
The encoder consists of two stacked BiLSTM layers and ei,t = align(si−1 , ht ), (6)
an attention layer. BiLSTM layers are capable of capturing
relevant temporal information from both the past and future where ei,t denotes the alignment score between the decoder
context at the same time. The attention layer can pick sig- layer’s hidden state si−1 and encoder layer’s hidden state ht .
nificant encoder hidden states across all time-steps which The attention weights for less significant features will be
helps to improve the model representation ability and increase small in the calculation of the temporal context vector. Due
the multi-step forecasting accuracy. It acts as an interface to the slow convergence of BiLSTM layers, residual or skip
between encoder and decoder. In the encoding stage, the input connections are implemented between subsequent BiLSTM
sequences are delivered to the network in a sequential manner. layers for faster gradient propagation.

VOLUME 9, 2021 85923


M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

FIGURE 3. Flow diagram of proposed MHSA-S2S framework.

2) DECODER NETWORK gradient problem in deeper networks. Also, they provide sim-
The decoder network at each time step focuses on different pler interconnections if additional complexity is not required.
parts of the encoder. Decoder with two stacked BiLSTM
layers and skip connections are used for decoding the hidden 3) INTERPRETABLE MHSA MECHANISM
states output from the encoder. The attention mechanism To augment the learning capability of the architecture,
calculates a separate source context vector at each decoding an interpretable MHSA layer is employed. The MHSA mech-
time-step, unlike the traditional S2S model that uses the same anism is better for modeling internal context or dependencies
context vector for every decoder hidden state. The decoder between different parts of the sequence in comparison to the
network uses the temporal context vector ha as an initial state standard attention method. It helps to learn the important
and reconstructs the past time series as a target sequence relationship between values of a sequence, just like the human
of fixed length. The temporal context vector is combined brain understands the words in a sentence. The MHSA mech-
with the previous decoder output and passed to the decoder anism takes the data processed by the decoder network as an
BiLSTM cell to produce a new decoder hidden state. Thus, input and passes it to the concluding fully connected layer
the temporal context vector and the decoder target hidden which finally outputs the forecasting value. It is designed to
state at time-step t are concatenated to produce hidden vector further correct the decoder output and direct the model to look
St which is then passed to the next layer is given as: at only relevant frames for an improved prediction. Further-
more, it helps to avoid overlooking crucial information by act-
St = concatenate[st , ct ]. (7) ing as an interface between the decoder and output layer. The
primary goal of the MHSA mechanism is to achieve superior
The parameters inside the BiLSTM layers are updated forecasting results that should be interpretable to end-users
by virtue of the back-propagation algorithm. Just piling up at the same time. The MHSA mechanism allows the input
several hidden layers could easily gain expensive compu- values of the same sequence to interplay with one another to
tation power. The deeper the network, the harder it is to figure out to whom they should pay more attention. On the
train. Because the further a gradient has to travel, the more a basis of these interactions and attention scores, concatenated
network is prone to vanishing and exploding problems. One outputs are produced.
of the solutions for avoiding vanishing or exploding gradient A standard self-attention mechanism takes the input in the
issues is using skip or residual connections. Even though form of a query (Q), key (K) and value (V). In order to obtain
there is no solid theoretical explanation, but practically long Q, K and V, each input is multiplied with three sets of small
skip connections are incredibly useful in dense prediction randomly initialized weights, separate for keys, queries and
tasks. Additional paths provided by skip connection are values. We first take a dot product between query of input
beneficial for model convergence and avoiding the vanishing 1 with all keys which result in an attention score against

85924 VOLUME 9, 2021


M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

each key vector. The dot product of these vectors represents


the measure of relevancy between each other. By applying
the softmax function across these scores, we obtain a soft-
max score. Each input’s softmax score is multiplied with its
respective values to get the weighted vector and by summing
all the weighted vector values element-wise results in the
output 1, which is founded on the interaction of input 1’s
query with all the keys. Likewise, we take the remaining
input’s queries and perform similar mathematical operations FIGURE 4. System load time series.
to get all the corresponding outputs. It can be overwhelming
for a single self-attention mechanism to learn all dependen-
cies between different values of a long sequence. In MHSA
architecture, we apply this same self-attention mechanism
using different heads (or brains) and combine the information
explored by these individual heads. In this way, we are not
only attending to different values in a sequence but a different
segment of values as well. We divide the decoder output
vectors into a H number of chunks, and then self-attention
is applied on these chunks, resulting in H context vectors for FIGURE 5. Locational marginal price time series.
each decoder output. The final context vector is obtained by
concatenating all these H context vectors. IV. EXPERIMENTAL SETUP AND RESULTS
In this portion, we analyze the forecasting capability of
MultiHead(Q, K , V ) = concat(H1 , H2 , . . . , Hi )W o , (8) the novel MHSA-S2S model using the ISO New Eng-
Q
Hi = Attention(QWi , KWiK , VWiV ), land (NE) load and price datasets. The predictive performance
(9) of the proposed framework is validated by comparing it with
Q
well-known advanced DL models.
where Wi , WiK and WiV are weights for keys, queries and
values of specific heads. Each head focuses on different infor- A. CHARACTERISTICS OF ELECTRICITY LOAD AND
mation in sequence and combining this knowledge results in PRICE DATA
better contextual representation. In this paper, ISO-NE’s load and price time series datasets are
Considering that each head uses different values, attention used as a case study. ISO-NE operates the 32 GW bulk elec-
weights only in itself would not be sufficient for getting a tricity generation and transmission system of the American
particular feature’s importance. In view of this fact, we adapt state of NE. The datasets include hourly system load, LMP
the MHSA mechanism to share values in each head and values together with temperature and dew point information.
utilize additive summation of all heads. We barrow the same These public datasets have already been used in various case
approach used as in [45]. studies [21], [49]. System load plot with seasonal patterns
MHSAinterpretable (Q, K , V ) = Ĥ · ŴH , (10) over the years can be observed in Figure 4. In the ISO-NE
energy market, natural gas is a primary fuel for almost 50%
Ĥ = Â(Q, K )V .WV , (11)
of the generators and therefore, wholesale electricity prices
where the value weights shared across all heads are denoted are directly linked to it [50], [51]. The price of this single
by WV . Each head can pick up different temporal patterns fuel most of the time decides the energy market price. A large
while looking at a shared group of feature vectors. This may number of price fluctuations mainly due to spikes in natural
be described as a standard ensemble over all the attention gas prices during winter seasons are evident in Figure 5. The
weights obtained by each head in a combined matrix Â(Q, K ). scatter plot in Figure 6 illustrate a high positive correlation
This approach increases the representation capacity of the between the system load and LMP values of a similar time
MHSA mechanism in an efficient way. Masking is applied period. As the electricity consumption rises to a high level
to the MHSA layer to preserve the causal information flow. during peak hours, evidently the price also increases during
The MHSA layer assigns variable attention weights to each this peak demand interval. Figure 7 demonstrates a persistent
decoder’s output feature and identifies important features correlation between the LMP and average natural gas prices.
by utilizing the magnitude of attention weights. Important Therefore, increased natural gas prices result in higher power
features will have higher attention weights. Attention weight tariffs.
patterns learned by the model are utilized to explain the highly
significant previous timesteps on that the model bases its B. DATA PREPOSSESSING AND FEATURES
decision-making. By analyzing such patterns, we can iden- ENGINEERING
tify significant features from data and learn about the inner The datasets includes various features like overall sys-
working of the model. tem load, dry bulb temperature, dew point and day-ahead

VOLUME 9, 2021 85925


M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

TABLE 1. Set of input features for system load and locational marginal
price target variables.

C. IMPLEMENTATION DETAILS
As described earlier, the attention-based S2S algorithm is
introduced to forecast multi-horizon load and price values for
FIGURE 6. Correlation between system load and LMP. a wholesale energy market. We divided the dataset into three
subsets: a training dataset for learning, a validation dataset for
tracking the performance of the model and a testing set for
evaluation in proportion to 60%, 20%, and 20% respectively.
In order to perform the 24-hours day-ahead load and price
forecasts, the model is fed 168-hours previous feature values
using the sample generation process. The MHSA-S2S model
as seen in Figure 8 has two BiLSTM layers and an attention
layer in the encoder network. Bahdanau attention layer pro-
duces a context vector which is passed as an initial hidden
state to the decoder. Identical BiLSTM layers are employed in
the decoder network as well and each layer with 128 neurons
is followed by a skip or residual connection layer.
Slow convergence of deep neural networks motivated us
to add residual connections between hidden layers of the
network for faster gradient propagation. Such connections
FIGURE 7. Correlation between average natural gas price and electricity
can skip over any surplus or unused parts of the architecture,
LMP. delivering adaptive depth and network intricacies to entertain
different kinds of datasets and scenarios (i.e. adapt to the
simpler model if extra complexity is not needed). At first, sys-
LMP etc. with an hourly resolution. To get the most out of
tem load or LMP sequences are decomposed by the EEMD
the datasets, additional features like hour of the day, day of the
process into several IMF’s. These IMF’s along with raw input
week, month, previous day average load or LMP, same hour
features are passed to the Catboost algorithm, which on the
previous day load or LMP and previous week average load or
basis of the feature importance function selects only the best
LMP are extracted from existing feature variables. As natural
feature. The significant features are selected based on how
gas is a primary fuel for NE generators, natural gas monthly
much on average the loss value changes by varying a partic-
average price information is also included in the price dataset.
ular feature value. The selected data is then directed to the
Thus, the final sets of input features from load and price
Kalman smoothing algorithm for noise removal. Using a cus-
datasets used for training the model are shown in Table 1. For
tom sample generation process, sequences are transformed
handling of categorical variables in the dataset, Target Encod-
into three-dimensional data (i.e samples, time steps, features).
ing technique [52] is used. Target encoding is a Bayesian
We selected mean squared error (MSE) as a loss function
encoding method that does not increase the dimensions of the
and tanh activation function for the hidden units of BiLSTM
dataset. It computes the average of the target variable values
layers. We utilized Adam as an optimizer and trained the
for every category and substitutes each category variable
model for 200 epochs.To achieve an optimal number of learn-
with this average value. Datasets are scaled to ensure that
able parameters along with quick model training, the MHSA
all the input values are under the same scale. The mean µ is
layer employs 08 number of heads. The hyperparameter set-
subtracted from every data value, and the resultant is divided
tings are identical for both system load and LMP forecasting
by the standard deviation σ . The standardization of data can
tasks which confirms the better generalization ability of the
be expressed as;
proposed model to handle different sorts of datasets. The
x−µ open source libraries such as Tensorflow, Keras, scikit-learn,
x0 = . (12)
σ pyEMD and pyKalman are used in the implementation of

85926 VOLUME 9, 2021


M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

FIGURE 9. Actual versus MHSA-S2S method predicted system load values.

where yi is the true load or price value, ŷi is the forecasted


value of the model and N denotes the number of sample
values.

E. RESULTS AND PERFORMANCE COMPARISON


This section offers a performance analysis of the proposed
method along with a comparison in respect of well-known
DL methods, to show its robustness for load and price pre-
diction tasks. The results are compared with five well-known
DL models namely stacked BiLSTM, stacked bidirectional
GRU (BiGRU), ensemble network, TCN and standard S2S
network.

1) MULTI-HORIZON LOAD FORECASTING


The proposed MHSA-S2S method is evaluated on a testing
dataset for the year 2019. The model produces the average
MAE score of 0.0233 and average RMSE score of 0.0322
FIGURE 8. Summary of proposed architecture. The input sequences are
on overall out-of-sample data. Figure 9 illustrate how the
passed to an attention-based encoder-decoder network that employs MHSA-S2S predicted values almost track down the true load
BiLSTM layers. Skip connections yield alternative paths to the gradient for values, which confirms the higher degree of accurateness of
faster model convergence. MHSA layer give directions to network based
on attention weights for producing better day-ahead time series forecast. this novel method for multi-horizon load forecasting task.
Moreover, it can be seen as the trend of system load val-
ues changes suddenly, particularly in the summer season
this framework. The model is trained and evaluated on the (2207 − 4392 samples), the proposed MHSA-S2S model
data in a rolling fashion and the predictions are also taken in swiftly responds to the changes of time series, which con-
the same manner. Once trained, the model can produce the siderably improves the overall accuracy.
prediction in large batches of data with an hourly resolution. As electricity consumption patterns changes with respect
All experiments are conducted with a hardware configuration to weather conditions, a season-based 24-hours ahead load
of Intel Core-i7 windows PC with Nvidia Tesla T4 GPU. forecast with the hourly resolution is performed. Figure 10
presents a comparison of DL-based methods with the pro-
D. EVALUATION METRICS posed MHSA-S2S model for different seasons on the basis
In this work, mean absolute error (MAE) and the root mean of average MAE score. It can be noticed that the suggested
square error (RMSE) are used as accuracy measures for model surpasses the other five existing techniques in accuracy
multi-horizon load and price forecast. These metrics are across all four seasons. Table 2 gives a summary of results
extensively utilized to assess the model performance on attained from all the methods on testing data in terms of
regression tasks. The lesser values of MAE and RMSE indi- RMSE metric. We can observe that the proposed framework
cate superior forecasting accuracy. Both evaluation metrics results in lesser average RMSE score compared with other
are given by the following equations. DL methods for all seasons. Signal processing techniques
like EEMD and Kalman smoothing proved to be notable fac-
1 X
N tors for further improving the model accuracy. The proposed
MAE = |ŷi − yi |, (13) model without these techniques still performs better than the
N
v
i=1 existing methods indicated by the ‘‘MHSA-S2S w/o EEMD’’
u
u1 X N column in Table 2.
RMSE = t (ŷi − yi )2 , (14) Individual out-of-sample 24-hours load forecasting results
N are plotted in Figure 11. Pictorial comparison of the graphs
i=1

VOLUME 9, 2021 85927


M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

TABLE 2. Average RMSE score of various seasons resulted by MHSA-S2S and other five existing DL-based methods for ISO New England day-ahead
system load forecast.

FIGURE 12. Actual versus MHSA-S2S method predicted LMP values.

FIGURE 10. The average MAE score of 24-hours system load forecasting
with respect to various seasons resulted by the MHSA-S2S model and
other five existing DL methods.

FIGURE 13. The average MAE score of 24-hours LMP forecasting with
respect to various seasons resulted from the MHSA-S2S model and the
other five existing DL methods.

2) MULTI-HORIZON PRICE FORECASTING


The proposed MHSA-S2S method is evaluated on a testing
dataset from summer 2018 to spring 2019. The model pro-
duces the average MAE score of 0.0292 and average RMSE
score of 0.0509 on overall out-of-sample data. As mentioned
earlier that the ISO-NE market price data is highly volatile
and heavily depends on natural gas fuel price. That’s the
reason that price forecasting errors are marginally higher than
FIGURE 11. Comparative analysis between the presented MHSA-S2S the load forecast error values. Moreover, several variables like
method and existing well-known DL techniques on 24-hours future congestion cost and transmission which depends on real-time
system load forecast. From graphs (a) and (b), we notice that the
MHSA-S2S method output a prediction pattern (red) that nearly
operating conditions decisively affects the LMP values. Dur-
corresponds to the true system load values (blue). ing the cold winter season, natural gas prices soar too high
in NE because of constrained in natural gas pipeline infras-
in given subplots shows the excellent prediction performance tructure and natural gas-powered power plants must compete
of the proposed method compared to existing techniques. for fuel against heating demand. Increased price during the
Summer season load sequence as indicated in subplot 11a, winter season (4392 − 6552 samples) results in large price
consistently rises till 17 : 00 hours due to increased cooling spikes which can be observed from the Figure 12.
requirements at day time. For the winter season, a clear typi- We can observe from Figure 13 that the proposed method
cal daily load pattern can be observed in subplot 11b, where leads to lower price prediction error in comparison to other
the load demand is high during early morning and evening methods at different seasons. Due to various fuel mixes and
periods due to increased heating requirements. Besides sum- extended cold conditions, prices do not follow any systematic
mer time, the winter season also serves as the peaking period trend in winter, which results in high forecast error during this
in the ISO-NE market as heating is mostly electrified in period for all the methods. A set of RMSE score of day-ahead
winters. price forecasting for different seasons is stated in Table 3,

85928 VOLUME 9, 2021


M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

TABLE 3. Average RMSE score of various seasons resulted by MHSA-S2S and other five existing DL-based methods for ISO New England day-ahead LMP
forecast.

which indicate lesser forecasting error of the MHSA-S2S TABLE 4. Feature variables attention weights magnitude.
model in comparison to the other existing method.

V. MODEL INTERPRETABILITY (A USE CASE)


Interpretability in machine learning is a process in which
we try to understand the inner working of the model. The
lack of interpretability raises a serious question about the
trust in DL-based methods, particularly in high-risk forecast-
ing sectors, such as competitive energy markets, self-driving
vehicles, medical care and stock markets. A good model not
only provides an accurate prediction but also gives insights
into what inputs are driving the results or how the model is
arriving at a certain conclusion. To learn something from the
model, we have to be able to interpret which factors affect
the model results the most. The model information, when FIGURE 14. Average attention pattern at horizon-1 forecast.
interpreted accurately, can reveal new correlations and expose
potential vulnerabilities. Moreover, the interpretation can be considered the second most important variable. Previous val-
given on the basis of model parameters or in the form of input ues of the target variable are significant as one would expect
features utilized by the model. Despite multiple research because forecasts are extrapolations of historical observed
studies, understanding and interpreting the behaviors of DL values.
models remains a work in progress. The MHSA mechanism It should be noted that the variable importance value when
in the proposed architecture is a great tool that allows us only looking at a single input feature variable is not that
to interpret the general relationships the model has learned. meaningful. For instance, the mean value of hour 278.21
A practical evaluation exhibit that the proposed framework does not tell us much. The difference in variable importance
is capable of supporting different kinds of interpretability values between inputs features or relative value is useful.
scenarios. We demonstrate three interpretability scenarios The relative variable importance shows which variable and
given as: associated patterns are considered to be more linked with an
• Analyzing the importance of individual features. accurate model forecast. A case of associated patterns can
• Visualizing static temporal attention patterns. be described as; high relative importance value of the hour
• Identifying significant events. of day variable implies a 24-hours cycle exists in given time
In comparison to other attention-based interpretability series data. On the other hand, the time index has less relative
approaches proposed in existing literature, this work is based importance which intimates that a pattern of linear increase
on a different strategy inspired by [45] that combines the or decrease is present. This makes sense as time is always
patterns within the whole dataset - resulting in generic intu- constantly increasing. So, we anticipate to see usually upward
itions and insights regarding different temporal elements. or downward trends in the target variable with respect to time
We believe that this is the first study which is solely focused for those cases where time is an important feature variable.
on model interpretability relating to electricity load and price In this way, we can say that the model predominantly relies
forecasting problem. on a handful of important input features that in an intuitive
way, play a key part in the final forecast. Thus, the selection
A. ANALYZING THE IMPORTANCE OF INDIVIDUAL of variable based on their importance enables the model to
FEATURE VARIABLES entirely focus its learning abilities on the most crucial features
The proposed model selects the relevant feature variables and ignores the variables with less predictive power.
based on the learned weights which are shared across all
time steps. Table 4 shows that on average, for any forecasting B. VISUALIZING STATIC TEMPORAL ATTENTION
value, the target variable system load’s previous time step PATTERNS
value is most decisive in generating the final prediction, while Time series patterns are not so much different from visual
the hour of the day in which that value was recorded is patterns. Weights learned by the attention mechanism can

VOLUME 9, 2021 85929


M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

FIGURE 16. Identifying significant and normal events.


FIGURE 15. System load versus dist (t ) for ISO-NE data. Substantial
deviations in attention patterns can clearly be noticed near intervals of
abrupt changes corresponding to the peaks seen in the dist (t ) curve
illustrated by the red line-plot. We used a threshold of dist (t ) > 0.3 to significant event or regime as illustrated in Figure 16. Also,
show the significant regimes, as shaded in green.
the model is paying roughly the same attention to all feature
variables from 29 July onward when the system load is steady.
provide insights about dominant past time steps on the basis From the plots above, we conclude that the model continu-
of which model makes its decision. The examination of static ally changes its behavior between different regimes-putting
temporal patterns like periodicity or seasonality is often cru- uniform attention around historical inputs where randomness
cial for interpreting time-dependent patterns existing in the or volatility is small, whereas placing more focus on sudden
dataset. A time series attention function can be visualized transitions over high volatility intervals-implying dissimilar-
in Figure 14. In shown plot, the average of the importance of ities in temporal dynamics learned.
variables over time is given as ‘‘Mean Attention Weights’’.
We can notice a 24-hours periodic trend which implies that VI. CONCLUSION & FUTURE WORKS
the hour of the day has a strong association with the pre- Nowadays, accurate prediction of electricity load and
dicted value. We can also describe that the important stuff price is essential for the purpose of improving power
is happening around the spike area. Likewise, we observe a system stability, planning, maintenance and energy mar-
depletion in attention weights magnitude at time index greater ket operation. In this context, we presented a hybrid DL
than 150, which specifies that the far-off past input variables architecture founded on BiLSTM and an interpretable MHSA
are not very much dominant when forecasting target variable. mechanism for multi-horizon load and price forecast tasks.
Moreover, the difference in attention magnitude from one To improve the connectivity between the BiLSTM encoder
hour of the day to the next is evident from the Figure 14 which and decoder, the Bahdanau attention mechanism was added.
indicates that the model is focusing more on peak demand MHSA mechanism rectifies the decoder’s output and learns
hour. The plot shows a peak at one specific hour, followed by long-term relationships across different time steps. In this
a gradual decrease in attention magnitude for the subsequent study, DL techniques are combined with EEMD, a signal
hours. decomposition method, to enhance the forecasting accuracy
of the model. Residual connections are utilized to speed
C. IDENTIFYING SIGNIFICANT EVENTS up the model training, better convergence and to provide
Identification of abrupt transitions in temporal patterns could adaptive model depth. The empirical results corroborate that
be extremely helpful as temporary changes may happen the suggested modifications have a considerable impact on
owing to the occurrence of major incidents like peak hour, improving model accuracy. Furthermore, insightful analysis
generator breakdown, transmission system fault etc. Signif- on model interpretability is provided to further explore the
icant events can be detected by measuring the deviation of underlying reasoning for the superior performance of the
attention patterns at each time step from a global ‘‘average’’ presented method. The developed framework is evaluated
attention pattern as shown in Figure 15. During such events on publicly available ISO-NE load and price datasets and
period, we do not see a uniform distribution of attention compared with well-known DL techniques to verify its fore-
weights that means the model does not assign equal atten- casting performance.
tion weight to all feature variables. dist(t) metric indicates In the future work, we plan to upgrade the presented tech-
the distance between attention vectors at each point with nique further by taking account of longer time horizons like
the global average pattern, aggregated for all horizons. In month or year. A successful DR program depends critically
Figure 16, system load and attention weights (yellow) are on the quality of price forecasting at the market side and load
plotted together for a period of 26 July to 01 August 2017. forecasting at the utility side as well as the end consumer side.
As described earlier, during significant regime or event Henceforth, the proposed research work can be utilized to
periods we do not see a uniform distribution of attention identify the consumption behaviours of individual customers
weights. Increased attention weights magnitude changes on and to prepare optimal DR pricing strategies. At present,
a sharp downward trend on dated 27 to 29 July indicates the model is just deployable on a server or PC, thus we aim to

85930 VOLUME 9, 2021


M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

deploy the forecasting model on resource-constrained Inter- [23] W. Kong, Z. Y. Dong, D. J. Hill, F. Luo, and Y. Xu, ‘‘Short-term residential
net of Things (IoT) production environment by decreasing the load forecasting based on resident behaviour learning,’’ IEEE Trans. Power
Syst., vol. 33, no. 1, pp. 1087–1088, Jan. 2018.
storage and computational complexity. [24] S. Siami-Namini, N. Tavakoli, and A. S. Namin, ‘‘The performance of
LSTM and BiLSTM in forecasting time series,’’ in Proc. IEEE Int. Conf.
REFERENCES Big Data (Big Data), Dec. 2019, pp. 3285–3292.
[25] W. Wu, W. Liao, J. Miao, and G. Du, ‘‘Using gated recurrent unit network
[1] W.-C. Hong, Y. Dong, W. Y. Zhang, L.-Y. Chen, and B. K. Panigrahi, to forecast short-term load considering impact of electricity price,’’ Energy
‘‘Cyclic electric load forecasting by seasonal SVR with chaotic genetic Procedia, vol. 158, pp. 3369–3374, Feb. 2019.
algorithm,’’ Int. J. Electr. Power Energy Syst., vol. 44, no. 1, pp. 604–614, [26] M. Dong and L. Grumbach, ‘‘A hybrid distribution feeder long-term load
Jan. 2013. forecasting method based on sequence prediction,’’ IEEE Trans. Smart
[2] T. Hong, ‘‘Crystal ball lessons in predictive analytics,’’ EnergyBiz Mag., Grid, vol. 11, no. 1, pp. 470–482, Jan. 2020.
vol. 12, no. 2, pp. 35–37, 2015. [27] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, ‘‘Extreme learning machine: The-
[3] J. Contreras, R. Espinola, F. J. Nogales, and A. J. Conejo, ‘‘ARIMA models ory and applications,’’ Neurocomputing, vol. 70, nos. 1–3, pp. 489–501,
to predict next-day electricity prices,’’ IEEE Trans. Power Syst., vol. 18, Dec. 2006.
no. 3, pp. 1014–1020, Aug. 2003. [28] M. Akhtaruzzaman, M. K. Hasan, S. R. Kabir, S. N. H. S. Abdullah,
[4] N. Amjady, ‘‘Short-term hourly load forecasting using time-series model- M. J. Sadeq, and E. Hossain, ‘‘HSIC bottleneck based distributed deep
ing with peak load estimation capability,’’ IEEE Trans. Power Syst., vol. 16, learning model for load forecasting in smart grid with a comprehensive
no. 3, pp. 498–505, Aug. 2001. survey,’’ IEEE Access, vol. 8, pp. 222977–223008, 2020.
[5] E. Ceperic, V. Ceperic, and A. Baric, ‘‘A strategy for short-term load [29] Ö. F. Ertugrul, ‘‘Forecasting electricity load by a novel recurrent extreme
forecasting by support vector regression machines,’’ IEEE Trans. Power learning machines approach,’’ Int. J. Electr. Power Energy Syst., vol. 78,
Syst., vol. 28, no. 4, pp. 4356–4364, Nov. 2013. pp. 429–435, Jun. 2016.
[6] J. Pahasa and N. Theera-Umpon, ‘‘Short-term load forecasting using [30] S. Li, L. Goel, and P. Wang, ‘‘An ensemble approach for short-term
wavelet transform and support vector machines,’’ in Proc. Int. Power Eng. load forecasting by extreme learning machine,’’ Appl. Energy, vol. 170,
Conf. (IPEC), Dec. 2008, pp. 47–52. pp. 22–29, May 2016.
[7] S. R. Mohanty, N. Kishor, and D. K. Singh, ‘‘Comparison of empirical [31] K. Amarasinghe, D. L. Marino, and M. Manic, ‘‘Deep neural networks
mode decomposition and wavelet transform for power quality assessment for energy load forecasting,’’ in Proc. IEEE 26th Int. Symp. Ind. Electron.
in FPGA,’’ in Proc. IEEE Int. Conf. Power Electron., Drives Energy Syst. (ISIE), Jun. 2017, pp. 1483–1488.
(PEDES), Dec. 2018, pp. 1–6. [32] M. Alhussein, K. Aurangzeb, and S. I. Haider, ‘‘Hybrid CNN-LSTM model
[8] J. Bedi and D. Toshniwal, ‘‘Empirical mode decomposition based for short-term individual household load forecasting,’’ IEEE Access, vol. 8,
deep learning for electricity demand forecasting,’’ IEEE Access, vol. 6, pp. 180544–180557, 2020.
pp. 49144–49156, 2018. [33] F. U. M. Ullah, A. Ullah, I. U. Haq, S. Rho, and S. W. Baik, ‘‘Short-
[9] Y. Yaslan and B. Bican, ‘‘Empirical mode decomposition based denoising term prediction of residential power energy consumption via CNN
method with support vector regression for time series prediction: A case and multi-layer Bi-directional LSTM networks,’’ IEEE Access, vol. 8,
study for electricity load forecasting,’’ Measurement, vol. 103, pp. 52–61, pp. 123369–123380, 2020.
Jun. 2017. [34] S. Bai, J. Z. Kolter, and V. Koltun, ‘‘An empirical evaluation of generic
[10] X. Qiu, Y. Ren, P. N. Suganthan, and G. A. J. Amaratunga, ‘‘Empirical convolutional and recurrent networks for sequence modeling,’’ 2018,
mode decomposition based ensemble deep learning for load demand time arXiv:1803.01271. [Online]. Available: http://arxiv.org/abs/1803.01271
series forecasting,’’ Appl. Soft Comput., vol. 54, pp. 246–255, May 2017. [35] J. Chen, D. Chen, and G. Liu, ‘‘Using temporal convolution network for
[11] M. R. Haq and Z. Ni, ‘‘A new hybrid model for short-term electricity load remaining useful lifetime prediction,’’ Eng. Rep., vol. 3, no. 3, Mar. 2021,
forecasting,’’ IEEE Access, vol. 7, pp. 125413–125423, 2019. Art. no. e12305.
[12] Z. Wu and N. E. Huang, ‘‘Ensemble empirical mode decomposition: [36] P. Lara-Benítez, M. Carranza-García, J. M. Luna-Romera, and
A noise-assisted data analysis method,’’ Adv. Adapt. Data Anal., vol. 1, J. C. Riquelme, ‘‘Temporal convolutional networks applied to energy-
no. 1, pp. 1–41, Jan. 2009. related time series forecasting,’’ Appl. Sci., vol. 10, no. 7, p. 2322,
[13] A. Rahman, V. Srikumar, and A. D. Smith, ‘‘Predicting electricity con- Mar. 2020.
sumption for commercial and residential buildings using deep recurrent [37] A. Alshejari, V. S. Kodogiannis, and S. Leonidis, ‘‘Development of neu-
neural networks,’’ Appl. Energy, vol. 212, pp. 372–385, Feb. 2018. rofuzzy architectures for electricity price forecasting,’’ Energies, vol. 13,
[14] M. K. Shehzad, L. Rose, and M. Assaad, ‘‘Dealing with CSI compression no. 5, p. 1209, Mar. 2020.
to reduce losses and overhead: An artificial intelligence approach,’’ 2021, [38] Azzam-ul-Asar, S. R. U. Hassnain, and A. Khan, ‘‘Short term load fore-
arXiv:2104.00189. [Online]. Available: http://arxiv.org/abs/2104.00189 casting using particle swarm optimization based ANN approach,’’ in Proc.
[15] U. Ugurlu, I. Oksuz, and O. Tas, ‘‘Electricity price forecasting using Int. Joint Conf. Neural Netw., Aug. 2007, pp. 1476–1481.
recurrent neural networks,’’ Energies, vol. 11, no. 5, p. 1255, May 2018. [39] H. M. I. Pousinho, V. M. F. Mendes, and J. P. S. Catalão, ‘‘A hybrid
[16] Y. Bengio, P. Simard, and P. Frasconi, ‘‘Learning long-term dependencies PSO–ANFIS approach for short-term wind power prediction in Portugal,’’
with gradient descent is difficult,’’ IEEE Trans. Neural Netw., vol. 5, no. 2, Energy Convers. Manage., vol. 52, no. 1, pp. 397–402, Jan. 2011.
pp. 157–166, Mar. 1994. [40] L. Wang, Z. Zhang, and J. Chen, ‘‘Short-term electricity price forecasting
[17] L. Jiang and G. Hu, ‘‘Day-ahead price forecasting for electricity market with stacked denoising autoencoders,’’ IEEE Trans. Power Syst., vol. 32,
using long-short term memory recurrent neural network,’’ in Proc. 15th Int. no. 4, pp. 2673–2681, Jul. 2017.
Conf. Control, Autom., Robot. Vis. (ICARCV), Nov. 2018, pp. 949–954. [41] S.-Y. Shih, F.-K. Sun, and H.-Y. Lee, ‘‘Temporal pattern attention for
[18] S. Zhou, L. Zhou, M. Mao, H.-M. Tai, and Y. Wan, ‘‘An optimized multivariate time series forecasting,’’ Mach. Learn., vol. 108, nos. 8–9,
heterogeneous structure LSTM network for electricity price forecasting,’’ pp. 1421–1441, Sep. 2019.
IEEE Access, vol. 7, pp. 108161–108173, 2019. [42] M. Jurasovic, E. Franklin, M. Negnevitsky, and P. Scott, ‘‘Day ahead load
[19] W. Kong, Z. Y. Dong, Y. Jia, D. J. Hill, Y. Xu, and Y. Zhang, ‘‘Short-term forecasting for the modern distribution network—A tasmanian case study,’’
residential load forecasting based on LSTM recurrent neural network,’’ in Proc. Australas. Universities Power Eng. Conf. (AUPEC), Nov. 2018,
IEEE Trans. Smart Grid, vol. 10, no. 1, pp. 841–851, Jan. 2019. pp. 1–6.
[20] Y. Zhu, R. Dai, G. Liu, Z. Wang, and S. Lu, ‘‘Power market price forecast- [43] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by
ing via deep learning,’’ in Proc. 44th Annu. Conf. IEEE Ind. Electron. Soc. jointly learning to align and translate,’’ 2014, arXiv:1409.0473. [Online].
(IECON), Oct. 2018, pp. 4935–4939. Available: http://arxiv.org/abs/1409.0473
[21] S. Hassan, A. Khosravi, J. Jaafar, and M. Q. Raza, ‘‘Electricity load and [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
price forecasting with influential factors in a deregulated power industry,’’ L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ 2017,
in Proc. 9th Int. Conf. Syst. Syst. Eng. (SOSE), Jun. 2014, pp. 79–84. arXiv:1706.03762. [Online]. Available: http://arxiv.org/abs/1706.03762
[22] R. Weron, ‘‘Electricity price forecasting: A review of the state-of-the- [45] B. Lim, S. O. Arik, N. Loeff, and T. Pfister, ‘‘Temporal fusion trans-
art with a look into the future,’’ Int. J. Forecasting, vol. 30, no. 4, formers for interpretable multi-horizon time series forecasting,’’ 2019,
pp. 1030–1081, Oct. 2014. arXiv:1912.09363. [Online]. Available: http://arxiv.org/abs/1912.09363

VOLUME 9, 2021 85931


M. F. Azam, M. S. Younis: Multi-Horizon Electricity Load and Price Forecasting Using Interpretable Multi-Head Self-Attention

[46] R. E. Kalman, ‘‘A new approach to linear filtering and prediction prob- MUHAMMAD SHAHZAD YOUNIS received the
lems,’’ Trans. ASME, J. Basic Eng., vol. 82D, pp. 35–45, Mar. 1960. bachelor’s degree from the National University
[47] M. K. Shehzad, L. Rose, and M. Assaad, ‘‘A novel algorithm to report CSI of Sciences and Technology, Islamabad, Pakistan,
in MIMO-based wireless networks,’’ 2021, arXiv:2104.00200. [Online]. in 2002, the master’s degree from the University
Available: http://arxiv.org/abs/2104.00200 of Engineering and Technology, Taxila, Pakistan,
[48] M.-T. Luong, H. Pham, and C. D. Manning, ‘‘Effective approaches to in 2005, and the Ph.D. degree from Universiti
attention-based neural machine translation,’’ 2015, arXiv:1508.04025. Teknologi PETRONAS, Perak, Malaysia, in 2009.
[Online]. Available: http://arxiv.org/abs/1508.04025
He is currently working as an Assistant Profes-
[49] A. Motamedi, H. Zareipour, and W. D. Rosehart, ‘‘Electricity price and
sor at the Department of Electrical Engineering,
demand forecasting in smart grids,’’ IEEE Trans. Smart Grid, vol. 3, no. 2,
pp. 664–674, Jun. 2012. School of Electrical Engineering and Computer
[50] ISO New England Energy, Load, and Demand Reports. Accessed: Science (SEECS), National University of Sciences and Technology (NUST).
May 28, 2021. [Online]. Available: https://www.iso-ne.com/isoexpress/ He has published more than 30 articles in domestic and international journals
web/reports/load-and-demand/-/tree/net-ener-peak-load and conferences. His research interests include deep learning, statistical
[51] G. V. Welie, ‘‘The rapid transformation of new England’s power system and signal processing, adaptive filters, convex optimization, biomedical signal
implications for wholesale electricity markets,’’ presented at the Harvard processing, wireless communication modeling, smart grid, and digital signal
Kennedy School, Cambridge, MA, USA, Apr. 8, 2019. Accessed: processing.
May 28, 2021. [Online]. Available: https://projects.iq.harvard.edu/
energyconsortium/event/rapid-transformation-new-englands-power-
system-and-implications-wholesale
[52] D. Micci-Barreca, ‘‘A preprocessing scheme for high-cardinality categor-
ical attributes in classification and prediction problems,’’ ACM SIGKDD
Explor. Newslett., vol. 3, no. 1, pp. 27–32, Jul. 2001.

MUHAMMAD FURQAN AZAM received the


B.S. degree in electronics engineering from Inter-
national Islamic University Islamabad, Pakistan,
in 2016. He is currently pursuing the M.S. degree
in electrical engineering with the School of Electri-
cal Engineering and Computer Sciences (SEECS),
National University of Sciences and Technology
(NUST), Islamabad. He is currently a Power Sys-
tem Engineer at National Transmission and Dis-
patch Company, NTDC, Pakistan. His research
interests include deep learning, power system fault detection and analysis,
time series forecasting, and smart grid.

85932 VOLUME 9, 2021

You might also like