Explainable Multivariate Time Series Classification: A Deep
Neural Network Which Learns to Attend to Important Variables
As Well As Time Intervals
Tsung-Yu Hsieh Suhang Wang
The Pennsylvania State University The Pennsylvania State University
University Park, PA, USA University Park, PA, USA
[email protected] [email protected] Yiwei Sun Vasant Honavar
The Pennsylvania State University The Pennsylvania State University
University Park, PA, USA University Park, PA, USA
[email protected] [email protected]ABSTRACT and Data Mining (WSDM ’21), March 8–12, 2021, Virtual Event, Israel. ACM,
Many real-world applications, e.g., healthcare, present multi-variate New York, NY, USA, 9 pages. https://doi.org/10.1145/3437963.3441815
time series prediction problems. In such settings, in addition to
the predictive accuracy of the models, model transparency and 1 INTRODUCTION
explainability are paramount. We consider the problem of build-
Recent advances in high throughput sensors and digital technolo-
ing explainable classifiers from multi-variate time series data. A
gies for data storage and processing have resulted in the availability
key criterion to understand such predictive models involves elu-
of complex multivariate time series (MTS) data, i.e., measurements
cidating and quantifying the contribution of time varying input
from multiple sensors, in the simplest case, sampled at regularly
variables to the classification. Hence, we introduce a novel, modu-
spaced time points, that offer traces of complex behaviors as they un-
lar, convolution-based feature extraction and attention mechanism
fold over time. There is much interest in effective methods for classi-
that simultaneously identifies the variables as well as time inter-
fication of MTS data [3] across a broad range of application domains
vals which determine the classifier output. We present results of
including finance [58], metereology [8], graph mining [55, 60], au-
extensive experiments with several benchmark data sets that show
dio representation learning [17, 54], healthcare [13, 34], human
that the proposed method outperforms the state-of-the-art baseline
activity recognition [38, 57], among others. The impressive success
methods on multi-variate time series classification task. The results
of deep neural networks on a broad range of applications [31] has
of our case studies demonstrate that the variables and time intervals
spurred the development of several deep neural network models
identified by the proposed method make sense relative to available
for MTS classification [15]. For example, recurrent neural network
domain knowledge.
and its variants LSTM and GRU are the state-of-the-art methods for
modeling the complex temporal and variable relationships [10, 27].
CCS CONCEPTS In high-stakes applications of machine learning, the ability to
• Computing methodologies → Neural networks; Feature se- explain a machine learned predictive model is a prerequisite for es-
lection. tablishing trust in the model’s predictions, and for gaining scientific
insights that enhance our understanding of the domain [28, 39].
KEYWORDS MTS classification models are no exception: In healthcare appli-
Multivariate time series; attentive convolution; explainability cations, e.g., monitoring and detection of epileptic seizures, it is
important for clinicians to understand how and why an MTS clas-
ACM Reference Format: sifier classifies EEG signal as indicative of onset of seizure [50].
Tsung-Yu Hsieh, Suhang Wang, Yiwei Sun, and Vasant Honavar. 2021. Ex- Similarly, in human activity classification, it is important to be able
plainable Multivariate Time Series Classification: A Deep Neural Network
to explain why an MTS classifier detects activity that may be con-
Which Learns to Attend to Important Variables As Well As Time Intervals.
In Proceedings of the Fourteenth ACM International Conference on Web Search
sidered suspicious or abnormal [62]. Although there has been much
recent work explaining black box predictive models and their pre-
dictions [23, 28, 39], the existing methods are not directly applicable
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed to MTS classifiers.
for profit or commercial advantage and that copies bear this notice and the full citation Developing explainable MTS data presents several unique chal-
on the first page. Copyrights for components of this work owned by others than ACM lenges: Unlike in the case of classifiers trained on static data samples,
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a MTS data encode the patterns of variable progression over time. For
fee. Request permissions from [email protected]. example, compare the brain wave signals (electroencephalogram or
WSDM ’21, March 8–12, 2021, Virtual Event, Israel EEG recordings) from a healthy patient with those from a patient
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8297-7/21/03. . . $15.00 suffering from epileptic seizure as shown in Figure 1 [22, 50]. The
https://doi.org/10.1145/3437963.3441815 two EEG recordings differ with respect to the temporal patterns
• We describe a novel modular architecture consisting of a
convolution-based feature extraction network and dual at-
tention networks to effectively address this problem;
• We present results of extensive experiments with several
benchmark data sets and show that LAXCAT outperforms
(a) Normal example
the state-of-the-art baseline methods for MTS classification;
• We present results of case studies and demonstrate that the
variables and time intervals identified by the proposed model
are in line with the available domain knowledge.
The rest of the paper is organized as follows. Section 2 reviews
(b) Seizure example related work; Section 3 introduces the problem definition; Section
Figure 1: Normal and seizure brain wave signal examples. 4 describes our proposed solution; Section 5 describes our experi-
ments and case studies; Section 6 concludes with a brief summary
and discussion of some directions for further research.
in the signals [37]. Because EEG measurements obtained at high
temporal resolution suffer from low signal-to-noise ratio, the EEG
recordings from healthy patients (see Figure 1(a)) display some of 2 RELATED WORK
the spike-like signals that are similar to those indicative of seizure
Multivariate Time Series Classification. Multi-variate time se-
Figure 1(b). However, the temporal pattern of EEG signals over a
ries classification has received much attention in recent years. Such
larger time window shows clear differences between healthy and
methods can be broadly grouped into two categories: distance-
seizure activity. Thus, undue attention to local, point-wise obser-
based methods [1] and feature-based methods [21]. Distance-based
vations, without consideration of the entire temporal pattern of
methods classify a given time series based on the label(s) of the time
activity [32] would result in failure to correctly recognize abnormal
series in the training set that are most similar to it or closest to it
EEG recordings that are indicative of seizure. In contrast, focusing
where closeness is defined by some distance measure. Dynamic time
on the temporal pattern of activity over the relevant time windows
warping (DTW) [5] is perhaps the most common distance measure
as shown in Figure 1, would make it easy to distinguish the EEG
for assessing the similarity between time series. DTW, combined
recordings indicative of healthy brain activity from those that are
with the nearest neighbors (NN) classifier is a very strong baseline
indicative of seizure, and to explain how they differ from each other.
method for MTS classification [3]. Feature-based methods extract
In the case of MTS data, each variable offers different amounts of
a collection of informative features from the time series data and
information that is relevant to the classification task. Furthermore,
encode the time series using a feature vector. The simplest such
different variables may provide discriminative information during
encoding involves representing the sampled time series values by a
different time intervals. Hence, we hypothesize that MTS classifiers
vector of numerical feature values. Other examples of time series
that can simultaneously identify not only important variables but
features include various statistics such as sample mean and vari-
also the time intervals during which the variables facilitate effective
ance, energy value from the Fourier transform coefficients, power
discrimination between different classes can not only improve the
spectrum bands [7], wavelets [43], shapelets [61], among others.
accuracy of MTS classifiers, but also enhance their explainability.
Once time series data are encoded using finite dimensional fea-
Hence, we introduce a novel, modular, convolution-based feature
ture vectors, the resulting data can be used to train a classifier
extraction and attention mechanism that simultaneously (i) identi-
using any standard supervised machine learning method [26]. The
fies informative variables and the time intervals during which they
success of deep neural networks on a wide range of classification
contain informative patterns for classification; and (ii) leverages the
problems [31] has inspired much work on variants of deep neural
informative variables and time intervals to perform MTS classifica-
networks for time series classification (see [15] for a review). How-
tion. Specifically, we propose Locality Aware eXplainable Convolu-
ever, as noted earlier, the black box nature of deep neural networks
tional ATtention network (LAXCAT), a novel MTS classifier which
makes them difficult to understand. Deep neural network models
consists of dedicated convolution-based feature extraction network
for MTS classification are no exceptions.
and dual attention networks. The convolution feature extraction
Explainable Models. There has been much recent work on meth-
network extracts and encodes information from a local temporal
ods for explaining black box predictive models (reviewed in [28, 39],
neighborhood. The dual attention networks help identify the infor-
typically, by attributing the responsibility for the model’s predic-
mative variables and the time intervals in which each variable helps
tions to different input features. Such post hoc model explanation
discriminate between classes. Working in concert, the convolution-
techniques include methods for visualizing the effect of the model
based feature extraction network and the dual attention networks
inputs on its outputs [65, 67], methods for extracting simplified rules
maximize predictive performance and the explainability of the MTS
or feature interactions from black box models [20, 41], methods that
classifier. The major contributions of this work are as follows:
score features according to their importance in prediction [2, 9, 36],
• We consider the novel problem of simultaneously selecting gradient based methods that assess how changes in inputs impact
informative variables and time intervals with informative the model predictions [9, 51, 52], and methods for approximating
patterns for discrimination between the classes to optimize local decision surfaces in the neighborhood of the input sample via
the accuracy and explainability of MTS classifiers; localized regression [6, 47, 49].
An alternative to post hoc analysis is explainability by design, 4 THE PROPOSED FRAMEWORK - LAXCAT
which includes in particular, methods that identify an informative We proceed to describe Locality Aware eXplainable Convolutional
subset of features to build parsimonious, and hence, easier to under- ATtention network (LAXCAT). Figure 2 provides an overview of
stand models. Such methods can be further categorized into global the LAXCAT architecture. LAXCAT consists of three components:
methods which discover a single, instance agnostic subset of rele- (i) a convolutional module that extracts time-interval based features
vant variables, and local methods which discover instance-specific from the input multivariate time series sequence; (ii) variable atten-
subsets of relevant features. Yoon et al. [63] proposed a principal tion module, which assigns weights to variables according to their
component analysis-based recursive variable elimination approach importance in classification; and (iii) temporal attention module,
to identify informative subset of variables on an fMRI classification which identifies the time intervals over which the variables identi-
task. Han et al. [25] use class separability to select optimal subset fied by the variable attention module inform the classifier output.
of variables in a MTS classification task. When the data set is het- The LAXCAT architecture is designed to learn a representation of
erogeneous, it may be hard to identify a single set of features that the MTS data that not only suffices for accurate prediction of the
are relevant for classification over the entire data set [64]. Such a class label for each MTS data instance, but also helps explain the
setting calls for local methods that can identify instance-specific assigned class label in terms of the variables and the time intervals
features. One such local method uses an attention mechanism [4]. over which the values they assume inform the classification. We
Choi et al. [11] proposed RETAIN, an explainable predictive model now proceed to describe each module of LAXCAT in detail.
based on a two-level neural attention mechanism which identifies
significant clinical variables and influential visits to the hospital in 4.1 Feature Extraction via Convolutional Layer
the context of electronic health records classification. RAIM [59]
The first step is to extract useful features from the input time series.
introduced a multi-channel attention mechanism guided by dis-
The key idea of the feature extraction module is to incorporate tem-
crete clinical events to jointly analyze continuous monitoring data
poral pattern of values assumed by a time series variable as opposed
and discrete clinical events. Qin et al. [46] proposed a dual-stage
to focusing only on point-wise observations. Given a multivariate
attention-based encoder-decoder RNN to select the time series vari-
ables that drive the model predictions. Guo et al. [24] explored the time series input sequence X = {x (1) , . . . , x (𝑇 ) }, with x (𝑡 ) ∈ R𝑃 ,
structure of LSTM networks to learn variable-wise hidden states where 𝑇 is the length of the sequence and 𝑃 is the number of covari-
to understand the role of each variable in the prediction. A key ates, we adopt convolutional layer to automatically extract features
limitation of the existing body of work on explaining black box from the time series. Specifically, a 1-𝑑 convolutional layer with ker-
neural network models for MTS classification is that they focus on nel size 1 × 𝐿 is applied on each input variable where 𝐿 is the length
either identifying a subset of relevant time series, or a subset of of the time interval of interest. The kernel window slides through
discrete time points. However, many practical applications of MTS the temporal domain with overlap. The convolutional weight is
classification, require identifying not only the relevant subset of shared along the temporal domain and each input variable has its
the time series variables, but also the time intervals during which own dedicated feature extraction convolutional layer. In our model,
the variables help discriminate between the classes. we adopt a convolutional layer with 𝐽 filters so that a 𝐽 -dimensional
feature vector is extracted for each variable from each time interval.
The convolutional layer encodes multivariate input sequence by:
3 PROBLEM DEFINITION
Let X = {x (1) , . . . , x (𝑇 ) } be a multivariate time series sequence, c𝑖,𝑡 = 𝐶𝑁 𝑁𝑖 (x𝑖 ) , 𝑖 = 1, . . . , 𝑃 (1)
where x (𝑡 ) ∈ R𝑃 denotes the 𝑃 dimensional observation at time where c𝑖,𝑡 ∈ R 𝐽 is the feature vector for x𝑖 extracted from the 𝑡-th
(𝑡 )
point 𝑡. x𝑖 ∈ R means the value of the 𝑖-th variable sampled at time interval of interest, 𝑡 = 1, . . . , 𝑙. Number of intervals, 𝑙, depends
time point 𝑡. We use X = {(X1, 𝑦1 ), . . . , (X𝑁 , 𝑦𝑁 )} to denote a set of on the convolution kernel length 𝐿 and convolution stride size.
𝑁 input sequences along with their true labels, where X𝑖 is the 𝑖-th The convolution-based feature extraction yields features that
multivariate time series sequence and 𝑦𝑖 is its corresponding label. incorporate the temporal pattern of values assumed by the input
Based on the context, 𝑡 can be used to index either a time point or a variables within a local context (determined by the convolution
time interval. In multivariate time series classification (MTSC), the window). The attention mechanism applied to such features mea-
goal is to predict the label 𝑦 of a MTS data X. For example, given
sures the importance of the targeted time interval, as opposed to
sequences of EEG recordings of a subject from multiple channels
specific time points. Thus, the convolutional layers can learn to
corresponding to different locations on the brain surface, the task adapt to the dynamics of each input time series variable while en-
is to predict whether it denotes healthy or seizure activity. As suring that the attention scores are attached to the corresponding
noted earlier each multivariate time series, not all the features input variables. The multiple filters attend to different aspects of
equally inform the classification. In addition, for the important the signal and jointly extract a rich feature vector that encodes the
variables, only few key time intervals are typically important for relevant information from the time series in the time interval of
discrimination between the different classes. Hence the problem of interest. Note that for each variable, the convolution computation
explainable MTSC is formally defined as follows on each time interval can be carried out in parallel, as opposed to
Given an MTS training data set X = {(X1, 𝑦1 ), . . . , (X𝑁 , 𝑦𝑁 )}, learn the sequential processing in canonical RNN models. Furthermore,
a function 𝑓 that can simultaneously predict the label of a MTS data, the number of effective time points is significantly reduced by
and identify the informative variables and the time intervals over considering intervals as opposed to discrete time points. This also
which their values inform the class label. reduces the computational complexity for downstream attention
Figure 2: Proposed LAXCAT model framework. The framework is comprised of three major components: CNN feature extrac-
tion module and two attention modules. The CNN layer extracts informative features within each time interval of interest
(TOI). The two attention modules work together to identify informative variables and key TOIs.
mechanism. While we limit ourselves to the simple convolution been considered. The input data to the variable attention network
structure described above, the LAXCAT architecture can accommo- is the feature matrix C𝑡 which contains all feature vectors in time
date more sophisticated e.g., dilated [42] convolution structures for interval 𝑡. The attention network considers the multivariate corre-
more flexible feature extraction from MTS data. lation and distributes attention weights to each variable so as to
The feature extraction module accepts an input time series {x1, . . . , x𝑇 } maximize the predictive performance. Note that the local context
and produces a sequence of feature matrices {C1, . . . , C𝑙 }, where embeddings in different intervals can be constructed independently
C𝑡 ∈ R𝑃 ×𝐽 . Each row in C𝑡 stores the feature vector specific to of each other (and hence processed in parallel). In addition, the
each variable within time interval 𝑡 in the input sequence, i.e., parameters of the variable attention module are shared among all
C𝑡 = [c1,𝑡 , . . . , c𝑃,𝑡 ]𝑇 . The variable attention module (see below) intervals to ensure parsimony with respect to the model parameters.
considers the feature matrices at each time interval so as to obtain The preceding process yields local context embeddings for each
a local context embedding vector h𝑡 , 𝑡 = 1, . . . , 𝑙, for each interval. of the time intervals by considering the relative variable importance.
The temporal attention module construct the summary embedding The result is a context matrix H = [h1, . . . , h𝑙 ]𝑇 ∈ R𝑙×𝐽 consisting
z, which is used to encode the MTS data for classification. In the of the context vector at each interval. In the next subsection, we
model, temporal attention measures the contribution of each time describe how to compose the summary embedding of the MTS
interval to the embedding whereas variable attention controls the instance using the temporal attention mechanism.
extent to which each variable is important within each interval. We
proceed to discuss the detail of the two attention modules. 4.3 Temporal Attention Module
The goal of temporal attention module is to identify key segments of
4.2 Variable Attention Module signals which contain information that can discriminate between
The variable attention module evaluates variable attention and classes. The summary vector z is composed by aggregating the
constructs local context embedding. Specifically, the local context context embedding vectors weighted by their relative temporal
embedding is an aggregation of the feature vectors weighted by contribution as follows:
𝑙
their relative importance measures within the specific time interval.
Õ
z= 𝛽𝑡 h𝑡 (4)
The context vector h𝑡 ∈ R 𝐽 for the 𝑡-th time interval is obtained by 𝑡 =1
Õ𝑃 where 𝛽𝑡 is the temporal attention score for the context vector
h𝑡 = 𝛼𝑖𝑡 c𝑖,𝑡 (2) h𝑡 and it quantifies the contribution of the information carried in
𝑖=1 interval 𝑡. Similarly, the temporal attention module is instantiated
where 𝛼𝑖𝑡 is the attention score for c𝑖,𝑡 and is equivalently the by a feed-forward network. The vector of temporal attention scores
attention score dedicated to variable x𝑖 in the 𝑡-th time interval. b = [𝛽 1, . . . , 𝛽𝑙 ] is learned using the following procedure:
To precisely evaluate the importance of each variable, we use
(𝑇 ) (𝑇 ) (𝑇 ) (𝑇 )
a feed forward b = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 (u) , u = 𝜎2 𝜎1 𝑊1 H + 𝐵 1 𝑊2 + 𝐵 2
network to learn the attention score vectors a𝑡 =
𝛼 1𝑡 , . . . , 𝛼 𝑃𝑡 , 𝑡 = 1, . . . , 𝑙. The network can be characterized by (5)
(𝑇 ) (𝑇 ) (𝑇 ) (𝑇 )
(𝑉 ) (𝑉 )
(𝑉 ) (𝑉 )
where 𝑊1 ∈ R1×𝑙 , 𝑊2 ∈ R 𝐽 ×𝑙 , 𝐵 1 ∈ R1×𝐽 , and 𝐵 2 ∈ R1×𝑙
a𝑡 = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 (s𝑡 ) , s𝑡 = 𝜎2 𝜎1 𝑊1 C𝑡 + 𝐵 1 𝑊2 + 𝐵 2 are model parameters, and 𝜎1 (·), 𝜎2 (·) are non-linear activation
(3) functions. The input to the temporal attention module is the entire
(𝑉 ) (𝑉 ) (𝑉 ) (𝑉 )
where𝑊1 ∈ R1×𝑃 , 𝑊2 ∈ R 𝐽 ×𝑃 , 𝐵 1 ∈ R1×𝐽 , 𝐵 2 ∈ R1×𝑃 are context matrix H. The module takes into account the correlations
the model parameters, and 𝜎1 (·), 𝜎2 (·) are non-linear activation among time intervals and the predictive performance of each inter-
functions such as tanh, ReLU among others. In the feature extrac- val to distribute attention scores.
tion stage in Sec. 4.1, each time series input variable is processed This concludes the three modules of LAXCAT architecture. To
independently and the correlations among the variables have not summarize, the convolutional feature extraction module extracts a
Table 1: A summary of the datasets.
Dataset # Var. (𝑃 ) # Time Points # Classes # Samples
PM2.5 w/ 8 24 6 1013
PM2.5 w/o 7 24 6 1013
Seizure 23 1025 2 272
Motor Task 15 641 2 405
5 EXPERIMENTS AND RESULTS
We proceed to describe our experiments aimed at evaluating the
performance LAXCAT in terms of the accuracy of MTS classification
as well as the explainability of the resulting classifications.
Figure 3: Explainability of the LAXCAT model.
5.1 Datasets
rich set of features from time series data. The variable and temporal We used three publicly available real-world multivariate time series
attention modules, construct an embedding of the MTS data to be data sets and a summary of the data sets is provided in Table 1.
classified by attending to the relevant variables and time intervals. • PM2.5 data set [35] contains hourly PM2.5 value and the asso-
ciated meteorological measurements in Beijing. Given the mea-
4.4 Learning LAXCAT surements on one day, the task is to predict the PM2.5 level on
the next day at 8 am, during the peak commute. The PM2.5 value
Given the encoding z which captures the important variables and
is categorized into six levels according to the United States Envi-
time intervals of the input multivariate time series sequence, we
ronmental Protection Agency standard, i.e., good, moderate, un-
can predict the class label of the sequence as follows:
healthy for sensitive, unhealthy, very unhealthy, and hazardous.
𝑦 = 𝑓 (z; W) (6) We arranged this data set into two versions, the first one contains
PM2.5 recordings as one of the covariates, called PM2.5 w/, and
where W is the weights of 𝑓 (·), a fully connected network1 . the second one excludes PM2.5 recordings, denoted as PM2.5
Given a set of training instances {(X1, 𝑦1 ), . . . , (X𝑁 , 𝑦𝑁 )}, the w/o. Aside from PM2.5 values, the meteorological variables in-
parameters Θ of the variable and temporal attention modules and clude dew point, temperature, pressure, wind direction, wind
the classifier network can be jointly learned by optimizing the speed, hours of snow and hours of rain. We keep the measure-
following objective function: ments for weekdays and exclude the measurements for weekends
𝑁 yielding a data set of 1013 MTS instances in total.
1 Õ
min L (X𝑖 , 𝑦𝑖 ; Θ) + 𝛼 ∥Θ∥ 2𝐹 (7) • Seizure data set [22, 50] consists of electroencephalogram (EEG)
Θ 𝑁
𝑖=1 recordings from pediatric subjects with intractable seizures col-
lected at the Children’s Hospital Boston. EEG signals at 23 posi-
where ∥Θ∥ 2𝐹 is the Frobenius norm on the weights to alleviate over-
tions, as shown in Figure 4(a), according to the international 10-20
fitting and 𝛼 is a scalar that control the effect of the regularization
system, were recorded at 256 samples per second with 16-bit reso-
term. In this study, L is chosen to be the cross entropy loss function.
lution. Each instance is a four second recording containing either
The resulting objective function is smooth and differentiable allow-
seizure attack period or non-seizure period.
ing the objective function to be minimized using standard gradient
• Movement data set [22, 48] consists of EEG recordings of sub-
back propagation update of the model parameters. We used the
jects opening and closing left or right fist. EEG signals were
Adam optimizer [29] to train the model and the hyperparameters
recorded at 160 samples per second and 15 electrode locations
are set to their default values.
were used in this study, covering the central-parietal, frontal and
Explainability of LAXCAT. LAXCAT is designed to accurately occipital regions as shown in Figure 4(b). Each instance contains
classify MTS data and also facilitate instance-level explanation of 4 second recordings and the subjects were at rest state during the
the predicted class label. Given an input sequence, we can extract first two seconds and performed the fist movement during the
two attention measures from the model, namely temporal attention latter two seconds. The task is to distinguish between left and
scores and variable attention scores. As shown in Figure 3, a sum- right fist movement based on the 15-channel EEG recordings.
mary of the importance of each variable in each interval is given
by the product of the two attention scores,
5.2 Baseline Methods and Evaluation Setup
𝐽𝑜𝑖𝑛𝑡𝐴𝑡𝑡𝑖,𝑡 = 𝛼𝑖𝑡 × 𝛽𝑡 (8) We compare the classification performance of LAXCAT with repre-
for 𝑖 = 1, . . . , 𝑃, 𝑡 = 1, . . . , 𝑙. These results can then be compared sentative and state-of-the-art baselines:
against domain knowledge or used to guide further experiments. • 𝑘NN-DTW [19, 40] is the dynamic time warping (DTW) dis-
tance measure combined with 𝑘-nearest neighbor (𝑘NN) classi-
1 Although
fier. DTW provides a similarity score between two time series
here we focus on classifying MTS data, the LAXCAT framework can be
readily applied to forecasting and other related tasks by choosing an appropriate 𝑓 ( ·) by warping the time axes of the sequences to achieve alignment.
. The classification phase is carried out by 𝑘NN classifier.
accuracy than the canonical LSTM model. Among the attention
based deep neural network models, LAXCAT outperforms the other
two attention based deep neural network baselines. We further note
that, in the case of the two PM2.5 data sets, perhaps not surprisingly,
all models consistently make better future PM2.5 value predictions
when past PM2.5 value recordings are included as an input .
Time Complexity. We also compare the computational complex-
ity of deep learning based methods in terms of run-time per training
iteration and run-time per testing iteration. As reported in Table 3,
the LSTM baseline does not include any attention mechanism to
track variable and temporal importance and hence takes the least
(a) Seizure data (b) Movement data
amount of run-time in each training and test iteration across all the
Figure 4: Variable and EEG location correspondence. data sets. Among the attention based deep neural network models,
LAXCAT has the shortest run-time. The difference in execution
• LR is the logistic regression classifier. For multivariate time series
time between LAXCAT and the two baselines is quite substantial
input, we concatenate all variables and the input to the LR model
in the case of Seizure and Motor Task data sets, due to the lengths
is a multivariate vector.
of the time series in question: Each sequence in Seizure data set
• LSTM [27] is the long short-term memory recurrent neural net-
contains 1025 sampling points while sequences in Motor Task con-
work. An LSTM network with one hidden layer is adopted to
tain 641 sampling points. DARNN and IMV-LSTM evaluate time
learn an encoding from multivariate time series data and the clas-
point-based attention, which places a greater computational burden
sification phase is carried out by a feed forward neural network.
compared to the time interval-based attention in LAXCAT . We fur-
• DARNN [46] is a dual attention RNN model. It uses an encoder-
ther note in contrast to LSTM based methods which are inherently
decoder structure where the encoder is applied to learn attentions
sequential, many aspects of LAXCAT are parallelizable.
and the decoder is adopted for prediction task.
• IMV-LSTM [24] is the interpretable multivariate LSTM model. It
5.4 Explaining the LAXCAT Predictions
explores the structure of LSTM networks to learn variable-wise
hidden states. With hidden states, a mixture attention mechanism We proceed to describe several case studies designed to evaluate
is exploited to model the generative process of the target. the effectiveness of LAXCAT in producing useful explanations of
its classification. For qualitative analysis, we report the meaningful
We implemented the proposed model and deep learning baseline
variables and time intervals identified by the attention mechanism
methods with PyTorch. We used the Adam optimizer [29] to train
and compare them with domain findings in related literature. To
the networks with default parameter settings and the mini-batch
quantitatively assess the effectiveness of the allocation of attention,
size is 40. The number of filters in the feature extraction step is
we define the attention allocation measure (AAM)
chosen from {8, 16, 32}. The kernel size 𝐿 is selected from {2, 3, 5},
{16, 32, 64}, {16, 32, 64} in PM2.5, Seizure, Motor Task respectively, Amount of attention allocated correctly
𝐴𝐴𝑀 = × 100% (9)
and the stride size is set to 50% of kernel size. For the number of hid- Total amount of attention
den nodes in the classifier feed forward network, we conduct grid This measure is only applied to the cases where a solid understand-
search over {8, 16, 32}. The coefficient for the regularization term is ing of important variable and time interval is present.
chosen from {0.001, 0.01, 0.1}. In the case of the 𝑘NN-DTW method,
𝑘 is set to 1, yielding an one nearest neighbor classifier. In the case 5.4.1 Case Study I: Synthetic Data. To thoroughly examine the
of the LSTM baseline, the number of hidden nodes is selected from attention mechanism, we constructed a synthetic data set that re-
{8, 16, 32, 64}. For IMV-LSTM, we implemented IMV-Tensor as it flects concrete prior knowledge regarding the key variables and
was reported to perform better [24]. The parameter selection for the time intervals that determine class labels. This synthetic data
(𝑡 )
the baseline methods DARNN and IMV-LSTM follows the guide- set consists of 3 time series variables, i.e. x1 = cos (2𝜋𝑡) + 𝜖,
(𝑡 ) (𝑡 )
lines provided in the respective papers. We train the models using x2 = sin (2𝜋𝑡) + 𝜖, and x3 = exp (𝑡) + 𝜖, where 𝜖 is Gaussian
70% of the samples, and 15% of the samples are for validation. The noise and 𝑡 takes value from a vector of 50 linearly equally spaced
remaining 15% are used as test set. We repeat the experiment five points between 0 and 1. To generate two classes, we randomly
times and report the average performance. (𝑡 )
select half of the instances and manipulate the first variable x1
by adding a square wave signal to the raw sequence. The square
5.3 Performance of LAXCAT wave is controlled by three random variables, the starting point, the
Classification Accuracy. We compared LAXCAT with the base- length, and the magnitude of the square wave. We treat instances
line methods on MTS classification using the different benchmark with square wave as positive and that without square wave as neg-
data sets described above and report the results in Table 2. The ative. For the synthetic data, we define correct attention allocation
results of our experiments show that deep learning-based methods as the attention assigned to variable 1 within the interval of the
outperform the other two simple baseline methods, 1NN-DTW and square wave. The AAM scores on the synthetic data are reported
LR. LR mostly outperforms 1NN-DTW with the only exception on in Table 4. From the table, we observe that LAXCAT outperforms
Seizure data set. Among the deep learning based methods, those DARNN and IMV-LSTM, suggesting that LAXCAT can better iden-
equipped with an attention mechanism achieve better classification tify important variables and the relevant time intervals. We give
Table 2: Classification results (Accuracy±std) of different algorithms on the four data sets
Dataset 1NN-DTW LR LSTM DARNN IMV-LSTM Proposed
PM2.5 w/ 36.05 ± 3.24 38.29 ± 0.86 40.40 ± 1.89 41.19 ± 2.89 48.16 ± 3.30 50.66 ± 4.58
PM2.5 w/o 30.39 ± 3.85 35.92 ± 2.11 37.06 ± 3.04 38.98 ± 5.43 39.34 ± 4.40 45.53 ± 6.20
Seizure 53.66 ± 7.72 52.20 ± 5.88 70.24 ± 2.67 71.31 ± 2.78 72.19 ± 2.78 76.59 ± 4.08
Movement 53.44 ± 5.13 71.80 ± 8.31 75.32 ± 3.26 83.28 ± 1.37 84.09 ± 1.12 87.21 ± 1.37
Table 3: Run-time (per iteration in seconds) comparison.
Dataset PM2.5w/ Seizure Movement
LSTM (Train) 0.5 4 2.8
DARNN (Train) 4.8 430 218
IMV-LSTM (Train) 3.3 430 150
Ours (Train) 1.4 4.5 3.5
LSTM (Test) 0.001 0.01 0.01
DARNN (Test) 0.08 24 13
IMV-LSTM (Test) 0.06 20 4.3 (a) PM2.5 w/ (b) PM2.5 w/o
Ours (Test) 0.02 0.03 0.03 Figure 6: Average attention allocation on the PM25 datasets.
(a) Left hand 1 (b) Right hand 1 (c) Left hand 2 (d) Right hand 2
(a) Positive Sample (b) Negative Sample
Figure 5: Positive and negative synthetic examples are
drawn in solid lines and the heat maps of the attention al-
location are depicted in the background. The attention for
variable 1 is located in the top row, variable 2 in the center
row, and variable 3 in the bottom row. (Best viewed in color)
Table 4: AAM score on Synthetic data and Motor Task.
Dataset DARNN IMV-LSTM Ours
(e) Temporal attention
Synthetic 5.42% 7.91% 10.93%
Figure 7: Important channels and time intervals identified
Motor Task 19.91% 22.08% 24.17%
by LAXCAT on the Movement data set. The darker the color,
the more attention allocated at the location.
an illustration of positive instance and negative instance with the
attention allocation by LAXCAT in Figure 5. We observe that the PM2.5 value. According to [45], wind direction and speed are critical
proposed model distributes most of its attention to variable 1, and factors that affect the amount of pollutant transport and dispersion
specifically, in the interval that covers the location of the square between Beijing and surrounding areas.
wave in both positive and negative class instances.
5.4.3 Case Study III: Movement Data. On the Movement data set,
5.4.2 Case Study II: PM2.5. PM2.5 value is the concentration of we use EEG recordings to distinguish whether the subject is moving
particles with a diameter of 2.5 micrometers or less suspended the left or right hand. A subset of attention results are reported in
in air. Studies have found a close link between exposure to fine Figure 7. Extensive research has shown that the motor cortex is
particles and premature death from heart and lung disease [18]. involved in the planning, control, and execution of voluntary move-
We report the attention learning results in Figure 6. Variable-wise ments [12, 44]. The motor cortex, located in the rear portion of the
speaking, as shown in Figure 6, when past PM2.5 recordings are frontal lobe, is closest to the locations of variables 1 to 7 in our em-
available for future prediction, IMV-LSTM ranks PM2.5 as the most pirical analysis. In Figures 7(a)- 7(d), the heatmaps of accumulated
important variable, and the LAXCAT method ranks it as the second attention in the entire time period from 2 subjects are depicted. We
most important variable. DARNN consistently selects dew point, observe that most attention is distributed to the channels around
snow, and rain as important predictive variables. When PM2.5 the central region, namely the 𝐶 1 , 𝐶 3 , 𝐶 4 , and 𝐶 6 channels. On the
value is not available, wind speed, wind direction and pressure are two left hand movement examples, i.e. Figure 7(a), 7(c), the pro-
high ranked by IMV-LSTM, which is consistent with that reported posed model allocates attention to the right brain. On the contrary,
in [24]. LAXCAT attends to wind direction and wind speed besides LAXCAT assigns most attention to the left brain during right hand
6 SUMMARY
We considered the problem of MTS classification, in settings where
besides achieving high accuracy, it is important to identify both the
key variables that drive the classification, and the time intervals
during which their values provide information that helps discrimi-
nate between the classes. We introduced LAXCAT, a novel, modular
architecture for explainable MTS classification. LAXCAT consists of
a convolution-based feature extraction along with a variable based
(a) 8 hidden nodes (b) 16 hidden nodes (c) 32 hidden nodes
and a temporal interval based attention mechanism. LAXCAT is
Figure 8: Parameter analysis on the Movement data set. trained to optimize classification accuracy while simultaneously
selecting variables and time intervals over which the pattern of
Table 5: Ablation study on variable (var.) attention and tem- values they assume drive the classifier output. We present results
poral time-interval (temp.) attention. of extensive experiments with several benchmark data sets and
PM2.5w/ PM2.5w/o show that the proposed method outperforms the state-of-the-art
LAXCAT 50.66 45.53 baseline methods for explainable MTS classification. The case stud-
LAXCAT - var. attention 40.35 42.50 ies demonstrate that the variables and time intervals identified by
LAXCAT - temp. attention 41.67 41.71 LAXCAT are in line with available domain knowledge. Some direc-
LAXCAT - both attentions 40.26 38.68 tions for ongoing and future research include generalizations of the
LAXCAT framework to the settings with transfer learning [56, 68],
multi-modal [14] or multi-view [53, 66], sparsely and irregularly
observed [30, 33, 34], multi-scale [16], MTS data.
movements as shown in Figure 7(b), 7(d). This observation is in line
with the current theory of contralateral brain function, which states
that the brain controls the opposite side of the body. For temporal ACKNOWLEDGMENTS
attention, as reported in Figure 7(e), LAXCAT puts most attention This work was funded in part by the NIH NCATS grant UL1 TR002014
to the time intervals later to the motor task onset. For qualitative and by NSF grants 2041759, 1636795, 1909702, and 1955851, the
analysis, we define correct attention assignment as the attention Edward Frymoyer Endowed Professorship at Pennsylvania State
distributed to channels 𝐶 5, 𝐶 3, 𝐶 1, 𝐶𝑧 , 𝐶 2, 𝐶 4, 𝐶 6 in the time period University and the Sudha Murty Distinguished Visiting Chair in
after motor task onset. As reported in Table 4, LAXCAT assigns Neurocomputing and Data Science funded by the Pratiksha Trust
about 24% of attention to the target zone as compared to around at the Indian Institute of Science (both held by Vasant Honavar).
20% for DARNN and 22% for IMV-LSTM.
REFERENCES
5.5 Ablation Study [1] Amaia Abanda, Usue Mori, and Jose A Lozano. 2019. A review on distance based
time series classification. Data Mining and Knowledge Discovery 33, 2 (2019),
We also conducted an ablation study to examine the relative contri- 378–412.
butions of variable attention and temporal attention in LAXCAT. [2] Marco Ancona, Cengiz Oztireli, and Markus Gross. 2019. Explaining Deep Neu-
ral Networks with a Polynomial Time Algorithm for Shapley Value Approxi-
Specifically, we remove the variable attention module and obtain mation. In Proceedings of the 36th International Conference on Machine Learn-
local context embeddings by averaging the feature vectors in each ing (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri
time interval (the second row in Table 5). Similarly, we remove the and Ruslan Salakhutdinov (Eds.). PMLR, Long Beach, California, USA, 272–281.
http://proceedings.mlr.press/v97/ancona19a.html
temporal attention module and obtain summary embedding vector [3] Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large, and Eamonn Keogh.
by averaging over the context embedding vectors (the third row in 2017. The great time series classification bake off: a review and experimental
evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery
Table 5). Lastly, we remove both attention modules (the last row 31, 3 (2017), 606–660.
of Table 5). We conclude that both variable and temporal attention [4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine
modules contribute to improved classification accuracy of LAXCAT. translation by jointly learning to align and translate. In ICLR.
[5] Donald J Berndt and James Clifford. 1994. Using dynamic time warping to find
patterns in time series.. In KDD workshop, Vol. 10. Seattle, WA, 359–370.
[6] Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan
5.6 Parameter Sensitivity Analysis Jia, Joydeep Ghosh, Ruchir Puri, José MF Moura, and Peter Eckersley. 2020.
We investigated how the kernel size (interval length), number of Explainable machine learning in deployment. In Proceedings of the 2020 Conference
on Fairness, Accountability, and Transparency. 648–657.
filters, and number of hidden nodes in the classifier neural network [7] Peter Bloomfield. 2004. Fourier analysis of time series: an introduction. John Wiley
affect classification accuracy. Due to space limitation, we only report & Sons.
the results on the Movement data set, shown in Figure 8. The results [8] Prithwish Chakraborty, Manish Marwah, Martin Arlitt, and Naren Ramakrishnan.
2012. Fine-grained photovoltaic output prediction using a bayesian ensemble. In
show no clear pattern as to how the numbers of filters and hidden AAAI.
nodes affect the predictive performance. As for kernel size, 16 and 64 [9] Jianbo Chen, Le Song, Martin Wainwright, and Michael Jordan. 2018. Learning
to Explain: An Information-Theoretic Perspective on Model Interpretation. In
consistently yield better results than 32. When we set the kernel size ICML. 883–892.
to 1 (corresponding to time point based temporal attention) while [10] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Ben-
fixing number of filters and hidden nodes to 8, the classification gio. 2014. On the properties of neural machine translation: Encoder-decoder
approaches. arXiv:1409.1259 (2014).
accuracy falls to around 75%, which further underscores the benefits [11] Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy
of interval-based temporal attention. Schuetz, and Walter Stewart. 2016. Retain: An interpretable predictive model for
healthcare using reverse time attention mechanism. In NeurIPS. 3504–3512. [40] Meinard Müller. 2007. Dynamic time warping. Information retrieval for music
[12] BA Conway, DM Halliday, SF Farmer, U Shahani, P Maas, AI Weir, and JR Rosen- and motion (2007), 69–84.
berg. 1995. Synchronization between motor cortex and spinal motoneuronal [41] W James Murdoch, Peter J Liu, and Bin Yu. 2018. Beyond word importance:
pool during the performance of a maintained motor task in man. The Journal of Contextual decomposition to extract interactions from LSTMs. arXiv:1801.05453
physiology 489, 3 (1995), 917–924. (2018).
[13] Enyan Dai, Yiwei Sun, and Suhang Wang. 2020. Ginger Cannot Cure Cancer: [42] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol
Battling Fake Health News with a Comprehensive Data Repository. In Proceedings Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu.
of the International AAAI Conference on Web and Social Media, Vol. 14. 853–862. 2016. Wavenet: A generative model for raw audio. arXiv:1609.03499 (2016).
[14] Vijay Ekambaram, Kushagra Manglik, Sumanta Mukherjee, Surya Shravan Kumar [43] Donald B Percival and Andrew T Walden. 2000. Wavelet methods for time series
Sajja, Satyam Dwivedi, and Vikas Raykar. 2020. Attention based Multi-Modal analysis. Vol. 4. Cambridge university press.
New Product Sales Time-series Forecasting. In KDD. 3110–3118. [44] Tue Hvass Petersen, Maria Willerslev-Olsen, Bernard A Conway, and Jens Bo
[15] Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, Nielsen. 2012. The motor cortex drives the muscles during walking in human
and Pierre-Alain Muller. 2019. Deep learning for time series classification: a subjects. The Journal of physiology 590, 10 (2012), 2443–2452.
review. Data Mining and Knowledge Discovery 33, 4 (2019), 917–963. [45] Wei-wei Pu, Xiu-juan Zhao, Xiao-ling Zhang, and Zhi-qiang Ma. 2011. Effect of
[16] Garrett M Fitzmaurice, Nan M Laird, and James H Ware. 2012. Applied longitudinal meteorological factors on PM2. 5 during July to September of Beijing. Procedia
analysis. Vol. 998. John Wiley & Sons. Earth and Planetary Science 2 (2011), 272–277.
[17] Jean-Yves Franceschi, Aymeric Dieuleveut, and Martin Jaggi. 2019. Unsupervised [46] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison W
scalable representation learning for multivariate time series. In NeurIPS. 4650– Cottrell. 2017. A Dual-Stage Attention-Based Recurrent Neural Network for
4661. Time Series Prediction. In IJCAI.
[18] Meredith Franklin, Petros Koutrakis, and Joel Schwartz. 2008. The role of particle [47] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why should I
composition on the association between PM2. 5 and mortality. Epidemiology trust you?”: Explaining the predictions of any classifier. In KDD. ACM, 1135–1144.
(Cambridge, Mass.) 19, 5 (2008), 680. [48] Gerwin Schalk, Dennis J McFarland, Thilo Hinterberger, Niels Birbaumer, and
[19] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of Jonathan R Wolpaw. 2004. BCI2000: a general-purpose brain-computer interface
statistical learning. Vol. 1. Springer series in statistics New York. (BCI) system. IEEE TBME 51, 6 (2004), 1034–1043.
[20] Nicholas Frosst and Geoffrey Hinton. 2017. Distilling a neural network into a [49] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedan-
soft decision tree. arXiv:1711.09784 (2017). tam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual explanations from
[21] Ben D Fulcher and Nick S Jones. 2014. Highly comparative feature-based time- deep networks via gradient-based localization. In IEEE ICCV. 618–626.
series classification. IEEE TKDE 26, 12 (2014), 3026–3037. [50] Ali Hossam Shoeb. 2009. Application of machine learning to epileptic seizure
[22] Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch onset detection and treatment. Ph.D. Dissertation. Massachusetts Institute of
Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and Technology.
H Eugene Stanley. 2000. PhysioBank, PhysioToolkit, and PhysioNet: components [51] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning im-
of a new research resource for complex physiologic signals. circulation 101, 23 portant features through propagating activation differences. In ICML. JMLR. org,
(2000), e215–e220. 3145–3153.
[23] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca [52] Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje.
Giannotti, and Dino Pedreschi. 2018. A survey of methods for explaining black 2016. Not just a black box: Learning important features through propagating
box models. ACM Computing Surveys (CSUR) 51, 5 (2018), 1–42. activation differences. arXiv:1605.01713 (2016).
[24] Tian Guo, Tao Lin, and Nino Antulov-Fantulin. 2019. Exploring interpretable [53] Yiwei Sun, Ngot Bui, Tsung-Yu Hsieh, and Vasant Honavar. 2018. Multi-view
LSTM neural networks over multi-variable data. In ICML. 2494–2504. network embedding via graph factorization clustering and co-regularized multi-
[25] Min Han and Xiaoxin Liu. 2013. Feature selection techniques with class separa- view agreement. In ICDM Workshop. IEEE, 1006–1013.
bility for multivariate time series. Neurocomputing 110 (2013), 29–34. [54] Yiwei Sun and Shabnam Ghaffarzadegan. 2020. An Ontology-Aware Framework
[26] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The elements of for Audio Event Classification. In ICASSP. IEEE, 321–325.
statistical learning: data mining, inference, and prediction. Springer Science & [55] Yiwei Sun, Suhang Wang, Tsung-Yu Hsieh, Xianfeng Tang, and Vasant Honavar.
Business Media. 2019. MEGAN: a generative adversarial network for multi-view network embed-
[27] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural ding. In IJCAI. AAAI Press, 3527–3533.
computation 9, 8 (1997), 1735–1780. [56] Xianfeng Tang, Yandong Li, Yiwei Sun, Huaxiu Yao, Prasenjit Mitra, and Suhang
[28] Aria Khademi and Vasant Honavar. 2020. A Causal Lens for Peeking into Black Wang. 2020. Transferring Robustness for Graph Neural Network Against Poison-
Box Predictive Models: Predictive Model Interpretation via Causal Attribution. ing Attacks. In WSDM. 600–608.
arXiv:2008.00357 (2020). [57] Xianfeng Tang, Huaxiu Yao, Yiwei Sun, Charu C Aggarwal, Prasenjit Mitra, and
[29] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- Suhang Wang. 2020. Joint Modeling of Local and Global Temporal Dynamics for
mization. arXiv:1412.6980 (2014). Multivariate Time Series Forecasting with Missing Values.. In AAAI. 5956–5963.
[30] Thanh Le and Vasant Honavar. 2020. Dynamical Gaussian Process Latent Variable [58] Yue Wu, José Miguel Hernández Lobato, and Zoubin Ghahramani. 2013. Dynamic
Model for Representation Learning from Longitudinal Data. In Proceedings of the covariance models for multivariate financial time series. In ICML. III–558.
2020 ACM-IMS on Foundations of Data Science Conference. 183–188. [59] Yanbo Xu, Siddharth Biswal, Shriprasad R Deshpande, Kevin O Maher, and Jimeng
[31] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature Sun. 2018. Raim: Recurrent attentive and intensive model of multimodal patient
521, 7553 (2015), 436–444. monitoring data. In KDD. 2565–2573.
[32] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, [60] Xiang Xuan and Kevin Murphy. 2007. Modeling changing dependency structure
and Xifeng Yan. 2019. Enhancing the locality and breaking the memory bottleneck in multivariate time series. In ICML. 1055–1062.
of transformer on time series forecasting. In NeurIPS. 5244–5254. [61] Lexiang Ye and Eamonn Keogh. 2009. Time series shapelets: a new primitive for
[33] Junjie Liang, Yanting Wu, Dongkuan Xu, and Vasant Honavar. 2021. Longitudi- data mining. In KDD. 947–956.
nal Deep Kernel Gaussian Process Regression. In Proceedings of the 35th AAAI [62] Jie Yin, Qiang Yang, and Jeffrey Junfeng Pan. 2008. Sensor-based abnormal
Conference on Artificial Intelligence. In press. human-activity detection. IEEE TKDE 20, 8 (2008), 1082–1090.
[34] Junjie Liang, Dongkuan Xu, Yiwei Sun, and Vasant G Honavar. 2020. LMLFM: [63] Hyunjin Yoon and Cyrus Shahabi. 2006. Feature subset selection on multivariate
Longitudinal Multi-Level Factorization Machine. In AAAI. time series with extremely large spatial features. In ICDM Workshop). IEEE,
[35] Xuan Liang, Tao Zou, Bin Guo, Shuo Li, Haozhe Zhang, Shuyi Zhang, Hui Huang, 337–342.
and Song Xi Chen. 2015. Assessing Beijing’s PM2. 5 pollution: severity, weather [64] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. INVASE: Instance-
impact, APEC and winter heating. Proc. R. Soc. A: Mathematical, Physical and wise variable selection using neural networks. In ICLR.
Engineering Sciences 471, 2182 (2015), 20150257. [65] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. 2015.
[36] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model Understanding neural networks through deep visualization. arXiv:1506.06579
predictions. In NeurIPS. 4765–4774. (2015).
[37] Philippe Major and Elizabeth A Thiele. 2007. Seizures in Children: Laboratory. [66] Ye Yuan, Guangxu Xun, Fenglong Ma, Yaqing Wang, Nan Du, Kebin Jia, Lu Su, and
Pediatrics in review 28, 11 (2007), 405. Aidong Zhang. 2018. Muvan: A multi-view attention network for multivariate
[38] Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion temporal data. In ICDM. IEEE, 717–726.
prediction using recurrent neural networks. In CVPR. IEEE, 4674–4683. [67] Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolu-
[39] Shane T Mueller, Robert R Hoffman, William Clancey, Abigail Emrey, and tional networks. In ECCV. Springer, 818–833.
Gary Klein. 2019. Explanation in human-AI systems: A literature meta-review, [68] Xi Sheryl Zhang, Fengyi Tang, Hiroko H Dodge, Jiayu Zhou, and Fei Wang.
synopsis of key ideas and publications, and bibliography for explainable AI. 2019. Metapred: Meta-learning for clinical risk prediction with limited patient
arXiv:1902.01876 (2019). electronic health records. In KDD. 2487–2495.