Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views17 pages

Machine Learning For Streaming Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views17 pages

Machine Learning For Streaming Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Machine learning for streaming data: state of the art,

challenges, and opportunities

Heitor Murilo Gomes Jesse Read Albert Bifet


Department of Computer LIX, École Polytechnique Department of Computer
Science, University of Waikato Palaiseau, France Science, University of Waikato
LTCI, Télécom ParisTech jesse.read@ LTCI, Télécom ParisTech
Hamilton, New Zealand Hamilton, New Zealand
polytechnique.edu
[email protected] [email protected]
Jean Paul Barddal João Gama
PPGIA, Pontifical Catholic LIAAD, INESC TEC
University of Parana University of Porto
Curitiba, Brazil Porto, Portugal
jean.barddal@ [email protected]
ppgia.pucpr.br

ABSTRACT eral works on supervised learning [19], especially classifica-


tion, mostly focused on addressing the problem of changes
Incremental learning, online learning, and data stream learn-
to the underlying data distribution over time, i.e., concept
ing are terms commonly associated with learning algorithms
drifts [119]. In general, these works focus on one specific
that update their models given a continuous influx of data
challenge: develop methods that maintain an accurate de-
without performing multiple passes over data. Several works
cision model with the ability to learn and forget concepts
have been devoted to this area, either directly or indirectly
incrementally.
as characteristics of big data processing, i.e., Velocity and
Volume. Given the current industry needs, there are many The focus given to supervised learning has shifted towards
challenges to be addressed before existing methods can be ef- other tasks in the past few years, mostly to accommodate
ficiently applied to real-world problems. In this work, we fo- more general requirements of real-world problems. Nowa-
cus on elucidating the connections among the current state- days one can find work on data stream clustering, pattern
of-the-art on related fields; and clarifying open challenges mining, anomaly detection, feature selection, multi-output
in both academia and industry. We treat with special care learning, semi-supervised learning, novel class detection, and
topics that were not thoroughly investigated in past posi- others. Nevertheless, some fields were less developed than
tion and survey papers. This work aims to evoke discus- others, e.g., drift detection and recovery has been thoroughly
sion and elucidate the current research opportunities, high- investigated for streaming data where labels are immedi-
lighting the relationship of different subareas and suggesting ately (and fully) available. In contrast, not as much re-
courses of action when possible. search has been conducted on the efficiency of drift detec-
tion methods for streaming data where labels arrive with a
non-negligible delay or when some (or all) labels never arrive
1. INTRODUCTION (semi-supervised and unsupervised learning).
Data sources are becoming increasingly ubiquitous and faster Previous works have shown the importance of some of these
in comparison to the previous decade. These characteristics problems and their research directions. Krempl et al. [73]
motivated the development of several machine learning al- discuss important issues, such as how to evaluate data stream
gorithms for data streams. We are now on the verge of algorithms, privacy and the gap between algorithms to full
moving out these methods from the research labs to the in- decision support systems. Machine learning for data streams
dustry, similarly to what happened to traditional machine is a recurrent topic in Big Data surveys [44; 127] as it is re-
learning methods in the recent past. This current movement lated to the Velocity and Volume characteristics of the tra-
requires the development and adaptation of techniques that ditional 3V’s of big data (Volume, Variety, Velocity). Nev-
are adjacent to the learning algorithms, i.e., it is necessary ertheless, there is no consensus about how learning from
to develop not only efficient and adaptive learners, but also streaming data should be tackled, and depending on the
methods to deal with data preprocessing and other prac- application (and the research group) different abstractions
tical tasks. On top of that, there is a need to objectively and solutions are going to be used. For example, Stoica
reassess the underlying assumptions of some techniques de- et al. [115] discuss continual learning, and from their point
veloped under hypothetical scenarios to clearly understand of view, learning from heterogeneous environments where
when and how they are applicable in practice. changes are expected to occur is better addressed by rein-
Machine learning for streaming data research yielded sev- forcement learning. In summary, some reviews and surveys
focus on specific tasks or techniques, such as rule mining
for data streams [66], activity recognition [1] or ensemble
learning [53].
In this paper, we focus on providing an updated view of and Flink deal with them and which methods are imple-
the field of machine learning for data streams, highlighting mented for a wide range of preprocessing tasks, including:
the state-of-the-art and possible research (and development) Feature selection, instance selection, discretization, and oth-
opportunities. Our contributions focus on aspects that were ers. Ramı́rez-Gallego et al. [103] focus specifically on pre-
not thoroughly discussed before in similar works [73], and processing techniques for data streams, mentioning both ex-
thus, when appropriate, we direct the readers to works that isting methods and open challenges. Critical remarks were
better introduce the original problems while we highlight made in [103], such as: the relevance of proposing novel dis-
more complex challenges. cretization techniques that do not rely solely on quantiles
In summary, the topics discussed are organized as follows. and that perform better in the presence of concept drifts;
The first three sections of the manuscript address the im- expanding existing preprocessing techniques that deal with
portant topics of preprocessing (section 2), learning (section concept drift to also account for recurrent drifts; and the
3), and adaptation (section 4). In the last three sections, need for procedures that address non-conventional problems,
we turn our attention to evaluating algorithms (section 5), e.g., multi-label classification.
streaming data and related topics in AI (section 6), and In this section, we focus on issues and techniques that were
existing tools for exploring machine learning for streaming not thoroughly discussed in previous works, such as sum-
data (section 7). Finally, section 8 outlines the main take- marization sketches. Other aspects that can be considered
aways regarding research opportunities and challenges dis- as part of preprocessing, notably dealing with imbalanced
cussed throughout this paper. data, and others, are discussed in further sections.

2.1 Feature transformation


2. DATA PREPROCESSING
Data preparation is an essential part of a machine learning 2.1.1 Summarization Sketches
solution. Real-world problems require transformations to Working with limited memory in streaming data is non-
raw data, preprocessing steps and usually a further selection trivial, since data streams produce insurmountable quan-
of ‘relevant’ data before it can be used to build machine tities of raw data, which are often not useful as individual
learning models. instances, but essential when aggregated. These aggregated
Trivial preprocessing, such as normalizing a feature, can be data structures can be used for further analysis such as:
complicated in a streaming setting. The main reason is that “what is the average age of all customers that visited a given
statistics about the data are unknown a priori, e.g., the website in the last hour?” or to train a machine learning
minimum and maximum values a given feature can exhibit. model.
There are different approaches to scale and discretize fea- The summary created to avoid storing and maintaining a
tures as discussed in the next subsections; still, as we move large amount of data is often referred to as a ‘sketch’ of
into more complex preprocessing, it is usually unknown ter- the data. Sketches are probabilistic data structures that
ritory or one which has not been explored in depth. summarize streams of data, such that any two sketches of
There are mainly two reasons for performing data prepro- individual streams can be combined into the sketch of the
cessing before training a machine learning model: combined stream in a space-efficient way. Using sketches re-
quires a number of compromises: sketches must be created
1. To allow learning algorithms to handle the data;
by using a constrained amount of memory and processing
2. To improve learning by extracting or keeping only the time, and if the sketches are not accurate, then the informa-
most relevant data. tion derived from them may be misleading.
Over the last decades, many summarization sketches have
The first reason is more restrictive, as some algorithms will been proposed. Ranging from simple membership techniques
not be able to digest data if it is not in the expected format, such as Bloom filters [22] and counting strategies to more ad-
i.e., the data types do not match. Other algorithms will vanced methods such as CM-Sketch [30], and ADA-Sketches
perform poorly if the data is not normalized. Examples [112]. Bloom filters are probabilistic data structures used to
include algorithms that rely on Stochastic Gradient Descent test whether an element is a member of a set, using hash
and distance-based algorithms (such as nearest neighbors). functions. CM-Sketch is essentially an extension of Bloom
Feature engineering governs the second aspect, as accurate filters used to count the frequency of different elements in a
machine learning solutions often rely on a well-thought fea- stream.
ture transformation, selection or reduction of the raw data. Sketches are becoming a popular approach to cope with data
In a batch learning pipeline, preprocessing, fitting and test- streams [4; 108; 128]. In the previous decade, the adop-
ing a model are distinct phases. They are applied in order, tion of sketching for stream processing was taken with some
and the output of this process is a fitted model that can be skepticism, for example, Gaber et al. [45] suggests using di-
used for future data. These same operations are required in mensionality reduction techniques, such as Principal Com-
a streaming setting. The main difference is that streaming ponents Analysis, as a more sustainable approach for stream
data needs the continuous application of the whole pipeline processing.
while it is being used. Consequently, all these phases are Novel sketches have been proposed based on the idea of
interleaved in an online process, which requires an intricate building a meta-sketch using several sketches as components.
orchestration of the pipeline and ideally does not rely on The Slim-Fat Sketch [128] (SF-Sketch) is an example of this,
performing some tasks offline. that outperforms single sketches.
Garcı́a et al. [51] discussed the challenges concerning pre- Sketches can also be used in machine learning methods for
processing techniques for big data environments, focusing on data streams. For example, the Graphical Model Sketch [75]
how different big data frameworks, such as Hadoop, Spark, is a sketch used inside Bayesian networks, or in the Naive
Bayes classifier, to reduce the size of memory used. How online transformation would provide more realistic results.
to use sketches inside other machine learning methods is an
open question. 2.1.3 Feature Discretization
Discretization is a process that divides numeric features into
2.1.2 Feature Scaling categorical ones using intervals. Depending on the appli-
Feature scaling consists of transforming the features domain cation and predictive model being used, discretization can
in a way that they are on a similar scale. Commonly, scal- bring several benefits, including faster computation time as
ing refers to normalizing, i.e., transform features such that discrete variables are usually easier to handle compared to
their mean x̂ = 0 and standard deviation σ = 1. In batch numeric ones; and decreases the chances of overfitting since
learning, feature scaling is both an important and an unin- the feature space becomes less complex.
teresting topic, which is often added to the data transfor- Targeting feature discretization from data streams, a signifi-
mation pipeline without much thought of the process. It is cant milestone was the Partition Incremental Discretization
important while fitting learners that rely on gradient descent algorithm (PiD) [101]. PiD discretizes numeric features in
(e.g., neural networks) as these will converge faster if fea- two layers. The first layer is responsible for computing a
tures are about the same scale; or learners that rely on dis- high number of intervals given the arriving data, while the
tances among instances (e.g. k-means, k-nearest neighbors, second uses the statistics calculated in the first layer to com-
and others) as this prevent one dimension with a wide range pute equal-frequency partitions.
of values dominating others when calculating distances. The Webb [122] proposed two different schemes for feature dis-
two most popular approaches consist of i) centralizing the cretization: Incremental Discretization Algorithm (IDA) and
data by subtracting the mean and diving by the standard IDAW (where W stands for windowing). IDA uses quantile-
deviation; or ii) dividing each value by the range (max - based discretization on the entire data stream using random
min). sampling, while IDAW maintains a window of the most re-
It is unfeasible to perform feature scale for data streams as cent values for an attribute and discretizes these. IDAW
aggregate calculations must be estimated throughout the ex- requires more computational time than IDA since it must
ecution. For landmark window approaches there are exact be updated more frequently.
solutions to incrementally computing the mean and stan- The ChiMerge discretization algorithm [77] store the fea-
dard deviation without storing all the points seen so far. tures’ values on a binary search tree, which makes it more
We need to maintain only three numbers: the number P of robust to noise in comparison previous methods.
data points seen so far n, the sum of the dataP points (x), Pfahringer et al. [99] compared a range of discretization
and the sum of the squares of the data points (x2 ). These schemes for Hoeffding Trees. Based on empirical evalua-
statistics P
are easy to compute incrementally. P 2
ThePmean is tions, the Gaussian approximation was indicated as the most
(x)2 )/n
given by n(x) and the variance is given by (x )−( n−1
. accurate method in terms of accuracy and tree growth.
In landmark windows, its easy and fast to maintain ex- Finally, similarly to feature scaling, the effort on feature dis-
act statistics by storing few numbers. However, in time- cretization should target the provisioning of efficient imple-
changing streams the adaptation is too slow. To deal with mentations that integrate with different parts of the stream-
change, sliding window models are more appropriate. The ing process, such as classification systems, drift detection,
problem is that exact computation of the mean or variance and evaluation [40].
over a stream in a sliding window model requires to store all
data points inside the window. Approximate solutions, us- 2.2 Invalid entries handling
ing logarithmic space, are the exponential histograms [31]. Invalid entries may refer to missing, noise or other issues that
Exponential histograms store data in buckets of exponen- may arise (e.g. unknown formats). Characterizing what is
tial growing size: 20 , 21 , 22 , 23 , . . .. For a window size W , an invalid entry depends on the problem, algorithms, and
only log(W ) space is required. Recent data are stored in software. For example, categorical features may be deemed
fine granularity, while past data are stored in an aggregated as invalid in many machine learning tools (e.g., scikit-learn),
form. The statistics computed using exponential histograms not because they are inherently wrong, but because the soft-
are approximate, with error bounds. The error comes from ware was designed in a way that does not account for them.
the last bucket, where it is not possible to guarantee all data Despite technical issues related to invalid entries, the most
is inside the window. well known, and studied, problem is missing data. Krempl
The lack of attention on feature scaling by the data stream et al. [73] commented on the relevance of this problem and
mining research community may be justified by the perva- discussed missing values for the output feature as part of
siveness of the Hoeffding Tree algorithm [37]. Hoeffding this discussion. We prefer to include this latter case under
trees maintain the characteristic of conventional decision our discussion of semi-supervised learning (see section 3.2)
trees of being resilient to variations in the features range and solely concentrate on the input data in this section.
of values. Even if there is not much room for theoretical ad- To address missing values imputation methods are relatively
vances for data stream feature scaling, practical implemen- standard in batch learning [38]. Imputation methods have
tations would be welcome by the community (see section 7). not been thoroughly investigated for streaming data yet.
The most immediate challenge is to provide efficient imple- This is mostly because the techniques often rely on observing
mentation of these feature scaling methods that integrate the whole data before imputing the values. These techniques
with other operators (e.g. drift detection) in the streaming are ‘feasible’ in batch learning, but not for streaming data.
machine learning pipeline. Finally, another aspect is related For example, mean and median imputation will encounter
to how some learning algorithms were tested on datasets issues such as: How to estimate the mean and the median
where features were scaled in an offline process. This cer- for evolving data? The issues with mean estimation were
tainly affects the results obtained (see section 5), and an previously discussed in section 2.1.2.
An option to avoid aggregation calculations is to apply im- and enhancing the generalization rates of classification sys-
putation by using a learner. For example, a windowed K- tems as these are less prone to overfitting.
nearest neighbors can be used, such that the k neighbors Feature selection methods tailored for batch settings require
values can be used to infer the value of a missing feature in the entire dataset to determine which features are the most
a given instance. important according to some goodness-of-fit criterion. Nev-
ertheless, this is a requirement that does not hold in stream-
2.3 Dimensionality reduction ing scenarios, as new data becomes available over time. Tar-
Dimensionality reduction tackles the retention of patterns geting data streams, Barddal et al. [10] showed that hoeffd-
in the input data that are relevant to the learning task. We ing (decision) trees [37] and decision rules [71] are the major
report the works and gaps on dimensionality reduction tech- representatives of classification and regression systems that
niques that apply transformations to the input data, e.g., can incrementally identify which features are the most im-
Principal Component Analysis (PCA), and Random Pro- portant.
jections; while the next section discusses feature selection Incrementally identifying which features are important is a
techniques tailored for data streams and their shortcomings. relevant subject in data streams. New methods must be de-
Mitliagkas et al. [86] introduced a memory-limited approx- signed so that the feature selection process can identify and
imation of PCA based on sampling and sketches that can adapt to changes in the relevance of features, a phenomenon
be calculated under reasonable error bounds. Yu et al. [129] called feature drift (see section 4.2).
proposed a single-pass randomized PCA method, yet, the Another critical gap of feature selection in streaming sce-
method has been solely evaluated on a single image dataset. narios regards the evaluation of feature selectors. There are
Zhou et al. [133] presented an incremental feature learning different factors to account for when evaluating feature selec-
algorithm to determine the optimal model complexity for tion proposals. Throughout the years, different quantitative
online data based on the denoising autoencoder. measures, such as accuracy and scalability; and subjective
Another set of techniques that are important for dimension- ones, such as “ease of use”, have been used to highlight
ality reduction are those tailored for text data. A notable the efficiency of feature selectors [47]. First, it is crucial
implementation of dimensionality reduction in such scenar- to assess the behavior of feature selection algorithms when
ios is the hashing tricks provided in Vowpal Wabbit [76]. combined with different learners, as each learner builds its
Hashing tricks facilitate the processing of text data as con- predictive model differently despite being fed with the same
ventional Bag-of-Words and n-grams are unappealing for subset of features. On the other hand, it is important to
streaming scenarios since the entire lexicon for a domain make sure that the feature selection process is accurate, i.e.,
must be known before the learning process starts, which the selected subset of features matches the features that are
is an assumption that is hardly met in streaming domains, indeed relevant [47].
as out-of-vocabulary words may appear over time. With Finally, feature selectors are expected to be “stable”, mean-
the hashing trick, each word (feature) in the original text ing that they should select the same features despite being
data is converted into a key using a hash function, which trained with different subsets of data [91]. In batch learn-
is used to increment counters in a reduced dimensionality ing, stability metrics target the measuring of whether the
space that is later fed to the learner. Such a process has a selected subset of features across different data samples of
vital downside as reverse lookups are not possible, i.e., if one the same distribution match [74]. Stable methods are pre-
wants to determine which words are the most important for ferred as they facilitate learning a model from the data, i.e.,
predictions. the subset of features is fixed. However, an open challenge
Finally, Pham et al. [100] proposed an ensemble that com- in the streaming setting is the contradiction between feature
bines different random projection techniques (Bernoulli, stability and selection accuracy. If the features’ importance
Achlioptas, and Gaussian projections) with Hoeffding Trees, shifts over time (feature drifts) then the feature selection
and results have shown that the technique is feasible and method will need to compromise either stability or accu-
competitive against bagging methods when applied to real- racy.
world data.
Further investigation is necessary for all the techniques dis-
cussed in this section, specifically to investigate the effect 3. THE LEARNING PROCESS
of concept drifts. For instance, most of these methods are Learning from streaming data requires real-time (or near
single-pass, yet, they have been applied to datasets in a real-time) updates to a model. There are many important
batch processing scheme. In real-world streaming scenar- aspects to consider in this ‘learning phase’, such as deal-
ios, drifts in the original data will induce changes in the ing with verification latency. In this section, we discuss the
feature transformation outputs of random projections, so relationship between data streams and time series; the prob-
closer analysis is required to investigate how classifiers be- lem of dealing with partially and delayed labels; ensemble
have according to such changes. learning; imbalanced data streams and the essential issue of
detecting anomalies from streaming data.
2.4 Feature selection
Feature selection targets the identification of which features 3.1 Time series
are relevant to the learning task. In contrast to dimension- Time series data may commonly arrive in the form of online
ality reduction techniques, feature selection does not apply data, and it thus can be treated as a data stream. Another
transformations to the data, and thus, the features can still way of seeing it: data streams may often involve temporal
be interpreted. As a by-product, feature selection is also dependence and thus be considered as time series. A com-
known in batch machine learning for potentially improving parison of data streams and time series methods is given in
the computation time, reducing computation requirements, Žliobaite et al. [121]. It was pointed out that many bench-
mark datasets used in data streams research exhibit time y1 y2 y3 y4
series elements, as exemplified in Fig. 1.

x1 x2 x3 x4

Figure 2: A probabilistic graphical model representation of


a generative model (such as a hidden Markov model, where
temporal dependence is considered; exemplified over four
yt =0
time points.
yt =1

stream there is no fixed end to the sequence and predictions


Figure 1: A small section of the well-known Electricity are needed at the current timestep, not retrospectively. It
dataset; a common benchmark in data stream evaluations. can, however, be done over a window, and we are not aware
A time series nature is clearly visible with regard to tempo- of any work that considers explicitly this in the data-stream
ral dependence, both in the features (plotted in solid lines) context – an open challenge.
and the class labels (shown above). The open challenge relating to data stream learning is to
draw more solid links to the existing time series literature.
Unlike a regular data stream, where instances are assumed On the theoretical level, connections between models in the
to be independently and identically distributed1 (i.i.d.), data respective areas should be formalized, thus clarifying which
points in a time series are expected to exhibit strong tem- time series models are applicable in data streams and in
poral dependence. which contexts. An empirical comparison of methods across
Data stream methods can be adapted to such scenarios us- these domains is needed, not only to reveal the extent of
ing relatively simple strategies, such as aggregating the la- time series nature within standard data sources found in the
bels of earlier instances into the instance space of the current time series literature but also indicate on the practical ap-
instance [121]. There are special considerations in terms of plicability of methods from the rich and diverse time-series
evaluation under concept drifting streams, in particular re- literature. Undoubtedly, techniques (in particular those of
garding the selection of benchmark classifiers (see, again, drift detection), could also be used to enhance time-series
Žliobaite et al. [121]). Further discussion on evaluation methods.
strategies is given in section 5.
Considering a moving window of instances can be viewed
3.2 Semi-Supervised learning
as removing time dependence, and thus converting a time Semi-supervised learning (SSL) problems are challenging,
series into an ordinary data stream. If P (yt |xt , xt−1 ) = appear in a multitude of domains, and are particularly rel-
P (yt |xt ) does not hold, it indicates temporal dependence evant to streaming applications2 where data are abundant,
in the data stream. The idea is to produce a new stream of but labeled data may be rare. To address SSL problems,
instances x′t := [xt , xt−1 , . . . , xt−ℓ ] over a window of size ℓ, one can either ignore the unlabeled data and focus on the
sufficient such that P (yt |x′t ) = P (yt |x′t , x′t−1 ); thus produc- labeled data; try to leverage the unlabeled data; or assume
ing a temporally-independent (i.e., ‘regular’) data stream. some labels are available per request (active learning). The
It is also possible to use a memory device to embed an arbi- first implies a supervised problem; the second relies on find-
trarily long series into such an instance representation, for ing and exploring a specific characteristic of the data; while
example, by using an echo state network or another kind of the third depends on an external agent to provide the re-
recurrent neural network (see section 6.2 for further consid- quired labels on time.
eration in the context of streams). An experimentation of In this section we focus the discussion on the last two op-
such an approach in data streams was carried out in [83] tions, leveraging unlabeled data and active learning. Still, it
under Hoeffding tree methods. is essential to consider the first option, supervised learning,
We can note that the filtering task of sequential state-space for practical applications as discussed by Oliver et al. [92],
models, such as the hidden Markov model (HMM), are di- since a robust supervised model trained only on the la-
rectly applicable to classification in data streams. Indeed, beled data may outperform intricate semi-supervised meth-
one can see HMMs as a sequential version of naive Bayes ods. On top of that, active learning might not always be
(a common data-streams benchmark), simply by modelling; feasible as labeling the data can be costly financially and in
see Fig. 2 for a graphical intuition. Kalman filters and par- terms of time.
ticle filters can similarly be considered under the continuous Even for supervised problems, it is reasonable to assume
output (i.e., regression) scenario. See [8; 41] for a compre- that immediately labeled data will not be available. For ex-
hensive overview of these methods. ample, in a real-world data stream problem it is often the
We do remark that, unlike in a typical scenario of these mod- case that the algorithm is required to predict x and only af-
els, learning cannot be done on a full forward-backward pass ter several time units its true label y is going to be available.
(such as using the Baum-Welch algorithm [13]) because in a This problem setting leads to a situation where verification
latency, or delay, takes place. One can assume a delayed
1
The ‘identically distributed ’ assumption may be relaxed in
2
the presence of concept drift; but in any case we may say Also referred to as ‘partially labelled streams’ or ‘infinitely
i.i.d. with regrad to a particular concept. delayed stream’.
Stream learning already changed or the volume of data to be labeled exceeds
the capabilities of the responsible for labeling. For example,
assuming an algorithm performs well with 5% labeled data,
however, to label 5% of a data stream that generates thou-
Immediate Delayed Never sands of instances per day is still a difficult, and costly, task
(Unsupervised) in a variety of domains.
The amount of literature concerning how to exploit unla-
Fixed Varying
beled instances and how to deal with verification latency has
increased in the past years. Still, open issues include: how to
effectively evaluate algorithms when labels arrive with delay;
All labeled Some labeled how to deal with out-of-order data [78]; and fundamental
(Supervised) (Semi-Supervised) theoretical aspects behind existing SSL methods proposed
for non-stationary problems. On top of that, some strate-
gies developed for batch data have not been thoroughly ex-
Figure 3: Stream learning according to labels arrival
plored in a streaming scenario, including multi-view learn-
time [54].
ing and co-training [23]; and transductive support vector
machines [67; 113]. Finally, transfer learning is a somewhat
popular method to alleviate the problem of few labeled in-
labeled stream as an SSL problem and ignore that some (or stances [94], and it has not been widely adopted for the
all) of the labels will be available at some point in the future. streaming setting yet.
Therefore delayed labeled streams can be tackled with SSL
techniques. Figure 3 presents a categorization of how super- 3.3 Ensemble learning
vised, semi-supervised and verification latency are related Ensemble learning receives much attention for data stream
w.r.t the label arrival. Plasse and Adams [102] introduce learning as ensembles can be integrated with drift detection
a taxonomy to determine the delay mechanism and magni- algorithms and incorporate dynamic updates, such as se-
tude; present real-world applications where delayed labels lective removal or addition of base models [53]. On top of
occur, e.g., credit scoring; notation for the delayed labeling that, several issues such as concept evolution, feature evolu-
setting; and how the set of delayed labels can be used to pre- tion, semi-supervised learning, anomaly detection, are often
update the classifier. However, Plasse and Adams [102] do approached with an ensemble approach for data streams.
not introduce a specific evaluation procedure that accounts The most common use of ensemble models is to allow recov-
for verification latency. ery from concept drifts. Ensemble models can rely on
Žliobaite [134] raise the critical questions of if and when reactive or active strategies to cope with concept drift. Re-
it is possible to detect a concept drift from delayed labeled active strategies continuously update the ensemble, often
data. The work is motivated by a large number of real-world assigning different weights to base models according to their
problems in which labels are delayed due to the problem prediction performance. This weighting function may take
characteristics. For example, the ground truth for credit into account the recency of prediction mistakes, such that
default prediction can only be obtained several months, or correct predictions in the latest instances receive a higher
years, after a decision was made. The work also discusses weight in comparison to correct predictions in oldest in-
the relationship between delayed labeling and active learn- stances. Examples of these strategies include the Streaming
ing. The former concerns when new labels are needed, while Ensemble Algorithm (SEA) [116] and Dynamic Weighted
the latter is related to which instances must be labeled to Ensemble (DWM) [70].
minimize cost and maximize prediction performance. It was A canonical example of an active strategy is ADWIN
concluded that both problems are complementary [134]. Bagging, i.e., the combination of ADWIN [17] and online
The SSL techniques for streaming data includes unsuper- bagging [93]. In ADWIN Bagging, each base model clas-
vised learning combined with supervised learning, e.g., clus- sification output is monitored by an ADWIN instance and
tering the unlabeled data and pseudo-labeling it based on whenever a drift is flagged the corresponding model is reset.
the clusters and existing labeled data [109; 85; 64]; active Recently, ensembles that combine reactive and active strate-
learning [29]; and hybrid approaches [95]. Each of these gies have been proposed in the literature. Examples include
approaches makes assumptions about the problem and the the Adaptive Random Forest (ARF) [54] and the Streaming
data distribution, but not very often these assumptions are Random Subspaces (SRP) [57], which are adaptations of the
explicitly discussed. Random Forest [24] and Random Patches algorithms [82],
Active learning is a popular choice for streaming data. Ac- respectively, to streaming data with the addition of drift
tive learning promises to reduce the amount of required la- detectors and weighting functions based on models predic-
beled data in supervised learning to achieve a given level of tion performance. The main difference between ARF and
predictive performance. In simple terms, the algorithm de- SRP is that ARF is based on local subspace randomization
cides which instances should be labeled given some criteria. and SRP uses a global subspace randomization strategy. In
There are a few concerns regarding this approach, such that [57] authors showed that the global subspace strategy is a
it presumably assumes that any instance can be labeled, more flexible model, which increases diversity among base
which may not be true (it depends on the domain); and it models and the overall accuracy of the ensemble.
includes an outsider (usually a human) in the learning pro- An ensemble-based method is often used along with other
cess, i.e., someone who is going to provide the labels required techniques to address concept evolution. Even though
by the algorithm. Assuming a traditional evolving stream the ensemble may not be used directly to detect the novel
setting, by the time the label is provided the concept has classes, it is useful to dynamically incorporate the novel
class instances into the whole learning model without major ing redundant models when their current predictions are
changes to the already learned models, i.e., other ensem- too similar [55; 56], or their coverage overlaps during train-
ble members. For instance, ensemble approaches combined ing [109]. These techniques may not enhance the ensem-
with One-Versus-All (OVA) approach to address concept ble performance from the learning performance perspective;
evolution includes OVA Decision Trees [63], Learn++ .NC in fact they might negatively impact it. Algorithms that
and Learn++ .UDNC [35]. solve more problems are often more challenging to manage,
Ensemble strategies have been used to address feature drift for example, algorithms that combine clustering and ensem-
problems by removing/adding the influence of specific fea- bles to address partially labeled streams. One approach is
tures by using single-feature classifiers, such that if a fea- to distribute the computation using multiple threads [54].
ture disappears or is identified as irrelevant, its influence However, there are limits to what can be accomplished with
can be wholly removed from the whole system by removing algorithms executed in a single machine, even if they are
the classifier associated with it. This approach is similar multi-threaded. As a consequence, the machine learning
to that mentioned previously to cope with concept evolu- community is investing efforts into scalable and distributed
tion (one classifier per class), and it is the approach used systems.
in HSMiner [96], with the addition of using different classi- The challenge is how to maintain the characteristics of the
fiers according to the feature domain. On top of that, us- ensemble methods and efficiently distributed them over sev-
ing single-feature classifiers, or a limited size of features per eral machines. Some ensemble methods are straightforward
learner that improves the algorithm’s scalability as its pro- to be adapted to distributed environments (e.g., bagging),
cessing can be distributed among multiple machines using a while others are more complicated (e.g., random forests).
map-reduce approach [61]. Efforts have been driven towards integrated platforms for
The flexibility that an ensemble strategy allows (i.e., add stream learning in this context, which resulted in frame-
and remove models) make it an attractive strategy to deal works (or libraries) such as Apache Scalable Advanced Mas-
with partially labeled (i.e., semi-supervised) streams. sive Online Analysis (SAMOA) [33] and StreamDM.
An example is the SluiceBox AnyMeans (SluiceBoxAM) al- Currently, there are efforts in deploying stream learning al-
gorithm [97]. SluiceBoxAM is based on the SluiceBox al- gorithms (ensembles included) in a distributed setting. Ex-
gorithm [95], which already combined different methods to amples include the Streaming Parallel Decision Tree [14],
address problems, such as multi-domain features, concept HSMiner [61] and Vertical Hoeffding Tree (VHT) [72]. En-
drift, novel class detection, and feature evolution. Besides sembles are attractive techniques, as discussed in this sec-
using a clustering method (AnyMeans) capable of discov- tion, and they are probably going to play an essential role in
ering non-spherical clusters SluiceBoxAM can be used with stream processing software, such as Apache Spark Stream-
other ensemble classifiers, e.g., Parker and Khan [97] re- ing [130] and Apache Flink [25].
port the performance of SluiceBoxAM combined with the
leveraging bagging algorithm [21]. The overall idea behind 3.4 Imbalanced Learning
SliceBoxAM and other ensemble methods combined with Imbalanced datasets are characterized by one class outnum-
clustering methods. bering the instances of the other one [80]. The later is re-
Open issues related to the deployment of ensemble methods ferred to as the minority class, while the former is identified
to practical streaming data include overly complex mod- as the majority class. These concepts can be generalized
els and massive computational resources demands. Some to multi-class classification and other learning tasks, e.g.,
algorithms are too complex as they contain different learn- regression [118]. The imbalance may be inherent to the
ing strategies and heuristics to cope with various problems. problem (intrinsic) or caused by some fault in the data ac-
While addressing different problems is a good trait, these quisition (extrinsic). Learning from imbalanced datasets is
models are often too complicated for a framework developer challenging as most learning algorithms are designed to op-
to understand all of their idiosyncrasies, which make them timize for generalization, and as a consequence, the minority
less attractive to be implemented and deployed in practice. class may be completely ignored.
On top of that, as discussed by Gomes et al. [53] the combi- The approaches for dealing with imbalanced datasets com-
nation of too many heuristics raises the question: “Does the monly rely on cost-sensitive learning; resampling meth-
ensemble perform well because of the combination of all its ods (oversampling and undersampling); and ensemble learn-
methods, or simply because some of them are very effective, ing. Cost-sensitive strategies rely on assigning different
while others are effectively useless or even harmful?” weights to incorrect predictions. This can be used to in-
To address these questions, it is important to present in- crease the cost of a minority class error, which shall ‘bias’
depth analysis of how each of the strategies embedded into the learning process in its favor. Resampling methods rely
the ensemble behaves individually and in combination with on removing instances from the majority class (undersam-
the others. This requires creating experiments that are be- pling) or creating synthetic instances for the minority class
yond measuring the overall ensemble classification perfor- (oversampling). These methods tend to be costly even for
mance. In general, it is difficult to isolate aspects of the batch learning, as in general, they require multiple distance
method for comparisons, but it is worthwhile to verify if it computations among training instances, e.g., SMOTE [26].
is possible when proposing a novel method, especially if it Finally, ensemble strategies for imbalanced learning use the
lacks theoretical guarantees. ensemble structure alongside cost-sensitive learning or re-
The computational resources used by a machine learning sampling strategies [46].
algorithm developed for data streams are of critical impor- Besides the issues related to the dataset imbalance, in a
tance. An accurate, yet inefficient method might not be streaming scenario, other challenges may arise. For exam-
fit for use in environments with strictly limited resources. ple, given two classes labels which distribution is balanced,
Some ensemble algorithms approach this problem by remov- for a given period one of them may be underrepresented;
thus leading to a ‘temporary’ imbalance. Another possibil- Based [59], Threaded Ensemble of Autoencoders [39], and
ity is that the class distribution variations indicate a con- OnCAD [28].
cept drift (e.g., a period of transition in a gradual drift) or
perhaps a concept evolution (e.g., one of the classes is dis-
appearing). How to differentiate between these situations
4. REASONING ABOUT THE LEARNING
and propose general strategies to address them is still an PROCESS
open issue. This motivates the development of methods Learning from data streams is a continuous process. The
to address imbalanced streaming data; examples include: learning systems that act in dynamic environments, where
Learn++.NSE [42]; SMOTE [36]; REA [27]; and an adapted working conditions change and evolve, need to monitor their
Neural Network [52]. Finally, another challenge is how to de- working conditions. They need to monitor the learning
velop algorithms that are effective in addressing the imbal- process for change detection, emergence of novel classes,
ance problem, without compromising the computational re- changes in the relevance of features, changes in the optimal
sources. To this end, the cost-sensitive and ensemble strate- parameters settings, and others. The goal of this section
gies seems to be more effectively than the resampling strate- is to discuss the design aspects of learning systems that can
gies (specially oversampling). monitor their performance. In a general sense, these learning
systems should be able of self-diagnosis when performance
3.5 Anomaly detection degrades, by identifying the possible causes of degradation
A significant task in imbalanced learning is that of anomaly and self-repairing or self-reconfiguring to recover to a stable
detection. Supervised anomaly detection, where labeled nor- status.
mal examples and anomalies are part of the dataset, is indis- Much research has been devoted to characterizing concept
tinguishable from a standard imbalanced learning problem drift [123], detecting and recovering from it [50], recurrent
– where the anomalous instances belong to the underrepre- concepts [48; 49]. Since this is a frequent topic when dis-
sented (minority) class. However, in most practical scenar- cussing learning from data streams, we refrain from review-
ios, it is not feasible to get verified labels, particularly for ing it entirely and focus mostly on current issues related to
non-stationary data streams. Therefore, in a real scenario, it, such as feature drifts and their relationship to feature se-
one might need to choose between unsupervised learning and lection; drift detection under unsupervised/semi-supervised
semi-supervised learning. and delayed labeled scenarios; and hyper-parameter tuning.
For the sake of generality, many methods assume that no
labeled data is available, and the problem is tackled essen- 4.1 Concept drift and label availability
tially as a clustering or density-estimation task. In this case, Novel concept drift detection algorithms are proposed each
instances ‘too far’ from the centers of established clusters, year, and their performance assessed using different meth-
or densities, are considered anomalies. Existing clustering ods (see section 5). Most of these algorithms are applied
algorithms for streaming data, such as CluStream [3] can to the univariate stream of correct/incorrect predictions of
be used for this purpose. However, a challenge is that some a learner. To achieve detections in a timely fashion, this
of these methods rely on an offline step where the actual requires that the ground-truth be available almost imme-
clustering method, e.g., k-means, is executed. The online diately after the prediction is made. This ‘immediate’ set-
step is responsible only for updates to the data structures ting can be characterized by the ground-truth y t of instance
that summarize the incoming data. Salehi and Rashidi [110] xt being available before the next instance xt+1 is avail-
presents a recent survey on anomaly detection for evolving able (see section 3.2). Algorithms such as ADWIN [17] and
data, with a particular focus on unsupervised learning ap- EDDM [7], were tested under the aforementioned assump-
proaches. tion. However, if the ground-truth is not immediately avail-
In some scenarios, a small set of labeled normal instances able, then these algorithms’ ability to detect drifts might be
and anomalies is available. This case can neither be charac- severely decreased.
terized as supervised nor unsupervised, but as semi-supervised Algorithms focusing on drift detection on delayed or par-
learning (see section 3.2). In a semi-supervised anomaly de- tially labeled streams exists. Examples include SUN [126]
tection it is assumed that some normal examples and anoma- and the method from Klinkenberg [69] based on support
lies will not be labeled, but besides that, some anomalies vector machines. The former uses a clustering algorithm to
might not even be known beforehand, while others might produce ‘concept clusters’ at leaves of an incremental deci-
cease to exist altogether. The critical aspect in such sce- sion tree, and drifts are identified according to the deviation
nario is the evolution of the labels overtime. This can take between history concept clusters and the current clusters.
the form of adversarial machine learning [65], where Žliobaite [134] presents an analytical view of the conditions
an adversarial opponent actively attempt to jeopardize the that must be met to allow concept drift detection in a de-
learning model using several types of attacks3 . Furthermore, layed labeled setting. Three types of concept drifts are ana-
a problem where labels appear and disappear overtime can lytically studied and two of them also empirically evaluated.
be formulated as an evolving concept (or novel class detec- Unfortunately, one of the least investigated cases, when the
tion) problem [88] (see section 4). change occurs in the input data distribution, was not em-
We highlight the need for further discussion around the in- pirically investigated. Therefore, the proposed methods to
tersection among anomaly detection, adversarial learning, detect changes in the input data, such as parametric and
semi-supervised learning, and novel class detection. Finally, non-parametric multivariate two-sample tests, were not dis-
some of the latest proposed algorithms for data stream anomaly cussed in-depth. We further address the problem of identi-
detection are RS-Forest [125], Robust Random Cut Forest fying changes in the input data distribution in section 4.2.
3
Barreno et al. [12] presents a taxonomy of such attacks. 4.2 Feature drift
of relevant features may drift over time.
1.0
4.3 Feature evolution
0.8 Another important trait of streaming scenarios regards the
appearance and disappearance of features over time. In
0.6 practice, if a new feature becomes available over time, and
Information Gain

directed if it deemed relevant, one may argue that a feature drift has
0.4 listinfo occurred, and then this feature could be incorporated into
the learning process. Similarly, if a feature becomes un-
available, then all of its values might be treated as missing
0.2
values, and then the learning model should ignore its ex-
istence. Most of the existing frameworks we will discuss in
0.0 section 7, e.g., MOA [20], SAMOA [33], Scikit-multiflow [87],
0 2000 4000 6000 8000 10000 do not account for changes in the input vector of streaming
# instances processed data. Therefore, in dynamic scenarios where features may
appear and disappear over time, the data stream computa-
Figure 4: Two features IG variation over time for SPAM tional representation in these frameworks will either remain
CORPUS. Adapted from [9]. static or require external updates. Developing an efficient
dynamic input vector representation for streaming data is
an important and difficult task. Given its relevance to some
problem domains it deserves attention from machine learn-
Data streams are subject to different types of concept drifts. ing framework developers.
Examples include (i) changes in the values of a feature and
their association with the class, (ii) changes in the domain 4.4 Concept evolution
of features, (iii) changes in the subset of features that are Concept evolution is a problem intrinsically related to oth-
used to label an instance, and so on. ers, such as anomaly detection for streaming data [43]. In
Despite considered in seminal works of the area [124], only general terms, concept evolution refers to the appearance
recently works on the above-emphasized type of drift have or disappearance of class labels. This is natural in some
gained traction. A feature drift occurs when a subset of fea- domains, for example, the interest of users in news media
tures becomes, or ceases to be, relevant to the learning task change over time, with new topics appearing and older ones
[10]. Following the definition provided by Zhao et al. [132], disappearing. Another example is Intrusion Detection Sys-
a feature xi is deemed relevant if ∃Si = X \ {xi }, Si′ ⊂ Si , tems, where new threats appear as attackers evolve. Ideally,
such that P (Y |xi , Si′ ) > P (Y |Si′ ) holds; and irrelevant oth- these threats must first be identified and then used for im-
erwise. Given the previous definition, the removal of a rel- proving the model, however doing it automatically is a dif-
evant feature decreases the prediction power of a classifier. ficult task. Informally, the challenge is to discern between
Also, there are two possibilities for a feature to be relevant: concept drifts, noise and the formation of a novel concept.
(i) it is strongly correlated with the class, or (ii) it forms a Examples of algorithms that address concept evolution in-
subset with other features, and this subset is correlated with cludes: ECSMiner [84], CLAM [5], and MINAS [32].
the class [132]. An example of features importance varying A major challenge here is the definition of evaluation setup
over time can be visualized in Fig. 4, where the Information and metrics to assess algorithms that detect concept evolu-
Gain for two features, w.r.t the target variable, is plotted tion.
over time for the SPAM CORPUS [68] dataset.
As in conventional drifts, changes in the relevant subset of 4.5 Hyperparameter tuning for evolving data
features affect the class-conditional probabilities P (Y |X) as streams
the decision boundary across classes changes. Therefore, Hyperparameter tuning (or optimization) is often treated
streaming algorithms should be able to detect these changes, as a manual task where experienced users define a subset of
enabling the learning algorithm to focus on the relevant fea- hyperparameters and their corresponding range of possible
tures, leading to lighter-weighted and less overfit models. values to be tested exhaustively (Grid Search), randomly
To address feature drifts, few proposals have been presented (Random Search) or according to some other criteria [11].
in the literature. Barddal et al. [10] showed that Hoeffding The brute force approach of trying all possible combinations
Adaptive Trees [18] are the state-of-the-art learners for iden- of hyperparameters and their values is time-consuming but
tifying changes in the relevance of features and adapting the can be efficiently executed in parallel in a batch setting.
model on the fly. Another important work that explicitly fo- However, it can be difficult to emulate this approach in an
cuses on performing feature selection during the progress of evolving streaming scenario. A naive approach is to sepa-
data streams was HEFT-Stream [90], where new features are rate an initial set of instances from the first instances seen
selected as new batches of arriving data become available for and perform an offline tuning of the model hyperparame-
training. ters on them. Nevertheless, this makes a strong assumption
Finally, the assessment of feature drifting scenarios should that even if the concept drifts the selected hyperparameters’
not only account for the accuracy rates of learners, but values will remain optimal. The challenge is to design an ap-
also whether the feature selection process correctly flags the proach that incorporate the hyperparameter tuning as part
changes in the relevant subset of features and if it identifies of the continual learning process, which might involve data
the features it should. Given that, the evaluation of feature preprocessing, drift detection, drift recovery, and others.
selectors should also be dynamic, as the ground-truth subset Losing et al. [81] present a review and comparison of in-
cremental learners including SVM variations, tree ensem- 5.1 Benchmark data
bles, instance-based models and others. Interestingly, this Data stream algorithms are usually assessed using a bench-
is one of the first works to benchmark incremental learn- mark that is a combination of synthetic generators and real-
ers using a strategy to perform hyperparameter optimiza- world datasets. The synthetic data is used to allow showing
tion. To perform the tuning a minimum of 20% (or 1,000 how the method performs given specific problems (e.g., con-
instances, whichever is reached first) of the training data cept drifts, concept evolution, feature drifts, and so forth) in
was gathered. Assuming a stationary distribution, this ap- a controlled environment. The real world datasets are used
proach performs reasonably well. Experiments with non- to justify the application of the method beyond hypothetical
stationary streams are also briefly covered by the authors, situations; however, they are often used without guarantees
but since the algorithms used were not designed for this set- that the issues addressed by the algorithm are present. For
ting, it was not possible to draw further conclusions about example, it is difficult to check if and when a concept drift
the efficiency of performing hyperparameter optimization on takes place in a real dataset. The problem is that some of
drifting streams. the synthetic streams can be considered too straightforward
A recent work, [120] formulates the problem of parameter and perhaps outdated, e.g., STAGGER [111] and SEA [116]
tuning as an optimization problem. It uses the Nelder-Mead datasets.
algorithm to exploit the space of the parameters. The Nel- When it comes to real-world data streams, some researchers
derMead method [89] or downhill simplex method is a nu- use datasets that do not represent data streams or that are
merical method used to find the minimum or maximum of synthetic data masquerade as real datasets. An example
a function in a multidimensional space. that covers both concerns (not a stream and actually syn-
thetic) is the dataset named Pokerhand4 , at some point in
5. EVALUATION PROCEDURES AND DATA past it was used to assess the performance of streaming clas-
sifiers, probably because of its volume. However, it is neither
SOURCES “real” nor a representation of a stream of data. Until today
As the field evolves and practitioners, besides researchers, it is still in use without any reasonable explanation. Even
also start to apply the methods, it is critical to verify whether benchmark datasets that can be interpreted as actual data
or not the currently established evaluation metrics and bench- streams display some unrealistic characteristics that are of-
mark datasets fit the real world problems. The importance ten not discussed. Electricity [62] depicts the energy market
of selecting appropriate benchmark data is to avoid making from the Australian New South Wales Electricity Market,
assumptions about algorithms quality given empirical tests and even though the data was generated over time, often
on data that might not reflect realistic scenarios. the dataset version used was preprocessed in an offline pro-
Existing evaluation frameworks address issues such as im- cess to normalize the features, which might benefit some
balanced data, temporal dependences, cross-validation, and algorithms or at least ‘leak’ some future characteristics (see
others [16]. For example, when the input data stream ex- section 2.1.2).
hibit temporal dependences, a useful benchmark model is a Issues with evaluation frameworks are not limited to su-
naive No Change learner. This learner always predicts the pervised learning in a streaming context. For instance, as-
next label as the previous label and, surprisingly, it may sessing concept drift detection algorithms is also subject to
surpass advanced learning algorithms that build complex controversies. A standard approach to evaluate novel con-
models from the input data. Žliobaite et al. [135] propose cept drift detection is to combine them with a classification
the κ-temporal statistic, which incorporates the No Change algorithm and assess the detection capabilities of the con-
learner to the evaluation metric. cept drift method indirectly by observing the classification
However, one crucial issue related to the evaluation of stream- performance of the accompanying algorithm. The problem
ing algorithms is the lack of appropriate approaches to eval- with this evaluation is that it is indirect; thus the actual
uate delayed labeled problems. As previously discussed (see characteristics of the drift detection algorithm, such as the
section 3.2) in a delayed setting there is a non-negligible de- lag between drift and detection, cannot be observed from it.
lay between the input data x and the ground-truth label y, This issue is detailed in a recent work [15].
which can vary from a few minutes/hours up to years de- Why are we not using real data streams to assess the per-
pending on the application. A naive approach to evaluating formance of stream learning algorithms? One possible an-
the quality of such solutions is to record the learner predic- swer is the difficulty in preparing sensor data. Even though
tion ŷ when x is presented and then compare it against y the data is abundant, it is still necessary to transform it to
once it is available. One issue with this approach is that in a suitable format, and often this means converting from a
some applications, such as peer-to-peer lending and airlines multivariate time series (see section 3.1) to a data stream.
delay prediction, the learner will be pooled several times Another possible answer is that realistic data stream sources
with the same x before y is available, potentially improving can be complicated to configure and to replicate.
its performance as time goes by as other labels are made
available and used to update the model. Ideally, the learner 5.2 Handling real streaming data
should be capable of outputting better results since the first
For actual implementations, an important aspect of stream-
prediction when x is presented, however how to measure its
ing data is that the way the data is made available to the
ability to improve over time before y is made available? De-
system is significant. High latency data sources will ‘hold’
spite works that address delayed labeling, evaluation frame-
the whole system, and there is nothing the learning algo-
works have only recently been proposed [58].
rithm can do to solve it. Different from batch learning, the
4
https://archive.ics.uci.edu/ml/datasets/Poker+
Hand
data source for streaming data is often harder to grasp for multi-output problems will have some underlying structure
beginners. It is not merely a self-contained file or a well- and thus are in fact structured-output problems in the strict
defined database, and in fact, it has to allow the appear- sense. Indeed, many sequence prediction and time-series
ance of new data with low latency in a way that the learner models can be applied practically as-is to multi-label prob-
is updated as soon as possible when new data is available. lems and vice-versa [105]. This could include recurrent neu-
At the vanguard of stream processing there are frameworks, ral networks, as we review in section 6.2, or the methods
such as Apache Spark and Flink. mentioned already in section 3.1.
Recently, the team behind Apache Spark introduced a novel Therefore, the main challenges are dealing with the inher-
API to handle streaming data, namely the Structured Stream- ently more complex models and drift patterns streams deal-
ing API [6], which overshadows the previous Spark Stream- ing with structured outputs. Complex structured output
ing [130] API. Similar to its predecessor, Structured Stream- prediction tasks such as captioning have yet to be approached
ing is mainly based on micro-batches, i.e., instead of imme- in a data-stream context.
diately presenting new rows of data to the user code, the
rows are combined into small logical batches. This facilitates 6.2 Recurrent Neural Networks
manipulating the data as truly incremental systems can be
Many structured-ouput approaches can be approached with
both difficult to the user to handle and to the framework de-
recurrent neural networks (RNNs). These are inherently ro-
veloper to come up with efficient implementations. Besides
bust and well suited to dealing with sequential data, particu-
implementation details, the main difference between Spark
larly text (natural language) and signals with high temporal
Streaming and the new Structured Streaming API is that
dependence. See, e.g., Du and Swamy [41] present a detailed
the latter assumes that there is a structure to the streaming
overview.
data, which make it possible to manipulate data using SQL
RNNs are notoriously difficult to train. Obtaining good re-
and uses the abstraction of an unbounded table.
sults on batch data can already require exhaustive experi-
mentation of parameter settings, not easily affordable in the
6. STREAMING DATA IN ARTIFICIAL IN- streaming context. Long Short-Term Memory neural net-
TELLIGENCE works (LSTMs) have gained recent popularity, but are still
challenging to train on many tasks.
In this section, we look at stream mining in recent advanced
topics in artificial intelligence, by which we mean tasks that There are simplified RNNs which have been very effective,
fall outside of the traditional single-target classification or such as Time Delay Neural Networks, which simply include
regression scenario. a window of input as individual instances; considered, for
example, in Žliobaite et al. [121] in the context of streams.
6.1 Prediction of Structured Outputs Another useful variety of RNN more suited to data streams
is the Echo State Networks (ESNs). The weights of the hid-
In structured output prediction, values for multiple target
den layer of an ESN are randomly initialized and not trained.
variables are predicted simultaneously (for each instance).
Only the output layer (usually a linear one) is trained; and
A particular well-known special case is that of multi-label
stochastic gradient descent will suffice in many contexts.
classification [106; 34; 131] where multiple labels are associ-
ated with each data point – a natural point of departure for ESNs are an interesting way to embed signals into vectors
many text and image labeling tasks. – making them a good starting point for converting a time
series into an i.i.d. data stream which can be processed by
Methods can be applied directly in a ‘problem transfor-
traditional methods (see also discussion in section 3.1).
mation’ scenario or adapted in an ‘algorithm adaptation’
scheme [104], however, obtaining scalable models is inher- RNNs are naturally deployed in a streaming scenario for
ently more challenging, since the output space is K L for L prediction, but training them under the context of concept
label variables each taking K values, as opposed to K for a drift has, to the best of our knowledge, not been widely
K-class (single-label) problem. approached.
In other words: the output space may be of the same range Finally, we could remark that neuro-evolution is popular
of variety and dimensionality as an input space. As such we as a training method for RNNs in some areas, such as rein-
can consider the issues and techniques outlined in sections forcement learning, in particular in policy search approaches
4.2 and 4.3. (where the policy map is represented as a neural network);
see, for example, Stanley and Miikkulainen [114]. The struc-
We can emphasize that in multi-output data streams there
ture of the network is evolved over time (rather than back-
is an additional complication involving concept drift which
ward propagation of errors), and hence is arguably a more
now covers an additional dimension – models are inherently
intuitive option in online tasks.
more complex and more difficult to learn and thus there is
even greater motivation to adapt them as best as possible As training options become easier, we expect RNNs to be a
to the new concept when drift occurs, rather than resetting more common option as a data-streams method.
them. This is further encouraged under the consideration
that supervised labeling is less likely to be complete under 6.3 Reinforcement learning
this scenario. Reinforcement learning is inherently a task of learning from
Structured-output learning is the case of multi-output learn- data streams. Observations arrive on a time-step basis, in
ing where some structure is assumed in the problem domain. a stream, and are typically treated either on an episode ba-
For example, in an image, segmentation local dependence is sis (here we can make an analogy with batch-incremental
assumed among pixel variables, and in modeling sequences, methods) or on a time-step basis (i.e., instance-incremental
it is often assumed temporal dependence among each of the streaming). For a complete introduction to the topic, see,
output variables. However, essentially all multi-label and for example, Sutton and Barto [117].
ers. Besides learning algorithms, MOA also provides: data
generators (e.g., AGRAWAL, Random Tree Generator, and
SEA); evaluation methods (e.g., periodic holdout, test-then-
train, prequential); and statistics (CPU time, RAM-hours,
Kappa). MOA can be used through a GUI (Graphical User
Interface) or via command line, which facilitates running
batches of tests. The implementation is in Java and shares
many characteristics with the WEKA framework [60], such
as allowing users to extend the framework by inheriting ab-
stract classes. Very often researchers make their source code
available as an MOA extension6 .
Figure 5: The Mountain Car problem is a typical benchmark Advanced Data mining And Machine learning Sys-
in reinforcement learning. The goal is to drive the car to the tem (ADAMS)7 [107]. ADAMS is a workflow engine de-
top. It can be treated as a streaming problem. signed to prototype and maintain complex knowledge work-
flows. ADAMS is not a data stream learning tool per se,
but it can be combined with MOA, and other frameworks
The goal in reinforcement learning is to learn a policy, which like SAMOA and WEKA, to perform data stream analysis.
is essentially a mapping from inputs (i.e., observations) to Scalable Advanced Massive Online Analysis (SAMOA)8
outputs (i.e., actions). This mapping is conceptually simi- [33]. SAMOA combines stream mining and distributed com-
lar to that desired from a machine learning model (mapping puting (i.e., MapReduce), and is described as a framework
inputs to outputs). However, the peculiarity is that ground- as well as a library. As a framework, SAMOA allows users to
truth training pairs are never presented to the model, unlike abstract the underlying stream processing execution engine
in the typical supervised learning scenario. Rather a reward and focus on the learning problem at hand. Currently, it
signal is given instead of true labels. The reward signal is of- is possible to use Storm (http://storm.apache.org) and
ten sparse across time and – the most significant challenge – Samza (http://samza.incubator.apache.org). SAMOA
is that the reward at a particular time step may correspond provides adapted versions of stream learners for distributed
to an action taken many time steps ago, and it is thus dif- processing, including the Vertical Hoeffding Tree algorithm
ficult to break down into a time-step basis. Nevertheless, [72], bagging and boosting.
in certain environments, it is possible to consider training Vowpal Wabbit (VW)9 . VW is an open source machine
pairs on an episode level. learning library with an efficient and scalable implementa-
Despite the similarities with data-streams, there has been tion that includes several learning algorithms. VW has been
relatively little overlap in the literature. It is not difficult to used to learn from a terafeature dataset using 1000 nodes in
conceive of scenarios where a reinforcement-learning agent approximately an hour [2].
needs to detect concept drift in its environment, just as any StreamDM10 . StreamDM is an open source framework for
classification or regression model. big data stream mining that uses the Spark Streaming [130]
Reinforcement learning is still in its infancy relative to tasks extension of the core Spark API. One advantage of StreamDM
such as supervised classification – especially in terms of in- in comparison to existing frameworks is that it directly ben-
dustrial applications, which may explain the lack of consid- efits from the Spark Streaming API, which handles much of
eration of additional complications typically considered in the complex problems of the underlying data sources, such
data-streams, as concept drift. Nevertheless, we would ex- as out of order data and recovery from failures.
pect such overlap to increase as a wider variety of real-world Scikit-multiflow11 [87]. Scikit-multiflow is an extension
application domains are considered. to the popular scikit-learn [98] inspired by the MOA frame-
work. It is designed to accommodate multi-label, multi-
output and single output data stream mining algorithms.
7. SOFTWARE PACKAGES AND FRAME- One advantage of scikit-multiflow is its API similarity to
WORKS scikit-learn, which is widely used worldwide.
In this section, we present existing tools for applying ma- Ray RLlib12 [79]. RLlib is a reinforcement learning li-
chine learning to data streams for both research and prac- brary that features reference algorithms’ implementations
tical applications. Initially, frameworks were designed to and facilitates the creation of new algorithms through a set
facilitate collaboration among research groups and allow re- of scalable primitives. RLlib is part of the open source Ray
searchers to test their ideas directly. Nowadays, tools such project. Ray is a high-performance distributed execution
as the Massive Online Analysis (MOA) [20] can be adapted framework that allows Python tasks to be distributed across
to deploy models in practice depending on the problem re- larger clusters.
quirements.
Massive Online Analysis (MOA)5 [20]. The MOA frame-
6
work includes several algorithms for multiple tasks concern- http://moa.cms.waikato.ac.nz/moa-extensions/
7
ing data stream analysis. It features a larger community of https://adams.cms.waikato.ac.nz
8
researchers that continuously add new algorithms and tasks http://samoa.incubator.apache.org
9
to it. The current tasks included in MOA are classifica- https://github.com/VowpalWabbit/vowpal_wabbit
10
tion, regression, multi-label, multi-target, clustering, outlier http://huawei-noah.github.io/streamDM/
11
detection, concept drift detection, active learning, and oth- https://github.com/scikit-multiflow/
scikit-multiflow
5 12
http://moa.cms.waikato.ac.nz https://ray.readthedocs.io/en/latest/rllib.html
8. CONCLUSIONS [7] M. Baena-Garcı́a, J. del Campo-Ávila, R. Fidalgo,
We have discussed several challenges that pertain machine A. Bifet, R. Gavaldà, and R. Morales-Bueno. Early
learning for streaming data. In some cases, these challenges drift detection method. 2006.
have been addressed (often partially) by existing research,
[8] D. Barber. Bayesian Reasoning and Machine Learn-
which we discuss and point out the shortcomings. All the
ing. Cambridge University Press, 2012.
topics covered in this work are important, but some have
a broader impact or have been less investigated. Further [9] J. P. Barddal, H. M. Gomes, and F. Enembreck. Ana-
developing these in the near future will help the development lyzing the impact of feature drifts in streaming learn-
of the field: ing. In International Conference on Neural Informa-
tion Processing, pages 21–28. Springer, 2015.
• Explore the relationships between other AI develop-
ments (e.g., recurrent neural networks, reinforcement [10] J. P. Barddal, H. M. Gomes, F. Enembreck, and
learning, etc.) and adaptive stream mining algorithms; B. Pfahringer. A survey on feature drift adaptation:
Definition, benchmark, challenges and future direc-
• Characterize and detect drifts in the absence of imme- tions. Journal of Systems and Software, 127:278 – 294,
diately labeled data; 2017.
• Develop adaptive learning methods that take into ac-
[11] R. Bardenet, M. Brendel, B. Kégl, and M. Se-
count verification latency;
bag. Collaborative hyperparameter tuning. In Inter-
• Incorporate pre-processing techniques that continuously national Conference on Machine Learning, pages 199–
transform the raw data; 207, 2013.

It is also important to develop software that allows the ap- [12] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and
plication of data stream mining techniques in practice. In J. D. Tygar. Can machine learning be secure? In ACM
recent years, many frameworks were proposed, and they Symposium on Information, computer and communi-
are constantly being updated and maintained by the com- cations security, pages 16–25, 2006.
munity. Finally, it is unfeasible to cover all topics related [13] L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A
to machine learning and streaming data in a single paper. maximization technique occurring in the statistical
Therefore, we were able to only scratch the surface for some analysis of probabilistic functions of markov chains.
topics that deserve further analysis in the future, such as The annals of mathematical statistics, 41(1):164–171,
regression; unsupervised learning; evolving graph data; im- 1970.
age, text and other non-structured data sources; and pattern
mining. [14] Y. Ben-Haim and E. Tom-Tov. A streaming paral-
lel decision tree algorithm. The Journal of Machine
9. REFERENCES Learning Research, 11:849–872, 2010.
[15] A. Bifet. Classifier concept drift detection and the illu-
[1] Z. S. Abdallah, M. M. Gaber, B. Srinivasan, and S. Kr-
sion of progress. In International Conference on Arti-
ishnaswamy. Activity recognition with evolving data
ficial Intelligence and Soft Computing, pages 715–725.
streams: A review. ACM Computing Surveys (CSUR),
Springer, 2017.
51(4):71, 2018.
[16] A. Bifet, G. de Francisci Morales, J. Read, G. Holmes,
[2] A. Agarwal, O. Chapelle, M. Dudı́k, and J. Lang-
and B. Pfahringer. Efficient online evaluation of big
ford. A reliable effective terascale linear learning sys-
data stream classifiers. In ACM SIGKDD Interna-
tem. The Journal of Machine Learning Research,
tional Conference on Knowledge Discovery and Data
15(1):1111–1133, 2014.
Mining, pages 59–68, 2015.
[3] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A
[17] A. Bifet and R. Gavalda. Learning from time-changing
framework for clustering evolving data streams. In
data with adaptive windowing. In SIAM international
International Conference on Very Large Data Bases
conference on data mining, pages 443–448, 2007.
(VLDB), pages 81–92, 2003.
[4] C. C. Aggarwal and P. S. Yu. On classification of high- [18] A. Bifet and R. Gavaldà. Adaptive learning from
cardinality data streams. In SIAM International Con- evolving data streams. In International Symposium
ference on Data Mining, pages 802–813, 2010. on Intelligent Data Analysis, pages 249–260. Springer,
2009.
[5] T. Al-Khateeb, M. M. Masud, L. Khan, C. C. Aggar-
wal, J. Han, and B. M. Thuraisingham. Stream classi- [19] A. Bifet, R. Gavalda, G. Holmes, and B. Pfahringer.
fication with recurring and novel class detection using Machine Learning for Data Streams: with Practical
class-based ensemble. In ICDM, pages 31–40, 2012. Examples in MOA. Adaptive Computation and Ma-
chine Learning series. MIT Press, 2018.
[6] M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu,
R. Xin, A. Ghodsi, I. Stoica, and M. Zaharia. Struc- [20] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer.
tured streaming: A declarative api for real-time appli- Moa: Massive online analysis. The Journal of Machine
cations in apache spark. In International Conference Learning Research, 11:1601–1604, 2010.
on Management of Data, pages 601–613, 2018.
[21] A. Bifet, G. Holmes, and B. Pfahringer. Leveraging [36] G. Ditzler, R. Polikar, and N. Chawla. An incremen-
bagging for evolving data streams. In PKDD, pages tal learning algorithm for non-stationary environments
135–150, 2010. and class imbalance. In International Conference on
Pattern Recognition, pages 2997–3000, 2010.
[22] B. H. Bloom. Space/time trade-offs in hash coding
with allowable errors. Communications of the ACM, [37] P. Domingos and G. Hulten. Mining high-speed data
13(7):422–426, 1970. streams. In ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 71–
[23] A. Blum and T. Mitchell. Combining labeled and un-
80, 2000.
labeled data with co-training. In Conference on Com-
putational learning theory, pages 92–100, 1998. [38] A. R. T. Donders, G. J. Van Der Heijden, T. Stijnen,
and K. G. Moons. A gentle introduction to imputa-
[24] L. Breiman. Random forests. Machine Learning,
tion of missing values. Journal of clinical epidemiol-
45(1):5–32, 2001.
ogy, 59(10):1087–1091, 2006.
[25] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl,
S. Haridi, and K. Tzoumas. Apache flink: Stream [39] Y. Dong and N. Japkowicz. Threaded ensembles of
and batch processing in a single engine. Bulletin of autoencoders for stream learning. Computational In-
the IEEE Computer Society Technical Committee on telligence, 34(1):261–281, 2018.
Data Engineering, 36(4), 2015. [40] D. M. dos Reis, P. Flach, S. Matwin, and G. Batista.
[26] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Fast unsupervised online drift detection using incre-
Kegelmeyer. Smote: synthetic minority over-sampling mental kolmogorov-smirnov test. In ACM SIGKDD
technique. Journal of artificial intelligence research, International Conference on Knowledge Discovery and
16:321–357, 2002. Data Mining, pages 1545–1554, 2016.

[27] S. Chen and H. He. Towards incremental learning [41] K.-L. Du and M. N. Swamy. Neural Networks and Sta-
of nonstationary imbalanced data stream: a multi- tistical Learning. Springer Publishing Company, Incor-
ple selectively recursive approach. Evolving Systems, porated, 2013.
2(1):35–50, 2011.
[42] R. Elwell and R. Polikar. Incremental learning of con-
[28] M. Chenaghlou, M. Moshtaghi, C. Leckie, and cept drift in nonstationary environments. IEEE Trans-
M. Salehi. Online clustering for evolving data streams actions on Neural Networks, 22(10):1517–1531, 2011.
with online anomaly detection. In Pacific-Asia Con-
[43] M. A. Faisal, Z. Aung, J. R. Williams, and A. Sanchez.
ference on Knowledge Discovery and Data Mining,
Data-stream-based intrusion detection system for ad-
pages 508–521. Springer, 2018.
vanced metering infrastructure in smart grid: A feasi-
[29] W. Chu, M. Zinkevich, L. Li, A. Thomas, and bility study. IEEE Systems journal, 9(1):31–44, 2015.
B. Tseng. Unbiased online active learning in data
streams. In ACM SIGKDD international conference [44] W. Fan and A. Bifet. Mining big data: current sta-
on Knowledge discovery and data mining, pages 195– tus, and forecast to the future. ACM SIGKDD Explo-
203, 2011. rations Newsletter, 14(2):1–5, 2013.

[30] G. Cormode and S. Muthukrishnan. An improved data [45] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy.
stream summary: the count-min sketch and its appli- Mining data streams: a review. ACM Sigmod Record,
cations. Journal of Algorithms, 55(1):58–75, 2005. 34(2):18–26, 2005.

[31] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Main- [46] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince,
taining stream statistics over sliding windows. SIAM and F. Herrera. A review on ensembles for the class
journal on computing, 31(6):1794–1813, 2002. imbalance problem: bagging-, boosting-, and hybrid-
based approaches. IEEE Transactions on Systems,
[32] E. R. de Faria, A. C. P. de Leon Ferreira de Carvalho, Man, and Cybernetics, Part C (Applications and Re-
and J. Gama. MINAS: multiclass learning algorithm views), 42(4):463–484, 2012.
for novelty detection in data streams. Data Mining
Knowledge Discovery, 30(3):640–680, 2016. [47] S. Galelli, G. B. Humphrey, H. R. Maier, A. Castel-
letti, G. C. Dandy, and M. S. Gibbs. An evalua-
[33] G. De Francisci Morales and A. Bifet. Samoa: Scalable tion framework for input variable selection algorithms
advanced massive online analysis. Journal of Machine for environmental data-driven models. Environmental
Learning Research, 16:149–153, 2015. Modelling & Software, 62:33 – 51, 2014.
[34] K. Dembczyński, W. Waegeman, W. Cheng, and [48] J. Gama and P. Kosina. Learning about the learn-
E. Hüllermeier. On label dependence and loss min- ing process. In International Symposium on Intelligent
imization in multi-label classification. Mach. Learn., Data Analysis, pages 162–172, 2011.
88(1-2):5–45, July 2012.
[49] J. Gama and P. Kosina. Recurrent concepts in data
[35] G. Ditzler, M. D. Muhlbaier, and R. Polikar. Incre- streams classification. Knowledge and Information
mental learning of new classes in unbalanced datasets: Systems, 40(3):489–507, 2014.
Learn++.udnc. In International Workshop on Multi-
ple Classifier Systems, pages 33–42, 2010.
[50] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and [64] M. J. Hosseini, A. Gholipour, and H. Beigy. An en-
A. Bouchachia. A survey on concept drift adaptation. semble of cluster-based classifiers for semi-supervised
ACM computing surveys (CSUR), 46(4):44, 2014. classification of non-stationary data streams. Knowl-
edge and Information Systems, 46(3):567–597, 2016.
[51] S. Garcı́a, S. Ramı́rez-Gallego, J. Luengo, J. M.
Benı́tez, and F. Herrera. Big data preprocessing: [65] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein,
methods and prospects. Big Data Analytics, 1(1):9, and J. Tygar. Adversarial machine learning. In ACM
2016. workshop on Security and artificial intelligence, pages
43–58, 2011.
[52] A. Ghazikhani, R. Monsefi, and H. S. Yazdi. Ensemble
of online neural networks for non-stationary and im- [66] N. Jiang and L. Gruenwald. Research issues in data
balanced data streams. Neurocomputing, 122:535–544, stream association rule mining. ACM Sigmod Record,
2013. 35(1):14–19, 2006.
[53] H. M. Gomes, J. P. Barddal, F. Enembreck, [67] T. Joachims. Transductive inference for text classifi-
and A. Bifet. A survey on ensemble learning for cation using support vector machines. In ICML, vol-
data stream classification. ACM Computing Surveys, ume 99, pages 200–209, 1999.
50(2):23:1–23:36, 2017.
[68] I. Katakis, G. Tsoumakas, E. Banos, N. Bassiliades,
[54] H. M. Gomes, A. Bifet, J. Read, J. P. Bard- and I. Vlahavas. An adaptive personalized news dis-
dal, F. Enembreck, B. Pfharinger, G. Holmes, and semination system. Journal of Intelligent Information
T. Abdessalem. Adaptive random forests for evolving Systems, 32(2):191–212, 2009.
data stream classification. Machine Learning, 106(9-
10):1469–1495, 2017. [69] R. Klinkenberg. Using labeled and unlabeled data to
learn drifting concepts. In IJCAI Workshop on Learn-
[55] H. M. Gomes and F. Enembreck. Sae: Social adaptive ing from Temporal and Spatial Data, pages 16–24,
ensemble classifier for data streams. In IEEE Sympo- 2001.
sium on Computational Intelligence and Data Mining,
pages 199–206, April 2013. [70] J. Z. Kolter, M. Maloof, et al. Dynamic weighted ma-
jority: A new ensemble method for tracking concept
[56] H. M. Gomes and F. Enembreck. Sae2: Advances drift. In ICDM, pages 123–130, 2003.
on the social adaptive ensemble classifier for data
streams. In Proceedings of the 29th Annual ACM [71] P. Kosina and J. a. Gama. Very fast decision rules
Symposium on Applied Computing (SAC), SAC 2014, for classification in data streams. Data Mining and
pages 199–206, March 2014. Knowledge Discovery, 29(1):168–202, Jan. 2015.

[57] H. M. Gomes, J. Read, and A. Bifet. Streaming ran- [72] N. Kourtellis, G. D. F. Morales, A. Bifet, and A. Mur-
dom patches for evolving data stream classification. dopo. Vht: Vertical hoeffding tree. In IEEE Interna-
In IEEE International Conference on Data Mining. tional Conference on Big Data, pages 915–922, 2016.
IEEE, 2019. [73] G. Krempl, I. Žliobaite, D. Brzeziński, E. Hüllermeier,
[58] M. Grzenda, H. M. Gomes, and A. Bifet. Delayed la- M. Last, V. Lemaire, T. Noack, A. Shaker, S. Sievi,
belling evaluation for data streams. Data Mining and M. Spiliopoulou, et al. Open challenges for data stream
Knowledge Discovery, to appear. mining research. ACM SIGKDD Explorations newslet-
ter, 16(1):1–10, 2014.
[59] S. Guha, N. Mishra, G. Roy, and O. Schrijvers.
Robust random cut forest based anomaly detection [74] L. I. Kuncheva. A stability index for feature selec-
on streams. In International Conference on Machine tion. In International Multi-Conference: Artificial In-
Learning, pages 2712–2721, 2016. telligence and Applications, AIAP’07, pages 390–395,
2007.
[60] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reute-
mann, and I. H. Witten. The weka data mining [75] B. Kveton, H. H. Bui, M. Ghavamzadeh,
software: an update. ACM SIGKDD Explorations G. Theocharous, S. Muthukrishnan, and S. Sun.
newsletter, 11(1):10–18, 2009. Graphical model sketch. In Joint European Confer-
ence on Machine Learning and Knowledge Discovery
[61] A. Haque, B. Parker, L. Khan, and B. Thuraisingham. in Databases, pages 81–97, 2016.
Evolving big data stream classification with mapre-
duce. In International Conference on Cloud Comput- [76] J. Langford, L. Li, and A. Strehl. Vowpal Wabbit,
ing (CLOUD), pages 570–577, 2014. 2007.

[62] M. Harries and N. S. Wales. Splice-2 comparative eval- [77] P. Lehtinen, M. Saarela, and T. Elomaa. Online
uation: Electricity pricing. 1999. chimerge algorithm. In Data Mining: Foundations and
Intelligent Paradigms, pages 199–216. 2012.
[63] S. Hashemi, Y. Yang, Z. Mirzamomen, and M. Kan-
gavari. Adapted one-versus-all decision trees for data [78] M. Li, M. Liu, L. Ding, E. A. Rundensteiner, and
stream classification. IEEE Transactions on Knowl- M. Mani. Event stream processing with out-of-order
edge and Data Engineering, 21(5):624–637, 2009. data arrival. In International Conference on Dis-
tributed Computing Systems Workshops, pages 67–67,
2007.
[79] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, [92] A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk,
J. Gonzalez, K. Goldberg, and I. Stoica. Ray rllib: and I. Goodfellow. Realistic evaluation of deep semi-
A composable and scalable reinforcement learning li- supervised learning algorithms. In S. Bengio, H. Wal-
brary. arXiv preprint arXiv:1712.09381, 2017. lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
and R. Garnett, editors, Advances in Neural Infor-
[80] V. López, A. Fernández, S. Garcı́a, V. Palade, and mation Processing Systems, pages 3238–3249. 2018.
F. Herrera. An insight into classification with imbal-
anced data: Empirical results and current trends on [93] N. Oza. Online bagging and boosting. In IEEE Inter-
using data intrinsic characteristics. Information Sci- national Conference on Systems, Man and Cybernet-
ences, 250:113–141, 2013. ics, volume 3, pages 2340–2345 Vol. 3, Oct 2005.
[81] V. Losing, B. Hammer, and H. Wersing. Incremental [94] S. J. Pan, Q. Yang, et al. A survey on transfer learning.
on-line learning: A review and comparison of state of IEEE Transactions on knowledge and data engineer-
the art algorithms. Neurocomputing, 275:1261–1274, ing, 22(10):1345–1359, 2010.
2018.
[95] B. Parker and L. Khan. Rapidly labeling and tracking
[82] G. Louppe and P. Geurts. Ensembles on random dynamically evolving concepts in data streams. IEEE
patches. In Joint European Conference on Machine International Conference on Data Mining Workshops,
Learning and Knowledge Discovery in Databases, 0:1161–1164, 2013.
pages 346–361. Springer, 2012.
[96] B. Parker, A. M. Mustafa, and L. Khan. Novel class
[83] D. Marron, J. Read, A. Bifet, T. Abdessalem, detection and feature via a tiered ensemble approach
E. Ayguade, and J. Herrero. Echo state hoeffding tree for stream mining. In IEEE International Conference
learning. In R. J. Durrant and K.-E. Kim, editors, on Tools with Artificial Intelligence, volume 1, pages
Asian Conference on Machine Learning, volume 63, 1171–1178, 2012.
pages 382–397, 2016.
[97] B. S. Parker and L. Khan. Detecting and tracking con-
[84] M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thu- cept class drift and emergence in non-stationary fast
raisingham. Classification and novel class detection in data streams. In AAAI Conference on Artificial Intel-
concept-drifting data streams under time constraints. ligence, 2015.
IEEE Transactions on Knowledge and Data Engineer-
ing, 23(6):859–874, 2011. [98] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
[85] M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thu- R. Weiss, V. Dubourg, et al. Scikit-learn: Machine
raisingham. A practical approach to classify evolving learning in python. Journal of machine learning re-
data streams: Training with limited amount of labeled search, 12(Oct):2825–2830, 2011.
data. In ICDM, pages 929–934. IEEE, 2008.
[99] B. Pfahringer, G. Holmes, and R. Kirkby. Handling
[86] I. Mitliagkas, C. Caramanis, and P. Jain. Memory lim- numeric attributes in hoeffding trees. In Pacific-Asia
ited, streaming pca. In Advances in Neural Informa- Conference on Advances in Knowledge Discovery and
tion Processing Systems, pages 2886–2894, 2013. Data Mining, pages 296–307, 2008.
[87] J. Montiel, J. Read, A. Bifet, and T. Abdessalem. [100] X. C. Pham, M. T. Dang, S. V. Dinh, S. Hoang,
Scikit-multiflow: A multi-output streaming frame- T. T. Nguyen, and A. W. C. Liew. Learning from data
work. Journal of Machine Learning Research, 19(72), stream based on random projection and hoeffding tree
2018. classifier. In International Conference on Digital Im-
age Computing: Techniques and Applications, pages
[88] M. D. Muhlbaier, A. Topalis, and R. Polikar.
1–8, 2017.
Learn++.nc: Combining ensemble of classifiers with
dynamically weighted consult-and-vote for efficient in- [101] C. Pinto and J. Gama. Partition incremental dis-
cremental learning of new classes. IEEE transactions cretization. In Portuguese conference on artificial in-
on neural networks, 20(1):152–168, 2009. telligence, pages 168–174, 2005.
[89] J. Nelder and R. Mead. A simplex method for func- [102] J. Plasse and N. Adams. Handling delayed labels in
tion minimization. The Computer Journal, 7:308–313, temporally evolving data streams. In IEEE ICBD,
1965. pages 2416–2424, 2016.
[90] H.-L. Nguyen, Y.-K. Woon, W.-K. Ng, and L. Wan. [103] S. Ramı́rez-Gallego, B. Krawczyk, S. Garcı́a,
Heterogeneous ensemble for feature drifts in data M. Woźniak, and F. Herrera. A survey on data pre-
streams. In P.-N. Tan, S. Chawla, C. K. Ho, and processing for data stream mining: current status and
J. Bailey, editors, Advances in Knowledge Discovery future directions. Neurocomputing, 239:39–57, 2017.
and Data Mining, pages 1–12, 2012.
[104] J. Read, A. Bifet, G. Holmes, and B. Pfahringer. Scal-
[91] S. Nogueira and G. Brown. Measuring the stabil- able and efficient multi-label classification for evolv-
ity of feature selection. In Joint European Confer- ing data streams. Machine Learning, 88(1-2):243–272,
ence on Machine Learning and Knowledge Discovery 2012.
in Databases, pages 442–457. Springer, 2016.
[105] J. Read, L. Martino, and J. Hollmén. Multi-label [121] I. Žliobaitė, A. Bifet, J. Read, B. Pfahringer, and
methods for prediction with sequential data. Pattern G. Holmes. Evaluation methods and decision theory
Recognition, 63(March):45–55, 2017. for classification of streaming data with temporal de-
pendence. Machine Learning, 98(3):455–482, 2014.
[106] J. Read, B. Pfahringer, G. Holmes, and E. Frank.
Classifier chains for multi-label classification. Machine [122] G. I. Webb. Contrary to popular belief incremental
Learning, 85(3):333–359, 2011. discretization can be sound, computationally efficient
and extremely useful for streaming data. In ICDM,
[107] P. Reutemann and J. Vanschoren. Scientific workflow
pages 1031–1036. IEEE, 2014.
management with adams. In Machine Learning and
Knowledge Discovery in Databases, pages 833–837. [123] G. I. Webb, R. Hyde, H. Cao, H. L. Nguyen, and F. Pe-
Springer, 2012. titjean. Characterizing concept drift. Data Mining and
[108] P. Roy, A. Khan, and G. Alonso. Augmented sketch: Knowledge Discovery, 30(4):964–994, 2016.
Faster and more accurate stream processing. In Inter- [124] G. Widmer and M. Kubat. Learning in the presence of
national Conference on Management of Data, pages concept drift and hidden contexts. Machine Learning,
1449–1463, 2016. 23(1):69–101, Apr. 1996.
[109] J. Rushing, S. Graves, E. Criswell, and A. Lin. A cov- [125] K. Wu, K. Zhang, W. Fan, A. Edwards, and S. Y.
erage based ensemble algorithm (cbea) for streaming Philip. Rs-forest: A rapid density estimator for
data. In IEEE International Conference on Tools with streaming anomaly detection. In ICDM, pages 600–
Artificial Intelligence, pages 106–112, 2004. 609. IEEE, 2014.
[110] M. Salehi and L. Rashidi. A survey on anomaly de-
[126] X. Wu, P. Li, and X. Hu. Learning from concept drift-
tection in evolving data:[with application to forest fire
ing data streams with unlabeled data. Neurocomput-
risk prediction]. ACM SIGKDD Explorations Newslet-
ing, 92:145–155, 2012.
ter, 20(1):13–23, 2018.
[111] J. C. Schlimmer and R. H. Granger. Incremental learn- [127] X. Wu, X. Zhu, G.-Q. Wu, and W. Ding. Data mining
ing from noisy data. Machine learning, 1(3):317–354, with big data. IEEE transactions on knowledge and
1986. data engineering, 26(1):97–107, 2014.

[112] A. Shrivastava, A. C. Konig, and M. Bilenko. Time [128] T. Yang, L. Liu, Y. Yan, M. Shahzad, Y. Shen, X. Li,
adaptive sketches (ada-sketches) for summarizing data B. Cui, and G. Xie. Sf-sketch: A fast, accurate, and
streams. In International Conference on Management memory efficient data structure to store frequencies of
of Data, pages 1417–1432, 2016. data items. In ICDE, pages 103–106, 2017.

[113] V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the [129] W. Yu, Y. Gu, and J. Li. Single-pass pca of large high-
point cloud: from transductive to semi-supervised dimensional data. In International Joint Conference
learning. In ICML, pages 824–831, 2005. on Artificial Intelligence, pages 3350–3356, 2017.

[114] K. O. Stanley and R. Miikkulainen. Efficient reinforce- [130] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and
ment learning through evolving neural network topolo- I. Stoica. Discretized streams: Fault-tolerant stream-
gies. In Genetic and Evolutionary Computation Con- ing computation at scale. In ACM Symposium on Op-
ference, page 9, San Francisco, 2002. erating Systems Principles, pages 423–438, 2013.
[115] I. Stoica, D. Song, R. A. Popa, D. Patterson, [131] M.-L. Zhang and Z.-H. Zhou. A review on multi-label
M. W. Mahoney, R. Katz, A. D. Joseph, M. Jor- learning algorithms. IEEE Transactions on Knowledge
dan, J. M. Hellerstein, J. E. Gonzalez, et al. A berke- and Data Engineering, 26(8):1819–1837, 2014.
ley view of systems challenges for ai. arXiv preprint
arXiv:1712.05855, 2017. [132] Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani,
A. Anand, and H. Liu. Advancing feature selection re-
[116] W. N. Street and Y. Kim. A streaming ensemble al- search. ASU feature selection repository, pages 1–28,
gorithm (sea) for large-scale classification. In ACM 2010.
SIGKDD international conference on Knowledge dis-
covery and data mining, pages 377–382, 2001. [133] G. Zhou, K. Sohn, and H. Lee. Online incremental fea-
ture learning with denoising autoencoders. In Artifi-
[117] R. S. Sutton and A. G. Barto. Introduction to Rein- cial intelligence and statistics, pages 1453–1461, 2012.
forcement Learning. MIT Press, 1st edition, 1998.
[134] I. Žliobaite. Change with delayed labeling: When is
[118] L. Torgo, R. P. Ribeiro, B. Pfahringer, and P. Branco. it detectable? In IEEE International Conference on
Smote for regression. In Portuguese conference on ar- Data Mining Workshops, pages 843–850, 2010.
tificial intelligence, pages 378–389. Springer, 2013.
[135] I. Žliobaitė, A. Bifet, J. Read, B. Pfahringer, and
[119] A. Tsymbal. The problem of concept drift: definitions G. Holmes. Evaluation methods and decision theory
and related work. Technical report, 2004. for classification of streaming data with temporal de-
[120] B. Veloso, J. Gama, and B. Malheiro. Self hyper- pendence. Machine Learning, 98(3):455–482, 2015.
parameter tuning for data streams. In International
Conference on Discovery Science, page to appear,
2018.

You might also like