Machine Learning For Streaming Data
Machine Learning For Streaming Data
x1 x2 x3 x4
directed if it deemed relevant, one may argue that a feature drift has
0.4 listinfo occurred, and then this feature could be incorporated into
the learning process. Similarly, if a feature becomes un-
available, then all of its values might be treated as missing
0.2
values, and then the learning model should ignore its ex-
istence. Most of the existing frameworks we will discuss in
0.0 section 7, e.g., MOA [20], SAMOA [33], Scikit-multiflow [87],
0 2000 4000 6000 8000 10000 do not account for changes in the input vector of streaming
# instances processed data. Therefore, in dynamic scenarios where features may
appear and disappear over time, the data stream computa-
Figure 4: Two features IG variation over time for SPAM tional representation in these frameworks will either remain
CORPUS. Adapted from [9]. static or require external updates. Developing an efficient
dynamic input vector representation for streaming data is
an important and difficult task. Given its relevance to some
problem domains it deserves attention from machine learn-
Data streams are subject to different types of concept drifts. ing framework developers.
Examples include (i) changes in the values of a feature and
their association with the class, (ii) changes in the domain 4.4 Concept evolution
of features, (iii) changes in the subset of features that are Concept evolution is a problem intrinsically related to oth-
used to label an instance, and so on. ers, such as anomaly detection for streaming data [43]. In
Despite considered in seminal works of the area [124], only general terms, concept evolution refers to the appearance
recently works on the above-emphasized type of drift have or disappearance of class labels. This is natural in some
gained traction. A feature drift occurs when a subset of fea- domains, for example, the interest of users in news media
tures becomes, or ceases to be, relevant to the learning task change over time, with new topics appearing and older ones
[10]. Following the definition provided by Zhao et al. [132], disappearing. Another example is Intrusion Detection Sys-
a feature xi is deemed relevant if ∃Si = X \ {xi }, Si′ ⊂ Si , tems, where new threats appear as attackers evolve. Ideally,
such that P (Y |xi , Si′ ) > P (Y |Si′ ) holds; and irrelevant oth- these threats must first be identified and then used for im-
erwise. Given the previous definition, the removal of a rel- proving the model, however doing it automatically is a dif-
evant feature decreases the prediction power of a classifier. ficult task. Informally, the challenge is to discern between
Also, there are two possibilities for a feature to be relevant: concept drifts, noise and the formation of a novel concept.
(i) it is strongly correlated with the class, or (ii) it forms a Examples of algorithms that address concept evolution in-
subset with other features, and this subset is correlated with cludes: ECSMiner [84], CLAM [5], and MINAS [32].
the class [132]. An example of features importance varying A major challenge here is the definition of evaluation setup
over time can be visualized in Fig. 4, where the Information and metrics to assess algorithms that detect concept evolu-
Gain for two features, w.r.t the target variable, is plotted tion.
over time for the SPAM CORPUS [68] dataset.
As in conventional drifts, changes in the relevant subset of 4.5 Hyperparameter tuning for evolving data
features affect the class-conditional probabilities P (Y |X) as streams
the decision boundary across classes changes. Therefore, Hyperparameter tuning (or optimization) is often treated
streaming algorithms should be able to detect these changes, as a manual task where experienced users define a subset of
enabling the learning algorithm to focus on the relevant fea- hyperparameters and their corresponding range of possible
tures, leading to lighter-weighted and less overfit models. values to be tested exhaustively (Grid Search), randomly
To address feature drifts, few proposals have been presented (Random Search) or according to some other criteria [11].
in the literature. Barddal et al. [10] showed that Hoeffding The brute force approach of trying all possible combinations
Adaptive Trees [18] are the state-of-the-art learners for iden- of hyperparameters and their values is time-consuming but
tifying changes in the relevance of features and adapting the can be efficiently executed in parallel in a batch setting.
model on the fly. Another important work that explicitly fo- However, it can be difficult to emulate this approach in an
cuses on performing feature selection during the progress of evolving streaming scenario. A naive approach is to sepa-
data streams was HEFT-Stream [90], where new features are rate an initial set of instances from the first instances seen
selected as new batches of arriving data become available for and perform an offline tuning of the model hyperparame-
training. ters on them. Nevertheless, this makes a strong assumption
Finally, the assessment of feature drifting scenarios should that even if the concept drifts the selected hyperparameters’
not only account for the accuracy rates of learners, but values will remain optimal. The challenge is to design an ap-
also whether the feature selection process correctly flags the proach that incorporate the hyperparameter tuning as part
changes in the relevant subset of features and if it identifies of the continual learning process, which might involve data
the features it should. Given that, the evaluation of feature preprocessing, drift detection, drift recovery, and others.
selectors should also be dynamic, as the ground-truth subset Losing et al. [81] present a review and comparison of in-
cremental learners including SVM variations, tree ensem- 5.1 Benchmark data
bles, instance-based models and others. Interestingly, this Data stream algorithms are usually assessed using a bench-
is one of the first works to benchmark incremental learn- mark that is a combination of synthetic generators and real-
ers using a strategy to perform hyperparameter optimiza- world datasets. The synthetic data is used to allow showing
tion. To perform the tuning a minimum of 20% (or 1,000 how the method performs given specific problems (e.g., con-
instances, whichever is reached first) of the training data cept drifts, concept evolution, feature drifts, and so forth) in
was gathered. Assuming a stationary distribution, this ap- a controlled environment. The real world datasets are used
proach performs reasonably well. Experiments with non- to justify the application of the method beyond hypothetical
stationary streams are also briefly covered by the authors, situations; however, they are often used without guarantees
but since the algorithms used were not designed for this set- that the issues addressed by the algorithm are present. For
ting, it was not possible to draw further conclusions about example, it is difficult to check if and when a concept drift
the efficiency of performing hyperparameter optimization on takes place in a real dataset. The problem is that some of
drifting streams. the synthetic streams can be considered too straightforward
A recent work, [120] formulates the problem of parameter and perhaps outdated, e.g., STAGGER [111] and SEA [116]
tuning as an optimization problem. It uses the Nelder-Mead datasets.
algorithm to exploit the space of the parameters. The Nel- When it comes to real-world data streams, some researchers
derMead method [89] or downhill simplex method is a nu- use datasets that do not represent data streams or that are
merical method used to find the minimum or maximum of synthetic data masquerade as real datasets. An example
a function in a multidimensional space. that covers both concerns (not a stream and actually syn-
thetic) is the dataset named Pokerhand4 , at some point in
5. EVALUATION PROCEDURES AND DATA past it was used to assess the performance of streaming clas-
sifiers, probably because of its volume. However, it is neither
SOURCES “real” nor a representation of a stream of data. Until today
As the field evolves and practitioners, besides researchers, it is still in use without any reasonable explanation. Even
also start to apply the methods, it is critical to verify whether benchmark datasets that can be interpreted as actual data
or not the currently established evaluation metrics and bench- streams display some unrealistic characteristics that are of-
mark datasets fit the real world problems. The importance ten not discussed. Electricity [62] depicts the energy market
of selecting appropriate benchmark data is to avoid making from the Australian New South Wales Electricity Market,
assumptions about algorithms quality given empirical tests and even though the data was generated over time, often
on data that might not reflect realistic scenarios. the dataset version used was preprocessed in an offline pro-
Existing evaluation frameworks address issues such as im- cess to normalize the features, which might benefit some
balanced data, temporal dependences, cross-validation, and algorithms or at least ‘leak’ some future characteristics (see
others [16]. For example, when the input data stream ex- section 2.1.2).
hibit temporal dependences, a useful benchmark model is a Issues with evaluation frameworks are not limited to su-
naive No Change learner. This learner always predicts the pervised learning in a streaming context. For instance, as-
next label as the previous label and, surprisingly, it may sessing concept drift detection algorithms is also subject to
surpass advanced learning algorithms that build complex controversies. A standard approach to evaluate novel con-
models from the input data. Žliobaite et al. [135] propose cept drift detection is to combine them with a classification
the κ-temporal statistic, which incorporates the No Change algorithm and assess the detection capabilities of the con-
learner to the evaluation metric. cept drift method indirectly by observing the classification
However, one crucial issue related to the evaluation of stream- performance of the accompanying algorithm. The problem
ing algorithms is the lack of appropriate approaches to eval- with this evaluation is that it is indirect; thus the actual
uate delayed labeled problems. As previously discussed (see characteristics of the drift detection algorithm, such as the
section 3.2) in a delayed setting there is a non-negligible de- lag between drift and detection, cannot be observed from it.
lay between the input data x and the ground-truth label y, This issue is detailed in a recent work [15].
which can vary from a few minutes/hours up to years de- Why are we not using real data streams to assess the per-
pending on the application. A naive approach to evaluating formance of stream learning algorithms? One possible an-
the quality of such solutions is to record the learner predic- swer is the difficulty in preparing sensor data. Even though
tion ŷ when x is presented and then compare it against y the data is abundant, it is still necessary to transform it to
once it is available. One issue with this approach is that in a suitable format, and often this means converting from a
some applications, such as peer-to-peer lending and airlines multivariate time series (see section 3.1) to a data stream.
delay prediction, the learner will be pooled several times Another possible answer is that realistic data stream sources
with the same x before y is available, potentially improving can be complicated to configure and to replicate.
its performance as time goes by as other labels are made
available and used to update the model. Ideally, the learner 5.2 Handling real streaming data
should be capable of outputting better results since the first
For actual implementations, an important aspect of stream-
prediction when x is presented, however how to measure its
ing data is that the way the data is made available to the
ability to improve over time before y is made available? De-
system is significant. High latency data sources will ‘hold’
spite works that address delayed labeling, evaluation frame-
the whole system, and there is nothing the learning algo-
works have only recently been proposed [58].
rithm can do to solve it. Different from batch learning, the
4
https://archive.ics.uci.edu/ml/datasets/Poker+
Hand
data source for streaming data is often harder to grasp for multi-output problems will have some underlying structure
beginners. It is not merely a self-contained file or a well- and thus are in fact structured-output problems in the strict
defined database, and in fact, it has to allow the appear- sense. Indeed, many sequence prediction and time-series
ance of new data with low latency in a way that the learner models can be applied practically as-is to multi-label prob-
is updated as soon as possible when new data is available. lems and vice-versa [105]. This could include recurrent neu-
At the vanguard of stream processing there are frameworks, ral networks, as we review in section 6.2, or the methods
such as Apache Spark and Flink. mentioned already in section 3.1.
Recently, the team behind Apache Spark introduced a novel Therefore, the main challenges are dealing with the inher-
API to handle streaming data, namely the Structured Stream- ently more complex models and drift patterns streams deal-
ing API [6], which overshadows the previous Spark Stream- ing with structured outputs. Complex structured output
ing [130] API. Similar to its predecessor, Structured Stream- prediction tasks such as captioning have yet to be approached
ing is mainly based on micro-batches, i.e., instead of imme- in a data-stream context.
diately presenting new rows of data to the user code, the
rows are combined into small logical batches. This facilitates 6.2 Recurrent Neural Networks
manipulating the data as truly incremental systems can be
Many structured-ouput approaches can be approached with
both difficult to the user to handle and to the framework de-
recurrent neural networks (RNNs). These are inherently ro-
veloper to come up with efficient implementations. Besides
bust and well suited to dealing with sequential data, particu-
implementation details, the main difference between Spark
larly text (natural language) and signals with high temporal
Streaming and the new Structured Streaming API is that
dependence. See, e.g., Du and Swamy [41] present a detailed
the latter assumes that there is a structure to the streaming
overview.
data, which make it possible to manipulate data using SQL
RNNs are notoriously difficult to train. Obtaining good re-
and uses the abstraction of an unbounded table.
sults on batch data can already require exhaustive experi-
mentation of parameter settings, not easily affordable in the
6. STREAMING DATA IN ARTIFICIAL IN- streaming context. Long Short-Term Memory neural net-
TELLIGENCE works (LSTMs) have gained recent popularity, but are still
challenging to train on many tasks.
In this section, we look at stream mining in recent advanced
topics in artificial intelligence, by which we mean tasks that There are simplified RNNs which have been very effective,
fall outside of the traditional single-target classification or such as Time Delay Neural Networks, which simply include
regression scenario. a window of input as individual instances; considered, for
example, in Žliobaite et al. [121] in the context of streams.
6.1 Prediction of Structured Outputs Another useful variety of RNN more suited to data streams
is the Echo State Networks (ESNs). The weights of the hid-
In structured output prediction, values for multiple target
den layer of an ESN are randomly initialized and not trained.
variables are predicted simultaneously (for each instance).
Only the output layer (usually a linear one) is trained; and
A particular well-known special case is that of multi-label
stochastic gradient descent will suffice in many contexts.
classification [106; 34; 131] where multiple labels are associ-
ated with each data point – a natural point of departure for ESNs are an interesting way to embed signals into vectors
many text and image labeling tasks. – making them a good starting point for converting a time
series into an i.i.d. data stream which can be processed by
Methods can be applied directly in a ‘problem transfor-
traditional methods (see also discussion in section 3.1).
mation’ scenario or adapted in an ‘algorithm adaptation’
scheme [104], however, obtaining scalable models is inher- RNNs are naturally deployed in a streaming scenario for
ently more challenging, since the output space is K L for L prediction, but training them under the context of concept
label variables each taking K values, as opposed to K for a drift has, to the best of our knowledge, not been widely
K-class (single-label) problem. approached.
In other words: the output space may be of the same range Finally, we could remark that neuro-evolution is popular
of variety and dimensionality as an input space. As such we as a training method for RNNs in some areas, such as rein-
can consider the issues and techniques outlined in sections forcement learning, in particular in policy search approaches
4.2 and 4.3. (where the policy map is represented as a neural network);
see, for example, Stanley and Miikkulainen [114]. The struc-
We can emphasize that in multi-output data streams there
ture of the network is evolved over time (rather than back-
is an additional complication involving concept drift which
ward propagation of errors), and hence is arguably a more
now covers an additional dimension – models are inherently
intuitive option in online tasks.
more complex and more difficult to learn and thus there is
even greater motivation to adapt them as best as possible As training options become easier, we expect RNNs to be a
to the new concept when drift occurs, rather than resetting more common option as a data-streams method.
them. This is further encouraged under the consideration
that supervised labeling is less likely to be complete under 6.3 Reinforcement learning
this scenario. Reinforcement learning is inherently a task of learning from
Structured-output learning is the case of multi-output learn- data streams. Observations arrive on a time-step basis, in
ing where some structure is assumed in the problem domain. a stream, and are typically treated either on an episode ba-
For example, in an image, segmentation local dependence is sis (here we can make an analogy with batch-incremental
assumed among pixel variables, and in modeling sequences, methods) or on a time-step basis (i.e., instance-incremental
it is often assumed temporal dependence among each of the streaming). For a complete introduction to the topic, see,
output variables. However, essentially all multi-label and for example, Sutton and Barto [117].
ers. Besides learning algorithms, MOA also provides: data
generators (e.g., AGRAWAL, Random Tree Generator, and
SEA); evaluation methods (e.g., periodic holdout, test-then-
train, prequential); and statistics (CPU time, RAM-hours,
Kappa). MOA can be used through a GUI (Graphical User
Interface) or via command line, which facilitates running
batches of tests. The implementation is in Java and shares
many characteristics with the WEKA framework [60], such
as allowing users to extend the framework by inheriting ab-
stract classes. Very often researchers make their source code
available as an MOA extension6 .
Figure 5: The Mountain Car problem is a typical benchmark Advanced Data mining And Machine learning Sys-
in reinforcement learning. The goal is to drive the car to the tem (ADAMS)7 [107]. ADAMS is a workflow engine de-
top. It can be treated as a streaming problem. signed to prototype and maintain complex knowledge work-
flows. ADAMS is not a data stream learning tool per se,
but it can be combined with MOA, and other frameworks
The goal in reinforcement learning is to learn a policy, which like SAMOA and WEKA, to perform data stream analysis.
is essentially a mapping from inputs (i.e., observations) to Scalable Advanced Massive Online Analysis (SAMOA)8
outputs (i.e., actions). This mapping is conceptually simi- [33]. SAMOA combines stream mining and distributed com-
lar to that desired from a machine learning model (mapping puting (i.e., MapReduce), and is described as a framework
inputs to outputs). However, the peculiarity is that ground- as well as a library. As a framework, SAMOA allows users to
truth training pairs are never presented to the model, unlike abstract the underlying stream processing execution engine
in the typical supervised learning scenario. Rather a reward and focus on the learning problem at hand. Currently, it
signal is given instead of true labels. The reward signal is of- is possible to use Storm (http://storm.apache.org) and
ten sparse across time and – the most significant challenge – Samza (http://samza.incubator.apache.org). SAMOA
is that the reward at a particular time step may correspond provides adapted versions of stream learners for distributed
to an action taken many time steps ago, and it is thus dif- processing, including the Vertical Hoeffding Tree algorithm
ficult to break down into a time-step basis. Nevertheless, [72], bagging and boosting.
in certain environments, it is possible to consider training Vowpal Wabbit (VW)9 . VW is an open source machine
pairs on an episode level. learning library with an efficient and scalable implementa-
Despite the similarities with data-streams, there has been tion that includes several learning algorithms. VW has been
relatively little overlap in the literature. It is not difficult to used to learn from a terafeature dataset using 1000 nodes in
conceive of scenarios where a reinforcement-learning agent approximately an hour [2].
needs to detect concept drift in its environment, just as any StreamDM10 . StreamDM is an open source framework for
classification or regression model. big data stream mining that uses the Spark Streaming [130]
Reinforcement learning is still in its infancy relative to tasks extension of the core Spark API. One advantage of StreamDM
such as supervised classification – especially in terms of in- in comparison to existing frameworks is that it directly ben-
dustrial applications, which may explain the lack of consid- efits from the Spark Streaming API, which handles much of
eration of additional complications typically considered in the complex problems of the underlying data sources, such
data-streams, as concept drift. Nevertheless, we would ex- as out of order data and recovery from failures.
pect such overlap to increase as a wider variety of real-world Scikit-multiflow11 [87]. Scikit-multiflow is an extension
application domains are considered. to the popular scikit-learn [98] inspired by the MOA frame-
work. It is designed to accommodate multi-label, multi-
output and single output data stream mining algorithms.
7. SOFTWARE PACKAGES AND FRAME- One advantage of scikit-multiflow is its API similarity to
WORKS scikit-learn, which is widely used worldwide.
In this section, we present existing tools for applying ma- Ray RLlib12 [79]. RLlib is a reinforcement learning li-
chine learning to data streams for both research and prac- brary that features reference algorithms’ implementations
tical applications. Initially, frameworks were designed to and facilitates the creation of new algorithms through a set
facilitate collaboration among research groups and allow re- of scalable primitives. RLlib is part of the open source Ray
searchers to test their ideas directly. Nowadays, tools such project. Ray is a high-performance distributed execution
as the Massive Online Analysis (MOA) [20] can be adapted framework that allows Python tasks to be distributed across
to deploy models in practice depending on the problem re- larger clusters.
quirements.
Massive Online Analysis (MOA)5 [20]. The MOA frame-
6
work includes several algorithms for multiple tasks concern- http://moa.cms.waikato.ac.nz/moa-extensions/
7
ing data stream analysis. It features a larger community of https://adams.cms.waikato.ac.nz
8
researchers that continuously add new algorithms and tasks http://samoa.incubator.apache.org
9
to it. The current tasks included in MOA are classifica- https://github.com/VowpalWabbit/vowpal_wabbit
10
tion, regression, multi-label, multi-target, clustering, outlier http://huawei-noah.github.io/streamDM/
11
detection, concept drift detection, active learning, and oth- https://github.com/scikit-multiflow/
scikit-multiflow
5 12
http://moa.cms.waikato.ac.nz https://ray.readthedocs.io/en/latest/rllib.html
8. CONCLUSIONS [7] M. Baena-Garcı́a, J. del Campo-Ávila, R. Fidalgo,
We have discussed several challenges that pertain machine A. Bifet, R. Gavaldà, and R. Morales-Bueno. Early
learning for streaming data. In some cases, these challenges drift detection method. 2006.
have been addressed (often partially) by existing research,
[8] D. Barber. Bayesian Reasoning and Machine Learn-
which we discuss and point out the shortcomings. All the
ing. Cambridge University Press, 2012.
topics covered in this work are important, but some have
a broader impact or have been less investigated. Further [9] J. P. Barddal, H. M. Gomes, and F. Enembreck. Ana-
developing these in the near future will help the development lyzing the impact of feature drifts in streaming learn-
of the field: ing. In International Conference on Neural Informa-
tion Processing, pages 21–28. Springer, 2015.
• Explore the relationships between other AI develop-
ments (e.g., recurrent neural networks, reinforcement [10] J. P. Barddal, H. M. Gomes, F. Enembreck, and
learning, etc.) and adaptive stream mining algorithms; B. Pfahringer. A survey on feature drift adaptation:
Definition, benchmark, challenges and future direc-
• Characterize and detect drifts in the absence of imme- tions. Journal of Systems and Software, 127:278 – 294,
diately labeled data; 2017.
• Develop adaptive learning methods that take into ac-
[11] R. Bardenet, M. Brendel, B. Kégl, and M. Se-
count verification latency;
bag. Collaborative hyperparameter tuning. In Inter-
• Incorporate pre-processing techniques that continuously national Conference on Machine Learning, pages 199–
transform the raw data; 207, 2013.
It is also important to develop software that allows the ap- [12] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and
plication of data stream mining techniques in practice. In J. D. Tygar. Can machine learning be secure? In ACM
recent years, many frameworks were proposed, and they Symposium on Information, computer and communi-
are constantly being updated and maintained by the com- cations security, pages 16–25, 2006.
munity. Finally, it is unfeasible to cover all topics related [13] L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A
to machine learning and streaming data in a single paper. maximization technique occurring in the statistical
Therefore, we were able to only scratch the surface for some analysis of probabilistic functions of markov chains.
topics that deserve further analysis in the future, such as The annals of mathematical statistics, 41(1):164–171,
regression; unsupervised learning; evolving graph data; im- 1970.
age, text and other non-structured data sources; and pattern
mining. [14] Y. Ben-Haim and E. Tom-Tov. A streaming paral-
lel decision tree algorithm. The Journal of Machine
9. REFERENCES Learning Research, 11:849–872, 2010.
[15] A. Bifet. Classifier concept drift detection and the illu-
[1] Z. S. Abdallah, M. M. Gaber, B. Srinivasan, and S. Kr-
sion of progress. In International Conference on Arti-
ishnaswamy. Activity recognition with evolving data
ficial Intelligence and Soft Computing, pages 715–725.
streams: A review. ACM Computing Surveys (CSUR),
Springer, 2017.
51(4):71, 2018.
[16] A. Bifet, G. de Francisci Morales, J. Read, G. Holmes,
[2] A. Agarwal, O. Chapelle, M. Dudı́k, and J. Lang-
and B. Pfahringer. Efficient online evaluation of big
ford. A reliable effective terascale linear learning sys-
data stream classifiers. In ACM SIGKDD Interna-
tem. The Journal of Machine Learning Research,
tional Conference on Knowledge Discovery and Data
15(1):1111–1133, 2014.
Mining, pages 59–68, 2015.
[3] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A
[17] A. Bifet and R. Gavalda. Learning from time-changing
framework for clustering evolving data streams. In
data with adaptive windowing. In SIAM international
International Conference on Very Large Data Bases
conference on data mining, pages 443–448, 2007.
(VLDB), pages 81–92, 2003.
[4] C. C. Aggarwal and P. S. Yu. On classification of high- [18] A. Bifet and R. Gavaldà. Adaptive learning from
cardinality data streams. In SIAM International Con- evolving data streams. In International Symposium
ference on Data Mining, pages 802–813, 2010. on Intelligent Data Analysis, pages 249–260. Springer,
2009.
[5] T. Al-Khateeb, M. M. Masud, L. Khan, C. C. Aggar-
wal, J. Han, and B. M. Thuraisingham. Stream classi- [19] A. Bifet, R. Gavalda, G. Holmes, and B. Pfahringer.
fication with recurring and novel class detection using Machine Learning for Data Streams: with Practical
class-based ensemble. In ICDM, pages 31–40, 2012. Examples in MOA. Adaptive Computation and Ma-
chine Learning series. MIT Press, 2018.
[6] M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu,
R. Xin, A. Ghodsi, I. Stoica, and M. Zaharia. Struc- [20] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer.
tured streaming: A declarative api for real-time appli- Moa: Massive online analysis. The Journal of Machine
cations in apache spark. In International Conference Learning Research, 11:1601–1604, 2010.
on Management of Data, pages 601–613, 2018.
[21] A. Bifet, G. Holmes, and B. Pfahringer. Leveraging [36] G. Ditzler, R. Polikar, and N. Chawla. An incremen-
bagging for evolving data streams. In PKDD, pages tal learning algorithm for non-stationary environments
135–150, 2010. and class imbalance. In International Conference on
Pattern Recognition, pages 2997–3000, 2010.
[22] B. H. Bloom. Space/time trade-offs in hash coding
with allowable errors. Communications of the ACM, [37] P. Domingos and G. Hulten. Mining high-speed data
13(7):422–426, 1970. streams. In ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 71–
[23] A. Blum and T. Mitchell. Combining labeled and un-
80, 2000.
labeled data with co-training. In Conference on Com-
putational learning theory, pages 92–100, 1998. [38] A. R. T. Donders, G. J. Van Der Heijden, T. Stijnen,
and K. G. Moons. A gentle introduction to imputa-
[24] L. Breiman. Random forests. Machine Learning,
tion of missing values. Journal of clinical epidemiol-
45(1):5–32, 2001.
ogy, 59(10):1087–1091, 2006.
[25] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl,
S. Haridi, and K. Tzoumas. Apache flink: Stream [39] Y. Dong and N. Japkowicz. Threaded ensembles of
and batch processing in a single engine. Bulletin of autoencoders for stream learning. Computational In-
the IEEE Computer Society Technical Committee on telligence, 34(1):261–281, 2018.
Data Engineering, 36(4), 2015. [40] D. M. dos Reis, P. Flach, S. Matwin, and G. Batista.
[26] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Fast unsupervised online drift detection using incre-
Kegelmeyer. Smote: synthetic minority over-sampling mental kolmogorov-smirnov test. In ACM SIGKDD
technique. Journal of artificial intelligence research, International Conference on Knowledge Discovery and
16:321–357, 2002. Data Mining, pages 1545–1554, 2016.
[27] S. Chen and H. He. Towards incremental learning [41] K.-L. Du and M. N. Swamy. Neural Networks and Sta-
of nonstationary imbalanced data stream: a multi- tistical Learning. Springer Publishing Company, Incor-
ple selectively recursive approach. Evolving Systems, porated, 2013.
2(1):35–50, 2011.
[42] R. Elwell and R. Polikar. Incremental learning of con-
[28] M. Chenaghlou, M. Moshtaghi, C. Leckie, and cept drift in nonstationary environments. IEEE Trans-
M. Salehi. Online clustering for evolving data streams actions on Neural Networks, 22(10):1517–1531, 2011.
with online anomaly detection. In Pacific-Asia Con-
[43] M. A. Faisal, Z. Aung, J. R. Williams, and A. Sanchez.
ference on Knowledge Discovery and Data Mining,
Data-stream-based intrusion detection system for ad-
pages 508–521. Springer, 2018.
vanced metering infrastructure in smart grid: A feasi-
[29] W. Chu, M. Zinkevich, L. Li, A. Thomas, and bility study. IEEE Systems journal, 9(1):31–44, 2015.
B. Tseng. Unbiased online active learning in data
streams. In ACM SIGKDD international conference [44] W. Fan and A. Bifet. Mining big data: current sta-
on Knowledge discovery and data mining, pages 195– tus, and forecast to the future. ACM SIGKDD Explo-
203, 2011. rations Newsletter, 14(2):1–5, 2013.
[30] G. Cormode and S. Muthukrishnan. An improved data [45] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy.
stream summary: the count-min sketch and its appli- Mining data streams: a review. ACM Sigmod Record,
cations. Journal of Algorithms, 55(1):58–75, 2005. 34(2):18–26, 2005.
[31] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Main- [46] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince,
taining stream statistics over sliding windows. SIAM and F. Herrera. A review on ensembles for the class
journal on computing, 31(6):1794–1813, 2002. imbalance problem: bagging-, boosting-, and hybrid-
based approaches. IEEE Transactions on Systems,
[32] E. R. de Faria, A. C. P. de Leon Ferreira de Carvalho, Man, and Cybernetics, Part C (Applications and Re-
and J. Gama. MINAS: multiclass learning algorithm views), 42(4):463–484, 2012.
for novelty detection in data streams. Data Mining
Knowledge Discovery, 30(3):640–680, 2016. [47] S. Galelli, G. B. Humphrey, H. R. Maier, A. Castel-
letti, G. C. Dandy, and M. S. Gibbs. An evalua-
[33] G. De Francisci Morales and A. Bifet. Samoa: Scalable tion framework for input variable selection algorithms
advanced massive online analysis. Journal of Machine for environmental data-driven models. Environmental
Learning Research, 16:149–153, 2015. Modelling & Software, 62:33 – 51, 2014.
[34] K. Dembczyński, W. Waegeman, W. Cheng, and [48] J. Gama and P. Kosina. Learning about the learn-
E. Hüllermeier. On label dependence and loss min- ing process. In International Symposium on Intelligent
imization in multi-label classification. Mach. Learn., Data Analysis, pages 162–172, 2011.
88(1-2):5–45, July 2012.
[49] J. Gama and P. Kosina. Recurrent concepts in data
[35] G. Ditzler, M. D. Muhlbaier, and R. Polikar. Incre- streams classification. Knowledge and Information
mental learning of new classes in unbalanced datasets: Systems, 40(3):489–507, 2014.
Learn++.udnc. In International Workshop on Multi-
ple Classifier Systems, pages 33–42, 2010.
[50] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and [64] M. J. Hosseini, A. Gholipour, and H. Beigy. An en-
A. Bouchachia. A survey on concept drift adaptation. semble of cluster-based classifiers for semi-supervised
ACM computing surveys (CSUR), 46(4):44, 2014. classification of non-stationary data streams. Knowl-
edge and Information Systems, 46(3):567–597, 2016.
[51] S. Garcı́a, S. Ramı́rez-Gallego, J. Luengo, J. M.
Benı́tez, and F. Herrera. Big data preprocessing: [65] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein,
methods and prospects. Big Data Analytics, 1(1):9, and J. Tygar. Adversarial machine learning. In ACM
2016. workshop on Security and artificial intelligence, pages
43–58, 2011.
[52] A. Ghazikhani, R. Monsefi, and H. S. Yazdi. Ensemble
of online neural networks for non-stationary and im- [66] N. Jiang and L. Gruenwald. Research issues in data
balanced data streams. Neurocomputing, 122:535–544, stream association rule mining. ACM Sigmod Record,
2013. 35(1):14–19, 2006.
[53] H. M. Gomes, J. P. Barddal, F. Enembreck, [67] T. Joachims. Transductive inference for text classifi-
and A. Bifet. A survey on ensemble learning for cation using support vector machines. In ICML, vol-
data stream classification. ACM Computing Surveys, ume 99, pages 200–209, 1999.
50(2):23:1–23:36, 2017.
[68] I. Katakis, G. Tsoumakas, E. Banos, N. Bassiliades,
[54] H. M. Gomes, A. Bifet, J. Read, J. P. Bard- and I. Vlahavas. An adaptive personalized news dis-
dal, F. Enembreck, B. Pfharinger, G. Holmes, and semination system. Journal of Intelligent Information
T. Abdessalem. Adaptive random forests for evolving Systems, 32(2):191–212, 2009.
data stream classification. Machine Learning, 106(9-
10):1469–1495, 2017. [69] R. Klinkenberg. Using labeled and unlabeled data to
learn drifting concepts. In IJCAI Workshop on Learn-
[55] H. M. Gomes and F. Enembreck. Sae: Social adaptive ing from Temporal and Spatial Data, pages 16–24,
ensemble classifier for data streams. In IEEE Sympo- 2001.
sium on Computational Intelligence and Data Mining,
pages 199–206, April 2013. [70] J. Z. Kolter, M. Maloof, et al. Dynamic weighted ma-
jority: A new ensemble method for tracking concept
[56] H. M. Gomes and F. Enembreck. Sae2: Advances drift. In ICDM, pages 123–130, 2003.
on the social adaptive ensemble classifier for data
streams. In Proceedings of the 29th Annual ACM [71] P. Kosina and J. a. Gama. Very fast decision rules
Symposium on Applied Computing (SAC), SAC 2014, for classification in data streams. Data Mining and
pages 199–206, March 2014. Knowledge Discovery, 29(1):168–202, Jan. 2015.
[57] H. M. Gomes, J. Read, and A. Bifet. Streaming ran- [72] N. Kourtellis, G. D. F. Morales, A. Bifet, and A. Mur-
dom patches for evolving data stream classification. dopo. Vht: Vertical hoeffding tree. In IEEE Interna-
In IEEE International Conference on Data Mining. tional Conference on Big Data, pages 915–922, 2016.
IEEE, 2019. [73] G. Krempl, I. Žliobaite, D. Brzeziński, E. Hüllermeier,
[58] M. Grzenda, H. M. Gomes, and A. Bifet. Delayed la- M. Last, V. Lemaire, T. Noack, A. Shaker, S. Sievi,
belling evaluation for data streams. Data Mining and M. Spiliopoulou, et al. Open challenges for data stream
Knowledge Discovery, to appear. mining research. ACM SIGKDD Explorations newslet-
ter, 16(1):1–10, 2014.
[59] S. Guha, N. Mishra, G. Roy, and O. Schrijvers.
Robust random cut forest based anomaly detection [74] L. I. Kuncheva. A stability index for feature selec-
on streams. In International Conference on Machine tion. In International Multi-Conference: Artificial In-
Learning, pages 2712–2721, 2016. telligence and Applications, AIAP’07, pages 390–395,
2007.
[60] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reute-
mann, and I. H. Witten. The weka data mining [75] B. Kveton, H. H. Bui, M. Ghavamzadeh,
software: an update. ACM SIGKDD Explorations G. Theocharous, S. Muthukrishnan, and S. Sun.
newsletter, 11(1):10–18, 2009. Graphical model sketch. In Joint European Confer-
ence on Machine Learning and Knowledge Discovery
[61] A. Haque, B. Parker, L. Khan, and B. Thuraisingham. in Databases, pages 81–97, 2016.
Evolving big data stream classification with mapre-
duce. In International Conference on Cloud Comput- [76] J. Langford, L. Li, and A. Strehl. Vowpal Wabbit,
ing (CLOUD), pages 570–577, 2014. 2007.
[62] M. Harries and N. S. Wales. Splice-2 comparative eval- [77] P. Lehtinen, M. Saarela, and T. Elomaa. Online
uation: Electricity pricing. 1999. chimerge algorithm. In Data Mining: Foundations and
Intelligent Paradigms, pages 199–216. 2012.
[63] S. Hashemi, Y. Yang, Z. Mirzamomen, and M. Kan-
gavari. Adapted one-versus-all decision trees for data [78] M. Li, M. Liu, L. Ding, E. A. Rundensteiner, and
stream classification. IEEE Transactions on Knowl- M. Mani. Event stream processing with out-of-order
edge and Data Engineering, 21(5):624–637, 2009. data arrival. In International Conference on Dis-
tributed Computing Systems Workshops, pages 67–67,
2007.
[79] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, [92] A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk,
J. Gonzalez, K. Goldberg, and I. Stoica. Ray rllib: and I. Goodfellow. Realistic evaluation of deep semi-
A composable and scalable reinforcement learning li- supervised learning algorithms. In S. Bengio, H. Wal-
brary. arXiv preprint arXiv:1712.09381, 2017. lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
and R. Garnett, editors, Advances in Neural Infor-
[80] V. López, A. Fernández, S. Garcı́a, V. Palade, and mation Processing Systems, pages 3238–3249. 2018.
F. Herrera. An insight into classification with imbal-
anced data: Empirical results and current trends on [93] N. Oza. Online bagging and boosting. In IEEE Inter-
using data intrinsic characteristics. Information Sci- national Conference on Systems, Man and Cybernet-
ences, 250:113–141, 2013. ics, volume 3, pages 2340–2345 Vol. 3, Oct 2005.
[81] V. Losing, B. Hammer, and H. Wersing. Incremental [94] S. J. Pan, Q. Yang, et al. A survey on transfer learning.
on-line learning: A review and comparison of state of IEEE Transactions on knowledge and data engineer-
the art algorithms. Neurocomputing, 275:1261–1274, ing, 22(10):1345–1359, 2010.
2018.
[95] B. Parker and L. Khan. Rapidly labeling and tracking
[82] G. Louppe and P. Geurts. Ensembles on random dynamically evolving concepts in data streams. IEEE
patches. In Joint European Conference on Machine International Conference on Data Mining Workshops,
Learning and Knowledge Discovery in Databases, 0:1161–1164, 2013.
pages 346–361. Springer, 2012.
[96] B. Parker, A. M. Mustafa, and L. Khan. Novel class
[83] D. Marron, J. Read, A. Bifet, T. Abdessalem, detection and feature via a tiered ensemble approach
E. Ayguade, and J. Herrero. Echo state hoeffding tree for stream mining. In IEEE International Conference
learning. In R. J. Durrant and K.-E. Kim, editors, on Tools with Artificial Intelligence, volume 1, pages
Asian Conference on Machine Learning, volume 63, 1171–1178, 2012.
pages 382–397, 2016.
[97] B. S. Parker and L. Khan. Detecting and tracking con-
[84] M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thu- cept class drift and emergence in non-stationary fast
raisingham. Classification and novel class detection in data streams. In AAAI Conference on Artificial Intel-
concept-drifting data streams under time constraints. ligence, 2015.
IEEE Transactions on Knowledge and Data Engineer-
ing, 23(6):859–874, 2011. [98] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
[85] M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thu- R. Weiss, V. Dubourg, et al. Scikit-learn: Machine
raisingham. A practical approach to classify evolving learning in python. Journal of machine learning re-
data streams: Training with limited amount of labeled search, 12(Oct):2825–2830, 2011.
data. In ICDM, pages 929–934. IEEE, 2008.
[99] B. Pfahringer, G. Holmes, and R. Kirkby. Handling
[86] I. Mitliagkas, C. Caramanis, and P. Jain. Memory lim- numeric attributes in hoeffding trees. In Pacific-Asia
ited, streaming pca. In Advances in Neural Informa- Conference on Advances in Knowledge Discovery and
tion Processing Systems, pages 2886–2894, 2013. Data Mining, pages 296–307, 2008.
[87] J. Montiel, J. Read, A. Bifet, and T. Abdessalem. [100] X. C. Pham, M. T. Dang, S. V. Dinh, S. Hoang,
Scikit-multiflow: A multi-output streaming frame- T. T. Nguyen, and A. W. C. Liew. Learning from data
work. Journal of Machine Learning Research, 19(72), stream based on random projection and hoeffding tree
2018. classifier. In International Conference on Digital Im-
age Computing: Techniques and Applications, pages
[88] M. D. Muhlbaier, A. Topalis, and R. Polikar.
1–8, 2017.
Learn++.nc: Combining ensemble of classifiers with
dynamically weighted consult-and-vote for efficient in- [101] C. Pinto and J. Gama. Partition incremental dis-
cremental learning of new classes. IEEE transactions cretization. In Portuguese conference on artificial in-
on neural networks, 20(1):152–168, 2009. telligence, pages 168–174, 2005.
[89] J. Nelder and R. Mead. A simplex method for func- [102] J. Plasse and N. Adams. Handling delayed labels in
tion minimization. The Computer Journal, 7:308–313, temporally evolving data streams. In IEEE ICBD,
1965. pages 2416–2424, 2016.
[90] H.-L. Nguyen, Y.-K. Woon, W.-K. Ng, and L. Wan. [103] S. Ramı́rez-Gallego, B. Krawczyk, S. Garcı́a,
Heterogeneous ensemble for feature drifts in data M. Woźniak, and F. Herrera. A survey on data pre-
streams. In P.-N. Tan, S. Chawla, C. K. Ho, and processing for data stream mining: current status and
J. Bailey, editors, Advances in Knowledge Discovery future directions. Neurocomputing, 239:39–57, 2017.
and Data Mining, pages 1–12, 2012.
[104] J. Read, A. Bifet, G. Holmes, and B. Pfahringer. Scal-
[91] S. Nogueira and G. Brown. Measuring the stabil- able and efficient multi-label classification for evolv-
ity of feature selection. In Joint European Confer- ing data streams. Machine Learning, 88(1-2):243–272,
ence on Machine Learning and Knowledge Discovery 2012.
in Databases, pages 442–457. Springer, 2016.
[105] J. Read, L. Martino, and J. Hollmén. Multi-label [121] I. Žliobaitė, A. Bifet, J. Read, B. Pfahringer, and
methods for prediction with sequential data. Pattern G. Holmes. Evaluation methods and decision theory
Recognition, 63(March):45–55, 2017. for classification of streaming data with temporal de-
pendence. Machine Learning, 98(3):455–482, 2014.
[106] J. Read, B. Pfahringer, G. Holmes, and E. Frank.
Classifier chains for multi-label classification. Machine [122] G. I. Webb. Contrary to popular belief incremental
Learning, 85(3):333–359, 2011. discretization can be sound, computationally efficient
and extremely useful for streaming data. In ICDM,
[107] P. Reutemann and J. Vanschoren. Scientific workflow
pages 1031–1036. IEEE, 2014.
management with adams. In Machine Learning and
Knowledge Discovery in Databases, pages 833–837. [123] G. I. Webb, R. Hyde, H. Cao, H. L. Nguyen, and F. Pe-
Springer, 2012. titjean. Characterizing concept drift. Data Mining and
[108] P. Roy, A. Khan, and G. Alonso. Augmented sketch: Knowledge Discovery, 30(4):964–994, 2016.
Faster and more accurate stream processing. In Inter- [124] G. Widmer and M. Kubat. Learning in the presence of
national Conference on Management of Data, pages concept drift and hidden contexts. Machine Learning,
1449–1463, 2016. 23(1):69–101, Apr. 1996.
[109] J. Rushing, S. Graves, E. Criswell, and A. Lin. A cov- [125] K. Wu, K. Zhang, W. Fan, A. Edwards, and S. Y.
erage based ensemble algorithm (cbea) for streaming Philip. Rs-forest: A rapid density estimator for
data. In IEEE International Conference on Tools with streaming anomaly detection. In ICDM, pages 600–
Artificial Intelligence, pages 106–112, 2004. 609. IEEE, 2014.
[110] M. Salehi and L. Rashidi. A survey on anomaly de-
[126] X. Wu, P. Li, and X. Hu. Learning from concept drift-
tection in evolving data:[with application to forest fire
ing data streams with unlabeled data. Neurocomput-
risk prediction]. ACM SIGKDD Explorations Newslet-
ing, 92:145–155, 2012.
ter, 20(1):13–23, 2018.
[111] J. C. Schlimmer and R. H. Granger. Incremental learn- [127] X. Wu, X. Zhu, G.-Q. Wu, and W. Ding. Data mining
ing from noisy data. Machine learning, 1(3):317–354, with big data. IEEE transactions on knowledge and
1986. data engineering, 26(1):97–107, 2014.
[112] A. Shrivastava, A. C. Konig, and M. Bilenko. Time [128] T. Yang, L. Liu, Y. Yan, M. Shahzad, Y. Shen, X. Li,
adaptive sketches (ada-sketches) for summarizing data B. Cui, and G. Xie. Sf-sketch: A fast, accurate, and
streams. In International Conference on Management memory efficient data structure to store frequencies of
of Data, pages 1417–1432, 2016. data items. In ICDE, pages 103–106, 2017.
[113] V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the [129] W. Yu, Y. Gu, and J. Li. Single-pass pca of large high-
point cloud: from transductive to semi-supervised dimensional data. In International Joint Conference
learning. In ICML, pages 824–831, 2005. on Artificial Intelligence, pages 3350–3356, 2017.
[114] K. O. Stanley and R. Miikkulainen. Efficient reinforce- [130] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and
ment learning through evolving neural network topolo- I. Stoica. Discretized streams: Fault-tolerant stream-
gies. In Genetic and Evolutionary Computation Con- ing computation at scale. In ACM Symposium on Op-
ference, page 9, San Francisco, 2002. erating Systems Principles, pages 423–438, 2013.
[115] I. Stoica, D. Song, R. A. Popa, D. Patterson, [131] M.-L. Zhang and Z.-H. Zhou. A review on multi-label
M. W. Mahoney, R. Katz, A. D. Joseph, M. Jor- learning algorithms. IEEE Transactions on Knowledge
dan, J. M. Hellerstein, J. E. Gonzalez, et al. A berke- and Data Engineering, 26(8):1819–1837, 2014.
ley view of systems challenges for ai. arXiv preprint
arXiv:1712.05855, 2017. [132] Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani,
A. Anand, and H. Liu. Advancing feature selection re-
[116] W. N. Street and Y. Kim. A streaming ensemble al- search. ASU feature selection repository, pages 1–28,
gorithm (sea) for large-scale classification. In ACM 2010.
SIGKDD international conference on Knowledge dis-
covery and data mining, pages 377–382, 2001. [133] G. Zhou, K. Sohn, and H. Lee. Online incremental fea-
ture learning with denoising autoencoders. In Artifi-
[117] R. S. Sutton and A. G. Barto. Introduction to Rein- cial intelligence and statistics, pages 1453–1461, 2012.
forcement Learning. MIT Press, 1st edition, 1998.
[134] I. Žliobaite. Change with delayed labeling: When is
[118] L. Torgo, R. P. Ribeiro, B. Pfahringer, and P. Branco. it detectable? In IEEE International Conference on
Smote for regression. In Portuguese conference on ar- Data Mining Workshops, pages 843–850, 2010.
tificial intelligence, pages 378–389. Springer, 2013.
[135] I. Žliobaitė, A. Bifet, J. Read, B. Pfahringer, and
[119] A. Tsymbal. The problem of concept drift: definitions G. Holmes. Evaluation methods and decision theory
and related work. Technical report, 2004. for classification of streaming data with temporal de-
[120] B. Veloso, J. Gama, and B. Malheiro. Self hyper- pendence. Machine Learning, 98(3):455–482, 2015.
parameter tuning for data streams. In International
Conference on Discovery Science, page to appear,
2018.