Active Learning From Imbalanced Data: A Solution of Online
Weighted Extreme Learning Machine
ABSTRACT:
It is well known that active learning can simultaneously improve the quality of the
classification model and decrease the complexity of training instances. However, several
previous studies have indicated that the performance of active learning is easily disrupted by
an imbalanced data distribution. Some existing imbalanced active learning approaches also
suffer from either low performance or high time consumption. To address these problems,
this paper describes an efficient solution based on the extreme learning machine (ELM)
classification model, called active online-weighted ELM (AOW-ELM). The main
contributions of this paper include: 1) the reasons why active learning can be disrupted by an
imbalanced instance distribution and its influencing factors are discussed in detail; 2) the
hierarchical clustering technique is adopted to select initially labeled instances in order to
avoid the missed cluster effect and cold start phenomenon as much as possible; 3) the
weighted ELM (WELM) is selected as the base classifier to guarantee the impartiality of
instance selection in the procedure of active learning, and an efficient online updated mode of
WELM is deduced in theory; and 4) an early stopping criterion that is similar to but more
flexible than the margin exhaustion criterion is presented. The experimental results on 32
binary-class data sets with different imbalance ratios demonstrate that the proposed AOW-
ELM algorithm is more effective and efficient than several state-of the-art active learning
algorithms that are specifically designed for the class imbalance scenario.
INTRODUCTION:
Active learning is a popular machine learning paradigm and it is frequently deployed in the
scenarios when largescale instances are easily collected, but labelling them is expensive
and/or time-consuming [1]. By adopting active learning, a classification model can iteratively
interact with human experts to only select those most significant instances for labelling and to
further promote its performance as quickly as possible. Therefore, the merits of active
learning lie in decreasing both the burden of human experts and the complexity of training
instances but acquiring a classification model that delivers superior or comparable
performance to the model with labelling all instances.
EXISTING SYSTEM:
Existing system were a large number of active learning models, and generally, we have
several different taxonomies to organize these models. Based on different ways of entering
the unlabeled data, active learning can be divided into pool-based and stream-based models.
The former previously collects and prepares all unlabeled instances, while the latter can only
visit a batch of newly arrived unlabeled data at each specific time point. In addition, we have
several different significance measures to rank unlabeled instances, including uncertainty,
representativeness, inconsistency, variance, and error. In the past decade, active learning has
also been deployed in a variety of real world applications, such as video annotation, image
retrieval, text classification, remote sensing image annotation, speech recognition, network
intrusion detection, and bioinformatics. Active learning is undoubtedly effective, but several
recent studies have indicated that active learning tends to fail when it is applied to data with a
skewed class distribution. That is, similar to traditional supervised learning, active learning
also dares to face class imbalance problem. Several previous studies have tried to address this
problem by using different techniques. In particular, cost-sensitive SVM (CS-SVM) was
employed as the base learner, empirical costs were assigned according to the prior imbalance
ratio, and two traditional stopping criteria, i.e., the minimum error and the maximum
confidence, were adopted to find the appropriate stopping condition for active learning. The
method is robust and effective; however, it is also more time-consuming because the high
time-complexity of training an SVM and no use of online learning. Tomanek and Hahn
proposed two methods based on the inconsistency significance measure: balanced-batch
active learning (AL-BAB) and active learning with boosted disagreement (AL-BOOD),
where the former selects n labeled instances that are class balanced from 5n bnew labeled
instances on each round of active learning, while the latter modifies the equation of voting
entropy to make instance selection focus on the minority class. AL-BOOD must deploy many
diverse base learners (ensemble learning) to calculate the voting entropy of predictive labels,
which will inevitably increase the computational burden. Ertekin et al. indicated that near the
boundary of two different classes, the imbalance ratio is generally much lower than the
overall ratio, thus adopting active learning can effectively alleviate the negative effects of
imbalanced data distribution. In other words, they consider active learning to be a specific
sampling strategy. In addition, a margin exhaustion criterion is proposed as an early stopping
criterion to confirm the stopping condition because they selected SVM as a base learner. To
summarize the existing active learning algorithms applied in the scenario of unbalanced data
distributions, we found that they suffer from either low classification performance or high
time-consumption problems.
Disadvantages:
1) Active learning tends to fail when it is applied to data with a skewed class
distribution.
2) Active learning also dares to face class imbalance problem.
3) More time-consuming
PROPOSED SYSTEM:
We wish to propose an effective and efficient algorithm. The proposed algorithm is named
active online weighted ELM (AOW-ELM), and it should be applied in the pool-based batch-
mode active learning scenario with an uncertainty significance measure and ELM classifier.
We select ELM as the baseline classifier in active learning based on three observations: 1) It
always has better than or at least comparable generality ability and classification performance
as do SVM and MLP; 2) It can tremendously save training time compared to other classifiers;
and 3) It has an effective strategy for conducting active learning. In AOW-ELM, we first take
advantage of the idea of cost-sensitive learning to select the weighted ELM (WELM) as the
base learner to address the class imbalance problem existing in the procedure of active
learning. Then, we adopt the AL-ELM algorithm to construct an active learning framework.
Next, we deduce an efficient online learning mode of WELM in theory and design an
effective weight update rule. Finally, benefiting from the idea of the margin exhaustion
criterion, we present a more flexible and effective early stopping criterion. Moreover, we try
to simply discuss why active learning can be disturbed by skewed instance distribution,
further investigating the influence of three main distribution factors, including the class
imbalance ratio, class overlapping, and small disjunction. Specifically, we suggest adopting
the clustering techniques to previously select the initially labelled seed set, and thereby avoid
the missed cluster effect and cold start phenomenon as much as possible. Experiments are
conducted on 32 binary-class imbalanced data sets, and the results demonstrate that the
proposed algorithmic framework is generally more effective and efficient than several state-
of-the art active learning algorithms that were specifically designed for the class imbalance
scenario. The rest of work is organized as follows. (1) Introduces some priori knowledge
related to similar work. (2) We construct several representative synthetic data sets with
different distributions to analyze the reason why active learning can be destroyed by skewed
instance distribution. (3) Presents our proposed algorithmic framework in detail.
Advantages:
1) Better performance
2) Saves training time
3) More effective and efficient than earlier studies.
SYSTEM REQUIREMENTS:
Hardware Requirements:
System : Intel Core i5 3 GHZ
Memory : 16GB.
Hard Disk : 250 GB.
GPU : Nvidia Gforce GTX1050Ti 4GB
Software Requirements:
Operating System : Windows 7 / 8 or above.
Language : Python 3
Tool : Anaconda
CONCLUSION:
In this paper, we explore the problem of active learning in class imbalance scenario, and
present a solution of online WELM named the AOW-ELM algorithm. We find that the
harmfulness of skewed data distribution is related to multiple factors, and can be seen as a
combination of these factors. Hierarchical clustering can be effectively used to previously
extract initial representative instances into a seed set to address the potential missed cluster
effect and cold start phenomenon. The comparison between the proposed AOW-ELM
algorithm and some other benchmark algorithms indicates that AOW-ELM is an effective
strategy to address the problem of active learning in a class imbalance scenario. The merits of
the AOW-ELM algorithm can be summarized as follows.
1) It has a robust weight update rule.
2) Its running time is fast and linear with the training instances.
3) It has a flexible early stopping criterion.
4) It is appropriate for various types of data sets.
In the future work, we will focus more on the problem of active learning on multiclass
imbalanced data sets. In addition, the active learning strategies addressing imbalanced and
unlabelled data streams with handling concept drifts will also be investigated.