Active Learning
Active Learning
article info a b s t r a c t
Article history: We study a logistic model-based active learning procedure for binary classification prob-
Received 22 December 2017 lems, in which we adopt a batch subject selection strategy with a modified sequential
Received in revised form 13 August 2018 experimental design method. Moreover, accompanying the proposed subject selection
Accepted 14 August 2018
scheme, we simultaneously conduct a greedy variable selection procedure such that
Available online 27 August 2018
we can update the classification model with all labeled training subjects. The proposed
algorithm repeatedly performs both subject and variable selection steps until a prefixed
Keywords:
Active learning algorithm stopping criterion is reached. Our numerical results show that the proposed procedure has
D-efficiency criterion competitive performance, with smaller training size and a more compact model compared
Forward selection with that of the classifier trained with all variables and a full data set. We also apply the
Graft optimization proposed procedure to a well-known wave data set (Breiman et al., 1984) and a MAGIC
gamma telescope data set to confirm the performance of our method.
© 2018 Elsevier B.V. All rights reserved.
1. Introduction
To train a classification model, labeled data are essential when a training/testing framework is adopted, and its
classification performance relies on both the size and the quality of the training subjects used for learning. In a Big Data
scenario, we might easily meet a huge data set; however, the labeled information may be limited, and an abundance of
unlabeled subjects are available. To prevent money laundering, Deng et al. (2009) studied the method for building a detection
model using bank account data. This is a good example because in this situation, the label of interest (money laundering
account) is limited in a regular bank account data set. It would require a huge amount of time and resources to verify whether
an account is suspicious or non-suspicious, even though the major parts of the transactions in a bank account should be
normal. Efficiently determining the potential risks within a bank account in addition to effectively and efficiently using the
unlabeled subjects to improve the classification rule is the key issue, and the concept of active learning can be applied to this
situation.
When we train a classifier in an active learning manner, we need to annotate the unlabeled data and recruit them into
the training set, which can be done with the information of a model built on the labeled data at the current stage. In the
literature, it is observed that people can usually learn a satisfactory model economically with such a procedure (Cohn et
al., 1994a; Settles, 2011, 2012). There are many classification performance indexes, and it is clear that this subject selection
process may depend on the targeted index (Settles, 2009; Hsu, 2010; Settles, 2011). For example, Culver et al. (2006) studied
active learning procedures that maximize the area under the ROC curve (AUC), Long et al. (2010) were interested in the
* Corresponding author.
E-mail addresses: [email protected] (H.-L. Hsu), [email protected] (Y.I. Chang), [email protected] (R.-B. Chen).
https://doi.org/10.1016/j.csda.2018.08.013
0167-9473/© 2018 Elsevier B.V. All rights reserved.
120 H.-L. Hsu et al. / Computational Statistics and Data Analysis 129 (2019) 119–134
ranking of the data, and Deng et al. (2009) used an active learning study focusing on efficiently selecting the most informative
subjects to join the training set to construct an accurate classification model via the experimental design methods. However,
when there are many redundant variables (predictors), to have an effective design is difficult and in such a situation, this
classification model tends to over-fit the training subjects, which usually leads to high prediction uncertainty. A common
approach to improving its prediction stability is to increase the size of the training set at the cost of protracting the training
stage. Thus, a procedure that can identify a compact classification model during its training course is preferred.
In this paper, we propose a logistic model-based active learning procedure with a batch sampling to address binary
classification problems under big data scenarios. In addition to the subject selection scheme, we also integrate a variable
selection step into this procedure for systematically improving the prediction ability and avoiding the over-fitting of our
final classification model. Hence, the proposed algorithm is an iterative algorithm, in which we select a batch of new samples
at each iteration and then update our binary classification model via a variable selection approach. Both subject selection
and variable selection are featured in this novel active learning procedure.
We organize the rest of this paper as follows. Section 2 presents the details of the subject selection and variable selection
steps, and then we describe our active learning algorithm with an integrated subject and variable selection steps. Section 3
presents numerical results, where in addition to the simulation studies, we apply our algorithm to a well-known wave data
set used in Breiman et al. (1984) and a MAGIC gamma telescope data set. We present a brief discussion and conclusion in
Section 4.
2. Methodology
We consider a pool-based active learning procedure as studied in Lewis and Gale (1994) and assume that to obtain those
unlabeled data is cheap and to query their label information is expensive. Hence, we should rationally select the unlabeled
subjects from this large pool for being labeled to reduce the overall cost of model learning. We state the general framework
of the pool-based active learning methods as follows.
It is known that when there is a lengthy vector of variables, to train a classifier with an entire set of variables may diminish
its prediction power and protract its training time; a compact classification model is usually preferred when the model is
sparse. Hence, in our algorithm, we also emphasize the variable selection strategy in its learning process in addition to the
common subject selection as in active learning procedures reported in the literature.
The variable selection frame has two common approaches: forward selection and backward elimination (Whitney, 1971).
Because active learning procedures will usually start from a small size of training samples and keep accumulating according
to a pre-specified selection rule, we can only obtain satisfactory estimation results for a small number of parameters. Hence,
in our study, we use the forward selection scheme that increases the size of the variable set by adding a new variable to
the current model at a time and adopt a greedy selection approach. Based on the characters discussed above, we propose
a modified active learning algorithm, which integrates both batch-subject selection and greedy variable selection features
together, and we refer to this Greedy AcTivE learning algorithm as GATE learning (or GATE) algorithm throughout this paper.
Basically, we implement a variable selection step once we have an updated training set, and in each iteration of GATE, we
add more labeled samples, and re-justify our classification model.
Let ΞS denote the index set of the whole sample points and Ξs be the current training index set with labeled data. Thus,
Ξsc = ΞS \ Ξs is the pool of the unlabeled data. In this section, we focus on how to identify the batch unlabeled subjects
from Ξsc for binary classification based on a logistic model and, meanwhile, propose a two-stage query procedure by putting
the uncertainty sampling with the optimal design criterion together. Afterward, we introduce a greedy forward selection
to update the current model by selecting a candidate variable from Ξvc = ΞV \ Ξv , where ΞV is the index set of the whole
variables and Ξv denotes the index set of the current active variables in the logistic model.
Assume that the ith individual variate Yi ∈ {0, 1} is a binary variable with
P(Yi = 1) = pi = E(Yi ) and P(Yi = 0) = 1 − pi ,
and let the feature values of the ith subject be xi,p = (xi,j )⊤ , i ∈ Ξs , j ∈ Ξv whose dimension is equal to p. Then, we can fit
this data set with a logistic regression model below :
exp{β⊤
p xi,p }
pi = F (xi,p |βp ) = ,
1 + exp{β⊤
p xi,p }
H.-L. Hsu et al. / Computational Statistics and Data Analysis 129 (2019) 119–134 121
where βp is a p × 1 unknown parameter vector. The log-likelihood function of the logistic model using the labeled (training)
set, {(Yi , xi,p ), i ∈ Ξs }, is
∑{
Yi ln F (xi,p |βp ) + (1 − Yi ) ln(1 − F (xi,p |βp )) .
}
L=
i∈Ξs
Thus, we can use the maximum likelihood estimation (MLE) to estimate βp . It is known that there is no close-form expression
in the MLE approach, and the numerical optimization approach, like Newton’s method and iteratively re-weighted least
squares (IRLS), is commonly used in this case. Once we obtain the estimate of βp , β̂p , we can predict the label of the ith
observation by
⊤
exp{β̂p xi,p }
Ŷi = 1 if p̂i = ⊤
> α,
1 + exp{β̂p xi,p }
with the pre-specified value α , say, for example, α = 0.5.
If we use the experimental design methodologies properly to query the next points, the information concealed in a large
data set will be extracted quickly. Using some well-developed techniques in design theories (see Atkinson, 1996; Fedorov,
1972), we can effectively select samples, and some analytic results of optimal designs for parameter estimation in logistic
models have been derived. In the active learning literature, there are already many optimal design-based active learning
approaches for recruiting new samples into training sets, for example, see Cohn et al. (1994b), Cohn (1996) and Deng et
al. (2009). (For general information about the optimal design theory, please refer to Silvey, 1980). Because in our current
problem we only have unlabeled samples instead of a compact design space as in the conventional design problems, it is
hard to apply these analytic designs to our problem, in particular when we consider variable selection as a part in an active
learning process. How to quickly locate data points that are close to the analytic ones among a huge data set will be the main
issue when applying the design criteria to active learning processes. Thus, it becomes a computational problem instead of a
construction problem in this area.
Suppose {xi,p = (xi,j )⊤ , i ∈ Ξs , j ∈ Ξv } denotes the current labeled point set of size n with a variable length equal to p.
Following the definition in the optimal design (Silvey, 1980), we set a design ξn,p at points xi,p , i ∈ Ξs with equal weights
1/n. Given the parameter estimate, β̂p , the information matrix of the logistic model with respect to β̂p is
which maximizes (2) among all points xt ,p , t ∈ Ξsc based on the labeled training set and current logistic model. Since
1/p
to select a point satisfying (3) is equivalent to that in the locally D-optimal criterion. Because |M(ξn,p , β̂p )| can be viewed
as the geometric mean of the eigenvalues, we apply the pth root in (2) to facilitate interpolation. It follows that Eq. (2) is
the ratio of the average information of ξn,p and ξn+1,p . For further details regarding the relative D-efficiency, please refer
to Berger and Wong (2009) and Montgomery (2009).
122 H.-L. Hsu et al. / Computational Statistics and Data Analysis 129 (2019) 119–134
Because a complete search is exhausted and computationally inefficient, when the size of the unlabeled data is huge and
the uncertainty sampling strategy can reduce the search time, the idea of uncertainty sampling is popularly used in many
active learning processes in the literature. Here, we also adopt this strategy and propose a two-stage procedure. Before
applying the methods of experimental design, we will first identify a candidate set based on the current logistic model
with a pre-specified threshold value α as follows. For each unlabeled point x, we define d(x|β̂) = |F̂ (x|β̂) − α| to measure
uncertainty with respect to the current model, F̂ (·|β̂). We could encompass the uncertainty candidate subjects from the
unlabeled data set {xt ,p , t ∈ Ξsc }, i.e.,
Remark 1. For given α ∈ (0, 1), the decision boundary of a logistic model can be defined as lα (x) = {x : F (x|β) = α}.
Deng et al. (2009) treated a binary classification to obtain the separation boundary estimation problem; thus, they chose a
few candidates close to the estimated decision boundary based on the current learning model, and then selected their next
sample using a locally D-optimal criterion. Their two-stage procedure integrates the concept of uncertainty sampling (Lewis
and Gale, 1994) and an optimal design method.
Fitting a logistic regression model with a large number of variables and too many redundant variables causes computa-
tional difficulties in parameter estimation and enlarges prediction variation. Since having a large number of variables, P, is
common in this big data era, we want to identify a compact model for the binary classifier due to sparse model assumption.
In this paper, we adopt the concept of greedy forward selection algorithm for variable selection from computational
consideration. In fact, for reducing the computational cost, Efron and Hastie (2016) also suggests a forward selection
approach to identify a proper classification model.
Singh et al. (2009) introduced a single feature optimization (SFO) procedure for logistic regression models, and this is
a greedy-type feature selection procedure in the literature. Suppose xp is a candidate variable, and in the current logistic
model, we do have x1 , . . . , xp−1 . Then, in SFO, instead of re-estimating the coefficient vector βp = (β1 , . . . , βp ), it learns an
approximate model by fixing the original parameters for x1 , . . . , xp−1 and optimizing the parameter of the new variable, βp
via the log-likelihood function L with respect to xp , i.e.,
By this method, merely P − (p − 1) approximate models need to be created at each iteration of forward selection. The
estimation value of βp is computed according to Newton’s method. To identify the next added variable, Singh et al. (2009)
proposed scoring the new feature variable xp by evaluating the approximate model with a proper evaluating index, like AIC
or prediction error.
Instead of evaluating the variable effect based on the approximation of parameter estimation, Perkins et al. (2003)
proposed another greedy forward selection approach, called the grafting technique, based on the gradient of the log-
likelihood function for the newly added variable. With fixed parameters β1 , . . . , βp−1 , the variable with the largest
magnitude of gradient is added to the model, i.e.,
⏐ ⏐
⏐ ∂L ⏐
⏐ ⏐ ⏐∑ ⏐
arg max ⏐⏐ ⏐ = arg max ⏐ xi,p (Yi − pi )⏐ . (5)
⏐ ⏐
c
p∈Ξv ∂βp ⏐ c
p∈Ξv i∈Ξs
⏐ ⏐
Note that based on (5), the grafting technique is similar to the matching pursuit (Mallat and Zhang, 1993) (or weak greedy
algorithm) used in the variable selection for the regression model, and in this greedy forward selection algorithm, we only
need to compute the inner product operator of xp and the response vector.
Because both SFO and the grafting technique sequentially add one variable into the current logistic model, the GATE
algorithm typically has a small model when we start with a null model. These approaches can take advantage of the parallel
computing techniques (Kubicaa et al., 2011). However, SFO contains a series of one-dimensional optimization problems,
while the grafting technique only involves the inner product operator. Due to the computational complexity consideration,
we use the grafting technique in our GATE learning algorithm.
H.-L. Hsu et al. / Computational Statistics and Data Analysis 129 (2019) 119–134 123
The proposed GATE algorithm has two parts: (1) identify the proper subjects for labeling, and (2) find a compact
classification model. To identify the proper variable to be added at each iteration, additional information is required, and
thus, relying on only one newly labeled point is not enough. Thus, we include a batch of size nq instead, and we believe that
this is appropriate for most big data applications. Of course, the batch size nq can be a tuning parameter in the GATE algorithm
and may vary according to the application. Our query procedure relies on the D-optimal criterion. After one iteration of the
GATE algorithm, we will have additional nq points and include a new variable in the model. Then, we use the D-efficiency
criterion to measure the improvement of each iteration. As mentioned before, this criterion is the geometric mean with
respect to the different model size. Hence, we will stop the iteration of GATE learning algorithm when the relative difference
between the D-efficiencies is small enough.
Let ξ0 be the design with nq additional points in the current model with k variables, and ξ1 be the design with nq additional
points with respect to the one more selected variable, i.e., k + 1 variables. We then stop the GATE algorithm when the design
with the additional subjects cannot significantly provide enough information to support the larger model size. Let ε be a
pre-specified threshold value; we stop when the following inequality is satisfied:
1/k 1/(k+1) ⏐
⏐ ⏐
⏐|M(ξ0 , β̂k )| − |M(ξ1 , β̂k+1 )|
⏐
⏐
1/k
< ε. (6)
|M(ξ0 , β̂k )|
The details of each step of the GATE algorithm are stated as Algorithm 1.
1 Initialization: Learn the current logistic classifier model by estimating β0 based on the initial labeled set with n0 points;
2 Let k = #Ξv and set Crit = 1;
3 while Crit ≥ ε & k ≤ P do
(0)
4 Estimating β̂k = (β̂v , v ∈ Ξv )⊤ based {xi,v , i ∈ Ξs , v ∈ Ξv };
5 (batch active subject learning);
6 for 1 ≤ t ≤ nq do
7 for j ∈ Ξsc do
(t −1)
8 Calculate dj = |F̂ (xj,k |β̂k ) − α|;
9 end
10 Set d0 = d(h) as the hth order statistic of dj and ΞI = {i|i ∈ Ξsc , di ≤ d0 };
11 Identify the point xit ,v = arg max reDeff(xt ,v );
{xt ,v ,t ∈ΞI }
12 Query xit ,v and denoting Yit as its label;
(t)
13 Update Ξs = Ξs ∪ {it }. Re-estimating β̂k based {xi,v , i ∈ Ξs , v ∈ Ξv };
14 end
∗ (nq )
15 Define β̂k = β̂k ;
∗ ∗
16 Compute M0 = |M(ξ0 , β̂k )|1/k , where ξ0 equally supports {xi,v , i ∈ Ξs , v ∈ Ξv } and p̂i = F̂ (xi,v |β̂k );
17 (variable selecting);
18 for u ∈ Ξvc do
⏐∑ ⏐
Compute gu = ⏐ xi,u (Yi − p̂i )⏐;
⏐ ⏐
19 i∈Ξs
20 end
21 Select u∗ = maxu∈Ξvc gu and update Ξv ′ = Ξv ∪ {u∗ };
∗
22 Re-estimate β̂k+1 ;
∗
23 Obtain M1 = |M(ξ1 , β̂k+1 )|1/(k+1) where ξ1 equally supports {xi,v ′ , i ∈ Ξs , v ′ ∈ Ξv ′ };
24 Compute Crit= |M1 − M0 | /M0 ;
25 if Crit < ε then
26 Ξv = Ξv′ \ {u∗ };
27 else
28 Updating Ξv = Ξv ′ and k = k + 1;
29 end
30 end
∗
31 Finally, re-estimate β̂ based on {xi,v , i ∈ Ξs , v ∈ Ξv } with the selected training data set;
∗
32 Estimate F̂ (xj,v |β̂ ) with xj,v = (xj,v , v ∈ Ξv )⊤ for all j in the testing data set;
∗
33 Obtain the estimated labels for the testing data set, that is, Ŷjt = 1 when F̂ (xj,v |β̂ ) > α ;
Algorithm 1: Greedy Active Learning Algorithm
Remark 2. Consider the computational complexity of the GATE algorithm. Suppose at the current iteration, there are p
variables are chosen in the logistic model. Let n be the labeling points in the current training set and nr be the number of the
124 H.-L. Hsu et al. / Computational Statistics and Data Analysis 129 (2019) 119–134
unlabeled points. Firstly the computational complexity to identify one next labeled point in the subject selection step is at
most nr × O(p3 ), where O(p3 ) is the complexity for computing the determinant of a p × p information matrix. Secondly for
the variable selection part, because the grafting technique only involves the inner product operator, it takes O(nP) for adding
one additional variable, where P is the total number of the candidate variables. Therefore, let nq is the batch size, the GATE
algorithm will take, for one iteration,
Remark 3. We know that the conventional variable selection schemes do not consider subject selection strategies in their
algorithms. However, the subject selection is the highlight of active learning procedures. In this study, we incorporate
variable selection and subject selection into an algorithm, hence it is different from those conventional problems. Ideally,
one can replace the proposed forward selection scheme with other variable selection approaches. If we adopt a backward
selection scheme in active learning algorithm, we also need to adjust the subject selection procedure, accordingly. In this
case, the computational complexity will increase for data sets with large p as discussed here. On the other hand, if we
adopt a step-wise selection method, then the model will vary a lot during the subject selection of active learning procedures
compared to that of the conventional step-wise selection with fixed samples. Hence, directly replacing the current forward
selection scheme with other methods may not fully take the advantages of them, and we need to modify our subject selection,
accordingly, which will be some potential research problems.
3. Numerical examples
We illustrate the GATE algorithm in terms of the classification rate using some synthesized data sets, the MAGIC gamma
telescope data set (Dheeru and Karra Taniskidou, 2017) and a well-known wave data set from Breiman et al. (1984) based
on the training sample size used and the variable selection status.
Let xi = (xi,1 , . . . , xi,P )⊤ be a P × 1 predictor vector and β be the corresponding parameter vector, and for the ith predictor
vector xi , we assume that xi,1 = 1 and xi,j , j > 1 are independently generated from a normal distribution with mean µj and the
unit standard deviation, where µj is assumed from the uniform distribution within (−1, 1). Let P be 100; then, we generate
the response, Yi ’s, from Bernoulli distribution with the probability
exp{β⊤ xi }
pi = ,
1 + exp{β⊤ xi }
and consider the following sparse model scenarios with the different parameter vectors:
Case 1: The first five parameters are chosen as follows, (β1 , β2 , . . . , β5 )⊤ = (0.5, −2.0, −0.6, 0.5, 1.2), and the other
βi , i > 5, are set as zeros.
Case 2: The parameter vector is (β1 , β2 , . . . , β5 )⊤ = (5, −20, −6, 5, 12), which is equal to 10 times of the vector in
Case 1.
Case 3: The number of nonzero parameters are 6, and these 6 parameters are set as (β1 , β2 , . . . , β6 )⊤ = (1, −4, −2, 2,
3, 7). The other parameters are fixed as zeros.
For each logistic regression model, we generate 20,000 independent samples, which contains a training and a testing data
set with sample sizes Nt = 15,000 and Nv = 5000, respectively. There are 1000 replications for each case.
We randomly select n0 = 100 points from the potential training set as an initial labeled training subject set for the GATE
algorithm and pretend that, in this simulation study, the remaining unselected training points remain unlabeled. At each
iteration, we set h = 200 as the scope for the pool of the candidate subjects and the batch size nq = 30. Thus, we sequentially
select 30 points using a two-stage query procedure. The label information of points is revealed once they are selected but
not before the selection. We assume no prior knowledge for the samples; hence, we set a threshold value, α = 0.5, for the
logistic classification probability as follows:
⊤
exp{β̂ xi }
Ŷi = 1 if p̂i = ⊤
> α (= 0.5),
1 + exp{β̂ xi }
and we set ε = 10−2 for the stopping criterion stated in (6).
We use both Accuracy (ACC), defined as the proportion of true results (both true positives and true negatives) among the
total number of cases examined and Area Under the Curve (AUC) of the receiver operating characteristic (ROC) to evaluate
the performance in all cases. Both measurements target the fraction of all instances that are correctly categorized. Thus, they
are larger-the-better characteristics.
H.-L. Hsu et al. / Computational Statistics and Data Analysis 129 (2019) 119–134 125
To assess the variable selection of the proposed algorithm, we first let J be the set of true active variables and J be the
set of variables identified via the considered learning procedure, i.e.,
J = {j : βj ̸ = 0}, and J = {k : β̂k ̸ = 0}.
We then define the True Positive Rate (TPR) and the False Positive Rate (FPR) below:
number of {J ∩ J }
TPR = ,
number of {J }
number of {J ∩ J c }
FPR = ,
number of {J c }
and use them to evaluate the variable selection performance. Because TPR is the rate of active variables identified correctly,
it is the larger the better, whereas FPR is defined as the rate of inactive variables that are included in the model, the smaller
value of FPR indicates a better performance.
Table 1 summarizes the results of the 1000 simulations with respect to three different parameter cases. Among these
three cases, the first case should be the one with the worst classification rate due to the small non-zero parameter values.
As shown in Table 1, both classification measurements of Case 1, ACC and AUC are still larger than 0.82, and on average, we
only take around 530 points from the whole candidate set, which contains 15,000 points. The TPR values to evaluate the
variable selection performance for these three cases are higher than 0.94. This means that most of the true active variables
are included in the final model; however, over-selection problems do exist due to the potential weakness of the greedy
forward selection algorithm. Finally, without taking the labeling cost into account, the proposed algorithm is quite efficient
because the average CPU times are less than 4.5 mins.
In order to illustrate the advantages of the proposed GATE algorithm, we compare the classification performance with
those of the three different approaches. The first one is to implement the logistic classification with 15,000 training samples
and 100 variables. Second, we perform a comparison with an approach where in each replication, following the same
sample size n determined by the proposed active learning procedures, we randomly sample the labeled point from the
training set and then implement the logistic classifier based on the same active variables identified by grafting forward
selection. In the third approach, the 100 variables are taken into the logistic model, and then due to the sample size in
each of our 1000 replications, we randomly select the labeled points from the 15,000 candidate training points for learning
the corresponding logistic classifier. Tables 2–4 show the classification results with respect to the different approaches.
In addition to summarizing the results in the tables, we also illustrate them in Figs. 1–3 with respect to the different
parameter cases. In each figure, we not only display the selection frequencies of the predictors (variables) for our GATE
learning approach but also show the 1000 ROC curves for all replications with respect to the four different approaches and
we then compare our approaches with three scenarios: (1) the data with whole samples and variables (‘‘FUv-FUn’’); (2)
the data with the random subjects and selected variables (‘‘GAv-GARn’’) and (3) the data with random subjects and whole
variables (‘‘FUv-GARn’’).
Because the GATE algorithm simultaneously selects subjects for training and identifies a proper model during its training
course and there is a lack of active algorithms with both features, we only compare our results using the reDeff criterion
with active learning approaches with a subject selection feature only. Hence, we compare our approach further with other
two scenarios: (4) one is the data using whole variables with reDeff criterion to select the subjects with a pre-fixed size
as that of GATE (‘‘FUv-GAn-RE’’), and (5) another one is the data using a pre-fixed model with subjects selected by reDeff
(‘‘GAv-REn-RE’’).
FUv-FUn scenario. Table 2 shows the classification results based on 1000 replications for each parameter case. Here, we can
treat these results in Table 2 as the baselines for comparison. Due to these two classification measurements, ACC and AUC,
we focus on the classification results for testing sets. The lowest value of the relative ACC of the proposed GATE approach and
the baseline values is 0.821/0.832 = 0.987, and the lowest relative AUC value is 0.886/0.898 = 0.986. Both relative values
are close to 1.000. This means that the classification performances are similar for both approaches. However, in our GATE
learning approach, we averagely take less than 530 points, which is around 1/30 of the whole samples, and a more compact
model is used in the corresponding logistic classifier. Thus, this is the evidence to show the efficiency of the proposed GATE
learning approach, whether in the proposed query strategy or the greedy forward selection approach. The sub-figures (b)
and (c) of Figs. 1–3 show that both approaches have similar classification results, however the results under FUv-FUn are
more stable based on the 1000 replications.
GAv-GARn scenario. In each replication, we fix the model selected by our GATE learning approach, and the labeled subject
set is used to randomly sample the subjects from the whole samples, and its size is the same as the one we used in our GATE
learning. Then, we learn the corresponding logistic classifier. The results with 1000 replications are recorded in Table 3.
Consider the classification results for the testing set. The relative values of ACC and AUC of our GATE learning algorithm and
this method are all slightly larger than 1. That is, our GATE algorithm still performs better in these three cases. In addition,
it is evidence in support of our greedy selection approach because we still can achieve good classification results based on
these selected models. The sub-figures (b) and (d) in Figs. 1–3 also indicate that both approaches have almost the same
classification performance.
FUv-GARn scenario. In this classification approach, we take all 100 variables into the logistic regression model. To learn the
binary classifier, we randomly select the points from the whole samples with the same sample size used in GATE for each
126 H.-L. Hsu et al. / Computational Statistics and Data Analysis 129 (2019) 119–134
Table 1
The values of ACC, AUC, TPR, FPT and average CPU times of GATE based on 1000 simulations.
Case po n Training Testing Select Ave. time (min)
ACC AUC ACC AUC TPR FPR #{xd }
528.460 0.825 0.887 0.821 0.886 0.941 0.111 14.282 4.470
1 5
(190.255) (0.029) (0.015) (0.029) (0.015) (0.148) (0.061) (6.342)
314.290 0.981 0.998 0.980 0.998 0.999 0.033 7.143 1.549
2 5
(65.241) (0.005) (0.003) (0.005) (0.003) (0.026) (0.023) (2.175)
360.250 0.946 0.987 0.944 0.987 0.993 0.040 8.675 1.950
3 6
(78.794) (0.016) (0.012) (0.017) (0.012) (0.068) (0.027) (2.626)
Note: po represents the number of non-zero parameters in the true generating model, and the standard deviation of the 1000 simulated repeats is shown
in parentheses.
Table 2
The values of ACC, AUC, TPR, and FPT of under FUv-FUn scenario based on
1000 simulations.
Case po n Training Testing
ACC AUC ACC AUC
0.832 0.898 0.832 0.898
1 5 15 000
(0.025) (0.005) (0.025) (0.005)
0.982 0.999 0.982 0.999
2 5 15 000
(0.004) (0.000) (0.004) (0.000)
0.949 0.990 0.949 0.989
3 6 15 000
(0.009) (0.001) (0.010) (0.002)
Note: po represents the number of non-zero parameters in the true generating
model, and the standard deviation of the 1000 simulated repeats is shown in
parentheses.
replication. Table 4 shows the average classification results based on 1000 replications. Compared with the ACC and AUC
values shown in Table 1, our active learning approach significantly outperforms this random selection approach. This means
that to obtain better classification results, we need to have more labeled samples if we still want to take whole variables into
the model or, due to a smaller sample set, we need to have a compact logistic model. Thus, the one possible solution is our
proposed active learning approach because it can be used to sequentially select the next labeled points but also identify the
active variables for the binary classifier. Based on the sub-figure (e) in these three figures, the instability of this classification
approach is clearly illustrated.
FUv-GAn-RE scenario. Table 5 summarizes results of logistic models under FUv-GAn-RE scenario with a fixed sample size
as that in the GATE algorithm based on the 1000 replications. We found that overall performance is worse than that in
FUv-GARn. This may be due to the selection criterion, since the reDeff criterion is model-dependent. Because the logistic
model with 100 variables is very different from the true model, these results imply that the points identified by the reDeff
criterion cannot properly improve the logistic model fitting.
GAv-REn-RE scenario. Under GAv-REn-RE setup, we stop the active learning when the relative difference between the D-
efficiencies of two consecutive iterations is less than 10−2 . Table 6 is the results of 1000 replications under GAv-REn-RE.
The subject size of Case 1 is less than what we are accustomed to and the other two sizes are larger than that of the GATE.
However, all these differences are less than one standard deviation of sample sizes based on 1000 replications. We know
from Table 1 that the model identified with the GATE algorithm is very close to the true one. Thus, with this advantage, the
performance of GAv-REn-RE is also close to that of the GATE algorithm.
Remark 4. To use our GATE algorithm, we need to specify (1) the initial sample size, n0 , (2) the batch size nq , (3) uncertainty
sample step h and (4) threshold value in the stopping criterion ε . Selecting the most suitable parameters for a given data
set is always an important and time-consuming issue, and cross validation (CV) is a commonly used tuning procedure.
As we have reported in Case 2, we simply fix ε = 10−2 , and choose three candidate parameter vectors, (n0 , nq , h) =
(50, 20, 200); (100, 30, 200) and (150, 40, 300). We then use a 5-fold CV for parameter tuning, and the selection principle is
the average ACC in CV. After 100 replications, the average ACC for the testing set is 0.981, and TPR for the variable selection
is 1.000. Both measurements are slightly better than what we showed in Table 1. The differences are not significant. The
corresponding CPU time is 26.6999 mins because we need to run the GATE algorithm 3 × 5 + 1 = 16 times.
We apply the GATE algorithm to two data sets with the same tuning parameters stated in Section 3.1. That is that the size
of the initial training set, n0 , is equal to 100 points, the order for the subject candidate is h = 200, the batch size, nq is equal
to 30, and set the threshold value ε = 10−2 for the stopping criterion.
H.-L. Hsu et al. / Computational Statistics and Data Analysis 129 (2019) 119–134 127
Table 3
The values of ACC and AUC under GAv-GARn scenario based on 1000
simulations.
Case po n Training Testing
ACC AUC ACC AUC
528.460 0.820 0.885 0.819 0.884
1 5
(190.255) (0.029) (0.014) (0.029) (0.015)
314.290 0.974 0.997 0.974 0.997
2 5
(65.241) (0.006) (0.003) (0.007) (0.003)
360.250 0.941 0.986 0.940 0.986
3 6
(78.794) (0.016) (0.012) (0.017) (0.012)
Note: po represents the number of non-zero parameters in the true generating
model, and the standard deviation of the 1000 simulated repeats is shown in
parentheses.
Table 4
The values of ACC and AUC under FUv-GARn scenario based on 1000
simulations.
Case po n Training Testing
ACC AUC ACC AUC
528.460 0.763 0.816 0.759 0.812
1 5
(190.255) (0.055) (0.061) (0.056) (0.061)
314.290 0.858 0.922 0.856 0.921
2 5
(65.241) (0.030) (0.033) (0.031) (0.033)
360.250 0.851 0.917 0.848 0.915
3 6
(78.794) (0.031) (0.034) (0.031) (0.034)
Note: po represents the number of non-zero parameters in the true generating
model, and the standard deviation of the 1000 simulated repeats is shown in
parentheses.
Table 5
The values of ACC and AUC under FUv-GAn-RE scenario based on 1000
simulations.
Case po n Training Testing
ACC AUC ACC AUC
528.460 0.660 0.676 0.650 0.674
1 5
(190.255) (0.089) (0.091) (0.088) (0.089)
314.290 0.841 0.900 0.838 0.898
2 5
(65.241) (0.085) (0.087) (0.086) (0.088)
360.250 0.841 0.903 0.837 0.901
3 6
(78.794) (0.078) (0.081) (0.078) (0.081)
Note: po represents the number of non-zero parameters in the true generating
model, and the standard deviation of the 1000 simulated repeats is shown in
parentheses.
Table 6
The values of ACC and AUC under GAv-REn-RE scenario based on 1000
simulations.
Case po n Training Testing
ACC AUC ACC AUC
435.550 0.824 0.887 0.821 0.886
1 5
(140.150) (0.030) (0.016) (0.030) (0.016)
396.070 0.981 0.998 0.980 0.998
2 5
(108.453) (0.005) (0.003) (0.006) (0.003)
396.040 0.947 0.987 0.945 0.987
3 6
(102.529) (0.016) (0.012) (0.017) (0.012)
Note: po represents the number of non-zero parameters in the true generating
model, and the standard deviation of the 1000 simulated repeats is shown in
parentheses.
128 H.-L. Hsu et al. / Computational Statistics and Data Analysis 129 (2019) 119–134
Fig. 1. The performances of the different approaches for Case 1 with 1000 replications.
Fig. 2. The performances of the different approaches for Case 2 with 1000 replications.
data points in 2 classes (gamma and hadron classes) with 10 quantitative variables and is available at the Machine Learning
Repository website (Dheeru and Karra Taniskidou, 2017), where readers can find more references about it.
130 H.-L. Hsu et al. / Computational Statistics and Data Analysis 129 (2019) 119–134
Fig. 3. The performances of the different approaches for Case 3 with 1000 replications.
Our results are based on 100 replications, and a 5-fold cross validation is adopted. In each replication, the data set is
randomly partitioned into 5 disjoint subsets, and we use 4 of them together as the training set and compute the ACC and
AUC using the remaining set, which is not used for training. The averages of 5 ACCs and AUCs are reported.
H.-L. Hsu et al. / Computational Statistics and Data Analysis 129 (2019) 119–134 131
Table 7
The results of the magic example.
Method n 5-fold cross-validation Variable size
ACC AUC #{xd }
397.900 0.788 0.816 9.048
GATE
(37.088) (0.003) (0.006) (1.667)
FUv-FUn 0.791 0.839
(0.000)
GAv-GARn 397.900 0.784 0.826
(37.088) (0.004) (0.006)
FUv-GARn 397.900 0.783 0.829
(37.088) (0.003) (0.004)
In addition, the other three methods — FUv-FUn, GAv-GARn and FUv-GARn as described in Section 3.1, are implemented
for the comparison purpose. The results are summarized in Table 7. On average, the GATE algorithm selects less than 400
samples among 15,216 candidate points, and overall, 9 variables are included to learn the classifier. The average value of the
classification rate, ACC, and AUC value are 0.788 and 0.816, respectively, among 100 5-fold cross-validation replications.
Using the results of the logistic model with whole variables and full samples (FUv-FUn), as a baseline and comparing
the results of the GATE algorithm, we found that the GATE only takes 4/152 ≃ 2.63% of the training samples with the
corresponding relative ACC and AUC equal to 0.788/0.791 ≃ 0.996 and 0.816/0.839 ≃ 0.973, which are close to 1. This
means that the GATE algorithm exhibits a competitive performance with only a small fraction of the labeled samples. Under
GAv-GARn scenario as mentioned in Section 3.1, taking advantage of the compact model, its performance should be similar
to what the GATE algorithm does. Thus, from Table 7, the performance of GAv-GARn is just as good as expected because its
ACC and AUC values are all close to ours and those of FUv-FUn. On the other hand, the performance of FUv-GARn is similar
to that of the GATE algorithm; the GATE algorithm nearly selects the whole set of variables; therefore, the sparse learning
model offers no advantage in this case.
Table 8
The results of the wave example.
Method n Training Testing Variable size
ACC AUC ACC AUC #{xd }
743.500 0.920 0.978 0.920 0.980 22.45
GATE
(444.264) (0.002) (0.001) (0.002) (0.001) (14.809)
FUv-FUn 10 000 0.921 0.980 121
procedure. The selected frequencies of each variable are shown in Fig. 4. Suppose we set 0.8 as a threshold value for the
selected frequencies. Then, it is clear that our grafting approach can also yield the same set of active variables. For these
selected 15 variables, except V13 , the selected frequencies are higher than and equal to 95%. In fact, the frequency at which
V13 is identified is at least 83%. For 100 noise variables, the selected frequencies are essentially lower than those of the first
21 original variables.
In this study, we propose a logistic model-based active learning procedure for binary response data named GATE
algorithm. In addition to the common subject selection feature in active learning procedures, our algorithm can also identify
the proper classification model with the given data. We propose a two-stage subject selection procedure combining the ideas
of uncertainty sampling and relative D-efficiency (reDeff) criterion from the experimental design methods. In our algorithm,
we use the grafting technique, a greedy forward selection procedure, and adopt a sequential batch selection strategy at each
stage. Hence, the proposed active learning algorithm will repeat the subject selection and forward variable selection steps
until a stopping criterion is fulfilled. Both numerical results with our synthesized data and two well-known data sets support
the success of the proposed method.
We have some discussions about the stopping criterion, Eq. (6), in the following. According to the definition of the D-
1/k 1/(k+1)
efficiency, it might not be easier to theoretically show the relation among |M(ξ0 , β̂k )| and |M(ξ1 , β̂k+1 )| . Due to the
⊤ ⊤
consistency of the β estimate, we could assume that β̂k+1 = (β̂k β̂k+1 ), where β̂k+1 is the estimate of the (k + 1)th variable.
Therefore we decompose the information matrix M(ξ1 , β̂k+1 ) as
M(ξ0 , β̂k )
[ ]
Q
M(ξ1 , β̂k+1 ) = ,
Q⊤ Dk+1,k+1
where Q is a k × 1 vector and Dk+1,k+1 is a real number, and we have that
Fig. 5. The trend of the stopping criterion values with respect to the number of iterations of the GATE algorithm.
ξ0 and ξ1 should be small when we add more and more variables into the classification model. That is Eq. (6) should be small
when we have more variables. In addition, we re-visit the Case 2 in Section 3.1 to illustrate the trend of our stopping criterion
values which is shown in Fig. 5. It seems that after 4 iterations, the relative difference is smaller and smaller until its value
is less than ε = 10−2 .
Albert and Anderson (1984) noted that the MLE of the regression coefficients of logistic models might not exist if the
observed data are completely or quasi-completely separated by a hyperplane spanned by the variables considered in the
model. The Bayesian approach with a proper prior is one possible outlet to overcome this fitting problem. Another possible
solution is reducing the number of variables in the model, which can usually be accomplished with a forward or stepwise
selection approach as stated in SAS usage note 22 599, http://support.sas.com/kb/22/599.html. Thus, we modify the grafting
technique procedure in the following way: We fit the logistic model based on this new model when we add a new variable,
and if the MLE of the parameter vector does not exist after adding this variable, we then remove this newly added variable
and add the next candidate variable based on the grafting technique. At this stage, we will continue this procedure until we
obtain a model such that the MLE of the parameter vector does exist.
Because we consider subject and variable selection simultaneously, the proposed algorithm tends to over-select variables,
that is common phenomenon in the forward selection-based procedures. A natural approach to avoid this situation is to
consider a stepwise selection approach instead of a simple forward selection procedure. Once users adopt the stepwise
approach, they will have the chance to remove redundant variables from their classification models at the backward
elimination step. Certainly, the computational time will be largely extended for this kind of approach; thus, users should
take this extra computational cost into consideration for choosing the most proper approach in their applications. The L1
regularized logistic regression approach, like GLMNET (Friedman et al., 2010), may be another possibility. However, from
the computational perspective, the grafting technique still has its advantages for our algorithm.
Acknowledgments
The research of Prof. Hsu was partly supported by the Ministry of Science and Technology (MOST) of Taiwan with Grant
No., MOST 105-2118-M-390-004. The research of Dr. Chang was partly supported by the Ministry of Science and Technology
(MOST) of Taiwan with Grant No., MOST 106-2118-M-001-007-MY2 and MOST 103-2118-M-001-002-MY3. The research
of Prof. Chen was partially supported by the Ministry of Science and Technology (MOST) of Taiwan with Grant No., MOST
105-2628-M-006-002-MY2 and the Mathematics Division of the National Center for Theoretical Sciences in Taiwan.
References
Albert, A., Anderson, J.A., 1984. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71, 1–10.
Atkinson, A.C., 1996. The usefulness of optimum experimental designs. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 (1), 59–76.
Berger, M.P.F., Wong, W.K., 2009. An Introduction to Optimal Design for Social and Biomedical Eesearch. JohnWiley&Sons., Chichester, UK.
Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classification and Regression Trees. Wadsworth Statistical Press, Belmont, CA.
Cohn, D.A., 1996. Neural network exploration using optimal experiment design. Neural Netw. 9 (6), 1071–1083.
Cohn, D., Atlas, L., Ladner, R., 1994a. Improving generalization with active learning. Mach. Learn. 15 (2), 201–221.
Cohn, D.A., Ghahramani, Z., Jordan, M.I., 1994b. Active learning with statistical models. In: Proceedings of the 7th International Conference on Neural
Information Processing Systems. NIPS’94, MIT Press, Cambridge, MA, USA, pp. 705–712.
Culver, M., Kun, D., Scott, S., 2006. Active learning to maximize area under the ROC curve. In: Sixth International Conference on Data Mining, 2006. ICDM
’06.
Deng, X., Joseph, V.R., Sudjianto, A., Wu, C.F.J., 2009. Active learning through sequential design, with applications to detection of money laundering. J. Amer.
Statist. Assoc. 104 (487), 969–981.
Dheeru, D., Karra Taniskidou, E., 2017. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences.
Efron, B., Hastie, T., 2016. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science, first ed. Cambridge University Press, New York, NY,
USA.
134 H.-L. Hsu et al. / Computational Statistics and Data Analysis 129 (2019) 119–134
Fedorov, V., 1972. Theory of Optimal Experiments. In: Probability and Mathematical Statistics, Elsevier Science.
Friedman, J., Hastie, T., Tibshirani, R., 2010. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33 (1), 1–22.
Hsu, D.J., 2010. Algorithms for Active Learning (Ph.D. thesis), Columbia University.
Kubicaa, J., Singhb, S., Sorokinac, D., 2011. Parallel large-scale feature selection. In: Scaling Up Machine Learning: Parallel and Distributed Approaches.
Lewis, D.D., Gale, W.A., 1994. A sequential algorithm for training text classifiers. In: Proceedings of the 17th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval. SIGIR ’94, Springer-Verlag New York, Inc., New York, NY, USA, pp. 3–12.
Long, B., Bian, J., Chapelle, O., Zhang, Y., Inagaki, Y., Chang, Y., 2010. Active learning for ranking through expected loss optimization. In: SIGIR 2010.
Mallat, S.G., Zhang, Z., 1993. Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 41 (12), 3397–3415.
Montgomery, D.C., 2009. Design and Analysis of Experiments, seventh ed. JohnWiley&Sons., Hoboken, NJ, USA.
Perkins, S., Lacker, K., Theiler, J., 2003. Grafting: Fast, incremental feature selection by gradient descent in function space. J. Mach. Learn. Res. 3, 1333–1356.
Rakotomalala, R., 2005. TANAGRA: a free software for research and academic purposes. In: Proceedings of European Grid Conference 2005, RNTI-E-3,
Vol. 2. pp. 697–702.
Settles, B., 2009. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison.
Settles, B., 2011. From theories to queries: Active learning in practice. In: Journal of Machine Learning Research, Workshop on Active Learning and
Experimental Design, Workshop and Conference Proceedings, Vol. 16, pp. 1–18.
Settles, B., 2012. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, San Rafael.
Silvey, S.D., 1980. Optimal Design : An Introduction to the Theory for Parameter Estimation. Chapman and Hall, London ; New York.
Singh, S., Kubica, J., Larsen, S., Sorokina, D., 2009. Parallel large scale feature selection for logistic regression. In: SDM. SIAM, pp. 1172–1183.
Whitney, A.W., 1971. A direct method of nonparametric measurement selection. IEEE Trans. Comput. 20 (9), 1100–1103.