Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views7 pages

Null 2

article
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

Null 2

article
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Combining ADMM and the Augmented Lagrangian Method


for Efficiently Handling Many Constraints

Joachim Giesen1 and Sören Laue∗1,2


1
Friedrich-Schiller-Universität Jena
2
Data Assessment Solutions

Abstract be minimized, and every data point contributes a convex con-


straint, namely the distance of the point from the center must
Many machine learning methods entail minimiz- be at most the radius.
ing a loss-function that is the sum of the losses The increasing availability of distributed hardware sug-
for each data point. The form of the loss func- gests to address such problems by distributing the constraints
tion is exploited algorithmically, for instance in on different compute nodes. Unfortunately, to the best of our
stochastic gradient descent (SGD) and in the alter- knowledge, algorithmic schemes for distributing convex con-
nating direction method of multipliers (ADMM). straints are only known in a few special cases. Such a scheme
However, there are also machine learning meth- has not even been discussed for the well researched smallest
ods where the entailed optimization problem fea- enclosing ball (core vector machine) problem. The situation
tures the data points not in the objective function is vastly different when there is not a large number of con-
but in the form of constraints, typically one con- straints but a large number parameters. For instance, the al-
straint per data point. Here, we address the prob- ternating direction method of multipliers (ADMM) that was
lem of solving convex optimization problems with proposed by [Glowinski and Marroco, 1975] and by [Gabay
many convex constraints. Our approach is an ex- and Mercier, 1976] already decades ago obtained consider-
tension of ADMM. The straightforward implemen- able attention, because it allows to solve convex optimiza-
tation of ADMM for solving constrained optimiza- tion problems with a large number of parameters in a dis-
tion problems in a distributed fashion solves con- tributed setting [Boyd et al., 2011]. For instance, the parame-
strained subproblems on different compute nodes ters of any log-likelihood maximization problem like logistic
that are aggregated until a consensus solution is or ordinary least squares regression are just the data points.
reached. Hence, the straightforward approach has The loss function of such problems, that is, the negative log-
three nested loops: one for reaching consensus, likelihood function, is the sum of the losses for each data
one for the constraints, and one for the uncon- point. In this case ADMM lends itself to a distributed imple-
strained problems. Here, we show that solving the mentation where the data points are distributed on different
costly constrained subproblems can be avoided. In compute nodes.
our approach, we combine the ability of ADMM
to solve convex optimization problems in a dis- Surprisingly, so far no general convex inequality con-
tributed setting with the ability of the augmented straints have been considered directly in the context of
Lagrangian method to solve constrained optimiza- ADMM. Although, in principle, standard ADMM can also
tion problems. Consequently, our algorithm only be used for solving constrained optimization problems. A
needs two nested loops. We prove that it inher- distributed implementation of the straightforward extension
its the convergence guarantees of both ADMM and of ADMM leads to non-trivial constrained optimization sub-
the augmented Lagrangian method. Experimental problems that have to be solved in every iteration. Solving
results corroborate our theoretical findings. constrained problems is typically transformed into a sequence
of unconstrained problems. Hence, this approach features
three nested loops, the outer loop for reaching consensus, one
loop for the constraints, and an inner loop for solving un-
1 Introduction constrained problems. Alternatively, one could use the stan-
Optimization problems with many constraints typically arise dard augmented Lagrangian method, originally known as the
from large data sets. An illustrative example is the core vector method of multipliers [Hestenes, 1969], that has been specifi-
machine [Tsang et al., 2007], where the smallest enclosing cally designed for solving constrained optimization problems.
ball for a given set of data points has to be computed. The Combining the augmented Lagrangian method with ADMM
objective function here is the radius of the ball that needs to allows to solve general constrained problems in a distributed
fashion by running the augmented Lagrangian method in an

Contact Author outer loop and ADMM in an inner loop. Again, we end up

4525
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

with three nested loops, the outer loop for the augmented 2 Alternating Direction Method of Multipliers
Lagrangian method and the standard two nested inner loops Here, we briefly review the alternating direction method of
for ADMM. Thus, one could assume that any distributed multipliers (ADMM) and discuss how it can be adapted for
solver for constrained optimization problems needs at least dealing with distributed data, before we extend it for handling
three nested loops: one for reaching consensus, one for the convex constraints in the next section.
constraints, and one for the unconstrained problems. The ADMM is an iterative algorithm that in its most general
key contribution of our paper is showing that this is not the form can solve convex optimization problems of the form
case. One of the nested loops can be avoided by merging
the loops for reaching consensus and dealing with the con- minx,z f1 (x) + f2 (z)
(1)
straints. Our approach, that only needs two nested loops, s.t. Ax + Bz − c = 0,
combines ADMM with the augmented Lagrangian method
where f1 : Rn1 → R ∪ {∞} and f2 : Rn2 → R ∪ {∞} are
differently than the direct approach of running the augmented
convex functions, A ∈ Rm×n1 and B ∈ Rm×n2 are matrices,
Lagrangian method in the outer and ADMM in the inner loop.
and c ∈ Rm .
The latter combination, that to our surprise has not been dis-
ADMM can obviously deal with linear equality con-
cussed in the literature before, still provides us with a good
straints, but it can also handle linear inequality constraints.
baseline in the experimental section.
The latter are reduced to linear equality constraints by replac-
Related Work ing constraints of the form Ax ≤ b by Ax + s = b, adding
To the best of our knowledge, our extension of ADMM the slack variable s to the set of optimization variables, and
is the first distributed algorithm for solving general convex setting f2 (s) = 1Rm +
(s), where
optimization problems with no restrictions on the type of 
0, if s ≥ 0
constraints or assumptions on the structure of the problem. 1R m (s) =
The only special case, that we are aware of, are quadrati-
+
∞, otherwise,
cally constrained quadratic problems that has been addressed is the indicator function of the set Rm m
+ = {x ∈ R |x ≥ 0}.
by [Huang and Sidiropoulos, 2016]. However, their ap- Note that f1 and f2 are allowed to take the value ∞.
proach, that builds on consensus ADMM, does not scale, Recently, ADMM regained a lot of attention, because it
since every constraint gives rise to a new subproblem. allows to solve problems with separable objective function in
[Mosk-Aoyama et al., 2010] have designed and analyzed a a distributed setting. Such problems are typically given as
distributed algorithm for solving convex optimization prob- P
lems with separable objective function and linear equality minx i fi (x),
constraints. Their algorithm blends a gossip-based informa- where fi corresponds to the i-th data point (or more generally
tion spreading, iterative gradient ascent method with the bar- i-th data block) and x is a weight vector that describes the
rier method from interior-point algorithms. It is similar to data model. This problem can be transformed into an equiva-
ADMM and can also handle only linear constraints. lent optimization problem, with individual weight vectors xi
[Zhu and Martı́nez, 2012] have introduced a distributed
for each data point (data block) that are coupled through an
multiagent algorithm for minimizing a convex function that equality constraint,
is the sum of local functions subject to a global equality or P
inequality constraint. Their algorithm involves projections minxi ,z i fi (xi )
onto local constraint sets that are usually as hard to compute s.t. xi − z = 0 ∀ i,
as solving the original problem with general constraints. For
which is a special case of Problem 1 that can be solved by
instance, it is well known via standard duality theory that the
ADMM in a distributed setting by distributing the data.
feasibility problem for linear programs is as hard as solving
linear programs. This holds true for general convex optimiza-
tion problems with vanishing duality gap. 3 ADMM Extension
In principle, the standard ADMM can also handle convex Adding convex inequality constraints to Problem 1 does not
constraints by transforming them into indicator functions that destroy convexity of the problem, but so far ADMM cannot
are added to the objective function. However, this leads to deal with such constraints. Note that the problem only re-
subproblems that need to be solved in each iteration that entail mains convex, if all equality constraints are induced by affine
computing a projection onto the feasible region. This entails functions. That is, we cannot add convex equality constraints
the same issues as the method by [Zhu and Martı́nez, 2012] in general without destroying convexity. Hence, here we con-
since computing these projections can be as hard as solving sider convex optimization problems in its most general form
the original problem.
minx,z f1 (x) + f2 (z)
The recent literature on ADMM is vast. Most papers s.t. g0 (x) ≤ 0 (2)
on ADMM stay in the standard framework of optimizing a h1 (x) + h2 (z) = 0,
function or a sum of functions subject to linear constraints.
Exemplarily for many others, we just mention [Zhang and where f1 and f2 are as in Problem 1, g0 : Rn1 → Rp
Kwok, 2014] who provide convergence guarantees for asyn- is convex in every component, and h1 : Rn1 → Rm and
chronous ADMM and [Ghadimi et al., 2015] who study opti- h2 : Rn2 → Rm are affine functions. In the following we as-
mal penalty parameter selection. sume that the problem is feasible, i.e., that a feasible solution

4526
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

exists, and that strong duality holds. A sufficient condition x∗ , z ∗ , µ∗ , and λ∗ are not necessarily unique. Here, they refer
for strong duality is that the interior of the feasible region is just to one optimal solution. Also note that the Lagrangian is
non-empty. This condition is known as Slater’s condition for identical to the Augmented Lagrangian with ρ = 0. Given
convex optimization problems [Slater, 1950]. that strong duality holds, the optimal solution to the origi-
Our extension of ADMM for solving Problem 2 and its nal Problem 3 is identical to the optimal solution of the La-
convergence analysis works with an equivalent reformulation grangian dual.
of Problem, where we replace g0 (x) by We need a few more definitions. Let f k = f1 (xk )+f2 (z k )
be the objective function value at the k-th iterate (xk , z k ) and
g(x) = max{0, g0 (x)}2 ,
let f ∗ be the optimal function value. Let rgk = g(xk ) be the
with componentwise maximum, and turn the convex inequal- residual of the nonlinear equality constraints, i.e., the con-
ity constraints into convex equality constraints. Thus, in the straints originating from the convex inequality constraints,
following we consider optimization problems of the form and let rhk = h1 (xk ) + h2 (z k ) be the residual of the linear
minx,z f1 (x) + f2 (z) equality constraints in iteration k.
s.t. g(x) = 0 (3) Our goal in this section is to prove the following theorem.
h1 (x) + h2 (z) = 0, Theorem 1. When Algorithm 1 is applied to an instance of
Problem 3, then
where g(x) = max{0, g0 (x)}2 , which by construction is
again convex in every component. Note, though, that the con- lim rgk = 0, lim rhk = 0, and lim f k = f ∗ .
straint g(x) = 0 is no longer affine. However, we show in the k→∞ k→∞ k→∞
following that Problem 3 can still be solved efficiently. The theorem states primal feasibility and convergence of
Analogously to ADMM our extension builds on the Aug- the primal objective function value. Note, however, that con-
mented Lagrangian for Problem 3 which is the following vergence to primal optimal points x∗ and z ∗ cannot be guar-
function anteed. This is the case for the original ADMM as well.
ρ 2
Lρ (x, z, µ, λ) = f1 (x) + f2 (z) + kg(x)k + µ> g(x) Additional assumptions on the problem, like, for instance, a
2 unique optimum, are necessary to guarantee convergence to
ρ 2 the primal optimal points. However, the points xk , z k will be
+ kh1 (x) + h2 (z)k + λ> (h1 (x) + h2 (z)) ,
2 primal optimal and feasible up to an arbitrarily small error for
where µ ∈ Rp and λ ∈ Rm are Lagrange multipliers, ρ > 0 sufficiently large k.
is some constant, and k·k denotes the Euclidean norm. The The proof of Theorem 1 can be found in the full version of
Lagrange multipliers are also referred to as dual variables. this paper [Giesen and Laue, 2016].
Algorithm 1 is our extension of ADMM for solving in-
stances of Problem 3. It runs in iterations. In the (k + 1)-th 5 Distributing Constraints
iteration the primal variables xk and z k as well as the dual
variables µk and λk are updated. Finally, we are ready to discuss the main problem that we set
out to address in this paper, namely solving general convex
Algorithm 1 ADMM for problems with non-linear con- optimization problems with many constraints in a distributed
straints setting by distributing the constraints. That is, we want to
address optimization problems of the form
1: input: instance of Problem 3
2: output: approximate solution x ∈ Rn1 , z ∈ Rn2 , µ ∈ minx f (x)
R p , λ ∈ Rm s.t. gi (x) ≤ 0 i = 1 . . . p (5)
3: initialize x0 = 0, z 0 = 0, µ0 = 0, λ0 = 0, and ρ to some hi (x) = 0 i = 1 . . . m,
constant > 0
4: repeat where f : Rn → R and gi : Rn → Rpi are convex functions,
5: xk+1 := argminx Lρ (x, z k , µk , λk ) and hi : Rn → Rmi are affine functions. In total, we have
6: z k+1 := argminz Lρ (xk+1 , z, µk , λk ) p1 + p2 + . . . + pp inequality constraints that are grouped
7: µk+1 := µk + ρg(xk+1 ) together into p batches and m1 + m2 + . . . + mm equality
8: λk+1 := λk + ρ h1 (xk+1 ) + h2 (z k+1 )
 constraints that are subdivided into m groups. For distribut-
9: until convergence ing the constraints we can assume without loss of generality
10: return xk , z k , µk , λk that m = p. That is, we have m batches that each contain pi
inequality and mi equality constraints.
Again it is easier to work with an equivalent reformulation
of Problem 5, where each batch of equality and inequality
4 Convergence Analysis constraints shares the same variables xi , namely problems of
From duality theory we know that for all x ∈ Rn1 and z ∈ the form
Rn2 minxi ,z
Pm
i=1 f (xi )
L0 (x∗ , z ∗ , µ∗ , λ∗ ) ≤ L0 (x, z, µ∗ , λ∗ ), (4) s.t. max{0, gi (xi )}2 = 0 i = 1 . . . m
where L0 is the Lagrangian of Problem 3 and x∗ , z ∗ , µ∗ , (6)
hi (xi ) = 0 i = 1...m
and λ∗ are optimal primal and dual variables. Note, that xi = z,

4527
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

where all the variables xi are coupled through the affine con- That is, in each iteration there are m independent, uncon-
straints xi = z. To keep our exposition simple, the objective strained minimization problems that can be solved in parallel
function has been scaled by m in the reformulation. on different compute nodes. The solutions of the independent
For specializing our extension of ADMM to instances of subproblems are then combined on a central node through the
Problem 6 we need the Augmented Lagrangian of this prob- update of the z variables and the Lagrange multipliers. Ac-
lem, which reads as tually, since the Lagrange multipliers µi,g and µi,h are also
Lρ (xi , z, µi,g , µi,h , λ) = local, i.e., involve only the variables xk+1
i for any given index
m m i, they can also be updated in parallel on the same compute
X ρX nodes where the xki updates take place. Only the variables z
f (xi ) + k max{0, gi (xi )}2 k2
i=1
2 i=1 and the Lagrange multipliers λi need to be updated centrally.
m
X Looking at the update rules it becomes apparent that Al-
+ (µi,g )> max{0, gi (xi )}2 gorithm 1 when applied to instances of Problem 6 is basi-
i
cally a combination of the standard Augmented Lagrangian
m m method [Hestenes, 1969; Powell, 1969] for solving convex,
ρX X
constrained optimization problems and ADMM for solving
+ khi (xi )k2 + (µi,h )> hi (xi )
2 i=1 i
convex optimization problems in a distributed fashion.
m m
ρX X
6 Experiments
+ kxi − zk2 + (λi )> (xi − z),
2 i=1 i We have implemented our extension of ADMM in Python
where µi,g , µi,h , and λi are the Lagrange multipliers (dual using the NumPy and SciPy libraries, and tested this im-
variables). plementation on the robust SVM problem [Shivaswamy et
Note that the Lagrange function is separable. Hence, the al., 2006] that has a second order cone constraint for ev-
update of the x variables in Line 5 of Algorithm 1 decom- ery data point. In our experiments we distributed these
poses into the following m independent updates constraints onto different compute nodes, where we had to
ρ solve an unconstrained optimization problem in every itera-
xk+1
i = argminxi f (xi ) + k max{0, gi (xi )}2 k2 tion. Since there is no other approach available that could
2
k >
+ (µi,g ) max{0, gi (xi )}2 deal with a large number of arbitrary constraints in a dis-
tributed manner we compare our approach to the baseline
ρ
+ khi (xi )k2 + (µki,h )> hi (xi ) approach of running an Augmented Lagrangian method in
2 an outer loop and standard ADMM in an inner loop. Note
ρ
+ kxi − z k k2 + (λki )> (xi − z k ), that this approach has three nested loops. The outer loop
2 turns the constrained problem into a sequence of uncon-
that can be solved in parallel once the constraints gi (xi ) and strained problems (Augmented Lagrangians), the next loop
hi (xi ) have been distributed on m different, distributed com- distributes the problem using distributed ADMM, and the
pute nodes. Note that each update is an unconstrained, con- final inner loop solves the unconstrained subproblems us-
vex optimization problem, because the functions that need to ing the L-BFGS-B algorithm [Morales and Nocedal, 2011;
be minimized are sums of convex functions. The only two Zhu et al., 1997] in our implementation.
summands where this might not be obvious, are
ρ Robust SVMs
k max{0, gi (xi )}2 k2 and (µki,g )> max{0, gi (xi )}2 . The robust SVM problem has been designed to deal with bi-
2
For the first term note that the squared norm of a non- nary classification problems whose input are not just labeled
negative, convex function is always convex again. The second data points (x(1) , y (1) ), . . . , (x(n) , y (n) ), where the x(i) are
term is convex, because it can be shown by induction that the feature vectors and the y (i) are binary labels, but a distribu-
µki,g are always non-negative. tion over the feature vectors. That is, the labels are assumed to
The update of the z variable in Line 6 of Algorithm 1 be known precisely and the uncertainty is only in the features.
amounts to solving the following unconstrained optimization The idea behind the robust SVM is replacing the constraints
problems (for feature vectors without uncertainty) of the standard linear
m m soft-margin SVM by their probabilistic counterparts
X ρ k+1 X
z k+1 = argminz kxi − zk2 + (λki )> (xk+1 − z) Pr y (i) w> x(i) ≥ 1 − ξi ≥ 1 − δi
 
i
i=1
2 i=1
Pm Pm that require the now random variable x(i) with probability at
ρ i=1 xk+1 i + i=1 λki least 1 − δi ≥ 0 to be on the correct side of the hyperplane
= ,
ρ·m whose normal vector is w. Shivaswamy et al. show that the
and the updates of the dual variables µi and λi are as follows probabilistic constraints can be written as second order cone
µk+1 k k+1 2 constraints
i,g = µi,g + ρ max{0, gi (xi )} ,
1/2
y (i) w> x̄(i) ≥ 1 − ξi + δi /(1 − δi ) Σi w ,
p
µk+1 k k+1
i,h = µi + ρ hi (xi ),
under the assumption that the mean of the random variable
λk+1 = λki + ρ xik+1 − z k+1 .

i x(i) is the empirical mean x̄(i) and the covariance matrix of

4528
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Figure 1: Various statistics for the performance of the distributed ADMM extension on an instance of the robust SVM problem. The
convergence proof only states that the value V k must be monotonically decreasing. This can be observed also experimentally in the figure on
the bottom right. Neither the primal function value nor the residuals need to be monotonically decreasing, and as can be seen in the figures
on the top, they actually do not decrease monotonically.

x(i) is Σi . The robust SVM problem is then the following for two compute nodes. Note, that only V k must be strictly
SOCP (second order cone program) monotonically decreasing according to our convergence anal-
n
ysis. The proof does not make any statement about the mono-
1 X tonicity of the other values, and as can be seen in Figure 1,
min kwk2 + c ξi
w,ξ 2 such statements would actually not be true. All values de-
i=1
crease in the long run, but are not necessarily monotonically
1/2
s.t. y (i) w> x̄(i) ≥ 1 − ξi +
p
δi /(1 − δi ) Σi w decreasing.
ξi ≥ 0, i = 1 . . . n. As can be seen in Figure 1 (top-left), the function value
f k is actually increasing for the first few iterations, while the
This problem can be reformulated into the form of Problem 6 residuals rgk for the inequality constraints become very small,
and is thus amenable to a distributed implementation of our see Figure 1 (top-middle). That is, within the first iterations
extension of ADMM. each compute node finds a solution to its share of the data
that is almost feasible but has a higher function value than
Experimental Setup the true optimal solution. This basically means that the errors
We generated random data sets similarly to [Andersen et al., ξi for the data points are over-estimated. After a few more
2012], where an interior point solver has been described for iterations the primal function value drops and the inequality
solving the robust SVM problem. The set of feature vectors residuals increase meaning that the error terms ξi as well as
was sampled from a uniform distribution on [−1, 1]n . The the individual estimators wi converge to their optimal values.
covariance matrices Σi were randomly chosen from the cone In the long run, the local estimators at the different com-
of positive semidefinite matrices with entries in the interval pute nodes converge to the same solution. This is witnessed
[−1, 1] and δi has been set to 21 . Each data point contributes in Figure 1 (top-right), where one can see that the residuals rhk
exactly one constraint to the problem and is assigned to only for the consensus constraints converge to zero, i.e., consensus
one of the compute nodes. among the compute nodes is reached in the long run.
In the following, the primal optimization variables are w Finally, it can be seen that the consensus estimator z k con-
and ξ, the consensus variables for the primal optimization verges to its unique optimal point z ∗ . Note that in general we
variables w are still denoted as z, and also the dual variables cannot guarantee such a convergence since the optimal point
are still denoted as λ for the consensus constraints and µ for does not need to be unique. In the special case that the opti-
the convex constraints, respectively. mal point is unique we always have convergence to this point.
Convergence Results Scalability Results
Figure 1 shows the primal objective function value f k , the Figure 2 shows the scalability of our extension of ADMM
norm of the residuals rgk and rhk , the distances kz k − z ∗ k, in terms of the number of compute nodes, data points, and
kλk − λ∗ k, and the value V k of one run of our algorithm approximation quality, respectively. All running times were

4529
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

40 80 200
35 70 baseline baseline
30 60 150
this paper this paper

iterations
iterations

iterations
25 50
20 40 100
15 30
10
baseline 20 50
5 this paper 10
0 0 0
2 4 6 8 10 12 14 16 103 104 105 10−3 10−2 10−1
compute nodes data points kz − z ∗ k∞
k

Figure 2: Running times of the algorithm on the robust SVM problem. The figure on the left shows that the number of iterations increases
mildly with the number of compute nodes. The middle picture shows that the number of iterations is decreasing with increasing number of
data points. The figure on the right shows the dependency of the distance of the consensus estimator z k in iteration k to the optimal estimator
z ∗ . It can be seen that our extension of ADMM outperforms the baseline approach with three nested loops.

measured in terms of iterations and averaged over runs for (3) For measuring the scalability in terms of the approxi-
ten randomly generated data sets. For the baseline we used mation quality, we generated 8000 data points in 200 dimen-
the straightforward three nested loops approach. sions. Again, eight compute nodes were used for the exper-
(1) For measuring the scalability in terms of employed iments whose results are shown in Figure 2 (right). As ex-
compute nodes, we generated 10,000 data points with 10,000 pected the number of iterations (running time) increases with
features. As stopping criterion we used kz k − z ∗ k∞ ≤ increasing approximation quality that was again measured in
5 · 10−3 , i.e., the predictor z k had to be close to the opti- terms of the infinity norm. In this paper we are not providing
mum. Here we use the infinity norm to be independent from a theoretical convergence rate analysis, which we leave for
the number of dimensions. The data set was split into four, future work, but the experimental results shown here already
eight, twelve, and 16 equal sized batches that were distributed provide some intuition on the dependency of the number of
among the compute nodes. Note that every batch had much iterations in terms of the approximation quality: It seems that
fewer data points than features, and thus the optimal solu- our extension of ADMM can solve problems to a medium ac-
tions to the respective problems at the compute nodes were curacy within a reasonable number of iterations, but higher
quite different from each other. Nevertheless, our algorithm accuracy requires a significant increase in the number of iter-
converged very well to the globally optimal solution. Only ations. Such a behavior is well known for standard ADMM
the convergence speed was affected by the diversity of the lo- without constraints. In the context of our example applica-
cal solutions at the different compute nodes. Since we kept tion, robust SVMs, medium accuracy usually is sufficient as
the total number of data points in our experiments fixed, the often higher accuracy solutions do not provide better predic-
diversity was increasing with the number of compute nodes tors, a phenomenon that is also known as regularization by
that were assigned fewer data points each. Hence it was ex- early stopping.
pected that the convergence speed decreases, i.e., the number
of iterations increases, with a growing number of compute 7 Conclusions
nodes. The expected behavior can be seen in Figure 2 (left).
(2) For measuring the scalability in terms of the number Despite the vast literature on ADMM, to the best of our
of data points we increased the number of data points but kept knowledge, no scheme for distributing general convex con-
the number of features fixed at 200. The stopping criterion straints has been studied before. Here we have closed this
for our algorithm was again kz k − z ∗ k∞ ≤ 5 · 10−3 . We gap by combining ADMM and the augmented Lagrangian
used eight compute nodes to compute the solutions. Again, method for solving general convex optimization problems
the points were distributed equally among the compute nodes. with many convex constraints. The straightforward combi-
This time one would expect a decreasing running time with an nation of ADMM and the augmented Lagrangian method en-
increasing number of data points, because the number of data tails three nested loops, an outer loop for reaching consensus,
points per machine is increasing and thus also the diversity one loop for the constraints, and an inner loop for solving un-
of the local solutions at the different compute nodes is de- constrained problems. Our main contribution is showing that
creasing. That is, with an increasing number of data points it the loops for reaching consensus and for handling constraints
should take fewer iterations to reach an approximate consen- can be merged, resulting in a scheme with only two nested
sus about the global solution among the compute nodes. The loops. We provide the first convergence proof for such a lazy
results of the experiment that are shown in Figure 2 (middle) algorithmic scheme.
confirm this expectation. The number of iterations indeed
decreases with a growing number of data points. This is sim- Acknowledgments
ilar to [Shalev-Shwartz and Srebro, 2008] who have also ob-
served that an increasing number of data points can decrease This work has been supported by the DFG grant GI-711/5-1
the work required for a good predictor. within the Priority Program 1736 Algorithms for Big Data.

4530
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

References missing and uncertain data. Journal of Machine Learning


[Andersen et al., 2012] Martin Andersen, Joachim Dahl, Research, 7:1283–1314, 2006.
Zhang Liu, and Lieven Vandenberghe. Interior-Point [Slater, 1950] Morton Slater. Lagrange multipliers revisited.
Methods for Large-Scale Cone Programming, pages 55– Cowles Foundation Discussion Papers 80, Cowles Foun-
84. MIT Press, 2012. dation for Research in Economics, Yale University, 1950.
[Boyd et al., 2011] Stephen P. Boyd, Neal Parikh, Eric Chu, [Tsang et al., 2007] Ivor W. Tsang, András Kocsor, and
Borja Peleato, and Jonathan Eckstein. Distributed opti- James T. Kwok. Simpler core vector machines with en-
mization and statistical learning via the alternating direc- closing balls. In International Conference on Machine
tion method of multipliers. Foundations and Trends in Ma- Learning (ICML), pages 911–918, 2007.
chine Learning, 3(1):1–122, 2011. [Zhang and Kwok, 2014] Ruiliang Zhang and James T
[Gabay and Mercier, 1976] Daniel Gabay and Bertrand Kwok. Asynchronous distributed admm for consensus
Mercier. A dual algorithm for the solution of nonlinear optimization. In International Conference on Machine
variational problems via finite element approximation. Learning (ICML), pages 1701–1709, 2014.
Computers & Mathematics with Applications, 2(1):17 – [Zhu and Martı́nez, 2012] Minghui Zhu and Sonia Martı́nez.
40, 1976.
On distributed convex optimization under inequality and
[Ghadimi et al., 2015] Euhanna Ghadimi, André Teixeira, equality constraints. IEEE Trans. Automat. Contr.,
Iman Shames, and Mikael Johansson. Optimal parameter 57(1):151–164, 2012.
selection for the alternating direction method of multipli- [Zhu et al., 1997] Ciyou Zhu, Richard H. Byrd, Peihuang
ers (admm): Quadratic problems. IEEE Trans. Automat.
Lu, and Jorge Nocedal. Algorithm 778: L-BFGS-B: for-
Contr., 60(3):644–658, 2015.
tran subroutines for large-scale bound-constrained opti-
[Giesen and Laue, 2016] Joachim Giesen and Sören Laue. mization. ACM Trans. Math. Softw., 23(4):550–560, 1997.
Distributed convex optimization with many convex con-
straints. CoRR, abs/1610.02967, 2016.
[Glowinski and Marroco, 1975] R. Glowinski and A. Mar-
roco. Sur l’approximation, par éléments finis d’ordre
un, et la résolution, par pénalisation-dualité d’une classe
de problèmes de dirichlet non linéaires. ESAIM: Mathe-
matical Modelling and Numerical Analysis - Modélisation
Mathématique et Analyse Numérique, 9(R2):41–76, 1975.
[Hestenes, 1969] Magnus R. Hestenes. Multiplier and gradi-
ent methods. Journal of Optimization Theory and Appli-
cations, 4(5):303–320, 1969.
[Huang and Sidiropoulos, 2016] Kejun Huang and
Nicholas D Sidiropoulos. Consensus-admm for general
quadratically constrained quadratic programming. IEEE
Transactions on Signal Processing, 64(20):5297–5310,
2016.
[Morales and Nocedal, 2011] José Luis Morales and Jorge
Nocedal. Remark on ”algorithm 778: L-BFGS-B: fortran
subroutines for large-scale bound constrained optimiza-
tion”. ACM Trans. Math. Softw., 38(1):7:1–7:4, 2011.
[Mosk-Aoyama et al., 2010] Damon Mosk-Aoyama, Tim
Roughgarden, and Devavrat Shah. Fully distributed algo-
rithms for convex optimization problems. SIAM Journal
on Optimization, 20(6):3260–3279, 2010.
[Powell, 1969] M. J. D. Powell. Algorithms for nonlinear
constraints that use lagrangian functions. Mathematical
Programming, 14(1):224–248, 1969.
[Shalev-Shwartz and Srebro, 2008] Shai Shalev-Shwartz
and Nathan Srebro. SVM optimization: inverse depen-
dence on training set size. In International Conference on
Machine Learning (ICML), pages 928–935, 2008.
[Shivaswamy et al., 2006] Pannagadatta K. Shivaswamy,
Chiranjib Bhattacharyya, and Alexander J. Smola. Sec-
ond order cone programming approaches for handling

4531

You might also like