Null 2
Null 2
4525
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)
with three nested loops, the outer loop for the augmented 2 Alternating Direction Method of Multipliers
Lagrangian method and the standard two nested inner loops Here, we briefly review the alternating direction method of
for ADMM. Thus, one could assume that any distributed multipliers (ADMM) and discuss how it can be adapted for
solver for constrained optimization problems needs at least dealing with distributed data, before we extend it for handling
three nested loops: one for reaching consensus, one for the convex constraints in the next section.
constraints, and one for the unconstrained problems. The ADMM is an iterative algorithm that in its most general
key contribution of our paper is showing that this is not the form can solve convex optimization problems of the form
case. One of the nested loops can be avoided by merging
the loops for reaching consensus and dealing with the con- minx,z f1 (x) + f2 (z)
(1)
straints. Our approach, that only needs two nested loops, s.t. Ax + Bz − c = 0,
combines ADMM with the augmented Lagrangian method
where f1 : Rn1 → R ∪ {∞} and f2 : Rn2 → R ∪ {∞} are
differently than the direct approach of running the augmented
convex functions, A ∈ Rm×n1 and B ∈ Rm×n2 are matrices,
Lagrangian method in the outer and ADMM in the inner loop.
and c ∈ Rm .
The latter combination, that to our surprise has not been dis-
ADMM can obviously deal with linear equality con-
cussed in the literature before, still provides us with a good
straints, but it can also handle linear inequality constraints.
baseline in the experimental section.
The latter are reduced to linear equality constraints by replac-
Related Work ing constraints of the form Ax ≤ b by Ax + s = b, adding
To the best of our knowledge, our extension of ADMM the slack variable s to the set of optimization variables, and
is the first distributed algorithm for solving general convex setting f2 (s) = 1Rm +
(s), where
optimization problems with no restrictions on the type of
0, if s ≥ 0
constraints or assumptions on the structure of the problem. 1R m (s) =
The only special case, that we are aware of, are quadrati-
+
∞, otherwise,
cally constrained quadratic problems that has been addressed is the indicator function of the set Rm m
+ = {x ∈ R |x ≥ 0}.
by [Huang and Sidiropoulos, 2016]. However, their ap- Note that f1 and f2 are allowed to take the value ∞.
proach, that builds on consensus ADMM, does not scale, Recently, ADMM regained a lot of attention, because it
since every constraint gives rise to a new subproblem. allows to solve problems with separable objective function in
[Mosk-Aoyama et al., 2010] have designed and analyzed a a distributed setting. Such problems are typically given as
distributed algorithm for solving convex optimization prob- P
lems with separable objective function and linear equality minx i fi (x),
constraints. Their algorithm blends a gossip-based informa- where fi corresponds to the i-th data point (or more generally
tion spreading, iterative gradient ascent method with the bar- i-th data block) and x is a weight vector that describes the
rier method from interior-point algorithms. It is similar to data model. This problem can be transformed into an equiva-
ADMM and can also handle only linear constraints. lent optimization problem, with individual weight vectors xi
[Zhu and Martı́nez, 2012] have introduced a distributed
for each data point (data block) that are coupled through an
multiagent algorithm for minimizing a convex function that equality constraint,
is the sum of local functions subject to a global equality or P
inequality constraint. Their algorithm involves projections minxi ,z i fi (xi )
onto local constraint sets that are usually as hard to compute s.t. xi − z = 0 ∀ i,
as solving the original problem with general constraints. For
which is a special case of Problem 1 that can be solved by
instance, it is well known via standard duality theory that the
ADMM in a distributed setting by distributing the data.
feasibility problem for linear programs is as hard as solving
linear programs. This holds true for general convex optimiza-
tion problems with vanishing duality gap. 3 ADMM Extension
In principle, the standard ADMM can also handle convex Adding convex inequality constraints to Problem 1 does not
constraints by transforming them into indicator functions that destroy convexity of the problem, but so far ADMM cannot
are added to the objective function. However, this leads to deal with such constraints. Note that the problem only re-
subproblems that need to be solved in each iteration that entail mains convex, if all equality constraints are induced by affine
computing a projection onto the feasible region. This entails functions. That is, we cannot add convex equality constraints
the same issues as the method by [Zhu and Martı́nez, 2012] in general without destroying convexity. Hence, here we con-
since computing these projections can be as hard as solving sider convex optimization problems in its most general form
the original problem.
minx,z f1 (x) + f2 (z)
The recent literature on ADMM is vast. Most papers s.t. g0 (x) ≤ 0 (2)
on ADMM stay in the standard framework of optimizing a h1 (x) + h2 (z) = 0,
function or a sum of functions subject to linear constraints.
Exemplarily for many others, we just mention [Zhang and where f1 and f2 are as in Problem 1, g0 : Rn1 → Rp
Kwok, 2014] who provide convergence guarantees for asyn- is convex in every component, and h1 : Rn1 → Rm and
chronous ADMM and [Ghadimi et al., 2015] who study opti- h2 : Rn2 → Rm are affine functions. In the following we as-
mal penalty parameter selection. sume that the problem is feasible, i.e., that a feasible solution
4526
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)
exists, and that strong duality holds. A sufficient condition x∗ , z ∗ , µ∗ , and λ∗ are not necessarily unique. Here, they refer
for strong duality is that the interior of the feasible region is just to one optimal solution. Also note that the Lagrangian is
non-empty. This condition is known as Slater’s condition for identical to the Augmented Lagrangian with ρ = 0. Given
convex optimization problems [Slater, 1950]. that strong duality holds, the optimal solution to the origi-
Our extension of ADMM for solving Problem 2 and its nal Problem 3 is identical to the optimal solution of the La-
convergence analysis works with an equivalent reformulation grangian dual.
of Problem, where we replace g0 (x) by We need a few more definitions. Let f k = f1 (xk )+f2 (z k )
be the objective function value at the k-th iterate (xk , z k ) and
g(x) = max{0, g0 (x)}2 ,
let f ∗ be the optimal function value. Let rgk = g(xk ) be the
with componentwise maximum, and turn the convex inequal- residual of the nonlinear equality constraints, i.e., the con-
ity constraints into convex equality constraints. Thus, in the straints originating from the convex inequality constraints,
following we consider optimization problems of the form and let rhk = h1 (xk ) + h2 (z k ) be the residual of the linear
minx,z f1 (x) + f2 (z) equality constraints in iteration k.
s.t. g(x) = 0 (3) Our goal in this section is to prove the following theorem.
h1 (x) + h2 (z) = 0, Theorem 1. When Algorithm 1 is applied to an instance of
Problem 3, then
where g(x) = max{0, g0 (x)}2 , which by construction is
again convex in every component. Note, though, that the con- lim rgk = 0, lim rhk = 0, and lim f k = f ∗ .
straint g(x) = 0 is no longer affine. However, we show in the k→∞ k→∞ k→∞
following that Problem 3 can still be solved efficiently. The theorem states primal feasibility and convergence of
Analogously to ADMM our extension builds on the Aug- the primal objective function value. Note, however, that con-
mented Lagrangian for Problem 3 which is the following vergence to primal optimal points x∗ and z ∗ cannot be guar-
function anteed. This is the case for the original ADMM as well.
ρ 2
Lρ (x, z, µ, λ) = f1 (x) + f2 (z) + kg(x)k + µ> g(x) Additional assumptions on the problem, like, for instance, a
2 unique optimum, are necessary to guarantee convergence to
ρ 2 the primal optimal points. However, the points xk , z k will be
+ kh1 (x) + h2 (z)k + λ> (h1 (x) + h2 (z)) ,
2 primal optimal and feasible up to an arbitrarily small error for
where µ ∈ Rp and λ ∈ Rm are Lagrange multipliers, ρ > 0 sufficiently large k.
is some constant, and k·k denotes the Euclidean norm. The The proof of Theorem 1 can be found in the full version of
Lagrange multipliers are also referred to as dual variables. this paper [Giesen and Laue, 2016].
Algorithm 1 is our extension of ADMM for solving in-
stances of Problem 3. It runs in iterations. In the (k + 1)-th 5 Distributing Constraints
iteration the primal variables xk and z k as well as the dual
variables µk and λk are updated. Finally, we are ready to discuss the main problem that we set
out to address in this paper, namely solving general convex
Algorithm 1 ADMM for problems with non-linear con- optimization problems with many constraints in a distributed
straints setting by distributing the constraints. That is, we want to
address optimization problems of the form
1: input: instance of Problem 3
2: output: approximate solution x ∈ Rn1 , z ∈ Rn2 , µ ∈ minx f (x)
R p , λ ∈ Rm s.t. gi (x) ≤ 0 i = 1 . . . p (5)
3: initialize x0 = 0, z 0 = 0, µ0 = 0, λ0 = 0, and ρ to some hi (x) = 0 i = 1 . . . m,
constant > 0
4: repeat where f : Rn → R and gi : Rn → Rpi are convex functions,
5: xk+1 := argminx Lρ (x, z k , µk , λk ) and hi : Rn → Rmi are affine functions. In total, we have
6: z k+1 := argminz Lρ (xk+1 , z, µk , λk ) p1 + p2 + . . . + pp inequality constraints that are grouped
7: µk+1 := µk + ρg(xk+1 ) together into p batches and m1 + m2 + . . . + mm equality
8: λk+1 := λk + ρ h1 (xk+1 ) + h2 (z k+1 )
constraints that are subdivided into m groups. For distribut-
9: until convergence ing the constraints we can assume without loss of generality
10: return xk , z k , µk , λk that m = p. That is, we have m batches that each contain pi
inequality and mi equality constraints.
Again it is easier to work with an equivalent reformulation
of Problem 5, where each batch of equality and inequality
4 Convergence Analysis constraints shares the same variables xi , namely problems of
From duality theory we know that for all x ∈ Rn1 and z ∈ the form
Rn2 minxi ,z
Pm
i=1 f (xi )
L0 (x∗ , z ∗ , µ∗ , λ∗ ) ≤ L0 (x, z, µ∗ , λ∗ ), (4) s.t. max{0, gi (xi )}2 = 0 i = 1 . . . m
where L0 is the Lagrangian of Problem 3 and x∗ , z ∗ , µ∗ , (6)
hi (xi ) = 0 i = 1...m
and λ∗ are optimal primal and dual variables. Note, that xi = z,
4527
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)
where all the variables xi are coupled through the affine con- That is, in each iteration there are m independent, uncon-
straints xi = z. To keep our exposition simple, the objective strained minimization problems that can be solved in parallel
function has been scaled by m in the reformulation. on different compute nodes. The solutions of the independent
For specializing our extension of ADMM to instances of subproblems are then combined on a central node through the
Problem 6 we need the Augmented Lagrangian of this prob- update of the z variables and the Lagrange multipliers. Ac-
lem, which reads as tually, since the Lagrange multipliers µi,g and µi,h are also
Lρ (xi , z, µi,g , µi,h , λ) = local, i.e., involve only the variables xk+1
i for any given index
m m i, they can also be updated in parallel on the same compute
X ρX nodes where the xki updates take place. Only the variables z
f (xi ) + k max{0, gi (xi )}2 k2
i=1
2 i=1 and the Lagrange multipliers λi need to be updated centrally.
m
X Looking at the update rules it becomes apparent that Al-
+ (µi,g )> max{0, gi (xi )}2 gorithm 1 when applied to instances of Problem 6 is basi-
i
cally a combination of the standard Augmented Lagrangian
m m method [Hestenes, 1969; Powell, 1969] for solving convex,
ρX X
constrained optimization problems and ADMM for solving
+ khi (xi )k2 + (µi,h )> hi (xi )
2 i=1 i
convex optimization problems in a distributed fashion.
m m
ρX X
6 Experiments
+ kxi − zk2 + (λi )> (xi − z),
2 i=1 i We have implemented our extension of ADMM in Python
where µi,g , µi,h , and λi are the Lagrange multipliers (dual using the NumPy and SciPy libraries, and tested this im-
variables). plementation on the robust SVM problem [Shivaswamy et
Note that the Lagrange function is separable. Hence, the al., 2006] that has a second order cone constraint for ev-
update of the x variables in Line 5 of Algorithm 1 decom- ery data point. In our experiments we distributed these
poses into the following m independent updates constraints onto different compute nodes, where we had to
ρ solve an unconstrained optimization problem in every itera-
xk+1
i = argminxi f (xi ) + k max{0, gi (xi )}2 k2 tion. Since there is no other approach available that could
2
k >
+ (µi,g ) max{0, gi (xi )}2 deal with a large number of arbitrary constraints in a dis-
tributed manner we compare our approach to the baseline
ρ
+ khi (xi )k2 + (µki,h )> hi (xi ) approach of running an Augmented Lagrangian method in
2 an outer loop and standard ADMM in an inner loop. Note
ρ
+ kxi − z k k2 + (λki )> (xi − z k ), that this approach has three nested loops. The outer loop
2 turns the constrained problem into a sequence of uncon-
that can be solved in parallel once the constraints gi (xi ) and strained problems (Augmented Lagrangians), the next loop
hi (xi ) have been distributed on m different, distributed com- distributes the problem using distributed ADMM, and the
pute nodes. Note that each update is an unconstrained, con- final inner loop solves the unconstrained subproblems us-
vex optimization problem, because the functions that need to ing the L-BFGS-B algorithm [Morales and Nocedal, 2011;
be minimized are sums of convex functions. The only two Zhu et al., 1997] in our implementation.
summands where this might not be obvious, are
ρ Robust SVMs
k max{0, gi (xi )}2 k2 and (µki,g )> max{0, gi (xi )}2 . The robust SVM problem has been designed to deal with bi-
2
For the first term note that the squared norm of a non- nary classification problems whose input are not just labeled
negative, convex function is always convex again. The second data points (x(1) , y (1) ), . . . , (x(n) , y (n) ), where the x(i) are
term is convex, because it can be shown by induction that the feature vectors and the y (i) are binary labels, but a distribu-
µki,g are always non-negative. tion over the feature vectors. That is, the labels are assumed to
The update of the z variable in Line 6 of Algorithm 1 be known precisely and the uncertainty is only in the features.
amounts to solving the following unconstrained optimization The idea behind the robust SVM is replacing the constraints
problems (for feature vectors without uncertainty) of the standard linear
m m soft-margin SVM by their probabilistic counterparts
X ρ k+1 X
z k+1 = argminz kxi − zk2 + (λki )> (xk+1 − z) Pr y (i) w> x(i) ≥ 1 − ξi ≥ 1 − δi
i
i=1
2 i=1
Pm Pm that require the now random variable x(i) with probability at
ρ i=1 xk+1 i + i=1 λki least 1 − δi ≥ 0 to be on the correct side of the hyperplane
= ,
ρ·m whose normal vector is w. Shivaswamy et al. show that the
and the updates of the dual variables µi and λi are as follows probabilistic constraints can be written as second order cone
µk+1 k k+1 2 constraints
i,g = µi,g + ρ max{0, gi (xi )} ,
1/2
y (i) w> x̄(i) ≥ 1 − ξi + δi /(1 − δi ) Σi w ,
p
µk+1 k k+1
i,h = µi + ρ hi (xi ),
under the assumption that the mean of the random variable
λk+1 = λki + ρ xik+1 − z k+1 .
i x(i) is the empirical mean x̄(i) and the covariance matrix of
4528
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)
Figure 1: Various statistics for the performance of the distributed ADMM extension on an instance of the robust SVM problem. The
convergence proof only states that the value V k must be monotonically decreasing. This can be observed also experimentally in the figure on
the bottom right. Neither the primal function value nor the residuals need to be monotonically decreasing, and as can be seen in the figures
on the top, they actually do not decrease monotonically.
x(i) is Σi . The robust SVM problem is then the following for two compute nodes. Note, that only V k must be strictly
SOCP (second order cone program) monotonically decreasing according to our convergence anal-
n
ysis. The proof does not make any statement about the mono-
1 X tonicity of the other values, and as can be seen in Figure 1,
min kwk2 + c ξi
w,ξ 2 such statements would actually not be true. All values de-
i=1
crease in the long run, but are not necessarily monotonically
1/2
s.t. y (i) w> x̄(i) ≥ 1 − ξi +
p
δi /(1 − δi ) Σi w decreasing.
ξi ≥ 0, i = 1 . . . n. As can be seen in Figure 1 (top-left), the function value
f k is actually increasing for the first few iterations, while the
This problem can be reformulated into the form of Problem 6 residuals rgk for the inequality constraints become very small,
and is thus amenable to a distributed implementation of our see Figure 1 (top-middle). That is, within the first iterations
extension of ADMM. each compute node finds a solution to its share of the data
that is almost feasible but has a higher function value than
Experimental Setup the true optimal solution. This basically means that the errors
We generated random data sets similarly to [Andersen et al., ξi for the data points are over-estimated. After a few more
2012], where an interior point solver has been described for iterations the primal function value drops and the inequality
solving the robust SVM problem. The set of feature vectors residuals increase meaning that the error terms ξi as well as
was sampled from a uniform distribution on [−1, 1]n . The the individual estimators wi converge to their optimal values.
covariance matrices Σi were randomly chosen from the cone In the long run, the local estimators at the different com-
of positive semidefinite matrices with entries in the interval pute nodes converge to the same solution. This is witnessed
[−1, 1] and δi has been set to 21 . Each data point contributes in Figure 1 (top-right), where one can see that the residuals rhk
exactly one constraint to the problem and is assigned to only for the consensus constraints converge to zero, i.e., consensus
one of the compute nodes. among the compute nodes is reached in the long run.
In the following, the primal optimization variables are w Finally, it can be seen that the consensus estimator z k con-
and ξ, the consensus variables for the primal optimization verges to its unique optimal point z ∗ . Note that in general we
variables w are still denoted as z, and also the dual variables cannot guarantee such a convergence since the optimal point
are still denoted as λ for the consensus constraints and µ for does not need to be unique. In the special case that the opti-
the convex constraints, respectively. mal point is unique we always have convergence to this point.
Convergence Results Scalability Results
Figure 1 shows the primal objective function value f k , the Figure 2 shows the scalability of our extension of ADMM
norm of the residuals rgk and rhk , the distances kz k − z ∗ k, in terms of the number of compute nodes, data points, and
kλk − λ∗ k, and the value V k of one run of our algorithm approximation quality, respectively. All running times were
4529
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)
40 80 200
35 70 baseline baseline
30 60 150
this paper this paper
iterations
iterations
iterations
25 50
20 40 100
15 30
10
baseline 20 50
5 this paper 10
0 0 0
2 4 6 8 10 12 14 16 103 104 105 10−3 10−2 10−1
compute nodes data points kz − z ∗ k∞
k
Figure 2: Running times of the algorithm on the robust SVM problem. The figure on the left shows that the number of iterations increases
mildly with the number of compute nodes. The middle picture shows that the number of iterations is decreasing with increasing number of
data points. The figure on the right shows the dependency of the distance of the consensus estimator z k in iteration k to the optimal estimator
z ∗ . It can be seen that our extension of ADMM outperforms the baseline approach with three nested loops.
measured in terms of iterations and averaged over runs for (3) For measuring the scalability in terms of the approxi-
ten randomly generated data sets. For the baseline we used mation quality, we generated 8000 data points in 200 dimen-
the straightforward three nested loops approach. sions. Again, eight compute nodes were used for the exper-
(1) For measuring the scalability in terms of employed iments whose results are shown in Figure 2 (right). As ex-
compute nodes, we generated 10,000 data points with 10,000 pected the number of iterations (running time) increases with
features. As stopping criterion we used kz k − z ∗ k∞ ≤ increasing approximation quality that was again measured in
5 · 10−3 , i.e., the predictor z k had to be close to the opti- terms of the infinity norm. In this paper we are not providing
mum. Here we use the infinity norm to be independent from a theoretical convergence rate analysis, which we leave for
the number of dimensions. The data set was split into four, future work, but the experimental results shown here already
eight, twelve, and 16 equal sized batches that were distributed provide some intuition on the dependency of the number of
among the compute nodes. Note that every batch had much iterations in terms of the approximation quality: It seems that
fewer data points than features, and thus the optimal solu- our extension of ADMM can solve problems to a medium ac-
tions to the respective problems at the compute nodes were curacy within a reasonable number of iterations, but higher
quite different from each other. Nevertheless, our algorithm accuracy requires a significant increase in the number of iter-
converged very well to the globally optimal solution. Only ations. Such a behavior is well known for standard ADMM
the convergence speed was affected by the diversity of the lo- without constraints. In the context of our example applica-
cal solutions at the different compute nodes. Since we kept tion, robust SVMs, medium accuracy usually is sufficient as
the total number of data points in our experiments fixed, the often higher accuracy solutions do not provide better predic-
diversity was increasing with the number of compute nodes tors, a phenomenon that is also known as regularization by
that were assigned fewer data points each. Hence it was ex- early stopping.
pected that the convergence speed decreases, i.e., the number
of iterations increases, with a growing number of compute 7 Conclusions
nodes. The expected behavior can be seen in Figure 2 (left).
(2) For measuring the scalability in terms of the number Despite the vast literature on ADMM, to the best of our
of data points we increased the number of data points but kept knowledge, no scheme for distributing general convex con-
the number of features fixed at 200. The stopping criterion straints has been studied before. Here we have closed this
for our algorithm was again kz k − z ∗ k∞ ≤ 5 · 10−3 . We gap by combining ADMM and the augmented Lagrangian
used eight compute nodes to compute the solutions. Again, method for solving general convex optimization problems
the points were distributed equally among the compute nodes. with many convex constraints. The straightforward combi-
This time one would expect a decreasing running time with an nation of ADMM and the augmented Lagrangian method en-
increasing number of data points, because the number of data tails three nested loops, an outer loop for reaching consensus,
points per machine is increasing and thus also the diversity one loop for the constraints, and an inner loop for solving un-
of the local solutions at the different compute nodes is de- constrained problems. Our main contribution is showing that
creasing. That is, with an increasing number of data points it the loops for reaching consensus and for handling constraints
should take fewer iterations to reach an approximate consen- can be merged, resulting in a scheme with only two nested
sus about the global solution among the compute nodes. The loops. We provide the first convergence proof for such a lazy
results of the experiment that are shown in Figure 2 (middle) algorithmic scheme.
confirm this expectation. The number of iterations indeed
decreases with a growing number of data points. This is sim- Acknowledgments
ilar to [Shalev-Shwartz and Srebro, 2008] who have also ob-
served that an increasing number of data points can decrease This work has been supported by the DFG grant GI-711/5-1
the work required for a good predictor. within the Priority Program 1736 Algorithms for Big Data.
4530
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)
4531