Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
40 views9 pages

Federated Learning

Uploaded by

abhicourse1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views9 pages

Federated Learning

Uploaded by

abhicourse1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

Federated Learning for Face Recognition with Gradient Correction

Yifan Niu, Weihong Deng


Beijing University of Posts and Telecommunications
[email protected], [email protected]

Abstract as backbone and a shared fully-connected layer is applied for


final classification, which is likely to lead to privacy leakage.
With increasing appealing to privacy issues in face recog-
Therefore, these methods are not applicable to face recog-
nition, federated learning has emerged as one of the most
prevalent approaches to study the unconstrained face recogni- nition. Once the private class embedding is obtained, one
tion problem with private decentralized data. However, con- client’s private high-fidelity face images can be easily syn-
ventional decentralized federated algorithm sharing whole thesized by other clients via optimizing random noise, such
parameters of networks among clients suffers from privacy as DeepInversion (Yin et al. 2020). Moreover, lots of GAN-
leakage in face recognition scene. In this work, we intro- based face generation technics are also proposed to generate
duce a framework, FedGC, to tackle federated learning for a frontal photorealethistic face image with face embeddings.
face recognition and guarantees higher privacy. We explore On the other hand, existing federated methods are mainly fo-
a novel idea of correcting gradients from the perspective of cusing on shallow networks(e.g. 2 layer fully connected net-
backward propagation and propose a softmax-based regular- work), we found these methods may easily cause network
izer to correct gradients of class embeddings by precisely in-
collapsing when applied to deeper network structure on fa-
jecting a cross-client gradient term. Theoretically, we show
that FedGC constitutes a valid loss function similar to stan- cial datasets. Thus, we rethink the federated learning prob-
dard softmax. Extensive experiments have been conducted to lem of face recognition on privacy issues, and remodel con-
validate the superiority of FedGC which can match the per- ventional Federated Averaging algorithm (FedAvg) (McMa-
formance of conventional centralized methods utilizing full han et al. 2017) via ensuring each client holds a private fully-
training dataset on several popular benchmark datasets. connected layer which not only guarantees higher privacy
but also contributes to network convergence.
Introduction In general, each client commonly holds a small-scale non-
IID local dataset. When we follow the above setting, onece
Face Recognition has been the prominent biometric tech- the k-th client solves the optimization problem locally, the
nique for identity authentication and has been widely ap- classification task is relatively uncomplicated and the net-
plied in many areas. Recently, a variety of data-driven ap- work tends to overfit and suffers from degradation of gener-
proaches using Deep Convolutional Neural Networks (DC- alization ability. It leads to a phenomenon that class embed-
NNs) (Taigman et al. 2014; Kim, Park, and Shin 2020; Deng dings (the parameters of last fully-connected layer) of the
et al. 2020; Duan, Lu, and Zhou 2019; Marriott, Romdhani, same client are almost orthogonal to each other, but part of
and Chen 2021) have been proposed to improve the face class embeddings of different clients are highly similar.
identification and verification accuracy. A large scale dataset
To solve aforementioned problem, we should constitute a
with diverse variance is crucial for discriminative face rep-
new training strategy to train a model with private decen-
resentation learning. Although existing datasets (Cao et al.
tralized non-IID (Non Identically and Independently Dis-
2018; Guo et al. 2016; Kemelmacher-Shlizerman et al. 2016;
tributed) facial data. In this work, we first propose FedGC,
Wang et al. 2018a; Yi et al. 2014; Zhu et al. 2021) were
a novel and poweful federated learning framework for face
created aiming to study the unconstrained face recognition
recognition, which combines local optimization and cross
problem, they are still biased compared with the real world
client optimization injected by our proposed softmax reg-
data distribution. Considering privacy issue, we are not au-
ularizer. FedGC is a privacy-preserving federated learning
thorized to get access to mass face data in real world. Thus,
framework which guarantees that each client holds private
it is vital to train a model with private decentralized face data
class embeddings. In face recognition, several variants of
to study the unconstrained face recognition problem in real
softmax-based objective functions (Deng et al. 2019; Deng,
world scene.
Zhou, and Zafeiriou 2017; Simonyan and Zisserman 2014;
Federated methods on object classification tasks are all
Sun, Wang, and Tang 2015; Taigman et al. 2014; Wang
under a common setting where a shallow network is adopted
et al. 2018b; Wolf, Hassner, and Maoz 2011) have been pro-
Copyright © 2022, Association for the Advancement of Artificial posed in centralized methods. Hence, we propose a softmax-
Intelligence (www.aaai.org). All rights reserved. based regularizer aiming to correct gradients of local soft-

1999
max and precisely introduce cross-client gradients to ensure (Li et al. 2018) was proposed as a generalization and re-
that cross-clients class embeddings are fully spread out and parametrization of FedAvg with a proximal term. SCAF-
it can can be readily extended to other forms. Additionally, FOLD (Karimireddy et al. 2019) controls variates to cor-
we give a theoretical analysis to show that FedGC consti- rect the ’client-drift’ in local updates. FedAC (Yuan and Ma
tutes a valid loss function similar to standard softmax. Our 2020) is proposed to improve convergence speed and com-
contributions can be summarized as follows: munication efficiency. FedAwS (Yu et al. 2020) investigated
a new setting where each client has access to the positive
• We propose a federated learning framework, FedGC, for data associated with only a single class. However, most of
face recognition and guarantees higher privacy. It ad- them are mainly focusing on shallow networks and suffers
dresses the missing local optimization problems for face- from privacy leakage in face recognition. Recently, there
specific softmax-based loss functions. also emerged some works (Bai et al. 2021; Aggarwal, Zhou,
• We start from a novel perspective of back propagation to and Jain 2021) focusing on federated face recognition.
correct gradients and introduce cross-client gradients to
ensure the network updates in the direction of standard Methodology
softmax. We also give a theoretical analysis to show the In this section, we will first provide the formulation of the
effectiveness and significance of our method. federated learning and its variant for face recognition. We
• Extensive experiments and ablation studies have been start by analysing, and then illustrate how we are motivated
conducted and demonstrate the superiority of the pro- to propose FedGC.
posed FedGC on several popular benchmark datasets.
Problem Formulation
Related Work We consider a C class classification problem defined over
a compact space X and a label space Y. Let K be the
Face Recognition. Face Recognition (FR) has been the number of clients, suppose the k-th client holds the data
prominent biometric technique for identity authentication {xki , yik } which distributes over Sk : Xk × Yk , and ensure
and has been widely applied in many areas (Wang and Deng the identity mutual exclusion of clients Yk ∩ Yz = ∅, where
2018). Rcently, face reogintion has achieved a series of k, z ∈ [K], k ̸= z, such that S = ∪k∈[K] Sk . In this work,
promising breakthrough based on deep face representation we consider the following distributed optimization model:
learning and perfromed far beyond human. Conventional
face-recognition approaches are proposed such as Gabor K
X
wavelets (Liu and Wechsler 2002) and LBP (Ahonen, Ha- min F (w) ≜ pk Fk (w), (1)
w
did, and Pietikainen 2006). Schroff (Schroff, Kalenichenko, k=1
and Philbin 2015) proposed triplet loss to minimize intra- where pk is the weight of the k-th client. Let the k-th client
PK
class variance and maximize inter-class variance. Various holds nk training data and k=1 nk = N , where N is total
of softmax-based loss functions also emerged, such as L- number of data samples. We define pk as nNk , then we have
Softmax (Liu et al. 2016), CosFace (Wang et al. 2018c), PK
k=1 pk = 1.
SphereFace (Liu et al. 2017), AM-Softmax (Wang et al. Consider an “embedding-based” discriminative model,
2018b), Arcface (Deng et al. 2019). Circle Loss (Sun given an input data x ∈ X , a neural network G : X → Rd
et al. 2020) proposed a flexible optimization manner via re- parameterized by θ embeds the data x into a d-dimensional
weighting less-optimized similarity scores. GroupFace (Kim vector G(x; θ) ∈ Rd . Finally, the logits of an input data x in
et al. 2020) proposed a novel face-recognition achitecture the k-th client fk (x) ∈ RCk can be expressed as:
learning group-aware representations. However, these data-
driven approaches aim to learn discriminative face repre- fk (x) = WkT G(x; θ), (2)
sentations on the premise of having the access to full pri-
where matrix Wk ∈ Rd×Ck is the class embeddings of the
vate faical statistics. Public available training databases (Cao
k-th client. Then Eq. 1 can be reformulated as:
et al. 2018; Guo et al. 2016; Kemelmacher-Shlizerman et al.
K nk
2016; Wang et al. 2018a; Yi et al. 2014) are mostly collected X 1 X
ℓk fk (xki ), yik , (3)

from the photos of celebrities due to privacy issue, it is still min F (W, θ) ≜ pk
W,θ nk i=1
biased. Furthermore, with increasing appealing to privacy k=1
issues in society, existing public face datasets may turn to where ℓk (·, ·) is the loss function of the k-th client, W =
T
illegal. [W1 , · · · , WK ] . To provide a more strict privacy guaran-
Federated Learning. Federated Learning (FL) is a ma- tee, we modifed FedAvg (McMahan et al. 2017) via keeping
chine learning setting where many clients collaboratively last fully-connected layer private in each client. We term this
train a model under the orchestration of a central server, privacy-preserving version of FedAvg as Federated Averag-
while keeping the training data decentralized, aims to trans- ing with Private Embedding (FedPE). In FedPE, each client
fer the traditional deep learning methods to a privacy- only have access to its own final class embeddings and the
preserving way. Existing works seek to improve model per- shared backbone parameters. Note that differential privacy
formance, efficiency and fairness in training and commu- (Abadi et al. 2016) for federated methods can be readily
nication stage. FegAvg (McMahan et al. 2017) was pro- employed in FedPE by adding noise to the parameters from
posed as the basic algorithm of federated learning. FedProx each client to enhance security.

2000
①Send 𝜽𝒕
① Send 𝑾𝒕𝒌


backbone
𝑊k,1 Client 𝑊k,1
aggregate 𝑊z 𝑊k Server ④
feature
logit

⨂ … forward
backward
𝑊k,3
Client k 𝑊k,2 𝑊k,3 𝜃1𝑡+1 … 𝜃𝑘𝑡+1
shared 𝑊k,2
embedding Within-Client Compactness Cross-Client Separability

③Send 𝑾𝒕+𝟏
𝒌
③Send 𝜽𝒕+𝟏
𝒌

Figure 1: An illustration of our method. In communication round t, Server broadcast model parameters (θt , Wkt ) to the selected
clients. Thenclients locally compute an update to the model with their local data asynchronously, and send the new model
θt+1 , Wkt+1 back. Finally, Server collects an aggregate of the client updates and applys cross-client optimization. (a)Client
Optimization: clients seek to get more discriminative and more compact features. (b)Server Optimization: correct gradients and
make cross-client embeddings spreadout.

Observation and Motivation Algorithm 1: FedGC.


Softmax Loss. Softmax loss is the most widely used calssi- 1: Input. The K clients are indexed by k and hold local
fication loss function in face recognition. For convenience, data distributes over Sk , η is learning rate.
we omit the bias bj . In the k-th client, the optimization ob- 2: Server initializes model parameters θ 0 , W 0
jective is to minimize the following cross-entropy loss: 3: for each round t = 0, 1, ..., T − 1 do
N T 4: Server initializes k-th client model with θt , Wkt .
1 X eWk,yi Xk,i 5: for each client k = 1, 2, ..., K do
L=− log Pn , (4)
N i=1 T
eWk,j Xk,i 6: The k-th client computes local Softmax 
j=1
7: (θkt+1 , Wkt+1 ) ← (θt , Wkt ) − η∇ℓk xki , yik ,
where Xk,i ∈ Rd denotes the deep feature of the i-th sample, 8: and sends (θkt+1 , Wkt+1 ) to the server.
belonging to the yi -th class. In each individual client, the 9: end for
optimization objective is to minimize inter-class similarity 10: Server aggregates the model parameters:
PK
and maximize intra-class similarity over the local class space 11: θt+1 ← k=1 nnk θkt+1
Ck . We define this optimization in FedPE within client as t+1 T
W̃ t+1 = Wkt+1 , . . . , WK
 
local optimization. However, centralized training on the full 12:
training set solves the problem over the global class space 13: Server applys gradient correction:  
C. We define the centralized method as global optimization. 14: W t+1 ← W̃ t+1 − λη∇W̃ t+1 Reg W̃ t+1
In local optimization, the local softmax
 is to force 15: end for
T
Wk,y X k,i > max j∈Ck ,j̸=yi W T
X
k,j k,i . However, in
16: Output. θ T , W T
i

global optimization, the softmax is to force WyTi Xi >



maxj∈C,j̸=yi WjT Xi . Thus, it is obvious that the
in Fig. 1, a heuristic approach to minimize similarity among
model solving the classification problem as Eq. 3
cross-client class embeddings is to constrain the cross-client
only apply within-client optimization and omit cross-
T embeddings with a regularization term. Considering the ad-
client optimization, lacking constraint Wk,y Xk,i >
T
 i ditivity of gradients and the unique properties of sofmax loss
maxj∈Cz ,z̸=k Wz,j Xk,i . gradient, we are motivated to address this issue from a new
Therefore, this objective function as Eq. 3 leads the model perspective of back propagation. Following the form of soft-
to a convergence state where class embeddings of the same max, we define a regularization term, namely softmax regu-
client are almost orthogonal to each other, but part of class larizer, on the class embeddings W ∈ Rd×C as:
embeddings of different clients may highly similar. And it
results in overlapping of feature space among cross-client Ck
K X ′T ′
classes. Furthermore, only applying local optimization is
X eWk,i Wk,i
Reg (W ) = − log ′T W ′ T W′ ,
Wk,i P Cz Wz,j
more likely to cause overfitting on small-scale local datasets.
P
k=0 i=0 e k,i + z̸=k j=0 e k,i

(5)
Cross-Client Separability with Gradient where Wk,i is the i-th class embedding of the k-th client,
Correction ′
and (·) indicates the vector dosen’t require gradient (the
It is hard to mimic the global softmax with a set of local gradient is set to be zero). We precisely limit the gradient of
softmax. To address the missing optimization, as illustrated loss function with softmax regularizer in order to push the

2001
network update towards the direction of standard Softmax. 𝑇+2
𝑊k,𝑖 local gradient
In addition to FedPE, the server performs an additional 𝑇+1
𝑊k,𝑖 𝑇+2
𝑊k,𝑖
correction
optimization step on the class embedding matrix W ∈ 𝑇
𝑊k,𝑖
𝑇+1
𝑊k,𝑖
client update
𝑇 SGD
Rd×C to ensure that cross-client class embeddings are sep- 0
𝑊k,𝑖 0
𝑊k,𝑖
𝑊k,𝑖
arated from each other. The Federated Averaging with Gra- ...
dient correction (FedGC) algorithm is summarized in Algo-
(a)w/o correction (b)with correction
rithm 1. In communication round t, Server broadcast model
parameters (θt , Wkt ) to the k-th clients. Then clients locally
compute an update with respect to local softmax loss func- Figure 2: Update steps of class embedings on a single client.
tion with their local data asynchronously, and send the new (a) The divergence between FedPE and SGD becomes much
model θt+1 , Wkt+1 back. Finally, Server collects an ag- larger w/o correction. (b) Gradient correction term ensures
gregate of the client updates and applys cross-client opti- the update moves towards the true optimum.
mization. Note that differential privacy can also be applied
to FedGC to prevent privacy leakage, like FedPE.
We will theoretically analyze how FedGC works and how
it pushes the network to update in the direction of global ∂ℓeq eWk,j Xk,i
T

standard softmax. Note that FedGC effectively seeks to col- = PC WT X


Xk,i , (10)
laboratively minimize the following objective with softmax ∂Wk,j ′
k
e k,j′ k,i
j =1
regularizer Reg(W ):
T
K nk ∂ℓeq eWz,j Wk,yi Wk,yi
X 1 X = WT W , (11)
ℓk fk (xki ), yik + λ · Reg (W ) .

F (W, θ) ≜ pk ∂Wz,j P PC WT W
e k,yi k,yi + z̸=k j ′z=0 e z,j ′ k,yi
nk i=1
k=1
(6)
T
!
For convenience, we assume that every client holds n1 = ∂ℓeq eWk,yi Xk,i
· · · = nC = K N
data and c1 = · · · = cK = K C
, every = P Ck T − 1 Xk,i . (12)
∂Wk,yi eWk,j Xk,i
class holds a1 = · · · = aC = C images, and λ = C1 , the
N j=1
objective function can be reformulated as: Let Dk,i denote the distance between Wk,i and X k ,
Dk,i = Wk,yi − Xk,i . We assume a well trained feature
K on local data due to it’s easy convergence on local data, i.e.
1 X X
ℓeq fk (xki ), yik Dk,i → 0, then we have Wk,yi → Xk,i . We can approxi-

F (W, θ) =
N mate:
k=1 (xi ,yi )∈Sk
K T
1 X X eWk,yi Xk,i T
=− log PC T X (7) ∂ℓeq eWz,j Xk,i Xk,i
N
k=1 (xi ,yi )∈Sk e Wk,j k,i ≈ WT X WT X
. (13)
j=1 ∂Wz,j P PC
e k,yi k,i + z̸=k j ′z=0 e z,j ′ k,i

′T ′
eWk,yi Wk,yi ∂ℓ
+ log ′T W ′ T W′
. The parameters are updated by SGD as wk′ = wk −η ∂weqk ,
Wk,y P PCz Wz,j
e i k,y i + z̸=k j=0 e k,yi
where η is step-size. Here for simplicity, we simplify Eq. 8
∂L ∂L
as ∂W y
= αXi , Eq. 9 as ∂W j
= βXi , and Eq. 12 as
Thus, FedGC objective Eq. 6 equals the empirical risk ∂ℓeq
i
′ ∂ℓeq ′
∂Wk,yi = α Xk,i . Eq. 10 as ∂Wk,j = β Xk,i Eq. 13 as

with respect to the loss function ℓeq fk (xki ), yik . Our anal-
ysis easily extends to unbalanced distribution by involving a ∂ℓeq ′
∂Wz,j = γ Xk,i . We consider the direction of gradients, thus
weighted form.
Eq. 12 will act as a substitute for Eq. 8 in within-client opti-
Considering the collaborative effect of all the terms in ℓeq ,
mization, Eq. 10 will act as a substitute for Eq. 9 in within-
we give a interpretation from the perspective of backward
client optimization. The collaborative effect of both terms
propagation. For standard softmax in global optimization,
∂L ∂L act as local gradient in Fig. 2. The mismatch of the magi-
the computation of gradients ∂W and ∂W are listed as
yi j tude can be alleviated by adjusting the learning rate of class
follows: embeddings.
T
! More importantly, Eq. 13 performs cross-client optimiza-
∂L e Wyi X i tion and act as a correction term in Fig. 2 to correct gra-
= PC W T X − 1 Xi , (8)
∂Wyi e j i dient in cross-client optimization, introducing a gradient of
j=1
cross-client samples to Eq. 10. And Eq. 13 has the same
WjT Xi direction as Eq. 9. And for magitude, the denominator of
∂L e
= PC X , wherej ̸= yi .
WjT′ Xi i
(9) P Ck T
Wk,j Xk,i
∂Wj Eq. 13 lacks term j=0,j̸ =i e compared to stan-
j ′ =1 e dard SGD, but with a well done local optimization, we have
P Ck T PCz W T Xk,i
Similarly, for FedGC we also calculate the gradient of ℓeq . Wk,j Xk,i

P
j=0,j̸=i e j=0 e . Therefore,
z,j
∂ℓ ∂ℓ ∂ℓeq z̸=k
Then, ∂Weqk,j
, ∂Weqz,j
and ∂Wk,y can be expressed as: magitude of Eq. 13 and Eq. 9 are approximately equal. Thus,
i

2002
Eq. 13 together with Eq. 11 can act as a substitute for Eq. 9, ing, where clients tend to minimize the
Pnsimilarity of within-
client class embeddings, Lk = n1k j=1

and add a missing cross-client item. Therefore, FedGC can k
ℓk fk (xkj ), yjk .
push the class embeddings toward the similar dirction as But server tends to minimize the similarity of cross-client
standard SGD and guarantees higher privacy. class embeddings and encourages within-client class embed-
Remark: Another simple way to introduce cross-client dings to be more compact, Reg(W ). By performing adver-
P P Cz T
constraint is to minimize: z̸=k j=0 Wz,j Wk,i , we call sary learning similar to GAN, the network can learn more
it cosine regularizer. For particular Wz,j , cosine regularizer discriminative representations of class embeddings.
∂ℓ
P
introduce gradient ∂W = z̸=k Wk,i . We show that our
z,j
proposed softmax regularizer can act as a correction term for Experiments
local softmax and also can be regarded as a weighted version Implementation Details
P P Cz T
of z̸=k j=0 Wz,j Wk,i from the perspective of backward Datasets. Considering that federated learning is extremely
propagation. Our proposed softmax regularizer generate gra- time-consuming, we employ CASIA-WebFace (Yi et al.
dient of larger magitude for more similar embeddings (hard 2014) as training set. CASIA-WebFace is collected from in-
example), thus it can also be regarded as a regularization ternet and contains about 1,000 subjects and 500,000 im-
term with hard example mining. In addition, we defined the ages. To simulate federated learning setting, we randomly
softmax regularizer following the form of softmax. Thus, divide training set into 36 clients. For test, we explore
several loss functions which are the variants of softmax (e.g. the verification performance of proposed FedGC on bench-
ArcFace, CosFace ) can be obtained with minor modification mark datasets ( LFW (Huang et al. 2008), CFP-FP (Sen-
on softmax regularizer. gupta et al. 2016), AgeDB-30 (Moschoglou et al. 2017),
SLLFW (Deng et al. 2017), CPLFW (Zheng and Deng
Extend FedGC to More General Case 2018), CALFW (Zheng, Deng, and Hu 2017), and VGG2-
In the above analysis, we adopt identity mutual exclusion as- FP (Cao et al. 2018)). We also explore on large-scale image-
sumption Yk ∩ Yz = ∅. In fact, FedGC is to solve the prob- datasets ( MegaFace (Kemelmacher-Shlizerman et al. 2016),
lem of missing cross-client optimization. FedGC can also be IJB-B (Whitelam et al. 2017) and IJB-C (Maze et al. 2018)).
applied to general case. We generalize the above mentioned Experimental Settings. In data preprocessing, we use
situations, that is, some IDs are mutually exclusive and some five facial landmarks for similarity transformation, then crop
IDs are shared. For example, there is an identity l shared by a and resize the faces to (112×112). We employ the ResNet-
client group Kl . After each round of communication, server 34 (He et al. 2016) as backbone. We train the model with
takes the average of Wn,l t
, n ∈ Kl and apply our proposed 2 synchronized 1080Ti GPUs on Pytorch. The learning rate
is set to a constant of 0.1. The learning rate is kept constant
softmax regularizer (only exclusive clients are introduced, in
without decay which is similar to the recent federated works.
this case client K − Kl ) to correct its gradient. In this way,
The batch size is set as 256. For fair comparison, the learn-
we can get Wl updated in the direction similar to the stan-
ing rate is also kept 0.1 in centralized standard SGD. We set
dard softmax. With minor modifications to above analysis,
momentum as 0.9 and weight decay as 5e-4.
we can prove the applicability of FedGC in general case.
Ablation Study
Relation to Other Methods
Fraction of participants. We compare the fraction of par-
Multi-task learning. Multi-task learning combines several ticipants C ∈ [0, 1]. In each communication round, there
tasks to one system aiming to improve the generalization are C · K clients conduct optimization in parallel on LFW.
ability (Seltzer and Droppo 2013). Considering a multi- Table 1 shows the impact of varying C for face recognition
task learning system with input data xi , the overall ob- models. We train the models with the guidance of Softmax.
jective function is a combination
P of several subobject loss It is shown that with the increasing of client participation C,
functions, written as L = j Lj (θ, Wj , xi ), where θ is the performance of the model also increased. And FedGC
generic parameters and Wj , j ∈ [1, 2, · · · ] are task-specific still ourperforms the baseline model by a notable margin.
parameters. While in FedGC, Eq. 3 can also be regarded Regularizer multiplier λ. We perform an analysis of the
as a combination
Pnk of manyk class-dependent changing tasks learning rate multiplier of the softmax regularizer λ on LFW.
Lk = n1k j=1

ℓk fk (xj ), yjk , k ∈ [1, · · · , K]. In gen- As shown in Table 2, FedGC achieves the best performance
eral, multi-task learning is conducted end-to-end and train- when λ is 20. It is shown that a large multiplier also cause
ing on a single device. While in FedGC, the model is trained network collapsing, as it makes within-client class embed-
with class-exclusive decentralized non-IID data. Thus, our dings collapse to one point. When λ is very small, then the
method can be also regarded as a decentralized version of model degenerates into baseline model FedPE.
multi-task learning. Balanced v.s. Unbalanced Partition. We compare the
Generative Adversarial Nets (GAN). Based on the idea verification performance according to the partition of
of game theory, GAN is essentially a two players minimax datasets. Here we constructed a unbalanced partition by log-
problem, minG maxD V (D, G) = Ex∼pdata (x) [log D(x)] + arithmic normal distribution: ln X ∼ N (0, 1). We perform
Ez∼pz (z) [log(1 − D(G(z)))], which converges to a Nash an analysis on the model with softmax loss functions on
equilibrium. In FedGC, client optimization and server op- LFW. In table 3, it shows that unbalanced partition even im-
timization can be regarded as a process of adversary learn- prove the performance of network to some extent. We find

2003
Method C = 0.25 C = 0.5 C = 0.75 C=1 Method LFW CFP-FP AgeDB
Softmax-FedPE 93.12 93.83 94.32 94.77 Balanced-FedPE 94.77 81.90 78.38
Softmax-FedGC 97.07 97.98 98.13 98.40
Balanced-FedGC 98.40 90.20 85.85
Unbalanced-FedPE 96.27 85.26 81.22
Table 1: Verification performance on LFW of different par- Unbalanced-FedGC 98.80 91.56 88.78
ticipance fraction C with softmax loss function.
Table 3: Verification performance on LFW of different data
Method λ=1 λ = 20 λ = 50
partition with softmax loss function.
Softmax-FedGC 95.20 98.40 97.42
FedPE FedCos FedGC
Table 2: Verification performance on LFW of different learn-
LFW 94.77 96.63 98.40
ing rate multiplier λ with softmax loss function.

Table 4: Verification performance on LFW of different form


that the clients which holds larger scale dataset than aver- of regularization with softmax loss function.
age contribute significantly to network and make it generate
more discriminative representations. And FedGC still out-
performs baseline model on both balanced and unbalanced calculate pair-wise cosine similarity of two classes’ sam-
datasets. ples. In Fig. 3(c) and Fig. 3(d), we can clearly find that
Regularizer v.s. Fixed. It has been proved that random the cross-client class similarity significantly decreases in
initialized vectors in high dimensional spaces (512 in this FedGC which encourage a larger cross-client class angle.
case) are almost orthogonal to each other. A naı̈ve way to
prevent the class embeddings from collapsing into a over- Evaluations
lapping space is keep the class embeddings fixed to initial-
LFW, CALFW, CPLFW, CFP-FP, VGG2-FP SLLFW
ization. Table 5 shows that proposed FedGC outperforms
and AgeDB-30. In this section, we explore the performance
model with fully-connected layer fixed (”-Fixed”). For soft-
of different loss functions (Softmax, Cosface (Wang et al.
max loss function, simply fixing the last fully-connected
2018c), Arcface (Deng et al. 2019)). We set the margin of
layer leads to a better accuracy. However, for Arcface and
Cosface (Wang et al. 2018c) at 0.35. For Arcface (Deng et al.
Cosface which introduce a more strict constraint, the perfor-
2019), we set the feature scale s to 64 and choose the angular
mance of the model is even worse than baseline model. Intu-
margin m at 0.5. The performance of a model trained with
itively, random initialized orthogonal vectors lack semantic
federated learning algorithms is inherently upper bounded
information, and it confuses the network in a more difficult
by that of model trained in the centralized fashion. Table 5
classification task. Thus, it is shown that the performance is
shows the experiments results, where ”-X” means the dataset
further increased with adaptive optimization (FedGC).
is trained by method ”X”. FedGC achieves the highest aver-
Cosine v.s. Softmax Regularizer. We replace soft-
age accuracy for all loss functions (Softmax, Arcface , Cos-
max regularizer with cosine regularizer, namely FedCos:
P P Cz T
face) and performs better on all of the above datasets. For
z̸=k j=0 Wz,j Wk,i , and guided by softmax loss func- ArcFace(m = 0.5), the centralized method even achieves a
tion. We show the verification result on LFW in Table 4. poor performance worse than FedPE. And FedGC can also
Although cosine regularizer shows a better accuracy than match the performance of conventional centralized methods.
FedPE, it is still worse than FedGC. Because softmax reg- MegaFace. The MegaFace dataset (Kemelmacher-
ularizer can be regarded as a hard sample mining version Shlizerman et al. 2016) includes 1M images of 690K
of cosine regularizer, and also match the gradient in stan- different individuals as the gallery set and 100K photos of
dard softmax. Thus, the superiority of softmax regularizer is 530 unique individuals from FaceScrub (Ng and Winkler
proved experimentally. 2014) as the probe set. It measures TPR at 1e-6 FPR for
verification and Rank-1 retrieval performance for identifica-
Visualization tion. In Table 6, adopting FaceScrub as probe set and using
To show the effectiveness of FedGC, the visualization com- the wash list provided by DeepInsight (Deng et al. 2019),
parisons are conducted at feature level. We select four pairs FedGC outperforms the baseline model FedPE by a large
of classes to compare FedGC and FedPE. In each pair, the margin in different loss functions on both verification and
two classes are from different clients and their correspond- identification tasks. Some centralized methods (Softmax,
ing class embeddings are highly similar in FedPE model. ArcFace(m = 0.5)) even show a poor performance when
The features are extracted from softmax model and visu- the learning rate is 0.1. It shows that FedGC can match the
alized by t-SNE (Maaten and Hinton 2008), as shown in performance of conventional centralized methods.
Fig. 3(a) and Fig. 3(b), the representations of the 4 pairs IJB-B and IJB-C. The IJB-B dataset (Whitelam et al.
tends to gather to a point and form 4 clusters in FedPE, 2017) contains 1, 845 subjects with 21.8K still images and
but the representations tends to spreadout and clustered by 55K frames from 7, 011 videos. In total, there are 12, 115
themselves in FedGC. We also illustrate the angle distribu- templates with 10, 270 genuine matches and 8M impostor
tions of all 8 selected cross-client classes. For each pair, we matches. The IJB-C dataset (Maze et al. 2018) is a further

2004
Method LFW CFP-FP AgeDB CALFW CPLFW SLLFW VGG2-FP Average
Softmax∗ 99.84 89.39 87.62 84.83 76.08 92.33 88.18 88.32
-FedPE 94.77 81.90 78.38 74.15 64.40 80.42 80.32 79.19
-FedPE+Fixed 96.11 83.67 80.28 77.95 66.27 84.23 82.70 81.60
-FedGC 98.40 90.20 85.85 81.47 71.88 90.38 87.64 86.55
CosFace(m = 0.35)∗ 99.10 90.79 91.37 89.53 80.20 95.95 89.10 90.86
-FedPE 98.17 86.90 86.28 83.68 72.67 91.15 85.24 86.30
-FedPE+Fixed 96.35 73.01 81.77 79.25 62.15 86.57 75.16 79.18
-FedGC 98.83 88.60 90.00 87.82 76.72 94.02 85.74 88.82
ArcFace(m = 0.5)∗ 97.62 90.50 83.37 77.33 70.95 86.28 89.40 85.06
-FedPE 98.18 87.23 86.13 82.47 71.77 91.05 85.70 86.08
-FedPE+Fixed 95.85 64.43 79.15 77.53 58.63 85.67 66.70 75.42
-FedGC 98.65 87.77 89.27 86.47 75.17 93.58 84.80 87.96

Table 5: Verification results (%) of different loss functions (Softmax, Cosface, Arcface) and method on 7 verification datasets.
FedGC surpass others and enhance the average accuracy. ∗ indicates the re-implementation by our code and η is constant 0.1.

Method Ver.(%) Id.(%)


Softmax∗ 61.21 59.65
-FedPE 36.83 34.08
-FedGC 69.87 61.26
CosFace(m = 0.35)∗ 83.30 79.09
-FedPE 62.62 57.91
-FedGC 72.82 70.96
ArcFace(m = 0.5)∗ 50.51 35.18
(a) FedPE (b) FedGC -FedPE 64.53 58.12
-FedGC 71.96 68.75
3000 3000
2500 2500
2000
Table 6: Verification TPR (@FPR=1e-6) and identification
2000
Rank-1 on the MegaFace Challenge 1.
1500 1500
1000 1000
IJB-B IJB-C
500 500 Method
Ver.(%) Id.(%) Ver.(%) Id.(%)
0 0
-0.2 0 0.2 0.4 0.6 0.8 -0.2 0 0.2 0.4 0.6 0.8 Softmax∗ 72.60 74.81 75.06 76.05
-FedPE 54.33 64.44 57.85 65.35
(c) FedPE (d) FedGC -FedGC 69.23 78.52 71.33 79.52
CosFace(m = 0.35)∗ 76.79 78.35 79.45 79.90
-FedPE 74.24 78.10 77.12 79.10
Figure 3: Visualization of selected 8 classes from training -FedGC 80.28 82.10 83.40 83.44
set. (a)(b) t-SNE (Maaten and Hinton 2008) data distribu- ArcFace(m = 0.5)∗ 56.64 60.14 59.38 59.79
tion; (c)(d) Histogram of pairwise cosine similarity (hori- -FedPE 73.42 76.40 75.74 76.82
zontal axis: cosine similarity, vertical axis: number of pairs). -FedGC 75.11 78.33 78.13 79.28

Table 7: Verification TPR (@FPR=1e-3) and identification


extension of IJB-B, having 3, 531 subjects with 31.3K still Rank-1 on the IJB-B and IJB-C benchmarks.
images and 117.5K frames from 11, 779 videos. In total,
there are 23, 124 templates with 19, 557 genuine matches
and 15, 639K impostor matches. The verification TPR at 1e- FedGC, that consists of a set of local softmax and a softmax-
3 FPR and identification Rank-1 are reported in Table 7. based regularizer to effectively learn discriminative face rep-
FedGC shows significant improvements and surpasses all resenttations with decentralized face data. FedGC can ef-
candidates by a large margin. Compared with centralized fectively enhance the discriminative power of cross-client
method on all of three loss functions, FedGC can match the class embeddings and enable the network to update towards
performance of conventional centralized methods on both the same direction as standard SGD. Extensive experiments
IJB-B and IJB-C datasets. have been conducted over popular benchmarks to validate
the effectiveness of FedGC that can match the performance
Conclusion of centralized methods.
In this paper, we rethink the federated learning problem for Acknowledgments: This work was supported by the Na-
face recognition on privacy issues, and introduce a novel tional Natural Science Foundation of China under Grant
face-recognition-specialized federated learning framework, 61871052.

2005
References Kemelmacher-Shlizerman, I.; Seitz, S. M.; Miller, D.; and
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H. B.; Brossard, E. 2016. The megaface benchmark: 1 million
Mironov, I.; Talwar, K.; and Zhang, L. 2016. Deep learn- faces for recognition at scale. In Proceedings of the IEEE
ing with differential privacy. In Proceedings of the 2016 conference on computer vision and pattern recognition,
ACM SIGSAC conference on computer and communications 4873–4882.
security, 308–318. Kim, Y.; Park, W.; Roh, M.-C.; and Shin, J. 2020. Group-
Aggarwal, D.; Zhou, J.; and Jain, A. K. 2021. FedFace: Face: Learning Latent Groups and Constructing Group-
Collaborative Learning of Face Recognition Model. arXiv based Representations for Face Recognition. In Proceed-
preprint arXiv:2104.03008. ings of the IEEE/CVF Conference on Computer Vision and
Ahonen, T.; Hadid, A.; and Pietikainen, M. 2006. Face Pattern Recognition, 5621–5630.
description with local binary patterns: Application to face Kim, Y.; Park, W.; and Shin, J. 2020. BroadFace: Looking
recognition. IEEE transactions on pattern analysis and ma- at Tens of Thousands of People at Once for Face Recogni-
chine intelligence, 28(12): 2037–2041. tion. In European Conference on Computer Vision, 536–
Bai, F.; Wu, J.; Shen, P.; Li, S.; and Zhou, S. 2021. Federated 552. Springer.
Face Recognition. arXiv preprint arXiv:2105.02501. Li, T.; Sahu, A. K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.;
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O. M.; and Zisserman, and Smith, V. 2018. Federated optimization in heteroge-
A. 2018. Vggface2: A dataset for recognising faces across neous networks. arXiv preprint arXiv:1812.06127.
pose and age. In 2018 13th IEEE International Conference Liu, C.; and Wechsler, H. 2002. Gabor feature based classi-
on Automatic Face & Gesture Recognition (FG 2018), 67– fication using the enhanced fisher linear discriminant model
74. IEEE. for face recognition. IEEE Transactions on Image process-
Deng, J.; Guo, J.; Liu, T.; Gong, M.; and Zafeiriou, S. 2020. ing, 11(4): 467–476.
Sub-center arcface: Boosting face recognition by large-scale Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; and Song, L. 2017.
noisy web faces. In European Conference on Computer Vi- Sphereface: Deep hypersphere embedding for face recogni-
sion, 741–757. Springer. tion. In Proceedings of the IEEE conference on computer
Deng, J.; Guo, J.; Xue, N.; and Zafeiriou, S. 2019. Arcface: vision and pattern recognition, 212–220.
Additive angular margin loss for deep face recognition. In Liu, W.; Wen, Y.; Yu, Z.; and Yang, M. 2016. Large-margin
Proceedings of the IEEE Conference on Computer Vision softmax loss for convolutional neural networks. In ICML,
and Pattern Recognition, 4690–4699. volume 2, 7.
Deng, J.; Zhou, Y.; and Zafeiriou, S. 2017. Marginal loss Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data us-
for deep face recognition. In Proceedings of the IEEE Con- ing t-SNE. Journal of machine learning research, 9(Nov):
ference on Computer Vision and Pattern Recognition Work- 2579–2605.
shops, 60–68.
Marriott, R. T.; Romdhani, S.; and Chen, L. 2021. A 3D
Deng, W.; Hu, J.; Zhang, N.; Chen, B.; and Guo, J. 2017.
GAN for Improved Large-pose Facial Recognition. In Pro-
Fine-grained face verification: FGLFW database, baselines,
ceedings of the IEEE/CVF Conference on Computer Vision
and human-DCMN partnership. Pattern Recognition, 66:
and Pattern Recognition, 13445–13455.
63–73.
Duan, Y.; Lu, J.; and Zhou, J. 2019. Uniformface: Learn- Maze, B.; Adams, J.; Duncan, J. A.; Kalka, N.; Miller, T.;
ing deep equidistributed representation for face recognition. Otto, C.; Jain, A. K.; Niggel, W. T.; Anderson, J.; Cheney,
In Proceedings of the IEEE/CVF Conference on Computer J.; et al. 2018. Iarpa janus benchmark-c: Face dataset and
Vision and Pattern Recognition, 3415–3424. protocol. In 2018 International Conference on Biometrics
(ICB), 158–165. IEEE.
Guo, Y.; Zhang, L.; Hu, Y.; He, X.; and Gao, J. 2016. Ms-
celeb-1m: A dataset and benchmark for large-scale face McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and
recognition. In European conference on computer vision, y Arcas, B. A. 2017. Communication-efficient learning of
87–102. Springer. deep networks from decentralized data. In Artificial Intelli-
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid- gence and Statistics, 1273–1282. PMLR.
ual learning for image recognition. In Proceedings of the Moschoglou, S.; Papaioannou, A.; Sagonas, C.; Deng, J.;
IEEE conference on computer vision and pattern recogni- Kotsia, I.; and Zafeiriou, S. 2017. Agedb: the first manually
tion, 770–778. collected, in-the-wild age database. In Proceedings of the
Huang, G. B.; Mattar, M.; Berg, T.; and Learned-Miller, E. IEEE Conference on Computer Vision and Pattern Recogni-
2008. Labeled faces in the wild: A database forstudying face tion Workshops, 51–59.
recognition in unconstrained environments. In Workshop on Ng, H.-W.; and Winkler, S. 2014. A data-driven approach
faces in’Real-Life’Images: detection, alignment, and recog- to cleaning large face datasets. In 2014 IEEE international
nition. conference on image processing (ICIP), 343–347. IEEE.
Karimireddy, S. P.; Kale, S.; Mohri, M.; Reddi, S. J.; Stich, Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015. Facenet:
S. U.; and Suresh, A. T. 2019. Scaffold: Stochastic con- A unified embedding for face recognition and clustering. In
trolled averaging for federated learning. arXiv preprint Proceedings of the IEEE conference on computer vision and
arXiv:1910.06378. pattern recognition, 815–823.

2006
Seltzer, M. L.; and Droppo, J. 2013. Multi-task learning in Yuan, H.; and Ma, T. 2020. Federated Accelerated Stochas-
deep neural networks for improved phoneme recognition. In tic Gradient Descent. arXiv preprint arXiv:2006.08950.
2013 IEEE International Conference on Acoustics, Speech Zheng, T.; and Deng, W. 2018. Cross-pose lfw: A database
and Signal Processing, 6965–6969. IEEE. for studying cross-pose face recognition in unconstrained
Sengupta, S.; Chen, J.-C.; Castillo, C.; Patel, V. M.; Chel- environments. Beijing University of Posts and Telecommu-
lappa, R.; and Jacobs, D. W. 2016. Frontal to profile face nications, Tech. Rep, 5.
verification in the wild. In 2016 IEEE Winter Conference on Zheng, T.; Deng, W.; and Hu, J. 2017. Cross-age lfw: A
Applications of Computer Vision (WACV), 1–9. IEEE. database for studying cross-age face recognition in uncon-
Simonyan, K.; and Zisserman, A. 2014. Very deep convo- strained environments. arXiv preprint arXiv:1708.08197.
lutional networks for large-scale image recognition. arXiv Zhu, Z.; Huang, G.; Deng, J.; Ye, Y.; Huang, J.; Chen, X.;
preprint arXiv:1409.1556. Zhu, J.; Yang, T.; Lu, J.; Du, D.; et al. 2021. WebFace260M:
Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, A Benchmark Unveiling the Power of Million-Scale Deep
Z.; and Wei, Y. 2020. Circle loss: A unified perspec- Face Recognition. In Proceedings of the IEEE/CVF Confer-
tive of pair similarity optimization. In Proceedings of the ence on Computer Vision and Pattern Recognition, 10492–
IEEE/CVF Conference on Computer Vision and Pattern 10502.
Recognition, 6398–6407.
Sun, Y.; Wang, X.; and Tang, X. 2015. Deeply learned face
representations are sparse, selective, and robust. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, 2892–2900.
Taigman, Y.; Yang, M.; Ranzato, M.; and Wolf, L. 2014.
Deepface: Closing the gap to human-level performance in
face verification. In Proceedings of the IEEE conference on
computer vision and pattern recognition, 1701–1708.
Wang, F.; Chen, L.; Li, C.; Huang, S.; Chen, Y.; Qian, C.;
and Change Loy, C. 2018a. The devil of face recognition is
in the noise. In Proceedings of the European Conference on
Computer Vision (ECCV), 765–780.
Wang, F.; Cheng, J.; Liu, W.; and Liu, H. 2018b. Additive
margin softmax for face verification. IEEE Signal Process-
ing Letters, 25(7): 926–930.
Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.;
Li, Z.; and Liu, W. 2018c. Cosface: Large margin cosine
loss for deep face recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
5265–5274.
Wang, M.; and Deng, W. 2018. Deep Face Recognition: A
Survey. arXiv preprint arXiv:1804.06655.
Whitelam, C.; Taborsky, E.; Blanton, A.; Maze, B.; Adams,
J.; Miller, T.; Kalka, N.; Jain, A. K.; Duncan, J. A.; Allen,
K.; et al. 2017. Iarpa janus benchmark-b face dataset. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition Workshops, 90–98.
Wolf, L.; Hassner, T.; and Maoz, I. 2011. Face recognition in
unconstrained videos with matched background similarity.
In CVPR 2011, 529–534. IEEE.
Yi, D.; Lei, Z.; Liao, S.; and Li, S. Z. 2014. Learn-
ing face representation from scratch. arXiv preprint
arXiv:1411.7923.
Yin, H.; Molchanov, P.; Alvarez, J. M.; Li, Z.; Mallya, A.;
Hoiem, D.; Jha, N. K.; and Kautz, J. 2020. Dreaming to
distill: Data-free knowledge transfer via deepinversion. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, 8715–8724.
Yu, F. X.; Rawat, A. S.; Menon, A. K.; and Kumar, S.
2020. Federated Learning with Only Positive Labels. arXiv
preprint arXiv:2004.10342.

2007

You might also like