Federated Learning
Federated Learning
1999
max and precisely introduce cross-client gradients to ensure (Li et al. 2018) was proposed as a generalization and re-
that cross-clients class embeddings are fully spread out and parametrization of FedAvg with a proximal term. SCAF-
it can can be readily extended to other forms. Additionally, FOLD (Karimireddy et al. 2019) controls variates to cor-
we give a theoretical analysis to show that FedGC consti- rect the ’client-drift’ in local updates. FedAC (Yuan and Ma
tutes a valid loss function similar to standard softmax. Our 2020) is proposed to improve convergence speed and com-
contributions can be summarized as follows: munication efficiency. FedAwS (Yu et al. 2020) investigated
a new setting where each client has access to the positive
• We propose a federated learning framework, FedGC, for data associated with only a single class. However, most of
face recognition and guarantees higher privacy. It ad- them are mainly focusing on shallow networks and suffers
dresses the missing local optimization problems for face- from privacy leakage in face recognition. Recently, there
specific softmax-based loss functions. also emerged some works (Bai et al. 2021; Aggarwal, Zhou,
• We start from a novel perspective of back propagation to and Jain 2021) focusing on federated face recognition.
correct gradients and introduce cross-client gradients to
ensure the network updates in the direction of standard Methodology
softmax. We also give a theoretical analysis to show the In this section, we will first provide the formulation of the
effectiveness and significance of our method. federated learning and its variant for face recognition. We
• Extensive experiments and ablation studies have been start by analysing, and then illustrate how we are motivated
conducted and demonstrate the superiority of the pro- to propose FedGC.
posed FedGC on several popular benchmark datasets.
Problem Formulation
Related Work We consider a C class classification problem defined over
a compact space X and a label space Y. Let K be the
Face Recognition. Face Recognition (FR) has been the number of clients, suppose the k-th client holds the data
prominent biometric technique for identity authentication {xki , yik } which distributes over Sk : Xk × Yk , and ensure
and has been widely applied in many areas (Wang and Deng the identity mutual exclusion of clients Yk ∩ Yz = ∅, where
2018). Rcently, face reogintion has achieved a series of k, z ∈ [K], k ̸= z, such that S = ∪k∈[K] Sk . In this work,
promising breakthrough based on deep face representation we consider the following distributed optimization model:
learning and perfromed far beyond human. Conventional
face-recognition approaches are proposed such as Gabor K
X
wavelets (Liu and Wechsler 2002) and LBP (Ahonen, Ha- min F (w) ≜ pk Fk (w), (1)
w
did, and Pietikainen 2006). Schroff (Schroff, Kalenichenko, k=1
and Philbin 2015) proposed triplet loss to minimize intra- where pk is the weight of the k-th client. Let the k-th client
PK
class variance and maximize inter-class variance. Various holds nk training data and k=1 nk = N , where N is total
of softmax-based loss functions also emerged, such as L- number of data samples. We define pk as nNk , then we have
Softmax (Liu et al. 2016), CosFace (Wang et al. 2018c), PK
k=1 pk = 1.
SphereFace (Liu et al. 2017), AM-Softmax (Wang et al. Consider an “embedding-based” discriminative model,
2018b), Arcface (Deng et al. 2019). Circle Loss (Sun given an input data x ∈ X , a neural network G : X → Rd
et al. 2020) proposed a flexible optimization manner via re- parameterized by θ embeds the data x into a d-dimensional
weighting less-optimized similarity scores. GroupFace (Kim vector G(x; θ) ∈ Rd . Finally, the logits of an input data x in
et al. 2020) proposed a novel face-recognition achitecture the k-th client fk (x) ∈ RCk can be expressed as:
learning group-aware representations. However, these data-
driven approaches aim to learn discriminative face repre- fk (x) = WkT G(x; θ), (2)
sentations on the premise of having the access to full pri-
where matrix Wk ∈ Rd×Ck is the class embeddings of the
vate faical statistics. Public available training databases (Cao
k-th client. Then Eq. 1 can be reformulated as:
et al. 2018; Guo et al. 2016; Kemelmacher-Shlizerman et al.
K nk
2016; Wang et al. 2018a; Yi et al. 2014) are mostly collected X 1 X
ℓk fk (xki ), yik , (3)
from the photos of celebrities due to privacy issue, it is still min F (W, θ) ≜ pk
W,θ nk i=1
biased. Furthermore, with increasing appealing to privacy k=1
issues in society, existing public face datasets may turn to where ℓk (·, ·) is the loss function of the k-th client, W =
T
illegal. [W1 , · · · , WK ] . To provide a more strict privacy guaran-
Federated Learning. Federated Learning (FL) is a ma- tee, we modifed FedAvg (McMahan et al. 2017) via keeping
chine learning setting where many clients collaboratively last fully-connected layer private in each client. We term this
train a model under the orchestration of a central server, privacy-preserving version of FedAvg as Federated Averag-
while keeping the training data decentralized, aims to trans- ing with Private Embedding (FedPE). In FedPE, each client
fer the traditional deep learning methods to a privacy- only have access to its own final class embeddings and the
preserving way. Existing works seek to improve model per- shared backbone parameters. Note that differential privacy
formance, efficiency and fairness in training and commu- (Abadi et al. 2016) for federated methods can be readily
nication stage. FegAvg (McMahan et al. 2017) was pro- employed in FedPE by adding noise to the parameters from
posed as the basic algorithm of federated learning. FedProx each client to enhance security.
2000
①Send 𝜽𝒕
① Send 𝑾𝒕𝒌
②
backbone
𝑊k,1 Client 𝑊k,1
aggregate 𝑊z 𝑊k Server ④
feature
logit
⨂ … forward
backward
𝑊k,3
Client k 𝑊k,2 𝑊k,3 𝜃1𝑡+1 … 𝜃𝑘𝑡+1
shared 𝑊k,2
embedding Within-Client Compactness Cross-Client Separability
③Send 𝑾𝒕+𝟏
𝒌
③Send 𝜽𝒕+𝟏
𝒌
Figure 1: An illustration of our method. In communication round t, Server broadcast model parameters (θt , Wkt ) to the selected
clients. Thenclients locally compute an update to the model with their local data asynchronously, and send the new model
θt+1 , Wkt+1 back. Finally, Server collects an aggregate of the client updates and applys cross-client optimization. (a)Client
Optimization: clients seek to get more discriminative and more compact features. (b)Server Optimization: correct gradients and
make cross-client embeddings spreadout.
(5)
Cross-Client Separability with Gradient where Wk,i is the i-th class embedding of the k-th client,
Correction ′
and (·) indicates the vector dosen’t require gradient (the
It is hard to mimic the global softmax with a set of local gradient is set to be zero). We precisely limit the gradient of
softmax. To address the missing optimization, as illustrated loss function with softmax regularizer in order to push the
2001
network update towards the direction of standard Softmax. 𝑇+2
𝑊k,𝑖 local gradient
In addition to FedPE, the server performs an additional 𝑇+1
𝑊k,𝑖 𝑇+2
𝑊k,𝑖
correction
optimization step on the class embedding matrix W ∈ 𝑇
𝑊k,𝑖
𝑇+1
𝑊k,𝑖
client update
𝑇 SGD
Rd×C to ensure that cross-client class embeddings are sep- 0
𝑊k,𝑖 0
𝑊k,𝑖
𝑊k,𝑖
arated from each other. The Federated Averaging with Gra- ...
dient correction (FedGC) algorithm is summarized in Algo-
(a)w/o correction (b)with correction
rithm 1. In communication round t, Server broadcast model
parameters (θt , Wkt ) to the k-th clients. Then clients locally
compute an update with respect to local softmax loss func- Figure 2: Update steps of class embedings on a single client.
tion with their local data asynchronously, and send the new (a) The divergence between FedPE and SGD becomes much
model θt+1 , Wkt+1 back. Finally, Server collects an ag- larger w/o correction. (b) Gradient correction term ensures
gregate of the client updates and applys cross-client opti- the update moves towards the true optimum.
mization. Note that differential privacy can also be applied
to FedGC to prevent privacy leakage, like FedPE.
We will theoretically analyze how FedGC works and how
it pushes the network to update in the direction of global ∂ℓeq eWk,j Xk,i
T
2002
Eq. 13 together with Eq. 11 can act as a substitute for Eq. 9, ing, where clients tend to minimize the
Pnsimilarity of within-
client class embeddings, Lk = n1k j=1
and add a missing cross-client item. Therefore, FedGC can k
ℓk fk (xkj ), yjk .
push the class embeddings toward the similar dirction as But server tends to minimize the similarity of cross-client
standard SGD and guarantees higher privacy. class embeddings and encourages within-client class embed-
Remark: Another simple way to introduce cross-client dings to be more compact, Reg(W ). By performing adver-
P P Cz T
constraint is to minimize: z̸=k j=0 Wz,j Wk,i , we call sary learning similar to GAN, the network can learn more
it cosine regularizer. For particular Wz,j , cosine regularizer discriminative representations of class embeddings.
∂ℓ
P
introduce gradient ∂W = z̸=k Wk,i . We show that our
z,j
proposed softmax regularizer can act as a correction term for Experiments
local softmax and also can be regarded as a weighted version Implementation Details
P P Cz T
of z̸=k j=0 Wz,j Wk,i from the perspective of backward Datasets. Considering that federated learning is extremely
propagation. Our proposed softmax regularizer generate gra- time-consuming, we employ CASIA-WebFace (Yi et al.
dient of larger magitude for more similar embeddings (hard 2014) as training set. CASIA-WebFace is collected from in-
example), thus it can also be regarded as a regularization ternet and contains about 1,000 subjects and 500,000 im-
term with hard example mining. In addition, we defined the ages. To simulate federated learning setting, we randomly
softmax regularizer following the form of softmax. Thus, divide training set into 36 clients. For test, we explore
several loss functions which are the variants of softmax (e.g. the verification performance of proposed FedGC on bench-
ArcFace, CosFace ) can be obtained with minor modification mark datasets ( LFW (Huang et al. 2008), CFP-FP (Sen-
on softmax regularizer. gupta et al. 2016), AgeDB-30 (Moschoglou et al. 2017),
SLLFW (Deng et al. 2017), CPLFW (Zheng and Deng
Extend FedGC to More General Case 2018), CALFW (Zheng, Deng, and Hu 2017), and VGG2-
In the above analysis, we adopt identity mutual exclusion as- FP (Cao et al. 2018)). We also explore on large-scale image-
sumption Yk ∩ Yz = ∅. In fact, FedGC is to solve the prob- datasets ( MegaFace (Kemelmacher-Shlizerman et al. 2016),
lem of missing cross-client optimization. FedGC can also be IJB-B (Whitelam et al. 2017) and IJB-C (Maze et al. 2018)).
applied to general case. We generalize the above mentioned Experimental Settings. In data preprocessing, we use
situations, that is, some IDs are mutually exclusive and some five facial landmarks for similarity transformation, then crop
IDs are shared. For example, there is an identity l shared by a and resize the faces to (112×112). We employ the ResNet-
client group Kl . After each round of communication, server 34 (He et al. 2016) as backbone. We train the model with
takes the average of Wn,l t
, n ∈ Kl and apply our proposed 2 synchronized 1080Ti GPUs on Pytorch. The learning rate
is set to a constant of 0.1. The learning rate is kept constant
softmax regularizer (only exclusive clients are introduced, in
without decay which is similar to the recent federated works.
this case client K − Kl ) to correct its gradient. In this way,
The batch size is set as 256. For fair comparison, the learn-
we can get Wl updated in the direction similar to the stan-
ing rate is also kept 0.1 in centralized standard SGD. We set
dard softmax. With minor modifications to above analysis,
momentum as 0.9 and weight decay as 5e-4.
we can prove the applicability of FedGC in general case.
Ablation Study
Relation to Other Methods
Fraction of participants. We compare the fraction of par-
Multi-task learning. Multi-task learning combines several ticipants C ∈ [0, 1]. In each communication round, there
tasks to one system aiming to improve the generalization are C · K clients conduct optimization in parallel on LFW.
ability (Seltzer and Droppo 2013). Considering a multi- Table 1 shows the impact of varying C for face recognition
task learning system with input data xi , the overall ob- models. We train the models with the guidance of Softmax.
jective function is a combination
P of several subobject loss It is shown that with the increasing of client participation C,
functions, written as L = j Lj (θ, Wj , xi ), where θ is the performance of the model also increased. And FedGC
generic parameters and Wj , j ∈ [1, 2, · · · ] are task-specific still ourperforms the baseline model by a notable margin.
parameters. While in FedGC, Eq. 3 can also be regarded Regularizer multiplier λ. We perform an analysis of the
as a combination
Pnk of manyk class-dependent changing tasks learning rate multiplier of the softmax regularizer λ on LFW.
Lk = n1k j=1
ℓk fk (xj ), yjk , k ∈ [1, · · · , K]. In gen- As shown in Table 2, FedGC achieves the best performance
eral, multi-task learning is conducted end-to-end and train- when λ is 20. It is shown that a large multiplier also cause
ing on a single device. While in FedGC, the model is trained network collapsing, as it makes within-client class embed-
with class-exclusive decentralized non-IID data. Thus, our dings collapse to one point. When λ is very small, then the
method can be also regarded as a decentralized version of model degenerates into baseline model FedPE.
multi-task learning. Balanced v.s. Unbalanced Partition. We compare the
Generative Adversarial Nets (GAN). Based on the idea verification performance according to the partition of
of game theory, GAN is essentially a two players minimax datasets. Here we constructed a unbalanced partition by log-
problem, minG maxD V (D, G) = Ex∼pdata (x) [log D(x)] + arithmic normal distribution: ln X ∼ N (0, 1). We perform
Ez∼pz (z) [log(1 − D(G(z)))], which converges to a Nash an analysis on the model with softmax loss functions on
equilibrium. In FedGC, client optimization and server op- LFW. In table 3, it shows that unbalanced partition even im-
timization can be regarded as a process of adversary learn- prove the performance of network to some extent. We find
2003
Method C = 0.25 C = 0.5 C = 0.75 C=1 Method LFW CFP-FP AgeDB
Softmax-FedPE 93.12 93.83 94.32 94.77 Balanced-FedPE 94.77 81.90 78.38
Softmax-FedGC 97.07 97.98 98.13 98.40
Balanced-FedGC 98.40 90.20 85.85
Unbalanced-FedPE 96.27 85.26 81.22
Table 1: Verification performance on LFW of different par- Unbalanced-FedGC 98.80 91.56 88.78
ticipance fraction C with softmax loss function.
Table 3: Verification performance on LFW of different data
Method λ=1 λ = 20 λ = 50
partition with softmax loss function.
Softmax-FedGC 95.20 98.40 97.42
FedPE FedCos FedGC
Table 2: Verification performance on LFW of different learn-
LFW 94.77 96.63 98.40
ing rate multiplier λ with softmax loss function.
2004
Method LFW CFP-FP AgeDB CALFW CPLFW SLLFW VGG2-FP Average
Softmax∗ 99.84 89.39 87.62 84.83 76.08 92.33 88.18 88.32
-FedPE 94.77 81.90 78.38 74.15 64.40 80.42 80.32 79.19
-FedPE+Fixed 96.11 83.67 80.28 77.95 66.27 84.23 82.70 81.60
-FedGC 98.40 90.20 85.85 81.47 71.88 90.38 87.64 86.55
CosFace(m = 0.35)∗ 99.10 90.79 91.37 89.53 80.20 95.95 89.10 90.86
-FedPE 98.17 86.90 86.28 83.68 72.67 91.15 85.24 86.30
-FedPE+Fixed 96.35 73.01 81.77 79.25 62.15 86.57 75.16 79.18
-FedGC 98.83 88.60 90.00 87.82 76.72 94.02 85.74 88.82
ArcFace(m = 0.5)∗ 97.62 90.50 83.37 77.33 70.95 86.28 89.40 85.06
-FedPE 98.18 87.23 86.13 82.47 71.77 91.05 85.70 86.08
-FedPE+Fixed 95.85 64.43 79.15 77.53 58.63 85.67 66.70 75.42
-FedGC 98.65 87.77 89.27 86.47 75.17 93.58 84.80 87.96
Table 5: Verification results (%) of different loss functions (Softmax, Cosface, Arcface) and method on 7 verification datasets.
FedGC surpass others and enhance the average accuracy. ∗ indicates the re-implementation by our code and η is constant 0.1.
2005
References Kemelmacher-Shlizerman, I.; Seitz, S. M.; Miller, D.; and
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H. B.; Brossard, E. 2016. The megaface benchmark: 1 million
Mironov, I.; Talwar, K.; and Zhang, L. 2016. Deep learn- faces for recognition at scale. In Proceedings of the IEEE
ing with differential privacy. In Proceedings of the 2016 conference on computer vision and pattern recognition,
ACM SIGSAC conference on computer and communications 4873–4882.
security, 308–318. Kim, Y.; Park, W.; Roh, M.-C.; and Shin, J. 2020. Group-
Aggarwal, D.; Zhou, J.; and Jain, A. K. 2021. FedFace: Face: Learning Latent Groups and Constructing Group-
Collaborative Learning of Face Recognition Model. arXiv based Representations for Face Recognition. In Proceed-
preprint arXiv:2104.03008. ings of the IEEE/CVF Conference on Computer Vision and
Ahonen, T.; Hadid, A.; and Pietikainen, M. 2006. Face Pattern Recognition, 5621–5630.
description with local binary patterns: Application to face Kim, Y.; Park, W.; and Shin, J. 2020. BroadFace: Looking
recognition. IEEE transactions on pattern analysis and ma- at Tens of Thousands of People at Once for Face Recogni-
chine intelligence, 28(12): 2037–2041. tion. In European Conference on Computer Vision, 536–
Bai, F.; Wu, J.; Shen, P.; Li, S.; and Zhou, S. 2021. Federated 552. Springer.
Face Recognition. arXiv preprint arXiv:2105.02501. Li, T.; Sahu, A. K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.;
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O. M.; and Zisserman, and Smith, V. 2018. Federated optimization in heteroge-
A. 2018. Vggface2: A dataset for recognising faces across neous networks. arXiv preprint arXiv:1812.06127.
pose and age. In 2018 13th IEEE International Conference Liu, C.; and Wechsler, H. 2002. Gabor feature based classi-
on Automatic Face & Gesture Recognition (FG 2018), 67– fication using the enhanced fisher linear discriminant model
74. IEEE. for face recognition. IEEE Transactions on Image process-
Deng, J.; Guo, J.; Liu, T.; Gong, M.; and Zafeiriou, S. 2020. ing, 11(4): 467–476.
Sub-center arcface: Boosting face recognition by large-scale Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; and Song, L. 2017.
noisy web faces. In European Conference on Computer Vi- Sphereface: Deep hypersphere embedding for face recogni-
sion, 741–757. Springer. tion. In Proceedings of the IEEE conference on computer
Deng, J.; Guo, J.; Xue, N.; and Zafeiriou, S. 2019. Arcface: vision and pattern recognition, 212–220.
Additive angular margin loss for deep face recognition. In Liu, W.; Wen, Y.; Yu, Z.; and Yang, M. 2016. Large-margin
Proceedings of the IEEE Conference on Computer Vision softmax loss for convolutional neural networks. In ICML,
and Pattern Recognition, 4690–4699. volume 2, 7.
Deng, J.; Zhou, Y.; and Zafeiriou, S. 2017. Marginal loss Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data us-
for deep face recognition. In Proceedings of the IEEE Con- ing t-SNE. Journal of machine learning research, 9(Nov):
ference on Computer Vision and Pattern Recognition Work- 2579–2605.
shops, 60–68.
Marriott, R. T.; Romdhani, S.; and Chen, L. 2021. A 3D
Deng, W.; Hu, J.; Zhang, N.; Chen, B.; and Guo, J. 2017.
GAN for Improved Large-pose Facial Recognition. In Pro-
Fine-grained face verification: FGLFW database, baselines,
ceedings of the IEEE/CVF Conference on Computer Vision
and human-DCMN partnership. Pattern Recognition, 66:
and Pattern Recognition, 13445–13455.
63–73.
Duan, Y.; Lu, J.; and Zhou, J. 2019. Uniformface: Learn- Maze, B.; Adams, J.; Duncan, J. A.; Kalka, N.; Miller, T.;
ing deep equidistributed representation for face recognition. Otto, C.; Jain, A. K.; Niggel, W. T.; Anderson, J.; Cheney,
In Proceedings of the IEEE/CVF Conference on Computer J.; et al. 2018. Iarpa janus benchmark-c: Face dataset and
Vision and Pattern Recognition, 3415–3424. protocol. In 2018 International Conference on Biometrics
(ICB), 158–165. IEEE.
Guo, Y.; Zhang, L.; Hu, Y.; He, X.; and Gao, J. 2016. Ms-
celeb-1m: A dataset and benchmark for large-scale face McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and
recognition. In European conference on computer vision, y Arcas, B. A. 2017. Communication-efficient learning of
87–102. Springer. deep networks from decentralized data. In Artificial Intelli-
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid- gence and Statistics, 1273–1282. PMLR.
ual learning for image recognition. In Proceedings of the Moschoglou, S.; Papaioannou, A.; Sagonas, C.; Deng, J.;
IEEE conference on computer vision and pattern recogni- Kotsia, I.; and Zafeiriou, S. 2017. Agedb: the first manually
tion, 770–778. collected, in-the-wild age database. In Proceedings of the
Huang, G. B.; Mattar, M.; Berg, T.; and Learned-Miller, E. IEEE Conference on Computer Vision and Pattern Recogni-
2008. Labeled faces in the wild: A database forstudying face tion Workshops, 51–59.
recognition in unconstrained environments. In Workshop on Ng, H.-W.; and Winkler, S. 2014. A data-driven approach
faces in’Real-Life’Images: detection, alignment, and recog- to cleaning large face datasets. In 2014 IEEE international
nition. conference on image processing (ICIP), 343–347. IEEE.
Karimireddy, S. P.; Kale, S.; Mohri, M.; Reddi, S. J.; Stich, Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015. Facenet:
S. U.; and Suresh, A. T. 2019. Scaffold: Stochastic con- A unified embedding for face recognition and clustering. In
trolled averaging for federated learning. arXiv preprint Proceedings of the IEEE conference on computer vision and
arXiv:1910.06378. pattern recognition, 815–823.
2006
Seltzer, M. L.; and Droppo, J. 2013. Multi-task learning in Yuan, H.; and Ma, T. 2020. Federated Accelerated Stochas-
deep neural networks for improved phoneme recognition. In tic Gradient Descent. arXiv preprint arXiv:2006.08950.
2013 IEEE International Conference on Acoustics, Speech Zheng, T.; and Deng, W. 2018. Cross-pose lfw: A database
and Signal Processing, 6965–6969. IEEE. for studying cross-pose face recognition in unconstrained
Sengupta, S.; Chen, J.-C.; Castillo, C.; Patel, V. M.; Chel- environments. Beijing University of Posts and Telecommu-
lappa, R.; and Jacobs, D. W. 2016. Frontal to profile face nications, Tech. Rep, 5.
verification in the wild. In 2016 IEEE Winter Conference on Zheng, T.; Deng, W.; and Hu, J. 2017. Cross-age lfw: A
Applications of Computer Vision (WACV), 1–9. IEEE. database for studying cross-age face recognition in uncon-
Simonyan, K.; and Zisserman, A. 2014. Very deep convo- strained environments. arXiv preprint arXiv:1708.08197.
lutional networks for large-scale image recognition. arXiv Zhu, Z.; Huang, G.; Deng, J.; Ye, Y.; Huang, J.; Chen, X.;
preprint arXiv:1409.1556. Zhu, J.; Yang, T.; Lu, J.; Du, D.; et al. 2021. WebFace260M:
Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, A Benchmark Unveiling the Power of Million-Scale Deep
Z.; and Wei, Y. 2020. Circle loss: A unified perspec- Face Recognition. In Proceedings of the IEEE/CVF Confer-
tive of pair similarity optimization. In Proceedings of the ence on Computer Vision and Pattern Recognition, 10492–
IEEE/CVF Conference on Computer Vision and Pattern 10502.
Recognition, 6398–6407.
Sun, Y.; Wang, X.; and Tang, X. 2015. Deeply learned face
representations are sparse, selective, and robust. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, 2892–2900.
Taigman, Y.; Yang, M.; Ranzato, M.; and Wolf, L. 2014.
Deepface: Closing the gap to human-level performance in
face verification. In Proceedings of the IEEE conference on
computer vision and pattern recognition, 1701–1708.
Wang, F.; Chen, L.; Li, C.; Huang, S.; Chen, Y.; Qian, C.;
and Change Loy, C. 2018a. The devil of face recognition is
in the noise. In Proceedings of the European Conference on
Computer Vision (ECCV), 765–780.
Wang, F.; Cheng, J.; Liu, W.; and Liu, H. 2018b. Additive
margin softmax for face verification. IEEE Signal Process-
ing Letters, 25(7): 926–930.
Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.;
Li, Z.; and Liu, W. 2018c. Cosface: Large margin cosine
loss for deep face recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
5265–5274.
Wang, M.; and Deng, W. 2018. Deep Face Recognition: A
Survey. arXiv preprint arXiv:1804.06655.
Whitelam, C.; Taborsky, E.; Blanton, A.; Maze, B.; Adams,
J.; Miller, T.; Kalka, N.; Jain, A. K.; Duncan, J. A.; Allen,
K.; et al. 2017. Iarpa janus benchmark-b face dataset. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition Workshops, 90–98.
Wolf, L.; Hassner, T.; and Maoz, I. 2011. Face recognition in
unconstrained videos with matched background similarity.
In CVPR 2011, 529–534. IEEE.
Yi, D.; Lei, Z.; Liao, S.; and Li, S. Z. 2014. Learn-
ing face representation from scratch. arXiv preprint
arXiv:1411.7923.
Yin, H.; Molchanov, P.; Alvarez, J. M.; Li, Z.; Mallya, A.;
Hoiem, D.; Jha, N. K.; and Kautz, J. 2020. Dreaming to
distill: Data-free knowledge transfer via deepinversion. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, 8715–8724.
Yu, F. X.; Rawat, A. S.; Menon, A. K.; and Kumar, S.
2020. Federated Learning with Only Positive Labels. arXiv
preprint arXiv:2004.10342.
2007