Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
26 views33 pages

Towards An Explainable Comparison and Alignment of Feature Embeddings

This document presents the Spectral Pairwise Embedding Comparison (SPEC) framework, which facilitates the explainable comparison and alignment of feature embeddings by analyzing mismatches in sample clusters. The SPEC framework employs eigendecomposition of kernel matrices to identify differences in clustering between two embeddings, providing a scalable implementation suitable for large datasets. Additionally, it introduces SPEC-align, a method for aligning embeddings to improve their performance in recognizing various sample types.

Uploaded by

boonchoothamrong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views33 pages

Towards An Explainable Comparison and Alignment of Feature Embeddings

This document presents the Spectral Pairwise Embedding Comparison (SPEC) framework, which facilitates the explainable comparison and alignment of feature embeddings by analyzing mismatches in sample clusters. The SPEC framework employs eigendecomposition of kernel matrices to identify differences in clustering between two embeddings, providing a scalable implementation suitable for large datasets. Additionally, it introduces SPEC-align, a method for aligning embeddings to improve their performance in recognizing various sample types.

Uploaded by

boonchoothamrong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Towards an Explainable Comparison and Alignment of Feature Embeddings

Mohammad Jalali 1 Bahar Dibaei Nia 2 Farzan Farnia 1

Abstract raw image and text inputs into spaces with semantically
meaningful features (Radford et al., 2021; Liu et al., 2023;
While several feature embedding models have
Alayrac et al., 2022; Li et al., 2023). The application of pre-
been developed in the literature, comparisons of
trained embeddings has enabled scalable solutions for many
these embeddings have largely focused on their
downstream tasks, particularly in scenarios where the avail-
arXiv:2506.06231v1 [cs.LG] 6 Jun 2025

numerical performance in classification-related


able sample size and compute resources are significantly
downstream applications. However, an inter-
limited. In such cases, the embedded data can be used to
pretable comparison of different embeddings re-
train simple models, such as linear or k-nearest neighbors
quires identifying and analyzing mismatches be-
(KNN) classifiers, to achieve satisfactory results. Addition-
tween sample groups clustered within the embed-
ally, features extracted by standard embedding models are
ding spaces. In this work, we propose the Spec-
widely employed for the automated evaluation of generative
tral Pairwise Embedding Comparison (SPEC)
models (Heusel et al., 2017; Kynkäänniemi et al., 2023;
framework to compare embeddings and iden-
Stein et al., 2023), providing accurate rankings of genera-
tify their differences in clustering a reference
tive modeling architectures without requiring time-intensive
dataset. Our approach examines the kernel ma-
human assessments.
trices derived from two embeddings and lever-
ages the eigendecomposition of the difference While recent advancements in the machine learning commu-
kernel matrix to detect sample clusters that are nity have introduced various embedding models that achieve
captured differently by the two embeddings. We remarkable results on standard image, text, and video do-
present a scalable implementation of this kernel- mains, comparisons of these embeddings have primarily
based approach, with computational complexity focused on evaluating their performance in standard down-
that grows linearly with the sample size. Fur- stream tasks, such as classification accuracy on benchmark
thermore, we introduce an optimization prob- datasets (e.g. ImageNet). However, such comparisons often
lem using this framework to align two embed- lack interpretability and do not reveal how differently the
dings, ensuring that clusters identified in one embeddings behave in recognizing various sample types
embedding are also captured in the other model. (Boggust et al., 2022). A more fine-grained comparison
We provide numerical results demonstrating the is necessary to disclose explainable differences between
SPEC’s application to compare and align em- embedding models, particularly in identifying which sam-
beddings on large-scale datasets such as Ima- ples are clustered differently according to the models. Un-
geNet and MS-COCO. The code is available at derstanding these differences can aid in interpreting and
github.com/mjalali/embedding-comparison. debugging embeddings and can also be leveraged to align
multiple embeddings (Simhi & Markovitch, 2023; Dar et al.,
2023). Furthermore, interpreting the discrepancies between
1. Introduction embeddings can be utilized to select representation mod-
els for downstream applications such as generative model
Several mainstream frameworks in computer vision and nat- evaluation (Stein et al., 2023; Kynkäänniemi et al., 2023).
ural language processing rely on embedding models to map
In this work, we propose a spectral approach called Spec-
1
Department of Computer Science and Engineering, The tral Pairwise Embedding Comparison (SPEC) for the fine-
Chinese University of Hong Kong, Hong Kong 2 Sharif grained comparison of two embeddings. The SPEC frame-
University of Technology, Tehran, Iran. Correspondence
to: Mohammad Jalali <[email protected]>, Bahar
work detects differences in sample clusters assigned by two
Dibaei Nia <[email protected]>, Farzan Farnia <far- embeddings, identifying major data groups that are clustered
[email protected]>. differently by one embedding compared to the other one. To
achieve this, we adopt standard spectral clustering, which
Proceedings of the 42 nd International Conference on Machine leverages the eigendecomposition of the kernel similarity
Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025
by the author(s).
matrix, and propose analyzing the principal eigenvectors of

1
Towards an Explainable Comparison and Alignment of Feature Embeddings

An overview of SPEC method


Kernel matrix for UMAP Visualization of
Embedding X Embedding X CLIP DINOv2
(CLIP)
Output: 12.5
CLIP

10.0

Spectral approach for Eigenvectors to 7.5

5.0

Pairwise Embedding interpret and 2.5

visualize
0.0

Comparison (SPEC) 2.5

differently
5.0

Kernel matrix for using 7.5

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

Embedding Y Embedding Y Kernel Difference matrix captured clusters. 12


DINOv2
9

(DINOv2) Eigenvalues to 10
8

measure the 8
6

difference. 4

Reference Dataset 6 3

(FFHQ Dataset) 4 1

0
4 2 0 2 4 6

Top 3 SPEC identified Clusters for CLIP DINOv2 Top 3 SPEC identified Clusters for DINOv2 CLIP

Cluster #1 Cluster #2 Cluster #3 Cluster #1 Cluster #2 Cluster #3

Figure 1. Overview of the Spectral Pairwise Embedding Comparison (SPEC) framework: The SPEC performs an eigendecomposition
of the difference of kernel matrices following the two compared embeddings (e.g., DINOv2 and CLIP image embeddings) on a given
reference dataset. Every eigenvector can be interpreted as a differently captured sample cluster by the embeddings, and the corresponding
eigenvalue quantifies the difference between the cluster frequencies in the embedding spaces.

the difference of embeddings’ kernel matrices to interpret of the most differently captured cluster in one embedding
the differences in cluster assignments. Our analysis sug- that is not strongly clustered by the other model. We dis-
gests that the SPEC framework can effectively detect cluster cuss scalable computations of this distance and its gradient
differences between two embeddings. with respect to the embedding parameters. Using the power
method and the calculated left and right eigenvectors of the
To address the computational challenges of performing
differential covariance matrix, we enable gradient-based op-
eigendecomposition on large-scale datasets, we develop a
timization of the distance measure for aligning embedding
scalable implementation of SPEC. A direct eigendecompo-
models. This gradient-based approach leads to a method
sition of the n × n difference kernel matrix requires O(n3 )
we call SPEC-align, aligning embedding models by min-
computations for a dataset with n samples, which is compu-
imizing their differences in clustering a reference dataset.
tationally expensive for large datasets. Assuming a bounded
SPEC-align is particularly useful for aligning cross-modality
feature dimension d for the applied kernel function, we
embeddings, such as CLIP (Radford et al., 2021), with a
prove that the eigenspace of the difference kernel matrix can
state-of-the-art single-modality embedding. Such a spectral
be computed using O(max{d3 , n}) operations, resulting in
alignment can improve the performance of cross-modality
a scalable algorithm under a moderate dimension d value.
embeddings in capturing concepts specific to individual
Furthermore, we extend this scalable computation method
modalities.
to shift-invariant kernel functions, e.g. the Gaussian kernel,
by employing the framework of random Fourier features Finally, we present numerical experiments on several stan-
(RFF) (Rahimi & Recht, 2007a), where the size of RFF dard image and text embeddings using benchmark datasets.
proxy-feature map can be controlled for a more efficient Our results demonstrate the scalability of the SPEC frame-
application of SPEC. work in revealing differences in sample clusters across em-
beddings over large-scale datasets. In our experiments, we
We also explore the application of the SPEC framework to
tested the SPEC algorithm’s application with both cosine
define a distance measure between two embeddings. We
similarity and shift-invariant Gaussian kernels, where we
define the SPEC-diff distance as the spectral radius of the
leverage random Fourier features for the latter case. Addi-
kernel difference matrix, which aims to quantify the weight
tionally, we discuss the application of SPEC-align to align

2
Towards an Explainable Comparison and Alignment of Feature Embeddings

the CLIP model with single-modality embeddings. The em- of information sufficiency (IS) to quantify the required infor-
pirical results highlight the effectiveness of SPEC-align in mation to simulate one embedding from another. Our work
reducing the differences between CLIP’s image embeddings offers a complementary, explainable method for comparing
and specialized image-domain embeddings. The following embeddings by detecting different sample clusters assigned
is a summary of our work’s main contributions: by embeddings and providing a method for aligning them.

• Proposing the SPEC framework for explainable compari- A different yet related line of work is the evaluation of gen-
son of two embeddings, erative models. (Bińkowski et al., 2018; Jalali et al., 2023;
Ospanov et al., 2024; Jalali et al., 2024; Ospanov & Farnia,
• Providing a scalable SPEC implementation with linearly 2024) leverage the eigenspectrum of kernel matrices to quan-
growing computational cost to the sample size, tify diversity. The papers (Jiralerspong et al., 2023; Zhang
• Developing the gradient-based SPEC-align method to et al., 2024) explore novelty evaluation, analyzing how gen-
align two embeddings and matching their sample clusters, erated samples differ from those of a reference distribution.
In particular, (Zhang et al., 2024; 2025) propose a spectral
• Demonstrating the successful application of SPEC in com- method for measuring the entropy of the novel modes of a
paring and aligning embeddings on benchmark datasets. generative model with respect to a reference model, relying
on the eigendecomposition of kernel similarity matrices.
2. Related Work Embeddings Alignment. There are many works on embed-
Spectral Clustering, Kernel PCA, and Random Fourier ding alignment for multimodal models (Bellagente et al.,
Features. Kernel PCA (Schölkopf et al., 1998) is a widely 2023; Lu et al., 2024; Han et al., 2024; Wang et al., 2023b;
recognized technique for dimensionality reduction that re- Girdhar et al., 2023; Grave et al., 2019). Salman et al.
lies on the eigendecomposition of the kernel matrix. Several (2024); Eslami & de Melo (2025) introduced a method,
studies (Bengio et al., 2003b;a) have explored the relation- which demonstrates that adversarial perturbations can force
ship between kernel PCA and spectral clustering. Also, text embeddings to align with any image in multimodal
the analysis of random Fourier features (Rahimi & Recht, models, exposing security vulnerabilities in vision-language
2007b) for performing scalable kernel PCA has been stud- learning. Ye et al. (2024) proposed ModalChorus, an in-
ied by Chitta et al. (2012); Ghashami et al. (2016); Ullah teractive system that visualizes and corrects misalignments
et al. (2018); Sriperumbudur & Sterge (2022); Gedon et al. in multi-modal embeddings, improving interpretability and
(2023). In this paper, we introduce a spectral approach for optimization. Focusing on fine-grained alignment, Yin et al.
comparing two embeddings, leveraging the random Fourier (2024) introduced a method for explicitly aligning individual
features framework to address computational challenges. word embeddings with corresponding visual features, lever-
Unlike Laplacian spectral clustering (Ng et al., 2001) that aging cross-modal attention to refine token-image associa-
uses the graph Laplacian, our method uses the kernel matrix tions. In contrast, our work focuses on aligning embeddings
similar to Kernel PCA. in a kernel setting specifically to match their sample clusters,
leading to a different approach to embedding comparison.
Evaluation and Comparison of Embeddings. Embedding
evaluation is typically conducted using a limited set of down-
stream tasks (Chen et al., 2013; Santos et al., 2020; Perone
3. Preliminaries
et al., 2018; Choi et al., 2021). Existing NLP benchmarks 3.1. Embedding maps and spaces
(Gao et al., 2021; Reimers & Gurevych, 2019) focus on
limited tasks. Muennighoff et al. (2023) introduces MTEB, Consider a data vector x ∈ X in the space X . An em-
standardizing text embedder evaluation across diverse NLP bedding map ψ : X → S maps an input x to the em-
tasks. In Image embeddings, Kynkäänniemi et al. (2023); bedding space S, which is supposed to provide a more
Stein et al. (2023) compared different image embeddings meaningful representation of the input data vector. Through-
and showed how they can influence different tasks, specif- out this work, we focus on the problem of characterizing
ically the evaluation of generative models. Another line and interpreting the differences of two embedding maps
of research is probing methods (Belinkov, 2022; Pimentel ψ1 : X → S1 and ψ2 : X → S2 , which can map the input
et al., 2020; Adi et al., 2017; Rogers et al., 2021), which an- x ∈ X to different embedding spaces S1 , S2 .
alyze model embeddings by training small models on them
to understand what information is encoded. These methods 3.2. Kernel Functions and Covariance Matrix
help assess how well embeddings capture specific features, A kernel function k : X ×X → R maps two inputs x, x′ to a
although they are not focused on embedding comparison. similarity score k(x, x′ ) = ⟨ϕ(x), ϕ(x′ )⟩ ∈ [0, 1] that is the
Darrin et al. (2024) propose a new metric for comparing inner product of the representation of x, x′ characterized by
embeddings without labeled data and propose the concept ϕ : X → Rd . This definition implies that for every sequence

3
Towards an Explainable Comparison and Alignment of Feature Embeddings

of data points x1 , . . . , xn ∈ X , the following kernel matrix Definition 1. We define the normalized kernel difference
is positive semi-definite (PSD): matrix Λψ1 ,ψ2 ∈ Rn as follows:
1
 
k(x1 , x1 ) · · · k(x1 , xn )

Λψ1 ,ψ2 := Kψ1 − Kψ2 (3)
.. .. .. n
K= ⪰0 (1)
 
. . .
k(xn , x1 ) · · · k(xn , xn ) We propose the framework of Spectral Pairwise Embed-
ding Comparison (SPEC) where the two embeddings ψ1
Well-known examples of kernel functions include the cosine-
⊤ and ψ2 are compared using the eigendirections of the differ-
similarity kernel kcosine (x, y) = ∥x∥x2 ∥y∥
y
2
and the Gaussian ence kernel matrix Λψ1 ,ψ2 . As we will show, the principal
(RBF) kernel defined for bandwidth σ as: eigenvectors can be interpreted as the clusters of samples

−∥x − y∥22
 assigned by embedding ψ1 that are less strongly grouped by
k(x, y) = exp the second embedding ψ2 . In what follows, we first show
2σ 2
a theoretical result supporting the mentioned property of
Both these examples are normalized kernels where Λψ1 ,ψ2 ’s eigenvectors. Next, we provide a scalable compu-
k(x, x) = 1 holds for every x ∈ X . Note that the kernel tation method for computing the eigenspace of Λψ1 ,ψ2 that
matrix in (1) can be written as K = ΦΦ⊤ where Φ ∈ Rn×d linearly scales with the sample size n.
contains ϕ(xi ) as its ith row for i ∈ {1, . . . , n}. Then, the
kernel covariance matrix CX ∈ Rd×d can be defined by Theorem 1 proves that under the following two conditions
reversing the matrix multiplication order as: on the sample index set I ⊂ {1, . . . , n}, the eigndirections
of Λψ1 ,ψ2 can separate the clustered sample indices from
n
1 ⊤ 1X the rest of samples. Note that the notation I c denotes the
CX := Φ Φ = ϕ(xi )ϕ(xi )⊤ (2)
n n i=1 complement index set of I, and K[I, J ] denotes the sub-
matrix of K with rows in I and columns in J .
Therefore, CX = n1 Φ⊤ Φ and n1 K = n1 ΦΦ⊤ share the same
non-zero eigenvalues, since they represent the products of • Condition 1: Suppose the sample set XI character-
matrices with flipped multiplication orders. ized by index set I are separated from the rest of sam-
ples by embedding ψ1 , where the normalized block ker-
nel matrix n1 Kψ1 [I, I c ] has a bounded Frobenius norm
4. SPEC: A Spectral Identification of 1 c
n Kψ1 [I, I ] F ≤ ϵ1 .
Embeddings’ Mismatches
• Condition 2: Suppose the sample set XI characterized by
Consider a set of n data points x1 , . . . , xn ∈ X and two index set I are weakly grouped by embedding ψ2 , where
embedding maps ψ1 : X → S1 and ψ2 : X → S2 . Also, the normalized block kernel matrix n1 Kψ2 [I, I c ] has a
suppose k1 : S1 × S1 → R and k2 : S2 × S2 → R are bounded ℓ2 -operator norm (or maximum eigenvalue for
kernel functions to be applied to the embedding spaces for this PSD matrix) n1 Kψ2 [I, I] 2 ≤ ϵ2 .
ψ1 , ψ2 , respectively.
Theorem 1. Consider the difference kernel matrix Λψ1 ,ψ2
To compare the two embeddings, note that their correspond- in (3). Suppose Conditions 1 and 2 hold. Let v1 , . . . , vn
ing spaces S1 and S2 may have different dimensions, and be the unit-norm eigenvectors of Λψ1 ,ψ2 corresponding to
therefore a sample-specific comparison of the embedded eigenvalues λ1 , . . . , λn . For every i ∈ {1, . . . , n}, we de-
c
vectors ψ1 (x), ψ2 (x) for each individual data point x will fine λIi and λIi to be the closest eigenvalue of Λψ1 ,ψ2 [I, I]
not provide a meaningful comparison of the embedding and Λψ1 ,ψ2 [I c , I c ] to λi . Then, the following holds for
maps. Therefore, a more relevant approach is to com- ξ = 4(ϵ21 + ϵ2 ):
pare the embeddings’ outputs over the entire set of data n
{x1 , . . . , xn } and investigate which structures are dissimilar 2 2 c 2 2
X
λi − λIi vi [I] 2
+ λi − λIi vi [I c ] 2
≤ξ
between the sets of embedded data following the embed- i=1
dings. Here, we consider a spectral approach and partic-
ularly focus on the difference of kernel matrices between Proof. We defer the proof of the theoretical statements to
the two embeddings. In the following, we discuss how the the Appendix A.1.
eigenspace of the kernel difference matrix can help identify
the differently clustered points by the two embeddings. Corollary 1. In the setting of Theorem 1, suppose v is
an eigenvector of Λψ1 ,ψ2 for eigenvalue λ whose gap with
To do this, consider the kernel matrix of the first
(n,n) the maximum eigenvalue of the sub-matrix Λψ1 ,ψ2 [I c , I c ]
satisfies λ − λmax (Λψ1 ,ψ2 [I c , I c ]) ≥ γ > 0. Then,

embedding Kψ1 = k1 (ψ1 (xi ), ψ1 (xj )) (i,j)=(1,1) and
the kernel matrix of the second embedding Kψ2 = p
(n,n) c 2 ϵ21 + ϵ2
v[I ] ≤ .

k2 (ψ2 (xi ), ψ2 (xj )) (i,j)=(1,1) . 2 γ

4
Towards an Explainable Comparison and Alignment of Feature Embeddings

The above corollary proves that if an eigenvalue λ of the Algorithm 1 Spectral Pairwise Embedding Comparison
kernel difference matrix Λψ1 ,ψ2 is sufficiently large, such (SPEC)
that its gap with the maximum eigenvalue of the block 1: Input: Sample set {x1 , . . . , xn }, embeddings ψ1 and
Λψ1 ,ψ2 [I c , I c ] (with the complement of samples in I clus- ψ2 , kernel feature maps ϕ1 and ϕ2
tered by ψ1 yet not by ψ2 ) is higher than the threshold λ, 2: Initialize Cψ1 = 0d1 ×d1 , Cψ2 = 0d2 ×d2 ,
then the I c -entries of the corresponding unit-norm eigen- Cψ1 ,ψ2 = 0d1 ×d2
vector v will be bounded, reflecting the lack of I c samples 3: for i ∈ {1, . . . , n} do
in the differentially clustered samples by embedding ψ1 4: Update Cψ1 ← Cψ1 + n1 ϕ1 (ψ1 (xi ))ϕ1 (ψ1 (xi ))⊤
and ψ2 . Based on the above theoretical results, we propose
5: Update Cψ2 ← Cψ2 + n1 ϕ2 (ψ2 (xi ))ϕ2 (ψ2 (xi ))⊤
considering the principal eigendirections of the difference
kernel matrix, and using their significant-value entries to 6: Update Cψ1 ,ψ2←Cψ1 ,ψ2+ n1 ϕ1 (ψ1 (xi ))ϕ2 (ψ2 (xi ))⊤
7: end for
find the subset of samples clustered by embedding ψ1 but
not grouped by ψ2 . 8: Construct Γψ1 ,ψ2 as in Equation (4)
9: Compute eigendecomposition Γψ1 ,ψ2 = V diag(λ)V ⊤
Since the difference kernel matrix is of size n×n, a standard
eigendecomposition will cost O(n3 ) computations. Propo- 10: for i ∈ {1, . . . , n} do  
sition 1 shows that the computation cost will be lower for 11: Map eigenvector ui = ϕ1 (ψ1 (X)) ϕ2 (ψ2 (X)) vi
embeddings with bounded feature maps. In fact, this result 12: end for
shows the computation of the eigenspace can be performed 13: Output: Eigenvalues λ1 , . . . , λn , eigenvectors
using linearly growing computation cost O(n). u1 , . . . , un .
Proposition 1. Consider the difference kernel matrix
Λψ1 ,ψ2 in (3). This matrix shares the same non-zero eigen-
mapping from Γψ1 ,ψ2 to Λψ1 ,ψ2 will be O(n). Therefore,
values with the following matrix:
the entire eigenvector computation of Λψ1 ,ψ2 can be handled
using O(n + (d1 + d2 )3 ) computations. Algorithm 1 con-
 
Cψ Cψ ,ψ
1 1 2
Γψ1 ,ψ2 =   ∈ R(d1 +d2 )×(d1 +d2 ) (4) tains the main steps of computing the SPEC-eigendirections

−Cψ ,ψ −Cψ using the above approach. As detailed in this algorithm,
1 2 2
the computation of the differential kernel covariance matrix
where Cψ1 ∈ Rd1 ×d1 , Cψ2 ∈ Rd2 ×d2 are the kernel covari- can be run over samples in a cascade, avoiding the need for
ance matrices of ψ1 , ψ2 , respectively, and Cψ1 ,ψ2 ∈ Rd1 ×d2 storing a large dataset.
is the cross-covariance matrix, defined as:
Applying the standard linear and cosine-similarity kernels,
n the kernel feature dimension will match that of the em-
1X  ⊤
Cψ1 := ϕ1 ψ1 (xi ) ϕ1 ψ1 (xi ) , bedding, which is usually bounded by 1000 for standard
n i=1
image and text embeddings. In the case of shift-invariant
n
1X  ⊤ kernels, e.g. the Gaussian (RBF) kernel, whose feature di-
Cψ2 := ϕ2 ψ2 (xi ) ϕ2 ψ2 (xi ) , mension is infinite, we can leverage the random Fourier
n i=1
n
features (RFFs) (Rahimi & Recht, 2007b) to reduce the
1X  ⊤ dimension of the kernel feature dimension for a proper
Cψ1 ,ψ2 := ϕ1 ψ1 (xi ) ϕ2 ψ2 (xi ) .
n i=1 proxy kernel function characterized by the random Fourier
features. According to the RFF framework, given a ker-
nel function k(x, y) = κ(x − y) that is normalized i.e.
We also note that for every eigenvector v ∈ Rd1 +d2 of the
κ(0) = 1, we draw a number m independent Fourier fea-
matrix Γψ1 ,ψ2 in (4), which we call the differential covari-
tures ω1 , . . . , ωm ∼ κ
b from probability density function κ
ance matrix, we can find the corresponding vector u of
b
which denotes the Fourier transform of κ defined as
difference kernel matrix Λψ1 ,ψ2 using the following: Z
1
  κ
b(ω) = κ(x) exp(−i⟨ω, x⟩)dx
ϕ1 (ψ1 (x1 )) ϕ2 (ψ2 (x1 )) (2π)d X
u= .. ..
v
 
. . Then, the RFF method approximates the shift-invariant ker-
ϕ1 (ψ1 (xn )) ϕ2 (ψ2 (xn )) nel k(x, y) ≈ b b ⊤ ϕ(y)
k(x, y) = ϕ(x) b where

As can be seen, the computation of the matrix Γψ1 ,ψ2 can 1 h i


ϕ(x)
b = √ cos(ω1⊤ x), sin(ω1⊤ x), ., cos(ωm

x), sin(ωm⊤
x)
be performed with O(n) computations, linearly growing in m
sample size, and the eigendecomposition
 of Γψ1 ,ψ2 can be
handled via O (d1 + d2 )3 , depending on the dimensions of Theorem 2. Consider normalized shift-invariant kernel
the kernel feature maps ϕ1 , ϕ2 , and finally the eigenvector k1 (x, y) = κ1 (x − y) and k2 (x′ , y ′ ) = κ2 (x′ − y ′ ). Then,

5
Towards an Explainable Comparison and Alignment of Feature Embeddings

(1) (2)
drawing m Fourier features ωi ∼ κ b1 and ωi ∼ κ b2 , we In the above, L(ψ1,θ ) denotes the original loss function of
form the RFF-proxy kernel functions b k1 , b
k2 . Then, consid- training embedding ψ1,θ and β denotes the coefficient of the
ering eigenvalues λ b1 , . . . , λ
bn and eigenvectors vb1 , . . . , vbn penalty function SPEC-diff(ψ1,θ , ψ2 ), penalizing the mis-
of proxy Λ
b ψ ,ψ , for every δ > 0, the following holds with
1 2
match with reference embedding ψ2 . To apply a gradient-
probability at least 1 − δ: based optimization algorithm to solve (6), one needs to effi-
n ciently compute the gradient of SPEC-diff(ψ1,θ , ψ2 ) with
X 2 128 log(2/δ) respect to parameter θ. The following proposition shows
Λψ1 ,ψ2 vbi − λ
bi vbi
2

i=1
m that the gradient computation can be run in O(nB ) over a
batch size nB .
Proof. We defer the proof to the Appendix A.3. Proposition 2. Consider the definitions in (3),(4),(5). Then,
assuming a unique top eigenvalue (in terms of absolute
As the above theorem suggests, the eigenvectors of the RFF- value) for Γψ1,θ , ψ2 with the left and right eigenvectors
proxy difference kernel function Λ b ψ ,ψ provide an approx-
1 2 uleft , uright , we will have:
imation of the eigenspace of the target difference kernel  
function Λψ1 ,ψ2 . On the other hand, while forming the dif- ∇θ SPEC-diff(ψ1,θ , ψ2 ) = ∇θ u⊤ left Γψ1,θ ,ψ2 uright (7)
ferential covariance matrix for the proxy-RFF kernel, the
dimension of Γb ψ ,ψ will be 4m×4m, which is finite unlike
1 2
the target shift-invariant kernel. As a result, one can apply Therefore, the above proposition suggests computing the
the eigenspace equivalence in Proposition 1 to the proxy top left and right eigenvector of the (d1 + d2 ) × (d1 + d2 )
kernel function to reduce the computational complexity to difference kernel matrix, which can be computed using the
power method, and subsequently to take the gradient of

O m3 + n computations with m features and n samples.
the scalar function u⊤ left Γψ1,θ ,ψ2 uright which is the absolute
value of the mean of the function value for each individual
5. SPEC-based Quantification of Embedding sample x1 , . . . , xn . This property is especially suitable for
Differences applying stochastic gradient methods.
As discussed earlier, the eigenspace of the difference
kernel matrix Λψ1 ,ψ2 provides information on the differ- 6. Numerical Results
ently clustered samples by the two embeddings. There-
fore, the SPEC approach motivates measuring the dif- In this section, we first discuss the experimental settings and
ference of two embeddings using the eigenspectrum of then apply the SPEC algorithm to compare different image
Λψ1 ,ψ2 . Here, we specifically focus on the spectral ra- and text embeddings across various large-scale datasets.
dius of Λψ1 ,ψ2 , i.e., its maximum absolute eigenvalue Finally, we explore the use of the SPEC-align method to
ρ(Λψ1 ,ψ2 ) = max1≤i≤n |λi (Λψ1 ,ψ2 )| (ρ(A) denotes A’s match the sample clusters of the embeddings.
spectral radius). Note that Λψ1 ,ψ2 is by definition a symmet- Datasets. In our experiments on image data, we used four
ric matrix with a zero trace, and therefore its eigenvalues datasets: AFHQ (Choi et al., 2020) (15K animal faces
are all real and add up to 0. The following definition states in categories of cats, wildlife, and dogs), FFHQ (Karras
the difference measure, which we call SPEC-diff score: et al., 2019) (70K human-face images), ImageNet-1K (Deng
SPEC-diff(ψ1 , ψ2 ) := ρ(Λψ1 ,ψ2 ). (5) et al., 2009) (1.4 million images across 1,000 labels), and
MS-COCO 2017 (Lin et al., 2015) (≈110K samples of di-
Since SPEC-diff is only a function of Λψ1 ,ψ2 ’s verse scenes with multiple objects). Additionally, similar
non-zero eigenvalues, Proposition 1 shows that to (Materzynska et al., 2022), we created a custom dataset
SPEC-diff(ψ1 , ψ2 ) = ρ(Γψ1 ,ψ2 ) is equal to the spectral ra- derived from 10 selected classes from ImageNet-1k, where
dius of the differential covariance matrix Γψ1 ,ψ2 , therefore, we overlaid text labels directly on images.
it is a symmetric pseudo-distance whose computation cost
scales linearly with sample size n. Embeddings. The feature embeddings tested in this study
include the image embeddings: CLIP (Radford et al., 2021),
While the SPEC-diff measure can be used to quantify the DINOv2 (Oquab et al., 2024), Inception-V3 (Szegedy et al.,
mismatches of two embeddings, it can be further optimized 2016), and SWAV (Caron et al., 2021), and the text embed-
in the training or fine-tuning of an embedding map ψ1,θ ’s dings: RoBERTa (Liu et al., 2020), CLIP (Radford et al.,
parameters θ in order to align the embedding’s clusters with 2021), and E5-V2 (Wang et al., 2023a). All embeddings
another reference embedding ψ2 . The optimization problem were extracted using pre-trained models, and standard pre-
to be solved for such an alignment of the embeddings will processing was applied for uniformity across datasets.
be the following, which we call the SPEC-align problem:
Experimental settings. In our experiments, we computed
min L(ψ1,θ ) + β · SPEC-diff(ψ1,θ , ψ2 ) (6) the SPEC differential kernel covariance matrix using m =
θ∈Θ

6
Towards an Explainable Comparison and Alignment of Feature Embeddings

Cluster #1 Cluster #2 Cluster #3 UMAP Visualization of top 10 Clusters


eigenvalue= 0.025 eigenvalue = 0.023 eigenvalue = 0.019 DINOv2 CLIP
CLIP
DINOv2

eigenvalue = 0.034 eigenvalue = 0.026 eigenvalue = 0.013 CLIP DINOv2


DINOv2
CLIP

DINOv2 KMeans & SPEC AMI: 0.94 ± 0.0003 CLIP KMeans & SPEC AMI : 0.65 ± 0.0014 CLIP KMeans & SPEC AMI : 0.78 ± 0.0006 DINOv2 KMeans & SPEC AMI: 0.44 ± 0.0002

DINOv2
CLIP

DINOv2
CLIP

DINOv2
DINOv2

CLIP
CLIP

Figure 2. Comparison of different embeddings on 15K samples from the AFHQ dataset, consisting of 5K cats, 5K wildlife, and 5K dogs.
The number at the top of each image represents the eigenvalue of the corresponding SPEC cluster. The last two images in each row show
the UMAP representation of the SPEC clusters for each embedding individually.

2000 independent random Fourier features for a Gaussian violin plots to visualize normalized distances between data
kernel. To determine the Gaussian kernel bandwidth σ, we points within each cluster. The plots also suggest that the
followed the kernel-based evaluation of generative models first embedding can cluster the points more strongly com-
in (Jalali et al., 2023; Pasarkar & Dieng, 2024) and selected pared to the second embedding. Also, we ran the K-means
the embeddings bandwidths such that the difference between clustering algorithm 50 times on each of the embedding’s
top eigenvalue is less than 0.01. We provide the detailed features and computed the averaged (across the 50 runs)
SPEC algorithm in Algorithm 1. The experiments were Adjusted Mutual Information (AMI) (Vinh et al., 2009) be-
performed on two RTX-4090 GPUs. tween the K-means labels and the SPEC-identified labels.
The results indicate that the first embedding aligns more
SPEC comparison of different embeddings. To evaluate
strongly with K-Means labels.
SPEC, we compared various image embeddings using the
AFHQ dataset. As shown in Figure 2, we employed SPEC Furthermore, to highlight clustering differences between
for pairwise comparisons to analyze the difference between embeddings, we conducted a sanity check on two of the
these embeddings. We reported the top 9 images that corre- top five SPEC clusters from the DINOv2 - CLIP on AFHQ.
spond to the maximum entries of the top three eigenvectors We computed the center of the top four images in each
in the SPEC approach. Subsequently, we found and visu- cluster in both DINOv2 and CLIP embeddings. Then, we
alized the top 100 samples (with maximum entries) from calculated the cosine similarity between the center and a
each of the top 10 eigenvectors (i.e. SPEC-identified clus- set of eight test images: four additional images from the
ters). To confirm whether these samples were clustered same cluster and four random images that do not belong
by the first embedding and not by the second embedding, to the cluster. As shown in Figure 12, DINOv2 well sepa-
we used UMAP maps (McInnes et al., 2018) to validate rates the cluster images from random samples, assigning the
the SPEC-identified different-captured sample groups by highest similarity scores to cluster-specific samples while
the two embeddings. In the Appendix, we further provide keeping random samples significantly lower. However, in
the t-SNE (Van der Maaten & Hinton, 2008) and PaCMAP CLIP, some random images rank higher in similarity than
(Wang et al., 2021) plots of the FFHQ and AFHQ exper- the cluster-specific samples. A similar experiment was per-
iments. Also, we have analyzed the found clusters using formed on SPEC clusters from the DINOv2 - CLIP on the

7
Towards an Explainable Comparison and Alignment of Feature Embeddings

SPEC Principal Clusters for CLIP DINOv2

Cluster #1 Cluster #2 Cluster #3 Cluster #4


eigenval=0.030 eigenval=0.021 eigenval=0.016 eigenval=0.013

SPEC Principal Clusters for DINOv2 CLIP

Cluster #1 Cluster #2 Cluster #3 Cluster #4


eigenval=0.034 eigenval=0.024 eigenval=0.016 eigenval=0.011

Comparison of top 10 SPEC-identified clusters for CLIP DINOv2 Comparison of top 10 SPEC-identified clusters for DINOv2 CLIP
UMAP visualization of CLIP UMAP visualization of DINOv2 UMAP visualization of DINOv2 UMAP visualization of CLIP

Normalized Intra-cluster Distances of top 10 SPEC-identified clusters CLIP DINOv2 Normalized Intra-cluster Distances of top 10 SPEC-identified clusters DINOv2 CLIP

CLIP KMeans & SPEC AMI : 0.68 ± 0.0005 DINOv2 KMeans & SPEC AMI: 0.16 ± 0.0001 DINOv2 KMeans & SPEC AMI: 0.96 ± 0.0005 CLIP KMeans & SPEC AMI : 0.33 ± 0.0016

Figure 3. Top 4 SPEC-identified clusters comparing CLIP and DINOv2 embeddings on 10 ImageNet classes with overlaid text labels. The
last row shows the UMAP representation of the top 10 SPEC-identified clusters for each embedding.

FFHQ dataset (Figure 13). Details are in Appendix B.4. under the same settings on the ImageWoof dataset (Howard,
2019), which consists of various dog breeds from ImageNet-
To further evaluate SPEC’s performance in comparing em-
1K. SPEC principal clusters show that DINOv2 primarily
beddings, we apply a typographic attack on CLIP embed-
clusters images based on dog breeds, whereas CLIP groups
dings. As studied by Materzynska et al. (2022), CLIP prior-
them based on the animals’ gestures. Additional details are
itizes text added to a custom dataset over the image content.
provided in Figure 11 of Appendix B.3.
We selected 10 classes from the ImageNet-1K dataset and
overlaid different text labels directly onto the images. The SPEC comparison of embeddings on different image
top four SPEC-identified clusters are presented in Figure 3, and text datasets. To check the performance of SPEC on
where we observe that CLIP clusters are based on the over- text embeddings, we generated 10K samples from GPT-
laid text, whereas DINOv2 clusters them based on visual fea- 4o across different categories, including profession, object,
tures. Additionally, the top 10 principal clusters in CLIP are gender, and emotion. We compared CLIP and RoBERTa
not well-clustered by DINOv2, and vice versa, demonstrat- text embeddings in Figure 4 and observed that the top four
ing that SPEC effectively highlights differences between clusters in CLIP focused on objects in sentences, while
embeddings. We compared CLIP and DINOv2 embeddings RoBERTa clustered based on profession and gender. We

8
Towards an Explainable Comparison and Alignment of Feature Embeddings

SPEC Principal Clusters for CLIP RoBERTa UMAP visualization of top 10 SPEC clusters
Cluster #1 Cluster #2 Cluster #3 CLIP RoBERTa
eigenval=0.026 eigenval=0.014 eigenval=0.009
A serious female artist is adjusting a camera. A calm female teacher is sitting an airplane. A serious female doctor is carrying a frying pan.
A determined male engineer is holding a camera. A focused female doctor is adjusting an airplane. An angry male carpenter is adjusting a frying pan.
A happy female doctor is adjusting a photograph. A curious male teacher is adjusting an airplane. A determined female chef is adjusting a frying pan.
A calm female pilot is designing a photograph. A calm female doctor is inspecting an airplane. A tired male artist is adjusting a frying pan.

SPEC Principal Clusters for RoBERTa CLIP RoBERTa CLIP


Cluster #1 Cluster #2 Cluster #3
eigenval=0.022 eigenval=0.015 eigenval=0.009
A calm female firefighter is inspecting a camera. A happy male carpenter is carrying a desk. A curious male artist is holding a photograph.
A calm female carpenter is holding a syringe. An excited male carpenter is carrying a laptop. A happy male artist is inspecting a frying pan.
An excited female firefighter is holding a desk. A focused male carpenter is holding a camera. A happy male artist is carrying a desk.
A tired female firefighter is holding a desk. An excited male carpenter is holding a camera. A curious male artist is inspecting a camera.

Figure 4. Top 4 SPEC-identified clusters by comparing CLIP and RoBERTa text embeddings on a dataset generated from GPT-4o.

CLIP Kernel SPEC-align CLIP Kernel DINOv2 Kernel

Figure 5. Comparison of Kernel matrices after using SPEC-align to match the sample clusters of CLIP to DINOv2.

also noted from the UMAP visualization of the samples that SPEC-align’s performance, we conducted an experiment
the top clusters of each embedding are not well clustered similar to (Oquab et al., 2024), where feature quality was
by the other, indicating that they focus on different aspects assessed by training a simple classifier on a frozen backbone
of the sentences. We also compared CLIP with E5 and without fine-tuning its weights. In this setting, SPEC-align
observed the same results in Figure 9 of the Appendix B.2. CLIP achieved 73.93% top-1 accuracy on ImageNet-1K,
In Figure 14, we compared different image embeddings on outperforming the standard CLIP model, which reached
the MS-COCO 2017 training set with 120K samples which 67.20%. For reference, DINOv2 achieved 78.99% on the
we discuss in the Appendix B.3. same task, indicating that SPEC-align brings CLIP substan-
tially closer to DINOv2 performance.
Aligning embeddings using SPEC-align In this section,
we discuss how to use the SPEC-align method to align the 7. Conclusion
differential kernel covariance of two embeddings. As ob-
served in the comparison of CLIP and DINOv2 in Figures 3 In this paper, we proposed the spectral SPEC approach to
and 11, DINOv2 successfully captures certain clusters that the comparison of embedding maps. The SPEC method
CLIP fails to distinguish. To enhance CLIP’s performance aims to identify groups of samples clustered by one embed-
in these tasks, we aligned CLIP with DINOv2 using the ding model which is not grouped by another model. We
ImageNet training set. Specifically, we incorporated an formulated a scalable algorithm with O(n) computations to
alignment term into the CLIP loss function, as formulated apply SPEC to a dataset of size n. We also discussed the
in (6), and computed the gradient using (7). The learning application of SPEC for measuring the mismatches of two
parameters are detailed in the Appendix B.5. embeddings and their alignment. We note that the SPEC ap-
proach operates based on the assumption that the differently
We provide the kernel matrices for the four clusters in this
clustered samples can be detected by the spectral method.
experiment in Figure 5, corresponding to the results in Fig-
Extending the clustering-based approach to non-spectral
ure 3. Notably, the SPEC-aligned CLIP kernel captures
clustering frameworks will be interesting for future explo-
the top four clusters based on image content rather than
ration. In addition, extending the framework to compare
the overlaid text labels. The clusters and their UMAP vi-
cross-modal embeddings such as CLIP, BLIP, and ALIGN
sualizations are shown in Figure 21. To further evaluate
will be a future direction to this work.

9
Towards an Explainable Comparison and Alignment of Feature Embeddings

Acknowledgments Bińkowski, M., Sutherland, D. J., Arbel, M., and Gret-


ton, A. Demystifying MMD GANs. arXiv preprint
The work of Farzan Farnia is partially supported by a grant arXiv:1801.01401, 2018.
from the Research Grants Council of the Hong Kong Special
Administrative Region, China, Project 14209920, and is Boggust, A., Carter, B., and Satyanarayan, A. Em-
partially supported by CUHK Direct Research Grants with bedding comparator: Visualizing differences in global
CUHK Project No. 4055164 and 4937054. Finally, the structure and local neighborhoods via small multiples.
authors sincerely thank the anonymous reviewers for their In Proceedings of the 27th International Conference
useful feedback and constructive suggestions. on Intelligent User Interfaces, IUI ’22, pp. 746–766,
New York, NY, USA, 2022. Association for Comput-
Impact Statement ing Machinery. ISBN 9781450391443. doi: 10.1145/
3490099.3511122. URL https://doi.org/10.
This paper presents work whose goal is to advance the field 1145/3490099.3511122.
of Machine Learning. There are many potential societal
consequences of our work, none which we feel must be Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P.,
specifically highlighted here. and Joulin, A. Unsupervised learning of visual features
by contrasting cluster assignments, 2021.
References Chen, Y., Perozzi, B., Al-Rfou, R., and Skiena, S. The
Adi, Y., Kermany, E., Belinkov, Y., Lavi, O., and Goldberg, expressive power of word embeddings, 2013. URL
Y. Fine-grained analysis of sentence embeddings using https://arxiv.org/abs/1301.3226.
auxiliary prediction tasks. In International Conference Chitta, R., Jin, R., and Jain, A. K. Efficient kernel cluster-
on Learning Representations, 2017. ing using random fourier features. In 2012 IEEE 12th
International Conference on Data Mining, pp. 161–170.
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Has-
IEEE, 2012.
son, Y., Lenc, K., Mensch, A., Millican, K., Reynolds,
M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Choi, H., Kim, J., Joe, S., and Gwon, Y. Evaluation of
Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., BERT and ALBERT sentence embedding performance
Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, on downstream NLP tasks. In 2020 25th International
M., Barreira, R., Vinyals, O., Zisserman, A., and Si- Conference on Pattern Recognition (ICPR), pp. 5482–
monyan, K. Flamingo: a visual language model for 5487, 2021. doi: 10.1109/ICPR48806.2021.9412102.
few-shot learning. ArXiv, abs/2204.14198, 2022.
Choi, Y., Uh, Y., Yoo, J., and Ha, J.-W. StarGAN v2: Di-
Belinkov, Y. Probing classifiers: Promises, shortcomings, verse image synthesis for multiple domains. In Proceed-
and advances. Computational Linguistics, 48(1):207–219, ings of the IEEE/CVF Conference on Computer Vision
04 2022. ISSN 0891-2017. doi: 10.1162/coli a 00422. and Pattern Recognition (CVPR), June 2020.
Bellagente, M., Brack, M., Teufel, H., Friedrich, F., Deis- Dar, G., Geva, M., Gupta, A., and Berant, J. Analyzing
eroth, B., Eichenberg, C., Dai, A. M., Baldock, R., Nanda, transformers in embedding space. In Proceedings of the
S., Oostermeijer, K., Cruz-Salinas, A. F., Schramowski, 61st Annual Meeting of the Association for Computa-
P., Kersting, K., and Weinbach, S. Multifusion: Fus- tional Linguistics (Volume 1: Long Papers), pp. 16124–
ing pre-trained models for multi-lingual, multi-modal 16170, Toronto, Canada, July 2023. Association for Com-
image generation. In Advances in Neural Information putational Linguistics. doi: 10.18653/v1/2023.acl-long.
Processing Systems, volume 36, pp. 59502–59521. Cur- 893. URL https://aclanthology.org/2023.
ran Associates, Inc., 2023. acl-long.893/.
Bengio, Y., Vincent, P., Paiement, J.-F., Delalleau, O., Darrin, M., Formont, P., Ayed, I. B., Cheung, J. C., and
Ouimet, M., and Le Roux, N. Spectral clustering and Piantanida, P. When is an embedding model more
kernel PCA are learning eigenfunctions, volume 1239. promising than another? In The Thirty-eighth Annual
Citeseer, 2003a. Conference on Neural Information Processing Systems,
2024. URL https://openreview.net/forum?
Bengio, Y., Vincent, P., Paiement, J.-F., Delalleau, O., id=VqFz7iTGcl.
Ouimet, M., and LeRoux, N. Learning eigenfunctions
of similarity: linking spectral clustering and kernel PCA. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Technical report, Technical Report 1232, Departement Fei, L. ImageNet: A Large-Scale Hierarchical Image
d’Informatique et Recherche Oprationnelle . . . , 2003b. Database. In CVPR09, 2009.

10
Towards an Explainable Comparison and Alignment of Feature Embeddings

Eslami, S. and de Melo, G. Mitigate the gap: Improv- Jalali, M., Li, C. T., and Farnia, F. An information-theoretic
ing cross-modal alignment in CLIP. In The Thirteenth evaluation of generative models in learning multi-modal
International Conference on Learning Representations, distributions. In Thirty-seventh Conference on Neural
2025. URL https://openreview.net/forum? Information Processing Systems, 2023. URL https:
id=aPTGvFqile. //openreview.net/forum?id=PdZhf6PiAb.

Gao, T., Yao, X., and Chen, D. SimCSE: Simple con- Jalali, M., Ospanov, A., Gohari, A., and Farnia, F. Condi-
trastive learning of sentence embeddings. In Proceedings tional Vendi Score: An information-theoretic approach
of the 2021 Conference on Empirical Methods in Natural to diversity evaluation of prompt-based generative mod-
Language Processing, pp. 6894–6910, Online and Punta els. arXiv preprint arXiv:2411.02817, 2024. URL
Cana, Dominican Republic, November 2021. Association https://arxiv.org/abs/2411.02817.
for Computational Linguistics. doi: 10.18653/v1/2021.
emnlp-main.552. URL https://aclanthology. Jiralerspong, M., Bose, J., Gemp, I., Qin, C., Bachrach, Y.,
org/2021.emnlp-main.552/. and Gidel, G. Feature likelihood score: Evaluating the
generalization of generative models using samples. In
Gedon, D., Ribeiro, A. H., Wahlström, N., and Schön, T. B. Thirty-seventh Conference on Neural Information Pro-
Invertible kernel PCA with random fourier features. IEEE cessing Systems, 2023. URL https://openreview.
Signal Processing Letters, 30:563–567, 2023. net/forum?id=l2VKZkolT7.

Ghashami, M., Perry, D. J., and Phillips, J. Streaming kernel Karras, T., Laine, S., and Aila, T. A style-based generator
principal component analysis. In Artificial intelligence architecture for generative adversarial networks. In 2019
and statistics, pp. 1365–1374. PMLR, 2016. IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 4396–4405, 2019. doi: 10.1109/
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K. V., CVPR.2019.00453.
Joulin, A., and Misra, I. Imagebind: One embedding
space to bind them all. In CVPR, 2023. Kohler, J. M. and Lucchi, A. Sub-sampled cubic regulariza-
tion for non-convex optimization. In International Con-
Grave, E., Joulin, A., and Berthet, Q. Unsupervised align- ference on Machine Learning, pp. 1895–1904. PMLR,
ment of embeddings with wasserstein procrustes. In 2017.
Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of
the Twenty-Second International Conference on Artificial Kynkäänniemi, T., Karras, T., Aittala, M., Aila, T., and
Intelligence and Statistics, volume 89 of Proceedings of Lehtinen, J. The role of ImageNet classes in Fréchet
Machine Learning Research, pp. 1880–1890. PMLR, 16– inception distance. In The Eleventh International Confer-
18 Apr 2019. URL https://proceedings.mlr. ence on Learning Representations, 2023. URL https:
press/v89/grave19a.html. //openreview.net/forum?id=4oXTQ6m_ws8.

Gross, D. Recovering low-rank matrices from few coeffi- Li, J., Li, D., Savarese, S., and Hoi, S. BLIP-2: boot-
cients in any basis. IEEE Transactions on Information strapping language-image pre-training with frozen image
Theory, 57(3):1548–1566, 2011. encoders and large language models. In ICML, 2023.

Han, J., Gong, K., Zhang, Y., Wang, J., Zhang, K., Lin, D., Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick,
Qiao, Y., Gao, P., and Yue, X. Onellm: One framework to R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., and
align all modalities with language. In Proceedings of the Dollár, P. Microsoft COCO: Common objects in context,
IEEE/CVF Conference on Computer Vision and Pattern 2015.
Recognition (CVPR), 2024.
Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tun-
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and ing. In Advances in Neural Information Processing Sys-
Hochreiter, S. Gans trained by a two time-scale update tems, volume 36, pp. 34892–34916. Curran Associates,
rule converge to a local nash equilibrium. Advances in Inc., 2023.
neural information processing systems, 30, 2017.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
Howard, J. ImageWoof: a subset of 10 classes Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.
from imagenet that aren’t so easy to classify, March RoBERTa: A robustly optimized BERT pretraining ap-
2019. URL https://github.com/fastai/ proach, 2020. URL https://openreview.net/
imagenette#imagewoof. forum?id=SyxS0T4tvS.

11
Towards an Explainable Comparison and Alignment of Feature Embeddings

Lu, S., Li, Y., Chen, Q.-G., Xu, Z., Luo, W., Zhang, K., Pasarkar, A. P. and Dieng, A. B. Cousins of the vendi score:
and Ye, H.-J. Ovis: Structural embedding alignment for A family of similarity-based diversity metrics for science
multimodal large language model. arXiv:2405.20797, and machine learning. In International Conference on Ar-
2024. tificial Intelligence and Statistics, pp. 3808–3816. PMLR,
2024.
Materzynska, J., Torralba, A., and Bau, D. Disentangling
visual and written concepts in clip. In IEEE Conference Perone, C. S., Silveira, R., and Paula, T. S. Eval-
on Computer Vision and Pattern Recognition (CVPR), uation of sentence embeddings in downstream and
2022. linguistic probing tasks. ArXiv, abs/1806.06259,
2018. URL https://api.semanticscholar.
McInnes, L., Healy, J., and Melville, J. Umap: Uniform org/CorpusID:49306018.
manifold approximation and projection for dimension
reduction. arXiv preprint arXiv:1802.03426, 2018. Pimentel, T., Valvoda, J., Maudslay, R. H., Zmigrod, R.,
Williams, A., and Cotterell, R. Information-theoretic
Merity, S., Xiong, C., Bradbury, J., and Socher, R. probing for linguistic structure. In Proceedings of the
Pointer sentinel mixture models. arXiv preprint 58th Annual Meeting of the Association for Computa-
arXiv:1609.07843, 2016. tional Linguistics, pp. 4609–4622, Online, July 2020.
Association for Computational Linguistics.
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N.
MTEB: Massive text embedding benchmark. In Vla- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
chos, A. and Augenstein, I. (eds.), Proceedings of the Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
17th Conference of the European Chapter of the Asso- J., Krueger, G., and Sutskever, I. Learning transferable
ciation for Computational Linguistics, pp. 2014–2037, visual models from natural language supervision, 2021.
Dubrovnik, Croatia, May 2023. Association for Compu-
tational Linguistics. doi: 10.18653/v1/2023.eacl-main. Rahimi, A. and Recht, B. Random features for large-scale
148. URL https://aclanthology.org/2023. kernel machines. Advances in neural information pro-
eacl-main.148/. cessing systems, 20, 2007a.

Rahimi, A. and Recht, B. Random features for large-scale


Ng, A., Jordan, M., and Weiss, Y. On spectral clustering:
kernel machines. In Advances in Neural Information
Analysis and an algorithm. In Dietterich, T., Becker,
Processing Systems, volume 20. Curran Associates, Inc.,
S., and Ghahramani, Z. (eds.), Advances in Neural
2007b.
Information Processing Systems, volume 14. MIT Press,
2001. URL https://proceedings.neurips. Reimers, N. and Gurevych, I. Sentence-BERT: Sentence em-
cc/paper_files/paper/2001/file/ beddings using Siamese BERT-networks. In Proceedings
801272ee79cfde7fa5960571fee36b9b-Paper. of the 2019 Conference on Empirical Methods in Natu-
pdf. ral Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, IJCNLP), pp. 3982–3992. Association for Computa-
M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., tional Linguistics, November 2019. doi: 10.18653/v1/
El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, D19-1410. URL https://aclanthology.org/
R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, D19-1410/.
V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut,
P., Joulin, A., and Bojanowski, P. Dinov2: Learning Rogers, A., Kovaleva, O., and Rumshisky, A. A primer
robust visual features without supervision, 2024. in BERTology: What we know about how BERT works.
Transactions of the Association for Computational Lin-
Ospanov, A. and Farnia, F. On the statistical complexity of guistics, 8:842–866, 01 2021. ISSN 2307-387X.
estimating Vendi Scores from empirical data, 2024. URL
https://arxiv.org/abs/2410.21719. Salman, S., Shams, M. M. B., and Liu, X. Unaligning every-
thing: Or aligning any text to any image in multimodal
Ospanov, A., Zhang, J., Jalali, M., Cao, X., Bogdanov, models. arXiv preprint arXiv:2407.01157, 2024.
A., and Farnia, F. Towards a scalable reference-free
evaluation of generative models. In The Thirty-eighth Santos, J., Consoli, B., and Vieira, R. Word embed-
Annual Conference on Neural Information Processing ding evaluation in downstream tasks and semantic analo-
Systems, 2024. URL https://openreview.net/ gies. In Proceedings of the Twelfth Language Resources
forum?id=Ex3rPvEct8. and Evaluation Conference, pp. 4828–4834, Marseille,

12
Towards an Explainable Comparison and Alignment of Feature Embeddings

France, May 2020. European Language Resources As- Wang, Z., Zhao, Y., Cheng, X., Huang, H., Liu, J., Tang,
sociation. ISBN 979-10-95546-34-4. URL https: L., Li, L., Wang, Y., Yin, A., Zhang, Z., and Zhao, Z.
//aclanthology.org/2020.lrec-1.594/. Connecting multi-modal contrastive representations. In
Proceedings of the 37th International Conference on Neu-
Schölkopf, B., Smola, A., and Müller, K.-R. Nonlinear ral Information Processing Systems, NIPS ’23. Curran
Component Analysis as a Kernel Eigenvalue Problem. Associates Inc., 2023b.
Neural Computation, 10(5):1299–1319, July 1998. ISSN
0899-7667. doi: 10.1162/089976698300017467. URL Ye, Y., Xiao, S., Zeng, X., and Zeng, W. Modalchorus: Vi-
https://ieeexplore.ieee.org/document/ sual probing and alignment of multi-modal embeddings
6790375. Conference Name: Neural Computation. via modal fusion map. IEEE Transactions on Visualiza-
tion and Computer Graphics, 2024.
Simhi, A. and Markovitch, S. Interpreting embedding
spaces by conceptualization. In The 2023 Conference Yin, Y., Zhao, Y., Zhang, Y., Lin, K., Wang, J., Tao, X.,
on Empirical Methods in Natural Language Processing, Wan, P., Zhang, D., Yin, B., and Zhang, W. Sea: Super-
2023. URL https://openreview.net/forum? vised embedding alignment for token-level visual-textual
id=sPpft5DQJN. integration in mllms. arXiv preprint arXiv:2408.11813,
2024.
Sriperumbudur, B. K. and Sterge, N. Approximate kernel
PCA: Computational versus statistical trade-off. The Zhang, J., Li, C. T., and Farnia, F. An interpretable evalu-
Annals of Statistics, 50(5):2713–2736, 2022. ation of entropy-based novelty of generative models. In
Proceedings of the 41st International Conference on Ma-
Stein, G., Cresswell, J., Hosseinzadeh, R., Sui, Y., Ross, B., chine Learning, volume 235 of Proceedings of Machine
Villecroze, V., Liu, Z., Caterini, A. L., Taylor, E., and Learning Research, pp. 59148–59172. PMLR, 21–27 Jul
Loaiza-Ganem, G. Exposing flaws of generative model 2024.
evaluation metrics and their unfair treatment of diffusion
models. In Advances in Neural Information Processing Zhang, J., Jalali, M., Li, C. T., and Farnia, F. Unveiling
Systems, volume 36, pp. 3732–3784. Curran Associates, differences in generative models: A scalable differential
Inc., 2023. clustering approach. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, (CVPR), 2025. URL https://arxiv.org/abs/
Z. Rethinking the inception architecture for computer vi- 2405.02700.
sion. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 2818–2826, 2016.

Ullah, E., Mianjy, P., Marinov,


√ T. V., and Arora, R. Stream-
ing kernel PCA with o( n) random features. Advances
in Neural Information Processing Systems, 31, 2018.

Van der Maaten, L. and Hinton, G. Visualizing data using


t-sne. Journal of machine learning research, 9(11), 2008.

Vinh, N. X., Epps, J., and Bailey, J. Information theoretic


measures for clusterings comparison: is a correction for
chance necessary? In Proceedings of the 26th annual
international conference on machine learning, pp. 1073–
1080, 2009.

Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and
Wei, F. Improving text embeddings with large language
models. arXiv preprint arXiv:2401.00368, 2023a.

Wang, Y., Huang, H., Rudin, C., and Shaposhnik, Y. Un-


derstanding how dimension reduction tools work: An
empirical approach to deciphering t-sne, umap, trimap,
and pacmap for data visualization. Journal of Machine
Learning Research, 22(201):1–73, 2021. URL http:
//jmlr.org/papers/v22/20-1061.html.

13
Towards an Explainable Comparison and Alignment of Feature Embeddings

A. Proofs
A.1. Proof of Theorem 1
For simplicity, we adopt the following notations in this proof:
(1) 1 (1) 1 (1) 1
K11 := Kψ [I, I], K12 := Kψ1 [I, I c ], K22 := Kψ1 [I c , I c ]
n 1 n n
(2) 1 (2) 1 (2) 1
K11 := Kψ2 [I, I], K12 := Kψ2 [I, I c ], K22 := Kψ2 [I c , I c ]
n n n
Λ11 := Λψ1 ,ψ2 [I, I], Λ12 := Λψ1 ,ψ2 [I, I c ], Λ22 := Λψ1 ,ψ2 [I c , I c ]
As a result, the following holds by definition:
(1) (2) (1) (2) (1) (2)
Λ11 = K11 − K11 , Λ12 = K12 − K12 , Λ22 = K22 − K22
(1)
According to Condition 1, we know the Frobenius norm bound ∥K12 ∥F ≤ ϵ1 . Also, due to Condition 2, we know the
(2)
ℓ2 -operator norm bound ∥K11 ∥2 ≤ ϵ2 . Note that the matrix n1 Kψ2 is positive semi-definite (PSD). Therefore, the Schur
complement of its block representation following indices in I and I c = {1, . . . , n} − I must be a PSD matrix, i.e.
(2) (2) ⊤ (2) −1 (2)
K22 − K12 K11 K12 ⪰ 0.
Therefore, the above Schur complement has a non-negative trace, implying that
(2) ⊤ (2) −1 (2) (2) ⊤ (2) −1 (2)
     
(2) (2)
Tr K22 − K12 K11 K12 ≥ 0 =⇒ Tr K12 K11 K12 ≤ Tr K22 .

Therefore, we will have the following:


1 
1 = Tr Kψ2
n 
(2)
≥ Tr K22
(2) ⊤ (2) −1 (2)
 
≥ Tr K12 K11 K12

(2) ⊤ 1  (2) 
≥ Tr K12 (2)
I K12
λmax (K11 )
1 
(2) ⊤ (2)

≥ (2)
Tr K 12 K 12
λmax (K11 )
1 
(2) ⊤ (2)

= (2)
Tr K 12 K 12
∥K11 ∥2
1 (2) 2
= (2)
K12 F
∥K11 ∥2

The above means that Condition 2 implies


2
(2) (2)
K12 ≤ K11 ≤ ϵ2 .
F 2

As a result, we can apply Young’s inequality to show that


 
2 2
2 (2) (1)
∥Λ12 ∥F ≤ 2 K12 + K12 ≤ 2(ϵ21 + ϵ2 ).
F F

Therefore, we will have the following:


 
Λ11 0 2 2
Λψ1 ,ψ2 − = 2 Λ12 ≤ 4(ϵ21 + ϵ2 ).
0 Λ22 F F
| {z }
Λ
e ψ ,ψ
1 2

14
Towards an Explainable Comparison and Alignment of Feature Embeddings

Now, since Λψ1 ,ψ2 is a symmetric matrix, we can apply spectral decomposition to write it as Λψ1 ,ψ2 = V diag(λ)V ⊤ where
every row vi of matrix V is an eigenvector of Λψ1 ,ψ2 with corresponding eigenvalue λi , sorted as λ1 ≥ · · · ≥ λn . Note that
the eigenvalues are real and sum up to 0 as the trace of Λψ1 ,ψ2 is zero. Then, we can write:
n n
X 2 X  2
e ψ ,ψ vi − Λψ ,ψ vi
Λ = e ψ ,ψ − Λψ ,ψ vi
Λ
1 2 1 2 2 1 2 1 2 2
i=1 i=1
n
X ⊤
vi⊤ Λ

= e ψ ,ψ − Λψ ,ψ e ψ ,ψ − Λψ ,ψ vi
Λ
1 2 1 2 1 2 1 2
i=1
n   
e ψ ,ψ − Λψ ,ψ ⊤ Λ
X
Tr vi⊤ Λ

= e ψ ,ψ − Λψ ,ψ vi
1 2 1 2 1 2 1 2
i=1
n  
e ψ ,ψ − Λψ ,ψ ⊤ Λ
X
Tr vi vi⊤ Λ

= e ψ ,ψ − Λψ ,ψ
1 2 1 2 1 2 1 2
i=1
n
X  ⊤ 
= Tr vi vi⊤ e ψ ,ψ − Λψ ,ψ
Λ 1 2 1 2
e ψ ,ψ − Λψ ,ψ
Λ 1 2 1 2
i=1
 ⊤ 
= Tr e ψ ,ψ − Λψ ,ψ
Λ e ψ ,ψ − Λψ ,ψ
Λ
1 2 1 2 1 2 1 2

2
= Λe ψ ,ψ − Λψ ,ψ
1 2 1 2
F
≤ 4 ϵ21 + ϵ2


As a result, we can write


n
X 2
4 ϵ21 + ϵ2 ≥

e ψ ,ψ vi − Λψ ,ψ vi
Λ 1 2 1 2 2
i=1
n
X 2
= e ψ ,ψ vi − λi vi
Λ 1 2 2
i=1
n
X 2 2
= Λ11 vi [I] − λi vi [I] 2
+ Λ22 vi [I c ] − λi vi [I c ] 2
i=1
n
X 2 2 c 2 2
≥ λi − λIi vi [I] 2
+ λi − λIi vi [I c ] 2
.
i=1

In the above, the last inequality holds as we know for every PSD matrix A and vector v, we have ∥Av − λv∥2 ≥ |λj − λ|v,
where λj is the eigenvalue of A with the minimum absolute difference |λj − λ|. Therefore, the proof of Theorem 1 is
complete.

A.2. Proof of Proposition 1


Note that we can write Λψ1 ,ψ2 using the following matrix multiplication:

Φ⊤
 
1  ψ1
Λψ1 ,ψ2 = Φψ1 Φψ2  
n
−Φ⊤
ψ2

In the above, we define Φψ1 ∈ Rn×d1 to be the embedding of dataset x1 , . . . , xn with embedding map ψ1 , i.e., its ith
row will be ϕ1 (ψ1 (xi )), and similarly we let Φψ2 ∈ Rn×d2 to be the embedding of dataset with embedding map ψ2 with
⊤
its ith row being ϕ2 (ψ2 (xi )). Therefore, if we define A = Φψ1 Φψ2 and B = n1 Φ⊤ −Φ⊤
  
ψ1 ψ2 , then we will have
Λψ1 ,ψ2 = AB.
On the other hand, we know that for every matrix A ∈ Rn×(d1 +d2 ) and B ∈ R(d1 +d2 )×n , AB and BA share the same

15
Towards an Explainable Comparison and Alignment of Feature Embeddings

non-zero eigenvalues. In this case, the matrix BA sharing the non-zero eigenvalues with Λψ1 ,ψ2 = AB can be calculated as
" ⊤ #
1 Φψ1  
BA = Φψ1 Φψ2
n −Φψ ⊤
2
" 1 ⊤ 1 ⊤
#
Φ
n ψ1 ψ1Φ n Φψ1 Φψ2
=
− n1 Φ⊤ψ2 Φψ1 − n1 Φ⊤
ψ Φψ2
" # 2
Cψ1 Cψ1 ,ψ2
= ⊤
−Cψ1 ,ψ2 −Cψ2
= Γψ1 ,ψ2 .

In addition, for every eigenvector v (corresponding to a non-zero eigenvalue) of Γψ1 ,ψ2 = BA, we have that u = Av is an
eigenvector of Λψ1 ,ψ2 = AB which is
 
ϕ1 (ψ1 (x1 )) ϕ2 (ψ2 (x1 ))

u = Φψ1

Φψ2 v =  .. ..
v
 
. .
ϕ1 (ψ1 (xn )) ϕ2 (ψ2 (xn ))

Therefore, the proof is complete.

A.3. Proof of Theorem 2


As stated in the theorem, we consider independent random Fourier features ω1 , . . . , ωm ∼ κ
b where the proxy feature map is:

1 h i
ϕ(x)
b = √ cos(ω1⊤ x), sin(ω1⊤ x), ., cos(ωm
⊤ ⊤
x), sin(ωm x)
m

Based on the assumption, the shift-invariant kernel k(x, y) = κ(x −  y) is⊤normalized where k(x, x)⊤ = 1 for ⊤every
x ∈ X . Therefore, using the Fourier synthesis equation k(x, y) = Eω∼b
κ cos(ω (x − y)) = Eω∼bκ cos(ω x) cos(ω y) +
sin(ω ⊤ x) sin(ω ⊤ y) . The RFF-proxy kernel function can be viewed as


m
1 X
k(x, y) =
b cos(ωi⊤ (x − y)).
m i=1

As a result, if we consider kernel matrix Kψ1 ,ωi where kψ1 ,ωi (x, y) = cos(ωi⊤ (ψ1 (x) − ψ1 (y))), we can simplify the proxy
kernel matrix as
m
1 b 1 X1
Kψ1 = Kψ ,ω
n m i=1 n 1 i

where we note that Eωi ∼pω [ n1 Kψ1 ,ωi ] = 1


n Kψ1 as ω is drawn from the Fourier transform κ
b.
Also, ∥ n1 Kψ1 ,ωi ∥F
≤ 1 holds, because the kernel function is assumed to be normalized and |kψ1 (x, y)| ≤ 1 for every x, y.
Noting that the Frobenius norm ∥ · ∥F can be written as the Euclidean norm of the vectorized matrix, the application of
Vector Bernstein inequality (Gross, 2011; Kohler & Lucchi, 2017) proves for any 0 ≤ ϵ ≤ 2:
 1 Xm  8 − mϵ2 
1 1 
P [ Kψ ,ω ] − Kψ1 ≥ϵ ≤ exp ,
m i=1 n 1 i n F 32

Therefore, we will have


 1  8 − mϵ2 
b ψ − 1 Kψ

P K ≥ϵ ≤ exp
n 1 n 1 F 32

16
Towards an Explainable Comparison and Alignment of Feature Embeddings

Similarly, we can show that for the embedding ψ2 we will have


 1  8 − mϵ2 
b ψ − 1 Kψ

P K ≥ϵ ≤ exp
n 2 n 2 F 32
Next, we note that
n
X 2 n
X 2
b i − λi v
Λψ1 ,ψ2 v bi = bi − Λ
Λψ1 ,ψ2 v b ψ ,ψ v
1 2 i
b
2 2
i=1 i=1
Xn   2
= Λψ1 ,ψ2 − Λ
b ψ ,ψ v
1 2
bi
2
i=1
Xn  ⊤  
= bi⊤ Λψ1 ,ψ2 − Λ
v b ψ ,ψ
1 2 Λψ1 ,ψ2 − Λ
b ψ ,ψ v
1 2
bi
i=1
n
X   ⊤   
= bi⊤ Λψ1 ,ψ2 − Λ
Tr v b ψ ,ψ
1 2 Λψ1 ,ψ2 − Λ
b ψ ,ψ v
1 2
bi
i=1
n
X  ⊤   
= Tr Λψ1 ,ψ2 − Λ
b ψ ,ψ
1 2 Λψ1 ,ψ2 − Λ
b ψ ,ψ v
1 2
bi⊤
bi v
i=1
 ⊤  n
X 
= Tr Λψ1 ,ψ2 −Λ
b ψ ,ψ
1 2 Λψ1 ,ψ2 −Λ
b ψ ,ψ
1 2 v bi⊤
bi v
i=1
 ⊤  
= Tr Λψ1 ,ψ2 − Λ
b ψ ,ψ
1 2 Λψ1 ,ψ2 − Λ
b ψ ,ψ
1 2

2
= Λψ1 ,ψ2 − Λ
b ψ ,ψ
1 2
F
2 2
≤ 2 Kψ1 − K

1
+ 2 Kψ2 − K

2
F F

2 2 2
q from Young’s inequality showing that ∥A + B∥F ≤ 2∥A∥F + 2∥B∥F . Then,
The last line in the above inequalities follow
2 −1/4
setting δ = 2 exp 8−mϵ 32 implying ϵ = 32 log(2em
/δ)
, we will have
r r
 1
b ψ − 1 Kψ 32 log(2e−1/4 /δ)  δ  1
b ψ − 1 Kψ 32 log(2e−1/4 /δ)  δ
P K ≤ ≥ 1− , P K ≤ ≥ 1−
n 1 n 1 F m 2 n 2 n 2 F m 2
where by applying the union bound we can show
r r
 1 1 32 log(2e−1/4 /δ) 1 b 1 32 log(2e−1/4 /δ) 
P Kψ1 − Kψ1
b ≤ and Kψ2 − Kψ2 ≤ ≥ 1−δ
n n F m n n F m
which shows that
n
X 2 128 log(2e−1/4 /δ) 
P b i − λi v
Λψ1 ,ψ2 v bi ≤ ≥ 1−δ
i=1
2 m

The above completes the theorem’s proof.

A.4. Proof of Proposition 2


To show this statement, we leverage the fact that the eigenvalues of matrix Γψ1,θ ,ψ2 are real, as they are shared with the
symmetric matrix Λψ1 ,ψ2 . Then, we can leverage the Jordan canonical form to write the matrix Γψ1,θ ,ψ2 as follows:

Γψ1,θ ,ψ2 = U JU −1

In the above, matrix U ∈ Rd1 +d2 includes the right generalized eigenvectors of matrix Γψ1,θ ,ψ2 as its rows, and U −1
includes the left generalized eigenvectors of matrix Γψ1,θ ,ψ2 in its rows. Also, J is the Jordan normal form containing one

17
Towards an Explainable Comparison and Alignment of Feature Embeddings

block matrix on the diagonal for every eigenvalue. Assuming that the eigenvalue with the maximum absolute value (which
determines the spectral radius) has multiplicity 1, the Jordan canonical form has only one diagonal entry λmax for the top
eigenvalue, and so we can write the decomposition as:
(−imax )
Γψ1,θ ,ψ2 = U (−imax ) J (−imax ) U −1 + λmax uleft u⊤
right

Due to the bi-orthogonality of uleft , uright with the rest of right and left generalized eigenvectors, respectively, we will have
λmax (Γψ1,θ ,ψ2 ) = u⊤
left Γψ1,θ ,ψ2 uright

Taking the partial derivative with respect to θ from the above identity proves the proposition.

B. Additional Numerical Results


In this section, we provide additional numerical results on embedding comparisons and alignment using the SPEC framework,
further illustrating its effectiveness in identifying clustering differences across various datasets and embedding methods.

B.1. Ablation Study on the Comparison of Visualization Algorithms


To further validate that the SPEC method can distinguish between two embeddings, we supplemented the UMAP visualization
in Figure 2 with additional techniques, PacMAP (Wang et al., 2021) and t-SNE. These visualizations assess how well the
clusters identified by SPEC align across different dimensionality reduction methods. As shown in Figure 6, the alternative
techniques confirm that SPEC correctly isolates distinct clusters in one embedding, whereas the other embedding fails to
produce clean separations.

B.2. Comparison of text embeddings


Synthetic text dataset. To evaluate the performance of SPEC on text embeddings, we generated a dataset of 10K text
samples using GPT-4o, covering diverse categories such as profession, objects, gender, and emotions. We then applied SPEC
to compare CLIP, RoBERTa, and E5 text embeddings. As shown in Figures 8 and 9, CLIP primarily clusters sentences
based on objects mentioned in the text, whereas RoBERTa organizes them according to profession and gender. For instance,
CLIP’s first cluster consists of sentences related to cameras and photography, while RoBERTa’s first cluster groups sentences
about female firefighters and female carpenters. This suggests that CLIP embeddings do not cluster professions based on
gender but excel at grouping objects, which aligns with its training focus. Additionally, the t-SNE visualization reveals that
the principal clusters identified in one embedding are not well-separated in the other, indicating that each model captures
different semantic aspects of the text. In addition to previous experiments, we used MS-COCO 2017 train set captions
( 120K samples) to compare RoBERTa and E5-L-V2 embeddings. As shown in Figure 10, we can observe that E5 managed
to cluster captions where two or more animals are interacting.
The prompt used to generate the text dataset: “You are an expert prompt optimizer for text-to-image models. Text-to-image
models take a text prompt as input and generate images. Your task is to generate a prompt describing a person in [Profession],
[Emotion], and [Gender] performing [Action] with [Object]. You can randomly choose the categories from the attributes:
Professions: Chef, doctor, journalist, scientist, carpenter, engineer, pilot, artist, teacher, firefighter. Emotions: Excited, calm,
angry, serious, curious, confident, focused, determined, happy, tired. Genders: Male, female. Actions: Designing, adjusting,
sitting, crouching, climbing, carrying, holding, standing, inspecting, juggling. Objects: Camera, frying pan, painting, laptop,
photograph, syringe, golden throne, desk, guitar, airplane.”
Real world text dataset. To further compare the text embeddings, we validated our approach on a large-scale real text
dataset: WikiText-2 (Merity et al., 2016). We split the dataset into 10K samples, each containing 100 tokens. Then, we used
SPEC to compare CLIP and RoBERTa embeddings. As shown in Figure7. We observed that RoBERTa better clustered
Military Operations & Infrastructure, Ecology & Species Biology, Historical Figures, and Music, while CLIP embeddings
more strongly clustered Entertainment & Sports and Science.
In addition to t-SNE plots, we also examined the distribution of pairwise distances within each cluster to verify that
one embedding successfully captured these clusters while the other was less inclined to do so. Also, we ran the K-
means clustering algorithm 50 times on each of the embedding’s features and computed the averaged (across the 50 runs)
Normalized Mutual Information (NMI) between the K-means labels and the SPEC-identified labels. The results demonstrate
that one embedding achieved considerably stronger alignment with KMeans labels.

18
Towards an Explainable Comparison and Alignment of Feature Embeddings

UMAP Visualization of PacMAP Visualization of t-SNE Visualization


top 10 Clusters top 10 Clusters of top 10 Clusters

DINOv2
CLIP
DINOv2

CLIP

CLIP
DINOv2
CLIP

DINOv2

Figure 6. Comparison of SPEC-identified clusters across different visualization methods (PacMAP, t-SNE, UMAP) of Figure 2 experiment.
One embedding shows clear cluster separation, while the other fails to distinguish groups.

19
Towards an Explainable Comparison and Alignment of Feature Embeddings

SPEC Principal Clusters for RoBERTa CLIP


Cluster #1 Cluster #2 Cluster #3 Cluster #4
λ1=0.035 λ2=0.026 λ3=0.014 λ4=0.011
ϵ₁ = 0.0157, ϵ₂ = 0.0010 ϵ₁ = 0.0140, ϵ₂ = 0.0020 ϵ₁ = 0.0151, ϵ₂ = 0.0015 ϵ₁ = 0.0143, ϵ₂ = 0.0014
ξ = 0.0052 ξ = 0.0089 ξ = 0.0071 ξ = 0.0065
WWI Sinai Railway Construction (1916) Habitat of Zygoballus sexpunctatus Lord Rosebery’s Eccentricities Coldplay's "Clocks"
Ottoman Water Source Destruction (1916) Life Cycle of Zygoballus sexpunctatus Archibald, 5th Earl of Rosebery Rihanna's Diamonds World Tour
Operation Joint Endeavor (1995–96) Oribi Antelope Description Simon Bradstreet’s Legacy Beyoncé's "Crazy in Love"
Zanzibar Anti-Slavery Missions (1880s) Oribi Diet & Behavior Djedkare Isesi’s Reign Weird Al Yankovic's "Christmas at Ground Zero"
WWI Naval Raid Planning (1917–18) Kakapo Parrot Characteristics Neolithic Long Barrow Folklore Whitney Houston's Ballad Reviews
St. Nazaire Raid Plan (1942) Noisy Miner Taxonomy Arikamedu’s Roman Connection Beyoncé's "Freakum Dress"
Maryang San Battle (Korean War, 1951) Tawny Nurse Shark Ecology Coldrum Stones Rediscovery The Family Jewels Album Reception
Fort Glanville’s Decline (1889) False Morel Fruiting Patterns Feroz Shah’s Temple Destruction Nina Simone's Performances
Fort Glanville’s Final Role (1890s–1903) Mutinus elegans (Stinkhorn Fungus) García Márquez’s Historical Novel Nelly Furtado's "Folklore"
Mogadishu Road Modernization (2013–15) Amylostereaceae Fungal Family Erving Goffman’s Influence Casting Crowns' "Who Am I"

t-SNE visualization of top 10 SPEC clusters with RoBERTa embedding t-SNE visualization of top 10 SPEC clusters with CLIP embedding

Normalized Intra-cluster Distances of top 10 SPEC-identified clusters RoBERTa CLIP

RoBERTa KMeans & SPEC NMI: 0.62 ± 0.0004 CLIP KMeans & SPEC NMI : 0.31 ± 0.0005

SPEC Principal Clusters for CLIP RoBERTa


Cluster #1 Cluster #2 Cluster #3 Cluster #4
λ1=0.032 λ2=0.021 λ3=0.018 λ4=0.013
ϵ₁ = 0.0156, ϵ₂ = 0.0010 ϵ₁ = 0.0140, ϵ₂ = 0.0020 ϵ₁ = 0.0152, ϵ₂ = 0.0015 ϵ₁ = 0.0143, ϵ₂ = 0.0014
ξ = 0.0052 ξ = 0.0089 ξ = 0.0071 ξ = 0.0065
Tristan (horse): Career Highlights Hydrostatic equilibrium in stellar (Astrophysics) Cardinal Appointments of 26 September 1766 Roger Federer's 2009 season
Ross and Rachel: Cultural Analysis of Friends Xenon compounds and their chemical (Chemistry) Milanese Cardinals Under Benedict XIV Cole Hamels in 2009 postseason (loss to Yankees)
Chelsea F.C. in the 1950s: First Division Triumph Antimony oxides and halides (Chemistry) Papal Legates Appointed 1758-1759 Hurricane Helms' wrestling moves
Otra Nota: Marc Anthony's Pivotal Transition Tropical cyclone formation from easterly waves French Cardinals Under Clement XIII (1766) Track and field event classifications
The Seymour-Dudley Marriage Alliance Lifecycle of an Atlantic tropical depression Future Pope Clement XIV's Cardinalate (1759) Roger Federer's playing style evolution
Evolution of Barbie's Midge Extinct fauna of Réunion Island Spanish Cardinals of Benedict XIV's Era Roger Federer career statistics
Newcastle United Reserves 47-48 Central League Impact of agriculture on Irish wildlife habitats Pope Clement XIV (1743-1754) Roger Federer in 2004
Odyssey Number Five Extinct fauna of Réunion Island Antonio Colonna Branciforte - Cardinal 2017 Azerbaijan Grand Prix
Wrestling Championships Through the Ages Legal protection status of cougars by country Urbano Paracciani Rutili ( Sep 1766 ) – Cardinal Federer-Roddick rivalry"21-3 head-to-head record

t-SNE visualization of top 10 SPEC clusters with CLIP embedding t-SNE visualization of top 10 SPEC clusters with RoBERTa embedding

Normalized Intra-cluster Distances of top 10 SPEC-identified clusters RoBERTa CLIP

CLIP KMeans & SPEC NMI : 0.63 ± 0.0011 RoBERTa KMeans & SPEC NMI: 0.32 ± 0.0003

Figure 7. Top 4 SPEC-identified clusters by comparing CLIP and RoBERTa text embeddings on the WikiText-2 dataset with the
visualization of the top 10 SPEC-identified clusters using t-SNE.

20
Towards an Explainable Comparison and Alignment of Feature Embeddings

SPEC Principal Clusters for CLIP RoBERTa


Cluster #1 Cluster #2 Cluster #3 Cluster #4
eigenval=0.026 eigenval=0.014 eigenval=0.009 eigenval=0.007
A serious female artist is adjusting a camera. A determined female teacher is sitting an airplane. A serious female doctor is carrying a frying pan. A serious male journalist is inspecting a painting.
A determined male engineer is holding a camera. A determined male teacher is sitting an airplane. An angry male carpenter is adjusting a frying pan. A serious female engineer is adjusting a painting.
A happy female doctor is adjusting a photograph. A confident female carpenter is sitting an airplane. A happy female doctor is adjusting a frying pan. A calm female journalist is inspecting a painting.
A calm female pilot is designing a photograph. A focused female doctor is adjusting an airplane. A happy male artist is adjusting a frying pan. A calm male journalist is inspecting a painting.
A determined male artist is adjusting a photograph. A curious male teacher is adjusting an airplane. A determined female chef is adjusting a frying pan. A curious female artist is adjusting a painting.
A determined female engineer is holding a camera. An angry female doctor is sitting an airplane. A tired male artist is adjusting a frying pan. A serious female artist is adjusting a painting.
A happy female artist is adjusting a photograph. A tired female chef is sitting an airplane. A focused male engineer is inspecting a frying pan. A focused male engineer is inspecting a painting.
An angry male journalist is holding a photograph. A calm female doctor is inspecting an airplane. A serious male scientist is inspecting a frying pan. A determined male doctor is holding a painting.
A calm male engineer is holding a photograph. A determined male engineer is holding an airplane. A calm female pilot is adjusting a frying pan. A focused male engineer is carrying a painting.
t-SNE visualization of top 10 SPEC UMAP visualization of top 10 SPEC t-SNE visualization of top 10 SPEC UMAP visualization of top 10 SPEC
clusters with CLIP embedding clusters with CLIP embedding clusters with RoBERTa embedding clusters with RoBERTa embedding

Normalized Intra-cluster Distances of top 10 SPEC-identified clusters CLIP RoBERTa

CLIP KMeans & SPEC AMI : 0.68 ± 0.0006 RoBERTa KMeans & SPEC AMI: 0.34 ± 0.0010

SPEC Principal Clusters for RoBERTa CLIP


Cluster #1 Cluster #2 Cluster #3 Cluster #4
eigenval=0.022 eigenval=0.015 eigenval=0.009 eigenval=0.008
A calm female firefighter is inspecting a camera. A happy male carpenter is carrying a desk. A curious male artist is holding a photograph. A calm female journalist is holding a photograph.
A calm female carpenter is holding a syringe. An excited male carpenter is carrying a laptop. A happy male artist is inspecting a frying pan. A calm male journalist is holding a photograph.
An excited female firefighter is holding a desk. A focused male carpenter is holding a camera. A happy male artist is carrying a desk. A focused female journalist is holding a guitar.
A tired female firefighter is holding a desk. An excited male carpenter is holding a camera. A curious male artist is inspecting a camera. A calm female journalist is inspecting a camera.
A happy female firefighter is holding a guitar. A confident male carpenter is holding a camera. An excited male artist is carrying a desk. An angry male journalist is holding a photograph.
A happy female carpenter is holding a camera. A happy male carpenter is holding a laptop. A confident male artist is inspecting a desk. A confident female journalist is sitting a camera.
A calm female carpenter is carrying a laptop. A calm male carpenter is holding a camera. An excited male artist is holding a laptop. An angry male journalist is holding a frying pan.
A calm female carpenter is holding a desk. A calm male carpenter is holding a camera. An angry male artist is inspecting a camera. A calm female journalist is carrying a painting.
An angry female carpenter is standing a desk. A happy male carpenter is holding a frying pan. A happy male artist is holding a painting. An angry female journalist is carrying a guitar.
t-SNE visualization of top 10 SPEC UMAP visualization of top 10 SPEC t-SNE visualization of top 10 SPEC UMAP visualization of top 10 SPEC
clusters with RoBERTa embedding clusters with RoBERTa embedding clusters with CLIP embedding clusters with CLIP embedding

Normalized Intra-cluster Distances of top 10 SPEC-identified clusters RoBERTa CLIP

RoBERTa KMeans & SPEC AMI: 0.72 ± 0.0010 CLIP KMeans & SPEC AMI : 0.45 ± 0.0014

Figure 8. Top 4 SPEC-identified clusters by comparing CLIP and RoBERTa text embeddings on a dataset of 10K samples generated from
GPT-4o with the visualization of the top 10 SPEC-identified clusters using t-SNE and UMAP.

21
Towards an Explainable Comparison and Alignment of Feature Embeddings

SPEC Principal Clusters for CLIP E5


Cluster #1 Cluster #2 Cluster #3 Cluster #4
eigenval=0.025 eigenval=0.013 eigenval=0.009 eigenval=0.007
An angry female artist is designing a photograph. An angry male artist is inspecting a syringe. A calm male scientist is climbing a golden throne. An angry male artist is carrying a frying pan.
An angry male scientist is designing a photograph. A happy male engineer is adjusting a syringe. A determined male artist is designing a golden throne A serious male scientist is carrying a guitar.
A calm female engineer is designing a photograph. An angry female artist is adjusting a syringe. An angry male scientist is designing a golden throne. A tired female scientist is carrying a guitar.
An angry male pilot is designing a photograph. A calm male journalist is inspecting a syringe. A tired female scientist is sitting on a golden throne. A tired female scientist is carrying a guitar.
A determined male artist is designing a photograph. A determined male scientist is holding a syringe. An angry male artist is designing a golden throne. A curious female doctor is carrying a frying pan.
An angry male engineer is designing a photograph. A confident male journalist is inspecting a syringe. A tired male scientist is designing a golden throne. An angry female chef is carrying a frying pan.
A curious male engineer is designing a photograph. An angry female artist is inspecting a syringe. A happy male engineer is climbing a golden throne. A determined female doctor is carrying a guitar.
A curious female scientist is holding a photograph. A confident female scientist is adjusting a syringe. A confident male scientist is sitting a golden throne. A serious female chef is carrying a guitar.
A happy male doctor is adjusting a photograph. A curious male journalist is adjusting a syringe. A calm male scientist is standing a golden throne. A curious female scientist is carrying a camera.
t-SNE visualization of top 10 SPEC UMAP visualization of top 10 SPEC t-SNE visualization of top 10 SPEC UMAP visualization of top 10 SPEC
clusters with CLIP embedding clusters with CLIP embedding clusters with E5 embedding clusters with E5 embedding

Normalized Intra-cluster Distances of top 10 SPEC-identified clusters CLIP E5

CLIP KMeans & SPEC AMI : 0.68 ± 0.0003 E5 KMeans & SPEC AMI: 0.43 ± 0.0022

SPEC Principal Clusters for E5 CLIP


Cluster #1 Cluster #2 Cluster #3 Cluster #4
eigenval=0.010 eigenval=0.007 eigenval=0.005 eigenval=0.004
A focused female firefighter is carrying a laptop. A focused male firefighter is holding a laptop. An angry female artist is carrying a laptop. A tired female firefighter is carrying a syringe.
A serious female firefighter is climbing a camera. A calm male firefighter is holding a painting. An angry female pilot is holding a photograph. A tired female doctor is adjusting a photograph.
A excited female firefighter is standing a painting. A happy male firefighter is standing a laptop. An angry female chef is holding a syringe. A tired female doctor is designing a frying pan.
A happy female firefighter is holding a laptop. A confident male firefighter is crouching a laptop. An angry female artist is holding a guitar. A tired female doctor is adjusting a laptop.
A focused female firefighter is holding a guitar. A focused male firefighter is crouching a desk. An angry female artist is designing a camera. A tired female scientist is adjusting a camera.
A focused female firefighter is crouching a camera. A determined male firefighter is holding a camera. An angry female pilot is carrying a desk. A tired female firefighter is holding a camera.
A calm female firefighter is holding a laptop. An excited male firefighter is crouching a guitar. An angry female artist is adjusting a laptop. A tired female pilot is carrying a photograph.
A focused female firefighter is holding a desk. A focused male firefighter is holding a desk. An angry female chef is designing a syringe. A tired female teacher is carrying a syringe.
A confident female firefighter is holding a laptop. A confident male firefighter is sitting a guitar. An angry female artist is carrying a painting. A tired female doctor is holding a photograph.
t-SNE visualization of top 10 SPEC UMAP visualization of top 10 SPEC t-SNE visualization of top 10 SPEC UMAP visualization of top 10 SPEC
clusters with E5 embedding clusters with E5 embedding clusters with CLIP embedding clusters with CLIP embedding

Normalized Intra-cluster Distances of top 10 SPEC-identified clusters CLIP E5

CLIP KMeans & SPEC AMI : 0.55 ± 0.0008 E5 KMeans & SPEC AMI: 0.32 ± 0.0004

Figure 9. Top 4 SPEC-identified clusters by comparing CLIP and E5-Large-V2 text embeddings on a dataset of 10K samples generated
from GPT-4o with the visualization of the top 10 SPEC-identified clusters using t-SNE.

22
Towards an Explainable Comparison and Alignment of Feature Embeddings

SPEC Principal Clusters for RoBERTa E5


Cluster #1 Cluster #2 Cluster #3 Cluster #4
eigenval=0.018 eigenval=0.009 eigenval=0.006 eigenval=0.004
A row of chairs sitting in front of a building. A white refrigerator freezer sitting in a kitchen. a bathroom with a large sink in it A bus is parked in front of a building.
A woman and some kids sitting on a couch. A white toilet sitting under a bathroom window. a bathroom with a wooden door and tile walls A bus sits next to a bus stop.
A white toilet sitting under a bathroom window. A large refrigerator freezer sitting in a kitchen. a white bath room with a double sink A yellow bus driving down a street past cars.
A dog sitting in front of a doorway. A small white refrigerator freezer sitting in a a bath room with a toilet and a large mirror A red passenger bus parked in front of a building.
A small black dog sitting in front of a TV. kitchen. a bathroom with a sink toilet and mirror A red double decker bus next to a park.
A woman sitting down looking out the window. A wooden bench sitting next to a building. a large bathroom with a big bathtub in it A blue and yellow double decker bus on a street.
A white refrigerator freezer sitting in a kitchen. A black chair sitting next to a refrigerator. a bathroom with a sink and two mirrors A road with cars and buses near a building.
A row of benches sitting inside of a building. A white toilet sitting under a bathroom sink. a bathroom with two sinks and a mirror A car in the street next to a bus.
A person sitting on a wooden bench outside. A white bathroom sink sitting under a window. a small bathroom with a mirror and sink A red bus parked in front of a building.

t-SNE visualization of top 10 SPEC t-SNE visualization of top 10 SPEC


clusters with RoBERTa embedding clusters with E5 embedding

SPEC Principal Clusters for E5 RoBERTa


Cluster #1 Cluster #2 Cluster #3 Cluster #4
eigenval=0.010 eigenval=0.007 eigenval=0.005 eigenval=0.004
A man standing next to a giraffe on a snow field. Two zebras standing together in a field outside some meat and some veggies sitting on a plate. A guy jumps in the air on his skate board
A man holds a snowboard on a snowy day. Two giraffes standing in front of their stables. A cut pizza sits on a plate in front of a person. A man who is riding a skateboard down the street.
A man riding a board on a wave in front of a boat. Two giraffes and a zebra in grassy area. a close up of a plate of pizza on a table a man is flying through the air on a skateboard
A man that is on a snowboard standing in the snow. African buffalo and zebras standing in a savanna. A very close up picture of some food on a table. A man on a skateboard performing a trick.
A picture of a man on ski's standing in the snow. At least three zebras are standing together A person sitting at a table where a pizza is there. A man on a skateboard performing a trick.
A man that is holding a surfboard in the snow. affectionately. a close up of a pizza on a plate on a table. A man on a skateboard performing a trick.
A man riding a surfboard in the ocean Two zebras stand in front of a large building. a close up of a plate of food with pizza A man in the air on a skateboard.
A man that is standing on a snowboard in the snow. Two zebras and a giraffe near the water. a plate of pizza sits on a checkered table cover. Someone doing a trick on their skate board.
A man riding a surfboard in the waves. A zebra standing near another zebra. A picture of a plate of pizza on a table. A man riding a skateboard doing a trick.
t-SNE visualization of top 10 SPEC t-SNE visualization of top 10 SPEC
clusters with E5 embedding clusters with CLIP embedding

Figure 10. Top 4 SPEC-identified clusters by comparing RoBERTa and E5-Large-V2 text embeddings on MS-COCO 2017 train captions
( 120K prompts) with the visualization of the top 10 SPEC-identified clusters using t-SNE.

23
Towards an Explainable Comparison and Alignment of Feature Embeddings

B.3. Comparison of image embeddings


To explore the capability of CLIP’s image embedding, we analyze the top 4 SPEC-identified clusters comparing CLIP and
DINOv2 embeddings on ImageNet-1k dog breeds in Figure 11, highlighting their different clustering strategies. CLIP
primarily groups dogs based on their posture and gestures rather than their breed. For example, in cluster two, all dogs are
standing, but they belong to different breeds. This suggests that CLIP focuses more on high-level visual features like body
position and orientation. In contrast, DINOv2 forms clusters based on dog breeds, grouping visually similar dogs together
regardless of their posture. The last row presents the t-SNE representation of the top 10 SPEC-identified clusters for each
embedding, further illustrating their distinct clustering behaviors.
We analyzed the clustering behavior of different embeddings on 120K samples from the MS-COCO 2017 dataset and
observed similar trends in how different models organize visual concepts in Figure 14. For instance, SWAV demonstrates
a strong ability to cluster grid-like images, suggesting its emphasis on structural patterns in images. Meanwhile, CLIP
excels at differentiating activities like surfing, capturing fine-grained semantic details that may not be as distinct in SWAV or
DINOv2. However, CLIP struggles to cluster certain sports, such as tennis, as effectively as SWAV or DINOv2, highlighting
its relative limitations in capturing specific action-based similarities. These findings further illustrate the varying strengths
of different embeddings in organizing visual content.
We also analyzed SPEC on 70,000 samples from the FFHQ dataset. As observed in Figure 15, DINOv2 better distinguishes
images where two people are present, with one appearing incompletely, as a cluster. In contrast, CLIP is more effective at
identifying children as a distinct group.

B.4. Comparing similarity ranking for SPEC clusters


In Figure 12, the leftmost images show the top 4 samples of SPEC-identified clusters on AFHQ. The first and second clusters
correspond to black cats and black & white cats, respectively. For each cluster, cosine similarity is computed between the
mean of these samples and both 4 additional cluster members (green-bordered) and 4 random images (red-bordered). Images
are sorted left to right by similarity scores. The first row represents DINOv2, the second CLIP. Unlike DINOv2, CLIP ranks
some random cats more similar to the cluster mean than the actual cluster members.
For the FFHQ dataset, as shown in Figure 13, the first cluster corresponds to people wearing graduation caps, and the
second to people wearing sunglasses. While DINOv2 maintains a significant similarity gap, clearly distinguishing the
clusters, CLIP assigns higher similarity to some samples that lack these defining features.

B.5. SPEC-align Experiments


SPEC-align Finetuning Experiments Parameters. The following parameters were used in our experiment. We used the
OpenCLIP GitHub repository (link) and used the MS-COCO 2017 training set which consists of 120K pairs of texts and
images. We use SPEC-align with the following parameters and chose DINOv2-Vit-B/14 as our reference model.
Comparison of Kernel matrices for SPEC-align. As shown in Figure 21, the SPEC-aligned CLIP kernel captures the top
four clusters based on image content rather than the overlaid text labels. Furthermore, according to the t-SNE of the models,
the SPEC-align cluster of fish is close to the cluster of images with overlaying text of fish showing that SPEC-align captured
the similarity of text and image in this experiment while clustered based on the ground truth (cluster of images).
Aligning text embeddings. In addition, we aligned CLIP text features to the T5-XL model. In Figure 20, we can observe
that the CLIP kernel has become more similar to T5-XL, and the SPEC-diff is also decreasing.
Clusters Comparison of SPEC-align. We provide additional results by comparing the top 8 Kernel-PCA (Gaussian RBF
kernel) clusters of CLIP, DINOv2, and SPEC-align CLIP. We used the CLIP aligned with DINOv2 on the ImageNet training
set. We compare the clustering of these embeddings with ImageWoof and the text-overlaid dataset in Figure 3. In Figure 18,
we observe the top 8 clusters on the text-overlaid dataset. DINOv2 clusters are based on the images, while CLIP clusters
are based on images and texts and in some cases fail to cluster based on the image. On the other hand, SPEC-align CLIP
clusters based on images while focusing on the images with the same text, as expected. The top 8 clusters of ImageWoof in
Figure 19 also show that CLIP clusters the dogs based on the gesture or their interaction with humans or the number of dogs,
while DINOv2 clusters them only based on their breed. But SPEC-align CLIP clusters dogs based on their breeds while
focusing on the gesture or the number of dogs or their interactions with humans.

24
Towards an Explainable Comparison and Alignment of Feature Embeddings

SPEC Principal Clusters for CLIP DINOv2

Cluster #1 Cluster #2 Cluster #3 Cluster #4


eigenval=0.032 eigenval=0.011 eigenval=0.009 eigenval=0.006

SPEC Principal Clusters for DINOv2 CLIP

Cluster #1 Cluster #2 Cluster #3 Cluster #4


eigenval=0.016 eigenval=0.013 eigenval=0.011 eigenval=0.009

Comparison of top 10 SPEC-identified clusters for CLIP DINOv2 Comparison of top 10 SPEC-identified clusters for DINOv2 CLIP
UMAP visualization of CLIP UMAP visualization of DINOv2 UMAP visualization of DINOv2 UMAP visualization of CLIP

Normalized Intra-cluster Distances of top 10 SPEC-identified clusters CLIP DINOv2 Normalized Intra-cluster Distances of top 10 SPEC-identified clusters DINOv2 CLIP

CLIP KMeans & SPEC AMI : 0.59 ± 0.0009 DINOv2 KMeans & SPEC AMI: 0.23 ± 0.0003 DINOv2 KMeans & SPEC AMI: 0.87 ± 0.0003 CLIP KMeans & SPEC AMI : 0.51 ± 0.0024

Figure 11. Top 4 SPEC-identified clusters comparing CLIP and DINOv2 embeddings on ImageNet-1k dog breeds. The last row shows the
UMAP representation of the top 10 SPEC-identified clusters for each embedding.

25
Towards an Explainable Comparison and Alignment of Feature Embeddings

DINOv2
0.9006 0.9005 0.8726 0.8637
0.1418 -0.0259 -0.0724

CLIP

0.9494 0.9481 0.9479


0.9459 0.9444
0.9395 0.8766
DINOv2

0.8707 0.8698 0.8681 0.7804


0.2157 0.1972 0.1288

CLIP

0.9592 0.9553 0.9429


0.9415
0.9410
0.9290 0.9008

Figure 12. Comparing similarity ranking for SPEC clusters in DINOv2-CLIP on the AFHQ dataset. The leftmost images show the top 4
samples of two SPEC-identified clusters. Cosine similarity is computed with 4 cluster members (green-bordered) and 4 random images
(red-bordered), sorted by score. Unlike DINOv2, CLIP ranks some random samples higher than cluster members.
DINOv2

0.9480 0.9243 0.9101 0.8869 0.2675 0.2368 0.2069


CLIP

0.8206 0.8182 0.8126


0.8124 0.8093 0.7892 0.5790
DINOv2

0.8463 0.7961 0.7843 0.7538 0.5116 0.4491


CLIP

0.8323 0.8276 0.8271 0.8249 0.8012


0.7840 0.6582

Figure 13. Comparing similarity ranking for SPEC clusters in DINOv2-CLIP on the FFHQ dataset. The leftmost images show the top 4
samples of two SPEC-identified clusters. Cosine similarity is computed with 4 cluster members (green-bordered) and 4 random images
(red-bordered), sorted by score. Unlike DINOv2, CLIP ranks some random samples higher than cluster members.

26
Towards an Explainable Comparison and Alignment of Feature Embeddings

Comparing Different Image Embeddings using SPEC on MS-COCO dataset​


Cluster #1 Cluster #2 Cluster #3 Cluster #1 Cluster #2 Cluster #3

DinoV2 - CLIP
CLIP - DinoV2
CLIP - SWAV

SWAV - CLIP
Inception - SWAV

SWAV - Inception

Figure 14. Comparing Different embeddings on the 120K samples from MS-COCO 2017 dataset.

Cluster #1 Cluster #2 Cluster #3 UMAP Visualization of top 10 Clusters


eigenvalue = 0.007 eigenvalue = 0.004 eigenvalue = 0.003 DINOv2 CLIP
9

10
8

15
DINOv2 – CLIP

10 6

5
6
5
4

4 3
0

5 2
1

0
10 5 0 5 10 15 20 2 4 6 8 10 12 14

eigenvalue = 0.010 eigenvalue = 0.007 eigenvalue = 0.005 CLIP DINOv2


9
12
12.5

8
CLIP – DINOv2

10.0
7
10

7.5
6

5.0
8 5

2.5
4

0.0
6 3

2.5
2

5.0
4 1

7.5
0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 4 2 0 2 4 6

DINOv2 KMeans & SPEC AMI: 0.80 ± 0.0002 CLIP KMeans & SPEC AMI : 0.38 ± 0.0004 CLIP KMeans & SPEC AMI : 0.67 ± 0.0007 DINOv2 KMeans & SPEC AMI: 0.33 ± 0.0003
DINOv2
CLIP
DINOv2

CLIP

Figure 15. Comparing embeddings on 70K FFHQ samples. Top numbers show SPEC cluster eigenvalues. Last two images per row
display UMAP representations of SPEC clusters for each embedding.

27
Towards an Explainable Comparison and Alignment of Feature Embeddings

Cluster #1 Cluster #2 Cluster #3 UMAP Visualization of top 10 Clusters


eigenvalue = 0.015 eigenvalue = 0.008 eigenvalue = 0.006 Inception CLIP 9
14
14

8
12
Inception – CLIP

12
Cluster
Vl 1 7
10

6
8 10

6
8
4

3
6
2
2

0 4
1

2
0
5.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 6 8 10 12 14 16 18

eigenvalue = 0.037 eigenvalue = 0.018 eigenvalue = 0.011 CLIP Inception


9

14 18

8
CLIP – Inception

12

7
16
10

8
14
5

6
4

12
4
3

2
2
10

0
1

2 8
0
5.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 2 0 2 4 6 8 10

Inception KMeans & SPEC AMI: 0.76 ± 0.0002 CLIP KMeans & SPEC AMI : 0.20 ± 0.0001 CLIP KMeans & SPEC AMI: 0.80 ± 0.0006 Inception KMeans & SPEC AMI : 0.25 ± 0.0002

Inception
CLIP
Inception

CLIP

Figure 16. Comparing embeddings on 70K FFHQ samples. Top numbers show SPEC cluster eigenvalues. Last two images per row
display UMAP representations of SPEC clusters for each embedding.

Parameter Value
accum freq 1
alignment loss weight 0.1
batch size 128
clip alignment contrastive loss weight 0.9
coca contrastive loss weight 1.0
distributed True
epochs 10
lr 1e-05
lr scheduler cosine
model ViT-B-32
name Vit-B-32 laion2b e16 freeze 5
precision amp
pretrained laion2b e16
seed 0
wd 0.2

Table 1. Configuration parameters used in the experiments.

Table 2. Linear evaluation of frozen features on fine-grained benchmarks.


Model Architecture Data Imagenet-1K
OpenCLIP ViT-B/32 LAION 400M 73.50
SPEC-align OpenCLIP ViT-B/32 LAION 400M 76.45
DINOv2 Vit-B/14 LVD-142M 78.99

28
Towards an Explainable Comparison and Alignment of Feature Embeddings

Cluster #1 Cluster #2 Cluster #3 UMAP Visualization of top 10 Clusters


eigenvalue = 0.051 eigenvalue = 0.021 eigenvalue = 0.014 SWAV DINOv2 9

15.0 20

8
SWAV DINOv2

12.5
Cluster
Vl 1
15 7

10.0

7.5 10

5.0
5 4

2.5

3
0
0.0

2.5
5
1

5.0

0
5 0 5 10 15 5 0 5 10 15 20

eigenvalue = 0.027 eigenvalue = 0.021 eigenvalue = 0.017 DINOv2 SWAV 9


9
15
15
8
SWAV DINOv2

7
10
10

5
5

4
0

3
0

5
2

1
10 5

0
5 0 5 10 15 20 10 5 0 5 10 15

DINOv2 KMeans & SPEC AMI : 1.00 ± 0.0000 SWAV KMeans & SPEC AMI: 0.86 ± 0.0006
SWAV
DINOv2

DINOv2 KMeans & SPEC AMI: 0.78 ± 0.0005 DINOv2 KMeans & SPEC AMI : 0.54 ± 0.0004
DINOv2
SWAV

Figure 17. Comparison of different embeddings on 15K samples from the AFHQ dataset, consisting of 5K cats, 5K wildlife, and 5K dogs.
The number at the top of each image represents the eigenvalue of the corresponding SPEC cluster. The last two images in each row show
the UMAP representation of the SPEC clusters for each embedding individually.

29
Towards an Explainable Comparison and Alignment of Feature Embeddings

CLIP Clusters
SPEC-align Clusters
DINOv2 Clusters

Figure 18. Top 8 Kernel-PCA (Gaussian RBF kernel) clusters for CLIP, DINOv2, and CLIP aligned with DINOv2, trained on the ImageNet
dataset.

30
Towards an Explainable Comparison and Alignment of Feature Embeddings

CLIP Clusters
SPEC-align Clusters
DINOv2 Clusters

Figure 19. Top 8 Kernel-PCA (Gaussian RBF kernel) clusters for CLIP, DINOv2, and CLIP aligned with DINOv2, trained on the ImageNet
dataset.
31
Towards an Explainable Comparison and Alignment of Feature Embeddings

CLIP Kernel SPEC-align CLIP Kernel


T5-XL Kernel

Figure 20. Comparison of Kernel matrices after using SPEC-align to match the sample clusters of CLIP to T5-XL with measuring
SPEC-diff during the training.

32
Towards an Explainable Comparison and Alignment of Feature Embeddings

Figure 21. Comparison of Kernel matrices after using SPEC-align to align CLIP to DINOv2.

33

You might also like