Towards An Explainable Comparison and Alignment of Feature Embeddings
Towards An Explainable Comparison and Alignment of Feature Embeddings
Abstract raw image and text inputs into spaces with semantically
meaningful features (Radford et al., 2021; Liu et al., 2023;
While several feature embedding models have
Alayrac et al., 2022; Li et al., 2023). The application of pre-
been developed in the literature, comparisons of
trained embeddings has enabled scalable solutions for many
these embeddings have largely focused on their
downstream tasks, particularly in scenarios where the avail-
arXiv:2506.06231v1 [cs.LG] 6 Jun 2025
1
Towards an Explainable Comparison and Alignment of Feature Embeddings
10.0
5.0
visualize
0.0
differently
5.0
(DINOv2) Eigenvalues to 10
8
measure the 8
6
difference. 4
Reference Dataset 6 3
(FFHQ Dataset) 4 1
0
4 2 0 2 4 6
Top 3 SPEC identified Clusters for CLIP DINOv2 Top 3 SPEC identified Clusters for DINOv2 CLIP
Figure 1. Overview of the Spectral Pairwise Embedding Comparison (SPEC) framework: The SPEC performs an eigendecomposition
of the difference of kernel matrices following the two compared embeddings (e.g., DINOv2 and CLIP image embeddings) on a given
reference dataset. Every eigenvector can be interpreted as a differently captured sample cluster by the embeddings, and the corresponding
eigenvalue quantifies the difference between the cluster frequencies in the embedding spaces.
the difference of embeddings’ kernel matrices to interpret of the most differently captured cluster in one embedding
the differences in cluster assignments. Our analysis sug- that is not strongly clustered by the other model. We dis-
gests that the SPEC framework can effectively detect cluster cuss scalable computations of this distance and its gradient
differences between two embeddings. with respect to the embedding parameters. Using the power
method and the calculated left and right eigenvectors of the
To address the computational challenges of performing
differential covariance matrix, we enable gradient-based op-
eigendecomposition on large-scale datasets, we develop a
timization of the distance measure for aligning embedding
scalable implementation of SPEC. A direct eigendecompo-
models. This gradient-based approach leads to a method
sition of the n × n difference kernel matrix requires O(n3 )
we call SPEC-align, aligning embedding models by min-
computations for a dataset with n samples, which is compu-
imizing their differences in clustering a reference dataset.
tationally expensive for large datasets. Assuming a bounded
SPEC-align is particularly useful for aligning cross-modality
feature dimension d for the applied kernel function, we
embeddings, such as CLIP (Radford et al., 2021), with a
prove that the eigenspace of the difference kernel matrix can
state-of-the-art single-modality embedding. Such a spectral
be computed using O(max{d3 , n}) operations, resulting in
alignment can improve the performance of cross-modality
a scalable algorithm under a moderate dimension d value.
embeddings in capturing concepts specific to individual
Furthermore, we extend this scalable computation method
modalities.
to shift-invariant kernel functions, e.g. the Gaussian kernel,
by employing the framework of random Fourier features Finally, we present numerical experiments on several stan-
(RFF) (Rahimi & Recht, 2007a), where the size of RFF dard image and text embeddings using benchmark datasets.
proxy-feature map can be controlled for a more efficient Our results demonstrate the scalability of the SPEC frame-
application of SPEC. work in revealing differences in sample clusters across em-
beddings over large-scale datasets. In our experiments, we
We also explore the application of the SPEC framework to
tested the SPEC algorithm’s application with both cosine
define a distance measure between two embeddings. We
similarity and shift-invariant Gaussian kernels, where we
define the SPEC-diff distance as the spectral radius of the
leverage random Fourier features for the latter case. Addi-
kernel difference matrix, which aims to quantify the weight
tionally, we discuss the application of SPEC-align to align
2
Towards an Explainable Comparison and Alignment of Feature Embeddings
the CLIP model with single-modality embeddings. The em- of information sufficiency (IS) to quantify the required infor-
pirical results highlight the effectiveness of SPEC-align in mation to simulate one embedding from another. Our work
reducing the differences between CLIP’s image embeddings offers a complementary, explainable method for comparing
and specialized image-domain embeddings. The following embeddings by detecting different sample clusters assigned
is a summary of our work’s main contributions: by embeddings and providing a method for aligning them.
• Proposing the SPEC framework for explainable compari- A different yet related line of work is the evaluation of gen-
son of two embeddings, erative models. (Bińkowski et al., 2018; Jalali et al., 2023;
Ospanov et al., 2024; Jalali et al., 2024; Ospanov & Farnia,
• Providing a scalable SPEC implementation with linearly 2024) leverage the eigenspectrum of kernel matrices to quan-
growing computational cost to the sample size, tify diversity. The papers (Jiralerspong et al., 2023; Zhang
• Developing the gradient-based SPEC-align method to et al., 2024) explore novelty evaluation, analyzing how gen-
align two embeddings and matching their sample clusters, erated samples differ from those of a reference distribution.
In particular, (Zhang et al., 2024; 2025) propose a spectral
• Demonstrating the successful application of SPEC in com- method for measuring the entropy of the novel modes of a
paring and aligning embeddings on benchmark datasets. generative model with respect to a reference model, relying
on the eigendecomposition of kernel similarity matrices.
2. Related Work Embeddings Alignment. There are many works on embed-
Spectral Clustering, Kernel PCA, and Random Fourier ding alignment for multimodal models (Bellagente et al.,
Features. Kernel PCA (Schölkopf et al., 1998) is a widely 2023; Lu et al., 2024; Han et al., 2024; Wang et al., 2023b;
recognized technique for dimensionality reduction that re- Girdhar et al., 2023; Grave et al., 2019). Salman et al.
lies on the eigendecomposition of the kernel matrix. Several (2024); Eslami & de Melo (2025) introduced a method,
studies (Bengio et al., 2003b;a) have explored the relation- which demonstrates that adversarial perturbations can force
ship between kernel PCA and spectral clustering. Also, text embeddings to align with any image in multimodal
the analysis of random Fourier features (Rahimi & Recht, models, exposing security vulnerabilities in vision-language
2007b) for performing scalable kernel PCA has been stud- learning. Ye et al. (2024) proposed ModalChorus, an in-
ied by Chitta et al. (2012); Ghashami et al. (2016); Ullah teractive system that visualizes and corrects misalignments
et al. (2018); Sriperumbudur & Sterge (2022); Gedon et al. in multi-modal embeddings, improving interpretability and
(2023). In this paper, we introduce a spectral approach for optimization. Focusing on fine-grained alignment, Yin et al.
comparing two embeddings, leveraging the random Fourier (2024) introduced a method for explicitly aligning individual
features framework to address computational challenges. word embeddings with corresponding visual features, lever-
Unlike Laplacian spectral clustering (Ng et al., 2001) that aging cross-modal attention to refine token-image associa-
uses the graph Laplacian, our method uses the kernel matrix tions. In contrast, our work focuses on aligning embeddings
similar to Kernel PCA. in a kernel setting specifically to match their sample clusters,
leading to a different approach to embedding comparison.
Evaluation and Comparison of Embeddings. Embedding
evaluation is typically conducted using a limited set of down-
stream tasks (Chen et al., 2013; Santos et al., 2020; Perone
3. Preliminaries
et al., 2018; Choi et al., 2021). Existing NLP benchmarks 3.1. Embedding maps and spaces
(Gao et al., 2021; Reimers & Gurevych, 2019) focus on
limited tasks. Muennighoff et al. (2023) introduces MTEB, Consider a data vector x ∈ X in the space X . An em-
standardizing text embedder evaluation across diverse NLP bedding map ψ : X → S maps an input x to the em-
tasks. In Image embeddings, Kynkäänniemi et al. (2023); bedding space S, which is supposed to provide a more
Stein et al. (2023) compared different image embeddings meaningful representation of the input data vector. Through-
and showed how they can influence different tasks, specif- out this work, we focus on the problem of characterizing
ically the evaluation of generative models. Another line and interpreting the differences of two embedding maps
of research is probing methods (Belinkov, 2022; Pimentel ψ1 : X → S1 and ψ2 : X → S2 , which can map the input
et al., 2020; Adi et al., 2017; Rogers et al., 2021), which an- x ∈ X to different embedding spaces S1 , S2 .
alyze model embeddings by training small models on them
to understand what information is encoded. These methods 3.2. Kernel Functions and Covariance Matrix
help assess how well embeddings capture specific features, A kernel function k : X ×X → R maps two inputs x, x′ to a
although they are not focused on embedding comparison. similarity score k(x, x′ ) = ⟨ϕ(x), ϕ(x′ )⟩ ∈ [0, 1] that is the
Darrin et al. (2024) propose a new metric for comparing inner product of the representation of x, x′ characterized by
embeddings without labeled data and propose the concept ϕ : X → Rd . This definition implies that for every sequence
3
Towards an Explainable Comparison and Alignment of Feature Embeddings
of data points x1 , . . . , xn ∈ X , the following kernel matrix Definition 1. We define the normalized kernel difference
is positive semi-definite (PSD): matrix Λψ1 ,ψ2 ∈ Rn as follows:
1
k(x1 , x1 ) · · · k(x1 , xn )
Λψ1 ,ψ2 := Kψ1 − Kψ2 (3)
.. .. .. n
K= ⪰0 (1)
. . .
k(xn , x1 ) · · · k(xn , xn ) We propose the framework of Spectral Pairwise Embed-
ding Comparison (SPEC) where the two embeddings ψ1
Well-known examples of kernel functions include the cosine-
⊤ and ψ2 are compared using the eigendirections of the differ-
similarity kernel kcosine (x, y) = ∥x∥x2 ∥y∥
y
2
and the Gaussian ence kernel matrix Λψ1 ,ψ2 . As we will show, the principal
(RBF) kernel defined for bandwidth σ as: eigenvectors can be interpreted as the clusters of samples
−∥x − y∥22
assigned by embedding ψ1 that are less strongly grouped by
k(x, y) = exp the second embedding ψ2 . In what follows, we first show
2σ 2
a theoretical result supporting the mentioned property of
Both these examples are normalized kernels where Λψ1 ,ψ2 ’s eigenvectors. Next, we provide a scalable compu-
k(x, x) = 1 holds for every x ∈ X . Note that the kernel tation method for computing the eigenspace of Λψ1 ,ψ2 that
matrix in (1) can be written as K = ΦΦ⊤ where Φ ∈ Rn×d linearly scales with the sample size n.
contains ϕ(xi ) as its ith row for i ∈ {1, . . . , n}. Then, the
kernel covariance matrix CX ∈ Rd×d can be defined by Theorem 1 proves that under the following two conditions
reversing the matrix multiplication order as: on the sample index set I ⊂ {1, . . . , n}, the eigndirections
of Λψ1 ,ψ2 can separate the clustered sample indices from
n
1 ⊤ 1X the rest of samples. Note that the notation I c denotes the
CX := Φ Φ = ϕ(xi )ϕ(xi )⊤ (2)
n n i=1 complement index set of I, and K[I, J ] denotes the sub-
matrix of K with rows in I and columns in J .
Therefore, CX = n1 Φ⊤ Φ and n1 K = n1 ΦΦ⊤ share the same
non-zero eigenvalues, since they represent the products of • Condition 1: Suppose the sample set XI character-
matrices with flipped multiplication orders. ized by index set I are separated from the rest of sam-
ples by embedding ψ1 , where the normalized block ker-
nel matrix n1 Kψ1 [I, I c ] has a bounded Frobenius norm
4. SPEC: A Spectral Identification of 1 c
n Kψ1 [I, I ] F ≤ ϵ1 .
Embeddings’ Mismatches
• Condition 2: Suppose the sample set XI characterized by
Consider a set of n data points x1 , . . . , xn ∈ X and two index set I are weakly grouped by embedding ψ2 , where
embedding maps ψ1 : X → S1 and ψ2 : X → S2 . Also, the normalized block kernel matrix n1 Kψ2 [I, I c ] has a
suppose k1 : S1 × S1 → R and k2 : S2 × S2 → R are bounded ℓ2 -operator norm (or maximum eigenvalue for
kernel functions to be applied to the embedding spaces for this PSD matrix) n1 Kψ2 [I, I] 2 ≤ ϵ2 .
ψ1 , ψ2 , respectively.
Theorem 1. Consider the difference kernel matrix Λψ1 ,ψ2
To compare the two embeddings, note that their correspond- in (3). Suppose Conditions 1 and 2 hold. Let v1 , . . . , vn
ing spaces S1 and S2 may have different dimensions, and be the unit-norm eigenvectors of Λψ1 ,ψ2 corresponding to
therefore a sample-specific comparison of the embedded eigenvalues λ1 , . . . , λn . For every i ∈ {1, . . . , n}, we de-
c
vectors ψ1 (x), ψ2 (x) for each individual data point x will fine λIi and λIi to be the closest eigenvalue of Λψ1 ,ψ2 [I, I]
not provide a meaningful comparison of the embedding and Λψ1 ,ψ2 [I c , I c ] to λi . Then, the following holds for
maps. Therefore, a more relevant approach is to com- ξ = 4(ϵ21 + ϵ2 ):
pare the embeddings’ outputs over the entire set of data n
{x1 , . . . , xn } and investigate which structures are dissimilar 2 2 c 2 2
X
λi − λIi vi [I] 2
+ λi − λIi vi [I c ] 2
≤ξ
between the sets of embedded data following the embed- i=1
dings. Here, we consider a spectral approach and partic-
ularly focus on the difference of kernel matrices between Proof. We defer the proof of the theoretical statements to
the two embeddings. In the following, we discuss how the the Appendix A.1.
eigenspace of the kernel difference matrix can help identify
the differently clustered points by the two embeddings. Corollary 1. In the setting of Theorem 1, suppose v is
an eigenvector of Λψ1 ,ψ2 for eigenvalue λ whose gap with
To do this, consider the kernel matrix of the first
(n,n) the maximum eigenvalue of the sub-matrix Λψ1 ,ψ2 [I c , I c ]
satisfies λ − λmax (Λψ1 ,ψ2 [I c , I c ]) ≥ γ > 0. Then,
embedding Kψ1 = k1 (ψ1 (xi ), ψ1 (xj )) (i,j)=(1,1) and
the kernel matrix of the second embedding Kψ2 = p
(n,n) c 2 ϵ21 + ϵ2
v[I ] ≤ .
k2 (ψ2 (xi ), ψ2 (xj )) (i,j)=(1,1) . 2 γ
4
Towards an Explainable Comparison and Alignment of Feature Embeddings
The above corollary proves that if an eigenvalue λ of the Algorithm 1 Spectral Pairwise Embedding Comparison
kernel difference matrix Λψ1 ,ψ2 is sufficiently large, such (SPEC)
that its gap with the maximum eigenvalue of the block 1: Input: Sample set {x1 , . . . , xn }, embeddings ψ1 and
Λψ1 ,ψ2 [I c , I c ] (with the complement of samples in I clus- ψ2 , kernel feature maps ϕ1 and ϕ2
tered by ψ1 yet not by ψ2 ) is higher than the threshold λ, 2: Initialize Cψ1 = 0d1 ×d1 , Cψ2 = 0d2 ×d2 ,
then the I c -entries of the corresponding unit-norm eigen- Cψ1 ,ψ2 = 0d1 ×d2
vector v will be bounded, reflecting the lack of I c samples 3: for i ∈ {1, . . . , n} do
in the differentially clustered samples by embedding ψ1 4: Update Cψ1 ← Cψ1 + n1 ϕ1 (ψ1 (xi ))ϕ1 (ψ1 (xi ))⊤
and ψ2 . Based on the above theoretical results, we propose
5: Update Cψ2 ← Cψ2 + n1 ϕ2 (ψ2 (xi ))ϕ2 (ψ2 (xi ))⊤
considering the principal eigendirections of the difference
kernel matrix, and using their significant-value entries to 6: Update Cψ1 ,ψ2←Cψ1 ,ψ2+ n1 ϕ1 (ψ1 (xi ))ϕ2 (ψ2 (xi ))⊤
7: end for
find the subset of samples clustered by embedding ψ1 but
not grouped by ψ2 . 8: Construct Γψ1 ,ψ2 as in Equation (4)
9: Compute eigendecomposition Γψ1 ,ψ2 = V diag(λ)V ⊤
Since the difference kernel matrix is of size n×n, a standard
eigendecomposition will cost O(n3 ) computations. Propo- 10: for i ∈ {1, . . . , n} do
sition 1 shows that the computation cost will be lower for 11: Map eigenvector ui = ϕ1 (ψ1 (X)) ϕ2 (ψ2 (X)) vi
embeddings with bounded feature maps. In fact, this result 12: end for
shows the computation of the eigenspace can be performed 13: Output: Eigenvalues λ1 , . . . , λn , eigenvectors
using linearly growing computation cost O(n). u1 , . . . , un .
Proposition 1. Consider the difference kernel matrix
Λψ1 ,ψ2 in (3). This matrix shares the same non-zero eigen-
mapping from Γψ1 ,ψ2 to Λψ1 ,ψ2 will be O(n). Therefore,
values with the following matrix:
the entire eigenvector computation of Λψ1 ,ψ2 can be handled
using O(n + (d1 + d2 )3 ) computations. Algorithm 1 con-
Cψ Cψ ,ψ
1 1 2
Γψ1 ,ψ2 = ∈ R(d1 +d2 )×(d1 +d2 ) (4) tains the main steps of computing the SPEC-eigendirections
⊤
−Cψ ,ψ −Cψ using the above approach. As detailed in this algorithm,
1 2 2
the computation of the differential kernel covariance matrix
where Cψ1 ∈ Rd1 ×d1 , Cψ2 ∈ Rd2 ×d2 are the kernel covari- can be run over samples in a cascade, avoiding the need for
ance matrices of ψ1 , ψ2 , respectively, and Cψ1 ,ψ2 ∈ Rd1 ×d2 storing a large dataset.
is the cross-covariance matrix, defined as:
Applying the standard linear and cosine-similarity kernels,
n the kernel feature dimension will match that of the em-
1X ⊤
Cψ1 := ϕ1 ψ1 (xi ) ϕ1 ψ1 (xi ) , bedding, which is usually bounded by 1000 for standard
n i=1
image and text embeddings. In the case of shift-invariant
n
1X ⊤ kernels, e.g. the Gaussian (RBF) kernel, whose feature di-
Cψ2 := ϕ2 ψ2 (xi ) ϕ2 ψ2 (xi ) , mension is infinite, we can leverage the random Fourier
n i=1
n
features (RFFs) (Rahimi & Recht, 2007b) to reduce the
1X ⊤ dimension of the kernel feature dimension for a proper
Cψ1 ,ψ2 := ϕ1 ψ1 (xi ) ϕ2 ψ2 (xi ) .
n i=1 proxy kernel function characterized by the random Fourier
features. According to the RFF framework, given a ker-
nel function k(x, y) = κ(x − y) that is normalized i.e.
We also note that for every eigenvector v ∈ Rd1 +d2 of the
κ(0) = 1, we draw a number m independent Fourier fea-
matrix Γψ1 ,ψ2 in (4), which we call the differential covari-
tures ω1 , . . . , ωm ∼ κ
b from probability density function κ
ance matrix, we can find the corresponding vector u of
b
which denotes the Fourier transform of κ defined as
difference kernel matrix Λψ1 ,ψ2 using the following: Z
1
κ
b(ω) = κ(x) exp(−i⟨ω, x⟩)dx
ϕ1 (ψ1 (x1 )) ϕ2 (ψ2 (x1 )) (2π)d X
u= .. ..
v
. . Then, the RFF method approximates the shift-invariant ker-
ϕ1 (ψ1 (xn )) ϕ2 (ψ2 (xn )) nel k(x, y) ≈ b b ⊤ ϕ(y)
k(x, y) = ϕ(x) b where
5
Towards an Explainable Comparison and Alignment of Feature Embeddings
(1) (2)
drawing m Fourier features ωi ∼ κ b1 and ωi ∼ κ b2 , we In the above, L(ψ1,θ ) denotes the original loss function of
form the RFF-proxy kernel functions b k1 , b
k2 . Then, consid- training embedding ψ1,θ and β denotes the coefficient of the
ering eigenvalues λ b1 , . . . , λ
bn and eigenvectors vb1 , . . . , vbn penalty function SPEC-diff(ψ1,θ , ψ2 ), penalizing the mis-
of proxy Λ
b ψ ,ψ , for every δ > 0, the following holds with
1 2
match with reference embedding ψ2 . To apply a gradient-
probability at least 1 − δ: based optimization algorithm to solve (6), one needs to effi-
n ciently compute the gradient of SPEC-diff(ψ1,θ , ψ2 ) with
X 2 128 log(2/δ) respect to parameter θ. The following proposition shows
Λψ1 ,ψ2 vbi − λ
bi vbi
2
≤
i=1
m that the gradient computation can be run in O(nB ) over a
batch size nB .
Proof. We defer the proof to the Appendix A.3. Proposition 2. Consider the definitions in (3),(4),(5). Then,
assuming a unique top eigenvalue (in terms of absolute
As the above theorem suggests, the eigenvectors of the RFF- value) for Γψ1,θ , ψ2 with the left and right eigenvectors
proxy difference kernel function Λ b ψ ,ψ provide an approx-
1 2 uleft , uright , we will have:
imation of the eigenspace of the target difference kernel
function Λψ1 ,ψ2 . On the other hand, while forming the dif- ∇θ SPEC-diff(ψ1,θ , ψ2 ) = ∇θ u⊤ left Γψ1,θ ,ψ2 uright (7)
ferential covariance matrix for the proxy-RFF kernel, the
dimension of Γb ψ ,ψ will be 4m×4m, which is finite unlike
1 2
the target shift-invariant kernel. As a result, one can apply Therefore, the above proposition suggests computing the
the eigenspace equivalence in Proposition 1 to the proxy top left and right eigenvector of the (d1 + d2 ) × (d1 + d2 )
kernel function to reduce the computational complexity to difference kernel matrix, which can be computed using the
power method, and subsequently to take the gradient of
O m3 + n computations with m features and n samples.
the scalar function u⊤ left Γψ1,θ ,ψ2 uright which is the absolute
value of the mean of the function value for each individual
5. SPEC-based Quantification of Embedding sample x1 , . . . , xn . This property is especially suitable for
Differences applying stochastic gradient methods.
As discussed earlier, the eigenspace of the difference
kernel matrix Λψ1 ,ψ2 provides information on the differ- 6. Numerical Results
ently clustered samples by the two embeddings. There-
fore, the SPEC approach motivates measuring the dif- In this section, we first discuss the experimental settings and
ference of two embeddings using the eigenspectrum of then apply the SPEC algorithm to compare different image
Λψ1 ,ψ2 . Here, we specifically focus on the spectral ra- and text embeddings across various large-scale datasets.
dius of Λψ1 ,ψ2 , i.e., its maximum absolute eigenvalue Finally, we explore the use of the SPEC-align method to
ρ(Λψ1 ,ψ2 ) = max1≤i≤n |λi (Λψ1 ,ψ2 )| (ρ(A) denotes A’s match the sample clusters of the embeddings.
spectral radius). Note that Λψ1 ,ψ2 is by definition a symmet- Datasets. In our experiments on image data, we used four
ric matrix with a zero trace, and therefore its eigenvalues datasets: AFHQ (Choi et al., 2020) (15K animal faces
are all real and add up to 0. The following definition states in categories of cats, wildlife, and dogs), FFHQ (Karras
the difference measure, which we call SPEC-diff score: et al., 2019) (70K human-face images), ImageNet-1K (Deng
SPEC-diff(ψ1 , ψ2 ) := ρ(Λψ1 ,ψ2 ). (5) et al., 2009) (1.4 million images across 1,000 labels), and
MS-COCO 2017 (Lin et al., 2015) (≈110K samples of di-
Since SPEC-diff is only a function of Λψ1 ,ψ2 ’s verse scenes with multiple objects). Additionally, similar
non-zero eigenvalues, Proposition 1 shows that to (Materzynska et al., 2022), we created a custom dataset
SPEC-diff(ψ1 , ψ2 ) = ρ(Γψ1 ,ψ2 ) is equal to the spectral ra- derived from 10 selected classes from ImageNet-1k, where
dius of the differential covariance matrix Γψ1 ,ψ2 , therefore, we overlaid text labels directly on images.
it is a symmetric pseudo-distance whose computation cost
scales linearly with sample size n. Embeddings. The feature embeddings tested in this study
include the image embeddings: CLIP (Radford et al., 2021),
While the SPEC-diff measure can be used to quantify the DINOv2 (Oquab et al., 2024), Inception-V3 (Szegedy et al.,
mismatches of two embeddings, it can be further optimized 2016), and SWAV (Caron et al., 2021), and the text embed-
in the training or fine-tuning of an embedding map ψ1,θ ’s dings: RoBERTa (Liu et al., 2020), CLIP (Radford et al.,
parameters θ in order to align the embedding’s clusters with 2021), and E5-V2 (Wang et al., 2023a). All embeddings
another reference embedding ψ2 . The optimization problem were extracted using pre-trained models, and standard pre-
to be solved for such an alignment of the embeddings will processing was applied for uniformity across datasets.
be the following, which we call the SPEC-align problem:
Experimental settings. In our experiments, we computed
min L(ψ1,θ ) + β · SPEC-diff(ψ1,θ , ψ2 ) (6) the SPEC differential kernel covariance matrix using m =
θ∈Θ
6
Towards an Explainable Comparison and Alignment of Feature Embeddings
DINOv2 KMeans & SPEC AMI: 0.94 ± 0.0003 CLIP KMeans & SPEC AMI : 0.65 ± 0.0014 CLIP KMeans & SPEC AMI : 0.78 ± 0.0006 DINOv2 KMeans & SPEC AMI: 0.44 ± 0.0002
DINOv2
CLIP
DINOv2
CLIP
DINOv2
DINOv2
CLIP
CLIP
Figure 2. Comparison of different embeddings on 15K samples from the AFHQ dataset, consisting of 5K cats, 5K wildlife, and 5K dogs.
The number at the top of each image represents the eigenvalue of the corresponding SPEC cluster. The last two images in each row show
the UMAP representation of the SPEC clusters for each embedding individually.
2000 independent random Fourier features for a Gaussian violin plots to visualize normalized distances between data
kernel. To determine the Gaussian kernel bandwidth σ, we points within each cluster. The plots also suggest that the
followed the kernel-based evaluation of generative models first embedding can cluster the points more strongly com-
in (Jalali et al., 2023; Pasarkar & Dieng, 2024) and selected pared to the second embedding. Also, we ran the K-means
the embeddings bandwidths such that the difference between clustering algorithm 50 times on each of the embedding’s
top eigenvalue is less than 0.01. We provide the detailed features and computed the averaged (across the 50 runs)
SPEC algorithm in Algorithm 1. The experiments were Adjusted Mutual Information (AMI) (Vinh et al., 2009) be-
performed on two RTX-4090 GPUs. tween the K-means labels and the SPEC-identified labels.
The results indicate that the first embedding aligns more
SPEC comparison of different embeddings. To evaluate
strongly with K-Means labels.
SPEC, we compared various image embeddings using the
AFHQ dataset. As shown in Figure 2, we employed SPEC Furthermore, to highlight clustering differences between
for pairwise comparisons to analyze the difference between embeddings, we conducted a sanity check on two of the
these embeddings. We reported the top 9 images that corre- top five SPEC clusters from the DINOv2 - CLIP on AFHQ.
spond to the maximum entries of the top three eigenvectors We computed the center of the top four images in each
in the SPEC approach. Subsequently, we found and visu- cluster in both DINOv2 and CLIP embeddings. Then, we
alized the top 100 samples (with maximum entries) from calculated the cosine similarity between the center and a
each of the top 10 eigenvectors (i.e. SPEC-identified clus- set of eight test images: four additional images from the
ters). To confirm whether these samples were clustered same cluster and four random images that do not belong
by the first embedding and not by the second embedding, to the cluster. As shown in Figure 12, DINOv2 well sepa-
we used UMAP maps (McInnes et al., 2018) to validate rates the cluster images from random samples, assigning the
the SPEC-identified different-captured sample groups by highest similarity scores to cluster-specific samples while
the two embeddings. In the Appendix, we further provide keeping random samples significantly lower. However, in
the t-SNE (Van der Maaten & Hinton, 2008) and PaCMAP CLIP, some random images rank higher in similarity than
(Wang et al., 2021) plots of the FFHQ and AFHQ exper- the cluster-specific samples. A similar experiment was per-
iments. Also, we have analyzed the found clusters using formed on SPEC clusters from the DINOv2 - CLIP on the
7
Towards an Explainable Comparison and Alignment of Feature Embeddings
Comparison of top 10 SPEC-identified clusters for CLIP DINOv2 Comparison of top 10 SPEC-identified clusters for DINOv2 CLIP
UMAP visualization of CLIP UMAP visualization of DINOv2 UMAP visualization of DINOv2 UMAP visualization of CLIP
Normalized Intra-cluster Distances of top 10 SPEC-identified clusters CLIP DINOv2 Normalized Intra-cluster Distances of top 10 SPEC-identified clusters DINOv2 CLIP
CLIP KMeans & SPEC AMI : 0.68 ± 0.0005 DINOv2 KMeans & SPEC AMI: 0.16 ± 0.0001 DINOv2 KMeans & SPEC AMI: 0.96 ± 0.0005 CLIP KMeans & SPEC AMI : 0.33 ± 0.0016
Figure 3. Top 4 SPEC-identified clusters comparing CLIP and DINOv2 embeddings on 10 ImageNet classes with overlaid text labels. The
last row shows the UMAP representation of the top 10 SPEC-identified clusters for each embedding.
FFHQ dataset (Figure 13). Details are in Appendix B.4. under the same settings on the ImageWoof dataset (Howard,
2019), which consists of various dog breeds from ImageNet-
To further evaluate SPEC’s performance in comparing em-
1K. SPEC principal clusters show that DINOv2 primarily
beddings, we apply a typographic attack on CLIP embed-
clusters images based on dog breeds, whereas CLIP groups
dings. As studied by Materzynska et al. (2022), CLIP prior-
them based on the animals’ gestures. Additional details are
itizes text added to a custom dataset over the image content.
provided in Figure 11 of Appendix B.3.
We selected 10 classes from the ImageNet-1K dataset and
overlaid different text labels directly onto the images. The SPEC comparison of embeddings on different image
top four SPEC-identified clusters are presented in Figure 3, and text datasets. To check the performance of SPEC on
where we observe that CLIP clusters are based on the over- text embeddings, we generated 10K samples from GPT-
laid text, whereas DINOv2 clusters them based on visual fea- 4o across different categories, including profession, object,
tures. Additionally, the top 10 principal clusters in CLIP are gender, and emotion. We compared CLIP and RoBERTa
not well-clustered by DINOv2, and vice versa, demonstrat- text embeddings in Figure 4 and observed that the top four
ing that SPEC effectively highlights differences between clusters in CLIP focused on objects in sentences, while
embeddings. We compared CLIP and DINOv2 embeddings RoBERTa clustered based on profession and gender. We
8
Towards an Explainable Comparison and Alignment of Feature Embeddings
SPEC Principal Clusters for CLIP RoBERTa UMAP visualization of top 10 SPEC clusters
Cluster #1 Cluster #2 Cluster #3 CLIP RoBERTa
eigenval=0.026 eigenval=0.014 eigenval=0.009
A serious female artist is adjusting a camera. A calm female teacher is sitting an airplane. A serious female doctor is carrying a frying pan.
A determined male engineer is holding a camera. A focused female doctor is adjusting an airplane. An angry male carpenter is adjusting a frying pan.
A happy female doctor is adjusting a photograph. A curious male teacher is adjusting an airplane. A determined female chef is adjusting a frying pan.
A calm female pilot is designing a photograph. A calm female doctor is inspecting an airplane. A tired male artist is adjusting a frying pan.
Figure 4. Top 4 SPEC-identified clusters by comparing CLIP and RoBERTa text embeddings on a dataset generated from GPT-4o.
Figure 5. Comparison of Kernel matrices after using SPEC-align to match the sample clusters of CLIP to DINOv2.
also noted from the UMAP visualization of the samples that SPEC-align’s performance, we conducted an experiment
the top clusters of each embedding are not well clustered similar to (Oquab et al., 2024), where feature quality was
by the other, indicating that they focus on different aspects assessed by training a simple classifier on a frozen backbone
of the sentences. We also compared CLIP with E5 and without fine-tuning its weights. In this setting, SPEC-align
observed the same results in Figure 9 of the Appendix B.2. CLIP achieved 73.93% top-1 accuracy on ImageNet-1K,
In Figure 14, we compared different image embeddings on outperforming the standard CLIP model, which reached
the MS-COCO 2017 training set with 120K samples which 67.20%. For reference, DINOv2 achieved 78.99% on the
we discuss in the Appendix B.3. same task, indicating that SPEC-align brings CLIP substan-
tially closer to DINOv2 performance.
Aligning embeddings using SPEC-align In this section,
we discuss how to use the SPEC-align method to align the 7. Conclusion
differential kernel covariance of two embeddings. As ob-
served in the comparison of CLIP and DINOv2 in Figures 3 In this paper, we proposed the spectral SPEC approach to
and 11, DINOv2 successfully captures certain clusters that the comparison of embedding maps. The SPEC method
CLIP fails to distinguish. To enhance CLIP’s performance aims to identify groups of samples clustered by one embed-
in these tasks, we aligned CLIP with DINOv2 using the ding model which is not grouped by another model. We
ImageNet training set. Specifically, we incorporated an formulated a scalable algorithm with O(n) computations to
alignment term into the CLIP loss function, as formulated apply SPEC to a dataset of size n. We also discussed the
in (6), and computed the gradient using (7). The learning application of SPEC for measuring the mismatches of two
parameters are detailed in the Appendix B.5. embeddings and their alignment. We note that the SPEC ap-
proach operates based on the assumption that the differently
We provide the kernel matrices for the four clusters in this
clustered samples can be detected by the spectral method.
experiment in Figure 5, corresponding to the results in Fig-
Extending the clustering-based approach to non-spectral
ure 3. Notably, the SPEC-aligned CLIP kernel captures
clustering frameworks will be interesting for future explo-
the top four clusters based on image content rather than
ration. In addition, extending the framework to compare
the overlaid text labels. The clusters and their UMAP vi-
cross-modal embeddings such as CLIP, BLIP, and ALIGN
sualizations are shown in Figure 21. To further evaluate
will be a future direction to this work.
9
Towards an Explainable Comparison and Alignment of Feature Embeddings
10
Towards an Explainable Comparison and Alignment of Feature Embeddings
Eslami, S. and de Melo, G. Mitigate the gap: Improv- Jalali, M., Li, C. T., and Farnia, F. An information-theoretic
ing cross-modal alignment in CLIP. In The Thirteenth evaluation of generative models in learning multi-modal
International Conference on Learning Representations, distributions. In Thirty-seventh Conference on Neural
2025. URL https://openreview.net/forum? Information Processing Systems, 2023. URL https:
id=aPTGvFqile. //openreview.net/forum?id=PdZhf6PiAb.
Gao, T., Yao, X., and Chen, D. SimCSE: Simple con- Jalali, M., Ospanov, A., Gohari, A., and Farnia, F. Condi-
trastive learning of sentence embeddings. In Proceedings tional Vendi Score: An information-theoretic approach
of the 2021 Conference on Empirical Methods in Natural to diversity evaluation of prompt-based generative mod-
Language Processing, pp. 6894–6910, Online and Punta els. arXiv preprint arXiv:2411.02817, 2024. URL
Cana, Dominican Republic, November 2021. Association https://arxiv.org/abs/2411.02817.
for Computational Linguistics. doi: 10.18653/v1/2021.
emnlp-main.552. URL https://aclanthology. Jiralerspong, M., Bose, J., Gemp, I., Qin, C., Bachrach, Y.,
org/2021.emnlp-main.552/. and Gidel, G. Feature likelihood score: Evaluating the
generalization of generative models using samples. In
Gedon, D., Ribeiro, A. H., Wahlström, N., and Schön, T. B. Thirty-seventh Conference on Neural Information Pro-
Invertible kernel PCA with random fourier features. IEEE cessing Systems, 2023. URL https://openreview.
Signal Processing Letters, 30:563–567, 2023. net/forum?id=l2VKZkolT7.
Ghashami, M., Perry, D. J., and Phillips, J. Streaming kernel Karras, T., Laine, S., and Aila, T. A style-based generator
principal component analysis. In Artificial intelligence architecture for generative adversarial networks. In 2019
and statistics, pp. 1365–1374. PMLR, 2016. IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 4396–4405, 2019. doi: 10.1109/
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K. V., CVPR.2019.00453.
Joulin, A., and Misra, I. Imagebind: One embedding
space to bind them all. In CVPR, 2023. Kohler, J. M. and Lucchi, A. Sub-sampled cubic regulariza-
tion for non-convex optimization. In International Con-
Grave, E., Joulin, A., and Berthet, Q. Unsupervised align- ference on Machine Learning, pp. 1895–1904. PMLR,
ment of embeddings with wasserstein procrustes. In 2017.
Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of
the Twenty-Second International Conference on Artificial Kynkäänniemi, T., Karras, T., Aittala, M., Aila, T., and
Intelligence and Statistics, volume 89 of Proceedings of Lehtinen, J. The role of ImageNet classes in Fréchet
Machine Learning Research, pp. 1880–1890. PMLR, 16– inception distance. In The Eleventh International Confer-
18 Apr 2019. URL https://proceedings.mlr. ence on Learning Representations, 2023. URL https:
press/v89/grave19a.html. //openreview.net/forum?id=4oXTQ6m_ws8.
Gross, D. Recovering low-rank matrices from few coeffi- Li, J., Li, D., Savarese, S., and Hoi, S. BLIP-2: boot-
cients in any basis. IEEE Transactions on Information strapping language-image pre-training with frozen image
Theory, 57(3):1548–1566, 2011. encoders and large language models. In ICML, 2023.
Han, J., Gong, K., Zhang, Y., Wang, J., Zhang, K., Lin, D., Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick,
Qiao, Y., Gao, P., and Yue, X. Onellm: One framework to R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., and
align all modalities with language. In Proceedings of the Dollár, P. Microsoft COCO: Common objects in context,
IEEE/CVF Conference on Computer Vision and Pattern 2015.
Recognition (CVPR), 2024.
Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tun-
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and ing. In Advances in Neural Information Processing Sys-
Hochreiter, S. Gans trained by a two time-scale update tems, volume 36, pp. 34892–34916. Curran Associates,
rule converge to a local nash equilibrium. Advances in Inc., 2023.
neural information processing systems, 30, 2017.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
Howard, J. ImageWoof: a subset of 10 classes Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.
from imagenet that aren’t so easy to classify, March RoBERTa: A robustly optimized BERT pretraining ap-
2019. URL https://github.com/fastai/ proach, 2020. URL https://openreview.net/
imagenette#imagewoof. forum?id=SyxS0T4tvS.
11
Towards an Explainable Comparison and Alignment of Feature Embeddings
Lu, S., Li, Y., Chen, Q.-G., Xu, Z., Luo, W., Zhang, K., Pasarkar, A. P. and Dieng, A. B. Cousins of the vendi score:
and Ye, H.-J. Ovis: Structural embedding alignment for A family of similarity-based diversity metrics for science
multimodal large language model. arXiv:2405.20797, and machine learning. In International Conference on Ar-
2024. tificial Intelligence and Statistics, pp. 3808–3816. PMLR,
2024.
Materzynska, J., Torralba, A., and Bau, D. Disentangling
visual and written concepts in clip. In IEEE Conference Perone, C. S., Silveira, R., and Paula, T. S. Eval-
on Computer Vision and Pattern Recognition (CVPR), uation of sentence embeddings in downstream and
2022. linguistic probing tasks. ArXiv, abs/1806.06259,
2018. URL https://api.semanticscholar.
McInnes, L., Healy, J., and Melville, J. Umap: Uniform org/CorpusID:49306018.
manifold approximation and projection for dimension
reduction. arXiv preprint arXiv:1802.03426, 2018. Pimentel, T., Valvoda, J., Maudslay, R. H., Zmigrod, R.,
Williams, A., and Cotterell, R. Information-theoretic
Merity, S., Xiong, C., Bradbury, J., and Socher, R. probing for linguistic structure. In Proceedings of the
Pointer sentinel mixture models. arXiv preprint 58th Annual Meeting of the Association for Computa-
arXiv:1609.07843, 2016. tional Linguistics, pp. 4609–4622, Online, July 2020.
Association for Computational Linguistics.
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N.
MTEB: Massive text embedding benchmark. In Vla- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
chos, A. and Augenstein, I. (eds.), Proceedings of the Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
17th Conference of the European Chapter of the Asso- J., Krueger, G., and Sutskever, I. Learning transferable
ciation for Computational Linguistics, pp. 2014–2037, visual models from natural language supervision, 2021.
Dubrovnik, Croatia, May 2023. Association for Compu-
tational Linguistics. doi: 10.18653/v1/2023.eacl-main. Rahimi, A. and Recht, B. Random features for large-scale
148. URL https://aclanthology.org/2023. kernel machines. Advances in neural information pro-
eacl-main.148/. cessing systems, 20, 2007a.
12
Towards an Explainable Comparison and Alignment of Feature Embeddings
France, May 2020. European Language Resources As- Wang, Z., Zhao, Y., Cheng, X., Huang, H., Liu, J., Tang,
sociation. ISBN 979-10-95546-34-4. URL https: L., Li, L., Wang, Y., Yin, A., Zhang, Z., and Zhao, Z.
//aclanthology.org/2020.lrec-1.594/. Connecting multi-modal contrastive representations. In
Proceedings of the 37th International Conference on Neu-
Schölkopf, B., Smola, A., and Müller, K.-R. Nonlinear ral Information Processing Systems, NIPS ’23. Curran
Component Analysis as a Kernel Eigenvalue Problem. Associates Inc., 2023b.
Neural Computation, 10(5):1299–1319, July 1998. ISSN
0899-7667. doi: 10.1162/089976698300017467. URL Ye, Y., Xiao, S., Zeng, X., and Zeng, W. Modalchorus: Vi-
https://ieeexplore.ieee.org/document/ sual probing and alignment of multi-modal embeddings
6790375. Conference Name: Neural Computation. via modal fusion map. IEEE Transactions on Visualiza-
tion and Computer Graphics, 2024.
Simhi, A. and Markovitch, S. Interpreting embedding
spaces by conceptualization. In The 2023 Conference Yin, Y., Zhao, Y., Zhang, Y., Lin, K., Wang, J., Tao, X.,
on Empirical Methods in Natural Language Processing, Wan, P., Zhang, D., Yin, B., and Zhang, W. Sea: Super-
2023. URL https://openreview.net/forum? vised embedding alignment for token-level visual-textual
id=sPpft5DQJN. integration in mllms. arXiv preprint arXiv:2408.11813,
2024.
Sriperumbudur, B. K. and Sterge, N. Approximate kernel
PCA: Computational versus statistical trade-off. The Zhang, J., Li, C. T., and Farnia, F. An interpretable evalu-
Annals of Statistics, 50(5):2713–2736, 2022. ation of entropy-based novelty of generative models. In
Proceedings of the 41st International Conference on Ma-
Stein, G., Cresswell, J., Hosseinzadeh, R., Sui, Y., Ross, B., chine Learning, volume 235 of Proceedings of Machine
Villecroze, V., Liu, Z., Caterini, A. L., Taylor, E., and Learning Research, pp. 59148–59172. PMLR, 21–27 Jul
Loaiza-Ganem, G. Exposing flaws of generative model 2024.
evaluation metrics and their unfair treatment of diffusion
models. In Advances in Neural Information Processing Zhang, J., Jalali, M., Li, C. T., and Farnia, F. Unveiling
Systems, volume 36, pp. 3732–3784. Curran Associates, differences in generative models: A scalable differential
Inc., 2023. clustering approach. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, (CVPR), 2025. URL https://arxiv.org/abs/
Z. Rethinking the inception architecture for computer vi- 2405.02700.
sion. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 2818–2826, 2016.
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and
Wei, F. Improving text embeddings with large language
models. arXiv preprint arXiv:2401.00368, 2023a.
13
Towards an Explainable Comparison and Alignment of Feature Embeddings
A. Proofs
A.1. Proof of Theorem 1
For simplicity, we adopt the following notations in this proof:
(1) 1 (1) 1 (1) 1
K11 := Kψ [I, I], K12 := Kψ1 [I, I c ], K22 := Kψ1 [I c , I c ]
n 1 n n
(2) 1 (2) 1 (2) 1
K11 := Kψ2 [I, I], K12 := Kψ2 [I, I c ], K22 := Kψ2 [I c , I c ]
n n n
Λ11 := Λψ1 ,ψ2 [I, I], Λ12 := Λψ1 ,ψ2 [I, I c ], Λ22 := Λψ1 ,ψ2 [I c , I c ]
As a result, the following holds by definition:
(1) (2) (1) (2) (1) (2)
Λ11 = K11 − K11 , Λ12 = K12 − K12 , Λ22 = K22 − K22
(1)
According to Condition 1, we know the Frobenius norm bound ∥K12 ∥F ≤ ϵ1 . Also, due to Condition 2, we know the
(2)
ℓ2 -operator norm bound ∥K11 ∥2 ≤ ϵ2 . Note that the matrix n1 Kψ2 is positive semi-definite (PSD). Therefore, the Schur
complement of its block representation following indices in I and I c = {1, . . . , n} − I must be a PSD matrix, i.e.
(2) (2) ⊤ (2) −1 (2)
K22 − K12 K11 K12 ⪰ 0.
Therefore, the above Schur complement has a non-negative trace, implying that
(2) ⊤ (2) −1 (2) (2) ⊤ (2) −1 (2)
(2) (2)
Tr K22 − K12 K11 K12 ≥ 0 =⇒ Tr K12 K11 K12 ≤ Tr K22 .
14
Towards an Explainable Comparison and Alignment of Feature Embeddings
Now, since Λψ1 ,ψ2 is a symmetric matrix, we can apply spectral decomposition to write it as Λψ1 ,ψ2 = V diag(λ)V ⊤ where
every row vi of matrix V is an eigenvector of Λψ1 ,ψ2 with corresponding eigenvalue λi , sorted as λ1 ≥ · · · ≥ λn . Note that
the eigenvalues are real and sum up to 0 as the trace of Λψ1 ,ψ2 is zero. Then, we can write:
n n
X 2 X 2
e ψ ,ψ vi − Λψ ,ψ vi
Λ = e ψ ,ψ − Λψ ,ψ vi
Λ
1 2 1 2 2 1 2 1 2 2
i=1 i=1
n
X ⊤
vi⊤ Λ
= e ψ ,ψ − Λψ ,ψ e ψ ,ψ − Λψ ,ψ vi
Λ
1 2 1 2 1 2 1 2
i=1
n
e ψ ,ψ − Λψ ,ψ ⊤ Λ
X
Tr vi⊤ Λ
= e ψ ,ψ − Λψ ,ψ vi
1 2 1 2 1 2 1 2
i=1
n
e ψ ,ψ − Λψ ,ψ ⊤ Λ
X
Tr vi vi⊤ Λ
= e ψ ,ψ − Λψ ,ψ
1 2 1 2 1 2 1 2
i=1
n
X ⊤
= Tr vi vi⊤ e ψ ,ψ − Λψ ,ψ
Λ 1 2 1 2
e ψ ,ψ − Λψ ,ψ
Λ 1 2 1 2
i=1
⊤
= Tr e ψ ,ψ − Λψ ,ψ
Λ e ψ ,ψ − Λψ ,ψ
Λ
1 2 1 2 1 2 1 2
2
= Λe ψ ,ψ − Λψ ,ψ
1 2 1 2
F
≤ 4 ϵ21 + ϵ2
In the above, the last inequality holds as we know for every PSD matrix A and vector v, we have ∥Av − λv∥2 ≥ |λj − λ|v,
where λj is the eigenvalue of A with the minimum absolute difference |λj − λ|. Therefore, the proof of Theorem 1 is
complete.
Φ⊤
1 ψ1
Λψ1 ,ψ2 = Φψ1 Φψ2
n
−Φ⊤
ψ2
In the above, we define Φψ1 ∈ Rn×d1 to be the embedding of dataset x1 , . . . , xn with embedding map ψ1 , i.e., its ith
row will be ϕ1 (ψ1 (xi )), and similarly we let Φψ2 ∈ Rn×d2 to be the embedding of dataset with embedding map ψ2 with
⊤
its ith row being ϕ2 (ψ2 (xi )). Therefore, if we define A = Φψ1 Φψ2 and B = n1 Φ⊤ −Φ⊤
ψ1 ψ2 , then we will have
Λψ1 ,ψ2 = AB.
On the other hand, we know that for every matrix A ∈ Rn×(d1 +d2 ) and B ∈ R(d1 +d2 )×n , AB and BA share the same
15
Towards an Explainable Comparison and Alignment of Feature Embeddings
non-zero eigenvalues. In this case, the matrix BA sharing the non-zero eigenvalues with Λψ1 ,ψ2 = AB can be calculated as
" ⊤ #
1 Φψ1
BA = Φψ1 Φψ2
n −Φψ ⊤
2
" 1 ⊤ 1 ⊤
#
Φ
n ψ1 ψ1Φ n Φψ1 Φψ2
=
− n1 Φ⊤ψ2 Φψ1 − n1 Φ⊤
ψ Φψ2
" # 2
Cψ1 Cψ1 ,ψ2
= ⊤
−Cψ1 ,ψ2 −Cψ2
= Γψ1 ,ψ2 .
In addition, for every eigenvector v (corresponding to a non-zero eigenvalue) of Γψ1 ,ψ2 = BA, we have that u = Av is an
eigenvector of Λψ1 ,ψ2 = AB which is
ϕ1 (ψ1 (x1 )) ϕ2 (ψ2 (x1 ))
u = Φψ1
Φψ2 v = .. ..
v
. .
ϕ1 (ψ1 (xn )) ϕ2 (ψ2 (xn ))
1 h i
ϕ(x)
b = √ cos(ω1⊤ x), sin(ω1⊤ x), ., cos(ωm
⊤ ⊤
x), sin(ωm x)
m
Based on the assumption, the shift-invariant kernel k(x, y) = κ(x − y) is⊤normalized where k(x, x)⊤ = 1 for ⊤every
x ∈ X . Therefore, using the Fourier synthesis equation k(x, y) = Eω∼b
κ cos(ω (x − y)) = Eω∼bκ cos(ω x) cos(ω y) +
sin(ω ⊤ x) sin(ω ⊤ y) . The RFF-proxy kernel function can be viewed as
m
1 X
k(x, y) =
b cos(ωi⊤ (x − y)).
m i=1
As a result, if we consider kernel matrix Kψ1 ,ωi where kψ1 ,ωi (x, y) = cos(ωi⊤ (ψ1 (x) − ψ1 (y))), we can simplify the proxy
kernel matrix as
m
1 b 1 X1
Kψ1 = Kψ ,ω
n m i=1 n 1 i
16
Towards an Explainable Comparison and Alignment of Feature Embeddings
2
= Λψ1 ,ψ2 − Λ
b ψ ,ψ
1 2
F
2 2
≤ 2 Kψ1 − K
bψ
1
+ 2 Kψ2 − K
bψ
2
F F
2 2 2
q from Young’s inequality showing that ∥A + B∥F ≤ 2∥A∥F + 2∥B∥F . Then,
The last line in the above inequalities follow
2 −1/4
setting δ = 2 exp 8−mϵ 32 implying ϵ = 32 log(2em
/δ)
, we will have
r r
1
b ψ − 1 Kψ 32 log(2e−1/4 /δ) δ 1
b ψ − 1 Kψ 32 log(2e−1/4 /δ) δ
P K ≤ ≥ 1− , P K ≤ ≥ 1−
n 1 n 1 F m 2 n 2 n 2 F m 2
where by applying the union bound we can show
r r
1 1 32 log(2e−1/4 /δ) 1 b 1 32 log(2e−1/4 /δ)
P Kψ1 − Kψ1
b ≤ and Kψ2 − Kψ2 ≤ ≥ 1−δ
n n F m n n F m
which shows that
n
X 2 128 log(2e−1/4 /δ)
P b i − λi v
Λψ1 ,ψ2 v bi ≤ ≥ 1−δ
i=1
2 m
Γψ1,θ ,ψ2 = U JU −1
In the above, matrix U ∈ Rd1 +d2 includes the right generalized eigenvectors of matrix Γψ1,θ ,ψ2 as its rows, and U −1
includes the left generalized eigenvectors of matrix Γψ1,θ ,ψ2 in its rows. Also, J is the Jordan normal form containing one
17
Towards an Explainable Comparison and Alignment of Feature Embeddings
block matrix on the diagonal for every eigenvalue. Assuming that the eigenvalue with the maximum absolute value (which
determines the spectral radius) has multiplicity 1, the Jordan canonical form has only one diagonal entry λmax for the top
eigenvalue, and so we can write the decomposition as:
(−imax )
Γψ1,θ ,ψ2 = U (−imax ) J (−imax ) U −1 + λmax uleft u⊤
right
Due to the bi-orthogonality of uleft , uright with the rest of right and left generalized eigenvectors, respectively, we will have
λmax (Γψ1,θ ,ψ2 ) = u⊤
left Γψ1,θ ,ψ2 uright
Taking the partial derivative with respect to θ from the above identity proves the proposition.
18
Towards an Explainable Comparison and Alignment of Feature Embeddings
DINOv2
CLIP
DINOv2
CLIP
CLIP
DINOv2
CLIP
DINOv2
Figure 6. Comparison of SPEC-identified clusters across different visualization methods (PacMAP, t-SNE, UMAP) of Figure 2 experiment.
One embedding shows clear cluster separation, while the other fails to distinguish groups.
19
Towards an Explainable Comparison and Alignment of Feature Embeddings
t-SNE visualization of top 10 SPEC clusters with RoBERTa embedding t-SNE visualization of top 10 SPEC clusters with CLIP embedding
RoBERTa KMeans & SPEC NMI: 0.62 ± 0.0004 CLIP KMeans & SPEC NMI : 0.31 ± 0.0005
t-SNE visualization of top 10 SPEC clusters with CLIP embedding t-SNE visualization of top 10 SPEC clusters with RoBERTa embedding
CLIP KMeans & SPEC NMI : 0.63 ± 0.0011 RoBERTa KMeans & SPEC NMI: 0.32 ± 0.0003
Figure 7. Top 4 SPEC-identified clusters by comparing CLIP and RoBERTa text embeddings on the WikiText-2 dataset with the
visualization of the top 10 SPEC-identified clusters using t-SNE.
20
Towards an Explainable Comparison and Alignment of Feature Embeddings
CLIP KMeans & SPEC AMI : 0.68 ± 0.0006 RoBERTa KMeans & SPEC AMI: 0.34 ± 0.0010
RoBERTa KMeans & SPEC AMI: 0.72 ± 0.0010 CLIP KMeans & SPEC AMI : 0.45 ± 0.0014
Figure 8. Top 4 SPEC-identified clusters by comparing CLIP and RoBERTa text embeddings on a dataset of 10K samples generated from
GPT-4o with the visualization of the top 10 SPEC-identified clusters using t-SNE and UMAP.
21
Towards an Explainable Comparison and Alignment of Feature Embeddings
CLIP KMeans & SPEC AMI : 0.68 ± 0.0003 E5 KMeans & SPEC AMI: 0.43 ± 0.0022
CLIP KMeans & SPEC AMI : 0.55 ± 0.0008 E5 KMeans & SPEC AMI: 0.32 ± 0.0004
Figure 9. Top 4 SPEC-identified clusters by comparing CLIP and E5-Large-V2 text embeddings on a dataset of 10K samples generated
from GPT-4o with the visualization of the top 10 SPEC-identified clusters using t-SNE.
22
Towards an Explainable Comparison and Alignment of Feature Embeddings
Figure 10. Top 4 SPEC-identified clusters by comparing RoBERTa and E5-Large-V2 text embeddings on MS-COCO 2017 train captions
( 120K prompts) with the visualization of the top 10 SPEC-identified clusters using t-SNE.
23
Towards an Explainable Comparison and Alignment of Feature Embeddings
24
Towards an Explainable Comparison and Alignment of Feature Embeddings
Comparison of top 10 SPEC-identified clusters for CLIP DINOv2 Comparison of top 10 SPEC-identified clusters for DINOv2 CLIP
UMAP visualization of CLIP UMAP visualization of DINOv2 UMAP visualization of DINOv2 UMAP visualization of CLIP
Normalized Intra-cluster Distances of top 10 SPEC-identified clusters CLIP DINOv2 Normalized Intra-cluster Distances of top 10 SPEC-identified clusters DINOv2 CLIP
CLIP KMeans & SPEC AMI : 0.59 ± 0.0009 DINOv2 KMeans & SPEC AMI: 0.23 ± 0.0003 DINOv2 KMeans & SPEC AMI: 0.87 ± 0.0003 CLIP KMeans & SPEC AMI : 0.51 ± 0.0024
Figure 11. Top 4 SPEC-identified clusters comparing CLIP and DINOv2 embeddings on ImageNet-1k dog breeds. The last row shows the
UMAP representation of the top 10 SPEC-identified clusters for each embedding.
25
Towards an Explainable Comparison and Alignment of Feature Embeddings
DINOv2
0.9006 0.9005 0.8726 0.8637
0.1418 -0.0259 -0.0724
CLIP
CLIP
Figure 12. Comparing similarity ranking for SPEC clusters in DINOv2-CLIP on the AFHQ dataset. The leftmost images show the top 4
samples of two SPEC-identified clusters. Cosine similarity is computed with 4 cluster members (green-bordered) and 4 random images
(red-bordered), sorted by score. Unlike DINOv2, CLIP ranks some random samples higher than cluster members.
DINOv2
Figure 13. Comparing similarity ranking for SPEC clusters in DINOv2-CLIP on the FFHQ dataset. The leftmost images show the top 4
samples of two SPEC-identified clusters. Cosine similarity is computed with 4 cluster members (green-bordered) and 4 random images
(red-bordered), sorted by score. Unlike DINOv2, CLIP ranks some random samples higher than cluster members.
26
Towards an Explainable Comparison and Alignment of Feature Embeddings
DinoV2 - CLIP
CLIP - DinoV2
CLIP - SWAV
SWAV - CLIP
Inception - SWAV
SWAV - Inception
Figure 14. Comparing Different embeddings on the 120K samples from MS-COCO 2017 dataset.
10
8
15
DINOv2 – CLIP
10 6
5
6
5
4
4 3
0
5 2
1
0
10 5 0 5 10 15 20 2 4 6 8 10 12 14
8
CLIP – DINOv2
10.0
7
10
7.5
6
5.0
8 5
2.5
4
0.0
6 3
2.5
2
5.0
4 1
7.5
0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 4 2 0 2 4 6
DINOv2 KMeans & SPEC AMI: 0.80 ± 0.0002 CLIP KMeans & SPEC AMI : 0.38 ± 0.0004 CLIP KMeans & SPEC AMI : 0.67 ± 0.0007 DINOv2 KMeans & SPEC AMI: 0.33 ± 0.0003
DINOv2
CLIP
DINOv2
CLIP
Figure 15. Comparing embeddings on 70K FFHQ samples. Top numbers show SPEC cluster eigenvalues. Last two images per row
display UMAP representations of SPEC clusters for each embedding.
27
Towards an Explainable Comparison and Alignment of Feature Embeddings
8
12
Inception – CLIP
12
Cluster
Vl 1 7
10
6
8 10
6
8
4
3
6
2
2
0 4
1
2
0
5.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 6 8 10 12 14 16 18
14 18
8
CLIP – Inception
12
7
16
10
8
14
5
6
4
12
4
3
2
2
10
0
1
2 8
0
5.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 2 0 2 4 6 8 10
Inception KMeans & SPEC AMI: 0.76 ± 0.0002 CLIP KMeans & SPEC AMI : 0.20 ± 0.0001 CLIP KMeans & SPEC AMI: 0.80 ± 0.0006 Inception KMeans & SPEC AMI : 0.25 ± 0.0002
Inception
CLIP
Inception
CLIP
Figure 16. Comparing embeddings on 70K FFHQ samples. Top numbers show SPEC cluster eigenvalues. Last two images per row
display UMAP representations of SPEC clusters for each embedding.
Parameter Value
accum freq 1
alignment loss weight 0.1
batch size 128
clip alignment contrastive loss weight 0.9
coca contrastive loss weight 1.0
distributed True
epochs 10
lr 1e-05
lr scheduler cosine
model ViT-B-32
name Vit-B-32 laion2b e16 freeze 5
precision amp
pretrained laion2b e16
seed 0
wd 0.2
28
Towards an Explainable Comparison and Alignment of Feature Embeddings
15.0 20
8
SWAV DINOv2
12.5
Cluster
Vl 1
15 7
10.0
7.5 10
5.0
5 4
2.5
3
0
0.0
2.5
5
1
5.0
0
5 0 5 10 15 5 0 5 10 15 20
7
10
10
5
5
4
0
3
0
5
2
1
10 5
0
5 0 5 10 15 20 10 5 0 5 10 15
DINOv2 KMeans & SPEC AMI : 1.00 ± 0.0000 SWAV KMeans & SPEC AMI: 0.86 ± 0.0006
SWAV
DINOv2
DINOv2 KMeans & SPEC AMI: 0.78 ± 0.0005 DINOv2 KMeans & SPEC AMI : 0.54 ± 0.0004
DINOv2
SWAV
Figure 17. Comparison of different embeddings on 15K samples from the AFHQ dataset, consisting of 5K cats, 5K wildlife, and 5K dogs.
The number at the top of each image represents the eigenvalue of the corresponding SPEC cluster. The last two images in each row show
the UMAP representation of the SPEC clusters for each embedding individually.
29
Towards an Explainable Comparison and Alignment of Feature Embeddings
CLIP Clusters
SPEC-align Clusters
DINOv2 Clusters
Figure 18. Top 8 Kernel-PCA (Gaussian RBF kernel) clusters for CLIP, DINOv2, and CLIP aligned with DINOv2, trained on the ImageNet
dataset.
30
Towards an Explainable Comparison and Alignment of Feature Embeddings
CLIP Clusters
SPEC-align Clusters
DINOv2 Clusters
Figure 19. Top 8 Kernel-PCA (Gaussian RBF kernel) clusters for CLIP, DINOv2, and CLIP aligned with DINOv2, trained on the ImageNet
dataset.
31
Towards an Explainable Comparison and Alignment of Feature Embeddings
Figure 20. Comparison of Kernel matrices after using SPEC-align to match the sample clusters of CLIP to T5-XL with measuring
SPEC-diff during the training.
32
Towards an Explainable Comparison and Alignment of Feature Embeddings
Figure 21. Comparison of Kernel matrices after using SPEC-align to align CLIP to DINOv2.
33