SVFit: Efficient Fine-Tuning with Singular Values
SVFit: Efficient Fine-Tuning with Singular Values
Abstract—Large pre-trained models (LPMs) have demon- the existing model parameters or an entirely new set [25]–[27].
strated exceptional performance in diverse natural language The main goal is to retain the knowledge embedded in LPMs
processing and computer vision tasks. However, fully fine-tuning while adapting them to specific tasks, thereby minimizing the
these models poses substantial memory challenges, particularly
risk of catastrophic forgetting. Among these methods, Low-
arXiv:2409.05926v1 [cs.LG] 9 Sep 2024
ally, inspired by recent advances in image generation that guage understanding, image classification, and subject-
demonstrate the effectiveness of learnable scaling factors for driven text-to-image generation. It consistently outper-
improved domain adaptation [32], SVFit uses the most critical forms LoRA and other recent state-of-the-art techniques
singular values obtained from SVD as trainable parameters. in terms of parameter efficiency and overall performance.
Only the top r singular values Σr within Wr are trained,
with the fundamental subspaces derived from SVD scaled to II. R ELATED W ORK
promote rapid adaptation to new domains, as illustrated in Fig.
2(c) and Fig. 3. This approach enhances the learning of new A. Parameter-Efficient Fine-Tuning
domain knowledge for downstream tasks while preserving pre- With the development of LPMs, adapting models with
trained information and significantly reducing the number of billions of parameters to specific downstream tasks has become
trainable parameters. increasingly challenging due to their complexity and com-
Extensive experiments have been conducted to verify the putational demands [33]–[36]. Parameter-efficient fine-tuning
effectiveness of SVFit across various tasks and models. (PEFT) has garnered significant attention in recent years for its
Specifically, RoBERTa-base and RoBERTa-large were used ability to minimize the parameters and memory requirements
for natural language understanding tasks, ViT-base and ViT- needed while maintaining efficiency and accuracy, achieving
large were used for image classification tasks, and Stable performance comparable to full fine-tuning. Certain PEFT
Diffusion v1.5 was used for subject-driven text-to-image task. methods achieve fine-tuning by incorporating supplementary
The experimental results demonstrate that SVFit achieves modules or optimizing prompts and prefixes. For instance,
superior performance with a significantly reduced number Adapter [37] integrates lightweight trainable parameters be-
of parameters. For instance, in the image classification task tween pre-trained layers while maintaining fixed pre-trained
using the ViT-large model, SVFit outperforms LoRA with only weights. Prefix tuning [26] involves appending prefix param-
0.036M trainable parameters, compared to LoRA’s 0.8M. eters to the hidden states across all layers of the model.
The primary contributions of this paper are as follows: Prompt tuning [27] utilizes templates to reconstruct prompts,
• We present SVFit, a novel PEFT method that enhances updating only parameters relevant to prompt comprehension.
the initialization process by utilizing SVD to initialize Despite their significant performance gains, these approaches
low-rank matrices derived from pre-trained weights. SV- unavoidably introduce additional overhead during inference.
Fit focuses on training only the most significant top-
r singular values, significantly reducing the number of
trainable parameters while achieving efficient fine-tuning B. Low-Rank Adaptation
and preserving the model’s core capabilities. Low-Rank Adaptation (LoRA) [19] introduces two low-rank
• We offer a theoretical analysis to uncover the mecha- matrices to approximate weight updates during fine-tuning,
nisms underlying SVFit. This analysis demonstrates how seamlessly integrating incremental updates into pre-trained
leveraging singular values enables rapid adaptation by weights without causing noticeable delays in inference. To
effectively capturing the essential information from pre- be specific, when provided with a pre-trained weight matrix
trained models and efficiently learning new domain- W ∈ Rd1 ×d2 , after full fine-tuning on a specific domain task,
′ ′
specific knowledge with minimal parameters. the new weight matrix is W + W , where W represents the
• SVFit is evaluated on a range of tasks, such as natural lan- update containing domain knowledge. LoRA is designed to
3
h h h
B=0 1/2
�� ��� ��� ���
Trainable
Pre-trained Weight r Residual Matrix r �� r �� �1 -r
� ∈ ℝ�1×�2 �� �� ��� Frozen
A
1/2
A= �(0, �2 ) �� �� �� ��
� � �
Fig. 2. Visual comparison among LoRA, PiSSA, and SVFit. (a) LoRA introduces two low-rank matrices A and B to approximate weight updates during
fine-tuning. (b) PiSSA initializes A and B with the principal components of the pre-trained weight W , freezing the residual matrix during fine-tuning. (c)
SVFit initializes low-rank matrices through SVD of W and trains only the most significant top-r singular values (for simplicity, d1 ≪ d2 is assumed).
′ ′
progressively update W and break down W through the III. M ETHOD
matrix multiplication of two low-rank matrices A and B, In this section, we present SVFit, a novel PEFT approach
′
W = AB, (1) for LPMs that leverages singular values for improved model
adaptation. Unlike conventional methods that preserve the
where A ∈ Rd1 ×r , and B ∈ Rr×d2 with intrinsic rank r ≪ knowledge of LPMs by freezing the pre-trained weight ma-
min(d1 , d2 ). For h = W x, the modified computation of the trix W , SVFit aims to embed this knowledge into low-
forward pass can be represented as rank matrices and retain it permanently. SVFit achieves this
′
by performing SVD on the pre-trained weight matrix W ,
h = W + W x = (W + AB) x. (2) using the most critical singular values as trainable parameters,
In the initial training phase, A undergoes random Gaussian thereby optimizing the initialization of low-rank matrices for
initialization, while B is initialized to zeros, as shown in Fig. more effective fine-tuning.
2(a). This method freezes W and specifically updates A and B,
significantly reducing the number of trainable parameters for A. Fundamental Subspaces Derived from SVD
downstream tasks compared to full fine-tuning. Additionally, We begin by introducing key concepts related to Singular
LoRA integrates the values of matrices A and B into W during Value Decomposition (SVD) and fundamental subspaces, a
the inference phase, ensuring that this adaptation does not technique widely used in pattern recognition and data com-
introduce additional delays. pression [42], [43]. SVD transforms a dataset from a high-
dimensional space to a lower-dimensional space by ranking
the singular values according to their significance [44], [45].
C. LoRA’s Variants
Dimensionality reduction is achieved by discarding the less
Building upon LoRA, several studies have proposed mod- significant singular values, while the remaining singular val-
ifying the parameter update strategy. LoRA-FA [29] freezes ues, when combined with their corresponding singular vectors,
the low-rank matrix A within LoRA, significantly reducing define the reduced-dimensional space.
trainable parameters and activation memory costs without
increasing computational overhead. Delta-LoRA [20] updates Definition 1 (Range space). Given a matrix W ∈ Rd1 ×d2 ,
both low-rank matrices A and B, propagating these changes the range space of matrix W is the vector space spanned by
to the pre-trained weight W using the delta of their product. the columns of W . In other words, the range space is the set
PiSSA [38] initializes A and B with the principal components of all possible linear combinations of the column vectors of
of the original matrix W and places the remaining components W . The range space is often denoted as R (W ). Formally, the
into a residual matrix, which is kept frozen during fine- range space is defined as:
tuning. Another strategy to improve LoRA is to allow for the R (W ) = W x | x ∈ Rd2 ⊆ Rd1 .
(3)
adaptable adjustment of LoRA rank. AdaLoRA [39] utilizes T d2
SVD to parameterize incremental updates and dynamically T
space of W is a subspace of R ,
Similarly, the range
distributes the parameter budget among weight matrices based denoted as R W . That is,
R W T = W T y | y ∈ Rd1 ⊆ Rd2 .
on their importance score. SoRA [40] employs an optimizable (4)
gate with a proximal gradient method to control sparsity,
Definition 2 (Null space). Given a matrix W ∈ Rd1 ×d2 , the
expanding the optimization space and improving parameter
null space of matrix W is the set of all vectors x ∈ Rd2 that
efficiency. Similarly, Zhang et al. [41] suggest IncreLoRA,
are mapped to the zero vector in Rd1 when multiplied by W .
an incremental parameter allocation method that adaptively
Formally, the null space is defined as:
incorporates trainable parameters during training according to
the importance scores of each module. N (W ) = {x | W x = 0} ⊆ Rd2 . (5)
4
1: � � + 1: �1
�(�� ) 1: �
�1
⋱
� ∈ ℝ�1×�2 = �(�) �(�� ) ��
�(�) � + 1: �2
0
= +
�1
�(�) �(�� )
⋱ �(�� ) 0 �(�)
��
�� ��
Fig. 3. Illustration of the SVD of matrix W and its fundamental subspaces: This figure illustrates the SVD of the pre-trained weight matrix W ∈ Rd1 ×d2 ,
where W is decomposed into singular values and vectors as W = U diag(Σ)V T . The decomposition yields a rank-r approximation matrix Wr and a residual
matrix We . Specifically, the range space of W is spanned by Ur , and its null space is spanned by Ve . Conversely, the range space of W T is spanned by
Vr , and its null space is spanned by Ue .
Pr
Similarly, the null space of matrix W T consists of all vectors Since W = U diag(Σ)V T = T
i=1 λi ui viP , we can express
Pr d2
y ∈ Rd1 that are mapped to the zero vector in Rd2 when W as W x = λ i ui v T
x. Let z = i=r+1 βi vi , then
i=1 Pi
multiplied by W T . The formal definition of the null space of Pr
T
d2
Wz = i=1 λi ui vi i=r+1 βi vi = 0. Therefore, the
W T is as follows:
null space of W is N (W ) = {x | W x = 0} = R (Ve ).
N W T = y | W T y = 0 ⊆ Rd 1 .
(6)
The connection between the SVD of the matrix W and the
Given a matrix W ∈ Rd1 ×d2 of rank r, its SVD is four fundamental subspaces can be depicted in Fig. 3.
denoted as W = U diag(Σ)V T , where diag(Σ) is a d1 × d2
diagonal matrix containing the singular values {λi }1≤i≤r of B. SVFit: PEFT of LPMs Using Singular Values
W in descending order, with r ≪ min(d1 , d2 ). The matrices
SVFit performs SVD on the initial pre-trained weight matrix
U = [u1 , u2 , . . . , ud1 ] ∈ Rd1 ×d1 and V = [v1 , v2 , . . . , vd2 ] ∈
W ∈ Rd1 ×d2 within the self-attention and multilayer percep-
Rd2 ×d2 are orthogonal matrices. We can partition U and V
tron layers to obtain the best rank-r approximation
into two parts by columns and denote these partitions as Ur ,
Ue , and Vr , Ve , respectively: Wr = Ur diag(Σr )VrT , (9)
Ur = [u1 , u2 , · · · , ur ] , Ue = [ur+1 , ur+2 , · · · , ud1 ] , (7) and the residual matrix
Vr = [v1 , v2 , · · · , vr ] , Ve = [vr+1 , vr+2 , · · · , vd2 ] . (8) We = Ue diag(Σe )VeT , (10)
Then the four fundamental subspaces associated with matrix where Ur = [u1 , u2 , · · · , ur ] ∈ Rd1 ×r and Vr =
W can be obtained: [v1 , v2 , · · · , vr ] ∈ Rd2 ×r are the matrices of singular
• Ur is the orthonormal basis for the range space of W , vectors corresponding to the top r singular values, and
i.e., R(Ur ) = R(W ); Ue = [ur+1 , ur+2 , · · · , ud1 ] ∈ Rd1 ×(d1 −r) and Ve =
• Ue is the orthonormal basis for the null space of W ,
T [vr+1 , vr+2 , · · · , vd2 ] ∈ Rd2 ×(d2 −r) are the matrices of sin-
T gular vectors corresponding to the residual singular values.
i.e., R(Ue ) = N(W );
T As discussed in Section III-A, R(Ur ) = R(W ), R(Vr ) =
• Vr is the orthonormal basis for the range space of W ,
T
i.e., R(Vr ) = R(W ); R(W T ), R(Ue ) = N(W T ), and R(Ve ) = N(W ). The
• Ve is the orthonormal basis for the null space of W , i.e., singular values are arranged in descending order, with Σr =
R(Ve ) = N(W ). [λ1 , λ2 , . . . , λr ] and Σe = [λr+1 , λr+2 , . . . , λd1 ]. Conse-
To verify that R(Ve ) = N(W ), one would show that the quently, the pre-trained weight matrix W can be expressed
subspace spanned by Ve consists precisely of the vectors that as:
W maps to the zero vector. Similar arguments can be applied W = Wr + We = Ur diag(Σr )VrT + Ue diag(Σe )VeT
to verify the other subspaces. r d1
X X (11)
Proof. Assume the rank of the matrix W ∈ R is r d1 ×d2 = λi ui viT + λi ui viT ,
i=1 i=r+1
and d1 ≪ d2 . Thus, the singular values are are ordered
as λ1 ≥ λ2 ≥ · · · ≥ λr > λr+1 = · · · = λd1 = 0. as shown in Fig. 2(c) and Fig. 3.
5
As observed in Fig. 1, we performed SVD on the Fishstar • Full fine-tuning involves updating the entire set of model
to reconstruct images using various subsets of singular values: parameters, initialized with pre-trained weights and bi-
the top 8, 16, 32, 64, 128, and 256 singular values, as well ases, through gradient descent [53]. Although this ap-
as the smallest 8, 16, 32, 64, 128, and 256 singular values. proach is straightforward and robust, it demands substan-
The results highlight that larger singular values are crucial for tial computational resources.
preserving image quality. Notably, the top 10% or even 1% • LoRA [19] employs two low-rank matrices to learn
of singular values contribute to over 99% of the total matrix incremental updates, reducing GPU memory cost. We
sum. This observation indicates that the pre-trained weight replicate their experimental setup for a fair comparison.
matrix W can be effectively approximated by focusing on the • DyLoRA [24] dynamically selects a random rank r for
most significant singular values above a given threshold r, LoRA modules during training.
while the smaller singular values have minimal impact on the • AdaLoRA [39] addresses the challenge of optimal rank
overall structure. Consequently, in SVFit, We is frozen during selection for incremental updates by adaptively pruning
training, and the focus is placed on adapting Wr . Inspired by singular values based on their magnitudes, resulting in
recent advancements in image generation, such as learnable different ranks for different layers.
scaling factors for improved domain adaptation [32], SVFit • PiSSA [38], structurally similar to LoRA, initializes the
utilizes the most critical singular values obtained from SVD adapter matrices A and B using the principal components
as the trainable parameters. Specifically, it trains only the top of the original weight matrix W , while the remaining
r singular values Σr in Wr , while scaling the fundamental components form a residual matrix that is kept frozen
subspace derived from SVD to facilitate rapid adaptation to during fine-tuning.
new domains. In a word, the matrices Ur , Vr , Ue , Ve , and
Σe are kept frozen, and the training focuses exclusively on TABLE I
the most significant top-r singular values Σr , as demonstrated H YPERPARAMETER SETUP OF SVF IT FOR THE GLUE BENCHMARK
in Fig. 2(c) and Fig. 3. This method allows features to be
Hyperparameter COLA STS-B RTE MRPC SST-2 QNLI
projected onto a low-rank subspace defined by the orthogonal Rank 768
columns of U and V , enabling efficient layer-wise adaptation Warmup Ratio 0.06
with a reduced number of trainable parameters. LR Scheduler Linear
Base
LR Scheduler Linear
as the module can be seamlessly integrated into the original Weight Decay 0.1
matrix post-training. Trainable Matrices WQ , WV
Max Sequence Length 128
Epochs 40 20 40 40 10 20
Learning Rate 0.01 0.01 0.01 0.003 0.001 0.002
IV. E XPERIMENT
TABLE II
P ERFORMANCE COMPARISON OF VARIOUS FINE - TUNING METHODS ON THE GLUE BENCHMARK USING RO BERTA - BASE AND RO BERTA - LARGE
MODELS . M ETRICS INCLUDE MCC FOR C O LA, PCC FOR STS-B, AND ACC FOR RTE, MRPC, SST-2, AND QNLI. R ESULTS ARE REPORTED AS THE
MEDIAN OF 5 RUNS WITH DIFFERENT RANDOM SEEDS . T HE HIGHEST SCORE FOR EACH DATASET IS HIGHLIGHTED IN BOLD , WITH HIGHER VALUES
INDICATING BETTER PERFORMANCE ACROSS ALL METRICS
TABLE III
P ERFORMANCE COMPARISON OF VARIOUS FINE - TUNING METHODS ON THE IMAGE CLASSIFICATION TASK USING V I T- BASE AND V I T- LARGE MODELS
ACROSS DIFFERENT DATASETS . ACCURACY (%) IS REPORTED AFTER TEN EPOCHS . AVG . REPRESENTS THE AVERAGE ACCURACY ACROSS ALL DATASETS .
T HE BEST PERFORMANCE FOR EACH DATASET IS HIGHLIGHTED IN BOLD
# Trainable
Method OxfordPets CIFAR10 DTD EuroSAT RESISC45 StanfordCars FGVC CIFAR100 Avg.
Parameters
Head - 90.3 96.4 69.8 88.7 74.2 25.8 17.4 84.3 68.4
FT 85.8M 93.1 98.9 77.7 99.1 96.1 79.8 54.8 92.4 86.5
Base
LoRA 0.3M 93.2 98.8 75.0 98.4 92.7 45.4 25.2 92.0 77.6
PiSSA 0.3M 95.9 98.6 78.7 98.7 95.5 67.1 47.6 91.2 84.2
SVFit 0.018M 97.0 98.8 80.5 98.6 93.0 67.2 47.9 91.6 84.3
Head - 91.1 97.8 73.3 92.6 82.0 37.9 24.6 84.3 73.0
FT 303.3M 94.4 99.2 81.8 99.0 96.4 88.9 68.3 93.6 90.2
Large
LoRA 0.8M 94.8 99.1 81.8 98.6 94.7 73.3 42.3 94.9 84.9
PiSSA 0.8M 96.7 98.8 77.8 98.8 95.7 86.7 62.6 92.6 88.7
SVFit 0.036M 97.8 99.3 83.4 98.7 95.2 83.3 57.8 93.9 88.7
TABLE IV STS-B tasks. Consistent with prior work [21], [60], we report
H YPERPARAMETER SETUP FOR IMAGE CLASSIFICATION OF SVF IT the number of trainable parameters in the fine-tuned layers,
Hyperparameter OxfordPets CIFAR10 DTD EuroSAT explicitly excluding the classification head, which is trained
Rank 768 in a standard way. The results are averaged over five different
Epochs 10 random seeds. Additional details are provided in Table I.
Base
LR Scheduler Linear
Weight Decay 0.01
Trainable Matrices WQ , WV
Learning Rate 0.04 0.04 0.04 0.2
Rank 768 Results. Based on the results presented in Table II, our
Epochs 10 proposed SVFit method demonstrates superior performance in
Large
LR Scheduler Linear several key areas compared to both traditional fine-tuning (FT)
Weight Decay 0.01
Trainable Matrices WQ , WV and other PEFT methods such as LoRA, DyLoRA, AdaLoRA,
Learning Rate 0.06 0.03 0.03 0.1 and PiSSA. For RoBERTa-base, SVFit achieves the highest
Hyperparameter RESISC45 StanfordCars FGVC CIFAR100 Matthew’s correlation coefficient (MCC) for CoLA and the
Rank 768
Epochs 10 highest Pearson correlation coefficient (PCC) for STS-B, in-
dicating its effectiveness in classification and regression tasks.
Base
LR Scheduler Linear
Weight Decay 0.01 While it falls slightly behind in accuracy (ACC) for MRPC
Trainable Matrices WQ , WV
Learning Rate 0.2 0.4 0.2 0.04 and SST-2, it maintains competitive performance across all
Rank 768 tasks, resulting in a robust overall average. For RoBERTa-
Epochs 10 large, SVFit outperforms other methods on CoLA, RTE, and
Large
LR Scheduler Linear
Weight Decay 0.01 MRPC and achieves comparable performance on other tasks.
Trainable Matrices WQ , WV Its significant improvement in MCC for CoLA is noteworthy,
Learning Rate 0.07 0.4 0.3 0.1 reflecting its strength in complex language understanding
tasks. The results suggest that SVFit’s approach to fine-tuning,
with its unique parameterization and efficient use of trainable
head. For both models, the rank of LoRA is set to 8, and parameters, offers a balanced and effective alternative to
the rank of SVFit is set to 768. Due to time constraints and existing methods. Additionally, the method’s reduced number
budget limitations, we omit the time-intensive MNLI and QQP of trainable parameters demonstrates its potential for achieving
tasks, thereby forgoing the MNLI trick for MRPC, RTE, and high performance with lower computational costs.
7
Fig. 4. Randomly selected samples from DreamBooth, LoRA, and SVFit for the subject-driven generation task.
Acc.
Acc.
Acc.
97.5 77
93 97.0
92 97.0 76
91 96.5 75 96.5
90 8 16 32 64 128 256 512 768 96.0 8 16 32 64 128 256 512 768 74 8 16 32 64 128 256 512 768 96.0 8 16 32 64 128 256 512 768
Rank Rank Rank Rank
94 RESISC45 70 StanfordCars 50 FGVC 92.5 CIFAR100
LoRA (92.7) LoRA (45.4) LoRA (25.2) LoRA (92.0)
65 92.0
93 45
91.5
60
92 40 91.0
55
Acc.
Acc.
Acc.
Acc.
91 35 90.5
50
90.0
90 45 30
89.5
89 8 16 32 64 128 256 512 768 40 8 16 32 64 128 256 512 768 25 8 16 32 64 128 256 512 768 89.0 8 16 32 64 128 256 512 768
Rank Rank Rank Rank
Fig. 5. Performance of SVFit fine-tuning for ViT-base model on image classification tasks across different parameter budget levels. The x-axis represents the
rank, and the y-axis is the evaluation index of different datasets.
both methods. For fine-tuning, we use the dataset introduced balances the number of trainable parameters and accuracy.
in DreamBooth [51], which includes five or six images per For instance, on the OxfordPets dataset, our method achieves
subject for training. an accuracy of approximately 93.2% at the lowest rank of 8
Implementation Details. Both LoRA and our method use while maintaining or improving accuracy with higher ranks.
the same loss function as in DreamBooth. For DreamBooth Similarly, for the CIFAR10 dataset, our method achieves
and LoRA, we apply the best hyperparameter setup in the an accuracy of around 98.8% at rank 8, with further gains
original paper [51]. observed as the rank increases. This trend is consistent across
Results. In Fig. 4, we present a comparative analysis of other datasets. Our method significantly improves over the
image generation using Dreambooth, LoRA, and SVFit across baseline across various datasets and rank levels, maintaining
multiple scenarios. SVFit consistently demonstrates superior high performance with fewer trainable parameters.
subject detail preservation and context integration. For in-
stance, in the ”vase with a colorful flower bouquet” scenario, F. Analysis of Learning Rate
SVFit accurately retains the intricate details of the flowers and Adjusting the learning rate is a crucial step in fine-tuning.
vase while seamlessly blending them with the background. In Our method requires a larger learning rate than LoRA since
contrast, the results from Dreambooth and LoRA exhibit no- our initialization strategy has already initialized most of the
ticeable artifacts and inconsistencies in subject representation model’s parameters to a certain extent, and a larger learning
and background interaction. Overall, the visual comparison rate can help quickly adapt the remaining parameters to the
clearly illustrates SVFit’s enhanced capability in generating new tasks. We perform a learning rate search using our
high-fidelity images that closely follow the given prompts, method, as shown in Fig. 6. A learning rate 10× greater than
effectively preserving subject details and ensuring realistic pre-training yields the best results across multiple datasets.
environmental interactions. For instance, on the OxfordPets dataset, the accuracy peaks
at approximately 93.2% with a 10× learning rate, while both
lower (0.5×, 1×, 5×) and higher learning rates (50×, 100×,
E. Different Budget Levels
150×) result in decreased performance. However, for EuroSAT,
We analyzed the performance of SVFit fine-tuning on an accuracy of around 98.6% is achieved at 10×, with lower
the ViT-base model for image classification tasks un- and higher rates underperforming. These findings demonstrate
der different parameter budget levels. The specific re- that while a substantial increase in the learning rate from
sults are shown in Fig. 5. We employed various ranks pre-training is generally beneficial, the optimal rate can vary
r = {8, 16, 32, 64, 128, 256, 512, 768}, corresponding to 0.2K, significantly depending on the specific dataset.
0.4K, 0.8K, 1.5K, 3.1K, 6.1K, 12.3K, and 18.4K trainable
parameters, respectively. For LoRA, we used a baseline rank V. C ONCLUSION
of r = 8, corresponding to 294.9K trainable parameters. The In this work, we introduced SVFit, a novel PEFT method
experimental results demonstrate that our method effectively that enhances initialization by leveraging SVD to initialize
9
Acc.
Acc.
Acc.
98.4 76 96.5
94
96.0
93 98.2 74
95.5
92
0.5× 1× 5× 10× 50× 100×150× 98.0
0.5× 1× 5× 10× 50× 100×150× 72
0.5× 1× 5× 10× 50× 100×150× 95.0
0.5× 1× 5× 10× 50× 100×150×
Learning Rate Learning Rate Learning Rate Learning Rate
94 RESISC45 70 StanfordCars 50 FGVC 92.00 CIFAR100
93 65 91.75
45
91.50
92 60 91.25
40
91 55 91.00
Acc.
Acc.
Acc.
Acc.
35 90.75
90 50
90.50
89 45 30
90.25
88
0.5× 1× 5× 10× 50× 100×150× 40
0.5× 1× 5× 10× 50× 100×150× 25
0.5× 1× 5× 10× 50× 100×150× 90.00
0.5× 1× 5× 10× 50× 100×150×
Learning Rate Learning Rate Learning Rate Learning Rate
Fig. 6. Performance of SVFit fine-tuning for ViT-base model on image classification tasks across different learning rate. The x-axis represents the learning
rate, and the y-axis is the evaluation index of different datasets.
low-rank matrices derived from pre-trained weights. SVFit [2] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
focuses on training only the most significant top-r singular val- T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
“An image is worth 16x16 words: Transformers for image recognition at
ues, significantly reducing the number of trainable parameters scale,” in International Conference on Learning Representations, ICLR,
while ensuring efficient fine-tuning and preserving the model’s 2021.
core capabilities. Our theoretical analysis demonstrates how [3] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open
this approach enables rapid adaptation by effectively capturing foundation and fine-tuned chat models,” arXiv:2307.09288, 2023.
essential information from pre-trained models and efficiently [4] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
learning new domain-specific knowledge with minimal pa- Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning
with a unified text-to-text transformer,” J Mach Learn Res, vol. 21, no.
rameters. SVFit has been evaluated across various tasks, 140, pp. 1–67, 2020.
including natural language understanding, image classification, [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-
and subject-driven text-to-image generation, consistently out- training of deep bidirectional transformers for language understanding,”
performing LoRA and other recent state-of-the-art techniques arXiv:1810.04805, 2018.
[6] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al.,
such as PiSSA in both efficiency and effectiveness. Future “Language models are unsupervised multitask learners,” OpenAI blog,
work will focus on extending SVFit to more complex tasks, vol. 1, no. 8, p. 9, 2019.
optimizing singular value selection, and further enhancing [7] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. Tay, J. Feng,
and S. Yan, “Tokens-to-token vit: Training vision transformers from
performance through dynamic parameter budget allocation to scratch on imagenet,” in IEEE International Conference on Computer
broaden its applicability across diverse domains. Vision, ICCV, 2021, pp. 558–567.
[8] D. Zhou, B. Kang, X. Jin, L. Yang, X. Lian, Z. Jiang,
Q. Hou, and J. Feng, “Deepvit: Towards deeper vision transformer,”
ACKNOWLEDGEMENTS arXiv:2103.11886, 2021.
This research was partially funded by the National Natu- [9] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,”
in IEEE International Conference on Computer Vision, ICCV, 2021, pp.
ral Science Foundation of China under grants 62220106008, 16 259–16 268.
62306067, and U20B2063, as well as the Sichuan Science [10] J. Wei, Y. Yang, X. Xu, J. Song, G. Wang, and H. T. Shen, “Less is
and Technology Program under grant 2024NSFSC1463. Ad- better: Exponential loss for cross-modal matching,” IEEE Trans. Circuits
Syst. Video Technol., vol. 33, no. 9, pp. 5271–5280, 2023.
ditional support was provided by the Sichuan Province Innova- [11] Z.-Y. Wang, X. P. Li, H. C. So, and A. M. Zoubir, “Adaptive rank-one
tive Talent Funding Project for Postdoctoral Fellows (Project matrix completion using sum of outer products,” IEEE Trans. Circuits
BX202311) and the China Postdoctoral Science Foundation Syst. Video Technol., vol. 33, no. 9, pp. 4868–4880, 2023.
(Project 2022M720660). [12] W. Yan, M. Yang, and Y. Li, “Robust low rank and sparse representation
for multiple kernel dimensionality reduction,” IEEE Trans. Circuits Syst.
Video Technol., vol. 33, no. 1, pp. 1–15, 2023.
R EFERENCES [13] Y. Xu, J. Wei, Y. Bin, Y. Yang, Z. Ma, and H. T. Shen, “Set of diverse
queries with uncertainty regularization for composed image retrieval,”
[1] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, IEEE Trans. Circuits Syst. Video Technol., pp. 1–1, 2024.
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert [14] S. Kim, I. Kang, and N. Kwak, “Semantic sentence matching with
pretraining approach,” arXiv:1907.11692, 2019. densely-connected recurrent and co-attentive information,” in Proceed-
10
ings of the AAAI conference on artificial intelligence, AAAI, vol. 33, [38] F. Meng, Z. Wang, and M. Zhang, “Pissa: Principal singular val-
no. 01, 2019, pp. 6586–6593. ues and singular vectors adaptation of large language models,”
[15] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, arXiv:2404.02948, 2024.
C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language [39] Q. Zhang, M. Chen, A. Bukharin, N. Karampatziakis, P. He, Y. Cheng,
models to follow instructions with human feedback,” Advances in neural W. Chen, and T. Zhao, “Adaptive budget allocation for parameter-
information processing systems, NeurIPS, vol. 35, pp. 27 730–27 744, efficient fine-tuning,” in International Conference on Learning Repre-
2022. sentations, ICLR, 2023.
[16] K. Hambardzumyan, H. Khachatrian, and J. May, “Warp: Word-level [40] N. Ding, X. Lv, Q. Wang, Y. Chen, B. Zhou, Z. Liu, and M. Sun, “Sparse
adversarial reprogramming,” arXiv:2101.00121, 2021. low-rank adaptation of pre-trained language models,” arXiv:2311.11696,
[17] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, 2023.
P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Transformers: State- [41] F. Zhang, L. Li, J. Chen, Z. Jiang, B. Wang, and Y. Qian, “Increlora:
of-the-art natural language processing,” in Conference on Empirical Incremental parameter allocation method for parameter-efficient fine-
Methods in Natural Language Processing, EMNLP, 2020, pp. 38–45. tuning,” arXiv:2308.12043, 2023.
[18] N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, [42] C.-W. Sun, T.-Z. Huang, T. Xu, and L.-J. Deng, “Nf-3dlogtnn: An
C.-M. Chan, W. Chen et al., “Parameter-efficient fine-tuning of large- effective hyperspectral and multispectral image fusion method based on
scale pre-trained language models,” Nat. Mach. Intell., vol. 5, no. 3, pp. nonlocal low-fibered-rank regularization,” Appl Math Model, vol. 118,
220–235, 2023. pp. 780–797, 2023.
[19] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
[43] T. Xu, T.-Z. Huang, L.-J. Deng, and N. Yokoya, “An iterative regulariza-
and W. Chen, “Lora: Low-rank adaptation of large language models,”
tion method based on tensor subspace representation for hyperspectral
in International Conference on Learning Representations, ICLR, 2022.
image super-resolution,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp.
[20] B. Zi, X. Qi, L. Wang, J. Wang, K.-F. Wong, and L. Zhang, “Delta-lora:
1–16, 2022.
Fine-tuning high-rank parameters with the delta of low-rank matrices,”
arXiv:2309.02411, 2023. [44] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with
[21] D. J. Kopiczko, T. Blankevoort, and Y. M. Asano, “Vera: Vector-based randomness: Probabilistic algorithms for constructing approximate ma-
random matrix adaptation,” in International Conference on Learning trix decompositions,” Siam Rev, vol. 53, no. 2, pp. 217–288, 2011.
Representations, ICLR, 2024. [45] A. G. Akritas and G. I. Malaschonok, “Applications of singular-value
[22] Y. Gu, X. Han, Z. Liu, and M. Huang, “Ppt: Pre-trained prompt decomposition (svd),” Math Comput Simul, vol. 67, no. 1-2, pp. 15–31,
tuning for few-shot learning,” in Annual Meeting of the Association for 2004.
Computational Linguistics, ACL, 2022, pp. 8410–8423. [46] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman,
[23] P. Ren, C. Shi, S. Wu, M. Zhang, Z. Ren, M. de Rijke, Z. Chen, and “Glue: A multi-task benchmark and analysis platform for natural lan-
J. Pei, “Mini-ensemble low-rank adapters for parameter-efficient fine- guage understanding,” in International Conference on Learning Repre-
tuning,” arXiv:2402.17263, 2024. sentations, ICLR, 2019.
[24] M. Valipour, M. Rezagholizadeh, I. Kobyzev, and A. Ghodsi, “Dylora: [47] T. Su, D. Feng, M. Wang, and M. Chen, “Dual discriminative low-rank
Parameter efficient tuning of pre-trained models using dynamic search- projection learning for robust image classification,” IEEE Trans. Circuits
free low-rank adaptation,” arXiv:2210.07558, 2022. Syst. Video Technol., vol. 33, no. 12, pp. 7708–7722, 2023.
[25] S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Learning multiple visual [48] H. Liu, Y. Jia, J. Hou, and Q. Zhang, “Global-local balanced low-rank
domains with residual adapters,” Advances in neural information pro- approximation of hyperspectral images for classification,” IEEE Trans.
cessing systems, NeurIPS, vol. 30, pp. 506–516, 2017. Circuits Syst. Video Technol., vol. 32, no. 4, pp. 2013–2024, 2022.
[26] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts [49] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer,
for generation,” in Annual Meeting of the Association for Computational “High-resolution image synthesis with latent diffusion models,” in IEEE
Linguistics, ACL, 2021. Conference on Computer Vision and Pattern Recognition, CVPR, 2022,
[27] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for pp. 10 684–10 695.
parameter-efficient prompt tuning,” in Conference on Empirical Methods [50] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik,
in Natural Language Processing, EMNLP, 2021, pp. 3045–3059. and D. Cohen-Or, “An image is worth one word: Personalizing text-to-
[28] C. Li, H. Farkhoor, R. Liu, and J. Yosinski, “Measuring the intrinsic image generation using textual inversion,” arXiv:2208.01618, 2022.
dimension of objective landscapes,” arXiv:1804.08838, 2018. [51] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman,
[29] L. Zhang, L. Zhang, S. Shi, X. Chu, and B. Li, “Lora-fa: Memory- “Dreambooth: Fine tuning text-to-image diffusion models for subject-
efficient low-rank adaptation for large language models fine-tuning,” driven generation,” in IEEE Conference on Computer Vision and Pattern
arXiv:2308.03303, 2023. Recognition, CVPR, 2023, pp. 22 500–22 510.
[30] V. Ramanujan, M. Wortsman, A. Kembhavi, A. Farhadi, and M. Raste- [52] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cis-
gari, “What’s hidden in a randomly weighted neural network?” in IEEE tac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s transformers:
Conference on Computer Vision and Pattern Recognition, CVPR, 2020, State-of-the-art natural language processing,” arXiv:1910.03771, 2019.
pp. 11 893–11 902. [53] S. Huang, D. Xu, I. E. Yen, Y. Wang, S.-E. Chang, B. Li, S. Chen,
[31] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian M. Xie, S. Rajasekaran, H. Liu et al., “Sparse progressive distillation:
denoiser: Residual learning of deep cnn for image denoising,” IEEE Resolving overfitting under pretrain-and-finetune paradigm,” in Annual
Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, 2017. Meeting of the Association for Computational Linguistics, ACL, 2022.
[32] E. Xie, L. Yao, H. Shi, Z. Liu, D. Zhou, Z. Liu, J. Li, and Z. Li,
[54] A. Warstadt, A. Singh, and S. R. Bowman, “Neural network acceptability
“Difffit: Unlocking transferability of large diffusion models via simple
judgments,” Trans. Assoc. Comput. Linguist., vol. 7, pp. 625–641, 2019.
parameter-efficient fine-tuning,” in IEEE International Conference on
Computer Vision, ICCV, 2023, pp. 4230–4239. [55] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval-
[33] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, 2017 task 1: Semantic textual similarity-multilingual and cross-lingual
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models focused evaluation,” arXiv:1708.00055, 2017.
are few-shot learners,” in Advances in Neural Information Processing [56] D. Giampiccolo, B. Magnini, I. Dagan, and W. B. Dolan, “The third
Systems, NeurIPS, vol. 33, 2020, pp. 1877–1901. pascal recognizing textual entailment challenge,” in Proceedings of the
[34] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, ACL-PASCAL workshop on textual entailment and paraphrasing, 2007,
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open pp. 1–9.
foundation and fine-tuned chat models,” arXiv:2307.09288, 2023. [57] B. Dolan and C. Brockett, “Automatically constructing a corpus of sen-
[35] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, tential paraphrases,” in Third international workshop on paraphrasing
A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot (IWP2005), 2005.
learners,” arXiv:2109.01652, 2021. [58] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng,
[36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and C. Potts, “Recursive deep models for semantic compositionality over
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances a sentiment treebank,” in Conference on Empirical Methods in Natural
in neural information processing systems, NeurIPS, 2017. Language Processing, EMNLP, 2013, pp. 1631–1642.
[37] Z. Lin, A. Madotto, and P. Fung, “Exploring versatile generative [59] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+
language model via parameter-efficient transfer learning,” in Conference questions for machine comprehension of text,” in Conference on Em-
on Empirical Methods in Natural Language Processing, EMNLP, 2020, pirical Methods in Natural Language Processing, EMNLP, 2016, pp.
pp. 441–459. 2383–2392.
11