Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
64 views11 pages

SVFit: Efficient Fine-Tuning with Singular Values

Uploaded by

Trần Khiêm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views11 pages

SVFit: Efficient Fine-Tuning with Singular Values

Uploaded by

Trần Khiêm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1

SVFit: Parameter-Efficient Fine-Tuning of Large


Pre-Trained Models Using Singular Values
Chengwei Sun, Jiwei Wei∗ , Yujia Wu, Yiming Shi, Shiyuan He, Zeyu Ma, Ning Xie, and Yang Yang

Abstract—Large pre-trained models (LPMs) have demon- the existing model parameters or an entirely new set [25]–[27].
strated exceptional performance in diverse natural language The main goal is to retain the knowledge embedded in LPMs
processing and computer vision tasks. However, fully fine-tuning while adapting them to specific tasks, thereby minimizing the
these models poses substantial memory challenges, particularly
risk of catastrophic forgetting. Among these methods, Low-
arXiv:2409.05926v1 [cs.LG] 9 Sep 2024

in resource-constrained environments. Parameter-efficient fine-


tuning (PEFT) methods, such as LoRA, mitigate this issue by Rank Adaptation (LoRA) [19] has garnered particular attention
adjusting only a small subset of parameters. Nevertheless, these for its stable performance across diverse downstream tasks.
methods typically employ random initialization for low-rank LoRA is based on the hypothesis that LPMs can still learn
matrices, which can lead to inefficiencies in gradient descent effectively when projected onto a smaller subspace due to the
and diminished generalizability due to suboptimal starting points.
To address these limitations, we propose SVFit, a novel PEFT low intrinsic dimensionality of the tasks [28]. It introduces two
approach that leverages singular value decomposition (SVD) to low-rank matrices, A and B, to approximate weight updates
initialize low-rank matrices using critical singular values as train- during fine-tuning. Specifically, LoRA initializes A using a
able parameters. Specifically, SVFit performs SVD on the pre- Gaussian distribution and sets B to zero, as illustrated in Fig.
trained weight matrix to obtain the best rank-r approximation 2(a). This initialization scheme enables incremental updates to
matrix, emphasizing the most critical singular values that capture
over 99% of the matrix’s information. These top-r singular values be seamlessly integrated into the pre-trained weights without
are then used as trainable parameters to scale the fundamental causing significant delays in inference. Building on LoRA’s
subspaces of the matrix, facilitating rapid domain adaptation. foundation, several subsequent approaches [20], [21], [23],
Extensive experiments across various pre-trained models in [29] have adhered to this paradigm, experimenting with var-
natural language understanding, text-to-image generation, and ious initialization strategies such as Kaiming uniform [30] ,
image classification tasks reveal that SVFit outperforms LoRA
while requiring 16 times fewer trainable parameters. and further enhance computational efficiency. However, this
raises the question of whether random initialization is indeed
Index Terms—Large pre-trained model, parameter-efficient optimal. Specifically, random initialization may prevent the
fine-tuning, singular values.
main components of the pre-trained weight matrix W from
being effectively updated during the initial stages of fine-
I. I NTRODUCTION tuning. This limitation could lead to inefficiencies in gradient
descent, potentially resulting in suboptimal local minima and
L ARGE pre-trained models (LPMs), such as RoBERTa [1]
with an impressive 125 million trainable parameters,
ViT [2] boasting 354 million parameters, and LLaMA [3]
impairing the generalization performance of the model.
This paper introduces SVFit, an innovative PEFT strategy
featuring 700 million to 65 billion parameters, have become for LPMs. SVFit utilizes a distinctive initialization strategy by
indispensable tools in natural language processing [4]–[6] and applying SVD to the pre-trained weight matrix W , resulting
computer vision [7]–[13], showcasing remarkable performance in two components: the best rank-r approximation matrix Wr
across a spectrum of tasks. The industry’s relentless pursuit and a residual matrix We , with We capturing the smaller
of scaling up model parameters to the billion or even trillion singular values. Our analysis verifies that the top 10%, or
range continues to push the boundaries of large models [14]– even 1%, of singular values contribute to over 99% of the
[16]. Nonetheless, the immense size and computational de- total matrix sum. As depicted in Fig. 1, SVD was performed
mands of these models present significant challenges for on the Fishstar, with the first set of experiments arranging
adapting them to specific downstream tasks, particularly in the singular values in descending order and reconstructing the
resource-constrained environments [17], [18]. images using the largest 8, 16, 32, 64, 128, and 256 singular
In response to this challenge, parameter-efficient fine-tuning values, as shown in the first row. The second set of experiments
(PEFT) methods have emerged as a promising solution to arranged the singular values in ascending order, reconstructing
reduce memory requirements [19]–[24]. These methods focus the images with the smallest 8, 16, 32, 64, 128, and 256
on updating a limited subset of parameters—either a portion of singular values, with results displayed in the second row. These
findings underscore the critical role of larger singular values
*Corresponding author in preserving image quality. Therefore, the pre-trained weight
Chengwei Sun, Jiwei Wei, Yujia Wu, Yiming Shi, Shiyuan He, Zeyu Ma,
Ning Xie, and Yang Yang are with the Center for Future Media and School matrix W can be effectively approximated using only the
of Computer Science and Engineering, University of Electronic Science and most significant singular values above a threshold r, while
Technology of China, Chengdu 611731, China (e-mail: [email protected]; the smaller singular values contribute minimally to the overall
[email protected]; [email protected]; yiming-
[email protected]; [email protected]; [email protected]; structure. As a result, Wr retains the essential pre-trained
[email protected]; [email protected]). knowledge, and We is kept frozen during training. Addition-
2

Original Image r=8 r = 16 r = 32 r = 64 r = 128 r = 256

r=8 r = 16 r = 32 r = 64 r = 128 r = 256


Fig. 1. SVD-based reconstruction results of the Fishstar image (256 × 256) [31]. The first row shows the reconstruction using the top 8, 16, 32, 64, 128, and
256 largest singular values, sorted in descending order (r = 8, r = 16, r = 32, r = 64, r = 128, r = 256). The second row displays the reconstruction
using the smallest 8, 16, 32, 64, 128, and 256 singular values, sorted in ascending order. This comparison highlights the pivotal role of dominant singular
values in maintaining image quality, while the smallest singular values have minimal impact on the overall structure.

ally, inspired by recent advances in image generation that guage understanding, image classification, and subject-
demonstrate the effectiveness of learnable scaling factors for driven text-to-image generation. It consistently outper-
improved domain adaptation [32], SVFit uses the most critical forms LoRA and other recent state-of-the-art techniques
singular values obtained from SVD as trainable parameters. in terms of parameter efficiency and overall performance.
Only the top r singular values Σr within Wr are trained,
with the fundamental subspaces derived from SVD scaled to II. R ELATED W ORK
promote rapid adaptation to new domains, as illustrated in Fig.
2(c) and Fig. 3. This approach enhances the learning of new A. Parameter-Efficient Fine-Tuning
domain knowledge for downstream tasks while preserving pre- With the development of LPMs, adapting models with
trained information and significantly reducing the number of billions of parameters to specific downstream tasks has become
trainable parameters. increasingly challenging due to their complexity and com-
Extensive experiments have been conducted to verify the putational demands [33]–[36]. Parameter-efficient fine-tuning
effectiveness of SVFit across various tasks and models. (PEFT) has garnered significant attention in recent years for its
Specifically, RoBERTa-base and RoBERTa-large were used ability to minimize the parameters and memory requirements
for natural language understanding tasks, ViT-base and ViT- needed while maintaining efficiency and accuracy, achieving
large were used for image classification tasks, and Stable performance comparable to full fine-tuning. Certain PEFT
Diffusion v1.5 was used for subject-driven text-to-image task. methods achieve fine-tuning by incorporating supplementary
The experimental results demonstrate that SVFit achieves modules or optimizing prompts and prefixes. For instance,
superior performance with a significantly reduced number Adapter [37] integrates lightweight trainable parameters be-
of parameters. For instance, in the image classification task tween pre-trained layers while maintaining fixed pre-trained
using the ViT-large model, SVFit outperforms LoRA with only weights. Prefix tuning [26] involves appending prefix param-
0.036M trainable parameters, compared to LoRA’s 0.8M. eters to the hidden states across all layers of the model.
The primary contributions of this paper are as follows: Prompt tuning [27] utilizes templates to reconstruct prompts,
• We present SVFit, a novel PEFT method that enhances updating only parameters relevant to prompt comprehension.
the initialization process by utilizing SVD to initialize Despite their significant performance gains, these approaches
low-rank matrices derived from pre-trained weights. SV- unavoidably introduce additional overhead during inference.
Fit focuses on training only the most significant top-
r singular values, significantly reducing the number of
trainable parameters while achieving efficient fine-tuning B. Low-Rank Adaptation
and preserving the model’s core capabilities. Low-Rank Adaptation (LoRA) [19] introduces two low-rank
• We offer a theoretical analysis to uncover the mecha- matrices to approximate weight updates during fine-tuning,
nisms underlying SVFit. This analysis demonstrates how seamlessly integrating incremental updates into pre-trained
leveraging singular values enables rapid adaptation by weights without causing noticeable delays in inference. To
effectively capturing the essential information from pre- be specific, when provided with a pre-trained weight matrix
trained models and efficiently learning new domain- W ∈ Rd1 ×d2 , after full fine-tuning on a specific domain task,
′ ′
specific knowledge with minimal parameters. the new weight matrix is W + W , where W represents the
• SVFit is evaluated on a range of tasks, such as natural lan- update containing domain knowledge. LoRA is designed to
3

h h h

B=0 1/2
�� ��� ��� ���
Trainable
Pre-trained Weight r Residual Matrix r �� r �� �1 -r
� ∈ ℝ�1×�2 �� �� ��� Frozen
A
1/2
A= �(0, �2 ) �� �� �� ��

� � �

(a) LoRA (b) PiSSA (c) SVFit

Fig. 2. Visual comparison among LoRA, PiSSA, and SVFit. (a) LoRA introduces two low-rank matrices A and B to approximate weight updates during
fine-tuning. (b) PiSSA initializes A and B with the principal components of the pre-trained weight W , freezing the residual matrix during fine-tuning. (c)
SVFit initializes low-rank matrices through SVD of W and trains only the most significant top-r singular values (for simplicity, d1 ≪ d2 is assumed).

′ ′
progressively update W and break down W through the III. M ETHOD
matrix multiplication of two low-rank matrices A and B, In this section, we present SVFit, a novel PEFT approach

W = AB, (1) for LPMs that leverages singular values for improved model
adaptation. Unlike conventional methods that preserve the
where A ∈ Rd1 ×r , and B ∈ Rr×d2 with intrinsic rank r ≪ knowledge of LPMs by freezing the pre-trained weight ma-
min(d1 , d2 ). For h = W x, the modified computation of the trix W , SVFit aims to embed this knowledge into low-
forward pass can be represented as rank matrices and retain it permanently. SVFit achieves this
 ′
 by performing SVD on the pre-trained weight matrix W ,
h = W + W x = (W + AB) x. (2) using the most critical singular values as trainable parameters,
In the initial training phase, A undergoes random Gaussian thereby optimizing the initialization of low-rank matrices for
initialization, while B is initialized to zeros, as shown in Fig. more effective fine-tuning.
2(a). This method freezes W and specifically updates A and B,
significantly reducing the number of trainable parameters for A. Fundamental Subspaces Derived from SVD
downstream tasks compared to full fine-tuning. Additionally, We begin by introducing key concepts related to Singular
LoRA integrates the values of matrices A and B into W during Value Decomposition (SVD) and fundamental subspaces, a
the inference phase, ensuring that this adaptation does not technique widely used in pattern recognition and data com-
introduce additional delays. pression [42], [43]. SVD transforms a dataset from a high-
dimensional space to a lower-dimensional space by ranking
the singular values according to their significance [44], [45].
C. LoRA’s Variants
Dimensionality reduction is achieved by discarding the less
Building upon LoRA, several studies have proposed mod- significant singular values, while the remaining singular val-
ifying the parameter update strategy. LoRA-FA [29] freezes ues, when combined with their corresponding singular vectors,
the low-rank matrix A within LoRA, significantly reducing define the reduced-dimensional space.
trainable parameters and activation memory costs without
increasing computational overhead. Delta-LoRA [20] updates Definition 1 (Range space). Given a matrix W ∈ Rd1 ×d2 ,
both low-rank matrices A and B, propagating these changes the range space of matrix W is the vector space spanned by
to the pre-trained weight W using the delta of their product. the columns of W . In other words, the range space is the set
PiSSA [38] initializes A and B with the principal components of all possible linear combinations of the column vectors of
of the original matrix W and places the remaining components W . The range space is often denoted as R (W ). Formally, the
into a residual matrix, which is kept frozen during fine- range space is defined as:
tuning. Another strategy to improve LoRA is to allow for the R (W ) = W x | x ∈ Rd2 ⊆ Rd1 .

(3)
adaptable adjustment of LoRA rank. AdaLoRA [39] utilizes T d2
SVD to parameterize incremental updates and dynamically T
 space of W is a subspace of R ,
Similarly, the range
distributes the parameter budget among weight matrices based denoted as R W . That is,
R W T = W T y | y ∈ Rd1 ⊆ Rd2 .
 
on their importance score. SoRA [40] employs an optimizable (4)
gate with a proximal gradient method to control sparsity,
Definition 2 (Null space). Given a matrix W ∈ Rd1 ×d2 , the
expanding the optimization space and improving parameter
null space of matrix W is the set of all vectors x ∈ Rd2 that
efficiency. Similarly, Zhang et al. [41] suggest IncreLoRA,
are mapped to the zero vector in Rd1 when multiplied by W .
an incremental parameter allocation method that adaptively
Formally, the null space is defined as:
incorporates trainable parameters during training according to
the importance scores of each module. N (W ) = {x | W x = 0} ⊆ Rd2 . (5)
4

1: � � + 1: �1
�(�� ) 1: �
�1

� ∈ ℝ�1×�2 = �(�) �(�� ) ��
�(�) � + 1: �2
0

= +
�1
�(�) �(�� )
⋱ �(�� ) 0 �(�)
��

�� ��

Fig. 3. Illustration of the SVD of matrix W and its fundamental subspaces: This figure illustrates the SVD of the pre-trained weight matrix W ∈ Rd1 ×d2 ,
where W is decomposed into singular values and vectors as W = U diag(Σ)V T . The decomposition yields a rank-r approximation matrix Wr and a residual
matrix We . Specifically, the range space of W is spanned by Ur , and its null space is spanned by Ve . Conversely, the range space of W T is spanned by
Vr , and its null space is spanned by Ue .

Pr
Similarly, the null space of matrix W T consists of all vectors Since W = U diag(Σ)V T = T
i=1 λi ui viP , we can express
Pr d2
y ∈ Rd1 that are mapped to the zero vector in Rd2 when W as W x = λ i ui v T
x. Let z = i=r+1 βi vi , then
i=1 Pi
multiplied by W T . The formal definition of the null space of Pr 
T
 d2
Wz = i=1 λi ui vi i=r+1 βi vi = 0. Therefore, the
W T is as follows:
null space of W is N (W ) = {x | W x = 0} = R (Ve ).
N W T = y | W T y = 0 ⊆ Rd 1 .
 
(6)
The connection between the SVD of the matrix W and the
Given a matrix W ∈ Rd1 ×d2 of rank r, its SVD is four fundamental subspaces can be depicted in Fig. 3.
denoted as W = U diag(Σ)V T , where diag(Σ) is a d1 × d2
diagonal matrix containing the singular values {λi }1≤i≤r of B. SVFit: PEFT of LPMs Using Singular Values
W in descending order, with r ≪ min(d1 , d2 ). The matrices
SVFit performs SVD on the initial pre-trained weight matrix
U = [u1 , u2 , . . . , ud1 ] ∈ Rd1 ×d1 and V = [v1 , v2 , . . . , vd2 ] ∈
W ∈ Rd1 ×d2 within the self-attention and multilayer percep-
Rd2 ×d2 are orthogonal matrices. We can partition U and V
tron layers to obtain the best rank-r approximation
into two parts by columns and denote these partitions as Ur ,
Ue , and Vr , Ve , respectively: Wr = Ur diag(Σr )VrT , (9)
Ur = [u1 , u2 , · · · , ur ] , Ue = [ur+1 , ur+2 , · · · , ud1 ] , (7) and the residual matrix
Vr = [v1 , v2 , · · · , vr ] , Ve = [vr+1 , vr+2 , · · · , vd2 ] . (8) We = Ue diag(Σe )VeT , (10)
Then the four fundamental subspaces associated with matrix where Ur = [u1 , u2 , · · · , ur ] ∈ Rd1 ×r and Vr =
W can be obtained: [v1 , v2 , · · · , vr ] ∈ Rd2 ×r are the matrices of singular
• Ur is the orthonormal basis for the range space of W , vectors corresponding to the top r singular values, and
i.e., R(Ur ) = R(W ); Ue = [ur+1 , ur+2 , · · · , ud1 ] ∈ Rd1 ×(d1 −r) and Ve =
• Ue is the orthonormal basis for the null space of W ,
T [vr+1 , vr+2 , · · · , vd2 ] ∈ Rd2 ×(d2 −r) are the matrices of sin-
T gular vectors corresponding to the residual singular values.
i.e., R(Ue ) = N(W );
T As discussed in Section III-A, R(Ur ) = R(W ), R(Vr ) =
• Vr is the orthonormal basis for the range space of W ,
T
i.e., R(Vr ) = R(W ); R(W T ), R(Ue ) = N(W T ), and R(Ve ) = N(W ). The
• Ve is the orthonormal basis for the null space of W , i.e., singular values are arranged in descending order, with Σr =
R(Ve ) = N(W ). [λ1 , λ2 , . . . , λr ] and Σe = [λr+1 , λr+2 , . . . , λd1 ]. Conse-
To verify that R(Ve ) = N(W ), one would show that the quently, the pre-trained weight matrix W can be expressed
subspace spanned by Ve consists precisely of the vectors that as:
W maps to the zero vector. Similar arguments can be applied W = Wr + We = Ur diag(Σr )VrT + Ue diag(Σe )VeT
to verify the other subspaces. r d1
X X (11)
Proof. Assume the rank of the matrix W ∈ R is r d1 ×d2 = λi ui viT + λi ui viT ,
i=1 i=r+1
and d1 ≪ d2 . Thus, the singular values are are ordered
as λ1 ≥ λ2 ≥ · · · ≥ λr > λr+1 = · · · = λd1 = 0. as shown in Fig. 2(c) and Fig. 3.
5

As observed in Fig. 1, we performed SVD on the Fishstar • Full fine-tuning involves updating the entire set of model
to reconstruct images using various subsets of singular values: parameters, initialized with pre-trained weights and bi-
the top 8, 16, 32, 64, 128, and 256 singular values, as well ases, through gradient descent [53]. Although this ap-
as the smallest 8, 16, 32, 64, 128, and 256 singular values. proach is straightforward and robust, it demands substan-
The results highlight that larger singular values are crucial for tial computational resources.
preserving image quality. Notably, the top 10% or even 1% • LoRA [19] employs two low-rank matrices to learn
of singular values contribute to over 99% of the total matrix incremental updates, reducing GPU memory cost. We
sum. This observation indicates that the pre-trained weight replicate their experimental setup for a fair comparison.
matrix W can be effectively approximated by focusing on the • DyLoRA [24] dynamically selects a random rank r for
most significant singular values above a given threshold r, LoRA modules during training.
while the smaller singular values have minimal impact on the • AdaLoRA [39] addresses the challenge of optimal rank
overall structure. Consequently, in SVFit, We is frozen during selection for incremental updates by adaptively pruning
training, and the focus is placed on adapting Wr . Inspired by singular values based on their magnitudes, resulting in
recent advancements in image generation, such as learnable different ranks for different layers.
scaling factors for improved domain adaptation [32], SVFit • PiSSA [38], structurally similar to LoRA, initializes the
utilizes the most critical singular values obtained from SVD adapter matrices A and B using the principal components
as the trainable parameters. Specifically, it trains only the top of the original weight matrix W , while the remaining
r singular values Σr in Wr , while scaling the fundamental components form a residual matrix that is kept frozen
subspace derived from SVD to facilitate rapid adaptation to during fine-tuning.
new domains. In a word, the matrices Ur , Vr , Ue , Ve , and
Σe are kept frozen, and the training focuses exclusively on TABLE I
the most significant top-r singular values Σr , as demonstrated H YPERPARAMETER SETUP OF SVF IT FOR THE GLUE BENCHMARK
in Fig. 2(c) and Fig. 3. This method allows features to be
Hyperparameter COLA STS-B RTE MRPC SST-2 QNLI
projected onto a low-rank subspace defined by the orthogonal Rank 768
columns of U and V , enabling efficient layer-wise adaptation Warmup Ratio 0.06
with a reduced number of trainable parameters. LR Scheduler Linear
Base

Weight Decay 0.1


In comparison to other LoRA variants, SVFit uniquely em- Trainable Matrices WQ , WV
ploys the most critical singular values from SVD initialization Max Sequence Length 512
as trainable parameters, leading to more effective learning of Epochs 80 40 80 30 60 25
Batch Size 64 64 16 64 64 64
new domain knowledge for downstream tasks while preserving Learning Rate 0.01 0.01 0.01 0.02 0.003 0.003
pre-trained information. This approach significantly reduces Rank 768
the number of trainable parameters without introducing ad- Batch Size 32
Warmup Ratio 0.06
ditional computational overhead or latency during inference,
Large

LR Scheduler Linear
as the module can be seamlessly integrated into the original Weight Decay 0.1
matrix post-training. Trainable Matrices WQ , WV
Max Sequence Length 128
Epochs 40 20 40 40 10 20
Learning Rate 0.01 0.01 0.01 0.003 0.001 0.002
IV. E XPERIMENT

In this section, we conduct extensive experiments to evalu-


ate SVFit in the contexts of natural language understanding B. Natural Language Understanding
and computer vision. We fine-tune the RoBERTa-base and
Models and Datasets. We evaluate our method on the
RoBERTa-large models [1] on the GLUE benchmark [46] and
GLUE benchmark (General Language Understanding Eval-
apply SVFit to fine-tune ViT-base and ViT-large models [2]
uation). This comprehensive natural language understanding
for image classification tasks [47], [48]. Additionally, we fine-
assessment encompasses various tasks, including sentence
tune Stable Diffusion v1.5 [49] to generate diverse images of
relationship recognition, sentiment analysis, and natural lan-
a subject instance in various environments [50], [51]. We also
guage reasoning [46]. For systematic evaluation, we select six
vary the rank in our method for one task to examine how
tasks: CoLA [54], STS-B [55], RTE [56], MRPC [57], SST-
performance scales with the number of trainable parameters
2 [58], and QNLI [59]. We implement SVFit for fine-tuning
and analyze the influence of the learning rate. Our experi-
RoBERTa-base, which has 12 layers with a hidden size of 768,
ments are conducted using PyTorch, with pre-trained weights
totaling 125 million parameters, and RoBERTa-large, which
and configuration files obtained from HuggingFace [52], on
has 24 layers with a hidden size of 1024, totaling 356 million
NVIDIA A6000 GPUs.
parameters [1].
Implementation Details. For all six datasets in GLUE, we
tune the hyperparameters for learning rates and scaling values.
A. Baselines
Following the experimental setup used in previous studies [19],
We compare SVFit with full fine-tuning and popular PEFT [20], we fine-tune only the query and value weights in each
methods, including LoRA, DyLoRA, AdaLoRA, and PiSSA. transformer block while fully fine-tuning the classification
6

TABLE II
P ERFORMANCE COMPARISON OF VARIOUS FINE - TUNING METHODS ON THE GLUE BENCHMARK USING RO BERTA - BASE AND RO BERTA - LARGE
MODELS . M ETRICS INCLUDE MCC FOR C O LA, PCC FOR STS-B, AND ACC FOR RTE, MRPC, SST-2, AND QNLI. R ESULTS ARE REPORTED AS THE
MEDIAN OF 5 RUNS WITH DIFFERENT RANDOM SEEDS . T HE HIGHEST SCORE FOR EACH DATASET IS HIGHLIGHTED IN BOLD , WITH HIGHER VALUES
INDICATING BETTER PERFORMANCE ACROSS ALL METRICS

# Trainable CoLA STS-B RTE MRPC SST-2 QNLI


Method Avg.
Parameters (MCC) (PCC) (ACC) (ACC) (ACC) (ACC)
FT 125M 63.6 91.2 78.7 90.2 94.8 92.8 85.2
LoRA 0.3M 63.4 91.5 78.4 89.7 95.1 93.3 85.2
Base

DyLoRA 0.3M 61.1 91.1 78.7 89.5 94.3 92.2 84.5


AdaLoRA 0.3M 62 90.5 81 88.7 94.5 93.1 85.0
PiSSA 0.3M 63.8 90.8 75.5 89.2 94.7 92.5 84.4
SVFit 0.018M 64.8 92.4 78.0 90.0 94.4 90.8 85.1
FT 356M 68 92.4 86.6 90.9 96.4 94.7 88.2
Large

LoRA 0.8M 68.2 92.3 85.2 90.2 96.2 94.8 87.8


PiSSA 0.8M 69.0 92.9 85.2 90.2 96.7 95.1 88.2
SVFit 0.036M 71.4 92.0 86.3 90.9 96.2 94.4 88.5

TABLE III
P ERFORMANCE COMPARISON OF VARIOUS FINE - TUNING METHODS ON THE IMAGE CLASSIFICATION TASK USING V I T- BASE AND V I T- LARGE MODELS
ACROSS DIFFERENT DATASETS . ACCURACY (%) IS REPORTED AFTER TEN EPOCHS . AVG . REPRESENTS THE AVERAGE ACCURACY ACROSS ALL DATASETS .
T HE BEST PERFORMANCE FOR EACH DATASET IS HIGHLIGHTED IN BOLD

# Trainable
Method OxfordPets CIFAR10 DTD EuroSAT RESISC45 StanfordCars FGVC CIFAR100 Avg.
Parameters
Head - 90.3 96.4 69.8 88.7 74.2 25.8 17.4 84.3 68.4
FT 85.8M 93.1 98.9 77.7 99.1 96.1 79.8 54.8 92.4 86.5
Base

LoRA 0.3M 93.2 98.8 75.0 98.4 92.7 45.4 25.2 92.0 77.6
PiSSA 0.3M 95.9 98.6 78.7 98.7 95.5 67.1 47.6 91.2 84.2
SVFit 0.018M 97.0 98.8 80.5 98.6 93.0 67.2 47.9 91.6 84.3
Head - 91.1 97.8 73.3 92.6 82.0 37.9 24.6 84.3 73.0
FT 303.3M 94.4 99.2 81.8 99.0 96.4 88.9 68.3 93.6 90.2
Large

LoRA 0.8M 94.8 99.1 81.8 98.6 94.7 73.3 42.3 94.9 84.9
PiSSA 0.8M 96.7 98.8 77.8 98.8 95.7 86.7 62.6 92.6 88.7
SVFit 0.036M 97.8 99.3 83.4 98.7 95.2 83.3 57.8 93.9 88.7

TABLE IV STS-B tasks. Consistent with prior work [21], [60], we report
H YPERPARAMETER SETUP FOR IMAGE CLASSIFICATION OF SVF IT the number of trainable parameters in the fine-tuned layers,
Hyperparameter OxfordPets CIFAR10 DTD EuroSAT explicitly excluding the classification head, which is trained
Rank 768 in a standard way. The results are averaged over five different
Epochs 10 random seeds. Additional details are provided in Table I.
Base

LR Scheduler Linear
Weight Decay 0.01
Trainable Matrices WQ , WV
Learning Rate 0.04 0.04 0.04 0.2
Rank 768 Results. Based on the results presented in Table II, our
Epochs 10 proposed SVFit method demonstrates superior performance in
Large

LR Scheduler Linear several key areas compared to both traditional fine-tuning (FT)
Weight Decay 0.01
Trainable Matrices WQ , WV and other PEFT methods such as LoRA, DyLoRA, AdaLoRA,
Learning Rate 0.06 0.03 0.03 0.1 and PiSSA. For RoBERTa-base, SVFit achieves the highest
Hyperparameter RESISC45 StanfordCars FGVC CIFAR100 Matthew’s correlation coefficient (MCC) for CoLA and the
Rank 768
Epochs 10 highest Pearson correlation coefficient (PCC) for STS-B, in-
dicating its effectiveness in classification and regression tasks.
Base

LR Scheduler Linear
Weight Decay 0.01 While it falls slightly behind in accuracy (ACC) for MRPC
Trainable Matrices WQ , WV
Learning Rate 0.2 0.4 0.2 0.04 and SST-2, it maintains competitive performance across all
Rank 768 tasks, resulting in a robust overall average. For RoBERTa-
Epochs 10 large, SVFit outperforms other methods on CoLA, RTE, and
Large

LR Scheduler Linear
Weight Decay 0.01 MRPC and achieves comparable performance on other tasks.
Trainable Matrices WQ , WV Its significant improvement in MCC for CoLA is noteworthy,
Learning Rate 0.07 0.4 0.3 0.1 reflecting its strength in complex language understanding
tasks. The results suggest that SVFit’s approach to fine-tuning,
with its unique parameterization and efficient use of trainable
head. For both models, the rank of LoRA is set to 8, and parameters, offers a balanced and effective alternative to
the rank of SVFit is set to 768. Due to time constraints and existing methods. Additionally, the method’s reduced number
budget limitations, we omit the time-intensive MNLI and QQP of trainable parameters demonstrates its potential for achieving
tasks, thereby forgoing the MNLI trick for MRPC, RTE, and high performance with lower computational costs.
7

A [V] vase with a colorful flower bouquet

A [V] vase in the ocean

A [V] dog with a hat

A [V] dog wearing sunglasses

Input images Dreambooth LoRA SVFit

Fig. 4. Randomly selected samples from DreamBooth, LoRA, and SVFit for the subject-driven generation task.

C. Image Classification and performs competitively on other datasets, resulting in


Models and Datasets. We assess SVFit on the image an overall average accuracy of 84.3%. While FT achieves
classification task using both the base and large versions the highest average accuracy (86.5%), it requires significantly
of the widely adopted Vision Transformer (ViT) model [2], more trainable parameters (85.8M) compared to SVFit’s min-
pre-trained on the ImageNet-21K dataset [61]. To ensure a imal 0.018M parameters. For ViT-large, SVFit again stands
comprehensive evaluation, we employ a diverse set of datasets, out by achieving the highest accuracy on OxfordPets (97.8%),
including OxfordPets [62], CIFAR10 [63], DTD [64], Eu- CIFAR10 (99.3%), DTD (83.4%), and performs well across
roSAT [65], RESISC45 [66], StanfordCars [67], FGVC [68], other datasets with an average accuracy of 88.7%. FT still
and CIFAR100 [63]. yields the highest average accuracy (90.2%) but at the cost
Implementation Details. We evaluated the performance of of training 303.3M parameters, whereas SVFit uses only
LoRA, PiSSA, and SVFit applied to the query and value 0.036M parameters. These results underscore the efficiency
layers of the ViT, in addition to two baseline approaches: and effectiveness of SVFit, particularly in scenarios where
full fine-tuning (FT) and training only the classification head computational resources and parameter efficiency are critical.
(referred to as Head). Consistent with our GLUE benchmark By maintaining competitive performance while dramatically
setup, the rank of LoRA is set to 8 and the rank of SVFit to reducing the number of trainable parameters, SVFit offers a
768 for both models. Learning rates were meticulously tuned compelling alternative to traditional fine-tuning methods.
for all methods, with the maximum training epoch limited to
10. The reported parameter counts exclude the classification D. Dreambooth
head, which is trained across all methods. Further details are Models and Datasets. Following [51], we evaluate our
provided in Table IV. method on the subject-driven text-to-image generation task.
Results. Table III illustrates the performance of various By fine-tuning Stable Diffusion v1.5 [49] using DreamBooth,
fine-tuning methods on image classification tasks using ViT- we can generate diverse images of a subject instance in various
base and ViT-large models across various datasets. Notably, environments, maintaining high preservation of subject details
our proposed SVFit method demonstrates strong performance and realistic interactions between the scene and the subject. We
with both model sizes. For the ViT-base, SVFit achieves the compare our approach with LoRA and DreamBooth [51], en-
highest accuracy on OxfordPets (97.0%) and DTD (80.5%) suring fairness by randomly selecting generated images from
8

98 OxfordPets 99.5 CIFAR10 81 DTD 99.0 EuroSAT


LoRA (93.2) LoRA (98.8) LoRA (75.0) LoRA (98.4)
97 99.0 80 98.5
96 98.5 79
95 98.0
98.0 78
94 97.5
Acc.

Acc.

Acc.

Acc.
97.5 77
93 97.0
92 97.0 76
91 96.5 75 96.5
90 8 16 32 64 128 256 512 768 96.0 8 16 32 64 128 256 512 768 74 8 16 32 64 128 256 512 768 96.0 8 16 32 64 128 256 512 768
Rank Rank Rank Rank
94 RESISC45 70 StanfordCars 50 FGVC 92.5 CIFAR100
LoRA (92.7) LoRA (45.4) LoRA (25.2) LoRA (92.0)
65 92.0
93 45
91.5
60
92 40 91.0
55
Acc.

Acc.

Acc.

Acc.
91 35 90.5
50
90.0
90 45 30
89.5
89 8 16 32 64 128 256 512 768 40 8 16 32 64 128 256 512 768 25 8 16 32 64 128 256 512 768 89.0 8 16 32 64 128 256 512 768
Rank Rank Rank Rank

Fig. 5. Performance of SVFit fine-tuning for ViT-base model on image classification tasks across different parameter budget levels. The x-axis represents the
rank, and the y-axis is the evaluation index of different datasets.

both methods. For fine-tuning, we use the dataset introduced balances the number of trainable parameters and accuracy.
in DreamBooth [51], which includes five or six images per For instance, on the OxfordPets dataset, our method achieves
subject for training. an accuracy of approximately 93.2% at the lowest rank of 8
Implementation Details. Both LoRA and our method use while maintaining or improving accuracy with higher ranks.
the same loss function as in DreamBooth. For DreamBooth Similarly, for the CIFAR10 dataset, our method achieves
and LoRA, we apply the best hyperparameter setup in the an accuracy of around 98.8% at rank 8, with further gains
original paper [51]. observed as the rank increases. This trend is consistent across
Results. In Fig. 4, we present a comparative analysis of other datasets. Our method significantly improves over the
image generation using Dreambooth, LoRA, and SVFit across baseline across various datasets and rank levels, maintaining
multiple scenarios. SVFit consistently demonstrates superior high performance with fewer trainable parameters.
subject detail preservation and context integration. For in-
stance, in the ”vase with a colorful flower bouquet” scenario, F. Analysis of Learning Rate
SVFit accurately retains the intricate details of the flowers and Adjusting the learning rate is a crucial step in fine-tuning.
vase while seamlessly blending them with the background. In Our method requires a larger learning rate than LoRA since
contrast, the results from Dreambooth and LoRA exhibit no- our initialization strategy has already initialized most of the
ticeable artifacts and inconsistencies in subject representation model’s parameters to a certain extent, and a larger learning
and background interaction. Overall, the visual comparison rate can help quickly adapt the remaining parameters to the
clearly illustrates SVFit’s enhanced capability in generating new tasks. We perform a learning rate search using our
high-fidelity images that closely follow the given prompts, method, as shown in Fig. 6. A learning rate 10× greater than
effectively preserving subject details and ensuring realistic pre-training yields the best results across multiple datasets.
environmental interactions. For instance, on the OxfordPets dataset, the accuracy peaks
at approximately 93.2% with a 10× learning rate, while both
lower (0.5×, 1×, 5×) and higher learning rates (50×, 100×,
E. Different Budget Levels
150×) result in decreased performance. However, for EuroSAT,
We analyzed the performance of SVFit fine-tuning on an accuracy of around 98.6% is achieved at 10×, with lower
the ViT-base model for image classification tasks un- and higher rates underperforming. These findings demonstrate
der different parameter budget levels. The specific re- that while a substantial increase in the learning rate from
sults are shown in Fig. 5. We employed various ranks pre-training is generally beneficial, the optimal rate can vary
r = {8, 16, 32, 64, 128, 256, 512, 768}, corresponding to 0.2K, significantly depending on the specific dataset.
0.4K, 0.8K, 1.5K, 3.1K, 6.1K, 12.3K, and 18.4K trainable
parameters, respectively. For LoRA, we used a baseline rank V. C ONCLUSION
of r = 8, corresponding to 294.9K trainable parameters. The In this work, we introduced SVFit, a novel PEFT method
experimental results demonstrate that our method effectively that enhances initialization by leveraging SVD to initialize
9

98 OxfordPets 99.0 CIFAR10 82 DTD 99.0 EuroSAT


97 98.5
98.8 80
98.0
96 97.5
98.6 78
95 97.0
Acc.

Acc.

Acc.

Acc.
98.4 76 96.5
94
96.0
93 98.2 74
95.5
92
0.5× 1× 5× 10× 50× 100×150× 98.0
0.5× 1× 5× 10× 50× 100×150× 72
0.5× 1× 5× 10× 50× 100×150× 95.0
0.5× 1× 5× 10× 50× 100×150×
Learning Rate Learning Rate Learning Rate Learning Rate
94 RESISC45 70 StanfordCars 50 FGVC 92.00 CIFAR100
93 65 91.75
45
91.50
92 60 91.25
40
91 55 91.00
Acc.

Acc.

Acc.

Acc.
35 90.75
90 50
90.50
89 45 30
90.25
88
0.5× 1× 5× 10× 50× 100×150× 40
0.5× 1× 5× 10× 50× 100×150× 25
0.5× 1× 5× 10× 50× 100×150× 90.00
0.5× 1× 5× 10× 50× 100×150×
Learning Rate Learning Rate Learning Rate Learning Rate

Fig. 6. Performance of SVFit fine-tuning for ViT-base model on image classification tasks across different learning rate. The x-axis represents the learning
rate, and the y-axis is the evaluation index of different datasets.

low-rank matrices derived from pre-trained weights. SVFit [2] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
focuses on training only the most significant top-r singular val- T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
“An image is worth 16x16 words: Transformers for image recognition at
ues, significantly reducing the number of trainable parameters scale,” in International Conference on Learning Representations, ICLR,
while ensuring efficient fine-tuning and preserving the model’s 2021.
core capabilities. Our theoretical analysis demonstrates how [3] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open
this approach enables rapid adaptation by effectively capturing foundation and fine-tuned chat models,” arXiv:2307.09288, 2023.
essential information from pre-trained models and efficiently [4] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
learning new domain-specific knowledge with minimal pa- Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning
with a unified text-to-text transformer,” J Mach Learn Res, vol. 21, no.
rameters. SVFit has been evaluated across various tasks, 140, pp. 1–67, 2020.
including natural language understanding, image classification, [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-
and subject-driven text-to-image generation, consistently out- training of deep bidirectional transformers for language understanding,”
performing LoRA and other recent state-of-the-art techniques arXiv:1810.04805, 2018.
[6] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al.,
such as PiSSA in both efficiency and effectiveness. Future “Language models are unsupervised multitask learners,” OpenAI blog,
work will focus on extending SVFit to more complex tasks, vol. 1, no. 8, p. 9, 2019.
optimizing singular value selection, and further enhancing [7] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. Tay, J. Feng,
and S. Yan, “Tokens-to-token vit: Training vision transformers from
performance through dynamic parameter budget allocation to scratch on imagenet,” in IEEE International Conference on Computer
broaden its applicability across diverse domains. Vision, ICCV, 2021, pp. 558–567.
[8] D. Zhou, B. Kang, X. Jin, L. Yang, X. Lian, Z. Jiang,
Q. Hou, and J. Feng, “Deepvit: Towards deeper vision transformer,”
ACKNOWLEDGEMENTS arXiv:2103.11886, 2021.
This research was partially funded by the National Natu- [9] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,”
in IEEE International Conference on Computer Vision, ICCV, 2021, pp.
ral Science Foundation of China under grants 62220106008, 16 259–16 268.
62306067, and U20B2063, as well as the Sichuan Science [10] J. Wei, Y. Yang, X. Xu, J. Song, G. Wang, and H. T. Shen, “Less is
and Technology Program under grant 2024NSFSC1463. Ad- better: Exponential loss for cross-modal matching,” IEEE Trans. Circuits
Syst. Video Technol., vol. 33, no. 9, pp. 5271–5280, 2023.
ditional support was provided by the Sichuan Province Innova- [11] Z.-Y. Wang, X. P. Li, H. C. So, and A. M. Zoubir, “Adaptive rank-one
tive Talent Funding Project for Postdoctoral Fellows (Project matrix completion using sum of outer products,” IEEE Trans. Circuits
BX202311) and the China Postdoctoral Science Foundation Syst. Video Technol., vol. 33, no. 9, pp. 4868–4880, 2023.
(Project 2022M720660). [12] W. Yan, M. Yang, and Y. Li, “Robust low rank and sparse representation
for multiple kernel dimensionality reduction,” IEEE Trans. Circuits Syst.
Video Technol., vol. 33, no. 1, pp. 1–15, 2023.
R EFERENCES [13] Y. Xu, J. Wei, Y. Bin, Y. Yang, Z. Ma, and H. T. Shen, “Set of diverse
queries with uncertainty regularization for composed image retrieval,”
[1] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, IEEE Trans. Circuits Syst. Video Technol., pp. 1–1, 2024.
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert [14] S. Kim, I. Kang, and N. Kwak, “Semantic sentence matching with
pretraining approach,” arXiv:1907.11692, 2019. densely-connected recurrent and co-attentive information,” in Proceed-
10

ings of the AAAI conference on artificial intelligence, AAAI, vol. 33, [38] F. Meng, Z. Wang, and M. Zhang, “Pissa: Principal singular val-
no. 01, 2019, pp. 6586–6593. ues and singular vectors adaptation of large language models,”
[15] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, arXiv:2404.02948, 2024.
C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language [39] Q. Zhang, M. Chen, A. Bukharin, N. Karampatziakis, P. He, Y. Cheng,
models to follow instructions with human feedback,” Advances in neural W. Chen, and T. Zhao, “Adaptive budget allocation for parameter-
information processing systems, NeurIPS, vol. 35, pp. 27 730–27 744, efficient fine-tuning,” in International Conference on Learning Repre-
2022. sentations, ICLR, 2023.
[16] K. Hambardzumyan, H. Khachatrian, and J. May, “Warp: Word-level [40] N. Ding, X. Lv, Q. Wang, Y. Chen, B. Zhou, Z. Liu, and M. Sun, “Sparse
adversarial reprogramming,” arXiv:2101.00121, 2021. low-rank adaptation of pre-trained language models,” arXiv:2311.11696,
[17] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, 2023.
P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Transformers: State- [41] F. Zhang, L. Li, J. Chen, Z. Jiang, B. Wang, and Y. Qian, “Increlora:
of-the-art natural language processing,” in Conference on Empirical Incremental parameter allocation method for parameter-efficient fine-
Methods in Natural Language Processing, EMNLP, 2020, pp. 38–45. tuning,” arXiv:2308.12043, 2023.
[18] N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, [42] C.-W. Sun, T.-Z. Huang, T. Xu, and L.-J. Deng, “Nf-3dlogtnn: An
C.-M. Chan, W. Chen et al., “Parameter-efficient fine-tuning of large- effective hyperspectral and multispectral image fusion method based on
scale pre-trained language models,” Nat. Mach. Intell., vol. 5, no. 3, pp. nonlocal low-fibered-rank regularization,” Appl Math Model, vol. 118,
220–235, 2023. pp. 780–797, 2023.
[19] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
[43] T. Xu, T.-Z. Huang, L.-J. Deng, and N. Yokoya, “An iterative regulariza-
and W. Chen, “Lora: Low-rank adaptation of large language models,”
tion method based on tensor subspace representation for hyperspectral
in International Conference on Learning Representations, ICLR, 2022.
image super-resolution,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp.
[20] B. Zi, X. Qi, L. Wang, J. Wang, K.-F. Wong, and L. Zhang, “Delta-lora:
1–16, 2022.
Fine-tuning high-rank parameters with the delta of low-rank matrices,”
arXiv:2309.02411, 2023. [44] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with
[21] D. J. Kopiczko, T. Blankevoort, and Y. M. Asano, “Vera: Vector-based randomness: Probabilistic algorithms for constructing approximate ma-
random matrix adaptation,” in International Conference on Learning trix decompositions,” Siam Rev, vol. 53, no. 2, pp. 217–288, 2011.
Representations, ICLR, 2024. [45] A. G. Akritas and G. I. Malaschonok, “Applications of singular-value
[22] Y. Gu, X. Han, Z. Liu, and M. Huang, “Ppt: Pre-trained prompt decomposition (svd),” Math Comput Simul, vol. 67, no. 1-2, pp. 15–31,
tuning for few-shot learning,” in Annual Meeting of the Association for 2004.
Computational Linguistics, ACL, 2022, pp. 8410–8423. [46] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman,
[23] P. Ren, C. Shi, S. Wu, M. Zhang, Z. Ren, M. de Rijke, Z. Chen, and “Glue: A multi-task benchmark and analysis platform for natural lan-
J. Pei, “Mini-ensemble low-rank adapters for parameter-efficient fine- guage understanding,” in International Conference on Learning Repre-
tuning,” arXiv:2402.17263, 2024. sentations, ICLR, 2019.
[24] M. Valipour, M. Rezagholizadeh, I. Kobyzev, and A. Ghodsi, “Dylora: [47] T. Su, D. Feng, M. Wang, and M. Chen, “Dual discriminative low-rank
Parameter efficient tuning of pre-trained models using dynamic search- projection learning for robust image classification,” IEEE Trans. Circuits
free low-rank adaptation,” arXiv:2210.07558, 2022. Syst. Video Technol., vol. 33, no. 12, pp. 7708–7722, 2023.
[25] S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Learning multiple visual [48] H. Liu, Y. Jia, J. Hou, and Q. Zhang, “Global-local balanced low-rank
domains with residual adapters,” Advances in neural information pro- approximation of hyperspectral images for classification,” IEEE Trans.
cessing systems, NeurIPS, vol. 30, pp. 506–516, 2017. Circuits Syst. Video Technol., vol. 32, no. 4, pp. 2013–2024, 2022.
[26] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts [49] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer,
for generation,” in Annual Meeting of the Association for Computational “High-resolution image synthesis with latent diffusion models,” in IEEE
Linguistics, ACL, 2021. Conference on Computer Vision and Pattern Recognition, CVPR, 2022,
[27] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for pp. 10 684–10 695.
parameter-efficient prompt tuning,” in Conference on Empirical Methods [50] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik,
in Natural Language Processing, EMNLP, 2021, pp. 3045–3059. and D. Cohen-Or, “An image is worth one word: Personalizing text-to-
[28] C. Li, H. Farkhoor, R. Liu, and J. Yosinski, “Measuring the intrinsic image generation using textual inversion,” arXiv:2208.01618, 2022.
dimension of objective landscapes,” arXiv:1804.08838, 2018. [51] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman,
[29] L. Zhang, L. Zhang, S. Shi, X. Chu, and B. Li, “Lora-fa: Memory- “Dreambooth: Fine tuning text-to-image diffusion models for subject-
efficient low-rank adaptation for large language models fine-tuning,” driven generation,” in IEEE Conference on Computer Vision and Pattern
arXiv:2308.03303, 2023. Recognition, CVPR, 2023, pp. 22 500–22 510.
[30] V. Ramanujan, M. Wortsman, A. Kembhavi, A. Farhadi, and M. Raste- [52] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cis-
gari, “What’s hidden in a randomly weighted neural network?” in IEEE tac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s transformers:
Conference on Computer Vision and Pattern Recognition, CVPR, 2020, State-of-the-art natural language processing,” arXiv:1910.03771, 2019.
pp. 11 893–11 902. [53] S. Huang, D. Xu, I. E. Yen, Y. Wang, S.-E. Chang, B. Li, S. Chen,
[31] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian M. Xie, S. Rajasekaran, H. Liu et al., “Sparse progressive distillation:
denoiser: Residual learning of deep cnn for image denoising,” IEEE Resolving overfitting under pretrain-and-finetune paradigm,” in Annual
Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, 2017. Meeting of the Association for Computational Linguistics, ACL, 2022.
[32] E. Xie, L. Yao, H. Shi, Z. Liu, D. Zhou, Z. Liu, J. Li, and Z. Li,
[54] A. Warstadt, A. Singh, and S. R. Bowman, “Neural network acceptability
“Difffit: Unlocking transferability of large diffusion models via simple
judgments,” Trans. Assoc. Comput. Linguist., vol. 7, pp. 625–641, 2019.
parameter-efficient fine-tuning,” in IEEE International Conference on
Computer Vision, ICCV, 2023, pp. 4230–4239. [55] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval-
[33] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, 2017 task 1: Semantic textual similarity-multilingual and cross-lingual
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models focused evaluation,” arXiv:1708.00055, 2017.
are few-shot learners,” in Advances in Neural Information Processing [56] D. Giampiccolo, B. Magnini, I. Dagan, and W. B. Dolan, “The third
Systems, NeurIPS, vol. 33, 2020, pp. 1877–1901. pascal recognizing textual entailment challenge,” in Proceedings of the
[34] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, ACL-PASCAL workshop on textual entailment and paraphrasing, 2007,
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open pp. 1–9.
foundation and fine-tuned chat models,” arXiv:2307.09288, 2023. [57] B. Dolan and C. Brockett, “Automatically constructing a corpus of sen-
[35] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, tential paraphrases,” in Third international workshop on paraphrasing
A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot (IWP2005), 2005.
learners,” arXiv:2109.01652, 2021. [58] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng,
[36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and C. Potts, “Recursive deep models for semantic compositionality over
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances a sentiment treebank,” in Conference on Empirical Methods in Natural
in neural information processing systems, NeurIPS, 2017. Language Processing, EMNLP, 2013, pp. 1631–1642.
[37] Z. Lin, A. Madotto, and P. Fung, “Exploring versatile generative [59] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+
language model via parameter-efficient transfer learning,” in Conference questions for machine comprehension of text,” in Conference on Em-
on Empirical Methods in Natural Language Processing, EMNLP, 2020, pirical Methods in Natural Language Processing, EMNLP, 2016, pp.
pp. 441–459. 2383–2392.
11

[60] Z. Gao, Q. Wang, A. Chen, Z. Liu, B. Wu, L. Chen, and


J. Li, “Parameter-efficient fine-tuning with discrete fourier transform,”
arXiv:2405.03003, 2024.
[61] T. Ridnik, E. Ben-Baruch, A. Noy, and L. Zelnik-Manor, “Imagenet-21k
pretraining for the masses,” arXiv:2104.10972, 2021.
[62] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and
dogs,” in IEEE Conference on Computer Vision and Pattern Recognition,
CVPR, 2012, pp. 3498–3505.
[63] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features
from tiny images,” Master’s thesis, University of Tront, 2009.
[64] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi,
“Describing textures in the wild,” in IEEE Conference on Computer
Vision and Pattern Recognition, CVPR, 2014, pp. 3606–3613.
[65] P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel
dataset and deep learning benchmark for land use and land cover
classification,” IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens.,
vol. 12, no. 7, pp. 2217–2226, 2019.
[66] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classifi-
cation: Benchmark and state of the art,” Proceedings of the IEEE, vol.
105, no. 10, pp. 1865–1883, 2017.
[67] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations
for fine-grained categorization,” in IEEE International Conference on
Computer Vision, ICCV, 2013, pp. 554–561.
[68] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-
grained visual classification of aircraft,” arXiv:1306.5151, 2013.

You might also like