In recent years, the development of diffusion models has led to significant progress in image, video, and 3D generation tasks, with pre-trained models like the Stable Diffusion series playing a crucial role. However, a key challenge remains in downstream task applications: how to effectively and efficiently adapt pre-trained diffusion models to new tasks. Inspired by model pruning which lightens large pre-trained models by removing unimportant parameters, we propose a novel model fine-tuning method to make full use of these ineffective parameters and enable the pre-trained model with new task-specified capabilities. In this work, we first investigate the importance of parameters in pre-trained diffusion models, and discover that the smallest 10% to 20% of parameters by absolute values do not contribute to the generation process due to training instabilities rather than inherent model properties. Based on this observation, we propose a fine-tuning method termed SaRA that re-utilizes these temporarily ineffective parameters, equating to optimizing a sparse weight matrix to learn the task-specific knowledge. To mitigate potential overfitting, we propose a nuclear-norm-based low-rank sparse training scheme for efficient fine-tuning. Furthermore, we design a new progressive parameter adjustment strategy to make full use of the re-trained / finetuned parameters. Finally, we propose a novel unstructural backpropagation strategy, which significantly reduces memory costs during fine-tuning and further enhances the selective PEFT field. Our method enhances the generative capabilities of pre-trained models in downstream applications and outperforms traditional fine-tuning methods like LoRA in maintaining model's generalization ability. We validate our approach through fine-tuning experiments on SD 1.5, SD 2.0, and SD 3.0, demonstrating significant improvements. Additionally, we compare our method against previous fine-tuning approaches in various downstream tasks, including domain transfer, customization, and video generation, proving its effectiveness and generalization performance. SaRA also offers a practical advantage that requires only a single line of code modification for efficient implementation and is seamlessly compatible with existing methods.
we propose SaRA, a novel fine-tuning method for pre-trained diffusion models that trains the parameters with relatively small absolute values.
Fig.1 The comparison between our SaRA (d) and the previous parameter efficient finetuning methods, including (a) addictive fine-tuning, (b) reparameterized fine-tuning and (c) the selective fine-tuning.
SaRA can be implemented by simply modifying a single line of code, where you can just replace the original optimizer with the corresponding optimizer in SaRA.
SaRA can improve the performance of pre-trained models on the main task (the original task it is trained on), by optimizing the initially ineffective parameters to be effective and thus increasing the number of effective parameters.
Fig.2 Quantitative comparisons among different PEFT methods on Backbone Fine-tuning on ImageNet, FFHQ, and CelebA-HQ datasets. Our method achieves the best FID scores, indicating our method effectively improves the performance of the pre-trained models on the main task.
In this experiment, we choose 5 widely-used datasets from CIVITAI with 5 different styles to conduct the fine-tuning experiments, which are Barbie Style, Cyberpunk Style, Elementfire Style, Expedition Style and Hornify Style from top to bottom. Our methods can learn the target domain style accurately while achieving good alignment between the generated images and the text prompts.
Fig.3 Qualitative comparison on Stable Diffusion 1.5. Our can learn the domain-specific knowledge well whiling generating images that are consistent to the given prompts.
Tab.1 Comparisons with different parameter-efficient fine-tuning methods, along with full-parameter fine- tuning on Stable Diffusion 1.5, 2.0, and 3.0. For most of the conditions, our model achieves the best FID and VLHI score, indicating that our model learns domain-specific knowledge successfully while keeping the prior information well. Bold and underline represent optimal and sub-optimal results, respectively.
Since Dreambooth requires fine-tuning the UNet network, SaRA can be employed to fine-tune the UNet and achieve image customization.
Fig.4 Qualitative comparisons among different PEFT methods on image customization by fine-tuning the UNet model in Dreambooth. Our model can accurately capture the target feature while preventing the model from overfitting, outperforming Dreambooth with other PEFT methods and Textual inversion.
We further investigate the effectiveness of our method in fine-tuning video generation models (AnimateDiff) on datasets with different camera motions, including: ZoomIn, ZoomOut, PanLeft, and PanRight. SaRA keeps the model prior well and learns accurate camera motions.
@inproceedings{hu2024sara,
title={SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation},
author={Hu, Teng and Zhang, Jiangning and Yi, Ran and Huang, Hongrui and Wang, Yabiao and Ma Lizhuang},
booktitle={Arxiv},
year={2024}
}