Chao Huang, Susan Liang, Yunlong Tang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Text-guided diffusion models have revolutionized generative tasks by producing high-fidelity content from text descriptions. They have also enabled an editing paradigm where concepts can be replaced through text conditioning (e.g., a dog -> a tiger). In this work, we explore a novel approach: instead of replacing a concept, can we enhance or suppress the concept itself? Through an empirical study, we identify a trend where concepts can be decomposed in text-guided diffusion models. Leveraging this insight, we introduce ScalingConcept, a simple yet effective method to scale decomposed concepts up or down in real input without introducing new elements. To systematically evaluate our approach, we present the WeakConcept-10 dataset, where concepts are imperfect and need to be enhanced. More importantly, ScalingConcept enables a variety of novel zero-shot applications across image and audio domains, including tasks such as canonical pose generation and generative sound highlighting or removal.
Our code builds on the requirement of the diffusers library. To set up the environment, please run:
conda env create -f environment.yaml
conda activate ScalingConcept
or install requirements:
pip install -r requirements.txt
We provide a minimal example to explore the effects of concept scaling. The examples_images/ directory contains three sample images demonstrating different applications: canonical pose generation, face attribute editing, and anime sketch enhancement. To get started, try running:
python demo.pyThe default setting is configured for canonical pose generation. For optimal results with other applications, adjust the prompt and relevant hyperparameters as noted in the code comments.
Our ScalingConcept method supports various applications, each customizable by adjusting scaling parameters within pipe_inference. Below are recommended configurations for each application:
-
Canonical Pose Generation/Object Stitching:
prompt = [object_name] omega = 5 gamma = 3 t_exit = 15
-
Weather Manipulation:
prompt = '(heavy) fog' or '(heavy) rain' omega = 5 gamma = 3 t_exit = 15
-
Creative Enhancement:
prompt = [concept to enhance] omega = 3 gamma = 3 t_exit = 15
-
Face Attribute Scaling:
prompt = [face attribute, e.g., 'young face' or 'old face'] omega = 3 gamma = 3 t_exit = 15
-
Anime Sketch Enhancement:
prompt = 'anime' omega = 5 gamma = 3 t_exit = 25
In general, a larger omega value increases the effect of concept scaling, while higher gamma and t_exit values maintain fidelity. Note that inversion prompt selection is crucial, as the model is sensitive to the wording of prompts.
This code builds upon the diffusers library. Additionally, we borrow code from the following repositories:
- Pix2PixZero for noise regularization.
- sdxl_inversions for the initial implementation of DDIM inversion in SDXL.
- ReNoise-Inversion for a precise inversion technique.
If you use this code for your research, please cite the following work:
@misc{huang2024scaling,
title={Scaling Concept With Text-Guided Diffusion Models},
author={Chao Huang and Susan Liang and Yunlong Tang and Yapeng Tian Anurag Kumar and Chenliang Xu},
year={2024},
eprint={2410.24151},
archivePrefix={arXiv},
primaryClass={cs.CV}
}