Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
104 views26 pages

High-Res Image Synthesis Method

The document proposes a new method called FouriScale for generating high-resolution images from pre-trained diffusion models without retraining. FouriScale replaces the convolutional layers with dilated convolutions and a low-pass filter to achieve structural consistency and scale consistency across resolutions. This addresses issues like repetitive patterns that arise when generating images beyond the trained resolution. Experiments show FouriScale can generate high-fidelity, high-resolution images of various sizes and aspect ratios from models trained only on lower resolutions like 512x512 or 1024x1024.

Uploaded by

xepit98367
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views26 pages

High-Res Image Synthesis Method

The document proposes a new method called FouriScale for generating high-resolution images from pre-trained diffusion models without retraining. FouriScale replaces the convolutional layers with dilated convolutions and a low-pass filter to achieve structural consistency and scale consistency across resolutions. This addresses issues like repetitive patterns that arise when generating images beyond the trained resolution. Experiments show FouriScale can generate high-fidelity, high-resolution images of various sizes and aspect ratios from models trained only on lower resolutions like 512x512 or 1024x1024.

Uploaded by

xepit98367
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

FouriScale: A Frequency Perspective on

Training-Free High-Resolution Image Synthesis

Linjiang Huang1,2⋆ , Rongyao Fang1⋆ , Aiping Zhang3 , Guanglu Song4


Si Liu5 , Yu Liu4 , and Hongsheng Li1,2 B
1
CUHK-SenseTime Joint Laboratory, The Chinese University of Hong Kong
arXiv:2403.12963v1 [cs.CV] 19 Mar 2024

2
Centre for Perceptual and Interactive Intelligence
3
Sun Yat-Sen University
4
Sensetime Research
5
Beihang University
[email protected], {rongyaofang@link, hsli@ee}.cuhk.edu.hk

Abstract. In this study, we delve into the generation of high-resolution


images from pre-trained diffusion models, addressing persistent chal-
lenges, such as repetitive patterns and structural distortions, that emerge
when models are applied beyond their trained resolutions. To address this
issue, we introduce an innovative, training-free approach FouriScale from
the perspective of frequency domain analysis. We replace the original
convolutional layers in pre-trained diffusion models by incorporating a
dilation technique along with a low-pass operation, intending to achieve
structural consistency and scale consistency across resolutions, respec-
tively. Further enhanced by a padding-then-crop strategy, our method
can flexibly handle text-to-image generation of various aspect ratios. By
using the FouriScale as guidance, our method successfully balances the
structural integrity and fidelity of generated images, achieving an aston-
ishing capacity of arbitrary-size, high-resolution, and high-quality gen-
eration. With its simplicity and compatibility, our method can provide
valuable insights for future explorations into the synthesis of ultra-high-
resolution images. The code will be released at https://github.com/
LeonHLJ/FouriScale.

Keywords: Diffusion Model · Training Free · High-Resolution Synthesis

1 Introduction
Recently, Diffusion models [19, 36] have emerged as the predominant genera-
tive models, surpassing the popularity of GANs [13] and autoregressive mod-
els [11, 33]. Some text-to-image generation models, which are based on diffusion
models, such as Stable Diffusion (SD) [36], Stable Diffusion XL (SDXL) [32],
Midjourney [29], and Imagen [37], have shown their astonishing capacity to gen-
erate high-quality and fidelity images under the guidance of text prompts. To
ensure efficient processing on existing hardware and stable model training, these

Equal contribution. B
Corresponding author.
2 L. Huang et al.

Vanilla Attn-Entro ScaleCrafter Ours

SDXL 2048×2048

“An anime man in flight uniform with hyper detailed digital artwork and an art style inspired by Klimt, Nixeu, Ian Sprigger, Wlop, and Krenz Cushart.”

Fig. 1: Visualization of pattern repetition issue of higher-resolution image synthesis


using pre-trained SDXL [32] (Train: 1024×1024; Inference:2048×2048). Attn-Entro [25]
fails to address this problem and ScaleCrafter [15] still struggles with this issue in
image details. Our method successfully handles this problem and generates high-quality
images without model retraining.

models are typically trained at one or a few specific image resolutions. For in-
stance, SD models are often trained using images of 512 × 512 resolution, while
SDXL models are typically trained with images close to 1024 × 1024 pixels.
However, as shown in Fig. 1, directly employing pre-trained diffusion models
to generate an image at a resolution higher than what the models were trained
on will lead to significant issues, including repetitive patterns and unforeseen
artifacts. Some studies [3, 24, 26] have attempted to create larger images by uti-
lizing pre-trained diffusion models to stitch together overlapping patches into a
panoramic image. Nonetheless, the absence of a global direction for the whole im-
age restricts their ability to generate images focused on specific objects and fails
to address the problem of repetitive patterns, where a unified global structure is
essential. Recent work [25] has explored adapting pre-trained diffusion models for
generating images of various sizes by examining attention entropy. Nevertheless,
ScaleCrafter [15] found that the key point of generating high-resolution images
lies in the convolution layers. They introduce a re-dilation operation and a con-
volution disperse operation to enlarge kernel sizes of convolution layers, largely
mitigating the problem of pattern repetition. However, their conclusion stems
from empirical findings, lacking a deeper exploration of this issue. Additionally,
it needs an initial offline computation of a linear transformation between the
original convolutional kernel and the enlarged kernel, falling short in terms of
compatibility and scalability when there are variations in the kernel sizes of the
UNet and the desired target resolution of images.
In this work, we present FouriScale, an innovative and effective approach that
handles the issue through the perspective of frequency domain analysis, suc-
cessfully demonstrating its effectiveness through both theoretical analysis and
experimental results. FouriScale substitutes the original convolutional layers in
pre-trained diffusion models by simply introducing a dilation operation coupled
with a low-pass operation, aimed at achieving structural and scale consistency
across resolutions, respectively. Equipped with a padding-then-crop strategy, our
method allows for flexible text-to-image generation of different sizes and aspect
ratios. Furthermore, by utilizing FouriScale as guidance, our approach attains
FouriScale 3

remarkable capability in producing high-resolution images of any size, with inte-


grated image structure alongside superior quality. The simplicity of FouriScale
eliminates the need for any offline pre-computation, facilitating compatibility
and scalability. We envision FouriScale providing significant contributions to the
advancement of ultra-high-resolution image synthesis in future research.

2 Related Work

2.1 Text-to-Image Synthesis

Text-to-image synthesis [9, 20, 36, 37] has seen a significant surge in interest due
to the development of diffusion probabilistic models [19, 40]. These innovative
models operate by generating data from a Gaussian distribution and refining
it through a denoising process. With their capacity for high-quality generation,
they have made significant leaps over traditional models like GANs [9, 13], espe-
cially in producing more realistic images. The Latent Diffusion Model (LDM) [36]
integrates the diffusion process within a latent space, achieving astonishing real-
ism in the generation of images, which boosts significant interest in the domain
of generating via latent space [5, 16, 27, 31, 46]. To ensure efficient processing on
existing hardware and stable model training, these models are typically trained
at one or a few specific image resolutions. For instance, Stabe Diffusion (SD) [36]
is trained using 512 × 512 pixel images, while SDXL [32] models are typically
trained with images close to 1024 × 1024 resolution, accommodating various
aspect ratios simultaneously.

2.2 High-Resolution Synthesis via Diffusion Models

High-resolution synthesis has always received widespread attention. Prior works


mainly focus on refining the noise schedule [7, 22], developing cascaded ar-
chitectures [20, 37, 42] or mixtures-of-denoising-experts [2] for generating high-
resolution images. Despite their impressive capabilities, diffusion models were
often limited by specific resolution constraints and did not generalize well across
different aspect ratios and resolutions. Some methods have tried to address these
issues by accommodating a broader range of resolutions. For example, Any-size
Diffusion [50] fine-tunes a pre-trained SD on a set of images with a fixed range of
aspect ratios, similar to SDXL [32]. FiT [28] views the image as a sequence of to-
kens and adaptively padding image tokens to a predefined maximum token limit,
ensuring hardware-friendly training and flexible resolution handling. However,
these models require model training, overlooking the inherent capability of the
pre-trained models to handle image generation with varying resolutions. Most
recently, some methods [3, 24, 26] have attempted to generate panoramic images
by utilizing pre-trained diffusion models to stitch together overlapping patches.
Recent work [25] has explored adapting pre-trained diffusion models for gen-
erating images of various sizes by examining attention entropy. ElasticDiff [14]
uses the estimation of default resolution to guide the generation of arbitrary-size
4 L. Huang et al.

Input feature at high resolution Output feature at high resolution


Convolution

Dilation Down Sample


Low pass filtering
Down Sample
Original Kernel of UNet Dilated Kernel

Down-sampled feature Convolution
Output feature Consistency
at original resolution at original resolution across resolutions

Fig. 2: The overview of FouriScale (orange line), which includes a dilation convolution
operation (Sec. 3.2) and a low-pass filtering operation (Sec. 3.3) to achieve structural
consistency and scale consistency across resolutions, respectively.

images. However, ScaleCrafter [15] finds that the key point of generating high-
resolution images by pre-trained diffusion models lies in convolution layers. They
present a re-dilation and a convolution disperse operation to expand convolution
kernel sizes, which requires an offline calculation of a linear transformation from
the original convolutional kernel to the expanded one. In contrast, we deeply
investigate the issue of repetitive patterns and handle it through the perspective
of frequency domain analysis. The simplicity of our method eliminates the need
for any offline pre-computation, facilitating its compatibility and scalability.

3 Method

Diffusion models, also known as score-based generative models [19, 40], belong
to a category of generative models that follow a process of progressively intro-
ducing Gaussian noise into the data and subsequently generating samples from
this noise through a reverse denoising procedure. The key denoising step is typ-
ically carried out by a U-shaped Network (UNet), which learns the underlying
denoising function that maps from noisy data to its clean counterpart. The UNet
architecture, widely adopted for this purpose, comprises stacked convolution lay-
ers, self-attention layers, and cross-attention layers. Some previous works have
explored the degradation of performance when the generated resolution becomes
larger, attributing to the change of the attention tokens’ number [25] and the
reduced relative receptive field of convolution layers [15]. Based on empirical
evidence in [15], convolutional layers are more sensitive to changes in resolu-
tion. Therefore, we primarily focus on studying the impact brought about by
the convolutional layers. In this section, we will introduce FouriScale, as shown
in Fig. 2. It includes a dilation convolution operation (Sec. 3.2) and a low-pass
filtering operation (Sec. 3.3) to achieve structural consistency and scale consis-
tency across resolutions, respectively. With the tailored padding-then-cropping
strategy (Sec. 3.4), FouriScale can generate images of arbitrary aspect ratios.
By utilizing FouriScale as guidance (Sec. 3.5), our approach attains remarkable
capability in generating high-resolution and high-quality images.
FouriScale 5

3.1 Notation
2D Discrete Fourier Transform (2D DFT). Given a two-dimensional discrete
signal F (m, n) with dimensions M × N , the two-dimensional discrete Fourier
transform (2D DFT) is defined as:

\label {eq:2D_DFT} F(p, q) = \frac {1}{MN} \sum _{m=0}^{M-1} \sum _{n=0}^{N-1} F(m, n) e^{-j2\pi \left (\frac {pm}{M} + \frac {qn}{N}\right )}. (1)

2D Dilated Convolution. A dilated convolution kernel of the kernel k(m, n),


denoted as kdh ,dw (m, n), is formed by introducing zeros between the elements of
the original kernel such that:

k_{d_h, d_w}(m, n) = \begin {cases} k(\frac {m}{d_h}, \frac {n}{d_w}) & \text {if } m \operatorname {\%} d_h = 0 \text { and } n \operatorname {\%} d_w = 0, \\ 0 & \text {otherwise}, \end {cases} (2)

where dh , dw is the dilation factor along height and width, respectively, m and
n are the indices in the dilated space. The % represents the modulo operation.

3.2 Structural Consistency via Dilated Convolution


The diffusion model’s denoising network, denoted as ϵθ , is generally trained on
images or latent spaces at a specific resolution of h × w. This network is often
constructed using a U-Net architecture. Our target is to generate an image of
a larger resolution of H × W at the inference stage using the parameters of
denoising network ϵθ without retraining.
As previously discussed, the convolutional layers within the U-Net are largely
responsible for the occurrence of pattern repetition when the inference resolution
becomes larger. To prevent structural distortion at the inference resolution, we
resort to establishing structural consistency between the default resolution and
high resolutions, as shown in Fig. 2. In particular, for a convolutional layer Convk
in the UNet with its convolution kernel k, and the high-resolution input feature
map F , the structural consistency can be formulated as follows:

\label {eq:structural_consistency} \operatorname {Down}_{s}(F) \circledast k = \operatorname {Down}_{s}(F \circledast k'), (3)

where Downs denotes the down-sampling operation with scale s1 , and ⊛ rep-
resents the convolution operation. This equation implies the need to customize
a new convolution kernel k ′ for a larger resolution. However, finding an ap-
propriate k ′ can be challenging due to the variety of feature map F . The recent
ScaleCrafter [15] method uses structure-level and pixel-level calibrations to learn
a linear transformation between k and k ′ , but learning a new transformation for
each new kernel size and new target resolution can be cumbersome.
1
For simplicity, we assume equal down-sampling scales for height and width. Our
method can also accommodate different down-sampling scales in this context through
our padding-then-cropping strategy (Section 3.4).
6 L. Huang et al.

In this work, we propose to handle the structural consistency from a frequency


perspective. Suppose the input F (x, y), which is a two-dimensional discrete spa-
tial signal, belongs to the set RHf ×Wf ×C . The sampling rates along the x and y
axes are given by Ωx and Ωy correspondingly. The Fourier transform of F (x, y)
is represented by F (u, v) ∈ RHf ×Wf ×C . In this context, the highest frequencies
along the u and v axes are denoted as umax and vmax , respectively. Additionally,
the Fourier transform of the downsampled feature map Downs (F (x, y)), which
Hf Wf
× ×C
is dimensionally reduced to R s s , is denoted as F ′ (u, v).

Theorem 1. Spatial down-sampling leads to a reduction in the range of fre-


quencies that the signal can accommodate, particularly at the higher end of the
spectrum. This process causes high frequencies to be folded to low frequencies, and
superpose onto the original low frequencies. For a one-dimensional signal, in the
condition of s strides, this superposition of high and low frequencies resulting
from down-sampling can be mathematically formulated as

F'(u) = \mathbb {S}(F(u), F\left (u + \frac {a \Omega _x}{s}\right )) \mid u \in \left (0, \frac {\Omega _x}{s}\right ), (4)

where S dentes the superposing operator, Ωx is the sampling rates in x axis, and
a = 1, . . . , s − 1.

Lemma 1. For an image, the operation of spatial down-sampling using strides


of s can be viewed as partitioning the Fourier spectrum into s × s equal patches
and then uniformly superimposing these patches with an average scaling of s12 .

\label {eq:superpose} \operatorname {DFT}\left (\operatorname {Down}_{s} (F(x, y))\right ) = \frac {1}{s^2} \sum _{i=0}^{s-1} \sum _{j=0}^{s-1} F_{(i,j)}(u, v), (5)

where F(i,j) (u, v) is a sub-matrix of F (u, v) by equally splitting F (u, v) into s × s


non-overlapped patches and i, j ∈ {0, 1, . . . , s − 1}.

The proof of Theorem 1 and Lemma 1 are provided in the Appendix (Sec. A.1
and Sec. A.2). They describe the shuffling and superposing [34, 47, 51] in the
frequency domain imposed by spatial down-sampling. If we transform Eq. (3) to
the frequency domain and follow conclusion in Lemma 1, we can obtain:
\label {eq:eq_to_frequency_domain} &\left ( \frac {1}{s^2} \sum _{i=0}^{s-1} \sum _{j=0}^{s-1} F_{(i,j)}(u, v) \right ) \odot k(u, v) \leftarrow \text {Left side of Eq.~\eqref {eq:structural_consistency}} \nonumber \\ &= \frac {1}{s^2} \sum _{i=0}^{s-1} \sum _{j=0}^{s-1} \left (F_{(i,j)}(u, v) \odot k(u, v) \right ) \\ &= \frac {1}{s^2} \sum _{i=0}^{s-1} \sum _{j=0}^{s-1} \left (F_{(i,j)}(u, v) \odot k'_{(i,j)}(u, v) \right ), \leftarrow \text {Right side of Eq.~\eqref {eq:structural_consistency}} \nonumber
FouriScale 7

5x5 Conv Kernel 20x20 Dilated Conv Kernel DFT of 20x20 Dilated Conv Kernel
0 0 0

3
4 4
4
0 1 2 3 4
DFT of 5x5 Conv Kernel
0
8 8
1

4 12 12
0 1 2 3 4
Cropped DFT
0

1
16 16
2

4
0 1 2 3 4 0 4 8 12 16 0 4 8 12 16

Fig. 3: We visualize a random 5 × 5 kernel for better visualization. The Fourier spec-
trum of its dilated kernel, with a dilation factor of 4, clearly demonstrates a periodic
character. It should be noted that we also pad zeros to the right and bottom sides
of the dilated kernel, which differs from the conventional use. However, this does not
impact the outcome in practical applications.

where k(u, v), k ′ (u, v) denote the fourier transform of kernel k and k ′ , respec-
tively, ⊙ is element-wise multiplication. Eq. (6) suggests that the Fourier spec-
trum of the ideal convolution kernel k ′ should be the one that is stitched by s × s
Fourier spectrum of the convolution kernel k. In other words, there should be
a periodic repetition in the Fourier spectrum of k ′ , the repetitive pattern is the
Fourier spectrum of k.
Fortunately, the widely used dilated convolution perfectly meets this require-
ment. Suppose a kernel k(m, n) with the size of M × N , it’s dilated version is
kdh ,dw (m, n), with dilation factor of (dh , dw ). For any integer multiples of dh ,
namely p' = pd_h and integer multiples of dw , namely q' = qd_w , the exponential
term of the dilated kernel in the 2D DFT (Eq. (1)) becomes:

e^{-j2\pi \left (\frac {p'm}{d_hM} + \frac {q'n}{d_wN}\right )} = e^{-j2\pi \left (\frac {pm}{M} + \frac {qn}{N}\right )}, (7)

which is periodic with a period of M along the m-dimension and a period of


N along the n-dimension. It indicates that a dilated convolution kernel parame-
terized by the original kernel k, with dilation factor of (H/h, W/w), is the ideal
convolution kernel k ′ . In Fig. 3, we visually demonstrate the periodic repetition
of dilated convolution. We noticed that [15] also uses dilated operation. In con-
trast to [15], which is from empirical observation, our work begins with a focus
on frequency analysis and provides theoretical justification for its effectiveness.

3.3 Scale Consistency via Low-pass Filtering

However, in practice, dilated convolution alone cannot well mitigate the issue of
pattern repetition. As shown in Fig. 4a (top left), the issue of pattern repetition
is significantly reduced, but certain fine details, like the horse’s legs, still present
8 L. Huang et al.

“a professional photograph of an astronaut riding a horse” Down


Down Blocks
Blocks Down
Down Blocks
Blocks
0.0 0.0 0.0 0.0
0.5 0.5 0.5 0.5
1.0 1.0 1.0 1.0

Relative Log amplitude

Relative Log amplitude


Relative Log amplitude

Relative Log amplitude


1.5 1.5 1.5 1.5
2.0 2.0 Low-res
Low-res downdown
blockblock step_1
step_1 2.0 2.0 Low-res
Low-res downdown
blockblock step_1
step_1
High-res
High-res downdown
blockblock step_1
step_1 Filtered
Filtered High-res
High-res downdown
blockblock step_1
step_1
2.5 2.5 Low-res
Low-res downdown
blockblock step_25
step_25 2.5 2.5 Low-res
Low-res downdown
blockblock step_25
step_25
High-res
High-res downdown
blockblock step_25
step_25 Filtered
Filtered High-res
High-res downdown
blockblock step_25
step_25
3.0 3.0 Low-res
Low-res downdown
blockblock step_50
step_50 3.0 3.0 Low-res
Low-res downdown
blockblock step_50
step_50
High-res
High-res downdown
blockblock step_50
step_50 Filtered
Filtered High-res
High-res downdown
blockblock step_50
step_50
3.5 3.5 3.5 3.5
0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8
Frequency
Frequency Frequency
Frequency
Middle
Middle Blocks
Blocks Middle
Middle Blocks
Blocks
0.0 0.0 0.0 0.0

0.5 0.5 0.5 0.5

Relative Log amplitude

Relative Log amplitude


Relative Log amplitude

Relative Log amplitude


1.0 1.0 1.0 1.0

1.5 1.5 1.5 1.5


Only dilated convolution
Low-res
Low-res mid mid
blockblock step_1
step_1 Low-res
Low-res mid mid
blockblock step_1
step_1
2.0 2.0 High-res
High-res mid mid
blockblock step_1
step_1 2.0 2.0 Filtered
Filtered High-res
High-res mid mid
blockblock step_1
step_1
Low-res
Low-res mid mid
blockblock step_25
step_25 Low-res
Low-res mid mid
blockblock step_25
step_25
High-res
High-res mid mid
blockblock step_25
step_25 Filtered
Filtered High-res
High-res mid mid
blockblock step_25
step_25
2.5 2.5 Low-res
Low-res mid mid
blockblock step_50
step_50 2.5 2.5 Low-res
Low-res mid mid
blockblock step_50
step_50
High-res
High-res mid mid
blockblock step_50
step_50 Filtered
Filtered High-res
High-res mid mid
blockblock step_50
step_50
3.0 3.0 3.0 3.0
0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8
Frequency
Frequency Frequency
Frequency
Up Blocks
Up Blocks Up Blocks
Up Blocks
0 0 0 0

1 1 1 1
Relative Log amplitude

Relative Log amplitude


Relative Log amplitude

Relative Log amplitude


2 2 2 2

3 3 3 3
Low-res
Low-res up block
up block step_1
step_1 Low-res
Low-res up block
up block step_1
step_1
High-res
High-res up block
up block step_1
step_1 Filtered
Filtered High-res
High-res up block
up block step_1
step_1
4 4 Low-res
Low-res up block
up block step_25
step_25 4 4 Low-res
Low-res up block
up block step_25
step_25
High-res
High-res up block
up block step_25
step_25 Filtered
Filtered High-res
High-res up block
up block step_25
step_25
Low-res
Low-res up block
up block step_50
step_50 Low-res
Low-res up block
up block step_50
step_50
5 5 High-res
High-res up block
up block step_50
step_50 5 5 Filtered
Filtered High-res
High-res up block
up block step_50
step_50
0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8
Dilated convolution + Low-pass filter Frequency
Frequency Frequency
Frequency
(a) w/o
(a) w/o filtering
filtering (b) (b) w filtering
w filtering
(a) Visual comparisons (b) without filtering (c) with filtering

Fig. 4: (a) Visual comparisons between the images created at a resolution of 2048 ×
2048: with only the dilated convolution, and with both the dilated convolution and the
low-pass filtering. (b)(c) Fourier relative log amplitudes of input features from three
distinct layers from the down blocks, mid blocks, and up blocks of UNet, respectively,
are analyzed. We also include features at reverse steps 1, 25, and 50. (b) Without the
application of the low-pass filter. There is an evident distribution gap of the frequency
spectrum between the low resolution and high resolution. (c) With the application of
the low-pass filter. The distribution gap is largely reduced.

issues. This phenomenon is because of the aliasing effect after the spatial down-
sampling, which raises the distribution gap between the features of low resolution
and the features down-sampled from high resolution, as presented in Fig. 4b.
Aliasing alters the fundamental frequency components of the original signal,
breaking its consistency across scales.
In this paper, we introduce a low-pass filtering operation, or spectral pool-
ing [35] to remove high-frequency components that might cause aliasing, intend-
ing to construct scale consistency across different resolutions. Let F (m, n) be a
two-dimensional discrete signal with resolution M \times N . Spatial down-sampling of
F (m, n), by factors s_h and s_w along the height and width respectively, alters the
Nyquist limits to M/(2s_h) and N/(2s_w) in the frequency domain, corresponding
to half the new sampling rates along each dimension. The expected low-pass filter
should remove frequencies above these new Nyquist limits to prevent aliasing.
Therefore, the optimal mask size (assuming the frequency spectrum is central-
ized) for passing low frequencies in a low-pass filter is M/s_h \times N/s_w . This filter
design ensures the preservation of all valuable frequencies within the downscaled
resolution while preventing aliasing by filtering out higher frequencies.
FouriScale 9

Algorithm 1 Pseudo-code of FouriScale


Data: Input: F ∈ RC×Hf ×Wf . Original size: hf × wf .
Result: Output: Fconv ∈ RC×Hf ×Wf
H W
r = max(⌈ hff ⌉, ⌈ wff ⌉)
Fpad ← Zero-Pad(F ) ∈ RC×rhf ×rwf \triangleright Zero Padding
Fdf t ← DFT(Fpad ) ∈ CC×rhf ×rwf \triangleright Discrete Fourier transform
Flow ← H ⊙ Fdf t \triangleright Low pass filtering
Fidf t ← iDFT(Flow ) \triangleright Inverse Fourier transform
Fcrop ← Crop(Fidf t ) ∈ RR×Hf ×Wf \triangleright Cropping
Fconv ← Convk′ (Fcrop ) \triangleright Dilation factor of k′ is r

As illustrated in Fig. 4c, the application of the low-pass filter results in a


closer alignment of the frequency distribution between high and low resolutions.
This ensures that the left side of Eq. (3) produces a plausible image structure.
Additionally, since our target is to rectify the image structure, low-pass filtering
would not be harmful because it generally preserves the structural information of
a signal, which predominantly resides in the lower frequency components [30,48].
Subsequently, the final kernel k ∗ is obtained by applying low-pass filtering to
the dilated kernel. Considering the periodic nature of the Fourier spectrum asso-
ciated with the dilated kernel, the Fourier spectrum of the new kernel k ∗ involves
expanding the spectrum of the original kernel k by inserting zero frequencies.
Therefore, this expansion avoids the introduction of new frequency components
into the new kernel k ∗ . In practice, we do not directly calculate the kernel k ∗
but replace the original Convk with the following equivalent operation to ensure
computational efficiency:

\label {eq:final_operation} \operatorname {Conv}_k(F) \rightarrow \operatorname {Conv}_{k'}(\operatorname {iDFT}(H \odot \operatorname {DFT}(F)), (8)

where H denotes the low-pass filter. Fig. 4a (bottom left) illustrates that the
combination of dilated convolution and low-pass filtering resolves the issue of
pattern repetition.

3.4 Adaption to Arbitrary-size Generation


The derived conclusion is applicable only when the aspect ratios of the high-
resolution image and the low-resolution image used in training are identical.
From Eq. (5) and Eq. (6), it becomes apparent that when the aspect ratios
vary, meaning the dilation rates along the height and width are different, the
well-constructed structure in the low-resolution image would be distorted and
compressed, as shown in Fig. 5 (a). Nonetheless, in real-world applications, the
ideal scenario is for a pre-trained diffusion model to have the capability of gen-
erating arbitrary-size images.
We introduce a straightforward yet efficient approach, termed padding-then-
cropping, to solve this problem. Fig. 5 (b) demonstrates its effectiveness. In
essence, when a layer receives an input feature at a standard resolution of hf ×wf ,
10 L. Huang et al.

“A car in a garden, with a lake and Eiffel Tower”

(a) w/o padding-then-cropping (b) w padding-then-cropping

Fig. 5: Visual comparisons between the images created at a resolution of 2048 × 1024:
(a) without the application of padding-then-cropping strategy, and (b) with the appli-
cation of padding-then-cropping strategy. The Stable Diffusion 2.1 utilized is initially
trained on images of 512 × 512 resolution.

and this input feature increases to a size of Hf × Wf during inference, our first
step is to zero-pad the input feature to a size of rhf × rwf . Here, r is defined
H W
as the maximum of ⌈ hff ⌉ and ⌈ wff ⌉, with ⌈·⌉ representing the ceiling operation.
The padding operation assumes that we aim to generate an image of size rh×rw,
where certain areas are filled with zeros. Subsequently, we apply Eq. (8) to rectify
the issue of repetitive patterns in the higher-resolution output. Ultimately, the
obtained feature is cropped to restore its intended spatial size. This step is
necessary to not only negate the effects of zero-padding but also control the
computational demands when the resolution increases, particularly those arising
from the self-attention layers in the UNet architecture. Taking computational
efficiency into account, our equivalent solution is outlined in Algorithm 1.

3.5 FouriScale Guidance


FouriScale effectively mitigates structural distortion when generating high-res
images. However, it would introduce certain artifacts and unforeseen patterns in
the background, as depicted in Fig. 6 (b). Based on our empirical findings, we
identify that the main issue stems from the application of low-pass filtering when
generating the conditional estimation in classifier-free guidance [21]. This process
often leads to a ringing effect and loss of detail. To improve image quality and
reduce artifacts, as shown in Fig. 6 (a), we develop a guided version of FouriScale
for reference, aiming to align the output, rich in details, with it. Specifically,
beyond the unconditional and conditional estimations derived from the UNet
modified by FouriScale, we further generate an extra conditional estimation.
This one is subjected to identical dilated convolutions but utilizes milder low-
pass filters to accommodate more frequencies. We substitute its attention maps
of attention layers with those from the conditional estimation processed through
FouriScale, in a similar spirit with image editing [6, 12, 17]. Given that UNet’s
attention maps hold a wealth of positional and structural information [44, 45,
49], this strategy allows for the incorporation of correct structural information
derived from FouriScale to guide the generation, simultaneously mitigating the
FouriScale 11

Positive condition 𝑐" Denoising Net

𝜀! ( " ) FouriScale Milder Low-pass filter 𝜀! (𝑥" ,𝑐# )


Conditional
Positive condition 𝑐" Replace Attention Scores estimation

𝜀! ( " ) FouriScale Low-pass filter CFG


Negative condition 𝑐! Unconditional
or ∅ estimation
𝑥! 𝜀! ( " ) FouriScale Low-pass filter 𝜀! (𝑥" , 𝑐$ ) 𝑥!&'
(a) Overview of FouriScale Guidance
“Two little dogs looking a large pizza sitting on a table”

(b) w/o FouriScale Guidance (c) w FouriScale Guidance

Fig. 6: (a) Overview of FouriScale guidance. CFG denotes Classifier-Free Guidance.


(b)(c) Visual comparisons between the images created at 2048 × 2048 by SD 2.1: (b)
without the application of FouriScale guidance, ➊ has unexpected artifacts in the back-
ground, ➋➌ are wrong details, (c) with the application of FouriScale guidance.

decline in image quality and loss of details typically induced by low-pass filtering.
The final noise estimation is determined using both the unconditional and the
newly conditional estimations following classifier-free guidance. As we can see in
Fig. 6 (c), the aforementioned issues are largely mitigated.

3.6 Detailed Designs


Annealing dilation and filtering. Since the image structure is primarily outlined
in the early reverse steps, the subsequent steps focus on enhancing the details,
we implement an annealing approach for both dilation convolution and low-pass
filtering. Initially, for the first Sinit steps, we employ the ideal dilation convolu-
tion and low-pass filtering. During the span from Sinit to Sstop , we progressively
decrease the dilation factor and r (as detailed in Algorithm 1) down to 1. After
Sstop steps, the original UNet is utilized to refine image details further.

Settings for SDXL. Stable Diffusion XL [32] (SDXL) is generally trained on im-
ages with a resolution close to 1024 × 1024 pixels, accommodating various aspect
ratios simultaneously. Our observations reveal that using an ideal low-pass filter
leads to suboptimal outcomes for SDXL. Instead, a gentler low-pass filter, which
modulates rather than completely eliminates high-frequency elements using a
coefficient σ ∈ [0, 1] (set to 0.6 in our method) delivers superior visual qual-
ity. This phenomenon can be attributed to SDXL’s ability to handle changes in
12 L. Huang et al.

scale effectively, negating the need for an ideal low-pass filter to maintain scale
consistency, which confirms the rationale of incorporating low-pass filtering to
address scale variability. Additionally, for SDXL, we calculate the scale factor r
(refer to Algorithm 1) by determining the training resolution whose aspect ratio
is closest to the one of target resolution.

4 Experiments

Experimental setup. Wo follow [15] to report results on three text-to-image mod-


els, including SD 1.5 [12], SD 2.1 [10], and SDXL 1.0 [32] on generating images at
four higher resolutions. The resolutions tested are 4×, 6.25×, 8×, and 16× the
pixel count of their respective training resolutions. For both SD 1.5 and SD 2.1
models, the original training resolution is set at 512×512 pixels, while the infer-
ence resolutions are 1024×1024, 1280×1280, 2048×1024, and 2048×2048. In the
case of the SDXL model, it is trained at resolutions close to 1024×1024 pixels,
with the higher inference resolutions being 2048×2048, 2560×2560, 4096×2048,
and 4096×4096. We default use FreeU [39] in all experimental settings.

Testing dataset and evaluation metrics. Following [15], we assess performance


using the Laion-5B dataset [38], which comprises 5 billion pairs of images and
their corresponding captions. For tests conducted at an inference resolution of
1024×1024, we select a subset of 30,000 images, each paired with randomly
chosen text prompts from the dataset. Given the substantial computational de-
mands, our sample size is reduced to 10,000 images for tests at inference resolu-
tions exceeding 1024×1024. We evaluate the quality and diversity of the gener-
ated images by measuring the Frechet Inception Distance (FID) [18] and Kernel
Inception Distance (KID) [4] between generated images and real images, denoted
as FIDr and KIDr . To show the methods’ capacity to preserve the pre-trained
model’s original ability at a new resolution, we also follow [15] to evaluate the
metrics between the generated images at the base training resolution and the
inference resolution, denoted as FIDb and KIDb .

4.1 Quantitative Results

We compare our method with the vanilla text-to-image diffusion model (Vanilla),
the training-free approach [25] (Attn-Entro) that accounts for variations in at-
tention entropy between low and high resolutions, and ScaleCrafter [15], which
modifies convolution kernels through re-dilation and adopts linear transforma-
tions for kernel enlargement. We show the experimental results in Tab. 1. Com-
pared to the vanilla diffusion models, our method obtains much better results
because of eliminating the issue of repetitive patterns. The Attn-Entro does
not work at high upscaling levels because it fails to fundamentally consider the
structural consistency across resolutions. Due to the absence of scale consis-
tency consideration in ScaleCrafter, it performs worse than our method on the
majority of metrics. Additionally, we observe that ScaleCrafter often struggles
FouriScale 13

Table 1: Quantitative comparisons among training-free methods. The best and second
best results are highlighted in bold and underline. The values of KIDr and KIDb are
scaled by 102 .
SD 1.5 SD 2.1 SDXL 1.0
Resolution Method
FIDr ↓ KIDr ↓ FIDb ↓ KIDb ↓ FIDr ↓ KIDr ↓ FIDb ↓ KIDb ↓ FIDr ↓ KIDr ↓ FIDb ↓ KIDb ↓
Vanilla 26.96 1.00 15.72 0.42 29.90 1.11 19.21 0.54 49.81 1.84 32.90 0.92
Attn-Entro 26.78 0.97 15.64 0.42 29.65 1.10 19.17 0.54 49.72 1.84 32.86 0.92
4× 1:1
ScaleCrafter 23.90 0.95 11.83 0.32 25.19 0.98 13.88 0.40 49.46 1.73 36.22 1.07
Ours 23.62 0.92 10.62 0.29 25.17 0.98 13.57 0.40 33.89 1.21 20.10 0.47
Vanilla 41.04 1.28 31.47 0.77 45.81 1.52 37.80 1.04 68.87 2.79 54.34 1.92
Attn-Entro 40.69 1.31 31.25 0.76 45.77 1.51 37.75 1.04 68.50 2.76 54.07 1.91
6.25× 1:1
ScaleCrafter 37.71 1.34 25.54 0.67 35.13 1.14 23.68 0.57 55.03 2.02 45.58 1.49
Ours 30.27 1.00 16.71 0.34 30.82 1.01 18.34 0.42 44.13 1.64 37.09 1.16
Vanilla 50.91 1.87 44.65 1.45 57.80 2.26 51.97 1.81 90.23 4.20 79.32 3.42
Attn-Entro 50.72 1.86 44.49 1.44 57.42 2.26 51.67 1.80 89.87 4.15 79.00 3.40
8× 1:2
ScaleCrafter 35.11 1.22 29.51 0.81 41.72 1.42 35.08 1.01 106.57 5.15 108.67 5.23
Ours 35.04 1.19 26.55 0.72 37.19 1.29 27.69 0.74 71.77 2.79 70.70 2.65
Vanilla 67.90 2.37 66.49 2.18 84.01 3.28 82.25 3.05 116.40 5.45 109.19 4.84
Attn-Entro 67.45 2.35 66.16 2.17 83.68 3.30 81.98 3.04 113.25 5.44 106.34 4.81
16× 1:1
ScaleCrafter 32.00 1.01 27.08 0.71 40.91 1.32 33.23 0.90 84.58 3.53 85.91 3.39
Ours 30.84 0.95 23.29 0.57 39.49 1.27 28.14 0.73 56.66 2.18 49.59 1.63

to produce acceptable images for SDXL, leading to much lower performance


than ours. Conversely, our method is capable of generating images with plausi-
ble structures and rich details at various high resolutions, compatible with any
pre-trained diffusion models.
Furthermore, our method achieves better inference speed compared with
ScaleCrafter [15]. For example, under the 16× setting for SDXL, ScaleCrafter
takes an average of 577 seconds to generate an image, whereas our method,
employing a single NVIDIA A100 GPU, averages 540 seconds per image.

4.2 Qualitative Results

Fig. 7 presents a comprehensive visual comparison across various upscaling fac-


tors (4×, 8×, and 16×) with different pre-trained diffusion models (SD 1.5,
2.1, and SDXL 1.0). Our method demonstrates superior performance in pre-
serving structural integrity and fidelity compared to ScaleCrafter [15] and Attn-
Entro [25]. Besides, FouriScale maintains its strong performance across all three
pre-trained models, demonstrating its broad applicability and robustness. At 4×
upscaling, FouriScale faithfully reconstructs fine details like the intricate patterns
on the facial features of the portrait, and textures of the castle architecture. In
contrast, ScaleCrafter and Attn-Entro often exhibit blurring and loss of details.
As we move to more extreme 8× and 16× upscaling factors, the advantages of
FouriScale become even more pronounced. Our method consistently generates
images with coherent global structures and locally consistent textures across
diverse subjects, from natural elements to artistic renditions. The compared
methods still struggle with repetitive artifacts and distorted shapes.
14 L. Huang et al.

4× 8× 16×

SD 1.5

“Side-view blue-ice sneaker inspired by “The image is titled ‘Queen of the Robots’ “A painting of a koala wearing a princess
Spiderman created by Weta FX.” created by artists Greg Rutowski, Victo dress and crown, with a confetti background.”
Ngai, and Alphonse Mucha.”
SD 2.1

“a castle is in the middle of a eurpean city” “A teddy bear mad scientist mixing “Portrait of an anime maid by Krenz Cushart,
chemicals depicted in oil painting style.” Alphonse Mucha, and Ilya Kuvshinov.”
SD XL

“A watercolor portrait of a woman by Luke ”A nighstand topped with a white land- “Two cats, grey and black, are wearing
Rueda Studios and David Downton.” line phone, remote control, a metallic steampunk attire and standing in front of
lamp, and a black hardcover book.” a ship in a heavily detailed painting.”

Fig. 7: Visual comparisons between ➊ ours, ➋ ScaleCrafter [15] and ➌ Attn-Entro [25],
under settings of 4×, 8×, and 16×, employing three distinct pre-trained diffusion mod-
els: SD 1.5, SD 2.1, and SDXL 1.0.

4.3 Ablation Study

To validate the contributions of each component in our proposed method, we


conduct ablation studies on the SD 2.1 model generating 2048 × 2048 images.
First, we analyze the effect of using FouriScale Guidance as described in
Sec. 3.5. We compare the default FouriScale which utilizes guidance versus re-
moving the guidance and solely relying on the conditional estimation from the
FouriScale-modified UNet. As shown in Tab. 2, employing guidance improves the
FIDr by 4.26, demonstrating its benefits for enhancing image quality. The guid-
ance allows incorporating structural information from the FouriScale-processed
estimation to guide the generation using a separate conditional estimation with
milder filtering. This balances between maintaining structural integrity and pre-
venting loss of details.
Furthermore, we analyze the effect of the low-pass filtering operation de-
scribed in Sec. 3.3. Using the FouriScale without guidance as the baseline, we
additionally remove the low-pass filtering from all modules. As shown in Tab. 2,
this further deteriorates the FIDr to 46.74. The low-pass filtering is crucial for
FouriScale 15

“A tall giraffe in a zoo eating branches”

Method FIDr
FouriScale 39.49
w/o guidance 43.75
w/o guidance & filtering 46.74
Mask Size: 𝑀/2×𝑁/2 Mask Size: 𝑀/4×𝑁/4 (Ours) Mask Size: 𝑀/6×𝑁/6

Table 2: Ablation studies on Fig. 8: Comparison of mask sizes for passing low
FouriScale components on SD 2.1 frequencies generating 20482 images by SD 2.1. M ,
model under 16× 1:1 setting. N denote height and width of target resolution.

maintaining scale consistency across resolutions and preventing aliasing effects


that introduce distortions. Without it, the image quality degrades significantly.
A visual result of comparing the mask sizes for passing low frequencies is
depicted in Fig. 8. The experiment utilizes SD 2.1 (trained with 512×512 images)
to generate images of 2048×2048 pixels, setting the default mask size to M/4 ×
N/4. We can find that the optimal visual result is achieved with our default
settings. As the low-pass filter changes, there is an evident deterioration in the
visual appearance of details, which underscores the validity of our method.

5 Conclusion and Limitation

We present FouriScale, a novel approach that enhances the generation of high-


resolution images from pre-trained diffusion models. By addressing key chal-
lenges such as repetitive patterns and structural distortions, FouriScale intro-
duces a training-free method based on frequency domain analysis, improving
structural and scale consistency across different resolutions by a dilation oper-
ation and a low-pass filtering operation. The incorporation of a padding-then-
cropping strategy and the application of FouriScale guidance enhance the flex-
ibility and quality of text-to-image generation, accommodating different aspect
ratios while maintaining structural integrity. FouriScale’s simplicity and adapt-
ability, avoiding any extensive pre-computation, set a new benchmark in the
field. FouriScale still faces challenges in generating ultra-high-resolution sam-
ples, such as 4096×4096 pixels, which typically exhibit unintended artifacts.
Additionally, its focus on operations within convolutions limits its applicability
to purely transformer-based diffusion models.
16 L. Huang et al.

Appendix

A Proof

A.1 Proof of Theorem 1

Let’s consider f (x) as a one-dimensional signal. Its down-sampled counterpart is


represented by f ′ (x) = Downs (f ). To understand the connection between f ′ (x)
and f (x), we base our analysis on their generated continuous signal g(x), which
is produced using a particular sampling function. It’s important to note that
the sampling function sa∆T (x) is characterized by a series of infinitely spaced
impulse units, with each pair separated by intervals of ∆T :

\label {eq:1d_sampling} sa(x, \Delta T) = \sum _{n=-\infty }^{\infty } \delta (x - n\Delta T). (1)

Based on Eq. (1), f (x) and f ′ (x) can be formulated as

\begin {aligned} f(x) &= g(x) sa(x, \Delta T), \\ f'(x) &= g(x) sa(x, s\Delta T). \end {aligned}
(2)

Based on the Fourier transform and the convolution theorem, the spatial
sampling described above can be represented in the Fourier domain as follows:

\label {eq:freq_sampling} F(u) &= G(u) \circledast SA(u, \Delta T) \nonumber \\ &= \int _{-\infty }^{\infty } G(\tau )SA(u - \tau , \Delta T)d\tau \nonumber \\ &= \frac {1}{\Delta T} \sum _{n} \int _{-\infty }^{\infty } G(\tau ) \delta \left ( u - \tau - \frac {n}{\Delta T} \right ) d\tau \\ & = \frac {1}{\Delta T} \sum _{n} G \left ( u - \frac {n}{\Delta T} \right ), \nonumber

where G(u) and SA(u, ∆T ) are the Fourier transform of g(x) and sa(x, ∆T ).
From the above Equation, it can be observed that the spatial sampling introduces
the periodicity to the spectrum and the period is \protect \frac {1}{\Delta T} .
Note that the sampling rates of f(x) and f'(x) are \Omega _x and \Omega '_x , the relation-
ship between them can be written as

\Omega _x = \frac {1}{\Delta T}, \quad \Omega '_x = \frac {1}{s\Delta T} = \frac {1}{s}\Omega _x. (4)

With the down-sampling process in consideration, we presume that f(x)


complies with the Nyquist sampling theorem, suggesting that u_{max} < \frac {\Omega _x}{2} .
FouriScale 17

Following down-sampling, as per the Nyquist sampling theorem, the entire


sub-frequency range is confined to (0, Ωsx ). The resulting frequency band is a
composite of s initial bands, expressed as:

\label {eq:s_superpose} F'(u) = \mathbb {S}(F(u), F(\tilde {u}_1), \ldots , F(\tilde {u}_{s-1})), (5)

where \protect \tilde {u}_i represents the frequencies higher than the sampling rate, while u de-
notes the frequencies that are lower than the sampling rate. The symbol \protect \mathbb {S} stands
for the superposition operator. To simplify the discussion, \protect \tilde {u} will be used to de-
note \protect \tilde {u}_i in subsequent sections.
(1) In the sub-band, where u \in (0, \frac {\Omega _x}{2s}) , \protect \tilde {u} should satisfy

\label {eq:aliasing_theorem1} \quad \tilde {u} \in \left (\frac {\Omega _x}{2s}, u_{max}\right ). (6)

According to the aliasing theorem, the high frequency \protect \tilde {u} is folded back to the
low frequency:

\label {eq:aliasing_theorem2} \hat {u} = \left | \tilde {u} - (k + 1)\frac {\Omega '_x}{2} \right |, \quad k\frac {\Omega '_x}{2} \leq \tilde {u} \leq (k + 2)\frac {\Omega '_x}{2} (7)

where k = 1, 3, 5, \ldots and \protect \hat {u} is folded results by \protect \tilde {u}.
According to Eq. 6 and Eq. 7, we have

\label {eq:u_hat_range} \hat {u} = \frac {a\Omega _x}{s} - \tilde {u} \quad \text {and} \quad \hat {u} \in \left (\frac {\Omega _x}{s} - u_{max}, \frac {\Omega _x}{2s}\right ), (8)

where a = (k+1) / 2 = 1, 2, \ldots . According to Eq. (5) and Eq. (8), we can attain

\label {eq:fu_case} F'(u) = \begin {cases} F(u) & \text {if } u \in (0, \frac {\Omega _x}{s} - u_{max}), \\ \mathbb {S}(F(u), F(\frac {a\Omega _x}{s} - u)) & \text {if } u \in (\frac {\Omega _x}{s} - u_{max}, \frac {\Omega _x}{2s}). \end {cases} (9)

According to Eq. (3), F(u) is symmetric with respect to u = \frac {\Omega _x}{2} :

\label {eq:symmetry} F(\frac {\Omega _x}{2} - u) = F(u + \frac {\Omega _x}{2}). (10)

Therefore, we can rewrite F ( aΩs x − u) as:

\label {eq:symmetry_transfer} \begin {aligned} &F(\frac {\Omega _x}{2} - (\frac {\Omega _x}{2}+u-\frac {a\Omega _x}{s})) \\ = &F(\frac {\Omega _x}{2} + (\frac {\Omega _x}{2}+u-\frac {a\Omega _x}{s})) \\ = &F(u + \Omega _x -\frac {a\Omega _x}{s}) \\ = &F(u + \frac {a\Omega _x}{s}) \end {aligned}

(11)
18 L. Huang et al.

since a = 1, 2, \ldots , s-1. Additionally, for s = 2, the condition u \in (0, \frac {\Omega _x}{s} - u_{max})
results in F(u + \frac {\Omega _x}{s}) = 0. When s > 2, the range u \in (0, \frac {\Omega _x}{s} - u_{max}) typically
becomes non-existent. Thus, in light of Eq. (11) and the preceding analysis,
Eq. (9) can be reformulated as

\label {eq:theorem_prove1} F'(u) = \mathbb {S}(F(u), F(u + \frac {a\Omega _x}{s})) \mid u \in (0, \frac {\Omega _x}{2s}). (12)

(2) In the sub-band, where u \in (\frac {\Omega _x}{2s}, \frac {\Omega _x}{s}), different from (1), \protect \tilde {u} should satisfy

\tilde {u} \in (\frac {\Omega _x}{s} - u_{max}, \frac {\Omega _x}{2s}). (13)

Similarly, we can obtain:

\label {eq:theorem_prove2} F'(u) = \mathbb {S}(F(\tilde {u}), F(u + \frac {a\Omega _x}{s})) \mid u \in (\frac {\Omega _x}{2s}, \frac {\Omega _x}{s}). (14)

Combining Eq. (12) and Eq. (14), we obtain

F'(u) = \mathbb {S}(F(u), F(u + \frac {a\Omega _x}{s})) \mid u \in (0, \frac {\Omega _x}{s}), (15)

where a = 1, 2, \ldots , s-1.

A.2 Proof of Lemma 1


\protect \frac {1}{s}
Based on Eq. (3), it can be determined that the amplitude of F' is times that
of F . Hence, F'(u) can be expressed as:

F'(u) = \frac {1}{s}F(u) + \sum _a \frac {1}{s}F\left (u + \frac {a\Omega _x}{s}\right ) \mid u \in \left (0, \frac {\Omega _x}{s}\right ). (16)

Based on the dual principle, we can prove F'(u, v) in the whole sub-band

F'(u,v) = \frac {1}{s^2} \left (\sum _{a,b=0}^{s-1} F\left (u + \frac {a\Omega _s}{s}, v + \frac {b\Omega _y}{s} \right )\right ), (17)

where u \in \left (0, \frac {\Omega _x}{s}\right ) , v \in \left (0, \frac {\Omega _y}{s}\right ) .

B Implementation Details
B.1 Low-pass Filter Definition
In Fig. 1, we show the design of a low-pass filter used in FouriScale. Inspired
by [34, 41], we define the low-pass filter as the outer product between two 1D
filters (depicted in the left of Fig. 1), one along the height dimension and one
along the width dimension. We define the function of the 1D filter for the height
FouriScale 19

High-frequency
Coefficient 𝜎
Low-pass Smooth
Region Region

0
𝑊 = 64 𝐻 = 64 𝑠" = 4 𝑠! = 4
𝐻 𝐻 𝐻
+ 𝑅! 𝑅! = 8 𝜎=0
2𝑠! 2𝑠! 2 𝑅" = 8

Fig. 1: Visualization of the design of a low-pass filter. (a) 1D filter for the positive axis.
(2) 2D low-pass filter, which is constructed by mirroring the 1D filters and performing
an outer product between two 1D filters, in accordance with the settings of the 1D
filter.

dimension as follows, filters for the width dimension can be obtained in the same
way:

\text {mask}^h_{(s_{h},R_h,\sigma )} = \min \left ( \max \left ( \frac {1 - \sigma }{R_h} \left ( \frac {H}{s_{h}} + 1 - i \right ) + 1, \sigma \right ), 1 \right ), i \in [0,\frac {H}{2}], (18)

where sh denotes the down-sampling factor between the target and original res-
olutions along the height dimension. Rh controls the smoothness of the filter and
σ is the modulation coefficient for high frequencies. Exploiting the characteristic
of conjugate symmetry of the spectrum, we only consider the positive axis, the
whole 1D filter can be obtained by mirroring the 1D filter. We build the 2D
low-pass filter as the outer product between the two 1D filters:

\text {mask}(s_{h}, s_{w}, R_h, R_w, \sigma ) = \text {mask}^h_{(s_{h},R_h,\sigma )} \otimes \text {mask}^w_{(s_{w},R_w,\sigma )}, (19)

where ⊗ denotes the outer product operation. Likewise, the whole 2D filter can
be obtained by mirroring along the height and width axes. A toy example of a
2D low-pass filter is shown in the right of Fig. 1.

B.2 Hyper-parameter Settings


In this section, we detail our choice of hyperparameters. The evaluative param-
eters are detailed in Tab. 1. Additionally, Fig. 2 provides a visual guide of the
precise positioning of various blocks within the U-Net architecture employed in
our model.
The dilation factor used in each FouriScale layer is determined by the max-
imum value of the height and width scale relative to the original resolution. As
stated in our main manuscript, we employ an annealing strategy. For the first
20 L. Huang et al.

1× 1×
1 1
× ×
4 1 1 4
× ×
16 1 1 1 16
× × ×
64 64 64

DB0 DB1 DB2 DB3 MB UB0 UB1 UB2 UB3

Fig. 2: Reference block names of stable diffusion in the following experiment details.

Sinit steps, we employ the ideal dilation convolution and low-pass filtering. Dur-
ing the span from Sinit to Sstop , we progressively decrease the dilation factor and
r (as detailed in Algorithm 1 of our main manuscript) down to 1. After Sstop
steps, the original UNet is utilized to refine image details further. The settings
for Sinit and Sstop are shown in Tab. 1.

Table 1: Experiment settings for SD 1.5, SD 2.1, and SDXL 1.0.

Params SD 1.5 & SD 2.1 SDXL 1.0


FouriScale blocks [DB2,DB3,MB,UB0,UB1,UB2] [DB2,MB,UB0,UB1]
inference timesteps 50 50
[10,30] (4×1:1 and 6.25×1:1)
[Sinit , Sstop ] [20,35]
[20,35] (8×1:2 and 16×1:1)

C More Experiments

C.1 Comparison with Diffusion Super-Resolution Method

In this section, we compare the performance of our proposed method with a cas-
caded pipeline, which uses SD 2.1 to generate images at the default resolution
of 512×512, and upscale them to 2048×2048 by a pre-trained diffusion super-
resolution model, specifically the Stable Diffusion Upscaler-4× [43]. We apply
this super-resolution model to a set of 10,000 images generated by SD 2.1. We
then evaluate the FIDr and KIDr scores of these upscaled images and compare
them with images generated at 2048×2048 resolution using SD 2.1 equipped
with our FouriScale. The results of this comparison are presented in Tab. 2. Our
method obtains somewhat worse results than the cascaded method. However,
our method is capable of generating high-resolution images in only one stage,
without the need for a multi-stage process. Besides, our method does not need
FouriScale 21

Fig. 3: Visual comparison with SD+SR. Left: 2048×2048 image upscaled by SD+SR
from 512×512 SD 2.1 generated image. Right: 2048×2048 image generated by our
FouriScale with SD 2.1.

Table 2: Comparison with SD + Table 3: Comparison with ElasticDiffusion


Super-Resolution. on the SDXL 2048×2048 setting.
Method FIDr KIDr Method FIDr KIDr FIDb KIDb
SD+Super Resolution 25.94 0.91 ElasticDiffusion [14] 52.02 3.03 40.46 2.22
Ours 39.49 1.27 Ours 33.89 1.21 20.10 0.47

model re-training, while the SR model demands extensive data and computa-
tional resources for training. More importantly, as shown in Fig. 3, we find that
our method can generate much better details than the cascaded pipeline. Due to
a lack of prior knowledge in generation, super-resolution method can only utilize
existing knowledge within the single image for upscaling the image, resulting in
over-smooth appearance. However, our method can effectively upscale images
and fill in details using generative priors with a pre-trained diffusion model, pro-
viding valuable insights for future explorations into the synthesis of high-quality
and ultra-high-resolution images.

C.2 Comparison with ElasticDiffusion

We observe that the recent approach, ElasticDiffusion [14], has established a


technique to equip pre-trained diffusion models with the capability to generate
images of arbitrary sizes, both smaller and larger than the resolution used during
training. Here, we provide the comparison with ElasticDiffusion [14] on the SDXL
2048×2048 setting. The results are shown in Tab 3. First, it’s important to note
that the inference times for ElasticDiffusion are approximately 4 to 5 times
longer than ours. Besides, as we can see, our method demonstrates superior
22 L. Huang et al.

4096×4096 2048×2048 1024×1024 2048×4096

1024×2048 1024×1024
2048×2048 2048×2048

Fig. 4: Visualization of the high-resolution images generated by SD 2.1 integrated with


customized LoRAs (images in red rectangle) and images generated by a personalized
diffusion model, AnimeArtXL [1], which is based on SDXL.

performance across all evaluation metrics, achieving lower FID and KID scores
compared to ElasticDiffusion, indicating better image quality and diversity.

D More Visualizations

D.1 LoRAs

In Fig. 4, we present the high-resolution images produced by SD 2.1, which has


been integrated with customized LoRAs [23] from Civitai [8]. We can see that
our method can be effectively applied to diffusion models equipped with LoRAs.

D.2 Other Resolutions

In Fig. 5, we present more images generated at different resolutions by SD 2.1,


aside from the 4×, 6.25×, 8×, and 16× settings. Our approach is capable of
generating high-quality images of arbitrary aspect ratios and sizes.

References
1. AnimeArtXL: (2024), https : / / civitai . com / models / 117259 / anime - art -
diffusion-xl, accessed: 17, 01, 2024 22
2. Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila,
T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an
ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022) 3
3. Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: Multidiffusion: Fusing diffusion paths
for controlled image generation. arXiv preprint arXiv:2302.08113 (2023) 2, 3
4. Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd gans.
In: International Conference on Learning Representations (2018) 12
FouriScale 23

1024×2048 512×1280 1536×1280

2048×768

3072×768
1920×1408 768×2816

1920×640

1152×768 768×768

Fig. 5: More generated images using FouriScale and SD 2.1 with arbitrary resolutions.
24 L. Huang et al.

5. Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis,
K.: Align your latents: High-resolution video synthesis with latent diffusion models.
In: CVPR. pp. 22563–22575 (2023) 3
6. Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mu-
tual self-attention control for consistent image synthesis and editing. arXiv preprint
arXiv:2304.08465 (2023) 10
7. Chen, T.: On the importance of noise scheduling for diffusion models. arXiv
preprint arXiv:2301.10972 (2023) 3
8. Civitai: (2024), https://civitai.com/, accessed: 17, 01, 2024 22
9. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. NeurIPS
34, 8780–8794 (2021) 3
10. Diffusion, S.: Stable diffusion 2-1 base. https://huggingface.co/stabilityai/
stable-diffusion-2-1-base/blob/main/v2-1_512-ema-pruned.ckpt (2022) 12
11. Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., Lin, J., Zou, X., Shao,
Z., Yang, H., et al.: Cogview: Mastering text-to-image generation via transformers.
NeurIPS 34, 19822–19835 (2021) 1
12. Epstein, D., Jabri, A., Poole, B., Efros, A.A., Holynski, A.: Diffusion self-guidance
for controllable image generation. arXiv preprint arXiv:2306.00986 (2023) 10, 12
13. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., Bengio, Y.: Generative adversarial nets. NeurIPS 27 (2014) 1, 3
14. Haji-Ali, M., Balakrishnan, G., Ordonez, V.: Elasticdiffusion: Training-free arbi-
trary size image generation. arXiv preprint arXiv:2311.18822 (2023) 3, 21
15. He, Y., Yang, S., Chen, H., Cun, X., Xia, M., Zhang, Y., Wang, X., He, R., Chen,
Q., Shan, Y.: Scalecrafter: Tuning-free higher-resolution visual generation with
diffusion models. arXiv preprint arXiv:2310.07702 (2023) 2, 4, 5, 7, 12, 13, 14
16. He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion mod-
els for high-fidelity video generation with arbitrary lengths. arXiv preprint
arXiv:2211.13221 (2022) 3
17. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-or, D.:
Prompt-to-prompt image editing with cross-attention control. In: ICLR (2022) 10
18. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained
by a two time-scale update rule converge to a local nash equilibrium. NeurIPS 30
(2017) 12
19. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS 33,
6840–6851 (2020) 1, 3, 4
20. Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded dif-
fusion models for high fidelity image generation. The Journal of Machine Learning
Research 23(1), 2249–2281 (2022) 3
21. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint
arXiv:2207.12598 (2022) 10
22. Hoogeboom, E., Heek, J., Salimans, T.: simple diffusion: End-to-end diffusion for
high resolution images. arXiv preprint arXiv:2301.11093 (2023) 3
23. Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.:
Lora: Low-rank adaptation of large language models. In: International Conference
on Learning Representations (2021) 22
24. Jiménez, Á.B.: Mixture of diffusers for scene composition and high resolution image
generation. arXiv preprint arXiv:2302.02412 (2023) 2, 3
25. Jin, Z., Shen, X., Li, B., Xue, X.: Training-free diffusion model adaptation for
variable-sized text-to-image synthesis. arXiv preprint arXiv:2306.08645 (2023) 2,
3, 4, 12, 13, 14
FouriScale 25

26. Lee, Y., Kim, K., Kim, H., Sung, M.: Syncdiffusion: Coherent montage via syn-
chronized joint diffusions. NeurIPS 36 (2024) 2, 3
27. Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., Plumb-
ley, M.D.: Audioldm: Text-to-audio generation with latent diffusion models. arXiv
preprint arXiv:2301.12503 (2023) 3
28. Lu, Z., Wang, Z., Huang, D., Wu, C., Liu, X., Ouyang, W., Bai, L.: Fit: Flexible
vision transformer for diffusion model. arXiv preprint arXiv:2402.12376 (2024) 3
29. Midjourney: (2024), https://www.midjourney.com, accessed: 17, 01, 2024 1
30. Pattichis, M.S., Bovik, A.C.: Analyzing image structure by multidimensional fre-
quency modulation. IEEE TPAMI 29(5), 753–766 (2007) 9
31. Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV. pp.
4195–4205 (2023) 3
32. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna,
J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image
synthesis. arXiv preprint arXiv:2307.01952 (2023) 1, 2, 3, 11, 12
33. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M.,
Sutskever, I.: Zero-shot text-to-image generation. In: ICML. pp. 8821–8831. PMLR
(2021) 1
34. Riad, R., Teboul, O., Grangier, D., Zeghidour, N.: Learning strides in convolutional
neural networks. In: ICLR (2021) 6, 18
35. Rippel, O., Snoek, J., Adams, R.P.: Spectral representations for convolutional neu-
ral networks. NeurIPS 28 (2015) 8
36. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: CVPR. pp. 10684–10695 (2022)
1, 3
37. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour,
K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-
to-image diffusion models with deep language understanding. NeurIPS 35, 36479–
36494 (2022) 1, 3
38. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M.,
Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy,
S., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: Laion-5b: An open
large-scale dataset for training next generation image-text models (2022) 12
39. Si, C., Huang, Z., Jiang, Y., Liu, Z.: Freeu: Free lunch in diffusion u-net. arXiv
preprint arXiv:2309.11497 (2023) 12
40. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International
Conference on Learning Representations (2020) 3, 4
41. Sukhbaatar, S., Grave, É., Bojanowski, P., Joulin, A.: Adaptive attention span in
transformers. In: Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. pp. 331–335 (2019) 18
42. Teng, J., Zheng, W., Ding, M., Hong, W., Wangni, J., Yang, Z., Tang, J.: Relay
diffusion: Unifying diffusion process across resolutions for image synthesis. arXiv
preprint arXiv:2309.03350 (2023) 3
43. Upscaler, S.D.: (2024), https : / / huggingface . co / stabilityai / stable -
diffusion-x4-upscaler, accessed: 17, 01, 2024 20
44. Wang, J., Li, X., Zhang, J., Xu, Q., Zhou, Q., Yu, Q., Sheng, L., Xu, D.: Diffu-
sion model is secretly a training-free open vocabulary semantic segmenter. arXiv
preprint arXiv:2309.02773 (2023) 10
45. Xiao, C., Yang, Q., Zhou, F., Zhang, C.: From text to mask: Localizing en-
tities using the attention of text-to-image diffusion models. arXiv preprint
arXiv:2309.04109 (2023) 10
26 L. Huang et al.

46. Zeng, X., Vahdat, A., Williams, F., Gojcic, Z., Litany, O., Fidler, S., Kreis,
K.: Lion: Latent point diffusion models for 3d shape generation. arXiv preprint
arXiv:2210.06978 (2022) 3
47. Zhang, R.: Making convolutional networks shift-invariant again. In: ICML. pp.
7324–7334. PMLR (2019) 6
48. Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using
very deep residual channel attention networks. In: ECCV. pp. 286–301 (2018) 9
49. Zhao, W., Rao, Y., Liu, Z., Liu, B., Zhou, J., Lu, J.: Unleashing text-to-image
diffusion models for visual perception. ICCV (2023) 10
50. Zheng, Q., Guo, Y., Deng, J., Han, J., Li, Y., Xu, S., Xu, H.: Any-size-diffusion:
Toward efficient text-driven synthesis for any-size hd images. arXiv preprint
arXiv:2308.16582 (2023) 3
51. Zhu, Q., Zhou, M., Huang, J., Zheng, N., Gao, H., Li, C., Xu, Y., Zhao, F.: Fourid-
own: Factoring down-sampling into shuffling and superposing. In: NeurIPS (2023)
6

You might also like