High-Res Image Synthesis Method
High-Res Image Synthesis Method
2
Centre for Perceptual and Interactive Intelligence
3
Sun Yat-Sen University
4
Sensetime Research
5
Beihang University
[email protected], {rongyaofang@link, hsli@ee}.cuhk.edu.hk
1 Introduction
Recently, Diffusion models [19, 36] have emerged as the predominant genera-
tive models, surpassing the popularity of GANs [13] and autoregressive mod-
els [11, 33]. Some text-to-image generation models, which are based on diffusion
models, such as Stable Diffusion (SD) [36], Stable Diffusion XL (SDXL) [32],
Midjourney [29], and Imagen [37], have shown their astonishing capacity to gen-
erate high-quality and fidelity images under the guidance of text prompts. To
ensure efficient processing on existing hardware and stable model training, these
⋆
Equal contribution. B
Corresponding author.
2 L. Huang et al.
SDXL 2048×2048
“An anime man in flight uniform with hyper detailed digital artwork and an art style inspired by Klimt, Nixeu, Ian Sprigger, Wlop, and Krenz Cushart.”
models are typically trained at one or a few specific image resolutions. For in-
stance, SD models are often trained using images of 512 × 512 resolution, while
SDXL models are typically trained with images close to 1024 × 1024 pixels.
However, as shown in Fig. 1, directly employing pre-trained diffusion models
to generate an image at a resolution higher than what the models were trained
on will lead to significant issues, including repetitive patterns and unforeseen
artifacts. Some studies [3, 24, 26] have attempted to create larger images by uti-
lizing pre-trained diffusion models to stitch together overlapping patches into a
panoramic image. Nonetheless, the absence of a global direction for the whole im-
age restricts their ability to generate images focused on specific objects and fails
to address the problem of repetitive patterns, where a unified global structure is
essential. Recent work [25] has explored adapting pre-trained diffusion models for
generating images of various sizes by examining attention entropy. Nevertheless,
ScaleCrafter [15] found that the key point of generating high-resolution images
lies in the convolution layers. They introduce a re-dilation operation and a con-
volution disperse operation to enlarge kernel sizes of convolution layers, largely
mitigating the problem of pattern repetition. However, their conclusion stems
from empirical findings, lacking a deeper exploration of this issue. Additionally,
it needs an initial offline computation of a linear transformation between the
original convolutional kernel and the enlarged kernel, falling short in terms of
compatibility and scalability when there are variations in the kernel sizes of the
UNet and the desired target resolution of images.
In this work, we present FouriScale, an innovative and effective approach that
handles the issue through the perspective of frequency domain analysis, suc-
cessfully demonstrating its effectiveness through both theoretical analysis and
experimental results. FouriScale substitutes the original convolutional layers in
pre-trained diffusion models by simply introducing a dilation operation coupled
with a low-pass operation, aimed at achieving structural and scale consistency
across resolutions, respectively. Equipped with a padding-then-crop strategy, our
method allows for flexible text-to-image generation of different sizes and aspect
ratios. Furthermore, by utilizing FouriScale as guidance, our approach attains
FouriScale 3
2 Related Work
Text-to-image synthesis [9, 20, 36, 37] has seen a significant surge in interest due
to the development of diffusion probabilistic models [19, 40]. These innovative
models operate by generating data from a Gaussian distribution and refining
it through a denoising process. With their capacity for high-quality generation,
they have made significant leaps over traditional models like GANs [9, 13], espe-
cially in producing more realistic images. The Latent Diffusion Model (LDM) [36]
integrates the diffusion process within a latent space, achieving astonishing real-
ism in the generation of images, which boosts significant interest in the domain
of generating via latent space [5, 16, 27, 31, 46]. To ensure efficient processing on
existing hardware and stable model training, these models are typically trained
at one or a few specific image resolutions. For instance, Stabe Diffusion (SD) [36]
is trained using 512 × 512 pixel images, while SDXL [32] models are typically
trained with images close to 1024 × 1024 resolution, accommodating various
aspect ratios simultaneously.
Fig. 2: The overview of FouriScale (orange line), which includes a dilation convolution
operation (Sec. 3.2) and a low-pass filtering operation (Sec. 3.3) to achieve structural
consistency and scale consistency across resolutions, respectively.
images. However, ScaleCrafter [15] finds that the key point of generating high-
resolution images by pre-trained diffusion models lies in convolution layers. They
present a re-dilation and a convolution disperse operation to expand convolution
kernel sizes, which requires an offline calculation of a linear transformation from
the original convolutional kernel to the expanded one. In contrast, we deeply
investigate the issue of repetitive patterns and handle it through the perspective
of frequency domain analysis. The simplicity of our method eliminates the need
for any offline pre-computation, facilitating its compatibility and scalability.
3 Method
Diffusion models, also known as score-based generative models [19, 40], belong
to a category of generative models that follow a process of progressively intro-
ducing Gaussian noise into the data and subsequently generating samples from
this noise through a reverse denoising procedure. The key denoising step is typ-
ically carried out by a U-shaped Network (UNet), which learns the underlying
denoising function that maps from noisy data to its clean counterpart. The UNet
architecture, widely adopted for this purpose, comprises stacked convolution lay-
ers, self-attention layers, and cross-attention layers. Some previous works have
explored the degradation of performance when the generated resolution becomes
larger, attributing to the change of the attention tokens’ number [25] and the
reduced relative receptive field of convolution layers [15]. Based on empirical
evidence in [15], convolutional layers are more sensitive to changes in resolu-
tion. Therefore, we primarily focus on studying the impact brought about by
the convolutional layers. In this section, we will introduce FouriScale, as shown
in Fig. 2. It includes a dilation convolution operation (Sec. 3.2) and a low-pass
filtering operation (Sec. 3.3) to achieve structural consistency and scale consis-
tency across resolutions, respectively. With the tailored padding-then-cropping
strategy (Sec. 3.4), FouriScale can generate images of arbitrary aspect ratios.
By utilizing FouriScale as guidance (Sec. 3.5), our approach attains remarkable
capability in generating high-resolution and high-quality images.
FouriScale 5
3.1 Notation
2D Discrete Fourier Transform (2D DFT). Given a two-dimensional discrete
signal F (m, n) with dimensions M × N , the two-dimensional discrete Fourier
transform (2D DFT) is defined as:
\label {eq:2D_DFT} F(p, q) = \frac {1}{MN} \sum _{m=0}^{M-1} \sum _{n=0}^{N-1} F(m, n) e^{-j2\pi \left (\frac {pm}{M} + \frac {qn}{N}\right )}. (1)
k_{d_h, d_w}(m, n) = \begin {cases} k(\frac {m}{d_h}, \frac {n}{d_w}) & \text {if } m \operatorname {\%} d_h = 0 \text { and } n \operatorname {\%} d_w = 0, \\ 0 & \text {otherwise}, \end {cases} (2)
where dh , dw is the dilation factor along height and width, respectively, m and
n are the indices in the dilated space. The % represents the modulo operation.
\label {eq:structural_consistency} \operatorname {Down}_{s}(F) \circledast k = \operatorname {Down}_{s}(F \circledast k'), (3)
where Downs denotes the down-sampling operation with scale s1 , and ⊛ rep-
resents the convolution operation. This equation implies the need to customize
a new convolution kernel k ′ for a larger resolution. However, finding an ap-
propriate k ′ can be challenging due to the variety of feature map F . The recent
ScaleCrafter [15] method uses structure-level and pixel-level calibrations to learn
a linear transformation between k and k ′ , but learning a new transformation for
each new kernel size and new target resolution can be cumbersome.
1
For simplicity, we assume equal down-sampling scales for height and width. Our
method can also accommodate different down-sampling scales in this context through
our padding-then-cropping strategy (Section 3.4).
6 L. Huang et al.
F'(u) = \mathbb {S}(F(u), F\left (u + \frac {a \Omega _x}{s}\right )) \mid u \in \left (0, \frac {\Omega _x}{s}\right ), (4)
where S dentes the superposing operator, Ωx is the sampling rates in x axis, and
a = 1, . . . , s − 1.
\label {eq:superpose} \operatorname {DFT}\left (\operatorname {Down}_{s} (F(x, y))\right ) = \frac {1}{s^2} \sum _{i=0}^{s-1} \sum _{j=0}^{s-1} F_{(i,j)}(u, v), (5)
The proof of Theorem 1 and Lemma 1 are provided in the Appendix (Sec. A.1
and Sec. A.2). They describe the shuffling and superposing [34, 47, 51] in the
frequency domain imposed by spatial down-sampling. If we transform Eq. (3) to
the frequency domain and follow conclusion in Lemma 1, we can obtain:
\label {eq:eq_to_frequency_domain} &\left ( \frac {1}{s^2} \sum _{i=0}^{s-1} \sum _{j=0}^{s-1} F_{(i,j)}(u, v) \right ) \odot k(u, v) \leftarrow \text {Left side of Eq.~\eqref {eq:structural_consistency}} \nonumber \\ &= \frac {1}{s^2} \sum _{i=0}^{s-1} \sum _{j=0}^{s-1} \left (F_{(i,j)}(u, v) \odot k(u, v) \right ) \\ &= \frac {1}{s^2} \sum _{i=0}^{s-1} \sum _{j=0}^{s-1} \left (F_{(i,j)}(u, v) \odot k'_{(i,j)}(u, v) \right ), \leftarrow \text {Right side of Eq.~\eqref {eq:structural_consistency}} \nonumber
FouriScale 7
5x5 Conv Kernel 20x20 Dilated Conv Kernel DFT of 20x20 Dilated Conv Kernel
0 0 0
3
4 4
4
0 1 2 3 4
DFT of 5x5 Conv Kernel
0
8 8
1
4 12 12
0 1 2 3 4
Cropped DFT
0
1
16 16
2
4
0 1 2 3 4 0 4 8 12 16 0 4 8 12 16
Fig. 3: We visualize a random 5 × 5 kernel for better visualization. The Fourier spec-
trum of its dilated kernel, with a dilation factor of 4, clearly demonstrates a periodic
character. It should be noted that we also pad zeros to the right and bottom sides
of the dilated kernel, which differs from the conventional use. However, this does not
impact the outcome in practical applications.
where k(u, v), k ′ (u, v) denote the fourier transform of kernel k and k ′ , respec-
tively, ⊙ is element-wise multiplication. Eq. (6) suggests that the Fourier spec-
trum of the ideal convolution kernel k ′ should be the one that is stitched by s × s
Fourier spectrum of the convolution kernel k. In other words, there should be
a periodic repetition in the Fourier spectrum of k ′ , the repetitive pattern is the
Fourier spectrum of k.
Fortunately, the widely used dilated convolution perfectly meets this require-
ment. Suppose a kernel k(m, n) with the size of M × N , it’s dilated version is
kdh ,dw (m, n), with dilation factor of (dh , dw ). For any integer multiples of dh ,
namely p' = pd_h and integer multiples of dw , namely q' = qd_w , the exponential
term of the dilated kernel in the 2D DFT (Eq. (1)) becomes:
e^{-j2\pi \left (\frac {p'm}{d_hM} + \frac {q'n}{d_wN}\right )} = e^{-j2\pi \left (\frac {pm}{M} + \frac {qn}{N}\right )}, (7)
However, in practice, dilated convolution alone cannot well mitigate the issue of
pattern repetition. As shown in Fig. 4a (top left), the issue of pattern repetition
is significantly reduced, but certain fine details, like the horse’s legs, still present
8 L. Huang et al.
1 1 1 1
Relative Log amplitude
3 3 3 3
Low-res
Low-res up block
up block step_1
step_1 Low-res
Low-res up block
up block step_1
step_1
High-res
High-res up block
up block step_1
step_1 Filtered
Filtered High-res
High-res up block
up block step_1
step_1
4 4 Low-res
Low-res up block
up block step_25
step_25 4 4 Low-res
Low-res up block
up block step_25
step_25
High-res
High-res up block
up block step_25
step_25 Filtered
Filtered High-res
High-res up block
up block step_25
step_25
Low-res
Low-res up block
up block step_50
step_50 Low-res
Low-res up block
up block step_50
step_50
5 5 High-res
High-res up block
up block step_50
step_50 5 5 Filtered
Filtered High-res
High-res up block
up block step_50
step_50
0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8
Dilated convolution + Low-pass filter Frequency
Frequency Frequency
Frequency
(a) w/o
(a) w/o filtering
filtering (b) (b) w filtering
w filtering
(a) Visual comparisons (b) without filtering (c) with filtering
Fig. 4: (a) Visual comparisons between the images created at a resolution of 2048 ×
2048: with only the dilated convolution, and with both the dilated convolution and the
low-pass filtering. (b)(c) Fourier relative log amplitudes of input features from three
distinct layers from the down blocks, mid blocks, and up blocks of UNet, respectively,
are analyzed. We also include features at reverse steps 1, 25, and 50. (b) Without the
application of the low-pass filter. There is an evident distribution gap of the frequency
spectrum between the low resolution and high resolution. (c) With the application of
the low-pass filter. The distribution gap is largely reduced.
issues. This phenomenon is because of the aliasing effect after the spatial down-
sampling, which raises the distribution gap between the features of low resolution
and the features down-sampled from high resolution, as presented in Fig. 4b.
Aliasing alters the fundamental frequency components of the original signal,
breaking its consistency across scales.
In this paper, we introduce a low-pass filtering operation, or spectral pool-
ing [35] to remove high-frequency components that might cause aliasing, intend-
ing to construct scale consistency across different resolutions. Let F (m, n) be a
two-dimensional discrete signal with resolution M \times N . Spatial down-sampling of
F (m, n), by factors s_h and s_w along the height and width respectively, alters the
Nyquist limits to M/(2s_h) and N/(2s_w) in the frequency domain, corresponding
to half the new sampling rates along each dimension. The expected low-pass filter
should remove frequencies above these new Nyquist limits to prevent aliasing.
Therefore, the optimal mask size (assuming the frequency spectrum is central-
ized) for passing low frequencies in a low-pass filter is M/s_h \times N/s_w . This filter
design ensures the preservation of all valuable frequencies within the downscaled
resolution while preventing aliasing by filtering out higher frequencies.
FouriScale 9
\label {eq:final_operation} \operatorname {Conv}_k(F) \rightarrow \operatorname {Conv}_{k'}(\operatorname {iDFT}(H \odot \operatorname {DFT}(F)), (8)
where H denotes the low-pass filter. Fig. 4a (bottom left) illustrates that the
combination of dilated convolution and low-pass filtering resolves the issue of
pattern repetition.
Fig. 5: Visual comparisons between the images created at a resolution of 2048 × 1024:
(a) without the application of padding-then-cropping strategy, and (b) with the appli-
cation of padding-then-cropping strategy. The Stable Diffusion 2.1 utilized is initially
trained on images of 512 × 512 resolution.
and this input feature increases to a size of Hf × Wf during inference, our first
step is to zero-pad the input feature to a size of rhf × rwf . Here, r is defined
H W
as the maximum of ⌈ hff ⌉ and ⌈ wff ⌉, with ⌈·⌉ representing the ceiling operation.
The padding operation assumes that we aim to generate an image of size rh×rw,
where certain areas are filled with zeros. Subsequently, we apply Eq. (8) to rectify
the issue of repetitive patterns in the higher-resolution output. Ultimately, the
obtained feature is cropped to restore its intended spatial size. This step is
necessary to not only negate the effects of zero-padding but also control the
computational demands when the resolution increases, particularly those arising
from the self-attention layers in the UNet architecture. Taking computational
efficiency into account, our equivalent solution is outlined in Algorithm 1.
decline in image quality and loss of details typically induced by low-pass filtering.
The final noise estimation is determined using both the unconditional and the
newly conditional estimations following classifier-free guidance. As we can see in
Fig. 6 (c), the aforementioned issues are largely mitigated.
Settings for SDXL. Stable Diffusion XL [32] (SDXL) is generally trained on im-
ages with a resolution close to 1024 × 1024 pixels, accommodating various aspect
ratios simultaneously. Our observations reveal that using an ideal low-pass filter
leads to suboptimal outcomes for SDXL. Instead, a gentler low-pass filter, which
modulates rather than completely eliminates high-frequency elements using a
coefficient σ ∈ [0, 1] (set to 0.6 in our method) delivers superior visual qual-
ity. This phenomenon can be attributed to SDXL’s ability to handle changes in
12 L. Huang et al.
scale effectively, negating the need for an ideal low-pass filter to maintain scale
consistency, which confirms the rationale of incorporating low-pass filtering to
address scale variability. Additionally, for SDXL, we calculate the scale factor r
(refer to Algorithm 1) by determining the training resolution whose aspect ratio
is closest to the one of target resolution.
4 Experiments
We compare our method with the vanilla text-to-image diffusion model (Vanilla),
the training-free approach [25] (Attn-Entro) that accounts for variations in at-
tention entropy between low and high resolutions, and ScaleCrafter [15], which
modifies convolution kernels through re-dilation and adopts linear transforma-
tions for kernel enlargement. We show the experimental results in Tab. 1. Com-
pared to the vanilla diffusion models, our method obtains much better results
because of eliminating the issue of repetitive patterns. The Attn-Entro does
not work at high upscaling levels because it fails to fundamentally consider the
structural consistency across resolutions. Due to the absence of scale consis-
tency consideration in ScaleCrafter, it performs worse than our method on the
majority of metrics. Additionally, we observe that ScaleCrafter often struggles
FouriScale 13
Table 1: Quantitative comparisons among training-free methods. The best and second
best results are highlighted in bold and underline. The values of KIDr and KIDb are
scaled by 102 .
SD 1.5 SD 2.1 SDXL 1.0
Resolution Method
FIDr ↓ KIDr ↓ FIDb ↓ KIDb ↓ FIDr ↓ KIDr ↓ FIDb ↓ KIDb ↓ FIDr ↓ KIDr ↓ FIDb ↓ KIDb ↓
Vanilla 26.96 1.00 15.72 0.42 29.90 1.11 19.21 0.54 49.81 1.84 32.90 0.92
Attn-Entro 26.78 0.97 15.64 0.42 29.65 1.10 19.17 0.54 49.72 1.84 32.86 0.92
4× 1:1
ScaleCrafter 23.90 0.95 11.83 0.32 25.19 0.98 13.88 0.40 49.46 1.73 36.22 1.07
Ours 23.62 0.92 10.62 0.29 25.17 0.98 13.57 0.40 33.89 1.21 20.10 0.47
Vanilla 41.04 1.28 31.47 0.77 45.81 1.52 37.80 1.04 68.87 2.79 54.34 1.92
Attn-Entro 40.69 1.31 31.25 0.76 45.77 1.51 37.75 1.04 68.50 2.76 54.07 1.91
6.25× 1:1
ScaleCrafter 37.71 1.34 25.54 0.67 35.13 1.14 23.68 0.57 55.03 2.02 45.58 1.49
Ours 30.27 1.00 16.71 0.34 30.82 1.01 18.34 0.42 44.13 1.64 37.09 1.16
Vanilla 50.91 1.87 44.65 1.45 57.80 2.26 51.97 1.81 90.23 4.20 79.32 3.42
Attn-Entro 50.72 1.86 44.49 1.44 57.42 2.26 51.67 1.80 89.87 4.15 79.00 3.40
8× 1:2
ScaleCrafter 35.11 1.22 29.51 0.81 41.72 1.42 35.08 1.01 106.57 5.15 108.67 5.23
Ours 35.04 1.19 26.55 0.72 37.19 1.29 27.69 0.74 71.77 2.79 70.70 2.65
Vanilla 67.90 2.37 66.49 2.18 84.01 3.28 82.25 3.05 116.40 5.45 109.19 4.84
Attn-Entro 67.45 2.35 66.16 2.17 83.68 3.30 81.98 3.04 113.25 5.44 106.34 4.81
16× 1:1
ScaleCrafter 32.00 1.01 27.08 0.71 40.91 1.32 33.23 0.90 84.58 3.53 85.91 3.39
Ours 30.84 0.95 23.29 0.57 39.49 1.27 28.14 0.73 56.66 2.18 49.59 1.63
4× 8× 16×
SD 1.5
“Side-view blue-ice sneaker inspired by “The image is titled ‘Queen of the Robots’ “A painting of a koala wearing a princess
Spiderman created by Weta FX.” created by artists Greg Rutowski, Victo dress and crown, with a confetti background.”
Ngai, and Alphonse Mucha.”
SD 2.1
“a castle is in the middle of a eurpean city” “A teddy bear mad scientist mixing “Portrait of an anime maid by Krenz Cushart,
chemicals depicted in oil painting style.” Alphonse Mucha, and Ilya Kuvshinov.”
SD XL
“A watercolor portrait of a woman by Luke ”A nighstand topped with a white land- “Two cats, grey and black, are wearing
Rueda Studios and David Downton.” line phone, remote control, a metallic steampunk attire and standing in front of
lamp, and a black hardcover book.” a ship in a heavily detailed painting.”
Fig. 7: Visual comparisons between ➊ ours, ➋ ScaleCrafter [15] and ➌ Attn-Entro [25],
under settings of 4×, 8×, and 16×, employing three distinct pre-trained diffusion mod-
els: SD 1.5, SD 2.1, and SDXL 1.0.
Method FIDr
FouriScale 39.49
w/o guidance 43.75
w/o guidance & filtering 46.74
Mask Size: 𝑀/2×𝑁/2 Mask Size: 𝑀/4×𝑁/4 (Ours) Mask Size: 𝑀/6×𝑁/6
Table 2: Ablation studies on Fig. 8: Comparison of mask sizes for passing low
FouriScale components on SD 2.1 frequencies generating 20482 images by SD 2.1. M ,
model under 16× 1:1 setting. N denote height and width of target resolution.
Appendix
A Proof
\label {eq:1d_sampling} sa(x, \Delta T) = \sum _{n=-\infty }^{\infty } \delta (x - n\Delta T). (1)
\begin {aligned} f(x) &= g(x) sa(x, \Delta T), \\ f'(x) &= g(x) sa(x, s\Delta T). \end {aligned}
(2)
Based on the Fourier transform and the convolution theorem, the spatial
sampling described above can be represented in the Fourier domain as follows:
\label {eq:freq_sampling} F(u) &= G(u) \circledast SA(u, \Delta T) \nonumber \\ &= \int _{-\infty }^{\infty } G(\tau )SA(u - \tau , \Delta T)d\tau \nonumber \\ &= \frac {1}{\Delta T} \sum _{n} \int _{-\infty }^{\infty } G(\tau ) \delta \left ( u - \tau - \frac {n}{\Delta T} \right ) d\tau \\ & = \frac {1}{\Delta T} \sum _{n} G \left ( u - \frac {n}{\Delta T} \right ), \nonumber
where G(u) and SA(u, ∆T ) are the Fourier transform of g(x) and sa(x, ∆T ).
From the above Equation, it can be observed that the spatial sampling introduces
the periodicity to the spectrum and the period is \protect \frac {1}{\Delta T} .
Note that the sampling rates of f(x) and f'(x) are \Omega _x and \Omega '_x , the relation-
ship between them can be written as
\Omega _x = \frac {1}{\Delta T}, \quad \Omega '_x = \frac {1}{s\Delta T} = \frac {1}{s}\Omega _x. (4)
\label {eq:s_superpose} F'(u) = \mathbb {S}(F(u), F(\tilde {u}_1), \ldots , F(\tilde {u}_{s-1})), (5)
where \protect \tilde {u}_i represents the frequencies higher than the sampling rate, while u de-
notes the frequencies that are lower than the sampling rate. The symbol \protect \mathbb {S} stands
for the superposition operator. To simplify the discussion, \protect \tilde {u} will be used to de-
note \protect \tilde {u}_i in subsequent sections.
(1) In the sub-band, where u \in (0, \frac {\Omega _x}{2s}) , \protect \tilde {u} should satisfy
\label {eq:aliasing_theorem1} \quad \tilde {u} \in \left (\frac {\Omega _x}{2s}, u_{max}\right ). (6)
According to the aliasing theorem, the high frequency \protect \tilde {u} is folded back to the
low frequency:
\label {eq:aliasing_theorem2} \hat {u} = \left | \tilde {u} - (k + 1)\frac {\Omega '_x}{2} \right |, \quad k\frac {\Omega '_x}{2} \leq \tilde {u} \leq (k + 2)\frac {\Omega '_x}{2} (7)
where k = 1, 3, 5, \ldots and \protect \hat {u} is folded results by \protect \tilde {u}.
According to Eq. 6 and Eq. 7, we have
\label {eq:u_hat_range} \hat {u} = \frac {a\Omega _x}{s} - \tilde {u} \quad \text {and} \quad \hat {u} \in \left (\frac {\Omega _x}{s} - u_{max}, \frac {\Omega _x}{2s}\right ), (8)
where a = (k+1) / 2 = 1, 2, \ldots . According to Eq. (5) and Eq. (8), we can attain
\label {eq:fu_case} F'(u) = \begin {cases} F(u) & \text {if } u \in (0, \frac {\Omega _x}{s} - u_{max}), \\ \mathbb {S}(F(u), F(\frac {a\Omega _x}{s} - u)) & \text {if } u \in (\frac {\Omega _x}{s} - u_{max}, \frac {\Omega _x}{2s}). \end {cases} (9)
According to Eq. (3), F(u) is symmetric with respect to u = \frac {\Omega _x}{2} :
\label {eq:symmetry} F(\frac {\Omega _x}{2} - u) = F(u + \frac {\Omega _x}{2}). (10)
\label {eq:symmetry_transfer} \begin {aligned} &F(\frac {\Omega _x}{2} - (\frac {\Omega _x}{2}+u-\frac {a\Omega _x}{s})) \\ = &F(\frac {\Omega _x}{2} + (\frac {\Omega _x}{2}+u-\frac {a\Omega _x}{s})) \\ = &F(u + \Omega _x -\frac {a\Omega _x}{s}) \\ = &F(u + \frac {a\Omega _x}{s}) \end {aligned}
(11)
18 L. Huang et al.
since a = 1, 2, \ldots , s-1. Additionally, for s = 2, the condition u \in (0, \frac {\Omega _x}{s} - u_{max})
results in F(u + \frac {\Omega _x}{s}) = 0. When s > 2, the range u \in (0, \frac {\Omega _x}{s} - u_{max}) typically
becomes non-existent. Thus, in light of Eq. (11) and the preceding analysis,
Eq. (9) can be reformulated as
\label {eq:theorem_prove1} F'(u) = \mathbb {S}(F(u), F(u + \frac {a\Omega _x}{s})) \mid u \in (0, \frac {\Omega _x}{2s}). (12)
(2) In the sub-band, where u \in (\frac {\Omega _x}{2s}, \frac {\Omega _x}{s}), different from (1), \protect \tilde {u} should satisfy
\tilde {u} \in (\frac {\Omega _x}{s} - u_{max}, \frac {\Omega _x}{2s}). (13)
\label {eq:theorem_prove2} F'(u) = \mathbb {S}(F(\tilde {u}), F(u + \frac {a\Omega _x}{s})) \mid u \in (\frac {\Omega _x}{2s}, \frac {\Omega _x}{s}). (14)
F'(u) = \mathbb {S}(F(u), F(u + \frac {a\Omega _x}{s})) \mid u \in (0, \frac {\Omega _x}{s}), (15)
F'(u) = \frac {1}{s}F(u) + \sum _a \frac {1}{s}F\left (u + \frac {a\Omega _x}{s}\right ) \mid u \in \left (0, \frac {\Omega _x}{s}\right ). (16)
Based on the dual principle, we can prove F'(u, v) in the whole sub-band
F'(u,v) = \frac {1}{s^2} \left (\sum _{a,b=0}^{s-1} F\left (u + \frac {a\Omega _s}{s}, v + \frac {b\Omega _y}{s} \right )\right ), (17)
where u \in \left (0, \frac {\Omega _x}{s}\right ) , v \in \left (0, \frac {\Omega _y}{s}\right ) .
B Implementation Details
B.1 Low-pass Filter Definition
In Fig. 1, we show the design of a low-pass filter used in FouriScale. Inspired
by [34, 41], we define the low-pass filter as the outer product between two 1D
filters (depicted in the left of Fig. 1), one along the height dimension and one
along the width dimension. We define the function of the 1D filter for the height
FouriScale 19
High-frequency
Coefficient 𝜎
Low-pass Smooth
Region Region
0
𝑊 = 64 𝐻 = 64 𝑠" = 4 𝑠! = 4
𝐻 𝐻 𝐻
+ 𝑅! 𝑅! = 8 𝜎=0
2𝑠! 2𝑠! 2 𝑅" = 8
Fig. 1: Visualization of the design of a low-pass filter. (a) 1D filter for the positive axis.
(2) 2D low-pass filter, which is constructed by mirroring the 1D filters and performing
an outer product between two 1D filters, in accordance with the settings of the 1D
filter.
dimension as follows, filters for the width dimension can be obtained in the same
way:
\text {mask}^h_{(s_{h},R_h,\sigma )} = \min \left ( \max \left ( \frac {1 - \sigma }{R_h} \left ( \frac {H}{s_{h}} + 1 - i \right ) + 1, \sigma \right ), 1 \right ), i \in [0,\frac {H}{2}], (18)
where sh denotes the down-sampling factor between the target and original res-
olutions along the height dimension. Rh controls the smoothness of the filter and
σ is the modulation coefficient for high frequencies. Exploiting the characteristic
of conjugate symmetry of the spectrum, we only consider the positive axis, the
whole 1D filter can be obtained by mirroring the 1D filter. We build the 2D
low-pass filter as the outer product between the two 1D filters:
\text {mask}(s_{h}, s_{w}, R_h, R_w, \sigma ) = \text {mask}^h_{(s_{h},R_h,\sigma )} \otimes \text {mask}^w_{(s_{w},R_w,\sigma )}, (19)
where ⊗ denotes the outer product operation. Likewise, the whole 2D filter can
be obtained by mirroring along the height and width axes. A toy example of a
2D low-pass filter is shown in the right of Fig. 1.
1× 1×
1 1
× ×
4 1 1 4
× ×
16 1 1 1 16
× × ×
64 64 64
Fig. 2: Reference block names of stable diffusion in the following experiment details.
Sinit steps, we employ the ideal dilation convolution and low-pass filtering. Dur-
ing the span from Sinit to Sstop , we progressively decrease the dilation factor and
r (as detailed in Algorithm 1 of our main manuscript) down to 1. After Sstop
steps, the original UNet is utilized to refine image details further. The settings
for Sinit and Sstop are shown in Tab. 1.
C More Experiments
In this section, we compare the performance of our proposed method with a cas-
caded pipeline, which uses SD 2.1 to generate images at the default resolution
of 512×512, and upscale them to 2048×2048 by a pre-trained diffusion super-
resolution model, specifically the Stable Diffusion Upscaler-4× [43]. We apply
this super-resolution model to a set of 10,000 images generated by SD 2.1. We
then evaluate the FIDr and KIDr scores of these upscaled images and compare
them with images generated at 2048×2048 resolution using SD 2.1 equipped
with our FouriScale. The results of this comparison are presented in Tab. 2. Our
method obtains somewhat worse results than the cascaded method. However,
our method is capable of generating high-resolution images in only one stage,
without the need for a multi-stage process. Besides, our method does not need
FouriScale 21
Fig. 3: Visual comparison with SD+SR. Left: 2048×2048 image upscaled by SD+SR
from 512×512 SD 2.1 generated image. Right: 2048×2048 image generated by our
FouriScale with SD 2.1.
model re-training, while the SR model demands extensive data and computa-
tional resources for training. More importantly, as shown in Fig. 3, we find that
our method can generate much better details than the cascaded pipeline. Due to
a lack of prior knowledge in generation, super-resolution method can only utilize
existing knowledge within the single image for upscaling the image, resulting in
over-smooth appearance. However, our method can effectively upscale images
and fill in details using generative priors with a pre-trained diffusion model, pro-
viding valuable insights for future explorations into the synthesis of high-quality
and ultra-high-resolution images.
1024×2048 1024×1024
2048×2048 2048×2048
performance across all evaluation metrics, achieving lower FID and KID scores
compared to ElasticDiffusion, indicating better image quality and diversity.
D More Visualizations
D.1 LoRAs
References
1. AnimeArtXL: (2024), https : / / civitai . com / models / 117259 / anime - art -
diffusion-xl, accessed: 17, 01, 2024 22
2. Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila,
T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an
ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022) 3
3. Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: Multidiffusion: Fusing diffusion paths
for controlled image generation. arXiv preprint arXiv:2302.08113 (2023) 2, 3
4. Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd gans.
In: International Conference on Learning Representations (2018) 12
FouriScale 23
2048×768
3072×768
1920×1408 768×2816
1920×640
1152×768 768×768
Fig. 5: More generated images using FouriScale and SD 2.1 with arbitrary resolutions.
24 L. Huang et al.
5. Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis,
K.: Align your latents: High-resolution video synthesis with latent diffusion models.
In: CVPR. pp. 22563–22575 (2023) 3
6. Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mu-
tual self-attention control for consistent image synthesis and editing. arXiv preprint
arXiv:2304.08465 (2023) 10
7. Chen, T.: On the importance of noise scheduling for diffusion models. arXiv
preprint arXiv:2301.10972 (2023) 3
8. Civitai: (2024), https://civitai.com/, accessed: 17, 01, 2024 22
9. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. NeurIPS
34, 8780–8794 (2021) 3
10. Diffusion, S.: Stable diffusion 2-1 base. https://huggingface.co/stabilityai/
stable-diffusion-2-1-base/blob/main/v2-1_512-ema-pruned.ckpt (2022) 12
11. Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., Lin, J., Zou, X., Shao,
Z., Yang, H., et al.: Cogview: Mastering text-to-image generation via transformers.
NeurIPS 34, 19822–19835 (2021) 1
12. Epstein, D., Jabri, A., Poole, B., Efros, A.A., Holynski, A.: Diffusion self-guidance
for controllable image generation. arXiv preprint arXiv:2306.00986 (2023) 10, 12
13. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., Bengio, Y.: Generative adversarial nets. NeurIPS 27 (2014) 1, 3
14. Haji-Ali, M., Balakrishnan, G., Ordonez, V.: Elasticdiffusion: Training-free arbi-
trary size image generation. arXiv preprint arXiv:2311.18822 (2023) 3, 21
15. He, Y., Yang, S., Chen, H., Cun, X., Xia, M., Zhang, Y., Wang, X., He, R., Chen,
Q., Shan, Y.: Scalecrafter: Tuning-free higher-resolution visual generation with
diffusion models. arXiv preprint arXiv:2310.07702 (2023) 2, 4, 5, 7, 12, 13, 14
16. He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion mod-
els for high-fidelity video generation with arbitrary lengths. arXiv preprint
arXiv:2211.13221 (2022) 3
17. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-or, D.:
Prompt-to-prompt image editing with cross-attention control. In: ICLR (2022) 10
18. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained
by a two time-scale update rule converge to a local nash equilibrium. NeurIPS 30
(2017) 12
19. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS 33,
6840–6851 (2020) 1, 3, 4
20. Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded dif-
fusion models for high fidelity image generation. The Journal of Machine Learning
Research 23(1), 2249–2281 (2022) 3
21. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint
arXiv:2207.12598 (2022) 10
22. Hoogeboom, E., Heek, J., Salimans, T.: simple diffusion: End-to-end diffusion for
high resolution images. arXiv preprint arXiv:2301.11093 (2023) 3
23. Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.:
Lora: Low-rank adaptation of large language models. In: International Conference
on Learning Representations (2021) 22
24. Jiménez, Á.B.: Mixture of diffusers for scene composition and high resolution image
generation. arXiv preprint arXiv:2302.02412 (2023) 2, 3
25. Jin, Z., Shen, X., Li, B., Xue, X.: Training-free diffusion model adaptation for
variable-sized text-to-image synthesis. arXiv preprint arXiv:2306.08645 (2023) 2,
3, 4, 12, 13, 14
FouriScale 25
26. Lee, Y., Kim, K., Kim, H., Sung, M.: Syncdiffusion: Coherent montage via syn-
chronized joint diffusions. NeurIPS 36 (2024) 2, 3
27. Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., Plumb-
ley, M.D.: Audioldm: Text-to-audio generation with latent diffusion models. arXiv
preprint arXiv:2301.12503 (2023) 3
28. Lu, Z., Wang, Z., Huang, D., Wu, C., Liu, X., Ouyang, W., Bai, L.: Fit: Flexible
vision transformer for diffusion model. arXiv preprint arXiv:2402.12376 (2024) 3
29. Midjourney: (2024), https://www.midjourney.com, accessed: 17, 01, 2024 1
30. Pattichis, M.S., Bovik, A.C.: Analyzing image structure by multidimensional fre-
quency modulation. IEEE TPAMI 29(5), 753–766 (2007) 9
31. Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV. pp.
4195–4205 (2023) 3
32. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna,
J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image
synthesis. arXiv preprint arXiv:2307.01952 (2023) 1, 2, 3, 11, 12
33. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M.,
Sutskever, I.: Zero-shot text-to-image generation. In: ICML. pp. 8821–8831. PMLR
(2021) 1
34. Riad, R., Teboul, O., Grangier, D., Zeghidour, N.: Learning strides in convolutional
neural networks. In: ICLR (2021) 6, 18
35. Rippel, O., Snoek, J., Adams, R.P.: Spectral representations for convolutional neu-
ral networks. NeurIPS 28 (2015) 8
36. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: CVPR. pp. 10684–10695 (2022)
1, 3
37. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour,
K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-
to-image diffusion models with deep language understanding. NeurIPS 35, 36479–
36494 (2022) 1, 3
38. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M.,
Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy,
S., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: Laion-5b: An open
large-scale dataset for training next generation image-text models (2022) 12
39. Si, C., Huang, Z., Jiang, Y., Liu, Z.: Freeu: Free lunch in diffusion u-net. arXiv
preprint arXiv:2309.11497 (2023) 12
40. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International
Conference on Learning Representations (2020) 3, 4
41. Sukhbaatar, S., Grave, É., Bojanowski, P., Joulin, A.: Adaptive attention span in
transformers. In: Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. pp. 331–335 (2019) 18
42. Teng, J., Zheng, W., Ding, M., Hong, W., Wangni, J., Yang, Z., Tang, J.: Relay
diffusion: Unifying diffusion process across resolutions for image synthesis. arXiv
preprint arXiv:2309.03350 (2023) 3
43. Upscaler, S.D.: (2024), https : / / huggingface . co / stabilityai / stable -
diffusion-x4-upscaler, accessed: 17, 01, 2024 20
44. Wang, J., Li, X., Zhang, J., Xu, Q., Zhou, Q., Yu, Q., Sheng, L., Xu, D.: Diffu-
sion model is secretly a training-free open vocabulary semantic segmenter. arXiv
preprint arXiv:2309.02773 (2023) 10
45. Xiao, C., Yang, Q., Zhou, F., Zhang, C.: From text to mask: Localizing en-
tities using the attention of text-to-image diffusion models. arXiv preprint
arXiv:2309.04109 (2023) 10
26 L. Huang et al.
46. Zeng, X., Vahdat, A., Williams, F., Gojcic, Z., Litany, O., Fidler, S., Kreis,
K.: Lion: Latent point diffusion models for 3d shape generation. arXiv preprint
arXiv:2210.06978 (2022) 3
47. Zhang, R.: Making convolutional networks shift-invariant again. In: ICML. pp.
7324–7334. PMLR (2019) 6
48. Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using
very deep residual channel attention networks. In: ECCV. pp. 286–301 (2018) 9
49. Zhao, W., Rao, Y., Liu, Z., Liu, B., Zhou, J., Lu, J.: Unleashing text-to-image
diffusion models for visual perception. ICCV (2023) 10
50. Zheng, Q., Guo, Y., Deng, J., Han, J., Li, Y., Xu, S., Xu, H.: Any-size-diffusion:
Toward efficient text-driven synthesis for any-size hd images. arXiv preprint
arXiv:2308.16582 (2023) 3
51. Zhu, Q., Zhou, M., Huang, J., Zheng, N., Gao, H., Li, C., Xu, Y., Zhao, F.: Fourid-
own: Factoring down-sampling into shuffling and superposing. In: NeurIPS (2023)
6