3University of Chinese Academy of Sciences 4Nanyang Technological University
5Harbin Institute of Technology
Paper | Project Page | LoRA Weights
We propose TACA, a parameter-efficient method that dynamically rebalances cross-modal attention in multimodal diffusion transformers to improve text-image alignment.
teaser.mp4
For Stable Diffusion 3.5, simply run:
python infer/infer_sd3.pyFor FLUX.1, run:
python infer/infer_flux.pyComparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.
| Model | Attribute Binding | Object Relationship | Complex |
|||
|---|---|---|---|---|---|---|
| Color |
Shape |
Texture |
Spatial |
Non-Spatial |
||
| FLUX.1-Dev | 0.7678 | 0.5064 | 0.6756 | 0.2066 | 0.3035 | 0.4359 |
| FLUX.1-Dev + TACA ( |
0.7843 | 0.5362 | 0.6872 | 0.2405 | 0.3041 | 0.4494 |
| FLUX.1-Dev + TACA ( |
0.7842 | 0.5347 | 0.6814 | 0.2321 | 0.3046 | 0.4479 |
| SD3.5-Medium | 0.7890 | 0.5770 | 0.7328 | 0.2087 | 0.3104 | 0.4441 |
| SD3.5-Medium + TACA ( |
0.8074 | 0.5938 | 0.7522 | 0.2678 | 0.3106 | 0.4470 |
| SD3.5-Medium + TACA ( |
0.7984 | 0.5834 | 0.7467 | 0.2374 | 0.3111 | 0.4505 |