Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Latest commit

 

History

History

README.md

Plug-and-play Example

This folder contains plug-and-play inference scripts for five Diffusers video models replacing SDPA with sageattn.

Supported models:

We can replace scaled_dot_product_attention easily.
We will take CogvideoX as an example:

Just add the following codes and run!

from sageattention import sageattn
import torch.nn.functional as F

F.scaled_dot_product_attention = sageattn

Specifically,

cd example
python cogvideox_infer.py --model cogvideox-2b --compile --attention_type sage

You can get a lossless video in ./example/videos/<model>/<attention_type>/ faster than by using --attention_type sdpa.

Note: If you set --compile, the first run will be slower than the following runs. Please run it twice to get the accurate speed.

Note: torch.compile is generally incompatible with enable_sequential_cpu_offload(). Don't use them together.

Modify Attention From Source Code

To have finer control over where to use SageAttention, you can modify a small subset of the source code. For example, in modify_mochi.py, you can replace the MochiAttnProcessor2_0 from diffusers with your own attention class.

Note: In Diffusers pipelines, HunyuanVideo uses attention_mask which is not supported by the sageattn API. As a workaround, you can follow SageAttention Issue #115 to modify the official attention implementation from the HunyuanVideo repo to split text tokens vs image tokens, then apply SageAttention only to the large image-token self-attention (mask-free), while keeping the masked/text part on SDPA/FlashAttention.

Local Image

Local Image

Parallel SageAttention Inference

Install xDiT(xfuser >= 0.3.5) and diffusers(>=0.32.0.dev0) from sources and run:

# install latest xDiT(xfuser).
pip install "xfuser[flash_attn]"
# install latest diffusers (>=0.32.0.dev0), need by latest xDiT.
git clone https://github.com/huggingface/diffusers.git
cd diffusers && python3 setup.py bdist_wheel && cd dist && python3 -m pip install *.whl
# then run parallel sage attention inference.
./run_parallel.sh