Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Don't disable uneven k to support more headdims#21

Open
njhill wants to merge 1 commit into
vllm-project:main-archivefrom
njhill:uneven_k
Open

Don't disable uneven k to support more headdims#21
njhill wants to merge 1 commit into
vllm-project:main-archivefrom
njhill:uneven_k

Conversation

@njhill
Copy link
Copy Markdown
Member

@njhill njhill commented Sep 27, 2024

Originally disabled by @WoosukKwon in eee8e47

@WoosukKwon
Copy link
Copy Markdown
Collaborator

WoosukKwon commented Sep 30, 2024

Adding the flag increased the wheel size of vllm-flash-attn from 107 MB to 143MB (not sure whether this is before compression).

@njhill
Copy link
Copy Markdown
Member Author

njhill commented Sep 30, 2024

Thanks @WoosukKwon, I guess that's fairly significant.

The motivation here is that we depend of flash attention currently for certain features like multi-step scheduling, so it's a big performance downside for models which have these other head sizes (in particular some of the IBM Granite models use head_dim 80)

@WoosukKwon
Copy link
Copy Markdown
Collaborator

@njhill We can just add the head size 80 to

HEAD_DIMENSIONS = [32, 64, 96, 128, 160, 192, 224, 256]

Would that work for you?

@njhill
Copy link
Copy Markdown
Member Author

njhill commented Sep 30, 2024

@WoosukKwon yes, I think for our immediate needs that would be great, if you're sure that will be sufficient!

@njhill
Copy link
Copy Markdown
Member Author

njhill commented Oct 2, 2024

@WoosukKwon I've opened #22 for this

@WoosukKwon
Copy link
Copy Markdown
Collaborator

@njhill Since the idea above doesn't work, can you check how much this PR affects the vllm wheel size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants