About KV-Cache

您好，我阅读了您本项目的文章 ‘MoH: Multi-Head Attention as Mixture-of Head Attention’，并对本项目与在 LLM 中常使用的 KV-Cache 这项技术的融合性有一些疑问：由于 attention head 是被稀疏激活的，这就导致 KV 也是被稀疏计算的。也就是说若使用 KV-Chace，如果上一个 token 未被当前 head 处理，则对应的 KV 值将丢失，除非当场计算。对此问题2022 年的文章 ’Mixture of Attention Heads: Selecting Attention Heads Per Token‘ 使用 shared KV 的方式，让一个 layer 的所有的 head 共用一对 KV，从而解决这类问题。但是这种方式极大的影响了，每个 head 计算 KV 的多样性，从而限制了  per expert 的能力。

请问本项目如何与 KV-Cache 这项技术融合，如果无法融合，对计算量和运行时间的影响如何。

期待您的回复。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

About KV-Cache #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

About KV-Cache #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions