Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DingWeiPeng/Transformer-decoder-only-with-KV-Cache

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

The implementation of a Transformer Decoder Only model is described, which incorporates both Key-Value (KV) caching and absolute positional encoding. The entire code was crafted with reference to Figure 1 of the seminal paper "Attention is All You Need." However, our implementation only covers the right half of Figure 1, namely the Decoder, while preserving an interface for the Encoder within the Decoder.

image

At the conclusion of the program, the Transformer Decoder Only model is utilized for autoregressive causal sequence prediction. Furthermore, we verified whether the sequences generated using KV caching are identical to those generated without it. The experimental results indicate that when generating hundreds to thousands of tokens, the sequences produced are precisely the same.

The program was written following the principle of low cohesion and high coupling, where each module in the Decoder is implemented as a class. These classes call each other without exhibiting any obvious containment relationships. Consequently, when first reading KV_Cache_Transformer.py and attention.py, you may find it challenging to grasp the program's logic and appreciate its sophistication. If you find the program confusing, please review it multiple times and try debugging it.

The execution of the KV_Cache_Transformer.py requires only the presence of PyTorch. It is essential to ensure that the number of tokens does not exceed the number of vocabulary entries (tokens ≤ num_vocbs).

Of equal importance, though not least, is the verification of the model's token generation speed within the program. Figure 2 illustrates that when generating 511 tokens in a causal autoregressive way, the sequences produced with and without KV caching are EXACTLY IDENTICAL. It is particularly noteworthy that KV caching enhances the generation speed by a factor of ten on the CPU. From Figure 2 and Figure 3, an increase in speed by one orders of magnitude is also feasible When the number of tokens reaches the thousands.

image
Figure 2. The speed and sequence comparison between KV Caching and not in CPU while generating 511 tokens.

image
Figure 3. The speed and sequence comparison between KV Caching and not in CPU while generating 1023 tokens.

About

Transformer Key Value Store/Transformer Decoder Only with KV Cache

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages