Follows https://arxiv.org/pdf/1706.03762 somewhat, but only the decoder part as GPT is a decoder only model
- Video to follow along all the steps and questions you have https://www.youtube.com/watch?v=kCc8FmEb1nY
- Uses a simple character encoder
- Only for learning purpose, and get more insight into the gpt models
- Layer norms and residuel connections are done differently in gpt
- Using masked attention heads for each, should not be needed?
See training for the results