GPT-2 Training Implementation

A PyTorch implementation of GPT-2 training from scratch, featuring modern optimizations and best practices.

Features

Pure PyTorch implementation of GPT-2 architecture
Flash Attention support for improved performance
Mixed precision training with bfloat16
Weight sharing between input embeddings and output layer
Configurable model size and architecture
Support for loading pretrained weights from HuggingFace
Efficient data loading with tiktoken tokenization

Requirements

pip install torch tiktoken transformers

Model Architecture

The implementation includes:

Multi-head self-attention with causal masking
Position-wise feed-forward networks
Layer normalization
Residual connections
Weight sharing between input and output embeddings

Usage

Training

from train_gpt2 import GPT, GPTConfig, DataLoaderLite

# Initialize model and data loader
model = GPT(GPTConfig())
train_loader = DataLoaderLite(B=4, T=32)

# Training loop
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for i in range(50):
    x, y = train_loader.next_batch()
    logits, loss = model(x, y)
    loss.backward()
    optimizer.step()

Loading Pretrained Weights

model = GPT.from_pretrained('gpt2')  # or 'gpt2-medium', 'gpt2-large', 'gpt2-xl'

Configuration

The model can be configured through GPTConfig:

@dataclass 
class GPTConfig: 
    block_size: int = 1024      # Maximum sequence length
    vocab_size: int = 50257     # Vocabulary size (GPT-2 default)
    n_layer: int = 6           # Number of transformer layers
    n_head: int = 6            # Number of attention heads
    n_embd: int = 384          # Embedding dimension

Performance Optimizations

Flash Attention for efficient attention computation
Mixed precision training with bfloat16
High-precision matrix multiplication
Efficient data loading with tiktoken

Data Requirements

The training script expects an input.txt file in the same directory containing the training text. The text will be tokenized using the GPT-2 tokenizer (tiktoken).

License

MIT License

Acknowledgments

Implementation based on the original GPT-2 paper and architecture
Flash Attention implementation from PyTorch
Tokenization using tiktoken

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
fineweb.py		fineweb.py
hellaswag.py		hellaswag.py
input.txt		input.txt
play.ipynb		play.ipynb
train_gpt2.py		train_gpt2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT-2 Training Implementation

Features

Requirements

Model Architecture

Usage

Training

Loading Pretrained Weights

Configuration

Performance Optimizations

Data Requirements

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

morka17/gpt2-124M

Folders and files

Latest commit

History

Repository files navigation

GPT-2 Training Implementation

Features

Requirements

Model Architecture

Usage

Training

Loading Pretrained Weights

Configuration

Performance Optimizations

Data Requirements

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages