Usually decoder-only architecture is not that suitable compared with seq2seq.
To make the setting as clean as possible (prevent padding tricks & special attention masks), the generation in this implementation is not batchified.
pip install nltk rouge-score fire datasets transformers sacrebleu
pip install -e .
bash compile_cuda.sh
We're still missing a few important arithmetic operations for Transformers, namely element-wise (e-wise) power and element-wise tanh.
1. Implementent the forward and backward functions for the Tanh and PowerScalar tensor function in minitorch/tensor_functions.py
Recall from lecture the structure of minitorch. Calling .tanh() on a tensor for example will call a Tensor Function defined in tensor_functions.py. These functions are implemented on the CudaKernelBackend, which execute the actual operations on the tensors.
You should utilize tanh_map and pow_scalar_zip, which have already been added to the TensorBackend class, which your CudaKernelOps should then implement.
Don't forget to save the necessary values in the context in the forward pass for your backward pass when calculating the derivatives.
Since we're taking e-wise tanh and power, your gradient calculation should be very simple.
Add the following snippet to your __device__ float fn function in minitorch/combine.cu
case POW: {
// BEGIN YOUR SOLUTION
return;
// END YOUR SOLUTION
}
case TANH: {
// BEGIN YOUR SOLUTION
return;
// END YOUR SOLUTION
}
Complete the Cuda code to support element-wise power and tanh.
You can look up the relevant mathematical functions here: CUDA Math API
The accompanying tests are in tests/test_tensor_general_student.py
Run the following to test an individual function eg.
python -m pytest -l -v -k "test_pow_1_student"
Run the following to test all parts to problem 1.
python -m pytest -l -v -m a2_1
You will be implementing all the necessary functions and modules to implement a decoder-only transformer model. PLEASE PLEASE PRETTY PLEASE READ THE IMPLEMENTATION DETAILS SECTION BEFORE STARTING regarding advice for working with miniTorch.
Implement the GELU activation, logsumexp, one_hot, and softmax_loss functions in minitorch/nn.py
The accompanying tests are in tests/test_nn_student.py
Hints:
-
one_hot: Since MiniTorch doesn't support slicing/indexing with tensors, you'll want to utilize Numpy's eye function. You can use the .to_numpy() function for MiniTorch Tensors here. (Try to avoid using this in other functions because it's expensive.)
-
softmax_loss: You'll want to make use of your previously implemented one_hot function.
Run the following to test an individual function eg.
python -m pytest -l -v -k "test_gelu_student"
Run the following to test all the parts to Problem 2
python -m pytest -l -v -m a2_2
Implement the Embedding, Dropout, Linear, and LayerNorm1d modules in minitorch/modules_basic.py
The accompanying tests are in tests/test_modules_basic_student.py
Hints:
- Embedding: You'll want to use your one_hot function to easily get embeddings for all your tokens.
Run the following to test an individual function eg.
python -m pytest -l -v -k "test_embedding_student"
Run the following to test all the parts to Problem 3
python -m pytest -l -v -m a2_3
Implement the MultiHeadAttention, FeedForward, TransformerLayer, and DecoderLM module in minitorch/modules_transfomer.py.
The accompanying tests are in tests/test_modules_transformer_student.py
Run the following to test an individual function eg.
python -m pytest -l -v -k "test_multihead_attention_student"
Run the following to test question 1.1
python -m pytest -l -v -m a2_4
Implement a machine translation pipeline in project/run_machine_translation.py
- Initializing parameters
When initializing weights in a Module, always wrap them with Parameter(.), otherwise miniTorch will not update it.
- Using
_from_numpyfunctions
We've provided a new set of tensor initialization functions eg. tensor_from_numpy.
Feel free to use them in functions like one_hot, since minitorch doesn't support slicing, or other times when you need numpy functions and minitorch doesn't support them. Only initialize tensors with a new numpy nd.array (slicing is ok) ie. don't call new functions on them like .T before passing the nd.array into these functions. If you're getting errors, create a copy or deep copy of your numpy array, but you shouldn't need this.
- Initializing weights
You'll need to initialize weights from certain distributions. You may want to do so with Numpy's random functions and use tensor_from_numpy to create the corresponding tensor.
- Requiring Gradients
When you initialize parameters eg. in LayerNorm, make sure you set the require_grad_ field for parameters or tensors for which you'll need to update.
- Broadcasting - implicit broadcasting
Unlike numpy or torch, we don't have the broadcast_to function available. However, we do have implicit broadcasting. eg. given a tensors of shape (2, 2) and (1, 2), you can add the two tensors and the second tensor will be broadcasted to the first tensor using standard broadcasting rules. You will encounter this when building your modules, so keep this in mind if you ever feel like you need broadcast_to.
- Contiguous Arrays
Some operations like view require arrays to be contiguous. Sometimes adding a .contiguous() may fix your error.
- No sequential
There is easy way to add sequential modules. Do not put transformer layers in a list/iterable and iterate through it in your forward function, because miniTorch will not recognize it
- Always add backend
Always ensure your parameters are initialized with the correct backend (with your CudaKernelOps) to ensure they're run correctly.
- Batch Matrix Multiplication
We support batched matrix multiplication: Given tensors A and B of shape (a, b, m, n) and (a, b, n, p), A @ B will be of shape (a, b, m, p), whereby matrices are multiplied elementwise across dimensions 0 and 1.
-
MiniTorch behavior when preserving dimensions
-
Linear/Layernorm
You need to make your input tensors 2D.
- Possible errors
- you may be using the wrong backend
python project/run_torch_machine_translation.py
| BLEU | Running time (RTX-3090) | Speed (RTX-3090) | |
|---|---|---|---|
| Transformer (Seq2seq) | 34 | ||
| Ours (GPT2 with PyTorch) | 27 | 34 mins / epoch * 10 epochs | 59K tokens / sec. |
| Ours (GPT2 with MiniTorch) |
python project/run_torch_conditional_generation.py --dataset_name gigaword --samples_per_epoch 200000
--samples_per_epoch 200000: gigaword has a very large training set (3M), so here every epoch we only sample a part of it to save training time.
| Rouge-1 | Rouge-2 | Rouge-L | Running time (RTX-3090) | Speed (RTX-3090) | |
|---|---|---|---|---|---|
| Transformer (Seq2seq) | 37.57 | 18.90 | 34.69 | ||
| Ours (GPT2 with PyTorch) | 33.84 | 15.06 | 31.31 | 19 mins / epoch * 10 epochs | 50K tokens / sec. |
| Ours (GPT2 with MiniTorch) |