The course on Large Language Models (LLMs) aims to equip students with a comprehensive understanding of the principles, architectures, and applications of state-of-the-art LLMs like GPT-4. This course is designed for graduate students in computer science, data science, and related fields who have a foundational knowledge of machine learning and artificial intelligence. By taking this course, students will gain valuable skills in developing, fine-tuning, and deploying LLMs, which are increasingly integral to advancements in natural language processing, automated content creation, and AI-driven decision-making. This expertise will not only enhance their academic and research capabilities but also significantly boost their employability in tech industries, research institutions, and innovative startups focused on AI and machine learning technologies.
Optional Textbooks
- Deep Learning by Goodfellow, Bengio, and Courville free online
- Machine Learning — A Probabilistic Perspective by Kevin Murphy online
- Natural Language Processing by Jacob Eisenstein free online
- Speech and Language Processing by Dan Jurafsky and James H. Martin (3rd ed. draft)
Optional Papers
- On the Opportunities and Risks of Foundation Models
- Multimodal Foundation Models: From Specialists to General-Purpose Assistants
- Large Multimodal Models: Notes on CVPR 2023 Tutorial
- A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
- Interactive Natural Language Processing
- Towards Reasoning in Large Language Models: A Survey
By the end of this course, you should be able to:
-
Analyze the underlying architectures and mechanisms of large language models.
-
Implement and fine-tune large language models for specific applications.
-
Evaluate the performance of large language models in various contexts.
-
Design novel applications leveraging large language models to solve real-world problems.
-
Critically assess the limitations and potential improvements of current large language models.
Assignments (individually graded)
- There will be two (2) assignments contributing to 2 * 25% = 50% of the total assessment.
- Students will be graded individually on the assignments. They will be allowed to discuss with each other on the homework assignments, but they are required to submit individual write-ups and coding exercises.
Final Project (Group work but individually graded)
- There will be a final project contributing to the remaining 50% of the total coursework assessment.
- 3–6 people per group
- Presentation: 20%, report: 30%
- The project will be group work but the students will be graded individually. The final project presentation will ensure the student’s understanding of the project
- Proficiency in Deep Learning models
- Proficiency in Python (using Numpy and PyTorch)
- Deep Learning and NLP basics
Instructor
Teaching Assistants
Nguyen Tran Cong Duy
- Logistics of the course
- Introduction about deep learning
- Types of deep learning
- Introduction about Large language models
-
Programming in Python
- Jupiter Notebook and google colab
- Introduction to Python
- Deep Learning Frameworks
- Why Pytorch?
- Deep learning with PyTorch
-
[Supplementary]
- Numerical programming with Numpy/Scipy - Numpy intro
- Numerical programming with Pytorch - Pytorch intro
- From Logistic Regression to Feed-forward NN
- Activation functions
- SGD with Backpropagation
- Adaptive SGD (adagrad, adam, RMSProp)
- Word Embeddings
- CNN
- RNN
- RNN variants
- Information bottleneck issue with vanilla Seq2Seq
- Attention to the rescue
- Details of attention mechanism
- Transformer architecture
- Self-attention
- Positional encoding
- Multi-head attention
- Deep learning with PyTorch
- Linear Regression
- Logistic Regression
- Numpy notebook Pytorch notebook
- Backpropagation
- Dropout
- Batch normalization
- Initialization
- Gradient clipping
- Word2Vec Tutorial - The Skip-Gram Model, blog
- Convolutional Neural Networks for Sentence Classification
- Fine-grained Opinion Mining with Recurrent Neural Networks and Word Embeddings
- Sequence to Sequence Learning with Neural Networks (original seq2seq NMT paper)
- Effective Approaches to Attention-based Neural Machine Translation
- Neural Machine Translation by Jointly Learning to Align and Translate (original seq2seq+attention paper)
- Attention Is All You Need
- The Illustrated Transformer
- Language model
- N-gram based LM
- Window-based Language Model
- Neural Language Models
- Encoder-decoder
- Seq2Seq
- Sampling algorithms
- Beam search
- Sequence to Sequence Learning with Neural Networks (original seq2seq NMT paper)
- N-gram Language Models
- Karpathy’s nice blog on Recurrent Neural Networks
- Building an Efficient Neural Language Model
Instruction to choose final project's topic
- FFN
- Mixture of Experts
- Attention
- Layer Norm
- Positional Encoding
- Attention Is All You Need
- Hendrycks and Gimpel. 2016. Gaussian Error Linear Units.
- Ramachandran et al. 2017. Searching for Activation Functions.
- Shazeer 2017. GLU Variants Improve Transformer
- Ainslie et al. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
- Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need.
- DeepSeek team. DeepSeek-V2
- About pre-training
- Why we need pre-training
- Does pre-training indeed help?
- Pre-trained Language models
- Large Language Models
- Chang, Y., Wang, X., Wang, J., Wu, Y., Zhu, K., Chen, H., Yang, L., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P.S., Yang, Q., & Xie, X. (2023). A Survey on Evaluation of Large Language Models. ArXiv, abs/2307.03109
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485-5551.
- A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.
- Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., ... & Fiedel, N. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240), 1-113.
- Chen, Mark, et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, arXiv, 14 July 2021. arXiv.org, https://doi.org/10.48550/arXiv.2107.03374.
- Touvron, Hugo, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288, arXiv, 19 July 2023. arXiv.org, https://doi.org/10.48550/arXiv.2307.09288.
- Jiang, Albert Q., et al. Mixtral of Experts. arXiv:2401.04088, arXiv, 8 Jan. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2401.04088.
- Using pretrained language model for classification: https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/pytorch/sequence_classification.ipynb
- LLM prompting: https://huggingface.co/docs/transformers/main/en/tasks/prompting
- LLM full finetuning
- In-context learning
- Parameter-efficient finetuning
- Instruction finetuning
Assignment 1 is out here. Deadline: 13 Oct 2025.
- Instruction tuning
- Multitask Prompted Training Enables Zero-shot Task Generalization (T0)
- LIMA: Less Is More for Alignment
- Instructed GPT
- Reinforcement learning from human feedback (RLHF)
- Direct preference optimization (DPO)
- Frontier, pitfalls and open problems of RLHF
- Chain-of-Thought Prompting
- Self-Consistency Improves Chain of Thought Reasoning in Language Models
- Tree of Thoughts Prompting
- Program of Thoughts Prompting
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
- Measuring and Narrowing the Compositionality Gap in Language Models
Instruction for final project report here
- Limitations of parametric LLMs
- What are retrieval-augmented LMs?
- Benefit of retrieval-augmented LMs
- Past: Architecture and training of retrieval-augmented LMs for downstream tasks
- Present: Retrieval-augmented generation with LLMs
Assignment 2 is out here. Deadline: 17 Nov 2025.
- General concepts of efficient inference methods for LLM serving
- Speculative decoding systems
- Model-based efficiency
- Paged attention
- Flash attention
Move to week 13