Transformers & LLM Basics
LARGE LANGUAGE MODELS(LLMs)
▪ A large language model is a type of machine learning model that is trained on a large corpus of text
data to generate outputs for various natural language processing (NLP) tasks, such as text generation,
question answering, and machine translation.
▪ Large language models are typically based on deep learning neural networks such as the Transformer
architecture and are trained on massive amounts of text data, often involving billions of words. Larger
models, such as Google’s BERT model, are trained with a large dataset from various data sources which
allows them to generate output for many tasks.
Text Output
Language
Text Input
Model
Numeric Representation of
text useful for other systems
USE CASES OF LLMs
Large language models can be applied to a variety of use cases and industries,
including healthcare, retail, tech, and more. The following are use cases that exist in
all industries:
Text generation
Sentiment analysis
Chatbots
Textual Entailment Recognition
Question Answering
Code generation
TTransformer Componets
TRANSFORMER MODELS
Largely replaced RNN models with the publication of Attention is All You Need by Google in 2017
CATEGORIES OF TRANSFORMER MODELS
Encoders Decoders Encoders-Decoders
For understanding Language For Generative Models Sequence to Sequence
Suited for task requiring an understanding Suited for tasks involving Suited for tasks around generating new sentences
of the full sentence, such as sentence Text Generation depending on a given input, such as
summarization, translation, or generative question
classification, named entity recognition, and
answering.
extractive question answering.
Output probabilities
Models: Models: Models:
▪ BERT ▪ GPT-3 ▪ T5
Encoder Decoder
▪ ALBERT ▪ GPT-2 ▪ Multilingual –mT5
▪ DistilBERT
Outputs
Inputs
(shifted right)
BERT
▪ BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding (from Google in 2018)
▪ Encoder-only architecture that performs two main tasks
▪ Predicts several blanks in input given entire
context around the blank
▪ When given sentences A and B, it determines if
B actually follows A
▪ Used for question answering, classification etc.
▪ Takes a long time to train since each iteration only gets
signal from a handful of tokens in each sequence
GPT
Generative Pre-Training
▪ Originally published by OpenAI in 2018, followed by GPT-2 in 2019, and GPT-3 in 2020.
▪ Architecture is also a single stack like BERT, but is a traditional left-to-right language model
▪ Can be used for generating larger blocks of text (e.g. chat bots), but can also be used for question answering
▪ Has been the model that we have focused the most on with Megatron
▪ Faster to train than BERT since each iteration gets signal from every token in the sequence
WHEN LARGE LANGUAGE MODELS MAKE SENSE?
Traditional
Large Language Models ▪ Zero-Shot (or Few Shot Learning)
NLP
Approach ▪ Painful & Impractical to get a large corpus of
labelled data
Requires
labelled data Yes No
▪ Models can learn new tasks
▪ If you want models with “common sense”
Parameters 100s of millions Billions to trillions and can generalize well to new tasks
Desired Specific (one model General (model can do ▪ A single model can serve all use-cases
model per task) many tasks) ▪ At-scale you avoid costs and complexity of
capability many models, saving cost in data curation,
training, and managing deployment
Training Retrain frequently with Never retrain, or
Frequenc task-specific training retrain minimally
y data
DISTRIBUTED TRAINING
Data, Pipeline and Tensor Parallelism
CHALLENGES
Compute-, cost-, and Significant capital investment and large-scale compute infrastructure are
time- intensive necessary to maintain and develop LLMs.
workload:
As mentioned, training a large model requires a significant amount of
Scale of data required: data. Many companies struggle to get access to large enough data.
Due to their scale, training and deploying large language models are very
Technical expertise:
difficult.
THANK YOU!