NLP: Sub-domain of AI concerned with task of developing programs possessing some
capability of ‘understanding’ NL to achieve some specific goal
- A transformation from one representation (input text) to another (internal representation)
- No relation with Neuro-Linguistic programming.
NLP applications:
NLP Pipeline:
Stop word
Tokenization Normalization Stemming Lemmatization
Convert Arabic text Process reduce a Process of grouping
into a standard form word to its word together different
to improve the stem, which is the forms of a word →
accuracy (NLP) basic form of word. Can be analyzed as a
tasks single item.
Used in NLP tasks
such as info retrieval It's sophisticated
& text classification. process than stem
It's a challenging
task → the complex
morphology.
Arabic words can
have multiple stems
Accuracy: Lemma more accurate than stem → it considers grammatical context of a word.
Speed: Stem faster than lemma → it doesn't consider grammatical context of a word
Uses of POS - Part-of-speech
- It gives you information about its neighbors Ex: possessive pronouns ("my", "her", "its")
likely to be followed by nouns, personal "I", "you", "he" likely to be followed by verbs
- Tell us about how a word is pronounced ("content" as a noun or an adjective)
- Useful for App such as parsing, named entity recognition, and coreference resolution
- Corpora that have been marked with parts of speech are useful for linguistic research
- POS tagging is the automatic assignment of POS to words
- POS tagging is often an important first step before several other NLP apps can be applied
- hidden Markov models (HMMs) for POS tagging, deep learning approaches for POS
tagging, which tend to perform a bit better (we'll learn about such methods later in the
course
Named entity recognition (NER): NLP task that seeks to locate and classify named entities
mentioned in unstructured text into pre-defined categories such as person names,
organizations, locations, medical codes, time expressions, quantities, monetary values,
percentages
The syntactic structure helps in various NLP tasks by providing a clear framework for
understanding sentence composition. Here's how it aids specific tasks:
- Parsing:
Benefit: Helps in breaking down sentences into their grammatical components.
Ex: Identifying noun phrases & verb phrases allows parsers to understand sentence structure.
- Machine Translation:
Benefit: Ensures that grammatical relationships are preserved across languages.
Ex: Knowing the subject-verb-object order helps in translating sentences accurately.
- Information Extraction:
Benefit: Facilitates the identification of key entities and their relationships.
Ex: Extracting "bear" as the subject and "squirrel" as the object in the sentence.
- Sentiment Analysis:
Benefit: Helps in associating adjectives with the correct nouns.
Ex: Understanding that "angry" describes "bear" aids in sentiment classification.
- Question Answering:
Benefit: Enhances the system's ability to understand and generate responses.
Ex: Identifying the main action ("chased") and participants ("bear" and "squirrel") helps in
formulating answers.
By providing a detailed map of sentence structure, syntactic analysis enables more accurate
and nuanced language processing
Wordforms: Some use whitespace to separate, strip punctuation, count certain punctuation (
periods, ?), complicated (split possessive -s ('s)), convert letters to lower → case insensitivity
Sentence Segmentation(not trivial): tokenization aids sentence segmentation
- Complication is periods are used for acronyms ( "U.S.A." | "Mr." | "$50.25" )
The process of tokenization, optionally followed by lemmatization or stemming and/or
sentence segmentation, is often referred to as text normalization
The Chomsky Hierarchy: 4 types of formal grammar from the most powerful / least
restrictive (type 0) to the least powerful / most restrictive (type 3)
Unrestricted (type 0) | Context-sensitive | Context free | Regular (type 3)
- Often useful (simpler + efficient) to use the most restrictive type that suits your purpose
- Regular grammar is generally powerful enough for tokenization
- Regular can defined using rewrite rules or finite state automata
RE: Grammatical formalism useful for defining regular grammar | Case sensitive
Search requires a pattern that we want to search for and a corpus of texts to search through
Examples of RE ( 9-15)
Morphology: Words are built up from smaller meaning-bearing units called morphemes
- WordNet is an example of such a resource that was very popular in conventional NLP
- We can avoid lemmatization by apply stemming (simpler (but doesn't always work))
- Morphological rules deal with exceptions. "fish" is its own plural, "goose" → "geese"
- Parsing uses both types of rules to break down a word into its component morphemes
- The smallest part of the word that has a semantic meaning
- Ex: wordform "going", parsed form: "VERB go + GERUND-ing"
- Played an important role for POS tagging | web search
2 board classes of Morphemes:
Stems: Main (Central, Important, Significant) morpheme of the word
Affixes: Add additional meanings of various kinds | Word can have more than one affix
- Ex: (has a stem ("believe"), a prefix ("un-"), and two suffixes ("-able" and "-ly"))
Divided to:
- Prefix: attaches before its base, like inter- in international.
- Suffix: follows its base, like -s in cats.
- Circumfix: attaches around its base (attaching both at the beginning and the end like ).
- Infix: attaches inside its base: run-ran and buy bought
N-grams and N-gram Models: Sequence of N tokens (words - characters – embeddings )
- Computes prob of each possible final token of an N-gram given the previous N-1 tokens
- Used to estimate the probability of a word given the N-1 previous words
- important foundational tool for understand fundamental concepts of language modeling
App: Speech recognition | Translation | Spelling correction | Grammatical error correction |
Augmentative communication | POS tagging | Natural language generation | Predictive text
Obtaining counts of things in a natural language can rely on a corpus (the plural is Corpora)
Words Revisited:
Decide what counts as a word before applying tokenization&other steps of text normalization
When counting N-grams &computing N-gram probabilities based on a corpus, we need to
use the same text normalization techniques that will be used for applications
Bigger N → More data you need to achieve reasonable N-gram estimates
Estimating N-gram Probabilities: ( 30 –
Chain Rule: Easily be applied to words → Compute probability of a seq of words by
multiplying together multiple conditional probabilities (and one prior probability)
- It doesn't help us if we can't obtain accurate estimates of the conditional probabilities
- Many seq words never occurred, but they shouldn't be estimated to have 0 probability
Applying N-gram Models:
- Bigram estimates the probability of word given all previous word to be the probability of
a word given the preceding word. This is an example of a Markov assumption
- When k=1, it is common to use the unigram probability for the first word
Machine Learning Terminology:
- Learning N-gram probabilities from a corpus is an example of (ML)
- There should be no overlap between the training set and test set
- Tuning | Validation (held-out) set used to set the hyperparameters or cross-validation
Maximum Likelihood Estimates (MLE) :
When computing probabilities for N grams → Training set reflect the nature of the data that
you are likely to see later (this is true for parameter estimation and ML) use training data
that mixes different genres (e.g., newspaper, fiction, telephone conversation, web pages)
- Start-of-sentence tokens ("<s>") | End of-sentence tokens ("<\s>")
- Symbols will ensure that the probability estimates for all possible words ending an N-
gram, starting with some sequence, add up to one
- Avoids the need to use a unigram probability at the start of the sequence
Natural Language Generation: N-grams can used to for (NLG)
Start randomly generating unigram → Generate one or more bigrams → one or more trigrams
If you use trigrams → repeatedly generating a new word based on the previous two words
When the end-of-sentence marker is generated, the process ends
Unknown Words: - Common method:
- Start with a fixed vocabulary list and treat every other word in the training set as a
pseudo-word, <UNK>
- Replace words that have a frequency below some specified cutoff in the training set with
<UNK>
- We can choose a vocabulary size, V, and use the top V words
Language Models: Assigns probability to a sequence of NL text – Ex: (N-gram)
Evaluating Language Model:
- Extrinsic: Best way is to embed it in another application (machine translation) and
evaluate its performance - it involves a lot of time and effort
- Intrinsic: Metric to evaluate the quality of a model (language model) independent of App
| Based on a test set | Calc probability assigned by the model to the test set
improvements using intrinsic correlate with improvements using an extrinsic measure
Log Probabilities: Log of a product is equal to the sum of the log | Use it to avoids 0’s
- Makes computation quicker ( + instead of *)
- Precomputed probability estimates for N-grams can stored as log probabilities directly
→ No need to compute the logarithms when evaluating
Perplexity( PP): Report a metric
- Closely related to the information theoretic notion of cross-entropy
- It is a measure of how good a probability distribution predicts a sample.
- Can be understood as a measure of uncertainty. The perplexity can be calculated by cross-
entropy to the exponent of 2.
- Perplexity is the inverse probability of a sequence of words normalized by the number of
words
- The higher the probability of the word sequence, the lower the perplexity
Smoothing:
MLE produces poor estimates when counts are small due to sparse data, especially when
counts are zero → if any N-gram occurs that never occurred in the training set, it would be
assigned 0 probability
Smoothing techniques are modifications that address poor estimates due to sparse data
Laplace smoothing (add-one smoothing) is a simple method that adds a constant (1) to
all counts
We can view smoothing as discounting (lowering) some non-zero counts in order to
obtain the probability mass that will be assigned to the zero counts
changes due to add-one smoothing are too big in practice
Example:
Information retrieval (IR): Task of return, or retrieve, relevant documents in response to a
particular natural language query ( based on context )
- Collection: set of documents being used to satisfy user requests
- Term: lexical item (word or token) that occurs in the collection
- Query: user's information need expressed as a set of terms
Conventional IR systems assume that the meaning of a doc resides solely in the set of terms
(unigrams) → ignore syntactic (arguably, semantic) info, use bag-of-words approaches
NLP methods involving word embeddings also make a bag-of-words assumption (if the
ordering of embeddings is not considered)
Vector space model of IR, documents and queries are represented as vectors of features
- Each feature corresponds to one term (word) that occurs in the collection. value of each
feature is term weight in term's frequency, possibly combined with other factors
- The number of dimensions of each vector is equal to the size of the vocabulary
- Represent a document, dj as vector: dj =(w1,j,w2,j,w3,j,…,wN,j)
- N, the number of dimensions of the vector, is equal to the number of distinct terms that
occur in the collection, and wi,j is the weight that term i is assigned in document j
Word Embedding: Matrix of words with values for
each word to identify its meaning and its distance
from the other words. To establish mathematical
relations between them, we can ask and answer
questions for each term. To create an embedding
word, the model must be trained to reliably estimate
each word's meaning using millions of words
Text Vectors:
Calculate the required words based on another number
of words using embedding word values for each word.
Measuring Similarity Between Vectors
Cosine similarity metric is often used in IR to measure the similarity between two vectors
If 2 words far → Angle will be 90, cos will be 0, indicating that they are different
If 2 words that are identical, Angle will be 90, cos will be 0, indicating that they are similar
Text Feature Extraction
1- Bag-of-words(BoW):
Statistical LM to analyze text & doc based on word
count.
- doesn't account for word order within a document +
Semantic meaning
2- TF-IDF or ( Term Frequency(TF) | Inverse Dense Frequency(IDF) )
- Used to find meaning of sentences consisting of words and cancel out the incapability of
BOW which is good for text classification or for helping a machine read words in numbers.
- Give way to associate each word in Doc with a number (represents how relevant each
word is in Doc. Such numbers can be then used as features of machine learning models
- Score of a term in a document measures its importance in the document, which can be
used for various App ( information retrieval, text classification, clustering)
• Use log to minimize the value of the greatest numbers and reduce growth. If the term is
repeated twenty times in a file, it is better, but not twenty times better.
• N, Total num of files, df, num of files that mention the desired word.
• If the word is rarely used and appears in a few documents, then the log of a large number
(for example, a million) and has a large value will be 6 (more important)
• If it appears in all files, it will be logarithm 1, i.e. 0. (less important)
3- One Hot Encoding: غير عملية للبيانات الكبيرة واألرقام مالها معنى
Before Text Feature Extraction we need to:
- Decide whether an IR system should apply stemming (common), apply lemmatization
(not common), convert all letters to the same case.
- Be consistent with the text normalization techniques applied to the collection
- This doesn't change results much, since these words tend to have very low IDF values; it
could make the system more efficient, but makes it more difficult to search for phrases
- Used to by some conventional IR systems is to add synonyms of query terms to the
queries to help locate documents that are relevant but don't use overlapping terms
- Vector space Not the only way in conventional IR; another approach (more common for
earlier IR systems) was to use a Bayesian model
Ranking Web Pages: Some websites put invisible popular search terms in their pages to fool
search engines → Popular web search engines today treat the web as a directed graph
intuitively, good pages will tend to be more popular
- Hyperlinked-Induced Topic Search (HITS) algorithm involves authorities (pages that
have a lot of incoming links from other good pages) and hubs (pages that point to a lot of
authorities)
Text Categorization: Automatic labeling of documents based on text contained in, or
associated with, the documents into one or more predefined categories, a.k.a. classes
- Some TC tasks assume that the categories are independent ( binary or Boolean )
- Each Doc can assigned to 0 | 1| multiple categories
- News article can be classified into multiple topics like "Politics," "Sports,"
- Other TC tasks assume mutually exclusive and exhaustive categories; each document is
then assigned to exactly one category (the best fit)
- Hierarchical taxonomy: structured classification system. categories organized in tree
(parent child relationships) → allow for intuitive navigation and retrieval of information
- Applications:
Conventional TC Approaches:
- Many conventional TC approaches rely on BOW approach to represent documents and
some use a vector space model
- Use single words (unigrams) as terms; larger N grams typically did not improve results
- 3 conventional: Rocchio/TF*IDF, k-nearest neighbors, and naïve Bayes
Text Normalization for TC:
- Decide what constitutes a term (tokenization & other text normalization strategies)
- Some specific issues that TC shares with IR include: Whether or not to make the system
case sensitive | use stemming or lemmatization | stop list | apply POS tagging
Machine Learning for TC:
- Almost all TC use supervised; the training data is in the form of labeled documents
- If the categories are binary, you need positive and negative examples for each category
- If categories are mutually exclusive and exhaustive, you need examples of each category
- One advantage of machine learning (ML) for TC is that it is general
- To move to a new set of categories, we only need a new training set
- It's possible to create a train set without labeling any documents
- Creating a proper training set is often the most expensive part of creating a TC system
K-Nearest Neighbors: example of instance-based learning (a.k.a. example-based learning,
memory-based learning, or lazy learning)
- Training simply consists of recording which training instances belong to each category
when new document arrives, it is compared to those in training set
- Closeness of one document to another can be measured according to Euclidean distance
- k can be manually coded or chosen based on cross validation experiments
Choosing a Category with KNN
- if it's mutually exclusive & exhaustive, the system can choose the most common
category among the k closest documents
- If the categories are binary, the simplest approach is to choose all categories that are
assigned to over half of the k closest documents ( if it's 5 we will take 3 )
KNN Applied to TC
- For better results, we can use similarity (ex: cosine similarity) instead of distance
- Training for KNN is trivial and fast, but that is not too important
- To classify a doc, it must be compared with every document in the training set
- Efficiency is a major problem with KNN; it is generally considered to be
computationally intensive → find approximate nearest neighbors more efficiently
Naïve Bayes: Categorization is a probabilistic approach based on Bayes' theorem
- "naïve" assumption that the probability distribution for each feature in a document given
its category is not affected by the other features ( independent)
Evaluating TC Systems: generally easier than for IR, because there is a labeled test set
- common to measure precision, recall, and a metric ( F1) (F-measure)
- confusion matrix (contingency table): where rows represent a system's predictions, and
the columns represent the actual categorizations of documents
TC Evaluation Metrics – per catecory
- Overall accuracy = (A + D) / (A + B + C + D);
- Precision = A / (A + B); - pred col (yes)
- Recall = A / (A + C); - actual col (yes)
- F1= (2 * Precision * Recall) / (Precision + Recall);
When there are multiple binary categories – per system
- micro-averaging: for the first 3 metrics, you combine all the confusion matrices (for all
categories) to obtain global counts for A, B, C, and D
- macro-averaging: for the first 3 metrics, you average together the values of overall
accuracy, precision, or recall for the individual categories
Properties of Categories
Feedforward Neural Networks, Word Embedding, Neural Language Models, and Word2vec
Text categorization we covered naïve Bayes and KNN approach | POS tagging, hidden Markov models
Artificial Neural Networks (ANNs): fundamental computational tool for language processing
- Composed of units. Each accepts inputs and computes a weighted sum of the inputs
- Bias weight is also added (or subtracted) or used a threshold instead
- Activation function applied to weighted sum, include bias weight result of the activation
function is the output, or activation of the unit
Activation Functions: Necessary for NN to represent functions that are not linearly separable
➔ Threshold (not common deep NN) | Sigmoid (logistic function/ regression)
➔ Tanh | Rectified linear units (ReLUs) - in deep neural networks
Monotonic Function: if it is either entirely non-increasing or non-decreasing. Input
increases, output value always increases (or the same) but it will not do both. And via vice
Derivative of the Function: Rate of function change. If derivative not monotonic, it means
that rate does not consistently increase or decrease
Activation functions with monotonic derivatives:
more predictable | stable training dynamics in NN | reduce oscillations in weight updates |
simplify math | fewer local minima | Better generalization.
S(15.16)
Perceptron's (single-layer perceptron): Defined NNs which don't contain hidden layers
- Contain multiple nodes that share the same input, but each node functions independently
- Represented by individual nodes EX: AND, OR, NOT linearly separable But Not XOR
- + and - Ex of linearly separable functions can be separated by hyperplanes
- perceptron with a threshold activation function will fire (outputs 1) if w∙x > 0
Feedforward NN (multi-layer perceptron's): represent functions NOT linearly separable
- Input layer DON'T include counting layers + No activation function
- one or more hidden layers, consisting of hidden units
When the nodes in a layer connect with every node in the previous layer → fully connected
Computing Values for the Hidden Layer
- If it uses sigmoid, outputs h = σ(Wx+b), x (input), W (matrix), b (bias). And all is vector
- Computational adv of represent layers as vectors and weights between layers as matrices
Computing Values for the Output Layer
- include bias weights connected to fixed inputs;
- Represent output as z = Uh, where h = σ(Wx+b) and U is weight matrix
- Softmax:converts vector of real num to floating-point between 0,1 probability distribution
- If feedforward NN has more than 1 hidden layer, generally be considered a deep NN
Creating a Neural Network
- Choosing an architecture (number, types, and sizes of layers)
- Setting hyperparameters
- Obtaining a training set (including examples with known inputs and outputs)
- Training NN to learn the other parameters of the network (the weights, including the bias)
- The architecture and hyperparameters are manually before training; optionally, they can
be tuned based on a tuning set (validation set) then evaluate it using a test set
Loss Functions: how far predicted output of NN from the true (expected) EX: cross-entropy
Gradient Descent: could be used to train the weights of NN
First, compute the partial derivative of the loss function (each weight) Then push the weight
in a direction to reduce the loss
Stochastic Gradient Descent (SGD) is used instead
- 1 mini batch is used to estimate the gradient. mini-batch size is a hyperparameter of NN
- Loop through the entire train set multiple times (1 mini-batch at time) each pass is epoch
- Modern implementations use vectors and matrix calculations to implement the procedure,
making SGD more efficient due to parallelization
Adjusting Weights
- update weights going into the output nodes. to do this involves backpropagation,
- The gradients for all weights are calculated → multiplied by learning rate that controls
the size of the adjustments - Modern NN have adaptive learning rates
Regularization to minimize overfitting: Penalize | Dropout random weights during training
NN can be described using computation graphs - mathematical expressions
- Representing forward | backward pass (compute partial derivatives of the output)
Feedforward representing NNs as Computation Graphs has 2 input nodes, one hidden layer
with 2 ReLU nodes, and one output sigmoid node
Revisiting the Term-document Matrix: represent of words as vectors ( word embedding )
- value in row i, column j, represents weight of the ith word in the jth document of corpus
- Matrix is generally sparse, so we use an inverted index to store the information
Distributional hypothesis: words with similar semantic meaning, occur in similar contexts.
Latent semantic analysis (LSA) use of singular value decomposition (SVD) applied to the
term-document matrix
Word Embeddings
- Create d-dimensional vector, with a fixed d (50 to 500), for each word in a vocabulary
- Learned from a corpus using an unsupervised learning approach
Pre-word-embedding Neural Networks
Problem: Size of vocabulary is V → V input nodes ( lot of weights) → may have overfitting
- Based on BOW, order of the words in the input doc doesn't affect the input to the NN.
Even using bigrams would blow up the number of input nodes
- Similar words represented by entirely different node
- NNs did not achieve state-of-the-art results for most NLP tasks
number of input nodes is related to d, the dimension of the word embeddings
Advantages
- Diff tasks & arch → input will be one word embedding or a fixed number at a time
- CNN (sentence classification), word embeddings from one padded sentence may be input
- RNN one word embedding, and the words of a sentence are traversed in a sequence
- Transformers, word embeddings for sentence or short sequence of text
Neural Language Models: considers 3 sequential words at a time and predicts the next word
Train by feedforward, Modern neural language models use RNN or transformers
- Projection layer (input) consists of 3*d nodes, d dimension of each word embed vector
- |V| output nodes, where V is the set of vocabulary words, and output is a softmax layer
- Output of ith output node interpreted as probability that ith V word is the next word
- dimensions, they are basically treating the layers as columns (not drawn that way)
- Be careful, what is stored in row columns, ordering of terms when applying operations to
vectors and matrices
- Assume, known mapping from each word in the V to a word embedding vector like that
embeddings have already been learned using some other algorithm
- Possible to learn word embeddings for the current task | use contextual word embeddings
Advantages of Neural Language Models
- No smoothing of probabilities (softmax layer never outputs a 0 exactly)
- generalizing based on similar words to the current words & predicting next word
- NLM make better predictions than conventional N gram models: we can evaluate a
language model by multiplying together the predicted probabilities of actual words
according to a test set. Instead, we add log probabilities to avoid issues with finite
precision, or we use a related metric such as perplexity
Training Such a Network
- Mapping between words and embeddings → train by SGD& backpropagation
Loop (would be one epoch of training) through a large corpus, and for each N-gram → Map
the first n-1 words to embeddings and concatenate these to form the input → Output
probability of the actual word as 1, other 0. cross-entropy loss function becomes: L = -log
P(wt | wt-1 … wt-n+1)
Using the NN to Learn the Embedding: adding 1 additional layer to our network
- Network can learn the word embeddings, how to predict probabilities of the next words
- Training a neural network is not the actual method used to learn embeddings
- Input consists 3|V|-dimensional one-hot vectors, have single 1(for specific word) others 0
- From input to projection, set of shared weights convert each one-hot vector to word
embedding vector (one-hot vector multiplied by the same weight matrix, E, to produce
word embedding vector)
- Diff between 2 NN is E is now being learned with the rest of the network's & weights
updated using SDG and backpropagation
- For each N-gram of a large corpus, we concatenate N-1 one-hot vectors to form the input
- Output, probability of the actual word as 1,other's 0. Also, Loss fun is the same
Advantages of Word Embeddings in General
Dense vectors work better in NLP task than sparse vectors.
- Easier use dense vectors as features for ML systems (lead to fewer weights)
- Help avoid overfitting (this is related to fewer weights)
- Do better job at capturing synonymy. related words will have similar vectors
Word2vec - models for producing word embeddings | Example: , GloVe, BERT
- Train a classifier to predict whether a word will show up close to specific other words
- learned weights become word embeddings
- often claimed that embeddings seem to capture something about the semantics of words
Have 2 Method :
1- Skip-gram: predict CONTEXT words based on the current word
learns two embeddings for each word, w
- Target represents w when it's current word, or center, surrounded by other context words
Target matrix with |V| rows and d columns. contains all target embeddings (one per row)
- Context represents w when it appears as a context word around another target word
Matrix with d rows and |V| columns. contains all the context embeddings (one per column)
2- CBOW: predict the CURRENT word based on context words
Learning the Skip-Gram Model Matrices
In training, consider context words within some small window of size, L, of each target word
- Probability of seeing wj in the context of a target word, wi → P(wj | wi) dot product
- Target embeddings of words pushed close to context of words they appear close to
(within the window)
Word2vec Skip-gram Model as a NN
- Input layer is a one-hot vector
- Hidden (row vector) contains one row of W (single target) no activation function
- Input to Output the dot product of the target embedding with every context embedding
- Treating probability(if softmax) of the context word as 1 and all other probabilities as 0
Skip-gram with Negative Sampling (SGNS)
NN would have to compute the dot product of each target with context for every update
→ It is more efficient to use (SGNS), For each context word within the windows size, we
choose k negative sampled words (range from 5 to 20)
Estimating Probabilities using SGNS:
Compute probability estimates using the sigmoid function: σ(x) = 1 / (1 + e-x)
This gives us: P(+|t, c) = 1 / (1 + e-t∙c) | P(-|t, c) = 1-P(+|t, c)
probabilities of actual context words be high and negative sampled words to be low
assume independence among context words → multiply probabilities or add log probabilities
Training SGNS – S:64,65
Embeddings for Word Similarity: One thing that word embeddings can simply be used for
is to compute word-to-word similarity
Visualizing Word Embeddings: the d-dimensional vectors can be mapped to two dimensions
Approach: principal component analysis (PCA) | t-SNE
S: 68-70
Evaluating Word Embeddings
Predicts nearby words Word similarity scores
Word analogy tasks (king - man + woman ≈ queen)
GloVe Fasttext BERT ELMo
Ex of tasks on embeddings: sentence classification, machine translation, question answering
Recurrent Neural Networks and LSTMs
RNN: network that contains a cycle within its network connections
Simple RNN has a single hidden layer. with outputs that lead back to its own inputs
Ex: ( Elman networks or vanilla RNNs )
Single Time Step: processing input Sequentially
- ht-1 and ht refer to the same layer, but at different times
- Not all RNNs include output for every time (the one does → for sequence labeling)
Equations: what happens at each time step ht = g(Uht-1 + Wxt) yt = f(Vht)
- Initialize h0 to be a vector of 0s, and start the indexing of actual steps at 1
- Activation fun (at hidden layer) g; might be a sigmoid or tanh
- Output layer assumed to be a SoftMax layer → yt = softmax(Vht)
Unrolling an RNN: each time step (for some fixed number of time steps) is drawn separately
- Each instance of hidden + output for each of the depicted time steps is drawn separately
Forward Inference (propagation): At each time step
- Hidden and output nodes → change
- Weights W V U (matrices) → Don't change
- Same weights are reused (shared) at each step | Change when we train an RNN
Training using SGD + backpropagation and we need a training set + loss function
We now have 3 sets of weights to update:
- W: weights between the input and hidden layer
- V: weights between the hidden and output layer
- U: from output of the hidden to the input of the hidden (at the next time step)
Backprop: Updating V
- z[i]: weighted sum of the inputs to layer i
- a[i]: activation value from a layer i, (the result of applying an activation fun, g, to z[i] )
- Update V (compute the gradient of the loss, L, with respect to V ) – Chain Rule
Error terms:
Two-pass Weight Training:
1- Perform Forward inference, computing all the h and y values at every time step
2- Process the sequence in Reverse, computing the required error terms and gradients
Accumulate gradients and update weights accordingly ( Backpropagation Through Time )
Language Model: Simple RNNs can be used as recurrent neural language models
- Previous state and Current word used to create the current hidden state
- Current hidden state (SoftMax creates a probability distribution to predict next word)
Unlike N-grams, feedforward NNs, RNNs are not limited to a fixed number of prior words
when predicting ( all the words in the sequence can affect the prediction of next word)
18-22
Autoregressive Generation
RNN is trained as a language model, we can generate random text the probabilities of each
possible next word are used to randomly choose a word
- Inputs are pre-trained word embeddings
- <S> beginning of sentence marker, which has its own embedding
- Hidden state Semantic representation of all content has been processed so far
- Output of SoftMax is Not used to predict the most likely word ( probability distribution )
from which to sample the next word
- The processing ends after fixed number of tokens, or </S>
Sequence labelling: Involves categorizing every item in a Sequence | Every word or token
gets a label
POS tagging
- includes hidden Markov models (HMMs) and maximum entropy Markov models
- State-of-the-art POS tagging uses more complex variations of or transformers
Named entity recognition (NER)
- Involves detecting spans of text representing names of people, places, times and dates
- The first phase of an information extraction (IE) system
- Trained using supervised Words in the training set are labeled with IOB tags
RNN used for POS tagging or NER (the architectures are basically identical) Whice make
each prediction independently can be improved using Conditional Random Fields
Text Categorization (Sequence Classification) | a sequence of text receives one label
RNNs, successful for the categorization of short sequences, like tweets/ individual sentences
- Final hidden state becomes the input to a feedforward neural network (final hidden state
as representing the meaning of the text )
- هنا النص كامل اعطيه ليبل مثال يتكلم عن الرياضة بعكس ال سيكونس ليبل نعطي كل كلمة ليبل
Stacked RNN
- Uses hidden states produced by one RNN as the inputs to the next. Each RNN as a layer
- Hidden states of the top layer can be used as outputs of the stack, perhaps sent as input to
another layer, such as a SoftMax layer
- Stacks RNNs have outperformed single-layer RNNs for many tasks
- Optimal number of RNN layers varies according to the task + Training set
- Adding additional layers can significantly increase the training time
- The entire stack is trained at once using end-to-end training
Bidirectional RNNs
- RNNs process inputs sequentially in one direction
- All the input is available at once, we can create another RNN processes the inputs in the
opposite, or backward, direction
- For text categorization, only the final states produced by the forward and backward
RNNs are concatenated and then fed to a feedforward neural network
32-34
The Vanishing Gradient Problem:
During backpropagation, for each layer or time step that error is backpropagated, there is a
multiplication taking place.
→ These multiplications reduce the gradients. That is, the further back we go, the less
significant layers or states seem to be, in terms of how they affect the measured loss during
training. الطبقات أو الكلمات القديمة ما تؤثر كثير في التدريب
- For (CNNs and feedforward NNs), rectified linear units (ReLUs) mitigate the problem
to some extent BUT not typically used for RNNs
- Sometimes, multiplications increase the gradients, leading to exploding gradient
problem → simply capping the gradient to some fixed maximum
Non-local Context: Only very local context winds up being significant ( hidden states only
significantly influenced by the previous two or three words )
➔ SOLUTAION:
Long Short-Term Memory Units: networks provide a solution vanishing gradient problem
- LSTM unrolled; the cell is the component of the architecture that repeats
Input and Output of Cells: LSTM cells have 2 sets of values that are passed between cells
cell state ( ct−1 ) | cell's hidden state ( ht−1 )
The Forget Gate
- Uses previous hidden state and current input → Decide how much of (and which parts
of) the cell state to forget / remember
- Forget weight → multiplied by the cell state element-wise(ʘ), determining which
information will be kept / forgotten ( 𝑘𝑡 = 𝑐𝑡−1 ʘ𝑓t)
New Cell Content and the Input Gate
Determines how the input, combined with the previous hidden state, might be used to update
the cell state
The input gate is used to decide how much of (and which parts of) the new cell content is
added to the cell state; the formula for this is 𝑗𝑡 = 𝑔𝑡ʘ𝑖𝑡
The cell state is then simply updated as ct = jt + kt value of the cell state that will be passed
out of the cell to the next time step (and we will soon see it is also used to compute the value
of the cell's hidden state)
The Output Gate
- Decide how much of (and which parts of) the updated cell state is passed on as the hidden
state from the cell
- Does not get applied to the cell state directly
- Hidden state, not the cell state, that also potentially serves as the output of the cell
- The cell’s hidden state might be used to make predictions, to serve as input to another
stacked LSTM
LSTM: Summary
- A single LSTM cell/unit accepts as input the previous cell's state (context), the previous
cell's hidden state, and the current input (vector)
- Cell generates an updated cell and hidden state, → passed to the next cell (Same unit at
Next time step)
- Hidden state ( cell's output and used for classification, as input to a stacked LSTM, etc.)
- The gates, and process of calculating the new candidate values, all ultimately involve
ordinary nodes and weights in a NN
- Learn the weights (train the LSTM) using SGD and backpropagation
Gated Recurrent Units ( Alternative to LSTMs )
- GRUs were developed much more recently than LSTMs
- Just a hidden state passed from cell to cell (no separate cell state)
- Conceptually simpler to understand
- Significantly faster to train
- They perform as well as LSTMs for some (BUT not all) NLP tasks
Building Complex Neural Networks
- Any of the 3 recurrent structures can be bi-directional and/or stacked
- Can be applied to any of the tasks in the context of simple RNNs
- Outputs from recurrent structures (at each time step or ends) fed as input to feedforward
networks for categorization
- Modern deep learning libraries (TensorFlow/PyTorch) make it easy to build complex
networks out of standard layers
- The libraries allow more advanced programmers to design their own units
- Regardless of the architecture, NN can be trained end-to-end
Encoder-Decoder Models, Attention, and Transformers
encoder-decoder networks, also known as sequence-to sequence (seq2seq) models
Seq to Seq Many NLP tasks involve mapping a sequence of text to another sequence of text
EX: Machine translation(MT) | Summarization | Q&A
1- Autoregressive Generation with Prefix: to be seeded with a more general prefix
- Given a prefix (can be supplied by a user), task of RNN is predicting the completion of a
sentence starting with the prefix
- Prefix processed first, without outputs or predictions ( just forward inference)
- Final hidden state after processing the prefix → used as the starting point for the
autoregressive generation
Training Data for MT – هنا مثال على الترجمة ولكن ينطبق بعد على السمري واألسئلة واألجوبة
- Statistical MT (neural or pre-neural) relies on a training set in parallel corpus
(multilingual, parallel text, or bitext) → sentences have been aligned
- Original language: referred to as the source language
- Desired language: (that the original is translated to) referred to as target language
2- Using an RNN for MT
- Instead of a prefix → source sentence of a pair is fed to the RNN first
- Instead of generating continuation of a prefix, → predicts the target sentence
- After training, we can apply the network as follows:
1- Apply a source sentence into the network
2- Use the final hidden state from the source sentence as the initial hidden state for the
decoder, which predicts the target sentence
3- Simple Encoder-Decoder Model
- Encoder: portion of the network that processes an input sequence (like the Source)
- Decoder: portion of the network that generates an output sequence (like the Target)
This simple encoder-decoder model has at least 3 flaws:
- Encoder + Decoder assumed to have the same internal structure
- Final state of the Encoder is the only context available to the Decoder
- Final state of the Encoder is only available to the Decoder as its initial state
Generalized Encoder-Decoder Network
- Encoder accepts an input sequence, 𝑥 1 to 𝑛 → Generates a corresponding sequence of
contextualized representations, ℎ1 to 𝑛 | often use stacked Bi-L, GRU, Transformer
- Context vector (c) is a function of ℎ1 to 𝑛, and conveys the essence of the input to the
decoder (if the encoder is stacked, these are the hidden states at the top layer)
- Decoder accepts c as input → Generates an arbitrary length sequence of hidden states ℎ1
to 𝑚, from which a corresponding sequence of output states, 𝑦1𝑚, can be obtained
Encoder Applied to MT
- The cells at the top layer produce visible hidden states, ℎ1to n
- Produces c, based on the sequence of hidden states produced by the encoder's top layer
Decoder Applied to MT
- Each current decoder cell considers the c, + Previous hidden state + Previous output
- ℎ𝑡 to 𝑑 = 𝑔 (𝑦^𝑡−1, ℎ𝑑 to 𝑡−1 ,c)
𝑦^𝑡−1 Representing the word generated at the previous time step. ℎ𝑑 to 𝑡 previous hidden
state produced by the decoder (Output layer computes outputs, 𝑦1𝑚, based the hidden states
of the decode)
- Optionally, c and the previous output can be used by the output layer at each time step
- 𝑦𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥( 𝑦𝑡−1^ , 𝑧𝑡 , 𝑐), where 𝑧𝑡 = 𝑓( hd to t )
The Context Vector Revisited – الشكلة اننا نعتمد على اخر وحدة من االنبوت وتصير أهميته أكبر وهذا غلط
- Num of hidden states produced by Encoder varies based on length of the input seq
- We need a fixed-sized context vector, so we can't juts concatenate all of them
➔ This leads to context that is dominated by the latter part of the input So →
1- bidirectional, concatenate the final hidden states from each direction BUT context
that is dominated by the start and end of the input
2- Sum or average all the hidden states produced by the encoder BUT all hidden states
are not equally important
: األول نفس الشيء األهمية بتصير لألول واألخير والثاني يصيرون كلهم نفس األهمية وكلهم غلط! وطلع مفهوم جديد
Attention
- Added it to encoder-decoder models → improved performance for several NLP tasks
- Instead of static c → different c generated at each time step of the decoder
- hd to i =𝑔 (𝑦𝑖−1^ , ℎd to 𝑖−1 , ci )
Computing Scores
Compute ci : how much to focus on each ℎ𝑖−1 encoder state (how relevant each encoder state
is to the decoder state)
- Dot-product: pay most attention to hidden states from Encoder that are most like the
previous hidden state from the decoder
- Computing similarity: Learn (add Weights) how to compare encoder hidden states to
decoder hidden states
Computing the Context Vector
Attention allows the Decoder to focus on the portion of the output from the Encoder that
seems most relevant at each time step, improved results further
CNNs successful for some text categorization tasks (short text sequences) such as single
sentences or tweets | For MT, CNNs were sometimes used for encoders
RNNs(like LSTMs) can't be parallelized → Process input sequentially (slow training)
CNNs have problems detecting long-distance dependencies
Transformers ( Ex of an encoder decoder network)
- Encoder Decoder both rely on stacked layers - Each layer has sublayers
- One Head → 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄,𝐾,𝑉
- Multi-head → Attention learns to focus on one aspect of the input. various aspects of the
input that we need to pay attention to for different reasons
1- First, queries, keys, and values are linearly projected using learned mappings
2- Next, attention is applied in parallel to each result
3- Outputs or attention are then concatenated and again projected
Without Positional encodings, no way for a Transformer to make use of the order of the input
Use sine/cosine functions | added to the input embeddings | only added to the input directly
The Transformer’s Encoder: Accepts the input (sequence) → Positional encodings →
Go through 6 identical layers → each containing 2 sublayers:
- Multi-head self-attention layer
- Position-wise fully connected feedforward NN
Transformer’s Decoder Accepts the outputs of the seq2seq mapping → Go through 6
identical layers → each containing 2 sublayers:
- Masked multi-head self-attention
- Multi-head self-attention
- Position-wise fully connected feedforward NN
➔ Output fed through a linear transformation layer → SoftMax is applied to predict
Input Tokenization - When you see a word never see before:
English → WordPiece Arabic → Farasa
Masked LM: Understand left + right context | Doesn't need labeled data just raw text