Conditional Random Fields
(CRFs)
Md. Shahidul Salim
Lecturer, CSE,KUET
Conditional Random Fields
• Conditional Random Fields are a discriminative model, used for
predicting sequences
• Use contextual information from previous labels, thus increasing the
amount of information the model has to make a good prediction.
Discriminative models
• Discriminative
• Labels: Y=y, and
• Features: X={x1, x2, …xn}
• The discriminative model refers to a class of models used in
• Statistical Classification, mainly used for supervised machine learning
• Logistic regression
• Scalar Vector Machine (SVMs)
• Traditional neural networks
• Nearest neighbor
• Conditional Random Fields (CRFs)
• Decision Trees and Random Forest
Conditional Random Fields (CRFs)
• Unknown words: proper names and acronyms are created very often
• Add arbitrary features(words starting with capital letters are likely to be proper
nouns; words ending with -ed tend to be past tense (VBD or VBN))
• Knowing the previous or following words might be a useful feature (if the
previous word is the, the current tag is unlikely to be a verb)
• Hard for generative models like HMMs to add arbitrary features
• Combining arbitrary features using logistic regression model
• But logistic regression isn’t a sequence model; it assigns a class to a single
observation
Conditional Random Fields (CRFs)
• There is a discriminative sequence model based on log-linear models:
the conditional random field (CRF)-linear chain CRF
• Assuming we have a sequence of input words X = x1...xn and want to
compute a sequence of output tags Y = y1...yn. In an HMM to compute
the best tag sequence that maximizes P(Y|X) we rely on Bayes’ rule
and the likelihood P(X|Y):
• In a CRF, by contrast, we compute the posterior p(Y|X) directly,
training the CRF to discriminate among the possible tag sequences.
• However, the CRF does not compute a probability for each tag at each
time step. Instead, at each time step the CRF computes log-linear
functions over a set of relevant features, and these local features are
aggregated and normalized to produce a global probability for the
whole sequence
• X and Y as the input and output sequences
• A CRF is a log-linear model that assigns a probability to an entire
output (tag) sequence Y , out of all possible sequences given the
entire input (word) sequence X
• CRF -multinomial logistic regression(Modified version of logistic
regression that predicts a multinomial probability (i.e. more than two
classes) for each input example.)
• In a CRF, the function F maps an entire input sequence X and an entire
output sequence Y to a feature vector.
Let’s assume we have K features, with a weight wk for each feature Fk:
• K functions Fk(X,Y) global features.
• Each one is a property of the entire input sequence X
and output sequence Y
• We compute them by decomposing into a sum of local features for each
position i in Y:
• Each of these local features fk in a linear-chain CRF is allowed to make use
of the current output token yi , the previous output token yi−1, the entire input
string X (or any subpart of it), and the current position i. This constraint to
only depend on the current and previous output tokens yi and yi−1 are what
characterizes a linear chain CRF. As we will see, this limitation makes it
possible to use versions of the linear chain CRF efficient Viterbi and
Forward-Backwards algorithms from the HMM. A general CRF, by contrast,
allows a feature to make use of any output token, and are thus necessary for
tasks in which the decision depend on distant output tokens, like yi−4.
Features in a CRF POS Tagger
• Linear-chain CRF, each local feature fk at position i can depend on any
information from: (yi−1, yi ,X,i). So some legal features representing
common situations might be the following:
Above, we explicitly use the notation 1{x} to mean “1 if x is true, and 0
otherwise”. From now on, we’ll leave off the 1 when we define features, but
you can assume each feature has it there implicitly.
Feature templates:
• These templates automatically populate the set of features from every
instance in the training and test set. Thus for our example Janet/NNP
will/MD back/VB the/DT bill/NN, when xi is the word ”back”, the
following features would be generated and have the value 1 (we’ve
assigned them arbitrary feature numbers):
• Word shape features (Unknown words)
• lower-case letters to ‘x’, upper-case to ‘X’, numbers to ’d’, and retaining
punctuation
• For example the word “well-dressed” might generate the following
non-zero valued feature values:
• The known-word templates are computed for every word seen in the
training set; the unknown word features can also be computed for all
words in training, or only on training words whose frequency is below
some threshold.
Can HMMs incorporate features?
• Because in HMMs all computation is based on the two probabilities
P(tag|tag) and P(word|tag), if we want to include some source of
knowledge in the tagging process, we must find a way to encode the
knowledge into one of these two probabilities.
• Each time we add a feature, we have to do a lot of complicated
conditioning, which gets harder and harder as we have more and
more such features.