Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
57 views13 pages

Conditional Random Fields (CRFS)

Conditional Random Fields (CRFs) are a discriminative model used for sequence prediction that leverages contextual information from previous labels to enhance prediction accuracy. Unlike generative models like HMMs, CRFs compute the posterior probability directly and utilize log-linear functions over relevant features to produce a global probability for the entire output sequence. The model allows for the incorporation of arbitrary features, making it more flexible and effective for tasks such as part-of-speech tagging.

Uploaded by

Mahir Abrar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views13 pages

Conditional Random Fields (CRFS)

Conditional Random Fields (CRFs) are a discriminative model used for sequence prediction that leverages contextual information from previous labels to enhance prediction accuracy. Unlike generative models like HMMs, CRFs compute the posterior probability directly and utilize log-linear functions over relevant features to produce a global probability for the entire output sequence. The model allows for the incorporation of arbitrary features, making it more flexible and effective for tasks such as part-of-speech tagging.

Uploaded by

Mahir Abrar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Conditional Random Fields

(CRFs)
Md. Shahidul Salim
Lecturer, CSE,KUET
Conditional Random Fields
• Conditional Random Fields are a discriminative model, used for
predicting sequences
• Use contextual information from previous labels, thus increasing the
amount of information the model has to make a good prediction.
Discriminative models
• Discriminative
• Labels: Y=y, and
• Features: X={x1, x2, …xn}
• The discriminative model refers to a class of models used in
• Statistical Classification, mainly used for supervised machine learning
• Logistic regression
• Scalar Vector Machine (SVMs)
• Traditional neural networks
• Nearest neighbor
• Conditional Random Fields (CRFs)
• Decision Trees and Random Forest
Conditional Random Fields (CRFs)
• Unknown words: proper names and acronyms are created very often
• Add arbitrary features(words starting with capital letters are likely to be proper
nouns; words ending with -ed tend to be past tense (VBD or VBN))
• Knowing the previous or following words might be a useful feature (if the
previous word is the, the current tag is unlikely to be a verb)
• Hard for generative models like HMMs to add arbitrary features
• Combining arbitrary features using logistic regression model
• But logistic regression isn’t a sequence model; it assigns a class to a single
observation
Conditional Random Fields (CRFs)
• There is a discriminative sequence model based on log-linear models:
the conditional random field (CRF)-linear chain CRF
• Assuming we have a sequence of input words X = x1...xn and want to
compute a sequence of output tags Y = y1...yn. In an HMM to compute
the best tag sequence that maximizes P(Y|X) we rely on Bayes’ rule
and the likelihood P(X|Y):
• In a CRF, by contrast, we compute the posterior p(Y|X) directly,
training the CRF to discriminate among the possible tag sequences.

• However, the CRF does not compute a probability for each tag at each
time step. Instead, at each time step the CRF computes log-linear
functions over a set of relevant features, and these local features are
aggregated and normalized to produce a global probability for the
whole sequence
• X and Y as the input and output sequences
• A CRF is a log-linear model that assigns a probability to an entire
output (tag) sequence Y , out of all possible sequences given the
entire input (word) sequence X
• CRF -multinomial logistic regression(Modified version of logistic
regression that predicts a multinomial probability (i.e. more than two
classes) for each input example.)
• In a CRF, the function F maps an entire input sequence X and an entire
output sequence Y to a feature vector.
Let’s assume we have K features, with a weight wk for each feature Fk:

• K functions Fk(X,Y) global features.


• Each one is a property of the entire input sequence X
and output sequence Y
• We compute them by decomposing into a sum of local features for each
position i in Y:

• Each of these local features fk in a linear-chain CRF is allowed to make use


of the current output token yi , the previous output token yi−1, the entire input
string X (or any subpart of it), and the current position i. This constraint to
only depend on the current and previous output tokens yi and yi−1 are what
characterizes a linear chain CRF. As we will see, this limitation makes it
possible to use versions of the linear chain CRF efficient Viterbi and
Forward-Backwards algorithms from the HMM. A general CRF, by contrast,
allows a feature to make use of any output token, and are thus necessary for
tasks in which the decision depend on distant output tokens, like yi−4.
Features in a CRF POS Tagger
• Linear-chain CRF, each local feature fk at position i can depend on any
information from: (yi−1, yi ,X,i). So some legal features representing
common situations might be the following:

Above, we explicitly use the notation 1{x} to mean “1 if x is true, and 0


otherwise”. From now on, we’ll leave off the 1 when we define features, but
you can assume each feature has it there implicitly.

Feature templates:
• These templates automatically populate the set of features from every
instance in the training and test set. Thus for our example Janet/NNP
will/MD back/VB the/DT bill/NN, when xi is the word ”back”, the
following features would be generated and have the value 1 (we’ve
assigned them arbitrary feature numbers):

• Word shape features (Unknown words)


• lower-case letters to ‘x’, upper-case to ‘X’, numbers to ’d’, and retaining
punctuation
• For example the word “well-dressed” might generate the following
non-zero valued feature values:

• The known-word templates are computed for every word seen in the
training set; the unknown word features can also be computed for all
words in training, or only on training words whose frequency is below
some threshold.
Can HMMs incorporate features?
• Because in HMMs all computation is based on the two probabilities
P(tag|tag) and P(word|tag), if we want to include some source of
knowledge in the tagging process, we must find a way to encode the
knowledge into one of these two probabilities.
• Each time we add a feature, we have to do a lot of complicated
conditioning, which gets harder and harder as we have more and
more such features.

You might also like