|
4 | 4 | ====================================================== |
5 | 5 |
|
6 | 6 | Dyanmic versus Static Deep Learning Toolkits |
7 | | -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 7 | +-------------------------------------------- |
8 | 8 |
|
9 | 9 | Pytorch is a *dynamic* neural network kit. Another example of a dynamic |
10 | 10 | kit is `Dynet <https://github.com/clab/dynet>`__ (I mention this because |
|
47 | 47 | the code more closely resembling the host language (by that I mean that |
48 | 48 | Pytorch and Dynet look more like actual Python code than Keras or |
49 | 49 | Theano). |
50 | | -""" |
51 | 50 |
|
| 51 | +Bi-LSTM Conditional Random Field Discussion |
| 52 | +------------------------------------------- |
52 | 53 |
|
53 | | -##################################################################### |
54 | | -# Bi-LSTM Conditional Random Field Discussion |
55 | | -# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
56 | | -# |
57 | | -# For this section, we will see a full, complicated example of a Bi-LSTM |
58 | | -# Conditional Random Field for named-entity recognition. The LSTM tagger |
59 | | -# above is typically sufficient for part-of-speech tagging, but a sequence |
60 | | -# model like the CRF is really essential for strong performance on NER. |
61 | | -# Familiarity with CRF's is assumed. Although this name sounds scary, all |
62 | | -# the model is is a CRF but where an LSTM provides the features. This is |
63 | | -# an advanced model though, far more complicated than any earlier model in |
64 | | -# this tutorial. If you want to skip it, that is fine. To see if you're |
65 | | -# ready, see if you can: |
66 | | -# |
67 | | -# - Write the recurrence for the viterbi variable at step i for tag k. |
68 | | -# - Modify the above recurrence to compute the forward variables instead. |
69 | | -# - Modify again the above recurrence to compute the forward variables in |
70 | | -# log-space (hint: log-sum-exp) |
71 | | -# |
72 | | -# If you can do those three things, you should be able to understand the |
73 | | -# code below. Recall that the CRF computes a conditional probability. Let |
74 | | -# :math:`y` be a tag sequence and :math:`x` an input sequence of words. |
75 | | -# Then we compute |
76 | | -# |
77 | | -# .. math:: P(y|x) = \frac{\exp{(\text{Score}(x, y)})}{\sum_{y'} \exp{(\text{Score}(x, y')})} |
78 | | -# |
79 | | -# Where the score is determined by defining some log potentials |
80 | | -# :math:`\log \psi_i(x,y)` such that |
81 | | -# |
82 | | -# .. math:: \text{Score}(x,y) = \sum_i \log \psi_i(x,y) |
83 | | -# |
84 | | -# To make the partition function tractable, the potentials must look only |
85 | | -# at local features. |
86 | | -# |
87 | | -# In the Bi-LSTM CRF, we define two kinds of potentials: emission and |
88 | | -# transition. The emission potential for the word at index :math:`i` comes |
89 | | -# from the hidden state of the Bi-LSTM at timestep :math:`i`. The |
90 | | -# transition scores are stored in a :math:`|T|x|T|` matrix |
91 | | -# :math:`\textbf{P}`, where :math:`T` is the tag set. In my |
92 | | -# implementation, :math:`\textbf{P}_{j,k}` is the score of transitioning |
93 | | -# to tag :math:`j` from tag :math:`k`. So: |
94 | | -# |
95 | | -# .. math:: \text{Score}(x,y) = \sum_i \log \psi_\text{EMIT}(y_i \rightarrow x_i) + \log \psi_\text{TRANS}(y_{i-1} \rightarrow y_i) |
96 | | -# |
97 | | -# .. math:: = \sum_i h_i[y_i] + \textbf{P}_{y_i, y_{i-1}} |
98 | | -# |
99 | | -# where in this second expression, we think of the tags as being assigned |
100 | | -# unique non-negative indices. |
101 | | -# |
102 | | -# If the above discussion was too brief, you can check out |
103 | | -# `this <http://www.cs.columbia.edu/%7Emcollins/crf.pdf>`__ write up from |
104 | | -# Michael Collins on CRFs. |
105 | | -# |
106 | | -# Implementation Notes |
107 | | -# ~~~~~~~~~~~~~~~~~~~~ |
108 | | -# |
109 | | -# The example below implements the forward algorithm in log space to |
110 | | -# compute the partition function, and the viterbi algorithm to decode. |
111 | | -# Backpropagation will compute the gradients automatically for us. We |
112 | | -# don't have to do anything by hand. |
113 | | -# |
114 | | -# The implementation is not optimized. If you understand what is going on, |
115 | | -# you'll probably quickly see that iterating over the next tag in the |
116 | | -# forward algorithm could probably be done in one big operation. I wanted |
117 | | -# to code to be more readable. If you want to make the relevant change, |
118 | | -# you could probably use this tagger for real tasks. |
119 | | -##################################################################### |
| 54 | +For this section, we will see a full, complicated example of a Bi-LSTM |
| 55 | +Conditional Random Field for named-entity recognition. The LSTM tagger |
| 56 | +above is typically sufficient for part-of-speech tagging, but a sequence |
| 57 | +model like the CRF is really essential for strong performance on NER. |
| 58 | +Familiarity with CRF's is assumed. Although this name sounds scary, all |
| 59 | +the model is is a CRF but where an LSTM provides the features. This is |
| 60 | +an advanced model though, far more complicated than any earlier model in |
| 61 | +this tutorial. If you want to skip it, that is fine. To see if you're |
| 62 | +ready, see if you can: |
| 63 | +
|
| 64 | +- Write the recurrence for the viterbi variable at step i for tag k. |
| 65 | +- Modify the above recurrence to compute the forward variables instead. |
| 66 | +- Modify again the above recurrence to compute the forward variables in |
| 67 | + log-space (hint: log-sum-exp) |
| 68 | +
|
| 69 | +If you can do those three things, you should be able to understand the |
| 70 | +code below. Recall that the CRF computes a conditional probability. Let |
| 71 | +:math:`y` be a tag sequence and :math:`x` an input sequence of words. |
| 72 | +Then we compute |
| 73 | +
|
| 74 | +.. math:: P(y|x) = \frac{\exp{(\text{Score}(x, y)})}{\sum_{y'} \exp{(\text{Score}(x, y')})} |
| 75 | +
|
| 76 | +Where the score is determined by defining some log potentials |
| 77 | +:math:`\log \psi_i(x,y)` such that |
| 78 | +
|
| 79 | +.. math:: \text{Score}(x,y) = \sum_i \log \psi_i(x,y) |
| 80 | +
|
| 81 | +To make the partition function tractable, the potentials must look only |
| 82 | +at local features. |
| 83 | +
|
| 84 | +In the Bi-LSTM CRF, we define two kinds of potentials: emission and |
| 85 | +transition. The emission potential for the word at index :math:`i` comes |
| 86 | +from the hidden state of the Bi-LSTM at timestep :math:`i`. The |
| 87 | +transition scores are stored in a :math:`|T|x|T|` matrix |
| 88 | +:math:`\textbf{P}`, where :math:`T` is the tag set. In my |
| 89 | +implementation, :math:`\textbf{P}_{j,k}` is the score of transitioning |
| 90 | +to tag :math:`j` from tag :math:`k`. So: |
| 91 | +
|
| 92 | +.. math:: \text{Score}(x,y) = \sum_i \log \psi_\text{EMIT}(y_i \rightarrow x_i) + \log \psi_\text{TRANS}(y_{i-1} \rightarrow y_i) |
| 93 | +
|
| 94 | +.. math:: = \sum_i h_i[y_i] + \textbf{P}_{y_i, y_{i-1}} |
| 95 | +
|
| 96 | +where in this second expression, we think of the tags as being assigned |
| 97 | +unique non-negative indices. |
| 98 | +
|
| 99 | +If the above discussion was too brief, you can check out |
| 100 | +`this <http://www.cs.columbia.edu/%7Emcollins/crf.pdf>`__ write up from |
| 101 | +Michael Collins on CRFs. |
| 102 | +
|
| 103 | +Implementation Notes |
| 104 | +-------------------- |
| 105 | +
|
| 106 | +The example below implements the forward algorithm in log space to |
| 107 | +compute the partition function, and the viterbi algorithm to decode. |
| 108 | +Backpropagation will compute the gradients automatically for us. We |
| 109 | +don't have to do anything by hand. |
| 110 | +
|
| 111 | +The implementation is not optimized. If you understand what is going on, |
| 112 | +you'll probably quickly see that iterating over the next tag in the |
| 113 | +forward algorithm could probably be done in one big operation. I wanted |
| 114 | +to code to be more readable. If you want to make the relevant change, |
| 115 | +you could probably use this tagger for real tasks. |
| 116 | +""" |
120 | 117 | # Author: Robert Guthrie |
121 | 118 |
|
122 | 119 | import torch |
@@ -358,7 +355,7 @@ def forward(self, sentence): # dont confuse this with _forward_alg above. |
358 | 355 |
|
359 | 356 | ###################################################################### |
360 | 357 | # Exercise: A new loss function for discriminative tagging |
361 | | -# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 358 | +# -------------------------------------------------------- |
362 | 359 | # |
363 | 360 | # It wasn't really necessary for us to create a computation graph when |
364 | 361 | # doing decoding, since we do not backpropagate from the viterbi path |
|
0 commit comments