Part-of-Speech Tagging For Twitter With Adversarial Neural Networks
Part-of-Speech Tagging For Twitter With Adversarial Neural Networks
2411
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2411–2420
Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics
tures. Ritter et al. (2011) also constructed a resentations through in-domain and out-of-
labeled dataset, which contained 787 tweets, to domain data and construct a cross domain
empirically evaluate the performance of super- POS tagger through the learned represen-
vised methods on Twitter. Owoputi et al. (2013) tations. The proposed method also tries
incorporated word clusters into the feature sets to preserve the specific features of target
and further improved the performance. From domain.
these works, we can observe that the size of the
training data was much smaller than the newswire • Experimental results demonstrate that the
domain’s. proposed method can lead to better perfor-
Besides the challenge of lack of training data, mance in most of cases on three different
the frequent use of out-of-vocabulary words also datasets.
makes this problem difficult to address. Social 2 Approach
media users often use informal ways of expressing
their ideas and often spell words phonetically In this work, we propose a novel recurrent neural
(e.g., “2mor” for “tomorrow”). In addition, network, Target Preserved Adversarial Neural
they also make extensive use of emoticons and Network (TPANN), to learn common features
abbreviations (e.g., “:-)” for smiling emotion and between resource-rich domain and target domain,
“LOL” for laughing out loud). Moreover, new simultaneously to preserve target domain-specific
symbols, abbreviations, and words are constantly features. It extends the bi-directional LSTM
being created. Figure 1 shows an example of with adversarial network and autoencoder. The
tagged Tweet. architecture of TPANN is illustrated in Figure 2.
To tackle the challenges posed by the lack of The model consists of four components: Feature
training data and the out-of-vocabulary words, in Extractor, POS Tagging Classifier, Domain Dis-
this paper, we propose a novel recurrent neu- criminator and Target Domain Autoencoder. In
ral network, which we call Target Preserved the following sections, we will detail each part of
Adversarial Neural Network (TPANN) to per- the proposed architecture and training methods.
form the task. It can make use of a large
2.1 Feature Extractor
quantity of annotated data from other resource-
rich domains, unlabeled in-domain data, and a The feature extractor F adopts CNN to extract
small amount of labeled in-domain data. All of character embedding features, which can tackle
these datasets can be easily obtained. To make the out-of-vocabulary word problem effectively.
use of unlabeled data, motivated by the work To incorporate word embedding features, we
of Goodfellow et al. (2014) and Chen et al. (2016), concatenate word embedding to character em-
the proposed method extends the bi-directional bedding as the input of bi-LSTM on the next
long short-term memory recurrent neural network layer. Utilizing a bi-LSTM to model sentences,
(bi-LSTM) with an adversarial predictor. To F can extract sequential relations and context
overcome the defect that adversarial networks can information.
merely learn the common features, we propose to We denote the input sentence as x and the i-th
use an autoencoder only acting on target dataset to word as xi . xi ∈ S(x) and xi ∈ T (x) represent
preserve its own specific features. For tackling the input samples are from source domain and target
out-of-vocabulary problem, the proposed method domain, respectively. We denote the parameters
also incorporates a character level convolutional of F as θf . Let V be the vocabulary of words,
neutral network to leverage subword information. and C be the vocabulary of characters. d is the
The contributions of this work are as follows: dimensionality of character embedding then Q ∈
Rd×|C| is the representation matrix of vocabulary.
• We propose to incorporate large scale unla- We assume that word xi ∈ V is made up of
beled in-domain data, out-of-domain labeled a sequence of characters Ci = [c1 , c2 , . . . , cl ],
data, and in-domain labeled data for Twitter where l is the max length of word and every word
part-of-speech tagging task. will be padded to this length. Then Ci ∈ Rd×l
would be the inputs of CNN.
• We introduce a novel recurrent neural net- We apply a narrow convolution between Ci and
work, which can learn domain-invariant rep- filter H ∈ Rd×k , where k is the width of the filter.
2412
Lol , u
WSJ The finding probably Twitter Decoder
will support those …
𝓛𝒕𝒂𝒓𝒈𝒆𝒕
PTB/Twitter
LSTM LSTM Domain Discriminator
𝓛𝒕𝒚𝒑𝒆
LSTM LSTM
NN
VB
Twitter Lol, u think im being POS tagging classifier
there … 𝓛𝒕𝒂𝒔𝒌
RB
After that we add a bias and apply nonlinearity to classifier is trained on Ns samples from the source
obtain a feature map mi ∈ Rl−k+1 . Specifically, domain with the cross entropy loss:
the j-th element of mi is given by:
Ns
X
k i Ltask = − yi ∗ log yˆi , (2)
i [j] = tanh(hC [∗, j : j + k − 1], Hi + b), (1)
i=1
where Ck [∗, j : j + k − 1] is the j-to-(j + k −
1)-th column of Ci and hA, Bi = Tr(AB T ) is where yi is the one-hot vector of POS tagging
the Frobenius inner product. We then apply a label corresponding to xi ∈ S(x), yˆi is the
max-over-time pooling operation (Collobert et al., output of top softmax layer: yˆi = P(F(xi )).
2011) over the feature map. CNN uses multiple During the training time, The parameters θf and
filters with varying widths to obtain the feature θy are optimized to minimize the classification
vector ~ci for word xi . Then, the character-level loss Ltask . This ensures that P(F(xi )) can make
feature vector ~ci is concatenated to the word accurate prediction on the source domain.
embedding w ~ i to form the input of bi-LSTM Conversely, domain discriminator maps the
on the next layer. The word embedding w ~ is same hidden states h to the domain labels with
pretrained on 30 million tweets. Then, the hidden parameters θd . The domain discriminator aims to
states h of bi-LSTM turn into the features that will discriminate the domain label with following loss
be transfered to P, Q and R, i.e. F(x) = h. function:
NX
s +Nt
2.2 POS Tagging Classifier and Domain
Discriminator Ltype = − {di log d̂i +(1−di ) log(1− d̂i )},
i=1
POS tagging classifier P and domain discrimi- (3)
nator Q take F(x) as input. They are standard where di is the ground truth domain label for sam-
feed-forward networks with a softmax layer for ple i, d̂i is the output of top layer: d̂i = Q(F(xi )).
classification. P predicts POS tagging label to Nt means Nt samples from the target domain. The
get classification capacity, and Q discriminates domain discriminator is trained towards a saddle
domain label to make F(x) domain-invariant. point of the loss function through minimizing
The POS tagging classifier P maps the feature the loss over θd while maximizing the loss over
vector F(xi ) to its label. We denote the param- θf (Ganin et al., 2016). Optimizing θf ensures
eters of this mapping as θy . The POS tagging that the domain discriminator can’t discriminate
2413
the domain, i.e., the feature extractor finds the data. When the adversarial networks try to
common features between the two domains. optimize the hidden representation to common
representation hcommon , The target domain au-
2.3 Target Domain Autoencoder toencoder counteracts a tendency of the adver-
Through training adversarial networks, we can sarial network to erase target domain features
obtain domain-invariant features hcommon , but by optimizing the common representation to be
it will weaken some domain-specific features informative on the target-domain data.
which are useful for POS tagging classification.
Merely obtaining domain invariant features would 2.4 Training
therefore limit the classification ability. Our model can be trained end-to-end with standard
Our model tries to tackle this defect by intro- back-propagation, which we will detail in this
ducing domain-specific autoencoder R, which at- section.
tempts to reconstruct target domain data. Inspired Our ultimate training goal is to minimize the
by (Sutskever et al., 2014) but different from (Dai total loss function with parameters {θf , θy , θr , θd }
and Le, 2015), we treat the feature extractor F as as follows:
encoder. In addition, we combine the last hidden
states of the forward LSTM and backward LSTM Ltotal = αLtask + βLtarget + γLtype , (6)
in F as the initial state h0 (dec) of the decoder where α, β, γ are the weights to balance the effects
LSTM. Hence, we don’t need to reverse the order of P, R and Q.
of words of the input sentences and the model For obtaining domain-invariant representation
avoids the difficulty of ”establish communication” hcommon , inspired by (Ganin and Lempitsky,
between the input and the output (Sutskever et al., 2015), we introduce a special gradient reversal
2014). layer (GRL), which does nothing during forward
Similar to (Zhang et al., 2016), we use h0 (dec) propagation, but negates the gradients if it receives
and embedding vector of the previous word as backward propagation, i.e. g(F(x)) = F(x)
the inputs of the decoder, but in a computation- but ∇g(F(x)) = −λ∇F(x). We insert the
ally more efficient manner by computing pre- GRL between F and Q, which can run standard
vious word representation. We assume that Stochastic Gradient Descent with respect to θf and
(x̂1 , · · · , x̂T ) is the output sequence. zt is the t-th θd . The parameter −λ drives the parameters θf
word representation: zt = M LP (ht ), and M LP not to amplify the dissimilarity of features when
is the multiple perceptron function. Hidden state minimize Ltpye . So by introducing a GRL, F can
ht = LST M ([h0 (dec) : zt−1 ], ht−1 ), where [· : ·] drive its parameters θf to extract hidden represen-
is the concatenation operation. We estimate the tations that help the POS tagging classification and
conditional probability p(xˆ1 , · · · , xˆT |h0 (dec)) as hamper the domain discrimination.
follows: In order to preserve target domain-specific
p(xˆ1 , · · · , xˆT |h0 (dec)) = features, we only optimize the autoencoder on
T target domain data for reconstruction tasks.
Y (4)
p(x̂t |h0 (dec), z1 , · · · , zt−1 ), Through above procedures, the model can learn
t=1 the common features between domains, simulta-
neously preserve target domain-specific features.
where each p(x̂t |h0 (dec), z1 , · · · , zt−1 ) distribu- Finally, we can update the parameters as follows:
tion is computed with softmax over all the words
in the vacabulary. ∂Litask ∂Litarget ∂Litype
θf = θf − µ(α +β −γ·λ )
Our aim is to minimize the following loss ∂θf ∂θf ∂θf
function with respect to parameters θr : ∂Litask
θy = θy − µ · α
Nt
X ∂θy
Ltarget = − xi ∗ log x̂i , (5) ∂Litarget
i=1 θr = θr − µ · β
∂θr
where xi is the one-hot vector of i-th word. This ∂Litype
makes h0 (dec) learn an undercomplete and most θd = θd − µ · γ ,
∂θd
salient sentence representation of target domain (7)
2414
Dataset # Tokens 3.2 Experimental Setup
WSJ 1,173,766
We select both state-of-the-art and classic methods
UNL 1,177,746
for comparison, as follows:
RIT-Train 10,652
RIT-Twitter RIT-Dev 2,242 • Stanford POS Tagger: Stanford POS Tag-
RIT-Test 2,291 ger is a widely used tool for newswire
NPSCHAT 44,997 domains (Toutanova et al., 2003). In this
ARK-Twitter OCT27 26,594 work, we train it using two different sets,
DAILY547 7,707 the WSJ (sections 0-18) and a WSJ, IRC,
and Twitter mixed corpus. We use Stanford-
Table 1: The statistics of the datasets used in our
WSJ and Stanford-MIX to represent them,
experiments.
respectively.
where µ is the learning rate. Because the size of • T-POS: T-Pos (Ritter et al., 2011) adopts the
the WSJ is more than 100 times that of the labeled Conditional Random Fields and clustering
Twitter dataset, if we directly train the model with algorithm to perform the task. It was trained
the combined dataset, the final results are much from a mixture of hand-annotated tweets and
worse than those using two training steps. So, we existing POS-labeled data.
adopt adversarial training on WSJ and unlabeled
Twitter dataset at the first step, then use a small • GATE Tagger: GATE tagger (Derczynski
number of in-domain labeled data to fine-tune the et al., 2013) is based on vote-constrained
parameters with a low learning rate. bootstrapping with unlabeled data. It com-
bines cases where available taggers use dif-
3 Experiments ferent tagsets.
In this section, we will detail the datasets used for • ARK Tagger: ARK tagger (Owoputi et al.,
experiments and experimental setup. 2013) is a system that reports the best accu-
racy on the RIT dataset. It uses unsupervised
3.1 Datasets
word clustering and a variety of lexical
The methods proposed in this work incorporate features.
out-of-domain labeled data from resource-rich
domains, large scale unlabeled in-domain data, • bi-LSTM: Bidirectional Long Short-Term
and a small number of labeled in-domain data. Memory (LSTM) networks have been widely
The datasets used in this work are as follows: used in a variety of sequence labeling
Labeled out-of-domain data. We use a standard tasks (Graves and Schmidhuber, 2005). In
benchmark dataset for adversarial POS tagging, this work, we evaluate it at character level,
namely the Wall Street Journal (WSJ) data from word level, and combining them together.
the Penn TreeBank v3 (Marcus et al., 1993), bi-LSTM (word level) uses one layer of
sections 0-24 for the out-of-domain data. bi-LSTM to extract word-level features
Labeled in-domain data. For training and and adopts a random initialization method
evaluating POS tagging approaches, we compare to transform words to vectors. bi-LSTM
the proposed method with other approaches on (character level) represents a method
three benchmarks: RIT-Twitter (Ritter et al., that combines bi-LSTM and CNN-based
2011), NPSCHAT (Forsyth, 2007), and ARK- character embedding, a similar approach with
Twitter (Gimpel et al., 2011). character-aware neural network described
Unlabeled in-domain data. For training the in (Kim et al., 2015) to handle the out-of-
adversarial network, we need to use a dataset vocabulary words. bi-LSTM (word level
that has large scale unlabeled tweets. Hence, in pretrain) architecture is the same as that of
this work, we construct large scale unlabeled data bi-LSTM(word level) but adopts word2vec
(UNL), from Twitter using its API. tool (Mikolov et al., 2013) to vectorize.
The detailed data statistics of the datasets used bi-LSTM (combine) concatenates word to
in this work are listed in Table 1. character features.
2415
Methods RIT-Test RIT-Dev
Stanford-WSJ (Toutanova et al., 2003) 73.37% 83.29%
Stanford-MIX 83.14% 84.19%
T-POS (Ritter et al., 2011) 84.55% 84.83%
GATE Tagger (Derczynski et al., 2013) 88.69% 89.37%
ARK Tagger (Owoputi et al., 2013) 90.40% -
bi-LSTM (word level) 75.91% 76.94%
bi-LSTM (word level pretrain) 85.99% 86.93%
bi-LSTM (character level) 82.85% 84.30%
bi-LSTM (combine) 89.48% 89.30%
bi-LSTM (combine + WSJ) 83.54% 83.64%
bi-LSTM (combine + WSJ + adversarial) 83.76% 84.45%
bi-LSTM (combine + WSJ + fine-tune) 89.87% 90.23%
bi-LSTM (combine + WSJ + adversarial + fine-tune) 90.60% 90.73%
TPANN (combine + WSJ + adversarial + fine-tune + autoencoder) 90.92% 91.08%
Table 2: Token level accuracies of different methods on RIT-Test and RIT-Dev. bi-LSTM(combine)
refers to combining word level with character level. bi-LSTM(combine + WSJ) refers to the model
trained on WSJ and tested on RIT. bi-LSTM(combine + WSJ + adversarial) refers to adversarial model
trained on 1.1 million tokens of labeled WSJ data and the same scale of unlabeled Twitter data, then
tested on RIT. Fine-tune means adding RIT-train data to fine-tune.
The hyper-parameters used for our model are as categories can be tagged almost perfectly using
follows. AdaGrad optimizer trained with cross- simple regular expressions, similar to (Owoputi
entropy loss is used with 0.1 as the default learning et al., 2013), we use regular expressions to tags
rate. The dimensionality of word embedding is set these words appropriately for all systems.
to 200. The dimensionality for random initialized From the results of the Stanford-WSJ, we can
character embedding is set to 25. We adopt a observe that the newswire domain is different from
bi-LSTM for encoding with each layer consisting Twitter. Although the token-level accuracy of the
of 250 hidden neurons. We set three layers of Stanford POS Tagger is higher than 97.0% on
standard LSTM for decoding. Each LSTM layer the PTB dataset, its performance on Twitter drops
consists of 500 hidden neurons. Adam optimizer sharply to 73.37%. By incorporating some in-
trained with cross-entropy loss is used to fine-tune domain labeled data for training, the accuracy of
with 0.0001 as the default learning rate. Fine- Stanford POS Tagger can reach up to 83.14%.
tuning is run for 100 epochs using early stop. Taking a variety of linguistic features and many
other resources into consideration, the T-POS,
4 Results and Discussion GATE tagger, and ARK tagger can achieve better
In this section, we will report experimental results performance.
and a detailed analysis of the results for the three The second part of Table 2 shows the results of
different datasets. the bi-LSTM based methods, which are trained on
the RIT-Train dataset. According to the results of
4.1 Evaluation on RIT-Twitter word level, we can see that word2vec can provide
The RIT-Twitter is split into training, development valuable information. The pre-trained word vec-
and evaluation sets (RIT-Train, RIT-Dev, RIT- tors in bi-LSTM(word level pretrain) give almost
Test). The splitting method is shown in (Der- 10% higher accuracy than bi-LSTM(word level).
czynski et al., 2013), and the dataset statistics are Comparing the character-level bi-LSTM with
listed in Table 1. Table 2 shows the results of our word-level bi-LSTM with random initialization,
method and other approaches on the RIT-Twitter we can observe that the character-level method can
dataset. RIT-Twitter uses the PTB tagset with achieve better performance than the word-level
several Twitter-specific tags: retweets, @user- method. bi-LSTM(combine) combines word with
names, hashtags, and urls. Since words in these character features, as described in Section 2.1,
2416
Figure 3: The visualization of bi-LSTM’s outputs of the extracted features. The left figure shows the
results when no adversary is performed. The right figure shows the results when the adversary procedure
is incorporated into training. Blue points correspond to the source PTB domain examples, and red points
correspond to the target Twitter domain.
2417
Methods Accuracy using a character-based convolutional neural
Gimpel et al. (2011) version 0.2 90.8% network to perform the POS tagging problem.
ARK Tagger 93.2% Bi-LSTMs with word, character or unicode byte
TPANN 92.8% embedding were also introduced to achieve the
POS tagging and named entity recognition tasks
Table 4: Tagging accuracies on DAILY547. (Plank et al., 2016; Chiu and Nichols, 2015; Ma
and Hovy, 2016). In this work, we study the
4.3 Evaluation on ARK-Twitter problem from a domain adaption perspective.
Inspired by these works, we also propose to
ARK-Twitter data contains an entire dataset con-
use character-level methods to handle out-of-
sisting of a number of tweets sampled from
vocabulary words and bi-LSTMs to model the
one particular day (October 27, 2010) described
sequence relations.
in (Gimpel et al., 2011). This part is used
for training. They also created another dataset, Adversarial networks were successfully used
which consists of 547 tweets, for evaluation for image generation (Goodfellow et al., 2014;
(DAILY547). This dataset consists of one random Dosovitskiy et al., 2015; Denton et al., 2015),
English tweet from every day between January domain adaption (Tzeng et al., 2014; Ganin et al.,
1, 2011 and June 30, 2012. The distribution of 2016), and semi-supervised learning (Denton
training data may be slightly different from the et al., 2016). The key idea of adversarial networks
testing data, for example a substantial fraction for domain adaption is to construct invariant
of the messages in the training data are about features by optimizing the feature extractor as an
a basketball game. Since ARK-Twitter uses a adversary against the domain classifier (Zhang
different tagset with PTB, we manually construct et al., 2017).
a table to link tags for the two datasets. Sequence autoencoder reads the input sequence
Table 4 shows the results of different methods into a vector and then tries to reconstruct it. Dai
on this dataset. From the results, we can see that and Le (2015) used the model on a number of
our method can achieve a better result than (Gim- different tasks and verified its validity. Li et al.
pel et al., 2011). However, the performance (2015) introduced the model to hierarchically
of our method is worse than the ARK Tagger. build an embedding for a paragraph, showing that
Through analyzing the errors, we find that 16.7% the model was able to encode texts to preserve
errors occurr between nouns and proper nouns. syntactic, semantic, and discourse coherence.
Since our method do not include any ontology In this work, we incorporate adversarial
or knowledge, proper nouns can not be easily networks with autoencoder to obtain domain-
detected. However, the ATK Tagge add a token- invariant features and keep domain-specific
level name list feature. The name list is useful features. Our model is more suitable for target
for proper nouns recognition, which fires on domain tasks.
names from many sources, such as Freebase
lists of celebrities, the Moby Words list of US 6 Conclusion
Locations, proper names from Mark Kantrowitz’s In this work, we propose a novel adversarial
name corpus and so on. So, our model is neural network to address the POS tagging prob-
also competitive when lacking of manual feature lem. Besides learning common representations
knowledge. between source domain and target domain, it
can simultaneously preserve specific features of
5 Related Work
target domain. The proposed method leverages
Part-of-Speech tagging is an important pre- newswire resources and large scale in-domain
processing step and can provide valuable unlabeled data to help POS tagging classification
information for various natural language on Twitter, which has a few of labeled data. We
processing tasks. In recent years, deep evaluate the proposed method and several state-of-
learning algorithms have been successfully the-art methods on three different corpora. In most
used for POS tagging. A number of approaches of the cases, the proposed method can achieve
have been proposed and have achieved some better performance than previous methods. Ex-
progress. Santos and Guimaraes (2015) proposed perimental results demonstrate that the proposed
2418
method can make full use of these resources, Eric N Forsyth. 2007. Improving automated lexical and
which can be easily obtained. discourse analysis of online chat dialog.
2419
compositionality. In Advances in neural information Kumanan Wilson and John S Brownstein. 2009.
processing systems, pages 3111–3119. Early detection of disease outbreaks using the
internet. Canadian Medical Association Journal,
Olutobi Owoputi, Brendan O’Connor, Chris Dyer, 180(8):829–831.
Kevin Gimpel, Nathan Schneider, and Noah A
Smith. 2013. Improved part-of-speech tagging Yizhe Zhang, Zhe Gan, and Lawrence Carin. 2016.
for online conversational text with word clusters. Generating text via adversarial training. NIPS 2016
Association for Computational Linguistics. Workshop on Adversarial Training.
Yuan Zhang, Regina Barzilay, and Tommi
Michael J Paul and Mark Dredze. 2011. You are
Jaakkola. 2017. Aspect-augmented adversarial
what you tweet: Analyzing twitter for public health.
networks for domain adaptation. arXiv preprint
ICWSM, 20:265–272.
arXiv:1701.00188.
Barbara Plank, Anders Søgaard, and Yoav Goldberg.
2016. Multilingual part-of-speech tagging with
bidirectional long short-term memory models and
auxiliary loss. arXiv preprint arXiv:1604.05529.
2420