Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views10 pages

Part-of-Speech Tagging For Twitter With Adversarial Neural Networks

This paper presents a novel approach to part-of-speech tagging for Tweets using a Target Preserved Adversarial Neural Network (TPANN). The method leverages a combination of labeled and unlabeled data from various domains to improve tagging accuracy, addressing challenges such as informal language and out-of-vocabulary words. Experimental results indicate that TPANN outperforms existing state-of-the-art methods on multiple datasets.

Uploaded by

Muhammad Alfian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views10 pages

Part-of-Speech Tagging For Twitter With Adversarial Neural Networks

This paper presents a novel approach to part-of-speech tagging for Tweets using a Target Preserved Adversarial Neural Network (TPANN). The method leverages a combination of labeled and unlabeled data from various domains to improve tagging accuracy, addressing challenges such as informal language and out-of-vocabulary words. Experimental results indicate that TPANN outperforms existing state-of-the-art methods on multiple datasets.

Uploaded by

Muhammad Alfian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Part-of-Speech Tagging for Twitter with Adversarial Neural Networks

Tao Gui, Qi Zhang∗, Haoran Huang, Minlong Peng, Xuanjing Huang


Shanghai Key Laboratory of Intelligent Information Processing
School of Computer Science, Fudan University
825 Zhangheng Road, Shanghai, China
{tgui16,qz,huanghr15,mlpeng16,xjhuang}@fudan.edu.cn

Abstract @DORSEY33 lol aw i thought u


In this work, we study the problem of part- USR UH UH PRP VBD PRP
of-speech tagging for Tweets. In contrast was talkin bout another time . nd i dnt
to newswire articles, Tweets are usually VBD VBG IN DT NN . CC PRP VBP
informal and contain numerous out-of- see u either !
vocabulary words. Moreover, there is a VB PRP RB .
lack of large scale labeled datasets for
this domain. To tackle these challenges,
we propose a novel neural network to Figure 1: An example of tagged Tweet, which
make use of out-of-domain labeled data, contains nonstandard orthography, emoticon, and
unlabeled in-domain data, and labeled in- abbreviation. The tagset is defined similar as that
domain data. Inspired by adversarial of PTB (Marcus et al., 1993).
neural networks, the proposed method
tries to learn common features through
Part-of-speech (POS) tagging is one of the most
adversarial discriminator. In addition,
important natural language processing tasks. It
we hypothesize that domain-specific fea-
has also been widely used in the social media
tures of target domain should be preserved
analysis systems (Ritter et al., 2012; Lamb et al.,
in some degree. Hence, the proposed
2013; Kiritchenko et al., 2014). Most state-
method adopts a sequence-to-sequence au-
of-the-art POS tagging approaches are based on
toencoder to perform this task. Experi-
supervised methods. Hence, they usually require
mental results on three different datasets
a large amount of annotated data to train models.
show that our method achieves better per-
Many datasets have been constructed for POS
formance than state-of-the-art methods.
tagging task. Because newswire articles are
1 Introduction carefully edited, benchmarks usually use them for
annotation (Marcus et al., 1993). However, user-
During the last decade, social media have become generated contents on social media are usually
extremely popular, on which billions of user- informal and contain many nonstandard lexical
generated contents are posted every day. Many items. Moreover, the difference in domains be-
users have been writing about their thoughts and tween training data and evaluation data may heav-
lives on the go. The massive unstructured data ily impact the performance of approaches based
from social media provides valuable informa- on supervised methods (Caruana and Niculescu-
tion for a variety of applications such as stock Mizil, 2006). Hence, most POS tagging meth-
prediction (Bollen et al., 2011), public health ods cannot achieve the same performance as
analysis (Wilson and Brownstein, 2009; Paul and reported on newswire domain when applied on
Dredze, 2011), real-time event detection (Sakaki Twitter (Owoputi et al., 2013).
et al., 2010), and so on. The quality of these appli- To perform the Twitter POS tagging task, some
cations is highly impacted by the performance of approaches have been proposed to perform the
natural language processing tasks. task. Gimpel et al. (2011) manually annotated

Corresponding author. 1,827 tweets and carefully studied various fea-

2411
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2411–2420
Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics
tures. Ritter et al. (2011) also constructed a resentations through in-domain and out-of-
labeled dataset, which contained 787 tweets, to domain data and construct a cross domain
empirically evaluate the performance of super- POS tagger through the learned represen-
vised methods on Twitter. Owoputi et al. (2013) tations. The proposed method also tries
incorporated word clusters into the feature sets to preserve the specific features of target
and further improved the performance. From domain.
these works, we can observe that the size of the
training data was much smaller than the newswire • Experimental results demonstrate that the
domain’s. proposed method can lead to better perfor-
Besides the challenge of lack of training data, mance in most of cases on three different
the frequent use of out-of-vocabulary words also datasets.
makes this problem difficult to address. Social 2 Approach
media users often use informal ways of expressing
their ideas and often spell words phonetically In this work, we propose a novel recurrent neural
(e.g., “2mor” for “tomorrow”). In addition, network, Target Preserved Adversarial Neural
they also make extensive use of emoticons and Network (TPANN), to learn common features
abbreviations (e.g., “:-)” for smiling emotion and between resource-rich domain and target domain,
“LOL” for laughing out loud). Moreover, new simultaneously to preserve target domain-specific
symbols, abbreviations, and words are constantly features. It extends the bi-directional LSTM
being created. Figure 1 shows an example of with adversarial network and autoencoder. The
tagged Tweet. architecture of TPANN is illustrated in Figure 2.
To tackle the challenges posed by the lack of The model consists of four components: Feature
training data and the out-of-vocabulary words, in Extractor, POS Tagging Classifier, Domain Dis-
this paper, we propose a novel recurrent neu- criminator and Target Domain Autoencoder. In
ral network, which we call Target Preserved the following sections, we will detail each part of
Adversarial Neural Network (TPANN) to per- the proposed architecture and training methods.
form the task. It can make use of a large
2.1 Feature Extractor
quantity of annotated data from other resource-
rich domains, unlabeled in-domain data, and a The feature extractor F adopts CNN to extract
small amount of labeled in-domain data. All of character embedding features, which can tackle
these datasets can be easily obtained. To make the out-of-vocabulary word problem effectively.
use of unlabeled data, motivated by the work To incorporate word embedding features, we
of Goodfellow et al. (2014) and Chen et al. (2016), concatenate word embedding to character em-
the proposed method extends the bi-directional bedding as the input of bi-LSTM on the next
long short-term memory recurrent neural network layer. Utilizing a bi-LSTM to model sentences,
(bi-LSTM) with an adversarial predictor. To F can extract sequential relations and context
overcome the defect that adversarial networks can information.
merely learn the common features, we propose to We denote the input sentence as x and the i-th
use an autoencoder only acting on target dataset to word as xi . xi ∈ S(x) and xi ∈ T (x) represent
preserve its own specific features. For tackling the input samples are from source domain and target
out-of-vocabulary problem, the proposed method domain, respectively. We denote the parameters
also incorporates a character level convolutional of F as θf . Let V be the vocabulary of words,
neutral network to leverage subword information. and C be the vocabulary of characters. d is the
The contributions of this work are as follows: dimensionality of character embedding then Q ∈
Rd×|C| is the representation matrix of vocabulary.
• We propose to incorporate large scale unla- We assume that word xi ∈ V is made up of
beled in-domain data, out-of-domain labeled a sequence of characters Ci = [c1 , c2 , . . . , cl ],
data, and in-domain labeled data for Twitter where l is the max length of word and every word
part-of-speech tagging task. will be padded to this length. Then Ci ∈ Rd×l
would be the inputs of CNN.
• We introduce a novel recurrent neural net- We apply a narrow convolution between Ci and
work, which can learn domain-invariant rep- filter H ∈ Rd×k , where k is the width of the filter.

2412
Lol , u
WSJ The finding probably Twitter Decoder
will support those …
𝓛𝒕𝒂𝒓𝒈𝒆𝒕

CNN LSTM LSTM

PTB/Twitter
LSTM LSTM Domain Discriminator
𝓛𝒕𝒚𝒑𝒆

LSTM LSTM

NN
VB
Twitter Lol, u think im being POS tagging classifier
there … 𝓛𝒕𝒂𝒔𝒌
RB

Figure 2: The general architecture of the proposed method.

After that we add a bias and apply nonlinearity to classifier is trained on Ns samples from the source
obtain a feature map mi ∈ Rl−k+1 . Specifically, domain with the cross entropy loss:
the j-th element of mi is given by:
Ns
X
k i Ltask = − yi ∗ log yˆi , (2)
i [j] = tanh(hC [∗, j : j + k − 1], Hi + b), (1)
i=1
where Ck [∗, j : j + k − 1] is the j-to-(j + k −
1)-th column of Ci and hA, Bi = Tr(AB T ) is where yi is the one-hot vector of POS tagging
the Frobenius inner product. We then apply a label corresponding to xi ∈ S(x), yˆi is the
max-over-time pooling operation (Collobert et al., output of top softmax layer: yˆi = P(F(xi )).
2011) over the feature map. CNN uses multiple During the training time, The parameters θf and
filters with varying widths to obtain the feature θy are optimized to minimize the classification
vector ~ci for word xi . Then, the character-level loss Ltask . This ensures that P(F(xi )) can make
feature vector ~ci is concatenated to the word accurate prediction on the source domain.
embedding w ~ i to form the input of bi-LSTM Conversely, domain discriminator maps the
on the next layer. The word embedding w ~ is same hidden states h to the domain labels with
pretrained on 30 million tweets. Then, the hidden parameters θd . The domain discriminator aims to
states h of bi-LSTM turn into the features that will discriminate the domain label with following loss
be transfered to P, Q and R, i.e. F(x) = h. function:
NX
s +Nt
2.2 POS Tagging Classifier and Domain
Discriminator Ltype = − {di log d̂i +(1−di ) log(1− d̂i )},
i=1
POS tagging classifier P and domain discrimi- (3)
nator Q take F(x) as input. They are standard where di is the ground truth domain label for sam-
feed-forward networks with a softmax layer for ple i, d̂i is the output of top layer: d̂i = Q(F(xi )).
classification. P predicts POS tagging label to Nt means Nt samples from the target domain. The
get classification capacity, and Q discriminates domain discriminator is trained towards a saddle
domain label to make F(x) domain-invariant. point of the loss function through minimizing
The POS tagging classifier P maps the feature the loss over θd while maximizing the loss over
vector F(xi ) to its label. We denote the param- θf (Ganin et al., 2016). Optimizing θf ensures
eters of this mapping as θy . The POS tagging that the domain discriminator can’t discriminate

2413
the domain, i.e., the feature extractor finds the data. When the adversarial networks try to
common features between the two domains. optimize the hidden representation to common
representation hcommon , The target domain au-
2.3 Target Domain Autoencoder toencoder counteracts a tendency of the adver-
Through training adversarial networks, we can sarial network to erase target domain features
obtain domain-invariant features hcommon , but by optimizing the common representation to be
it will weaken some domain-specific features informative on the target-domain data.
which are useful for POS tagging classification.
Merely obtaining domain invariant features would 2.4 Training
therefore limit the classification ability. Our model can be trained end-to-end with standard
Our model tries to tackle this defect by intro- back-propagation, which we will detail in this
ducing domain-specific autoencoder R, which at- section.
tempts to reconstruct target domain data. Inspired Our ultimate training goal is to minimize the
by (Sutskever et al., 2014) but different from (Dai total loss function with parameters {θf , θy , θr , θd }
and Le, 2015), we treat the feature extractor F as as follows:
encoder. In addition, we combine the last hidden
states of the forward LSTM and backward LSTM Ltotal = αLtask + βLtarget + γLtype , (6)
in F as the initial state h0 (dec) of the decoder where α, β, γ are the weights to balance the effects
LSTM. Hence, we don’t need to reverse the order of P, R and Q.
of words of the input sentences and the model For obtaining domain-invariant representation
avoids the difficulty of ”establish communication” hcommon , inspired by (Ganin and Lempitsky,
between the input and the output (Sutskever et al., 2015), we introduce a special gradient reversal
2014). layer (GRL), which does nothing during forward
Similar to (Zhang et al., 2016), we use h0 (dec) propagation, but negates the gradients if it receives
and embedding vector of the previous word as backward propagation, i.e. g(F(x)) = F(x)
the inputs of the decoder, but in a computation- but ∇g(F(x)) = −λ∇F(x). We insert the
ally more efficient manner by computing pre- GRL between F and Q, which can run standard
vious word representation. We assume that Stochastic Gradient Descent with respect to θf and
(x̂1 , · · · , x̂T ) is the output sequence. zt is the t-th θd . The parameter −λ drives the parameters θf
word representation: zt = M LP (ht ), and M LP not to amplify the dissimilarity of features when
is the multiple perceptron function. Hidden state minimize Ltpye . So by introducing a GRL, F can
ht = LST M ([h0 (dec) : zt−1 ], ht−1 ), where [· : ·] drive its parameters θf to extract hidden represen-
is the concatenation operation. We estimate the tations that help the POS tagging classification and
conditional probability p(xˆ1 , · · · , xˆT |h0 (dec)) as hamper the domain discrimination.
follows: In order to preserve target domain-specific
p(xˆ1 , · · · , xˆT |h0 (dec)) = features, we only optimize the autoencoder on
T target domain data for reconstruction tasks.
Y (4)
p(x̂t |h0 (dec), z1 , · · · , zt−1 ), Through above procedures, the model can learn
t=1 the common features between domains, simulta-
neously preserve target domain-specific features.
where each p(x̂t |h0 (dec), z1 , · · · , zt−1 ) distribu- Finally, we can update the parameters as follows:
tion is computed with softmax over all the words
in the vacabulary. ∂Litask ∂Litarget ∂Litype
θf = θf − µ(α +β −γ·λ )
Our aim is to minimize the following loss ∂θf ∂θf ∂θf
function with respect to parameters θr : ∂Litask
θy = θy − µ · α
Nt
X ∂θy
Ltarget = − xi ∗ log x̂i , (5) ∂Litarget
i=1 θr = θr − µ · β
∂θr
where xi is the one-hot vector of i-th word. This ∂Litype
makes h0 (dec) learn an undercomplete and most θd = θd − µ · γ ,
∂θd
salient sentence representation of target domain (7)

2414
Dataset # Tokens 3.2 Experimental Setup
WSJ 1,173,766
We select both state-of-the-art and classic methods
UNL 1,177,746
for comparison, as follows:
RIT-Train 10,652
RIT-Twitter RIT-Dev 2,242 • Stanford POS Tagger: Stanford POS Tag-
RIT-Test 2,291 ger is a widely used tool for newswire
NPSCHAT 44,997 domains (Toutanova et al., 2003). In this
ARK-Twitter OCT27 26,594 work, we train it using two different sets,
DAILY547 7,707 the WSJ (sections 0-18) and a WSJ, IRC,
and Twitter mixed corpus. We use Stanford-
Table 1: The statistics of the datasets used in our
WSJ and Stanford-MIX to represent them,
experiments.
respectively.

where µ is the learning rate. Because the size of • T-POS: T-Pos (Ritter et al., 2011) adopts the
the WSJ is more than 100 times that of the labeled Conditional Random Fields and clustering
Twitter dataset, if we directly train the model with algorithm to perform the task. It was trained
the combined dataset, the final results are much from a mixture of hand-annotated tweets and
worse than those using two training steps. So, we existing POS-labeled data.
adopt adversarial training on WSJ and unlabeled
Twitter dataset at the first step, then use a small • GATE Tagger: GATE tagger (Derczynski
number of in-domain labeled data to fine-tune the et al., 2013) is based on vote-constrained
parameters with a low learning rate. bootstrapping with unlabeled data. It com-
bines cases where available taggers use dif-
3 Experiments ferent tagsets.

In this section, we will detail the datasets used for • ARK Tagger: ARK tagger (Owoputi et al.,
experiments and experimental setup. 2013) is a system that reports the best accu-
racy on the RIT dataset. It uses unsupervised
3.1 Datasets
word clustering and a variety of lexical
The methods proposed in this work incorporate features.
out-of-domain labeled data from resource-rich
domains, large scale unlabeled in-domain data, • bi-LSTM: Bidirectional Long Short-Term
and a small number of labeled in-domain data. Memory (LSTM) networks have been widely
The datasets used in this work are as follows: used in a variety of sequence labeling
Labeled out-of-domain data. We use a standard tasks (Graves and Schmidhuber, 2005). In
benchmark dataset for adversarial POS tagging, this work, we evaluate it at character level,
namely the Wall Street Journal (WSJ) data from word level, and combining them together.
the Penn TreeBank v3 (Marcus et al., 1993), bi-LSTM (word level) uses one layer of
sections 0-24 for the out-of-domain data. bi-LSTM to extract word-level features
Labeled in-domain data. For training and and adopts a random initialization method
evaluating POS tagging approaches, we compare to transform words to vectors. bi-LSTM
the proposed method with other approaches on (character level) represents a method
three benchmarks: RIT-Twitter (Ritter et al., that combines bi-LSTM and CNN-based
2011), NPSCHAT (Forsyth, 2007), and ARK- character embedding, a similar approach with
Twitter (Gimpel et al., 2011). character-aware neural network described
Unlabeled in-domain data. For training the in (Kim et al., 2015) to handle the out-of-
adversarial network, we need to use a dataset vocabulary words. bi-LSTM (word level
that has large scale unlabeled tweets. Hence, in pretrain) architecture is the same as that of
this work, we construct large scale unlabeled data bi-LSTM(word level) but adopts word2vec
(UNL), from Twitter using its API. tool (Mikolov et al., 2013) to vectorize.
The detailed data statistics of the datasets used bi-LSTM (combine) concatenates word to
in this work are listed in Table 1. character features.

2415
Methods RIT-Test RIT-Dev
Stanford-WSJ (Toutanova et al., 2003) 73.37% 83.29%
Stanford-MIX 83.14% 84.19%
T-POS (Ritter et al., 2011) 84.55% 84.83%
GATE Tagger (Derczynski et al., 2013) 88.69% 89.37%
ARK Tagger (Owoputi et al., 2013) 90.40% -
bi-LSTM (word level) 75.91% 76.94%
bi-LSTM (word level pretrain) 85.99% 86.93%
bi-LSTM (character level) 82.85% 84.30%
bi-LSTM (combine) 89.48% 89.30%
bi-LSTM (combine + WSJ) 83.54% 83.64%
bi-LSTM (combine + WSJ + adversarial) 83.76% 84.45%
bi-LSTM (combine + WSJ + fine-tune) 89.87% 90.23%
bi-LSTM (combine + WSJ + adversarial + fine-tune) 90.60% 90.73%
TPANN (combine + WSJ + adversarial + fine-tune + autoencoder) 90.92% 91.08%

Table 2: Token level accuracies of different methods on RIT-Test and RIT-Dev. bi-LSTM(combine)
refers to combining word level with character level. bi-LSTM(combine + WSJ) refers to the model
trained on WSJ and tested on RIT. bi-LSTM(combine + WSJ + adversarial) refers to adversarial model
trained on 1.1 million tokens of labeled WSJ data and the same scale of unlabeled Twitter data, then
tested on RIT. Fine-tune means adding RIT-train data to fine-tune.

The hyper-parameters used for our model are as categories can be tagged almost perfectly using
follows. AdaGrad optimizer trained with cross- simple regular expressions, similar to (Owoputi
entropy loss is used with 0.1 as the default learning et al., 2013), we use regular expressions to tags
rate. The dimensionality of word embedding is set these words appropriately for all systems.
to 200. The dimensionality for random initialized From the results of the Stanford-WSJ, we can
character embedding is set to 25. We adopt a observe that the newswire domain is different from
bi-LSTM for encoding with each layer consisting Twitter. Although the token-level accuracy of the
of 250 hidden neurons. We set three layers of Stanford POS Tagger is higher than 97.0% on
standard LSTM for decoding. Each LSTM layer the PTB dataset, its performance on Twitter drops
consists of 500 hidden neurons. Adam optimizer sharply to 73.37%. By incorporating some in-
trained with cross-entropy loss is used to fine-tune domain labeled data for training, the accuracy of
with 0.0001 as the default learning rate. Fine- Stanford POS Tagger can reach up to 83.14%.
tuning is run for 100 epochs using early stop. Taking a variety of linguistic features and many
other resources into consideration, the T-POS,
4 Results and Discussion GATE tagger, and ARK tagger can achieve better
In this section, we will report experimental results performance.
and a detailed analysis of the results for the three The second part of Table 2 shows the results of
different datasets. the bi-LSTM based methods, which are trained on
the RIT-Train dataset. According to the results of
4.1 Evaluation on RIT-Twitter word level, we can see that word2vec can provide
The RIT-Twitter is split into training, development valuable information. The pre-trained word vec-
and evaluation sets (RIT-Train, RIT-Dev, RIT- tors in bi-LSTM(word level pretrain) give almost
Test). The splitting method is shown in (Der- 10% higher accuracy than bi-LSTM(word level).
czynski et al., 2013), and the dataset statistics are Comparing the character-level bi-LSTM with
listed in Table 1. Table 2 shows the results of our word-level bi-LSTM with random initialization,
method and other approaches on the RIT-Twitter we can observe that the character-level method can
dataset. RIT-Twitter uses the PTB tagset with achieve better performance than the word-level
several Twitter-specific tags: retweets, @user- method. bi-LSTM(combine) combines word with
names, hashtags, and urls. Since words in these character features, as described in Section 2.1,

2416
Figure 3: The visualization of bi-LSTM’s outputs of the extracted features. The left figure shows the
results when no adversary is performed. The right figure shows the results when the adversary procedure
is incorporated into training. Blue points correspond to the source PTB domain examples, and red points
correspond to the target Twitter domain.

which achieves the best results at 89.48% in the Methods Accuracy


bi-LSTM based baseline systems and shows that Forsyth (2007) 90.8%
the morphological features and pre-trained word ARK Tagger 93.4% ± 0.3%
vectors are both useful for POS tagging. TPANN 94.1%
The third part of Table 2 shows the results of Table 3: Tagging accuracies on NPSChat Corpus.
our methods incorporating out-of-domain labeled
data, in-domain unlabeled data, and in-domain la-
beled data. Putting everything together, our model 4.2 Evaluation on NPSChat
can achieve 90.92% on this dataset. Compared IRC, which contains Internet relay room messages
with the architecture without an adversarial model, from 2006, is a medium of online conversational
our method is almost 1% better. It demonstrates text. Its content is very similar to tweets. We
that adversarial networks can significantly help evaluate the proposed method on the NPSChat
with tasks of this nature. Through introducing corpus (Forsyth, 2007), a PTB-part-of-speech an-
the autoencoder in target domain, we can preserve notated dataset of IRC.
domain-specific features for better performance. We compared our method with a tagger in the
Compared with the ARK tagger, which achieves same setup as experiments with (Forsyth, 2007).
the previous best result on this dataset, our model The training part contains 90% of the data. The
is also 0.52% better, the error reduction rate is testing part contains the other 10%. Table 3 shows
more than 5.5%. the results of the ARK Tagger and our method.
To better understand why adversarial networks We used PTB, unlabeled Twitter, and the training
can help transfer domains from newswire to part of NPSChat to train our model. From the
Twitter, in this work we also followed the results, we can see that our model achieved 94.1%
method Ganin and Lempitsky (2015) used accuracy. This is significantly better than the
to visualize the outputs of LSTM with t- result Forsyth (2007) reported, which was 90.8%.
SNE (Van Der Maaten, 2013). Figure 3 shows They trained their tagger with a mix of several
the visualization results. From the figure, we POS-annotated corpora (12K from Twitter, 40K
can see that the adversary in our method makes from IRC, and 50K from PTB). Our method also
the two distributions of features much more outperforms state-of-the-art results 93.4%±0.3%,
similar, which means that the outputs of bi-LSTM which was achieved by the ARK Tagger with
are domain-invariant. Hence, the PTB training various external corpus and features, e.g., Brown
data can provide much more help than directly clustering, PTB, Freebase lists of celebrities, and
combining PTB and RIT-Train together. video games.

2417
Methods Accuracy using a character-based convolutional neural
Gimpel et al. (2011) version 0.2 90.8% network to perform the POS tagging problem.
ARK Tagger 93.2% Bi-LSTMs with word, character or unicode byte
TPANN 92.8% embedding were also introduced to achieve the
POS tagging and named entity recognition tasks
Table 4: Tagging accuracies on DAILY547. (Plank et al., 2016; Chiu and Nichols, 2015; Ma
and Hovy, 2016). In this work, we study the
4.3 Evaluation on ARK-Twitter problem from a domain adaption perspective.
Inspired by these works, we also propose to
ARK-Twitter data contains an entire dataset con-
use character-level methods to handle out-of-
sisting of a number of tweets sampled from
vocabulary words and bi-LSTMs to model the
one particular day (October 27, 2010) described
sequence relations.
in (Gimpel et al., 2011). This part is used
for training. They also created another dataset, Adversarial networks were successfully used
which consists of 547 tweets, for evaluation for image generation (Goodfellow et al., 2014;
(DAILY547). This dataset consists of one random Dosovitskiy et al., 2015; Denton et al., 2015),
English tweet from every day between January domain adaption (Tzeng et al., 2014; Ganin et al.,
1, 2011 and June 30, 2012. The distribution of 2016), and semi-supervised learning (Denton
training data may be slightly different from the et al., 2016). The key idea of adversarial networks
testing data, for example a substantial fraction for domain adaption is to construct invariant
of the messages in the training data are about features by optimizing the feature extractor as an
a basketball game. Since ARK-Twitter uses a adversary against the domain classifier (Zhang
different tagset with PTB, we manually construct et al., 2017).
a table to link tags for the two datasets. Sequence autoencoder reads the input sequence
Table 4 shows the results of different methods into a vector and then tries to reconstruct it. Dai
on this dataset. From the results, we can see that and Le (2015) used the model on a number of
our method can achieve a better result than (Gim- different tasks and verified its validity. Li et al.
pel et al., 2011). However, the performance (2015) introduced the model to hierarchically
of our method is worse than the ARK Tagger. build an embedding for a paragraph, showing that
Through analyzing the errors, we find that 16.7% the model was able to encode texts to preserve
errors occurr between nouns and proper nouns. syntactic, semantic, and discourse coherence.
Since our method do not include any ontology In this work, we incorporate adversarial
or knowledge, proper nouns can not be easily networks with autoencoder to obtain domain-
detected. However, the ATK Tagge add a token- invariant features and keep domain-specific
level name list feature. The name list is useful features. Our model is more suitable for target
for proper nouns recognition, which fires on domain tasks.
names from many sources, such as Freebase
lists of celebrities, the Moby Words list of US 6 Conclusion
Locations, proper names from Mark Kantrowitz’s In this work, we propose a novel adversarial
name corpus and so on. So, our model is neural network to address the POS tagging prob-
also competitive when lacking of manual feature lem. Besides learning common representations
knowledge. between source domain and target domain, it
can simultaneously preserve specific features of
5 Related Work
target domain. The proposed method leverages
Part-of-Speech tagging is an important pre- newswire resources and large scale in-domain
processing step and can provide valuable unlabeled data to help POS tagging classification
information for various natural language on Twitter, which has a few of labeled data. We
processing tasks. In recent years, deep evaluate the proposed method and several state-of-
learning algorithms have been successfully the-art methods on three different corpora. In most
used for POS tagging. A number of approaches of the cases, the proposed method can achieve
have been proposed and have achieved some better performance than previous methods. Ex-
progress. Santos and Guimaraes (2015) proposed perimental results demonstrate that the proposed

2418
method can make full use of these resources, Eric N Forsyth. 2007. Improving automated lexical and
which can be easily obtained. discourse analysis of online chat dialog.

Yaroslav Ganin and Victor Lempitsky. 2015. Unsu-


Acknowledgments pervised domain adaptation by backpropagation. In
The authors wish to thank the anonymous re- Proceedings of The 32nd International Conference
on Machine Learning, page 11801189.
viewers for their helpful comments. This work
was partially funded by National Natural Science Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan,
Foundation of China (No. 61532011, 61473092, Pascal Germain, Hugo Larochelle, François Lavio-
lette, Mario Marchand, and Victor Lempitsky. 2016.
and 61472088) and STCSM (No.16JC1420401). Domain-adversarial training of neural networks.
Journal of Machine Learning Research, 17(59):1–
35.
References
Kevin Gimpel, Nathan Schneider, Brendan O’Connor,
Johan Bollen, Huina Mao, and Xiaojun Zeng. 2011. Dipanjan Das, Daniel Mills, Jacob Eisenstein,
Twitter mood predicts the stock market. Journal of Michael Heilman, Dani Yogatama, Jeffrey Flanigan,
Computational Science, 2(1):1–8. and Noah A Smith. 2011. Part-of-speech tagging
Rich Caruana and Alexandru Niculescu-Mizil. 2006. for twitter: Annotation, features, and experiments.
An empirical comparison of supervised learning In Proceedings of the 49th Annual Meeting of the
algorithms. In Proceedings of the 23rd international Association for Computational Linguistics: Human
conference on Machine learning, pages 161–168. Language Technologies: short papers-Volume 2,
ACM. pages 42–47. Association for Computational Lin-
guistics.
Xilun Chen, Ben Athiwaratkun, Yu Sun, Kilian
Weinberger, and Claire Cardie. 2016. Adversarial Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
deep averaging networks for cross-lingual sentiment Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
classification. arXiv preprint arXiv:1606.01614. Courville, and Yoshua Bengio. 2014. Generative
adversarial nets. In Advances in Neural Information
Jason PC Chiu and Eric Nichols. 2015. Named entity Processing Systems, pages 2672–2680.
recognition with bidirectional lstm-cnns. arXiv
preprint arXiv:1511.08308. Alex Graves and Jürgen Schmidhuber. 2005. Frame-
wise phoneme classification with bidirectional lstm
Ronan Collobert, Jason Weston, Léon Bottou, Michael and other neural network architectures. Neural
Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Networks, 18(5):602–610.
2011. Natural language processing (almost) from
scratch. Journal of Machine Learning Research, Yoon Kim, Yacine Jernite, David Sontag, and
12(Aug):2493–2537. Alexander M Rush. 2015. Character-aware neural
language models. In AAAI 2016.
Andrew M Dai and Quoc V Le. 2015. Semi-
supervised sequence learning. In Advances in Svetlana Kiritchenko, Xiaodan Zhu, and Saif M
Neural Information Processing Systems, pages Mohammad. 2014. Sentiment analysis of short
3079–3087. informal texts. Journal of Artificial Intelligence
Research, 50:723–762.
Emily Denton, Sam Gross, and Rob Fergus. 2016.
Semi-supervised learning with context-conditional Alex Lamb, Michael J Paul, and Mark Dredze. 2013.
generative adversarial networks. arXiv preprint Separating fact from fear: Tracking flu infections on
arXiv:1611.06430. twitter. In HLT-NAACL, pages 789–795.
Emily L Denton, Soumith Chintala, Rob Fergus, Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015.
et al. 2015. Deep generative image models using A hierarchical neural autoencoder for paragraphs
a laplacian pyramid of adversarial networks. In and documents. arXiv preprint arXiv:1506.01057.
Advances in neural information processing systems,
pages 1486–1494. Xuezhe Ma and Eduard Hovy. 2016. End-to-end
sequence labeling via bi-directional lstm-cnns-crf.
Leon Derczynski, Alan Ritter, Sam Clark, and Kalina arXiv preprint arXiv:1603.01354.
Bontcheva. 2013. Twitter part-of-speech tagging for
all: Overcoming sparse and noisy data. In RANLP, Mitchell P Marcus, Mary Ann Marcinkiewicz, and
pages 198–206. Beatrice Santorini. 1993. Building a large annotated
corpus of english: The penn treebank. Computa-
Alexey Dosovitskiy, Jost Tobias Springenberg, and tional linguistics, 19(2):313–330.
Thomas Brox. 2015. Learning to generate chairs
with convolutional neural networks. In Proceedings Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S
of the IEEE Conference on Computer Vision and Corrado, and Jeff Dean. 2013. Distributed
Pattern Recognition, pages 1538–1546. representations of words and phrases and their

2419
compositionality. In Advances in neural information Kumanan Wilson and John S Brownstein. 2009.
processing systems, pages 3111–3119. Early detection of disease outbreaks using the
internet. Canadian Medical Association Journal,
Olutobi Owoputi, Brendan O’Connor, Chris Dyer, 180(8):829–831.
Kevin Gimpel, Nathan Schneider, and Noah A
Smith. 2013. Improved part-of-speech tagging Yizhe Zhang, Zhe Gan, and Lawrence Carin. 2016.
for online conversational text with word clusters. Generating text via adversarial training. NIPS 2016
Association for Computational Linguistics. Workshop on Adversarial Training.
Yuan Zhang, Regina Barzilay, and Tommi
Michael J Paul and Mark Dredze. 2011. You are
Jaakkola. 2017. Aspect-augmented adversarial
what you tweet: Analyzing twitter for public health.
networks for domain adaptation. arXiv preprint
ICWSM, 20:265–272.
arXiv:1701.00188.
Barbara Plank, Anders Søgaard, and Yoav Goldberg.
2016. Multilingual part-of-speech tagging with
bidirectional long short-term memory models and
auxiliary loss. arXiv preprint arXiv:1604.05529.

Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011.


Named entity recognition in tweets: an experimental
study. In Proceedings of the Conference on Em-
pirical Methods in Natural Language Processing,
pages 1524–1534. Association for Computational
Linguistics.

Alan Ritter, Oren Etzioni, Sam Clark, et al. 2012.


Open domain event extraction from twitter. In
Proceedings of the 18th ACM SIGKDD interna-
tional conference on Knowledge discovery and data
mining, pages 1104–1112. ACM.

Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo.


2010. Earthquake shakes twitter users: real-time
event detection by social sensors. In Proceedings
of the 19th international conference on World wide
web, pages 851–860. ACM.

Cicero Nogueira dos Santos and Victor Guimaraes.


2015. Boosting named entity recognition with
neural character embeddings. arXiv preprint
arXiv:1505.05008.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le.


2014. Sequence to sequence learning with neural
networks. In Advances in neural information
processing systems, pages 3104–3112.

Kristina Toutanova, Dan Klein, Christopher D Man-


ning, and Yoram Singer. 2003. Feature-rich part-of-
speech tagging with a cyclic dependency network.
In Proceedings of the 2003 Conference of the
North American Chapter of the Association for
Computational Linguistics on Human Language
Technology-Volume 1, pages 173–180. Association
for Computational Linguistics.

Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko,


and Trevor Darrell. 2014. Deep domain confusion:
Maximizing for domain invariance. arXiv preprint
arXiv:1412.3474.

Laurens Van Der Maaten. 2013. Barnes-hut-sne. arXiv


preprint arXiv:1301.3342.

2420

You might also like