Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
38 views11 pages

Catching Up

The document discusses various techniques in natural language processing (NLP), focusing on the Porter Stemmer for morphological analysis and statistical POS tagging methods such as Brill Tagging. It emphasizes the importance of evaluation methodologies, including the use of gold standards and error analysis to improve tagging accuracy. The document also addresses challenges like tag indeterminacy and handling unknown words in tagging processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views11 pages

Catching Up

The document discusses various techniques in natural language processing (NLP), focusing on the Porter Stemmer for morphological analysis and statistical POS tagging methods such as Brill Tagging. It emphasizes the importance of evaluation methodologies, including the use of gold standards and error analysis to improve tagging accuracy. The document also addresses challenges like tag indeterminacy and handling unknown words in tagging processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 11

Catching Up

CS 4705

CS 4705 1
Porter Stemmer (1980)

• Used for tasks in which you only care about the stem
– IR, modeling given/new distinction, topic detection, document
similarity
• Lexicon-free morphological analysis
• Cascades rewrite rules (e.g. misunderstanding -->
misunderstand --> understand --> …)
• Easily implemented as an FST with rules e.g.
– ATIONAL  ATE
– ING  ε
• Not perfect ….
– Doing  doe

2
• Policy  police
• Does stemming help?
– IR, little
– Topic detection, more

3
Statistical POS Tagging

• Goal: choose the best sequence of tags T for a


sequence of words W in a sentence
– T 'arg max P(T |W )
T
– By Bayes Rule
P(T |W ) P(T )P(W |T )
P(W )
– Since we can ignore P(W), we have
T 'arg max P(T )P(W |T )
T

4
Brill Tagging: TBL

• Start with simple (less accurate) rules…learn


better ones from tagged corpus
– Tag each word initially with most likely POS
– Examine set of transformations to see which improves
tagging decisions compared to tagged corpus
– Re-tag corpus
– Repeat until, e.g., performance doesn’t improve
– Result: tagging procedure which can be applied to new,
untagged text

5
An Example

The horse raced past the barn fell.


The/DT horse/NN raced/VBN past/IN the/DT
barn/NN fell/VBD ./.
1) Tag every word with most likely tag and score
The/DT horse/NN raced/VBD past/NN the/DT
barn/NN fell/VBD ./.
2) For each template, try every instantiation (e.g.
Change VBN to VBD when the preceding word is
tagged NN, add rule to ruleset, retag corpus, and
score
6
3) Stop when no transformation improves score
4) Result: set of transformation rules which can be
applied to new, untagged data (after initializing
with most common tag)
….What problems will this process run into?

7
Methodology: Evaluation

• For any NLP problem, we need to know how to


evaluate our solutions
• Possible Gold Standards -- ceiling:
– Annotated naturally occurring corpus
– Human task performance (96-7%)
• How well do humans agree?
• Kappa statistic: avg pairwise agreement
corrected for chance agreement
– Can be hard to obtain for some tasks:
sometimes humans don’t agree
8
• Baseline: how well does simple method do?
– For tagging, most common tag for each word (91%)
– How much improvement do we get over baseline?

9
Methodology: Error Analysis

• Confusion matrix:
– E.g. which tags did we most often confuse with
which other tags?
– How much of the overall error does each
confusion account for?

10
More Complex Issues

• Tag indeterminacy: when ‘truth’ isn’t clear


Carribean cooking, child seat
• Tagging multipart words
wouldn’t --> would/MD n’t/RB
• Unknown words
– Assume all tags equally likely
– Assume same tag distribution as all other singletons in
corpus
– Use morphology, word length,….

11

You might also like