0% found this document useful (0 votes)

76 views104 pages

An Overview of Statistical Machine Translation: Charles Schafer

The document provides an overview of statistical machine translation, covering key topics such as the translation problem, available translation data resources, modeling translations, searching for translations, training models from data, and creating translation dictionaries from minimal resources. It discusses how most research has focused on high-resource languages that have a large amount of parallel text data available, and how translation data is generally obtained from existing aligned text rather than commissioning new translations due to the expense. Sentence translation is the most common problem addressed rather than document or word translation.

Uploaded by

fouzi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views104 pages

An Overview of Statistical Machine Translation: Charles Schafer

Uploaded by

fouzi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 104

An Overview of Statistical

Machine Translation
Charles Schafer
David Smith
Johns Hopkins University

AMTA 2006 Overview of Statistical MT 1

Overview of the Overview
• The Translation Problem and Translation Data
– “What do we have to work with?”
• Modeling
– “What makes a good translation?”
• Search
– “What’s the best translation?”
• Training
– “Which features of data predict good translations?”
• Translation Dictionaries From Minimal Resources
– “What if I don’t have (much) parallel text?”
• Practical Considerations

AMTA 2006 Overview of Statistical MT 2

The Translation Problem
and
Translation Data

AMTA 2006 Overview of Statistical MT 3

The Translation Problem

Whereas recognition of the inherent dignity and of the equal and

inalienable rights of all members of the human family is the foundation
of freedom, justice and peace in the world

AMTA 2006 Overview of Statistical MT 4

Why Machine Translation?

* Cheap, universal access to world’s online

information regardless of original language.
(That’s the goal)

Why Statistical (or at least Empirical)

Machine Translation?

* We want to translate real-world documents.

Thus, we should model real-world documents.

* A nice property: design the system once, and

extend to new languages automatically by training
on existing data.
F(training data, model) -> parameterized MT system
AMTA 2006 Overview of Statistical MT 5
Ideas that cut across empirical
language processing problems and methods
Real-world: don’t be (too) prescriptive. Be able to
process (translate/summarize/identify/paraphrase) relevant
bits of human language as they are, not as they “should
be”. For instance, genre is important: translating French
blogs into English is different from translating French
novels into English.

Model: a fully described procedure, generally having

variable parameters, that performs some interesting task
(for example, translation).

Training data: a set of observed data instances which

can be used to find good parameters for a model via a
training procedure.

Training procedure: a method that takes observed data

and refines the parameters of a model, such that the model
is improved according to some objective function.
AMTA 2006 Overview of Statistical MT 6
Resource Availability

Most of this tutorial

Most statistical machine translation (SMT)

research has focused on a few “high-resource”
languages(European, Chinese, Japanese, Arabic).

Some other work: translation for the rest of

the world’s languages found on the web.

AMTA 2006 Overview of Statistical MT 7

Most statistical machine translation research
has focused on a few high-resource languages
(European, Chinese, Japanese, Arabic).
(~200M words)
Approximate
Parallel Text Available
(with English)
Various
Western European
languages:
parliamentary
proceedings,
govt documents Bible/Koran/ Nothing/
(~30M words) Book of Mormon/ Univ. Decl.
Dianetics Of Human
u
{ (~1M words) Rights

{
(~1K words)

…
… …
Chinese French Arabic Italian Danish Overview Serbian Uzbek Chechen Khmer
AMTA 2006 Finnish of Statistical MTBengali 8
Resource Availability

Most statistical machine translation (SMT)

research has focused on a few “high-resource”
languages(European, Chinese, Japanese, Arabic).

Some other work: translation for the rest of

the world’s languages found on the web.

Romanian Catalan Serbian Slovenian Macedonian Uzbek Turkmen Kyrgyz

Uighur Pashto Tajikh Dari Kurdish Azeri Bengali Punjabi Gujarati
Nepali Urdu Marathi Konkani Oriya Telugu Malayalam Kannada Cebuano

We’ll discuss this briefly

AMTA 2006 Overview of Statistical MT 9
The Translation Problem

Document translation? Sentence translation? Word translation?

What to translate? The most common

use case is probably document translation.

Most MT work focuses on sentence translation.

What does sentence translation ignore?

- Discourse properties/structure.
- Inter-sentence coreference.

AMTA 2006 Overview of Statistical MT 10

Document Translation:
Could Translation Exploit Discourse Structure?
<doc> Documents usually don’t
<sentence> begin with “Therefore”
William Shakespeare was an English poet and
playwright widely regarded as the greatest writer of
the English language, as well as one of the greatest
in Western literature, and the world's pre-eminent
dramatist.
<sentence>

He wrote about thirty-eight plays and 154 sonnets,

as well as a variety of other poems.
<sentence>
What is the referent of “He”?
. . .
</doc>
AMTA 2006 Overview of Statistical MT 11
Sentence Translation

- SMT has generally ignored extra-sentence

structure (good future work direction
for the community).

- Instead, we’ve concentrated on translating

individual sentences as well as possible.
This is a very hard problem in itself.

- Word translation (knowing the possible

English translations of a French word)
is not, by itself, sufficient for building
readable/useful automatic document
translations – though it is an important
component in end-to-end SMT systems.

Sentence translation using only a word translation

dictionary is called “glossing” or “gisting”.
AMTA 2006 Overview of Statistical MT 12
Word Translation (learning from minimal resources)

We’ll come back to this later…

and address learning the word

translation component (dictionary)
of MT systems without using
parallel text.

(For languages having little

parallel text, this is the best
we can do right now)

AMTA 2006 Overview of Statistical MT 13

Sentence Translation

- Training resource: parallel text (bitext).

- Parallel text (with English) on the order
of 20M-200M words (roughly, 1M-10M sentences)
is available for a number of languages.
- Parallel text is expensive to generate:
human translators are expensive
($0.05-$0.25 per word). Millions of words
training data needed for high quality SMT
results. So we take what is available.
This is often of less than optimal genre
(laws, parliamentary proceedings,
religious texts).

AMTA 2006 Overview of Statistical MT 14

Sentence Translation: examples of more and
less literal translations in bitext
French, English from Bitext Closely Literal English Translation

Le débat est clos .

The debate is closed . The debate is closed.

Accepteriez - vous ce principe ?

Would you accept that principle ? Accept-you that principle?

Merci , chère collègue .

Thank you , Mrs Marinucci . Thank you, dear colleague.

Avez - vous donc une autre proposition ?

Can you explain ? Have you therefore another proposal?
(from French-English European Parliament proceedings)
AMTA 2006 Overview of Statistical MT 15
Sentence Translation: examples of more and
less literal translations in bitext

Word alignments illustrated.

Well-defined for more literal
Le débat est clos . translations.

The debate is closed .

Accepteriez - vous ce principe ?

Would you accept that principle ?

Merci , chère collègue .

Thank you , Mrs Marinucci .

Avez - vous donc une autre proposition ?

Can you explain ?

AMTA 2006 Overview of Statistical MT 16
Translation and Alignment
- As mentioned, translations are expensive to commission
and generally SMT research relies on already existing
translations

- These typically come in the form of aligned documents.

- A sentence alignment, using pre-existing document

boundaries, is performed automatically. Low-scoring
or non-one-to-one sentence alignments are discarded.
The resulting aligned sentences constitute the
training bitext.

- For many modern SMT systems, induction of word

alignments between aligned sentences, using algorithms
based on the IBM word-based translation models, is one
of the first stages of processing. Such induced word
alignments are generally treated as part of the observed
data and are used to extract aligned phrases or subtrees.
AMTA 2006 Overview of Statistical MT 17
Target Language Models
The translation problem can be described as modeling
the probability distribution P(E|F), where F is a
string in the source language and E is a string in the
target language.

Using Bayes’ Rule, this can be rewritten

P(E|F) = P(F|E)P(E)
P(F)

= P(F|E)P(E) [since F is observed as the

sentence to be translated,
P(F)=1]

P(F|E) is called the “translation model” (TM).

P(E) is called the “language model” (LM).
The LM should assign probability to sentences
which are “good English”.
AMTA 2006 Overview of Statistical MT 18
Target Language Models
- Typically, N-Gram language models are employed

- These are finite state models which predict

the next word of a sentence given the previous
several words. The most common N-Gram model
is the trigram, wherein the next word is predicted
based on the previous 2 words.

- The job of the LM is to take the possible next

words that are proposed by the TM, and assign
a probability reflecting whether or not such words
constitute “good English”.

p(the|went to) p(the|took the)

p(happy|was feeling) p(sagacious|was feeling)
p(time|at the) p(time|on the)
AMTA 2006 Overview of Statistical MT 19
Translating Words in a Sentence

- Models will automatically learn entries in

probabilistic translation dictionaries, for
instance p(elle|she), from co-occurrences in
aligned sentences of a parallel text.

- For some kinds of words/phrases, this

is less effective. For example:
numbers
dates
named entities (NE)
The reason: these constitute a large open
class of words that will not all occur even in
the largest bitext. Plus, there are
regularities in translation of
numbers/dates/NE.
AMTA 2006 Overview of Statistical MT 20
Handling Named Entities

- For many language pairs, and particularly

those which do not share an alphabet,
transliteration of person and place names
is the desired method of translation.

- General Method:
1. Identify NE’s via classifier
2. Transliterate name
3. Translate/reorder honorifics

- Also useful for alignment. Consider the

case of Inuktitut-English alignment, where
Inuktitut renderings of European names are
highly nondeterministic.
AMTA 2006 Overview of Statistical MT 21
Transliteration

Inuktitut rendering of
English names changes the
string significantly but not
deterministically

AMTA 2006 Overview of Statistical MT 22

Transliteration

Inuktitut rendering of
English names changes the
string significantly but not
deterministically
Train a probabilistic finite-state
transducer to model this ambiguous
transformation
AMTA 2006 Overview of Statistical MT 23
Transliteration

Inuktitut rendering of
English names changes the
string significantly but not
deterministically

… Mr. Williams … … mista uialims …

AMTA 2006 Overview of Statistical MT 24
Useful Types of Word Analysis

- Number/Date Handling

- Named Entity Tagging/Transliteration

- Morphological Analysis
- Analyze a word to its root form
(at least for word alignment)
was -> is believing -> believe
ruminerai -> ruminer ruminiez -> ruminer
- As a dimensionality reduction technique
- To allow lookup in existing dictionary

AMTA 2006 Overview of Statistical MT 25

Modeling

What makes a good translation?

AMTA 2006 Overview of Statistical MT 26

Modeling
• Translation models
– “Adequacy”
– Assign better scores to accurate (and
complete) translations
• Language models
– “Fluency”
– Assign better scores to natural target
language text

AMTA 2006 Overview of Statistical MT 27

Word Translation Models

Auf diese Frage habe ich leider keine Antwort bekommen

Blue word links aren’t observed in data. NULL

I did not unfortunately receive an answer to this question

Features for word-word links: lexica, part-of-
speech, orthography, etc.
AMTA 2006 Overview of Statistical MT 28
Word Translation Models
• Usually directed: each
word in the target
generated by one word in Im Anfang war das Wort
the source
• Many-many and null-
many links allowed
• Classic IBM models of
Brown et al.
In the beginning was the word
• Used now mostly for word
alignment, not translation

AMTA 2006 Overview of Statistical MT 29

Phrase Translation Models
Not necessarily syntactic phrases

Division into phrases is hidden

Auf diese Frage habe ich leider keine Antwort bekommen

phrase= 0.212121, 0.0550809; lex= 0.0472973, 0.0260183; lcount=2.718

What are some other features?

I did not unfortunately receive an answer to this question

Score each phrase pair using several features

AMTA 2006 Overview of Statistical MT 30
Phrase Translation Models
• Capture translations in context
– en Amerique: to America
– en anglais: in English
• State-of-the-art for several years
• Each source/target phrase pair is scored by
several weighted features.
• The weighted sum of model features is the
whole translation’s score: θ  f
• Phrases don’t overlap (cf. language models) but
have “reordering” features.
AMTA 2006 Overview of Statistical MT 31
Single-Tree Translation Models
Minimal parse tree: word-word dependencies

Auf diese Frage habe ich leider keine Antwort bekommen

NULL

I did not unfortunately receive an answer to this question

Parse trees with deeper structure have also been used.

AMTA 2006 Overview of Statistical MT 32

Single-Tree Translation Models
• Either source or target has a hidden tree/parse
structure
– Also known as “tree-to-string” or “tree-transducer”
models
• The side with the tree generates words/phrases
in tree, not string, order.
• Nodes in the tree also generate words/phrases
on the other side.
• English side is often parsed, whether it’s source
or target, since English parsing is more
advanced.

AMTA 2006 Overview of Statistical MT 33

Tree-Tree Translation Models

Auf diese Frage habe ich leider keine Antwort bekommen

NULL

I did not unfortunately receive an answer to this question

AMTA 2006 Overview of Statistical MT 34

Tree-Tree Translation Models
• Both sides have hidden tree structure
– Can be represented with a “synchronous” grammar
• Some models assume isomorphic trees, where
parent-child relations are preserved; others do
not.
• Trees can be fixed in advance by monolingual
parsers or induced from data (e.g. Hiero).
• Cheap trees: project from one side to the other

AMTA 2006 Overview of Statistical MT 35

Projecting Hidden Structure

AMTA 2006 Overview of Statistical MT 36

Projection
• Train with bitext
• Parse one side
Im Anfang war das Wort • Align words
• Project dependencies
• Many to one links?
• Non-projective and
In the beginning was the word
circular
dependencies?

AMTA 2006 Overview of Statistical MT 37

Divergent Projection

Auf diese Frage habe ich leider keine Antwort bekommen

NULL

I did not unfortunately receive an answer to this question

head-swapping

null siblings
AMTA 2006 Overview of Statistical
monotonicMT 38
Free Translation
Bad
dependencies

Tschernobyl könnte dann etwas später an die Reihe kommen

NULL

Parent-ancestors?

Then we could deal with Chernobyl some time later

AMTA 2006 Overview of Statistical MT 39

Dependency Menagerie

AMTA 2006 Overview of Statistical MT 40

A Tree-Tree Generative Story
observed

Auf diese Frage habe ich leider keine Antwort bekommen

NULL
P(parent-child)
P(breakage)
P(I | ich)
I did not unfortunately receive an answer to this question

P(PRP | no left children of did)

AMTA 2006 Overview of Statistical MT 41

Finite State Models

Kumar, Deng & Byrne, 2005

AMTA 2006 Overview of Statistical MT 42

Finite State Models
First transducer in the pipeline

Map distinct words to

phrases
Here a unigram
model of phrases Kumar, Deng & Byrne, 2005

AMTA 2006 Overview of Statistical MT 43

Finite State Models
• Natural composition with other finite state
processes, e.g. Chinese word
segmentation
• Standard algorithms and widely available
tools (e.g. AT&T fsm toolkit)
• Limit reordering to finite offset
• Often impractical to compose all finite
state machines offline
AMTA 2006 Overview of Statistical MT 44
Search

What’s the best translation

(under our model)?

AMTA 2006 Overview of Statistical MT 45

Search
• Even if we know the right words in a
translation, there are n! permutations.
10! 3,626,800 20! 2.43 1018 30! 2.65 1032
• We want the translation that gets the
highest score under our model
– Or the best k translations
– Or a random sample from the model’s
distribution
• But not in n! time!
AMTA 2006 Overview of Statistical MT 46
Search in Phrase Models
One segmentation out of 4096

Deshalb haben wir allen Grund , die Umwelt in die Agrarpolitik zu integrieren

One phrase translation out of 581

That is why we have every reason to integrate the environment in the agricultural policy

One reordering out of 40,320

Translate in target language order to ease language modeling.

AMTA 2006 Overview of Statistical MT 47

Search in Phrase Models
Deshalb haben wir allen Grund , die Umwelt in die Agrarpolitik zu integrieren

that is why we have every reason the environment in the agricultural policy to integrate

agricultural policy
therefore have we every reason the environment in the to integrate
,

that is why we have all reason , which environment in agricultural policy parliament

have therefore us all the reason of the environment into the agricultural policy successfully integrated

hence , we every reason to make environmental on the cap be woven together

the agricultural policy

we have therefore everyone grounds for taking the to the on parliament
environment is

so , we all of cause which environment , to the cap , for incorporated

hence our any why that outside at agricultural policy too woven together

therefore , it of all reason for , the completion into that agricultural policy be

And many, many more…even before reordering

AMTA 2006 Overview of Statistical MT 48
“Stack Decoding”

Deshalb haben wir allen Grund , die Umwelt in die Agrarpolitik zu integrieren

hence hence we we have therefore

We could declare these equivalent.

we we have we have therefore

have we have etc., u.s.w., until all source

words are covered
in the environment

the

AMTA 2006 Overview of Statistical MT 49

Search in Phrase Models
• Many ways of segmenting source
• Many ways of translating each segment
• Restrict phrases > e.g. 7 words, long-distance
reordering
• Prune away unpromising partial translations or
we’ll run out of space and/or run too long
– How to compare partial translations?
– Some start with easy stuff: “in”, “das”, ...
– Some with hard stuff: “Agrarpolitik”,
“Entscheidungsproblem”, …

AMTA 2006 Overview of Statistical MT 50

What Makes Search Hard?
• What we really want: the best (highest-scoring)
translation
• What we get: the best translation/phrase
segmentation/alignment
– Even summing over all ways of segmenting one
translation is hard.
• Most common approaches:
– Ignore problem
– Sum over top j translation/segmentation/alignment
triples to get top k<<j translations
AMTA 2006 Overview of Statistical MT 51
Redundancy in n-best Lists
Source: Da ich wenig Zeit habe , gehe ich sofort in medias res .

as i have little time , i am immediately in medias res . | 0-1,0-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-12,12-12
as i have little time , i am immediately in medias res . | 0-0,0-0 1-1,1-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-12,12-12
as i have little time , i am in medias res immediately . | 0-1,0-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,9-9 9-9,10-10 10-10,11-11 11-11,8-8 12-12,12-12
as i have little time , i am in medias res immediately . | 0-0,0-0 1-1,1-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,9-9 9-9,10-10 10-10,11-11 11-11,8-8 12-12,12-12
as i have little time , i am immediately in medias res . | 0-1,0-1 2-2,4-4 3-3,2-2 4-4,3-3 5-5,5-5 6-7,6-7 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-12,12-12
as i have little time , i am immediately in medias res . | 0-0,0-0 1-1,1-1 2-2,4-4 3-3,2-2 4-4,3-3 5-5,5-5 6-7,6-7 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-
12,12-12
as i have little time , i am in medias res immediately . | 0-1,0-1 2-2,4-4 3-3,2-2 4-4,3-3 5-5,5-5 6-7,6-7 8-8,9-9 9-9,10-10 10-10,11-11 11-11,8-8 12-12,12-12
as i have little time , i am in medias res immediately . | 0-0,0-0 1-1,1-1 2-2,4-4 3-3,2-2 4-4,3-3 5-5,5-5 6-7,6-7 8-8,9-9 9-9,10-10 10-10,11-11 11-11,8-8 12-
12,12-12
as i have little time , i am immediately in medias res . | 0-1,0-1 2-2,4-4 3-4,2-3 5-5,5-5 6-6,7-7 7-7,6-6 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-12,12-12
as i have little time , i am immediately in medias res . | 0-0,0-0 1-1,1-1 2-2,4-4 3-4,2-3 5-5,5-5 6-6,7-7 7-7,6-6 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-
12,12-12
as i have little time , i would immediately in medias res . | 0-1,0-1 2-2,4-4 3-4,2-3 5-5,5-5 6-6,7-7 7-7,6-6 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-
12,12-12
because i have little time , i am immediately in medias res . | 0-0,0-0 1-1,1-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-
12,12-12
as i have little time , i am immediately in medias res . | 0-1,0-1 2-2,4-4 3-3,2-2 4-4,3-3 5-5,5-5 6-6,7-7 7-7,6-6 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-
12,12-12
as i have little time , i am immediately in medias res . | 0-0,0-0 1-1,1-1 2-2,4-4 3-3,2-2 4-4,3-3 5-5,5-5 6-6,7-7 7-7,6-6 8-8,8-8 9-9,9-9 10-10,10-10 11-
11,11-11 12-12,12-12
as i have little time , i am in res medias immediately . | 0-1,0-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,9-9 9-9,11-11 10-10,10-10 11-11,8-8 12-12,12-12
because i have little time , i am immediately in medias res . | 0-1,0-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,8-8 9-9,9-9 10-10,10-10 11-11,11-11 12-12,12-12
as i have little time , i am in res medias immediately . | 0-0,0-0 1-1,1-1 2-2,4-4 3-4,2-3 5-5,5-5 6-7,6-7 8-8,9-9 9-9,11-11 10-10,10-10 11-11,8-8 12-12,12-12

AMTA 2006 Overview of Statistical MT 52

Bilingual Parsing
póll’ oîd’ alṓpēx

the
póll’ oîd’ alṓpēx
fox NN/NN
the fox knows many things knows
VB/VB

many
JJ/JJ

things
A variant of CKY chart parsing.

AMTA 2006 Overview of Statistical MT 53

Bilingual Parsing
póll’ oîd’ alṓpēx

NP V’ NP the
póll’ oîd’ alṓpēx NP/NP
fox

the fox knows many things knows

VP/VP

NP V’ NP many

NP/NP
things

AMTA 2006 Overview of Statistical MT 54

Bilingual Parsing
póll’ oîd’ alṓpēx
VP

NP V’ NP the
póll’ oîd’ alṓpēx NP/NP
fox

the fox knows many things knows

NP V’ NP many
VP/VP
VP

things

AMTA 2006 Overview of Statistical MT 55

Bilingual Parsing
S
póll’ oîd’ alṓpēx
VP

NP V’ NP the
póll’ oîd’ alṓpēx
fox

the fox knows many things knows

S/S
NP V’ NP many
VP

things
S

AMTA 2006 Overview of Statistical MT 56

MT as Parsing
• If we only have the source, parse it while
recording all compatible target language
trees.
• Runtime is also multiplied by a grammar
constant: one string could be a noun and a
verb phrase
• Continuing problem of multiple hidden
configurations (trees, instead of phrases)
for one translation.
AMTA 2006 Overview of Statistical MT 57
Training

Which features of data predict

good translations?

AMTA 2006 Overview of Statistical MT 58

Training: Generative/Discriminative
• Generative
– Maximum likelihood training: max p(data)
– “Count and normalize”
– Maximum likelihood with hidden structure
• Expectation Maximization (EM)
• Discriminative training
– Maximum conditional likelihood
– Minimum error/risk training
– Other criteria: perceptron and max. margin
AMTA 2006 Overview of Statistical MT 59
“Count and Normalize”
... into the programme ...
... into the disease ...
• Language modeling example: ... into the disease ...
... into the correct ...
assume the probability of a word ... into the next ...
depends only on the previous 2 ... into the national ...
... into the integration ...
words. ... into the Union ...
p(into the disease ) ... into the Union ...
p(disease | into the ) 
p(into the ) ... into the Union ...
... into the sort ...
... into the internal ...
• p(disease|into the) = 3/20 = 0.15 ... into the general ...
... into the budget ...
• “Smoothing” reflects a prior belief ... into the disease ...
... into the legal …
that p(breech|into the) > 0 ... into the various ...
... into the nuclear ...
despite these 20 examples. ... into the bargain ...
... into the situation ...
AMTA 2006 Overview of Statistical MT 60
Phrase Models
I

did

not

unfortunately

receive

answer

this

question
Auf

diese

Frage

habe

ich

leider

keine

Antwort

bekommen
Assume word alignments are given.
AMTA 2006 Overview of Statistical MT 61
Phrase Models
I

did

not

unfortunately

receive

answer

this

question
Auf

diese

Frage

habe

ich

leider

keine

Antwort

bekommen
Some good phrase pairs.
AMTA 2006 Overview of Statistical MT 62
Phrase Models
I

did

not

unfortunately

receive

answer

this

question
Auf

diese

Frage

habe

ich

leider

keine

Antwort

bekommen
Some bad phrase pairs.
AMTA 2006 Overview of Statistical MT 63
“Count and Normalize”
• Usual approach: treat relative frequencies
of source phrase s and target phrase t as
probabilities
count ( s, t ) count ( s, t )
p( s | t )  p(t | s ) 
count (t ) count ( s )
• This leads to overcounting when not all
segmentations are legal due to unaligned
words.
AMTA 2006 Overview of Statistical MT 64
Hidden Structure
• But really, we don’t observe word
alignments.
• How are word alignment model
parameters estimated?
• Find (all) structures consistent with
observed data.
– Some links are incompatible with others.
– We need to score complete sets of links.

AMTA 2006 Overview of Statistical MT 65

Hidden Structure and EM
• Expectation Maximization
– Initialize model parameters (randomly, by some
simpler model, or otherwise)
– Calculate probabilities of hidden structures
– Adjust parameters to maximize likelihood of observed
data given hidden data
– Iterate
• Summing over all hidden structures can be
expensive
– Sum over 1-best, k-best, other sampling methods
AMTA 2006 Overview of Statistical MT 66
Discriminative Training
• Given a source sentence, give “good”
translations a higher score than “bad”
translations.
• We care about good translations, not a high
probability of the training data.
• Spend less “energy” modeling bad translations.
• Disadvantages
– We need to run the translation system at each training
step.
– System is tuned for one task (e.g. translation) and
can’t be directly used for others (e.g. alignment)
AMTA 2006 Overview of Statistical MT 67
“Good” Compared to What?
• Compare current translation to
• Idea #1: a human translation. OK, but
– Good translations can be very dissimilar
– We’d need to find hidden features (e.g. alignments)
• Idea #2: other top n translations (the “n-best
list”). Better in practice, but
– Many entries in n-best list are the same apart from
hidden links
• Compare with a loss function L
– 0/1: wrong or right; equal to reference or not
– Task-specific metrics (word error rate, BLEU, …)

AMTA 2006 Overview of Statistical MT 68

MT Evaluation

* Intrinsic

Human evaluation

Automatic (machine) evaluation

* Extrinsic

How useful is MT system output for…

Deciding whether a foreign language blog is about politics?
Cross-language information retrieval?
Flagging news stories about terrorist attacks?
…

AMTA 2006 Overview of Statistical MT 69

Human Evaluation

Je suis fatigué.

Adequacy Fluency

Tired is I. 5 2

Cookies taste good! 1 5

I am exhausted. 5 5

AMTA 2006 Overview of Statistical MT 70

Human Evaluation

PRO

High quality

CON
Expensive!

Person (preferably bilingual) must make a

time-consuming judgment per system hypothesis.

Expense prohibits frequent evaluation of

incremental system modifications.
AMTA 2006 Overview of Statistical MT 71
Automatic Evaluation

PRO

Cheap. Given available reference translations,

free thereafter.

CON

We can only measure some proxy for

translation quality.
(Such as N-Gram overlap or edit distance).

AMTA 2006 Overview of Statistical MT 72

Automatic Evaluation: Bleu Score

Bounded above
N-Gram
precision pn 
n -gramhyp
countclip (n - gram ) by highest count
of n-gram in any

 n -gramhyp
count (n - gram ) reference sentence

brevity
penalty
B= { e
1
(1- |ref| / |hyp|) if |ref| > |hyp|

otherwise

Bleu score:
1 N 
Bleu= B  exp   pn 
brevity penalty,
geometric
mean of N-Gram  N n1 
precisions
AMTA 2006 Overview of Statistical MT 73
Automatic Evaluation: Bleu Score

hypothesis 1 I am exhausted

hypothesis 2 Tired is I

reference 1 I am tired

reference 2 I am ready to sleep now

AMTA 2006 Overview of Statistical MT 74

Automatic Evaluation: Bleu Score
1-gram 2-gram 3-gram
hypothesis 1 I am exhausted 3/3 1/2 0/1

hypothesis 2 Tired is I 1/3 0/2 0/1

hypothesis 3 III 1/3 0/2 0/1

reference 1 I am tired

reference 2 I am ready to sleep now and so exhausted

AMTA 2006 Overview of Statistical MT 75

Minimizing Error/Maximizing Bleu
• Adjust parameters to
minimize error (L) when
translating a training set
• Error as a function of
parameters is
– nonconvex: not guaranteed
to find optimum
– piecewise constant: slight
changes in parameters might
not change the output.
• Usual method: optimize
one parameter at a time
with linear programming

AMTA 2006 Overview of Statistical MT 76

Generative/Discriminative Reunion
• Generative models can be cheap to train: “count
and normalize” when nothing’s hidden.
• Discriminative models focus on problem: “get
better translations”.
• Popular combination
– Estimate several generative translation and language
models using relative frequencies.
– Find their optimal (log-linear) combination using
discriminative techniques.

AMTA 2006 Overview of Statistical MT 77

Generative/Discriminative Reunion
Score each hypothesis with several generative models:

1 p phrase(s | t )   2 p phrase(t | s)  3 plexical (s | t )    7 pLM (t )  8 # words  

If necessary, renormalize into a probability distribution:

Z  k exp( θ  f k ) Unnecessary if thetas sum to 1 and

p’s are all probabilities.

where k ranges over all hypotheses. We then have

1
p (ti | s )  exp( θ  f ) Exponentiation makes it positive.
Z
for any given hypothesis i.

AMTA 2006 Overview of Statistical MT 78

Minimizing Risk
Instead of the error of the 1-best
translation, compute expected
error (risk) using k-best
translations; this makes the
function differentiable.
Smooth probability estimates
using gamma to even out local
bumpiness. Gradually increase   0.1  1

gamma to approach the 1-best

error.
E p ,θ [ L(s, t )]
[exp θ  fi ]
p , (ti | si ) 
k
[exp θ  fk ]
  10  
AMTA 2006 Overview of Statistical MT 79
Learning Word Translation Dictionaries
Using Minimal Resources

AMTA 2006 Overview of Statistical MT 80

Learning Translation Lexicons for
Low-Resource Languages
{Serbian Uzbek Romanian Bengali} English

Problem: Scarce resources . . .

– Large parallel texts are very helpful, but often unavailable
– Often, no “seed” translation lexicon is available
– Neither are resources such as parsers, taggers, thesauri

Solution: Use only monolingual corpora in source, target

languages
– But use many information sources to propose and rank
translation candidates

AMTA 2006 Overview of Statistical MT 81

Bridge Languages
Serbian
Ukrainian
ENGLISH
Russian
CZECH
Polish
Slovak
Bulgarian
Slovene

Dictionary Intra-family string

Bengali transduction

Gujarati
Nepali HINDI

AMTA 2006 Overview of Statistical MT Marathi 82

Punjabi
AMTA 2006 Overview of Statistical MT 83
* Constructing translation candidate sets
Tasks

Cognate Selection

Italian

Spanish

Catalan Romanian

Galician

some cognates

AMTA 2006 Overview of Statistical MT 84

Tasks

The Transliteration Problem

Arabic

Inuktitut

AMTA 2006 Overview of Statistical MT 85

Example Models for Cognate and Transliteration Matching

Memoryless Transducer
(Ristad & Yianilos 1997)

AMTA 2006 Overview of Statistical MT 86

Example Models for Cognate and Transliteration Matching

Two-State Transducer (“Weak Memory”)

AMTA 2006 Overview of Statistical MT 87

Example Models for Cognate and Transliteration Matching

Unigram Interlingua Transducer

AMTA 2006 Overview of Statistical MT 88

Examples: Possible Cognates Ranked by
Various String Models

Romanian inghiti (ingest)

Uzbek avvalgi (previous/former)
AMTA 2006 Overview of Statistical MT 89
* Effectiveness of cognate models
Russian

Farsi
ENGLISH

Turkish

Kazakh Uzbek

Kyrgyz
AMTA 2006 Overview of Statistical MT 90
* Multi-family bridge languages
Similarity Measures
for re-ranking cognate/transliteration hypotheses

1. Probabilistic string transducers

2. Context similarity

3. Date distribution similarity

4. Similarities based on monolingual

word properties

AMTA 2006 Overview of Statistical MT 91

Similarity Measures

1. Probabilistic string transducers

2. Context similarity

3. Date distribution similarity

4. Similarities based on monolingual

word properties

AMTA 2006 Overview of Statistical MT 92

Compare Vectors
nezavisnost vector
Projection of context 0 0 2 1.5 1.5 1.5 4 1.5
vector from Serbian to
English term space

independence vector
3 1 10 0 479 836 191 0
Construction of
context term vector

freedom vector
Construction of 681 184 104 0 21 4 141 0
context term vector

Compute cosine similarity between nezavisnost and “independence”

… and between nezavisnost and “freedom”

AMTA 2006 Overview of Statistical MT 93
Similarity Measures

1. Probabilistic string transducers

2. Context similarity

3. Date distribution similarity

4. Similarities based on monolingual

word properties

AMTA 2006 Overview of Statistical MT 94

Date Distribution Similarity

• Topical words associated with real-world events appear

within news articles in bursts following the date of the
event

• Synonymous topical words in different languages, then,

display similar distributions across dates in news text: this
can be measured

• We use cosine similarity on date term vectors, with term

values p(word|date), to quantify this notion of similarity

AMTA 2006 Overview of Statistical MT 95

Date Distribution Similarity - Example
nezavisnost

(correct) independence

DATE (200-Day Window)

nezavisnost

(incorrect) freedom

AMTA 2006 Overview of Statistical MT 96

Similarity Measures

1. Probabilistic string transducers

2. Context similarity

3. Date distribution similarity

4. Similarities based on monolingual

word properties

AMTA 2006 Overview of Statistical MT 97

Relative Frequency
Cross-Language Comparison:

fC (wF)
F

rf(wF)= min ( rf(wF)

, rf(wE)
)
|CF| rf(wE) rf(wF)

fC (wE)
E

rf(wE)=
|CE| [min-ratio method]

Precedent in Yarowsky & Wicentowski (2000);

used relative frequency similarity for
morphological analysis
AMTA 2006 Overview of Statistical MT 98
Combining Similarities: Uzbek

AMTA 2006 Overview of Statistical MT 99

Combining Similarities:
Romanian, Serbian, & Bengali

AMTA 2006 Overview of Statistical MT 100

Observations

* With no Uzbek-specific supervision,

we can produce an Uzbek-English
dictionary which is 14% exact-match correct

* Or, we can put a correct translation

in the top-10 list 34% of the time
(useful for end-to-end machine translation
or cross-language information retrieval)

* Adding more
bridge languages
helps

AMTA 2006 Overview of Statistical MT 101

Practical Considerations

AMTA 2006 Overview of Statistical MT 102

Empirical Translation in Practice: System Building

1. Data collection
- Bitext
- Monolingual text for language model (LM)

2. Bitext sentence alignment, if necessary

3. Tokenization
- Separation of punctuation
- Handling of contractions

4. Named entity, number, date normalization/translation

5. Additional filtering
- Sentence length
- Removal of free translations

6. Training…

AMTA 2006 Overview of Statistical MT 103

Some Freely Available Tools
• Sentence alignment
– http://research.microsoft.com/~bobmoore/
• Word alignment
– http://www.fjoch.com/GIZA++.html
• Training phrase models
– http://www.iccs.inf.ed.ac.uk/~pkoehn/training.tgz
• Translating with phrase models
– http://www.isi.edu/licensed-sw/pharaoh/
• Language modeling
– http://www.speech.sri.com/projects/srilm/
• Evaluation
– http://www.nist.gov/speech/tests/mt/resources/scoring.htm
• See also http://www.statmt.org/
AMTA 2006 Overview of Statistical MT 104

Astral Attack Defense PDF
100% (4)
Astral Attack Defense PDF
130 pages
The Role of a Translator
No ratings yet
The Role of a Translator
15 pages
Unit-5 NLP Part-A
No ratings yet
Unit-5 NLP Part-A
5 pages
Spanish - Portuguese SMT
100% (2)
Spanish - Portuguese SMT
10 pages
Learn Italian Bilingual Book Italian English The Adventures of Julius Caesar Ebooks
No ratings yet
Learn Italian Bilingual Book Italian English The Adventures of Julius Caesar Ebooks
4 pages
Translation Theory
No ratings yet
Translation Theory
44 pages
Modul Bahasa Inggris Sma N Solok
100% (1)
Modul Bahasa Inggris Sma N Solok
208 pages
Borrowing or Loan Words
No ratings yet
Borrowing or Loan Words
2 pages
Machine Translation - From Real Users To Research
No ratings yet
Machine Translation - From Real Users To Research
297 pages
05 Lecture08 NMT
No ratings yet
05 Lecture08 NMT
79 pages
The Map
100% (1)
The Map
154 pages
Machine Translation: A Presentation By: Julie Conlonova, Rob Chase, and Eric Pomerleau
No ratings yet
Machine Translation: A Presentation By: Julie Conlonova, Rob Chase, and Eric Pomerleau
31 pages
TTT Class 1
No ratings yet
TTT Class 1
15 pages
NLP M5 Part-2 SPP
No ratings yet
NLP M5 Part-2 SPP
62 pages
The Quran and Orientalism
No ratings yet
The Quran and Orientalism
30 pages
English Translation of Great India Classics: Tamil Nadu National Law University, Tiruchirappalli
No ratings yet
English Translation of Great India Classics: Tamil Nadu National Law University, Tiruchirappalli
12 pages
Using Synonyms For Arabic-to-English Example-Based Translation
No ratings yet
Using Synonyms For Arabic-to-English Example-Based Translation
10 pages
Machine Translation and Natural Language
No ratings yet
Machine Translation and Natural Language
5 pages
Lect 07 - MT and Seq2seq
No ratings yet
Lect 07 - MT and Seq2seq
86 pages
Word Order SMT
No ratings yet
Word Order SMT
8 pages
Translation Table Compression Under End-Tagged Dense Code
No ratings yet
Translation Table Compression Under End-Tagged Dense Code
6 pages
Machine Translation
No ratings yet
Machine Translation
13 pages
Dennis MacDonald - Semeia 38
100% (1)
Dennis MacDonald - Semeia 38
117 pages
Machine Translation
No ratings yet
Machine Translation
58 pages
Machine Translation Final Draft
No ratings yet
Machine Translation Final Draft
27 pages
Unit 5
No ratings yet
Unit 5
42 pages
A Complete Critique of Walter Veith's DVD Changing The Word (42p.)
100% (1)
A Complete Critique of Walter Veith's DVD Changing The Word (42p.)
42 pages
Machine Translation Insights
No ratings yet
Machine Translation Insights
71 pages
Machine Translation-2018-Reduced
No ratings yet
Machine Translation-2018-Reduced
44 pages
Buddhist Monastic Text Analysis
100% (1)
Buddhist Monastic Text Analysis
105 pages
Thesis On Statistical Machine Translation
100% (2)
Thesis On Statistical Machine Translation
8 pages
Eng Arabic RBMT
No ratings yet
Eng Arabic RBMT
5 pages
Paper Review
No ratings yet
Paper Review
41 pages
History and Theory of Translation
No ratings yet
History and Theory of Translation
6 pages
Liang Chen University of Cincinnati Thesis
100% (3)
Liang Chen University of Cincinnati Thesis
6 pages
13 Machine Translation
No ratings yet
13 Machine Translation
22 pages
Machine Translation
No ratings yet
Machine Translation
5 pages
Ref Paper 4
No ratings yet
Ref Paper 4
4 pages
ASWIN TS Unit 3 NLP Translations Gen AI
No ratings yet
ASWIN TS Unit 3 NLP Translations Gen AI
5 pages
Luận Văn Integrated Linguistic to Statistical Machine Translation Tích Hợp Thông Tin Ngôn Ngữ Vào Dịch Máy Tính Thống Kê
No ratings yet
Luận Văn Integrated Linguistic to Statistical Machine Translation Tích Hợp Thông Tin Ngôn Ngữ Vào Dịch Máy Tính Thống Kê
16 pages
Improving The Performance of English-Tamil Statistical Machine Translation System Using Source-Side Pre-Processing
No ratings yet
Improving The Performance of English-Tamil Statistical Machine Translation System Using Source-Side Pre-Processing
11 pages
Legal Terminology Glossary - English.arabic
No ratings yet
Legal Terminology Glossary - English.arabic
38 pages
Unit 4
No ratings yet
Unit 4
4 pages
A Case Study On English-Malayalam Machine Translat
No ratings yet
A Case Study On English-Malayalam Machine Translat
8 pages
Unspervised MT D18-1399
No ratings yet
Unspervised MT D18-1399
11 pages
Machine Translation Thesis PDF
100% (3)
Machine Translation Thesis PDF
8 pages
Machine Translation in CAT Tools
No ratings yet
Machine Translation in CAT Tools
14 pages
Murad Alam: Career Objective
No ratings yet
Murad Alam: Career Objective
2 pages
Machine Translation Approaches Issues An
No ratings yet
Machine Translation Approaches Issues An
7 pages
Translation Studies Overview
No ratings yet
Translation Studies Overview
29 pages
Module 1 Lecture 6-1
No ratings yet
Module 1 Lecture 6-1
14 pages
JETIR2211403
No ratings yet
JETIR2211403
6 pages
The Historical Formation of The Arab Nation A Study in Identity and Consciousness A.A. Duri Instant Download
No ratings yet
The Historical Formation of The Arab Nation A Study in Identity and Consciousness A.A. Duri Instant Download
52 pages
FN Paper 2
No ratings yet
FN Paper 2
13 pages
A Gentle Introduction To Neural Machine Translation
No ratings yet
A Gentle Introduction To Neural Machine Translation
14 pages
EFL Students Use Google Translate
No ratings yet
EFL Students Use Google Translate
18 pages
ChatGPT Lords Prayer
No ratings yet
ChatGPT Lords Prayer
12 pages
Challenges in NMT - 1706.03872
No ratings yet
Challenges in NMT - 1706.03872
12 pages
2018 - Generating Noun Declension-Case Markers For English To Indian Languages in Declension Rule Based MT Systems
No ratings yet
2018 - Generating Noun Declension-Case Markers For English To Indian Languages in Declension Rule Based MT Systems
7 pages
Dialnet LynneBowkerComputerAidedTranslationTechnology 4925608 PDF
No ratings yet
Dialnet LynneBowkerComputerAidedTranslationTechnology 4925608 PDF
4 pages
Deep Learning in Machine Translation
No ratings yet
Deep Learning in Machine Translation
9 pages
Machine Translation Overview
No ratings yet
Machine Translation Overview
30 pages
The Statistical Machine Translation
No ratings yet
The Statistical Machine Translation
9 pages
Machine Translation Approaches & Evaluation
No ratings yet
Machine Translation Approaches & Evaluation
9 pages
An English-Assamese Machine Translation System: Moirangthem Tiken Singh Rajdeep Borgohain
No ratings yet
An English-Assamese Machine Translation System: Moirangthem Tiken Singh Rajdeep Borgohain
6 pages
Dependency Structure Trees in Syntax Based Machine Translation
No ratings yet
Dependency Structure Trees in Syntax Based Machine Translation
18 pages
Machine Translation Essentials
No ratings yet
Machine Translation Essentials
22 pages
Statistical Machine Translation Intro
No ratings yet
Statistical Machine Translation Intro
128 pages
On Application of Natural Language Processing in Machine Translation
No ratings yet
On Application of Natural Language Processing in Machine Translation
5 pages
An Introduction To Machine Translation: Andy Way, DCU
No ratings yet
An Introduction To Machine Translation: Andy Way, DCU
23 pages
The Theory of Islamic Art 3
No ratings yet
The Theory of Islamic Art 3
92 pages
The Art of War, by Sun Tzu 1
No ratings yet
The Art of War, by Sun Tzu 1
96 pages
Machine Translation: A Presentation By: Julie Conlonova, Rob Chase, and Eric Pomerleau
No ratings yet
Machine Translation: A Presentation By: Julie Conlonova, Rob Chase, and Eric Pomerleau
31 pages
Review Article: Example-Based Machine Translation
No ratings yet
Review Article: Example-Based Machine Translation
46 pages
ILS Whitepaper Industrial 031716 1
No ratings yet
ILS Whitepaper Industrial 031716 1
3 pages
Statistical Machine Translation Overview
No ratings yet
Statistical Machine Translation Overview
18 pages
Sanchez Martinez11a
No ratings yet
Sanchez Martinez11a
12 pages
ESLA-Novel Translation Style Guide Version 1.1
No ratings yet
ESLA-Novel Translation Style Guide Version 1.1
12 pages
Texto 10 Snell-Hornby - Venuti S Foreignization
No ratings yet
Texto 10 Snell-Hornby - Venuti S Foreignization
4 pages
Neural and Statistical Machine Translation: Confronting The State of The Art
No ratings yet
Neural and Statistical Machine Translation: Confronting The State of The Art
13 pages
Neural and Statistical Machine Translation: Confronting The State of The Art
No ratings yet
Neural and Statistical Machine Translation: Confronting The State of The Art
13 pages
Comparative Study of Machine Translation Techniques
No ratings yet
Comparative Study of Machine Translation Techniques
16 pages
Cadernos de Tradução
No ratings yet
Cadernos de Tradução
22 pages
Dept. of Math. Dept. of Software College of Computer Sciences and Mathematics University of Mosul
No ratings yet
Dept. of Math. Dept. of Software College of Computer Sciences and Mathematics University of Mosul
19 pages
FIL 221 Self-Instructional Manual
No ratings yet
FIL 221 Self-Instructional Manual
31 pages
Translation Analysis Report
No ratings yet
Translation Analysis Report
3 pages
Tef2016 Walker en
No ratings yet
Tef2016 Walker en
13 pages
Rule-Based MT & Evaluation Guide
No ratings yet
Rule-Based MT & Evaluation Guide
9 pages
Translation QA Process Guide
No ratings yet
Translation QA Process Guide
9 pages
English
No ratings yet
English
4 pages
Communicative and Functional Approaches
No ratings yet
Communicative and Functional Approaches
5 pages
Bourbouhakis, Rev. of Macrides, George Acropolites
No ratings yet
Bourbouhakis, Rev. of Macrides, George Acropolites
4 pages
NLP Basics for Beginners
No ratings yet
NLP Basics for Beginners
7 pages
Draft FBMTG Agenda
No ratings yet
Draft FBMTG Agenda
2 pages
The Study of Girl in Translation
No ratings yet
The Study of Girl in Translation
2 pages
Targus Intellect 15 6 Laptop Backpack Blackgrey Tbb565eu
No ratings yet
Targus Intellect 15 6 Laptop Backpack Blackgrey Tbb565eu
1 page