0% found this document useful (0 votes)

38 views18 pages

NLP Unit V

The document provides an overview of machine translation and multilingual information retrieval within the field of Natural Language Processing (NLP). It discusses the challenges faced in machine translation, the current status of techniques such as neural machine translation, and the architecture of the Anusaraka system. Additionally, it covers multilingual information retrieval and cross-lingual information retrieval, outlining their processes and challenges in overcoming language barriers.

Uploaded by

Teja Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views18 pages

NLP Unit V

Uploaded by

Teja Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

PBR VITS (AUTONOMOUS)

III B.TECH CSE-AI

NATURAL LANGUAGE PROCESSING
TEXT BOOK: James Allen, Natural Language Understanding, 2nd Edition, 2003, Pearson Education

Course Instructor: Dr KV Subbaiah, M.Tech, Ph.D, Professor, Dept. of CSE

UNIT–V Machine Translation and Multilingual Information

Machine Translation Survey: Introduction, Problems of Machine Translation, Is Machine Translation
Possible, Brief History, Possible Approaches, Current Status. Anusaraka or Language Accessor:
Background, Cutting the Gordian Knot, The Problem, Structure of Anusaraka System, User Interface,
Linguistic Area, Giving up Agreement in Anusaraka Output, Language Bridges.
Multilingual Information Retrieval - Introduction, Document Pre-processing, Monolingual Information
Retrieval, CLIR, MLIR, Evaluation in Information Retrieval, Tools, Software and Resources.
Multilingual Automatic Summarization - Introduction, Approaches to Summarization, Evaluation, How
to Build a Summarizer, Competitions and Datasets.

1. Introduction:

Natural Language Processing (NLP) has revolutionized the field of machine

translation, enabling the development of advanced algorithms and models
that can automatically translate text or speech from one language to another.
This survey focuses on gathering insights and opinions specifically related to
machine translation in the context of NLP.

Machine translation plays a vital role in bridging language barriers and

promoting global communication. It has applications in various domains,
including multilingual customer support, cross-border business collaborations,
and information dissemination on the internet. As NLP techniques continue to
advance, machine translation systems have become more accurate and
efficient, improving the overall quality of translations.

This survey aims to assess the current state of machine translation in the
context of NLP, explore the challenges faced by researchers and practitioners,

1
and identify potential areas for future research and development. By
participating in this survey, you will contribute to the understanding of the
current landscape of machine translation in the field of NLP.

Your responses will be treated anonymously, and the survey results will be
used for research purposes only. Please provide your insights and opinions
based on your knowledge and expertise in the field of NLP and machine
translation.

2. Problems of Machine Translation

Machine translation in NLP faces several challenges that impact the quality and accuracy of
translations. Some of the prominent problems include:

1. Ambiguity: Natural languages often contain ambiguous words, phrases, or sentences that can
have multiple interpretations. Machine translation systems may struggle to accurately
disambiguate such instances, leading to incorrect translations.
2. Idiomatic Expressions and Cultural Nuances: Languages contain idiomatic expressions and
cultural nuances that are challenging to translate accurately. Machine translation systems may
struggle to capture the intended meaning or may produce literal translations that lack the cultural
context, resulting in translations that sound awkward or are incorrect.
3. Syntax and Grammar: Translating sentences while preserving the correct syntax and grammar
is a complex task. Machine translation systems may produce translations that have grammatical
errors, incorrect word order, or lack fluency, making them harder to understand.
4. Out-of-vocabulary (OOV) Words: Machine translation systems often encounter words or
phrases that are not present in their training data. These out-of-vocabulary words pose a
challenge as the system may not have learned their translations, leading to inaccurate or
untranslated words in the output.
5. Domain-specific Terminology: Different domains have their own specific terminologies, such
as technical or medical terms, which might not have direct translations. Machine translation
systems may struggle to handle these specialized terms, resulting in inaccurate or inconsistent
translations.
6. Language Pair Discrepancies: The performance of machine translation systems can vary
significantly depending on the language pair being translated. Some language pairs have more
training data available, leading to better performance, while others may have limited resources,
resulting in poorer translations.
7. Lack of Context Understanding: Machine translation systems often lack the ability to
understand the broader context of a sentence or document. They may struggle with pronoun
resolution, coreference resolution, or understanding context-dependent meanings, leading to
translation errors.

2
8. Rare or Low-Resource Languages: Machine translation for rare or low-resource languages
poses additional challenges due to limited training data and resources. It is often difficult to
achieve high-quality translations for these languages due to the scarcity of linguistic resources
and models trained specifically for them.

Addressing these challenges requires ongoing research and development in the field of machine
translation in NLP. By tackling these problems, we can improve the accuracy, fluency, and
overall quality of machine translation systems.

3. Is Machine Translation Possible

Yes, machine translation is a subfield of Natural Language Processing (NLP). Machine

translation involves the use of algorithms and computational methods to automatically
translate text or speech from one language to another. It aims to bridge the language
barrier and facilitate communication between people who speak different languages.

Machine translation can be achieved through various approaches, including rule-based

systems, statistical models, and more recently, neural machine translation (NMT). NMT,
powered by deep learning techniques, has shown significant improvements in
translation quality compared to earlier approaches.

NLP encompasses a wide range of tasks beyond machine translation, such as language
understanding, sentiment analysis, question answering, text summarization, and many
others. However, machine translation remains one of the fundamental and widely
studied applications within the field of NLP.

4. Current Status
As of my knowledge cutoff in September 2021, machine translation has made
significant progress in the field of NLP, particularly with the advent of neural
machine translation (NMT) models. NMT has revolutionized the quality of
machine translation outputs and has become the dominant approach in recent
years.

NMT models employ deep learning techniques, specifically neural networks, to

learn the mapping between different languages. These models can handle
complex sentence structures, capture context dependencies, and generate
more fluent translations compared to previous statistical and rule-based
methods.

3
The most notable NMT architecture is the Transformer model, introduced in
2017. Transformers leverage attention mechanisms to focus on relevant parts
of the input sequence during translation, allowing for better long-range
dependencies modeling. This architecture has achieved state-of-the-art results
on various machine translation benchmarks.

Additionally, large-scale pretraining techniques, such as unsupervised or semi-

supervised learning, have been explored to enhance machine translation
performance. By leveraging massive amounts of monolingual data, these
approaches can improve translation quality even without parallel corpora.

It's important to note that the field of NLP is rapidly evolving, and new
techniques and models may have been developed since my last update. I
recommend referring to recent research papers, conferences, and
advancements in the field to stay up to date with the current state of machine
translation in NLP.

5. Cutting the Gordian Knot

"Cutting the Gordian Knot" is a metaphorical expression that originates from the legend
of Alexander the Great. According to the legend, Alexander encountered a complex knot
tied by Gordius, the king of Phrygia. The knot was said to be impossible to untie, and it
was prophesied that whoever could unravel it would become the ruler of Asia. Instead of
attempting to untie the knot, Alexander famously took his sword and sliced through it,
"cutting the Gordian Knot" and solving the problem in a bold and unconventional way.

In the context of NLP, "cutting the Gordian Knot" can refer to finding a simple and
effective solution to a complex problem or challenge. NLP tasks often involve intricate
linguistic nuances, ambiguity, and challenges related to language understanding and
generation. Researchers and practitioners in NLP continuously seek innovative
approaches and techniques to tackle these challenges.

The expression "cutting the Gordian Knot" in NLP may signify the discovery of a
breakthrough technique, a novel algorithm, or an innovative model architecture that
simplifies or solves a previously difficult or unsolved problem in natural language
processing.

4
It's worth noting that specific techniques and approaches for "cutting the Gordian Knot"
in NLP may vary depending on the particular problem or task at hand. Researchers and
practitioners employ a range of methods, including deep learning, neural networks,
transfer learning, reinforcement learning, and more, to overcome challenges and
improve the performance of NLP systems.

6. Structure of Anusaraka System

The name Anusaaraka is derived from Sanskrit word Anusaaran that means “to follow”.
In the processing of Anusaaraka output appears in one step followed by the next one. Hence it is
named so based on its way of generating the output. Anusaaraka is a translator that accepts
English as input and produces output in Telugu/Hindi etc. The sentence is passed through
various stages of defragmentation and analysis before the output is generated.

The Anusaaraka architecture has been designed and developed based on issues revealed during
an evaluation of conventional machine translation. The architecture is shown in Fig 1.

5
Architecture of the Anusaaraka system:

The Anusaaraka system has two major components.

_ Core engine
_ User–cum-developer interface

„Core‟ engine is the main engine of anusaaraka. This engine produces the output in different
layers making the process of Machine Translation transparent to the user.

The architecture of “core” anusaaraka is shown in Figure 1.

This architecture differs from the conventional architecture in three major ways:

1. The order of operations is reversed. In the new architecture there is initial word level
substitution followed by use of other language resources that are less reliable, like POS taggers,
parsers, etc.

2. A graphical user interface has been developed to display the spectrum of outputs. The user has
flexibility to adjust the output as per his/her needs. There will be users of different kinds based
on the level of sophistication required and skill in handling the tool.

3. Special “interfaces”, which act as „glue‟ have been developed for different parsers, which
allow plugging in of different parsers thereby providing modularity.

Core Anusaaraka engine

The core anusaaraka engine has four major modules viz.

I. Word Level Substitution
II. Word Sense Disambiguation
III. Preposition placement
IV. Word Order generation

I.Word Level Substitution

At this level the „gloss‟ of each source language word into the target language is provided.
However, the Polysemous words (words having more than one related meaning) create problems.
When there is no one-one mapping, it is not practical to list all the meanings. On the other hand,
anusaaraka claims „faithfulness‟ to the original text. Then how is the faithfulness guaranteed at
word level substitution?

II.Word Sense Disambiguation (WSD)

English has a very rich source of systematic ambiguity. Majority of nouns in English can
potentially be used as verbs. Therefore, the WSD task in case of English can be split into two
classes:
(i) WSD across POS
(ii) WSD within POS

6
The POS taggers can help in WSD when the ambiguity is across POSs. For example: Consider
the two sentences „He chairs the session‟. „The chairs in this room are comfortable‟. The POS
taggers mark the words with appropriate POS tags. These taggers use certain heuristic rules, and
hence may sometimes go wrong. The reported performances of these POS taggers vary between
95% to 97%. However, they are still useful, since they reduce the search space for meanings
substantially.

III.Preposition Placement
English has prepositions whereas Hindi has postpositions. Hence, it is necessary to move the
prepositions to proper positions in Hindi before substituting their meanings. While moving the
prepositions from their English positions to the proper Hindi positions, \record of their
movements must be stored, so that in case a need arises, they can be reverted back to their
original position. Therefore, the transformations performed by this module, are also reversible.

IV. Word Order Generation

Hindi is a free word order language. Therefore, even the anusaaraka output in the previous layer
makes sense to the Hindi reader. However, this output not being natural in Hindi, may not be
enjoyed as much as the output with natural Hindi order. Additionally, it would not be treated as a
translation. Therefore, in this module the attempt is to generate the correct Hindi word order.

Interface for different linguistic tools

The second major contribution of this architecture is the concept of „interfaces‟. Machine
translation requires language resources such as POS taggers, morphological analyzers, and
parsers. More than one kinds of each of these tools exist. Hence, it is wise to use these tools.
However, there are problems.

7
As a machine translation system developer who is interested in the “usable” product one would
like to plug-in different parsers and watch the performance. May be one would like to use
combinations of them, or may like to vote among different parsers and choose the best parse out
of them.
The Java/PYTHON based user interface has been developed to display the outputs produced by
different layers of anusaaraka engine. The user interface provides a flexibility to control the
display.

7.Multilingual Information Retrieval(MLIR)

Multilingual Information Retrieval (MLIR) is a subfield of Natural Language Processing (NLP)

that focuses on retrieving relevant information from multilingual sources. It involves techniques
and methodologies for searching, retrieving, and ranking documents or information in different
languages.

The main goal of MLIR is to overcome language barriers and enable users to retrieve
information from a diverse range of languages, even if they are not proficient in those languages.
MLIR systems typically involve the following key components:

1. Multilingual Indexing: MLIR systems index documents from multiple languages to create a
searchable collection. This process involves language-specific preprocessing techniques such as
tokenization, stemming, and stop-word removal.
2. Cross-lingual Mapping: MLIR often involves creating mappings between different languages to
establish connections and similarities. This can be achieved through techniques such as bilingual
dictionaries, parallel corpora, or statistical models that learn cross-lingual word representations.
3. Query Translation: MLIR systems handle queries in one language and translate them into the
languages of the indexed documents. Query translation methods include statistical machine
translation, rule-based translation, or leveraging cross-lingual word embeddings.
4. Cross-lingual Retrieval Models: MLIR employs retrieval models that consider the multilingual
nature of the indexed documents and queries. These models often combine language-specific
relevance signals with cross-lingual information, such as document similarity or query expansion
techniques.
5. Evaluation Metrics: MLIR systems are evaluated using metrics that account for the effectiveness
of information retrieval across multiple languages. Common evaluation measures include mean
average precision, precision at K, or cross-lingual variants of these metrics.

Challenges in MLIR include handling language-specific nuances, limited availability of

resources for some languages, handling code-switching and mixed-language content, and
scalability in indexing and retrieval for large multilingual collections.

MLIR finds applications in various domains such as cross-lingual search engines, multilingual
digital libraries, e-commerce platforms, and information retrieval in multilingual social media
content.

8
Researchers and practitioners in MLIR continue to explore advanced techniques, leveraging deep
learning, neural networks, and transformer-based models to improve the effectiveness and
efficiency of multilingual information retrieval systems.

8. Cross-Lingual Information Retrieval (CLIR)

CLIR stands for Cross-Language Information Retrieval, and it is a subfield of Natural Language
Processing (NLP) that focuses on retrieving information across different languages. It involves
the process of searching for and retrieving relevant documents or information in a target
language, given a query expressed in a different source language.

The goal of CLIR is to bridge the language barrier and enable users to access information from
different languages, even if they do not understand or speak those languages. It is particularly
useful in multilingual and cross-cultural contexts, where people may need to search for
information in languages they are not familiar with.

CLIR typically involves the following steps:

1. Query Translation: The user query, expressed in the source language, needs to be translated
into the target language. This step can be challenging due to differences in grammar, vocabulary,
and linguistic structure between languages.
2. Document Indexing: The documents in the target language need to be indexed to enable
efficient retrieval. This typically involves extracting relevant features from the documents, such
as keywords, named entities, or language-specific patterns.
3. Retrieval: The translated query is used to search the indexed documents, and retrieval
algorithms rank the documents based on their relevance to the query. Various information
retrieval techniques, such as vector space models or probabilistic models, can be applied here.
4. Result Presentation: The retrieved documents are presented to the user, often with additional
processing to provide a summary, highlight relevant information, or support further exploration.

CLIR faces several challenges due to the inherent complexities of language translation,
variations in language resources and structures, and the scarcity of parallel or bilingual data.
Researchers in the field employ various techniques to address these challenges, including
statistical machine translation, cross-lingual word embeddings, and leveraging multilingual
resources such as dictionaries or parallel corpora.

CLIR has applications in areas such as multilingual search engines, digital libraries, cross-
cultural communication, and global information access. It enables users to overcome language
barriers and access information from diverse linguistic sources, fostering knowledge
dissemination and collaboration across different language communities.

9
9. Evaluation in Information Retrieval
Evaluation in Information Retrieval (IR) in NLP refers to the process of assessing and measuring
the effectiveness and performance of IR systems or models in retrieving relevant information in
response to user queries. Evaluation plays a crucial role in assessing the quality of IR systems,
comparing different approaches, and driving improvements in the field. There are several
commonly used evaluation measures in IR:

1. Precision: Precision measures the proportion of retrieved documents that are relevant to a given
query. It is calculated as the number of relevant documents retrieved divided by the total number
of documents retrieved.
2. Recall: Recall measures the proportion of relevant documents that are retrieved out of all the
relevant documents available in the collection. It is calculated as the number of relevant
documents retrieved divided by the total number of relevant documents.
3. F1 Score: The F1 score combines precision and recall into a single metric, providing a balanced
measure of system performance. It is the harmonic mean of precision and recall, calculated as 2 *
(precision * recall) / (precision + recall).
4. Mean Average Precision (MAP): MAP is a widely used measure for ranked retrieval. It
calculates the average precision across different recall levels for a set of queries. It considers the
order in which documents are retrieved and rewards systems that retrieve relevant documents
earlier in the ranked list.
5. Normalized Discounted Cumulative Gain (NDCG): NDCG is a measure that accounts for the
relevance and rank of retrieved documents. It assigns higher scores to relevant documents that
are ranked higher in the list. NDCG takes into account both precision and the position of relevant
documents in the ranked list.
6. Precision at K: Precision at K measures the precision of the top-K retrieved documents. It
considers only the first K documents and calculates the proportion of relevant documents among
them.
7. Mean Reciprocal Rank (MRR): MRR measures the rank at which the first relevant document
is retrieved. It calculates the reciprocal of the rank and takes the average across multiple queries.

These evaluation measures are used in experimental settings where a set of queries and relevant
documents are predefined. The effectiveness of an IR system is evaluated by comparing its
performance against a ground truth set of relevant documents. Additionally, evaluation may also
involve user studies, where human assessors judge the relevance of retrieved documents based
on their expertise or preferences.

It's important to note that evaluation in IR is an ongoing area of research, and different
evaluation measures may be used depending on the specific task, dataset, or application.
Researchers and practitioners continually work to develop new evaluation techniques that better
reflect user needs and system performance in real-world scenarios.

10
10. Tools, Software and Resources
There are numerous tools, software libraries, and resources available for Natural Language
Processing (NLP) that can assist with various NLP tasks. Here are some commonly used ones:

1. NLTK (Natural Language Toolkit): NLTK is a popular open-source library for NLP written in
Python. It provides a wide range of functionalities and tools for tasks such as tokenization,
stemming, tagging, parsing, and classification. It also offers access to corpora, lexical resources,
and pre-trained models.
2. spaCy: spaCy is a Python library for advanced NLP tasks. It provides efficient tokenization,
named entity recognition, part-of-speech tagging, dependency parsing, and lemmatization. spaCy
focuses on performance and is known for its speed and ease of use.
3. Gensim: Gensim is a Python library for topic modeling, document similarity analysis, and other
unsupervised NLP tasks. It offers implementations of popular algorithms like Latent Semantic
Analysis (LSA), Latent Dirichlet Allocation (LDA), and Word2Vec.
4. Stanford CoreNLP: Stanford CoreNLP is a suite of NLP tools developed by Stanford
University. It offers a wide range of capabilities, including tokenization, part-of-speech tagging,
named entity recognition, dependency parsing, sentiment analysis, and coreference resolution.
CoreNLP supports multiple languages and provides Java APIs along with wrappers for other
programming languages.
5. TensorFlow: TensorFlow is an open-source deep learning framework that includes tools and
libraries for NLP. It provides a high-level API called TensorFlow Hub, which offers pre-trained
models for tasks like text classification, machine translation, and text generation. TensorFlow
also supports the development of custom NLP models using deep learning architectures.
6. PyTorch: PyTorch is another popular deep learning framework that offers support for NLP. It
provides tools for building and training neural networks, including modules for text
classification, sequence labeling, and language generation. PyTorch also offers pre-trained
models, such as BERT and GPT, for various NLP tasks.
7. WordNet: WordNet is a lexical database that organizes words into sets of synonyms called
synsets. It also provides semantic relationships between words, such as hypernyms, hyponyms,
and meronyms. WordNet is widely used for tasks like word sense disambiguation, lexical
similarity, and semantic analysis.
8. Word Embeddings: Word embeddings are distributed representations of words in a continuous
vector space. Pre-trained word embeddings, such as Word2Vec, GloVe, and FastText, are
available for download and can be used to capture semantic relationships between words in NLP
models.
9. Universal Dependencies: Universal Dependencies is a project that provides syntactic annotation
standards for a large number of languages. It offers pre-annotated treebanks that represent the
syntactic structure of sentences, which can be used for tasks like parsing and dependency
analysis.
10. BERT (Bidirectional Encoder Representations from Transformers): BERT is a pre-trained deep
learning model that has achieved state-of-the-art performance on various NLP tasks, including
question answering, named entity recognition, and sentiment analysis. The original BERT model
and its variations are available for fine-tuning and transfer learning.

11
These are just a few examples of the many tools, software libraries, and resources available for
NLP. The choice of tools depends on the specific task, programming language preference, and
the complexity of the project at hand. It's important to explore and experiment with different
tools to find the ones that best suit your needs.

11. Multilingual Automatic Summarization

Multilingual automatic summarization refers to the process of automatically generating

concise summaries of text documents written in multiple languages. It involves
extracting the most important information from a given text and presenting it in a
condensed form, while maintaining the essential meaning and context across different
languages.

There are several approaches to multilingual automatic summarization:

1. Statistical Methods: These methods analyze the statistical properties of the text, such
as word frequency and distribution, to identify key sentences or phrases for
summarization. Techniques like TF-IDF (Term Frequency-Inverse Document Frequency)
and graph-based algorithms are commonly used.
2. Machine Learning: Machine learning techniques, including supervised and
unsupervised algorithms, are employed to train models on multilingual datasets. These
models learn to recognize important content and generate summaries based on
patterns observed in the training data.
3. Cross-Lingual Transfer Learning: This approach involves leveraging pre-trained
language models that have been trained on large multilingual corpora. These models
can understand and generate text in multiple languages, enabling them to perform
summarization tasks across different languages without needing language-specific
training data.
4. Alignment-based Methods: These methods align sentences or phrases in different
languages to find corresponding content, allowing for the creation of summaries that
preserve the meaning and coherence across languages. Techniques like parallel text
alignment and cross-lingual sentence embeddings are used in this approach.
5. Hybrid Approaches: Some systems combine multiple techniques to achieve better
performance in multilingual summarization tasks. For example, a system might use
statistical methods for sentence extraction and machine learning models for content
scoring and summarization generation.

Multilingual automatic summarization has various applications, including language

translation, cross-lingual information retrieval, and facilitating access to multilingual

12
content on the web. It helps users quickly grasp the main points of documents written in
different languages, thereby improving efficiency and accessibility in multilingual
communication and information processing.

11. Explain Approaches to Summarization in NLP

In Natural Language Processing (NLP), summarization refers to the process of

condensing a text document while retaining its most important information. There are
several approaches to automatic summarization in NLP, each with its own techniques
and methodologies:

1. Extractive Summarization:
 Extractive summarization involves selecting and combining important sentences
or phrases directly from the original text to form a summary.
 Techniques such as TextRank, a graph-based algorithm similar to Google's
PageRank, and algorithms based on clustering and sentence scoring are
commonly used.
 Extractive methods are relatively simpler and preserve the wording of the original
text but may lack coherence and fluency.
2. Abstractive Summarization:
 Abstractive summarization involves generating new sentences that capture the
essence of the original text in a more concise form.
 This approach often utilizes deep learning techniques, such as Recurrent Neural
Networks (RNNs), Transformer-based models (like BERT and GPT), and sequence-
to-sequence models (such as LSTM or GRU with attention mechanisms).
 Abstractive methods can produce more coherent and fluent summaries
compared to extractive methods but may struggle with maintaining factual
accuracy and generating grammatically correct sentences.
3. Query-Based Summarization:
 Query-based summarization focuses on generating summaries based on specific
queries or questions provided by the user.
 It typically involves techniques like Information Retrieval (IR) to identify relevant
passages or sentences from the text that address the query, followed by
summarization methods to generate a concise answer.
 Query-based approaches are useful for tasks such as question answering and
providing contextually relevant summaries for user queries.

13
4. Domain-Specific Summarization:
 Domain-specific summarization techniques are tailored to particular domains or
types of documents, such as scientific articles, legal documents, or news articles.
 These methods often incorporate domain-specific knowledge and terminology to
improve the quality and relevance of the summaries.
 Domain-specific summarization may involve specialized pre-processing steps,
feature engineering, or fine-tuning of models on domain-specific datasets.
5. Multi-Document Summarization:
 Multi-document summarization aims to generate a summary from a collection of
multiple documents on a given topic or event.
 Techniques for multi-document summarization include clustering similar
documents, identifying important information common across documents, and
fusion of summaries from individual documents.
 Multi-document summarization is particularly useful for tasks such as literature
reviews, news aggregation, and summarizing discussions on social media.

These approaches to summarization in NLP cater to various needs and applications,

ranging from extracting key information from large volumes of text to generating
concise summaries tailored to specific queries or domains. Depending on the task
requirements and the nature of the input text, different summarization techniques may
be more suitable.

12. Explain Summarization Evaluation

Summarization evaluation is the process of assessing the quality and effectiveness of automatic
summarization systems. Evaluating summarization systems is crucial for determining their
performance, comparing different approaches, and guiding improvements in summarization
algorithms. Several evaluation metrics and methodologies are commonly used in the field of
automatic summarization:

1. ROUGE (Recall-Oriented Understudy for Gisting Evaluation):

 ROUGE is one of the most widely used metrics for evaluating summarization systems.
 It calculates the overlap between the n-grams (sequences of n words) in the generated
summary and the reference (gold standard) summaries.
 ROUGE-N measures overlap at the unigram, bigram, trigram, etc., levels, providing a
comprehensive assessment of content overlap.
 ROUGE-L and ROUGE-W consider the longest common subsequences and weighted LCS,
respectively, between the summary and the reference.
 ROUGE is effective for assessing the content overlap between the generated summary and
human-written reference summaries.

14
2. BLEU (Bilingual Evaluation Understudy):
 BLEU was originally designed for machine translation evaluation but has been adapted for
summarization evaluation.
 It computes the precision of n-grams in the generated summary compared to the reference
summaries.
 BLEU considers the presence of n-grams in both the generated and reference summaries,
focusing on fluency and grammaticality.
 However, BLEU may penalize summaries that differ syntactically or structurally from the
reference summaries, which can be problematic for abstractive summarization systems.
3. METEOR (Metric for Evaluation of Translation with Explicit Ordering):
 METEOR evaluates summarization systems by considering the precision, recall, and alignment
between the generated and reference summaries.
 It incorporates stemming, synonymy, and paraphrasing to account for variations in word
choice and sentence structure.
 METEOR is particularly useful for assessing summaries that may differ in wording or phrasing
from the reference summaries.
4. Human Evaluation:
 Human evaluation involves having human judges assess the quality of summaries produced
by the summarization system.
 Judges may rate summaries based on criteria such as informativeness, coherence, readability,
and relevance to the original text.
 Human evaluation provides valuable insights into the overall quality and usability of
summaries but can be resource-intensive and subjective.
5. Task-Specific Evaluation:
 Task-specific evaluation measures assess the performance of summarization systems based
on their effectiveness for specific downstream tasks.
 For example, in question-answering tasks, the relevance and completeness of the generated
summaries in answering questions can be evaluated.
 Task-specific evaluation provides a practical assessment of the utility of summaries for
specific applications.

Evaluating summarization systems often involves a combination of these metrics and methodologies
to provide a comprehensive understanding of their performance. The choice of evaluation metrics
depends on factors such as the nature of the summarization task, the availability of reference
summaries, and the desired characteristics of the generated summaries.

13. Explain How to Build a Summarizer in NLP

Building a summarizer in Natural Language Processing (NLP) involves several steps, from
preprocessing the text data to implementing the summarization algorithm. Here's a high-level
overview of the process:

15
1. Preprocessing:
 Clean the text data: Remove irrelevant characters, punctuation, and special symbols.
 Tokenization: Split the text into individual words or tokens.
 Sentence segmentation: Split the text into sentences to process them independently.
2. Feature Extraction:
 Calculate sentence importance scores: Assign importance scores to each sentence based on
criteria such as word frequency, position in the document, or semantic similarity to other
sentences.
 Compute word embeddings: Represent words in the text as dense vectors in a high-
dimensional space to capture their semantic meanings.
3. Summarization Algorithm:
 Extractive Summarization:
 TextRank Algorithm: Apply the TextRank algorithm, a graph-based ranking algorithm
similar to Google's PageRank, to identify important sentences based on their
connectivity within the document.
 Clustering: Cluster sentences based on similarity measures such as cosine similarity or
Jaccard similarity and select representative sentences from each cluster as the
summary.
 Machine Learning Models: Train machine learning models (e.g., Support Vector
Machines, Random Forests) on labeled data to predict sentence importance scores
and select top-ranked sentences as the summary.
 Abstractive Summarization:
 Sequence-to-Sequence Models: Implement sequence-to-sequence models, such as
Recurrent Neural Networks (RNNs) with attention mechanisms or Transformer-based
architectures (e.g., BERT, GPT), to generate summaries by paraphrasing and
rephrasing the input text.
 Reinforcement Learning: Train models using reinforcement learning techniques to
generate summaries by maximizing rewards based on predefined criteria (e.g.,
ROUGE scores, semantic similarity).
4. Post-processing:
 Reconstruct the summary: Combine selected sentences or generated phrases to form a
coherent summary.
 Ensure summary length: Limit the length of the summary to meet specific requirements or
constraints.
 Remove redundant information: Filter out redundant sentences or phrases to improve the
quality and conciseness of the summary.
5. Evaluation:
 Evaluate the performance of the summarization model using appropriate evaluation metrics
such as ROUGE, BLEU, METEOR, or human judgment.
 Fine-tune the model parameters or adjust the summarization approach based on evaluation
results to improve performance.
6. Deployment and Integration:
 Deploy the summarization model as a standalone application or integrate it into existing NLP
pipelines or platforms.
 Provide an interface for users to input text documents and receive summarized outputs.

16
Throughout the process, it's essential to iterate on the design, experiment with different techniques,
and continuously evaluate and refine the summarization model to achieve optimal performance and
usability. Additionally, considering ethical implications such as fairness, bias, and privacy is crucial
when building NLP applications like summarizers.

14. Explain Competitions and Datasets in NLP

Competitions and datasets play vital roles in advancing the field of Natural Language Processing
(NLP) by providing benchmarks for evaluating and comparing algorithms, fostering innovation, and
promoting collaboration among researchers and practitioners. Here's an overview of competitions
and datasets in NLP:

1. Competitions:
 Shared Tasks: Many NLP competitions are organized as shared tasks, where participants are
given specific challenges or problems to solve using natural language processing techniques.
These tasks can range from text classification and sentiment analysis to machine translation
and summarization.
 Evaluation Metrics: Competitions typically define evaluation metrics to assess the
performance of participants' solutions objectively. Common metrics include accuracy,
precision, recall, F1-score, and task-specific metrics like ROUGE for summarization or BLEU
for machine translation.
 Platforms: Competitions are often hosted on platforms like Kaggle, SemEval (Semantic
Evaluation), and the Conference on Computational Natural Language Learning (CoNLL).
These platforms provide infrastructure for organizing competitions, submitting solutions, and
benchmarking results.
 Community Engagement: Competitions encourage collaboration and knowledge sharing
within the NLP community. Participants often publish their approaches and findings, leading
to advancements in the field.
 Example Competitions: Some notable NLP competitions include the SemEval tasks, the
General Language Understanding Evaluation (GLUE) benchmark, the Conversational
Intelligence Challenge (ConvAI), and the Text REtrieval Conference (TREC).
2. Datasets:
 Annotated Datasets: Annotated datasets are essential for training and evaluating NLP
models. They contain text documents or corpora annotated with labels, annotations, or
ground truth information for specific tasks, such as sentiment analysis, named entity
recognition, or question answering.
 Large-Scale Datasets: Large-scale datasets, such as the Common Crawl, Wikipedia dumps,
and the BooksCorpus, provide vast amounts of text data for training deep learning models.
These datasets enable researchers to develop models with better generalization and
scalability.
 Task-Specific Datasets: Datasets are tailored to specific NLP tasks, such as sentiment
analysis (e.g., the IMDb dataset), machine translation (e.g., the WMT datasets), and question

17
answering (e.g., the SQuAD dataset). Task-specific datasets facilitate benchmarking and
comparison of models across different domains and applications.
 Multilingual Datasets: Multilingual datasets contain text data in multiple languages,
allowing researchers to develop models that can understand and process text in different
languages. Examples include the Multi30K dataset for image captioning and the XNLI dataset
for cross-lingual natural language inference.
 Ethical Considerations: It's crucial to consider ethical considerations when creating and
using datasets in NLP, such as ensuring privacy, avoiding bias and discrimination, and
obtaining informed consent from data subjects.

Both competitions and datasets are instrumental in driving progress and innovation in NLP, enabling
researchers and practitioners to develop more accurate, robust, and scalable natural language
processing solutions.

*****************************************END**********************************************

(Routledge Studies in Translation Technology 2018 - 1) Chan, Sin-Wai - The Human Factor in Machine Translation-Routledge, Taylor & Francis Group (2018)
No ratings yet
(Routledge Studies in Translation Technology 2018 - 1) Chan, Sin-Wai - The Human Factor in Machine Translation-Routledge, Taylor & Francis Group (2018)
269 pages
2023-2024 Translation & Localization Trends
No ratings yet
2023-2024 Translation & Localization Trends
17 pages
Language Data For AI
No ratings yet
Language Data For AI
24 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
NLP Lab Manual for Engineers
No ratings yet
NLP Lab Manual for Engineers
139 pages
NLP Unit-5
No ratings yet
NLP Unit-5
14 pages
NLP Module5 and 6
No ratings yet
NLP Module5 and 6
31 pages
Natural Language Processing For Multilingual Translation Systems
No ratings yet
Natural Language Processing For Multilingual Translation Systems
8 pages
Machine Translation Approaches Issues An
No ratings yet
Machine Translation Approaches Issues An
7 pages
JSeva-ODEP-PhD - PristupniRad - Automatic Language Translation
No ratings yet
JSeva-ODEP-PhD - PristupniRad - Automatic Language Translation
13 pages
Research On The Relations Between Machine Translation and Human Translation
No ratings yet
Research On The Relations Between Machine Translation and Human Translation
7 pages
Chapter 6
No ratings yet
Chapter 6
7 pages
General Introduction - and Brief History
No ratings yet
General Introduction - and Brief History
9 pages
Natural Language Processing For Language Translation
No ratings yet
Natural Language Processing For Language Translation
23 pages
NLP Unit 1
100% (1)
NLP Unit 1
34 pages
Important Topics Explantion NLP
No ratings yet
Important Topics Explantion NLP
39 pages
Machine Learning in Translation (Peng Wang, David B. Sawyer) (Z-Library)
No ratings yet
Machine Learning in Translation (Peng Wang, David B. Sawyer) (Z-Library)
219 pages
Machine Translation Approaches & Evaluation
No ratings yet
Machine Translation Approaches & Evaluation
9 pages
Machine Translation, Auto Encoders and Decoders
No ratings yet
Machine Translation, Auto Encoders and Decoders
29 pages
Leeds 2006
No ratings yet
Leeds 2006
34 pages
404-BA-Chapter V
No ratings yet
404-BA-Chapter V
22 pages
Unit I
No ratings yet
Unit I
36 pages
Unit-I NLP
No ratings yet
Unit-I NLP
37 pages
Machine Translation Thesis PDF
100% (3)
Machine Translation Thesis PDF
8 pages
ASWIN TS Unit 3 NLP Translations Gen AI
No ratings yet
ASWIN TS Unit 3 NLP Translations Gen AI
5 pages
Arnold Etal94
100% (1)
Arnold Etal94
237 pages
Group 7 453
No ratings yet
Group 7 453
52 pages
Machine Translation and Natural Language
No ratings yet
Machine Translation and Natural Language
5 pages
Riwhiwhp
No ratings yet
Riwhiwhp
15 pages
Summary Ch2 - External Book Huma and Machine-The+History+of+Machine+Translation
No ratings yet
Summary Ch2 - External Book Huma and Machine-The+History+of+Machine+Translation
11 pages
On Application of Natural Language Processing in Machine Translation
No ratings yet
On Application of Natural Language Processing in Machine Translation
5 pages
Machine Translation Overview
No ratings yet
Machine Translation Overview
30 pages
Introduction NLP
No ratings yet
Introduction NLP
32 pages
Syntactic and Semantic
No ratings yet
Syntactic and Semantic
4 pages
Chapter 6 NLP
No ratings yet
Chapter 6 NLP
16 pages
Machine Translation and Its Approaches: Vanlalmuansangi Khenglawt, Lal Anpuia
No ratings yet
Machine Translation and Its Approaches: Vanlalmuansangi Khenglawt, Lal Anpuia
5 pages
Natural Language Processing: Bachelor of Technology Computer Science and Engineering
No ratings yet
Natural Language Processing: Bachelor of Technology Computer Science and Engineering
7 pages
NLP 1-3
No ratings yet
NLP 1-3
34 pages
Statistical Machine Translation Overview
No ratings yet
Statistical Machine Translation Overview
18 pages
Hang & Chao, Machine Translation Evaluation. A Survey (Paper 2016)
No ratings yet
Hang & Chao, Machine Translation Evaluation. A Survey (Paper 2016)
17 pages
Machine Translation With Statistical Approach
No ratings yet
Machine Translation With Statistical Approach
33 pages
NLP Qna Sem 7 2024 18 11 05 03 29 1
No ratings yet
NLP Qna Sem 7 2024 18 11 05 03 29 1
37 pages
Neural and Statistical Machine Translation: Confronting The State of The Art
No ratings yet
Neural and Statistical Machine Translation: Confronting The State of The Art
13 pages
Neural and Statistical Machine Translation: Confronting The State of The Art
No ratings yet
Neural and Statistical Machine Translation: Confronting The State of The Art
13 pages
Survey On Machine Translation Approaches Used in India: D S Rawat
No ratings yet
Survey On Machine Translation Approaches Used in India: D S Rawat
4 pages
Machine Learning in Translation Corpora Processing
No ratings yet
Machine Learning in Translation Corpora Processing
281 pages
Answer Key-3
No ratings yet
Answer Key-3
12 pages
A Transformer-Based Yoruba To English Machine Translation (TYEMT) System With Rouge Score
No ratings yet
A Transformer-Based Yoruba To English Machine Translation (TYEMT) System With Rouge Score
11 pages
Transforming Communication
No ratings yet
Transforming Communication
12 pages
UNIT 6 Applications of NLP
No ratings yet
UNIT 6 Applications of NLP
7 pages
Cross Language Information Retrieval Saba
No ratings yet
Cross Language Information Retrieval Saba
20 pages
Machine Translation Spanish-To-English Translation System Using RNNs
No ratings yet
Machine Translation Spanish-To-English Translation System Using RNNs
9 pages
Hindi To English Machine Translation
No ratings yet
Hindi To English Machine Translation
4 pages
Human Vs AI An Assessment of The Translation Quali
No ratings yet
Human Vs AI An Assessment of The Translation Quali
12 pages
Unit 5
No ratings yet
Unit 5
8 pages
NLP 1
No ratings yet
NLP 1
37 pages
Unit-5 NLP Part-A
No ratings yet
Unit-5 NLP Part-A
5 pages
Strath Prints 002611
No ratings yet
Strath Prints 002611
39 pages
An In-Depth Exploration of Natural Language Processing: Evolution, Applications, and Future Directions
100% (8)
An In-Depth Exploration of Natural Language Processing: Evolution, Applications, and Future Directions
5 pages
Interlingual Machine Translation
No ratings yet
Interlingual Machine Translation
27 pages
NLP for Tech Enthusiasts
No ratings yet
NLP for Tech Enthusiasts
35 pages
Translator From Yoruba To English
No ratings yet
Translator From Yoruba To English
18 pages
Effective Use of Google Translate in Writing: Andi Wirantaka, Mahdiana Syahri Fijanah
No ratings yet
Effective Use of Google Translate in Writing: Andi Wirantaka, Mahdiana Syahri Fijanah
9 pages
задание 02
No ratings yet
задание 02
14 pages
Meta-Learning For Low-Resource Neural Machine Translation
No ratings yet
Meta-Learning For Low-Resource Neural Machine Translation
10 pages
Translator Training
No ratings yet
Translator Training
16 pages
Kenny - Ed - 2022 - MT For Everyone
100% (1)
Kenny - Ed - 2022 - MT For Everyone
224 pages
LEXICOGRAPHY
No ratings yet
LEXICOGRAPHY
9 pages
Amdework Asefa Belay
No ratings yet
Amdework Asefa Belay
119 pages
Translation Concepts & Challenges
No ratings yet
Translation Concepts & Challenges
22 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
4 pages
Introduction to NLP and CL
100% (1)
Introduction to NLP and CL
182 pages
Multilingual201006 DL
No ratings yet
Multilingual201006 DL
64 pages
08 Right On 4 - SERB - Test 3B - Mod 3
No ratings yet
08 Right On 4 - SERB - Test 3B - Mod 3
4 pages
Curriculum Vitae: Solomon Teferra Abate
No ratings yet
Curriculum Vitae: Solomon Teferra Abate
6 pages
Corpora in Translation Studies An Overview and Some Suggestions For Future Research - 1995
No ratings yet
Corpora in Translation Studies An Overview and Some Suggestions For Future Research - 1995
21 pages
Trans - Unit 2
No ratings yet
Trans - Unit 2
5 pages
Chapter 1
No ratings yet
Chapter 1
12 pages
Terminology-Aware Sentence Mining For NMT Domain Adaptation
No ratings yet
Terminology-Aware Sentence Mining For NMT Domain Adaptation
7 pages
NLP
100% (1)
NLP
20 pages
An Introduction To Translation Studies SB PDF
No ratings yet
An Introduction To Translation Studies SB PDF
101 pages
Collocation
No ratings yet
Collocation
23 pages
Education
No ratings yet
Education
3 pages
Tamilcon 2010
No ratings yet
Tamilcon 2010
1 page
NLP Web Corpus Creation
No ratings yet
NLP Web Corpus Creation
8 pages
Amharic-Oromo Translation Study
No ratings yet
Amharic-Oromo Translation Study
103 pages
Word Root Finder: A Morphological Segmentor Based On CRF: J Oseph Z. Chang J Ason S. Chang
No ratings yet
Word Root Finder: A Morphological Segmentor Based On CRF: J Oseph Z. Chang J Ason S. Chang
8 pages
The Impactof AIonthe Translation Industry M
No ratings yet
The Impactof AIonthe Translation Industry M
20 pages
Malagasy NLP: A Comprehensive Review
No ratings yet
Malagasy NLP: A Comprehensive Review
6 pages

NLP Unit V

Uploaded by

NLP Unit V

Uploaded by

PBR VITS (AUTONOMOUS)

III B.TECH CSE-AI

Course Instructor: Dr KV Subbaiah, M.Tech, Ph.D, Professor, Dept. of CSE

UNIT–V Machine Translation and Multilingual Information

Natural Language Processing (NLP) has revolutionized the field of machine

Machine translation plays a vital role in bridging language barriers and

2. Problems of Machine Translation

3. Is Machine Translation Possible

Yes, machine translation is a subfield of Natural Language Processing (NLP). Machine

Machine translation can be achieved through various approaches, including rule-based

NMT models employ deep learning techniques, specifically neural networks, to

Additionally, large-scale pretraining techniques, such as unsupervised or semi-

5. Cutting the Gordian Knot

6. Structure of Anusaraka System

The Anusaaraka system has two major components.

The architecture of “core” anusaaraka is shown in Figure 1.

Core Anusaaraka engine

The core anusaaraka engine has four major modules viz.

I.Word Level Substitution

II.Word Sense Disambiguation (WSD)

IV. Word Order Generation

Interface for different linguistic tools

7.Multilingual Information Retrieval(MLIR)

Multilingual Information Retrieval (MLIR) is a subfield of Natural Language Processing (NLP)

Challenges in MLIR include handling language-specific nuances, limited availability of

8. Cross-Lingual Information Retrieval (CLIR)

CLIR typically involves the following steps:

11. Multilingual Automatic Summarization

Multilingual automatic summarization refers to the process of automatically generating

There are several approaches to multilingual automatic summarization:

Multilingual automatic summarization has various applications, including language

11. Explain Approaches to Summarization in NLP

In Natural Language Processing (NLP), summarization refers to the process of

These approaches to summarization in NLP cater to various needs and applications,

12. Explain Summarization Evaluation

1. ROUGE (Recall-Oriented Understudy for Gisting Evaluation):

13. Explain How to Build a Summarizer in NLP

14. Explain Competitions and Datasets in NLP

You might also like