Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views556 pages

TTS Notes-Unit 12

The document outlines the course 'Text and Speech Analysis' (AIML543PE04) offered at CHRIST (Deemed to be University), detailing its mission, vision, course objectives, learning outcomes, and assessment methods. It emphasizes the importance of Natural Language Processing (NLP) and its applications in various fields, including machine translation, sentiment analysis, and conversational agents. Additionally, it provides a list of textbooks and reference materials for students, along with a structured approach to continuous internal assessments and laboratory components.

Uploaded by

G0REM0ND
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views556 pages

TTS Notes-Unit 12

The document outlines the course 'Text and Speech Analysis' (AIML543PE04) offered at CHRIST (Deemed to be University), detailing its mission, vision, course objectives, learning outcomes, and assessment methods. It emphasizes the importance of Natural Language Processing (NLP) and its applications in various fields, including machine translation, sentiment analysis, and conversational agents. Additionally, it provides a list of textbooks and reference materials for students, along with a structured approach to continuous internal assessments and laboratory components.

Uploaded by

G0REM0ND
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 556

TEXT AND SPEECH ANALYSIS

(AIML543PE04)
2024- 25

Dr. DEEPA YOGISH


Associate Professor, Department of CSE, School of Engineering and Technology
CHRIST (Deemed to be University), Bangalore, India

Contact No.: +919632948736 | [email protected]

MISSION VISION CORE VALUES


CHRIST is a nurturing ground for an individual’s Excellence and Service Faith in God | Moral Uprightness
holistic development to make effective contribution to Love of Fellow Beings
the society in a dynamic environment Social Responsibility | Pursuit of Excellence
CHRIST
Deemed to be University

CHRIST (Deemed to be University), Bangalore


Vision and Mission

VISION : Excellence and Service

MISSION : CHRIST (Deemed to be University) is a nurturing ground for


an individual's holistic development to make effective contribution to
society in a dynamic environment.

Core Values:
• Faith in God
• Moral Uprightness
• Love of Fellow Beings
• Social Responsibility
• Pursuit of Excellence

Excellence and Service


2
CHRIST
Deemed to be University

SCHOOL OF ENGINEERING AND TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Department’s Vision and Mission

VISION : “To fortify Ethical Computational Excellence”

MISSION :
● Imparts core and contemporary knowledge in the areas of
Computation and Information Technology
● Promotes the culture of research and facilitates higher studies
● Acquaints the students with the latest industrial practices, team
building and entrepreneurship
● Sensitizes the students to serve for environmental, social & ethical
needs of society through lifelong learning.

Excellence and Service


3
CHRIST
Deemed to be University

Course Objectives
• Understand natural language processing basics and
apply classification algorithms to text documents.

• Build Question Answering and dialogue systems.

• Develop a speech recognition system and a speech


synthesizer.

Excellence and Service


4
CHRIST
Deemed to be University

Course Learning Outcomes

1. Apply Natural language preprocessing techniques for text using


NLTK-L3

2. Apply deep learning techniques for text classification and word


embedding techniques.-L3

3. Develop language models for Question Answering and build a


chatbots, dialogue system.- L3

4. Develop deep learning models to convert text to speech -L3

5. Apply deep learning models for building speech recognition and


text-to-speech systems.-L3

Excellence and Service


5
CHRIST
Deemed to be University

Text Books
Daniel Jurafsky and James H. Martin, “Speech and Language
Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition”, Third
Edition, 2022.

Excellence and Service


6
CHRIST
Deemed to be University

Reference Books

R1. Dipanjan Sarkar, “Text Analytics with Python: A Practical


Real-World approach to Gaining Actionable insights from your
data”, A Press, 2018.

R2.Steven Bird, Ewan Klein, and Edward Loper, “Natural


language processing with Python”, O’REILLY.

R3.Lawrence Rabiner, Biing-Hwang Juang, B. Yegnanarayana,


“Fundamentals of Speech Recognition” 1st Edition, Pearson,
2009.

Excellence and Service


7
CHRIST
Deemed to be University

Continuous Internal Assessments (CIAs)-Elective

● CIA 1 - 20 Marks

● CIA 2 (Mid Semester Exam (MSE)) - 50 Marks – 30


Marks

● CIA 3 - 20 Marks

● Attendance 5 Marks
● Lab Exam -50Marks – 35Marks
● End Semester Exams (ESE) - 50 Marks 30Marks

Excellence and Service 8


CHRIST
Deemed to be University

Continuous Internal Assessments (CIAs)


CIAs Component Details

CIA 1A Programming Assignment (Unit 1)

CIA 1B Problem Solving(Unit 2)

CIA 2 Closed Book Test (50 M)


(MSE)
Q1 Unit 1
Q2 Unit 1
Q3 Unit 2
Q4 Unit 2
Q5 Unit 3 Any one question
Q6 Unit 3 either Q5 or Q6

CIA 3A Problem Solving- Real Time Case Study

(Unit 3 and 4)
CIA 3B Case Study Implementation (Units 3, 4 and 5)

Excellence and Service 9


CHRIST
Deemed to be University

Laboratory Components
● Observation – Algorithm/Pseudocode must be written.
○ Students must complete observation before coming to the lab.

● Record – Program and Input & Output must be written-Printout

● Evaluation scheme for observation:


○ 10 marks – if completed on same day
○ 5 - 8 marks – if completed before next lab
○ Absent for 1st hour – if not shown (if they complete within that hour they will be
permitted to attend the 2nd hour and 5 marks will be given)

● Each lab experiment must be completed in the respective labs only.

● Record must be shown to on a weekly basis (E.g., Lab1 record must be shown on
Lab2)
Excellence and Service 10
CHRIST
Deemed to be University

Unit – 1 -Natural Language Basics


● Foundations of natural language processing
● Language Syntax and Structure
● Text Preprocessing and Wrangling
● Text tokenization
● Stemming
● Lemmatization
● Removing stop words
● Feature Engineering for Text representation
● Bag of Words model
● Bag of N-Grams model
● TF-IDF model
Excellence and Service
11
CHRIST
Deemed to be University

INTRODUCTION

● Artificial intelligence (AI) integration has revolutionized various


industries, and now it is transforming the realm of human behavior
research.
● This integration marks a significant milestone in the data collection
and analysis endeavors, enabling users to unlock deeper insights from
spoken language and empower researchers and analysts with
enhanced capabilities for understanding and interpreting human
communication.
● Human interactions are a critical part of many organizations.
● Many organizations analyze speech or text via natural language
processing (NLP) and link them to insights and automation such as
text categorization, text classification, information extraction, etc.

Excellence and Service


12
CHRIST
Deemed to be University

● In business intelligence, speech and text analytics enable us to gain


insights into customer-agent conversations through sentiment
analysis, and topic trends.
● These insights highlight areas of improvement, recognition, and
concern, to better understand and serve customers and employees.
● Speech and text analytics features provide automated speech and text
analytics capabilities on 100% of interactions to provide deep insight
into customer-agent conversations.
● Speech and text analytics is a set of features that uses natural language
processing (NLP) to provide an automated analysis of an interaction’s
content and insight into customer-agent conversations.

Excellence and Service


13
CHRIST
Deemed to be University

NLP and Artificial Intelligence • Branch of AI


• Interface with humans
• Deal with a complex artifact
like language
• Deep and Shallow NLP
• Super-applications of NLP
Difference from other AI
tasks
• Higher-order cognitive
skills
• Inherently discrete
• Diversity of languages

Excellence and Service


CHRIST
Deemed to be University

Monolingual Applications
Cross-lingual Applications

Document Classification
Sentiment Analysis Translation
Entity Extraction Transliteratio
Relation Extraction n
Information Retrieval Cross-lingual Applications
Mixed Language Applications
Question Answering Information Retrieval

Conversational Question Answering


Conversation
Code-Mixing
Systems Systems
Creole/Pidgin languages
Language Evolution
Comparative Linguistics

Excellence and Service


CHRIST
Deemed to be University

Synthesis Analysis

Document Classification
Sentiment Analysis Question Answering
Entity Extraction Conversational
Relation Extraction Systems Machine
Information Retrieval Translation Grammar
Parsing Correction Text
Summarization

Excellence and Service


CHRIST
Deemed to be University

Using Language Before ~1950


● Verbal communication between people
■ Day-to-day conversations

■ Oral history

● Written communication for people


■ Stone tablets, scrolls, books, etc.

■ Permanent record of written language

Source: Wiki Commons (CC BY-SA 4.0): Rosetta Stone


17

Excellence and Service


CHRIST
Deemed to be University

Since 1950: Communication with Machines


~50s-70s ~80s Toda
y

Basic symbolic languages Formal languages Natural language


(e.g., punch cards) (e.g., programming languages) (e.g., conversational agents / chatbots)

Source: Wiki Commons (CC BY-SA 4.0): punch cards, programming


17

Excellence and Service


CHRIST
Deemed to be University

Communication with Machines


Humans Machines

Analysis
Natural
Language
Generation

Source: Wiki Commons (CC BY-SA 4.0): gpu


19

Excellence and Service


CHRIST
Deemed to be University

Foundations of natural language processing


● Natural Language Processing (NLP) is the process of
producing meaningful phrases and sentences in the form of
natural language.
● Natural Language Processing precludes Natural Language
Understanding (NLU) and Natural Language Generation
(NLG).
● NLU takes the data input and maps it into natural language.
● NLG conducts information extraction and retrieval, sentiment
analysis, and more.
● NLP can be thought of as an intersection of Linguistics,
Computer Science and Artificial Intelligence that helps
computers understand, interpret and manipulate human
language.

Excellence and Service


20
CHRIST
Deemed to be University

Fig. NLP Overview

Excellence and Service


21
CHRIST
Deemed to be University

● Ever since then, there has been an immense amount of study and
development in the field of Natural Language Processing.
● Today NLP is one of the most in-demand and promising fields of
Artificial Intelligence!
● There are two main parts to Natural Language Processing:

● 1. Data Preprocessing
● 2. Algorithm Development

Excellence and Service


22
CHRIST
Deemed to be University

Applications Core technologies


• Machine Translation • Language modeling
• Information Retrieval • Part-of-speech tagging
• Question Answering • Syntactic parsing
• Dialogue Systems • Named-entity recognition
• Information Extraction • Coreference resolution
• Summarization • Word sense disambiguation
• Sentiment Analysis • Semantic Role Labelling
• ... • ...

Excellence and Service


23
CHRIST
Deemed to be University

● In Natural Language Processing, machine learning training algorithms


study millions of examples of text — words, sentences, and
paragraphs — written by humans.
● By studying the samples, the training algorithms gain an
understanding of the “context” of human speech, writing, and other
modes of communication.
● This training helps NLP software to differentiate between the
meanings of various texts.
● The five phases of NLP involve lexical (structure) analysis, parsing,
semantic analysis, discourse integration, and pragmatic analysis.
● Some well-known application areas of NLP are Optical Character
Recognition (OCR), Speech Recognition, Machine Translation, and
Chatbots.

Excellence and Service


24
CHRIST
Deemed to be University

NLP APPLICATIONS

● NLP application in almost daily use


■ Machine translation
■ Conversational agents (e.g., chat bots)
■ Text summarization
■ Text generation (e.g., autocomplete)

● Applications powered by NLP


■ Social media
■ Search engines
■ Writing assistants (e.g., grammar checking)

Excellence and Service


CHRIST
Deemed to be University

Machine Translation
A-

2
6

Excellence and Service


CHRIST
Deemed to be University

Conversational Agents
● Conversational agents
— core components
■ Speech recognition

■ Language analysis

■ Dialogue processing

■ Information retrieval

■ Text-to-Speech

2
7

Excellence and Service


CHRIST
Deemed to be University

Conversational Agents — Question Answering

2
8

Excellence and Service


CHRIST
Deemed to be University

Text Summarization

Google's cloud unit looked into using artificial intelligence


to help a financial firm decide whom to lend money to. It
turned down the client's idea after weeks of internal
discussions, deeming the project too ethically dicey.
Google has also blocked new AI features analysing
emotions, fearing cultural insensitivity. Microsoft restricted
software mimicking voices and IBM rejected a client
request for an advanced facial-recognition system.
2
9

Excellence and Service


CHRIST
Deemed to be University

Text Generation
● Example: Autocomplete
■ Given the first words of a sentence,
predict the next most likely word

3
0

Excellence and Service


CHRIST
Deemed to be University

Text Generation
● Example: Image Captioning

➜ "A man riding a red


bicycle."

3
1

Excellence and Service


CHRIST
Deemed to be University

Other Applications
● Spelling correction

● Document clustering

● Document classification, e.g.:


■ Spam detection

■ Sentiment analysis

■ Authorship attribution

3
2

Excellence and Service


CHRIST
Deemed to be University

Question Answering: IBM’s Watson

• Won Jeopardy on February 16, 2011!

WILLIAM WILKINSON’S
“AN ACCOUNT OF THE PRINCIPALITIES OF
WALLACHIA AND MOLDOVIA” Bram Stoker
INSPIRED THIS AUTHOR’S
MOST FAMOUS NOVEL

Excellence and Service


CHRIST
Deemed to be University

Dan Jurafsky

Information Extraction
Event:
● To: Dan JurafsWkyhere: Gates 159
Subject: curriculum meeting Curriculum mtg
Start: 10:00am
Date: Jan-16-2012
Date: January 15, 2012 End: 11:30am
●Hi Dan, we’ve now scheduled the curriculum meeting.

It will be in Gates 159 tomorrow from 10:00-‐11:30.


-‐Chris Create new Calendar entry
3

Excellence and Service


CHRIST
Deemed to be University

Dan Jurafsky

Information Extraction & Sentiment Analysis


Attributes:
zoom
affordability
size and weight
flash
ease of use
Size and weight
✓ • nice and compact to carry!
• since the camera is small and light, I won't need to carry
✓ around those heavy, bulky professional cameras either!
✗• the camera feels flimsy, is plastic and very light in weight you
4
have to be very delicate in the handling of this camera

Excellence and Service


CHRIST
Deemed to be University

Dan Jurafsky

Machine Translation
• Helping human translators
• Fully automatic
Enter Source Text:

这 不过 是 一 个 时间 的 问题 .
Translation from Stanford’s Phrasal:

This is only a matter of time.

Excellence and Service


CHRIST
Deemed to be University

Language Technology
Dan Jurafsky

making good progress


Sentiment analysis still really hard
Best roast chicken in San Francisco!
mostly solved Question answering (QA)
The waiter ignored us for 20 minutes.
Q. How effective is ibuprofen in reducing
Spam detection Coreference resolution fever in patients with acute febrile illness?
Let’s go to Agra! ✓

Carter told Mubarak he shouldn’t run again. Paraphrase
Buy V1AGRA …
Word sense disambiguation (WSD) XYZ acquired ABC yesterday
I need new batteries for my mouse. ABC has been taken over by XYZ
Part-‐of-‐speech (POS) tagging
ADJ ADJ NOUN VERB Summarization
ADVideas sleep furiously.
Colorless green Parsing The Dow Jones is up Economy is
I can see Alcatraz from the window! The S&P500 jumped good
Housing prices rose
Named entity recognition (NER) Machine translation (MT)
PERSON ORG LOC 第13届上海国际电影节开幕… Dialog Where is Citizen Kane playing in SF?
Einstein met with UN officials in Princeton
The 13th Shanghai International Film Festival…
Castro Theatre at 7:30. Do
Information extraction (IE) you want a ticket?
Party
You’re invited to our dinner May 27
party, Friday May 27 at 8:30 add

Excellence and Service


CHRIST
Deemed to be University

What is Natural Language?


● Natural Language
■ Means of communicating thoughts, feelings, opinions, ideas, etc.

■ Not formal, yet systematic: rules can emerge which were not previously defined
(this includes that many rules can be bent until reaching a breaking point)

■ Characteristics: ambiguous, redundant, changing, unbounded, imprecise, etc.

● Text / Writing
■ Visual representation of verbal communication (i.e., Natural Language)

■ Writing system: agreed meaning behind the sets of characters that make up a text
(most importantly: letters, digits, punctuation, white space characters: spaces, tabs, new lines, etc.)

38

Excellence and Service


CHRIST
Deemed to be University

Core Building Blocks ●


of (Written) Language
Basic symbol of written language
Character
(letter, numeral, punctuation mark, etc.)
r, e, a, c, t, i, o, n

Morpheme ● Smallest meaning-bearing


unit in a language re-act-ion
(1..n characters)

Word ● Single independent unit of


language that can be represented reaction
(1..n morphemes)

Phrase ● Group of words expressing a


particular idea or meaning his quick reaction
(1..n words)

Clause ● Phrase with a subject and verb his quick reaction saved him
(1..n phrases)

39

Excellence and Service


CHRIST
Deemed to be University

Core Building Blocks of (Written) Language


Sentence ● Expresses an independent statement, His quick reaction saved him
(1..n clauses) question, request, exclamation, etc. from the oncoming traffic.

● Self-contained unit of discourse in writing Bob lost control of his car. His quick reaction
Paragraph
dealing with a particular point or idea. saved him from the oncoming traffic. Luckily
(1..n sentences)
nobody was hurt and the damage to the cae
was minimal.

(Text) Document
● Written representation of thought
(1..n paragraphs)

Corpus ● Collection of writings (i.e., written texts)


(1..n documents)

40

Excellence and Service


CHRIST
Deemed to be University

Morphemes
● Morpheme
■ Smallest meaning-bearing unit in a language ➜ word = 1..n
morphemes

● Example: Prefixes & Suffixes


■ Change the semantic meaning or the part of speech of the affected word

un-happy de-frost-er hope-less

■ Assign a particular grammatical property to that word (e.g., tense, number, possession, comparison)

walk-ed elephant-s Bob-'s fast-er

41

Excellence and Service


CHRIST
Deemed to be University

Examples
Prefix Prefix Stem Suffix Suffix Suffix

dogs dog -s

walked walk -ed

imperfection im- perfect -ion

hopelessness hope -less -ness

undesirability un- desire -able -ity

unpremeditated un- pre- mediate -ed

antidisestablishmentarianism anti- dis- establish -ment -arian -ism

Examples with multiple stems: daydream-ing, paycheck-s, skydive-er

42

Excellence and Service


CHRIST
Deemed to be University

Natural language is the object to study of NLP


Linguistics is the study of natural language
Just as you need to know the laws of physics
to build mechanical devices, you need to
know the nature of language to build tools
to understand/generate language

Some interesting reading material

1) Linguistics: Adrian Akmajian et al.


2) The Language Instinct: Steven Pinker – for
a
general audience – highly recommended
3) Other popular linguistic books by Steven
Source: Wikipedia Pinker

Excellence and Service


CHRIST
Deemed to be University

Phonetics & Phonology

Vocal
Tract

International Phonetic Alphabet (IPA) chart

• Phonemes are the basic distinguishable sounds of a language


• Every language has a sound inventory

Excellence and Service


CHRIST
Deemed to be University

Language Diversity
Phonology/Phonetics:
- Retroflex sounds most found in Indian languages
- Tonal languages (Chinese, Thai)

Morphology:
Chinese isolating language
Malayalam agglutinative language

Syntax:
SOV language (Hindi): मैं बाज़ार जा रहा ह
SVO language (English): I am going to the market
Subject (S) Verb (V) Free-order vs. Fixed-order languages
Object (O)

Excellence and Service


CHRIST
Deemed to be University

Five Phases of NLP

Excellence and Service


46
CHRIST
Deemed to be University

characters
● Tokenization ● Stemming
morphemes Lexical Analysis
"shallower"

words (understanding structure & meaning of words) ● Normalization ● Lemmatization

Syntactic Analysis ● Part-of-Speech Tagging


(organization of words into sentences)
phrases ● Syntactic parsing (constituents, dependencies)
clauses
sentences ● Word Sense Disambiguation
Semantic Analysis ● Named Entity Recognition
(meaning of words and sentences)
● Semantic Role Labeling

paragraphs ● Coreference / anaphora resolution


Discourse Analysis ● Ellipsis resolution
documents (meaning of sentences in documents)
"deeper"

world knowledge Pragmatic Analysis


common sense ● Textual Entailment
(understanding & interpreting language in context)
● Intent recognition
47

Excellence and Service


CHRIST
Deemed to be University

Phase I: Lexical or morphological analysis

● The first phase of NLP is word structure analysis, which is referred to


as lexical or morphological analysis.
● A lexicon is defined as a collection of words and phrases in a given
language, with the analysis of this collection being the process of
splitting the lexicon into components, based on what the user sets as
parameters – paragraphs, phrases, words, or characters.
● Similarly, morphological analysis is the process of identifying the
morphemes of a word.
● A morpheme is a basic unit of English language construction, which
is a small element of a word, that carries meaning.

Excellence and Service


48
CHRIST
Deemed to be University

● These can be either a free morpheme (e.g. walk) or a bound


morpheme (e.g. -ing, -ed), with the difference between the two being
that the latter cannot stand on it’s own to produce a word with
meaning, and should be assigned to a free morpheme to attach
meaning.
● In search engine optimization (SEO), lexical or morphological
analysis helps guide web searching.
● For instance, when doing on-page analysis, you can perform lexical
and morphological analysis to understand how often the target
keywords are used in their core form (as free morphemes, or when in
composition with bound morphemes).

Excellence and Service


49
CHRIST
Deemed to be University

● This type of analysis can ensure that you have an accurate


understanding of the different variations of the morphemes that are
used.
● Morphological analysis can also be applied in transcription and
translation projects, so can be very useful in content repurposing
projects, and international SEO and linguistic analysis.

Excellence and Service


50
CHRIST
Deemed to be University

Lexical Analysis — Tokenization


● Tokenization
■ Splitting a sentence or text into meaningful / useful units

■ Different levels of granularity applied in practice

character-
based
She ' s d r i v i ng f as t e r t han a l l owe d .

subword-
based She 's driv ing fast er than allow ed .

word-
based She's driving faster than allowed .

51

Excellence and Service


CHRIST
Deemed to be University

Phase II: Syntax analysis (parsing)

● Syntax Analysis is the second phase of natural language processing.


● Syntax analysis or parsing is the process of checking grammar,
word arrangement, and overall – the identification of relationships
between words and whether those make sense.
● The process involved examination of all words and phrases in a
sentence, and the structures between them.
● As part of the process, there’s a visualization built of semantic
relationships referred to as a syntax tree (similar to a knowledge
graph).
● This process ensures that the structure and order and grammar of
sentences makes sense, when considering the words and phrases that
make up those sentences.

Excellence and Service


52
CHRIST
Deemed to be University

● Syntax analysis also involves tagging words and phrases with POS
tags. There are two common methods, and multiple approaches to
construct the syntax tree – top-down and bottom-up, however, both
are logical and check for sentence formation, or else they reject the
input.
● Syntax analysis can be beneficial for SEO in several ways:
● Programmatic SEO: Checking whether the produced content makes
sense, especially when producing content at scale using an automated
or semi-automated approach.
● Semantic analysis: Once you have a syntax analysis conducted,
semantic analysis is easy, as well as uncovering the relationship
between the different entities recognized in the content.

Excellence and Service


53
CHRIST
Deemed to be University

Syntactic Analysis — Part-of-Speech Tagging


● Part-of-Speech (POS) tagging
■ Labeling each word in a text corresponding to a part of speech

■ Basic POS tags: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, interjection

proper noun verb preposition or noun, singular


adverb possessive adjective
singular past tense subordinating or mass punctuation
pronoun
conjunction

NNP VBD RB IN PRP$ JJ NN .

IN

Bob walked slowly because of his .


swollen ankle

54

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


55
CHRIST
Deemed to be University

Syntactic Analysis — Syntactic Parsing


● Dependency parsing
■ Analyze the grammatical structure in a sentence

■ Find related words & the type of the relationship between them

Example: Dependency Graph

56

Excellence and Service


CHRIST
Deemed to be University

Phase III: Semantic analysis


● Semantic analysis is the third stage in NLP, when an analysis is
performed to understand the meaning in a statement.
● This type of analysis is focused on uncovering the definitions of
words, phrases, and sentences and identifying whether the way words
are organized in a sentence makes sense semantically.
● This task is performed by mapping the syntactic structure, and
checking for logic in the presented relationships between entities,
words, phrases, and sentences in the text. There are a couple of
important functions of semantic analysis, which allow for natural
language understanding:
● To ensure that the data types are used in a way that’s consistent with
their definition.
● To ensure that the flow of the text is consistent.

Excellence and Service


57
CHRIST
Deemed to be University

● Identification of synonyms, antonyms, homonyms, and other lexical


items.
● Overall word sense disambiguation.
● Relationship extraction from the different entities identified from the
text.
● There are several things you can utilise semantic analysis for in SEO.
Here are some examples:
● Topic modeling and classification – sort your page content into
topics (predefined or modelled by an algorithm).
● You can then use this for ML-enabled internal linking, where you link
pages together on your website using the identified topics.
● Topic modeling can also be used for classifying first-party collected
data such as customer service tickets, or feedback users left on your
articles or videos in free form (i.e. comments).

Excellence and Service


58
CHRIST
Deemed to be University

● Entity analysis, sentiment analysis, and intent classification –


● You can use this type of analysis to perform sentiment analysis and
identify intent expressed in the content analysed.
● Entity identification and sentiment analysis are separate tasks, and
both can be done on things like keywords, titles, meta descriptions,
page content, but works best when analysing data like comments,
feedback forms, or customer service or social media interactions.
● Intent classification can be done on user queries (in keyword research
or traffic analysis), but can also be done in analysis of customer
service interactions.

Excellence and Service


59
CHRIST
Deemed to be University

Semantic Analysis — Word Sense Disambiguation


● Word Sense Disambiguation (WSD)
■ Identification of the right sense of a word among all possible senses

■ Semantic ambiguity: many words have multiples meanings (i.e., senses)

sloping land
depository financial
institution arrangement of
similar objects
...

She heard a loud shot t bank during the time of the


from the robbery.
a consecutive series of pictures (film)
the act of firing a projectile
an attempt to score in a game
... 60

Excellence and Service


CHRIST
Deemed to be University

Semantic Analysis — Named Entity Recognition


● Named Entity Recognition (NER)
■ Identification of named entities: terms that represent real-world objects

■ Examples: persons, locations, organizations, time, money, etc.

PERSON ORGANIZATION LOCATION MONEY

Chris booked a Singapore flight to Germany for S$1,200.


Airlines

61

Excellence and Service


CHRIST
Deemed to be University

Semantic Analysis — Semantic Role Labeling


● Semantic Role Labeling (SRL)
■ Identification of the semantic roles of these words or phrases in sentences

■ Express semantic roles as predicate-argument structures

Who did What to Whom What exactly at When

The teacher sent the class the assignment last .


week

62

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


63
CHRIST
Deemed to be University

Phase IV: Discourse integration


● Discourse integration is the fourth phase in NLP, and simply means
contextualization.
● Discourse integration is the analysis and identification of the larger
context for any smaller part of natural language structure (e.g. a
phrase, word or sentence).
● During this phase, it’s important to ensure that each phrase, word, and
entity mentioned are mentioned within the appropriate context.
● This analysis involves considering not only sentence structure and
semantics, but also sentence combination and meaning of the text
as a whole.
● Otherwise, when analyzing the structure of text, sentences are broken
up and analyzed and also considered in the context of the sentences
that precede and follow them, and the impact that they have on the
structure of text.

Excellence and Service


64
CHRIST
Deemed to be University

● Some common tasks in this phase include: information extraction,


conversation analysis, text summarization, discourse analysis.
● Here are some complexities of natural language understanding
introduced during this phase:
● Understanding of the expressed motivations within the text, and its
underlying meaning.
● Understanding of the relationships between entities and topics
mentioned, thematic understanding, and interactions analysis.
● Understanding the social and historical context of entities mentioned.
● Discourse integration and analysis can be used in SEO to ensure that
appropriate tense is used, that the relationships expressed in the text
make logical sense, and that there is overall coherency in the text
analysed.

Excellence and Service


65
CHRIST
Deemed to be University

● This can be especially useful for programmatic SEO initiatives or text


generation at scale. The analysis can also be used as part of
international SEO localization, translation, or transcription tasks on
big corpuses of data.
● There are some research efforts to incorporate discourse analysis into
systems that detect hate speech (or in the SEO space for things like
content and comment moderation), with this technology being aimed
at uncovering intention behind text by aligning the expression with
meaning, derived from other texts.
● This means that, theoretically, discourse analysis can also be used for
modeling of user intent (e.g search intent or purchase intent) and
detection of such notions in texts.

Excellence and Service


66
CHRIST
Deemed to be University

Discourse Analysis — Coreference Resolution


● Coreference Resolution
■ Identification of expressions that refer to the same entity in a text

■ Entities can be referred to by named entities, noun phrases, pronouns, etc.

Mr Smith didn't see the car. Then it hit him.

Mr Smith didn't see the car. Then the car hit Mr Smith.

67

Excellence and Service


CHRIST
Deemed to be University

● Coreference resolution (CR) is the task of finding all linguistic expressions (called
mentions) in a given text that refer to the same real-world entity. After finding and
grouping these mentions we can resolve them by replacing, as stated above, pronouns
with noun phrases.
● Coreference resolution is an exceptionally versatile tool and can be applied to a variety
of NLP tasks such as text understanding, information extraction, machine translation,
sentiment analysis, or document summarization. It is a great way to obtain
unambiguous sentences which can be much more easily understood by computers.

Excellence and Service


68
CHRIST
Deemed to be University

Discourse Analysis — Ellipsis Resolution


● Ellipsis Resolution
■ Inference of ellipses using the surrounding context

■ Ellipsis: omission of a word or phrases in sentence

He studied at NUS, his brother at NTU.

He studied at NUS, his brother studied at NTU.

She's very funny. Her sister is not.

She's very funny. Her sister is not very funny.

69

Excellence and Service


CHRIST
Deemed to be University

Phase V: Pragmatic analysis

● Pragmatic analysis is the fifth and final phase of natural language


processing.
● As the final stage, pragmatic analysis extrapolates and incorporates
the learnings from all other, preceding phases of NLP.
● Pragmatic analysis involves the process of abstracting or extracting
meaning from the use of language, and translating a text, using
the gathered knowledge from all other NLP steps performed
beforehand.
● Here are some complexities that are introduced during this phase
● Information extraction, enabling an advanced text understanding
functions such as question-answering.
● Meaning extraction, which allows for programs to break down
definitions or documentation into a more accessible language.

Excellence and Service


70
CHRIST
Deemed to be University

● Understanding of the meaning of the words, and context, in which


they are used, which enables conversational functions between
machine and human (e.g. chatbots).
● Pragmatic analysis has multiple applications in SEO.
● One of the most straightforward ones is programmatic SEO and
automated content generation.
● This type of analysis can also be used for generating FAQ sections
on your product, using textual analysis of product documentation,
or even capitalizing on the ‘People Also Ask’ featured snippets by
adding an automatically-generated FAQ section for each page
you produce on your site.

Excellence and Service


71
CHRIST
Deemed to be University

Pragmatic Analysis — Textual Entailment


● Textual Entailment
■ Determining the inference relation between two short, ordered texts

■ Given a text t and hypothesis h, "t entails h" (t ⇒ h)


➜ someone reading t would infer that h is most likely true

t: A mixed choir is performing at the National Day parade.


t⇒h
h: The anthem is sung by a group of men and women.

Required world knowledge:


● Mixed choir: male and female members
● Singing a song is a performance
● "anthem" typically refers to "national anthem"
72

Excellence and Service


CHRIST
Deemed to be University

Pragmatic Analysis — Intent Recognition


● Intent Recognition
■ Classification of an utterance based on what the speaker/writer is trying to achieve

■ Core component of sophisticated chatbots

"I'm hungry!"
Intent: Action:
Additional context: Writer is looking Search for vegetarian restaurants in
➜ ➜
● The writer is vegetarian
for a place to eat and around VivoCity that are open.
● The writer is near VivoCity
● It's 1pm: lunch time
● ...

73

Excellence and Service


CHRIST
Deemed to be University

LANGUAGE SYNTAX AND STRUCTURE

● For any language, syntax and structure usually go hand in


hand, where a set of specific rules, conventions, and principles
govern the way words are combined into phrases; phrases get
combines into clauses; and clauses get combined into
sentences.
● In English, words usually combine together to form other
constituent units.
● These constituents include words, phrases, clauses, and
sentences.

Excellence and Service


74
CHRIST
Deemed to be University

● Considering a sentence, “The brown fox is quick and he is


jumping over the lazy dog”, it is made of a bunch of words
and just looking at the words by themselves don’t tell us
much.

Fig. A bunch of unordered words don’t convey much information

Excellence and Service


75
CHRIST
Deemed to be University

● Knowledge about the structure and syntax of the language


is helpful in many areas like text processing, annotation,
and parsing for further operations such as text classification
or summarization.
● Typical parsing techniques for understanding text syntax
are mentioned below.
○ Parts of Speech (POS) Tagging
○ Shallow Parsing or Chunking
○ Constituency Parsing
○ Dependency Parsing

Excellence and Service


76
CHRIST
Deemed to be University

Considering the previous example sentence “The brown fox is quick and he
is jumping over the lazy dog”, if we were to annotate it using basic POS tags,
it would look like the following figure.

Thus, a sentence typically follows a hierarchical structure consisting


the following components,
sentence → clauses → phrases → words

Excellence and Service


77
CHRIST
Deemed to be University

Tagging Parts of Speech


Parts of speech (POS) are specific lexical categories to which words are
assigned, based on their syntactic context and role. Usually, words can fall into
one of the following major categories.
● N(oun): This usually denotes words that depict some object or entity,
which may be living or nonliving. Some examples would be fox , dog ,
book , and so on. The POS tag symbol for nouns is N.
● V(erb): Verbs are words that are used to describe certain actions, states,
or occurrences. There are a wide variety of further subcategories, such as
auxiliary, reflexive, and transitive verbs (and many more). Some typical
examples of verbs would be running , jumping , read , and write . The
POS tag symbol for verbs is V.
● Adj(ective): Adjectives are words used to describe or qualify other
words, typically nouns and noun phrases. The phrase beautiful flower
has the noun (N) flower which is described or qualified using the
adjective (ADJ) beautiful . The POS tag symbol for adjectives is ADJ .
Excellence and Service
78
CHRIST
Deemed to be University

● Adv(erb): Adverbs usually act as modifiers for other words


including nouns, adjectives, verbs, or other adverbs. The phrase
very beautiful flower has the adverb (ADV) very , which
modifies the adjective (ADJ) beautiful , indicating the degree to
which the flower is beautiful. The POS tag symbol for adverbs
is ADV.
● Besides these four major categories of parts of speech , there are
other categories that occur frequently in the English language.
These include pronouns, prepositions, each POS tag like the
noun (N) can be further subdivided into categories like singular
nouns (NN), singular proper nouns(NNP), and plural nouns
(NNS).
● interjections, conjunctions, determiners, and many others.

Excellence and Service


79
CHRIST
Deemed to be University

● The process of classifying and labeling POS tags for words


called parts of speech tagging or POS tagging .
● POS tags are used to annotate words and depict their POS,
which is really helpful to perform specific analysis, such as
narrowing down upon nouns and seeing which ones are the
most prominent, word sense disambiguation, and grammar
analysis.

Excellence and Service


80
CHRIST
Deemed to be University

Excellence and Service


81
CHRIST
Deemed to be University

Excellence and Service


82
CHRIST
Deemed to be University

1. Noun (NN): A word that represents a person, place, thing, or idea.


Examples: “cat,” “house,” “love.”
2. Verb (VB): A word that expresses an action or state of being.
Examples: “run,” “eat,” “is.”
3. Adjective (JJ): A word that describes or modifies a noun.
Examples: “red,” “happy,” “tall.”
4. Adverb (RB): A word that modifies a verb, adjective, or other adverb, often indicating manner,
time, place, degree, etc.
Examples: “quickly,” “very,” “here.”
5. Pronoun (PRP): A word that substitutes for a noun or noun phrase.
Examples: “he,” “she,” “they.”
6. Preposition (IN): A word that shows the relationship between a noun (or pronoun) and other
words in a sentence.
Examples: “in,” “on,” “at.”
7. Conjunction (CC): A word that connects words, phrases, or clauses.
Examples: “and,” “but,” “or.”
8. Interjection (UH): A word or phrase that expresses emotion or exclamation.
Examples: “wow,” “ouch,” “hey.”

Excellence and Service


83
CHRIST
Deemed to be University

Types of POS Tagging in NLP

● Rule-Based POS Tagging


● This method uses a set of predefined rules to assign POS tags
to words based on their context and surrounding words.
● Rule-based POS Tagging is like having a set of instructions
to decide which category each word in a sentence belongs to.
Imagine you have a rule that says, “If a word ends in ‘ing,’
it’s probably a verb.” So, when you see a word like
“running,” you automatically know it’s a verb because it ends
in “ing.”

Excellence and Service


84
CHRIST
Deemed to be University

● Transformation-Based Part-of-Speech
● Transformation-Based Part-of-Speech (POS) Tagging
is a method where you change the labels (tags) of
words in a sentence based on specific rules about their
context.
● These rules help adjust the tags to better fit the
grammar of the sentence, improving accuracy in
identifying whether a word is a noun, verb, adjective,
etc.

Excellence and Service


85
CHRIST
Deemed to be University

Sentence: “The cat chased the mouse.”

First, let’s assign initial tags to each word based on what we


know:

● “The” — Determiner (DET)


● “cat” — Noun (N)
● “chased” — Verb (V)
● “the” — Determiner (DET)
● “mouse” — Noun (N)

Excellence and Service


86
CHRIST
Deemed to be University

● Now, let’s apply the transformation rule


● Change the tag of a verb to a noun if it follows a determiner
like “the.”
● We see that “chased” follows “the,” so we change its tag
from Verb (V) to Noun (N).
● Updated tags:
○ “The” — Determiner (DET)
○ “cat” — Noun (N)
○ “chased” — Noun (N)
○ “the” — Determiner (DET)
○ “mouse” — Noun (N)

Excellence and Service


87
CHRIST
Deemed to be University

● Statistical POS Tagging


● Statistical POS tagging is another approach to
automatically assigning parts-of-speech (POS) tags
to words in a sentence.
● Unlike transformation-based tagging which relies
on rules, statistical tagging uses the power of
statistics and machine learning.

Excellence and Service


88
CHRIST
Deemed to be University

● Here’s how statistical POS tagging works:


● 1. Training the Model:
● First, we train a statistical model on a large corpus of labeled text
data. This corpus contains sentences where each word is tagged with
its correct POS tag. The model learns patterns and relationships
between words and their corresponding POS tags from this training
data.
● 2. Tagging Words:
● Once the model is trained, we use it to predict the POS tags for each
word in new, unseen sentences. The model analyzes the context of
each word in the sentence and predicts the most likely POS tag based
on the patterns it learned during training.

Excellence and Service


89
CHRIST
Deemed to be University

● Probability-based Approach:
● Statistical POS tagging works on a probability-based
approach. For each word in the sentence, the model assigns a
probability distribution over all possible POS tags. The tag
with the highest probability is chosen as the predicted POS
tag for that word.
● 4. Evaluation and Refinement:
● After tagging the words in a sentence, we evaluate the
accuracy of the model’s predictions by comparing them to
the actual POS tags. If there are errors, we can refine the
model by retraining it on more labeled data or adjusting its
parameters.
Excellence and Service
90
CHRIST
Deemed to be University

● Statistical POS tagging uses statistical models to predict


the most likely POS tags for words in a sentence based on
patterns learned from training data.
● It’s a powerful and widely used technique in NLP for
various tasks like text analysis, machine translation, and
information retrieval.

Excellence and Service


91
CHRIST
Deemed to be University

● Let us consider both nltk and spacy which usually use


the Penn Treebank notation for POS tagging.
● NLTK and spaCy are two of the most popular Natural
Language Processing (NLP) tools available in Python.
● You can build chatbots, automatic summarizers, and
entity extraction engines with either of these libraries.
● While both can theoretically accomplish any NLP task,
each one excels in certain scenarios.
● The Penn Treebank, or PTB for short, is a dataset
maintained by the University of Pennsylvania.

Excellence and Service


92
CHRIST
Deemed to be University

# create a basic pre-processed corpus, don't lowercase to get POS context


corpus = normalize_corpus(news_df['full_text'], text_lower_case=False,
text_lemmatization=False, special_char_removal=False)
# demo for POS tagging for sample news headline sentence =
str(news_df.iloc[1].news_headline) sentence_nlp = nlp(sentence)
# POS tagging with Spacy
spacy_pos_tagged = [(word, word.tag_, word.pos_) for word in sentence_nlp]
pd.DataFrame(spacy_pos_tagged, columns=['Word', 'POS tag', 'Tag type'])
# POS tagging with nltk
nltk_pos_tagged = nltk.pos_tag(sentence.split())
pd.DataFrame(nltk_pos_tagged, columns=['Word', 'POS tag'])

Excellence and Service


93
CHRIST
Deemed to be University

● We can see that each of these libraries treat tokens in their own way and
assign specific tags for them. Based on what we see, spacy seems to be
doing slightly better than nltk.

Excellence and Service


94
CHRIST
Deemed to be University

PARSING

● The word ‘Parsing’ whose origin is from Latin word ‘pars’


(which means ‘part’), is used to draw exact meaning or
dictionary meaning from the text.
● It is also called Syntactic analysis or syntax analysis. Comparing
the rules of formal grammar, syntax analysis checks the text for
meaningfulness.
● The sentence like “Give me hot ice-cream”, for example, would
be rejected by parser or syntactic analyzer.
● It may be defined as the process of analyzing the strings of
symbols in natural language conforming to the rules of formal
grammar.

Excellence and Service


95
CHRIST
Deemed to be University

Excellence and Service


96
CHRIST
Deemed to be University

● In a typical flow, input text goes into a lexical analyzer that


produces individual tokens.
● These tokens are the input to a parser, which produces the
syntactic structure at the output.
● When this structure is graphically represented as a tree, it's
called a Parse Tree.
● A parse tree can be simplified into an intermediate representation
called Abstract Syntax Tree (AST).

Excellence and Service


97
CHRIST
Deemed to be University

● Parser is used to report any syntax error.


● It helps to recover from commonly occurring error so that
the processing of the remainder of program can be
continued.
● Parse tree is created with the help of a parser.
● Parser is used to create symbol table, which plays an
important role in NLP.
● Parser is also used to produce intermediate representations
(IR).

Excellence and Service


98
CHRIST
Deemed to be University

Shallow Parsing or Chunking

Based on the hierarchy we depicted earlier, groups of words make up


phrases. There are five major categories of phrases:
● Noun phrase (NP): These are phrases where a noun acts as the head
word. Noun phrases act as a subject or object to a verb.
● Verb phrase (VP): These phrases are lexical units that have a verb
acting as the head word. Usually, there are two forms of verb
phrases. One form has the verb components as well as other entities
such as nouns, adjectives, or adverbs as parts of the object.
● Adjective phrase (ADJP): These are phrases with an adjective as
the head word. Their main role is to describe or qualify nouns and
pronouns in a sentence, and they will be either placed before or after
the noun or pronoun.

Excellence and Service


99
CHRIST
Deemed to be University

● Adverb phrase (ADVP): These phrases act like adverbs since the
adverb acts as the head word in the phrase. Adverb phrases are used
as modifiers for nouns, verbs, or adverbs themselves by providing
further details that describe or qualify them.
● Prepositional phrase (PP): These phrases usually contain a
preposition as the head word and other lexical components like
nouns, pronouns, and so on. These act like an adjective or adverb
describing other words or phrases.
Shallow parsing, also known as light parsing or chunking, is a popular
natural language processing technique of analyzing the structure of a
sentence to break it down into its smallest constituents (which are tokens
such as words) and group them together into higher-level phrases. This
includes POS tags and phrases from a sentence.
Excellence and Service
100
CHRIST
Deemed to be University

● Unlike full parsing, which involves analyzing the


grammatical structure of a sentence, shallow parsing
focuses on identifying individual phrases or
constituents, such as noun phrases, verb phrases, and
prepositional phrases.
● Shallow parsing is an essential component of many
NLP tasks, including information extraction, text
classification, and sentiment analysis.

Excellence and Service


101
CHRIST
Deemed to be University

● Parsing is the process of finding a parse tree that is


consistent with the grammar rules – in other words,
we want to find the set of grammar rules and their
sequence that generated the sentence.
● A parse tree not only gives us the POS tags, but also
which set of words are related to form phrases and
also the relationship between these phrases.

Excellence and Service


102
CHRIST
Deemed to be University

Fig. An example of shallow parsing depicting higher level phrase annotations

Excellence and Service


103
CHRIST
Deemed to be University

● To effectively employ NLP in practical settings, one needs to


possess extensive knowledge of numerous ideas and
terminologies.
● Linguistic analytic approaches like dependency parsing and
syntactic parsing are used in natural language processing.
● Dependency parsing aims to highlight the interdependencies
between words in a phrase in order to show the grammatical
links between them.
● By creating a tree structure that illustrates these dependencies,
it aids in the comprehension of sentence structure.

Excellence and Service


104
CHRIST
Deemed to be University

● In a broader sense, syntactic parsing aims to disclose the basic


syntactic structure of a sentence, including components, phrase
borders, and grammatical rules.
● In order to support a variety of language processing tasks,
including statistical language modelling, part-of-speech (POS)
tagging, syntactic and semantic analysis, sentiment evaluation,
normalization, tokenization, and more, it is imperative that
dependency parsing and constituency parsing be used to extract
meaning and insights from text.
● When combined, these techniques help to process textual data
comprehensively, opening up a variety of language processing
applications

Excellence and Service


105
CHRIST
Deemed to be University

Constituency Parsing

● Constituent-based grammars are used to analyze and


determine the constituents of a sentence.
● These grammars can be used to model or represent the
internal structure of sentences in terms of a hierarchically
ordered structure of their constituents.
● Each and every word usually belongs to a specific lexical
category in the case and forms the head word of different
phrases.
● These phrases are formed based on rules called phrase
structure rules.

Excellence and Service


106
CHRIST
Deemed to be University

● Phrase structure rules form the core of constituency


grammars, because they talk about syntax and rules that
govern the hierarchy and ordering of the various
constituents in the sentences.
● These rules cater to two things primarily.
● They determine what words are used to construct the
phrases or constituents.
● They determine how we need to order these constituents
together.
● The generic representation of a phrase structure rule is S →
AB , which depicts that the structure S consists of
constituents A and B , and the ordering is A followed by B
.
Excellence and Service
107
CHRIST
Deemed to be University

● While there are several rules the most important rule describes
how to divide a sentence or a clause.
● The phrase structure rule denotes a binary division for a
sentence or a clause as S → NP VP where S is the sentence or
clause, and it is divided into the subject, denoted by the noun
phrase (NP) and the predicate, denoted by the verb phrase (VP)
● A constituency parser can be built based on such
grammars/rules, which are usually collectively available as
context-free grammar (CFG) or phrase-structure grammar. The
parser will process input sentences according to these rules, and
help in building a parse tree.

Excellence and Service


108
CHRIST
Deemed to be University

Fig. An example of constituency parsing showing a nested hierarchical structure

Excellence and Service


109
CHRIST
Deemed to be University

● Constituency parsing is an important concept in Natural Language


Processing that involves analysing the structure of a sentence
grammatically by identifying the constituents or phrases in the
sentence and their hierarchical relationships.
● As we know that understanding natural language is a very complex
task as we have to deal with the ambiguity of the natural language
in order to properly understand the natural language.

Excellence and Service


110
CHRIST
Deemed to be University

Working of Constituency Parsing:

For understanding natural language the key is to understand


the grammatical pattern of the sentences involved.
The first step in understanding grammar is to segregate a
sentence into groups of words or tokens called
constituents based on their grammatical role in the
sentence.
Let’s understand this process with an example sentence:
“The lion ate the deer.”
Here, “The lion” represents a noun phrase, “ate” represents
a verb phrase, and “the deer” is another noun phrase.

Excellence and Service


111
CHRIST
Deemed to be University

Excellence and Service


112
CHRIST
Deemed to be University

Context-Free Grammar (CFG):


The most common technique used in constituency
parsing is Context-Free Grammar or CFG.
CFG works by organizing sentences into constituencies
based on a set of grammar rules (or productions).
These rules specify how individual words in a sentence
can be grouped to form constituents such as noun
phrases, verb phrases, preposition phrases, etc.

Excellence and Service


113
CHRIST
Deemed to be University

Dependency Parsing

● In dependency parsing, we try to use dependency-based grammars to


analyze and infer both structure and semantic dependencies and
relationships between tokens in a sentence.
● The basic principle behind a dependency grammar is that in any
sentence in the language, all words except one, have some
relationship or dependency on other words in the sentence.
● The word that has no dependency is called the root of the sentence.
● The verb is taken as the root of the sentence in most cases.
● All the other words are directly or indirectly linked to the root verb
using links, which are the dependencies.
● Considering the sentence “The brown fox is quick and he is jumping
over the lazy dog”, if we wanted to draw the dependency syntax tree
for this, we would have the structure

Excellence and Service


114
CHRIST
Deemed to be University

Fig. A dependency parse tree for a sentence


These dependency relationships each have their own meaning and are a part of a list of
universal dependency types.

Excellence and Service


115
CHRIST
Deemed to be University

● Some of the dependencies are as follows:


● The dependency tag det is pretty intuitive— it denotes the
determiner relationship between a nominal head and the
determiner. Usually, the word with POS tag DET will also have
the det dependency tag relation. Examples include fox → the
and dog → the.
● The dependency tag amod stands for adjectival modifier and
stands for any adjective that modifies the meaning of a noun.
Examples include fox → brown and dog → lazy.
● The dependency tag nsubj stands for an entity that acts as a
subject or agent in a clause. Examples include is → fox and
jumping → he.

Excellence and Service


116
CHRIST
Deemed to be University

● The dependencies cc and conj have more to do with linkages related to


words connected by coordinating conjunctions . Examples include is →
and and is → jumping.
● The dependency tag aux indicates the auxiliary or secondary verb in the
clause. Example: jumping → is.
● The dependency tag acomp stands for adjective complement and acts as
the complement or object to a verb in the sentence. Example: is → quick
● The dependency tag prep denotes a prepositional modifier, which
usually modifies the meaning of a noun, verb, adjective, or preposition.
Usually, this representation is used for prepositions having a noun or
noun phrase complement. Example: jumping → over.
● The dependency tag pobj is used to denote the object of a preposition.
This is usually the head of a noun phrase following a preposition in the
sentence. Example: over → dog.

Excellence and Service


117
CHRIST
Deemed to be University

● Dependency parsing, or DP, is the process of examining the relationships


between a sentence’s components to determine its grammatical structure.
This is the primary element that divides a sentence into multiple parts.
The foundation of the approach is the notion that each linguistic unit in a
phrase is directly related to every other unit. We refer to these
relationships as dependencies.

● The phrase, “I prefer the morning flight through Denver,” comes to mind.

● The dependency structure of the statement is explained in the graphic

below:

Excellence and Service


118
CHRIST
Deemed to be University

Excellence and Service


119
CHRIST
Deemed to be University

Dependency Parsing (DP), a modern parsing mechanism, whose main concept is that each
linguistic unit i.e. words relates to each other by a direct link. These direct links are actually
‘dependencies’ in linguistic. For example, the following diagram shows dependency grammar
for the sentence “John can hit the ball”.

Excellence and Service


120
CHRIST
Deemed to be University

TEXT PREPROCESSING OR WRANGLING

● Text preprocessing or wrangling is a method to clean the text


data and make it ready to feed data to the model.
● Text data contains noise in various forms like emotions,
punctuation, text in a different case.
● When we talk about Human Language then, there are
different ways to say the same thing, And this is only the
main problem we have to deal with because machines will
not understand words, they need numbers so we need to
convert text to numbers in an efficient manner.
● Techniques to perform text preprocessing or wrangling are as
follows:

Excellence and Service


121
CHRIST
Deemed to be University

● Tokenization: Tokenization is the process of separating a piece of


text into smaller units called tokens. Given a document, tokens can
be sentences, words, subwords, or even characters depending on
the application.
● Noise cleaning: Special characters and symbols contribute to extra
noise in unstructured text. Using regular expressions to remove
them or using tokenizers, which do the pre-processing step of
removing punctuation marks and other special characters, is
recommended.
● Spell-checking: Documents in a corpus are prone to spelling
errors; In order to make the text clean for the subsequent
processing, it is a good practice to run a spell checker and fix the
spelling errors before moving on to the next steps.

Excellence and Service


122
CHRIST
Deemed to be University

● Stopwords Removal: Stop words are those words which are very
common and often less significant. Hence, removing these is a pre-
processing step as well.
● This can be done explicitly by retaining only those words in the
document which are not in the list of stop words or by specifying
the stop word list as an argument in CountVectorizer or
TfidfVectorizer methods when getting Bag-of-Words(BoW)/TF-
IDF scores for the corpus of text documents.

Excellence and Service


123
CHRIST
Deemed to be University

● Stemming/Lemmatization: Both stemming and lemmatization are


methods to reduce words to their base form. While stemming
follows certain rules to truncate the words to their base form, often
resulting in words that are not lexicographically correct,
lemmatization always results in base forms that are
lexicographically correct.
● However, stemming is a lot faster than lemmatization. Hence, to
stem/lemmatize is dependent on whether the application needs
quick pre- processing or requires more accurate base forms.

Excellence and Service


124
CHRIST
Deemed to be University

TOKENIZATION
● Tokenization is a common task in Natural Language
Processing (NLP).
● It’s a fundamental step in both traditional NLP methods like
Count Vectorizer and Advanced Deep Learning-based
architectures like Transformers.
● As tokens are the building blocks of Natural Language, the
most common way of processing the raw text happens at the
token level.
● Tokens are the building blocks of Natural Language.
● Tokenization is a way of separating a piece of text into smaller
units called tokens. Here, tokens can be either words,
characters, or subwords.

Excellence and Service


125
CHRIST
Deemed to be University

● Hence, tokenization can be broadly classified into 3


types – word, character, and subword (n-gram
characters) tokenization.
● For example, consider the sentence: “Never give up”.
● The most common way of forming tokens is based on
space.
● Assuming space as a delimiter, the tokenization of the
sentence results in 3 tokens – Never-give-up.
● As each token is a word, it becomes an example of
Word tokenization

Excellence and Service


126
CHRIST
Deemed to be University

● Similarly, tokens can be either characters or sub-


words.
● For example, let us consider “smarter”:
● 1. Character tokens: s-m-a-r-t-e-r
● 2. Sub-word tokens: smart-er
● Here, Tokenization is performed on the corpus to
obtain tokens. The following tokens are then used to
prepare a vocabulary.

Excellence and Service


127
CHRIST
Deemed to be University

● Vocabulary refers to the set of unique tokens in the corpus.


● Remember that vocabulary can be constructed by considering
each unique token in the corpus or by considering the top K
Frequently Occurring Words.
● Creating Vocabulary is the ultimate goal of Tokenization.

Excellence and Service


128
CHRIST
Deemed to be University

● One of the simplest hacks to boost the performance of the


NLP model is to create a vocabulary out of top K
frequently occurring words.
● Now, let’s understand the usage of the vocabulary in
Traditional and Advanced Deep Learning-based NLP
methods.
● Traditional NLP approaches such as Count Vectorizer and
TF-IDF use vocabulary as features. Each word in the
vocabulary is treated as a unique feature:

Excellence and Service


129
CHRIST
Deemed to be University

Traditional NLP: Count Vectorizer

Excellence and Service


130
CHRIST
Deemed to be University

● In Advanced Deep Learning-based NLP architectures, vocabulary is


used to create the tokenized input sentences. Finally, the tokens of
these sentences are passed as inputs to the model
● As discussed earlier, tokenization can be performed on word,
character, or subword level. It’s a common question – which
Tokenization should we use while solving an NLP task? Let’s
address this question here.
● Word Tokenization
● Word Tokenization is the most commonly used tokenization
algorithm. It splits a piece of text into individual words based on a
certain delimiter. Depending upon delimiters, different word-level
tokens are formed. Pretrained Word Embeddings such as Word2Vec
and GloVe comes under word tokenization.

Excellence and Service


131
CHRIST
Deemed to be University

REMOVING STOP-WORDS

● The words which are generally filtered out before processing a natural
language are called stop words. These are actually the most common
words in any language (like articles, prepositions, pronouns, conjunctions,
etc) and does not add much information to the text. Examples of a few stop
words in English are “the”, “a”, “an”, “so”, “what”. Stop words are
available in abundance in any human language. By removing these words,
we remove the low-level information from our text in order to give more
focus to the important information. In order words, we can say that the
removal of such words does not show any negative consequences on the
model we train for our task.
● Removal of stop words definitely reduces the dataset size and thus reduces
the training time due to the fewer number of tokens involved in the
training.

Excellence and Service


132
CHRIST
Deemed to be University

● We do not always remove the stop words. The removal of stop words is
highly dependent on the task we are performing and the goal we want to
achieve. For example, if we are training a model that can perform the
sentiment analysis task, we might not remove the stop words.
● Movie review: “The movie was not good at all.”
● Text after removal of stop words: “movie good”
● We can clearly see that the review for the movie was negative. However,
after the removal of stop words, the review became positive, which is not
the reality. Thus, the removal of stop words can be problematic here.
● Tasks like text classification do not generally need stop words as the other
words present in the dataset are more important and give the general idea
of the text. So, we generally remove stop words in such tasks.

Excellence and Service


133
CHRIST
Deemed to be University

● In a nutshell, NLP has a lot of tasks that cannot be accomplished properly after the
removal of stop words. So, think before performing this step. The catch here is that
no rule is universal and no stop words list is universal. A list not conveying any
important information to one task can convey a lot of information to the other task.
● Word of caution: Before removing stop words, research a bit about your task and the
problem you are trying to solve, and then make your decision

Excellence and Service


134
CHRIST
Deemed to be University

Next comes a very important question: why we should remove


stop words from the text?. So, there are two main reasons for
that:
1. They provide no meaningful information, especially if
we are building a text classification model. Therefore, we have
to remove stop words from our dataset.
2. As the frequency of stop words are too high, removing
them from the corpus results in much smaller data in terms
of size. Reduced size results in faster computations on text
data and the text classification model need to deal with a
lesser number of features resulting in a robust model.

Excellence and Service


135
CHRIST
Deemed to be University

STEMMING
● Stemming is the process of producing morphological variants
of a root/base word.
● Stemming programs are commonly referred to as stemming
algorithms or stemmers.
● A stemming algorithm reduces the words “chocolates”,
“chocolatey”, “choco” to the root word, “chocolate” and
“retrieval”, “retrieved”, “retrieves” reduce to the stem
“retrieve”.
● Stemming is an important part of the pipelining process in
Natural language processing.
● The input to the stemmer is tokenized words. How do we get
these tokenized words? Well, tokenization involves breaking
down the document into different words.

Excellence and Service


136
CHRIST
Deemed to be University

● Stemming is a natural language processing technique that is


used to reduce words to their base form, also known as the
root form.
● The process of stemming is used to normalize text and make it
easier to process. It is an important step in text pre-
processing, and it is commonly used in information retrieval
and text mining applications.
● There are several different algorithms for stemming as
follows:
● · Porter stemmer
● · Snowball stemmer
● · Lancaster stemmer.

Excellence and Service


137
CHRIST
Deemed to be University

● The Porter stemmer is the most widely used algorithm, and it is based on a
set of heuristics that are used to remove common suffixes from words.
● The Snowball stemmer is a more advanced algorithm that is based on the
Porter stemmer, but it also supports several other languages in addition to
English. The Lancaster stemmer is a more aggressive stemmer and it is
less accurate than the Porter stemmer and Snowball stemmer.
● Stemming can be useful for several natural language processing tasks such
as text classification, information retrieval, and text summarization.
● However, stemming can also have some negative effects such as reducing
the readability of the text, and it may not always produce the correct root
form of a word. It is important to note that stemming is different from
Lemmatization.

Excellence and Service


138
CHRIST
Deemed to be University

● Lemmatization is the process of reducing a word to its base form, but unlike
stemming, it takes into account the context of the word, and it produces a valid
word, unlike stemming which can produce a non-word as the root form.
● Errors in Stemming:
● There are mainly two errors in stemming –
● · over-stemming
● · under-stemming
● Over-stemming occurs when two words are stemmed from the same root that are
of different stems. Over-stemming can also be regarded as false-positives. Over-
stemming is a problem that can occur when using stemming algorithms in natural
language processing. It refers to the situation where a stemmer produces a root
form that is not a valid word or is not the correct root form of a word. This can
happen when the stemmer is too aggressive in removing suffixes or when it does
not consider the context of the word.

Excellence and Service


139
CHRIST
Deemed to be University

Over-stemming can lead to a loss of meaning and make the text less
readable. For example, the word “arguing” may be stemmed to “argu,” which
is not a valid word and does not convey the same meaning as the original
word. Similarly, the word “running” may be stemmed to “run,” which is the
base form of the word but it does not convey the meaning of the original
word.
To avoid over-stemming, it is important to use a stemmer that is appropriate
for the task and language. It is also important to test the stemmer on a sample
of text to ensure that it is producing valid root forms. In some cases, using a
lemmatizer instead of a stemmer may be a better solution as it takes into
account the context of the word, making it less prone to errors. Another
approach to this problem is to use techniques like semantic role labeling,
sentiment analysis, context-based information, etc. that help to understand
the context of the text and make the stemming process more precise.
Excellence and Service
140
CHRIST
Deemed to be University

Under-stemming occurs when two words are stemmed from the same root
that are not of different stems. Under-stemming can be interpreted as false-
negatives. Under-stemming is a problem that can occur when using stemming
algorithms in natural language processing. It refers to the situation where a
stemmer does not produce the correct root form of a word or does not reduce
a word to its base form. This can happen when the stemmer is not aggressive
enough in removing suffixes or when it is not designed for the specific task
or language.
Under-stemming can lead to a loss of information and make it more difficult
to analyze text. For example, the word “arguing” and “argument” may be
stemmed to “argu,” which does not convey the meaning of the original
words. Similarly, the word “running” and “runner” may be stemmed to
“run,” which is the base form of the word but it does not convey the meaning
of the original words.
Excellence and Service
141
CHRIST
Deemed to be University

To avoid under-stemming, it is important to use a stemmer that is appropriate


for the task and language.
It is also important to test the stemmer on a sample of text to ensure that it is
producing the correct root forms. I
n some cases, using a lemmatizer instead of a stemmer may be a better
solution as it takes into account the context of the word, making it less prone
to errors.
Another approach to this problem is to use techniques like semantic role
labeling, sentiment analysis, context-based information, etc. that help to
understand the context of the text and make the stemming process more
precise.

Excellence and Service


142
CHRIST
Deemed to be University

Applications of stemming:

● Stemming is used in information retrieval systems like search engines.


● It is used to determine domain vocabularies in domain analysis.
● To display search results by indexing while documents are evolving into numbers
and to map documents to common subjects by stemming. Sentiment Analysis,
which examines reviews and comments made by different users about anything, is
frequently used for product analysis, such as for online retail stores. Before it is
interpreted, stemming is accepted in the form of the text-preparation mean.
● A method of group analysis used on textual materials is called document
clustering (also known as text clustering). Important uses of it include subject
extraction, automatic document structuring, and quick information retrieval.
● Fun Fact: Google search adopted a word stemming in 2003. Previously a search
for “fish” would not have returned “fishing” or “fishes”.

Excellence and Service


143
CHRIST
Deemed to be University

Porter’s Stemmer algorithm


It is one of the most popular stemming methods proposed in 1980. It is based on the idea
that the suffixes in the English language are made up of a combination of smaller and
simpler suffixes. This stemmer is known for its speed and simplicity. The main
applications of Porter Stemmer include data mining and Information retrieval. However, its
applications are only limited to English words. Also, the group of stems is mapped on to
the same stem and the output stem is not necessarily a meaningful word. The algorithms
are fairly lengthy in nature and are known to be the oldest stemmer.
Example: EED -> EE means “if the word has at least one vowel and consonant plus
EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree’.
Advantage: It produces the best output as compared to other stemmers and it has less error
rate.
Limitation: Morphological variants produced are not always real words.

Excellence and Service


144
CHRIST
Deemed to be University

Excellence and Service


145
CHRIST
Deemed to be University

Example: Step 1

Excellence and Service


146
CHRIST
Deemed to be University

Example: Steps 2a and 2b

Excellence and Service


147
CHRIST
Deemed to be University

Example: Step 5

Excellence and Service


148
CHRIST
Deemed to be University

Example Outputs

Excellence and Service


149
CHRIST
Deemed to be University

Lovins Stemmer

It is proposed by Lovins in 1968, that removes the longest suffix from a word then the word is recorded to convert this stem into
valid words.

Example: sitting -> sitt -> sit


Advantage: It is fast and handles irregular plurals like 'teeth' and 'tooth' etc. Limitation: It is time consuming and frequently fails to form words
from stem.

Dawson Stemmer
It is an extension of Lovins stemmer in which suffixes are stored in the reversed order indexed by their length and last letter.

Advantage: It is fast in execution and covers more suffices.

Limitation: It is very complex to implement.

Krovetz Stemmer

It was proposed in 1993 by Robert Krovetz. Following are the steps:

1) Convert the plural form of a word to its singular form.

2) Convert the past tense of a word to its present tense and remove the suffix ‘ing’.

Example: ‘children’ -> ‘child’

Advantage: It is light in nature and can be used as pre-stemmer for other stemmers.

Limitation: It is inefficient in case of large documents.


Excellence and Service
150
CHRIST
Deemed to be University

Xerox Stemmer

Example:

‘children’ -> ‘child’ ‘understood’ ->


‘understand’ ‘whom’ -> ‘who’

‘best’ -> ‘good’

N-Gram Stemmer

An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion of n-
grams in common.

Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S*
Advantage: It is based on string comparisons and it is language dependent. Limitation: It requires space to create and index the n-grams and it is
not time efficient.

Excellence and Service


151
CHRIST
Deemed to be University

Snowball Stemmer:
When compared to the Porter Stemmer, the Snowball Stemmer can map non-English
words too. Since it supports other languages the Snowball Stemmers can be called a
multi-lingual stemmer. The Snowball stemmers are also imported from the nltk
package. This stemmer is based on a programming language called ‘Snowball’ that
processes small strings and is the most widely used stemmer. The Snowball stemmer
is way more aggressive than Porter Stemmer and is also referred to as Porter2
Stemmer. Because of the improvements added when compared to the Porter Stemmer,
the Snowball stemmer is having greater computational speed.
Lancaster Stemmer:
The Lancaster stemmers are more aggressive and dynamic compared to the other two
stemmers. The stemmer is really faster, but the algorithm is really confusing when
dealing with small words. But they are not as efficient as Snowball Stemmers. The
Lancaster stemmers save the rules externally and basically uses an iterative algorithm.
Lancaster Stemmer is straightforward, although it often produces results with
Excellence
excessive stemming. Over-stemming andstems
renders Servicenon-linguistic or meaningless.
152
CHRIST
Deemed to be University

LEMMATIZATION
● Lemmatization is a text pre-processing technique used in natural
language processing (NLP) models to break a word down to its
root meaning to identify similarities.
● For example, a lemmatization algorithm would reduce the word
better to its root word, or lemme, good.
● In stemming, a part of the word is just chopped off at the tail
end to arrive at the stem of the word.
● There are different algorithms used to find out how many
characters have to be chopped off, but the algorithms don’t
actually know the meaning of the word in the language it
belongs to.

Excellence and Service


153
CHRIST
Deemed to be University

● In lemmatization, the algorithms have this knowledge.


● These algorithms refer to a dictionary to understand the
meaning of the word before reducing it to its root word, or
lemma.
● So, a lemmatization algorithm would know that the word better
is derived from the word good, and hence, the lemme is good.
● But a stemming algorithm wouldn’t be able to do the same.
There could be over-stemming or under-stemming, and the
word better could be reduced to either bet, or bett, or just
retained as better.

Excellence and Service


154
CHRIST
Deemed to be University

Fig. Stemming vs Lemmatization

Excellence and Service


155
CHRIST
Deemed to be University

● Lemmatization gives more context to chatbot conversations as it


recognizes words based on their exact and contextual
meaning.
● On the other hand, lemmatization is a time-consuming and slow
process.
● The obvious advantage of lemmatization is that it is more
accurate than stemming. So, if you’re dealing with an NLP
application such as a chat bot or a virtual assistant, where
understanding the meaning of the dialogue is crucial,
lemmatization would be useful.

Excellence and Service


156
CHRIST
Deemed to be University

● The WordNetLemmatizer is a class that uses the WordNet


database to perform lemmatization.
● Lemmatization is the process of grouping together the
different inflected forms of a word so they can be analyzed as
a single item.
● The WordNetLemmatizer has a method called lemmatize that
takes a word and an optional part-of-speech (POS) tag as
arguments and returns the lemma of the word.
● The POS tag can be "n" for noun, "v" for verb, "a" for
adjective, "r" for adverb, or "s" for adjective. If no POS tag is
given, the default is "n" (noun)

Excellence and Service


157
CHRIST
Deemed to be University

# Import the WordNetLemmatizer


from nltk.stem import WordNetLemmatizer

# Create an instance of the WordNetLemmatizer


wnl = WordNetLemmatizer()

# Lemmatize some words with different POS tags


print(wnl.lemmatize("dogs")) # noun
print(wnl.lemmatize("running", pos="v")) # verb
print(wnl.lemmatize("better", pos="a")) # adjective
print(wnl.lemmatize("slowly", pos="r")) # adverb

Excellence and Service


158
CHRIST
Deemed to be University

9 different approaches to perform Lemmatization :


1. WordNet
2. WordNet (with POS tag)
3. TextBlob
4. TextBlob (with POS tag)
5. spaCy
6. TreeTagger
7. Pattern
8. Gensim
9. Stanford CoreNLP

Excellence and Service


159
CHRIST
Deemed to be University

Definition: Regular Expression


● Regular expressions or RegEx is defined as a sequence of
characters that are mainly used to find or replace patterns
present in the text.
● A regular expression is a set of characters or a pattern that is
used to find substrings in a given string.
● A regular expression (RE) is a language for specifying text
search strings.
● It helps us to match or extract other strings or sets of strings,
with the help of a specialized syntax present in a pattern.
● For Example, extracting all hashtags from a tweet, getting
email iD or phone numbers, etc from large unstructured text
content.

Excellence and Service


160
CHRIST
Deemed to be University

● Regex can be in the form of either meta characters or literal


characters.
○ Literal characters, or literals for short, represent basic
characters that can be used to represent different
meanings, such as: 'a’, 'b’, 'c’, '1’, '2’, '3’, and so forth.
○ Meta characters, on the other hand, represent characters
with a special and constant meaning, such as: '$’, '*’,
'+’, and so forth.

Excellence and Service


161
CHRIST
Deemed to be University

Sometimes, we want to identify the different components of an email address.Simply put,


a regular expression is defined as an ”instruction” that is given to a function on what and
how to match, search or replace a set of strings.

Excellence and Service


162
CHRIST
Deemed to be University

How can Regular Expressions be used in NLP?

1. To Validate data fields.

For Example, dates, email address, URLs, abbreviations, etc.

2. To Filter a particular text from the whole corpus.

For Example, spam, disallowed websites, etc.

3. To Identify particular strings in a text.

For Example, token boundaries

4. To convert the output of one processing component into the format


required for a second component.

Excellence and Service


163
CHRIST
Deemed to be University

Regular Expressions use cases

Regular Expressions are used in various tasks such as:

● Data pre-processing;
● Rule-based information Mining systems;
● Pattern Matching;
● Text feature Engineering;
● Web scraping;
● Data validation;

● Data Extraction.

Excellence and Service


164
CHRIST
Deemed to be University

Regular expressions are commonly used for:

● Search and match patterns in a string.


● Split strings with a particular pattern into substrings with a
particular pattern.
● Findall strings with a particular pattern.
● Sub out particular patterns in a string.

You must import the re module before using regular expressions.


Three commonly-used regex functions are re.search, re.findall, and
re.match.

Excellence and Service


165
CHRIST
Deemed to be University

The concept of Raw String in Regular Expressions

In the following example, we have a couple of backslashes present


in the string. But in string python treats n as “move to a new line”

After seeing the output, you can observe that n has moved the text
after it to a new line. Here “nayan” has become “ayan” and n
disappeared from the path. This is not what we want.

Excellence and Service


166
CHRIST
Deemed to be University

So, to resolve this issue, now we use the “r” expression to create a raw string:

Again after seeing the output, we have observed that now the entire path has printed out
here by simply using “r” in front of the path.

Therefore, It is always recommended to use raw string instead of normal string while
dealing with Regular expressions.

Excellence and Service


167
CHRIST
Deemed to be University

Common Regex Functions used in NLP

To work with Regular Expressions, Python has a


built-in module known as “re”. Some common
functions from this module are as follows:
● re.search()
● re.match()
● re.sub()
● re.compile()
● re.findall()

Excellence and Service


168
CHRIST
Deemed to be University

re. search( )

● This function helps us to detect whether the given


regular expression pattern is present in the given
input string. It matches the first occurrence of a
pattern in the entire string and not just at the
beginning.
● It returns a Regex Object if the pattern is found in
the string, else it returns a None object.
● Syntax: re.search(patterns, string)

Excellence and Service


169
CHRIST
Deemed to be University

Excellence and Service


170
CHRIST
Deemed to be University

re.match( )

● This function will only match the string if the


pattern is present at the very start of the string.
● Syntax: re.match(patterns, string)
● we know that the output of the re.match is an
object, so to get the matched expression, we
will use the group() function of the match
object.

Excellence and Service


171
CHRIST
Deemed to be University

Excellence and Service


172
CHRIST
Deemed to be University

re.sub( )

● This function is used to substitute a substring with another


substring.
● Syntax: re.sub(patterns, Substitute, Input text)

Excellence and Service


173
CHRIST
Deemed to be University

re.findall( )

● This function will return all the occurrences of the pattern from the string.
● Always recommended to use re.findall().
● It can work like both re.search() and re.match(). Therefore, the result of the
findall() function is a list of all the matches

Excellence and Service


174
CHRIST
Deemed to be University

re.search() vs. re.findall() vs. re.match()

● The re.search function stops searching for the pattern once it finds a
match in a string. It will find the first occurrence of the pattern(s) in
each string.
● The re.findall module searches for all occurrences of the pattern in a
string.
● The re.match function searches for the first occurrence of the pattern,
just as the re.search function does. However, if a string is broken up
by line and there is a first occurrence of the match on any line other
than the first, the re.match function will not find it. This is because
re.match only searches for the pattern(s) on the first line of each
string.

Excellence and Service


175
CHRIST
Deemed to be University

Using re.finditer()

Similar to re.findall(), we can also use re.finditer() function to find all occurrences of a
pattern in a given string and iterate over the match objects. It will print each matched
content (in our case digits or numbers) if found.

import re
pattern = r'\d+'
text = 'There are 123 apples and 456 oranges.'
matches = re.finditer(pattern, text)

for match in matches:


print(f'Match found: {match.group()}')

Excellence and Service


176
CHRIST
Deemed to be University

Using re.split()
If you are working on NLP projects like sentiment analysis , word embedding, document
similarity matching, etc. we often need to split a document based on some logic.

re.split() – to split a string at occurrences of a pattern and obtain a list of substrings.

import re
pattern = r'\s+'
text = 'Split this string.'
parts = re.split(pattern, text)
print(f'Split parts: {parts}')

Excellence and Service


177
CHRIST
Deemed to be University

re.compile()
Regular expressions are compiled into pattern objects, which have
methods for various operations such as searching for pattern matches
or performing string substitutions.
The code uses a regular expression pattern [a-e] to find and list all
lowercase letters from ‘a’ to ‘e’ in the input string “Aye, said Mr.
Gibenson Stark”. The output will be ['e', 'a', 'd', 'b', 'e'], which are
the matching characters.
Keep in mind that the compile() method is useful for defining and
creating regular expressions object initially and then using that object
we can look for occurrences of the same pattern inside various target
strings without rewriting it which saves time and improves
performance.
Excellence and Service
178
CHRIST
Deemed to be University

Python’s re.compile() method is used to compile a regular expression


pattern provided as a string into a regex pattern object (re.Pattern).
Later we can use this pattern object to search for a match inside
different target strings using regex methods such as a re.match() or
re.search().
We can compile a regular expression into a regex object to look for
occurrences of the same pattern inside various target strings
without rewriting it.
Avoid using the compile() method when you want to search for
various patterns inside the single target string. You do not need to use
the compile method beforehand because the compiling is done
automatically with the execution of other regex methods.

Excellence and Service


179
CHRIST
Deemed to be University

Excellence and Service


180
CHRIST
Deemed to be University

Special Sequences in Regular Expressions

1. \b
\b returns a match where the specified pattern is at the beginning or at
the end of a word.
2. \d
\d returns a match where the string contains digits (numbers from 0-9).
adding '+' after '\d' will continue to extract digits till encounters a space
We can infer that \d+ repeats one or more occurrences of \d till the
non-matching character is found whereas \d does a character-wise
comparison.

Excellence and Service


181
CHRIST
Deemed to be University

3. \D

\D returns a match where the string does not contain any digit. It is
basically the opposite of \d.
4. \w

\w helps in extraction of alphanumeric characters only (characters from a


to Z, digits from 0-9, and the underscore _ character)
5. \W

\W returns match at every non-alphanumeric character. Basically opposite


of \w.

Excellence and Service


182
CHRIST
Deemed to be University

Brackets ([ ])
They are used to specify a disjunction of characters.
For Examples,
/[cC]hirag/ → Chirag or chirag
/[xyz]/ → ‘x’, ‘y’, or ‘z’
/[1234567890]/ → any digit

Here slashes represent the start and end of a particular


expression.
Excellence and Service
183
CHRIST
Deemed to be University

Dash (-)
They are used to specify a range.
For Examples,
/[A-Z]/ → matches an uppercase letter
/[a-z]/ → matches a lowercase letter
/[0–9]/ → matches a single digit

Excellence and Service


184
CHRIST
Deemed to be University

Caret (^)
They can be used for negation or just to mean ^.
For Examples,
/[ˆa-z]/ → not an lowercase letter
/[ˆCc]/ → neither ‘C’ nor ‘c’
/[ˆ.]/ → not a period
/[cˆ]/ → either ‘c’ or ‘ˆ’
/xˆy/ → the pattern ‘xˆy’
Excellence and Service
185
CHRIST
Deemed to be University

Question mark (?)


It marks the optionality of the previous
expression.
For Examples,
/maths?/ → math or maths
/colou?r/ → color or colour

Excellence and Service


186
CHRIST
Deemed to be University

What are Anchors?


These are special characters that help us to perform string operations either at the
beginning or at the end of text input. They are used to assert something about the
string or the matching process. Generally, they are not used in a specific word or
character but used while we are dealing with more general queries.
Caret character ‘^’

It specifies the start of the string. For a string to match the pattern, the character
followed by the ‘^’ in the pattern should be the first character of the string.
Dollar character ‘$’

It specifies the end of the string. For a string to match the pattern, the character
that precedes the ‘$’ in the pattern should be the last character in the string.

Excellence and Service


187
CHRIST
Deemed to be University

What are Quantifiers?

Some common Quantifiers are: ( *, +, ? and { } )

They allow us to mention and control over how many times

a specific character(s) pattern should occur in the given text.

Excellence and Service


188
CHRIST
Deemed to be University

Excellence and Service


189
CHRIST
Deemed to be University

Each of the earlier mentioned quantifiers can be


written in the form of {m,n} quantifier in the
following way:
● ‘?’ is equivalent to zero or once, or {0, 1}
● ‘*’ is equivalent to zero or more times, or {0, }
● ‘+’ is equivalent to one or more times, or {1, }

Excellence and Service


190
CHRIST
Deemed to be University

For Examples,
abc*: matches a string that has 'ab' followed by zero or more 'c'.
abc+: matches 'ab' followed by one or more 'c'
abc?: matches 'ab' followed by zero or one 'c'
abc{2}: matches 'ab' followed by 2 'c'
abc{2, }: matches 'ab' followed by 2 or more 'c'
abc{2, 5}: matches 'ab' followed by 2 upto 5 'c'
a(bc)*: matches 'a' followed by zero or more copies of the
sequence 'bc'

Excellence and Service


191
CHRIST
Deemed to be University

Pipe Operator
It’s denoted by ‘|’. It is used as an OR operator. We have to use it
inside the parentheses.
For Example, Consider the pattern ‘(play|walk)ed’ –
The above pattern will match both the strings — ‘played’ and
‘walked’.
Therefore, the pipe operator tells us about the place inside the
parentheses that can be either of the strings or the characters.

Excellence and Service


192
CHRIST
Deemed to be University

Escaping Special Characters


The characters which we discussed above in quantifiers
such as ‘?’, ‘*’, ‘+’, ‘(‘, ‘)’, ‘{‘, etc. can also appear in the
input text. So, In such cases to extract these specific
characters, we have to use the escape sequences. The
escape sequence, represented by a backslash ‘’, is used to
escape the special meaning of the special characters.

Excellence and Service


193
CHRIST
Deemed to be University

The curly braces { … }


It tells the computer to repeat the preceding character (or set of
characters) for as many times as the value inside this bracket.
Example : {2} means that the preceding character is to be repeated
2
times, {min,} means the preceding character is matches min or
more
times. {min,max} means that the preceding character is repeated at
least min & at most max times.

Excellence and Service


194
CHRIST
Deemed to be University

Wildcard ( . )
The dot symbol can take the place of any other symbol, that is why it is called the
wildcard character.

Example :
The Regular expression .* will tell the computer that any character
can be used any number of times.

Excellence and Service


195
CHRIST
Deemed to be University

Meta Sequences

Excellence and Service


196
CHRIST
Deemed to be University

Excellence and Service


197
CHRIST
Deemed to be University

Excellence and Service


198
CHRIST
Deemed to be University

Metacharacters in Regular Expression

(.) matches any character (except newline character)


(^) starts with
It checks whether the string starts with the given pattern or not.
($) ends with
It checks whether the string ends with the given pattern or not.
(*) matches for zero or more occurrences of the pattern to the left of
it
(+) matches one or more occurrences of the pattern to the left of it

Excellence and Service


199
CHRIST
Deemed to be University

(?) matches zero or one occurrence of the pattern left to it.


(|) either or
The pipe(|) operator checks whether any of the two patterns, to its
left and right, is present in the String or not.

Excellence and Service


200
CHRIST
Deemed to be University

Example : Regular expression for an email address :


^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-
Z]{2,5})$

PRACTICE:
https://www.w3resource.com/python-exercises/re/

Excellence and Service


201
CHRIST
Deemed to be University

LAB EXERCISE-1:

1.Implement a python code for preprocessing text


document using NLTK
A. TOKENIZATION
B. REMOVE PUNCTUATION
C. CONVERT UPPER CASE TO LOWER CASE
D. STOP WORD REMOVAL
E. FREQUENCY DISTRIBUTION
F. STEMMING AND LEMMATIZATION
G. POS TAGGING
H. PARSING-GENERATE TREE
I. CHUNKING
J. NER

Excellence and Service


202
CHRIST
Deemed to be University

LAB EXERCISE-2:To Create Regular expressions in Python


for detecting word patterns
1. Develop a Python program to validate and extract Social Security
Numbers (SSN) from a given text.
2. Develop a Python program that extracts and validates email addresses
from a given text.
3. Develop a Python code to replace a word from a given text.
4. Develop a Python code to find and identify a correct phone number
from a given list of numbers (solutions with answer code below). The
10-digit phone numbers should only contain numbers.
5. Develop a Python code uses the regular expression \d{2}/\d{2}/\d{4}
to extract dates in the format “MM/DD/YYYY” from the given text.
6. Develop a python code to extract only URLs using regular
expressions from a given string.

Excellence and Service


203
CHRIST
Deemed to be University

INPUT

1. text = 'Employee ID: 123-45-6789 and Employee ID: 247-


36-6788'
2. text = 'Contact us at [email protected] or
[email protected]'
3. text = 'I have an apple, and I love apples.'
4. numbers = ['1234567890', '9876543210', '123-456-7890',
'987654321']
5. text = 'Meeting on 03/15/2024 and 04/20/2024'
6. text = 'Visit our website at http://www.example.com or
check https://blog.example.com'

Excellence and Service


204
CHRIST
Deemed to be University

What is PunktSentenceTokenizer

● In NLTK, PUNKT is an unsupervised trainable model,


which means it can be trained on unlabeled data .
● This tokenizer divides a text into a list of sentences by
using an unsupervised algorithm to build a model for
abbreviation words, collocations, and words that start
sentences. It must be trained on a large collection of
plaintext in the target language before it can be used.
● For example, splitting a long text into sentences, the
following is the provided input text and the task to separate
the input into different sentences.

Excellence and Service


205
CHRIST
Deemed to be University

Excellence and Service


206
CHRIST
Deemed to be University

import nltk
nltk.download('punkt')

Excellence and Service


207
CHRIST
Deemed to be University

Chunking in Natural Language processing

● Chunking is defined as the process of natural language processing used to


identify parts of speech and short phrases present in a given sentence.
● Chunking is used to get the required phrases from a given sentence.
However, POS tagging can be used only to spot the parts of speech that
every word of the sentence belongs to.
● Chunking is extracting phrases from an unstructured text by evaluating a
sentence and determining its elements (Noun Groups, Verbs, verb groups,
etc.)
● When we have loads of descriptions or modifications around a particular
word or the phrase of our interest, we use chunking to grab the required
phrase alone, ignoring the rest around it. Hence, chunking paves a way to
group the required phrases and exclude all the modifiers around them which
are not necessary for our analysis. Summing up, chunking helps us extract
the important words alone from lengthy descriptions. Thus, chunking is a
step in information extraction.

Excellence and Service


208
CHRIST
Deemed to be University

This process of chunking in NLP is extended to various other applications; for instance, to
group fruits of a specific category, say, fruits rich in proteins as a group, fruits rich in
vitamins as another group, and so on. Besides, chunking can also be used to group similar
cars, say, cars supporting auto-gear into one group and the others which support manual
gear into another chunk and so on.

● Types of Chunking

● There are, broadly, two types of chunking:

● Chunking up

● Chunking down

Excellence and Service


209
CHRIST
Deemed to be University

Chunking in Python

● The high-level idea is that first, we tokenize our text. Now there is a
utility in NLTK which tags the words; pos_tag, which attaches a tag to
the words, for example, Verb conjunction etc.
● Then with the help of these tags, we can perform Chunking. If we want
to select verbs, we can write a grammar that selects the words with a
grammar tag.
● example of chunking :
● Sentence: The cat sat on the mat.
● POS tags: The/DT cat/NN sat/VBD on/IN the/DT mat/NN.
● Chunks:
● Noun Phrase: The cat
● Verb Phrase: sat
● Prepositional Phrase: on the mat

Excellence and Service


210
CHRIST
Deemed to be University

Information Extraction

Excellence and Service


211
CHRIST
Deemed to be University

Information extraction has many applications including −


Business intelligence
Resume harvesting
Media analysis
Sentiment detection
Patent search
Email scanning

Excellence and Service


212
CHRIST
Deemed to be University

Named-entity recognition (NER)

● Named-entity recognition (NER) is actually a way of extracting


some of most common entities like names, organizations,
location, etc.
● Let us see an example that took all the preprocessing steps such
as sentence tokenization, POS tagging, chunking, NER, and
follows the pipeline provided in the figure above.

Excellence and Service


213
CHRIST
Deemed to be University

Named Entity Recognition

● Named entity recognition (NER)is probably the first step towards


information extraction that seeks to locate and classify named entities
in text into pre-defined categories such as the names of persons,
organizations, locations, expressions of times, quantities, monetary
values, percentages, etc. NER is used in many fields in Natural
Language Processing (NLP), and it can help answering many real-
world questions, such as:
○ Which companies were mentioned in the news article?
○ Were specified products mentioned in complaints or reviews?
○ Does the tweet contain the name of a person? Does the tweet
contain this person’s location?

Excellence and Service


214
CHRIST
Deemed to be University

Excellence and Service


215
CHRIST
Deemed to be University

● Named entity recognition is a natural language processing technique that can


automatically scan entire articles and pull out some fundamental entities in a
text and classify them into predefined categories. Entities may be,
● Organizations,
● Quantities,
● Monetary values,
● Percentages, and more.
● People’s names
● Company names
● Geographic locations (Both physical and political)
● Product names
● Dates and times
● Amounts of money
● Names of events

Excellence and Service


216
CHRIST
Deemed to be University

Excellence and Service


217
CHRIST
Deemed to be University

A typical NER model consists of the following three blocks:


Noun Phrase Identification

This step deals with extracting all the noun phrases from a text with the help
of dependency parsing and part of speech tagging.
Phrase Classification

In this classification step, we classified all the extracted noun phrases from
the above step into their respective categories. To disambiguate locations,
Google Maps API can provide a very good path. and to identify person names
or company names, the open databases from DBpedia, Wikipedia can be
used. Apart from this, we can also make the lookup tables and dictionaries by
combining information with the help of different sources.

Excellence and Service


218
CHRIST
Deemed to be University

Entity Disambiguation
Sometimes what happens is that entities are misclassified, hence
creating a validation layer on top of the results becomes useful. The
use of knowledge graphs can be exploited for this purpose. Some of
the popular knowledge graphs are:
● Google Knowledge Graph,
● IBM Watson,
● Wikipedia, etc.

Excellence and Service


219
CHRIST
Deemed to be University

Any NER model is a two-step process:


● Detect a named entity
● Categorize the entity

So first, we need to create entity categories, like Name, Location,


Event, Organization, etc., and feed a NER model relevant training
data.
Then, by tagging some samples of words and phrases with their
corresponding entities, we’ll eventually teach our NER model to
detect the entities and categorize them.

Excellence and Service


220
CHRIST
Deemed to be University

FEATURE ENGINEERING FOR TEXT REPRESENTATION

● Feature engineering is one of the most important steps in


machine learning. It is the process of using domain
knowledge of the data to create features that make machine
learning algorithms work.
● Think machine learning algorithm as a learning child the
more accurate information you provide the more they will
be able to interpret the information well.
● Focusing first on our data will give us better results than
focusing only on models. Feature engineering helps us to
create better data which helps the model understand it well
and provide reasonable results.
Excellence and Service
221
CHRIST
Deemed to be University

Excellence and Service


222
CHRIST
Deemed to be University

● NLP is a subfield of artificial intelligence where we understand human


interaction with machines using natural languages. To understand a
natural language, you need to understand how we write a sentence,
how we express our thoughts using different words, signs, special
characters, etc basically we should understand the context of the
sentence to interpret its meaning.
● Extracting Features from Text
● Now we will learn about common feature extraction techniques and
methods.
● Feature extraction methods can be divided into 3 major categories,
basic, statistical, and advanced/vectorized.

Excellence and Service


223
CHRIST
Deemed to be University

Basic Methods
These feature extraction methods are based on various
concepts from NLP and linguistics. These are some of the
oldest methods but still can be very reliable are used
frequently in many areas.
● Parsing
● PoS Tagging
● Name Entity Recognition (NER)
● Bag of Words (BoW)

Excellence and Service


224
CHRIST
Deemed to be University

Statistical Methods
This is a bit more advanced feature extraction method and uses the concepts from
statistics and probability to extract features from text data.
● Term Frequency-Inverse Document Frequency (TF-IDF)
Advanced Methods
These methods can also be called vectorized methods as they aim to map a word,
sentence, document to a fixed-length vector of real numbers. The goal of this
method is to extract semantics from a piece of text, both lexical and
distributional. Lexical semantics is just the meaning reflected by the words
whereas distributional semantics refers to finding meaning based on various
distributions in a corpus.
● Word2Vec
● GloVe: Global Vector for word representation

Excellence and Service


225
CHRIST
Deemed to be University

Excellence and Service


226
CHRIST
Deemed to be University

Bag-of-Words Model
● It is called a “bag” of words, because any information about the
order or structure of words in the document is discarded.
● The model is only concerned with whether known words occur
in the document, not where in the document.
● The bag of words model is one particularly simple way to
represent a document in numerical form before we can feed it
into a machine learning algorithm.
● For any natural language processing task, we need a way to
accomplish this before any further processing.
● It doesn’t take into account the order and the structure of the
words, but it only checks if the words appear in the document.

Excellence and Service


227
CHRIST
Deemed to be University

● Machine learning algorithms can’t operate on raw text;


we need to convert the text to some sort of numerical
representation. This process is also known as embedding
the text.
● There are two basic approaches to embedding a text:
word vectors and document vectors. With word vectors,
we represent each individual word in the text as a vector
(i.e., a sequence of numbers). We then convert the whole
document into a sequence of these word vectors.
Document vectors, on the other hand, embed the entire
document as a single vector.
Excellence and Service
228
CHRIST
Deemed to be University

● The bag of words model is a simple way to convert words


to numerical representation in natural language processing.
● This model is a simple document embedding technique
based on word frequency.
● Conceptually, we think of the whole document as a “bag”
of words, rather than a sequence.
● We represent the document simply by the frequency of
each word.
● Using this technique, we can embed a whole set of
documents and feed them into a variety of different
machine learning algorithms.

Excellence and Service


229
CHRIST
Deemed to be University

BAG OF N-GRAMS MODEL

A bag-of-n-grams model is a way to represent a document, similar to a [bag-of-


words][/terms/bag-of-words/] model. A bag- of-n-grams model represents a text document
as an unordered collection of its n-grams.

For example, let’s use the following phrase and divide it into bi-grams (n=2).
“James is the best person ever.”

becomes

● <start>James
● James is
● is the
● the best
● best person
● person ever.
● ever.<end>

Excellence and Service


230
CHRIST
Deemed to be University

In a typical bag-of-n-grams model, these 6 bigrams would be a


sample from a large number of bigrams observed in a corpus.
And then James is the best person ever. would be encoded in a
representation showing which of the corpus’s bigrams were
observed in the sentence. A bag-of-n-grams model has the
simplicity of the bag-of-words model but allows the
preservation of more word locality information.

Excellence and Service


231
CHRIST
Deemed to be University

N-grams Model:
A more sophisticated approach is to create a vocabulary of grouped words. This changes both the scope of
the vocabulary and allows the bag-of-words to capture a little bit more meaning from the document.

In this approach, each word or token is called a “gram”. Creating a vocabulary of two-word pairs is, in turn,
called a bigram model. Again, only the bigrams that appear in the corpus are modeled, not all possible
bigrams.

An N-gram is an N-token sequence of words: a 2-gram (more commonly


called a bigram) is a two-word sequence of words like “please turn”, “turn
your”, or “your homework”, and a 3-gram (more commonly called a
trigram) is a three-word sequence of words like “please turn your”, or
“turn your homework”.

Excellence and Service


232
CHRIST
Deemed to be University

NgramTagger
NgramTagger has 3 subclasses
● UnigramTagger
● BigramTagger
● TrigramTagger

BigramTagger subclass uses previous tag as part of its context


TrigramTagger subclass uses the previous two tags as part of its context.
ngram – It is a subsequence of n items.
Idea of NgramTagger subclasses :
● By looking at the previous words and P-O-S tags, part-of-speech tag for the current
word can be guessed.
● Each tagger maintains a context dictionary (ContextTagger parent class is used to
implement it).
● This dictionary is used to guess that tag based on the context.
● The context is some number of previous tagged words in the case of NgramTagger
subclasses.

Excellence and Service


233
CHRIST
Deemed to be University

Limitations of Bag-of-Words

● If we deploy bag-of-words to generate vectors for large


documents, the vectors would be of large sizes and would also
have too many null values leading to the creation of sparse
vectors.
● Bag-of-words does not bring in any information on the meaning
of the text. For example, if we consider these two sentences –
“Text processing is easy but tedious.” and “Text processing is
tedious but easy.” – a bag-of-words model would create the
same vectors for both of them, even though they have different
meanings.

Excellence and Service


234
CHRIST
Deemed to be University

Excellence and Service


235
CHRIST
Deemed to be University

TF-IDF MODEL
● tf–idf or TFIDF, short for term frequency-inverse document
frequency, is a numerical statistic that is intended to reflect
how important a word is to a document in a collection or
corpus.
● TF-IDF is a natural language processing (NLP) technique
that’s used to evaluate the importance of different words in a
sentence. It’s useful in text classification and for helping a
machine learning model read words.
● tf–idf is one of the most popular term-weighting schemes
today; 83% of text-based recommender systems in digital
libraries use tf–idf.

Excellence and Service


236
CHRIST
Deemed to be University

● This concept includes:


● ·Counts. Count the number of times each word appears
in a document.
● ·Frequencies. Calculate the frequency that each word
appears in a document out of all the words in the
document.

Excellence and Service


237
CHRIST
Deemed to be University

Term Frequency:
Term frequency is defined as the number of times a word (i)
appears in a document (j) divided by the total number of words
in the document.

Excellence and Service


238
CHRIST
Deemed to be University

● Suppose we have a set of English text documents and wish to rank


which document is most relevant to the query, “Data Science is
awesome!”
● One way to start out is to eliminate documents that don’t contain
all three words: “Data,” “is,” “Science” and “awesome,” but this
still leaves too many documents.
● To further distinguish them, we might count the number of times
each term occurs in each document.
● The number of times a term occurs in a document is called its term
frequency.
● The weight of a term that occurs in a document is simply
proportional to the term frequency.

Excellence and Service


239
CHRIST
Deemed to be University

What Is Document Frequency in TF-IDF?


● Document frequency (DF) measures the
importance of a document in a whole set of
corpus.
● This is very similar to TF. The only difference is
that TF is a frequency counter for a term t in
document d, whereas DF is the count of
occurrences of term t in the document set N. In
other words, DF is the number of documents in
which the word is present.

Excellence and Service


240
CHRIST
Deemed to be University

Inverse Document Frequency:


● Inverse document frequency refers to the log of the total number of
documents divided by the number of documents that contain the
word. The logarithm is added to dampen the importance of a very
high value of IDF.

Excellence and Service


241
CHRIST
Deemed to be University

● Certain terms, such as “is,” “of,” and “that,” may appear


a lot of times but have little importance.
● We need to weigh down the frequent terms while scaling
up the rare ones.
● When we compute IDF, an inverse document frequency
factor is incorporated, which diminishes the weight of
terms that occur very frequently in the document set and
increases the weight of terms that rarely occur.

Excellence and Service


242
CHRIST
Deemed to be University

● It is used to calculate the weight of rare words across all documents in


the corpus. T
● The words that occur rarely in the corpus have a high IDF score.
● It is the logarithmically scaled inverse fraction of the documents that
contain the word (obtained by dividing the total number of documents by
the number of documents containing the term, and then taking the
logarithm of that quotient):
● Now, there are few other problems with the IDF. If you have a large
corpus, say 100,000,000, the IDF value explodes. To avoid this, we take
the log of IDF.
● During the query time, when a word that’s not in the vocabulary occurs,
the DF will be 0. Since we can’t divide by 0, we smoothen the value by
adding 1 to the denominator.
● idf(t) = log(N/(df + 1))

Excellence and Service


243
CHRIST
Deemed to be University

TFIDF is computed by multiplying the term frequency


with the inverse document frequency.

● TF-IDF is one of the best metrics to determine how significant a term is to a


text in a series or a corpus. TF-IDF is a weighting system that assigns a
weight to each word in a document based on its term frequency (TF) and the
reciprocal document frequency (TF) (IDF). The words with higher scores of
weight are deemed to be more significant.

Excellence and Service


244
CHRIST
Deemed to be University

● TF-IDF gives larger values for less frequent words in


the document corpus.
● TF-IDF value is high when both IDF and TF values
are high i.e the word is rare in the whole document
but frequent in a document.
● TF-IDF also doesn’t take the semantic meaning of
the words.

Excellence and Service


245
CHRIST
Deemed to be University

Example

● Sentence 1: The car is driven on the road.


● Sentence 2: The truck is driven on the
highway.
● In this example, each sentence is a separate
document.
● We will now calculate the TF-IDF for the
above two documents, which represent our
corpus.

Excellence and Service


246
CHRIST
Deemed to be University

● we can see that the TF-IDF of common words was zero, which shows
they are not significant. On the other hand, the TF-IDF of “car”,
“truck”, “road”, and “highway” are non-zero. These words have more
significance.
Excellence and Service
247
CHRIST
Deemed to be University

Reviewing- TFIDF is the product of the TF and IDF scores


of the term.
● TF = number of times the term appears in the doc/total
number of words in the doc
● IDF = ln(number of docs/number docs the term appears in)
● Higher the TFIDF score, the rarer the term is and vice-
versa.
● TFIDF is successfully used by search engines like Google,
as a ranking factor for content.
● The whole idea is to weigh down the frequent terms while
scaling up the rare ones.

Excellence and Service


248
CHRIST
Deemed to be University

Numerical Example

Imagine the term 𝑡 appears 20 times in a document that contains a total of 100
words. Term Frequency (TF) of 𝑡 can be calculated as follow:
𝑡𝑓(𝑡, 𝑑) = 20/100 = 0.2
Assume a collection of related documents contains 10,000 documents. If 100
documents out of 10,000 documents contain the term 𝑡, Inverse Document
Frequency (IDF) of 𝑡 can be calculated as follows
𝑖𝑑𝑓(𝑡) = 𝑙𝑜𝑔 10000/100= 2

Using these two quantities, we can calculate TF-IDF score of the term 𝑡 for the
document.
𝑡𝑓 − 𝑖𝑑𝑓(𝑡, 𝑑) = 0.2 ∗ 2 = 0.4

Excellence and Service


249
CHRIST
Deemed to be University

Example:Calculate tf-idf

Doc-1 : “Data Science is the sexiest job of the 21st


century”.
Doc-2: “machine learning is the key for data science”.

Excellence and Service


250
CHRIST
Deemed to be University

TF-IDF with example

three sentences given below:

1. Inflation has increased unemployment


2. The company has increased its sales
3. Fear increased his pulse

Excellence and Service


251
CHRIST
Deemed to be University

Step 1: Data Pre-processing

After lowercasing and removing stop words the sentences are transformed as below:

Excellence and Service


252
CHRIST
Deemed to be University

Step 2: Calculating Term Frequency

In this step, we have to calculate TF i.e., the Term Frequency of our given sentences.

Excellence and Service


253
CHRIST
Deemed to be University

Step 3: Calculating Inverse Document Frequency

Now, next, we have to calculate the Inverse Document Frequency (IDF) of all the words in
the sentences.

Excellence and Service


254
CHRIST
Deemed to be University

Step 4: Calculating Product of Term Frequency & Inverse Document


Frequency

Excellence and Service


255
CHRIST
Deemed to be University

Excellence and Service


256
CHRIST
Deemed to be University

So, for building a TF-IDF model we have to give all the input
features i.e., vocabulary along with the documents to the TF-IDF
model and it will result in a TF-IDF matrix which will be used
for training our machine learning model for document
classification.

Excellence and Service


257
CHRIST
Deemed to be University

IDF Formula in sklearn

Excellence and Service


258
CHRIST
Deemed to be University

Advantages of TF-IDF

● Measures relevance: TF-IDF measures the importance of a term in


a document, based on the frequency of the term in the document
and the inverse document frequency (IDF) of the term across the
entire corpus. This helps to identify which terms are most relevant
to a particular document.
● Handles large text corpora: TF-IDF is scalable and can be used
with large text corpora, making it suitable for processing and
analyzing large amounts of text data.
● Handles stop words: TF-IDF automatically down-weights common
words that occur frequently in the text corpus (stop words) that do
not carry much meaning or importance, making it a more accurate
measure of term importance.

Excellence and Service


259
CHRIST
Deemed to be University

○ Can be used for various applications: TF-IDF can be used


for various natural language processing tasks, such as text
classification, information retrieval, and document
clustering.
○ Interpretable: The scores generated by TF-IDF are easy to
interpret and understand, as they represent the importance of
a term in a document relative to its importance across the
entire corpus.
○ Works well with different languages: TF-IDF can be used
with different languages and character encodings, making it
a versatile technique for processing multilingual text data.

Excellence and Service


260
CHRIST
Deemed to be University

Limitations of TF-IDF

● Ignores the context: TF-IDF only considers the frequency of


each term in a document, and does not take into account the
context in which the term appears. This can lead to incorrect
interpretations of the meaning of the document.
● Assumes independence: TF-IDF assumes that the terms in a
document are independent of each other. However, this is often
not the case in natural language, where words are often related
to each other in complex ways.
● Vocabulary size: The vocabulary size can become very large
when working with large datasets, which can lead to high-
dimensional feature spaces and difficulty in interpreting the
results.
Excellence and Service
261
CHRIST
Deemed to be University

● No concept of word order: TF-IDF treats all words as equally important,


regardless of their order or position in the document. This can be
problematic for certain applications, such as sentiment analysis, where
word order can be crucial for determining the sentiment of a document.
● Limited to term frequency: TF-IDF only considers the frequency of each
term in a document and does not take into account other important
features, such as the length of the document or the position of the term
within the document.
● Sensitivity to stopwords: TF-IDF can be sensitive to stop words, which
are common words that do not carry much meaning, but appear
frequently in documents. Removing stop words from the document can
help to address this issue.

Excellence and Service


262
CHRIST
Deemed to be University

Applications of TF-IDF

● Search engines: TF-IDF is used in search engines to rank


documents based on their relevance to a query. The TF-IDF
score of a document is used to measure how well the document
matches the search query.
● Text classification: TF-IDF is used in text classification to
identify the most important features in a document. The TF-IDF
score of each term in the document is used to measure its
relevance to the class.
● Information extraction: TF-IDF is used in information extraction
to identify the most important entities and concepts in a
document. The TF-IDF score of each term is used to measure its
importance in the document.
Excellence and Service
263
CHRIST
Deemed to be University

● Keyword extraction: TF-IDF is used in keyword extraction to


identify the most important keywords in a document. The TF-
IDF score of each term is used to measure its importance in the
document.
● Recommender systems: TF-IDF is used in recommender
systems to recommend items to users based on their preferences.
The TF-IDF score of each item is used to measure its relevance
to the user’s preferences.
● Sentiment analysis: TF-IDF is used in sentiment analysis to
identify the most important words in a document that contribute
to the sentiment. The TF-IDF score of each word is used to
measure its importance in the document.

Excellence and Service


264
CHRIST
Deemed to be University

Top 8 Python Libraries For Natural Language Processing


(NLP) in 2024

Excellence and Service


265
CHRIST
Deemed to be University

Excellence and Service


266
UNIT-2
Text Classification

MISSION VISION CORE VALUES


CHRIST is a nurturing ground for an individual’s Excellence and Service Faith in God | Moral Uprightness
holistic development to make effective contribution to Love of Fellow Beings
the society in a dynamic environment Social Responsibility | Pursuit of Excellence
CHRIST
Deemed to be University

Text Classification

Vector Semantics and Embeddings, Word Embeddings, Word2Vec model, Glove


model, Fast Text model, Overview of Deep Learning models, RNN, Transformers,
Overview of Text Summarization and Topic Models.

Excellence and Service


268
CHRIST
Deemed to be University

Classification
What is classification?
✔ Classification is a process of categorizing a given set of data into
classes.
✔ It can be performed on both structured or unstructured data.
✔ A supervised learning approach.
✔ Categorizing some unknown items into a discrete set of
categories or classes.
✔ The target attribute is a categorical variable.
✔ The goal of data classification is to organize and categorize
data in distinct classes.

Excellence and Service


CHRIST
Deemed to be University

Basic Concepts (Contd...)


● Classification is an approach of machine learning which involves the
allocation of a data point to a predefined level.
● This involves taking an input and running it into a classification technique or
classifier to map the input into discrete class or category

Excellence and Service


CHRIST
Deemed to be University

General Approach to Classification

● Data classification is a three-step process:


1. Learning step (where a classification model is constructed)
2. Model evaluation (Estimate accuracy rate of the model based on a test set)
3. Classification step (where the model is used to predict class labels for given
data).

Excellence and Service


CHRIST
Deemed to be University

Contd...
1. Model construction (Learning):

● Each tuple is assumed to belong to a predefined class, as determined by one of


the attributes, called the class label.
● The set of all tuples used for construction of the model is called training set.
● The model is represented in the following forms:
○ Classification rules, (IF-THEN statements),
○ Decision tree
○ Mathematical formulae

Excellence and Service


CHRIST
Deemed to be University

Contd...

Excellence and Service


CHRIST
Deemed to be University

Contd...
2. Model Evaluation (Accuracy):

● Estimate accuracy rate of the model based on a test set.

– The known label of test sample is compared with the classified result from the
model.

– Accuracy rate is the percentage of test set samples that are correctly classified by
the model.

Excellence and Service


CHRIST
Deemed to be University

Contd...

Excellence and Service


CHRIST
Deemed to be University

Contd...
3. Model Use (Classification)

The model is used to classify unseen objects.


• Give a class label to a new tuple
• Predict the value of an actual attribute

Excellence and Service


CHRIST
Deemed to be University

Contd...

Excellence and Service


CHRIST
Deemed to be University

Text classification

● Text classification is a machine learning technique that assigns a set


of predefined categories to open-ended text.
● Text classifiers can be used to organize, structure, and categorize
pretty much any kind of text – from documents, medical studies, and
files, and all over the web.
● For example, new articles can be organized by topics; support tickets
can be organized by urgency; chat conversations can be organized by
language; brand mentions can be organized by sentiment;and so on.
● Text classification is one of the fundamental tasks in natural language
processing with broad applications such as sentiment analysis, topic
labeling, spam detection, and intent detection.
“The user interface is quite straightforward and easy to use.”

Excellence and Service


278
CHRIST
Deemed to be University

A text classifier can take this phrase as input, analyze its content, and then automatically
assign relevant tags, such as UI and Easy To Use.

Some Examples of Text Classification:


Sentiment Analysis.
Language Detection.
Fraud Profanity & Online Abuse Detection.
Detecting Trends in Customer Feedback.
Urgency Detection in Customer Support.

Excellence and Service


279
CHRIST
Deemed to be University

Why is Text Classification Important?

● It’s estimated that around 80% of all information is unstructured, with


text being one of the most common types of unstructured data.
● Because of the messy nature of text, analyzing, understanding,
organizing, and sorting through text data is hard and time-consuming,
so most companies fail to use it to its full potential.
● This is where text classification with machine learning comes in.
● Using text classifiers, companies can automatically structure all
manner of relevant text, from emails, legal documents, social media,
chatbots, surveys, and more in a fast and cost-effective way.
● This allows companies to save time analyzing text data, automate
business processes, and make data-driven business decisions.

Excellence and Service


280
CHRIST
Deemed to be University

Excellence and Service


281
CHRIST
Deemed to be University

Excellence and Service


282
CHRIST
Deemed to be University

Excellence and Service


283
CHRIST
Deemed to be University

Excellence and Service


284
CHRIST
Deemed to be University

Excellence and Service


285
CHRIST
Deemed to be University

Excellence and Service


286
Christ University

Classification Methods ™

• Decision Tree Induction ™


• Neural Networks ™
• Bayesian Classification ™
• Association-Based Classification ™
• K-Nearest Neighbour ™
• Case-Based Reasoning ™
• Genetic Algorithms ™
• Rough Set Theory ™
• Fuzzy Sets

Excellence and Service


Christ University

Types of classification tasks

1. Binary Classification
2. Multi-Class Classification
3. Multi-Label Classification
4. Imbalanced Classification

Excellence and Service


Christ University

Binary Classification
• Binary classification refers to those classification tasks that
have two class labels or two outcomes.

• Examples include:
1. Email spam detection (spam or not).
2. Churn prediction (churn or not).
3. Stock trend prediction (buy or not).

• Typically, binary classification tasks involve one class that is


the normal state and another class that is the abnormal state.

Excellence and Service


Christ University

Binary Classification

• For example “not spam” is the normal state and “spam” is the
abnormal state.

• Another example is “cancer not detected” is the normal state


of a task that involves a medical test and “cancer detected” is
the abnormal state.

• The class for the normal state is assigned the class label 0 and
the class with the abnormal state is assigned the class label 1.

Excellence and Service


Christ University
Binary Classification

Excellence and Service


Christ University
Binary Classification

Excellence and Service


Christ University

Multi-Class Classification

• Multi-class classification refers to those classification tasks


that have more than two class labels.
• Examples include:
1. Face classification.
2. Plant species classification.
3. Optical character recognition.

• Unlike binary classification, multi-class classification does not


have the notion of normal and abnormal outcomes.
• Instead, examples are classified as belonging to one among a
range of known classes.

Excellence and Service


Christ University
Multi-Class Classification

• The number of class labels may be very large on some


problems. For example, a model may predict a photo as
belonging to one among thousands or tens of thousands of
faces in a face recognition system.

Excellence and Service


Christ University
Multi-Class Classification

Excellence and Service


Christ University
Multi-Class Classification

Excellence and Service


Christ University

Multi-Label Classification
• Multi-label classification refers to those classification tasks
that have two or more class labels, where one or more class
labels may be predicted for each example.

• i.e Multi-label classification is applied when one input can


belong to more than one class

• Consider the example of photo classification, where a given


photo may have multiple objects in the scene and a model may
predict the presence of multiple known objects in the photo,
such as “bicycle,” “apple,” “person,” etc.

Excellence and Service


Christ University

Multi-Label Classification

• Another example, a person who is a citizen of two countries.

• To work with this type of classification, you need to build a


model that can predict multiple outputs.

• This is unlike binary classification and multi-class


classification, where a single class label is predicted for each
example.

Excellence and Service


Christ University
Multi-Label Classification

Excellence and Service


CHRIST
Deemed to be University

Classification of Machine Learning Algorithms

● Supervised learning
● Unsupervised learning
● Reinforcement learning

Excellence and Service


CHRIST
Deemed to be University

Supervised vs. Unsupervised Learning

● Supervised learning (classification)


○ Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations
○ New data is classified based on the training
set
● Unsupervised learning (clustering)
○ The class labels of training data is unknown

○ Given a set of measurements, observations,


etc. with the aim of establishing the existence
of classes or clusters in the data

Excellence and Service


CHRIST
Deemed to be University
Supervised learning algorithms

Various algorithms and computation techniques are used in supervised machine


learning processes.
● Neural networks
● Naive Bayes
● Linear regression
● Logistic regression
● Support vector machine (SVM)
● K-nearest neighbor
● Random forest

Excellence and Service


CHRIST
Deemed to be University

Supervised learning Projects

● Image- and object-recognition


● Predictive analytics
● Customer sentiment analysis
● Spam detection

Excellence and Service


CHRIST
Deemed to be University

Challenges of supervised learning

Although supervised learning can offer businesses advantages, such as deep


data insights and improved automation, there are some challenges when
building sustainable supervised learning models. The following are some of
these challenges:
● Supervised learning models can require certain levels of expertise to
structure accurately.
● Training supervised learning models can be very time intensive.
● Datasets can have a higher likelihood of human error, resulting in algorithms
learning incorrectly.
● Unlike unsupervised learning models, supervised learning cannot cluster or
classify data on its own.

Excellence and Service


CHRIST
Deemed to be University

Unsupervised Learning

● Unsupervised learning is a type of machine learning in which models are


trained using unlabeled dataset and are allowed to act on that data without
any supervision.

Popular unsupervised learning algorithms:


• K-means clustering
• Hierarchal clustering
• Anomaly detection
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
• KNN (k-nearest neighbors)
• Neural Networks

Excellence and Service


CHRIST
Deemed to be University

Advantages of Unsupervised Learning

Advantages:
● Unsupervised learning is used for more complex tasks as compared to
supervised learning because, in unsupervised learning, we don't have
labeled input data.
● Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

Disadvantages:
● Unsupervised learning is intrinsically more difficult than supervised learning
as it does not have corresponding output.
● The result of the unsupervised learning algorithm might be less accurate as
input data is not labeled, and algorithms do not know the exact output in
advance.

Excellence and Service


CHRIST
Deemed to be University

Reinforcement Learning
● Reinforcement learning is a feedback-based learning method, in which a
learning agent gets a reward for each right action and gets a penalty for each
wrong action.
● The agent learns automatically with these feedbacks and improves its
performance. In reinforcement learning, the agent interacts with the
environment and explores it. The goal of an agent is to get the most reward
points, and hence, it improves its performance.
● The robotic dog, which automatically learns the movement of his arms, is an
example of Reinforcement learning.

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


308
CHRIST
Deemed to be University

Excellence and Service


309
CHRIST
Deemed to be University

Excellence and Service


310
CHRIST
Deemed to be University

Excellence and Service


311
CHRIST
Deemed to be University

Excellence and Service


312
CHRIST
Deemed to be University

Why use machine learning text classification?

● Scalability: Manually analyzing and organizing is slow and much


less accurate.. Machine learning can automatically analyze millions of
surveys, comments, emails, etc., at a fraction of the cost, often in just
a few minutes. Text classification tools are scalable to any business
needs, large or small
● Real-time analysis: There are critical situations that companies need
to identify as soon as possible and take immediate action (e.g., PR
crises on social media). Machine learning text classification can
follow your brand mentions constantly and in real-time, so you'll
identify critical information and be able to take action right away.

Excellence and Service


313
CHRIST
Deemed to be University

● Consistent criteria: Human annotators make mistakes when classifying text


data due to distractions, fatigue, and boredom, and human subjectivity
creates inconsistent criteria. Machine learning, on the other hand, applies the
same lens and criteria to all data and results. Once a text classification model
is properly trained it performs with unsurpassed accuracy.
● Real-time analysis: There are critical situations that companies need to
identify as soon as possible and take immediate action (e.g., PR crises on
social media). Machine learning text classification can follow your brand
mentions constantly and in real-time, so you'll identify critical information
and be able to take action right away.
● Consistent criteria: Human annotators make mistakes when classifying text
data due to distractions, fatigue, and boredom, and human subjectivity
creates inconsistent criteria. Machine learning, on the other hand, applies the
same lens and criteria to all data and results. Once a text classification model
is properly trained it performs with unsurpassed accuracy.

Excellence and Service


314
CHRIST
Deemed to be University

Text Classification

● Text classification in two ways: manual or automatic.


● Manual text classification involves a human annotator, who interprets
the content of text and categorizes it accordingly. This method can
deliver good results but it’s time-consuming and expensive.
● Automatic text classification applies machine learning, natural
language processing (NLP), and other AI-guided techniques to
automatically classify text in a faster, more cost-effective, and more
accurate manner.
● There are many approaches to automatic text classification, but they
all fall under three types of systems:
● Rule-based systems
● Machine learning-based systems
● Hybrid systems

Excellence and Service


315
CHRIST
Deemed to be University

Rule-based systems
● Rule-based approaches classify text into organized groups by using a
set of handcrafted linguistic rules.
● These rules instruct the system to use semantically relevant elements
of a text to identify relevant categories based on its content.
● Each rule consists of an antecedent or pattern and a predicted
category.
● Example: Say that you want to classify news articles into two groups:
Sports and Politics.
● First, you’ll need to define two lists of words that characterize each
group (e.g., words related to sports such as football, basketball,
LeBron James, etc., and words related to politics, such as Donald
Trump, Hillary Clinton, Putin, etc.).

Excellence and Service


316
CHRIST
Deemed to be University

● Next, when you want to classify a new incoming text, you’ll need to count the
number of sport-related words that appear in the text and do the same for politics-
related words.
● If the number of sports-related word appearances is greater than the politics-related
word
● count, then the text is classified as Sports and vice versa.
● For example, this rule-based system will classify the headline
● “When is LeBron James' first game with the Lakers?” as Sports because it counted
one sports-related term (LeBron James) and it didn’t count any politics-related terms.
● Rule-based systems are human comprehensible and can be improved over time. But
this approach has some disadvantages.
● For starters, these systems require deep knowledge of the domain. They are also time-
consuming, since generating rules for a complex system can be quite challenging and
usually requires a lot of analysis and testing.
● Rulebased systems are also difficult to maintain and don’t scale well given that
adding new rules can affect the results of the pre-existing rules.

Excellence and Service


317
CHRIST
Deemed to be University

Example of a simple rule-based system for detecting passive voice in sentences using Python

import re
def detect_passive_voice(sentence):
passive_voice_pattern = r'\b(am|is|are|was|were|be|been|being)\b.*\b(by)\b'
if re.search(passive_voice_pattern, sentence):
return True
return False
# Example sentences
sentences = [
"The cake was eaten by the child.",
"The child ate the cake."
]
# Detect passive voice
for sentence in sentences:
if detect_passive_voice(sentence):
print(f"Passive voice detected: {sentence}")
else:
print(f"Active voice: {sentence}")

Excellence and Service


318
CHRIST
Deemed to be University

Machine learning-based systems


● Instead of relying on manually crafted rules, machine learning text
classification learns to make classifications based on past
observations.
● By using pre-labeled examples as training data, machine learning
algorithms can learn the different associations between pieces of text,
and that a particular output (i.e., tags) is expected for a particular
input (i.e.,text).
● A “tag” is the predetermined classification or category that any given
text could fall into.
● The first step towards training a machine learning NLP classifier is
feature extraction: a method used to transform each text into a
numerical representation in the form of a vector.
● One of the most frequently used approaches is the bag of words,
where a vector represents the frequency of a word in a predefined
dictionary of words.
Excellence and Service
319
CHRIST
Deemed to be University

● For example, if we have defined our dictionary to have the


following words {This, is, the, not, awesome,
bad,basketball}, and we wanted to vectorize the text “This is
awesome,”
● we would have the following vector representation of that
text: (1, 1, 0, 0, 1, 0, 0). Then, the machine learning algorithm
is fed with training data that consists of pairs of feature sets
(vectors for each text example) and tags (e.g. sports, politics)
to produce a classification model:

Excellence and Service


320
CHRIST
Deemed to be University

Excellence and Service


321
CHRIST
Deemed to be University

● Once it’s trained with enough training samples, the machine learning
model can begin to make accurate predictions.
● The same feature extractor is used to transform unseen text to feature
sets, which can be fed into the classification model to get predictions
on tags (e.g., sports, politics):

Excellence and Service


322
CHRIST
Deemed to be University

Excellence and Service


323
CHRIST
Deemed to be University

● Text classification with machine learning is usually much more


accurate than human-crafted rule systems, especially on complex NLP
classification tasks.
● Also, classifiers with machine learning are easier to maintain and you
can always tag new examples to learn new tasks.
● Machine Learning Text Classification Algorithms Some of the most
popular text classification algorithms include the Naive Bayes family
of algorithms, support vector machines (SVM), and deep learning.

Excellence and Service


324
CHRIST
Deemed to be University

Naive Bayes
● The Naive Bayes family of statistical algorithms are some of the most
used algorithms in text classification and text
● analysis, overall. One of the members of that family is Multinomial
Naive Bayes (MNB) with a huge advantage, that you
● can get really good results even when your dataset isn’t very large (~
a couple of thousand tagged samples) and computational resources are
scarce. Naive Bayes is based on Bayes’s Theorem, which helps us
compute the conditional probabilities of the occurrence of two events,
based on the probabilities of the occurrence of each individual event.
So we’re calculating the probability of each tag for a given text, and
then outputting the tag with the highest probability.

Naive Bayes formula.

Excellence and Service


325
CHRIST
Deemed to be University

● The probability of A, if B is true, is equal to the probability of B, if A


is true, times the probability of A being true,
● divided by the probability of B being true. This means that any vector
that represents a text will have to contain
● information about the probabilities of the appearance of certain words
within the texts of a given category so that the
● algorithm can compute the likelihood of that text belonging to the
category.

Excellence and Service


326
CHRIST
Deemed to be University

Why is it called Naïve Bayes?


● The Naïve Bayes algorithm is comprised of two words Naïve
and Bayes, Which can be described as:

● Naïve: It is called Naïve because it assumes that the occurrence


of a certain feature is independent of the occurrence of other
features. Such as if the fruit is identified on the bases of color,
shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on
each other.

● Bayes: It is called Bayes because it depends on the principle


of Bayes' Theorem.
Excellence and Service
CHRIST
Deemed to be University

Bayes' Theorem
● Bayes' theorem is also known as Bayes' Rule or Bayes' law,
which is used to determine the probability of a hypothesis with
prior knowledge. It is based on the conditional probability.

● Thomas Bayes was an English statistician and philosopher who


is known for formulating a specific case of the theorem that
bears his name: Bayes' theorem.

● Bayes Theorem is the base for the Naive Bayes Algorithm

Excellence and Service


CHRIST
Deemed to be University

Contd...

● Bayes' Theorem is a way of finding a probability when we know


certain other probabilities.

● It is stated as probability of the event A given B is equal to the


probability of the event B given A multiplied by the probability
of A upon probability of B

Excellence and Service


CHRIST
Deemed to be University

Contd....

Where

● P(A|B) is Posterior probability: Probability of hypothesis A on the


observed event B.
posterior probability = prior probability + new evidence (called likelihood).

● P(B|A) is Likelihood probability: Probability of the evidence given that the


probability of a hypothesis is true.

● P(A) is Prior Probability: Probability of hypothesis before observing the


evidence.

● P(B) is Marginal Probability: Probability of Evidence.

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

PROBLEM-1

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Predicting a class label using na¨ıve Bayesian classification

● We wish to predict the class label of a tuple using na¨ıve


Bayesian classification, given the same training data were
shown in Table 8.1.

● The data tuples are described by the attributes age, income,


student, and credit rating.

● The class label attribute, buys computer, has two distinct values
(namely, yes, no).
● Classify following tuple
● X = (age =youth, income =medium, student = yes, credit rating
= fair)

Excellence and Service


CHRIST
Deemed to be University

Naïve Bayes Classifier: Comments


● Advantages
○ Easy to implement
○ Good results obtained in most of the cases
● Disadvantages
○ Assumption: class conditional independence, therefore loss of
accuracy
○ Practically, dependencies exist among variables
■ E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
■ Dependencies among these cannot be modeled by Naïve Bayes Classifier
● How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)

Excellence and Service


CHRIST
Deemed to be University

APPLICATIONS

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

How to Build a Basic Model Using Naive Bayes in Python

● Again, scikit learn (python library) will help here to build a Naive Bayes
model in Python. There are five types of NB models under the scikit-learn
library:
● Gaussian Naive Bayes: gaussiannb is used in classification tasks and
it assumes that feature values follow a gaussian distribution. (Normal
distribution). When plotted, it gives a bell-shaped curve which is symmetric
about the mean of the feature values .
● Multinomial Naive Bayes: It is used for discrete counts. For example, let’s
say, we have a text classification problem. Here we can consider Bernoulli
trials which is one step further and instead of “word occurring in the
document”, we have “count how often word occurs in the document”, you can
think of it as “number of times outcome number x_i is observed over the n
trials”.

Excellence and Service


CHRIST
Deemed to be University

● Bernoulli Naive Bayes: The binomial model is useful if your feature vectors
are boolean (i.e. zeros and ones). One application would be text
classification with ‘bag of words’ model where the 1s & 0s are “word occurs
in the document” and “word does not occur in the document” respectively.
● Categorical Naive Bayes: Categorical Naive Bayes is useful if the features
are categorically distributed. We have to encode the categorical variable in
the numeric format using the ordinal encoder for using this algorithm.

Excellence and Service


CHRIST
Deemed to be University

Support Vector Machines

● Support Vector Machines (SVM) is another powerful text


classification machine learning algorithm because like Naive Bayes,
SVM doesn’t need much training data to start providing accurate
results. SVM does, however, require more computational resources
than Naive Bayes, but the results are even faster and more accurate. In
short, SVM draws a line or “hyperplane” that divides a space into two
subspaces. One subspace contains vectors (tags) that belong to a
group, and another subspace contains vectors that do not belong to
that group.

Excellence and Service


357
CHRIST
Deemed to be University

● The optimal hyperplane is the one with the largest distance between
each tag.
● In two dimensions it looks like above: Those vectors are
representations of your training texts, and a group is a tag you have
tagged your texts with. As
● data gets more complex, it may not be possible to classify vectors/tags
into only two categories. So, it looks like this:

Excellence and Service


358
CHRIST
Deemed to be University

Fig. Optimal SVM Hyperplane

Excellence and Service


359
CHRIST
Deemed to be University

SVM—Support Vector Machines

● A relatively new classification method for both linear and nonlinear


data
● It uses a nonlinear mapping to transform the original training data into
a higher dimension
● With the new dimension, it searches for the linear optimal separating
hyperplane (i.e., “decision boundary”)
● With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyperplane
● SVM finds this hyperplane using support vectors (“essential” training
tuples) and margins (defined by the support vectors)

Excellence and Service


CHRIST
Deemed to be University

SVM—History and Applications

● Vapnik and colleagues (1992)—groundwork from Vapnik &


Chervonenkis’ statistical learning theory in 1960s
● Features: training can be slow but accuracy is high owing to their
ability to model complex nonlinear decision boundaries (margin
maximization)
● Used for: classification and numeric prediction
● Applications:
○ handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests

Excellence and Service


CHRIST
Deemed to be University

SVM—General Philosophy

Small Large
Margin Margin
Support Vectors

Excellence and Service


CHRIST
Deemed to be University

SVM—Margins and Support Vectors

Excellence and Service


CHRIST
Deemed to be University

SVM—When Data Is Linearly Separable

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class
labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the
one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane
(MMH)

Excellence and Service


CHRIST
Deemed to be University

Deep Learning

● Deep learning is a set of algorithms and techniques inspired by how


the human brain works, called neural
● networks. Deep learning architectures offer huge benefits for text
classification because they perform at super high
● accuracy with lower- level engineering and computation. The two
main deep learning architectures for text classification
● are Convolutional Neural Networks (CNN) and Recurrent Neural
Networks (RNN). Deep learning is hierarchical machine
● learning, using multiple algorithms in a progressive chain of events.
It’s similar to how the human brain works when
● making decisions, using different techniques simultaneously to
process huge amounts of data.

Excellence and Service


367
CHRIST
Deemed to be University

● Deep learning algorithms do require much more training data than


traditional machine learning algorithms (at least
● millions of tagged examples). However, they don’t have a threshold
for learning from training data, like traditional
● machine learning algorithms, such as SVM and Deep learning
classifiers continue to get better the more data you feed
● them with: Deep learning algorithms, like Word2Vec or GloVe are
also used in order to obtain better vector
● representations for words and improve the accuracy of classifiers
trained with traditional machine learning algorithms.

Excellence and Service


368
CHRIST
Deemed to be University

● Hybrid Systems
● Hybrid systems combine a machine learning-trained base classifier
with a rule-based system, used to further
● improve the results. These hybrid systems can be easily fine-tuned by
adding specific rules for those conflicting tags that
● haven’t been correctly modeled by the base classifier.

Excellence and Service


369
CHRIST
Deemed to be University

Text Classification Using TF-IDF


Develop a text classification model for any use case using NLTK and
TF-IDF model.

Excellence and Service


370
CHRIST
Deemed to be University

VECTOR SEMANTICS AND EMBEDDINGS

● The idea of vector semantics is to represent a word as a point


in a multidimensional semantic space that is derived from the
distributions of embeddings word neighbors.
● Vectors for representing words are called embeddings .
● Vector Semantics defines semantics & interprets word
meaning to explain features such as word similarity.
● Its central idea is: Two words are similar if they have similar
word contexts.

Excellence and Service


371
CHRIST
Deemed to be University

● In its current form, the vector model inspires its working from the
linguistic and philosophical work of the 1950s.
● Vector semantics represents a word in multi-dimensional vector space
● . Vector model is also called Embeddings, due to the fact that a word is
embedded in a particular vector space.
● The vector model offers many advantages in NLP. For example, in
sentimental analysis, sets up a boundary class and predicts if the
sentiment is positive or negative (a binomial classification).
● Another key practical advantage of vector semantics is that it can learn
automatically from text without complex labeling or supervision.
● As a result of these advantages, vector semantics has become a de-facto
standard for NLP applications such as Sentiment Analysis, Named
Entity Recognition (NER), topic modeling, and so on.

Excellence and Service


372
CHRIST
Deemed to be University

WORD EMBEDDINGS

● Word Embeddings are the texts converted into numbers


and there may be different numerical representations of
the same text.
● Word Embedding is an approach for representing words
and documents.
● Word Embedding or Word Vector is a numeric vector
input that represents a word in a lower-dimensional
space.
● It allows words with similar meanings to have a similar
representation.

Excellence and Service


373
CHRIST
Deemed to be University

What is Word Embedding in NLP?


● Word Embeddings are a method of extracting features out of text so
that we can input those features into a machine learning model to
work with text data.
● They try to preserve syntactical and semantic information.
● The methods such as Bag of Words (BOW), CountVectorizer and
TFIDF rely on the word count in a sentence but do not save any
syntactical or semantic information.
● In these algorithms, the size of the vector is the number of elements in
the vocabulary.
● We can get a sparse matrix if most of the elements are zero. Large
input vectors will mean a huge number of weights which will result in
high computation required for training. Word Embeddings give a
solution to these problems.

Excellence and Service


374
CHRIST
Deemed to be University

● As we know that many Machine Learning algorithms and


almost all Deep Learning Architectures are not capable of
processing strings or plain text in their raw form.
● In a broad sense, they require numerical numbers as inputs to
perform any sort of task, such as classification, regression,
clustering, etc. Also, from the huge amount of data that is
present in the text format, it is imperative to extract some
knowledge out of it and build any useful applications.
● In short, we can say that to build any model in machine
learning or deep learning, the final level data has to be in
numerical form because models don’t understand text or
image data directly as humans do.

Excellence and Service


375
CHRIST
Deemed to be University

Excellence and Service


376
CHRIST
Deemed to be University

How did NLP models learn patterns from text data?

● To convert the text data into numerical data, we need some


smart ways which are known as vectorization, or in the NLP
world, it is known as Word embeddings.
● Therefore, Vectorization or word embedding is the process of
converting text data to numerical vectors.
● Later those vectors are used to build various machine
learning models.
● In this manner, we say this as extracting features with the
help of text with an aim to build multiple natural languages,
processing models

Excellence and Service


377
CHRIST
Deemed to be University

Need for Word Embedding?


● To reduce dimensionality
● To use a word to predict the words around it.
● Inter-word semantics must be captured.
How are Word Embeddings used?
● They are used as input to machine learning models.
Take the words —-> Give their numeric representation —->
Use in training or inference.
● To represent or visualize any underlying patterns of usage in
the corpus that was used to train them.

Excellence and Service


378
CHRIST
Deemed to be University

Different types of Word Embeddings

Broadly, we can classified word embeddings into the following two


categories:
● Frequency-based or Statistical based Word Embedding
● Prediction based Word Embedding

Frequency-based Embedding
1. Count Vector
2. One-Hot Encoding
3. Bag of Words
4. TF-IDF Vector

Excellence and Service


379
CHRIST
Deemed to be University

One-Hot Encoding (OHE)


● In this technique, we represent each unique word in vocabulary by setting a
unique token with value 1 and rest 0 at other positions in the vector.
● In simple words, a vector representation of a one-hot encoded vector
represents in the form of 1, and 0 where 1 stands for the position where the
word exists and 0 everywhere else.
● Example:
● Sentence: I am teaching NLP in Python
● A word in this sentence may be “NLP”, “Python”, “teaching”, etc.
● Since a dictionary is defined as the list of all unique words present in the
sentence. So, a dictionary may look like –
● Dictionary: [‘I’, ’am’, ’teaching’,’ NLP’,’ in’, ’Python’]
● Therefore, the vector representation in this format according to the above
dictionary is
● Vector for NLP: [0,0,0,1,0,0]
● Vector for Python: [0,0,0,0,0,1]

Excellence and Service


380
CHRIST
Deemed to be University

Disadvantages of One-hot Encoding


1. One of the disadvantages of One-hot encoding is that the Size
of the vector is equal to the count of unique words in the
vocabulary.
2. One-hot encoding does not capture the relationships between
different words. Therefore, it does not convey information about
the context.

Excellence and Service


381
CHRIST
Deemed to be University

Count Vectorizer

● It is one of the simplest ways of doing text vectorization.


● It creates a document term matrix, which is a set of dummy
variables that indicates if a particular word appears in the
document.
● Count vectorizer will fit and learn the word vocabulary and try to
create a document term matrix in which the individual cells denote
the frequency of that word in a particular document, which is also
known as term frequency, and the columns are dedicated to each
word in the corpus.

Excellence and Service


382
CHRIST
Deemed to be University

Excellence and Service


383
CHRIST
Deemed to be University

Bag-of-Words(BoW)
This vectorization technique converts the text content to numerical feature vectors. Bag of
Words takes a document from a corpus and converts it into a numeric vector by mapping
each document word to a feature vector for the machine learning model.

Disadvantages of Bag of Words

1. This method doesn’t preserve the word order.

2. It does not allow to draw of useful inferences for downstream NLP tasks.

Excellence and Service


384
CHRIST
Deemed to be University

N-grams Vectorization

● Similar to the count vectorization technique, in the N-Gram


method, a document term matrix is generated, and each cell
represents the count.
● The columns represent all columns of adjacent words of length
n.
● Count vectorization is a special case of N-Gram where n=1.
● N-grams consider the sequence of n words in the text; where n
is (1,2,3.. ) like 1-gram, 2-gram. for token pair.
● Unlike BOW, it maintains word order.

Excellence and Service


385
CHRIST
Deemed to be University

For Example,
● “I am studying NLP” has four words and n=4.
● if n=2, i.e bigram, then the columns would be — [“I am”,
“am reading”, ‘studying NLP”]
● if n=3, i.e trigram, then the columns would be — [“I am
studying”, ”am studying NLP”]
● if n=4,i.e four-gram, then the column would be -[‘“I am
studying NLP”]
● The n value is chosen based on the performance.

Excellence and Service


386
CHRIST
Deemed to be University

● How does the value of N in the N-grams method create a


tradeoff?
● The trade-off is created while choosing the N values. If we
choose N as a smaller number, then it may not be sufficient
enough to provide the most useful information.
● But on the contrary, if we choose N as a high value, then it will
yield a huge matrix with lots of features.
● Therefore, N-gram may be a powerful technique, but it needs a
little more care.

Excellence and Service


387
CHRIST
Deemed to be University

Disadvantages of N-Grams
1. It has too many features.
2. Due to too many features, the feature set becomes too sparse and
is computationally expensive.
3. Choose the optimal value of N is not that easy task.

Excellence and Service


388
CHRIST
Deemed to be University

TF-IDF Vectorization
● Term frequency-inverse document frequency ( TF-IDF) gives a
measure that takes the importance of a word into consideration
depending on how frequently it occurs in a document and a
corpus.
● This algorithm works on a statistical measure of finding word
relevance in the text that can be in the form of a single document or
various documents that are referred to as corpus.
● TF-IDF algorithm finds application in solving simpler natural
language processing and machine learning problems for tasks like
information retrieval, stop words removal, keyword extraction, and
basic text analysis. However, it does not capture the semantic
meaning of words efficiently in a sequence.

Excellence and Service


389
CHRIST
Deemed to be University

Important Points about TF-IDF Vectorizer


1. Similar to the count vectorization method, in the TF-IDF method, a document term
matrix is generated and each column represents an individual unique word.
2. The difference in the TF-IDF method is that each cell doesn’t indicate the term
frequency, but contains a weight value that signifies how important a word is for an
individual text message or document
3. This method is based on the frequency method but it is different from the count
vectorization in the sense that it takes into considerations not just the occurrence of a
word in a single document but in the entire corpus.
4. TF-IDF gives more weight to less frequently occurring events and less weight to
expected events. So, it penalizes frequently occurring words that appear frequently in
a document such as “the”, “is” but assigns greater weight to less frequent or rare
words.
5. The product of TF x IDF of a word indicates how often the token is found in the
document and how unique the token is to the whole entire corpus of documents.

Excellence and Service


390
CHRIST
Deemed to be University

● Important Note
● All of these techniques we discussed here suffer from the issue
of vector sparsity and as a result, it doesn’t handle complex
word relations and is not able to model long sequences of text.
● So, In the next part we will look at new-age text vectorization
techniques, that tries to overcome this problem.

Excellence and Service


391
CHRIST
Deemed to be University

Prediction-based Embedding

● So far, we have seen deterministic methods to determine word


vectors. But these methods proved to be limited in their word
representations until Mitolov etc. el introduced word2vec to the
NLP community.
● These methods were prediction based in the sense that they
provided probabilities to the words and proved to be state of the
art for tasks like word analogies and word similarities.
● They were also able to achieve tasks like King -man +woman =
Queen, which was considered a result almost magical.

Excellence and Service


392
CHRIST
Deemed to be University

Word2Vec
● Word2vec is not a single algorithm but a combination of two
techniques – CBOW(Continuous bag of words) and Skip-gram
model. Both of these are shallow neural networks which map
word(s) to the target variable which is also a word(s). Both of
these learn weights which act as word vector representation
techniques.
“A man is known by the company he keeps”

This is a very well know saying. And word2vec also works primarily on this idea. A word
is known by the company it keeps. This sounds so strange and funny but it gives amazing
results.

Excellence and Service


393
CHRIST
Deemed to be University

Word2Vec
● Word2Vec method was developed by Google in 2013. Presently, we use this
technique for all advanced natural language processing (NLP) problems. It
was invented for training word embeddings and is based on a distributional
hypothesis.
● In this hypothesis, it uses skip-grams or a continuous bag of words (CBOW).
● These are basically shallow neural networks that have an input layer, an
output layer, and a projection layer. It reconstructs the linguistic context of
words by considering both the order of words in history as well as the future.
● The method involves iteration over a corpus of text to learn the association
between the words. It relies on a hypothesis that the neighboring words in a
text have semantic similarities with each other. It assists in mapping
semantically similar words to geometrically close embedding vectors.

Excellence and Service


394
CHRIST
Deemed to be University

It uses the cosine similarity metric to measure semantic similarity.


Cosine similarity is equal to Cos(angle) where the angle is
measured between the vector representation of two
words/documents.

● So if the cosine angle is one, it means that the words are


overlapping.
● And if the cosine angle is a right angle or 90°, It means words
hold no contextual similarity and are independent of each
other.

Excellence and Service


395
CHRIST
Deemed to be University

Some sample sentences:


● The rewards of all your hard work in the garden are easy to
see.
● It is hard work to keep my room in proper order.
● The hard work began to tell on him.
In other words we can say that if the computer tries to learn that the
words “hard” and “work” occur in close proximity of each other then it
will learn the vectors according to that.
If we say that our “target” word is “hard” for which we need to learn a
good vector, we provide the computer with its nearby word or the
“context” word which is “work” in this case amongst “the, began, is
etc

Excellence and Service


396
CHRIST
Deemed to be University

Intuition:
Text : ['Best way to success is through hard work and persistence']
To model our input will be like :
Target word : Best. Context word : (way).
Target word : way. Now here we have two words, one before and after. So in this scenario our
context words will be : (Best,to).
Target word : to. Now here also we have two words, one before and after i,e (way,success). But
if we think again, then “to” and “is” can be found in sentences nearby to each other. Like “He is
going to the market”. So it is a good idea if we include the word “is” in the list of context words.
But now we can argue about “Best” or “through”. So here comes the concept of “window size”.
Window size is the number which we decide that how many nearby words are we going to
consider. So if the window size is 1 then our list of context words become (way,success). And if
the window size is 2 then our list of context words become (Best,way,success,is).

Similarly we can formulate the list for the rest of the words

Excellence and Service


397
CHRIST
Deemed to be University

Step 1: Read data from a file


Step 2: Generate variables
text = ['Best way to success is through hardwork and persistence']
Number of unique words: 9
word_to_index : {'best': 0, 'way': 1, 'to': 2, 'success': 3, 'is': 4,
'through': 5, 'hardwork': 6, 'and': 7, 'persistence': 8}
index_to_word : {0: 'best', 1: 'way', 2: 'to', 3: 'success', 4: 'is', 5:
'through', 6: 'hardwork', 7: 'and', 8: 'persistence'}
corpus: ['best', 'way', 'to', 'success', 'is', 'through', 'hardwork', 'and',
'persistence']
Length of corpus : 9

Excellence and Service


398
CHRIST
Deemed to be University

Step 3: Generate training data

text = ['Best way to success is through hardwork and persistence']


Window size = 2, Vocab size = 9
We will set the indices as 1 according to the word_to_index dict i.e best : 0, so
we set the 0th index as 1 to denote natural
Target word = best
Context words = (way,to)
Target_word_one_hot_vector = [1, 0, 0, 0, 0, 0, 0, 0, 0]
Context_word_one_hot_vector = [0, 1, 1, 0, 0, 0, 0, 0, 0]
Target word = way
Context words = (best,to,success)
Target_word_one_hot_vector = [0, 1, 0, 0, 0, 0, 0, 0, 0]
Context_word_one_hot_vector= [1, 0, 1, 1, 0, 0, 0, 0, 0]

Excellence and Service


399
CHRIST
Deemed to be University

Instead of having the indices present in a single list, we have two different list. But the problem with this
approach is that it will take a lot of space if the size of our data increases.To train a model, we need to have
the data in the form of (X,Y) i.e (target_words, context_words).
Target word:best . Target vector: [1. 0. 0. 0. 0. 0. 0. 0. 0.]
Context word:['way', 'to'] .
Context vector: [0. 1. 1. 0. 0. 0. 0. 0. 0.]
**************************************************
Target word:way . Target vector: [0. 1. 0. 0. 0. 0. 0. 0. 0.]
Context word:['best', 'to', 'success'] .
Context vector: [1. 0. 1. 1. 0. 0. 0. 0. 0.]
**************************************************
Target word:hardwork . Target vector: [0. 0. 0. 0. 0. 0. 1. 0. 0.]
Context word:['through', 'is', 'and', 'persistence'] .
Context vector: [0. 0. 0. 0. 1. 1. 0. 1. 1.]
**************************************************
Target word:and . Target vector: [0. 0. 0. 0. 0. 0. 0. 1. 0.]
Context word:['hardwork', 'through', 'persistence'] .
Context vector: [0. 0. 0. 0. 0. 1. 1. 0. 1.]
**************************************************
Target word:persistence . Target vector: [0. 0. 0. 0. 0. 0. 0. 0. 1.]
Context word:['and', 'hardwork'] .
Context vector: [0. 0. 0. 0. 0. 0. 1. 1. 0.]

Excellence and Service


400
CHRIST
Deemed to be University

CBOW (Continuous Bag of words)


The continuous bag of words variant includes various inputs that are taken by the neural network model. Out of this,
it predicts the targeted word that closely relates to the context of different words fed as input. It is fast and a great
way to find better numerical representation for frequently occurring words. Let us understand the concept of context
and the current word for CBOW.

In CBOW, we define a window size. The middle word is the current word and the surrounding words (past and
future words) are the context. CBOW utilizes the context to predict the current words.
Each word is encoded using One Hot Encoding in the defined vocabulary and sent to the CBOW neural network.
The hidden layer is a standard fully-connected dense layer. The output layer generates probabilities for the target
word from the vocabulary.

Excellence and Service


401
CHRIST
Deemed to be University

Excellence and Service


402
CHRIST
Deemed to be University

Consider, “Deep Learning is very hard and fun”. We need to set something known as window
size. Let’s say 2 in this case. What we do is iterate over all the words in the given data, which in this
case is just one sentence, and then consider a window of word which surrounds it. Here
since our window size is 2 we will consider 2 words behind the word and 2 words after the word,
hence each word will get 4 words associated with it. We will do this for each and every word in the
data and collect the word pairs.

Excellence and Service


403
CHRIST
Deemed to be University

Diagrammatic representation of the CBOW model

Excellence and Service


404
CHRIST
Deemed to be University

Excellence and Service


405
CHRIST
Deemed to be University

The matrix representation of the above image for a single data point is below.

Excellence and Service


406
CHRIST
Deemed to be University

● The flow is as follows:


● The input layer and the target, both are one- hot encoded of size [1 X V]. Here V=10 in the
above example.
● There are two sets of weights. one is between the input and the hidden layer and second
between hidden and output layer.
Input-Hidden layer matrix size =[V X N] , hidden-Output layer matrix size =[N X V] :
Where N is the number of dimensions we choose to represent our word in. It is arbitary
and a hyper-parameter for a Neural Network. Also, N is the number of neurons in the
hidden layer. Here, N=4.
● The input is multiplied by the input-hidden weights and called hidden activation. It is
simply the corresponding row in the input-hidden matrix copied.
● The hidden input gets multiplied by hidden- output weights and output is calculated.
● Error between output and target is calculated and propagated back to re-adjust the weights.
● The weight between the hidden layer and the output layer is taken as the word vector
representation of the word.

Excellence and Service


407
CHRIST
Deemed to be University

The image below describes the architecture for multiple context words.

Excellence and Service


408
CHRIST
Deemed to be University

EXAMPLE

Excellence and Service


409
CHRIST
Deemed to be University

Excellence and Service


410
CHRIST
Deemed to be University

Excellence and Service


411
CHRIST
Deemed to be University

Excellence and Service


412
CHRIST
Deemed to be University

Excellence and Service


413
CHRIST
Deemed to be University

Advantages of CBOW:
1. Being probabilistic is nature, it is supposed to perform superior to
deterministic methods(generally).
2. It is low on memory. It does not need to have huge RAM requirements
like that of co-occurrence matrix where it needs to store three huge
matrices.

Disadvantages of CBOW:
1. CBOW takes the average of the context of a word (as seen above in
calculation of hidden activation). For example, Apple can be both a fruit
and a company but CBOW takes an average of both the contexts and
places it in between a cluster for fruits and companies.
2. Training a CBOW from scratch can take forever if not properly
optimized.
Excellence and Service
414
CHRIST
Deemed to be University

Skip – Gram model


● Skip-gram is a slightly different word embedding technique in
comparison to CBOW as it does not predict the current word based
on the context. Instead, each current word is used as an input to a
log-linear classifier along with a continuous projection layer. This
way, it predicts words in a certain range before and after the
current word.
● This variant takes only one word as an input and then predicts the
closely related context words. That is the reason it can efficiently
represent rare words.

Excellence and Service


415
CHRIST
Deemed to be University

Excellence and Service


416
CHRIST
Deemed to be University

Skip – gram follows the same topology as of CBOW. It just flips CBOW’s architecture on
its head. The aim of skip-gram is to predict the context given a word. Let us take the same
corpus that we built our CBOW model on. C=”Hey, this is sample corpus using only one
context word.” Let us construct the training data.

Excellence and Service


417
CHRIST
Deemed to be University

● The input vector for skip-gram is going to be similar to a 1-context


CBOW model. Also, the calculations up to hidden layer activations
are going to be the same. The difference will be in the target variable.
Since we have defined a context window of 1 on both the sides, there
will be “two” one hot encoded target variables and “two”
corresponding outputsTwo separate errors are calculated with respect
to the two target variables and the two error vectors obtained are
added element-wise to obtain a final error vector which is propagated
back to update the weights.The weights between the input and the
hidden layer are taken as the word vector representation after training.
The loss function or the objective is of the same type as of the CBOW
model.

Excellence and Service


418
CHRIST
Deemed to be University

Excellence and Service


419
CHRIST
Deemed to be University

Excellence and Service


420
CHRIST
Deemed to be University

Excellence and Service


421
CHRIST
Deemed to be University

Excellence and Service


422
CHRIST
Deemed to be University

Advantages of Skip-Gram Model


1. Skip-gram model can capture two semantics for a single word.
i.e it will have two vector representations of Apple. One for the
company and other for the fruit.
2. Skip-gram with negative sub-sampling outperforms every other
method generally.

Excellence and Service


423
CHRIST
Deemed to be University

Excellence and Service


424
CHRIST
Deemed to be University

Word2Vec takes as its input a large corpus of text and produces a vector space, typically of several
hundred dimensions, with each unique words in the corpus being assigned a corresponding vector
in the space. Word vectors are positioned to vector space such that words that share common
contexts in the corpus are located in close proximity to one other in the space.

● CBOW is faster and has better representation for more frequent words.

● Skipgram works well with small amount of data and is found to represent rare words

well.

Excellence and Service


425
CHRIST
Deemed to be University

Summary of Word2Vec

Word2Vec is a popular word embedding technique that aims to represent words as


continuous vectors in a high-dimensional space. It introduces two models: Continuous
Bag of Words (CBOW) and Skip-gram, each contributing to the learning of vector
representations.

1. Model Architecture:
Continuous Bag of Words (CBOW): In CBOW, the model predicts a target
word based on its context. The context words are used as input, and the target
word is the output. The model is trained to minimize the difference between the
predicted and actual target words.

Skip-gram: Conversely, the Skip-gram model predicts context words given a


target word. The target word serves as input, and the model aims to predict the
words that are likely to appear in its context. Like CBOW, the goal is to
minimize the difference between the predicted and actual context words.

Excellence and Service


426
CHRIST
Deemed to be University

2. Neural Network Training:


Both CBOW and Skip-gram models leverage neural networks to learn vector
representations. The neural network is trained on a large text corpus,
adjusting the weights of connections to minimize the prediction error. This
process places similar words closer together in the resulting vector space.

3. Vector Representations:
Once trained, Word2Vec assigns each word a unique vector in the high-
dimensional space. These vectors capture semantic relationships between
words. Words with similar meanings or those that often appear in similar
contexts have vectors that are close to each other, indicating their semantic
similarity.
Excellence and Service
427
CHRIST
Deemed to be University

Advantages and Disadvantages:


Advantages:

● Captures semantic relationships effectively.


● Efficient for large datasets.
● Provides meaningful word representations.

Disadvantages:

● May struggle with rare words.


● Ignores word order.

Excellence and Service


428
CHRIST
Deemed to be University

FastText
● FastText is a vector representation technique developed by facebook AI
research. As its name suggests its fast and efficient method to perform
same task and because of the nature of its training method, it ends up
learning morphological details as well.
● FastText is unique because it can derive word vectors for unknown
words or out of vocabulary words — this is because by taking
morphological characteristics of words into account, it can create the
word vector for an unknown word.
● Since morphology refers to the structure or syntax of the words,
FastText tends to perform better for such task, word2vec perform better
for semantic task.

Excellence and Service


429
CHRIST
Deemed to be University

FastText works well with rare words. So even if a word


wasn’t seen during training, it can be broken down into n-
grams to get its embeddings.

Applications
● Analysing survey responses .
● Analysing verbatim comments.
● Music/Video recommendation system.

Excellence and Service


430
CHRIST
Deemed to be University

● FastText is an advanced word embedding technique


developed by Facebook AI Research (FAIR) that
extends the Word2Vec model.
● Unlike Word2Vec, FastText not only considers whole
words but also incorporates subword information —
parts of words like n-grams.
● This approach enables the handling of morphologically
rich languages and captures information about word
structure more effectively.

Excellence and Service


431
CHRIST
Deemed to be University

1. Subword Information:
● FastText represents each word as a bag of character
n-grams in addition to the whole word itself.
● This means that the word “apple” is represented by
the word itself and its constituent n-grams like “ap”,
“pp”, “pl”, “le”, etc.
● This approach helps capture the meanings of shorter
words and affords a better understanding of suffixes
and prefixes.
Excellence and Service
432
CHRIST
Deemed to be University

2. Model Training:
● Similar to Word2Vec, FastText can use either the CBOW
or Skip-gram architecture.
● However, it incorporates the subword information during
training.
● The neural network in FastText is trained to predict words
(in CBOW) or context (in Skip-gram) not just based on the
target words but also based on these n-grams.

Excellence and Service


433
CHRIST
Deemed to be University

3. Handling Rare and Unknown Words:


● A significant advantage of FastText is its ability to
generate better word representations for rare words or
even words not seen during training.
● By breaking down words into n-grams, FastText can
construct meaningful representations for these words
based on their subword units.

Excellence and Service


434
CHRIST
Deemed to be University

Excellence and Service


435
CHRIST
Deemed to be University

Excellence and Service


436
CHRIST
Deemed to be University

Excellence and Service


437
CHRIST
Deemed to be University

Advantages and Disadvantages:


Advantages:

● Better representation of rare words.


● Capable of handling out-of-vocabulary words.
● Richer word representations due to subword information.

Disadvantages:

● Increased model size due to n-gram information.


● Longer training times compared to Word2Vec.

Excellence and Service


438
CHRIST
Deemed to be University

Glove
● Global Vectors for Word Representation (GloVe) is a powerful word
embedding technique that captures the semantic relationships between
words by considering their co-occurrence probabilities within a
corpus.
● The key to GloVe’s effectiveness lies in the construction of a word-
context matrix and the subsequent factorization process.

1. Word-Context Matrix Formation:


The first step in GloVe’s mechanics involves creating a word-context
matrix. This matrix is designed to represent the likelihood of a given word
appearing near another across the entire corpus. Each cell in the matrix
holds the co-occurrence count of how often words appear together in a
certain context window.
Excellence and Service
439
CHRIST
Deemed to be University

Let’s consider a simplified example. Assume we have


the following sentences in our corpus:

● “Word embeddings capture semantic meanings.”


● “GloVe is an impactful word embedding model.”

Excellence and Service


440
CHRIST
Deemed to be University

Here, each row and column corresponds to a unique word in the corpus, and the
values in the cells represent how often these words appear together within a
certain context window.

Excellence and Service


441
CHRIST
Deemed to be University

2. Factorization for Word Vectors:


With the word-context matrix in place, GloVe turns to matrix
factorization. The objective here is to decompose this high-
dimensional matrix into two smaller matrices — one representing
words and the other contexts. Let’s denote these as W for words and
C for contexts. The ideal scenario is when the dot product of W and
CT (transpose of C) approximates the original matrix:

X≈W⋅CT

Excellence and Service


442
CHRIST
Deemed to be University

Through iterative optimization, GloVe adjusts W and C to minimize


the difference between X and W⋅CT. This process yields refined vector
representations for each word, capturing the nuances of their co-
occurrence patterns.

3. Vector Representations:
Once trained, GloVe provides each word with a dense vector that
captures not just local context but global word usage patterns. These
vectors encode semantic and syntactic information, revealing
similarities and differences between words based on their overall usage
in the corpus.

Excellence and Service


443
CHRIST
Deemed to be University

Advantages and Disadvantages:


Advantages:

● Efficiently captures global statistics of the corpus.


● Good at representing both semantic and syntactic
relationships.
● Effective in capturing word analogies.

Disadvantages:

● Requires more memory for storing co-occurrence matrices.


● Less effective with very small corpora.
Excellence and Service
444
CHRIST
Deemed to be University

Choosing the Right Embedding Model

● Word2Vec: Use when semantic relationships are


crucial, and you have a large dataset.
● GloVe: Suitable for diverse datasets and when
capturing global context is important.
● FastText: Opt for morphologically rich languages
or when handling out-of-vocabulary words is
vital.

Excellence and Service


445
CHRIST
Deemed to be University

OVERVIEW OF DEEP LEARNING MODELS – RNN

● Deep learning is a machine learning technique that teaches


computers to do what comes naturally to humans: learn by
example.
● Deep learning is a key technology behind driverless cars,

enabling them to recognize a stop sign, or to distinguish a


pedestrian from a lamppost. It is the key to voice control in
consumer devices like phones, tablets, TVs, and hands-free
speakers.
● In deep learning, a computer model learns to perform
classification tasks directly from images, text, or sound.
● Deep learning models can achieve state-of-the-art accuracy,

sometimes exceeding human-level performance.


● Models are trained by using a large set of labeled data and

neural network architectures that contain many layers.

Excellence and Service


446
CHRIST
Deemed to be University

While deep learning was first theorized in the 1980s, there are two main reasons
it has only recently become useful:
1. Deep learning requires large amounts of labeled data. For example,
driverless car development requires millions of images and thousands
of hours of video.
2. Deep learning requires substantial computing power. High
performance GPUs have a parallel architecture that is efficient for
deep learning. When combined with clusters or cloud computing, this
enables development teams to reduce training time for a deep learning
network from weeks to hours or less.

Excellence and Service


447
CHRIST
Deemed to be University

Deep learning applications are used in industries from automated driving to medical
devices.
Automated Driving: Automotive researchers are using deep learning to
automatically detect objects such as stop signs and traffic lights. In addition,
deep learning is used to detect pedestrians, which helps decrease accidents.
Aerospace and Defense: Deep learning is used to identify objects from satellites
that locate areas of interest, and identify safe or unsafe zones for troops.
Medical Research: Cancer researchers are using deep learning to automatically
detect cancer cells. Teams at UCLA built an advanced microscope that yields a
high-dimensional data set used to train a deep learning application to accurately
identify cancer cells.
Industrial Automation: Deep learning is helping to improve worker safety
around heavy machinery by automatically detecting when people or objects are
within an unsafe distance of machines.
Electronics: Deep learning is being used in automated hearing and speech
translation. For example, home assistance devices that respond to your voice and
know your preferences are powered by deep learning applications.

Excellence and Service


448
CHRIST
Deemed to be University

RNN

● Recurrent Neural Network (RNN) is a type of Neural Network where the


output from the previous step is fed as input to the current step.
● In traditional neural networks, all the inputs and outputs are independent of
each other, but in cases when it is required to predict the next word of a
sentence, the previous words are required and hence there is a need to
remember the previous words.
● Thus, RNN came into existence, which solved this issue with the help of a
Hidden Layer.
● The main and most important feature of RNN is its Hidden state, which
remembers some information about a sequence.
● The state is also referred to as Memory State since it remembers the previous
input to the network. It uses the same parameters for each input as it performs
the same task on all the inputs or hidden layers to produce the output. This
reduces the complexity of parameters, unlike other neural network

Excellence and Service


449
CHRIST
Deemed to be University

Excellence and Service


450
CHRIST
Deemed to be University

● A RNN treats each word of a sentence as a separate input occurring at


time ‘t’ and uses the activation value at ‘t- 1’ also, as an input in addition
to the input at time ‘t’.
● The diagram below shows a detailed structure of an RNN architecture.

The architecture described above is also called as a many to many


architecture with (Tx = Ty) i.e. number of inputs = number of outputs.
Such structure is quite useful in Sequence modelling.
● Apart from the architecture mentioned above there are three other
types of architectures of RNN which are commonly used.
1. Many to One RNN : Many to one architecture refers to an RNN

architecture where many inputs (Tx) are used to give one output (Ty). A
suitable example for using such an architecture will be a classification
task.

Excellence and Service


451
CHRIST
Deemed to be University

RNN are a very important variant of neural networks heavily


used in Natural Language Processing.
● Conceptually they differ from a standard neural network as the

standard input in a RNN is a word instead of the entire sample


as in the case of a standard neural network.
● This gives the flexibility for the network to work with varying

lengths of sentences, something which cannot be achieved in a


standard neural network due to it’s fixed structure.
● It also provides an additional advantage of sharing features

learned across different positions of text which cannot be


obtained in a standard neural network.

Excellence and Service


452
CHRIST
Deemed to be University

In the image e H represents the output of the activation function.

Excellence and Service


453
CHRIST
Deemed to be University

One to Many RNN: One to Many architecture refers to a situation where a RNN
generates a series of output values based on a single input value. A prime example for
using such an architecture will be a music generation task, where an input is a jounre or
the first note.

Excellence and Service


454
CHRIST
Deemed to be University

Many to Many Architecture (Tx not equals Ty): This architecture refers to where many inputs
are read to produce many outputs, where the length of inputs is not equal to the length of outputs.
A prime example for using such an architecture is machine translation tasks.

Excellence and Service


455
CHRIST
Deemed to be University

Encoder refers to the part of the network which reads the sentence to be
translated, and, Decoder is the part of the network which translates the sentence
into desired language.

Limitations of RNN
Apart from all of its usefulness RNN does have certain limitations major of which are:
Examples of RNN architecture stated above can capture the dependencies in only one
direction of the language.
1. Basically, in the case of Natural Language Processing it assumes that the word
coming after has no effect on the meaning of the word coming before. With our
experience of languages, we know that it is certainly not true.
2. RNN are also not very good in capturing long term dependencies and the
problem of vanishing gradients resurface in RNN.

Excellence and Service


456
CHRIST
Deemed to be University

NNs are ideal for solving problems where the sequence is more important than the individual items
themselves.
● An RNNs is essentially a fully connected neural network that contains a refactoring of some
of its layers into a loop. That loop is typically an iteration over the addition or concatenation
of two inputs, a matrix multiplication and a non-linear function.
Among the text usages, the following tasks are among those RNNs perform well at:
● Sequence labelling
● Natural Language Processing (NLP) text classification
● Natural Language Processing (NLP) text generation
● Other tasks that RNNs are effective at solving are time series predictions or other sequence
predictions that aren’t image or tabular-based.
● There have been several highlighted and controversial reports in the media over the
advances in text generation, OpenAI’s GPT-2 algorithm. In many cases the generated text
is often indistinguishable from text written by humans.
● RNNs effectively have an internal memory that allows the previous inputs to affect the
subsequent predictions. It’s much easier to predict the next word in a sentence with more
accuracy, if you know what the previous words were. Often with tasks well suited to RNNs,
the sequence of the items is as or more important than the previous item in the sequence.

Excellence and Service


457
CHRIST
Deemed to be University

● Sequence-to-Sequence Models: TRANSFORMERS (Translate one language


into another language)
● Sequence-to-sequence (seq2seq) models in NLP are used to convert sequences
of Type A to sequences of Type B.
● For example, translation of English sentences to German sentences is a
sequence-to-sequence task.
● Recurrent Neural Network (RNN) based sequence-to-sequence models have
garnered a lot of traction ever since they were introduced in 2014. Most of the
data in the current world are in the form of sequences – it can be a number
sequence, text sequence, a video frame sequence or an audio sequence.
● The performance of these seq2seq models was further enhanced with the
addition of the Attention Mechanism in 2015. How quickly advancements in
NLP have been happening in the last 5 years – incredible!

Excellence and Service


458
CHRIST
Deemed to be University

These sequence-to-sequence models are pretty versatile and they are used
in a variety of NLP tasks, such as:
Machine Translation
Text Summarization
Speech Recognition
Question-Answering System, and so on

Excellence and Service


459
CHRIST
Deemed to be University

RNN based Sequence-to-Sequence Model


Let’s take a simple example of a sequence-to-sequence model.

Excellence and Service


460
CHRIST
Deemed to be University

● The above seq2seq model is converting a German phrase to its English


counterpart. Let’s break it down:
● Both Encoder and Decoder are RNNs
● At every time step in the Encoder, the RNN takes a word vector (xi)
from the input sequence and a hidden state (Hi) from the previous time
step
● The hidden state is updated at each time step
● The hidden state from the last unit is known as the context vector. This
contains information about the input sequence
● This context vector is then passed to the decoder and it is then used to
generate the target sequence (English phrase)
● If we use the Attention mechanism, then the weighted sum of the
hidden states are passed as the context vector to the decoder

Excellence and Service


461
CHRIST
Deemed to be University

Challenges
Despite being so good at what it does, there are certain limitations of seq-2-seq models
with attention:
● Dealing with long-range dependencies is still challenging
● The sequential nature of the model architecture prevents parallelization.
These challenges are addressed by Google Brain’s Transformer concept
● RNN can remember important things about the input it has received, which allows
them to be very precise in predicting what can be the next outcome. So this is the
reason why they are performed or preferred on a sequential data algorithm. And
some of the examples of sequence data can be something like time, series, speech,
text, financial data, audio, video, weather, and many more. Although RNN was the
state-of-the-art algorithm for dealing with sequential data, they come up with their
own drawbacks and some popular drawbacks over here can be like due to the
complication or the complexity of the algorithm. The neural network is pretty slow
to train. And as a huge amount of dimensions here, the training is very long and
difficult to do.

Excellence and Service


462
CHRIST
Deemed to be University

TRANSFORMERS

Attention models/Transformers are the most exciting models


being studied in NLP research today, but they can be a bit challenging to
grasp – the pedagogy is all over the place. This is both a bad thing (it can
be confusing to hear different versions) and in some ways a good thing
(the field is rapidly evolving, there is a lot of space to improve).

Excellence and Service


463
CHRIST
Deemed to be University

Internally, the Transformer has a similar kind of architecture as the previous models
above. But the Transformer consists of six encoders and six decoders.

Excellence and Service


464
CHRIST
Deemed to be University

Each encoder is very similar to each other. All encoders have the same architecture. Decoders
share the same property, i.e. they are also very similar to each other. Each encoder consists of
two layers: Self-attention and a feed Forward Neural Network.

Excellence and Service


465
CHRIST
Deemed to be University

The encoder’s inputs first flow through a self-attention layer. It helps the encoder look at other
words in the input sentence as it encodes a specific word. The decoder has both those layers, but
between them is an attention layer that helps the decoder focus on relevant parts of the input
sentence.

Excellence and Service


466
CHRIST
Deemed to be University

Self-Attention
Let’s start to look at the various vectors/tensors and how they
flow between these components to turn the input of a trained model into
an output. As is the case in NLP applications in general, we begin by
turning each input word into a vector using an embedding algorithm.

Excellence and Service


467
CHRIST
Deemed to be University

Each word is embedded into a vector of size 512. We will represent those
vectors with these simple boxes. The embedding only happens in the
bottom-most encoder. The abstraction that is common to all the encoders
is that they receive a list of vectors each of the size 512.

In the bottom encoder that would be the word embeddings, but in


other encoders, it would be the output of the encoder that’s directly below.
After embedding the words in our input sequence, each of them flows
through each of the two layers of the encoder.

Excellence and Service


468
CHRIST
Deemed to be University

Here we begin to see one key property of the Transformer, which is that the word in each
position flows through its own path in the encoder. There are dependencies between these
paths in the self-attention layer. The feed-forward layer does not have those dependencies,
however, and thus the various paths can be executed in parallel while flowing through the
feed-forward layer.

Next, we’ll switch up the example to a shorter sentence and we’ll look at what
happens in each sub-layer of the encoder.

Excellence and Service


469
CHRIST
Deemed to be University

Self-Attention
Let’s first look at how to calculate self-attention using vectors, then proceed to
look at how it’s actually implemented — using matrices.

Figuring out the relation of words within a sentence and giving the right attention to it.

Excellence and Service


470
CHRIST
Deemed to be University

The first step in calculating self-attention is to create three vectors from


each of the encoder’s input vectors (in this case, the embedding of each word). So
for each word, we create a Query vector, a Key vector, and a Value vector. These
vectors are created by multiplying the embedding by three matrices that we
trained during the training process.

Notice that these new vectors are smaller in dimension than the
embedding vector. Their dimensionality is 64, while the embedding and encoder
input/output vectors have dimensionality of 512. They don’t HAVE to be smaller,
this is an architecture choice to make the computation of multiheaded attention
(mostly) constant.

Excellence and Service


471
CHRIST
Deemed to be University

Multiplying x1 by the WQ weight matrix produces q1, the “query” vector associated with that
word. We end up creating a “query”, a “key”, and a “value” projection of each word in the input
sentence.

What are the “query”, “key”, and “value” vectors?


They’re abstractions that are useful for calculating and thinking about attention. Once
you proceed with reading how attention is calculated below, you’ll know pretty much all you
need to know about the role each of these vectors plays.

Excellence and Service


472
CHRIST
Deemed to be University

The second step in calculating self-attention is to calculate a score. Say


we’re calculating the self-attention for the first word in this example,
“Thinking”. We need to score each word of the input sentence against this
word. The score determines how much focus to place on other parts of the
input sentence as we encode a word at a certain position.

The score is calculated by taking the dot product of the query


vector with the key vector of the respective word we’re scoring. So if
we’re processing the self-attention for the word in position #1, the first
score would be the dot product of q1 and k1. The second score would be
the dot product of q1 and k2.

Excellence and Service


473
CHRIST
Deemed to be University

The third and forth steps are to divide the scores by 8 (the square root of the
dimension of the key vectors used in this example is 64. This leads to having more
stable gradients. There could be other possible values here, but this is the
default), then pass the result through a softmax operation. Softmax normalizes the
scores so they’re all positive and add up to 1.

Excellence and Service


474
CHRIST
Deemed to be University

Excellence and Service


475
CHRIST
Deemed to be University

This softmax score determines how much how much each word will be
expressed at this position. Clearly the word at this position will have the
highest softmax score, but sometimes it’s useful to attend to another word
that is relevant to the current word.

The fifth step is to multiply each value vector by the softmax


score (in preparation to sum them up). The intuition here is to keep intact
the values of the word(s) we want to focus on, and drown-out irrelevant
words (by multiplying them by tiny numbers like 0.001, for example).
The sixth step is to sum up the weighted value vectors. This produces the
output of the self-attention layer at this position (for the first word).

Excellence and Service


476
CHRIST
Deemed to be University

That concludes the self-attention calculation. The resulting vector is one we can send
along to the feed-forward neural network. In the actual implementation, however, this
calculation is done in matrix form for faster processing.

Excellence and Service


477
CHRIST
Deemed to be University

Multihead attention
There are a few other details that make them work better. For
example, instead of only paying attention to each other in one dimension,
Transformers use the concept of Multihead attention. The idea behind it is
that whenever you are translating a word, you may pay different attention
to each word based on the type of question that you are asking. The
images below show what that means. For example, whenever you are
translating “kicked” in the sentence “I kicked the ball”, you may ask
“Who kicked”. Depending on the answer, the translation of the word to
another language can change. Or ask other questions, like “Did what?”,
etc…

Excellence and Service


478
CHRIST
Deemed to be University

Excellence and Service


479
CHRIST
Deemed to be University

Excellence and Service


480
CHRIST
Deemed to be University

Positional Encoding
Another important step on the Transformer is to add positional
encoding when encoding each word. Encoding the position of each
word is relevant since the position of each word is relevant to the
translation

Excellence and Service


481
CHRIST
Deemed to be University

Overview of Text Summarization


and Topic Models

Excellence and Service


482
CHRIST
Deemed to be University

Text summarization

● Text summarization is the process of generating short, fluent, and


most importantly accurate summary of a respectively longer text
document.
● The main idea behind automatic text summarization is to be able
to find a short subset of the most essential information from the
entire set and present it in a human-readable format.
● As online textual data grows, automatic text summarization
methods have the potential to be very helpful because more useful
information can be read in a short time.
● Text summarization can be a useful case study in domains like
financial research, question-answer bots, media monitoring, social
media marketing, and so on.

Excellence and Service


483
CHRIST
Deemed to be University

Type of summarization

Excellence and Service


484
CHRIST
Deemed to be University

Based on input type:

1. Single Document, where the input length is short. Many of the early

summarization systems dealt with single-document summarization.

2. Multi-Document, where the input can be arbitrarily long.

Excellence and Service


485
CHRIST
Deemed to be University

Based on the purpose:

1. Generic, where the model makes no assumptions about the


domain or content of the text to be summarized and treats all
inputs as homogeneous. The majority of the work that has been
done revolves around generic summarization.
2. Domain-specific, where the model uses domain-specific
knowledge to form a more accurate summary. For example,
summarizing research papers of a specific domain, biomedical
documents, etc.
3. Query-based, where the summary only contains information that
answers natural language questions about the input text.
Excellence and Service
486
CHRIST
Deemed to be University

Based on output type:

1. Extractive, where important sentences are selected from the


input text to form a summary. Most summarization
approaches today are extractive in nature.
2. Abstractive, where the model forms its own phrases and
sentences to offer a more coherent summary, like what a
human would generate. This approach is definitely more
appealing, but much more difficult than extractive
summarization.

Excellence and Service


487
CHRIST
Deemed to be University

How to do text summarization

● Text cleaning

● Sentence tokenization

● Word tokenization

● Word-frequency table

● Summarization

Excellence and Service


488
CHRIST
Deemed to be University

Automatic text summarization

In this approach we build algorithms or programs which will reduce the


text size and create a summary of our text data. This is called automatic
text summarization in machine learning.
Text summarization is the process of creating shorter text without
removing the semantic structure of text.

There are two approaches to text summarization.


1. Extractive approaches
2. Abstractive approaches

Excellence and Service


489
CHRIST
Deemed to be University

Excellence and Service


490
CHRIST
Deemed to be University

EXTRACTIVE APPROACHES
● Using an extractive approach we summarize our text on the basis of
simple and traditional algorithms.
● For example, when we want to summarize our text on the basis of the
frequency method, we store all the important words and frequency of all
those words in the dictionary.
● On the basis of high frequency words, we store the sentences containing
that word in our final summary.
● This means the words which are in our summary confirm that they are
part of the given text.
● It is the traditional method developed first. The main objective is to
identify the significant sentences of the text and add them to the
summary. You need to note that the summary obtained contains
exact sentences from the original text.

Excellence and Service


491
CHRIST
Deemed to be University

● As the name suggests, extractive text summarization ‘extracts’


notable information from the large dumps of text provided and
groups them into clear and concise summaries.
● The method is very straightforward as it extracts texts based on
parameters such as the text to be summarized, the most important
sentences (Top K), and the value of each of these sentences to the
overall subject.
Source text: Joseph and Mary rode on a donkey to attend the annual event in Jerusalem. In the
city, Mary gave birth to a child named Jesus.

Extractive summary: Joseph and Mary attend event Jerusalem. Mary birth Jesus.

Excellence and Service


492
CHRIST
Deemed to be University

ABSTRACTIVE APPROACHES
● An abstractive approach is more advanced.
● The approach is to identify the important sections, interpret the context and
reproduce in a new way.
● This ensures that the core information is conveyed through shortest text possible.
Note that here, the sentences in summary are generated, not just extracted from
original text.
● Abstractive text summarization generates legible sentences from the entirety of
the text provided. It rewrites large amounts of text by creating acceptable
representations, which is further processed and summarized by natural language
processing.
● What makes this method unique is its almost AI-like ability to use a machine’s
semantic capability to process text and iron out the kinks using NLP.
● Example:
● Abstractive summary: Joseph and Mary came to Jerusalem where Jesus
was born.

Excellence and Service


493
CHRIST
Deemed to be University

Techniques for text summarization in Python


Approaches to text summarization using both abstractive and extractive methods.

1. Gensim
● Gensim is an open-source topic and vector space modeling toolkit within
the Python programming language.
● First, the user needs to utilize the summarization.summarizer from
Gensim as it is based on a variation of the TextRank algorithm.
● Since TextRank is a graph-based ranking algorithm, it helps narrow
down the importance of vertices in graphs based on global information
drawn from said graphs.
● gensim is a very handy python library for performing NLP tasks. The
text summarization process using gensim library is based on TextRank
Algorithm

Excellence and Service


494
CHRIST
Deemed to be University

● TextRank is an extractive summarization technique.


● It is based on the concept that words which occur more
frequently are significant.
● Hence , the sentences containing highly frequent words are
important .
● Based on this , the algorithm assigns scores to each sentence
in the text .
● The top-ranked sentences make it to the summary.

Excellence and Service


495
CHRIST
Deemed to be University

Text Summarization with Sumy


Along with TextRank , there are various other algorithms to
summarize text.
implementation of the below algorithms for summarization using sumy
1. LexRank
2. Luhn
3. Latent Semantic Analysis, LSA
4. KL-Sum

Excellence and Service


496
CHRIST
Deemed to be University

LexRank.
How does LexRank work?
● A sentence which is similar to many other sentences of the text
has a high probability of being important.
● The approach of LexRank is that a particular sentence is
recommended by other similar sentences and hence is ranked
higher.
● Higher the rank, higher is the priority of being included in the
summarized text.
● This is an unsupervised machine learning based approach in
which we use the textrank approach to find the summary of our
sentences. Using cosine similarity and vector based algorithms,
we find minimum cosine distance among various words and
store the more similar words together.
Excellence and Service
497
CHRIST
Deemed to be University

working

!pip install sumy


import sumy
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
# Import the LexRank summarizer
from sumy.summarizers.lex_rank import LexRankSummarizer

Excellence and Service


498
CHRIST
Deemed to be University

As the text source here is a string, you need to use PlainTextParser.from_string()


function to initialize the parser. You can specify the language used as input to the
Tokenizer.
syntax : PlaintextParser.from_string(cls, string, tokenizer)
import nltk
nltk.download('punkt')
# Initializing the parser
my_parser = PlaintextParser.from_string(original_text,Tokenizer('english'))

Excellence and Service


499
CHRIST
Deemed to be University

Next create a summarizer model lex_rank_summarizer to fit your text.


The syntax is: lex_rank_summarizer(document, sentences_count)

lex_rank_summarizer = LexRankSummarizer()
lexrank_summary =
lex_rank_summarizer(my_parser.document,sentences_count=3)

# Printing the summary


for sentence in lexrank_summary: print(sentence)

Excellence and Service


500
CHRIST
Deemed to be University

Sumy:

# Load Packages
from sumy.parsers.plaintext
import PlaintextParser
from sumy.nlp.tokenizers
import Tokenizer
# Creating text parser using tokenization
parser = PlaintextParser.from_string(text, Tokenizer("english"))
from sumy.summarizers.text_rank
import TextRankSummarizer
# Summarize using sumy TextRank
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, 2)
text_summary = ""
for sentence in summary:
text_summary += str(sentence)
print(text_summary)

Excellence and Service


501
CHRIST
Deemed to be University

Lex Rank:
from sumy.parsers.plaintext
import PlaintextParser
from sumy.nlp.tokenizers
import Tokenizer
from sumy.summarizers.lex_rank
import LexRankSummarizer
def sumy_method(text):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LexRankSummarizer()
#Summarize the document with 2 sentences
summary = summarizer(parser.document, 2)
dp = []
for i in summary:
lp = str(i)
dp.append(lp)
final_sentence = ' '.join(dp)
return final_sentence

This is an unsupervised machine learning based approach in which we use the textrank approach to find
the summary of our sentences. Using cosine similarity and vector based algorithms, we find minimum
cosine distance among various words and store the more similar words together.

Excellence and Service


502
CHRIST
Deemed to be University

LSA (Latent semantic analysis)

● Latent Semantic Analysis is a unsupervised learning algorithm


that can be used for extractive text summarization.
● It extracts semantically significant sentences by applying
singular value decomposition(SVD) to the matrix of term-
document frequency.

Excellence and Service


503
CHRIST
Deemed to be University

LSA

from sumy.summarizers.lsa
import LsaSummarizer
def lsa_method(text):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer_lsa = LsaSummarizer()
summary_2 = summarizer_lsa(parser.document, 2)
dp = []
for i in summary_2:
lp = str(i)
dp.append(lp)
final_sentence = ' '.join(dp)
return final_sentence
Latent Semantic Analyzer (LSA) is based on decomposing the data into low dimensional space.
LSA has the ability to store the semantic of given text while summarizing.

Excellence and Service


504
CHRIST
Deemed to be University

Luhn

● Luhn Summarization algorithm’s approach is based on TF-


IDF (Term Frequency-Inverse Document Frequency).
● It is useful when very low frequent words as well as highly
frequent words(stopwords) are both not significant.
● Based on this, sentence scoring is carried out and the high
ranking sentences make it to the summary.

Excellence and Service


505
CHRIST
Deemed to be University

Using Luhn:

import LuhnSummarizer
def lunh_method(text):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer_luhn = LuhnSummarizer()
summary_1 = summarizer_luhn(parser.document, 2)
dp = []
for i in summary_1:
lp = str(i)
dp.append(lp)
final_sentence = ' '.join(dp)
return final_sentence
This approach is based on the frequency method; here we find TF-IDF (term
frequency inverse document frequency).

Excellence and Service


506
CHRIST
Deemed to be University

KL-Sum
● Another extractive method is the KL-Sum algorithm.
● It selects sentences based on similarity of word distribution as the original
text.
● It aims to lower the KL-divergence criteria.
● It uses greedy optimization approach and keeps adding sentences till the
KL-divergence decreases.

Excellence and Service


507
CHRIST
Deemed to be University

What is Abstractive Text Summarization?

● Abstractive summarization is the new state of art method, which


generates new sentences that could best represent the whole text.
● This is better than extractive methods where sentences are just
selected from original text for the summary.
● A simple and effective way is through the Huggingface’s transformers
library.
● HuggingFace supports state of the art models to implement tasks such
as summarization, classification, etc.. Some common models are
GPT-2, GPT-3, BERT , OpenAI, GPT, T5.
● Another awesome feature with transformers is that it provides
PreTrained models with weights that can be easily instantiated
through from_pretrained() method.

Excellence and Service


508
CHRIST
Deemed to be University

Summarization with T5 Transformers

● T5 is an encoder-decoder model. It converts all language problems


into a text-to-text format.
● First, need to import the tokenizer and corresponding model .
● It is preferred to use T5 For Conditional Generation model when
the input and output are both sequences.
● T5 is a encoder-decoder mode and hence the input sequence
should be in the form of a sequence of ids, or input-ids.
● How to convert the input text into input-ids ?
● This process is called encoding the text and can be achieved
through encode() method

Excellence and Service


509
CHRIST
Deemed to be University

Summarization with BART Transformers

● transformers library of HuggingFace supports summarization with BART models.


● Import the model and tokenizer. For problems where there is need to generate
sequences , it is preferred to use BartForConditionalGeneration model.
● ” bart-large-cnn” is a pretrained model, fine tuned especially for summarization
task. You can load the model using from_pretrained() method
● You need to pass the input text in the form of a sequence of ids.
● For this, use the batch_encode_plus() function with the tokenizer. This function
returns a dictionary containing the encoded sequence or sequence pair and other
additional information.
● Now, How to limit the maximum length of the returned sequence?
● Set the max_length parameter in batch_encode_plus()
● Next, pass the input_ids to model.generate() function to generate the ids of the
summarized output.model.generate() has returned a sequence of ids corresponding to
the summary of original text. You can convert the sequence of ids to text through
decode() method.

Excellence and Service


510
CHRIST
Deemed to be University

Summarization with GPT-2 Transformers

● GPT-2 transformer is another major player in text


summarization, introduced by OpenAI. Thanks to transformers,
the process followed is same just like with BART Transformers.
● First, you have to import the tokenizer and model. Make sure
that you import a LM Head type model, as it is necessary to
generate sequences. Next, load the pretrained gpt-2 model and
tokenizer .
● After loading the model, you have to encode the input text and
pass it as an input to model.generate().
● The summary_ids contains the sequence of ids corresponding to
the text summary . You can decode it and print the summary

Excellence and Service


511
CHRIST
Deemed to be University

Summarization with XLM Transformers

● You can import the XLMWithLMHeadModel as it supports


generation of sequences.You can load the pretrained xlm-mlm-
en-2048 model and tokenizer with weights using
from_pretrained() method.
● The nexts steps are same as the last three cases. The encoded
input text is passed to generate() function with returns id
sequence for the summary. You can decode and print the
summary.
● You can notice that the XLM_summary isn’t very good. It is
because , even though it supports summaization , the model was
not finetuned for this task.

Excellence and Service


512
CHRIST
Deemed to be University

Using a pre-trained summarizer and evaluating its output

● What do we mean by pre-trained models:-


● These models have already been trained on large datasets.
● If a model is trained on huge amounts of data it will naturally
predict better, however, the inability to collect large amounts of
data and subsequently higher training time are some of the reasons
why instead of training a model from scratch we could benefit by
using a pre-trained model.

Excellence and Service


513
CHRIST
Deemed to be University

What Are Transformers?

● Abstractive Summarization:
● Abstractive summarization techniques emulate human writing by
generating entirely new sentences to convey key concepts from
the source text, rather than merely rephrasing portions of it.
● These fresh sentences distill the vital information while
eliminating irrelevant details, often incorporating novel
vocabulary absent in the original text. The term “Transformers”
has recently dominated the natural language processing field,
although these models initially relied on designs based on
recurrent neural networks (RNNs).

Excellence and Service


514
CHRIST
Deemed to be University

● Transformers represent a series of systems that employ a unique encoder-decoder


architecture to transform an input sequence into an output sequence.
Transformers feature a distinctive “self-attention” mechanism, along with several
other enhancements like positional encoding, which set them apart. NOTE: Not
all Transformers are intended for use in text summarization. Let’s delve into the
recently released model called PEGASUS, which appears to excel in terms of
output quality for text summarization.

● PEGASUS shares similarities with other transformer models, with its primary
distinction lying in a unique approach used during the model’s pre-training.
Specifically, the most crucial sentences in the training text corpora are “masked”
(hidden from the model) during PEGASUS pre-training. The model is then
tasked with generating these concealed sentences as a single output sequence.

Excellence and Service


515
CHRIST
Deemed to be University

What is Topic Modelling in NLP ?

Excellence and Service


516
CHRIST
Deemed to be University

● The average person generates in excess of 1.7MB of digital data per second. This
number amounts to more than 2.5 quintillion bytes of data per day, of which 80-
90% is unstructured.
● Consider a scenario where a business employs a single individual to review each
piece of unstructured data and segment them based on the underlying topic. It
would be an impossible task.
● It would take a significant amount of time to complete and be extremely tedious,
plus there’s much more risk involved since humans are naturally biased and more
error-prone than machines.
● The solution is topic modeling.
Topic modeling aids businesses in:
● Performing real-time analysis on unstructured textual data
● Learn from unstructured data at scale
● Build a consistent understanding of data, regardless of its format.

Excellence and Service


517
CHRIST
Deemed to be University

What is Topic Modelling in NLP ?

● It is the scientific technique of recognizing the related words from the


topics present in the document or the corpus of data.
● In topic modeling, the algorithms refer to a collection of statistical
and Deep Learning methods for identifying latent semantic structures
in collections of documents.
● Topic modelling is an unsupervised approach of recognizing or
extracting the topics by detecting the patterns like clustering
algorithms which divides the data into different parts.
● The same happens in Topic modelling in which we get to know the
different topics in the document.
● This is done by extracting the patterns of word clusters and
frequencies of words in the document.

Excellence and Service


518
CHRIST
Deemed to be University

Applications of Topic Modeling in NLP

● Information Retrieval: The term 'Information Retrieval' in computer


science, particularly in the context of search engines. It is incorporated into
various text-processing rule-based systems to extract topics from text input
and retrieve relevant information.
● Document Clustering: Topic modeling can be used to group similar
documents together based on the topics they contain. It is useful in a range
of applications such as news aggregation, online discussion forums, and
social media analysis.
● Content Recommendation: Topic modeling can be used to identify the
topics that a user is interested in and recommend content that matches
those topics. This is useful in a various applications, such as content
personalization on websites, e-commerce product recommendations, and
news article recommendations.

Excellence and Service


519
CHRIST
Deemed to be University

● Sentiment Analysis: Topic modeling can be used to identify the


sentiment of a document or a section of text. By identifying the topics
discussed in the text and the sentiment associated with each topic, we
can better understand the overall sentiment of the document.
● Trend Analysis: Topic modeling can be used to identify the topics
that are currently trending in a given domain or industry. This can be
useful in various applications such as market research and news
analysis.
● Keyword Extraction: Topic modeling can be used to identify the
most important keywords in a document or a section of text. This is
useful for tasks such as search engine optimization (SEO), information
retrieval, and content analysis.

Excellence and Service


520
CHRIST
Deemed to be University

● This type of modelling is very much useful when there are many
documents present and when we want to get to know what type of
information is present in it. This takes a lot of time when done manually
and this can be done easily in very little time using Topic modelling.
● Topic modeling is a frequently used approach to discover hidden
semantic patterns portrayed by a text corpus and automatically identify
topics that exist inside it.
● Namely, it’s a type of statistical modeling that leverages unsupervised
machine learning to analyze and identify clusters or groups of similar
words within a body of text.
● For example, a topic modeling algorithm may be deployed to determine
whether the contents of a document imply it’s an invoice, complaint, or
contract.

Excellence and Service


521
CHRIST
Deemed to be University

A visualization of how topic modeling works

Excellence and Service


522
CHRIST
Deemed to be University

Excellence and Service


523
CHRIST
Deemed to be University

Different topic modeling approaches :

1. Latent Semantic Analysis (LSA)


2. Probabilistic Latent Semantic Analysis (PLSA)
3. Latent Dirichlet Allocation (LDA),
4. Non-Matrix Factorization (NMF)

Excellence and Service


524
CHRIST
Deemed to be University

Latent Semantic Analysis or Latent Semantic Indexing (LSA)

● Latent Semantic Analysis (LSA) is one of the basic topic


modeling techniques.
● Latent Semantic Analysis (LSA) is a natural language
processing technique used to analyze relationships between
documents and the terms they contain.
● The first step is to generate a Document Term Matrix (DTM). If
you have m documents and n words in your vocabulary, you can
create an m × n matrix A.
● Here each row represents a document, and each column
represents a word. In the simplest version of LSA, each entry is
a rough count of the number of times the jth word occurs in the
ith document.

Excellence and Service


525
CHRIST
Deemed to be University

● LSA assumes words with similar meanings will appear in similar


documents.
● It does so by constructing a matrix containing the word counts per
document, where each row represents a unique word, and columns
represent each document, and then using a Singular Value
Decomposition (SVD) to reduce the number of rows while preserving the
similarity structure among columns.
● SVD is a mathematical method that simplifies data while keeping its
important features. It's used here to maintain the relationships between
words and documents.

Excellence and Service


526
CHRIST
Deemed to be University

● To determine the similarity between documents, cosine


similarity is used. This is a measure that calculates the
cosine of the angle between two vectors, in this case,
representing documents.
● A value close to 1 means the documents are very similar
based on the words in them, whereas a value close to 0
means they're quite different.
● Raw counts don't work well because they don't take into
account the meaning of every word in the document.

Excellence and Service


527
CHRIST
Deemed to be University

● Therefore, LSA models typically replace the raw counts of the DTM with
TF-IDF scores. TF-IDF or term frequency-inverse document frequency
assigns a weight to term j in document i as follows:

Excellence and Service


528
CHRIST
Deemed to be University

● The more frequently a word appears in the document, the more intuitive
weight it has, and the less frequently it occurs in the corpus, the more weight
it has.
● LSA is fast and efficient to use, it has some significant drawbacks like a lack
of interpretable embeddings (we don't know what the subject is, and the
components can be arbitrarily positive/negative), a lot of documentation is
required, and vocabulary to get accurate results.
● LSA finds representations of documents and words in low dimensions.
● The dot product of row vectors is document similarity, and the dot product of
column vectors is word similarity.
● Apply truncated singular value decomposition (SVD) to reduce the
dimension of X.
● In both U and V, a column corresponds to one of the t topics. In U, rows
represent document vectors expressed in terms of topics. In V, rows represent
concept vectors expressed in terms of topics.

Excellence and Service


529
CHRIST
Deemed to be University

Excellence and Service


530
CHRIST
Deemed to be University

Excellence and Service


531
CHRIST
Deemed to be University

Latent Dirichlet Allocation


● In LDA, latent indicates the hidden topics present in the data then
Dirichlet is a form of distribution.
● LDA is a Bayesian network, meaning it’s a generative
statistical model that assumes documents are made up of
words that aid in determining the topics.
● Thus, documents are mapped to a list of topics by assigning
each word in the document to different topics.
● This model ignores the order of words occurring in a
document and treats them as a bag of words.
● Dirichlet distribution is different from the normal
distribution.

Excellence and Service


532
CHRIST
Deemed to be University

● Normal distribution tells us how the data deviates


towards the mean and will differ according to the
variance present in the data.
● When the variance is high then the values in the data
would be both smaller and larger than the mean and
can form skewed distributions.
● If the variance is small then samples will be close to
the mean and if the variance is zero it would be exactly
at the mean.

Excellence and Service


533
CHRIST
Deemed to be University

● When ML algorithms are to be applied the data has to be


normally distributed or follows Gaussian distribution.
● The normal distribution represents the data in real
numbers format whereas Dirichlet distribution represents
the data such that the plotted data sums up to 1.
● It can also be said as Dirichlet distribution is a
probability distribution that is sampling over a
probability simplex instead of sampling from the space
of real numbers as in Normal distribution.

Excellence and Service


534
CHRIST
Deemed to be University

The LDA is based upon two general assumptions:


● Documents that have similar words usually have the same topic
● Documents that have groups of words frequently occurring together usually
have the same topic.
These assumptions make sense because the documents that have the same topic,
for instance, Business topics will have words like the "economy", "profit", "the
stock market", "loss", etc. The second assumption states that if these words
frequently occur together in multiple documents, those documents may belong to
the same category.
Mathematically, the above two assumptions can be represented as:
● Documents are probability distributions over latent topics
● Topics are probability distributions over words

Excellence and Service


535
CHRIST
Deemed to be University

Excellence and Service


536
CHRIST
Deemed to be University

Summary

● Topic modeling is a popular natural language processing technique


used to create structured data from a collection of unstructured data.
● In other words, the technique enables businesses to learn the hidden
semantic patterns portrayed by a text corpus and automatically
identify the topics that exist inside it.
● Two popular topic modeling approaches are LSA and LDA.
● They both seek to discover the hidden patterns in text data, but they
make different assumptions to achieve their objective.
● Where LSA assumes words with similar meanings will appear in
similar documents,
● LDA assumes documents are made up of words that aid in
determining the topics.

Excellence and Service


537
CHRIST
Deemed to be University

Evaluation Metrics in NLP: Confusion Matrix, Accuracy, Precision, Recall,


F1 Score

How to evaluate the performance of a machine learning model?


Let us consider a task to classify whether a person is pregnant or
not pregnant. If the test for pregnancy is positive (+ve ), then the
person is pregnant. On the other hand, if the test for pregnancy is
negative (-ve) then the person is not pregnant.

Now consider the above classification ( pregnant or not pregnant )


carried out by a machine learning algorithm. The output of the
machine learning algorithm can be mapped to one of the following
categories.

Excellence and Service


538
CHRIST
Deemed to be University

1. A person who is actually pregnant (positive) and classified as pregnant


(positive). This is called TRUE POSITIVE (TP).

Excellence and Service


539
CHRIST
Deemed to be University

A person who is actually not pregnant (negative) and classified as not pregnant
(negative). This is called TRUE NEGATIVE (TN).

Excellence and Service


540
CHRIST
Deemed to be University

A person who is actually not pregnant (negative) and classified as pregnant (positive). This
is called FALSE POSITIVE (FP).

Excellence and Service


541
CHRIST
Deemed to be University

A person who is actually pregnant (positive) and classified as not pregnant (negative). This
is called FALSE NEGATIVE (FN).

Excellence and Service


542
CHRIST
Deemed to be University

● What we desire is TRUE POSITIVE and TRUE


NEGATIVE but due to the misclassifications, we may also
end up in FALSE POSITIVE and FALSE NEGATIVE.
● So there is a confusion in classifying whether a person is
pregnant or not.
● This is because no machine learning algorithm is perfect. this
confusion in classifying the data in a matrix called confusion
matrix.

Excellence and Service


543
CHRIST
Deemed to be University

● Now, we select 100 people which includes pregnant women, not


pregnant women and men with fat belly.
● Let us assume out of this 100 people 40 are pregnant and the
remaining 60 people include not pregnant women and men with fat
belly.
● We now use a machine learning algorithm to predict the outcome.
● The predicted outcome (pregnancy +ve or -ve) using a machine
learning algorithm is termed as the predicted label and the true
outcome (in this case which we know from doctor’s/expert’s record)
is termed as the true label.

Excellence and Service


544
CHRIST
Deemed to be University

Excellence and Service


545
CHRIST
Deemed to be University

Excellence and Service


546
CHRIST
Deemed to be University

Excellence and Service


547
CHRIST
Deemed to be University

Out of 40 pregnant women 30 pregnant women are classified correctly


and the remaining 10 pregnant women are classified as not pregnant by
the machine learning algorithm. On the other hand, out of 60 people in
the not pregnant category, 55 are classified as not pregnant and the
remaining 5 are classified as pregnant.

In this case, TN = 55, FP = 5, FN = 10, TP = 30. The confusion


matrix is as follows.

Excellence and Service


548
CHRIST
Deemed to be University

Excellence and Service


549
CHRIST
Deemed to be University

What is the accuracy of the machine learning model for this classification task?

● Accuracy represents the number of correctly classified data instances


over the total number of data instances.
● In this example, Accuracy = (55 + 30)/(55 + 5 + 30 + 10 ) = 0.85 and
in percentage the accuracy will be 85%.

Excellence and Service


550
CHRIST
Deemed to be University

● Is accuracy the best measure?


● Accuracy may not be a good measure if the dataset is not balanced
(both negative and positive classes have different number of data
instances).

Excellence and Service


551
CHRIST
Deemed to be University

precision (positive predictive value) in classifying the data


instances. Precision is defined as follows:

Excellence and Service


552
CHRIST
Deemed to be University

What does precision mean?


Precision should ideally be 1 (high) for a good classifier. Precision
becomes 1 only when the numerator and denominator are equal i.e
TP = TP +FP, this also means FP is zero.
As FP increases the value of denominator becomes greater than the
numerator and precision value decreases (which we don’t want).

So in the pregnancy example, precision = 30/(30+ 5) = 0.857

Excellence and Service


553
CHRIST
Deemed to be University

Recall is also known as sensitivity or true positive rate and is


defined as follows:

Recall should ideally be 1 (high) for a good classifier. Recall becomes 1 only when the
numerator and denominator are equal i.e TP = TP +FN, this also means FN is zero. As
FN increases the value of denominator becomes greater than the numerator and recall
value decreases (which we don’t want).

Excellence and Service


554
CHRIST
Deemed to be University

So in the pregnancy example let us see what will be the recall.

Recall = 30/(30+ 10) = 0.75

Excellence and Service


555
CHRIST
Deemed to be University

● So ideally in a good classifier, we want both precision and recall


to be one which also means FP and FN are zero.
● Therefore we need a metric that takes into account both precision
and recall. F1-score is a metric which takes into account both
precision and recall and is defined as follows:

Excellence and Service


556
CHRIST
Deemed to be University

F1 Score becomes 1 only when precision and recall are both


1. F1 score becomes high only when both precision and recall
are high. F1 score is the harmonic mean of precision and
recall and is a better measure than accuracy.

In the pregnancy example,

F1 Score = 2* ( 0.857 * 0.75)/(0.857 + 0.75) = 0.799.

Excellence and Service


557
CHRIST
Deemed to be University

FOCUS ON LEARNING

THANK YOU ☺

558
Excellence and Service

You might also like