0% found this document useful (0 votes)

9 views556 pages

TTS Notes-Unit 12

The document outlines the course 'Text and Speech Analysis' (AIML543PE04) offered at CHRIST (Deemed to be University), detailing its mission, vision, course objectives, learning outcomes, and assessment methods. It emphasizes the importance of Natural Language Processing (NLP) and its applications in various fields, including machine translation, sentiment analysis, and conversational agents. Additionally, it provides a list of textbooks and reference materials for students, along with a structured approach to continuous internal assessments and laboratory components.

Uploaded by

G0REM0ND

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views556 pages

TTS Notes-Unit 12

Uploaded by

G0REM0ND

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 556

TEXT AND SPEECH ANALYSIS

(AIML543PE04)
2024- 25

Dr. DEEPA YOGISH

Associate Professor, Department of CSE, School of Engineering and Technology
CHRIST (Deemed to be University), Bangalore, India

Contact No.: +919632948736 | [email protected]

MISSION VISION CORE VALUES

CHRIST is a nurturing ground for an individual’s Excellence and Service Faith in God | Moral Uprightness
holistic development to make effective contribution to Love of Fellow Beings
the society in a dynamic environment Social Responsibility | Pursuit of Excellence
CHRIST
Deemed to be University

CHRIST (Deemed to be University), Bangalore

Vision and Mission

VISION : Excellence and Service

MISSION : CHRIST (Deemed to be University) is a nurturing ground for

an individual's holistic development to make effective contribution to
society in a dynamic environment.

Core Values:
• Faith in God
• Moral Uprightness
• Love of Fellow Beings
• Social Responsibility
• Pursuit of Excellence

Excellence and Service

2
CHRIST
Deemed to be University

SCHOOL OF ENGINEERING AND TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Department’s Vision and Mission

VISION : “To fortify Ethical Computational Excellence”

MISSION :
● Imparts core and contemporary knowledge in the areas of
Computation and Information Technology
● Promotes the culture of research and facilitates higher studies
● Acquaints the students with the latest industrial practices, team
building and entrepreneurship
● Sensitizes the students to serve for environmental, social & ethical
needs of society through lifelong learning.

Excellence and Service

3
CHRIST
Deemed to be University

Course Objectives
• Understand natural language processing basics and
apply classification algorithms to text documents.

• Build Question Answering and dialogue systems.

• Develop a speech recognition system and a speech

synthesizer.

Excellence and Service

4
CHRIST
Deemed to be University

Course Learning Outcomes

1. Apply Natural language preprocessing techniques for text using

NLTK-L3

2. Apply deep learning techniques for text classification and word

embedding techniques.-L3

3. Develop language models for Question Answering and build a

chatbots, dialogue system.- L3

4. Develop deep learning models to convert text to speech -L3

5. Apply deep learning models for building speech recognition and

text-to-speech systems.-L3

Excellence and Service

5
CHRIST
Deemed to be University

Text Books
Daniel Jurafsky and James H. Martin, “Speech and Language
Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition”, Third
Edition, 2022.

Excellence and Service

6
CHRIST
Deemed to be University

Reference Books

R1. Dipanjan Sarkar, “Text Analytics with Python: A Practical

Real-World approach to Gaining Actionable insights from your
data”, A Press, 2018.

R2.Steven Bird, Ewan Klein, and Edward Loper, “Natural

language processing with Python”, O’REILLY.

R3.Lawrence Rabiner, Biing-Hwang Juang, B. Yegnanarayana,

“Fundamentals of Speech Recognition” 1st Edition, Pearson,
2009.

Excellence and Service

7
CHRIST
Deemed to be University

Continuous Internal Assessments (CIAs)-Elective

● CIA 1 - 20 Marks

● CIA 2 (Mid Semester Exam (MSE)) - 50 Marks – 30

Marks

● CIA 3 - 20 Marks

● Attendance 5 Marks
● Lab Exam -50Marks – 35Marks
● End Semester Exams (ESE) - 50 Marks 30Marks

Excellence and Service 8

CHRIST
Deemed to be University

Continuous Internal Assessments (CIAs)

CIAs Component Details

CIA 1A Programming Assignment (Unit 1)

CIA 1B Problem Solving(Unit 2)

CIA 2 Closed Book Test (50 M)

(MSE)
Q1 Unit 1
Q2 Unit 1
Q3 Unit 2
Q4 Unit 2
Q5 Unit 3 Any one question
Q6 Unit 3 either Q5 or Q6

CIA 3A Problem Solving- Real Time Case Study

(Unit 3 and 4)
CIA 3B Case Study Implementation (Units 3, 4 and 5)

Excellence and Service 9

CHRIST
Deemed to be University

Laboratory Components
● Observation – Algorithm/Pseudocode must be written.
○ Students must complete observation before coming to the lab.

● Record – Program and Input & Output must be written-Printout

● Evaluation scheme for observation:

○ 10 marks – if completed on same day
○ 5 - 8 marks – if completed before next lab
○ Absent for 1st hour – if not shown (if they complete within that hour they will be
permitted to attend the 2nd hour and 5 marks will be given)

● Each lab experiment must be completed in the respective labs only.

● Record must be shown to on a weekly basis (E.g., Lab1 record must be shown on
Lab2)
Excellence and Service 10
CHRIST
Deemed to be University

Unit – 1 -Natural Language Basics

● Foundations of natural language processing
● Language Syntax and Structure
● Text Preprocessing and Wrangling
● Text tokenization
● Stemming
● Lemmatization
● Removing stop words
● Feature Engineering for Text representation
● Bag of Words model
● Bag of N-Grams model
● TF-IDF model
Excellence and Service
11
CHRIST
Deemed to be University

● Verbal communication between people
■ Day-to-day conversations

■ Oral history

● Written communication for people

■ Stone tablets, scrolls, books, etc.

■ Permanent record of written language

Source: Wiki Commons (CC BY-SA 4.0): Rosetta Stone

Excellence and Service

CHRIST
Deemed to be University

Since 1950: Communication with Machines

~50s-70s ~80s Toda
y

Basic symbolic languages Formal languages Natural language

(e.g., punch cards) (e.g., programming languages) (e.g., conversational agents / chatbots)

Source: Wiki Commons (CC BY-SA 4.0): punch cards, programming

Excellence and Service

CHRIST
Deemed to be University

Communication with Machines

Humans Machines

Analysis
Natural
Language
Generation

Source: Wiki Commons (CC BY-SA 4.0): gpu

Excellence and Service

CHRIST
Deemed to be University

Foundations of natural language processing

● Natural Language Processing (NLP) is the process of
producing meaningful phrases and sentences in the form of
natural language.
● Natural Language Processing precludes Natural Language
Understanding (NLU) and Natural Language Generation
(NLG).
● NLU takes the data input and maps it into natural language.
● NLG conducts information extraction and retrieval, sentiment
analysis, and more.
● NLP can be thought of as an intersection of Linguistics,
Computer Science and Artificial Intelligence that helps
computers understand, interpret and manipulate human
language.

Excellence and Service

20
CHRIST
Deemed to be University

Fig. NLP Overview

Excellence and Service

21
CHRIST
Deemed to be University

● Ever since then, there has been an immense amount of study and
development in the field of Natural Language Processing.
● Today NLP is one of the most in-demand and promising fields of
Artificial Intelligence!
● There are two main parts to Natural Language Processing:

● NLP application in almost daily use

■ Machine translation
■ Conversational agents (e.g., chat bots)
■ Text summarization
■ Text generation (e.g., autocomplete)

● Applications powered by NLP

■ Social media
■ Search engines
■ Writing assistants (e.g., grammar checking)

Excellence and Service

CHRIST
Deemed to be University

Machine Translation
A-

2
6

Excellence and Service

CHRIST
Deemed to be University

Conversational Agents
● Conversational agents
— core components
■ Speech recognition

■ Language analysis

■ Dialogue processing

■ Information retrieval

■ Text-to-Speech

2
7

Excellence and Service

CHRIST
Deemed to be University

Conversational Agents — Question Answering

2
8

Excellence and Service

CHRIST
Deemed to be University

Text Summarization

Google's cloud unit looked into using artificial intelligence

to help a financial firm decide whom to lend money to. It
turned down the client's idea after weeks of internal
discussions, deeming the project too ethically dicey.
Google has also blocked new AI features analysing
emotions, fearing cultural insensitivity. Microsoft restricted
software mimicking voices and IBM rejected a client
request for an advanced facial-recognition system.
2
9

Excellence and Service

CHRIST
Deemed to be University

Text Generation
● Example: Autocomplete
■ Given the first words of a sentence,
predict the next most likely word

3
0

Excellence and Service

CHRIST
Deemed to be University

Text Generation
● Example: Image Captioning

➜ "A man riding a red

bicycle."

3
1

Excellence and Service

CHRIST
Deemed to be University

Other Applications
● Spelling correction

● Document clustering

● Document classification, e.g.:

■ Spam detection

■ Sentiment analysis

■ Authorship attribution

3
2

Excellence and Service

CHRIST
Deemed to be University

Question Answering: IBM’s Watson

• Won Jeopardy on February 16, 2011!

WILLIAM WILKINSON’S
“AN ACCOUNT OF THE PRINCIPALITIES OF
WALLACHIA AND MOLDOVIA” Bram Stoker
INSPIRED THIS AUTHOR’S
MOST FAMOUS NOVEL

Excellence and Service

CHRIST
Deemed to be University

Excellence and Service

CHRIST
Deemed to be University

Language Technology
Dan Jurafsky

making good progress

Sentiment analysis still really hard
Best roast chicken in San Francisco!
mostly solved Question answering (QA)
The waiter ignored us for 20 minutes.
Q. How eﬀective is ibuprofen in reducing
Spam detection Coreference resolution fever in patients with acute febrile illness?
Let’s go to Agra! ✓
✗
Carter told Mubarak he shouldn’t run again. Paraphrase
Buy V1AGRA …
Word sense disambiguation (WSD) XYZ acquired ABC yesterday
I need new batteries for my mouse. ABC has been taken over by XYZ
Part-‐of-‐speech (POS) tagging
ADJ ADJ NOUN VERB Summarization
ADVideas sleep furiously.
Colorless green Parsing The Dow Jones is up Economy is
I can see Alcatraz from the window! The S&P500 jumped good
Housing prices rose
Named entity recognition (NER) Machine translation (MT)
PERSON ORG LOC 第13届上海国际电影节开幕… Dialog Where is Citizen Kane playing in SF?
Einstein met with UN officials in Princeton
The 13th Shanghai International Film Festival…
Castro Theatre at 7:30. Do
Information extraction (IE) you want a ticket?
Party
You’re invited to our dinner May 27
party, Friday May 27 at 8:30 add

Excellence and Service

CHRIST
Deemed to be University

What is Natural Language?

● Natural Language
■ Means of communicating thoughts, feelings, opinions, ideas, etc.

■ Not formal, yet systematic: rules can emerge which were not previously defined
(this includes that many rules can be bent until reaching a breaking point)

■ Characteristics: ambiguous, redundant, changing, unbounded, imprecise, etc.

● Text / Writing
■ Visual representation of verbal communication (i.e., Natural Language)

■ Writing system: agreed meaning behind the sets of characters that make up a text
(most importantly: letters, digits, punctuation, white space characters: spaces, tabs, new lines, etc.)

Excellence and Service

CHRIST
Deemed to be University

Core Building Blocks ●

of (Written) Language
Basic symbol of written language
Character
(letter, numeral, punctuation mark, etc.)
r, e, a, c, t, i, o, n

Morpheme ● Smallest meaning-bearing

unit in a language re-act-ion
(1..n characters)

Word ● Single independent unit of

language that can be represented reaction
(1..n morphemes)

Phrase ● Group of words expressing a

particular idea or meaning his quick reaction
(1..n words)

Clause ● Phrase with a subject and verb his quick reaction saved him
(1..n phrases)

Excellence and Service

CHRIST
Deemed to be University

Core Building Blocks of (Written) Language

Sentence ● Expresses an independent statement, His quick reaction saved him
(1..n clauses) question, request, exclamation, etc. from the oncoming traffic.

● Self-contained unit of discourse in writing Bob lost control of his car. His quick reaction
Paragraph
dealing with a particular point or idea. saved him from the oncoming traffic. Luckily
(1..n sentences)
nobody was hurt and the damage to the cae
was minimal.

(Text) Document
● Written representation of thought
(1..n paragraphs)

Corpus ● Collection of writings (i.e., written texts)

(1..n documents)

Excellence and Service

CHRIST
Deemed to be University

Morphemes
● Morpheme
■ Smallest meaning-bearing unit in a language ➜ word = 1..n
morphemes

● Example: Prefixes & Suffixes

■ Change the semantic meaning or the part of speech of the affected word

un-happy de-frost-er hope-less

■ Assign a particular grammatical property to that word (e.g., tense, number, possession, comparison)

walk-ed elephant-s Bob-'s fast-er

Excellence and Service

CHRIST
Deemed to be University

Examples
Prefix Prefix Stem Suffix Suffix Suffix

dogs dog -s

walked walk -ed

imperfection im- perfect -ion

hopelessness hope -less -ness

undesirability un- desire -able -ity

unpremeditated un- pre- mediate -ed

antidisestablishmentarianism anti- dis- establish -ment -arian -ism

Examples with multiple stems: daydream-ing, paycheck-s, skydive-er

Excellence and Service

CHRIST
Deemed to be University

Natural language is the object to study of NLP

Linguistics is the study of natural language
Just as you need to know the laws of physics
to build mechanical devices, you need to
know the nature of language to build tools
to understand/generate language

Some interesting reading material

1) Linguistics: Adrian Akmajian et al.

Five Phases of NLP

Excellence and Service

46
CHRIST
Deemed to be University

characters
● Tokenization ● Stemming
morphemes Lexical Analysis
"shallower"

words (understanding structure & meaning of words) ● Normalization ● Lemmatization

Syntactic Analysis ● Part-of-Speech Tagging

(organization of words into sentences)
phrases ● Syntactic parsing (constituents, dependencies)
clauses
sentences ● Word Sense Disambiguation
Semantic Analysis ● Named Entity Recognition
(meaning of words and sentences)
● Semantic Role Labeling

paragraphs ● Coreference / anaphora resolution

Discourse Analysis ● Ellipsis resolution
documents (meaning of sentences in documents)
"deeper"

world knowledge Pragmatic Analysis

common sense ● Textual Entailment
(understanding & interpreting language in context)
● Intent recognition
47

Excellence and Service

CHRIST
Deemed to be University

Phase I: Lexical or morphological analysis

adverb possessive adjective
singular past tense subordinating or mass punctuation
pronoun
conjunction

NNP VBD RB IN PRP$ JJ NN .

Bob walked slowly because of his .

He studied at NUS, his brother at NTU.

He studied at NUS, his brother studied at NTU.

She's very funny. Her sister is not.

She's very funny. Her sister is not very funny.

Excellence and Service

CHRIST
Deemed to be University

CHRIST
Deemed to be University

LANGUAGE SYNTAX AND STRUCTURE

● For any language, syntax and structure usually go hand in

hand, where a set of specific rules, conventions, and principles
govern the way words are combined into phrases; phrases get
combines into clauses; and clauses get combined into
sentences.
● In English, words usually combine together to form other
constituent units.
● These constituents include words, phrases, clauses, and
sentences.

● Change the tag of a verb to a noun if it follows a determiner
like “the.”
● We see that “chased” follows “the,” so we change its tag
from Verb (V) to Noun (N).
● Updated tags:
○ “The” — Determiner (DET)
○ “cat” — Noun (N)
○ “chased” — Noun (N)
○ “the” — Determiner (DET)
○ “mouse” — Noun (N)

Excellence and Service

87
CHRIST
Deemed to be University

● Statistical POS Tagging

● Statistical POS tagging is another approach to
automatically assigning parts-of-speech (POS) tags
to words in a sentence.
● Unlike transformation-based tagging which relies
on rules, statistical tagging uses the power of
statistics and machine learning.

Excellence and Service

88
CHRIST
Deemed to be University

● Here’s how statistical POS tagging works:

● 1. Training the Model:
● First, we train a statistical model on a large corpus of labeled text
data. This corpus contains sentences where each word is tagged with
its correct POS tag. The model learns patterns and relationships
between words and their corresponding POS tags from this training
data.
● 2. Tagging Words:
● Once the model is trained, we use it to predict the POS tags for each
word in new, unseen sentences. The model analyzes the context of
each word in the sentence and predicts the most likely POS tag based
on the patterns it learned during training.

Excellence and Service

89
CHRIST
Deemed to be University

93
CHRIST
Deemed to be University

● We can see that each of these libraries treat tokens in their own way and
assign specific tags for them. Based on what we see, spacy seems to be
doing slightly better than nltk.

Excellence and Service

94
CHRIST
Deemed to be University

Based on the hierarchy we depicted earlier, groups of words make up

phrases. There are five major categories of phrases:
● Noun phrase (NP): These are phrases where a noun acts as the head
word. Noun phrases act as a subject or object to a verb.
● Verb phrase (VP): These phrases are lexical units that have a verb
acting as the head word. Usually, there are two forms of verb
phrases. One form has the verb components as well as other entities
such as nouns, adjectives, or adverbs as parts of the object.
● Adjective phrase (ADJP): These are phrases with an adjective as
the head word. Their main role is to describe or qualify nouns and
pronouns in a sentence, and they will be either placed before or after
the noun or pronoun.

Excellence and Service

99
CHRIST
Deemed to be University

● Adverb phrase (ADVP): These phrases act like adverbs since the
adverb acts as the head word in the phrase. Adverb phrases are used
as modifiers for nouns, verbs, or adverbs themselves by providing
further details that describe or qualify them.
● Prepositional phrase (PP): These phrases usually contain a
preposition as the head word and other lexical components like
nouns, pronouns, and so on. These act like an adjective or adverb
describing other words or phrases.
Shallow parsing, also known as light parsing or chunking, is a popular
natural language processing technique of analyzing the structure of a
sentence to break it down into its smallest constituents (which are tokens
such as words) and group them together into higher-level phrases. This
includes POS tags and phrases from a sentence.
Excellence and Service
100
CHRIST
Deemed to be University

● Unlike full parsing, which involves analyzing the

110
CHRIST
Deemed to be University

Working of Constituency Parsing:

For understanding natural language the key is to understand

the grammatical pattern of the sentences involved.
The first step in understanding grammar is to segregate a
sentence into groups of words or tokens called
constituents based on their grammatical role in the
sentence.
Let’s understand this process with an example sentence:
“The lion ate the deer.”
Here, “The lion” represents a noun phrase, “ate” represents
a verb phrase, and “the deer” is another noun phrase.

Excellence and Service

111
CHRIST
Deemed to be University

Excellence and Service

112
CHRIST
Deemed to be University

Context-Free Grammar (CFG):

The most common technique used in constituency
parsing is Context-Free Grammar or CFG.
CFG works by organizing sentences into constituencies
based on a set of grammar rules (or productions).
These rules specify how individual words in a sentence
can be grouped to form constituents such as noun
phrases, verb phrases, preposition phrases, etc.

Excellence and Service

● Dependency parsing, or DP, is the process of examining the relationships

between a sentence’s components to determine its grammatical structure.
This is the primary element that divides a sentence into multiple parts.
The foundation of the approach is the notion that each linguistic unit in a
phrase is directly related to every other unit. We refer to these
relationships as dependencies.

● The phrase, “I prefer the morning flight through Denver,” comes to mind.

text into smaller units called tokens. Given a document, tokens can
be sentences, words, subwords, or even characters depending on
the application.
● Noise cleaning: Special characters and symbols contribute to extra
noise in unstructured text. Using regular expressions to remove
them or using tokenizers, which do the pre-processing step of
removing punctuation marks and other special characters, is
recommended.
● Spell-checking: Documents in a corpus are prone to spelling
errors; In order to make the text clean for the subsequent
processing, it is a good practice to run a spell checker and fix the
spelling errors before moving on to the next steps.

Excellence and Service

122
CHRIST
Deemed to be University

● Stopwords Removal: Stop words are those words which are very
common and often less significant. Hence, removing these is a pre-
processing step as well.
● This can be done explicitly by retaining only those words in the
document which are not in the list of stop words or by specifying
the stop word list as an argument in CountVectorizer or
TfidfVectorizer methods when getting Bag-of-Words(BoW)/TF-
IDF scores for the corpus of text documents.

Excellence and Service

123
CHRIST
Deemed to be University

● Stemming/Lemmatization: Both stemming and lemmatization are

methods to reduce words to their base form. While stemming
follows certain rules to truncate the words to their base form, often
resulting in words that are not lexicographically correct,
lemmatization always results in base forms that are
lexicographically correct.
● However, stemming is a lot faster than lemmatization. Hence, to
stem/lemmatize is dependent on whether the application needs
quick preprocessing or requires more accurate base forms.

Excellence and Service

124
CHRIST
Deemed to be University

Excellence and Service

130
CHRIST
Deemed to be University

● In Advanced Deep Learning-based NLP architectures, vocabulary is

used to create the tokenized input sentences. Finally, the tokens of
these sentences are passed as inputs to the model
● As discussed earlier, tokenization can be performed on word,
character, or subword level. It’s a common question – which
Tokenization should we use while solving an NLP task? Let’s
address this question here.
● Word Tokenization
● Word Tokenization is the most commonly used tokenization
algorithm. It splits a piece of text into individual words based on a
certain delimiter. Depending upon delimiters, different word-level
tokens are formed. Pretrained Word Embeddings such as Word2Vec
and GloVe comes under word tokenization.

Excellence and Service

131
CHRIST
Deemed to be University

REMOVING STOP-WORDS

● The words which are generally filtered out before processing a natural
language are called stop words. These are actually the most common
words in any language (like articles, prepositions, pronouns, conjunctions,
etc) and does not add much information to the text. Examples of a few stop
words in English are “the”, “a”, “an”, “so”, “what”. Stop words are
available in abundance in any human language. By removing these words,
we remove the low-level information from our text in order to give more
focus to the important information. In order words, we can say that the
removal of such words does not show any negative consequences on the
model we train for our task.
● Removal of stop words definitely reduces the dataset size and thus reduces
the training time due to the fewer number of tokens involved in the
training.

Excellence and Service

132
CHRIST
Deemed to be University

● We do not always remove the stop words. The removal of stop words is
highly dependent on the task we are performing and the goal we want to
achieve. For example, if we are training a model that can perform the
sentiment analysis task, we might not remove the stop words.
● Movie review: “The movie was not good at all.”
● Text after removal of stop words: “movie good”
● We can clearly see that the review for the movie was negative. However,
after the removal of stop words, the review became positive, which is not
the reality. Thus, the removal of stop words can be problematic here.
● Tasks like text classification do not generally need stop words as the other
words present in the dataset are more important and give the general idea
of the text. So, we generally remove stop words in such tasks.

Excellence and Service

133
CHRIST
Deemed to be University

● In a nutshell, NLP has a lot of tasks that cannot be accomplished properly after the
removal of stop words. So, think before performing this step. The catch here is that
no rule is universal and no stop words list is universal. A list not conveying any
important information to one task can convey a lot of information to the other task.
● Word of caution: Before removing stop words, research a bit about your task and the
problem you are trying to solve, and then make your decision

Excellence and Service

134
CHRIST
Deemed to be University

Excellence and Service

139
CHRIST
Deemed to be University

Over-stemming can lead to a loss of meaning and make the text less
readable. For example, the word “arguing” may be stemmed to “argu,” which
is not a valid word and does not convey the same meaning as the original
word. Similarly, the word “running” may be stemmed to “run,” which is the
base form of the word but it does not convey the meaning of the original
word.
To avoid over-stemming, it is important to use a stemmer that is appropriate
for the task and language. It is also important to test the stemmer on a sample
of text to ensure that it is producing valid root forms. In some cases, using a
lemmatizer instead of a stemmer may be a better solution as it takes into
account the context of the word, making it less prone to errors. Another
approach to this problem is to use techniques like semantic role labeling,
sentiment analysis, context-based information, etc. that help to understand
the context of the text and make the stemming process more precise.
Excellence and Service
140
CHRIST
Deemed to be University

Under-stemming occurs when two words are stemmed from the same root
that are not of different stems. Under-stemming can be interpreted as false-
negatives. Under-stemming is a problem that can occur when using stemming
algorithms in natural language processing. It refers to the situation where a
stemmer does not produce the correct root form of a word or does not reduce
a word to its base form. This can happen when the stemmer is not aggressive
enough in removing suffixes or when it is not designed for the specific task
or language.
Under-stemming can lead to a loss of information and make it more difficult
to analyze text. For example, the word “arguing” and “argument” may be
stemmed to “argu,” which does not convey the meaning of the original
words. Similarly, the word “running” and “runner” may be stemmed to
“run,” which is the base form of the word but it does not convey the meaning
of the original words.
Excellence and Service
141
CHRIST
Deemed to be University

To avoid under-stemming, it is important to use a stemmer that is appropriate

for the task and language.
It is also important to test the stemmer on a sample of text to ensure that it is
producing the correct root forms. I
n some cases, using a lemmatizer instead of a stemmer may be a better
solution as it takes into account the context of the word, making it less prone
to errors.
Another approach to this problem is to use techniques like semantic role
labeling, sentiment analysis, context-based information, etc. that help to
understand the context of the text and make the stemming process more
precise.

Excellence and Service

142
CHRIST
Deemed to be University

Excellence and Service

Example: ‘children’ -> ‘child’

Advantage: It is light in nature and can be used as pre-stemmer for other stemmers.

Limitation: It is inefficient in case of large documents.

154
CHRIST
Deemed to be University

1. To Validate data fields.

For Example, dates, email address, URLs, abbreviations, etc.

2. To Filter a particular text from the whole corpus.

For Example, spam, disallowed websites, etc.

3. To Identify particular strings in a text.

For Example, token boundaries

4. To convert the output of one processing component into the format

required for a second component.

Excellence and Service

163
CHRIST
Deemed to be University

Regular Expressions use cases

Regular Expressions are used in various tasks such as:

● Data pre-processing;
● Rule-based information Mining systems;
● Pattern Matching;
● Text feature Engineering;
● Web scraping;
● Data validation;

● Data Extraction.

Excellence and Service

164
CHRIST
Deemed to be University

Regular expressions are commonly used for:

● Search and match patterns in a string.

● Split strings with a particular pattern into substrings with a
particular pattern.
● Findall strings with a particular pattern.
● Sub out particular patterns in a string.

You must import the re module before using regular expressions.

Three commonly-used regex functions are re.search, re.findall, and
re.match.

Excellence and Service

165
CHRIST
Deemed to be University

The concept of Raw String in Regular Expressions

In the following example, we have a couple of backslashes present

in the string. But in string python treats n as “move to a new line”

After seeing the output, you can observe that n has moved the text
after it to a new line. Here “nayan” has become “ayan” and n
disappeared from the path. This is not what we want.

Excellence and Service

166
CHRIST
Deemed to be University

So, to resolve this issue, now we use the “r” expression to create a raw string:

Again after seeing the output, we have observed that now the entire path has printed out
here by simply using “r” in front of the path.

Excellence and Service

172
CHRIST
Deemed to be University

175
CHRIST
Deemed to be University

Using re.finditer()

179
CHRIST
Deemed to be University

Excellence and Service

180
CHRIST
Deemed to be University

Special Sequences in Regular Expressions

1. \b
\b returns a match where the specified pattern is at the beginning or at
the end of a word.
2. \d
\d returns a match where the string contains digits (numbers from 0-9).
adding '+' after '\d' will continue to extract digits till encounters a space
We can infer that \d+ repeats one or more occurrences of \d till the
non-matching character is found whereas \d does a character-wise
comparison.

Excellence and Service

181
CHRIST
Deemed to be University

3. \D

\D returns a match where the string does not contain any digit. It is
basically the opposite of \d.
4. \w

\w helps in extraction of alphanumeric characters only (characters from a

to Z, digits from 0-9, and the underscore _ character)
5. \W

\W returns match at every non-alphanumeric character. Basically opposite

of \w.

Excellence and Service

182
CHRIST
Deemed to be University

Brackets ([ ])
They are used to specify a disjunction of characters.
For Examples,
/[cC]hirag/ → Chirag or chirag
/[xyz]/ → ‘x’, ‘y’, or ‘z’
/[1234567890]/ → any digit

Here slashes represent the start and end of a particular

expression.
Excellence and Service
183
CHRIST
Deemed to be University

Dash (-)
They are used to specify a range.
For Examples,
/[A-Z]/ → matches an uppercase letter
/[a-z]/ → matches a lowercase letter
/[0–9]/ → matches a single digit

Excellence and Service

184
CHRIST
Deemed to be University

Caret (^)
They can be used for negation or just to mean ^.
For Examples,
/[ˆa-z]/ → not an lowercase letter
/[ˆCc]/ → neither ‘C’ nor ‘c’
/[ˆ.]/ → not a period
/[cˆ]/ → either ‘c’ or ‘ˆ’
/xˆy/ → the pattern ‘xˆy’
Excellence and Service
185
CHRIST
Deemed to be University

Question mark (?)

It marks the optionality of the previous
expression.
For Examples,
/maths?/ → math or maths
/colou?r/ → color or colour

Excellence and Service

186
CHRIST
Deemed to be University

What are Anchors?

These are special characters that help us to perform string operations either at the
beginning or at the end of text input. They are used to assert something about the
string or the matching process. Generally, they are not used in a specific word or
character but used while we are dealing with more general queries.
Caret character ‘^’

It specifies the start of the string. For a string to match the pattern, the character
followed by the ‘^’ in the pattern should be the first character of the string.
Dollar character ‘$’

It specifies the end of the string. For a string to match the pattern, the character
that precedes the ‘$’ in the pattern should be the last character in the string.

Excellence and Service

187
CHRIST
Deemed to be University

What are Quantifiers?

Some common Quantifiers are: ( *, +, ? and { } )

They allow us to mention and control over how many times

a specific character(s) pattern should occur in the given text.

Excellence and Service

188
CHRIST
Deemed to be University

Excellence and Service

189
CHRIST
Deemed to be University

Each of the earlier mentioned quantifiers can be

written in the form of {m,n} quantifier in the
following way:
● ‘?’ is equivalent to zero or once, or {0, 1}
● ‘*’ is equivalent to zero or more times, or {0, }
● ‘+’ is equivalent to one or more times, or {1, }

Excellence and Service

190
CHRIST
Deemed to be University

For Examples,
abc*: matches a string that has 'ab' followed by zero or more 'c'.
abc+: matches 'ab' followed by one or more 'c'
abc?: matches 'ab' followed by zero or one 'c'
abc{2}: matches 'ab' followed by 2 'c'
abc{2, }: matches 'ab' followed by 2 or more 'c'
abc{2, 5}: matches 'ab' followed by 2 upto 5 'c'
a(bc)*: matches 'a' followed by zero or more copies of the
sequence 'bc'

Excellence and Service

191
CHRIST
Deemed to be University

Excellence and Service

196
CHRIST
Deemed to be University

Excellence and Service

197
CHRIST
Deemed to be University

Excellence and Service

198
CHRIST
Deemed to be University

Metacharacters in Regular Expression

(.) matches any character (except newline character)

(^) starts with
It checks whether the string starts with the given pattern or not.
($) ends with
It checks whether the string ends with the given pattern or not.
(*) matches for zero or more occurrences of the pattern to the left of
it
(+) matches one or more occurrences of the pattern to the left of it

Excellence and Service

199
CHRIST
Deemed to be University

(?) matches zero or one occurrence of the pattern left to it.

(|) either or
The pipe(|) operator checks whether any of the two patterns, to its
left and right, is present in the String or not.

Excellence and Service

200
CHRIST
Deemed to be University

Example : Regular expression for an email address :

^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-
Z]{2,5})$

PRACTICE:
https://www.w3resource.com/python-exercises/re/

Excellence and Service

201
CHRIST
Deemed to be University

LAB EXERCISE-1:

1.Implement a python code for preprocessing text

document using NLTK
A. TOKENIZATION
B. REMOVE PUNCTUATION
C. CONVERT UPPER CASE TO LOWER CASE
D. STOP WORD REMOVAL
E. FREQUENCY DISTRIBUTION
F. STEMMING AND LEMMATIZATION
G. POS TAGGING
H. PARSING-GENERATE TREE
I. CHUNKING
J. NER

205
CHRIST
Deemed to be University

Excellence and Service

206
CHRIST
Deemed to be University

import nltk
nltk.download('punkt')

Excellence and Service

207
CHRIST
Deemed to be University

Chunking in Natural Language processing

● Chunking is defined as the process of natural language processing used to

Excellence and Service

218
CHRIST
Deemed to be University

Entity Disambiguation
Sometimes what happens is that entities are misclassified, hence
creating a validation layer on top of the results becomes useful. The
use of knowledge graphs can be exploited for this purpose. Some of
the popular knowledge graphs are:
● Google Knowledge Graph,
● IBM Watson,
● Wikipedia, etc.

Excellence and Service

219
CHRIST
Deemed to be University

Any NER model is a two-step process:

● Detect a named entity
● Categorize the entity

So first, we need to create entity categories, like Name, Location,

Event, Organization, etc., and feed a NER model relevant training
data.
Then, by tagging some samples of words and phrases with their
corresponding entities, we’ll eventually teach our NER model to
detect the entities and categorize them.

Excellence and Service

220
CHRIST
Deemed to be University

FEATURE ENGINEERING FOR TEXT REPRESENTATION

● Feature engineering is one of the most important steps in

machine learning. It is the process of using domain
knowledge of the data to create features that make machine
learning algorithms work.
● Think machine learning algorithm as a learning child the
more accurate information you provide the more they will
be able to interpret the information well.
● Focusing first on our data will give us better results than
focusing only on models. Feature engineering helps us to
create better data which helps the model understand it well
and provide reasonable results.
Excellence and Service
221
CHRIST
Deemed to be University

Excellence and Service

222
CHRIST
Deemed to be University

● NLP is a subfield of artificial intelligence where we understand human

interaction with machines using natural languages. To understand a
natural language, you need to understand how we write a sentence,
how we express our thoughts using different words, signs, special
characters, etc basically we should understand the context of the
sentence to interpret its meaning.
● Extracting Features from Text
● Now we will learn about common feature extraction techniques and
methods.
● Feature extraction methods can be divided into 3 major categories,
basic, statistical, and advanced/vectorized.

Excellence and Service

223
CHRIST
Deemed to be University

Basic Methods
These feature extraction methods are based on various
concepts from NLP and linguistics. These are some of the
oldest methods but still can be very reliable are used
frequently in many areas.
● Parsing
● PoS Tagging
● Name Entity Recognition (NER)
● Bag of Words (BoW)

Excellence and Service

224
CHRIST
Deemed to be University

Statistical Methods
This is a bit more advanced feature extraction method and uses the concepts from
statistics and probability to extract features from text data.
● Term Frequency-Inverse Document Frequency (TF-IDF)
Advanced Methods
These methods can also be called vectorized methods as they aim to map a word,
sentence, document to a fixed-length vector of real numbers. The goal of this
method is to extract semantics from a piece of text, both lexical and
distributional. Lexical semantics is just the meaning reflected by the words
whereas distributional semantics refers to finding meaning based on various
distributions in a corpus.
● Word2Vec
● GloVe: Global Vector for word representation

Excellence and Service

225
CHRIST
Deemed to be University

Excellence and Service

226
CHRIST
Deemed to be University

Bag-of-Words Model
● It is called a “bag” of words, because any information about the
order or structure of words in the document is discarded.
● The model is only concerned with whether known words occur
in the document, not where in the document.
● The bag of words model is one particularly simple way to
represent a document in numerical form before we can feed it
into a machine learning algorithm.
● For any natural language processing task, we need a way to
accomplish this before any further processing.
● It doesn’t take into account the order and the structure of the
words, but it only checks if the words appear in the document.

Excellence and Service

227
CHRIST
Deemed to be University

● Machine learning algorithms can’t operate on raw text;

we need to convert the text to some sort of numerical
representation. This process is also known as embedding
the text.
● There are two basic approaches to embedding a text:
word vectors and document vectors. With word vectors,
we represent each individual word in the text as a vector
(i.e., a sequence of numbers). We then convert the whole
document into a sequence of these word vectors.
Document vectors, on the other hand, embed the entire
document as a single vector.
Excellence and Service
228
CHRIST
Deemed to be University

● The bag of words model is a simple way to convert words

to numerical representation in natural language processing.
● This model is a simple document embedding technique
based on word frequency.
● Conceptually, we think of the whole document as a “bag”
of words, rather than a sequence.
● We represent the document simply by the frequency of
each word.
● Using this technique, we can embed a whole set of
documents and feed them into a variety of different
machine learning algorithms.

Excellence and Service

229
CHRIST
Deemed to be University

BAG OF N-GRAMS MODEL

A bag-of-n-grams model is a way to represent a document, similar to a [bag-of-

words][/terms/bag-of-words/] model. A bag- of-n-grams model represents a text document
as an unordered collection of its n-grams.

For example, let’s use the following phrase and divide it into bi-grams (n=2).
“James is the best person ever.”

becomes

256
CHRIST
Deemed to be University

So, for building a TF-IDF model we have to give all the input
features i.e., vocabulary along with the documents to the TF-IDF
model and it will result in a TF-IDF matrix which will be used
for training our machine learning model for document
classification.

Excellence and Service

257
CHRIST
Deemed to be University

IDF Formula in sklearn

Excellence and Service

258
CHRIST
Deemed to be University

Advantages of TF-IDF

● Measures relevance: TF-IDF measures the importance of a term in

a document, based on the frequency of the term in the document
and the inverse document frequency (IDF) of the term across the
entire corpus. This helps to identify which terms are most relevant
to a particular document.
● Handles large text corpora: TF-IDF is scalable and can be used
with large text corpora, making it suitable for processing and
analyzing large amounts of text data.
● Handles stop words: TF-IDF automatically down-weights common
words that occur frequently in the text corpus (stop words) that do
not carry much meaning or importance, making it a more accurate
measure of term importance.

Excellence and Service

259
CHRIST
Deemed to be University

○ Can be used for various applications: TF-IDF can be used

for various natural language processing tasks, such as text
classification, information retrieval, and document
clustering.
○ Interpretable: The scores generated by TF-IDF are easy to
interpret and understand, as they represent the importance of
a term in a document relative to its importance across the
entire corpus.
○ Works well with different languages: TF-IDF can be used
with different languages and character encodings, making it
a versatile technique for processing multilingual text data.

Excellence and Service

260
CHRIST
Deemed to be University

Limitations of TF-IDF

● Ignores the context: TF-IDF only considers the frequency of

each term in a document, and does not take into account the
context in which the term appears. This can lead to incorrect
interpretations of the meaning of the document.
● Assumes independence: TF-IDF assumes that the terms in a
document are independent of each other. However, this is often
not the case in natural language, where words are often related
to each other in complex ways.
● Vocabulary size: The vocabulary size can become very large
when working with large datasets, which can lead to high-
dimensional feature spaces and difficulty in interpreting the
results.
Excellence and Service
261
CHRIST
Deemed to be University

● No concept of word order: TF-IDF treats all words as equally important,

regardless of their order or position in the document. This can be
problematic for certain applications, such as sentiment analysis, where
word order can be crucial for determining the sentiment of a document.
● Limited to term frequency: TF-IDF only considers the frequency of each
term in a document and does not take into account other important
features, such as the length of the document or the position of the term
within the document.
● Sensitivity to stopwords: TF-IDF can be sensitive to stop words, which
are common words that do not carry much meaning, but appear
frequently in documents. Removing stop words from the document can
help to address this issue.

Excellence and Service

Excellence and Service

268
CHRIST
Deemed to be University

Classification
What is classification?
✔ Classification is a process of categorizing a given set of data into
classes.
✔ It can be performed on both structured or unstructured data.
✔ A supervised learning approach.
✔ Categorizing some unknown items into a discrete set of
categories or classes.
✔ The target attribute is a categorical variable.
✔ The goal of data classification is to organize and categorize
data in distinct classes.

Excellence and Service

Contd...

Excellence and Service

CHRIST
Deemed to be University

Contd...
2. Model Evaluation (Accuracy):

● Estimate accuracy rate of the model based on a test set.

– The known label of test sample is compared with the classified result from the
model.

– Accuracy rate is the percentage of test set samples that are correctly classified by
the model.

Excellence and Service

CHRIST
Deemed to be University

Contd...

Excellence and Service

CHRIST
Deemed to be University

Contd...
3. Model Use (Classification)

The model is used to classify unseen objects.

• Give a class label to a new tuple
• Predict the value of an actual attribute

Excellence and Service

CHRIST
Deemed to be University

of a task that involves a medical test and “cancer detected” is
the abnormal state.

• The class for the normal state is assigned the class label 0 and
the class with the abnormal state is assigned the class label 1.

Excellence and Service

Christ University
Binary Classification

Excellence and Service

Christ University
Binary Classification

Excellence and Service

Christ University

• i.e Multi-label classification is applied when one input can

belong to more than one class

• Consider the example of photo classification, where a given

photo may have multiple objects in the scene and a model may
predict the presence of multiple known objects in the photo,
such as “bicycle,” “apple,” “person,” etc.

Excellence and Service

Christ University

Multi-Label Classification

• Another example, a person who is a citizen of two countries.

Supervised vs. Unsupervised Learning

● Supervised learning (classification)

○ Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations
○ New data is classified based on the training
set
● Unsupervised learning (clustering)
○ The class labels of training data is unknown

○ Given a set of measurements, observations,

etc. with the aim of establishing the existence
of classes or clusters in the data

Excellence and Service

CHRIST
Deemed to be University
Supervised learning algorithms

Various algorithms and computation techniques are used in supervised machine

learning processes.
● Neural networks
● Naive Bayes
● Linear regression
● Logistic regression
● Support vector machine (SVM)
● K-nearest neighbor
● Random forest

Excellence and Service

CHRIST
Deemed to be University

Supervised learning Projects

● Image- and object-recognition

● Predictive analytics
● Customer sentiment analysis
● Spam detection

Excellence and Service

CHRIST
Deemed to be University

Challenges of supervised learning

Although supervised learning can offer businesses advantages, such as deep

data insights and improved automation, there are some challenges when
building sustainable supervised learning models. The following are some of
these challenges:
● Supervised learning models can require certain levels of expertise to
structure accurately.
● Training supervised learning models can be very time intensive.
● Datasets can have a higher likelihood of human error, resulting in algorithms
learning incorrectly.
● Unlike unsupervised learning models, supervised learning cannot cluster or
classify data on its own.

Excellence and Service

CHRIST
Deemed to be University

Unsupervised Learning

● Unsupervised learning is a type of machine learning in which models are

trained using unlabeled dataset and are allowed to act on that data without
any supervision.

Popular unsupervised learning algorithms:

• K-means clustering
• Hierarchal clustering
• Anomaly detection
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
• KNN (k-nearest neighbors)
• Neural Networks

Excellence and Service

CHRIST
Deemed to be University

Advantages of Unsupervised Learning

Advantages:
● Unsupervised learning is used for more complex tasks as compared to
supervised learning because, in unsupervised learning, we don't have
labeled input data.
● Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

Disadvantages:
● Unsupervised learning is intrinsically more difficult than supervised learning
as it does not have corresponding output.
● The result of the unsupervised learning algorithm might be less accurate as
input data is not labeled, and algorithms do not know the exact output in
advance.

Excellence and Service

CHRIST
Deemed to be University

312
CHRIST
Deemed to be University

Why use machine learning text classification?

● Scalability: Manually analyzing and organizing is slow and much

less accurate.. Machine learning can automatically analyze millions of
surveys, comments, emails, etc., at a fraction of the cost, often in just
a few minutes. Text classification tools are scalable to any business
needs, large or small
● Real-time analysis: There are critical situations that companies need
to identify as soon as possible and take immediate action (e.g., PR
crises on social media). Machine learning text classification can
follow your brand mentions constantly and in real-time, so you'll
identify critical information and be able to take action right away.

Excellence and Service

313
CHRIST
Deemed to be University

● Consistent criteria: Human annotators make mistakes when classifying text

data due to distractions, fatigue, and boredom, and human subjectivity
creates inconsistent criteria. Machine learning, on the other hand, applies the
same lens and criteria to all data and results. Once a text classification model
is properly trained it performs with unsurpassed accuracy.
● Real-time analysis: There are critical situations that companies need to
identify as soon as possible and take immediate action (e.g., PR crises on
social media). Machine learning text classification can follow your brand
mentions constantly and in real-time, so you'll identify critical information
and be able to take action right away.
● Consistent criteria: Human annotators make mistakes when classifying text
data due to distractions, fatigue, and boredom, and human subjectivity
creates inconsistent criteria. Machine learning, on the other hand, applies the
same lens and criteria to all data and results. Once a text classification model
is properly trained it performs with unsurpassed accuracy.

Excellence and Service

314
CHRIST
Deemed to be University

Text Classification

● Text classification in two ways: manual or automatic.

● Manual text classification involves a human annotator, who interprets
the content of text and categorizes it accordingly. This method can
deliver good results but it’s time-consuming and expensive.
● Automatic text classification applies machine learning, natural
language processing (NLP), and other AI-guided techniques to
automatically classify text in a faster, more cost-effective, and more
accurate manner.
● There are many approaches to automatic text classification, but they
all fall under three types of systems:
● Rule-based systems
● Machine learning-based systems
● Hybrid systems

Excellence and Service

315
CHRIST
Deemed to be University

Rule-based systems
● Rule-based approaches classify text into organized groups by using a
set of handcrafted linguistic rules.
● These rules instruct the system to use semantically relevant elements
of a text to identify relevant categories based on its content.
● Each rule consists of an antecedent or pattern and a predicted
category.
● Example: Say that you want to classify news articles into two groups:
Sports and Politics.
● First, you’ll need to define two lists of words that characterize each
group (e.g., words related to sports such as football, basketball,
LeBron James, etc., and words related to politics, such as Donald
Trump, Hillary Clinton, Putin, etc.).

Excellence and Service

316
CHRIST
Deemed to be University

● Next, when you want to classify a new incoming text, you’ll need to count the
number of sport-related words that appear in the text and do the same for politics-
related words.
● If the number of sports-related word appearances is greater than the politics-related
word
● count, then the text is classified as Sports and vice versa.
● For example, this rule-based system will classify the headline
● “When is LeBron James' first game with the Lakers?” as Sports because it counted
one sports-related term (LeBron James) and it didn’t count any politics-related terms.
● Rule-based systems are human comprehensible and can be improved over time. But
this approach has some disadvantages.
● For starters, these systems require deep knowledge of the domain. They are also time-
consuming, since generating rules for a complex system can be quite challenging and
usually requires a lot of analysis and testing.
● Rulebased systems are also difficult to maintain and don’t scale well given that
adding new rules can affect the results of the pre-existing rules.

Excellence and Service

317
CHRIST
Deemed to be University

Example of a simple rule-based system for detecting passive voice in sentences using Python

Excellence and Service

322
CHRIST
Deemed to be University

Excellence and Service

323
CHRIST
Deemed to be University

● Text classification with machine learning is usually much more

accurate than human-crafted rule systems, especially on complex NLP
classification tasks.
● Also, classifiers with machine learning are easier to maintain and you
can always tag new examples to learn new tasks.
● Machine Learning Text Classification Algorithms Some of the most
popular text classification algorithms include the Naive Bayes family
of algorithms, support vector machines (SVM), and deep learning.

Excellence and Service

324
CHRIST
Deemed to be University

Naive Bayes
● The Naive Bayes family of statistical algorithms are some of the most
used algorithms in text classification and text
● analysis, overall. One of the members of that family is Multinomial
Naive Bayes (MNB) with a huge advantage, that you
● can get really good results even when your dataset isn’t very large (~
a couple of thousand tagged samples) and computational resources are
scarce. Naive Bayes is based on Bayes’s Theorem, which helps us
compute the conditional probabilities of the occurrence of two events,
based on the probabilities of the occurrence of each individual event.
So we’re calculating the probability of each tag for a given text, and
then outputting the tag with the highest probability.

Naive Bayes formula.

Excellence and Service

325
CHRIST
Deemed to be University

● The probability of A, if B is true, is equal to the probability of B, if A

is true, times the probability of A being true,
● divided by the probability of B being true. This means that any vector
that represents a text will have to contain
● information about the probabilities of the appearance of certain words
within the texts of a given category so that the
● algorithm can compute the likelihood of that text belonging to the
category.

Excellence and Service

326
CHRIST
Deemed to be University

Why is it called Naïve Bayes?

● The Naïve Bayes algorithm is comprised of two words Naïve
and Bayes, Which can be described as:

● Naïve: It is called Naïve because it assumes that the occurrence

of a certain feature is independent of the occurrence of other
features. Such as if the fruit is identified on the bases of color,
shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on
each other.

● Bayes: It is called Bayes because it depends on the principle

of Bayes' Theorem.
Excellence and Service
CHRIST
Deemed to be University

Bayes' Theorem
● Bayes' theorem is also known as Bayes' Rule or Bayes' law,
which is used to determine the probability of a hypothesis with
prior knowledge. It is based on the conditional probability.

● Thomas Bayes was an English statistician and philosopher who

is known for formulating a specific case of the theorem that
bears his name: Bayes' theorem.

● Bayes Theorem is the base for the Naive Bayes Algorithm

Excellence and Service

CHRIST
Deemed to be University

Contd...

● Bayes' Theorem is a way of finding a probability when we know

certain other probabilities.

● It is stated as probability of the event A given B is equal to the

probability of the event B given A multiplied by the probability
of A upon probability of B

Excellence and Service

CHRIST
Deemed to be University

Contd....

Where

● P(A|B) is Posterior probability: Probability of hypothesis A on the

observed event B.
posterior probability = prior probability + new evidence (called likelihood).

● P(B|A) is Likelihood probability: Probability of the evidence given that the

probability of a hypothesis is true.

● P(A) is Prior Probability: Probability of hypothesis before observing the

evidence.

● P(B) is Marginal Probability: Probability of Evidence.

Excellence and Service

CHRIST
Deemed to be University

Excellence and Service

CHRIST
Deemed to be University

Excellence and Service

● We wish to predict the class label of a tuple using na¨ıve

Bayesian classification, given the same training data were
shown in Table 8.1.

● The data tuples are described by the attributes age, income,

student, and credit rating.

● The class label attribute, buys computer, has two distinct values
(namely, yes, no).
● Classify following tuple
● X = (age =youth, income =medium, student = yes, credit rating
= fair)

Excellence and Service

CHRIST
Deemed to be University

Naïve Bayes Classifier: Comments

● Advantages
○ Easy to implement
○ Good results obtained in most of the cases
● Disadvantages
○ Assumption: class conditional independence, therefore loss of
accuracy
○ Practically, dependencies exist among variables
■ E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
■ Dependencies among these cannot be modeled by Naïve Bayes Classifier
● How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)

Excellence and Service

CHRIST
Deemed to be University

367
CHRIST
Deemed to be University

● Deep learning algorithms do require much more training data than

traditional machine learning algorithms (at least
● millions of tagged examples). However, they don’t have a threshold
for learning from training data, like traditional
● machine learning algorithms, such as SVM and Deep learning
classifiers continue to get better the more data you feed
● them with: Deep learning algorithms, like Word2Vec or GloVe are
also used in order to obtain better vector
● representations for words and improve the accuracy of classifiers
trained with traditional machine learning algorithms.

Excellence and Service

368
CHRIST
Deemed to be University

● Hybrid Systems
● Hybrid systems combine a machine learning-trained base classifier
with a rule-based system, used to further
● improve the results. These hybrid systems can be easily fine-tuned by
adding specific rules for those conflicting tags that
● haven’t been correctly modeled by the base classifier.

Excellence and Service

369
CHRIST
Deemed to be University

Text Classification Using TF-IDF

Develop a text classification model for any use case using NLTK and
TF-IDF model.

Excellence and Service

370
CHRIST
Deemed to be University

VECTOR SEMANTICS AND EMBEDDINGS

● The idea of vector semantics is to represent a word as a point

in a multidimensional semantic space that is derived from the
distributions of embeddings word neighbors.
● Vectors for representing words are called embeddings .
● Vector Semantics defines semantics & interprets word
meaning to explain features such as word similarity.
● Its central idea is: Two words are similar if they have similar
word contexts.

Excellence and Service

371
CHRIST
Deemed to be University

● In its current form, the vector model inspires its working from the
linguistic and philosophical work of the 1950s.
● Vector semantics represents a word in multi-dimensional vector space
● . Vector model is also called Embeddings, due to the fact that a word is
embedded in a particular vector space.
● The vector model offers many advantages in NLP. For example, in
sentimental analysis, sets up a boundary class and predicts if the
sentiment is positive or negative (a binomial classification).
● Another key practical advantage of vector semantics is that it can learn
automatically from text without complex labeling or supervision.
● As a result of these advantages, vector semantics has become a de-facto
standard for NLP applications such as Sentiment Analysis, Named
Entity Recognition (NER), topic modeling, and so on.

Excellence and Service

372
CHRIST
Deemed to be University

● To convert the text data into numerical data, we need some

smart ways which are known as vectorization, or in the NLP
world, it is known as Word embeddings.
● Therefore, Vectorization or word embedding is the process of
converting text data to numerical vectors.
● Later those vectors are used to build various machine
learning models.
● In this manner, we say this as extracting features with the
help of text with an aim to build multiple natural languages,
processing models

384
CHRIST
Deemed to be University

N-grams Vectorization

This is a very well know saying. And word2vec also works primarily on this idea. A word
is known by the company it keeps. This sounds so strange and funny but it gives amazing
results.

Excellence and Service

393
CHRIST
Deemed to be University

Word2Vec takes as its input a large corpus of text and produces a vector space, typically of several
hundred dimensions, with each unique words in the corpus being assigned a corresponding vector
in the space. Word vectors are positioned to vector space such that words that share common
contexts in the corpus are located in close proximity to one other in the space.

● CBOW is faster and has better representation for more frequent words.

● Skipgram works well with small amount of data and is found to represent rare words

well.

Excellence and Service

425
CHRIST
Deemed to be University

Summary of Word2Vec

Word2Vec is a popular word embedding technique that aims to represent words as

continuous vectors in a high-dimensional space. It introduces two models: Continuous
Bag of Words (CBOW) and Skip-gram, each contributing to the learning of vector
representations.

1. Model Architecture:
Continuous Bag of Words (CBOW): In CBOW, the model predicts a target
word based on its context. The context words are used as input, and the target
word is the output. The model is trained to minimize the difference between the
predicted and actual target words.

Skip-gram: Conversely, the Skip-gram model predicts context words given a

target word. The target word serves as input, and the model aims to predict the
words that are likely to appear in its context. Like CBOW, the goal is to
minimize the difference between the predicted and actual context words.

Excellence and Service

426
CHRIST
Deemed to be University

2. Neural Network Training:

Both CBOW and Skip-gram models leverage neural networks to learn vector
representations. The neural network is trained on a large text corpus,
adjusting the weights of connections to minimize the prediction error. This
process places similar words closer together in the resulting vector space.

3. Vector Representations:
Once trained, Word2Vec assigns each word a unique vector in the high-
dimensional space. These vectors capture semantic relationships between
words. Words with similar meanings or those that often appear in similar
contexts have vectors that are close to each other, indicating their semantic
similarity.
Excellence and Service
427
CHRIST
Deemed to be University

Advantages and Disadvantages:

Advantages:

● Captures semantic relationships effectively.

● Efficient for large datasets.
● Provides meaningful word representations.

Disadvantages:

● May struggle with rare words.

● Ignores word order.

Excellence and Service

428
CHRIST
Deemed to be University

FastText
● FastText is a vector representation technique developed by facebook AI
research. As its name suggests its fast and efficient method to perform
same task and because of the nature of its training method, it ends up
learning morphological details as well.
● FastText is unique because it can derive word vectors for unknown
words or out of vocabulary words — this is because by taking
morphological characteristics of words into account, it can create the
word vector for an unknown word.
● Since morphology refers to the structure or syntax of the words,
FastText tends to perform better for such task, word2vec perform better
for semantic task.

Excellence and Service

429
CHRIST
Deemed to be University

FastText works well with rare words. So even if a word

wasn’t seen during training, it can be broken down into n-
grams to get its embeddings.

● Capable of handling out-of-vocabulary words.
● Richer word representations due to subword information.

Disadvantages:

● Increased model size due to n-gram information.

● Longer training times compared to Word2Vec.

Excellence and Service

438
CHRIST
Deemed to be University

Glove
● Global Vectors for Word Representation (GloVe) is a powerful word
embedding technique that captures the semantic relationships between
words by considering their co-occurrence probabilities within a
corpus.
● The key to GloVe’s effectiveness lies in the construction of a word-
context matrix and the subsequent factorization process.

1. Word-Context Matrix Formation:

The first step in GloVe’s mechanics involves creating a word-context
matrix. This matrix is designed to represent the likelihood of a given word
appearing near another across the entire corpus. Each cell in the matrix
holds the co-occurrence count of how often words appear together in a
certain context window.
Excellence and Service
439
CHRIST
Deemed to be University

Advantages:

● Efficiently captures global statistics of the corpus.

● Good at representing both semantic and syntactic
relationships.
● Effective in capturing word analogies.

Disadvantages:

● Requires more memory for storing co-occurrence matrices.

● Less effective with very small corpora.
Excellence and Service
444
CHRIST
Deemed to be University

Choosing the Right Embedding Model

● Word2Vec: Use when semantic relationships are

crucial, and you have a large dataset.
● GloVe: Suitable for diverse datasets and when
capturing global context is important.
● FastText: Opt for morphologically rich languages
or when handling out-of-vocabulary words is
vital.

Excellence and Service

445
CHRIST
Deemed to be University

OVERVIEW OF DEEP LEARNING MODELS – RNN

● Deep learning is a machine learning technique that teaches

computers to do what comes naturally to humans: learn by
example.
● Deep learning is a key technology behind driverless cars,

enabling them to recognize a stop sign, or to distinguish a

pedestrian from a lamppost. It is the key to voice control in
consumer devices like phones, tablets, TVs, and hands-free
speakers.
● In deep learning, a computer model learns to perform
classification tasks directly from images, text, or sound.
● Deep learning models can achieve state-of-the-art accuracy,

sometimes exceeding human-level performance.

● Models are trained by using a large set of labeled data and

neural network architectures that contain many layers.

Excellence and Service

446
CHRIST
Deemed to be University

While deep learning was first theorized in the 1980s, there are two main reasons
it has only recently become useful:
1. Deep learning requires large amounts of labeled data. For example,
driverless car development requires millions of images and thousands
of hours of video.
2. Deep learning requires substantial computing power. High
performance GPUs have a parallel architecture that is efficient for
deep learning. When combined with clusters or cloud computing, this
enables development teams to reduce training time for a deep learning
network from weeks to hours or less.

Excellence and Service

447
CHRIST
Deemed to be University

Deep learning applications are used in industries from automated driving to medical
devices.
Automated Driving: Automotive researchers are using deep learning to
automatically detect objects such as stop signs and traffic lights. In addition,
deep learning is used to detect pedestrians, which helps decrease accidents.
Aerospace and Defense: Deep learning is used to identify objects from satellites
that locate areas of interest, and identify safe or unsafe zones for troops.
Medical Research: Cancer researchers are using deep learning to automatically
detect cancer cells. Teams at UCLA built an advanced microscope that yields a
high-dimensional data set used to train a deep learning application to accurately
identify cancer cells.
Industrial Automation: Deep learning is helping to improve worker safety
around heavy machinery by automatically detecting when people or objects are
within an unsafe distance of machines.
Electronics: Deep learning is being used in automated hearing and speech
translation. For example, home assistance devices that respond to your voice and
know your preferences are powered by deep learning applications.

Excellence and Service

448
CHRIST
Deemed to be University

RNN

● Recurrent Neural Network (RNN) is a type of Neural Network where the

output from the previous step is fed as input to the current step.
● In traditional neural networks, all the inputs and outputs are independent of
each other, but in cases when it is required to predict the next word of a
sentence, the previous words are required and hence there is a need to
remember the previous words.
● Thus, RNN came into existence, which solved this issue with the help of a
Hidden Layer.
● The main and most important feature of RNN is its Hidden state, which
remembers some information about a sequence.
● The state is also referred to as Memory State since it remembers the previous
input to the network. It uses the same parameters for each input as it performs
the same task on all the inputs or hidden layers to produce the output. This
reduces the complexity of parameters, unlike other neural network

Excellence and Service

449
CHRIST
Deemed to be University

Excellence and Service

450
CHRIST
Deemed to be University

● A RNN treats each word of a sentence as a separate input occurring at

time ‘t’ and uses the activation value at ‘t- 1’ also, as an input in addition
to the input at time ‘t’.
● The diagram below shows a detailed structure of an RNN architecture.

The architecture described above is also called as a many to many

architecture with (Tx = Ty) i.e. number of inputs = number of outputs.
Such structure is quite useful in Sequence modelling.
● Apart from the architecture mentioned above there are three other
types of architectures of RNN which are commonly used.
1. Many to One RNN : Many to one architecture refers to an RNN

architecture where many inputs (Tx) are used to give one output (Ty). A
suitable example for using such an architecture will be a classification
task.

Excellence and Service

451
CHRIST
Deemed to be University

RNN are a very important variant of neural networks heavily

used in Natural Language Processing.
● Conceptually they differ from a standard neural network as the

standard input in a RNN is a word instead of the entire sample

as in the case of a standard neural network.
● This gives the flexibility for the network to work with varying

lengths of sentences, something which cannot be achieved in a

standard neural network due to it’s fixed structure.
● It also provides an additional advantage of sharing features

learned across different positions of text which cannot be

obtained in a standard neural network.

Excellence and Service

452
CHRIST
Deemed to be University

In the image e H represents the output of the activation function.

Excellence and Service

Excellence and Service

462
CHRIST
Deemed to be University

TRANSFORMERS

other encoders, it would be the output of the encoder that’s directly below.
After embedding the words in our input sequence, each of them flows
through each of the two layers of the encoder.

Excellence and Service

468
CHRIST
Deemed to be University

Here we begin to see one key property of the Transformer, which is that the word in each
position flows through its own path in the encoder. There are dependencies between these
paths in the self-attention layer. The feed-forward layer does not have those dependencies,
however, and thus the various paths can be executed in parallel while flowing through the
feed-forward layer.

Next, we’ll switch up the example to a shorter sentence and we’ll look at what
happens in each sub-layer of the encoder.

Excellence and Service

469
CHRIST
Deemed to be University

That concludes the self-attention calculation. The resulting vector is one we can send
along to the feed-forward neural network. In the actual implementation, however, this
calculation is done in matrix form for faster processing.

Excellence and Service

477
CHRIST
Deemed to be University

Multihead attention
There are a few other details that make them work better. For
example, instead of only paying attention to each other in one dimension,
Transformers use the concept of Multihead attention. The idea behind it is
that whenever you are translating a word, you may pay different attention
to each word based on the type of question that you are asking. The
images below show what that means. For example, whenever you are
translating “kicked” in the sentence “I kicked the ball”, you may ask
“Who kicked”. Depending on the answer, the translation of the word to
another language can change. Or ask other questions, like “Did what?”,
etc…

Excellence and Service

478
CHRIST
Deemed to be University

Excellence and Service

479
CHRIST
Deemed to be University

Excellence and Service

480
CHRIST
Deemed to be University

Positional Encoding
Another important step on the Transformer is to add positional
encoding when encoding each word. Encoding the position of each
word is relevant since the position of each word is relevant to the
translation

1. Single Document, where the input length is short. Many of the early

summarization systems dealt with single-document summarization.

2. Multi-Document, where the input can be arbitrarily long.

Excellence and Service

485
CHRIST
Deemed to be University

Based on the purpose:

1. Generic, where the model makes no assumptions about the

domain or content of the text to be summarized and treats all
inputs as homogeneous. The majority of the work that has been
done revolves around generic summarization.
2. Domain-specific, where the model uses domain-specific
knowledge to form a more accurate summary. For example,
summarizing research papers of a specific domain, biomedical
documents, etc.
3. Query-based, where the summary only contains information that
answers natural language questions about the input text.
Excellence and Service
486
CHRIST
Deemed to be University

Based on output type:

1. Extractive, where important sentences are selected from the

input text to form a summary. Most summarization
approaches today are extractive in nature.
2. Abstractive, where the model forms its own phrases and
sentences to offer a more coherent summary, like what a
human would generate. This approach is definitely more
appealing, but much more difficult than extractive
summarization.

Excellence and Service

487
CHRIST
Deemed to be University

How to do text summarization

● Text cleaning

● Sentence tokenization

● Word tokenization

● Word-frequency table

● Summarization

Excellence and Service

488
CHRIST
Deemed to be University

Automatic text summarization

In this approach we build algorithms or programs which will reduce the

text size and create a summary of our text data. This is called automatic
text summarization in machine learning.
Text summarization is the process of creating shorter text without
removing the semantic structure of text.

There are two approaches to text summarization.

1. Extractive approaches
2. Abstractive approaches

Excellence and Service

489
CHRIST
Deemed to be University

Excellence and Service

490
CHRIST
Deemed to be University

EXTRACTIVE APPROACHES
● Using an extractive approach we summarize our text on the basis of
simple and traditional algorithms.
● For example, when we want to summarize our text on the basis of the
frequency method, we store all the important words and frequency of all
those words in the dictionary.
● On the basis of high frequency words, we store the sentences containing
that word in our final summary.
● This means the words which are in our summary confirm that they are
part of the given text.
● It is the traditional method developed first. The main objective is to
identify the significant sentences of the text and add them to the
summary. You need to note that the summary obtained contains
exact sentences from the original text.

Excellence and Service

491
CHRIST
Deemed to be University

● As the name suggests, extractive text summarization ‘extracts’

notable information from the large dumps of text provided and
groups them into clear and concise summaries.
● The method is very straightforward as it extracts texts based on
parameters such as the text to be summarized, the most important
sentences (Top K), and the value of each of these sentences to the
overall subject.
Source text: Joseph and Mary rode on a donkey to attend the annual event in Jerusalem. In the
city, Mary gave birth to a child named Jesus.

Extractive summary: Joseph and Mary attend event Jerusalem. Mary birth Jesus.

Excellence and Service

492
CHRIST
Deemed to be University

ABSTRACTIVE APPROACHES
● An abstractive approach is more advanced.
● The approach is to identify the important sections, interpret the context and
reproduce in a new way.
● This ensures that the core information is conveyed through shortest text possible.
Note that here, the sentences in summary are generated, not just extracted from
original text.
● Abstractive text summarization generates legible sentences from the entirety of
the text provided. It rewrites large amounts of text by creating acceptable
representations, which is further processed and summarized by natural language
processing.
● What makes this method unique is its almost AI-like ability to use a machine’s
semantic capability to process text and iron out the kinks using NLP.
● Example:
● Abstractive summary: Joseph and Mary came to Jerusalem where Jesus
was born.

Excellence and Service

493
CHRIST
Deemed to be University

Techniques for text summarization in Python

lex_rank_summarizer = LexRankSummarizer()
lexrank_summary =
lex_rank_summarizer(my_parser.document,sentences_count=3)

LSA (Latent semantic analysis)

● Latent Semantic Analysis is a unsupervised learning algorithm

that can be used for extractive text summarization.
● It extracts semantically significant sentences by applying
singular value decomposition(SVD) to the matrix of term-
document frequency.

Excellence and Service

503
CHRIST
Deemed to be University

LSA

from sumy.summarizers.lsa
import LsaSummarizer
def lsa_method(text):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer_lsa = LsaSummarizer()
summary_2 = summarizer_lsa(parser.document, 2)
dp = []
for i in summary_2:
lp = str(i)
dp.append(lp)
final_sentence = ' '.join(dp)
return final_sentence
Latent Semantic Analyzer (LSA) is based on decomposing the data into low dimensional space.
LSA has the ability to store the semantic of given text while summarizing.

Excellence and Service

504
CHRIST
Deemed to be University

Luhn

● Luhn Summarization algorithm’s approach is based on TF-

IDF (Term Frequency-Inverse Document Frequency).
● It is useful when very low frequent words as well as highly
frequent words(stopwords) are both not significant.
● Based on this, sentence scoring is carried out and the high
ranking sentences make it to the summary.

Excellence and Service

505
CHRIST
Deemed to be University

Using Luhn:

513
CHRIST
Deemed to be University

What Are Transformers?

● Abstractive Summarization:
● Abstractive summarization techniques emulate human writing by
generating entirely new sentences to convey key concepts from
the source text, rather than merely rephrasing portions of it.
● These fresh sentences distill the vital information while
eliminating irrelevant details, often incorporating novel
vocabulary absent in the original text. The term “Transformers”
has recently dominated the natural language processing field,
although these models initially relied on designs based on
recurrent neural networks (RNNs).

Excellence and Service

514
CHRIST
Deemed to be University

● Transformers represent a series of systems that employ a unique encoder-decoder

architecture to transform an input sequence into an output sequence.
Transformers feature a distinctive “self-attention” mechanism, along with several
other enhancements like positional encoding, which set them apart. NOTE: Not
all Transformers are intended for use in text summarization. Let’s delve into the
recently released model called PEGASUS, which appears to excel in terms of
output quality for text summarization.

● PEGASUS shares similarities with other transformer models, with its primary
distinction lying in a unique approach used during the model’s pre-training.
Specifically, the most crucial sentences in the training text corpora are “masked”
(hidden from the model) during PEGASUS pre-training. The model is then
tasked with generating these concealed sentences as a single output sequence.

Excellence and Service

515
CHRIST
Deemed to be University

What is Topic Modelling in NLP ?

● Therefore, LSA models typically replace the raw counts of the DTM with
TF-IDF scores. TF-IDF or term frequency-inverse document frequency
assigns a weight to term j in document i as follows:

Excellence and Service

528
CHRIST
Deemed to be University

A person who is actually not pregnant (negative) and classified as not pregnant
(negative). This is called TRUE NEGATIVE (TN).

Excellence and Service

540
CHRIST
Deemed to be University

A person who is actually not pregnant (negative) and classified as pregnant (positive). This
is called FALSE POSITIVE (FP).

Excellence and Service

541
CHRIST
Deemed to be University

549
CHRIST
Deemed to be University

What is the accuracy of the machine learning model for this classification task?

● Accuracy represents the number of correctly classified data instances

over the total number of data instances.
● In this example, Accuracy = (55 + 30)/(55 + 5 + 30 + 10 ) = 0.85 and
in percentage the accuracy will be 85%.

Excellence and Service

550
CHRIST
Deemed to be University

● Is accuracy the best measure?

● Accuracy may not be a good measure if the dataset is not balanced
(both negative and positive classes have different number of data
instances).

Excellence and Service

551
CHRIST
Deemed to be University

precision (positive predictive value) in classifying the data

instances. Precision is defined as follows:

Excellence and Service

552
CHRIST
Deemed to be University

What does precision mean?

Precision should ideally be 1 (high) for a good classifier. Precision
becomes 1 only when the numerator and denominator are equal i.e
TP = TP +FP, this also means FP is zero.
As FP increases the value of denominator becomes greater than the
numerator and precision value decreases (which we don’t want).

So in the pregnancy example, precision = 30/(30+ 5) = 0.857

Excellence and Service

553
CHRIST
Deemed to be University

Recall is also known as sensitivity or true positive rate and is

defined as follows:

557
CHRIST
Deemed to be University

FOCUS ON LEARNING

THANK YOU ☺

558
Excellence and Service

Industrial Engineering and Management by Pravin Kumar
100% (10)
Industrial Engineering and Management by Pravin Kumar
673 pages
Discover Haxeflixel Full
100% (3)
Discover Haxeflixel Full
182 pages
Salient Features of IT Act 2000
No ratings yet
Salient Features of IT Act 2000
10 pages
EI8751-Industrial Data Networks
No ratings yet
EI8751-Industrial Data Networks
10 pages
UserGuide10 PDF
No ratings yet
UserGuide10 PDF
494 pages
Which Chart or Graph Is Right For You? Tell Impactful Stories With Data
No ratings yet
Which Chart or Graph Is Right For You? Tell Impactful Stories With Data
14 pages
Machine Learning Introduction Guide
No ratings yet
Machine Learning Introduction Guide
144 pages
B Tech Integrated Mtech ECE 2012 Proposed
No ratings yet
B Tech Integrated Mtech ECE 2012 Proposed
204 pages
IntelliSteer Operating Guide PDF
No ratings yet
IntelliSteer Operating Guide PDF
240 pages
P8 5.5.0-P85.5.4 Patch Compatibility Matrix 6
No ratings yet
P8 5.5.0-P85.5.4 Patch Compatibility Matrix 6
16 pages
MCA Placement Brochure 2023-2024
No ratings yet
MCA Placement Brochure 2023-2024
44 pages
1 清实录10 高宗纯皇帝实录卷六○至卷一五七
No ratings yet
1 清实录10 高宗纯皇帝实录卷六○至卷一五七
600 pages
DLL For Observation Edited
100% (1)
DLL For Observation Edited
3 pages
Dual Degree in Comp. Linguistics
No ratings yet
Dual Degree in Comp. Linguistics
112 pages
How Does The Positioning of Information Technology Firms in Strat
No ratings yet
How Does The Positioning of Information Technology Firms in Strat
35 pages
MCA Placement Brochure 2022-2023
No ratings yet
MCA Placement Brochure 2022-2023
38 pages
BlueBorne Technical White Paper
No ratings yet
BlueBorne Technical White Paper
42 pages
Create All Time Zone Tables in HANA Schema SYSTEM
No ratings yet
Create All Time Zone Tables in HANA Schema SYSTEM
4 pages
Unit V - DAA - Notes
No ratings yet
Unit V - DAA - Notes
62 pages
Independent Speed Test Analysis of 4G Mobile Networks Performed by DIKW Consulting
No ratings yet
Independent Speed Test Analysis of 4G Mobile Networks Performed by DIKW Consulting
50 pages
Math Analysis for Business Students
No ratings yet
Math Analysis for Business Students
54 pages
Christ University Computer Science Overview
No ratings yet
Christ University Computer Science Overview
20 pages
Unit I
No ratings yet
Unit I
28 pages
Natural Language Processing - Semantic Aspects PDF
100% (4)
Natural Language Processing - Semantic Aspects PDF
343 pages
NLP Unit 1
No ratings yet
NLP Unit 1
18 pages
Natural Language Processing - Session 1 - Introduction
100% (1)
Natural Language Processing - Session 1 - Introduction
55 pages
Gatling 2
No ratings yet
Gatling 2
10 pages
Information Systems vs. IT Overview
No ratings yet
Information Systems vs. IT Overview
48 pages
AI-Powered DeFi Trading Platform
No ratings yet
AI-Powered DeFi Trading Platform
22 pages
Natural Language Processing Inside Pages 2
No ratings yet
Natural Language Processing Inside Pages 2
159 pages
AI With Natural Language Processing and Speech Recognition
No ratings yet
AI With Natural Language Processing and Speech Recognition
16 pages
7152 - Application Manual
No ratings yet
7152 - Application Manual
103 pages
Natural Language Processing 110641
No ratings yet
Natural Language Processing 110641
19 pages
iDS-7200HQHI-M2/S SERIES Turbo Acusense DVR: Key Feature
No ratings yet
iDS-7200HQHI-M2/S SERIES Turbo Acusense DVR: Key Feature
4 pages
Definition: Natural Language Processing Is A Theoretically Motivated Range of Computational
No ratings yet
Definition: Natural Language Processing Is A Theoretically Motivated Range of Computational
14 pages
Exam Paper 2020 Oct
100% (1)
Exam Paper 2020 Oct
7 pages
NLP Course Notes 2024-2025
No ratings yet
NLP Course Notes 2024-2025
38 pages
AI Unit 5
No ratings yet
AI Unit 5
22 pages
Unlocking The Power of Natural Language Processing Computational Linguistics
No ratings yet
Unlocking The Power of Natural Language Processing Computational Linguistics
15 pages
Natural Language Processing: Course ID: 1905380
No ratings yet
Natural Language Processing: Course ID: 1905380
15 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
54 pages
IBM POST & BIOS Error Codes Guide
No ratings yet
IBM POST & BIOS Error Codes Guide
4 pages
U1 - NLP Complete
No ratings yet
U1 - NLP Complete
108 pages
Google AI and Machine Learning APIs
No ratings yet
Google AI and Machine Learning APIs
6 pages
Elective
No ratings yet
Elective
10 pages
NLP & Text Analytics Course
No ratings yet
NLP & Text Analytics Course
4 pages
Nlp-Unit-I Final
No ratings yet
Nlp-Unit-I Final
31 pages
Natural Language Processing Report (By Sandeep Kumar Dash)
No ratings yet
Natural Language Processing Report (By Sandeep Kumar Dash)
25 pages
COMP 262 Winter2022
No ratings yet
COMP 262 Winter2022
9 pages
Group 8 NLP
No ratings yet
Group 8 NLP
3 pages
CSE4022 Natural-Language-Processing ETH 1 AC41
No ratings yet
CSE4022 Natural-Language-Processing ETH 1 AC41
6 pages
Computer Hardware Assessment Package LS 6
No ratings yet
Computer Hardware Assessment Package LS 6
21 pages
Unit - 1 NLP-R20
No ratings yet
Unit - 1 NLP-R20
10 pages
Table of Content
No ratings yet
Table of Content
13 pages
Text and Speech Analysis Notes ccs369 Unit 1
No ratings yet
Text and Speech Analysis Notes ccs369 Unit 1
28 pages
2024-Natural Language Processing RELIES On Linguistics
No ratings yet
2024-Natural Language Processing RELIES On Linguistics
29 pages
2022 Grade 10 3rd Tem Tamil
No ratings yet
2022 Grade 10 3rd Tem Tamil
8 pages
Lec1 Intro-2
No ratings yet
Lec1 Intro-2
25 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
57 pages
Selective Topic Assignment
No ratings yet
Selective Topic Assignment
7 pages
SRS Doc Update
No ratings yet
SRS Doc Update
54 pages
TTS Unit 3 QAS
No ratings yet
TTS Unit 3 QAS
241 pages
Tech Intern Seeks Real-World Experience
No ratings yet
Tech Intern Seeks Real-World Experience
1 page
Unit 4
No ratings yet
Unit 4
121 pages
CV Riswanda Zikrawi
No ratings yet
CV Riswanda Zikrawi
1 page
Nlp-Unit-I Final
No ratings yet
Nlp-Unit-I Final
31 pages
Unit 5
No ratings yet
Unit 5
38 pages
NLP (Comp4136)
No ratings yet
NLP (Comp4136)
3 pages
Al3501 NLP
100% (1)
Al3501 NLP
2 pages
Christ
No ratings yet
Christ
3 pages
GE3151 - Python
No ratings yet
GE3151 - Python
2 pages
CCS348 - Game Theory Lab Manual Record
No ratings yet
CCS348 - Game Theory Lab Manual Record
42 pages
SwethaCordelia Final
No ratings yet
SwethaCordelia Final
31 pages
CRC Ai For Understanding Human Conversations 1032968761
No ratings yet
CRC Ai For Understanding Human Conversations 1032968761
215 pages
Muhammad Danish Afif Bin Rosman Resume As of Aug 2022
No ratings yet
Muhammad Danish Afif Bin Rosman Resume As of Aug 2022
1 page
Unit 3
No ratings yet
Unit 3
114 pages
Unit5 - Updated
No ratings yet
Unit5 - Updated
112 pages
Unit4 C
No ratings yet
Unit4 C
107 pages
NLP - Lab - IT - 7th
No ratings yet
NLP - Lab - IT - 7th
2 pages
Cantina Centrifuge CFG February March2025
No ratings yet
Cantina Centrifuge CFG February March2025
10 pages
MCQ Ec-405
No ratings yet
MCQ Ec-405
2 pages
School of Engeniering Chirst College 2026
No ratings yet
School of Engeniering Chirst College 2026
135 pages
NEXUS November 2024
No ratings yet
NEXUS November 2024
37 pages
Proceeding New Horizons Religious Text
No ratings yet
Proceeding New Horizons Religious Text
105 pages
Bac1105 Bisf1105 Bsd1106 Installation and Customization
No ratings yet
Bac1105 Bisf1105 Bsd1106 Installation and Customization
3 pages
Speech Language Processing Daniel Jura F Sky
No ratings yet
Speech Language Processing Daniel Jura F Sky
8 pages
Text Similarity Metrics
No ratings yet
Text Similarity Metrics
10 pages
Research Paper
No ratings yet
Research Paper
5 pages
18 Computational Linguistics
No ratings yet
18 Computational Linguistics
5 pages
Natural Language Processing UNIT 1
No ratings yet
Natural Language Processing UNIT 1
130 pages
Natural Language Processing (Pe3)
No ratings yet
Natural Language Processing (Pe3)
2 pages