0% found this document useful (0 votes)

15 views16 pages

04 - Lect4 - Text Transformation

Uploaded by

Mahmoud Nasser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views16 pages

04 - Lect4 - Text Transformation

Uploaded by

Mahmoud Nasser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Biomedical IR

Search Engine Architecture [part 3]

Lecture 4

Dr. Ebtsam AbdelHakam

Minia University
Indexing Process
document  unique ID
Document what can you store?
web-crawling disk space? rights?
provider feeds data store
compression?
RSS “feeds”
A System desktop/email
and
…………M…e…th…od for
……………………
……………………………………
……………………………………
Documents Index
……………………………………
……….. acquisition creation Index

what data
do we want? a lookup table for
quickly finding all docs
Text containing a word
transformation
format conversion. international?
which part contains “meaning”? © Addison Wesley, 2008

word units? stopping? stemming?

Walid Magdy, TTDS 2017/2018

Text Transformation (Pre-processing)

• Standard text pre-processing steps:

1. Tokenisation
2. Stop of word removal
3. Normalization
4. Stemming

Walid Magdy, TTDS 2017/2018

Getting ready for indexing?
• Pre-processing steps before indexing:
• Tokenisation
• Stopping
• Stemming
• Objective  identify the optimal form of the term to
be indexed to achieve the best retrieval performance

Walid Magdy, TTDS 2017/2018

Tokenisation
• Tokenizer: A document is converted to a stream of tokens,
e.g. individual words.

• Sentence  tokenization (splitting)  tokens

• A token is an instance of a sequence of characters
• Typical technique: split at non-letter characters (space)
• Each such token is now a candidate for an index entry
(term), after further processing

Walid Magdy, TTDS 2017/2018

• This is a very exciting lecture on the technologies of
text
• Stop words: the most common words in collection
 the, a, is, he, she, I, him, for, on, to, very, …
• There are a lot of them ≈ 30-40% of text
• New stop words appear in specific domains
• Tweets: RT  “RT @realDonalTrump Mexico will …”
• Patents: said, claim  “a said method that extracts ….”
• Stop words
• influence on sentence structure
• less influence on topic (aboutness)
Walid Magdy, TTDS 2017/2018
Stopping: stop words
• Common practice in many applications “remove them”
• You need them for:
• Phrase queries:
“King of Denmark”, “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
• Objective  make words with different
surface forms look the same
• Document: “this is my CAR!!”
Query: “car”
should “car” match “CAR”?
• Sentence  tokenisation  tokens  normalisation
 terms to be indexed
• Same tokenisation/normalisation steps should be
applied to documents & queries

Walid Magdy, TTDS 2017/2018

• Search for: “play”
should it match: “played”, “playing”, “player”?

• Stemmers attempt to reduce morphological variations

of words to a common stem
• usually involves removing suffixes (in English)
• Many morphological variations of words
• inflectional (plurals, tenses)
• derivational (making verbs nouns etc.)
• In most cases, aboutness does not change
• Can be done at indexing time or as part of query
processing (like stopwords)
Walid Magdy, TTDS 2017/2018
• Two basic types
• Dictionary-based: uses lists of related words
• Algorithmic: uses program to determine related words
• Algorithmic stemmers
• suffix-s: remove ‘s’ endings assuming plural
• e.g., cats → cat, lakes → lake, windows → window
• Many false negatives: supplies → supplie
• Some false positives: James → Jame

Walid Magdy, TTDS 2017/2018

• Most common algorithm for stemming English
• Conventions + 5 phases of reductions
• phases applied sequentially
• each phase consists of a set of commands
• sample convention:
of the rules in a compound command, select the one that
applies to the longest suffix.
• Example rules in Porter stemmer
• sses  ss (processes  process)
• yi (reply  repli)
• ies  i (replies  repli)
• ement → null (replacement  replac)

Walid Magdy, TTDS 2017/2018

• Irregular verbs:
• saw  see
• went  go
• Different spellings
• colour vs. color
• tokenisation vs. tokenization
• Television vs. TV
• Synonyms
• car vs. vehicle
• UK vs. Britain

• Solution  Query expansion …

Walid Magdy, TTDS 2017/2018
 Text pre-processing before IR:
Tokenisation  Stopping  Stemming

Walid Magdy, TTDS 2017/2018

Index
Creation
Storing Document Statistics
•
‣ The counts and positions of document terms are stored.

‣ Index common types:

1. Forward Index: Key is the document, value is a list of terms
and term positions. Easiest for the crawler to build.

2. Inverted Index: Key is a term, value is a list of documents

and term positions. Provides faster processing at query time.
Forward index

- The rationale behind developing a forward index is that as documents

are parsed, it is better to intermediately store the words per document.

- The forward index is sorted to transform it to an inverted index.

- The forward index is essentially a list of pairs consisting of a document
and a word, collated by the document.
Inverted index

• This index can only determine whether a word exists within a particular
document, since it stores no information regarding the frequency and
position of the word; it is therefore considered to be a boolean index.

• Such an index determines which documents match a query but does

not rank matched documents.

Collins Cobuild Students Grammar (Harpercollins Publisher)
No ratings yet
Collins Cobuild Students Grammar (Harpercollins Publisher)
280 pages
Paragon Testing Enterprises Common Celpip Errors and How To
No ratings yet
Paragon Testing Enterprises Common Celpip Errors and How To
91 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
IR Lec3
No ratings yet
IR Lec3
41 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
03 - Lect3 Search Engines-Part2
No ratings yet
03 - Lect3 Search Engines-Part2
32 pages
L3 Vocabulary+Postings List
No ratings yet
L3 Vocabulary+Postings List
28 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
3 Indexing
No ratings yet
3 Indexing
28 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
41 pages
Exercises So and Such
No ratings yet
Exercises So and Such
6 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
Active & Passive Voice Guide
100% (2)
Active & Passive Voice Guide
14 pages
Chapter - 2 Text Operation (Lecture 2.1)
No ratings yet
Chapter - 2 Text Operation (Lecture 2.1)
63 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
2.3 Chap NLP Stemming
No ratings yet
2.3 Chap NLP Stemming
32 pages
Elasticsearch Basics for Beginners
No ratings yet
Elasticsearch Basics for Beginners
44 pages
3-More On Indexing & Text Operations
No ratings yet
3-More On Indexing & Text Operations
27 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
Lec 19
No ratings yet
Lec 19
60 pages
Lec 5
No ratings yet
Lec 5
22 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
37 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
Text Preprocessing for Information Retrieval
No ratings yet
Text Preprocessing for Information Retrieval
58 pages
Presentation Contrastive Conjunction
No ratings yet
Presentation Contrastive Conjunction
11 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
English Kendra
No ratings yet
English Kendra
17 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
78 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Index Construction Guide
No ratings yet
Index Construction Guide
43 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
Comparative and Superlative
No ratings yet
Comparative and Superlative
1 page
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
Malagasy For Beginner - (Rev.) J. Richardson (1884)
No ratings yet
Malagasy For Beginner - (Rev.) J. Richardson (1884)
133 pages
Inverted Index Construction Guide
No ratings yet
Inverted Index Construction Guide
57 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
Lecture3 Roy
No ratings yet
Lecture3 Roy
5 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
8 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
62 pages
English Grammar Essentials
100% (2)
English Grammar Essentials
17 pages
Session 1
No ratings yet
Session 1
33 pages
CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
Text Processing for IR Systems
No ratings yet
Text Processing for IR Systems
43 pages
Information Retrival Unit 1
No ratings yet
Information Retrival Unit 1
29 pages
Info Retrieval for Linguists
No ratings yet
Info Retrieval for Linguists
38 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
56 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
6 The Term Vocabulary & Posting List
No ratings yet
6 The Term Vocabulary & Posting List
19 pages
Intro to Info Retrieval Basics
No ratings yet
Intro to Info Retrieval Basics
34 pages
Simple Present for Beginners
No ratings yet
Simple Present for Beginners
20 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
L1 Grammar Fundamentals 1 NAK Whole Book2018
No ratings yet
L1 Grammar Fundamentals 1 NAK Whole Book2018
26 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Ludlow Karen Official Quick Guide To Linguaskill Self Study
No ratings yet
Ludlow Karen Official Quick Guide To Linguaskill Self Study
38 pages
English Grammar for ESL Students
No ratings yet
English Grammar for ESL Students
31 pages
Explain Text Operation
No ratings yet
Explain Text Operation
6 pages
175 1 Language p1 For All Answer Key
No ratings yet
175 1 Language p1 For All Answer Key
5 pages
Iii To V Class Intso Level - 1 Syllabus (23-24)
No ratings yet
Iii To V Class Intso Level - 1 Syllabus (23-24)
2 pages
Present Tenses and Articles Lesson - Pre-Intermediate
No ratings yet
Present Tenses and Articles Lesson - Pre-Intermediate
12 pages
English Grammar P.3 Term 2 Schemes of Work
No ratings yet
English Grammar P.3 Term 2 Schemes of Work
14 pages
Essere Vs Avere
100% (1)
Essere Vs Avere
3 pages
Simple Past Past Progressive
No ratings yet
Simple Past Past Progressive
8 pages
English 8 Exam
No ratings yet
English 8 Exam
3 pages
Unit 22 Conditionals Other Ways To Express Unreality
No ratings yet
Unit 22 Conditionals Other Ways To Express Unreality
11 pages
English Grammar and Vocabulary Review for EEAR CFS 2022.1
No ratings yet
English Grammar and Vocabulary Review for EEAR CFS 2022.1
5 pages
first-conditional-UCSS - BASIC 9
No ratings yet
first-conditional-UCSS - BASIC 9
1 page
Summative Evaluation: Dan's Activity Solution
No ratings yet
Summative Evaluation: Dan's Activity Solution
2 pages
02 09 2022 Verbs Clothes
No ratings yet
02 09 2022 Verbs Clothes
2 pages
I Make Sentences in The Present Perfect
No ratings yet
I Make Sentences in The Present Perfect
2 pages
Interclausal Relations With Ol
No ratings yet
Interclausal Relations With Ol
31 pages
Present Progressive Verb Tense: Aformofbe
No ratings yet
Present Progressive Verb Tense: Aformofbe
10 pages
Contrast Clauses + Would Ac218c32022
No ratings yet
Contrast Clauses + Would Ac218c32022
4 pages
Linguistics: Understanding Morphology
No ratings yet
Linguistics: Understanding Morphology
11 pages

04 - Lect4 - Text Transformation

Uploaded by

04 - Lect4 - Text Transformation

Uploaded by

Biomedical IR

Search Engine Architecture [part 3]

Dr. Ebtsam AbdelHakam

word units? stopping? stemming?

Walid Magdy, TTDS 2017/2018

• Standard text pre-processing steps:

Walid Magdy, TTDS 2017/2018

Walid Magdy, TTDS 2017/2018

• Sentence  tokenization (splitting)  tokens

Walid Magdy, TTDS 2017/2018

Walid Magdy, TTDS 2017/2018

• Stemmers attempt to reduce morphological variations

Walid Magdy, TTDS 2017/2018

Walid Magdy, TTDS 2017/2018

• Solution  Query expansion …

Walid Magdy, TTDS 2017/2018

‣ Index common types:

2. Inverted Index: Key is a term, value is a list of documents

- The rationale behind developing a forward index is that as documents

- The forward index is sorted to transform it to an inverted index.

• Such an index determines which documents match a query but does

You might also like