0% found this document useful (0 votes)

39 views61 pages

Introduction IR

irs definition and functionalities

Uploaded by

pplmusheerabadboys1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views61 pages

Introduction IR

irs definition and functionalities

Uploaded by

pplmusheerabadboys1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 61

Introduction to Information

Retrieval

Jian-Yun Nie
University of Montreal
Canada

1
Outline
 What is the IR problem?
 How to organize an IR system? (Or
the main processes in IR)
 Indexing
 Retrieval
 System evaluation
 Some current research topics
2
The problem of IR
 Goal = find documents relevant to an
information need from a large document
Info.set
need

Query
IR
Retrieval syste
Document Answer list
collection m

3
Example

Googl
e

Web

4
IR problem
 First applications: in libraries (1950s)
ISBN: 0-201-12227-8
Author: Salton, Gerard
Title: Automatic text processing: the transformation,
analysis, and retrieval of information by computer
Editor: Addison-Wesley
Date: 1989
Content: <Text>
 external attributes and internal attribute
(content)
 Search by external attributes = Search in
DB
 IR: search by content 5
Possible approaches
1. String matching (linear search in
documents)
- Slow
- Difficult to improve
2. Indexing (*)
- Fast
- Flexible to further improvement

6
Indexing-based IR
Document Query

indexing indexing
(Query
analysis)
Representation Representation
(keywords) Query (keywords)
evaluation

7
Main problems in IR
 Document and query indexing
 How to best represent their
contents?
 Query evaluation (or retrieval process)
 To what extent does a document
correspond to a query?
 System evaluation
 How good is a system?
 Are the retrieved documents
relevant? (precision)
 Are all the relevant documents
retrieved? (recall) 8
Document indexing
 Goal = Find the important meanings and create
an internal representation
 Factors to consider:

Accuracy to represent meanings (semantics)

Exhaustiveness (cover all the contents)

Facility for computer to manipulate
 What is the best representation of contents?

Char. string (char trigrams): not precise enough

Word: good coverage, not precise

Phrase: poor coverage, more precise

Concept: poor coverage, precise

Coverage Accuracy
(Recall) String Word Phrase Concept (Precision)
9
Keyword selection and
weighting
 How to select important keywords?
 Simple method: using middle-frequency
words Frequency/Informativity
frequency informativity

Max.

Min.
123… Rank
10
tf*idf weighting
schema
 tf = term frequency
 frequency of a term/keyword in a document
The higher the tf, the higher the importance (weight) for
the doc.
 df = document frequency
 no. of documents containing the term
 distribution of the term
 idf = inverse document frequency
 the unevenness of term distribution in the corpus
 the specificity of term to a document
The more the term is distributed evenly, the less it is
specific to a document
weight(t,D) = tf(t,D) * idf(t) 11
Some common tf*idf
schemes
 tf(t, D)=freq(t,D) idf(t) = log(N/n)
 tf(t, D)=log[freq(t,D)] n = #docs containing t
 tf(t, D)=log[freq(t,D)]+1 N = #docs in corpus
 tf(t, D)=freq(t,d)/Max[f(t,d)]

weight(t,D) = tf(t,D) * idf(t)

 Normalization: Cosine normalization, /max, …

12
Document Length
Normalization
 Sometimes, additional normalizations
e.g. length:
weight (t , D)
pivoted (t , D) 
slope
1 normalized _ weight (t , D)
(1  slope) povot

Probability
of relevance
slope
pivot
Probability of retrieval

Doc. length
13
Stopwords / Stoplist
 function words do not bear useful information for
IR
of, in, about, with, I, although, …
 Stoplist: contain stopwords, not to be used as
index

Prepositions

Articles

Pronouns

Some adverbs and adjectives

Some frequent words (e.g. document)

 The removal of stopwords usually improves IR

effectiveness
 A few “standard” stoplists are commonly used.
14
Stemming
 Reason:
 Different word forms may bear similar meaning (e.g.
search, searching): create a “standard” representation
for them
 Stemming:

Removing some endings of word
computer
compute
computes
computing
computed
computation comput

15
Porter algorithm
(Porter, M.F., 1980, An algorithm for suffix
stripping, Program, 14(3) :130-137)

Step 1: plurals and past participles
 SSES -> SS caresses -> caress
 (*v*) ING -> motoring -> motor
 Step 2: adj->n, n->v, n->adj, …
 (m>0) OUSNESS -> OUS callousness -> callous
 (m>0) ATIONAL -> ATE relational -> relate
 Step 3:
 (m>0) ICATE -> IC triplicate -> triplic
 Step 4:
 (m>1) AL -> revival -> reviv
 (m>1) ANCE -> allowance -> allow
 Step 5:

(m>1) E -> probate -> probat

(m > 1 and *d and *L) -> single letter controll -> control
16
Lemmatization
 transform to standard form according to
syntactic category.
E.g. verb + ing  verb
noun + s  noun

Need POS tagging

More accurate than stemming, but needs more
resources

 crucial to choose stemming/lemmatization rules

noise v.s. recognition rate
 compromise between precision and recall

light/no stemming severe stemming

-recall +precision +recall -precision
17
Result of indexing
 Each document is represented by a set of
weighted keywords (terms):
D1  {(t1, w1), (t2,w2), …}

e.g. D1  {(comput, 0.2), (architect, 0.3), …}

D2  {(comput, 0.1), (network, 0.5), …}

 Inverted file:
comput  {(D1,0.2), (D2,0.1), …}
Inverted file is used during retrieval for higher
efficiency.
18
Retrieval
 The problems underlying retrieval
 Retrieval model

How is a document represented with the
selected keywords?

How are document and query
representations compared to calculate a
score?
 Implementation

19
Cases
 1-word query:
The documents to be retrieved are those that
include the word
- Retrieve the inverted list for the word

- Sort in decreasing order of the weight of the

word
 Multi-word query?
- Combining several lists
- How to interpret the weight?

(IR model)
20
IR models
 Matching score model
 Document D = a set of weighted
keywords
 Query Q = a set of non-weighted
keywords
 R(D, Q) = i w(ti , D)
where ti is in Q.

21
Boolean model

Document = Logical conjunction of keywords

Query = Boolean expression of keywords

R(D, Q) = D Q

e.g. D = t1  t2  …  tn
Q = (t1  t2)  (t3  t4)
D Q, thus R(D, Q) = 1.

Problems:

R is either 1 or 0 (unordered set of documents)

many documents or few documents

End-users cannot manipulate Boolean operators
correctly
E.g. documents about kangaroos and koalas
22
Extensions to Boolean
model
(for document ordering)
 D = {…, (ti, wi), …}: weighted keywords
 Interpretation:

D is a member of class ti to degree wi.

In terms of fuzzy sets: ti(D) = wi
A possible Evaluation:
R(D, ti) = ti(D);
R(D, Q1  Q2) = min(R(D, Q1), R(D, Q2));
R(D, Q1  Q2) = max(R(D, Q1), R(D, Q2));
R(D, Q1) = 1 - R(D, Q1).

23
Vector space model
 Vector space = all the keywords
encountered
<t1, t2, t3, …, tn>
 Document
D = < a1, a2, a3, …, an>
ai = weight of ti in D
 Query
Q = < b1, b2, b3, …, bn>
bi = weight of ti in Q
 R(D,Q) = Sim(D,Q) 24
Matrix representation
Document t1 t2 t3 … tn Term
space
vector
D1 a11 a12 a13 … a1n space

D2 a21 a22 a23 … a2n

D3 a31 a32 a33 … a3n
…
Dm am1 am2 am3 … amn
Q b1 b2 b3 … bn
25
Some formulas for Sim
Dot product Sim( D, Q)  (ai * bi )
t1
 (a * b ) i i D
Sim( D, Q)  i
Cosine  ai *  bi
2 2 Q
i i
t2
2 (ai * bi )
Dice Sim( D, Q)  i

 ai   bi
2 2

i i

 (a * b ) i i
Jaccard Sim( D, Q)  i

 a   b   (a * b )
2 2
i i i i
i i i
26
Implementation (space)
 Matrix is very sparse: a few 100s terms
for a document, and a few terms for a
query, while the term space is large
(~100k)

 Stored as:
D1  {(t1, a1), (t2,a2), …}

t1  {(D1,a1), …}

27
Implementation (time)
 The implementation of VSM with dot product:
 Naïve implementation: O(m*n)
 Implementation using inverted file:

Given a query = {(t1,b1), (t2,b2)}:

1. find the sets of related documents through inverted file for
t1 and t2
2. calculate the score of the documents to each weighted
term
(t1,b1)  {(D1,a1 *b1), …}
3. combine the sets and sum the weights ()
 O(|Q|*n)

28
Other similarities
 Cosine:
 (a * b ) i i
ai bi
Sim( D, Q)  i

 a *b a b
2 2 2 2
i
i i j j
j j j j

 aand to
2 2
- use j normalize the bj
j j
weights after indexing
- Dot product
(Similar operations do not apply to Dice
and Jaccard)
29
Probabilistic model
 Given D, estimate P(R|D) and P(NR|D)
 P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R)
constant)
 P(D|R) 1 present
xi 
D = {t1=x1, t2=x2, …}  0 absent
P( D | R)   P(t
( ti xi )D
i  xi | R )


 P (ti 1 | R ) xi P (ti 0 | R ) (1 xi )  pi i (1  pi ) (1 xi )

ti ti

P ( D | NR )  P (ti 1 | NR ) xi P (ti 0 | NR ) (1 xi )  qi i (1  qi ) (1 xi )

ti ti

30
Prob. model (cont’d)
For document ranking
 i
x (1 xi )
p i
(1  pi )
P( D | R) ti
Odd ( D) log log
i
x (1 xi )
P ( D | NR ) q i
(1  qi )
ti

pi (1  qi ) 1  pi
 xi log   log
ti qi (1  pi ) ti 1  qi
pi (1  qi )
  xi log
ti qi (1  pi )

31
Prob. model (cont’d)
ri ni-ri ni
 How to estimate pi and Rel. Irrel.doc Doc.
qi? doc. . with ti
with ti with ti
Ri-ri N-Ri– N-ni
 A set of N relevant and n+ri
Rel. Doc.
irrelevant
r samples:
n r doc. Irrel.doc without
pi  i
qi  i i
without . ti
Ri N  Ri ti without
ti
Ri N-Ri N
Rel. doc Irrel.doc Samples
. 32
Prob. model (cont’d)
pi (1  qi )
Odd ( D)  xi log
ti qi (1  pi )
ri ( N  Ri  ni  ri )
 xi
ti ( Ri  ri )(ni  ri )
 Smoothing (Robertson-Sparck-Jones formula)

(ri  0.5)( N  Ri  ni  ri  0.5)

Odd ( D)  xi   wi
ti ( Ri  ri  0.5)(ni  ri  0.5) ti D

 When no sample is available:

pi=0.5,
qi=(ni+0.5)/(N+0.5)ni/N
 May be implemented as VSM 33
BM25
(k1  1)tf (k3  1)qtf avdl  dl
Score ( D, Q)  w  k2 | Q |
tQ K  tf k3  qtf avdl  dl
dl
K k1 ((1  b)  b )
avdl  dl
 k1, k2, k3, d: parameters
 qtf: query term frequency
 dl: document length
 avdl: average document length
34
(Classic) Presentation of
results
 Query evaluation result is a list of
documents, sorted by their similarity to
the query.
 E.g.
doc1 0.67
doc2 0.65
doc3 0.54
…

35
System evaluation
 Efficiency: time, space
 Effectiveness:

How is a system capable of retrieving relevant
documents?

Is a system better than another one?
 Metrics often used (together):

Precision = retrieved relevant docs / retrieved
docs

Recall = retrieved relevant
retrieved docs / relevant docs
relevant

relevant retrieved 36
General form of
precision/recall
Precision
1.0

Recall
1.0

-Precision change w.r.t. Recall (not a fixed point)

-Systems cannot compare at one Precision/Recall point
-Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0) 37
An illustration of P/R
calculation
Precision
List Rel? 1.0 - * (0.2, 1.0)

Doc Y 0.8 - * (0.6, 0.75)

1 * (0.4, 0.67)
0.6 - * (0.6, 0.6)
Doc * (0.2, 0.5)

2 0.4 -

 rij = rank of the j-th relevant document for Qi

 |Ri| = #rel. doc. for Qi
 n = # test queries
 E.g. Rank: 1 4 1st rel. doc.
5 8 2nd rel. doc.
10 3rd rel. doc.
1 1 1 2 3 1 1 2
MAP  [ (   )  (  )]
2 3 1 5 10 2 4 8 39
Some other measures
 Noise = retrieved irrelevant docs / retrieved
docs
 Silence = non-retrieved relevant docs /
relevant docs
 Noise = 1 – Precision; Silence = 1 – Recall
 Fallout = retrieved irrel. docs / irrel. docs
 Single value measures:
 F-measure = 2 P * R / (P + R)
 Average precision = average at 11 points of recall
 Precision at n document (often used for Web IR)
 Expected search length (no. irrelevant documents to
read before obtaining n relevant doc.)
40
Test corpus
 Compare different IR systems on the
same test corpus
 A test corpus contains:
 A set of documents
 A set of queries

 Relevance judgment for every document-

query pair (desired answers for each query)

 The results of a system is compared
with the desired answers.

41
An evaluation example
(SMART)
Run number: 1 2 Average precision for all points
Num_queries: 52 52 11-pt Avg: 0.2859 0.3092
Total number of documents over all % Change: 8.2
queries
Recall:
Retrieved: 780 780 Exact: 0.4139 0.4166
Relevant: 796 796 at 5 docs: 0.2373 0.2726
Rel_ret: 246 229 at 10 docs: 0.3254 0.3572
Recall - Precision Averages: at 15 docs: 0.4139 0.4166
at 0.00 0.7695 0.7894 at 30 docs: 0.4139 0.4166
at 0.10 0.6618 0.6449 Precision:
at 0.20 0.5019 0.5090 Exact: 0.3154
at 0.30 0.3745 0.3702 0.2936
at 0.40 0.2249 0.3070 At 5 docs: 0.4308 0.4192
at 0.50 0.1797 0.2104 At 10 docs: 0.3538 0.3327
at 0.60 0.1143 0.1654 At 15 docs: 0.3154 0.2936
at 0.70 0.0891 0.1144 At 30 docs: 0.1577 0.1468
at 0.80 0.0891 0.1096
at 0.90 0.0699 0.0904
at 1.00 0.0699 0.0904

42
The TREC experiments
 Once per year
 A set of documents and queries are
distributed to the participants (the
standard answers are unknown) (April)
 Participants work (very hard) to construct,
fine-tune their systems, and submit the
answers (1000/query) at the deadline (July)
 NIST people manually evaluate the
answers and provide correct answers (and
classification of IR systems) (July – August)
 TREC conference (November)
43
TREC evaluation
methodology
 Known document collection (>100K) and query
set (50)
 Submission of 1000 documents for each query by
each participant
 Merge 100 first documents of each participant ->
global pool
 Human relevance judgment of the global pool
 The other documents are assumed to be
irrelevant
 Evaluation of each system (with 1000 answers)


Partial relevance judgments

But stable for system ranking
44
Tracks (tasks)
 Ad Hoc track: given document collection,
different topics
 Routing (filtering): stable interests (user profile),
incoming document flow
 CLIR: Ad Hoc, but with queries in a different
language
 Web: a large set of Web pages
 Question-Answering: When did Nixon visit China?
 Interactive: put users into action with system
 Spoken document retrieval
 Image and video retrieval
 Information tracking: new topic / follow up 45
CLEF and NTCIR
 CLEF = Cross-Language Experimental
Forum

for European languages

organized by Europeans

Each per year (March – Oct.)
 NTCIR:

Organized by NII (Japan)

For Asian languages

cycle of 1.5 year
46
Impact of TREC
 Provide large collections for further
experiments
 Compare different systems/techniques on
realistic data
 Develop new methodology for system
evaluation

 Similar experiments are organized in other

areas (NLP, Machine translation,
Summarization, …)
47
Some techniques to
improve IR effectiveness
 Interaction with user (relevance
feedback)
- Keywords only cover part of the contents
- User can help by indicating
relevant/irrelevant document
 The use of relevance feedback
 To improve query expression:
Qnew = *Qold + *Rel_d - *Nrel_d
where Rel_d = centroid of relevant
documents
NRel_d = centroid of non-relevant 48
documents
Effect of RF

2nd retrieval
1st retrieval

* * *
* *
* x * x x
* * * x x
** * * R* Q * NR x
Qnew
**
* x * x x
* * x

49
Modified relevance
feedback
 Users usually do not cooperate (e.g.
AltaVista in early years)
 Pseudo-relevance feedback (Blind RF)
 Using the top-ranked documents as if
they are relevant:

Select m terms from n top-ranked
documents
 One can usually obtain about 10%
improvement

50
Query expansion
 A query contains part of the important
words
 Add new (related) terms into the query
 Manually constructed knowledge
base/thesaurus (e.g. Wordnet)

Q = information retrieval

Q’ = (information + data + knowledge +
…)
(retrieval + search + seeking + …)
 Corpus analysis:

two terms that often co-occur are related (Mutual
information)

Two terms that co-occur with the same words are 51
Global vs. local context
analysis
 Global analysis: use the whole
document collection to calculate term
relationships
 Local analysis: use the query to retrieve
a subset of documents, then calculate
term relationships
 Combine pseudo-relevance feedback and
term co-occurrences
 More effective than global analysis

52
Some current research
topics:
Go beyond keywords
 Keywords are not perfect representatives of
concepts

Ambiguity:
table = data structure, furniture?

Lack of precision:
“operating”, “system” less precise than “operating_system”
 Suggested solution

Sense disambiguation (difficult due to the lack of
contextual information)

Using compound terms (no complete dictionary of
compound terms, variation in form)

Using noun phrases (syntactic patterns + statistics)
 Still a long way to go
53
Theory …
 Bayesian networks
 P(Q|D)
D1 D2 D3 … Dm

t1 t2 t3 t4 …. tn

c1 c2 c3 c4 … cl

Inference Q revision

 Language models
54
Logical models
 How to describe the relevance
relation as a logical relation?
D => Q
 What are the properties of this
relation?
 How to combine uncertainty with
a logical framework?
 The problem: What is relevance?
55
Related applications:
Information filtering
 IR: changing queries on stable document
collection
 IF: incoming document flow with stable
interests (queries)
yes/no decision (in stead of ordering documents)


 Advantage: the description of user’s interest may be

improved using relevance feedback (the user is more

willing to cooperate)
 Difficulty: adjust threshold to keep/ignore document

 The basic techniques used for IF are the same as

those for IR – “Two sides of the same

keep coin”
… doc3, doc2, doc1 IF
ignore
User profile 56
IR for (semi-)structured
documents
 Using structural information to assign
weights to keywords (Introduction,
Conclusion, …)
 Hierarchical indexing
 Querying within some structure (search
in title, etc.)
 INEX experiments
 Using hyperlinks in indexing and
retrieval (e.g. Google)
 … 57
PageRank in Google
I1
PR( I i )
A B PR( A) (1  d )  d 
I2 i C(Ii )

 Assign a numeric value to each page

 The more a page is referred to by important pages, the
more this page is important

 d: damping factor (0.85)

 Many other criteria: e.g. proximity of query words


“…information retrieval …” better than “… information …
retrieval …”
58
IR on the Web
 No stable document collection
(spider, crawler)
 Invalid document, duplication, etc.
 Huge number of documents (partial
collection)
 Multimedia documents
 Great variation of document quality
 Multilingual problem
 …
59
Final remarks on IR
 IR is related to many areas:
 NLP, AI, database, machine learning, user
modeling…
 library, Web, multimedia search, …
 Relatively week theories
 Very strong tradition of experiments
 Many remaining (and exciting) problems
 Difficult area: Intuitive methods do not
necessarily improve effectiveness in
practice
60
Why is IR difficult
 Vocabularies mismatching

Synonymy: e.g. car v.s. automobile

Polysemy: table
 Queries are ambiguous, they are partial
specification of user’s need
 Content representation may be inadequate and
incomplete
 The user is the ultimate judge, but we don’t know
how the judge judges…

The notion of relevance is imprecise, context- and user-
dependent

 But how much it is rewarding to gain 10%

improvement!
61

III I Ui Design Flutter Lab Manual r22
67% (3)
III I Ui Design Flutter Lab Manual r22
64 pages
L3 Vocabulary+Postings List
No ratings yet
L3 Vocabulary+Postings List
28 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
Sdps Year Wise
No ratings yet
Sdps Year Wise
72 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
10 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
41 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Module 7
No ratings yet
Module 7
53 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Bulu
No ratings yet
Bulu
47 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Web Search
No ratings yet
Web Search
30 pages
IR Berhampore Sukomalpal
No ratings yet
IR Berhampore Sukomalpal
82 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Introduction to IR Models and Techniques
100% (1)
Introduction to IR Models and Techniques
32 pages
Information Retrieval & MapReduce
No ratings yet
Information Retrieval & MapReduce
72 pages
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
No ratings yet
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
7 pages
Traditional IR Models Overview
No ratings yet
Traditional IR Models Overview
65 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Intro to Information Retrieval
No ratings yet
Intro to Information Retrieval
47 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Lecture3 Roy
No ratings yet
Lecture3 Roy
5 pages
Daniel B. Botkin - Forest Dynamics - An Ecological Model (1993) PDF
No ratings yet
Daniel B. Botkin - Forest Dynamics - An Ecological Model (1993) PDF
326 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Page 390
No ratings yet
Page 390
1 page
The Machine Learning Solutions Architect Handbook - 2nd Edition (Early Access) David Ping All Chapter Instant Download
100% (1)
The Machine Learning Solutions Architect Handbook - 2nd Edition (Early Access) David Ping All Chapter Instant Download
49 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
113 pages
Performance Enhancement and Customization of Information Storage and Retrieval System
No ratings yet
Performance Enhancement and Customization of Information Storage and Retrieval System
32 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Gamayas Portfolio
No ratings yet
Gamayas Portfolio
17 pages
Ballbar and Circle Diamond Square Machine Tests - Rev 1m
No ratings yet
Ballbar and Circle Diamond Square Machine Tests - Rev 1m
14 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
Annihilator Method
100% (1)
Annihilator Method
7 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Java Solve
No ratings yet
Java Solve
28 pages
Music Notation Shortcuts Guide
No ratings yet
Music Notation Shortcuts Guide
7 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
46 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
36 - Extracted - CN LAB FILE
No ratings yet
36 - Extracted - CN LAB FILE
21 pages
CORPORATE - FORM Internet Banking
No ratings yet
CORPORATE - FORM Internet Banking
2 pages
Intro to Info Retrieval Basics
No ratings yet
Intro to Info Retrieval Basics
34 pages
EIM Performance Tuning Guide
No ratings yet
EIM Performance Tuning Guide
3 pages
CV Syllabus
No ratings yet
CV Syllabus
3 pages
Interview Questions
No ratings yet
Interview Questions
50 pages
Advanced Database Tech: IR & Web Search
No ratings yet
Advanced Database Tech: IR & Web Search
21 pages
Explain Item Normalization?
No ratings yet
Explain Item Normalization?
7 pages
DLC OBE Assignment Solution 22-49016-3
No ratings yet
DLC OBE Assignment Solution 22-49016-3
3 pages
Linux Chrome Shortcut Guide
No ratings yet
Linux Chrome Shortcut Guide
2 pages
Business Process Simulation Guide
No ratings yet
Business Process Simulation Guide
24 pages
Multithreading in Java
No ratings yet
Multithreading in Java
13 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
54 pages
Form5 Accounting HHW December 2022
No ratings yet
Form5 Accounting HHW December 2022
15 pages
Deploying FortiMail Server Mode
No ratings yet
Deploying FortiMail Server Mode
5 pages
SM 25
No ratings yet
SM 25
17 pages
Assigning A Sound File To An Instance. Assigning A Keyboard Key To An Instance. Assigning An Image File To An Instance. All of The Above. ( )
No ratings yet
Assigning A Sound File To An Instance. Assigning A Keyboard Key To An Instance. Assigning An Image File To An Instance. All of The Above. ( )
4 pages
CA 13 VectorProcessors
No ratings yet
CA 13 VectorProcessors
16 pages
Computer 1
No ratings yet
Computer 1
8 pages
Cmos Digital Vlsi Design: Sequential Logic Design-VII
No ratings yet
Cmos Digital Vlsi Design: Sequential Logic Design-VII
11 pages
Skylon (Album)
No ratings yet
Skylon (Album)
4 pages
CS Unplugged-How Is It Used, and Does It Work?: Abstract. Computer Science Unplugged Has Been Used For Many Years
No ratings yet
CS Unplugged-How Is It Used, and Does It Work?: Abstract. Computer Science Unplugged Has Been Used For Many Years
25 pages
Here Is The Placeholder For Three Lines Title Create Social Media Accounts For Your Business
No ratings yet
Here Is The Placeholder For Three Lines Title Create Social Media Accounts For Your Business
21 pages
2013 SNUG SV Synthesizable SystemVerilog Paper
No ratings yet
2013 SNUG SV Synthesizable SystemVerilog Paper
45 pages
E-Commerce Project
No ratings yet
E-Commerce Project
26 pages
HEC-RAS User's Manual Version 4.1
No ratings yet
HEC-RAS User's Manual Version 4.1
790 pages
Accounts Payable User Manual
No ratings yet
Accounts Payable User Manual
32 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
1 清实录10 高宗纯皇帝实录卷六○至卷一五七
No ratings yet
1 清实录10 高宗纯皇帝实录卷六○至卷一五七
600 pages

Introduction IR

Uploaded by

Introduction IR

Uploaded by

Introduction to Information

weight(t,D) = tf(t,D) * idf(t)

 Normalization: Cosine normalization, /max, …

 The removal of stopwords usually improves IR

 crucial to choose stemming/lemmatization rules

light/no stemming severe stemming

e.g. D1  {(comput, 0.2), (architect, 0.3), …}

- Sort in decreasing order of the weight of the

D2 a21 a22 a23 … a2n

Given a query = {(t1,b1), (t2,b2)}:

 P (ti 1 | R ) xi P (ti 0 | R ) (1 xi )  pi i (1  pi ) (1 xi )

P ( D | NR )  P (ti 1 | NR ) xi P (ti 0 | NR ) (1 xi )  qi i (1  qi ) (1 xi )

(ri  0.5)( N  Ri  ni  ri  0.5)

 When no sample is available:

-Precision change w.r.t. Recall (not a fixed point)

Doc Y 0.8 - * (0.6, 0.75)

 rij = rank of the j-th relevant document for Qi

 Relevance judgment for every document-

query pair (desired answers for each query)

 Similar experiments are organized in other

 Advantage: the description of user’s interest may be

improved using relevance feedback (the user is more

 The basic techniques used for IF are the same as

those for IR – “Two sides of the same

 Assign a numeric value to each page

 d: damping factor (0.85)

 Many other criteria: e.g. proximity of query words

 But how much it is rewarding to gain 10%

You might also like