0 ratings0% found this document useful (0 votes) 7 views68 pagesISR
notes for information storage retireval
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
i
Introduction to
S{estae
& = Information Retrieval
gave Concepts of Data Retieval& Information Revavai Yon uaa aayiunanik
ext Analysis: Lunn’ ‘eval, Text mining and IR relation. 18 syste
indexing, Automatic Classification, ith, Indexing and Inder Term Weight
Indexing, Autor ication Measures of Association, Different Matching Coelficients Cluster Hypotnes's
Clustering Techniques: Rocchio's Algorithm, Single pass algorithm, Single Link algorith
rithm, Single Link algorithm
1.1 _Basic Concepts of IR
4+ This is the information era, We handle vast amount of information. The purpose to maintain such huge amount of
information is when we need any information; we should get it as early as possible.
We require speedy and accurate access whenever we need it
(One method to get relevant information is read all the documents and then decides which are the relevant and
which are the non-relevant documents? This is the manual method.
Second method is the automatic method in which we store all the information in computer and ask to find the
felevant information. Information retrieval handles the representation, storage, organization and access to
information items. The representation of information should require less space. The organization of information
should be such that, system should require less time to access the items of information which satisfies user needs
1.2 _ Data Retrieval and Information Retrieval
Ditlarontiato betwoen data and information retrieval [RRS
mining which documents contain specific words in the query
* Data retrieval mainly concerned with dete
information retrieval, the user is interested in the relevant information of the query
* Table 1.22 is the comparison of data retrieval and inforIntroduction to,
Inform,
Table 1.2.2
Parameters | Data Retrleval (OR) | Information Retrieval any |
fe eee! (DW) el wd)
1 | Matching Gaact match Partial match. best match
LL ———
2. | Inference Deduction Induction
3. | Mode! Deterministic Probabilistic |
——
4. _| Classification Monothetic Polythetic
4"
5. | Query language | Artificiat Natural
6 _| Query specification | Complete Incomplete
7.__| tems wanted Matching Relevant
8. _| Error response Sensitive Insensitive
Matching
* In data retrieval, we normally search for exact match for &9. whether a file contains a Particular word o,
we normally find documents which Partially match the request and then se
2. Inference
+ In information retrieval,
‘ut of them,
eC be
* The inference which get used in data retrieval is d
Ictive e.g,
as bandb—cthena+e
* In Information Retrieval we follow inductive inferenc
relations are specified with a degree of cons
uncertainty.
3. Model
documents.
* As the Information Retrieval uses indu
4. Classification
. lassification i.e, one with classes
belong to a class,
val menothetic classification is not required
in Information Retrieval polythetic classification gets used,
defined by objec
In such a classification each individual in a class will possess only a proportion of all attributes possessed by
‘members of that class, H
once no attribute is necessary nor suficient for membership of a claee
Query language
The query language which is u
sed in Data Retrieval is ‘mostly artificial; with restricted syntax and vocabular
©9. we can write a query i
1 SQL in its fixed format with fixed words‘gr information Storage and Retrieval (SPrUy
Introduction to Informa
+ Ininformation Retriey
‘eval, query will have no restric I
«User can provide @ query in natural | ion related to syntax or vocabulary
handle such queries, \9uage format. The Information Retrieval system should be able to
6. Query specification
«As Data Retrieval finds ex.
1s exact match and the query follows a restricted format, the query must be complete
User should provide the exact query for his into
rect interest
+ In Information Retrieval, us
rnformaton Reel user can use natural inguage to spedilythe query and hence & posse M3
query may be incomplete e.g. user will net follow the standard grammar of the language. The Information
Retrieval system can handle such queries :
2, Weems wanted ¢
+ In bata Retrieval user specifies the exact query and hence the lst of items wil contain those items which
exactly match the query. —
+ In information Retrieval, the query gets specified in natura language as well as the models which used for
finding the items are probabilistic the system will find items which are relevant to the query
+ User then will decide for best one from the listed output.
8. Error response -
© In Data Retrieval, the query must be complete with proper syntax. Hence if there wi
specifying the query, meaning will be totally different and we can get wrong items.
{ll be any error while
+ In Information Retrieval, query gets specified in natural language, hence some sort of relaxation can Be
handled by the system
1.2.1 Text Mining and IR Relation
Information Retrieval is related to text, images, audio, video or object oriented type of information. IR deals with
‘and various methods of searching the information based on user's
the efficient way of storage of the information
< interest) Handling textual information is subdemein of IR.
IRis more to do with search engines where we have large amount of information and based on user's requirement,
specific information is extracted from the collection. IR fs more hybrid topic which converts, machine learning
Natural Language processing techniques. Now-a-days, more focus of IR is on search engines.
techniques,
1.3. Information Retrieval System: Block Diagram
Eee
Q. Draw and explain IR system block diagram. Se ee ed
SASL
© Draw IR eystom block diagram.
resentative? Explain with a sultable example.
What is a document re
+ An information retrieval system deals with representation, storage, organization and access of information
fig. 1.3. shows the block diagram of a typical Information Retrieval system.
YF hatecinagIntroduction to Information Rety,
s 4
W information Storage and Retrieval (SPPU)
on and the query given by the .-
+ The input for this systenv is the fst of documents which contain the infornat te i
6 relevant to the qu
The Information Retrieval system will fnd the list of documents which are rel
Foeutiacs
Output
Inpuy
Fig. 1.3.2 : Information Retrleval system block diagram
le for computer to use
* The main problem here is to obtain representation of each document and query suitable for ¢ ni we
are converted into its representation. The documents in its natural language tern,
the first step the documents 7
ace requir
Presentations. But if we store these documents in natural language format, space req}
are one of the re
Gets increased as well
Hence most computer based retrieval systems store only representation of the document.
nificant. These words are calleg ;,
representative could be a list of extracted words considered to be
35 time to retrieve the items which are relevant to query is also large.
A documen,
a document is lost once it has been converted into document representation.
keywords. Text of
tomate
or manual
idexing
* Fig. 13.2 shows the logical view of the document. & full
index terms) can be a document representative or any intermediate status of the document can be document
representative.
also converted is
As the document gets converted in its internal representation, the queries given by the user are
the same fashion,
pared with the query and outpu
fc. the list of relevant documents will be provided to the user.
he can provide the feedback based on which the query ge
User can stop here or if user wants to refine the query,
modified.
Provided to the user.WF Information Storage and Retrieval (SPPU) 15 tion to faformat
T.4__ Automatic Text Analysis
+ _ Information rewieval systems are of two types, ane is manual retveval system and another is automatic retrieysl
system. Here we are discussing about automatic retrieval system.
+ Inautomatic retrieval system computer searches for the relevant documents related to the given query
+ Before computerized Information Retrieval system can actually operate the documents to retrieve information, the
documents must be stored inside the computer one way to store the documents [5 in its natural format Ve. text
format
+ ut the disadvantage of this method is requires more space in memory to store the documents. And the secon:
which searching for the query relevant documents, system require more time.
+ Solution for this problem is to find the document representative for each document. It can be a ttle, an abstract
or a list of words from that document, Mostly the lists of words from the documents are used as the documient
representative. SES keywords
ntic of the document. These words are called
dis
+ The words a
as keywords.)
those words from documents which contain the seman
15 _Luhn's Ideas
@. Explain Lunn’s idea for understanding the context af the document.
poy
@.__ Explain Luhn's idea in dotaiis.
+ Document can be represented as a list of words. But here the question aries which words can be picked from the
documents to create the document representative. Luh has given the basic idea for this selection.
Luhn states thet the frequency of word occurrence in an article furnishes a useful measurement of word
ifieance furnishes a useful
significance. The relative position within a sentence of words having given values of sig
measurement for determining the significance of sentences.
+The significance factor of a sentence will therefore be based on a combination of these two measurements.
In shor, he states that frequency of the words @an te unde to extract words and sentences to represent a
document. Luhn used the Zipf's law for stating the idea, Zips law states that the product of the frequency of use
of words and the rank order is approximately constant
+ Luhnhas specified following idea. Let,
F: Frequency of occurrence of various word types in a given position of text.
R: The rank order of these words ie. order of their frequency of eccurrence.
+ Then plot a graph relating f and r whichis ike a hyperbolic curve
«Here we are interested to find the significant words from the document.
Fig. 1.5.1 shows a plot of the hyperbolic cure rélating, f the frequency of occurrence and r, the rank order. Luhn has
stated two cut-offs upper cut off and lower cut off.
+ Upper cutzaif is used for excluding common words. The words whose frequency is greater than the upper cut off
are the common words. These words do not contain the semantic of the documents, Hence these words aré not
cofiBidered in the list.——_—_—_— —$ i
Invoduction to tnformation Petng
BF information Storage and Retiieval (SPPU) 16
ds, Hence Get discarded
+ Phewords having less frequency as compared (0 the lower cut-off are the rare wards, Hence 4 tes
ol are considered ag
7 Thus the words whose Hrequency values are in the range of upper and lower cut-off
words. These words become the part of document repres
Une
anton
2 | eaohan per of
. ‘ayant werd
{= Sigmtcant words
PN
Worl by rank ortor
Fig. 1.5.4 : Luhn’s Idea
7 There is no thumb rule for deciding upper and lower cutoff, They have to be established by trail and error
1.6 Conflation Algorithm
Hoey to gonerate the document representatives using conflation algorithm? SES |
@. Explain steps in conflation algorithm using a suitable example, SAT |
@. List and explain steps of contlation algorithm SUIT
©. You are developing a text processing system for use in an automatic rotrieval system, Explain the following pans
i) Removal of high frequency words
ii) Suttix stripping
Detecting equivalent stems, EE
Contlation algorithm is the method to find the document representative, Its based on the Luhn's idea
9
* The algorithm consists of three parts as shown in Fig, 1.6.1
Algorithm consists of three parts
1 Romoval of high frequancy words
2. Sullix stipping
3. Datacting aquivatont stems
Fig, 1.6.1 : Parts of Conflation algorithm
Teateocilinformation Storage
1d Retrieval (SPPU) 1.7 Introduction to Information Retrieval
Removal of high frequency words
«The removal of high tr
coat et hah frequency words ie. ‘stop’ words oF ult words is one way of implementing Luhn’s upper
ata ine Be
done by comparing each word from the document to the list of high frequency words. High
frequency words are those words which comes more number of times in the text
+ These words does not contain meaning or semantic of the text. For eg. is, am, 1 are, the ete. are some of
words
+ The advantage of this process is not enly the non-significant words are removed but also the size of the
document can be reduced to 30 to 50%
2. Suffix stripping -— —
+ The second step is the suffix stripping. in this step, each word is handled from the output of first step. any
word is having the suffix, then the suffix gets removed and the word is converted ints origina form.
* For e.g If the word is “Killed” it will be converted into ‘Kill’ other example are
Original ward Word in orlginal form
1._| Processes =| Process
2._| Repeated = _| Repeat
3._| Kidding = | kia
4._| National = _| Nation
+ Unfortunately, context free removal leads to a significant error rate, For e.g. we may want UAL removed from
FACTUAL but not from EQUAL,
* — Toavoid erroneously removing suffixes some rules can be followed.For e.g.
1. The length of remaining stem exceeds a given number, the default is usually 2,
2. The stem-ending satisfies a certain condition, e.g. does not end with Q.
* For removing the suffixes the rules of grammar of the language can be used. For English language porter’s
removal of suffixes.
algorithm is one of the algorithm which helps
* The process is called as stemming. An advantage of stemming is it reduces the size of the text. However, too
much stemming is not practical and annoys users.
3. Detecting equivalent stems
19. We will have the list of words. Only one occurrence of the word is kept in the list for e.g
+ After suffix steip
if two words “processing” and “processed” get converted into ‘process’ only one occurrence of process will be
part of the list. Each word is called as ‘stem’
If two words have the same underlying stem then they refer to the same concept and should be indexed as
such. This is obviously an over simplification since words with same stem, such as NEUTRON AND
NEUTRALISE, sometimes need to be distinguished.
‘The final output from a conflation algorithm is a set of classes, one for each stem detected. As class name is
assigned to a document if and only if one of its members occurs as a significant word in the text of the
document,Introduction to tnt
PPUD
ferred 91 the &
Retr
WF Information Storage an
5 names These
ive then becomes a list of cl
lent represe
=A dow
jorms or Keywords.
mryerted in query 1
Quenies are also treated in the same way. Thus each query 1s converted
preventative
1,7__Indexing and Index Term Weighting
17.1 Indexing
uring “Indexing” documents are prepared for use by information Retrieval system
ble representation of docum
+ This means preparing the raw document collection into an easily accessi
indexing the document
transformation from a document text into a representation of text is known a5 indexin:
* Transforming a document into an indexed form involves the use of
© Allibrary or set of regular expressions
© Parsers
© Alibrery of stop words (a stop list)
© Other miscellaneous filters,
Conflation algorithm is used for converting document into its representation ie. indexing. Each element ie. wore.
index language is referred as index term,
Index language may be pre-coordinate or post-coordinate. In pre-coordinate, the terms are coordinated 2: =
time of indexing, A logical combination of any index terms may be used as a label to identify a class of documens
In post-coordinate, the terms are coordinated at the time of searching. In post-coordinate indexing the same cz
would be identified at search time by combining the classes of documents labelled with the individual index terms
The vocabulary of an index language may be controlled or uncontrolled. The former refers to a list of approv
index terms that an indexer may use.
The controls on the language may also include hierarchic relationships between the index terms. Or, one may inss
to the kind of syntactic controls one ma
that certain terms can only be used as adjective. There is really no li
put ona language.
The index language which comes out of the conflation algorithm is uncontrolled, post-coordinate and derived
The vocabulary of index terms at any stage in the evolution of document collection is just the set of all confat
class names.
1.7.2 Index Term Weighting
a Describe index term wei BRI May 15, B Marks
Weighting is the final stage in most Information Retrieval indexing applications. For each index term a weight val
get assigned which indicates the significance of that index term w.r.to the document.
The two important factors governing the effectiveness of an index language are exhaustivity of indexing a”
specificity of index languagen Storage and Retrieval (SPP Introduction to Information Retrieval
(sppuy 1-9
2 Language epocitoty
Fig. 2.7.2 : Factors governing effectiveness of index language
Indexing exhaustivity
+ This defined as the number of diferent topics indexed. A high level of exhaustiiY of indexing leads to high
recall and low precision.
‘exhaustivity is the no. of index terms
+ Alow level of exhaustivity leads to low recall and high precision. In short
assigned to the document.
Language specificity
pics precisely. It is the level of precision with which &
+ Ihis the ability of the index language to describe to
document is actually indexed.
«High specincty leads to high precision and love recall Spectity means the na. documents to which 9 given
term is assigned in a given colle
© Refer to the Luhn’s idea, He has mentioned 1
order of their frequency of occurrence.
fhe discrimination power of index terms as a function of the rank
«The highest discrimination power being associated with the middle frequencies.
= Considering this idea, each index term is assigned with a weight value which indicates the significance of the
index term in the document.
«A frequency count is how many times the index term comes in the document, can be considered as the weight
value.
«Different methods are present to find the weight value of index terms. First is to assign the frequency count as
the weight value,
‘Another way is based on the index term distribution in the entire collection.
Let. N= Total no. of documents
n= Thedocumentin which an index term accurse;
«tt we compare two methods, The document frequency weighting places emphasis on content description
whereas the second methad ie. weighting by specificity atternpts to stress the ability of terms to discriminate
one document from another.
salton and Yong combined both the methods of weighting considering inter document frequencies and intra
document frequencies.
SetJn to Information Retriev,
Invoduc
X¥ Information Storage and Retrieval (SPPU) 10 7
4 is dh van over the dOCUMENS, thy,
+ By considering both, the total frequency of accurrence @ fs
wot, they concluded many M¥Nds
is how many times it occurs in each document, they seme
ul in eetnieval
ets not very
L.A term with high total frequency of occurre'
distnbution
is skewed.
Yiddle terms are most useful particularly if the distribution = #
5 Nea secon nan the middle frequency ones
ely to be useful but less 50 1h
3. Rare terms with a skewed distribution are like
rom of the list except (oF
the ones with a high tory
4. Very rare terms are also quite useful but come bott
frequency.
+ A*good” index term is one which, when assigned as an ind
index term is one
a collection of documents, renders the
fox terrn to
which renders the document
mote
documents as dissimilar as possible, where as a “bad”
similar. 7
ysures the increase
ji 1. a particular term mea:
+ This is quantified through a term discrimination value which for 2 Pa
: . documents on the removal of that term. Therefore, 2 ood term
decrease in the average dissimilarity between
leads to a decrease in average.
is one which on removal from the collection of documents,
leads on removal to an increase.
= Dissimilarity, whereas a bad term is one whic!
1s will enhance retrieval effectiveness but then less
= The idea is that a greater separation between document
separation will depress retrieval effectiveness.
1.8 Probabilistic Indexing
* Probabilistic indexing is based on the probability model for Information Retrieval.
fers the difference in the distributional behaviour of words as a guide to whether a word should
«This model, con
be assigned as an index term.
= The statistical behaviours of ‘speciality’ words are different from that of ‘function’ words.
Function words are closely modelled. By Poisson distribution over all documents where as speciality words did not
follow a Poisson distribution.
Leta = function word over a set of texts
fn = number of eccurrences of word w.
ity that a text will have n occurrences of a function word w is given by,
© Then, f(a) the probat
fin)
+ xwill vary from word to word and for a given word should be proportional to the length of the text.
© We can interpret x as the mean number of occurrences of the « in the set of texts.
* ‘Speciality words' are content bearing. Whereas function words are not. Word randomly distributed according to a
Poisson distribution is not informative about the document in which it occurs.
‘Aword which does not follow a Poisson distribution is assumed to indicate that it conveys information as to what
document is about.RF Information Storage and Re troduction to Information Retrieval
(SPPUY n
+ Fore. WAR is speciality word It ean come in relative documents Whereas ‘FOR’ Is a function word, which can be
randomly distributed,
A document collection can be
+ THIS model also assumes that a document can te about a word of same deg
‘are about a given word to the same
broken up into subsets. Cacty si
degree.
bset being made up of documents that
ts wato the extent to which
* Content-bearing word is a word that distinguishes more than one class of document
{the topic referred to by the word is treated in the documents in each (lass,
by measuring
+ These are candidates for index terms, These content bearing words can be mechanically detected
the extent to which their distributions deviate from that expected under a Poisson process
+ In this model the status of one of these content words within a subset of documents af the same ‘abounds
vate between further subsets.
fone of non-content-bearing, this is, within the given subset it does not dist
+ The assumptions based on which a word can be considered as index term for the document are :
7 ject is a function of
© The probability that a document will be found relevant to request for information on 2 subject is a functio!
the relative extent to which the topics treated in the document.
© Theno. of tokens in a document is a function of the extent to which the subject referred to by the word is
treated in the document.
‘©The indexing rule based on these assumptions indexes a document with word a if probability exceeds some cost
function,
If there are only two subsets differing in the extent to which they are about a word @ then the distribution of @ can
be described by a mixture of two Poisson distributions.
Pevix (-ppe?
Thus, f(s) = St +}
Here, P,is the probability of a random document belonging to one of the subsets and x,and x, are the mean
occurrences in the two classes.
This model is called as 2-Poisson model, It describes the statistical behaviour of a content-bearing word over two
classes which are ‘about’ that word to different extents, these classes are not necessarily the relevant and non-
relevant documents although by assumption (1) we can calculate the probability of reference for any document
from one of these classes.
Pet
es
* [tis the ratio + 7H e
Pre x +Q-Py e 7x
That is used to make the decision whether to assign an index term a that occurs k times in a document.
This ratio is in fact the probability that the particular document belongs to the class which treats @ to an average
extent of x, given that it contains exactly k occurrences of w.
ran™
1-12 Introduction to Inform,
Oey
W Information Storage and Retrieval (SPPU)
Automatic Classification
* Classification is the process to categories the given document in different groups. Here, we make the yy,
given objects
= There are two main areas of application of classification methods in Information Retrieval as shown in Fig. 4. 5
‘Main aroas of application of
classification methods In
Information Retrioval
(1) Keyword clustering
(2) Document clustering}
Fig. 1.9.1 : Areas of application ofelassification method In IR
1. Keyword clustering
* Many automatic retrieval systems rely on thesauri to modify queries and document representatives to improy
the chance of retrieving relevant documents. In practice many of thesauri are constructed manually.
* They have mainly been constructed in two ways :
1. Words which are deemed to be about the same topic are linked
2. Words which are deemed to be about related things are linked,
2. Document clustering
‘+ Document clustering is to group the documents in the collection. The purpose will be to group the document
in such a way that retrieval will be faster or alternatively it may be to construct a thesaurus automatically.
+ Whatever the purpose the ‘goodness’ of the classification can finally only be measured by its performanc
during retrieval.
Considering the collection, the given documents will be divided in different groups (or subsets). Each grou
will be considered as a single cluster,
* Adocument can become part of a particular cluster if itis los
Thus a single cluster will contain all those documents which are semantically related to each other.
Purpose of clustering is to increase the speed of searching. In practice it is not possible to match ead
analysed document with each analysed search request because the time consumed by such operation woul
ly related to other members of the clustes
be excessive,
Using clustering process will have following steps :
For the given collection, using some algorithms, documents will be converted into different groups. Eat
group or cluster will contain the semantically related documents.
© For each cluster, one cluster representative will be decided. The cluster representative may be such
document which is semantically near to all other documents of that cluster.
© When user will fire a query, it will be firstly matched with the cluster representative. If the query ha
relation with cluster representative, then it indicates that the documents which are member of th
particular cluster may be part of the answer set with respect to the query.
It there is no match, the cluster will not handled further.
Once there is match with query and cluster representative, then each document from the cluster 3
checked for mateh with query. The documents which are logically near to query become part of answe
set.
raeW information Storage and R
eval (SPPU) fn oduction to Information Ri
Thus, using clustenng, the searching is done at two ditt
1
ent levels
Level 1: Comparing query and cluster representative.
2. Level 2: Comparing query and actual document
clustering, docun stated to each other. but more
In clustenng, documents are grouped because they are in some sense related 19 64h ’
he
basically, they are grouped because they are Ikely to be wanted together, and lagueal relanonshp
Means of measuring this likelihood.
The classification of documents can be done manually oF via the intermediate calculation of # me
closeness between documents. The first approach has proved theoretically to be intractable
1.10__Measures of Association -
Tm
a.
List with definition different measures of association.
To distribute the objects in different groups, the relationship between each pait of document's considered
‘The relationship will indicate that whether a particular document is 5
‘emantically near to the second document a%
compared with other documents are not.
‘The relationship between the documents can be mentioned using three different methods as shown in Fig, 230
“Three ditferent methods:
‘of documents
7. Similanty
2, Association.
3. Dissimilarity
Fig. 1.10.1 : Methods of documents
Similarity
Similarity value indicates how much two documents or objects are near to each other.
Association
‘Association is same as similarity, but the difference is objects which are considered for comparison are the objects
characterized by discrete-state attributes.
Dissimilarity
Dissimilarity values show that how much far the objects are. Thus, the similarity value indicates the likeliness of
two objects. If someone wants to find out the group of documents which are similar to each other, the
similarity value can be considered,
For the information retrieval system, we are interested in finding the subsets of the given documents.
Documents in the collection are described using the list of index terms,
Here for information retrieval systems, each pair of document is considered. Two documents will be similar to
each other if they have more number of common index terms,
TFIntroduction to Inform,
Yomanon sige ane
wo decent si hag rr a onrin
: ments in the same group
ms then obviously 1
res of association are defined The yy
To define the relation between the objects various measures of assoc
nila attnbutes
‘Measure will be more if two objects have more number of similar attnbut
fering of the association values wo,
1 follows that a cluster method depending only on the rank-ordering of t!
sentical clustering for all these measures,
There are fe measures of association methods. As we are using these measures for information ye,
System, we should consider the representation of the document inside the system.
Here the assumption is each document is represented by a list of keywords. A query is also represented »
list of keywords.
* Each list of keyword is considered as one set.
* Thus the terms which are assumed here are:
2. _X: The list of index terms related to a document eg. Document 1
2, Y: The list of index terms related to a document e.g, Document 2.
3. IX |: The number of index terms present in the Document 1
4 LV [: The number of index terms present in the Document 2
5. 9: Intersection of two sets,
For example
X: The list of index terms related to Document 1.
1. Bat 2 Ball
3. Stump 4 Pen
5. Pencil 6 Night
7. Dog 8. Cat
9%. Coat 10. Fur
Y: The list of index terms related to Document 2.
2, Pencil 2. Paper
3. Rubber 4. Cat
5. Mouse 6. Book
7 fe & Nose
9. Heart 10. Dark
* Here! am considering only the nouns. in real scenario the actual index terms may have original representatio
of nouns, verbs, etc.
Thus :
Ix] = 10
I¥{ = 10
+ Now we will define different measure of association methods.1.15 Introduction to
1_Different Matching Coefficients =
Py EE
©, Doscribe ditferont matching cootticiont.
2. EAR
Wato a short note on matching costticionts.
Different Matching Coetficlents
1. Simpio Marching Coatticlent
!
2, Dissimlacty Cootficionts
Fig. 1.11.2 : Different matching coefficients
1.11.1 Simple Matching Coefficient
Its the number of shared index terms.
© Thus, we can calculate simple matching coefficient as
Ixay]
This method does not consider the size of X and ¥
© In our example, the common terms are:
1. Pencil
2 Cat
+ Thus, value of simple matching coefficient is 2.
1, Dice's coefficient
Xnv
IXd+1¥
Itis the shared number index terms divided by addition of sizes of both sets X and ¥.
2, Jaceard’s coefficient
Xov
[xoY}
Itis calculated as shared index terms divided by union of set X and set ¥.
1e coefficient .
xay ixey
IxP YP? a y
3
Cosine coefticient can be calculated as number of common index terms divided by square root of size of X set plus,
square root of size of Y set.
4. Overlap coefficient
xay
min (XE TYD6 Introduction to Information Re
WW information storage and Retrieval (SPPU) .
> Overlap coefficient can be calculated as number of common index terms divided by the size of set either y~
Y having comparatively minimum enttes.
lap coelficients are normalized ver,
# The Dice's coetticient. Jaccard’s coclticiont, cosine coefficient and over
fom Oto 1.
of simple matching coetficient. The values of these coefficients range frorn O
jing example presents importance y
© Wis necessary that the vakies of coefficients must be normalized. Following example P «
normalized values.
Let
1 SLOG Y)_ = [KA Y]=2 Simple matching coefficient which is not normalized.
2 82061) » 7ST = normalized coofclen
Casel
Le
1 Xp aa
= I¥| = Land
3. Xayoe1
Then, SLKY) #1
Case 2
let
4 Ix} = 10
7 ly = 20
3. Ixay =a
Then, SILKY) = 1
2IXoY
820M = Tes tyl
2.
2
"To +10 "20 “20
‘© In first case, both the coefficients have same value ie. 1, which indicates that there is exact match. But in secon!
case, even though there is a single common term present in both the set X and ¥. Coefficient $1 has value 16°)
which doesn't reflect any difference between case 1 and case 2 whereas value of $2 coefficient is 1/10, which 9xe
comparatively real scenario.
Banna:
Document 1 = {CPU, keyboard, RAM, VGA, SMPS, USB, CD-ROM, Printer)
Document 2» (CPU, VGA, Simulator, OS, Vitoo, USB, Printer, Scanner, Compl)
Find the similarly botwean two documents using diferent matching cootticients
‘SPPU
rcW Information Storage and Retrieval (SPPU) Introduction to Intormation Better
soln.
X = Document 1
= CPU, keyboard, RAM, VGA, SMPS, USB, CD-ROM, Printer] and
Y = Document 2
= (CPU, VGA, Simulator, OS, Video, USB, Printer, Scanner, Compiler)
Set XY) = (CPU, VGA, USB, Printer) and
(union ¥) = {CPU, keyboard, RAM, VGA, SMPS, USB, CD-ROM, Printer, Simulator,
05, video, Scanner, Compiler)
Hence,
IX] = Band}y| =9
kay} = 4
IXuniony] = 13
Following are the similarity coefficients:
(Simple matching coefficient = [XY] =4
¥ .
Dice's coefficient = TTS TH] =4/ +9) =4/27 = 023529812
Xoy
Gi) Jaccard’s coeffi = xv] = 4/13 = 0.30769
(™ Cosine coefficient = —bou. 4/ (282% 3) = 0.472
IxPety i?
, xay .
™ Overlap coefficient = mind Xb 1D =4/8=05
111.2 Dissimilarity Coefficients
rrr
Explain the properties of dissimilarity cootficiants used in information retrieval.
«= Coefficients which are explained in previous section are based on similarity values.
«There are some coefficients defined, which are based on dissimilarity values between the documents.
Properties of dissimilarity coefficients
«Any dissimilarity function can be transformed into similarity function by a simple transformation of the form :
Ss = (ray?
© But the reverse is not always true,
«fis the set of objects to be clustered. A pair wise dissimilarity coefficient D is a function from P xP to the non-
negative real numbers.trturct00 10 1NOFMANON By
soho
Tran Hovering
tor
Eon ont)
Tom WD He!
ttn for
fig. 131.2: Condito esx
D(X ¥) 20 forall xX. Ye P
Tt says that the dissimilarity coefficient should be non-negative.
ee itself, the dissimilarity coefficient should ha,
we find the dissimilarity value by comparing same document by itself, wee
Value 0 because there exact match
3 DAW=DU.%forK ve P
e documents. Thus the dissimilan
Desmlry s stelnet depnd one gers in hin we hare he docue —
Coefficient must be same between 2 documents, without considering the order of handling
4 DaKM=sDKD-DM yy ier
This is based on theorem from Eueiian geometiy which states thatthe sum of length of two sides of triangle, oy
always greater than the length ofthe thi side, fol
Dissimilarity coefficients example
+ Examples ofessimiariycoetcient which saisties above conditions are
. Fy
IXP+ TY]
Where, Ka) = KuY)-KAY)
tis the symmetric difference of set X and Y,
This coetficient is simply related to Dice's coefficient by,
~21XaYL __ixay
IXT+IY] > Txpeqyy i
The same coefficient can be represented in another form, i
" =a if represented as the binary string where each entry represents an absence or presence of” v
eywOrd, which is indicated by zero or one in i* posi i i
servo Postion. In this case, the above dissimilarity coelfiient can?
Ext
Where, the summation is ove the total number of diffe,
eat keywords in the document Collection. —=Wnformation Storage and Retrieval (SPPU) 1» Introd
mbedded in an n-dimensional Euclidian space.
2. Salton considered document representatives as the binary rector
Where. n = Total number of index terms
Thus kay
Maye
Can then be interpreted as the cosine of the angular separation of the two binary rectors % and ¥.
: iY)
+ c0x0
ix %
Where, OY) = inner product
We
I the space is Euclidian then for X= Gy oun)and
Y= Ye Yo)
Aa
Dey,
=1
= ETET
3. Expected mutual information measure
Measure of association can be defined based on probabilistic model. It can be measures by the association
between two objects by the extent to which their distribution deviate from stochastic independence. For two
distinct probability distribution P (x) and P (x), the expected mutual information measure can be defined 05
length of rector
Fig. 1.11.3,
follows :
x Pb
1%, ») 34,7 Or fo Pox) PO)
Properties of the function
i. When xand xare independent,
PURP(s) = Pox)
So, 1) = 0
Ti, 1 Gq) <1 5x) whi
Itis invariant under one-to-one transformation of coordinates.
shows it is symmetric.
1 (x, 4) is often interpreted as @ measure of the statistical information contained in xabout x,
When we apply this function to measure the association between two index terms, say i and j, then x, and x
are binary variables. Thus P (x= 2) willbe the probability of eccurrence of the term i and similarly P (= 0) will
be the probability of its non-occurrence, The extent to which two index terms i and j are associated is then
measured by I (s, x). It measures the extent to which thelr distributions deviate from stochastic independence.
Writes1,
.12__ Cluster Hypothesis
WF tnlormanion storage
mation
Drssimilanty betwren two cl
Classes On the tuass of their probabslty distribution Over
between two
with a function For
Jos of otyectt can be define’
simple two point space (1. 0)
tet
FLAP, ©) Probabuity datnbution associated with «lass L
PLC. P,(0) Probability distribution associated with class It
On the bass of cifference benween them, we measure the dissimilarity between class Land Ti B/ \nfowmatn,
Puy Pt)
+ vP, (109 3p, 1) + vP,
Thus,
P10)
Information radius = uP, (2)!09 5p 7A « vP, (1)
PO
P00 __
+ uP (1 00 pape yay * 3 109.5) + ¥FT
Ne
tere, u and v are positive weights adding to units.
Properties
Under some interpretation, the expected mutual information measure is a special case of the information radiu,
Foreg.
Let P; () and P,() are two conditional distributions P (,/n,) and P (/w;).
u = POAW)
vo PCA)
P (x/ wy) P (oa) + P Oy) Pion) 7x = 02
12
Then for information radius we get,
Pay =
Pov wy PX
P(x w)
We came to the expected mutual information measure I (x).
Closely related documents tend to be relevant to the same requests.
vest are separated from those which
A basic assumption in retrieval systems is that documents relevant to a req.
non-relevant.
‘The relevant documents are more like one another than like non-relevant documents.
This can be tested as follows : Compute the association between all pairs of documents :
(2) Both of which are relevant to a request
(b) One of which is relevant and the other is non-relevant.
gased on a set of requests, the relative distribution of relevant-relevant (R ~ R) and relevant - non-relevant(R-
R) association of a collection can be defined.
Plotting the relative frequency against strength of association for two hypothetical collections X and Y, we™
get distribution as shown in Fig, 1.12.1.W inform:
9 Storage and Retrieval (SPPU) 2 Introdu:
In Fig. 1.12.1. R-R is the distribution of relevant associate. R-N-R is the distribution of relevant non-relevant
association
From graph we can conclude that
(1) Separation for collection x is good while for Vis poor.
2) Strength of association between relevant documents is greater for X than for Y.
A linear search ignores the relationship exists between documents. Hence, structuring a collection in such 9 way
that relevant documents will be part of one class will speed up the process of retrieval of the documents
Catloetion X_ Cottection ¥
100
60
€0
40
Fig. 1.32.1
The searching will be more effective, since classes will contain only relevant documents and no nen-relevant
documents.
‘Cluster hypothesis is based on the document descriptions. Hence the objects should be described in such @ way
that we can increase the distance between the two distributions R-R and R-N-R.
‘We want to make more likely clear that we will retrieve relevant documents and less likely that we will retrieve non=
relevant.
Thus, cluster hypothesis is a convenient way of expressing the aim of such operations as document clustering. It
does not say anything about how the separation is to be exploited.
1.12.1 Clustering in Information Retrieval
Cluster analysis is a statistical technique used to generate a category structure which fits a set of observations. The
groups which are formed should have a high degree of association between members of the same group and a
low degree betweon members of different groups.
Cluster analysis can be performed on documents in several ways
(Documents may be clustered on the basis of the terms that they contain. The aim of this approach is to
provide more efficent and more effective retrieval
Documents may be clustered based on co-occurring citations in order to provide i
© provide insights into th
the literature of afield a 2 nature of
Gi) Terms may be clustered on the basis of documents in which they co-occur, It is useful in const
struction of a
thesaurus or in the enhancement of queries.Introduction t9 tnformaie,
2
BF Information storage and Retrieval (SPPU) us
Wy mplemented wath available software packages. it MAY Rave sory
Although cluster analysis can be ea
like
presentation
tributes on which items are to be clustered and their re
available
(© Selecting the at
larity measure from those
© Selecting an appropriate clustering method and sini
wn be expensive in terms of computational resources,
© Creating cluster or cluster hierarchies, which car
idity of the result obtained.
1¢ must be considered
© Assessing the
If the collection to be clustered is dynamic, the requirements for updat
1F the aim is to use the clustered collection as the basis of information retrieval, a method for searc.
clusters or cluster hierarchy must be selected.
Criteria for choosing clustering method
While choosing the clustering method two criteria have been used.
Criteria for choosing the
clustering method
i
1. Theoretical soundness:
2, Elfcioncy
Fig. 1.12.2 : Criteria for choosing clustering method.
1) Theoretical soundness
The clustering method should satisfy some constraints like =
(2) The method produces a clustering which is unlikely to be altered drastically when further object
incorporated ie. itis stable under growth.
{b) The method is stable in the sense that small errors in the description of the objects lead to small chan;
clustering
(©) The method is independent of the initial ordering of the objects.
2) Efficiency
The method should be efficient in terms of speed requirement and storage requirement.
1,13 Clustering Algorithm
Clustering methods are usually categorized according to the type of cluster they produce. Thus, the chs
methods can be categorized as :
Hierarchical methods
Non-hierarchical methods
Hierarchical methods produce the output as ordered list of clusters. Whereas non-hierarchical methods P
unordered lists,
eReEF information Storage and Retrieval (SPPU)
1.13.1 Det
1
@:
co totorenation sattieral
2
Other categorization of the methods are
i, The methods producing exclusive clusters
fi. The methods producing overlapping clusters
Here are some definitions related to clustering methods. While discussing about different algonthens we vil use
these terms
initions
Cluster
cluster is an ordered list of objects, which have some common characteristics
Distance between two clusters : The distance between two clusters Involves some or all elements of We 17/2
clusters. The clustering method determines how the distance should be computed.
Similarity : A similarity measure SIMILAR (d,. d) can be used to represent the similarity between the document
Typically similarity is normalized value which ranges from 0 to 1-Similarty generates values of O for documents
exhibiting no agreement among the assigned index terms and 1. when perfect agreement is detected Intermediate
values are obtained for cases of partial agreement.
Threshold : The lowest possible input value of similarity required to join two objects in one clu
Similarity matric: Similarity between objects calculated by the function SIMILAR (d,d) represented
ster,
in the form of
a matrix is called a similarity matrix
Dissimilarity coefficient : The dissimilarity of two clusters is defined to be the distance between them, The smaller
larity coefficient, the more similar two clusters are.
ft. Every incoming object's similarity is compared
the value of diss
Cluster representative (seed) : The representative of the clust
with cluster representative. A clustering method can have predetermined parameters like =
1. The number of clusters desired.
2. Aminimum and maximum size for each cluster.
3, Athreshold value on the matching function, below which an object will not be included in cluster
4, The control of overlap between clusters
5. Anarbitrarity chosen objective function which is optimized.
Now, let us discuss different clustering algorithms.
1.14 Rocchio’s Algorithm
Rochhio developed a clustering algorithm in 1966. It was developed on SMART project.
Several parameters which are defined as the input for this algorithm are as follows :
Minimum and maximum documents per cluster.
correlation between an item and a cluster below which an item will not be placed in
+ Lower bound on the
“luster. This is a threshold that would ke used in the final clean up phase of unclustered items.
«Similarity coefficient : On new line the algorithm operates in three stages.(sPeu,
entres. The remaining »
Wietormation storage and Ret
Stage 1: Aigonthm selects (by some cr
assigned to the centies or rag-bag cluster Pag-bag
saments. the cluster representatives are computed and ait op
of objects
it a temporary cluster used to a
meet
lily defined in terms oF thresholds na
On the basis of the intial
more assigned to clusters The assignment rules are eupli
function
‘The final cluster may overlap (ie an object may be assigned to more than one clusted.
8
Stage 2: Itis an iterative step. Here, the input parameters can be adjusted so that the resulting classification me,
rior specification of such things as cluster size, etc, more nearly.
8
Stage 3: Thisis the tidying up’ stage. After the stage 2, the objects which are unassigned objects (Le. pan of rx,
cluster) are forcibly assigned and overlap between clusters is reduced.
caw
Capes —o > )
1.15 Single-Pass Algorithm (2
Process
‘Single-pass algorithm process as follows :
* The object desenptions are processed serially,
* The first object becomes the cluster representative of the fist cluster.
Each subsequent object is matched against all cluster representatives existing atts processing time.
* A given object is assigned to one cluster (or more if overlap is allowed) according to some condition on
‘matching function.
* When an object is assigned to a cluster the representative for that cluster is recomputed.
* [tan object fails a certain test it becomes the cluster representative of a new cluster,
Examples
Object = (1,2,3,4,5,6
Similarity matrix
1
2 | 06
3 06 | 08
4 os | 09 | 07
o5 | o5 | o9 | os | as
Threshold : 0.89ion Retrieval
Wtnformation storage and Retrieval (5PPU) 1.25 Invoducion to tafor
‘Case 1: Clustering metho
Process object from 1 to 6
4. Object 1
First object ie. object 1 becomes part of cluster as well as cluster representative
2G = Gh
Object 2
‘ome
. whether it can be
‘The object 2 is compared with object 1 (as itis cluster representative of C1) to check whether it can Bes
part of first cluster.
Compare, similarity (1, 2) and threshold value.
A306 < 089
Object 2 can't become part of cluster
Hence a new cluster is created whose cluster representative will be object 2
2G =
G = @
3. Object 3
larly (1, 3) < threshold
06 < 089
Hence object 3 can’t become part of cluster 1.
b. Similarly (2, 3) < threshold
08 < 089
Hence object 3 ean't become part of cluster 2.
Create new cluster whose cluster representative is object 3.
wo
@
a)
4, Object 4
larly (1, 4) > threshold
os > 029
Object 4 becomes part of cluster 1.
2G = @ay
G = @
G = 6B
‘As the new element is added is cluster, cluster representative is again calculated. In this example there is no
cluster representative.
change. itIntroduction 19 inform,
Information Storage and Retvioval (SPPU) 26
Object
Check sinarity (1. 5) » threshold
09+ 089
eG = LAS)
G = @
G * a
AS new element is added in cluster 1, cluster representative is calculated again. Based on the similarity yy,
* S objects. All objects are equidistant hence no change in cluster representative
6 Object 6
% Check similarity (2, 6) « threshold
os
B. Check similarity (2, 6) < threshold
os
© Simitarity (3, 6) > threshold
09
Hence, object 6 becomes part of cluster 3
2
G
G
Here again for C,
Thus,
When the clustering method is exclusive,
cover,
0.89
0.89
089
24,5)
{2}
8,6)
y object 3 and 6 are equidistant hence no change in cluster representative.
the output of single pass algorithm is as follows :
a, 4,5)
= @
= 36
Fig. 1.15.1
once an object become part of a single cluster, handling of that objecWhen clustering
stering method is averlapping each object is compared with cluster represen
=> Process objects from 1 to 6
1. Object 1
As there is no cluster present. create new cluster whose cluster representative i oyect 1 Well
qa
2, Object 2
a. Compare similarity (1, 2) < threshold
b. = Create new cluster.
G-a@
G = @
3. Object3
a Compare similarity (1, 3) < threshold
o6 < 089
b. Compare similarity (2, 3) < threshold
pratives of each cluster
08 < 089
Create new cluster.
2G =a)
c = (2)
G = 8
4. Object 4
a. Compare similarity (1, 4) > threshold
os > 089
Hence object 4 can become part of cluster 1, As the method is overlapping, go on checking the object 4's
similarity value with all cluster representative.
b. Compare similarity (2, 4) > threshold
09 > 089
Object 4 can become part of cluster C-
Compare similarity 3, 4) < threshold
07 < 089
LG = a4)
c, = (24)
G =
Raefi
WW information Storage and Retrieval (SPPU)
Object §
& compare winvilarity (1. 8) » threshold
09
B compare sinvilarity (2. $) « threshold
06
© Compare similarity (3, $) « threshold
06
ag
Gq
G
6& Odject 6
Compare similarity (1, 6)
os
Compare similarity (2, 6) < threshold
os
© Compare similarity (3, 6) < threshold
09
Latroduction to lnformy
ono
089
089
(1,4,5)
(2,4)
3)
0.89
089
0.89
14,5)
(2,4)
(3,6)
Advantage of Single Pass Algorithm :
Simple for implementation.
Disadvantage of Single Pass Algorithm
Fig, 1.15.2
The output depends on the sequence in which objects are handled.
wieIntroduction to Information Retrieval
pO aC
Step 1: Start with Doc. As initialy no cluster is present, Document Lintroduces Cluster 2 ve. Cl Hence,
C1 = (Doc)
Centroid of this cluster is <2, 2,0, 0,15
Cl= (Doc 1); Centroid of C1: <1, 2, 0,0,1>
Step 2: Now we need to make decision for Doc? Either it can become part offirst cluster or it can Introd
cluster.
For making the decision, we need to find similarity between centroid of first cluster and Doc. 2, Hence, we
use dot product for simplicity.
Centroid of C2 :<1,2,0,0,1>
Doc. 2 :< 3, 1.2, 3, 0>
SIM (Doc 2, C1) = 143 +241 4024093 +1°0=5
juce anew
will
Now compare threshold value and
SIM(Doc 2, C1) = 10>5
Hence Doc 2 can't become part of first cluster. Hence, new cluster will be introduced.
C1= {Doc 1}; Centroid of C1: <1, 2, 0,0,1>
(€2= {Doc 2}; Centrold of C2: <3,1,2,3,0 >
Step 3 : Make decision for Doc.3
Hence,
Doc. 3:< 3, 0,0, 0, 1>
Centroid of C1 :< 1,2,0,0,1>
SIM(Doc3,C1) = 3140240904 0:0+12=6
10 > 6, hence Doc.3 can't be part of C1.
Now, check whether Doc.3 can become part of C2Hence,
Doe. 3:< 3,0, 0,0,1>
Centroid of C2 :< 3, 3, 2,3, 0>S
Introduction to inf
Information Storage and Retrieval SPPUI 1-30 ormtion
+1069 —™
siM(Doc3, C2) +014 0°2 +0"
2 also. Hene, introduce a new Cluster
Threhold 10> 9, heace, Doc 3 cant become part of cluster
3 = (Doc. 3)
C1= {Doc 1); Centrold of CL: <1, 2,0,0,1>
€2= (Doc 2}; Centrold of C2 : <3, 1, 2,3,0>
€3s { Doc. 3); Centroid of C3 :< 3, 0,0, 0, 1>
Step 4:
‘Now. Make decision for Doc. 4:< 2, 1,0, 3, 0
SIM(Doc4, Cl) = 142+ 291400 =0°3 +1°0=4
Threshold 10 >4 ; Doc.4 cannot become part of C1.
SIM( Doc4, C2) = 3424112 +210 + 343 + 0°0= 16
Threshold 10 < 16, Hence Doc.4 will become part of C2.
C2 = (Doe2, Doc. 4)
AS new document is included in cluster,
Doc. 2:< 3, 2,2,3,0>
Doc. 4:< 2,1,0,3,0>
Centrold of C2 = <5/2,2/2, 2/2, 6/2, 0/2> = <25,1, 1, 3, 0>
Thus Clusters available are:
C= {Doc 1); Centroid of C1:
C2 = {Doc 2, Doc. 4}; Centroid of C2 :<2.5, 1, 1,3, 0>
1, 2,0,0,1>
€3= ( Doc. 3); Centroid of C3 :< 3, 0, 0, 0,1>
Step 5 : Finally we need to find where Doc. 5 will fit,
Doc. 5:<2, 2,1, 5, 1>
112 +292 +01 +05 421-7
SIM(Doc.5, C1)
Threshold 10. > 7, Hence Doc$ cannot be part of C1
252 +142 +141 +35 +02
SIM(Doc. 5, C2) =
= 5424141540 =23
Threshold 10 < 23, Hence Doc.5 becomes part of C2.
€2 = (Doe 2, Doc. 4, Doc. 5)
Thus,
Centroid of C2 will be recalculated which is average of Doc.2, Doc and Doc. 5. Hence
Doc. 2 :< 3, 1, 2, 3,0 >
Doe, 4:< 2, 1,0,3,0>
Centroid of cluster is recalculated which is average of Doc.2 and Doe. 4Introduction to Information Retrieval
id Re
Doc. 5: <2, 2,1, 5, 1>
Centroid of C2 # «7/3, 4/3,3/3, 1113.1»
«233,133, 1, 366,033
Thus finally, we have 3 Clusters.
€1=(Doc 1); Centroid of C1 : <1, 2,0,0,1>
€2= (Doc 2, Doc. 4, Doc. 5);
Centroid of C2 :<2.33, 1.33, 1, 3.66, 0.33 >
C3= ( Doc. 3}; Centroid of C3 :< 3, 0,0, 0, 1>
1.16_ Single Link Algorithm
@. Show how single link clusters may be derived from the dissimilarity costficient by threshoicing i.
Joining at each step, the two most
‘¢ The single link method is the best known of hierarchical methods. It operates by,
pairs of clusters
similar objects, which are not yet in the same cluster, The name single link refers to the joining of
by the single shortest link between them
‘© The dissimilarity coefficient is the basic input to a single-link clustering algorithm. Single-link produces the output
which is a hierarchy with associated numerical levels called a dendogram.
«The hierarchy is represented by a free structure. The dendogram and its respective tree is as shown in
ovata,
Ys
u &
LN LN
‘pt
ba
Fig. 1.16.1: Dendogram
tere,
{A,B, C, D, E) are the objects clusters are:
At level 1: (A, B}, (C), {0}, {E}
At level 2: {A,B} {C, 0, E)
At level 3 :{A, B, C.D, E}
‘At each level of hierarchy a set of classes can be identified. As we move up in hierarchy, the classes at lower level
are nested in the classes at higher levels,