Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views68 pages

ISR

notes for information storage retireval

Uploaded by

pandeyashish0504
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
7 views68 pages

ISR

notes for information storage retireval

Uploaded by

pandeyashish0504
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 68
i Introduction to S{estae & = Information Retrieval gave Concepts of Data Retieval& Information Revavai Yon uaa aayiunanik ext Analysis: Lunn’ ‘eval, Text mining and IR relation. 18 syste indexing, Automatic Classification, ith, Indexing and Inder Term Weight Indexing, Autor ication Measures of Association, Different Matching Coelficients Cluster Hypotnes's Clustering Techniques: Rocchio's Algorithm, Single pass algorithm, Single Link algorith rithm, Single Link algorithm 1.1 _Basic Concepts of IR 4+ This is the information era, We handle vast amount of information. The purpose to maintain such huge amount of information is when we need any information; we should get it as early as possible. We require speedy and accurate access whenever we need it (One method to get relevant information is read all the documents and then decides which are the relevant and which are the non-relevant documents? This is the manual method. Second method is the automatic method in which we store all the information in computer and ask to find the felevant information. Information retrieval handles the representation, storage, organization and access to information items. The representation of information should require less space. The organization of information should be such that, system should require less time to access the items of information which satisfies user needs 1.2 _ Data Retrieval and Information Retrieval Ditlarontiato betwoen data and information retrieval [RRS mining which documents contain specific words in the query * Data retrieval mainly concerned with dete information retrieval, the user is interested in the relevant information of the query * Table 1.22 is the comparison of data retrieval and infor Introduction to, Inform, Table 1.2.2 Parameters | Data Retrleval (OR) | Information Retrieval any | fe eee! (DW) el wd) 1 | Matching Gaact match Partial match. best match LL ——— 2. | Inference Deduction Induction 3. | Mode! Deterministic Probabilistic | —— 4. _| Classification Monothetic Polythetic 4" 5. | Query language | Artificiat Natural 6 _| Query specification | Complete Incomplete 7.__| tems wanted Matching Relevant 8. _| Error response Sensitive Insensitive Matching * In data retrieval, we normally search for exact match for &9. whether a file contains a Particular word o, we normally find documents which Partially match the request and then se 2. Inference + In information retrieval, ‘ut of them, eC be * The inference which get used in data retrieval is d Ictive e.g, as bandb—cthena+e * In Information Retrieval we follow inductive inferenc relations are specified with a degree of cons uncertainty. 3. Model documents. * As the Information Retrieval uses indu 4. Classification . lassification i.e, one with classes belong to a class, val menothetic classification is not required in Information Retrieval polythetic classification gets used, defined by objec In such a classification each individual in a class will possess only a proportion of all attributes possessed by ‘members of that class, H once no attribute is necessary nor suficient for membership of a claee Query language The query language which is u sed in Data Retrieval is ‘mostly artificial; with restricted syntax and vocabular ©9. we can write a query i 1 SQL in its fixed format with fixed words ‘gr information Storage and Retrieval (SPrUy Introduction to Informa + Ininformation Retriey ‘eval, query will have no restric I «User can provide @ query in natural | ion related to syntax or vocabulary handle such queries, \9uage format. The Information Retrieval system should be able to 6. Query specification «As Data Retrieval finds ex. 1s exact match and the query follows a restricted format, the query must be complete User should provide the exact query for his into rect interest + In Information Retrieval, us rnformaton Reel user can use natural inguage to spedilythe query and hence & posse M3 query may be incomplete e.g. user will net follow the standard grammar of the language. The Information Retrieval system can handle such queries : 2, Weems wanted ¢ + In bata Retrieval user specifies the exact query and hence the lst of items wil contain those items which exactly match the query. — + In information Retrieval, the query gets specified in natura language as well as the models which used for finding the items are probabilistic the system will find items which are relevant to the query + User then will decide for best one from the listed output. 8. Error response - © In Data Retrieval, the query must be complete with proper syntax. Hence if there wi specifying the query, meaning will be totally different and we can get wrong items. {ll be any error while + In Information Retrieval, query gets specified in natural language, hence some sort of relaxation can Be handled by the system 1.2.1 Text Mining and IR Relation Information Retrieval is related to text, images, audio, video or object oriented type of information. IR deals with ‘and various methods of searching the information based on user's the efficient way of storage of the information < interest) Handling textual information is subdemein of IR. IRis more to do with search engines where we have large amount of information and based on user's requirement, specific information is extracted from the collection. IR fs more hybrid topic which converts, machine learning Natural Language processing techniques. Now-a-days, more focus of IR is on search engines. techniques, 1.3. Information Retrieval System: Block Diagram Eee Q. Draw and explain IR system block diagram. Se ee ed SASL © Draw IR eystom block diagram. resentative? Explain with a sultable example. What is a document re + An information retrieval system deals with representation, storage, organization and access of information fig. 1.3. shows the block diagram of a typical Information Retrieval system. YF hatecinag Introduction to Information Rety, s 4 W information Storage and Retrieval (SPPU) on and the query given by the .- + The input for this systenv is the fst of documents which contain the infornat te i 6 relevant to the qu The Information Retrieval system will fnd the list of documents which are rel Foeutiacs Output Inpuy Fig. 1.3.2 : Information Retrleval system block diagram le for computer to use * The main problem here is to obtain representation of each document and query suitable for ¢ ni we are converted into its representation. The documents in its natural language tern, the first step the documents 7 ace requir Presentations. But if we store these documents in natural language format, space req} are one of the re Gets increased as well Hence most computer based retrieval systems store only representation of the document. nificant. These words are calleg ;, representative could be a list of extracted words considered to be 35 time to retrieve the items which are relevant to query is also large. A documen, a document is lost once it has been converted into document representation. keywords. Text of tomate or manual idexing * Fig. 13.2 shows the logical view of the document. & full index terms) can be a document representative or any intermediate status of the document can be document representative. also converted is As the document gets converted in its internal representation, the queries given by the user are the same fashion, pared with the query and outpu fc. the list of relevant documents will be provided to the user. he can provide the feedback based on which the query ge User can stop here or if user wants to refine the query, modified. Provided to the user. WF Information Storage and Retrieval (SPPU) 15 tion to faformat T.4__ Automatic Text Analysis + _ Information rewieval systems are of two types, ane is manual retveval system and another is automatic retrieysl system. Here we are discussing about automatic retrieval system. + Inautomatic retrieval system computer searches for the relevant documents related to the given query + Before computerized Information Retrieval system can actually operate the documents to retrieve information, the documents must be stored inside the computer one way to store the documents [5 in its natural format Ve. text format + ut the disadvantage of this method is requires more space in memory to store the documents. And the secon: which searching for the query relevant documents, system require more time. + Solution for this problem is to find the document representative for each document. It can be a ttle, an abstract or a list of words from that document, Mostly the lists of words from the documents are used as the documient representative. SES keywords ntic of the document. These words are called dis + The words a as keywords.) those words from documents which contain the seman 15 _Luhn's Ideas @. Explain Lunn’s idea for understanding the context af the document. poy @.__ Explain Luhn's idea in dotaiis. + Document can be represented as a list of words. But here the question aries which words can be picked from the documents to create the document representative. Luh has given the basic idea for this selection. Luhn states thet the frequency of word occurrence in an article furnishes a useful measurement of word ifieance furnishes a useful significance. The relative position within a sentence of words having given values of sig measurement for determining the significance of sentences. +The significance factor of a sentence will therefore be based on a combination of these two measurements. In shor, he states that frequency of the words @an te unde to extract words and sentences to represent a document. Luhn used the Zipf's law for stating the idea, Zips law states that the product of the frequency of use of words and the rank order is approximately constant + Luhnhas specified following idea. Let, F: Frequency of occurrence of various word types in a given position of text. R: The rank order of these words ie. order of their frequency of eccurrence. + Then plot a graph relating f and r whichis ike a hyperbolic curve «Here we are interested to find the significant words from the document. Fig. 1.5.1 shows a plot of the hyperbolic cure rélating, f the frequency of occurrence and r, the rank order. Luhn has stated two cut-offs upper cut off and lower cut off. + Upper cutzaif is used for excluding common words. The words whose frequency is greater than the upper cut off are the common words. These words do not contain the semantic of the documents, Hence these words aré not cofiBidered in the list. ——_—_—_— —$ i Invoduction to tnformation Petng BF information Storage and Retiieval (SPPU) 16 ds, Hence Get discarded + Phewords having less frequency as compared (0 the lower cut-off are the rare wards, Hence 4 tes ol are considered ag 7 Thus the words whose Hrequency values are in the range of upper and lower cut-off words. These words become the part of document repres Une anton 2 | eaohan per of . ‘ayant werd {= Sigmtcant words PN Worl by rank ortor Fig. 1.5.4 : Luhn’s Idea 7 There is no thumb rule for deciding upper and lower cutoff, They have to be established by trail and error 1.6 Conflation Algorithm Hoey to gonerate the document representatives using conflation algorithm? SES | @. Explain steps in conflation algorithm using a suitable example, SAT | @. List and explain steps of contlation algorithm SUIT ©. You are developing a text processing system for use in an automatic rotrieval system, Explain the following pans i) Removal of high frequency words ii) Suttix stripping Detecting equivalent stems, EE Contlation algorithm is the method to find the document representative, Its based on the Luhn's idea 9 * The algorithm consists of three parts as shown in Fig, 1.6.1 Algorithm consists of three parts 1 Romoval of high frequancy words 2. Sullix stipping 3. Datacting aquivatont stems Fig, 1.6.1 : Parts of Conflation algorithm Teateocil information Storage 1d Retrieval (SPPU) 1.7 Introduction to Information Retrieval Removal of high frequency words «The removal of high tr coat et hah frequency words ie. ‘stop’ words oF ult words is one way of implementing Luhn’s upper ata ine Be done by comparing each word from the document to the list of high frequency words. High frequency words are those words which comes more number of times in the text + These words does not contain meaning or semantic of the text. For eg. is, am, 1 are, the ete. are some of words + The advantage of this process is not enly the non-significant words are removed but also the size of the document can be reduced to 30 to 50% 2. Suffix stripping -— — + The second step is the suffix stripping. in this step, each word is handled from the output of first step. any word is having the suffix, then the suffix gets removed and the word is converted ints origina form. * For e.g If the word is “Killed” it will be converted into ‘Kill’ other example are Original ward Word in orlginal form 1._| Processes =| Process 2._| Repeated = _| Repeat 3._| Kidding = | kia 4._| National = _| Nation + Unfortunately, context free removal leads to a significant error rate, For e.g. we may want UAL removed from FACTUAL but not from EQUAL, * — Toavoid erroneously removing suffixes some rules can be followed.For e.g. 1. The length of remaining stem exceeds a given number, the default is usually 2, 2. The stem-ending satisfies a certain condition, e.g. does not end with Q. * For removing the suffixes the rules of grammar of the language can be used. For English language porter’s removal of suffixes. algorithm is one of the algorithm which helps * The process is called as stemming. An advantage of stemming is it reduces the size of the text. However, too much stemming is not practical and annoys users. 3. Detecting equivalent stems 19. We will have the list of words. Only one occurrence of the word is kept in the list for e.g + After suffix steip if two words “processing” and “processed” get converted into ‘process’ only one occurrence of process will be part of the list. Each word is called as ‘stem’ If two words have the same underlying stem then they refer to the same concept and should be indexed as such. This is obviously an over simplification since words with same stem, such as NEUTRON AND NEUTRALISE, sometimes need to be distinguished. ‘The final output from a conflation algorithm is a set of classes, one for each stem detected. As class name is assigned to a document if and only if one of its members occurs as a significant word in the text of the document, Introduction to tnt PPUD ferred 91 the & Retr WF Information Storage an 5 names These ive then becomes a list of cl lent represe =A dow jorms or Keywords. mryerted in query 1 Quenies are also treated in the same way. Thus each query 1s converted preventative 1,7__Indexing and Index Term Weighting 17.1 Indexing uring “Indexing” documents are prepared for use by information Retrieval system ble representation of docum + This means preparing the raw document collection into an easily accessi indexing the document transformation from a document text into a representation of text is known a5 indexin: * Transforming a document into an indexed form involves the use of © Allibrary or set of regular expressions © Parsers © Alibrery of stop words (a stop list) © Other miscellaneous filters, Conflation algorithm is used for converting document into its representation ie. indexing. Each element ie. wore. index language is referred as index term, Index language may be pre-coordinate or post-coordinate. In pre-coordinate, the terms are coordinated 2: = time of indexing, A logical combination of any index terms may be used as a label to identify a class of documens In post-coordinate, the terms are coordinated at the time of searching. In post-coordinate indexing the same cz would be identified at search time by combining the classes of documents labelled with the individual index terms The vocabulary of an index language may be controlled or uncontrolled. The former refers to a list of approv index terms that an indexer may use. The controls on the language may also include hierarchic relationships between the index terms. Or, one may inss to the kind of syntactic controls one ma that certain terms can only be used as adjective. There is really no li put ona language. The index language which comes out of the conflation algorithm is uncontrolled, post-coordinate and derived The vocabulary of index terms at any stage in the evolution of document collection is just the set of all confat class names. 1.7.2 Index Term Weighting a Describe index term wei BRI May 15, B Marks Weighting is the final stage in most Information Retrieval indexing applications. For each index term a weight val get assigned which indicates the significance of that index term w.r.to the document. The two important factors governing the effectiveness of an index language are exhaustivity of indexing a” specificity of index language n Storage and Retrieval (SPP Introduction to Information Retrieval (sppuy 1-9 2 Language epocitoty Fig. 2.7.2 : Factors governing effectiveness of index language Indexing exhaustivity + This defined as the number of diferent topics indexed. A high level of exhaustiiY of indexing leads to high recall and low precision. ‘exhaustivity is the no. of index terms + Alow level of exhaustivity leads to low recall and high precision. In short assigned to the document. Language specificity pics precisely. It is the level of precision with which & + Ihis the ability of the index language to describe to document is actually indexed. «High specincty leads to high precision and love recall Spectity means the na. documents to which 9 given term is assigned in a given colle © Refer to the Luhn’s idea, He has mentioned 1 order of their frequency of occurrence. fhe discrimination power of index terms as a function of the rank «The highest discrimination power being associated with the middle frequencies. = Considering this idea, each index term is assigned with a weight value which indicates the significance of the index term in the document. «A frequency count is how many times the index term comes in the document, can be considered as the weight value. «Different methods are present to find the weight value of index terms. First is to assign the frequency count as the weight value, ‘Another way is based on the index term distribution in the entire collection. Let. N= Total no. of documents n= Thedocumentin which an index term accurse; «tt we compare two methods, The document frequency weighting places emphasis on content description whereas the second methad ie. weighting by specificity atternpts to stress the ability of terms to discriminate one document from another. salton and Yong combined both the methods of weighting considering inter document frequencies and intra document frequencies. Set Jn to Information Retriev, Invoduc X¥ Information Storage and Retrieval (SPPU) 10 7 4 is dh van over the dOCUMENS, thy, + By considering both, the total frequency of accurrence @ fs wot, they concluded many M¥Nds is how many times it occurs in each document, they seme ul in eetnieval ets not very L.A term with high total frequency of occurre' distnbution is skewed. Yiddle terms are most useful particularly if the distribution = # 5 Nea secon nan the middle frequency ones ely to be useful but less 50 1h 3. Rare terms with a skewed distribution are like rom of the list except (oF the ones with a high tory 4. Very rare terms are also quite useful but come bott frequency. + A*good” index term is one which, when assigned as an ind index term is one a collection of documents, renders the fox terrn to which renders the document mote documents as dissimilar as possible, where as a “bad” similar. 7 ysures the increase ji 1. a particular term mea: + This is quantified through a term discrimination value which for 2 Pa : . documents on the removal of that term. Therefore, 2 ood term decrease in the average dissimilarity between leads to a decrease in average. is one which on removal from the collection of documents, leads on removal to an increase. = Dissimilarity, whereas a bad term is one whic! 1s will enhance retrieval effectiveness but then less = The idea is that a greater separation between document separation will depress retrieval effectiveness. 1.8 Probabilistic Indexing * Probabilistic indexing is based on the probability model for Information Retrieval. fers the difference in the distributional behaviour of words as a guide to whether a word should «This model, con be assigned as an index term. = The statistical behaviours of ‘speciality’ words are different from that of ‘function’ words. Function words are closely modelled. By Poisson distribution over all documents where as speciality words did not follow a Poisson distribution. Leta = function word over a set of texts fn = number of eccurrences of word w. ity that a text will have n occurrences of a function word w is given by, © Then, f(a) the probat fin) + xwill vary from word to word and for a given word should be proportional to the length of the text. © We can interpret x as the mean number of occurrences of the « in the set of texts. * ‘Speciality words' are content bearing. Whereas function words are not. Word randomly distributed according to a Poisson distribution is not informative about the document in which it occurs. ‘Aword which does not follow a Poisson distribution is assumed to indicate that it conveys information as to what document is about. RF Information Storage and Re troduction to Information Retrieval (SPPUY n + Fore. WAR is speciality word It ean come in relative documents Whereas ‘FOR’ Is a function word, which can be randomly distributed, A document collection can be + THIS model also assumes that a document can te about a word of same deg ‘are about a given word to the same broken up into subsets. Cacty si degree. bset being made up of documents that ts wato the extent to which * Content-bearing word is a word that distinguishes more than one class of document {the topic referred to by the word is treated in the documents in each (lass, by measuring + These are candidates for index terms, These content bearing words can be mechanically detected the extent to which their distributions deviate from that expected under a Poisson process + In this model the status of one of these content words within a subset of documents af the same ‘abounds vate between further subsets. fone of non-content-bearing, this is, within the given subset it does not dist + The assumptions based on which a word can be considered as index term for the document are : 7 ject is a function of © The probability that a document will be found relevant to request for information on 2 subject is a functio! the relative extent to which the topics treated in the document. © Theno. of tokens in a document is a function of the extent to which the subject referred to by the word is treated in the document. ‘©The indexing rule based on these assumptions indexes a document with word a if probability exceeds some cost function, If there are only two subsets differing in the extent to which they are about a word @ then the distribution of @ can be described by a mixture of two Poisson distributions. Pevix (-ppe? Thus, f(s) = St +} Here, P,is the probability of a random document belonging to one of the subsets and x,and x, are the mean occurrences in the two classes. This model is called as 2-Poisson model, It describes the statistical behaviour of a content-bearing word over two classes which are ‘about’ that word to different extents, these classes are not necessarily the relevant and non- relevant documents although by assumption (1) we can calculate the probability of reference for any document from one of these classes. Pet es * [tis the ratio + 7H e Pre x +Q-Py e 7x That is used to make the decision whether to assign an index term a that occurs k times in a document. This ratio is in fact the probability that the particular document belongs to the class which treats @ to an average extent of x, given that it contains exactly k occurrences of w. ran ™ 1-12 Introduction to Inform, Oey W Information Storage and Retrieval (SPPU) Automatic Classification * Classification is the process to categories the given document in different groups. Here, we make the yy, given objects = There are two main areas of application of classification methods in Information Retrieval as shown in Fig. 4. 5 ‘Main aroas of application of classification methods In Information Retrioval (1) Keyword clustering (2) Document clustering} Fig. 1.9.1 : Areas of application ofelassification method In IR 1. Keyword clustering * Many automatic retrieval systems rely on thesauri to modify queries and document representatives to improy the chance of retrieving relevant documents. In practice many of thesauri are constructed manually. * They have mainly been constructed in two ways : 1. Words which are deemed to be about the same topic are linked 2. Words which are deemed to be about related things are linked, 2. Document clustering ‘+ Document clustering is to group the documents in the collection. The purpose will be to group the document in such a way that retrieval will be faster or alternatively it may be to construct a thesaurus automatically. + Whatever the purpose the ‘goodness’ of the classification can finally only be measured by its performanc during retrieval. Considering the collection, the given documents will be divided in different groups (or subsets). Each grou will be considered as a single cluster, * Adocument can become part of a particular cluster if itis los Thus a single cluster will contain all those documents which are semantically related to each other. Purpose of clustering is to increase the speed of searching. In practice it is not possible to match ead analysed document with each analysed search request because the time consumed by such operation woul ly related to other members of the clustes be excessive, Using clustering process will have following steps : For the given collection, using some algorithms, documents will be converted into different groups. Eat group or cluster will contain the semantically related documents. © For each cluster, one cluster representative will be decided. The cluster representative may be such document which is semantically near to all other documents of that cluster. © When user will fire a query, it will be firstly matched with the cluster representative. If the query ha relation with cluster representative, then it indicates that the documents which are member of th particular cluster may be part of the answer set with respect to the query. It there is no match, the cluster will not handled further. Once there is match with query and cluster representative, then each document from the cluster 3 checked for mateh with query. The documents which are logically near to query become part of answe set. rae W information Storage and R eval (SPPU) fn oduction to Information Ri Thus, using clustenng, the searching is done at two ditt 1 ent levels Level 1: Comparing query and cluster representative. 2. Level 2: Comparing query and actual document clustering, docun stated to each other. but more In clustenng, documents are grouped because they are in some sense related 19 64h ’ he basically, they are grouped because they are Ikely to be wanted together, and lagueal relanonshp Means of measuring this likelihood. The classification of documents can be done manually oF via the intermediate calculation of # me closeness between documents. The first approach has proved theoretically to be intractable 1.10__Measures of Association - Tm a. List with definition different measures of association. To distribute the objects in different groups, the relationship between each pait of document's considered ‘The relationship will indicate that whether a particular document is 5 ‘emantically near to the second document a% compared with other documents are not. ‘The relationship between the documents can be mentioned using three different methods as shown in Fig, 230 “Three ditferent methods: ‘of documents 7. Similanty 2, Association. 3. Dissimilarity Fig. 1.10.1 : Methods of documents Similarity Similarity value indicates how much two documents or objects are near to each other. Association ‘Association is same as similarity, but the difference is objects which are considered for comparison are the objects characterized by discrete-state attributes. Dissimilarity Dissimilarity values show that how much far the objects are. Thus, the similarity value indicates the likeliness of two objects. If someone wants to find out the group of documents which are similar to each other, the similarity value can be considered, For the information retrieval system, we are interested in finding the subsets of the given documents. Documents in the collection are described using the list of index terms, Here for information retrieval systems, each pair of document is considered. Two documents will be similar to each other if they have more number of common index terms, TF Introduction to Inform, Yomanon sige ane wo decent si hag rr a onrin : ments in the same group ms then obviously 1 res of association are defined The yy To define the relation between the objects various measures of assoc nila attnbutes ‘Measure will be more if two objects have more number of similar attnbut fering of the association values wo, 1 follows that a cluster method depending only on the rank-ordering of t! sentical clustering for all these measures, There are fe measures of association methods. As we are using these measures for information ye, System, we should consider the representation of the document inside the system. Here the assumption is each document is represented by a list of keywords. A query is also represented » list of keywords. * Each list of keyword is considered as one set. * Thus the terms which are assumed here are: 2. _X: The list of index terms related to a document eg. Document 1 2, Y: The list of index terms related to a document e.g, Document 2. 3. IX |: The number of index terms present in the Document 1 4 LV [: The number of index terms present in the Document 2 5. 9: Intersection of two sets, For example X: The list of index terms related to Document 1. 1. Bat 2 Ball 3. Stump 4 Pen 5. Pencil 6 Night 7. Dog 8. Cat 9%. Coat 10. Fur Y: The list of index terms related to Document 2. 2, Pencil 2. Paper 3. Rubber 4. Cat 5. Mouse 6. Book 7 fe & Nose 9. Heart 10. Dark * Here! am considering only the nouns. in real scenario the actual index terms may have original representatio of nouns, verbs, etc. Thus : Ix] = 10 I¥{ = 10 + Now we will define different measure of association methods. 1.15 Introduction to 1_Different Matching Coefficients = Py EE ©, Doscribe ditferont matching cootticiont. 2. EAR Wato a short note on matching costticionts. Different Matching Coetficlents 1. Simpio Marching Coatticlent ! 2, Dissimlacty Cootficionts Fig. 1.11.2 : Different matching coefficients 1.11.1 Simple Matching Coefficient Its the number of shared index terms. © Thus, we can calculate simple matching coefficient as Ixay] This method does not consider the size of X and ¥ © In our example, the common terms are: 1. Pencil 2 Cat + Thus, value of simple matching coefficient is 2. 1, Dice's coefficient Xnv IXd+1¥ Itis the shared number index terms divided by addition of sizes of both sets X and ¥. 2, Jaceard’s coefficient Xov [xoY} Itis calculated as shared index terms divided by union of set X and set ¥. 1e coefficient . xay ixey IxP YP? a y 3 Cosine coefticient can be calculated as number of common index terms divided by square root of size of X set plus, square root of size of Y set. 4. Overlap coefficient xay min (XE TYD 6 Introduction to Information Re WW information storage and Retrieval (SPPU) . > Overlap coefficient can be calculated as number of common index terms divided by the size of set either y~ Y having comparatively minimum enttes. lap coelficients are normalized ver, # The Dice's coetticient. Jaccard’s coclticiont, cosine coefficient and over fom Oto 1. of simple matching coetficient. The values of these coefficients range frorn O jing example presents importance y © Wis necessary that the vakies of coefficients must be normalized. Following example P « normalized values. Let 1 SLOG Y)_ = [KA Y]=2 Simple matching coefficient which is not normalized. 2 82061) » 7ST = normalized coofclen Casel Le 1 Xp aa = I¥| = Land 3. Xayoe1 Then, SLKY) #1 Case 2 let 4 Ix} = 10 7 ly = 20 3. Ixay =a Then, SILKY) = 1 2IXoY 820M = Tes tyl 2. 2 "To +10 "20 “20 ‘© In first case, both the coefficients have same value ie. 1, which indicates that there is exact match. But in secon! case, even though there is a single common term present in both the set X and ¥. Coefficient $1 has value 16°) which doesn't reflect any difference between case 1 and case 2 whereas value of $2 coefficient is 1/10, which 9xe comparatively real scenario. Banna: Document 1 = {CPU, keyboard, RAM, VGA, SMPS, USB, CD-ROM, Printer) Document 2» (CPU, VGA, Simulator, OS, Vitoo, USB, Printer, Scanner, Compl) Find the similarly botwean two documents using diferent matching cootticients ‘SPPU rc W Information Storage and Retrieval (SPPU) Introduction to Intormation Better soln. X = Document 1 = CPU, keyboard, RAM, VGA, SMPS, USB, CD-ROM, Printer] and Y = Document 2 = (CPU, VGA, Simulator, OS, Video, USB, Printer, Scanner, Compiler) Set XY) = (CPU, VGA, USB, Printer) and (union ¥) = {CPU, keyboard, RAM, VGA, SMPS, USB, CD-ROM, Printer, Simulator, 05, video, Scanner, Compiler) Hence, IX] = Band}y| =9 kay} = 4 IXuniony] = 13 Following are the similarity coefficients: (Simple matching coefficient = [XY] =4 ¥ . Dice's coefficient = TTS TH] =4/ +9) =4/27 = 023529812 Xoy Gi) Jaccard’s coeffi = xv] = 4/13 = 0.30769 (™ Cosine coefficient = —bou. 4/ (282% 3) = 0.472 IxPety i? , xay . ™ Overlap coefficient = mind Xb 1D =4/8=05 111.2 Dissimilarity Coefficients rrr Explain the properties of dissimilarity cootficiants used in information retrieval. «= Coefficients which are explained in previous section are based on similarity values. «There are some coefficients defined, which are based on dissimilarity values between the documents. Properties of dissimilarity coefficients «Any dissimilarity function can be transformed into similarity function by a simple transformation of the form : Ss = (ray? © But the reverse is not always true, «fis the set of objects to be clustered. A pair wise dissimilarity coefficient D is a function from P xP to the non- negative real numbers. trturct00 10 1NOFMANON By soho Tran Hovering tor Eon ont) Tom WD He! ttn for fig. 131.2: Condito esx D(X ¥) 20 forall xX. Ye P Tt says that the dissimilarity coefficient should be non-negative. ee itself, the dissimilarity coefficient should ha, we find the dissimilarity value by comparing same document by itself, wee Value 0 because there exact match 3 DAW=DU.%forK ve P e documents. Thus the dissimilan Desmlry s stelnet depnd one gers in hin we hare he docue — Coefficient must be same between 2 documents, without considering the order of handling 4 DaKM=sDKD-DM yy ier This is based on theorem from Eueiian geometiy which states thatthe sum of length of two sides of triangle, oy always greater than the length ofthe thi side, fol Dissimilarity coefficients example + Examples ofessimiariycoetcient which saisties above conditions are . Fy IXP+ TY] Where, Ka) = KuY)-KAY) tis the symmetric difference of set X and Y, This coetficient is simply related to Dice's coefficient by, ~21XaYL __ixay IXT+IY] > Txpeqyy i The same coefficient can be represented in another form, i " =a if represented as the binary string where each entry represents an absence or presence of” v eywOrd, which is indicated by zero or one in i* posi i i servo Postion. In this case, the above dissimilarity coelfiient can? Ext Where, the summation is ove the total number of diffe, eat keywords in the document Collection. —= Wnformation Storage and Retrieval (SPPU) 1» Introd mbedded in an n-dimensional Euclidian space. 2. Salton considered document representatives as the binary rector Where. n = Total number of index terms Thus kay Maye Can then be interpreted as the cosine of the angular separation of the two binary rectors % and ¥. : iY) + c0x0 ix % Where, OY) = inner product We I the space is Euclidian then for X= Gy oun)and Y= Ye Yo) Aa Dey, =1 = ETET 3. Expected mutual information measure Measure of association can be defined based on probabilistic model. It can be measures by the association between two objects by the extent to which their distribution deviate from stochastic independence. For two distinct probability distribution P (x) and P (x), the expected mutual information measure can be defined 05 length of rector Fig. 1.11.3, follows : x Pb 1%, ») 34,7 Or fo Pox) PO) Properties of the function i. When xand xare independent, PURP(s) = Pox) So, 1) = 0 Ti, 1 Gq) <1 5x) whi Itis invariant under one-to-one transformation of coordinates. shows it is symmetric. 1 (x, 4) is often interpreted as @ measure of the statistical information contained in xabout x, When we apply this function to measure the association between two index terms, say i and j, then x, and x are binary variables. Thus P (x= 2) willbe the probability of eccurrence of the term i and similarly P (= 0) will be the probability of its non-occurrence, The extent to which two index terms i and j are associated is then measured by I (s, x). It measures the extent to which thelr distributions deviate from stochastic independence. Writes 1, .12__ Cluster Hypothesis WF tnlormanion storage mation Drssimilanty betwren two cl Classes On the tuass of their probabslty distribution Over between two with a function For Jos of otyectt can be define’ simple two point space (1. 0) tet FLAP, ©) Probabuity datnbution associated with «lass L PLC. P,(0) Probability distribution associated with class It On the bass of cifference benween them, we measure the dissimilarity between class Land Ti B/ \nfowmatn, Puy Pt) + vP, (109 3p, 1) + vP, Thus, P10) Information radius = uP, (2)!09 5p 7A « vP, (1) PO P00 __ + uP (1 00 pape yay * 3 109.5) + ¥FT Ne tere, u and v are positive weights adding to units. Properties Under some interpretation, the expected mutual information measure is a special case of the information radiu, Foreg. Let P; () and P,() are two conditional distributions P (,/n,) and P (/w;). u = POAW) vo PCA) P (x/ wy) P (oa) + P Oy) Pion) 7x = 02 12 Then for information radius we get, Pay = Pov wy PX P(x w) We came to the expected mutual information measure I (x). Closely related documents tend to be relevant to the same requests. vest are separated from those which A basic assumption in retrieval systems is that documents relevant to a req. non-relevant. ‘The relevant documents are more like one another than like non-relevant documents. This can be tested as follows : Compute the association between all pairs of documents : (2) Both of which are relevant to a request (b) One of which is relevant and the other is non-relevant. gased on a set of requests, the relative distribution of relevant-relevant (R ~ R) and relevant - non-relevant(R- R) association of a collection can be defined. Plotting the relative frequency against strength of association for two hypothetical collections X and Y, we™ get distribution as shown in Fig, 1.12.1. W inform: 9 Storage and Retrieval (SPPU) 2 Introdu: In Fig. 1.12.1. R-R is the distribution of relevant associate. R-N-R is the distribution of relevant non-relevant association From graph we can conclude that (1) Separation for collection x is good while for Vis poor. 2) Strength of association between relevant documents is greater for X than for Y. A linear search ignores the relationship exists between documents. Hence, structuring a collection in such 9 way that relevant documents will be part of one class will speed up the process of retrieval of the documents Catloetion X_ Cottection ¥ 100 60 €0 40 Fig. 1.32.1 The searching will be more effective, since classes will contain only relevant documents and no nen-relevant documents. ‘Cluster hypothesis is based on the document descriptions. Hence the objects should be described in such @ way that we can increase the distance between the two distributions R-R and R-N-R. ‘We want to make more likely clear that we will retrieve relevant documents and less likely that we will retrieve non= relevant. Thus, cluster hypothesis is a convenient way of expressing the aim of such operations as document clustering. It does not say anything about how the separation is to be exploited. 1.12.1 Clustering in Information Retrieval Cluster analysis is a statistical technique used to generate a category structure which fits a set of observations. The groups which are formed should have a high degree of association between members of the same group and a low degree betweon members of different groups. Cluster analysis can be performed on documents in several ways (Documents may be clustered on the basis of the terms that they contain. The aim of this approach is to provide more efficent and more effective retrieval Documents may be clustered based on co-occurring citations in order to provide i © provide insights into th the literature of afield a 2 nature of Gi) Terms may be clustered on the basis of documents in which they co-occur, It is useful in const struction of a thesaurus or in the enhancement of queries. Introduction t9 tnformaie, 2 BF Information storage and Retrieval (SPPU) us Wy mplemented wath available software packages. it MAY Rave sory Although cluster analysis can be ea like presentation tributes on which items are to be clustered and their re available (© Selecting the at larity measure from those © Selecting an appropriate clustering method and sini wn be expensive in terms of computational resources, © Creating cluster or cluster hierarchies, which car idity of the result obtained. 1¢ must be considered © Assessing the If the collection to be clustered is dynamic, the requirements for updat 1F the aim is to use the clustered collection as the basis of information retrieval, a method for searc. clusters or cluster hierarchy must be selected. Criteria for choosing clustering method While choosing the clustering method two criteria have been used. Criteria for choosing the clustering method i 1. Theoretical soundness: 2, Elfcioncy Fig. 1.12.2 : Criteria for choosing clustering method. 1) Theoretical soundness The clustering method should satisfy some constraints like = (2) The method produces a clustering which is unlikely to be altered drastically when further object incorporated ie. itis stable under growth. {b) The method is stable in the sense that small errors in the description of the objects lead to small chan; clustering (©) The method is independent of the initial ordering of the objects. 2) Efficiency The method should be efficient in terms of speed requirement and storage requirement. 1,13 Clustering Algorithm Clustering methods are usually categorized according to the type of cluster they produce. Thus, the chs methods can be categorized as : Hierarchical methods Non-hierarchical methods Hierarchical methods produce the output as ordered list of clusters. Whereas non-hierarchical methods P unordered lists, eRe EF information Storage and Retrieval (SPPU) 1.13.1 Det 1 @: co totorenation sattieral 2 Other categorization of the methods are i, The methods producing exclusive clusters fi. The methods producing overlapping clusters Here are some definitions related to clustering methods. While discussing about different algonthens we vil use these terms initions Cluster cluster is an ordered list of objects, which have some common characteristics Distance between two clusters : The distance between two clusters Involves some or all elements of We 17/2 clusters. The clustering method determines how the distance should be computed. Similarity : A similarity measure SIMILAR (d,. d) can be used to represent the similarity between the document Typically similarity is normalized value which ranges from 0 to 1-Similarty generates values of O for documents exhibiting no agreement among the assigned index terms and 1. when perfect agreement is detected Intermediate values are obtained for cases of partial agreement. Threshold : The lowest possible input value of similarity required to join two objects in one clu Similarity matric: Similarity between objects calculated by the function SIMILAR (d,d) represented ster, in the form of a matrix is called a similarity matrix Dissimilarity coefficient : The dissimilarity of two clusters is defined to be the distance between them, The smaller larity coefficient, the more similar two clusters are. ft. Every incoming object's similarity is compared the value of diss Cluster representative (seed) : The representative of the clust with cluster representative. A clustering method can have predetermined parameters like = 1. The number of clusters desired. 2. Aminimum and maximum size for each cluster. 3, Athreshold value on the matching function, below which an object will not be included in cluster 4, The control of overlap between clusters 5. Anarbitrarity chosen objective function which is optimized. Now, let us discuss different clustering algorithms. 1.14 Rocchio’s Algorithm Rochhio developed a clustering algorithm in 1966. It was developed on SMART project. Several parameters which are defined as the input for this algorithm are as follows : Minimum and maximum documents per cluster. correlation between an item and a cluster below which an item will not be placed in + Lower bound on the “luster. This is a threshold that would ke used in the final clean up phase of unclustered items. «Similarity coefficient : On new line the algorithm operates in three stages. (sPeu, entres. The remaining » Wietormation storage and Ret Stage 1: Aigonthm selects (by some cr assigned to the centies or rag-bag cluster Pag-bag saments. the cluster representatives are computed and ait op of objects it a temporary cluster used to a meet lily defined in terms oF thresholds na On the basis of the intial more assigned to clusters The assignment rules are eupli function ‘The final cluster may overlap (ie an object may be assigned to more than one clusted. 8 Stage 2: Itis an iterative step. Here, the input parameters can be adjusted so that the resulting classification me, rior specification of such things as cluster size, etc, more nearly. 8 Stage 3: Thisis the tidying up’ stage. After the stage 2, the objects which are unassigned objects (Le. pan of rx, cluster) are forcibly assigned and overlap between clusters is reduced. caw Capes —o > ) 1.15 Single-Pass Algorithm (2 Process ‘Single-pass algorithm process as follows : * The object desenptions are processed serially, * The first object becomes the cluster representative of the fist cluster. Each subsequent object is matched against all cluster representatives existing atts processing time. * A given object is assigned to one cluster (or more if overlap is allowed) according to some condition on ‘matching function. * When an object is assigned to a cluster the representative for that cluster is recomputed. * [tan object fails a certain test it becomes the cluster representative of a new cluster, Examples Object = (1,2,3,4,5,6 Similarity matrix 1 2 | 06 3 06 | 08 4 os | 09 | 07 o5 | o5 | o9 | os | as Threshold : 0.89 ion Retrieval Wtnformation storage and Retrieval (5PPU) 1.25 Invoducion to tafor ‘Case 1: Clustering metho Process object from 1 to 6 4. Object 1 First object ie. object 1 becomes part of cluster as well as cluster representative 2G = Gh Object 2 ‘ome . whether it can be ‘The object 2 is compared with object 1 (as itis cluster representative of C1) to check whether it can Bes part of first cluster. Compare, similarity (1, 2) and threshold value. A306 < 089 Object 2 can't become part of cluster Hence a new cluster is created whose cluster representative will be object 2 2G = G = @ 3. Object 3 larly (1, 3) < threshold 06 < 089 Hence object 3 can’t become part of cluster 1. b. Similarly (2, 3) < threshold 08 < 089 Hence object 3 ean't become part of cluster 2. Create new cluster whose cluster representative is object 3. wo @ a) 4, Object 4 larly (1, 4) > threshold os > 029 Object 4 becomes part of cluster 1. 2G = @ay G = @ G = 6B ‘As the new element is added is cluster, cluster representative is again calculated. In this example there is no cluster representative. change. it Introduction 19 inform, Information Storage and Retvioval (SPPU) 26 Object Check sinarity (1. 5) » threshold 09+ 089 eG = LAS) G = @ G * a AS new element is added in cluster 1, cluster representative is calculated again. Based on the similarity yy, * S objects. All objects are equidistant hence no change in cluster representative 6 Object 6 % Check similarity (2, 6) « threshold os B. Check similarity (2, 6) < threshold os © Simitarity (3, 6) > threshold 09 Hence, object 6 becomes part of cluster 3 2 G G Here again for C, Thus, When the clustering method is exclusive, cover, 0.89 0.89 089 24,5) {2} 8,6) y object 3 and 6 are equidistant hence no change in cluster representative. the output of single pass algorithm is as follows : a, 4,5) = @ = 36 Fig. 1.15.1 once an object become part of a single cluster, handling of that objec When clustering stering method is averlapping each object is compared with cluster represen => Process objects from 1 to 6 1. Object 1 As there is no cluster present. create new cluster whose cluster representative i oyect 1 Well qa 2, Object 2 a. Compare similarity (1, 2) < threshold b. = Create new cluster. G-a@ G = @ 3. Object3 a Compare similarity (1, 3) < threshold o6 < 089 b. Compare similarity (2, 3) < threshold pratives of each cluster 08 < 089 Create new cluster. 2G =a) c = (2) G = 8 4. Object 4 a. Compare similarity (1, 4) > threshold os > 089 Hence object 4 can become part of cluster 1, As the method is overlapping, go on checking the object 4's similarity value with all cluster representative. b. Compare similarity (2, 4) > threshold 09 > 089 Object 4 can become part of cluster C- Compare similarity 3, 4) < threshold 07 < 089 LG = a4) c, = (24) G = Rae fi WW information Storage and Retrieval (SPPU) Object § & compare winvilarity (1. 8) » threshold 09 B compare sinvilarity (2. $) « threshold 06 © Compare similarity (3, $) « threshold 06 ag Gq G 6& Odject 6 Compare similarity (1, 6) os Compare similarity (2, 6) < threshold os © Compare similarity (3, 6) < threshold 09 Latroduction to lnformy ono 089 089 (1,4,5) (2,4) 3) 0.89 089 0.89 14,5) (2,4) (3,6) Advantage of Single Pass Algorithm : Simple for implementation. Disadvantage of Single Pass Algorithm Fig, 1.15.2 The output depends on the sequence in which objects are handled. wie Introduction to Information Retrieval pO aC Step 1: Start with Doc. As initialy no cluster is present, Document Lintroduces Cluster 2 ve. Cl Hence, C1 = (Doc) Centroid of this cluster is <2, 2,0, 0,15 Cl= (Doc 1); Centroid of C1: <1, 2, 0,0,1> Step 2: Now we need to make decision for Doc? Either it can become part offirst cluster or it can Introd cluster. For making the decision, we need to find similarity between centroid of first cluster and Doc. 2, Hence, we use dot product for simplicity. Centroid of C2 :<1,2,0,0,1> Doc. 2 :< 3, 1.2, 3, 0> SIM (Doc 2, C1) = 143 +241 4024093 +1°0=5 juce anew will Now compare threshold value and SIM(Doc 2, C1) = 10>5 Hence Doc 2 can't become part of first cluster. Hence, new cluster will be introduced. C1= {Doc 1}; Centroid of C1: <1, 2, 0,0,1> (€2= {Doc 2}; Centrold of C2: <3,1,2,3,0 > Step 3 : Make decision for Doc.3 Hence, Doc. 3:< 3, 0,0, 0, 1> Centroid of C1 :< 1,2,0,0,1> SIM(Doc3,C1) = 3140240904 0:0+12=6 10 > 6, hence Doc.3 can't be part of C1. Now, check whether Doc.3 can become part of C2Hence, Doe. 3:< 3,0, 0,0,1> Centroid of C2 :< 3, 3, 2,3, 0> S Introduction to inf Information Storage and Retrieval SPPUI 1-30 ormtion +1069 —™ siM(Doc3, C2) +014 0°2 +0" 2 also. Hene, introduce a new Cluster Threhold 10> 9, heace, Doc 3 cant become part of cluster 3 = (Doc. 3) C1= {Doc 1); Centrold of CL: <1, 2,0,0,1> €2= (Doc 2}; Centrold of C2 : <3, 1, 2,3,0> €3s { Doc. 3); Centroid of C3 :< 3, 0,0, 0, 1> Step 4: ‘Now. Make decision for Doc. 4:< 2, 1,0, 3, 0 SIM(Doc4, Cl) = 142+ 291400 =0°3 +1°0=4 Threshold 10 >4 ; Doc.4 cannot become part of C1. SIM( Doc4, C2) = 3424112 +210 + 343 + 0°0= 16 Threshold 10 < 16, Hence Doc.4 will become part of C2. C2 = (Doe2, Doc. 4) AS new document is included in cluster, Doc. 2:< 3, 2,2,3,0> Doc. 4:< 2,1,0,3,0> Centrold of C2 = <5/2,2/2, 2/2, 6/2, 0/2> = <25,1, 1, 3, 0> Thus Clusters available are: C= {Doc 1); Centroid of C1: C2 = {Doc 2, Doc. 4}; Centroid of C2 :<2.5, 1, 1,3, 0> 1, 2,0,0,1> €3= ( Doc. 3); Centroid of C3 :< 3, 0, 0, 0,1> Step 5 : Finally we need to find where Doc. 5 will fit, Doc. 5:<2, 2,1, 5, 1> 112 +292 +01 +05 421-7 SIM(Doc.5, C1) Threshold 10. > 7, Hence Doc$ cannot be part of C1 252 +142 +141 +35 +02 SIM(Doc. 5, C2) = = 5424141540 =23 Threshold 10 < 23, Hence Doc.5 becomes part of C2. €2 = (Doe 2, Doc. 4, Doc. 5) Thus, Centroid of C2 will be recalculated which is average of Doc.2, Doc and Doc. 5. Hence Doc. 2 :< 3, 1, 2, 3,0 > Doe, 4:< 2, 1,0,3,0> Centroid of cluster is recalculated which is average of Doc.2 and Doe. 4 Introduction to Information Retrieval id Re Doc. 5: <2, 2,1, 5, 1> Centroid of C2 # «7/3, 4/3,3/3, 1113.1» «233,133, 1, 366,033 Thus finally, we have 3 Clusters. €1=(Doc 1); Centroid of C1 : <1, 2,0,0,1> €2= (Doc 2, Doc. 4, Doc. 5); Centroid of C2 :<2.33, 1.33, 1, 3.66, 0.33 > C3= ( Doc. 3}; Centroid of C3 :< 3, 0,0, 0, 1> 1.16_ Single Link Algorithm @. Show how single link clusters may be derived from the dissimilarity costficient by threshoicing i. Joining at each step, the two most ‘¢ The single link method is the best known of hierarchical methods. It operates by, pairs of clusters similar objects, which are not yet in the same cluster, The name single link refers to the joining of by the single shortest link between them ‘© The dissimilarity coefficient is the basic input to a single-link clustering algorithm. Single-link produces the output which is a hierarchy with associated numerical levels called a dendogram. «The hierarchy is represented by a free structure. The dendogram and its respective tree is as shown in ovata, Ys u & LN LN ‘pt ba Fig. 1.16.1: Dendogram tere, {A,B, C, D, E) are the objects clusters are: At level 1: (A, B}, (C), {0}, {E} At level 2: {A,B} {C, 0, E) At level 3 :{A, B, C.D, E} ‘At each level of hierarchy a set of classes can be identified. As we move up in hierarchy, the classes at lower level are nested in the classes at higher levels,

You might also like