Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
27 views37 pages

Unit III

The document discusses automatic indexing and document clustering, detailing various classes of automatic indexing such as statistical, natural language, and concept indexing. It explains the indexing process, the creation of searchable data structures, and the application of statistical methods for relevance scoring in information retrieval systems. Additionally, it covers techniques like probabilistic weighting, vector weighting, and inverse document frequency to enhance search precision and document ranking.

Uploaded by

kuppamhyndu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views37 pages

Unit III

The document discusses automatic indexing and document clustering, detailing various classes of automatic indexing such as statistical, natural language, and concept indexing. It explains the indexing process, the creation of searchable data structures, and the application of statistical methods for relevance scoring in information retrieval systems. Additionally, it covers techniques like probabilistic weighting, vector weighting, and inverse document frequency to enhance search precision and document ranking.

Uploaded by

kuppamhyndu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

AUTOMATIC INDEXING,

DOCUMENT AND TERM CLUSTERING

UNIT-III
CATALOGING AND INDEXING

Automatic Indexing: Classes of Automatic Indexing, Statistical Indexing, Natural Language,


Concept Indexing, Hypertext Linkages
Document and Term Clustering: Introduction to Clustering, Thesaurus Generation, Item
Clustering, Hierarchy of Clusters

Automatic Indexing
 The indexing process is a transformation of an item that extracts the semantics of the topics
discussed in the item.
 The extracted information is used to create the processing tokens and the searchable data
structure.
 The semantics of the item not only refers to the subjects discussed in the item but also in
weighted systems.
 The index can be based on the full text of the item, automatic or manual generation of a subset
of terms/phrases to represent the item, natural language representation of the item or
abstraction to concepts in the item.
 The results of this process are stored in one of the data structures.

Classes of Automatic Indexing


 Automatic indexing is the process of
analyzing an item to extract the
information to be permanently kept in
an index.
 This process is associated with the
generation of the searchable data
structures associated with an item.
 The left side of the figure including
Identify Processing Tokens, Apply Stop
Lists, Characterize tokens, Apply
Stemming and Create Searchable Data
Structure is all part of the indexing
process.
 Filters, such as stop lists and stemming
algorithms, are frequently applied to
reduce the number of tokens to be
processed.
 The next step depends upon the
search strategy of a particular system.

K. JAYASRI | UNIT – III | CSM 1 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
 Search strategies can be classified as
1. Statistical,
2. Natural Language,
3. Concept.
 An index is the data structure created to support the search strategy.
 Statistical Indexing
o covers the broadest range of indexing techniques and common in commercial systems.
o Basis for statistical is – frequency of occurrence of processing tokens (words/ phrases)
within documents and within data base.
o The words/phrases are the domain of searchable values.
o Statistics that are applied to the event data are
1. Probabilistic,
2. Bayesian,
3. Vector spaces,
4. Neural net.
o Statistic approach – stores a single statistic, such as how often each word occurs in an
item – used for generating relevance scores after a Boolean search.

Statistical Indexing Natural Language Concept Indexing Hypertext Linkages

Probabilistic Vector Bayesian Model


Weighting Weighting Index Phrase Natural Language
Generation Processing
Simple Term Frequency
Algorithm

Inverse Document Frequency

Signal Weighting

Discrimination Value

Problems with Weighting


Schemes

Problems with vector Model

o Probabilistic indexing
 stores the information that are used in calculating a probability that a particular
item satisfies (i.e., is relevant to) a particular query.

K. JAYASRI | UNIT – III | CSM 2 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
o Bayesian and vector space
 stores the information that are used in generating a relative confidence level of
an items relevancy to a query.
o Neural networks
 dynamic learning structures– comes under concept indexing – that determine
concept class.
 Natural language approach
o perform the similar processing token identification as in statistical techniques -
additional level of parsing of the item (present, past, future action) enhance search
precision.
 Concept indexing
o uses the words within an item to correlate to concepts discussed in the index item.
o When generating the concept classes automatically, there may not be a name applicable
to the concept but just a statistical significance.
o Finally, a special class of indexing can be defined by creation of hypertext linkages.
o These linkages provide virtual threads of concepts between items versus directly
defining the concept within an item. Fig 5.1: Data flow in Information Processing System
o Each technique has its own strengths and weaknesses.
 Uses frequency of occurrence of events to calculate the number to indicate potential relevance
of an item.
 The documents are found by normal Boolean search and then statistical calculation are
performed on the hit file, ranking the output( e.g. The term- frequency algorithm)
1. Probability weighting
2. Vector Weighting
a. Simple Term Frequency algorithm
b. Inverse Document Frequency algorithm
c. Signal Weighting
d. Discrimination Value
e. Problems with the weighting schemes and vector model
3. Bayesian Model
1. Probabilistic weighting:
 Probabilistic systems attempt to calculate a probability value that should be invariant to both
calculation method and text corpora(large collection of written/spoken texts).
 The probabilistic approach is based on direct application of theory of probability to information
retrieval system.
 Advantage: uses the probability theory to develop the algorithm.
o This allows easy integration of the final results when searches are performed across
multiple databases and use different search algorithms.
o The use of probability theory is a natural choice – the basis of evidential reasoning
(drawing conclusions from evidence).
o This is summarized by PRP (probability ranking principle) and Plausible corollary
(reasonable result)

K. JAYASRI | UNIT – III | CSM 3 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
 HYPOTHESIS –if a reference retrieval systems response to each request is a ranking of the
documents in order of decreasing probability of usefulness to the user, the overall effectiveness
of the system to the users is best obtainable on the basis of the data available.
 PLAUSIBLE COROLLARY: the techniques for estimating the probabilities of usefulness for output
ranking in IR is standard probability theory and statistics.
 probabilities are based on binary condition the item is relevant or not.
 IRS the relevance of an item is a continuous function from non- relevant to absolutely useful.
 Source of problems: Probability theory come from lack of accurate data and simplified
assumptions that are applied to mathematical modeling.
 cause the results of probabilistic approaches in ranking items to be less accurate than other
approaches. Advantage of probabilistic approach is that it can identify its weak assumptions and
work to strengthens them.
 Ex : logistical regression
 Approach starts by defining a model 0 system.
 In retrieval system there exits a query qi and a document term di which has a set of attributes
(Vi…Vn) from the query (e.g., counts of term frequency in the query), from the document (e.g.,
counts of term frequency in the document ) and from the database (e.g., total number of
documents in the database divided by the number of documents indexed by the term).
 The logistic reference model uses a random sample of query document terms for which binary
relevance judgment has been made from judgment from samples.
 Estimate Log-Odds : Logarithm O is the logarithm of odds of relevance for terms Tk which is
present in document Dj and query Qi

 Aggregate Log-Odds Across All Query Terms:

K. JAYASRI | UNIT – III | CSM 4 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
 Convert Log-Odds to Probability

o The coefficients of the equation for logodds is derived for a particular database using a
random sample of query-document-term-relevance quadruples and used to predict
odds of relevance for other query-document pairs.
o Additional attributes of relative frequency in the query (QRF), relative frequency in the
document (DRF) and relative frequency of the term in all the documents (RFAD) were
included, producing the logodds formula:
o QRF = QAF\ (total number of terms in the query), DRF = DAF\(total number of words in
the document) and RFAD = (total number of term occurrences in the database)\ (total
number of all words in the database).

 Feature-Based Model (Logistic Regression)

K. JAYASRI | UNIT – III | CSM 5 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
 Final Ranking Formula
o Logs are used to reduce the impact of frequency information; then smooth out skewed
distributions.
o A higher max likelihood is attained for logged attributes.
o The coefficients and log (O(R)) were calculated creating the final formula for ranking for
query vector Q’, which contains q terms:

o The logistic inference method was applied to the test database along with the Cornell
SMART vector system, inverse document frequency and cosine relevance weighting
formulas.
o The logistic inference method outperformed the vector method.
o Attempts have been made to combine different probabilistic techniques to get a more
accurate value.
o This combination of logarithmic odds has not presented better results.
o The objective is to have the strong points of different techniques compensate for
weaknesses.
o To date this combination of probabilities using averages of Log-Odds has not produced
better results and in many cases produced worse results
 Example:

K. JAYASRI | UNIT – III | CSM 6 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING

Step 2: Convert Log-Odds to Probability

2. Vector Weighting
 Earliest system that investigated statistical approach is SMART system of Cornell university. –
system based on vector model.
 Vector is one dimension of set of values, where the order position is fixed and represents a
domain.
 Each position in the vector represents a processing token.
 Two approaches to the domain values in the vector – binary or weighted
 Under binary approach the domain contains a value of 1 or 0
 Under weighted - domain is set of positive value – the value of each processing token represents
the relative importance of the item.
 Binary vector requires a decision process to determine if the degree that a particular token the
semantics of item is sufficient to include in the vector.
 Ex., a five-page item may have had only one sentence like “Standard taxation of the shipment of
the oil to refineries is enforced.”
 For the binary vector, the concepts of “Tax” and “Shipment” are below the threshold of
importance (e.g., assume threshold is 1.0) and they not are included in the vector.

K. JAYASRI | UNIT – III | CSM 7 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING

 A Weighted vector acts same as the binary vector but provides a range of values that
accommodates a variance in the value of relative importance of processing tokens in
representing the item.
 The use of weights also provides a basis for determining the rank of an item.
 The vector approach allows for mathematical and physical representation using a vector space
model.
 Each processing token can be considered another dimension in an item representation space.
 3D vector representation assuming there were only three processing tokens, Petroleum Mexico
and Oil.
• This figure shows a 3D coordinate
space with just 3 dimensions:
• X-axis: Petroleum
• Y-axis: Mexico
• Z-axis: Oil
• The document vector is a point in
this space:
(2.8, 1.6, 3.5)
• Each document is represented as
such a point or arrow in an n-
dimensional space.

2. Vector Weighting
a) Simple Term Frequency Algorithm (STFA)
 In both weighted and un weighted approaches an automatic index processing
implements an algorithm to determine the weight to be assigned to a processing token.
 In statistical system: the data that are potentially available for calculating a weight are
the frequency of occurrence of the processing token in an existing item (i.e., term
frequency - TF), the frequency of occurrence of the processing token in the existing
database (i.e., total frequency -TOTF) and the number of unique items in the database
that contain the processing token (i.e., item frequency - IF, frequently labeled in other
document frequency - DF).
 Simplest approach is to have the weight equal to the term frequency.
 If the word “computer” occurs 15 times within an item it has a weight of 15.
 The term frequency weighting formula: (1+log(TF))/1+log(average(TF)) / (1-slope) *
pivot + slope*no. of unique terms
 where slope was set at .2 and the pivot was set to the average no of unique terms
occurring in the collection.
 Slope and pivot are constants for any document/query set.
 This leads to the final algorithm that weights each term by the above formula divided
by the pivoted normalization:

K. JAYASRI | UNIT – III | CSM 8 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING

STFA Example

2. Vector Weighting
b) Inverse Document Frequency Algorithm (IDF)
 IDF measures how unique or rare a term is across all documents in a corpus. It is part of the TF-
IDF (Term Frequency-Inverse Document Frequency) weighting scheme, which is widely used to
rank documents based on their relevance to a query.
 Enhancing the weighting algorithm: the weights assigned to the term should be inversely
proportional to the frequency of term occurring in the data base.
 The term “computer” represents a concept used in an item, but it does not help a user find the
specific information being sought since it returns the complete DB.
 This leads to the general statement enhancing weighting algorithms that the weight assigned to
an item should be inversely proportional to the frequency of occurrence of an item in the
database.
 This algorithm is called inverse document frequency (IDF).
 The un-normalized weighting formula is:

Where:
TFij: Term frequency of term j in document i
n: Total number of items/documents
IFj: Number of items/documents containing term j
 IDF Example

 Total items (n) = 2048


 Terms & Frequencies:

K. JAYASRI | UNIT – III | CSM 9 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
o "oil": appears 4 times in the item, found in 128 documents
o "Mexico": appears 8 times, found in 16 documents
o "refinery": appears 10 times, found in 1024 documents

2. Vector Weighting
c) Signal weighting
 Signal Weighting Can be used to improve term weighting in information retrieval systems (IRS)
beyond standard inverse document frequency (IDF).

 The signal weighting method enhances precision in search results by considering how a term is
distributed across documents, not just how many documents contain it. This adds a statistical
depth using shannon's information theory to standard retrieval models.

 Information theory—specifically shannon's entropy

 IDF adjusts the weight of a processing token for an item based upon the number of items that
contain the term in the existing database.
 It does not account for is the term frequency distribution of the processing token in the items
that contain the term - can affect the ability to rank items.
 For example, assume the terms “SAW” and “DRILL” are found in 5 items with the following
frequencies:

 In Information Theory, the information content value of an object is inversely proportional to


the probability of occurrence of the item.
 An instance of an event that occurs all the time has less information value than an instance of a
seldom occurring event.

K. JAYASRI | UNIT – III | CSM 10 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
Weight Calculation:

Weight Calculation for SAW and DRILL:

2. Vector Weighting
d) Discrimination Value Weighting

 The goal of this weighting approach is to assign importance to terms that help distinguish
(discriminate) between documents/items in a collection.
 The Discrimination Value (DISCRIM) measures how much a term contributes to differentiating
items.
 Creating a weighting algorithm based on the discrimination of value of the term but all items
appear the same, the harder it is to identify those that are needed.

K. JAYASRI | UNIT – III | CSM 11 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
 Salton and Yang proposed a weighting algorithm that takes into consideration the ability for a
search term to discriminate among items.

Steps to calculate weight using discrimination value:

 Step 1: Compute DISCRIMᵢ

 AVESIM = average similarity between all items in the database.


 AVESIMᵢ = average similarity when term i is removed from all items.
o DISCRIMᵢ value being positive, close to zero/negative.
o A positive value indicates that removal of term “i” has increased the similarity
between items.
o In this case, leaving the term in the database assists indiscriminating between items
and is of value.
o A value close to zero implies that the term’s removal or inclusion does not change
the similarity between items.
o If the value is negative, the term’s effect on the database is to make the items
appear more similar since their average similarity decreased with its removal.

 Step 2: Compute Final Term Weight

 TFᵢₖ = Term frequency of term k in document i.


 DISCRIMₖ = Discrimination value for term k.

Example

K. JAYASRI | UNIT – III | CSM 12 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
Calculating Cosine Similarity with Example

2. Vector Weighting
e) Problems With Weighting Schemes
 Often weighting schemes use information that is based upon processing token distributions
across the database.
 Information databases tend to be dynamic with new items always being added and to a lesser
degree old items being changed or deleted.
 Thus, these factors are changing dynamically.
 This presents a challenge in maintaining consistent and accurate weightings.

Three Main Approaches to Handle Changing Weight Factors:

Approach 1: Rebuild Approach 2: Threshold-


Aspect Approach 3: Dynamic Calculation
Periodically Based Update

Use current values; Use fixed values; update


Store invariant values; compute
Method recalculate entire only when change
weights during search
database periodically threshold is exceeded

Low during runtime; high Moderate; distributed over High computation per query
Overhead
during rebuild time (minimal if using inverted files)

Medium – fluctuates High – stable until major


Accuracy Very High – always current
over time changes

Simpler systems, Balanced efficiency and High-accuracy systems or small


Best For
infrequent updates accuracy databases

K. JAYASRI | UNIT – III | CSM 13 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING

Costly rebuilds for large Complexity in tracking Slower search times without
Downside
DBs changes optimization

 SOLUTIONS:
o If the system is using an inverted file search structure, this overhead is very minor.
o The best environment would allow a user to run a query against multiple different time
periods and different databases that potentially use different weighting algorithms, and
have the system integrate the results into a single ranked Hit file.

Problems With the Vector Model

Issue Category Problem Description Example / Explanation

Weighting factors can become


Term importance may shift over time, but
Dynamic Databases inaccurate as the database content
the model doesn't adapt automatically.
changes dynamically.

The model cannot differentiate A document discussing “oil in Mexico” and


Multiple Topics in a
between distinct topics discussed “coal in Pennsylvania” may match “coal in
Document
in the same document. Mexico” incorrectly.

Terms are treated independently;


no correlation or linkage between “Coal” and “Mexico” are scored
Lack of Term Association
related terms (no independently, even if unrelated in context.
precoordination).

Cannot perform proximity


Positional relationships between terms are
No Positional Information searches (e.g., term A within 10
not stored.
words of term B).

Each term is assigned only one


scalar value per document — no Can't distinguish where the term appears in
Scalar Value Limitation
detail about term location or the document.
section.

Searching within parts (subsets) of


Helps isolate sections that match a search
Subset Searching as a Fix a document can improve precision
query more accurately.
by focusing on specific topics.

K. JAYASRI | UNIT – III | CSM 14 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
3. Bayesian Model
 One way of overcoming the restrictions inherent in a vector model is to use a Bayesian approach
to maintaining information on processing tokens.
 The Bayesian model provides a conceptually simple yet complete model for information
systems.
 The Bayesian approach is based upon conditional probabilities (e.g., Probability of Event 1 given
Event 2 occurred).
 This general concept can be applied to the search function as well as to creating the index to the
database.
 The objective of information systems is to return relevant items.
 The formula used is:

 The probability that a document (doci) is relevant (REL) given a query (queryj).
 In addition to search, Bayesian formulas can be used in determining the weights associated with
a particular processing token in an item.
 The objective of creating the index to an item is to represent the semantic information in the
item.
 A Bayesian network can be used to determine the final set of processing tokens (called topics)
and their weights.

 A simple view of the process where Ti


represents the relevance of topic “i” in a
particular item and Pj represents a
statistic associated with the event of
processing token “j” being present in the
item.
 “m” topics would be stored as the final
index to the item.
 The statistics associated with the processing token are typically frequency of occurrence.
 But they can also incorporate proximity factors that are useful in items that discuss multiple
topics.
 There is one major assumption made in this model:

Assumption of Binary Independence:

 The topics and the processing token statistics are independent of each other.
 The existence of one topic is not related to the existence of the other topics.
 The existence of one processing token is not related to the existence of other processing tokens
 In most cases this assumption is not true. Some topics are related to other topics and some
processing tokens related to other processing tokens.
 For example, the topics of “Politics” and “Economics” are in some instances related to each
other and in many other instances totally unrelated.
 The same type of example would apply to processing tokens.

K. JAYASRI | UNIT – III | CSM 15 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
There are two approaches to handling this Binary Independence problem.

 The first is to assume that there are dependencies, but that the errors introduced by assuming
the mutual independence do not noticeably effect the determination of relevance of an item
nor its relative rank associated with other retrieved items.This is the most common approach
used in system implementations.
 A second approach can extend the network to additional layers to handle interdependencies.
Thus, an additional layer of Independent Topics (ITs) can be placed above the Topic layer and a
layer of Independent Processing Tokens (IPs) can be placed above the processing token layer.

 Top level (ITs) could represent the


item or system being indexed.
 The next level (T1 to TM) might be
topics or thematic categories.
 Below that, IP1 to IPR might represent
index phrases derived from
processing.
 At the bottom, P1 to PN could be
individual phrases or words.
 It shows how complex semantic
processing builds up from base
phrases to structured thematic
indexes.

 The new set of Independent Processing Tokens can then be used to define the attributes
associated with the set of topics selected to represent the semantics of an item.
 To compensate for dependencies between topics the final layer of Independent Topics is
created.
 The degree to which each layer is created depends upon the error that could be introduced by
allowing for dependencies between Topics or Processing Tokens.
 Although this approach is the most mathematically correct, it suffers from losing a level of
precision by reducing the number of concepts available to define the semantics of an item.

Example

K. JAYASRI | UNIT – III | CSM 16 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
Natural language:
 The goal of natural language processing is to use the semantic information in addition to the
statistical information to enhance the indexing of the item.
 This improves the precision of searches, reducing the number of false hits a user reviews.
 The semantic information is extracted as a result of processing the language rather than treating
each word as an independent entity.
 The simplest output of this process results in generation of phrases that become indexes to an
item.
 More complex analysis generates thematic representation of events rather than phrases.
 Statistical approaches use proximity as the basis behind determining the strength of word
relationships in generating phrases.
 For example, with a proximity constraint of adjacency, the phrases “venetian blind” and “blind
Venetian” may appear related and map to the same phrase.
 But syntactically and semantically those phrases are very different concepts.
 Word phrases generated by natural language processing algorithms enhance indexing
specification and provide another level of disambiguation.
 Natural language processing can also combine the concepts into higher level concepts
sometimes referred to as thematic representations.

Step
Step Name Purpose
No.

1 Lexical Analysis Preprocessing: word forms, tense, plurality

2 Generation of Term Phrases Identify basic important terms

Mapping to Subject Codes


3 Assign semantic categories to terms
(LDOCE)

Organize text into logical sections (Evaluation, Main Event,


4 Text Structuring
Expectation, etc.)

Find main ideas and their semantic attributes (time frame: past,
5 Topic Statement Identification
present, future)

Assign News Schema


6 Assign sentences to categories like Circumstance, Consequence, Lead
Components

7 Concept Relationship Detection Identify how different topics/concepts relate

8 Relationship Weighting Assign importance to relationships (e.g., active verbs weighted higher)

9 Storage for Retrieval Store all information to help in natural language search queries

K. JAYASRI | UNIT – III | CSM 17 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
1. Index Phrase Generation

 The goal of indexing is to represent the semantic concepts of an item in the information system
to support finding relevant information.
 Single words have conceptual context, but frequently they are too general to help the user find
the desired information.
 Term phrases allow additional specification and focusing of the concept to provide better
precision and reduce the user’s overhead of retrieving non-relevant items.
 Having the modifier “grass” or “magnetic” associated with the term “field” clearly
disambiguates between very different concepts.
 One of the earliest statistical approaches to determining term phrases using of a COHESION
factor between terms :

 where SIZE-FACTOR is a normalization factor based upon the size of the vocabulary and is the
total frequency of co-occurrence of the pair Termk , Termh in the item collection.
 Co-occurrence may be defined in terms of adjacency, word proximity, sentence proximity, etc.
 This initial algorithm has been modified in the SMART system to be based on the following
guidelines
o Any pair of adjacent non-stop words is a potential phrase
o Any pair must exist in 25 or more items
o Phrase weighting uses a modified version of the smart system single term algorithm
o Normalization is achieved by dividing by the length of the single-term sub-vector.
 Statistical approaches tend to focus on two term phrases.
 The natural language approaches are their ability to produce multiple-term phrases to denote a
single concept.
 If a phrase such as “industrious intelligent students” was used often, a statistical approach
would create phrases such as “industrious intelligent” and “intelligent student.”
 A natural language approach would create phrases such as “industrious student,” “intelligent
student” and “industrious intelligent student.”
 The first step in a natural language determination of phrases is a lexical analysis of the input.
 In its simplest form this is a part of speech tagger that, for example, identifies noun phrases by
recognizing adjectives and nouns.
 Precise part of speech taggers exist that are accurate to the 99 per cent range.
 Additionally, proper noun identification tools exist that allow for accurate identification of
names, locations and organizations since these values should be indexed as phrases and not
undergo stemming.
 The Tagged Text Parser (TTP), based upon the Linguistic String Grammar, produces a regularized
parse tree representation of each sentence reflecting the predicate-argument structure.
 The tagged text parser contains over 400 grammar production rules. Some examples of

K. JAYASRI | UNIT – III | CSM 18 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING

 The TTP parse trees are header-modifier pairs where the header is the main concept and the
modifiers are the additional descriptors that form the concept and eliminate ambiguities.
 Example: The former Soviet President
has been a local hero ever since a
Russian tank invaded Wisconsin

 To determine if a header-modifier pair warrants indexing, Strzalkowski calculates a value for


Informational Contribution (IC) for each element in the pair.
o The basis behind the IC formula is a conditional probability between the terms. The
formula for IC between two terms (x,y) is:

K. JAYASRI | UNIT – III | CSM 19 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
Example:

Calculating Weight of Phrase

2. Natural Language Processing:


 Natural language processing not only produces more accurate term phrases, but can provide
higher level semantic information identifying relationships between concepts.
 System adds the functional processes Relationship Concept Detectors, Conceptual Graph
Generators and Conceptual Graph Matchers that generate higher level linguistic relationships
including semantic and is course level relationships.
 During the first phase of this approach, the processing tokens in the document are mapped to
Subject Codes.
 These codes equate to index term assignment and have some similarities to the concept-based
systems.
 The next phase is called the Text Structure, which attempts to identify general discourse level
areas within an item.
 The next level of semantic processing is the assignment of terms to components, classifying the
intent of the terms in the text and identifying the topical statements.
 The next level of natural language processing identifies interrelationships between the concepts.

K. JAYASRI | UNIT – III | CSM 20 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
 The final step is to assign final weights to the established relationships.
 The weights are based upon a combination of statistical information and values assigned to the
actual words used in establishing the linkages.

Concept indexing
 Natural language processing starts with a basis of the terms within an item and extends the
information kept on an item to phrases and higher-level concepts such as the relationships
between concepts.
 Concept indexing takes the abstraction a level further.
 Its goal is using concepts instead of terms as the basis for the index, producing a reduced
dimension vector space.
 Concept indexing can start with a number of unlabeled concept classes and let the information
in the items define the concepts classes created.
 A term such as “automobile” could be associated with concepts such as “vehicle,”
“transportation,” “mechanical device,” “fuel,” and “environment.”
 The term “automobile” is strongly related to “vehicle,” lesser to “transportation” and much
lesser the other terms.
 Thus, a term in an item needs to be represented by many concept codes with different weights
for a particular item.
 The basis behind the generation of the concept approach is a neural network model.
 Special rules must be applied to create a new concept class.
 Example demonstrates how the process would work for the term “automobile.”

 Another example of this process is Latent Semantic Indexing (LSI).


 Its assumption is that there is an underlying or “latent” structure represented by
interrelationships between words.
 The index contains representations of the “latent semantics” of the item. Like Convectis, the
large term-document matrix is decomposed into a small set (e.g., 100-300) of orthogonal factors
which use linear combinations of the factors (concepts) to approximate the original matrix.
 Any rectangular matrix can be decomposed into the product of three matrices. Let X be a mxn
matrix such that:

 Where T0 and D0 have orthogonal columns and are m x r and r x n matrices, S0 is an r x r diagonal
matrix and r is the rank of matrix X.

K. JAYASRI | UNIT – III | CSM 21 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
Hypertext linkages
 It’s a new class of information representation is evolving on the Internet.
 Need to be generated manually Creating an additional information retrieval dimension.
 Traditionally the document was viewed as two dimensional.
 Text of the item as one dimension and references as second dimension.
 Hypertext with its linkages to additional electronic items, can be viewed as networking between
the items that extend contents, i.e by embedding linkage allows the user to go immediately to
the linked item.
 Issue: how to use this additional dimension to locate relevant information.
 At the internet we have three classes of mechanism to help find information.
1. Manually generated indexes ex: www.yahoo.com were information sources on the home page
are indexed manually into hyperlink hierarchy.
o The user navigates through hierarchy by expanding the hyper link.
o At some point the user starts to see the end of items.
2. Automatically generated indexes - sites like lycos.com and altavista.com automatically go to
other internet sites and return the text , ggogle.com.
3. Web Crawler's : A web crawler (also known as a Web spider or Web robot) is a program or
automated script which browses the World Wide Web in a methodical, automated manner.
o Webcrawlers (e.g., WebCrawler, OpenText, Pathfinder) and intelligent agents (Coriolis
Groups’ NetSeeker™) are tools that allow a user to define items of interest and they
automatically go to various sites on the Internet searching for the desired information.
 What is needed is an index algorithm for items that looks at the hypertext linkages as an
extension of the concepts being presented in the item where the link exists.
 Some links that are for references to multi-media imbedded objects would not be part of the
indexing process.
 The Universal Reference Locator (URL) hypertext links can map to another item or to a specific
location within an item.
 The index values of the hyperlinked item have a reduced weighted value from contiguous text
biased by the type of linkage. The weight of processing tokens appears:

K. JAYASRI | UNIT – III | CSM 22 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING

Introduction to Clustering
 Clustering is the process of grouping similar items or terms together to improve how we search
and retrieve information.
 The goal of the clustering was to assist in the location of information.
 Clustering of words originated with the generation of thesauri. Thesaurus, coming from the Latin
word meaning “treasure,” is similar to a dictionary in that it stores words.
 Instead of definitions, it provides the synonyms and antonyms for the words.
 Its primary purpose is to assist authors in selection of vocabulary.
 The goal of clustering is to provide a grouping of similar objects (e.g., terms or items) into a
“class” under a more general title. Clustering also allows linkages between clusters to be
specified.
 The term class is frequently used as a synonym for the term cluster.

Clustering Process Steps:

 Define the domain for the clustering effort. Defining the domain for the clustering identifies
those objects to be used in the clustering process. Ex: Medicine, Education, Finance etc.
 Once the domain is determined, determine the attributes of the objects to be clustered. (Ex:
Title, Place, job etc zones).
 Determine the strength of the relationships between the attributes whose co-occurrence in
objects suggest those objects should be in the same class.
 Apply some algorithm to determine the class(s) to which each item will be assigned.

Guidelines

 Clear Meaning: Each cluster should have a clear semantic meaning.


 Balanced Size: Clusters should be roughly equal in size.If one cluster contains 90% of the
items, it's not helpful.
 For example, assume a thesaurus class called “computer” exists and it contains the objects
(words/word phrases) “microprocessor,” “286-processor,” “386- processor” and “pentium.”
If the term “microprocessor” is found 85 per cent of the time and the other terms are used 5
per cent each, there is a strong possibility that using “microprocessor” as a synonym for
“286- processor” will introduce too many errors. It may be better to place “microprocessor”
into its own class.
 Whether an object can be assigned to multiple classes or just one must be decided at
creation time.
 There are additional important decisions associated with the generation of thesauri that are
not part of item clustering. They are

1) Word coordination approach: specifies if phrases as well as individual terms are


to be clustered

2) Word relationships: Aitchison and Gilchrist specified three types of relationships:


equivalence, hierarchical and nonhierarchical. Equivalence relationships are the most

K. JAYASRI | UNIT – III | CSM 23 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
common and represent synonyms. Hierarchical relationships where the class name is a
general term and the entries are specific examples of the general term. The previous
example of “computer” class name and “microprocessor,” “pentium,” etc Non-
hierarchical relationships cover other types of relationships such as “object”-“attribute”
that would contain “employee” and “job title.”

3) Homograph resolution: a homograph is a word that has multiple, completely


different meanings. For example, the term “field” could mean a electronic field, a field
of grass, etc.

4) Vocabulary constraints: this includes guidelines on the normalization and


specificity of the vocabulary. Normalization may constrain the thesaurus to stems versus
complete words.

Thesaurus Generation
 There are three basic methods for generation of a thesaurus; hand crafted, co- occurrence, and
header-modifier based.
1. Hand-Crafted 2. Co-Occurrence-Based 3. Header-Modifier (Linguistic)
Thesauri Thesauri Based Thesauri
Created manually by domain Automatically generated Based on linguistic parsing of text.
experts. based on how often words Words are grouped based on
Example: A thesaurus for appear together in a text. grammatical roles they play.
medical terms. Reflects real usage in Assumption: Words in similar
Works best in specific documents. grammatical structures are similar in
domains (e.g., legal, Relies on statistical methods. meaning.
medical). Effective in capturing term Key Syntactic Structures Parsed:
General thesauri (like similarity. Subject-Verb (e.g., "Dogs bark")
WordNet) are less helpful Verb-Object (e.g., "Eat food")
for search and query Adjective-Noun (e.g., "blue sky")
expansion due to multiple Noun-Noun (e.g., "data analysis")
meanings (polysemy). Co-occurrence + Mutual
Information:
Each noun is associated with verbs,
adjectives, and nouns it appears with.
Mutual Information (typically with a
log function) measures how strongly
two words are associated.
This score is used to compute
similarity between words, helping in
term classification.

K. JAYASRI | UNIT – III | CSM 24 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
 There are two types generation of a thesaurus
o Manual Clustering
o Automatic Term Clustering
 Complete Term Relation Method
 Clustering Using Existing Clusters
 One Pass Assignments
Manual Clustering
 Steps
1. Define the Domain:
 Select the domain (e.g., data processing, medical, legal) the thesaurus is meant
for.
 Helps avoid homographs (same word, different meanings) and keeps terms
domain-relevant
2. Generate Word List Using Concordances:
 Use tools like:
 Existing thesauri
 Concordances (alphabetical list of words with frequency and document
references)
 Domain dictionaries
3. Select Useful Terms
 Avoid unrelated or overly common words (like “computer” in a computer
thesaurus).
 Focus on domain-relevant, meaningful terms.
4. Cluster the Selected Terms
 Based on semantic relationships and usage patterns.
 Done using editor's judgment—not automated.
 Example: Group "memory", "chips", and "RAM" together if they co-occur in
similar contexts
5. Quality Assurance
 Other editors review and refine the thesaurus.
 Ensures accuracy, consistency, and relevance.

 KWOC (Key Word Out of Context)


1. KWOC does not provide context—we can’t tell what kind of "chips" it is
 KWIC (Key Word In Context)
1. Displays the term inside a sentence, showing the term with its surrounding words.
2. Helps resolve ambiguities (e.g., memory chips vs wood chips).
 KWAC (Key Word And Context)
1. Displays keyword followed by full sentence context.
2. Easier for manual scanning of how a term is used in actual sentences.

K. JAYASRI | UNIT – III | CSM 25 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
The Figure shows the various displays for
“computer design contains memory chips”.

 The phrase is assumed to be from doc4; the


other frequency and document ids for KWOC
were created for this example.

 In the Figure 6.1 the character “/” is used in


KWIC to indicate the end of the phrase.
 The KWIC and KWAC are useful in determining
the meaning of homographs.

 The term “chips” could be wood chips or


memory chips.

 In both the KWIC and KWAC displays, the editor


of the thesaurus can read the
sentence fragment associated with the term and
determine its meaning.

 Once the terms are selected, they are clustered based upon the word relationship guidelines
and the interpretation of the strength of the relationship.
 The resultant thesaurus undergoes many quality assurance reviews by additional editors using
some of the guidelines already suggested before it is finalized

Automatic Term Clustering


 The more frequently two terms co-occur in the same items, the more likely they are about the
same concept.
 They differ by the completeness with which terms are correlated.
 The more complete the correlation, the higher the time and computational overhead to create
the clusters.
 The most complete process computes the strength of the relationships between all
combinations of the “n” unique words with an overhead of other techniques start with an
arbitrary set of clusters and iterate on the assignment of terms to these clusters.
 The basis for automatic generation of a thesaurus
 Is a set of items that represents the vocabulary to be included in the thesaurus.
 Selection of this set of items is the first step of determining the domain for the
thesaurus.
 The processing tokens (words) in the set of items are the attributes to be used to create
the clusters.
 The basis for automated method of clustering documents
 Is based upon the polythetic clustering where each cluster is defined by a set of words
and phrases.

K. JAYASRI | UNIT – III | CSM 26 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
 Inclusion of an item in a cluster is based upon the similarity of the item's words and
phrases to those of other items in the cluster.

Two Techniques in Automatic Term Clustering


1. Full Correlation Matrix Method

– This method computes all pairwise correlations between terms.

– If there are n unique words, this results in a time and computational overhead of O(n²)
— meaning:

– Time increases quadratically with the number of words.

– It is very expensive and slow when n is large.

2. Iterative Clustering Method

– Begins with an initial set of clusters (could be random).

– Then, it iteratively reassigns terms to clusters based on similarity.

– If a large number of clusters is created initially, it can be used to build higher-level


clusters (a hierarchy or layered grouping).

Automatic Term Clustering


 Complete Term Relation Method
 Similarity between every term pair is calculated for determining the clusters.
 The vector model is represented by a matrix where the rows are individual items and
the columns are the unique words (processing tokens) in the items.
 The values in the matrix represent how strongly that particular word represents
concepts in the item.
 Figure 6.2 provides an example of a database with 5 items and 8 terms. To determine
the relationship between terms, a similarity measure is required. The measure
calculates the similarity between two terms.
 where “k” is summed across the set of
all items. In effect the formula takes the
two columns of the two terms being
analyzed, multiplying and accumulating
the values in each row.
 The results can be paced in a resultant
“m” by “m” matrix, called a Term-Term
Matrix (Salton-83), where “m” is the
number of columns (terms) in the
original matrix.
 This simple formula is reflexive so that
the matrix that is generated is
symmetric. Other similarity formulas
could produce a non- symmetric matrix.

K. JAYASRI | UNIT – III | CSM 27 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
 Using the data in Figure 6.2, the Term-
Term matrix produced is shown in Figure
6.3.
 There are no values on the diagonal
since that represents the auto
correlation of a word to itself.
 The next step is to select a threshold
that determines if two terms are
considered similar enough to each other
to be in the same class.
 In this example, the threshold value of
10 is used.
 Thus, two terms are considered similar if
the similarity value between them is 10
or greater.
 A new binary matrix called the Term
Relationship matrix (Figure 6.4) that
defines which terms are similar.
 A one in the matrix indicates that the
terms specified by the column and the
row are similar enough to be in the same
class.
 Term 7 demonstrates that a term may
exist on its own with no other similar
terms identified.
 In any of the clustering processes
described below this term will always
migrate to a class by itself.
 The final step in creating clusters is to
determine when two objects (words) are
in the same cluster.

The following algorithms are the most common: cliques, single link, stars and connected components.

Cliques:
Cliques require all terms in a cluster to be
within the threshold of all other terms.

K. JAYASRI | UNIT – III | CSM 28 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING

Single Link:
 Any term that is similar to any term in
the cluster can be added to the cluster.
 It is impossible for a term to be in two
different clusters.
 Applying the algorithm for creating
clusters using single link to the Term
Relationship Matrix, Figure 6.4, the
following classes are created:

Star:
 Select a term and then places in the class
all terms that are related to that term.
 Terms not yet in classes are selected as
new seeds until all terms are assigned to
a class.
 There are many different classes that can
be created using the Star technique.
 If we always choose as the starting point
for a class the lowest numbered term not
already in a class.

String:
 Starts with a term and includes in the
class one additional term that is similar

K. JAYASRI | UNIT – III | CSM 29 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
to the term selected and not already in a
class. Network Diagram of term similarities
 The new term is then used as the new
node and the process is repeated until no
new terms can be added because the
term being analyzed does not have
another term related to it or the terms
related to it are already in the class.
 A new class is started with any terms not
currently in any existing class.
 Using the guidelines to select the lowest
number term similar to the current term
and not to select any term already in an
existing class produces the following
classes.

Comparison:

Aspect Clique Single Link

Class Relationship Strength Strongest relationship Weakest relationship; a term connects


between all terms in a class to at least one in the class

Concept Representation More likely to describe a Covers diverse/unrelated concepts


specific concept

Number of Classes Produced Produces more classes Produces the fewest classes

Similarity Requirement All terms must be mutually Terms can be grouped even with zero
similar similarity

Use Case (Matrix Density) Best for dense term Best for sparse term relationship
relationship matrices matrices

Precision vs Recall High precision – ideal for High recall – may retrieve more non-
query term expansion relevant items

Computational Overhead Higher (not specified, but Lower overhead – only O(n²)
more complex due to full comparisons
linkage requirement)

Use in Thesaurus Construction Suitable when concept clarity Suitable when broad coverage is
is critical desired

K. JAYASRI | UNIT – III | CSM 30 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
Automatic Term Clustering
 Clustering Using Existing Clusters
 Start with a set of existing clusters.
 The initial assignment of terms to the clusters is arbitrary and revised by revalidating
every term assignment to a cluster.
 To minimum calculations, centroids are calculated for each cluster.
 Centroid: the average of all of the vectors in a cluster.
 The similarity between all existing terms and the centroids of the clusters can be
calculated.
 The term is reallocated to the cluster that has the highest similarity.
 The process stops when minimal movement between clusters is detected.
Step 1: Initial Cluster Centroids
 Black squares = initial
centroids.
 Triangles = terms.
 Circles = initial groupings of
terms into 3 clusters (Classes
1, 2, 3).
 At this stage, the centroids
are arbitrarily placed and do
not accurately represent the
final clusters yet.

Step 2: After Reassignment


 After the first iteration of
recalculating centroids and
reassigning terms, we see:
 The black squares have
moved closer to actual
clusters.
 Terms are more
appropriately grouped.
 Still not perfect, but closer to
ideal.
Step 3: Centroid Calculations
 Class 1 = (Term 1 and Term
2),
 Class 2 = (Term3 and Term 4)
and Class 3 = (Term5 and
Term 6).

K. JAYASRI | UNIT – III | CSM 31 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
Step 4: First Iteration of Term
Assignment
 Apply the simple similarity
measure between each of
the 8 terms and 3 centroids.
 One technique for breaking
ties is to look at the similarity
weights of the other terms in
the class and assign it to the
class that has the most
similar weights.
Step 5: New Centroids
 After new assignments,
centroids are recalculated
again. These are based on
the new groups of terms
 Only Term 7 moved from
Class 1 to Class 3, which
makes sense because:
 Its similarity to Class 1 was
weak.
 It’s closer to Class 3’s
centroid now.

Clustering Using Existing Clusters:


 Computation overhead: O(n).
 The number of classes is defined at the start of the process and cannot grow.
 It is possible to be fewer classes at the end of the process.
 Since all terms must be assigned to a class, it forces terms to be allocated to classes, even if their
similarity to the class is very weak compared to other terms assigned.
Automatic Term Clustering
 One Pass Assignments
 Minimum overhead: only one pass of all of the terms is used to assign terms to classes.
 Algorithm
 The first term is assigned to the first class.
 Each additional term is compared to the centroids of the existing classes.
 A threshold is chosen. If the item is greater than the threshold, it is assigned to
the class with the highest similarity.
 A new centroid has to be calculated for the modified class.
 If the similarity to all of the existing centroids is less than the threshold, the
term is the first item in a new class.
 This process continues until all items are assigned to classes.

K. JAYASRI | UNIT – III | CSM 32 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING

 Example with threshold of 10:

 Minimum computation on the order of O(n).


 Does not produce optimum clustered classes.
 Different classes can be produced if the order in which the items are analyzed changes.

Item Clustering
 Think about manual item clustering, which is inherent in any library or filing system – one item
one category.
 Automatic clustering – one primary category and several “secondary” categories
 Similarity between documents is based on two items that have terms in common.
 The similarity function is performed between rows of the item matrix
Step 1: Calculating Similarity
 Formula measures
similarity between two
items by computing the
dot product of their term
vectors.
 Termi,k : Term weight or
count of term k in Item i
 If two items share many
common terms (or the
same terms with high
weight), their similarity is
high.
Step 2: Item/Item Similarity
Matrix

 Higher number → greater


similarity.
 For example: Item 2 and
Item 5 have similarity 36
(strong connection).

K. JAYASRI | UNIT – III | CSM 33 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
Step 3: Item Relationship Matrix
 We apply a threshold (e.g.,
10):
 If similarity ≥ 10 → assign 1
 Else → assign 0
 This results in a binary
matrix (Figure 6.10)
representing item
relationships.
Step 4: Clustering Methods
1. Clique Algorithm: 2. Single Link Algorithm:
 Finds fully connected  Builds a single cluster by linking items that
subgroups. are indirectly connected.

Because:
Item 1 → Item 2 → Item 3 & 5 → Item 4
 Note: Items can belong to  All linked through at least one neighbor.
multiple cliques.

3. Star Algorithm: 4. String Algorithm:


 Starts with the lowest  Creates chains of items until all are
unassigned item as the assigned.
center.

 Grows clusters by connecting the next


available unassigned item.
 Each star has a center with
its directly connected items.

Step 5: Clustering with Centroids


 Instead of using relationships, now
use term vectors to create centroids
(average vectors of each class).
 Initial Assignment:
Class 1: Items 1 & 3
Class 2: Items 2 & 4
 Compute Centroids (Averages of
Item Vectors)
 Example:
Class 1 Centroid = (Item 1 + Item 3) /
2
Class 2 Centroid = (Item 2 + Item 4) /
2
 Then compare each item's vector to
both centroids, and reassign it based
on higher similarity.

K. JAYASRI | UNIT – III | CSM 34 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING

Hierarchy of Clusters
 Hierarchical Clustering builds a tree-like structure (called a dendrogram) of clusters based on
the similarity or distance between data points (documents or terms).
 Hierarchical agglomerative clustering (HAC) – start with un-clustered items and perform pair-
wise similarity measures to determine the clusters.
 Hierarchical divisive clustering – start with a cluster and breaking it down into smaller clusters
 There are two main approaches:
 Agglomerative (Bottom-up): Start with each item in its own cluster and successively merge the
most similar pairs.
 Divisive (Top-down): Start with one big cluster and divide it into smaller ones.

Objectives of Creating a Hierarchy of Clusters :


 Reduce the overhead of search
o Perform top-down searches of the centroids of the clusters in the hierarchy and trim
those branches that are not relevant.
 Provide for visual representation of the information space
o Visual cues on the size of clusters (size of ellipse) and strengths of the linkage between
clusters (dashed line, sold line…).
 Expand the retrieval of relevant items
o A user, once having identified an item of interest, can request to see other items in the
cluster.
o The user can increase the specificity of items by going to children clusters or by
increasing the generality of items being reviewed by going to a parent clusters.

Dendogram for Visualizing Hierarchical Clusters

 Hierarchical Agglomerative Clustering Method (HACM) :


o First the N * N item relationship matrix is formed.
o Each item is placed into its own clusters.
o The following two steps are repeated until only one cluster exists
 The two clusters that have the highest similarity are found
 These two clusters are combined, and the similarity between the newly formed
cluster and the remaining clusters recomputed
o As the larger cluster is found, the clusters that merged together are tracked and form a
hierarchy
 HACM Example :
o Assume document A, B, C, D, and E exist and a document-document similarity matrix
exists .
o {{A} {B} {C} {D} {E}} ➔ {{A, B} {C} {D} {E}} ➔ …➔ {A, B, C, D, E}

K. JAYASRI | UNIT – III | CSM 35 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING
 Similarity Measure between Clusters:
o Single link clustering
 The similarity between two clusters (inter-cluster similarity) is computed as the
maximum similarity between any two documents in the two clusters, each
initially from a separate.
o Complete linkage
 Inter-cluster similarity is computed as the minimum of the similarity between
any documents in the two clusters such that one document is from each cluster.
o Group average
 As a node is considered for a cluster, its average similarity to all nodes in the
cluster is computed. It is placed in the cluster as long as its average similarity is
higher than its average similarity for any other cluster
 Cluster Similarity: Lance-Williams Formula

o
 Ward’s method
o Clusters are joined so that their merger minimizes the increase in the sum of the
distances from each individual document to the centroid of the cluster containing it.
o If cluster A merged with either cluster B or cluster C, the centroids for the potential
cluster AB and AC are computed as well as the maximum distance of any document to
the centroid. The cluster with the lowest maximum is used.

 Analysis of HACM:
o Ward’s method typically took the longest to implement.
o Single link and complete linkage are somewhat similar in run time.
o Clusters found in the single link clustering tend to be fair broad in nature and provide
lower effectiveness.
o Choosing the best cluster as the source of relevant documents results in very close
effectiveness results for complete link, Ward’s and group average clustering.
o A consistent drop in effectiveness for single link clustering is noted.

K. JAYASRI | UNIT – III | CSM 36 Information Retrieval System (IRS)


AUTOMATIC INDEXING,
DOCUMENT AND TERM CLUSTERING

 Monolithic vs Polythetic Clusters


o Monolithic: Clusters are formed based on a single attribute; easier to interpret (e.g.,
Yahoo directories).
o Polythetic: Based on multiple features like terms and concepts; harder to interpret but
more flexible.
 Concept Hierarchy with Term Subsumption (Sanderson & Croft)
o If 80% of the documents containing term Y also contain term X → X is a more general
concept than Y.
o This builds a concept hierarchy where:
o Parent = more general term
o Child = more specific term
o Represented as a directed acyclic graph (DAG)

 Semantic Hierarchies and Thesauri


o Manual thesauri use semantic relationships (e.g., is-a, part-of).
o Automatic clustering can build item hierarchies effectively.
o However, for term hierarchies, automation often introduces errors, so human-guided
construction is more reliable.

Component Explanation

Tree diagram showing how documents/items are hierarchically


Dendrogram clustered

Hierarchical Agglomerative Clustering Methods used to group similar


HACM documents

Lance-Williams General formula to update cluster distances


Formula

Ward’s Method Minimizes variance within clusters

Subsumption Rule Defines parent-child relationship in term hierarchies

Centroids Mean vector of items in a cluster; used to cluster at higher levels

Scatter/Gather IR technique using repeated clustering

Monolithic vs Monolithic: simple, one-topic clusters; Polythetic: multi-attribute,


Polythetic harder to interpret

K. JAYASRI | UNIT – III | CSM 37 Information Retrieval System (IRS)

You might also like