Department of Information Technology
Sub : Information Storage and Retrieval
Topic: Clustering
Mr. Kare S. S.
Classification and retrieval search strategies
Contents:
Retrieval strategies: Vector Space model, Probabilistic retrieval strategies, Language models,
Inference networks, Extended Boolean retrieval, Latent semantic indexing, neural networks, Fuzzy set
retrieval.
Retrieval utilities: Relevance feedback, Cluster Hypothesis, Clustering Algorithms: Single Pass
Algorithm, Single Link Algorithm.
Unit Objectives
1.To understand concepts of clustering and how it is related to Information retrieval.
Unit outcomes: On completion the students will be able to :
1.By the end of the course, Deal with storage and retrieval process of text data
Outcome Mapping:
PEO: I, PO: a, b, c CO: 1,1 , PSO: 1,2
Books :
T1. Yates & Neto, Modern Information Retrieval, Pearson Education, ISBN:81-297-0274-6 2.
T2. C.J. T2: Rijsbergen, Information Retrieval, (www.dcs.gla.ac.uk)., 2ndISBN:978- 408709293
Retrieval utilities
Relevance Feedback
• For defining the relevant and non relevant documents, we take help of
matching coefficients are the threshold.
• Practically speaking, defining threshold is very difficult and hence we take
feedback from the user to update the matching technique.
Relevance Feedback
• For defining the relevant and non relevant documents, we take help of
matching coefficients are the threshold.
• Practically speaking, defining threshold is very difficult and hence we take
feedback from the user to update the matching technique.
Relevance Feedback
Relevance Feedback
Relevance Feedback
Cluster Hypothesis
• closely associated documents tend
to be relevant to the same
requests.
Clustering Algorithms
• Criteria for choosing clustering method
(1) Theoretical soundness
The clustering method should satisfy some constraints like :
• The method produces a clustering which is unlikely to be altered drastically
when further objects are incorporated i.e. it is stable under growth.
• The method is stable in the sense that small errors in the description of the
objects lead to small changes in clustering.
• The method is independent of the initial ordering of the objects.
(2) Efficiency
The method should be efficient in terms of speed requirement and storage
requirement.
Single Pass Algorithm
• Single-pass algorithm process as follows :
1. The object descriptions are processed serially.
2. The first object becomes the cluster representative of the first cluster.
3. Each subsequent object is matched against all cluster representatives
existing at its processing time.
4. A given object is assigned to one cluster (or more if overlap is allowed)
according to some condition on the matching function.
5. When an object is assigned to a cluster the representative for that cluster is
recomputed.
6. It an object fails a certain test it becomes the cluster representative of a
new cluster.
Example
1
2 0.6
3 0.6 0.8
4 0.9 0.9 0.7
5 0.9 0.6 0.6 0.9
6 0.5 0.5 0.9 0.5 0.5
1 2 3 4 5 6
Example
Non overlapping Overlapping
Single Link Algorithm
• The single link method is the best known of hierarchical
methods. It operates by joining at each step, the two most
similar objects, which are not yet in the same cluster. The
name single link refers to the joining of pairs of clusters by
the single shortest link between them.
• The dissimilarity coefficient is the basic input to a single-link
clustering algorithm. Single-link produces the output which
is a hierarchy with associated numerical levels called a
dendogram.
• The hierarchy is represented by a free structure. The
dendogram and its respective tree is as shown in Figure.
Single Link Algorithm
• Here,
• {A, B, C, D, E} are the objects
clusters are :
• At level 1 : {A, B}, {C}, {D}, {E}
• At level 2 : {A, B} {C, D, E}
• At level 3 : {A, B, C, D, E}
• At each level of hierarchy a set
of classes can be identified. As
we move up in hierarchy, the
classes at lower level are nested
Dendogram in the classes at higher levels.
Thank You