Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
18 views53 pages

CH 5

Multimedia Information Retrieval (MIR) focuses on retrieving relevant multimedia content such as images, videos, and audio through complex processing of unstructured data. Key operations in MIR include data acquisition, feature representation, query formulation, and retrieval ranking, while challenges include high-dimensional data processing and the semantic gap between features and human perception. The document also discusses various indexing methods, similarity measurement metrics, and the evolution of web search engines, highlighting issues like scalability, adaptability, and spam.

Uploaded by

ddushyant89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views53 pages

CH 5

Multimedia Information Retrieval (MIR) focuses on retrieving relevant multimedia content such as images, videos, and audio through complex processing of unstructured data. Key operations in MIR include data acquisition, feature representation, query formulation, and retrieval ranking, while challenges include high-dimensional data processing and the semantic gap between features and human perception. The document also discusses various indexing methods, similarity measurement metrics, and the evolution of web search engines, highlighting issues like scalability, adaptability, and spam.

Uploaded by

ddushyant89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Information Retrieval

(CS-440)

Dr. Robin Singh Bhadoria,


Assistant Professor, Dept. of CSE UNIT-5
Multimedia Information Retrieval
Definition and Importance:
(MIR)
• MIR focuses on retrieving relevant multimedia content (images, videos, audio) based on user queries.

• Unlike traditional text retrieval, MIR deals with unstructured data requiring complex processing.

• Importance in digital libraries, medical imaging, video surveillance, and social media analysis.

Challenges in MIR:
• High-dimensional data processing.

• Feature extraction and representation.

• Semantic gap between low-level features and human perception.

• Scalability and efficiency.


Multimedia Information Retrieval (MIR)
Operations in Multimedia Information Retrieval (MIR)
1. Data Acquisition & Pre-processing 2. Feature Representation & Indexing
• Data Collection: Gathering multimedia • Metadata-Based Indexing: Using textual
content from various sources (web, annotations, tags, and timestamps for
databases, user uploads, sensors). organization.
• Feature Extraction: Extracting • Content-Based Indexing: Utilizing low-level
meaningful representations (e.g., color features (color, texture, shape) for images and
histograms for images, MFCC for audio). high-level features (semantic concepts) for
• Normalization & Cleaning: Removing videos.
noise, standardizing formats, and • Spatial & Temporal Indexing: Organizing
ensuring consistency in multimedia data. multimedia based on spatial layout (for images)
• Compression & Encoding: Reducing or sequence in time (for videos).
storage requirements while preserving • Inverted Indexing & Hashing: Efficient
quality (e.g., JPEG for images, MP3 for retrieval methods like Bag-of-Visual-Words
audio). (BoVW) or Locality-Sensitive Hashing (LSH).
Operations in Multimedia Information Retrieval (MIR)
3. Query Formulation 4. Retrieval & Ranking
•Text-Based Queries: Using keywords, •Similarity Matching: Using distance metrics
metadata, and descriptions. like Euclidean distance or cosine similarity.
•Content-Based Queries: Searching by •Machine Learning & Deep Learning: Using
example (CBIR - Content-Based Image CNNs (for images), RNNs (for audio), and
Retrieval, CBVR - Content-Based Video transformers (for multimodal retrieval).
Retrieval). •Ranking & Filtering: Sorting results based on
•Multimodal Queries: Combining relevance, popularity, or personalization.
different modalities (e.g., text and •Personalization & Recommendation:
image together). Adaptive retrieval using user preferences and
•Relevance Feedback: User-driven browsing history.
refinement to improve search accuracy.
Operations in Multimedia Information Retrieval (MIR)

5. Post-Retrieval Processing 6. Performance Optimization & Scalability


• Visualization & Presentation: Displaying • Indexing Optimization: Improving storage
results in an intuitive format (e.g., image and retrieval efficiency (e.g., tree-based
grids, timelines for videos). structures like KD-Trees).
• Relevance Feedback Mechanisms: • Parallel & Distributed Processing: Using
Adjusting retrieval based on user cloud and edge computing for large-scale
interactions. retrieval.
• Refinement & Iteration: Allowing query • Compression & Storage Efficiency: Reducing
refinement through feedback loops. bandwidth and space while maintaining
accuracy.
Similarity Queries
• Concept of Similarity Queries in MIR: Searches for multimedia objects
that are similar to a given query object based on predefined similarity
measures.
• Types of Similarity Queries:
1. Exact Similarity Queries: Retrieve objects identical to the query.

2. Approximate Similarity Queries: Find objects with some level of


similarity.

3. Range-Based Queries: Retrieve results within a predefined similarity


threshold.
Role of Similarity Measurement in MIR
Metrics for Similarity Measurement:
1. Euclidean Distance: Measures direct spatial distance between two points.
• Formula:

2. Cosine Similarity: Computes similarity based on the angle between feature


vectors.
• Formula:

3. Jaccard Index: Measures similarity between two sets.


• Formula:

4. Earth Mover’s Distance (EMD): Used for image retrieval based on color
histograms.
Feature-Based Indexing and Searching
• The documents are represented by sets of features or index
terms, allowing for efficient retrieval based on these
characteristics, and can be either manually or automatically
defined.
• Feature-based indexing: It involves extracting, processing, and
storing characteristic features of data to facilitate fast and accurate
retrieval.

• Feature-based searching: It involves querying a dataset by


comparing the extracted features of a query object (text, image, video,
etc.) against an indexed database.
Indexing Structures for Feature-Based IR
1. Traditional Indexing Methods:
• Inverted Index (for text retrieval): Maps terms to document IDs.
• Suffix Trees & Tries: Useful in substring and pattern matching.
2. High-Dimensional Indexing Methods:
Feature-based indexing often results in high-dimensional data representations. Specialized indexing
structures are required:
• KD-Trees (K-Dimensional Trees): Efficient for low-dimensional feature spaces.
• R-Trees: Used for spatial indexing in GIS applications.
• VP-Trees (Vantage Point Trees): Suitable for metric spaces.

3. Approximate Nearest Neighbor (ANN) Indexing:


For large-scale feature-based retrieval, exact nearest neighbor search is computationally expensive. ANN methods
approximate the search while maintaining high accuracy:
•Locality-Sensitive Hashing (LSH): Hash-based indexing for similarity search.
•Hierarchical Navigable Small World (HNSW) Graphs: Efficient ANN search used in FAISS.
•Product Quantization (PQ): Compresses feature vectors for faster retrieval.
•Facebook AI Similarity Search (FAISS): State-of-the-art vector search library.
Feature-Based Searching Methods
1. Similarity Metrics: To compare feature vectors, different similarity measures are used:
• Euclidean Distance (L2 Norm): Measures straight-line distance in feature space.
• Cosine Similarity: Measures angle between two vectors (useful for word embeddings).
• Manhattan Distance (L1 Norm): Measures distance along coordinate axes.
• Jaccard Similarity: Measures overlap between sets (useful in NLP).
• Hamming Distance: Used for binary feature representations.

2. Types of Feature-Based Searches:


a) Exact Nearest Neighbor Search
• Linear search (brute-force) on feature vectors.
• Not scalable for large datasets.
b) Approximate Nearest Neighbor Search
• Used for fast retrieval in high-dimensional spaces.
• Libraries like FAISS, Annoy (Spotify), HNSWlib.
c) Hybrid Search: Combines traditional keyword-based search with deep-learning-based feature
retrieval:
• BM25 + Dense Vectors (BERT embeddings).
• Elasticsearch with neural re-ranking models.
Spatial Access Methods
• Spatial Access Methods (SAMs) are indexing and searching techniques used to efficiently store,
retrieve, and query spatial data, such as geographic information, images, and multidimensional feature
vectors. These methods help in optimizing spatial queries like range queries, nearest neighbor searches,
and spatial joins.
• SAMs are important because:
1. Efficient Query Processing – Reduces the complexity of spatial queries like “Find all restaurants
within 5 km.”
2. Scalability – Supports large-scale spatial databases used in Geographic Information Systems
(GIS), image retrieval, and machine learning.
3. Optimized Storage – Structures spatial data for efficient retrieval in multidimensional spaces.
• Types of Spatial Access Methods: Spatial Access Methods can be categorized based on the way they
organize spatial data:
1. Point Access Methods (PAMs) – Designed for indexing points in space (e.g., cities on a map).
2. Spatial Data Structures for Extended Objects – Designed for indexing complex spatial objects
(e.g., polygons, lines, and 3D models).
3. Multidimensional Indexing Methods – Designed for high-dimensional feature spaces.
(1) Point Access Methods (PAMs)
1. Grid Files
• Divides the space into a grid structure with fixed-size cells.
• Each cell contains pointers to the data points within it.
• Pros: Fast access for uniformly distributed data.
• Cons: Inefficient for non-uniform data, requires dynamic resizing.
2. k-d Trees (k-dimensional Trees)
• A binary tree where each node splits the space along one of the k dimensions.
• Used for efficient nearest neighbor searches.
• Pros: Works well for low-dimensional spaces.
• Cons: Performance degrades in high-dimensional spaces (curse of dimensionality).
3. Quadtrees
• A tree-based spatial indexing structure that recursively subdivides 2D space into four quadrants.
• Common in image compression, GIS applications.
• Pros: Efficient for hierarchical spatial partitioning.
• Cons: Performance degrades with irregular data distributions.
4. Octrees
• The 3D extension of a quadtree, used for indexing 3D spatial data (e.g., 3D maps, medical imaging).
• Pros: Efficient for 3D object representation.
• Cons: High memory usage.
(2)Spatial Data Structures for Extended Objects
These methods support more complex spatial objects such as lines, polygons, and regions.
1. R-Trees (Rectangle Trees)
• A hierarchical tree structure where each node stores a Minimum Bounding Rectangle (MBR) that encloses
spatial objects.
• Used in GIS, geospatial databases, and CAD applications.
• Pros: Supports dynamic insertions and deletions, good for range queries.
• Cons: High overlap in bounding boxes can reduce efficiency.

❑ Variants of R-Trees
• R-Tree* – Optimized R-Tree that reduces overlap and improves query efficiency.
• R+ Tree – Avoids overlapping bounding boxes, leading to faster search but requires more memory.
• Hilbert R-Tree – Uses space-filling curves to improve spatial locality.
3. X-Trees
• Designed for high-dimensional data where R-Trees fail due to excessive overlap.
• Used in multimedia retrieval and machine learning feature indexing.
• Pros: Handles high-dimensional spatial queries effectively.
• Cons: More complex to implement.
(3)Multidimensional Indexing Methods
1. Ball Trees
• Organizes data points into a tree structure where each node represents a ball containing
points within a certain radius.
• Used in nearest neighbor search (NNS) applications.
• Pros: Works well for high-dimensional data.
• Cons: Slower than other methods for low-dimensional data.

2. VP-Trees (Vantage Point Trees)


• A metric-tree based approach that partitions data using distance-based clustering.
• Efficient for nearest neighbor search in metric spaces.
• Pros: Good for non-Euclidean spaces.
• Cons: Slower for uniform data distributions.

3. Locality-Sensitive Hashing (LSH)


• Uses hashing techniques to map similar points into the same hash buckets.
• Used in large-scale approximate nearest neighbor search (e.g., image search, document
retrieval).
• Pros: Extremely fast for high-dimensional feature spaces.
• Cons: Not exact; retrieval accuracy depends on the number of hash functions used.

4. Hierarchical Navigable Small World (HNSW):

• A graph-based indexing method that builds a hierarchy of small-world networks.


• Used in vector search engines like FAISS and Annoy.
• Pros: One of the fastest ANN search techniques.
• Cons: Higher memory consumption.
Meta Ranking
• Meta-ranking refers to the process of combining multiple ranking algorithms or
search engines to produce an improved ranking of search results. It is widely used
in meta-search engines, recommendation systems, and information retrieval to
enhance search accuracy and user satisfaction.

• Importance of Meta Ranking:


1. Aggregates multiple ranking sources to improve result quality.
2. Mitigates individual algorithm biases, providing a more balanced ranking.
3. Enhances relevance by leveraging multiple ranking perspectives.

• Meta-ranking methods can be classified into two major categories:


1. Integrated Methods – Combine rankings at the feature level before
ranking.
2. Isolated Methods – Rank individual sources separately and then merge
their rankings.
Meta Ranking
Integrated vs Isolated Methods
Integrated Method Isolated Method
Directly combine features from multiple ranking Each ranking source produces a ranking
systems. independently.
Uses machine learning models (e.g., RankNet, The final ranked list is obtained by merging the
LambdaMART, neural networks) to predict final results using combination techniques (e.g., voting,
ranking. score normalization).
Advantages: Advantages:
• Leverages multiple ranking signals effectively. • Simpler to implement.
• Provides higher accuracy with sufficient training • Works well when sources have different scoring
data. models.

Disadvantages: Disadvantages:
• Requires a large training dataset. • Can lead to conflicts when rankings differ
• Computationally expensive. significantly.
Ranking Combination Techniques
1. Interleaving:
• Used for evaluating ranking algorithms rather than merging them (as in A/B Testing).
• Combines results from two ranking systems in an alternating manner.
• User interactions (clicks) determine the better ranking system.
• Example:
• If Algorithm A produces (R1, R3, R5) and Algorithm B produces (R2, R4, R6), an
interleaved ranking would be:
• (R1, R2, R3, R4, R5, R6)
• Advantages:
• Helps compare ranking methods in real-world settings.
• Disadvantages:
• Not an actual merging technique; it’s mainly for evaluation.
2. Voting-Based Method: Voting-based techniques treat ranking as a voting problem, where
documents receive scores based on their position in different rankings.

I. Borda Count
• Each result receives a score based on its rank position.
• Scores are summed across different rankings to produce the final order.
• Advantages:
• Simple and effective.
• Disadvantages:
• Doesn’t consider confidence levels of different ranking methods.

II. Condorcet Voting


• Compares every pair of documents across different rankings.
• A document wins if it is ranked higher than another in a majority of ranking lists.
• Advantages:
• More robust than Borda count.
• Disadvantages:
• Computationally expensive.
Web Search
• Web search is the process of retrieving relevant information from the internet using
search engines. The effectiveness of a web search depends on factors like indexing,
crawling, ranking algorithms, and user interaction.
• Early Web (1990s)
• The World Wide Web (WWW) was introduced by Tim Berners-Lee in 1989.
• Early search engines were directory-based (e.g., Yahoo! Directory).
• Web pages were indexed manually, making searches inefficient.
• Evolution of Search Engines
• 1993: Aliweb – The first search engine using manual submission.
• 1994: WebCrawler – First full-text search engine.
• 1996: AltaVista – Introduced advanced indexing techniques.
• 1998: Google – Revolutionized search with PageRank algorithm.
• 2000s-Present – Modern search engines (Google, Bing) use AI, NLP, and deep learning for
ranking.
Search Engine

Search Engine Issues


(1) Dynamic Data (Incorporating new data)
• The “collection” for most real applications is constantly changing in
terms of updates, additions, deletions
• e.g., web pages

• Acquiring or “crawling” the documents is a major task


• Typical measures are coverage (how much has been indexed) and
freshness (how recently was it indexed)

• Updating the indexes while processing queries is also a design issue

27
Search Engine

Search Engine Issues


(2) Scalability
• Making everything work with millions of users every day, and many terabytes
of documents
• Distributed processing is essential

(3) Adaptability
• Changing and tuning search engine components such as ranking algorithm,
indexing strategy, interface for different applications

28
Search Engine

Search Engine Issues


(4) Spam
• For Web search, spam in all its forms is one of the major issues

• Affects the efficiency of search engines and, more seriously, the


effectiveness of the results

• Many types of spam


• e.g. spamdexing or term spam, link spam, “optimization”

• New subfield called adversarial IR, since spammers are “adversaries”


with different goals

29
Search Engine

Architecture of SE

How do search engines like Google work?

30
Search Engine

Paid
Search Ads

Algorithmic results.
31
Architecture

User

Web spider

Searc
h
Indexer

The Web

Indexes Ad indexes 32
Indexing Process

33
Indexing Process
• Text acquisition
• identifies and stores documents for indexing

• Text transformation
• transforms documents into index terms or features

• Index creation
• takes index terms and creates data structures (indexes) to support fast
searching

34
Query Process

35
Query Process
• User interaction
• supports creation and refinement of query, display of results

• Ranking
• uses query and indexes to generate ranked list of documents

• Evaluation
• monitors and measures effectiveness and efficiency (primarily offline)

36
Indexing Process

Details: Text Acquisition


• Crawler
• Identifies and acquires documents for search engine

• Many types – web, enterprise, desktop

• Web crawlers follow links to find documents


• Must efficiently find huge numbers of web pages (coverage) and keep them
up-to-date (freshness)
• Single site crawlers for site search
• Topical or focused crawlers for vertical search

• Document crawlers for enterprise and desktop search


• Follow links and scan directories 37
Web Crawler
• Starts with a set of seeds, which are a set of URLs given to it as parameters

• Seeds are added to a URL request queue

• Crawler starts fetching pages from the request queue

• Downloaded pages are parsed to find link tags that might contain other useful
URLs to fetch

• New URLs added to the crawler’s request queue, or frontier

• Continue until no more new URLs or disk full


38
Crawling picture

URLs crawled
and parsed
Unseen Web

Seed URLs frontier


pages
Web

39
Crawling the Web

40
Text Acquisition
• Feeds
• Real-time streams of documents
• e.g., web feeds for news, blogs, video, radio, tv
• RSS is common standard
• RSS “reader” can provide new XML documents to search engine

• Conversion
• Convert variety of documents into a consistent text plus metadata format
• e.g. HTML, XML, Word, PDF, etc. → XML
• Convert text encoding for different languages
• Using a Unicode standard like UTF-8

41
Text Acquisition

• Document data store

• Stores text, metadata, and other related content for documents


• Metadata is information about document such as type and creation date
• Other content includes links, anchor text

• Provides fast access to document contents for search engine components


• e.g. result list generation

• Could use relational database system


• More typically, a simpler, more efficient storage system is used due to huge
numbers of documents
42
Text Transformation
• Parser
• Processing the sequence of text tokens in the document to recognize structural
elements
• e.g., titles, links, headings, etc.

• Tokenizer recognizes “words” in the text


• must consider issues like capitalization, hyphens, apostrophes, non-alpha
characters, separators

• Markup languages such as HTML, XML often used to specify structure


• Tags used to specify document elements
• E.g., <h2> Overview </h2>
• Document parser uses syntax of markup language (or other formatting) to
identify structure
43
Text Transformation

• Stopping
• Remove common words
• e.g., “and”, “or”, “the”, “in”
• Some impact on efficiency and effectiveness
• Can be a problem for some queries

• Stemming
• Group words derived from a common stem
• e.g., “computer”, “computers”, “computing”, “compute”
• Usually effective, but not for all queries
• Benefits vary for different languages
44
Text Transformation

• Link Analysis
• Makes use of links and anchor text in web pages

• Link analysis identifies popularity and community information


• e.g., PageRank

• Anchor text can significantly enhance the representation of pages pointed to by


links

• Significant impact on web search

45
Text Transformation
• Information Extraction
• Identify classes of index terms that are important for some applications
• e.g., named entity recognizers identify classes such as people, locations,
companies, dates, etc.

• Classifier
• Identifies class-related metadata for documents
• i.e., assigns labels to documents
• e.g., topics, reading levels, sentiment, genre
• Use depends on application

46
Index Creation
• Document Statistics
• Gathers counts and positions of words and other features
• Used in ranking algorithm

• Weighting
• Computes weights for index terms
• Used in ranking algorithm
• e.g., tf.idf weight
• Combination of term frequency in document and inverse document frequency in
the collection

47
Index Creation
• Inversion
• Core of indexing process

• Converts document-term information to term-document for indexing


• Difficult for very large numbers of documents

• Format of inverted file is designed for fast query processing


• Must also handle updates
• Compression used for efficiency

• Index Distribution
• Distributes indexes across multiple computers and/or multiple sites

• Essential for fast query processing with large numbers of documents

• Many variations
• Document distribution, term distribution, replication

• P2P and distributed IR involve search across multiple sites 48


User Interaction
• Query input
• Provides interface and parser for query language

• Most web queries are very simple, other applications may use forms

• Query language used to describe more complex queries and results of query transformation
• e.g., Boolean queries
• similar to SQL language used in database applications
• IR query languages also allow content and structure specifications, but focus on content
• Query transformation
• Improves initial query, both before and after initial search

• Includes text transformation techniques used for documents

• Spell checking and query suggestion provide alternatives to original query

• Query expansion and relevance feedback modify the original query with additional terms
49
User Interaction
• Results output
• Constructs the display of ranked documents for a query

• Generates snippets to show how queries match documents

• Highlights important words and passages

• Retrieves appropriate advertising in many applications

• May provide clustering and other visualization tools

50
Link Analysis: Ranking Algorithms
1. PageRank (Google):

• Developed by Larry Page & Sergey Brin (1998).

• Concept: A page is important if many important pages link to it.

• Formula:

where:
• PR(A)PR(A)PR(A) = PageRank of page A.

• d = Damping factor (~0.85).

• L(B) = Number of outgoing links from page B.

• https://youtu.be/P8Kt6Abq_rM?si=xzpl8vhikcVqceMm
Link Analysis: Ranking Algorithms

2. HITS Algorithm (Hyperlink-Induced Topic Search):

•Developed by Jon Kleinberg (1999).


•Concept: Pages are classified as hubs or authorities.
•Hubs: Link to many authorities.
•Authorities: Are linked by many hubs.
•Used in topic-specific searches.

• https://youtu.be/K9WwBkmwDx4?si=rFaE8i7Clw8NahLy
Applications in Web
Information Management
Domain Application

E-Commerce Product ranking (Amazon, Flipkart).

Social Media Personalized recommendations (YouTube, Instagram).

Healthcare Medical information retrieval (PubMed, Google Health).

Academia Research paper search (Google Scholar, Semantic Scholar).

Fact-checking and trending news analysis.


News & Media

You might also like