Unit 2

Uploaded by

Sree Dhathri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views40 pages

Unit 2

Uploaded by

Sree Dhathri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

The indexing process determines which terms (concepts) can represent a

particular item
The transformation from the received item to the searchable data
structure is called indexing
Manual or automatic
Search – Once the searchable data structure has been created, techniques
must be defined that correlate the user entered query statement to the set
of items in the database to determine the items to be returned to the user.
Information extraction – extract specific information to be normalized
and entered into a structured database (DBMS)
Focus on very specific concepts and contains a transformation process
that modifies the extracted information into a form compatible with the
end structured database
Automatic File Build
Indexing (originally called cataloging) is the oldest technique for
identifying the contents of item to assist in their retrieval
• Subject indexing ->hierarchical subject indexing
• Indexing creates a bibliographic citation in a structured file that reference
the original text
Citation information about the items, key wording the subjects of the
item, and a constrained length free text field used for an abstract/summary
Usually performed by professional indexers
Objectives:
• Represent the concepts within an item to facilitate the user’s finding
relevant information
• The full text searchable data structures for items in the Document File
provides a new class of indexing called total document indexing
All of the words within the item are potential index descriptors of the
Subjects of the item.
• The process of Item normalization that takes all possible words in an
item and transforms them into processing tokens used in defining the
searchable representation of an item.
Current systems have the ability to automatically weight the
processing tokens based upon their potential Importance in defining the
concepts in the item. • Other objectives of indexing – ranking, item
clustering
Indexing Process: When an organization with multiple indexers decides
to create a public or private index, some procedural decisions on how to
create the index terms assist the indexer and end users in knowing what
to expect in the index file
Scope of the indexing – define what level of detail the subject index
will contain
• Based on usage scenarios of the end users
The need to link index terms together in a single index for a particular
concept
• Needed when there are multiple independent concepts found within
an item
Two factors involved in deciding on what level to index
the concepts in an item
Exhaustivity
Specificity
Automatic Indexing:
• Automatic indexing is the capability to automatically determine
the index terms to be assigned to an item
Simplest case: total document indexing
Complex: emulate a human indexer and determine a limited
number of index terms for the major concepts in the item
Indexing by Term

The terms of the original item are used as a basis of the index process
• Two major techniques: statistical and natural language
• Statistical techniques
Calculation of weights use statistic information such as the frequency
of occurrence of words and their distributions in the searchable DB
Vector models and probabilistic models
• Natural language processing Process items at the morphological,
lexical, semantic, syntax, and discourse levels
Each level uses information from the previous level to perform it
additional analysis
Events, and event relationships
Indexing by Concept:
• The basis for concept indexing is that there are many ways to
express the same ideas and increased retrieval performance comes
from using a single representation
• Indexing by term treats each of these occurrences as a different
index and then uses thesauri or other query expansion techniques to
expand a query to find the different ways the same thing has been
represented
• Concept indexing determines a canonical set of concepts based on a
test set of terms and uses them as a basis for indexing all items
Over generation measures the amount of irrelevant information that
is extracted. This could be caused by templates filled on topics that
are not intended to be extracted or slots that get filled with non-
relevant data.
Fallout measures how much a system assigns incorrect slot fillers as
the number of potential incorrect slot fillers increases.
There are two processes associated with information extraction:
determination of facts to go into structured fields in a database and
extraction of text that can be used to summarize an item. In the first
case only a subset of the important facts in an item may be identified
and extracted. In summarization all of the major concepts in the item
should be represented in the summary.
Knowledge of data structures used in Information Retrieval Systems
provides insights into the capabilities available to the systems that
implement them. Each data structure has a set of associated capabilities
that provide an insight into the objectives of the implementers by its
selection. From an Information Retrieval System perspective, the two
aspects of a data structure that are important are its ability to represent
concepts and their relationships and how well it supports location of
those concepts.
There are usually two major data structures in any information
system.
One structure stores and manages the received items in their
normalized form.
The process supporting this structure is called the
“document manager.”

The other major data structure contains the processing tokens and
associated data to support search.

The results of a search are references to the items that satisfy the
search statement, which are passed to the document manager for
retrieval.
The concept of stemming has been applied to information systems from
their initial automation in the 1960’s. The original goal of stemming
was to improve performance and require less system resources by
reducing the number of unique words that a system has to contain. With
the continued significant increase in storage and computing power, use
of stemming for performance reasons is no longer as important.
Stemming is now being reviewed for the potential improvements it can
make in recall versus its associated decline in precision. A system
designer can trade off the increased overhead of stemming in creating
processing tokens versus reduced search time overhead of processing
query terms with trailing
Stemming reduces the diversity of representations of a concept (word) to
a canonical morphological representation. The risk with stemming is that
concept discrimination information may be lost in the process, causing a
decrease in precision and the ability for ranking to be performed. On the
positive side, stemming has the potential to improve recall.

The most common stemming algorithm removes suffixes and prefixes,

sometimes recursively, to derive the final stem. Other techniques such as
table lookup and successor stemming provide alternatives that require
additional overheads. Successor stemmers determine prefix overlap as the
length of a stem is increased
Terms with Common Stem

Connect
Connected
Connect
Connecting (Root Word)
Connection
Connections

Stemming is the process of reducing terms to their roots before

indexing
Memorial Memory
Memorize

Both are not Synonyms having different meanings.

Some Ending are also removed even if the root form is not found in the
dictionary.

ness, ly

Factorial---- factory
Matrices---- matrix
The process determines the successor varieties for a word
Uses this information to divide a word into segments and
selects one of the segments as the stem.
Inversion list file structures are well suited to store concepts and their
Relationships. Each Inversion list can be thought of as representing a
particular concept. The inversion list is then concordance of all the items
that contain that concept.

Inversion list are used because they provide optimum performance in

searching large databases.
The optimality comes from the minimization of data flow in resolving
query.
N-Grams can be viewed as a special technique for stemming and as unique
Data structure in information systems. N-Grams are a fixed length consecutive
Series of “n” characters.

Generally Stemming determine the stem of a word that represents the semantic
Meaning of the word , n-grams don’t care about semantics.

The symbol # is used to represent the inter word symbol which is any one
the symbols like blank, period, semicolon, colon etc,
It is possible that same n-gram can be created multiple times from a single word
The application of n-grams is in II WORLD war
Used by cryptographers.
Major use of n-grams is in spelling error detection and correction.
Using n-grams with inter word symbols included between valid processing
Tokens equates to continuous text input data structure that is being indexed
In contiguous “n” character tokens.
A Different way of addressing a continuous text input data structure comes
from PAT trees and PAT Arrays.
The Input stream is transformed into a Searchable data structure consisting
of substrings.
The name PAT is from Patricia Trees(Patricia stands for Practical Algorithm
To retrieve Information Coded in Alphanumeric)
It is possible to have Substring go beyond the length of the Input Stream
By adding additional null characters.
<font color=“red” size=12>

Node

LINK
Hidden Markov models have been applied for the last 20 years to solving
Problems in speech recognition and to lesser extent in the areas locating
Named entities, optical character recognition and topic identification.

A HMM can best be understood by first defining a discrete markov process.

The States will be one of the following that is observed at the closing of the
Market.

1. State 1 Market is decreased

2. State 2 Market did not change
3. State 3 Market increased in value

Class 10 IT PYQs E Book Readers Venue 2025 03 03 05 04 6
100% (6)
Class 10 IT PYQs E Book Readers Venue 2025 03 03 05 04 6
40 pages
IRS Unit-2
50% (4)
IRS Unit-2
13 pages
Indexing - Library Scinece
No ratings yet
Indexing - Library Scinece
92 pages
Cataloging and Indexing
No ratings yet
Cataloging and Indexing
52 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Irs Unit Ii
No ratings yet
Irs Unit Ii
25 pages
Lis-311 Indexing and Abstracting: Lecture On
No ratings yet
Lis-311 Indexing and Abstracting: Lecture On
72 pages
Unit-I: Introduction To Information Retrieval Systems
100% (1)
Unit-I: Introduction To Information Retrieval Systems
14 pages
Unit Ii
No ratings yet
Unit Ii
61 pages
IRS Unit-2
No ratings yet
IRS Unit-2
63 pages
Unit-Ii: Cataloging and Indexing
100% (3)
Unit-Ii: Cataloging and Indexing
13 pages
NLP 1 - 5 Modules
No ratings yet
NLP 1 - 5 Modules
210 pages
Unit 2 Irs
No ratings yet
Unit 2 Irs
25 pages
Indexing and Abstracting
No ratings yet
Indexing and Abstracting
48 pages
Irs Unit-2 Notes - 241015 - 102936
No ratings yet
Irs Unit-2 Notes - 241015 - 102936
27 pages
IRS Notes
No ratings yet
IRS Notes
40 pages
PDF Maker 1755642646912
No ratings yet
PDF Maker 1755642646912
27 pages
Irs Unit-3 Notes - 241202 - 145950
No ratings yet
Irs Unit-3 Notes - 241202 - 145950
21 pages
Irs Unit - 3
No ratings yet
Irs Unit - 3
68 pages
Irs Unit-1
No ratings yet
Irs Unit-1
61 pages
Irs Unit-Ii-Notes
No ratings yet
Irs Unit-Ii-Notes
18 pages
Irs Unit III
No ratings yet
Irs Unit III
74 pages
IRS Unit 2
No ratings yet
IRS Unit 2
15 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
Dbms
No ratings yet
Dbms
99 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
IRS Cataloging and Indexing 2.1
No ratings yet
IRS Cataloging and Indexing 2.1
12 pages
Irs Unit 2
No ratings yet
Irs Unit 2
47 pages
IRSunit 2
No ratings yet
IRSunit 2
20 pages
Unit II
No ratings yet
Unit II
28 pages
Development of Indexes Indexing
100% (1)
Development of Indexes Indexing
22 pages
Exploring Indexing Systems and Techniques New1
No ratings yet
Exploring Indexing Systems and Techniques New1
20 pages
IRS Unit-2
No ratings yet
IRS Unit-2
37 pages
4 Indexing
No ratings yet
4 Indexing
59 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
Irs Cie-II Notes
No ratings yet
Irs Cie-II Notes
30 pages
Unit Ii-Ingest
No ratings yet
Unit Ii-Ingest
48 pages
3rd Unit Part-1
No ratings yet
3rd Unit Part-1
7 pages
IRSUnit 2
No ratings yet
IRSUnit 2
21 pages
Exploring Indexing System and Techniques
No ratings yet
Exploring Indexing System and Techniques
20 pages
What Is Structured Data?: Information Retrieval
No ratings yet
What Is Structured Data?: Information Retrieval
6 pages
Cataloging & Indexing Evolution
No ratings yet
Cataloging & Indexing Evolution
39 pages
QBank IRS
No ratings yet
QBank IRS
4 pages
UNIT 2 IRS Up
No ratings yet
UNIT 2 IRS Up
42 pages
IRS - Notes - I&2 CSE A&B
No ratings yet
IRS - Notes - I&2 CSE A&B
27 pages
Chapter 2
No ratings yet
Chapter 2
64 pages
Indexingand Abstracting Services
No ratings yet
Indexingand Abstracting Services
27 pages
IR Chapter 2 Class 1
No ratings yet
IR Chapter 2 Class 1
20 pages
Index
No ratings yet
Index
40 pages
Answer Key Advanced EDB Postgres v15
No ratings yet
Answer Key Advanced EDB Postgres v15
13 pages
Indexing: Indexing: Is The Process of Analyzing The Information
No ratings yet
Indexing: Indexing: Is The Process of Analyzing The Information
2 pages
Unit-2 Irs
No ratings yet
Unit-2 Irs
28 pages
Irs Ii
No ratings yet
Irs Ii
39 pages
Unit 2-Part 1
No ratings yet
Unit 2-Part 1
13 pages
SET 2 Ans Key
No ratings yet
SET 2 Ans Key
21 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
Student Management System
No ratings yet
Student Management System
39 pages
Indexing Database Systems
No ratings yet
Indexing Database Systems
5 pages
Explain Item Normalization?
No ratings yet
Explain Item Normalization?
7 pages
Wikipidea - Concept Search
No ratings yet
Wikipidea - Concept Search
7 pages
Indexing and Cataloging Essentials
No ratings yet
Indexing and Cataloging Essentials
16 pages
Data Mining New Notes Unit 3 PDF
No ratings yet
Data Mining New Notes Unit 3 PDF
12 pages
Indexing & Abstracting Essentials
No ratings yet
Indexing & Abstracting Essentials
27 pages
MySQL Practical
No ratings yet
MySQL Practical
21 pages
SQL Lab Assignment
No ratings yet
SQL Lab Assignment
6 pages
REPORT
No ratings yet
REPORT
5 pages
Holistic COC L 4 Question WDDBA
No ratings yet
Holistic COC L 4 Question WDDBA
16 pages
SQL Basics
No ratings yet
SQL Basics
34 pages
SAP BI Architecture Overview
No ratings yet
SAP BI Architecture Overview
6 pages
Types of Dimensions - Javatpoint
No ratings yet
Types of Dimensions - Javatpoint
1 page
INT221
No ratings yet
INT221
14 pages
3 P32 Midterm 2019
No ratings yet
3 P32 Midterm 2019
6 pages
DD 2 1 SG
No ratings yet
DD 2 1 SG
13 pages
Mini Project Report XXXXXXXX
No ratings yet
Mini Project Report XXXXXXXX
25 pages
Unit - 1
No ratings yet
Unit - 1
51 pages
Dbms Lab Manual Print
No ratings yet
Dbms Lab Manual Print
73 pages
Views
No ratings yet
Views
7 pages
Oracle 12c Exam Prep Guide
No ratings yet
Oracle 12c Exam Prep Guide
222 pages
Unit 3
No ratings yet
Unit 3
93 pages
Design Ticketmaster ( - New - )
No ratings yet
Design Ticketmaster ( - New - )
15 pages
Assign 5
No ratings yet
Assign 5
5 pages
Exam C1000-148: IBM Cloud Pak For Business Automation v21.0.3 Solution Architect
No ratings yet
Exam C1000-148: IBM Cloud Pak For Business Automation v21.0.3 Solution Architect
23 pages
Classification & Prediction
No ratings yet
Classification & Prediction
78 pages
DBMS Practise Queries
No ratings yet
DBMS Practise Queries
9 pages
Flask Cheatsheet - CodeWithHarry
No ratings yet
Flask Cheatsheet - CodeWithHarry
6 pages
University Database Search Engine
No ratings yet
University Database Search Engine
9 pages
Unit 4
No ratings yet
Unit 4
61 pages
Unit V
No ratings yet
Unit V
43 pages
Chapter 10 Dbms
No ratings yet
Chapter 10 Dbms
9 pages
Sample 2023-24
No ratings yet
Sample 2023-24
12 pages
Database Modeling Essentials
No ratings yet
Database Modeling Essentials
31 pages

Unit 2

Uploaded by

Unit 2

Uploaded by

The indexing process determines which terms (concepts) can represent a

The most common stemming algorithm removes suffixes and prefixes,

Stemming is the process of reducing terms to their roots before

Both are not Synonyms having different meanings.

Inversion list are used because they provide optimum performance in

A HMM can best be understood by first defining a discrete markov process.

1. State 1 Market is decreased

You might also like