Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
22 views40 pages

Unit 2

Uploaded by

Sree Dhathri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views40 pages

Unit 2

Uploaded by

Sree Dhathri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

The indexing process determines which terms (concepts) can represent a

particular item
The transformation from the received item to the searchable data
structure is called indexing
Manual or automatic
Search – Once the searchable data structure has been created, techniques
must be defined that correlate the user entered query statement to the set
of items in the database to determine the items to be returned to the user.
Information extraction – extract specific information to be normalized
and entered into a structured database (DBMS)
Focus on very specific concepts and contains a transformation process
that modifies the extracted information into a form compatible with the
end structured database
Automatic File Build
Indexing (originally called cataloging) is the oldest technique for
identifying the contents of item to assist in their retrieval
• Subject indexing ->hierarchical subject indexing
• Indexing creates a bibliographic citation in a structured file that reference
the original text
Citation information about the items, key wording the subjects of the
item, and a constrained length free text field used for an abstract/summary
Usually performed by professional indexers
Objectives:
• Represent the concepts within an item to facilitate the user’s finding
relevant information
• The full text searchable data structures for items in the Document File
provides a new class of indexing called total document indexing
All of the words within the item are potential index descriptors of the
Subjects of the item.
• The process of Item normalization that takes all possible words in an
item and transforms them into processing tokens used in defining the
searchable representation of an item.
Current systems have the ability to automatically weight the
processing tokens based upon their potential Importance in defining the
concepts in the item. • Other objectives of indexing – ranking, item
clustering
Indexing Process: When an organization with multiple indexers decides
to create a public or private index, some procedural decisions on how to
create the index terms assist the indexer and end users in knowing what
to expect in the index file
Scope of the indexing – define what level of detail the subject index
will contain
• Based on usage scenarios of the end users
The need to link index terms together in a single index for a particular
concept
• Needed when there are multiple independent concepts found within
an item
Two factors involved in deciding on what level to index
the concepts in an item
Exhaustivity
Specificity
Automatic Indexing:
• Automatic indexing is the capability to automatically determine
the index terms to be assigned to an item
Simplest case: total document indexing
Complex: emulate a human indexer and determine a limited
number of index terms for the major concepts in the item
Indexing by Term

The terms of the original item are used as a basis of the index process
• Two major techniques: statistical and natural language
• Statistical techniques
Calculation of weights use statistic information such as the frequency
of occurrence of words and their distributions in the searchable DB
Vector models and probabilistic models
• Natural language processing Process items at the morphological,
lexical, semantic, syntax, and discourse levels
Each level uses information from the previous level to perform it
additional analysis
Events, and event relationships
Indexing by Concept:
• The basis for concept indexing is that there are many ways to
express the same ideas and increased retrieval performance comes
from using a single representation
• Indexing by term treats each of these occurrences as a different
index and then uses thesauri or other query expansion techniques to
expand a query to find the different ways the same thing has been
represented
• Concept indexing determines a canonical set of concepts based on a
test set of terms and uses them as a basis for indexing all items
Over generation measures the amount of irrelevant information that
is extracted. This could be caused by templates filled on topics that
are not intended to be extracted or slots that get filled with non-
relevant data.
Fallout measures how much a system assigns incorrect slot fillers as
the number of potential incorrect slot fillers increases.
There are two processes associated with information extraction:
determination of facts to go into structured fields in a database and
extraction of text that can be used to summarize an item. In the first
case only a subset of the important facts in an item may be identified
and extracted. In summarization all of the major concepts in the item
should be represented in the summary.
Knowledge of data structures used in Information Retrieval Systems
provides insights into the capabilities available to the systems that
implement them. Each data structure has a set of associated capabilities
that provide an insight into the objectives of the implementers by its
selection. From an Information Retrieval System perspective, the two
aspects of a data structure that are important are its ability to represent
concepts and their relationships and how well it supports location of
those concepts.
There are usually two major data structures in any information
system.
One structure stores and manages the received items in their
normalized form.
The process supporting this structure is called the
“document manager.”

The other major data structure contains the processing tokens and
associated data to support search.

The results of a search are references to the items that satisfy the
search statement, which are passed to the document manager for
retrieval.
The concept of stemming has been applied to information systems from
their initial automation in the 1960’s. The original goal of stemming
was to improve performance and require less system resources by
reducing the number of unique words that a system has to contain. With
the continued significant increase in storage and computing power, use
of stemming for performance reasons is no longer as important.
Stemming is now being reviewed for the potential improvements it can
make in recall versus its associated decline in precision. A system
designer can trade off the increased overhead of stemming in creating
processing tokens versus reduced search time overhead of processing
query terms with trailing
Stemming reduces the diversity of representations of a concept (word) to
a canonical morphological representation. The risk with stemming is that
concept discrimination information may be lost in the process, causing a
decrease in precision and the ability for ranking to be performed. On the
positive side, stemming has the potential to improve recall.

The most common stemming algorithm removes suffixes and prefixes,


sometimes recursively, to derive the final stem. Other techniques such as
table lookup and successor stemming provide alternatives that require
additional overheads. Successor stemmers determine prefix overlap as the
length of a stem is increased
Terms with Common Stem

Connect
Connected
Connect
Connecting (Root Word)
Connection
Connections

Stemming is the process of reducing terms to their roots before


indexing
Memorial Memory
Memorize

Both are not Synonyms having different meanings.


Some Ending are also removed even if the root form is not found in the
dictionary.

ness, ly

Factorial---- factory
Matrices---- matrix
The process determines the successor varieties for a word
Uses this information to divide a word into segments and
selects one of the segments as the stem.
Inversion list file structures are well suited to store concepts and their
Relationships. Each Inversion list can be thought of as representing a
particular concept. The inversion list is then concordance of all the items
that contain that concept.

Inversion list are used because they provide optimum performance in


searching large databases.
The optimality comes from the minimization of data flow in resolving
query.
N-Grams can be viewed as a special technique for stemming and as unique
Data structure in information systems. N-Grams are a fixed length consecutive
Series of “n” characters.

Generally Stemming determine the stem of a word that represents the semantic
Meaning of the word , n-grams don’t care about semantics.

The symbol # is used to represent the inter word symbol which is any one
the symbols like blank, period, semicolon, colon etc,
It is possible that same n-gram can be created multiple times from a single word
The application of n-grams is in II WORLD war
Used by cryptographers.
Major use of n-grams is in spelling error detection and correction.
Using n-grams with inter word symbols included between valid processing
Tokens equates to continuous text input data structure that is being indexed
In contiguous “n” character tokens.
A Different way of addressing a continuous text input data structure comes
from PAT trees and PAT Arrays.
The Input stream is transformed into a Searchable data structure consisting
of substrings.
The name PAT is from Patricia Trees(Patricia stands for Practical Algorithm
To retrieve Information Coded in Alphanumeric)
It is possible to have Substring go beyond the length of the Input Stream
By adding additional null characters.
<font color=“red” size=12>

Node

LINK
Hidden Markov models have been applied for the last 20 years to solving
Problems in speech recognition and to lesser extent in the areas locating
Named entities, optical character recognition and topic identification.

A HMM can best be understood by first defining a discrete markov process.

The States will be one of the following that is observed at the closing of the
Market.

1. State 1 Market is decreased


2. State 2 Market did not change
3. State 3 Market increased in value

You might also like