2M IRT CIT 2
1. Define the characterization of text classification
Text classification involves automatically assigning predefined labels to text based on its
content. The task can be supervised, using algorithms like Decision Trees, Naive Bayes, or
Support Vector Machines (SVM), where the model is trained on labelled examples. It can also
be unsupervised, such as clustering, where the data is grouped into categories based on
similarities without prior labelling.
Features like word frequencies, TF-IDF scores, and n-grams are typically used to represent
the text, allowing the classification model to discern patterns in the data
2. State the Evaluation metrics with example.
Evaluation Metrics in Information Retrieval are used to assess the performance of retrieval
system. Key metrics include,
4.MAP (Mean Average Precision) averages precision over all queries.
These metrics are widely used. such as in search engines. They help in understanding both
the accuracy (precision) and the completeness (recall) of the system.
3. Define Dimensionality reduction.
Dimensionality reduction is a process used to reduce the number of input variables or
features in a dataset while preserving its essential information. In text classification, it helps
reduce computational complexity by transforming the high-dimensional space (e.g.,
thousands of words) into a lower-dimensional space using techniques like Principal
Component Analysis (PCA) or Singular Value Decomposition (SVD).
This helps in improving model performance, reducing noise, and speeding up the training
process without losing significant information
4. What is Hash-based Dictionary in Indexing?
A hash-based dictionary in indexing is a structure used to map terms to their respective
document locations efficiently. It utilizes a hash function to assign a unique value to each
term, which points to a specific "bucket" where the term’s data is stored (such as document
frequencies or positions). This method ensures fast lookups, as the time to find a term is
reduced significantly, making it particularly useful in large-scale text search operations like
web search engines.
5. State the advantages of Naive Bayes.
Simple and Fast: It is computationally efficient and easy to implement, even with
large datasets.
Works Well with Small Data: Performs well with limited training data.
Handles High-Dimensional Data: Effective when the number of features is high.
Robust to Irrelevant Features: Irrelevant features have minimal impact on
predictions.
Performs Well with Categorical Data: Particularly suitable for tasks like spam
detection and text classification.
6. What is Ranking Function?
A ranking function is used in information retrieval to determine the order in which
documents are presented to the user based on their relevance to a query. It assigns a score
to each document, usually considering factors such as term frequency (TF), inverse
document frequency (IDF), and similarity measures like cosine similarity.
Documents are ranked in descending order of their scores, with the most relevant
documents appearing first in the search results. This is critical in search engines.
7. Define focused crawler.
A focused crawler is a specialized web crawler designed to fetch web pages that are highly
relevant to a specific topic or domain. Unlike general crawlers, which aim to index the entire
web, a focused crawler prioritizes pages that match predefined criteria or keywords. By doing
so, it reduces bandwidth consumption and processing time while ensuring the collected data
is useful for a particular purpose, such as building topic-specific search engines or knowledge
bases.
8. Describe the term Browsing
Browsing is a type of information-seeking behavior where the user explores a collection of
data or documents without a specific goal or precise query in mind. It is often used when the
user has a general interest in a topic and wants to gather information by casually navigating
through related documents. This contrasts with direct searching, where the user has a clear
objective. Browsing is common in digital libraries, websites, or online stores, where users
might explore categories and topics of interest.
9. List the use of inversion in indexing process.
Inversion refers to the creation of an inverted index, a fundamental data structure used in
search engines and information retrieval systems. The inverted index works by mapping each
term (or keyword) to a list of documents in which the term appears, along with its position
or frequency.
Example: In a search engine, if a user queries a word, the inverted index quickly identifies all
the documents containing that word, making the search process fast and efficient.
10. Define HITS.
HITS is an algorithm designed to rank web pages based on their authority and hub scores.
Authority pages are those that provide valuable content, and hub pages are those that link to
many authority pages. HITS operates by iteratively assigning these scores based on the link
structure of the web, emphasizing the importance of mutual reinforcement between hubs
and authorities.
It is particularly useful in domains like academic citations and identifying influential web
pages within a specific topic.