Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views4 pages

Ir Assignment 1 Answers

INFORMATION RETRIEVAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views4 pages

Ir Assignment 1 Answers

INFORMATION RETRIEVAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

✅ Final Answer

🔍 Boolean Query:
cybersecurity AND (encryption OR firewall) AND NOT malware

📘 What This Query Means:

This Boolean query is constructed to precisely filter academic articles that meet the following
criteria:

1. Must contain the word "cybersecurity".


2. Must also contain either the word "encryption" or "firewall" (or both).
3. Must not contain the word "malware".

This logical structure allows a researcher to narrow down results to highly relevant
documents that focus on secure technologies (like encryption and firewalls), strictly within
the cybersecurity domain, and exclude documents that focus on threats like malware.

🗂️Sample Document Set:

Document ID Content
D1 "Cybersecurity strategies often include encryption and firewall setup."
D2 "Cybersecurity depends on advanced encryption techniques."
D3 "Cybersecurity and firewall protection are essential against malware."
D4 "Encryption is important, but malware is still a threat."
D5 "Cybersecurity, encryption, and firewall systems work together."
D6 "Malware is harmful to all systems, even with firewalls."

🔄 Step-by-Step Evaluation Using the Boolean Retrieval Model:

Step 1: Filter documents containing “cybersecurity”

 Match: D1, D2, D3, D5


(These documents are candidates for further filtering.)

Step 2: From those, keep documents containing “encryption” OR “firewall”

 D1 → has both "encryption" and "firewall" ✅


 D2 → has "encryption" ✅
 D3 → has "firewall" ✅
 D5 → has both "encryption" and "firewall" ✅
(All still qualify)

Step 3: Exclude documents containing “malware”

 D3 → contains "malware" ❌
 D1, D2, D5 → do not contain "malware" ✅

✅ Final Matching Documents:

Document ID Why It Matches


D1 Contains "cybersecurity", "encryption", "firewall" — and no "malware" ✅
D2 Contains "cybersecurity", "encryption" — and no "malware" ✅
D5 Contains "cybersecurity", "encryption", "firewall" — and no "malware" ✅

🧠 Conclusion:

Using this Boolean Retrieval Model, the system precisely retrieves documents that:

 Are focused on cybersecurity,


 Discuss key defense techniques like encryption or firewalls,
 And exclude topics related to malware, which the researcher may deem irrelevant or
outside the scope.

This method helps in zeroing in on highly specific and relevant documents, making
literature searches in digital libraries both efficient and accurate.

✅ Answer: Probabilistic Retrieval Model Evaluation

📊 Given Data:

 Total documents = 10,000


 Term "data" appears in 500 documents
 Term "breach" appears in 100 documents
 Document D1 contains both "data" and "breach"
 Document D2 contains only "data"
🔍 1. Which document is more likely to be ranked higher,
and why?
✅ Document D1 is more likely to be ranked higher than D2.

Reason:

The probabilistic retrieval model (like BM25) estimates how likely a document is relevant
based on the presence and weight of query terms. Two main factors here are:

 D1 contains both query terms ("data" and "breach").


 D2 contains only one query term ("data").

Since relevance is estimated using the combined contribution of matching terms, and more
matched query terms → higher relevance probability, D1 is more informative and
relevant under this model.

📈 2. How does term weighting (like IDF) affect this


decision?
IDF (Inverse Document Frequency):

The formula for IDF is often:

IDF(t)=log⁡(Ndft)\text{IDF}(t) = \log \left(\frac{N}{df_t}\right)

Where:

 NN = total number of documents


 dftdf_t = number of documents containing term tt

Let’s calculate:

 IDF(data) = log⁡10(10,000500)=log⁡10(20)≈1.30\log_{10} \left(\frac{10,000}{500}\


right) = \log_{10}(20) ≈ 1.30
 IDF(breach) = log⁡10(10,000100)=log⁡10(100)=2.00\log_{10} \left(\frac{10,000}
{100}\right) = \log_{10}(100) = 2.00

Effect:

 "breach" has a higher IDF than "data", meaning it's more discriminative (rarer and
thus more meaningful for ranking).
 Since D1 contains both "data" and especially the high-IDF term "breach", it will
receive a higher cumulative relevance score.
 D2 only contains "data", the more common (less informative) term → lower score.
Thus, IDF emphasizes rare, informative terms, boosting the rank of D1.

📉 3. How would using term frequency (TF) further refine


the ranking if both documents contained both terms but
with different frequencies?
TF (Term Frequency):

TF measures how many times a term appears in a document. Higher frequency = more
relevance to that document.

If both D1 and D2 contain both "data" and "breach", but:

 D1: "data" (3 times), "breach" (1 time)


 D2: "data" (1 time), "breach" (5 times)

Then the ranking would be influenced as follows:

 D2 has more occurrences of "breach", the higher-weighted term (higher IDF).


 The probabilistic model (e.g., BM25) balances TF and IDF to score:

Score∝TF×IDF\text{Score} \propto \text{TF} \times \text{IDF}

So:

 D2's high frequency of "breach" may give it a higher score than D1, despite lower
"data" frequency, depending on the scoring formula and term saturation effects
(e.g., BM25 dampens very high term frequencies).

🧠 Summary:
Factor Effect
Term
More query terms matched → higher relevance (D1 initially wins)
presence
IDF Rare terms (like "breach") weigh more → favors D1 (since D2 lacks it)
More occurrences of key terms → can shift ranking, especially for high-IDF
TF
terms

➡️So, D1 ranks higher initially, but TF differences (if added) could tip the scale in favor
of D2 if it has much higher frequency of the rare term ("breach").

You might also like