Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
66 views11 pages

Final Exam (Spring 2020 - V1)

The document outlines the final examination for the Information Retrieval course at the National University of Computer & Emerging Sciences, Karachi, conducted by Dr. Muhammad Rafi. It includes various questions on topics such as Boolean Retrieval Model limitations, evaluation metrics like Precision and Recall, text classification methods, and web crawling properties. The exam format specifies that answers must be submitted as a single PDF and emphasizes clarity and adherence to the question sequence.

Uploaded by

rania.ghazanfar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views11 pages

Final Exam (Spring 2020 - V1)

The document outlines the final examination for the Information Retrieval course at the National University of Computer & Emerging Sciences, Karachi, conducted by Dr. Muhammad Rafi. It includes various questions on topics such as Boolean Retrieval Model limitations, evaluation metrics like Precision and Recall, text classification methods, and web crawling properties. The exam format specifies that answers must be submitted as a single PDF and emphasizes clarity and adherence to the question sequence.

Uploaded by

rania.ghazanfar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

National University of Computer & Emerging Sciences, Karachi

Spring-2020 CS-Department
Final Examination (Sol)
June 22, 2020 (Time: 9AM – 12:30PM)

Course Code: CS317 Course Name: Information Retrieval


Instructor Name: Dr. Muhammad Rafi
Student Roll No: Section No:
 This is an offline exam. You need to produce solutions as a single pdf and need to upload at slate.
 Read each question completely before answering it. There are 6 questions and 4 pages.
 In case of any ambiguity, you may make assumption. But your assumption should not contradict with
any statement in the question paper.
 All the answers must be solved according to the sequence given in the question paper.
 Be specific, to the point and illustrate with diagram/code where necessary.

Time: 210 minutes+ 30 min. for submission Max Marks: 100 points

Basic IR Concepts / IR Retrieval Models

Question No. 1 [Time: 30 Min] [Marks: 20]

a. Answer the following questions to the point. Not more than 5 lines of text. [2x5]

1. What are some of the limitations of Boolean Retrieval Model in information Retrieval(IR)?

The main limitation of BM is exact retrieval based on exact matching of query terms. Other
problems in BM is query formation it is very hard to come up with a Boolean expression for
some complex query. The results of BM are flat that is no ranking in it.

2. In practical Implementation of Vector Space Model (VSM), what is the major problem you
have observed? illustrate.

In VSM for large documents we have possible a large number of features. Representing these
features with weighting like (tf*idf) gives very smaller values, thus finding similarity via
cosine is also very small values (underflow of values may occur). It is one of the very basic
implementation challenge for VSM.

3. What is the important factor that can result in false negative match in a VSM?

Semantic sensitivity- that is document with similar context but different term vocabulary
might result in a “false negative” match.

4. From a Human Language standpoint, what is the major drawback of VSM for IR?

From a Human Language standpoint – the word order is very important and VSM lost this
order in representation.

Page 1 of 11
5. In Probabilistic Information Retrieval- what do we mean by the assumption “If the word is
not in the query, it is equally likely to occur in relevant and non-relevant”? – explain

In PIR, the assumption about a word that is not in the query, that this work has an equal likely
to appears in relevant and non-relevant documents. Simplify the derivation for the formula as
it will cancel the numerator and denominator of the equal likely odds from the formula.

b. Consider the partial document collection D= {d1: w1 w2 w4 w1; d2: w3 w2 w6; d3: w1 w2 w7}
and q: w4 w3 w7; if the following table gives the tf and idf score of each term, compute the score
of each document against the given query q (assume query use simple tfq scores), using cosine of
angle between query vector and document vector. Give vector representations of documents
vector and query. Also produce the ranking of the documents against this query. [10]

Word tf-d1 tf-d2 tf-d3 idf


W1 0.34 0.17 0.12 0.14
W2 0.12 0.29 0.19 0.38
W3 0.23 0.33 0.14 0.51
W4 0.26 0.28 0.22 0.24
W5 0.15 0.66 0.15 0.60
W6 0.31 0.22 0.16 0.32
W7 0.23 0.45 0.21 0.15

First getting the documents vectors with tf*idf weighting as below:


d1 = < 0.04,0.04,0.11,0.06,0.09,0.09,0.03>
d2 = < 0.02,0.11,0.16,0.06,0.39,0.07,0.06>
d3 = < 0.01,0.07,0.07,0.05,0.09,0.05,0.03>
q =( 0; 0; 1; 1; 0; 0; 1> no tf*idf weighting for query.

Cosine(d1,q) = 0.213
Cosine(d2,q) = 0.303
Cosine(d3,q) = 0.155

Ranking d2,d1, and d3

Page 2 of 11
Evaluation in IR / Relevance Feedback

Question No. 2 [Time: 30 Min] [Marks: 20]

a. Why Precision and Recall together not a very good evaluation scheme for IR? Justify in term of
system development of IR perspective. [5]

An information retrieval system can be built with either high precision and high recall value, if
the system always return only 1 relevant document it precision will be 1. On the other hand, if the
system returns all documents for every query it recalls will be 1. Hence both precision and recall
not good to report on information retrieval systems.

b. What is a Break-Even Point in IR Evaluation? Must there always be a break-even point between
precision and recall? Either show there must be or give a counter-example. How break-even point
related to the value of F1? [5]

In a system, Precision(P) = Recall® if and only if, False Positive (FP)= False Negative(FN) or
True Positive(TP) =0, If in a rank retrieval system if the highest document is not relevant then
TP=0 and that is a trivial break-even point. On the other hand, if it is not the case, then the
number of FP increases as you go down and the number pf false negative decreases. The at the
start of the list fp<fn and at the end of the list fp>fn. Thus there has to be a break-even point the
rank list. At the break-even point F1=P=R

c. The following list of Rs and Ns represents relevant (R) and non-relevant (N) returned documents
in a ranked list of 12 documents retrieved in response to a query from a collection of 1000
documents. The top of the ranked list (the document the system thinks is most likely to be
relevant) is on the left of the list. This list shows 5 relevant documents. Assume that there are 8
relevant documents in total in the collection. [10]

RRNNRNNNRNNR

1) What is the precision of the system on the top 12?

From the given information, we can see that tp=5; fp=7 and fn=7-5=2 so for precision we
have Precision = tp / (tp+fp) = 5/12 =0.416

2) What is the F1 on the top 12?

Let’s find recall for F1: we know Recall = tp/ (fn+tp) = 5/7= 0.714 hence
F1= 2 X (Precision * Recall) / (Precision + Recall)
F1= (2X0.416X0.7146) / (0.416+0.741) = 0.616 / 1.157 = 0.532

Page 3 of 11
3) Assume that these 12 documents are the complete result set of the system. What is the MAP
for the query?

The MAP possible the current ranking from the system is:
MAP = 1/5 * (1/1+2/2+3/5+4/9+5/12) = 0.6922

4) What is the largest possible MAP that this system could have?

The maximum MAP possible when the remaining two relevant documents retrieved next to
these 12 documents.
MAP = 1/8 * (1/1+2/2+3/5+4/9+5/12+6/13+7/14+8/15) = 0.619

5) What is the smallest possible MAP that this system could have?

The minimum MAP possible when the remaining four relevant documents found as the last
documents from the collection.

MAP = 1/8 * (1/1+2/2+3/5+4/7+5/997+6/998+7/999+8/1000) = 0.399

Page 4 of 11
Text Classification

Question No. 3 [Time: 30 Min] [Marks: 15]

Consider the following examples for the task of text classification [5+5+5]

i. Using the k-Nearest Neighbors (KNN) with k=3 identify the class of test instance docID=5?

Dictionary Order = < japan,macao,osaka,sapporo,shanghai,taipei, taiwan>


d1= <0,0,0,0,0,1,1> d2=<0,1,0,0,1,0,1> d3=<1,0,0,1,0,0,0> d4=<0,0,1,1,0,0,1>
d5=<0,0,0,1,0,0,2>
distance |d1,d5| = 3 distance |d2,d5| = 4 distance |d3,d5| = 5
distance |d4,d5| = 2

d5 closest neighbours are d4,d1,d2,d5 hence d5 belong to class c.

ii. Using the Rocchio’s algorithm, classify the test instance docID=5?

Consider again

d1= <0,0,0,0,0,1,1> d2=<0,1,0,0,1,0,1> d3=<1,0,0,1,0,0,0> d4=<0,0,1,1,0,0,1>


d5=<0,0,0,1,0,0,2>
class c = ½ |d1 d2| = <0,1/2,0,0,1/2,1/2,1/2,1>
class ~c = ½ |d3 d4| = <1/2,0,1/2,1,0,0,1/2>

distance |d5,c| = 2
distance |d5,~c| =2.75 hence d5 belong to class c.

Page 5 of 11
iii. Using the Multinomial Naïve Bayes to estimate the probabilities of each term (feature) that
you use to classify the test instance docID=5?

Page 6 of 11
Text Clustering

Question No. 4 [Time: 30 Min] [Marks: 15]

a. Consider a collection of overly simplified documents d1(1,4); d2(2,4); d3(4,4); d4(1,1); d5(2,1)
and d6(4,1). Apply k-means algorithm using seeds d2 and d5. What are the resultant clusters?
What is the time complexity? How do we know that this result is optimal or not? [5]

Let C1=d2 and C2 = d5


Starting C1 and C2 as initial clusters, we need to decide about the membership for each of
the documents.
For d1: Dist(C1, d1) < Dist(C2,d1) => d1 belongs to C1.
For d3: Dist(C1, d3) < Dist(C2,d3) => d3 belongs to C1.
For d4: Dist(C1, d4) > Dist(C2,d4) => d4 belongs to C2.
For d6: Dist(C1, d6) > Dist(C2,d6) => d6 belongs to C2.
Hence d1, d2 and d3 are in C1 cluster, and d4, d5 and d6 are in C2 cluster.

K-mean coverage to a local minimum, in order to find optimal clustering one need to
produce all possible clustering arrangement. The running time of K-mean is O(n*k*d),
where n is the number of documents, k is the number of clusters and d is the number of
features.
b. Consider a collection of overly simplified documents d1(1,4); d2(2,4); d3(4,4); d4(1,1); d5(2,1)
and d6(4,1). Apply HAC using single link. What are the resultant clusters? What is the time
complexity? How do we know that this result is optimal or not? [5]

HAC produced overlapping (hierarchical)n clusters. The resultant clusters are overlapping
sub-clusters as above. It time complexity in O (n3), if carefully implemented it can be kept
as quadratic. It cannot guarantee he optimal solution.
c. Would you expect the same results in part (a) and part (b) of this question? Why these results are
different (if they are)? What they represent from the possibility of clustering arrangements? [5]

No. The results must be different as one is partition clustering and other is hierarchical
clustering. They only represent a fraction of the possibility of number of clustering
arrangements through Catalan Number.

Page 7 of 11
Web Search & Crawler

Question No. 5 [Time: 30 Min] [Marks: 15]

a. What are the different types of users queries on the web? Give example of each type of the query
(Note: other than the textbook example). [5]

Informational queries seek general information on a broad topic, such as leukemia or


Provence. There is typically not a single web page that contains all the information sought;
indeed, users with informational queries typically try to assimilate information from multiple
web pages. Query: what is open wound?

Navigational queries seek the website or home page of a single entity that the user has in mind,
say Lufthansa airlines. In such cases, the user’s expectation is that the very first search result
should be the home page of Lufthansa. Query: Home page setting

Transactional query is one that is a prelude to the user performing a transaction on the Web –
such as purchasing a product, downloading a file or making a reservation. In such cases, the
search engine should return results listing services that provide form interfaces for such
transactions. Query: purchase air ticket for air blue

b. Identify why these properties are essential (must / should) for a web crawler? Give one problem
and one solution for each one: [5]
i. Politeness

Politeness is a must property for modern crawlers, Web servers have both implicit and explicit
policies regulating the rate at which a crawler can visit them. These politeness policies must
be respected. The problem is when the crawler is not obeying these policies there are many
stalls, and it further delays the crawl processes.

ii. Freshness

Freshness is a desirable property of modern crawlers. In many applications, the crawler should
operate in continuous mode, it should obtain fresh copies of previously fetched pages. A
search engine crawler, for instance, can thus ensure that the search engine’s index contains a
fairly current representation of each indexed web page. For such continuous crawling, a
crawler should be able to crawl a page with a frequency that approximates the rate of change
of that page. Otherwise it would not be able to get the updated page for indexing.

Page 8 of 11
iii. Extensible

The modern crawlers should be Extensible. Crawlers should be designed to be extensible in


many ways – to cope with new data formats, new fetch protocols, and etc. This demands that
the crawler architecture be modular and extensible.

c. Why it is better to partition hosts (rather than individual URLs) between nodes of a distributed
crawl system? Suggest an architecture for handing both URLs and Host in a crawler. (draw the
diagram and explain its working). [5]

The crawler must establish connection to Host in order to get the required webpages. It will be
efficient to fetch all pages once the connection of the required host is established, hence sorting
and maintaining a queue of each Host for the pages that need to be crawl save a good amount of
the time. A The URL frontier maintains such an order.

Page 9 of 11
Link Analysis

Question No. 6 [Time: 30 Min] [Marks: 15]

a. Consider a subgraph of web represented by a collection of 6 pages (namely A, B, C, D, E, and F),


first draw a pictorial representation of this graph along with adjacency matrix. Is it always possible
to follow directed edges (hyperlinks) in the given web graph from any node (web page) to any
other as in Bow-Tie Model? Justify it. [5]

Page A is connected to B, D and F Page B is connected to A,C,D and E


Page D is connected to F Page E is connected to A and C

This graph is very similar to Bow-Tie Model and hence all the finding of the model can be
validated easily.

b. Using the adjacency matrix from part (a), Using A and H as column vectors for Hub and
Authority, apply HITS algorithm for two iterations to identify at least one hub and one authority
page from the collection. [5]

Let A be the connectivity matrix for the given graph from part a. We know that

h1 = A . a0 and a1= AT. h0 , similarly h2 = A . a1 and a2= AT. h1

0 1 0 0 1 1 a0 1
A= 1 0 1 1 1 0 1
0 0 0 0 0 0 1
0 0 0 0 0 1 1
1 0 1 0 0 0 1
0 0 0 0 0 0 1

0 1 0 0 1 0 h0 1
AT = 1 0 0 0 0 0 1
0 1 0 0 1 0 1
0 1 0 0 0 0 1
1 1 0 0 0 0 1
1 0 0 1 0 0 1

a2 = < 5/70,6/70,0,0,3/70,0> h2 = < 3/41,2/41, 3/41, 1/41,3/41, 3/41>


Best Hub is B; Best Authority is A
Page 10 of 11
c. Using the adjacency matrix from part (a), Assume that the PageRank values for any page pi at
iteration 0 is PR(pi) = 1 and that the damping factor for iterations is d = 0.85 Perform the
PageRank algorithm and determine the rank for every page after 2 iterations. [5]

Adjacency matrix

0 1 0 0 1 1
A= 1 0 1 1 1 0
0 0 0 0 0 0
0 0 0 0 0 1
1 0 1 0 0 0
0 0 0 0 0 0

0 1/3 0 0 1/3 1/3 d 0.85


A= 1/4 0 1/4 1/4 1/4 0 0.85
0 0 0 0 0 0 0.85
0 0 0 0 0 1 0.85
1/2 0 1/2 0 0 0 0.85
0 0 0 0 0 0 0.85

Iteration1: Iteration2:

0.84 0.51
0.85 0.63
0 0
0.85 0
0.85 0.42
0 0

Hence A= 0.51 B=0.63 C=0 D=0 E=0.42 F=0

Closing Remarks:

You need to prepare a pdf file of all the question as per the question ordering. The orientation should
be portrait for each page. It should be clearly visible for each and every text written on the page. You
suppose to upload it on Slate as an assignment submission. You have good 30 minutes for it. Wish
you all the best. <The End>

Page 11 of 11

You might also like