<date>
7:30 to 9:00 p.m.
1) Answer all
questions.
2) Write
your answers in an examination book. Write your NetID on the front of each
book.
3) This is an
open book examination. You may
bring any notes, books, etc. to the examination.
4) Laptop computers may be used to store course material or as calculators, but for no other purposes. In particular, it is strictly forbidden to use them for any form of communication. No other electronic devices may be used during the examination except a laptop or a calculator.
(a) Define the terms inverted file, inverted list, posting.
(b) When implementing an inverted file system, what are the criteria that you would use to judge whether the system is suitable for very large-scale information retrieval?
(c) You are designing an inverted file system to be used with Boolean queries on a very large collection of textual documents. New documents are being continually added to the collection.
(i) What file structure(s) would you use?
(ii) How well does your design satisfy the criteria listed in Part (b)?
(a) Explain how vector space concepts can be used to calculate the similarity between two documents.
(b) You have the collection of documents that contain the following index terms:
D1: alpha bravo charlie delta echo foxtrot golf
D2: golf golf golf delta alpha
D3: bravo charlie bravo echo foxtrot bravo
D4: foxtrot alpha alpha golf golf delta
(i) Use an incidence matrix of terms to calculate a similarity matrix for these four documents, with no term weighting.
(ii) Use a frequency matrix of terms to calculate a similarity matrix for these documents, with weights proportional to the term frequency and inversely proportional to the document frequency.
(a) Define the
terms recall and precision.
(b) Q is a query. D is a collection of 1,000,000 documents. When the query Q is run, a set of 200 documents is returned.
(i) How in a practical experiment would you calculate the precision?
(ii) How in a practical experiment would you calculate the recall?
(c) Suppose that, by some means, it is known that 100 of the documents in D are relevant to Q. Of the 200 documents returned by the search, 50 are relevant.
(i) What is the precision?
(ii) What is the recall?
(d) Explain in general terms the method used by TREC to estimate the recall.