Wednesday March 7, 2001
7:30 to 9:00 p.m.
1)� Answer all questions.
2)�� Write your answers in an examination book. WRITE YOUR NETID ON THE
FRONT OF EACH BOOK.
�����
3)� This is an open book� examination.
(a)� Define the terms inverted file, inverted list, posting.
(b)� When implementing an inverted file system, what are the criteria that you would use to judge whether the system is suitable for very large-scale information retrieval?�
(c)� You are designing an inverted file system to be used with Boolean queries on a very large collection of textual documents.� New documents are being continually added to the collection.
����� (i)� What file structure(s) would you use?
����� (ii)� How well does your design satisfy the criteria listed in Part (b)?
(a)� Explain how vector space concepts can be used to calculate the similarity between two documents.
(b)� You have the collection of documents that contain the following index terms:
D1:� alpha bravo charlie delta echo foxtrot golf
D2:� golf golf golf delta alpha
D3:� bravo charlie bravo echo foxtrot bravo
D4:� foxtrot alpha alpha golf golf delta
(i)� Use an incidence matrix of terms to calculate a similarity matrix for these four documents, with no term weighting.
(ii)� Use a frequency matrix of terms to calculate a similarity matrix for these documents, with weights proportional to the term frequency and inversely proportional to the document frequency.
(a)� Define the terms
recall and precision.�
(b)� Q is a query.� D is a collection of 1,000,000 documents.� When the query Q is run, a set of 200 documents is returned.
(i)�� How in a practical experiment would you calculate the precision?
(ii)� How in a practical experiment would you calculate the recall?
(c)� Suppose that, by some means, it is known that 100 of the documents in D are relevant to Q.� Of the 200 documents returned by the search, 50 are relevant.
����� (i)�� What is the precision?
����� (ii)� What is the recall?
(d)� Explain in general terms the method used by TREC to estimate the recall.
Here is a Dublin Core metadata record:
Title��������������������������������������������� Gore/Lieberman 2000
Title.alternative����� Welcome to the Gore-Lieberman 2000 official campaign Web site
Title.alternative����� Gore 200
Title.alternative����� Viva Gore Lieberman 2000
Identifier.LCCN����� 00530047
Identifier.URI����� http://www.algore2000.com/
Type.OCLCg������������������������������ Computer file
Type.AACR2g-gmd����� [computer file]
Contributor.nameCorporate����� Gore/Lieberman, Inc.
Coverage.spatial.MARC21-gac����� n-us---
Date.issued.MARC21-Date����� 2000-9999
Description.note����� Title from home page as viewed on Nov. 1, 2000.
Description.summary����� Presents information on U.S. Vice President Albert Arnold Gore, Jr. (b. 1948) and his presidential campaign, provided by Gore 2000, Inc.
Language.ISO639-2����� eng
Language.ISO639-2����� engspa
Language������������������������������������ In English and Spanish
Publisher������������������������������������� Gore/Lieberman,
Publisher.place����� Nashville, Tenn. :
Relation.requires����� Mode of access: World Wide Web
Subject.class.LCC����� E840.8.G65
Subject.class.DDC����� 324.973
Subject.namePersonal.LCSH����� Gore, Albert, � 1948-
Subject.topical.LCSH����� Vice-Presidents � United States � Biography.
Subject.topical.LCSH����� Presidential candidates � United States � Biography.
Subject.topical.LCSH����� Presidents � United States � Election � 2000.
Subject.topical.LCSH����� Political campaigns � United States.
(a)� What is the Dublin Core principle of dumbing-down?� Are there any fields in this record that do not satisfy the principle?
(b)� The metadata in the fields Publisher and Publisher place end in punctuation marks.� Can you suggest any reasons for doing so?
(c)� This record has no Creator field.� It has a Contributor.nameCorporate field with value "Gore/Lieberman, Inc."� Do you consider that this is correct use of Dublin Core? What would you put in the Creator and Contributor fields?� Why?