Wednesday March 7, 2001
7:30 to 9:00 p.m.
1) Answer all questions.
2) Write your answers in an examination book. WRITE YOUR NETID ON THE
FRONT OF EACH BOOK.
3) This is an open book examination.
(a) Define the terms inverted file, inverted list, posting.
(b) When implementing an inverted file system, what are the criteria that you would use to judge whether the system is suitable for very large-scale information retrieval?
(c) You are designing an inverted file system to be used with Boolean queries on a very large collection of textual documents. New documents are being continually added to the collection.
(i) What file structure(s) would you use?
(ii) How well does your design satisfy the criteria listed in Part (b)?
(a) Explain how vector space concepts can be used to calculate the similarity between two documents.
(b) You have the collection of documents that contain the following index terms:
D1: alpha bravo charlie delta echo foxtrot golf
D2: golf golf golf delta alpha
D3: bravo charlie bravo echo foxtrot bravo
D4: foxtrot alpha alpha golf golf delta
(i) Use an incidence matrix of terms to calculate a similarity matrix for these four documents, with no term weighting.
(ii) Use a frequency matrix of terms to calculate a similarity matrix for these documents, with weights proportional to the term frequency and inversely proportional to the document frequency.
(a) Define the terms
recall and precision.
(b) Q is a query. D is a collection of 1,000,000 documents. When the query Q is run, a set of 200 documents is returned.
(i) How in a practical experiment would you calculate the precision?
(ii) How in a practical experiment would you calculate the recall?
(c) Suppose that, by some means, it is known that 100 of the documents in D are relevant to Q. Of the 200 documents returned by the search, 50 are relevant.
(i) What is the precision?
(ii) What is the recall?
(d) Explain in general terms the method used by TREC to estimate the recall.
Here is a Dublin Core metadata record:
Title Gore/Lieberman 2000
Title.alternative Welcome to the Gore-Lieberman 2000 official campaign Web site
Title.alternative Gore 200
Title.alternative Viva Gore Lieberman 2000
Identifier.LCCN 00530047
Identifier.URI http://www.algore2000.com/
Type.OCLCg Computer file
Type.AACR2g-gmd [computer file]
Contributor.nameCorporate Gore/Lieberman, Inc.
Coverage.spatial.MARC21-gac n-us---
Date.issued.MARC21-Date 2000-9999
Description.note Title from home page as viewed on Nov. 1, 2000.
Description.summary Presents information on U.S. Vice President Albert Arnold Gore, Jr. (b. 1948) and his presidential campaign, provided by Gore 2000, Inc.
Language.ISO639-2 eng
Language.ISO639-2 engspa
Language In English and Spanish
Publisher Gore/Lieberman,
Publisher.place Nashville, Tenn. :
Relation.requires Mode of access: World Wide Web
Subject.class.LCC E840.8.G65
Subject.class.DDC 324.973
Subject.namePersonal.LCSH Gore, Albert, • 1948-
Subject.topical.LCSH Vice-Presidents • United States • Biography.
Subject.topical.LCSH Presidential candidates • United States • Biography.
Subject.topical.LCSH Presidents • United States • Election • 2000.
Subject.topical.LCSH Political campaigns • United States.
(a) What is the Dublin Core principle of dumbing-down? Are there any fields in this record that do not satisfy the principle?
(b) The metadata in the fields Publisher and Publisher place end in punctuation marks. Can you suggest any reasons for doing so?
(c) This record has no Creator field. It has a Contributor.nameCorporate field with value "Gore/Lieberman, Inc." Do you consider that this is correct use of Dublin Core? What would you put in the Creator and Contributor fields? Why?