CS 430 Information Discovery

Midterm Examination

Wednesday March 7, 2001

7:30 to 9:00 p.m.

Instructions

1)� Answer all questions.

2)�� Write your answers in an examination book. WRITE YOUR NETID ON THE FRONT OF EACH BOOK.

��

3)� This is an open book� examination.

Question 1

(a)� Define the terms inverted file, inverted list, posting.

(b)� When implementing an inverted file system, what are the criteria that you would use to judge whether the system is suitable for very large-scale information retrieval?�

(c)� You are designing an inverted file system to be used with Boolean queries on a very large collection of textual documents.� New documents are being continually added to the collection.

�� (i)� What file structure(s) would you use?

�� (ii)� How well does your design satisfy the criteria listed in Part (b)?

Question 2

(a)� Explain how vector space concepts can be used to calculate the similarity between two documents.

(b)� You have the collection of documents that contain the following index terms:

D₁:� alpha bravo charlie delta echo foxtrot golf

D₂:� golf golf golf delta alpha

D₃:� bravo charlie bravo echo foxtrot bravo

D₄:� foxtrot alpha alpha golf golf delta

(i)� Use an incidence matrix of terms to calculate a similarity matrix for these four documents, with no term weighting.

(ii)� Use a frequency matrix of terms to calculate a similarity matrix for these documents, with weights proportional to the term frequency and inversely proportional to the document frequency.

Question 3

(a)� Define the terms recall and precision.�

(b)� Q is a query.� D is a collection of 1,000,000 documents.� When the query Q is run, a set of 200 documents is returned.

(i)�� How in a practical experiment would you calculate the precision?

(ii)� How in a practical experiment would you calculate the recall?

(c)� Suppose that, by some means, it is known that 100 of the documents in D are relevant to Q.� Of the 200 documents returned by the search, 50 are relevant.

�� (i)�� What is the precision?

�� (ii)� What is the recall?

(d)� Explain in general terms the method used by TREC to estimate the recall.

Question 4

Here is a Dublin Core metadata record:

Title�� Gore/Lieberman 2000

Title.alternative�� Welcome to the Gore-Lieberman 2000 official campaign Web site

Title.alternative�� Gore 200

Title.alternative�� Viva Gore Lieberman 2000

Identifier.LCCN�� 00530047

Identifier.URI�� http://www.algore2000.com/

Type.OCLCg�� Computer file

Type.AACR2g-gmd�� [computer file]

Contributor.nameCorporate�� Gore/Lieberman, Inc.

Coverage.spatial.MARC21-gac�� n-us---

Date.issued.MARC21-Date�� 2000-9999

Description.note�� Title from home page as viewed on Nov. 1, 2000.

Description.summary�� Presents information on U.S. Vice President Albert Arnold Gore, Jr. (b. 1948) and his presidential campaign, provided by Gore 2000, Inc.

Language.ISO639-2�� eng

Language.ISO639-2�� engspa

Language�� In English and Spanish

Publisher�� Gore/Lieberman,

Publisher.place�� Nashville, Tenn. :

Relation.requires�� Mode of access: World Wide Web

Subject.class.LCC�� E840.8.G65

Subject.class.DDC�� 324.973

Subject.namePersonal.LCSH�� Gore, Albert, � 1948-

Subject.topical.LCSH�� Vice-Presidents � United States � Biography.

Subject.topical.LCSH�� Presidential candidates � United States � Biography.

Subject.topical.LCSH�� Presidents � United States � Election � 2000.

Subject.topical.LCSH�� Political campaigns � United States.

(a)� What is the Dublin Core principle of dumbing-down?� Are there any fields in this record that do not satisfy the principle?

(b)� The metadata in the fields Publisher and Publisher place end in punctuation marks.� Can you suggest any reasons for doing so?

(c)� This record has no Creator field.� It has a Contributor.nameCorporate field with value "Gore/Lieberman, Inc."� Do you consider that this is correct use of Dublin Core? What would you put in the Creator and Contributor fields?� Why?