CS430/INFO430: Information Retrieval

CS 430 / INFO 430
Information Retrieval
Fall 2006

Assignments

Assignment 4: Late Submission Permitted until Sunday

Many people have asked for an extension on this assignment because of the large number of other courses that have assignments due at the same time. No penalty will be given for lateness of this assignment if it is submitted by 11:00 p.m. on Sunday, December 3.

11/29/06

General Instructions

During this course there will be several assignments which require programming. Java is strongly preferred as the programming language. If, however, you have limited Java experience C++ or Python is permitted.

Running your Programs

We will run your programs on a Windows computer, using the software environment in the CSUG lab. If you have any questions about this, please email cs430-l@cs.cornell.edu or make arrangements to meet with the Teaching Assistant.

The assignments give specific instructions that are intended to make your programs easy for us to run and grade consistently. Points will be subtracted if you do not follow these instructions.

When we run your Java programs, we run them from the command prompt using the "java ..." command. It is fine to develop your program and compile and test it in a development environment, such as Codewarrior or JCreator. But realize that it will be run from the command prompt. So, before submitting your program, test it from the command prompt.
If you are working in Java, you should submit your source Java files and your compiled .class files. Do not send .mcp files (i.e., Codewarrior project files) or any other project type files.
If you work in C or C++, send .c or .cpp and .exe files, with instructions on how to run from the command prompt. Your program should compile and run on Visual Studio .NET.

Include instructions for how to run your program in your report. Repeat all instructions separately for each assignment.

Submission Instructions

The course uses the Computer Science Course Management System (CMS) to manage assignments.

Go to http://cms2.csuglab.cornell.edu/. Login using your NetID and password.
Go to CS430 and follow the instructions for each assignment. If you do not see CS430, contact the course team immediately by email.

Assignment 1
Due: Sunday, September 17, at 11 p.m.

[Updated September 7. Changes from the preliminary version of this assignment are underlined.]

The object of this assignment is to write a suite of programs that takes raw text and creates an inverted file system.

Test collection

The test collection that you will use to test your programs has a set of files that are described on the Web page, testData.html. This Web page also describes several Excel spreadsheets that will be useful if you wish to check test results manually.

Programs

1. Inverted file setup

Write a program called invert that reads the input data and create two files, as defined in Lecture 4.

(1a) Index file (word list). A sorted list of the terms in the test collection, excluding stop words from the file stoplist.txt. The index data structure should be some form of tree that is organized for both random look-up and sequential processing (lexicographic order). The structure should permit new terms to be added without rebuilding it completely. The index should contain the document frequency for each term (i.e., the number of documents that contain the term) and a link to the list of postings for that term.
(1b) Postings file. A list of postings for each term in the word list, sorted in order of documents and location within document. For each posting, the file should include the term frequency (i.e., the number of times that the term occurs in the document).

This program need not create the documents file, but the graders need the documents file in order to test your programs.

The program should finish by writing out these files to disk, so that they can be read by the test program. A binary dump of the data structures is adequate.

2. Test program

Write a program called test. The program begins by reading the files that were created by the Index program. It then reads as input a sequence of single terms provided by the user.

If the term is in the word file, the program should display a list of the postings for that term. For each posting it should write three lines of output:
1. The term weightings, tf, idf and tf.idf (as defined in Lecture 3), for that posting, separated by commas.
2. The document identifier and location within the document for that posting, separated by commas.
3. About ten words from the document that include the term in its context within the document.
If the term is not in the inverted file, it should display a suitable message.
The program should loop until the term ZZZ is input.

Report

Write a brief report (one to two pages) that describes the algorithms and data structures that you used for each file and Program 1. The report should include instructions on how to run your programs. The report will be about one third of the grade.

Submission

You should submit a zip archive, named a1.zip, that contains the following:

Report
Programs invert and test (source and binary). [This instruction was corrected on 9/12/06.]
A listing of the first few records of files 1a and 1b. [For each file, list out the data in the first few records, with the values in the various fields. The definitions of the fields and the data structures used to store the records should be described in the report.]

As part of the grading, we will run your programs against the test data.

Hints

Remember that you are not building a production system and the volume of test data is quite small. This suggests the following:

You can choose data structures, etc. that illustrate the concepts but are straightforward to implement (e.g., you do not need to implement B trees).
Consider batch loading of data (e.g., no need to provide for incremental update).
User interface can be minimal (e.g., single letter commands).
Use standard Java (or C++ or Python) classes where available, even if their performance is slow.

Assignment 2
Due: Sunday, October 8, at 11 p.m.

[Final version.]

Frequently Asked Questions and Answers

The page Ass2FAQ.html is a list of questions that have been sent to the Teaching Assistants and their replies. If you have a question about the assignment, check this page first. If your question has not been addressed, send email to cs430-l@lists.cs.cornell.edu.

Assignment

The purpose of this assignment is to demonstrate your understanding of latent semantic indexing. For this assignment it is recommended that you write your program(s) in Java, but C++ or Python are permitted.

For the matrix calculations in Java, you can use the JAMA package http://math.nist.gov/javanumerics/jama/, which provides singular value decomposition. If you choose to write in C++ or Python you will have to find a matrix package that will do singular value decomposition.

For the assignment, you will use the set of files that were used for Assignment 1. They are described on the Web page, testData.html.

A search engine that indexes full text using latent semantic indexing

Write a program LSI that does the following:

Building the index(es)

Create the term document vector space and apply singular value decomposition to create an LSI representation.
Store the representation of the pages in the concept space.

Searching the indexes

After the indexes have been built, the user interface accepts queries from users. A query is a simple list of terms.

The program should do the following:

Transform the query vector to the LSI space.
Use a suitable similarity measure to compute and rank relevant documents.
For each hit, the program should print out a portion of the document from testData.

Run four tests of your program that demonstrate its features. For each test, state which features it is demonstrating.

Report

Provide a brief report that describes: the mathematical basis for your programs, how your program takes advantage of the capabilities of JAMA, and which features you have implemented yourself (about one or two pages). You should explicitly describe how you selected k, number of singular values chosen to represent the concepts in the set of documents.

The graders will run your programs, but they will not go into the source code and modify it. All files and any other options must be specified as input. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the stop list and test data file.

Submission

You should submit a zip archive, named a2.zip, that contains the following:

Program LSI
Output from your test runs
Report

You do not need to submit JAMA, but your report should state very clearly what the graders should do to build and run your programs.

Assignment 3
Due: Sunday, November 5, at 11 p.m.

Frequently Asked Questions and Answers

The page AssFAQ.html is a list of questions that have been sent to the Teaching Assistants and their replies. If you have a question about the assignment, check this page first. If your question has not been addressed, send email to cs430-l@lists.cs.cornell.edu.

Assignment

The aim of this assignment is to implement a program, explore, in Java or C++ that provides a simple information discovery system with fielded searching and basic relevance feedback. This assignment builds on Assignment 1.

The information discovery process

The information discovery process is as follows.

The user types a query to the search system.
The search system searches a catalog and provides a ranked list of all records that match the query.
The user looks at the list and selects one or more records that appear relevant.
The system creates a new query by combining terms from the selected records.
The new query is used to search the catalog and provide a new ranked list of matching records.

Each of these steps is described in more detail below.

Catalog data

The records to be indexed are contained in a file of tagged ASCII text of the following format:

The input data is delimited by a pair of tags: <metadata>, </metadata>.
Each record within the file is delimited by a pair of tags: <record>, </record>.
Each record consists of author and title fields delimited by pairs of tags: <author>, </author> or <title>,</title>

Each record begins on a new line. There may be blank lines. See the test data file for examples of this format. Your search engine does not need to handle records in any other format.

A file of test data test3.txt has been provided. This data is the catalog records from one year of articles in D-Lib Magazine, lightly edited. (This test data is actually an XML file. Your program can ignore the data before the first <metadata> tag. If you are knowledgeable about tools for processing XML files, e.g., XSLT, you may use them for this assignment.)

Inverted file set up

When the program starts up, it reads a file in the format of the test data, and builds an inverted index, which includes a word list and postings files of the search terms. Each posting should be labeled to show whether the term is in an <author> or <title> field. It should use the same stop list as for Assignment 1, stoplist.txt.

User interface

The program has a simple user interface that allows the user to input a query, display the results, select relevant results as relevance feedback, and run the revised query. It should be possible to run several queries in a single run. This can be a very simple command line interface.

Input a query

To specify a search, the user must specify a query and a search option.

A query is a string of printable ASCII text containing one or more search terms.

Three search options are provided:

Author: Search for terms in the query that are in the <author> field
Title: Search for terms in the query that are in the <title> field
General: Search for terms in query that are in any field.

Search algorithm

The search algorithm should rank the similarity of each document to a query. Depending on the search option chosen, the similarity should be calculated using terms in the <author>, <title>, or all fields.

Relevance feedback

After a search has been carried out, the user selects one or more catalog records from the results as relevant. The user also specifies a search option for the new search (author, title or general). A new query is created by the system which combines all the terms in the appropriate field of the selected records.

Another search is carried out using this revised query and search option.

{Hint: The standard tf.idf scheme is optimized for full text documents. In class we discussed fielded searches in general but made no specific recommendation for weights. You will have to devise a reasonable scheme and explain briefly why you chose it.}

Report

Provide a brief report that describes the algorithms and data structures that you use (about one or two pages). Be sure to specify the similarity and ranking methods, and how the new query is created using relevance feedback.

The graders will run your program, but they will not go into the source code and modify it. All files and any other options must be specified as input. One option is to provide file names as command line parameters. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the stop list and test data files.

Submission checklist

You should submit a zip archive, named a3.zip, that contains the following:

The program explore
Report

Before submitting your assignment check the following:

Does the program compile and run program run "out of the box", or does the grader need to ensure certain things are in order (e.g., put files in the right directory). If so, did you note clearly what those things are?
Did your instructions include the arguments the program requires (if any)?
Did you include default arguments/datafiles for the program (if applicable)?
Did you provide detailed instructions for running the program in your report?
Does the program use absolute paths? (The answer should be no.)

Assignment 4
Due: Friday, December 1, at 11 p.m., late submission permitted until Sunday

The purpose of this assignment is to demonstrate familiarity with concepts used in searching the web, such as the PageRank algorithm and the use of anchor text.

The assignment is divided into four separate parts. You will probably find that it is most convenient to write a single program that does all four parts, but you may submit several programs if you prefer.

1. Extracting hyperlinks from the input data

The file test/test4.txt contains a list of URLs, one per line. Each URL identifies an html page on the Information Science web site. The pages have been numbered for convenience.

You will need to write code that reads the html pages in the test data. For each page, extract each hyperlink out from the page. Ignore links to pages that are not in the test data set.

Use the links to pages that are in the test data set to build the link matrix as defined in Lecture 16.

Extract the anchor text associated with each hyperlink and associate it with the page to which it points.

[See the file test/URLhints.html for hints on extracting hyperlinks from these web pages.]

2. PageRank

Use the PageRank algorithm to calculate a ranking of the pages. Use a damping factor of 0.8.

3. Index

For each page build a short index record that contains only:

The title of the page. (This is tagged by a pair of tags <title> and </title>.)
Anchor text from pages that link to the page being indexed.

4. Metadata

Write a file, metadata, that contains the following metadata for each page: page ID, PageRank, list of terms in anchor text from pages that link to the page.

5. Searching

Build a simple search program that receives as input a single term, searches the short index records and returns a snippet for each page with that term in its short index record. The pages should be returned in the order of their PageRank.

Hint: It is not necessary that the snippet contains the search term. If fact, it may happen that the a page is returned that does not contain the search term. A possible snippet might consist of the title followed by the first few words of the page.

Report

Provide a brief report that describes the algorithms and data structures that you use (about one or two pages).

The graders will run your program, but they will not go into the source code and modify it. All files and any other options must be specified as input. One option is to provide file names as command line parameters. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the test data file.

Test

Run your program with five queries. Your report should list the queries and indicate what function each is testing.

Submission

You should submit a zip archive, named a4.zip, that contains the following:

Your program(s)
The file metadata
Test output and results file
Report

Before submitting your assignment check the following:

Does the program compile and run program run "out of the box", or does the grader need to ensure certain things are in order (e.g., put files in the right directory, edit a script file). If so, did you note clearly what those things are?
Did your instructions include the arguments the program requires (if any)?
Did you include default arguments/datafiles for the program (if applicable)?

Did you provide detailed instructions for running the program in your report?
Does the program use absolute paths in file names? (The answer should be no.)

William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 29, 2006