CS 430 / INFO 430
Information Retrieval
Fall 2005
Assignments
|
General Instructions During this course there will be several assignments which require programming. Java is strongly preferred as the programming language. If, however, you have limited Java experience C++ is permitted. Running your Programs We will run your programs on a Windows computer, using the software environment in the CSUG lab. If you have any questions about this, please email cs430-l@cs.cornell.edu or make arrangements to meet with the Teaching Assistant. The assignments give specific instructions that are intended to make your programs easy for us to run and grade consistently. Points will be subtracted if you do not follow these instructions.
Repeat all instructions separately for each assignment. Submission Instructions Each assignment will require you to submit a list of files.
Assignment 1 This assignment was slightly modified for clarity on September 8 and 15. The paragraphs that have been changed are identified by notes. The object of this assignment is to write a suite of programs that take raw text and creates an inverted file. Test collection The test collection that you will use to test your programs has a set of files that are described on the Web page, testData.html. This Web page also describes several Excel spreadsheets that will be useful if you wish to check test results manually.Programs 1. Inverted file setup Write a program that reads the input data and create three files, as defined in Lecture 4.
The program should finish by writing out these files to disk, so that they can be read by the Test program. A binary dump of the data structures is not adequate. [Note: This paragraph was added on September 15 for clarity.] 2. Test program Write a test program. The program begins by reading the files that were created by the Index program. It then reads as input one term at a time. [Note: This paragraph was modified on September 15 for clarity.]
Report Write a brief report (one to two pages) that describes the algorithms and data structures that you used for each file and Program 1. The report should include instructions on how to run your programs. The report will be about one third of the grade.Submission You should submit the following:
As part of the grading, we will run your programs against the test data. Hints Remember that you are not building a production system and the volume of test data is quite small. This suggests the following:
Assignment 2 Because of disruptions to the computing facilities in Upson Hall scheduled for Saturday, the due date for Assignment 2 has been extended to midnight on Sunday, October 9. [Minor revisions to the wording were made on, Wednesday, September 28.] The purpose of this assignment is to demonstrate your understanding of latent semantic indexing. For this assignment it is recommended that you write your programs in Java. For the matrix calculations in Java, you can use the JAMA package http://math.nist.gov/javanumerics/jama/, which provides singular value decomposition. If you choose to write in C++ you will have to find a matrix package that will do singular value decomposition. For the assignment, you will use the set of files that were used for Assignment 1. They are described on the Web page, testData.html. A search engine that indexes full text using latent semantic indexing Write a single program LSI that does the following: Building the index(es)
Searching the indexes After the indexes have been built, the user interface accepts queries from users. A query is a simple list of terms. The program should do the following:
Run four tests of your program that demonstrate its features. For each test, state which features it is demonstrating. Report Provide a brief report that describes: the mathematical basis for your programs, how your program takes advantage of the capabilities of JAMA, and which features you have implemented yourself (about one page). The graders will run your programs, but they will not go into the source code and modify it. All files and any other options must be specified as input. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the stop list and test data file. Submission You should submit the following:
You do not need to submit JAMA, but your report should state very clearly what the graders should do to build and run your programs. Assignment 3 The aim of this assignment is to implement a program, explore, in Java or C++ that provides a simple information discovery system with fielded searching and basic relevance feedback. This assignment builds on Assignment 1. The information discovery process The information discovery process is as follows.
Each of these steps is described in more detail below. Catalog data The records to be indexed are contained in a file of tagged ASCII text of the following format:
Each record begins on a new line. There may be blank lines. See the test data file for examples of this format. Your search engine does not need to handle records in any other format. A file of test data test3.txt has been provided. This data is the catalog records from one year of articles in D-Lib Magazine, lightly edited. (This test data is actually an XML file. Your program can ignore the data before the first <metadata> tag. If you are knowledgeable about tools for processing XML files, e.g., XSLT, you may use them for this assignment.) Inverted file set up When the program starts up, it reads a file in the format of the test data, and builds an inverted index, which includes a word list and postings files of the search terms. Each posting should be labeled to show whether the term is in an <author> or <title> field. It should use the same stop list as for Assignment 1, stoplist.txt. User interface The program has a simple user interface that allows the user to input a query, display the results, identify relevant results as relevance feedback, and run the revised query. It should be possible to run several queries in a single run. This can be a very simple command line interface. Input a query To specify a search, the user must specify a query and a search option. A query is a string of printable ASCII text containing one or more search terms. Three search options are provided:
Search algorithm The search algorithm should rank the similarity of each document to a query. Depending on the search option chosen, the similarity should be calculated using terms in the <author>, <title> or all fields. Relevance feedback After a search has been carried out, the user selects one or more catalog records from the results as relevant. The user also specifies a search option for the new search (author, title or general). A new query is created by the system which combines all the terms in the appropriate field of the selected records. Another search is carried out using this revised query and search option. {Hint: The standard tf.idf scheme is optimized for full text documents. In class we discussed fielded searches in general but made no specific recommendation for weights. You will have to devise a reasonable scheme and explain briefly why you chose it.} Report Provide a brief report that describes the algorithms and data structures that you use (about one page). Be sure to specify the similarity and ranking methods, and how the new query is created using relevance feedback. The graders will run your program, but they will not go into the source code and modify it. All files and any other options must be specified as input. One option is to provide file names as command line parameters. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the stop list and test data files. Submission checklist You should submit the following:
Before submitting your assignment check the following:
While not required, the graders would be grateful if you also kept the following things in mind
Assignment 4 The purpose of this assignment is to demonstrate familiarity with concepts used in searching the web, such as the PageRank algorithm and the use of anchor text. The assignment is divided into four separate parts. You will probably find that it is most convenient to write a single program that does all four parts, but you may submit several programs if you prefer. 1. Extracting hyperlinks from the input data The file test4.txt in the test directory contains a list of URLs, one per line. Each URL identifies an html page on the Information Science web site. The pages have been numbered for convenience. You will need to write code that reads the html pages in the test data. For each page, extract each hyperlink out from the page. Ignore links to pages that are not in the test data set. [See the file URLhints.html in the test directory for hints on extracting hyperlinks from these web pages.] 2. PageRank Use the PageRank algorithm to calculate a ranking of the pages. Use a damping factor of 0.2. 3. Index For each page build a short index record that contains only:
4. Metadata Write a file, metadata, that contains the following metadata for each page: page ID, PageRank, list of terms in anchor text from pages that link to the page. 5. Searching Build a simple search program that receives as input a single term, searches the short index records and returns a snippet for each page that has that term in its short index record. The pages should be returned in the order of their PageRank. Hint: It is not necessary that the snippet contains the search term. If fact, it may happen that the a page is returned that does not contain the search term. A possible snippet might consist of the title followed by the first few words of the page. Report Provide a brief report that describes the algorithms and data structures that you use (about one page). The graders will run your program, but they will not go into the source code and modify it. All files and any other options must be specified as input. One option is to provide file names as command line parameters. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the test data file. Test Run your program with five queries. Your report should list the queries and indicate what function each is testing. Submission You should submit the following:
Before submitting your assignment check the following:
While not required, the graders would be grateful if you also kept the following things in mind
|
[ Home | Syllabus | Readings | Assignments | Examinations | Academic Integrity ]
William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 10, 2005