CS 430 / INFO 430
Information Retrieval
Fall 2006
Assignments
|
Assignment 4: Late Submission Permitted until Sunday Many people have asked for an extension on this assignment because of the large number of other courses that have assignments due at the same time. No penalty will be given for lateness of this assignment if it is submitted by 11:00 p.m. on Sunday, December 3. 11/29/06 General Instructions During this course there will be several assignments which require programming. Java is strongly preferred as the programming language. If, however, you have limited Java experience C++ or Python is permitted. Running your Programs We will run your programs on a Windows computer, using the software environment in the CSUG lab. If you have any questions about this, please email cs430-l@cs.cornell.edu or make arrangements to meet with the Teaching Assistant. The assignments give specific instructions that are intended to make your programs easy for us to run and grade consistently. Points will be subtracted if you do not follow these instructions.
Include instructions for how to run your program in your report. Repeat all instructions separately for each assignment. Submission Instructions The course uses the Computer Science Course Management System (CMS) to manage assignments.
Assignment 1 [Updated September 7. Changes from the preliminary version of this assignment are underlined.] The object of this assignment is to write a suite of programs that takes raw text and creates an inverted file system. Test collection The test collection that you will use to test your programs has a set of files that are described on the Web page, testData.html. This Web page also describes several Excel spreadsheets that will be useful if you wish to check test results manually. Programs 1. Inverted file setup Write a program called invert that reads the input data and create two files, as defined in Lecture 4.
This program need not create the documents file, but the graders need the documents file in order to test your programs. The program should finish by writing out these files to disk, so that they can be read by the test program. A binary dump of the data structures is adequate. 2. Test program Write a program called test. The program begins by reading the files that were created by the Index program. It then reads as input a sequence of single terms provided by the user.
Report Write a brief report (one to two pages) that describes the algorithms and data structures that you used for each file and Program 1. The report should include instructions on how to run your programs. The report will be about one third of the grade.Submission You should submit a zip archive, named a1.zip, that contains the following:
As part of the grading, we will run your programs against the test data. Hints Remember that you are not building a production system and the volume of test data is quite small. This suggests the following:
Assignment 2 [Final version.] Frequently Asked Questions and Answers The page Ass2FAQ.html is a list of questions that have been sent to the Teaching Assistants and their replies. If you have a question about the assignment, check this page first. If your question has not been addressed, send email to cs430-l@lists.cs.cornell.edu. Assignment The purpose of this assignment is to demonstrate your understanding of latent semantic indexing. For this assignment it is recommended that you write your program(s) in Java, but C++ or Python are permitted. For the matrix calculations in Java, you can use the JAMA package http://math.nist.gov/javanumerics/jama/, which provides singular value decomposition. If you choose to write in C++ or Python you will have to find a matrix package that will do singular value decomposition. For the assignment, you will use the set of files that were used for Assignment 1. They are described on the Web page, testData.html. A search engine that indexes full text using latent semantic indexing Write a program LSI that does the following: Building the index(es)
Searching the indexes After the indexes have been built, the user interface accepts queries from users. A query is a simple list of terms. The program should do the following:
Run four tests of your program that demonstrate its features. For each test, state which features it is demonstrating. Report Provide a brief report that describes: the mathematical basis for your programs, how your program takes advantage of the capabilities of JAMA, and which features you have implemented yourself (about one or two pages). You should explicitly describe how you selected k, number of singular values chosen to represent the concepts in the set of documents. The graders will run your programs, but they will not go into the source code and modify it. All files and any other options must be specified as input. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the stop list and test data file. Submission You should submit a zip archive, named a2.zip, that contains the following:
You do not need to submit JAMA, but your report should state very clearly what the graders should do to build and run your programs. Assignment 3 Frequently Asked Questions and Answers The page AssFAQ.html is a list of questions that have been sent to the Teaching Assistants and their replies. If you have a question about the assignment, check this page first. If your question has not been addressed, send email to cs430-l@lists.cs.cornell.edu. Assignment The aim of this assignment is to implement a program, explore, in Java or C++ that provides a simple information discovery system with fielded searching and basic relevance feedback. This assignment builds on Assignment 1. The information discovery process The information discovery process is as follows.
Each of these steps is described in more detail below. Catalog data The records to be indexed are contained in a file of tagged ASCII text of the following format:
Each record begins on a new line. There may be blank lines. See the test data file for examples of this format. Your search engine does not need to handle records in any other format. A file of test data test3.txt has been provided. This data is the catalog records from one year of articles in D-Lib Magazine, lightly edited. (This test data is actually an XML file. Your program can ignore the data before the first <metadata> tag. If you are knowledgeable about tools for processing XML files, e.g., XSLT, you may use them for this assignment.) Inverted file set up When the program starts up, it reads a file in the format of the test data, and builds an inverted index, which includes a word list and postings files of the search terms. Each posting should be labeled to show whether the term is in an <author> or <title> field. It should use the same stop list as for Assignment 1, stoplist.txt. User interface The program has a simple user interface that allows the user to input a query, display the results, select relevant results as relevance feedback, and run the revised query. It should be possible to run several queries in a single run. This can be a very simple command line interface. Input a query To specify a search, the user must specify a query and a search option. A query is a string of printable ASCII text containing one or more search terms. Three search options are provided:
Search algorithm The search algorithm should rank the similarity of each document to a query. Depending on the search option chosen, the similarity should be calculated using terms in the <author>, <title>, or all fields. Relevance feedback After a search has been carried out, the user selects one or more catalog records from the results as relevant. The user also specifies a search option for the new search (author, title or general). A new query is created by the system which combines all the terms in the appropriate field of the selected records. Another search is carried out using this revised query and search option. {Hint: The standard tf.idf scheme is optimized for full text documents. In class we discussed fielded searches in general but made no specific recommendation for weights. You will have to devise a reasonable scheme and explain briefly why you chose it.} Report Provide a brief report that describes the algorithms and data structures that you use (about one or two pages). Be sure to specify the similarity and ranking methods, and how the new query is created using relevance feedback. The graders will run your program, but they will not go into the source code and modify it. All files and any other options must be specified as input. One option is to provide file names as command line parameters. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the stop list and test data files. Submission checklist You should submit a zip archive, named a3.zip, that contains the following:
Before submitting your assignment check the following:
Assignment 4 Many people have asked for an extension on this assignment because of the large number of other courses that have assignments due at the same time. No penalty will be given for lateness of this assignment if it is submitted by 11:00 p.m. on Sunday, December 3. The purpose of this assignment is to demonstrate familiarity with concepts used in searching the web, such as the PageRank algorithm and the use of anchor text. The assignment is divided into four separate parts. You will probably find that it is most convenient to write a single program that does all four parts, but you may submit several programs if you prefer. 1. Extracting hyperlinks from the input data The file test/test4.txt contains a list of URLs, one per line. Each URL identifies an html page on the Information Science web site. The pages have been numbered for convenience. You will need to write code that reads the html pages in the test data. For each page, extract each hyperlink out from the page. Ignore links to pages that are not in the test data set. [See the file test/URLhints.html for hints on extracting hyperlinks from these web pages.] 2. PageRank Use the PageRank algorithm to calculate a ranking of the pages. Use a damping factor of 0.8. 3. Index For each page build a short index record that contains only:
4. Metadata Write a file, metadata, that contains the following metadata for each page: page ID, PageRank, list of terms in anchor text from pages that link to the page. 5. Searching Build a simple search program that receives as input a single term, searches the short index records and returns a snippet for each page with that term in its short index record. The pages should be returned in the order of their PageRank. Hint: It is not necessary that the snippet contains the search term. If fact, it may happen that the a page is returned that does not contain the search term. A possible snippet might consist of the title followed by the first few words of the page. Report Provide a brief report that describes the algorithms and data structures that you use (about one or two pages). The graders will run your program, but they will not go into the source code and modify it. All files and any other options must be specified as input. One option is to provide file names as command line parameters. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the test data file. Test Run your program with five queries. Your report should list the queries and indicate what function each is testing. Submission You should submit a zip archive, named a4.zip, that contains the following:
Before submitting your assignment check the following:
|
[ Home | Syllabus | Readings | Assignments | Examinations | Academic Integrity ]
William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 29, 2006