CS 430 / INFO 430
Information Retrieval
Fall 2005
Test Data
Stoplist
Use the following stop list for all assignments: stoplist.txt.
Test data for Assignment 1
The test collection that you will use to test your programs has 20 files which are stored in the test directory. The files are news articles taken from the NASA web site. The average length is 600 terms.
The files are:
file01.txt
file02.txt
file03.txt
file04.txt
file05.txt
file06.txt
file07.txt
file08.txt
file09.txt
file10.txt
file11.txt
file12.txt
file13.txt
file14.txt
file15.txt
file16.txt
file17.txt
file18.txt
file19.txt
file20.txt
Test data for Assignment 3
The test collection that you will use to test your programs is stored in test/test3.txt. This data is the catalog records from one year of articles in D-Lib Magazine, lightly edited. (This test data is actually an XML file. Your program can ignore the data before the first <metadata> tag. If you are knowledgeable about tools for processing XML files, e.g., XSLT, you may use them for this assignment.)
Test data for Assignment 4
The test collection that you will use to test your programs is stored in test/test4.txt. This data is a list of URLs of html and htm pages from the www.infosci.cornell.edu Web site.
The file test/URLhints.html provides hints on extracting hyperlinks from these pages.
Calculating tf.idf manually
To understand tf.idf and be able to test the output of programs, such as Assignment 1, it is useful to calculate a few sample values manually. The files test/AllFiles1.xls and test/DocumentFreq1.xls are Excel spread sheets of the terms in the 20 test documents.
test/AllFiles1.xls
The file test/AllFiles1.xls has a column for the terms in each of the 20 files. Terms from the stop list and terms that do not begin with a letter have been removed. Otherwise no editing has been done. The terms are sorted in lexicographical order and each term is repeated as many times as it occurs in the file.
For each file, a second column gives a running total of the number of occurrences of search term. For example, if the term active appears twice, the first is labeled 1 and the second is labeled 2.
For each file, the following statistics are calculated:
test/DocumentFreq1.xls
The file test/DocumentFreq1.xls has all terms from the 20 files merged into a single list. In this table, each row represents the occurrence of one term in one document. The columns are as follows:
The second row of the spread sheet calculates the following statistic:
[Home]
William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 10, 2005