There have been a number of prior attempts to theoretically justify the effectiveness of the inverse document frequency (IDF). Those that take as their starting point Robertson and Spärck Jones's probabilistic model are based on strong or complex assumptions. We show that a more intuitively plausible assumption suffices. Moreover, the new assumption, while conceptually very simple, provides a solution to an estimation problem that had been deemed intractable by Robertson and Walker (1997).
@inproceedings{Lee:07a, author = {Lillian Lee}, title = {{IDF} revisited: A simple new derivation within the {Robertson-Sp\"arck Jones} probabilistic model}, year = {2007}, pages = {751--752}, booktitle = {Proceedings of SIGIR} }
This paper is based upon work supported in part by the National Science Foundation under grant no. IIS-0329064, a Yahoo! Research Alliance gift, and an Alfred P. Sloan Research Fellowship. Any opinions, findings, and conclusions or recommendations expressed are those of the author and do not necessarily reflect the views or official policies, either expressed or implied, of any sponsoring institutions, the U.S. government, or any other entity.