CS 430 / INFO 430
Information Retrieval
Fall 2004
Hints on Extracting URLS
Forms of URLs
Various forms of URL may point to the same file. The pages referenced from test4.txt are fairly straightforward, but you must be prepared for the following:
Relative URLS
The full form of a URL begins with a protocol name, e.g., http, followed by
two slashes, //. For example:
http://www.cs.cornell.edu/wya/index.html
However, within a web site it is usual to refer to pages relative to the current
directory. Thus, if a page is stored in the directory www.cs.cornell.edu/wya/,
the relative URL:
index.html
refers to the file:
www.cs.cornell.edu/wya/index.html
The notation ../ at the beginning of a relative URL refers to the parent directory.
Thus, if a page is stored in the directory www.cs.cornell.edu/wya/, the relative
URL:
../index.html
refers to the file:
www.cs.cornell.edu/index.html
Default files
Sometimes a URL specifies a directory, but does not specify a file within that
directory, as in the following:
http://www.cs.cornell.edu/wya/
In this situation, the URL refers to a default file within the specified directory.
The commonest defaults are files named index.html or index.htm. Thus this example
refers to:
http://www.cs.cornell.edu/wya/index.html
Anchors within a file
Usually URLs refer to the beginning of a page. However, it is possible to refer
to an anchor within a page, by appending the # sign followed by the name of
the anchor. Thus the following two URLs refer to the same page, though they
reference different locations within the page:
http://www.cs.cornell.edu/wya/papers.html
http://www.cs.cornell.edu/wya/papers.html#year2000
[Home]
William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 12, 2004