Programming
Assignment 1 (due
Friday, 7 March, 5 PM)
In
this assignment, you will implement a caching web
proxy.
What
is a Web Proxy?
A web proxy is an intermediary between
the web browser and the servers on the Internet. All requests from the browser go to the
proxy, which may act upon the request before resending it to the web
servers. The response from the
servers is sent to the proxy, which might process the response before sending it
to the browser (see Figure below).
There are a number of uses of a web proxy. Proxy caches allow a single copy of a page to be cached locally and accessed by many users within the same LAN rather than requiring the page to be fetched from a remote server for each user. Proxies can also translate/modify data before presenting it to its clients. For example, proxies could produce smaller lower resolution images for client's with low bandwidth connections. A proxy is also commonly used for interaction between web browsers inside a firewall and servers over the Internet. Proxies also help in providing access control for certain documents or protocols based on the IP address. For example, you want to allow a client to use HTTP, but disallow the use of FTP. Proxies are also useful in providing web access on devices with low power, or a low-resolution display.
This Assignment
In this assignment you will implement a HTTP proxy cache. Your proxy will behave like an HTTP server to web browsers and like an HTTP client to web servers. It will retain copies of pages accessed and attempt to first satisfy requests from its cache rather than by contacting a remote web server.
In addition to HTTP, your web proxy will be required to respond to ADMIN requests defined especially for this assignment (i.e. not an Internet standard) such as FLUSH ( to flush all cached pages from the cache), DEL (to delete a specified page from the cache), INFO (to dump statistics about the cache operation to a local log file) and CHANGE (to change the size of the cache dynamically).
We expect your proxy to take the
following command line parameters:
–c <cache_size_in_KB> -p <port number proxy will listen on> -l <name of local log file> -a <caching algorithm>
What
to do?
We recommend developing your web proxy in four phases.
Phase
1: HTTP Processing
In the first phase you will implement a simple web proxy. The proxy will take requests from the browser, parse the request, and send the request to the web server. The response gathered by the proxy will be sent back to the browser. For this functionality, the proxy should open a socket connection on startup, and listen for incoming requests. On getting a request from the browser, the proxy should parse the HTTP request to determine the destination server, and open a connection to it. It should then send the request, process the reply, and send it back to the browser. The port number for the proxy should be a command line argument.
At this point, your proxy need only implement the HTTP GET request. If the client issues other HTTP requests like HEAD or POST you can simply return a 501 Not Implemented error message! That said processing HEAD and POST should not be particularly difficult and may make your proxy much more useable (and amenable to data collection as described in the next section). If you do choose to handle these request, you should mention it in your README.
For simplicity you are allowed to use HTTP/1.0 between the proxy and the web server. HTTP/1.1 uses persistent connections, which could give you occasional problems using sockets. For this modification, the proxy will have to take the HTTP/1.1 request and replace HTTP/1.1 with HTTP/1.0 before sending it to the web server. HTTP/1.0 requests will just have to be echoed to the web server. Note: This is just a suggestion and not mandatory to implement.
You should however support HTTP/1.1 or 1.0 from the browsers. There is no need however to implement the complex features of HTTP/1.1. If the browser sends a 1.1 request, you just need to respond to the request, i.e. fetch the page corresponding to the GET request.
At this phase we also
recommend that you begin setting up your logging system. The name of the log
file to be kept will be passed with the -l option. You should write status
messages to the log such as "Request for <url> from
<client>" and "Requesting <url> from <server>".
Phase
2: Caching
Now, you will add caching to your proxy. Requested web pages will be temporarily stored at the proxy to satisfy further requests for the same web page.
One way to implement the cache would be with a hash table (compute a hash based on URL). However, you are free to use any scheme of your choice. You should describe the data structures used to implement the cache in your design document. Note: The cache size, in KB, should be specified as a parameter to the proxy program.
Once placed in the cache, pages will leave for two reasons: first the cache is full and something must be evicted and second, an expired header was specified which limited the time the page should be cached.
If a page is cached and the browser specifies the If-modified-since header with a request for a cached page, you should send an if-modified-since request to the server. If the server returns new data, you should update your cache and send the new data to the client. If the server responds that it has not been modified, you should return your cached copy to the client.
You should write status messages to the log such as "Added <the url> to Cache, should be evicted after <duration>", "Added <url> to cache with no expires header", "Evicted <the url> from Cache because of the Expires header", "Evicted from Cache because it is full", "Reading Response from Cache (Cache Hit)", "Response not in Cache. Getting it from the Server (cache miss)".
Everyone should implement LRU cache replacement strategy (designated by -a 1). In addition, everyone should implement at least one alternate caching policy (designated by -a 2 and above). In your design document, you should argue under what circumstances your alternate caching policy will perform better. Feel free to consider other factors in your caching decisions like number of hits or size of the object.
You should also keep statistics like the following total number of requests, total hits, total misses, requests in the last second, hits in the last second, misses in the last second, average age of pages in the cache, average size of pages in the cache, percentage of pages with an expires header, average duration of expires header, percentage of pages that are evicted due to the expires headers, number of clients seen in the last second, number of servers contacted in the last second, etc.
You should give some careful thought to the types of statistics to report and you should use the statistics to sanity check your implementation, to deepen your understanding of the nature of web traffic and to compare caching algorithms. Certainly many statistics will depend on the workload you present to the cache. Statistics based on a few simple test queries will be relatively uninteresting. Therefore, it is important that you complete the programming of your proxy in enough time to allow you to use the proxy and collect statistics.
You will dump a variety of statistics in response to an ADMIN INFO command. You may also want to write some statistics regularly to the log. In addition, we will specify the statitics that should be reported as a result of an "INFO 0" command but you should feel free to define additional INFO numbers and describe their use.
In your design document, you should discuss in detail the types of statistics you choose to keep and what you learned from them.
You may want to consider writing scripts to process your log files. For example, if you prefix statistics with a specified string you could use the prefix to filter information from the log. Similarly, if you choose the format of your prints carefully (tab delimited for example). You may be able to import subsets of your log directly into Excel or even a database where you could produce reports or graphs of the data. If you build an especially efficient statistics processing system, you may receive extra credit.
You will be graded both on the statistics you choose to report and your analysis/reporting of them.
Phase 3: ADMIN interface
In addition to responding the HTTP GET request, your proxy should handle administrative messages listed above.
FLUSH: erases the contents of the cache
DELETE <name>: deletes the object with the identifier <name> you should use the filename of the object when you are caching. Just store the name that you get with the GET request.
INFO <number> : this dumps some statistics about the cache into a log file. The statistics that we expect to see in response to an "INFO 0" are: current size of the cache and the list of objects in the cache.
CHANGE <new_cache_size>: This command allows the admin to dynamically change the cache size. You should flush the objects in the cache as necessary according to your caching algorithm..
Note that for each of these commands, you should just send a 200 HTTP OK message back. For details on the format of this message please read the RFC's. If anything goes wrong, just send a 400 Error msg with the appropriate description of the problem, e.g. if the <name> to delete is not there. For the exact format of the log messages that should be sent to file, please read FAQ. Here is some example requests that you might get:
In your design document, you should critique this ADMIN interface and sketch out at least one alternative. Could we provide an administrative interface in other ways than through a special protocol enhancement? Would it make more sense for it to be a local feature of the proxy itself rather than a feature provided to clients over the network? What are the advantages/ disadvantages of the way we specified this feature? Is it secure?
Detailed Description of the ADMIN interface
All admin commands will start with command "ADMIN" followed by the
actual command name. All command names are case insensitive.
There are command names that take a single parameter.. That parameter will come in an HTTP header line with the "Object:"
descriptor.
Specifically, here are some examples of commands that you should be able to handle:
ADMIN flush HTTP/1.1
ADMIN FLUSH HTTP/1.1
ADMIN INFO HTTP/1.1
Object: 0
ADMIN info HTTP/1.1
Object: 1
ADMIN CHANGE HTTP/1.1
Object: 64
ADMIN DELETE HTTP/1.1
Object: <*exact*_name_of_object_which appeared_in_some_get_request_before>
Some notes:
The size parameter given by CHANGE command is in Kbytes..
If anything goes wrong send 400
Error Msg with description.. This includes the scenario where the arguments are in
an unexpected format.. e.g.
ADMIN INFO HTTP/1.1
Object: gimmeinfo
Extra credit
Any other features you would like to add your web proxy is likely to earn you some extra credit, but you should do it only if you are having fun and would like to learn more. In particular, we will *not* say how much extra credit each feature or sub feature may be worth.
One suggestion is adding concurrent request processing to the web proxy. Concurrency can be added by creating a new thread or process for each request. An alternate approach which avoids the thread/process creation overhead is to use a fixed sized pool of threads/processes which pull work from a work queue. In Java you can use Java threads and for C on UNIX you might want to consider pthreads. You could also manage concurrency with the select system call in UNIX, but threads are strongly recommended in this assignment. Note that since you will be using multiple threads, you will need synchronization mechanisms for writing to the log file, evicting data from the cache and inserting data into the cache. Regardless of the approach you take to increasing concurrency, consider analyzing the impact of your approach by placing additional load on the system and comparing the performance to a single threaded implementation. You may be surprised to find that performance doesn't exactly improve when you expect it to. As in phase 3, think carefully about the statistics you should keep to better understand the performance of your system. Report your results in the design document.
Some other suggestions are supporting SSL, doing content filtering or translation (blocking traffic from certain servers or if certain keywords are present, etc.), making the cache contents persistent by saving cache contents to a file and restoring them on start up.
LOGISTICS
This assignment has to be done in groups of two. Requests to do the assignment with a different group size should be submitted in writing. We recommend doing the assignment in Java and will provide more "official" support for Java. However, you may also do the assignment in C. If you do the assignment in Java, you should use jdk 1.4. Regardless of whether you use Java or C, your code must compile and run with software/libraries available on the CSUG lab machines.
In you are using Java, your proxy should start by giving the
following command on the command line:
java web_proxy –c <cache_size> -p <port number> -l <log file name> -a <caching algorithm>
If you are coding in C, then you should submit your solution with a Makefile. In that case your proxy will start with:
web_proxy -c <cache_size> -p <port> -l <log_file>-a <caching algorithm>
Here
is a file describing the log format we expect and two example log files, info.log
and info2.log.
What
to submit?
You should submit the following for this
assignment- a single gzipped tar file containing the following.
1.
Complete
implementation of the proxy.
2.
A file
called README.txt where you give a tutorial on how to compile and run a
program. This file should also
contain the names, netids and cornellids of all the individuals in the
group. If
you did extra credit, you should clearly list the additional features. If
required portions of your assignment do not work, you should make this clear as
well.
3.
A design
document (PDF, HTML or text) of your proxy. The design of all the phases
should be explained in this file.
Specifically there should be a section presenting the statistics you collected
and your analysis. There should also be a section describing your alternate
caching policy and comparing it to LRU. Describe the data structures you used to
implement your caching policies. You should also critique the ADMIN interface
and propose alternatives. Be sure to
include a bibliography listing any external references or code used.
How
will you be graded?
The following will play a crucial role in
your grades for this assignment.
1.
Correct
implementation of the basic web proxy . [15 points]
2
3. Handling the ADMIN commands [5 points] and your critique of the ADMIN interface [5 points]
4.
Statistic reporting and analysis [20 points]
5.
If you are unable to
complete this assignment, you should specifically mention why you think the
program did not work, and what your approach for the remaining phases would
be. This could earn you some extra
points.
1. Mehmet and Xiangyang have put together some template files to get you started: Here. You are welcome to start from scratch instead if you'd like.
2.
A tutorial
on how to use threads in Java.
3.
Help
on using sockets in Java.
4.
You will
need some basic HTTP knowledge for this assignment. We expect you to be able to handle only
GET requests. RFC 2616 gives the HTTP 1.1
specification in detail.
5.
You should
test your proxy with an available browser.
For Netscape, go to
Edit->Preferences->Advanced->Proxies->Manual Proxy
Configuration. For Internet
Explorer, choose Tools->Options->Connections->LAN Settings->Use a
Proxy Server. In either case
specify the address and port number.
1.
Having trouble loading images in Java? BufferedReader is designed for text from a
character-input stream) it will most likely convert "\n" to "\r\n" or vice versa depending on your platform; so if you use it for binary data, your
binary data is going to be screwed up. You may want to try to use something like the read method on the InputStream directly or a write on the
OutputStream. Also there is a difference between char[] (i.e. String) and byte[]
in Java, unlike C. So you want to be careful to use byte arrays to store your binary data.
2. ...