CS 316 Programming Assignment 6: FAQ

Wednesday, December 12, 9:55 p.m.

If printf is not outputting to the screen when you expect it to, try adding fflush(stdout); after your printf.

(For efficiency, all I/O is buffered. This includes printing to the screen. If your program exits or crashes before the stdout buffer is emptied, the contents of that buffer will be lost. fflush flushes the buffer for you.
Wednesday, December 12, 9:18 p.m.

Why do enqueue_request() and dequeue_response() only deal with message headers? What about the rest of the message?

This is intentional. If you look at the struct skeletons for the actual messages in messages.h, you'll see they all start with a request_header_t field. What would these messages look like when laid out in memory?
Wednesday, December 12, 8:30 p.m.

Why do the compiler warnings claim that my functions are returning integers?

This is a common cause of segfaults, in particular when functions that should return pointers are involved.

If, when a function is called, the compiler cannot see the declaration (i.e. header) of that function, it assumes the function returns an integer, and proceeds serenely. Until you run it. Then, most likely, you explode.

To fix this, check your #includes. You are probably missing one. Whenever you call a function, you should be able to trace back, through includes, to the declaration of that function. If you can't, it won't work correctly.
Wednesday, December 12, 5:44 p.m.
Unless if you are ambitious about handling errors, you should only be connecting to the servers once at the beginning of the program and reusing your connection to a given server for all messages sent to that server.
Wednesday, December 12, 5:40 p.m.

To compile a pthreaded program by hand, you need to pass -lpthread to gcc thusly:
gcc -lpthread -o <prog-name> <source-files> ...
Wednesday, December 12, 1:59 p.m.

Note that pselect changes the contents of its readfds argument. You should be re-initializing that set every time though your communications loop before calling pselect.
Wednesday, December 12, 3:48 a.m.

A lot of groups seem to have trouble using read_from_worker correctly. read_from_worker blocks until the given socket is ready to be read and then reads as much as it can from the socket without blocking. If it is able to read a complete message, that message is returned in a buffer; otherwise, read_from_worker saves the partial message in an internal buffer and returns NULL.

If you get NULL from read_from_worker, you should not be calling it again with the same socket without first checking with pselect; otherwise, your communications thread will block while attempting to read on that socket and your client will perform poorly.
Tuesday, December 11, 12:43 a.m.
How to use netcat to debug your client:
1. Start netcat in tunnelling mode:
  
  netcat -x -L csugXX:7777 -p YYYY
  where csugXX:7777 corresponds to the server to which you wish to tunnel, and YYYY is any port number (in the range 1024-65535). This starts netcat listening on port YYYY on the machine where netcat is running. Any traffic received at that socket will be forwarded to the server at csugXX:7777, and any response received from the server will be forwarded onto the client.
  
  The port number you pick must be unused amongst all other programs already running. Pick a different number if you get the error:
```
     Error: Could not setup listening socket (err=-3)
```
  For example, to have netcat listen on port 2222 and tunnel to a server at csug42:7777, type:
  netcat -x -L csug42:7777 -p 2222
2. Configure your client to talk to netcat. In your workers.conf, change the line that says:
  csugXX 7777 Z
  (where Z is some number of threads) to:
  csugWW YYYY Z
  where csugWW:YYYY is the socket on which netcat is listening.
  
  Continuing with the above example, suppose we started netcat on csug01 and our workers.conf had the line:
  csug42 7777 9
  then we'd change that line to:
  csug01 2222 9
  If we happen to be running netcat and the client on the same machine, we can instead have the line:
  localhost 2222 9
Monday, December 10, 11:46 p.m.
A signup sheet has been posted on CMS for the final project presentations. Please select a time slot for your group before Wednesday night at midnight. At the presentation we will review the performance of your program on a test rendering made earlier in the morning and we will ask each group several questions about their implementation. For the rendering test, we will have your program compute an image that should take about 1 minute on the full cluster. This corresponds to about 7:30 minutes on 16 unloaded cores. We stress that the performance of your program is considerably less important than correctness and you should direct your effort accordingly.

Both partners should be prepared to answer any question about any part of the assignment during the assignment presentation. If you find that you are unable to create a working final implementation, please come prepared with a 1 to 2 page write-up explaining what parts of the assignment that you did get working, the problems you faced that prevented your from creating a working solution and an outline of what work would be required to correctly complete the assignment. You do not need to submit the write-up into CMS, but instead bring a paper copy to your scheduled presentation time. Those groups that have completely working assignments need not prepare a write-up, but can if they wish.

When you submit to CMS, make sure you submit all the files necessary to compile your project. We expect that for most groups this will simply be the same directory of files as the framework, but with completed source files. If your code requires any steps other than running make to compile include a README with specific compilation instructions. Submitted binary code will not be considered when grading and will be deleted from all submissions before we compile and run your project.
Monday, December 10, 12:06 p.m.
Your client should pay attention to all command-line arguments passed to it as well as the contents of the workers.conf file. We will be testing your client with varying options, client IDs, and worker configurations, so it is important that you support these and not hard-code them.
Monday, December 10, 11:55 a.m.
We will not be providing samples of the image being rendered. You will know when you've done things correctly when you have a coherent image of a gold-coloured object with a red background.
Sunday, December 9, 1:46 a.m.
In the queue, the signature for poll() reads:
```
  int poll(queue_t* queue, void** item)
```
The void** is not a typo. To see why item has to be a void**, think about what goes wrong if it is just a void*. (Hint: item is a local variable and you can't dereference a void* without casting it to a different type.)
Sunday, December 9, 1:35 a.m.

Many groups have been confused by the apparent inconsistency between the header structs defined in messages.h and the header format specification in the assignment write-up.

There is not disparity. The header structs provide you with a place for some information the servers don't care about. If the communications thread multiplexes between two servers, you will need to tell the communications thread the server for which a given request is meant. Thus you have a place in the request header to say "send this message to server 1." However, you don't actually need to send the server number because the server doesn't really care about it. It's only for your bookeeping. The PA6.html describes the data you should send to a server and the header structs are meant to be useful for your bookeeping.

Likewise, the header struct omits some items (namely message length) that can be calculated during marshalling.
Sunday, December 9, 1:04 a.m.
Contrary to what was implied by the comments in comm.h, you should have only have a single communications thread. The comments in /usr/local/cs316/pa6/comm.h have been updated to reflect this.

Sunday, December 9, 12:54 a.m.

The comment for poll() in queue.c and queue.h has been updated with a clarification. (New bits bolded.)

/**
 * Dequeues the item at the front of the given queue and returns 0.  Upon
 * return, item will point to the dequeued item and the caller is responsible
 * for deallocating any memory used by that item.  If the queue is empty, item
 * will remain unchanged and -1 is returned.
 */

Saturday, December 8, 3:38 a.m.
There was a typo in the WORK_RESULT message specification. The table caption indicated "type=11" but the table showed type=1. The actual type ID is 11. The table has been fixed.
Saturday, December 8, 3:23 a.m.

A word on the 10-minute "limit".

First, note that the cluster has 112 cores and you are only given 1/7 of that. Under ideal conditions, 10 minutes on the whole cluster would translate to about 70 minutes on what you've been allocated. (Ahmdal's Law says it really translates to less than that.) This is assuming, of course, that you are the only ones using your 4 machines. You are actually sharing those machines with about 9 other groups, so things can get much worse.

Second, the "10 minutes" thing isn't meant to be a hard limit. In fact, with larger image sizes, a "quality 2" render will likely take much longer than 10 minutes on the whole cluster. With the default sizes, you can expect a "quality 2" render to take about 6 minutes on the whole cluster; 30 minutes with what you're given.

The point is that you should try to balance the workload as effectively as possible. One way to determine how well you're doing this is to look at the render reports, as detailed below in this FAQ.
Thursday, December 6, 5:45 p.m.
As part of the update below, use of csug01 has been disabled for rendering. Please use csug01 for the development of your client.
Thursday, December 6, 5:27pm
The servers have been updated so that each group has access to 4 servers and can run 4 threads per machine. We expect this configuration to be permanent for the rest of the assignment. You can see the new configuration on the server configuration file. The bigger change is that the servers have been updated to print reports of each group's usage for a particular rendering. The usage parameters are written to the directory /usr/local/cs316/pa6_render_reports. Note that the statistics are not saved between renderings so new files will overwrite the old. If you want to save the results, please copy the rendering reports into your own directory. In the subdirectory called examples, you can see the reports generated by our client for a loaded and a free set of servers. The README file in that directory contains an annotated report and more information.

The goal of these reports is to help you fine tune your work manager. Our main requirement is that you implement a load balanced work manager. Your goal should be make the total time each rendering thread is alive about equal (wall time in the reports). You will never be able to do this perfectly but within a few percent is more than good enough. If you achieve this across a variety of number of machines and threads, your work manager is well designed. You should also test that you work manager is rapidly feeding each render thread, but this will likely be a side effect of a load balanced server. Your goal here is to make your CPU utilization about the same as your fraction of the rendering threads. However, the server cannot detect load besides that of the rendering threads so load external to the servers will skew these numbers somewhat.
Monday, December 3, 5:15 p.m.
A clarification on the queue specfication. The enqueue function has the comment:
```
    /**
     * Adds the given item to the front of the queue.
     */
  
```
Clearly, since this is supposed to be a queue, you should be adding to the back of the queue, not the front.
Monday, December 3, 1:45 a.m.
It turns out that the CSUG machines are running an old version of the Linux kernel that has a race condition in its pselect implementation.

comm_helper.c and comm_helper.h have been updated with a workaround to this bug. To avoid the pselect race condition bug, you MUST apply this update and use pselect_316 instead of pselect.

To apply this update:
```
  cp /usr/local/cs316/pa6/comm_helper.* ~/pa6/
```
Sunday, December 2, 4:20 p.m.
A new version of netcat has been copied to:
```
  /usr/local/cs316/pa6/netcat
```
This version has all the features referred to in the section slides and in the assignment write-up.

If you do:
```
  cp /usr/local/cs316/pa6/netcat ~/pa6/
```
you will be able to conveniently run netcat out of your project directory.
Saturday, December 1, 12:50 p.m.

A sample workers.conf has been added to /usr/local/cs316/pa6/. Since the file format doesn't support comments, it's not very illustrative. You're better off reading the PA6 write-up for the file format.
Saturday, December 1, 12:30 a.m.

If you've looked at messages.h already, you'll find structs named request_steal_work and response_work_stolen. Please remove these. They are left over from a draft version of the protocol and no longer exist. /usr/local/cs316/pa6/messages.h has been updated to reflect this.
Friday, November 30, 2:42 p.m.

With regard to using external libraries in PA6, system libraries (e.g., stdio, stdlib, pthread, errno, netdb, arpa/*, sys/*) are okay to use. If you want to use anything else, or if there is any doubt, please check with us first.
Friday, November 30, 11:36 a.m.
To make your debugging lives easier, I've tweaked the image assembler a bit. If you terminate the client with CTRL+C, the client will dump the current state of the image before exiting. Also, if you send the client a SIGHUP (type "killall -HUP drt" in another shell on the same machine), the client will save the current state of the image and continue running.

You don't have to apply this update if you don't want to. If you think this will be useful, though, simply copy over assem.c from the cs316 directory into your project directory:
```
    cp /usr/local/cs316/pa6/assem.c ~/pa6/
```