CS 3410 Programming Assignment 4

Programming Assignment 4: Distributed Image Renderer

Implementation

Analysis

Client Skeleton

How to Get Started

What to Submit

Help and Hints

Overview For this assignment, you will write the scheduling algorithm for a distributed rendering system. This assignment has two components: an implementation component and an analysis component. For your implementation, you will write a thread safe stack and a basic work stealing scheduler. For the analysis, you will run your scheduler on several different simulated workloads (images to be rendered) and several different simulated backends (rendering clusters). Your analysis will test several different scheduling decisions and empirically determine their effects on performance when using different hardware backends. As a final part of your analysis, we will start specialized rendering servers on each of the CSUG lab machines and provide a backend to your scheduler that can connect to them. Using the results from your previous simulated analysis, you will determine optimum scheduling algorithm for a set of real images on the real CSUG machines.

Implementation

System Architecture We have provided a basic skeleton that will perform the simulation of different cluster backends including the real rendering servers on the CSUG machines. You will write additional code that is compiled into the skeleton to produce a final scheduler. You will be required to write three new C files:

worker.c contains the code that performs the scheduling operations
stack.h contains the definition of the thread safe stack used by your workers
stack.c contains the implementation of your thread safe stack

Feel free to browse the remainder of the skeleton but, at the end of the assignment, only code placed in these three files will be graded.

Your first task should be to implement the thread safe stack. It can be tested independently from the rest of your implementation. The details of your stack implementation are entirely up to you as long your code implements the basic last-in, first-out (LIFO) stack operations and your stack is robust to simultaneous access from any number of threads.

Once you have a correctly working stack implementation, you can move on to the scheduler. At a high level, your scheduler will be implemented by the actions of a pool of worker threads. Each worker will execute a single method work, described in a C-like pseudocode as

            void work(stack_t stack)
            {
          while(!image.done())
              {
                if(!stack.empty())
                {
                work_t to_do = stack.pop();
                  if(!render(to_do))
                    subdivide(to_do, stack);
}
                else if(!image.done())
                {
                cancel_worker();
                }
              }
          }

In plain language, each worker has acccess to a shared stack of all the work that remains to be completed. If the image is not yet finished, a worker tries to find new work from two sources. First, it checks to see if there is unallocated work in the stack. If there is, it takes the first piece off the stack and begins to render it. Second, if there is no unallocated work, it checks to see if there are other workers currently working on other parts of the image. If there are, it chooses one and cancels its current work. When a thread is cancelled, its call to render returns false. It then subdivides its work into smaller units and then pushes them onto the stack for other threads to help complete. The actions of your scheduler can be boiled down to two decisions: a) which worker to cancel if no free work is available and b) how to subdivide a work unit if a worker is cancelled. You will try several different options during your analysis.

For this assignment, you are free to design your workers as you see fit; but in order to connect with the rest of the skeleton with your worker.c file must contain one method:

            void do_work();

This method is called by the main method and has three main purposes. First, it needs create any data structures used by your workers to perform their scheduling (like, say, a stack). Second, it must start all the worker threads. Third, it must make an initial distribution of work to the workers. The parameters required for initialization are provided in a global options variable described below.

Analysis

The full analysis will take roughly about 2 hours assuming you have a perfectly correct solution. To account for bugs revealed during analysis that will need fixing, you should begin your analysis early (as early as Saturday might not be a bad idea).

Your analysis will have 5 steps. Each step will add complexity to the overall scheduler. However, you may find it more efficient to implement and test all the functionality beforehand. You may then conditionally enable and disable functionality using #ifdef and #define as needed. For the first 3 analyses, your scheduler should not use cancel.

Compute the block overhead
There is a fixed cost to issue a rendering request for a given block of any size. Using the fixed parameters to drt -i uniform -s 512 -w 64 vary the size of the initial blocks of work you place in your stack. Try 1x1, 2x2, 4x4, and 8x8. For each block size, compute a block overhead cost and report the average cost for all block sizes.
Compute the best initial block size for a uniform image

This analysis finds the best initial block size for different numbers of threads. We assume that the cost of rendering pixels is uniform across the image; i.e., each pixel takes approximately the same time to render. Do the following analysis for 16 and 64 threads using 128x128 images: render the uniform image with block sizes 1x1, 2x2, 4x4, ... 64x64. Record the total image time for each block size. What is the best number of block size for a given number of threads? Why do you think this number of blocks is best?

Compute the best initial block size for a non-uniform image

Next repeat the previous analysis for two non-uniform images; these images have some pixels that are much more expensive to render than others. Use the two images called, expensive_blocks-1000.0-16-4 and expensive_blocks-1000.0-16-8. In these images, there are either four or eight 16x16 blocks which are 1000 times more expensive than the rest of the blocks in the image. For each image compute the best block size for a given number of threads. Why do you think these blocks sizes are best?

Workload tests with adaptive refinement and random cancel.

Change your scheduler so that when no work is available, a worker cancels a random other worker if it still is working. Have the cancelled worker divide its block into four equal sized child blocks and place them on the global work stack. Run this new scheduler on the following images 512x512 images: uniform, expensive_blocks-1000.0-16-4, expensive_blocks-1000.0-16-8, box-l, kitchen-q, roulette-e and temple-l. Use an initial blocksize of 16x16 and 64 worker threads. Report the total time to render each image. Does your scheduler perform better or worse for these images? Why do you think that this is so?

Workload tests with adapative refinement by random cancel and with endgame compensation.

Now change your scheduler so it does not cancel blocks that are within some degree of tolerence. To do this you will first have to track a little more information in your code: for each thread and each block size it renders, track the average time taken to compute that block size. You should cancel another thread only if it has taken more than two times the expected amount of time. If you do not have a time expectation for a given size, you should approximately derive it from the next larger block size. Repeat the analysis from part 4 with your new scheduler. Report the total time to render each image. Your scheduler should perform better on the two expensive block images. Why do you think that this is so?

After all your analysis is complete, compile all results into a short 1 or 2 page summary and submit your report with your final code.

Client Skeleton

As a starting point, we have provided you with a code skeleton for your scheduler. In it, you will find several C source files, a makemake script, a Makefile.am, and six other text files. You can ignore these extra text files, but do not delete them. They are needed for the GNU build system mentioned below.

While you should examine the skeleton to determine the functions of various components, remember that you will only submit code in the worker.c, stack.h and stack.c files.

Obtaining the code The code is available on the CSUG machines under /courses/cs3410/pa4. To copy it into your home directory on the CSUG machines, type:

  cp -R /courses/cs3410/pa4 ~

If you prefer to use your own machine, you can scp it from the CSUG machines into your home directory:

  scp -r netid@csug01.csuglab.cornell.edu:/courses/cs3410/pa4 ~

We will not provide support for developing on non-CSUG machines. You are responsible for ensuring that your project will compile and run correctly on CSUG.

Compiling the project This project uses the GNU configure and build system to automatically detect dependencies between the source files and generate an appropriate Makefile. Simply run the makemake script to create a Makefile. Once you have a Makefile, you can use make as usual to compile your project.

You should re-run makemake if you change the dependencies between your source files (i.e., add or remove any #includes that reference other files in your project). If you fail to do any of this, your project will still compile, but you will have to "make clean" more often, as make's incremental build functionality will not know about your new files/dependencies.

The code will compile only on csug01—csug10. During the analysis phase of the project, each group will be assigned one of these boxes to spread out the load. The code will not compile on csug11—csug15.

Running the client Your client will be compiled into a binary called drt. Running drt with no arguments will start the client with the default configuration. The client supports several configuration options for adjusting the workload and backend parameters. To see what options are accepted by the client, run drt -h.

Important Methods. The skeleton provides three methods essential for writing your worker. They are available by importing the main.h header file and provide high level communication between rendering servers.

int render(pixel_t* out_pixels, int x1, int y1, int x2, int y2, int* out_rid);

The first method render connects with a rendering server and produces output pixels. The call will block until the rendering is either complete or cancelled by another thread. The method returns either the number of pixels rendered or -1 if the rendering was cancelled (see below). The four parameters x1, y1, x2, and y2 specify a rectangle of pixels to be rendered. The first coordinate specifies the top left corner of the rectangle and the second the bottom right. The first argument, out_pixels, is a pointer to a previously allocated array of pixel_t structs. At the end of the render call elements in this array will be set with the output pixels. The final argument out_rid is the address of an integer. Before render begins this integer will be set to a unique identifier for this specific rendering call. This id can be used to cancel this rendering operation.

int cancel(int rid);

The second method cancels an ongoing rendering operation. To cancel a render operation you need to pass to cancel the integer id (out_rid) set for an ongoing rendering operation. The method returns a 0 if the cancelling operation was successful and -1 otherwise. A cancelling operation can fail because a thread that is cancelled may asynchronously complete before the cancel call finishes. Always check that cancel was successful.

void image_set_pixel(int x, int y, int c);

The final method sets a pixel color in the final image. The three arguments specify the pixel position and its color (encoded as an integer).

Code layout The code should be fairly well-commented. You should examine base skeleton to determine how the overall system works; however you should only edit the required files. The other files in the project are used for:

main.h

This contains the method declarations for render, image_set_pixel and cancel.. It also contains the global options variable that stores the client's configuration parameters, parsed from both the command line and workers.conf.

main.c, opts.c
This contains the main() function as well as code for parsing the command-line options.
config.h
Produced by the GNU Make system this file contains parameters the define the overal compilation environment.
comm.h, comm.c, comm_user.c, comm_emulate.c, comm_renderer.c
These methods provide the infrastructure for communication to the simulator and to the CSUG rendering machines.
assem.h
This contains the interface to the image assembler.

How to Get Started

This project will be larger than the others and deal with lower level systems programming than you are likely to be already familiar with. To get started, it's best to have a plan of attack.

Write the thread safe stack. To start, you should implement the thread-safe stack. The stack is the basis of the client's inter-thread communication and debugging the stack once the rest of the client is written will be difficult. Design your stack and test it as a separate unit first. This will also familiarize you with blocking, locking and threads.

Write and test the workers and their schedulering on a uniform workload. Using a uniform workload you can run simple tests and check the sanity of your worker implementation. Try varying the number of thread or the basic block size and monitor performance. Make sure that your scheduler works as expected in several trial configurations.

Command Line

To run your program, the command line is: ./drt [options] [initial_blocksize]. For example, ./drt -w 16 -i kitchen-q 16. The options are as follows.

-w num
The number of worker threads that must be started. Once parsed, the value is stored in a global variable. Your code that starts the worker threads must use this value.
-i scene
The name of the scene to render. An error is printed if the data for the scene is missing. Check the data directory for possible scene names.
-s width
The width of the image to render in pixels. Check the data directory for possible sizes.
initial_blocksize
Provided optionally after all other arguments. When omitted, the default value set in stack.h is used. Your do_work function should use this value.
-d
Increasing the level of debugging output. Can be present multiple times. The more debugging output that needs to be computed or printed, the slow the program will run.
- -d
  Prints out various timing and overhead statistics. See the STOPWATCH_* macros to debug overheads in your code, and the DEBUG macro to print output at this debugging level.
- -d -d
  Instead of coloring the image by the computed pixel, the image will be colored based on the thread that set the pixel. This allows you to visualize how well your work distribution algorithm is striping your threads through the image. Also, to further help debug striping issues, final image will look noisy irrespective of debug level if multiple threads render the same pixel.
- -d -d -d
  Print extremely finegrained information about what is happening. The TRACE macro is used to print output at this debugging level.
You can force the program to dump the image computed at any time by running killall -SIGUSR1 drt from another terminal. The -t option can be used to visualize the render cost of each pixel.
Additional options can be found in the code.

What to Submit

Submit worker.c, stack.h and stack.c to CMS by the due date. Submit the 1 to 2 page analysis report. In the final demo, we will run your code to produce an image. For this test, we will use the entire rendering cluster. During this process you will discuss your code's design and explain the details of your implementation.

Help and Hints

Ask the TAs and consultants for help. You can contact us through the course staff mailing list or the class newsgroup. We expect to see most students in office hours during the course of the project. Extra hours will be scheduled as needed.

If you suspect a bug in the servers or in the client framework, ask the course staff for help.