Programming Assignment 4: Distributed Image Renderer
Overview
For this assignment, you will write the scheduling algorithm for a
distributed
rendering system. This assignment has two components: an implementation
component and an analysis component. For your implementation, you
will
write a thread safe stack and a basic work stealing scheduler.
For the
analysis, you will run your scheduler on several different
simulated workloads (images to be rendered) and several different
simulated backends (rendering clusters). Your analysis will test
several different scheduling decisions and empirically determine their
effects on performance when using different hardware backends. As
a final
part of your analysis, we will start specialized
rendering servers on each of the CSUG
lab machines and provide a backend to your scheduler that can connect
to them. Using the results from your previous simulated analysis,
you will determine optimum scheduling algorithm for a set of real
images on the real CSUG machines.
Implementation
System Architecture We have provided a basic skeleton that will perform the simulation of different cluster backends including the real rendering servers on the CSUG machines. You will write additional code that is compiled into the skeleton to produce a final scheduler. You will be required to write three new C files:- worker.c contains the code that performs the scheduling operations
- stack.h contains the definition of the thread safe stack used by your workers
- stack.c contains the
implementation of your thread safe stack
Your first task should be to implement the thread safe stack. It can be tested independently from the rest of your implementation. The details of your stack implementation are entirely up to you as long your code implements the basic last-in, first-out (LIFO) stack operations and your stack is robust to simultaneous access from any number of threads.
Once you have a correctly working stack implementation, you can move on to the scheduler. At a high level, your scheduler will be implemented by the actions of a pool of worker threads. Each worker will execute a single method work, described in a C-like pseudocode as
void work(stack_t stack)
{
while(!image.done())
{
if(!stack.empty())
{
work_t to_do = stack.pop();
if(!render(to_do))
subdivide(to_do, stack);
}
else if(!image.done())
{
cancel_worker();
}
}
}
In plain language, each worker has acccess to a shared stack of all the work that remains to be completed. If the image is not yet finished, a worker tries to find new work from two sources. First, it checks to see if there is unallocated work in the stack. If there is, it takes the first piece off the stack and begins to render it. Second, if there is no unallocated work, it checks to see if there are other workers currently working on other parts of the image. If there are, it chooses one and cancels its current work. When a thread is cancelled, its call to render returns false. It then subdivides its work into smaller units and then pushes them onto the stack for other threads to help complete. The actions of your scheduler can be boiled down to two decisions: a) which worker to cancel if no free work is available and b) how to subdivide a work unit if a worker is cancelled. You will try several different options during your analysis.
For this assignment, you are free to design your workers as you see fit; but in order to connect with the rest of the skeleton with your worker.c file must contain one method:
void do_work();
This method is called by the main method and has three main purposes. First, it needs create any data structures used by your workers to perform their scheduling (like, say, a stack). Second, it must start all the worker threads. Third, it must make an initial distribution of work to the workers. The parameters required for initialization are provided in a global options variable described below.
Analysis
Your analysis will have 5 steps. Each step will add complexity to the
overall scheduler. However, you may find it more efficient to implement
and test all the functionality beforehand. You may then conditionally
enable and disable functionality using #ifdef and #define as needed.
For the first 3 analyses, your scheduler should not use cancel.
- Compute the block overhead
There is a fixed cost to issue a rendering request for a given block of any size. Using the fixed parameters to drt -i uniform -s 512 -w 64 vary the size of the initial blocks of work you place in your stack. Try 1x1, 2x2, 4x4, and 8x8. For each block size, compute a block overhead cost and report the average cost for all block sizes.
- Compute the best initial block size for a uniform image
- Compute the best initial block size for a non-uniform image
- Workload tests with adaptive refinement and random cancel.
- Workload tests with adapative refinement by random cancel and with endgame compensation.
This analysis finds the best initial block size for different numbers of threads. We assume that the cost of rendering pixels is uniform across the image; i.e., each pixel takes approximately the same time to render. Do the following analysis for 16 and 64 threads using 128x128 images: render the uniform image with block sizes 1x1, 2x2, 4x4, ... 64x64. Record the total image time for each block size. What is the best number of block size for a given number of threads? Why do you think this number of blocks is best?
Next repeat the previous analysis for two non-uniform images; these images have some pixels that are much more expensive to render than others. Use the two images called, expensive_blocks-1000.0-16-4 and expensive_blocks-1000.0-16-8. In these images, there are either four or eight 16x16 blocks which are 1000 times more expensive than the rest of the blocks in the image. For each image compute the best block size for a given number of threads. Why do you think these blocks sizes are best?
Change your scheduler so that when no work is available, a worker cancels a random other worker if it still is working. Have the cancelled worker divide its block into four equal sized child blocks and place them on the global work stack. Run this new scheduler on the following images 512x512 images: uniform, expensive_blocks-1000.0-16-4, expensive_blocks-1000.0-16-8, box-l, kitchen-q, roulette-e and temple-l. Use an initial blocksize of 16x16 and 64 worker threads. Report the total time to render each image. Does your scheduler perform better or worse for these images? Why do you think that this is so?
Now change your scheduler so it does not cancel blocks that are within some degree of tolerence. To do this you will first have to track a little more information in your code: for each thread and each block size it renders, track the average time taken to compute that block size. You should cancel another thread only if it has taken more than two times the expected amount of time. If you do not have a time expectation for a given size, you should approximately derive it from the next larger block size. Repeat the analysis from part 4 with your new scheduler. Report the total time to render each image. Your scheduler should perform better on the two expensive block images. Why do you think that this is so?
Client Skeleton
Obtaining the code The code is available on the CSUG machines under /courses/cs3410/pa4. To copy it into your home directory on the CSUG machines, type:
cp -R /courses/cs3410/pa4 ~If you prefer to use your own machine, you can scp it from the CSUG machines into your home directory:
scp -r netid@csug01.csuglab.cornell.edu:/courses/cs3410/pa4 ~
Compiling the project This project uses the GNU configure and build system to automatically detect dependencies between the source files and generate an appropriate Makefile. Simply run the makemake script to create a Makefile. Once you have a Makefile, you can use make as usual to compile your project.
You should re-run makemake if you change the dependencies between your source files (i.e., add or remove any #includes that reference other files in your project). If you fail to do any of this, your project will still compile, but you will have to "make clean" more often, as make's incremental build functionality will not know about your new files/dependencies.
Running the client
Your client will be compiled into a binary called drt.
Running
drt with no arguments will start the client with the default
configuration. The client supports several configuration options for
adjusting the workload and backend parameters. To see what options are
accepted by the client, run drt
-h.
Important Methods. The
skeleton provides three methods essential for writing your
worker. They are available by importing the main.h header file and provide
high level communication between rendering servers.
int render(pixel_t* out_pixels, int
x1, int y1, int x2, int y2, int* out_rid);
The first method render
connects with a rendering server and produces output pixels. The
call will block until the rendering is either complete or cancelled by
another thread. The method
returns either the number of pixels rendered or -1 if the rendering was
cancelled (see below). The four parameters x1, y1, x2, and y2 specify a rectangle of
pixels to be rendered. The first coordinate specifies the top
left corner of the rectangle and the second the bottom right. The
first argument, out_pixels,
is a pointer to a previously allocated array of pixel_t structs. At the
end of the render call
elements in this array will be set with the output pixels. The
final
argument out_rid is the
address of an integer. Before render begins this integer will be
set to a unique identifier for this specific rendering call. This
id can be used to cancel this rendering operation.
int
cancel(int rid);
The second method cancels an ongoing rendering operation. To
cancel a render operation you need to pass to cancel the integer id (out_rid) set for an ongoing
rendering operation. The method returns a 0 if the
cancelling operation was successful and -1 otherwise. A
cancelling
operation can fail because a thread that is cancelled may
asynchronously complete before the cancel call finishes. Always
check that cancel was successful.
void
image_set_pixel(int x, int y, int c);
The final method sets a pixel color in the final image. The
three arguments specify the pixel position and its color (encoded as an
integer).
- main.h
This contains the method declarations for render, image_set_pixel and cancel.. It also contains the global options variable that stores the client's configuration parameters, parsed from both the command line and workers.conf.
- main.c, opts.c
This contains the main() function as well as code for parsing the command-line options.
- config.h
Produced by the GNU Make system this file contains parameters the define the overal compilation environment.
- comm.h, comm.c, comm_user.c, comm_emulate.c, comm_renderer.c
These methods provide the infrastructure for communication to the simulator and to the CSUG rendering machines.
- assem.h
This contains the interface to the image assembler.
How to Get Started
Command Line
To run your program, the command line is: ./drt [options] [initial_blocksize]. For example, ./drt -w 16 -i kitchen-q 16. The options are as follows.
-
-w num
The number of worker threads that must be started. Once parsed, the value is stored in a global variable. Your code that starts the worker threads must use this value. -
-i scene
The name of the scene to render. An error is printed if the data for the scene is missing. Check the data directory for possible scene names. -
-s width
The width of the image to render in pixels. Check the data directory for possible sizes. -
initial_blocksize
Provided optionally after all other arguments. When omitted, the default value set in stack.h is used. Your do_work function should use this value. -
-d
Increasing the level of debugging output. Can be present multiple times. The more debugging output that needs to be computed or printed, the slow the program will run.-
-d
Prints out various timing and overhead statistics. See the STOPWATCH_* macros to debug overheads in your code, and the DEBUG macro to print output at this debugging level. -
-d -d
Instead of coloring the image by the computed pixel, the image will be colored based on the thread that set the pixel. This allows you to visualize how well your work distribution algorithm is striping your threads through the image. Also, to further help debug striping issues, final image will look noisy irrespective of debug level if multiple threads render the same pixel. -
-d -d -d
Print extremely finegrained information about what is happening. The TRACE macro is used to print output at this debugging level.
-
-
You can force the program to dump the image computed at any time by running killall -SIGUSR1 drt from another terminal. The -t option can be used to visualize the render cost of each pixel.
-
Additional options can be found in the code.
What to Submit
Help and Hints
Ask the TAs and consultants for help. You can contact us through the course staff mailing list or the class newsgroup. We expect to see most students in office hours during the course of the project. Extra hours will be scheduled as needed.
If you suspect a bug in the servers or in the client framework, ask the course staff for help.