CS432 Assignment 4: Joins

Deadline: November 10, 11:59pm. No late submissions will be accepted. The course management system will not accept any submissions after the deadline, and you will receive 0% of the grade for this assignment if you do not turn it in by the deadline.
This is a group assignment. You will have to create groups using the course management system.
You can download any necessary files for the assignment using the course management system.
Here is a list of frequently asked questions.
Read the assignment description and the FAQ carefully before you start.
This assignment is worth 10% of your overall grade.

Goal

For this assignment you are expected to implement three join algorithms and evaluate their relative performance.

Background

Let us quickly review some of the modules you will need for this assignment. You should already be familiar with them from the the previous assignments.

In Minibase, a relation is implemented as a heap file, which is a collection of records. Records can be inserted or deleted from a heap file, and each record is uniquely identified by a record id. A scan is the interface used to access records in a heap file, one by one.

An index provides fast access to records in the heap file, and currently Minibase only supports B+-indices. Each entry in the index is a (key, record id) pair. Entries can be inserted or deleted from an index. An index search scan provides interface for accessing the records in the index.

Four classes are provided for you (i.e. you only need to call methods in these classes):

HeapFile - implements a heap file
Scan - scan interface to HeapFile
BTreeFile - implements B+-tree index
BTreeFileScan - implements scan interface to B+-tree

Assignment Requirements

You need to implement the following three join operations.

Tuple-at-a-time nested loop join

Start with this one, it is the simplest to implement.

Block nested loop join

The algorithm for this join method can be found in your textbook. Since Minibase does not provide page-by-page access into a relation stored in a heap file, you will simulate a "block" with an array storing the records. The join function will take as a parameter, an integer B (block size = array size) of the outer relation. You are not required to implement the double buffering techniques or use hash tables.

The pseudocode for this join is:

For each block B in R

    For each tuple s in S

        For each tuple r in B

            Match r with s

            if Match then

                  Insert (r,s) into the result relation

Compare the performance of "Block nested loop" join for various block sizes.

Index nested loop join

Implement this algorithm based on the pseudo-code in the textbook.

First create an unclustered B+-tree index on the inner relation. Determine whether it is beneficial to build a B+-tree index for the purpose of performing a single join operation by comparing the cost of this join with that of other join methods.

Statistics Collection

Compare the relative performance of the three join algorithms (tuple nested loops join, block nested loops join, index nested loop join). Collect the times taken for each algorithm to run, and the number of page misses.
Study the effect of the buffer pool size on the three algorithms by changing the buffer pool size (NUM_BUF_PAGES in main.cpp).
Submit documentation, tables and graphs of these statistics, together with analysis of the results.

Augment the buffer manager class given to you to collect statistics. Specifically, you should be able to tell how many PinPage requests have been made and how many of those miss. It is suggested that you write two functions to reset and report the statistics.

You can use the time() function (which returns the current time) to determine out the running time of your join algorithm. You will need to include time.h. Refer to the C++ help for more details on time().

You should let your algorithm run for a few times (the longer the better) and report the average running time. Avoid running other CPU or I/O intensive processes (Word, Internet Explorer, etc.) while collecting statistics since clock() reports actual time rather than CPU time. Print out average running times and the number of misses for every join algorithm you implement.

Documentation

Your documentation should contains results of your experiments in table and graph format. You should include a brief analysis, explaining the results that you have submitted. (You should not just reproduce the formulae in the textbook.)

Assignment Details

Simplifications

You can assume that all records are fixed length.
You can assume that we only perform joins on integer fields.

Source code

The source code for the project can be downloaded from the course management system. The directory contains the following files:

join.cpp/.h - utility functions useful for writing join algorithms.
relation.cpp/.h - functions to create test relations.
<join-method>.cpp - each file contains a skeleton for implementing a particular join method.
main.cpp - main program.
spacemgr - subdirectory for space manager code (HeapFile class, DB class etc.)
btree - subdirectory for B+-Tree index code
bufmgr - subdirectory for buffer manager code
include - subdirectory of all .h files needed.

You should write your code in <join-method>.cpp and main.cpp. Functions in join.cpp and relation.cpp will be useful for writing your join methods and for debugging. The function SortFile() is particularly useful as an example on how to use HeapFile, Scan, BTreeFile and BTreeFileScan.

Coding conventions

You need to follow certain coding convention for this assignment:

Names for all classes/method/type should start with a caps. Break multiple words with caps. For example AddFileEntry.
Names for all members/variables should start with small letters. Break multiple words with caps. For example numOfPages.
Names for all enum/macro should be all caps. Break multiple words with underscore. For example MINIBASE_DB.
If you create a new function, make sure the header contains the necessary comments. (PreCond, PostCond, etc).

Submission procedure

Create a zip file that contains the following files, and upload the zip file into the course management system by the deadline.

blockjoin.cpp
indexjoin.cpp
tuplejoin.cpp
main.cpp
documentation.doc

Keep a copy of the project in your own account just in case.

Grading criteria

Your assignments will be graded according to the following criteria:

Correctness (70%): You will get full marks for a correct implementation. Partial credit will be to partially correct answers.
Coding style (10%): You are expected to write neat code. Code should be properly indented and commented. You should follow the coding conventions.
Documentation (20%): Your documentation should contains results of your experiments in table and graph format. You should include a brief analysis, explaining the results that you have submitted. (You should not just reproduce the formulas in the textbook.) Call your document documentation.doc.

Hints

Study the Scan class. Learn how to access records in a heapfile. Study the SortFile function in join.cpp.
Use small numbers of records (by changing the constants in join.h) for testing.
As a rough estimate, each join algorithm should be around 100 lines of code.
You can change the interfaces to the join functions provided (to specify the output file/relation etc.).
The user interface will not affect your grade. However, you should provide a menu interface in main.cpp that:

allows the TAs to select the join algorithm to run
allows the TAs to specify the name of the output file (where you should store your join results)

Reference

The following classes are particularly useful for this assignment. You don't need to know the rest to complete the assignment, but if you wish to study the code provided, they may be of help.

struct KeyDataType