For this assignment you are expected to implement three join algorithms and evaluate their relative performance.
Let us quickly review some of the modules you will need for this assignment. You should already be familiar with them from the the previous assignments.
In Minibase, a relation is implemented as a heap file, which is a collection of records. Records can be inserted or deleted from a heap file, and each record is uniquely identified by a record id. A scan is the interface used to access records in a heap file, one by one.
An index provides fast access to records in the heap file, and currently Minibase only supports B+-indices. Each entry in the index is a (key, record id) pair. Entries can be inserted or deleted from an index. An index search scan provides interface for accessing the records in the index.
Four classes are provided for you (i.e. you only need to call methods in these classes):
You need to implement the following three join operations.
Start with this one, it is the simplest to implement.
The algorithm for this join method can be found in your textbook. Since Minibase does not provide page-by-page access into a relation stored in a heap file, you will simulate a "block" with an array storing the records. The join function will take as a parameter, an integer B (block size = array size) of the outer relation. You are not required to implement the double buffering techniques or use hash tables.
The pseudocode for this join is:
For each block B in R
For each tuple s in S
For each tuple r in B
Match r with s
if Match then
Insert (r,s) into the result relation
Compare the performance of "Block nested loop" join for various block sizes.
Implement this algorithm based on the pseudo-code in the textbook.
First create an unclustered B+-tree index on the inner relation. Determine whether it is beneficial to build a B+-tree index for the purpose of performing a single join operation by comparing the cost of this join with that of other join methods.
Augment the buffer manager class given to you to collect statistics. Specifically, you should be able to tell how many PinPage requests have been made and how many of those miss. It is suggested that you write two functions to reset and report the statistics.
You can use the time() function (which returns the current time) to determine out the running time of your join algorithm. You will need to include time.h. Refer to the C++ help for more details on time().
You should let your algorithm run for a few times (the longer the better) and report the average running time. Avoid running other CPU or I/O intensive processes (Word, Internet Explorer, etc.) while collecting statistics since clock() reports actual time rather than CPU time. Print out average running times and the number of misses for every join algorithm you implement.
Your documentation should contains results of your experiments in table and graph format. You should include a brief analysis, explaining the results that you have submitted. (You should not just reproduce the formulae in the textbook.)
The source code for the project can be downloaded from the course management system. The directory contains the following files:
You should write your code in <join-method>.cpp and main.cpp. Functions in join.cpp and relation.cpp will be useful for writing your join methods and for debugging. The function SortFile() is particularly useful as an example on how to use HeapFile, Scan, BTreeFile and BTreeFileScan.
You need to follow certain coding convention for this assignment:
Create a zip file that contains the following files, and upload the zip file into the course management system by the deadline.
Keep a copy of the project in your own account just in case.
Your assignments will be graded according to the following criteria:
The following classes are particularly useful for this assignment. You don't need to know the rest to complete the assignment, but if you wish to study the code provided, they may be of help.