Huffman Compression

Instructions: Remember, all assignments in CS 3410 are individual. You must submit work that is 100% your own. Remember to ask for help from the CS 3410 staff in office hours or on Ed! If you discuss the assignment with anyone else, be careful not to share your actual work, and include an acknowledgment of the discussion in a collaboration.txt file along with your submission.

The assignment is due via Gradescope at 11:59pm on the due date indicated on the schedule.

Submission Requirements

You will submit your completed solution to this assignment on Gradescope. You must submit:

  • huffman.c, which will contain part of your work for Task 0 and all of your work for Tasks 1 and Task 2.
  • priority_queue.c, which will contain part of your work for Task 0.

Restrictions

  • You may not modify any files other than huffman.c and priority_queue.c. You won’t be submitting them anyway

Provided Files

  • priority_queue.h, which is a header file that defines the specification for the priority queue.
  • priority_queue.c, which will contain your implementation of a priority queue and stack. You will modify this file.
  • huffman.h, which is a header file that defines the types and functions you will need to implement Huffman compression.
  • huffman.c, which will contain your implementation for the Huffman compression system. You will also modify this one.
  • bit_tools.h, which is a header file that defines the BitWriter and BitReader structs and their respective functions for reading and writing binary values from files.
  • bit_tools.c, which contains the implementation of the functions for BitWriter and BitReader.
  • utils.h, which contains utility functions for printing lists and tree nodes.
  • utils.c, which contains the implementation for the utility functions.
  • Makefile, which contains the build tools for this assignment.
  • test_priority_queue.c, which contains functions to test your implementation for Task 0. You may add tests here, but you will not turn this file in.
  • test_huffman.c, which contains functions to test your implementation for Task 1. You may also modify this file, as above.
  • cu_unit.h, which contains the macro definitions that you’ll use for unit testing.

Remember, do not modify other source files except the ones containing your implementation. We will grade your submission with “stock” versions of the supporting files.

Overview

In this assignment you will implement a data compression system using Huffman coding. Huffman compression is an encoding scheme that uses fewer bits to encode more frequently appearing characters and more bits to encode less frequently appearing characters. It is used by ZIP files, among many other things. The high-level overview of the algorithm is:

  1. Calculate the frequency of each character in the data. (Task 0)
  2. Build a Huffman tree using the frequencies. (Task 1)
  3. Build an encoding table using the Huffman tree. (Task 2)
  4. Encode each character in the data using your encoding table. (Task 2)

In the lab, you will implement a priority queue in C. You’ll use this to build your Huffman tree. The bulk of the work for this assignment will come from understanding the Huffman coding algorithm and manipulating data structures in C.

Huffman Compression Algorithm

Your implementation will read a single text file as input and produce two output files: a compressed data file and a coding table file that encodes enough information to allow decompression. (This assignment does not include decompression; we have given you a decompressor implementation.) Task 2 describes the format for these files.

Before moving onto the tasks, let’s break down the Huffman compression algorithm. You may recall that ASCII is a straightforward way to represent characters. In ASCII, every character is encoded with 8 bits (1 byte). There are 256 possible ASCII values that can be represented. This means that if we use standard ASCII encodings to represent a text file, each character in the file requires exactly 1 byte. This is inefficient, as most text streams don’t actually use all 256 possible characters. The basic idea behind Huffman encoding is as follows: use fewer bits to represent characters that occur more frequently.

Consider the string go go gophers. Because g and o appear so much more often, can we construct an encoding that uses a small number of bits for these characters, and possibly a larger number of bits for characters like h? That’s the goal with Huffman coding.

The core data structure we need is a binary tree with characters at the leaves. Each edge in the tree will correspond to a bit: a left edge corresponds to 0 and a right edge corresponds to 1. To get the encoding for a character, traverse the tree from the root to the leaf and concatenate all the corresponding bits.

Here’s a Huffman tree that contains all the characters in our string, go go gophers:

Huffman Tree

We have labeled each leaf with the frequency of that character. Internal nodes also have a frequency number that is the sum of all the frequencies of the children.

Here’s a table that shows the binary code for each character, according to this tree:

CharacterBinary code
101
e1100
g00
h1101
o01
p1110
r1111
s100

Remember, you get the encoding by traversing the path from the root to the character, using a 0 for every left edge and a 1 for every right edge.

The Huffman tree ensures that characters that are more frequent in the input receive shorter encodings, and characters that are less frequent receive longer encodings. Our goal is to construct the Huffman tree, write the coding table, and write the compressed file using these shorter encodings.

Getting Started

To get started, obtain the release code by cloning your assignment repository from GitHub:

$ git clone git@github.coecis.cornell.edu:cs3410-2024fa/<NETID>_huffman.git

Replace <NETID> with your NetID. The characters in your NetID should all be in lower cases.

Assignment Outline

  • Task 0: You will complete Task 0 in lab. You will implement a priority queue in C as well as the calc_frequencies function in huffman.c.
  • Task 1: You will implement the algorithm to create a Huffman tree.
  • Task 2: You will implement the functions write_coding_table and write_compressed to write the coding table and compressed bytes to distinct files.

Implementation

(Lab) Task 0: Implementing a priority queue and frequency counter

Before starting, make sure you’ve cloned the release code by following the instructions in Getting Started.

Step 1: Implement a priority queue

The code for this portion is located in priority_queue.c, which is provided to you in the release code. In this step, you’ll build a priority queue that accepts a “generic” data type. This is accomplished by storing a pointer to an arbitrary piece of memory that can store anything by using void*. We’ve provided a header file that defines the PQNode type as well as the function declarations for the functions you are required to implement.

Your implementation will go in priority_queue.c. We’ve provided a basic test suite in test_priority_queue.c. You will implement the following functions:

  • PQNode *pq_enqueue(PQNode **a_head, void *a_value, int (*cmp_fn)(const void *, const void *)): Add a new node with value a_value to a priority queue, using function cmp_fn(...) to determine the ordering of the priority queue.
  • PQNode *pq_dequeue(PQNode **a_head): Detach and return the head. Note, the caller is responsible for freeing the detached node, and any memory it refers to. Do not call free.
  • void destroy_list(PQNode **a_head, void (*destroy_fn)(void *)): Deallocates the priority queue. This should call the detroy_fn function on every data element, and it should free the list nodes.
  • PQNode *stack_push(PQNode **stack, void *a_value): Add a new node with value a_value to the front of the list.
  • PQNode *stack_pop(PQNode **stack): Detach and return the head of the list. Note, this function is extremely similar to pq_dequeue.

The last two functions are to enable us to use the same data structure as a stack, when needed. You probably will not make use of this for your Huffman compression system, but the decompression system needs a stack to work properly. If you can implement pq_enqueue, and pq_dequeue, implementing stack_push and stack_pop should be very easy.

We’ve provided a test file called test_priority_queue.c. Running rv make pqtest from the command line will build an executable called test_priority_queue, which you can then run by typing rv qemu test_priority_queue.

The tests use the header file cu_unit.h, which defines various macros that help you write unit tests. In general, tests should be structured like so:

static int _test_name() {
    cu_start();
    //-------------------
    // Setup code - build a list, declare a variable, call a function, etc. 
    cu_check(/*condition you want to check*/);
    // ... add as many checks as you want
    //-------------------
    cu_end();
}

int main(int argc, char* argv[]) {
    cu_start_tests(); // Indicate start of test suite
    cu_run(_test_name); // Don't forget to run the test in `main`
    cu_end_tests(); // Indicate end of the test suite
} 

Upon running the test, you’ll see one of the two following messages:

Test passed: _test_name

which will be displayed in green, or:

Test failed: _test_name at line x

which will be printed in red, and give the line that failed. We’ve provided two simple tests in the release code that check the behavior of your priority queue and stack. You are encouraged to add more tests to verify the functionality of your implementation. You will not be turning in test_priority_queue.c, however, so this will not be graded.

Generic data types

You might notice some strange looking syntax in these function declarations. This is to enable generic data types. The PQNode struct contains a void*, which you can think of as a memory address to any type. This allows you to use the same code for linked lists of any type.

You can assign a void* to an address of any type. This is why you can write code like:

char* s = malloc(...);

even though malloc(...) returns a void*, not a char*. This is also similar to the way functions such as qsort(...) allow you to sort arrays of any type.

Function addresses

Code that deals with generic data types often needs to pass functions as parameters. To do this, you need to specify the address to a function as an argument. In other words, you are declaring the parameter of the function (in this case cmp_fn) as the address to a function that takes in some parameter(s) of specified types and returns a value of a specified type. For the compare function, you’ll always return an integer, and the arguments to the compare function can be anything, depending on the underlying data in the nodes of the priority queue.

Let’s look at an example:

void _print_square(int n) {
    printf("%d squared is %d\n", n, n * n);
}

void _print_cube(int n) {
    printf("%d cubed is %d\n", n, n * n * n);
}

void _call_print_fn(int n, void(*print_fn)(int)) {
    print_fn(n);
}

int main(int argc, char* argv[]) {
    _call_print_fn(4, _print_square); // Prints 16
    _call_print_fn(4, _print_cube); // Prints 64
}

In the above code, the type of parameter print_fn is void(*)(int). In other words, print_fn is the address to a function taking an int and returning void. Generalizing this to our priority queue, notice that the type of parameter cmp_fn is int(*)(const void*, const void*). This is the address to a function taking two addresses to memory locations of any type and returning an int.

Similarly, destroy_list also takes a function address. This is because beyond freeing the node itself, you also need to potentially free whatever the node stores (e.g., if you have a priority queue of dynamically allocated strings).

Implementing pq_enqueue

You might recall from CS 2110 that priority queues can be implemented with binary heaps. In our implementation, however, we will be implementing our priority queue as a linked list that we will keep sorted by priority. This means that inserting a node will be an \(O(n)\) time operation, and removing from the priority queue will be a constant time operation. This is fine for our purposes.

In pq_enqueue, *a_head refers to the head of the linked list. If *a_head is NULL, then the list is empty. a_value is the address of whatever value is associated with this node. Allocate a new PQNode and insert it into the list in sorted order, according to the cmp_fn function. That is, everything before the new PQNode should be less than the new one, and everything to the right should be bigger than (or equal to) the new one.

*a_head should be updated if the new node becomes the first item in the list. The function should return the address of the new node.

This function should call malloc exactly once. You should not call free in this function.

We recommend you test your implementation for your priority queue as you go in test_priority_queue.c. You should also test your implementation for types other than integers, including dynamically allocated types such as strings. You will need to write your own comparison function to do this, and potentially your own print function if you want to be able to print your list.

Implementing pq_dequeue

Like the previous function, *a_head refers to the head (first node) of a valid linked list. If the list is empty, return NULL (since there is nothing to dequeue). Upon return, *a_head must be a valid linked list (although possibly empty). For our purposes, NULL is a valid linked list of size 0. Thus, *a_head will be set to NULL if the list is empty, and upon removing the last node, you should set *a_head to NULL.

You must also set the next field of the removed node to NULL. The caller is responsible for freeing the detached node, and any memory it refers to. For this reason, this function should not call free, directly or indirectly.

Again, you should test this by adding more statements to test_priority_queue.c and printing the list to observe the behavior of your function.

Implementing destroy_list

This function should completely destroy the linked list referred to by *a_head, freeing any memory that was allocated for it. destroy_fn(...) is a function that deallocates *a_value as needed (if for example, the nodes of the priority queue had values that were themselves dynamically allocated). This function should set the head to NULL in the caller’s stack frame (i.e. *a_head = NULL).

This is a good point to check to make sure that your code does not leak memory. Suppose you have the following code in test_priority_queue.c:

#include "priority_queue.h"
#include "cu_unit.h"

int _cmp_int(const void *a, const void *b) {...}

void _print_int(void *a_n) {...}

int _test_destroy() {
    cu_start(); 
    // ------------------
    PQNode* head = NULL;
    int n1 = 5, n2 = 7, n3 = 6;
    pq_enqueue(&head, &n1, _cmp_int);
    pq_enqueue(&head, &n2, _cmp_int);
    pq_enqueue(&head, &n3, _cmp_int);
    destroy_list(&head, NULL);
    cu_check(head == NULL);
    //--------------------
    cu_end();
}

int main(int argc, char* argv[]) {
    cu_start_tests();
    cu_run(_test_destroy);
    cu_end_tests();
    return 0;
}

This code should contain no memory leaks, i.e., it should eventually free everything that it mallocs.

You will likely want to use the sanitizers to check for memory bugs. Running rv make pqtest also enables the sanitizers so you don’t have to write out the command-line flags yourself.

Implementing stack_push and stack_pop

In stack_push, *stack stores the address of the first node in the linked list. a_value stores the address of the generic type. The newly allocated node should become the first node of the list, and *stack should be updated. The function returns the address of the new node.

In this function, you will call malloc exactly once, and you will not call free. This function is extremely similar to pq_enqueue, except you don’t need to think about where in the list the node should go. It always goes in the front of the list.

For stack_pop, you should simply detach and return the node from the head of the linked list. Note that this is incredibly similar to the specification for pq_dequeue.

Again, make sure you thoroughly test this code, as it will be used extensively in Task 1 and Task 2. If you are confident your code is correct, now would be a good time to commit and push your work to GitHub.

Step 2: Implementing calc_frequencies

The code for this task is located in huffman.c. You will be implementing the following function:

  • calc_frequencies(Frequencies freqs, const char* path, const char** a_error): Open a file at path and either store the character frequencies in freq or set *a_error to strerror(errno).

Before getting started, we recommend you take a look at the type definitions and function specification located in huffman.h. In particular, pay careful attention to these two lines:

typedef unsigned char uchar; 
typedef uint64_t Frequencies[256];

The first line tell us that uchar is simply an alias for an unsigned char. Similarly, the second line tells us that Frequencies is an alias for an array of 256 unsigned integer values.

For the function calc_frequencies, the caller is responsible for initializing freqs[ch] to 0 for all ch from 0 through 255. The function should behave as follows:

  • If the file is opened correctly, then set freqs[ch] to \(n\), where \(n\) is the number of occurrences of the character ch in the file at path. Then, return true. Do not modify a_error.
  • If the file could not be opened (i.e., fopen returned NULL), set *a_error to strerror(errno) and return false. Do not modify freqs.

You only need to check for errors related to failure to open the file. This function should not print anything, nor should you call malloc or free. You do not need them.

This function will need to use file input/output functions from the stdio.h header. In particular, use the documentation for fopen, fgetc, and fclose. Working with files in C can be confusing at first. Let’s look at some of the basic syntax:

#include <stdio.h>
#include <stdlib.h>

void print_first_character(char const* path) {
    FILE* stream = fopen(path, "r"); // this opens the file in reading mode 
    char ch = fgetc(stream); // read one character from the file, starting from the beginning
    fputc(ch, stdout); // write that character to stdout
    fclose(stream); // always call fclose() if you call fopen()
}

int main(int arc, char* argv[]) {
    print_first_character("animal.txt");
    return 0;
}

In the fopen function, the second argument indicates the mode the file should be opened in. "r" is for reading, "w" is for writing, and "a" is for appending. If you wanted to write a function to print out every character in a file (and not just the first), you’d write something like this:

void cat(char const* path) {
    FILE* stream = fopen(path, "r"); 

    for (char ch = fgetc(stream); !feof(stream); ch = fgetc(stream)) {
        fputc(ch, stdout);
    }

    fclose(stream);
}

Be sure to use the stdio.h documentation to find the I/O functions you need.

Again, we recommend testing your code for calc_frequencies before moving on. Create a file called test_frequencies.c, and an example file such as animals.txt. Try calling your function and seeing if it correctly obtains the frequencies of each character in the text file using cu_unit.

That’s all for Task 0 and the lab! Don’t forget to commit and push your code to GitHub.

Task 1: Building a Huffman Tree

In lab we created a priority queue that accepts a “generic” data type. We will use the priority queue in this task to build our Huffman tree.

If you missed lab or you don’t have a working priority queue or calc_frequencies function, go back and finish that first. Your code for this task will rely on the previous task.

The implementation for the Huffman tree will be contained in huffman.c. Look carefully first at huffman.h to ensure you understand the functions you are required to implement. In this task you will be implementing two functions:

  • TreeNode* make_huffman_tree(Frequencies freq): Given an array freq which contains the frequency of each character, create a Huffman tree and return the root.
  • void destroy_huffman_tree(TreeNode** a_root): Given the address of the root of a Huffman tree created by make_huffman_tree(...), deallocate and destroy the tree.

Recall that freq is an array with 256 values. Each index of the array is an ASCII character (recall that chars are just unsigned bytes in C). The value of freq[c] is the frequency of character c in the input file.

Also important in the header file is the definition of the TreeNode struct. A Huffman tree node contains the character, the frequency of the character in the input, and two child nodes. Huffman’s algorithm assumes that we’re building a single tree from a set (or forest) of trees. Initially, all the trees have a single node containing a character and the character’s weight. Iteratively, a new tree is formed by picking two trees and making a new tree whose child nodes are the roots of the two trees. The weight of the new tree is the sum of the weights of the two sub-trees. This decreases the number of trees by one in each iteration. The process iterates until there is only one tree left. The algorithm is as follows:

  1. Begin with a forest of trees. All trees have just one node, with the weight of the tree equal to the weight of the character in the node. Characters that occur most frequently have the highest weights. Characters that occur least frequently have the smallest weights. These nodes will be the leaves of the Huffman tree that you will be building.
  2. Repeat this step until there is only one tree: Choose two trees with the smallest weights; call these trees T1 and T2. Create a new tree whose root has a weight equal to the sum of the weights T1 + T2 and whose left sub-tree is T1 and whose right sub-tree is T2.
  3. The single tree left after the previous step is an optimal encoding tree.

To implement this strategy, use your priority queue to store your tree nodes. You want all the nodes to be ordered by their weights, so you can easily find the two trees with the smallest weights (at the front of the queue). You will need to write your own comparison function to implement this policy. To break ties when two tree-nodes have the same frequency, you can order them lexicographically by the ASCII value of the character.

We will not pay particular attention to the tie-breaking between a node and a non-leaf node, since those nodes are supposed to not hold a value in theory. Adding a tie-breaking here would make your implementation unnecessarily more complex. While there is only a single theoretically correct Huffman tree, this implies that the tree we build here can take on multiple forms. That’s fine; we will not grade based on the exact structure of your Huffman tree, but the properties delineated below.

When you test your code, you should make sure that calling destroy_huffman_tree(TreeNode** a_root) ensures that your code has no memory leaks.

For testing, there are a few properties of Huffman trees we would like to verify:

  1. The weight of an internal node is equal to the sum of the weights of its children.
  2. The sum of the weights of the leaf nodes is equal to the number of characters in the uncompressed text.
  3. If the number of distinct leaf nodes is \(n\), then the number of total nodes in the Huffman tree is \(2n - 1\).

The last property follows from the fact that if you start with \(n\) leaf nodes, you need \(n - 1\) internal nodes to connect them.

We’ve provided you with a file test_huffman.c, which defines functions that verify the aforementioned properties using cu_unit.h. We’ve provided three test functions: one for each file given to you in the tests directory. You are encouraged to add more thorough tests yourself; however, you do not need to turn in test_huffman.c. Once you are confident your implementation is correct, move on to the next task.

To compile and run this program, you’ll run:

$ rv make hufftest
$ rv qemu test_huffman

Task 2: Writing the compressed file and coding table

Now we have all of the pieces we need to write the compressed file and the coding table. For this task, you must implement two functions, found in huffman.c:

  • void write_coding_table(TreeNode* root, BitWriter* a_writer): Write the code table to a_writer->file. This function writes to a file called coding_table.bits.
  • void write_compressed(TreeNode* root, BitWriter* a_writer): Write the encoded data to a_writer->file. This function writes to a file called compressed.bits

The above functions make use of the BitWriter struct, which is defined in bit_tools.h. The BitWriter allows us to write data to a file in increments of bits instead of bytes. (Normal file writing APIs, including C’s standard stdio.h, only support writing entire bytes at a time.) You are not responsible for fully understanding the inner workings of BitWriter, but you do need to know how to use it to write data to the file.

The BitWriter struct contains a file that is already opened in "w" mode. To write bits to the file, you must call the function write_bits(BitWriter* a_writer, uint8_t bits, uint8_t num_bits_to_write). It takes three parameters:

  • a_writer: The address of a BitWriter that contains a file which is open for writing
  • bits: The bits you want to write, stored in a uint8_t
  • num_bits_to_write: The number of bits you want to write, which must be between 0 and 8 inclusive

For both the compressed file and the coding table, you should only need to write bits to the file in 1-bit and 8-bit increments. The following program may help in understanding the behavior of the BitWriter more clearly:

int main(int argc, char* argv[]) {
    BitWriter writer = open_bit_writer("new_file.bits");
    write_bits(&writer, 0x05, 3);  // 0x05 ↔ 00000101₂ ⋯ writes just 101₂
    write_bits(&writer, 0xf3, 3);  // 0xf3 ↔ 11110011₂ ⋯ writes just 011₂
    write_bits(&writer, 0x01, 2);  // 0x01 ↔ 00000001₂ ⋯ writes just 01₂
    write_bits(&writer, 0x20, 6);  // 0x20 ↔ 00100000₂ ⋯ writes just 100000₂
    write_bits(&writer, 0x13, 5);  // 0x13 ↔ 00010011₂ ⋯ writes just 10011₂
    write_bits(&writer, 0x05, 5); // 0x05 ↔ 00000101₂ ⋯ writes just 00101₂ 
    close_bit_writer(&writer);
    return 0;
}

After running this code, you can inspect the new_file.bits file using the following command:

$ xxd -b -g 1 new_file.bits

The xxd tool prints out files in binary, hex, and ASCII formats so you can see exactly what you have written.

Implementing write_coding_table

The coding table is a file that encodes the structure of your Huffman tree in a text file. It is an important utility for the decompression algorithm, as it allows you to recover the structure of the Huffman tree without needing the original uncompressed text. In this step, we will write the encoded Huffman tree to a file called coding_table.bits.

To write the coding table, you do a post-order traversal of your Huffman tree.

  1. Traverse the left subtree of the root (i.e., encode it to the file).
  2. Traverse the right subtree of the root (i.e., encode it to the file).
  3. Visit the root.

Every time you “visit” a node (including the root of a subtree):

  • If it is a leaf (i.e., character), you write one bit: 1. Then, you write the entire character (8 bits). Example: If the character is A, you will write 0b101000001. The 1 is to signify that it is a leaf. The 0b01000001 is to specify the character itself.
  • If it is a non-leaf (i.e., an internal node), you write one bit: 0.

To write out the bits for a character, you can pass a char value directly to write_bits. For example, use write_bits(my_writer, 'A', 8) to write out the binary encoding of the character A.

Your code will write the bits for the coding table using BitWriter. To make the coding table more explicit, consider the following Huffman tree for go go gophers again:

huffman tree

If we provide this tree as an input to write_coding_table, the coding table representation should look like 1g1o01s1 01e1h01p1r00000, and in complete binary (as formatted by xxd), it would be represented as:

00000000: 10110011 11011011 11010111 00111001 00000010 11001011
00000006: 01101000 01011100 00101110 01000000

Notice that the first bit is a 1, indicating a leaf, followed by the byte 01100111, which represents the character g in ASCII. Write the bits of the coding table to the file only. Do not write anything before or after the encoding of the Huffman tree.

Before we move on, here’s another reminder that the Huffman tree you build in make_huffman_tree can take on various forms depending on how you tiebreak the non-leaf nodes; there is no single “correct” Huffman tree for the purpose of this assignment. This means your binary representation generated by the compression driver below for go go gophers might not match the example above; in fact, in our implementation we got:

00000000: 10111001 11011001 01101101 00000101 10011101 01101111  ..m..o
00000006: 10111000 01011100 10010010 00000000                    .\..

So even if your coding table for the gophers example might not match our examples in this instruction, there is no need to fret. Just make sure to verify that your coding table matches your Huffman tree and run some tests.

You can verify the functionality of your write_coding_table by running the compression driver:

$ rv make 
$ rv qemu compress tests/ex.txt
$ xxd -b -g 1 coding_table.bits

Running the compress binary will produce two files: coding_table.bits and compressed.bits. You can inspect each of these files to verify the correctness of the write_coding_table and write_compressed functions, respectively.

Implementing write_compressed

In this step, we will write the compressed data to compressed.bits. The argument a_writer to the function points to a BitWriter that has compressed.bits open for writing. To write the compressed data, you will need to traverse your Huffman tree to recover the encodings, and then use the encodings to write the compressed data. How you accomplish this is largely up to you—there are many valid approaches here. Just make sure that there are no memory leaks and that your compressed data file actually represents the Huffman encodings. Again, write the bits of the compressed data only—do not write any bits before or after the compressed bits.

When you go to inspect the file, you may notice that there are an additional four bytes written to compressed.bits before the compressed data itself. These bytes represent the size of the original uncompressed text in bytes. Integers are typically four bytes, so we use four bytes to write this information to the file. This is written for you by the compression driver (do not write this yourself). The reason it’s there is for decompression—the decompression program needs to know how big the original text file was to recover the uncompressed text.

Using the go go gophers example, the compressed data should look something like (where there are four additional bytes at the beginning):

00000000: 00001101 00000000 00000000 00000000 01101110 11011101  ....n.
00000006: 10110000 11001011 01000000                             ..@

Notice that if you use the command ls -l, you can see the sizes of your files in the directory in bytes. The original file was 13 bytes but the compressed file is 9 bytes—our compression was successful!

Running and Testing

To make it easier to compile and run your code, we’ve provided a Makefile. To build your program, simply type rv make. rv make will build two executables: a compression program and a decompression program. To run the compression program, type:

rv qemu compress <filename>

This will produce two output files: compressed.bits and coding_table.bits. If you run the compression program on another input file, the two output files will be overwritten with the new results.

To run the decompression program, type:

rv qemu decompress compressed.bits coding_table.bits <uncompressed_filename>

This produces a file called <uncompressed_filename>. To see if your compression was successful, you can try comparing the result of the decompression to the original unencoded file by running:

diff <original_file> <uncompressed_file> 

For example, if you were trying this on the cornell.txt file in the tests directory, you’d run:

$ rv qemu compress tests/cornell.txt
$ rv qemu decompress compressed.bits coding_table.bits uncompressed_cornell.txt
$ diff tests/cornell.txt uncompressed_cornell.txt

If you see nothing when running this, that means the files are identical and decompressing your compressed file was successful. Good work!

Note that the decompression tool is based on your implementation of the coding table and the Huffman tree. In other words, you might be able to decompress your file correctly, but that does not mean your Huffman tree is correct.

Round-trip compression and decompression is necessary for the correctness of the entire system, but not sufficient, to guarantee that all of the functions from Task 1 and Task 2 are correct. You are strongly encouraged to use cu_unit.h (described in Task 0) to more thoroughly test your code for Task 0 and Task 1. You can add tests directly to test_priority_queue.c and test_huffman.c. You are not required to submit these files, but we strongly encourage you to test each task separately as that is how your code will be graded.

To build the test executables, you can run:

$ rv make pqtest
$ rv make hufftest

which will generate test_priority_queue and test_huffman, respectively.

Submission

Submit huffman.c and priority_queue.c to Gradescope. Upon submission, we will provide a sanity check to ensure your code compiles and passes the public test cases. The public test cases will only test for round-trip compression and decompression, and not intermediate functions.

Rubric

  • Task 0: 30 points
  • Task 1: 30 points
  • Task 2: 40 points

Code that contains memory leaks as reported by the sanitizers will be subject to a deduction of 50% of the total points for the assignment.