Huffman Compression
Instructions:
Remember, all assignments in CS 3410 are individual.
You must submit work that is 100% your own.
Remember to ask for help from the CS 3410 staff in office hours or on Ed!
If you discuss the assignment with anyone else, be careful not to share your actual work, and include an acknowledgment of the discussion in a collaboration.txt
file along with your submission.
The assignment is due via Gradescope at 11:59pm on the due date indicated on the schedule.
Submission Requirements
You will submit your completed solution to this assignment on Gradescope. You must submit:
huffman.c
, which will contain part of your work for Task 0 and all of your work for Tasks 1 and Task 2.priority_queue.c
, which will contain part of your work for Task 0.
Restrictions
- You may not modify any files other than
huffman.c
andpriority_queue.c
. You won’t be submitting them anyway
Provided Files
priority_queue.h
, which is a header file that defines the specification for the priority queue.priority_queue.c
, which will contain your implementation of a priority queue and stack. You will modify this file.huffman.h
, which is a header file that defines the types and functions you will need to implement Huffman compression.huffman.c
, which will contain your implementation for the Huffman compression system. You will also modify this one.bit_tools.h
, which is a header file that defines theBitWriter
andBitReader
structs and their respective functions for reading and writing binary values from files.bit_tools.c
, which contains the implementation of the functions forBitWriter
andBitReader
.utils.h
, which contains utility functions for printing lists and tree nodes.utils.c
, which contains the implementation for the utility functions.Makefile
, which contains the build tools for this assignment.test_priority_queue.c
, which contains functions to test your implementation for Task 0. You may add tests here, but you will not turn this file in.test_huffman.c
, which contains functions to test your implementation for Task 1. You may also modify this file, as above.cu_unit.h
, which contains the macro definitions that you’ll use for unit testing.
Remember, do not modify other source files except the ones containing your implementation. We will grade your submission with “stock” versions of the supporting files.
Overview
In this assignment you will implement a data compression system using Huffman coding. Huffman compression is an encoding scheme that uses fewer bits to encode more frequently appearing characters and more bits to encode less frequently appearing characters. It is used by ZIP files, among many other things. The high-level overview of the algorithm is:
- Calculate the frequency of each character in the data. (Task 0)
- Build a Huffman tree using the frequencies. (Task 1)
- Build an encoding table using the Huffman tree. (Task 2)
- Encode each character in the data using your encoding table. (Task 2)
In the lab, you will implement a priority queue in C. You’ll use this to build your Huffman tree. The bulk of the work for this assignment will come from understanding the Huffman coding algorithm and manipulating data structures in C.
Huffman Compression Algorithm
Your implementation will read a single text file as input and produce two output files: a compressed data file and a coding table file that encodes enough information to allow decompression. (This assignment does not include decompression; we have given you a decompressor implementation.) Task 2 describes the format for these files.
Before moving onto the tasks, let’s break down the Huffman compression algorithm. You may recall that ASCII is a straightforward way to represent characters. In ASCII, every character is encoded with 8 bits (1 byte). There are 256 possible ASCII values that can be represented. This means that if we use standard ASCII encodings to represent a text file, each character in the file requires exactly 1 byte. This is inefficient, as most text streams don’t actually use all 256 possible characters. The basic idea behind Huffman encoding is as follows: use fewer bits to represent characters that occur more frequently.
Consider the string go go gophers
.
Because g
and o
appear so much more often, can we construct an encoding that uses a small number of bits for these characters, and possibly a larger number of bits for characters like h
?
That’s the goal with Huffman coding.
The core data structure we need is a binary tree with characters at the leaves. Each edge in the tree will correspond to a bit: a left edge corresponds to 0 and a right edge corresponds to 1. To get the encoding for a character, traverse the tree from the root to the leaf and concatenate all the corresponding bits.
Here’s a Huffman tree that contains all the characters in our string, go go gophers
:
We have labeled each leaf with the frequency of that character. Internal nodes also have a frequency number that is the sum of all the frequencies of the children.
Here’s a table that shows the binary code for each character, according to this tree:
Character | Binary code |
---|---|
| 101 |
e | 1100 |
g | 00 |
h | 1101 |
o | 01 |
p | 1110 |
r | 1111 |
s | 100 |
Remember, you get the encoding by traversing the path from the root to the character, using a 0 for every left edge and a 1 for every right edge.
The Huffman tree ensures that characters that are more frequent in the input receive shorter encodings, and characters that are less frequent receive longer encodings. Our goal is to construct the Huffman tree, write the coding table, and write the compressed file using these shorter encodings.
Getting Started
To get started, obtain the release code by cloning your assignment repository from GitHub:
$ git clone git@github.coecis.cornell.edu:cs3410-2024fa/<NETID>_huffman.git
Replace <NETID>
with your NetID. The characters in your NetID should all be in lower cases.
Assignment Outline
- Task 0: You will complete Task 0 in lab. You will implement a priority queue in
C as well as the
calc_frequencies
function inhuffman.c
. - Task 1: You will implement the algorithm to create a Huffman tree.
- Task 2: You will implement the functions
write_coding_table
andwrite_compressed
to write the coding table and compressed bytes to distinct files.
Implementation
(Lab) Task 0: Implementing a priority queue and frequency counter
Before starting, make sure you’ve cloned the release code by following the instructions in Getting Started.
Step 1: Implement a priority queue
The code for this portion is located in priority_queue.c
, which is provided to
you in the release code. In this step, you’ll build a priority queue that accepts a “generic” data type.
This is accomplished by storing a pointer to an arbitrary piece of memory that
can store anything by using void*
. We’ve provided a header
file that defines the PQNode
type as well as the function declarations
for the functions you are required to implement.
Your implementation will go in priority_queue.c
. We’ve provided a basic
test suite in test_priority_queue.c
. You will implement the following
functions:
PQNode *pq_enqueue(PQNode **a_head, void *a_value, int (*cmp_fn)(const void *, const void *))
: Add a new node with valuea_value
to a priority queue, using functioncmp_fn(...)
to determine the ordering of the priority queue.PQNode *pq_dequeue(PQNode **a_head)
: Detach and return the head. Note, the caller is responsible for freeing the detached node, and any memory it refers to. Do not callfree
.void destroy_list(PQNode **a_head, void (*destroy_fn)(void *))
: Deallocates the priority queue. This should call thedetroy_fn
function on every data element, and it shouldfree
the list nodes.PQNode *stack_push(PQNode **stack, void *a_value)
: Add a new node with valuea_value
to the front of the list.PQNode *stack_pop(PQNode **stack)
: Detach and return the head of the list. Note, this function is extremely similar topq_dequeue
.
The last two functions are to enable us to use the same data structure as a stack,
when needed. You probably will not make use of this for your Huffman compression
system, but the decompression system needs a stack to work properly. If you can
implement pq_enqueue
, and pq_dequeue
, implementing stack_push
and stack_pop
should be very easy.
We’ve provided a test file called test_priority_queue.c
. Running rv make pqtest
from the command line will build an executable called test_priority_queue
, which
you can then run by typing rv qemu test_priority_queue
.
The tests use the header file cu_unit.h
, which defines various macros that help
you write unit tests. In general, tests should be structured like so:
static int _test_name() {
cu_start();
//-------------------
// Setup code - build a list, declare a variable, call a function, etc.
cu_check(/*condition you want to check*/);
// ... add as many checks as you want
//-------------------
cu_end();
}
int main(int argc, char* argv[]) {
cu_start_tests(); // Indicate start of test suite
cu_run(_test_name); // Don't forget to run the test in `main`
cu_end_tests(); // Indicate end of the test suite
}
Upon running the test, you’ll see one of the two following messages:
Test passed: _test_name
which will be displayed in green, or:
Test failed: _test_name at line x
which will be printed in red, and give the line that failed. We’ve provided two
simple tests in the release code that check the behavior of your priority queue
and stack. You are encouraged to add more tests to verify the functionality
of your implementation. You will not be turning in test_priority_queue.c
, however,
so this will not be graded.
Generic data types
You might notice some strange looking syntax in these function declarations.
This is to enable generic data types. The PQNode
struct contains a void*
, which
you can think of as a memory address to any type. This allows you to use the
same code for linked lists of any type.
You can assign a void*
to an address of any type. This is why you can write
code like:
char* s = malloc(...);
even though malloc(...)
returns a void*
, not a char*
. This is also similar
to the way functions such as qsort(...)
allow you to sort arrays of any type.
Function addresses
Code that deals with generic data types often needs to pass functions as parameters.
To do this, you need to specify the address to a function as an argument. In other
words, you are declaring the parameter of the function (in this case cmp_fn
) as
the address to a function that takes in some parameter(s) of specified types and
returns a value of a specified type. For the compare function, you’ll always
return an integer, and the arguments to the compare function can be anything,
depending on the underlying data in the nodes of the priority queue.
Let’s look at an example:
void _print_square(int n) {
printf("%d squared is %d\n", n, n * n);
}
void _print_cube(int n) {
printf("%d cubed is %d\n", n, n * n * n);
}
void _call_print_fn(int n, void(*print_fn)(int)) {
print_fn(n);
}
int main(int argc, char* argv[]) {
_call_print_fn(4, _print_square); // Prints 16
_call_print_fn(4, _print_cube); // Prints 64
}
In the above code, the type of parameter print_fn
is void(*)(int)
. In other
words, print_fn
is the address to a function taking an int and returning void.
Generalizing this to our priority queue, notice that the type of parameter cmp_fn
is int(*)(const void*, const void*)
. This is the address to a function taking
two addresses to memory locations of any type and returning an int
.
Similarly, destroy_list
also takes a function address. This is because beyond
freeing the node itself, you also need to potentially free whatever the node stores
(e.g., if you have a priority queue of dynamically allocated strings).
Implementing pq_enqueue
You might recall from CS 2110 that priority queues can be implemented with binary heaps. In our implementation, however, we will be implementing our priority queue as a linked list that we will keep sorted by priority. This means that inserting a node will be an \(O(n)\) time operation, and removing from the priority queue will be a constant time operation. This is fine for our purposes.
In pq_enqueue
, *a_head
refers to the head of the linked list. If
*a_head
is NULL
, then the list is empty. a_value
is the address of whatever
value is associated with this node.
Allocate a new PQNode
and insert it into the list in sorted order, according to the cmp_fn
function.
That is, everything before the new PQNode
should be less than the new one, and everything to the right should be bigger than (or equal to) the new one.
*a_head
should be updated if the new node becomes the first item in the list.
The function should return the address of the new node.
This function should call malloc
exactly once. You should not call
free
in this function.
We recommend you test your implementation for your priority queue as you go in
test_priority_queue.c
. You should also test your implementation for types other
than integers, including dynamically allocated types such as strings. You will need to
write your own comparison function to do this, and potentially your own print function if you
want to be able to print your list.
Implementing pq_dequeue
Like the previous function, *a_head
refers to the head (first node) of a valid
linked list. If the list is empty, return NULL
(since there is nothing to dequeue).
Upon return, *a_head
must be a valid linked list (although possibly empty). For
our purposes, NULL
is a valid linked list of size 0
. Thus, *a_head
will be
set to NULL
if the list is empty, and upon removing the last node, you should set
*a_head
to NULL
.
You must also set the next
field of the removed node to NULL
. The caller
is responsible for freeing the detached node, and any memory it refers to.
For this reason, this function should not call free
, directly or indirectly.
Again, you should test this by adding more statements to test_priority_queue.c
and printing the list to observe the behavior of your function.
Implementing destroy_list
This function should completely destroy the linked list referred to by *a_head
,
freeing any memory that was allocated for it. destroy_fn(...)
is a function that
deallocates *a_value
as needed (if for example, the nodes of the priority
queue had values that were themselves dynamically allocated). This function should
set the head to NULL
in the caller’s stack frame (i.e. *a_head = NULL
).
This is a good point to check to make sure that your code does not leak memory.
Suppose you have the following code in test_priority_queue.c
:
#include "priority_queue.h"
#include "cu_unit.h"
int _cmp_int(const void *a, const void *b) {...}
void _print_int(void *a_n) {...}
int _test_destroy() {
cu_start();
// ------------------
PQNode* head = NULL;
int n1 = 5, n2 = 7, n3 = 6;
pq_enqueue(&head, &n1, _cmp_int);
pq_enqueue(&head, &n2, _cmp_int);
pq_enqueue(&head, &n3, _cmp_int);
destroy_list(&head, NULL);
cu_check(head == NULL);
//--------------------
cu_end();
}
int main(int argc, char* argv[]) {
cu_start_tests();
cu_run(_test_destroy);
cu_end_tests();
return 0;
}
This code should contain no memory leaks,
i.e., it should eventually free
everything that it malloc
s.
You will likely want to use the sanitizers to check for memory bugs.
Running rv make pqtest
also enables the sanitizers so you don’t have to write out
the command-line flags yourself.
Implementing stack_push
and stack_pop
In stack_push
, *stack
stores the address of the first node in the linked list. a_value
stores
the address of the generic type. The newly allocated node should become the first node
of the list, and *stack
should be updated. The function returns the address
of the new node.
In this function, you will call malloc
exactly once, and you will
not call free
. This function is extremely similar to pq_enqueue
, except you
don’t need to think about where in the list the node should go.
It always goes in the front of the list.
For stack_pop
, you should simply detach and return the node from the head
of the linked list. Note that this is incredibly similar to the specification for
pq_dequeue
.
Again, make sure you thoroughly test this code, as it will be used extensively in Task 1 and Task 2. If you are confident your code is correct, now would be a good time to commit and push your work to GitHub.
Step 2: Implementing calc_frequencies
The code for this task is located in huffman.c
. You will be implementing the
following function:
calc_frequencies(Frequencies freqs, const char* path, const char** a_error)
: Open a file atpath
and either store the character frequencies infreq
or set*a_error
tostrerror(errno)
.
Before getting started, we recommend you take a look at the type definitions and function
specification located in huffman.h
. In particular, pay careful attention
to these two lines:
typedef unsigned char uchar;
typedef uint64_t Frequencies[256];
The first line tell us that uchar
is simply an alias for an unsigned char
. Similarly,
the second line tells us that Frequencies
is an alias for an array of 256 unsigned
integer values.
For the function calc_frequencies
, the caller is responsible for initializing
freqs[ch]
to 0 for all ch
from 0 through 255. The function should behave as follows:
- If the file is opened correctly, then set
freqs[ch]
to \(n\), where \(n\) is the number of occurrences of the characterch
in the file atpath
. Then, returntrue
. Do not modifya_error
. - If the file could not be opened (i.e.,
fopen
returnedNULL
), set*a_error
tostrerror(errno)
and returnfalse
. Do not modifyfreqs
.
You only need to check for errors related to failure to open the file. This
function should not print anything, nor should you call malloc
or free
.
You do not need them.
This function will need to use file input/output functions from the stdio.h
header.
In particular, use the documentation for fopen
, fgetc
, and fclose
.
Working with files in C can be confusing at first. Let’s look at some of the
basic syntax:
#include <stdio.h>
#include <stdlib.h>
void print_first_character(char const* path) {
FILE* stream = fopen(path, "r"); // this opens the file in reading mode
char ch = fgetc(stream); // read one character from the file, starting from the beginning
fputc(ch, stdout); // write that character to stdout
fclose(stream); // always call fclose() if you call fopen()
}
int main(int arc, char* argv[]) {
print_first_character("animal.txt");
return 0;
}
In the fopen
function, the second argument indicates the mode the file should
be opened in. "r"
is for reading, "w"
is for writing, and "a"
is for appending.
If you wanted to write a function to print out every character in a file (and
not just the first), you’d write something like this:
void cat(char const* path) {
FILE* stream = fopen(path, "r");
for (char ch = fgetc(stream); !feof(stream); ch = fgetc(stream)) {
fputc(ch, stdout);
}
fclose(stream);
}
Be sure to use the stdio.h
documentation to find the I/O functions you need.
Again, we recommend testing your code for calc_frequencies
before moving on.
Create a file called test_frequencies.c
, and an example file such as animals.txt
.
Try calling your function and seeing if it correctly obtains the frequencies of
each character in the text file using cu_unit
.
That’s all for Task 0 and the lab! Don’t forget to commit and push your code to GitHub.
Task 1: Building a Huffman Tree
In lab we created a priority queue that accepts a “generic” data type. We will use the priority queue in this task to build our Huffman tree.
If you missed lab or you don’t have a working priority queue or calc_frequencies
function, go back and finish that first. Your code for this task will rely on the previous task.
The implementation for the Huffman tree will be contained in huffman.c
. Look
carefully first at huffman.h
to ensure you understand the functions you are
required to implement. In this task you will be implementing two functions:
TreeNode* make_huffman_tree(Frequencies freq)
: Given an arrayfreq
which contains the frequency of each character, create a Huffman tree and return the root.void destroy_huffman_tree(TreeNode** a_root)
: Given the address of the root of a Huffman tree created bymake_huffman_tree(...)
, deallocate and destroy the tree.
Recall that freq
is an array with 256 values. Each
index of the array is an ASCII character (recall that chars are just unsigned
bytes in C). The value of freq[c]
is the frequency of character c
in the
input file.
Also important in the header file is the definition of the TreeNode
struct.
A Huffman tree node contains the character, the frequency of the character in the input, and
two child nodes. Huffman’s algorithm assumes that we’re building a single tree
from a set (or forest) of trees. Initially, all the trees have a single node
containing a character and the character’s weight. Iteratively, a new tree is
formed by picking two trees and making a new tree whose child nodes are the
roots of the two trees. The weight of the new tree is the sum of the weights of
the two sub-trees. This decreases the number of trees by one in each iteration.
The process iterates until there is only one tree left. The algorithm is as
follows:
- Begin with a forest of trees. All trees have just one node, with the weight of the tree equal to the weight of the character in the node. Characters that occur most frequently have the highest weights. Characters that occur least frequently have the smallest weights. These nodes will be the leaves of the Huffman tree that you will be building.
- Repeat this step until there is only one tree: Choose two trees with the
smallest weights; call these trees
T1
andT2
. Create a new tree whose root has a weight equal to the sum of the weightsT1 + T2
and whose left sub-tree isT1
and whose right sub-tree isT2
. - The single tree left after the previous step is an optimal encoding tree.
To implement this strategy, use your priority queue to store your tree nodes. You want all the nodes to be ordered by their weights, so you can easily find the two trees with the smallest weights (at the front of the queue). You will need to write your own comparison function to implement this policy. To break ties when two tree-nodes have the same frequency, you can order them lexicographically by the ASCII value of the character.
We will not pay particular attention to the tie-breaking between a node and a non-leaf node, since those nodes are supposed to not hold a value in theory. Adding a tie-breaking here would make your implementation unnecessarily more complex. While there is only a single theoretically correct Huffman tree, this implies that the tree we build here can take on multiple forms. That’s fine; we will not grade based on the exact structure of your Huffman tree, but the properties delineated below.
When you test your code, you should make sure that calling destroy_huffman_tree(TreeNode** a_root)
ensures that your code has no memory leaks.
For testing, there are a few properties of Huffman trees we would like to verify:
- The weight of an internal node is equal to the sum of the weights of its children.
- The sum of the weights of the leaf nodes is equal to the number of characters in the uncompressed text.
- If the number of distinct leaf nodes is \(n\), then the number of total nodes in the Huffman tree is \(2n - 1\).
The last property follows from the fact that if you start with \(n\) leaf nodes, you need \(n - 1\) internal nodes to connect them.
We’ve provided you with a file test_huffman.c
, which defines functions that
verify the aforementioned properties using cu_unit.h
. We’ve provided three
test functions: one for each file given to you in the tests
directory. You
are encouraged to add more thorough tests yourself; however, you do not need to turn
in test_huffman.c
. Once you are confident your implementation is correct, move on to the next task.
To compile and run this program, you’ll run:
$ rv make hufftest
$ rv qemu test_huffman
Task 2: Writing the compressed file and coding table
Now we have all of the pieces we need to write the compressed file and
the coding table. For this task, you must implement two functions, found in huffman.c
:
void write_coding_table(TreeNode* root, BitWriter* a_writer)
: Write the code table toa_writer->file
. This function writes to a file calledcoding_table.bits
.void write_compressed(TreeNode* root, BitWriter* a_writer)
: Write the encoded data toa_writer->file
. This function writes to a file calledcompressed.bits
The above functions make use of the BitWriter
struct, which is defined in
bit_tools.h
. The BitWriter
allows us to write data to a file in increments
of bits instead of bytes.
(Normal file writing APIs, including C’s standard stdio.h
, only support writing entire bytes at a time.)
You are not responsible for fully
understanding the inner workings of BitWriter
, but you do need to know how
to use it to write data to the file.
The BitWriter
struct contains a file that
is already opened in "w"
mode. To write bits to the file, you must call the
function write_bits(BitWriter* a_writer, uint8_t bits, uint8_t num_bits_to_write)
.
It takes three parameters:
a_writer
: The address of aBitWriter
that contains a file which is open for writingbits
: The bits you want to write, stored in auint8_t
num_bits_to_write
: The number of bits you want to write, which must be between 0 and 8 inclusive
For both the compressed file and the coding table, you should only need to write
bits to the file in 1-bit and 8-bit increments. The following program may help in
understanding the behavior of the BitWriter
more clearly:
int main(int argc, char* argv[]) {
BitWriter writer = open_bit_writer("new_file.bits");
write_bits(&writer, 0x05, 3); // 0x05 ↔ 00000101₂ ⋯ writes just 101₂
write_bits(&writer, 0xf3, 3); // 0xf3 ↔ 11110011₂ ⋯ writes just 011₂
write_bits(&writer, 0x01, 2); // 0x01 ↔ 00000001₂ ⋯ writes just 01₂
write_bits(&writer, 0x20, 6); // 0x20 ↔ 00100000₂ ⋯ writes just 100000₂
write_bits(&writer, 0x13, 5); // 0x13 ↔ 00010011₂ ⋯ writes just 10011₂
write_bits(&writer, 0x05, 5); // 0x05 ↔ 00000101₂ ⋯ writes just 00101₂
close_bit_writer(&writer);
return 0;
}
After running this code, you can inspect the new_file.bits
file using the
following command:
$ xxd -b -g 1 new_file.bits
The xxd tool prints out files in binary, hex, and ASCII formats so you can see exactly what you have written.
Implementing write_coding_table
The coding table is a file that encodes the structure of your Huffman tree in a
text file. It is an important utility for the decompression algorithm, as it
allows you to recover the structure of the Huffman tree without needing the original
uncompressed text. In this step, we will write the encoded Huffman tree to a file
called coding_table.bits
.
To write the coding table, you do a post-order traversal of your Huffman tree.
- Traverse the left subtree of the root (i.e., encode it to the file).
- Traverse the right subtree of the root (i.e., encode it to the file).
- Visit the root.
Every time you “visit” a node (including the root of a subtree):
- If it is a leaf (i.e., character), you write one bit:
1
. Then, you write the entire character (8 bits). Example: If the character isA
, you will write0b101000001
. The1
is to signify that it is a leaf. The0b01000001
is to specify the character itself. - If it is a non-leaf (i.e., an internal node), you write one bit:
0
.
To write out the bits for a character, you can pass a char
value directly to write_bits
.
For example, use write_bits(my_writer, 'A', 8)
to write out the binary encoding of the character A
.
Your code will write the bits for the coding table using BitWriter
. To make the coding table more explicit, consider the following Huffman tree for go go gophers
again:
If we provide this tree as an input to write_coding_table
, the coding table representation should look like 1g1o01s1 01e1h01p1r00000
, and in complete binary (as formatted by xxd
), it would be represented as:
00000000: 10110011 11011011 11010111 00111001 00000010 11001011
00000006: 01101000 01011100 00101110 01000000
Notice that the first bit is a 1
, indicating a leaf, followed by the byte
01100111
, which represents the character g
in ASCII. Write the bits of the
coding table to the file only. Do not write anything before or after the
encoding of the Huffman tree.
Before we move on, here’s another reminder that the Huffman tree you build in make_huffman_tree
can take on various forms depending on how you tiebreak the non-leaf nodes; there is no single “correct” Huffman tree for the purpose of this assignment. This means your binary representation generated by the compression driver below for go go gophers
might not match the example above; in fact, in our implementation we got:
00000000: 10111001 11011001 01101101 00000101 10011101 01101111 ..m..o
00000006: 10111000 01011100 10010010 00000000 .\..
So even if your coding table for the gophers example might not match our examples in this instruction, there is no need to fret. Just make sure to verify that your coding table matches your Huffman tree and run some tests.
You can verify the functionality of your write_coding_table
by running the compression driver:
$ rv make
$ rv qemu compress tests/ex.txt
$ xxd -b -g 1 coding_table.bits
Running the compress
binary will produce two files: coding_table.bits
and
compressed.bits
. You can inspect each of these files to verify the correctness
of the write_coding_table
and write_compressed
functions, respectively.
Implementing write_compressed
In this step, we will write the compressed data to compressed.bits
. The argument
a_writer
to the function points to a BitWriter
that has compressed.bits
open
for writing. To write the compressed data, you will need to traverse your Huffman tree to
recover the encodings, and then use the encodings to write the compressed data.
How you accomplish this is largely up to you—there are many valid approaches
here. Just make sure that there are no memory leaks and that your compressed
data file actually represents the Huffman encodings. Again, write the bits of
the compressed data only—do not write any bits before or after the compressed
bits.
When you go to inspect the file, you may notice that there are an additional
four bytes written to compressed.bits
before the compressed data itself. These
bytes represent the size of the original uncompressed text in bytes. Integers
are typically four bytes, so we use four bytes to write this information to the
file. This is written for you by the compression driver (do not write this
yourself). The reason it’s there is for decompression—the decompression
program needs to know how big the original text file was to recover the
uncompressed text.
Using the go go gophers
example, the compressed data
should look something like (where there are four additional bytes at the beginning):
00000000: 00001101 00000000 00000000 00000000 01101110 11011101 ....n.
00000006: 10110000 11001011 01000000 ..@
Notice that if you use the command ls -l
, you can see the sizes of your files
in the directory in bytes. The original file was 13 bytes but the compressed
file is 9 bytes—our compression was successful!
Running and Testing
To make it easier to compile and run your code, we’ve provided a Makefile
. To
build your program, simply type rv make
. rv make
will build two executables: a
compression program and a decompression program. To run the compression
program, type:
rv qemu compress <filename>
This will produce two output files: compressed.bits
and coding_table.bits
. If
you run the compression program on another input file, the two output files will
be overwritten with the new results.
To run the decompression program, type:
rv qemu decompress compressed.bits coding_table.bits <uncompressed_filename>
This produces a file called <uncompressed_filename>
. To see if your compression was
successful, you can try comparing the result of the decompression to the
original unencoded file by running:
diff <original_file> <uncompressed_file>
For example, if you were trying this on the cornell.txt
file in the tests
directory, you’d run:
$ rv qemu compress tests/cornell.txt
$ rv qemu decompress compressed.bits coding_table.bits uncompressed_cornell.txt
$ diff tests/cornell.txt uncompressed_cornell.txt
If you see nothing when running this, that means the files are identical and decompressing your compressed file was successful. Good work!
Note that the decompression tool is based on your implementation of the coding table and the Huffman tree. In other words, you might be able to decompress your file correctly, but that does not mean your Huffman tree is correct.
Round-trip compression and decompression is necessary for the correctness
of the entire system, but not sufficient, to
guarantee that all of the functions from Task 1 and Task 2 are correct. You are
strongly encouraged to use cu_unit.h
(described in Task 0) to more
thoroughly test your code for Task 0 and Task 1. You can add tests directly to test_priority_queue.c
and test_huffman.c
. You are not required to submit these files, but we strongly
encourage you to test each task separately as that is how your code will be graded.
To build the test executables, you can run:
$ rv make pqtest
$ rv make hufftest
which will generate test_priority_queue
and test_huffman
, respectively.
Submission
Submit huffman.c
and priority_queue.c
to Gradescope. Upon
submission, we will provide a sanity check to ensure your code compiles and
passes the public test cases. The public test cases will only test for
round-trip compression and decompression, and not intermediate functions.
Rubric
- Task 0: 30 points
- Task 1: 30 points
- Task 2: 40 points
Code that contains memory leaks as reported by the sanitizers will be subject to a deduction of 50% of the total points for the assignment.