Minifloat

Instructions: Remember, all assignments in CS 3410 are individual. You must submit work that is 100% your own. Remember to ask for help from the CS 3410 staff in office hours or on Ed! If you discuss the assignment with anyone else, be careful not to share your actual work, and include an acknowledgment of the discussion in a collaboration.txt file along with your submission.

The assignment is due via Gradescope at 11:59pm on the due date indicated on the schedule.

Submission Requirements

For this assignment, you will need to submit the following five files:

minifloat.c, with your written implementation for the missing functions.
minifloat_test_part1.expected, to match additional tests added in minifloat_test_part1.c
Some additional tests, in:
- minifloat_test_part1.c
- minifloat_test_part2.c
- minifloat_test_part3.c

Restrictions

For this assignment, you will be working with building your own floating-point representation.

You may not use built-in C operations for floating-point arithmetic.
You may not cast data to float or double, or create variables with these types.

Provided Files

The provided release code contains seven files:

minifloat.c, which includes some completed functions and some functions you are expected to implement
minifloat.h, which provides declarations and comments for the functions in minifloat.c, including those you are to implement
minifloat_test_part1.c, minifloat_test_part2.c, minifloat_test_part3.c, which provide some tests for you to get started. You are expected to add more tests of your own to each of these test suites
minifloat_test_part1.expected, which provides a baseline file to help with testing part 1. You are expected to add more lines to this file as part of testing part 1.
Makefile, which provides structure to compile your code (see our brief tutorial on Makefiles)

Getting Started

To get started, obtain the release code by cloning your assignment repository from GitHub:

$ git clone git@github.coecis.cornell.edu:cs3410-2024fa/<NETID>_minifloat.git

Replace <NETID> with your NetID. For example, if your NetID is zw669, then this clone statement would be git clone git@github.coecis.cornell.edu:cs3410-2024fa/zw669_minifloat.git

Overview

In this assignment, you will be developing a custom minifloat data format in C. You will be expected to reason about floating-point details and to implement C functions.

Background

In class, we learned about floating-point numbers, which represent decimals with some number of bits. C has built-in float and double types, which use (on modern hardware) 32 bits and 64 bits, respectively. Increasing the number of bits in a floating-point representation gives it more precision and more dynamic range, at the expense of less efficient arithmetic. It can also be useful, however, to perform operations with smaller floating-point representations—trading off precision for potentially faster calculations.

In this assignment, you will implement functions for a specialized 8-bit floating-point type. We’ll call these 8-bit numbers minifloats. Minifloats have severely limited precision, but such tiny floating point values are useful for situations where errors matter less and data sizes are enormous: most prominently, in machine learning. See, for example, this paper and this other paper that both show serious efficiency advantages from using 8-bit minifloats. While most floating point formats enjoy built-in hardware support, we can also implement minifloats in software with bit packing tricks.

Minifloats follow a similar representation strategy to the standard IEEE floating-point types that we learned about in lecture. However, they differ in a few important ways to make the implementation simpler, which we will summarize as well.

Minifloat specification

Minifloats use 8 bits in total: 1 sign bit, 3 exponent bits, and 4 significand bits. The layout of a minifloat looks like this, with s for sign, e for exponent, and g for significand:

As in standard formats, a sign bit of 0 indicates a positive number, and a sign bit of 1 indicates a negative number.
Minifloats have a bias of 3. In other words, we subtract 3 from the bit-representation of a minifloat exponent.
Unlike standard floating-point formats, wherein we usually append a leading 1 to the significand bits with the \(1.g\) notation, minifloats use the significand directly, with the binary point after the first digit. So if the four significand bits are \(g_3 g_2 g_1 g_0\), then the “base” part of the represented value is the binary number \(g_3 . g_2 g_1 g_0\). Or, in other words, the value is \(g \times 2^{-3}\), where \(g\) is the unsigned integer value of those 4 bits.
Also unlike standard floating-point formats, our minifloats do not use special values: not a number (NaN) and infinity (+∞ and -∞).

All together, the value represented by a minifloat with sign \(s\), exponent \(e\), and significand \(g\) is:

\[ (-1)^s \times (g \times 2^{-3}) \times 2^{e - 3} \]

Or, equivalently, if you prefer to think of the significand’s representation in terms of bits:

\[ (-1)^s \times (g_3.g_2g_1g_0) \times 2^{e - 3} \]

where \(g_3\) is the significand’s most significant bit, \(g_0\) is the least significant bit, and so on.

Minifloat Examples

To help with reading a minifloat, here are a pair of example calculations of reading a minifloat as a base-10 decimal:

Example 1: 10111100

We have a sign of 1, an exponent of 011, and a significand of 1100.

Our sign 1 corresponds to \(-1\).
Our exponent 011 corresponds to a decimal exponent of \(3-3 = 0\). (We’re applying our \(-3\) bias here.)
Our significand 1100 corresponds to the decimal \(12 \times 2^{-3}=\frac{12}{8}=1.5\). (Or, equivalently, the significand corresponds to the binary number \(1.100_2\), which is \(1.5\) in decimal.)

In total, our number is then \(-1 \times 1.5 \times 2^0 = -1 \times 1.5 \times 1 = -1.5\).

Example 2: 00010010

We have a sign of 0, an exponent of 001, and a significand of 0010.

Our sign 0 corresponds to \(+1\).
Our exponent 001 corresponds to a decimal exponent of \(1-3 = -2\).
Our significand 0010 corresponds indicates the binary value \(0.010_2\), which equals \(0.25_{10}\).

In total, our number is then \(1 \times 0.25 \times 2^{-2} = \frac{1}{16} = 0.0625\).

Bit size in C

We want to ensure that the type we are using to represent a minifloat is exactly 8 bits. We will use the uint8_t type from C’s stdint.h header. (We will avoid char, even though char is 8 bits on most platforms, because C unhelpfully does not guarantee that is is exactly 8 bits everywhere.) To break down this type’s, the uint means that bit-level operations are as on an unsigned integer, the 8 means that we expect operations to be on 8 bits, and _t is a common naming convention that indicates that this is a type. The stdint.h header defines many similar types, like these:

Type	Description
uint8_t	unsigned integer with 8 bits
uint16_t	unsigned integer with 16 bits
int8_t	signed integer with 8 bits

Your Task

This assignment is divided into three parts: displaying minifloats as decimals, implementing operations on minifloats, and using minifloats. Each part will have you implementing 1–3 functions, and adding test cases to help convince yourself these functions are correct. You must add at least 4 new test cases per function to what we have provided, though you may add more.

For all of your C implementations, you may not include any constants or variables of type float, double, or long double. You may not use C’s built-in floating-point operations, such as + on floating-point values.

This is not an arbitrary restriction. Using a larger float representation in your implementation will defeat the purpose of the smaller representation, which is that they are smaller and faster than “normal” floating-point types. Because of floating-point error, it is also very likely to introduce incorrect results.

We have provided a mini_to_double utility function to help you with debugging and testing. You may not use this function in any of your submitted implementations, but you may use this function for writing test cases for any of your functions.

Part 0: Lab

View the lab slides here.

In this lab, you will gain practice using floating point numbers. There are 2 designated checkoff points where you should show your work to a TA.

Review

If you need to, look over the lecture notes on standard floating-point types to remind yourself of the basic principles. And try out float.exposed to get hands-on practice!

Read over the background above and especially the specification for minifloats. To briefly summarize the minifloat format:

Bit 7 is the sign bit
Bits 6–4 are the exponent bits
Bits 3–0 are the fraction bits

(Bits are numbered from the right, so 0 is the least significant bit.)

Part 0.1: Practice Conversions

Decimal to float. To convert from a decimal to a floating point number:

Convert the integer and fractional parts into binary.
Combine the results and shift accordingly.
Convert our exponent into biased form.
Convert the fractional part into a mantissa/significand.
Recombine for a final floating-point representation.

Here’s a 32-bit example. Consider the decimal 12.375.

Converting the integer portion into binary yields 1100.

Converting the fractional part is a little more labor intensive. Our fractional part is 0.375. To convert, multiply the fractional part by 2, record the integer part of the result (should be 0 or 1), and repeat with the new fractional part until the fractional part becomes 0 or the precision limit is reached (is 23 digits for the IEEE 754 32-bit format). The recorded integer parts of this process becomes our binary representation for the original fractional part.

\(0.375 \times 2 = 0.750\). Record 0.
\(0.750 \times 2 = 1.500\). Record 1.
\(0.500 \times 2 = 1.000\). Record 1.

Thus our binary representation of 0.375 is 011, and thus our binary representation of 12.375 is 1100.011.

We now normalize our result so that it fits the format \((1.xyz \ldots ) \times 2^e\). In this case, we shift to the right to obtain \(1.100011 \times 2^3\).

Next, we need to apply our format’s exponent bias, which for this format is 127. To bias the exponent, we add our original exponent with the bias. In our example, this gives us 3 + 127 = 130.

From here, we can see that:

The biased exponent form of 130 yields 10000010 in binary.
The fractional part is \(100011_2\), and thus our mantissa is 10001100000000000000000. (We fill in the remaining digits with 0s.)
We have a positive number, so our sign bit is 0.

(The reason the exponent’s “bias” of 127 exists is so that we can represent both positive and negative exponents using an unsigned integer. It may help to think of the bias as the reference point of no shifting (an exponent of 0), and any additions and subtractions to this bias represent how much we shift by. To find the to find the bias of any floating-point number, we can use the formula \(2^{(k-1)} - 1\) where k is the number of bits in the exponent field.)

Thus, our final floating point representation is 0 10000010 10001100000000000000000. You can try checking the answer by entering these bits into float.exposed if you like.

Float to decimal. To convert from a floating-point number to a decimal number:

Separate out the sign, exponent, and mantissa.
We restore the 1. back to the mantissa and drop the trailing 0s.
We de-normalize the number so the exponent is 0.
Convert the integer and fractional parts to decimals.
Add a negative sign if necessary.

Exercises:

Convert -4.75 to our 8-bit floating point format.
Convert 1.7 to our 8-bit floating point format.
Convert the binary float 11010011 to a decimal.
Convert the binary float 00100110 to a decimal.

Checkoff #1: Show your converted numbers to a TA.

Part 0.2: Floating Point Addition and Inaccuracies

You can perform many arithmetic operations with floating-point numbers. However, in this lab, you will be practicing with addition as an introduction to floating-point arithmetic.

To perform addition with a floating point number:

Rewrite the smaller number so that the exponents are equal, and adjust the mantissa of the number with the smaller exponent by shifting it to the right accordingly.
Add the mantissas together.
Recombine and renormalize the result if necessary.

Exercises:

Add 0.25 and 8.0 with our 8-bit minifloat format.
Add 1.25 and 2.5 with our 8-bit minifloat format.
Can you find two numbers that share the same nearest minifloat representation?
What is the largest number our minifloat format can represent?

Take note of where accuracy was lost in these calculations. Why are our results so far off from what we expect?

Checkoff #2: Show your addition results to a TA.

Part 1: Displaying Minifloats

Your first task is to implement a function for displaying minifloats in C, named print_mini. This function takes in a minifloat and must print the sign, whole number, and fractional part associated with this minifloat as a base-10 value. The exact specification, with examples, is given in minifloat.h. Your implementation should be filled into minifloat.c.

To make your task somewhat easier, we have written a concrete call to printf at the end of the each function that you may use as a guide for what to implement. Note that print_mini requires that we write 6 decimal digits—the provided printf specifier %06d will fill any integer to have preceding zeros such that the printed integer has 6 digits. To provide two concrete examples:

printf("%06d", 123) will print 000123
printf("%06d", 100000) will print 100000

Remember, you may not include any constants or variables of type float, double, or long double, and you may not use any floating-point operations. You may, however, use any integer arithmetic operation (including integer division and modulus). In C, dividing two integers with i / j produces an integer. But be sure not to include a double constant (such as 1.0) by accident.

Hint: You may find it useful to observe that \(1/64=0.015625\), and that, with integer division, \(1000000 / 64 = 15625\).

Testing Part 1

A test script to help guide your development can be found in minifloat_test_part1.c. You can build this test with the following command:

rv make part1

To test this code, you must execute the resulting .out file and pipe your print results to a file, such as with the following command:

rv qemu minifloat_test_part1.out > minifloat_test_part1.txt

Reminder: use the rv aliases for each command if you have it set up!

Finally, you must compare the resulting prints to our expected results using diff:

diff minifloat_test_part1.txt minifloat_test_part1.expected

If you observe any differences between the two, a printing test failed.

You can also combine these operations into a single bash command:

rv make part1 && rv qemu minifloat_test_part1.out > minifloat_test_part1.txt && diff minifloat_test_part1.txt minifloat_test_part1.expected

Reminder: You must add 4 new printing tests (which means modifying both minifloat_test_part1.c and minifloat_test_part1.expected).

Part 2: Minifloat Operations

Your second task is to implement an equality check, addition, and multiplication between minifloats. Specifically, you will be implementing mini_eq, mini_add, and mini_mul, which both take in two minifloats and produce a new minifloat. As before, the specifications for each function can be found in minifloat.h, and your implementation should be written in minifloat.c.

The results of the arithmetic operations mini_add and mini_mul must produce the minifloat value closest to adding together the corresponding real numbers. If there are two possible closest real numbers, your implementation must correspond to the closest real number further from zero than the result of addition. For example, we would round 2.125 to 2.25, and similarly -1.0625 to -1.125.

If there are multiple possible minifloat representations of the resulting real number, you must return the minifloat with the smallest exponent. For example, the minifloat value 0 011 0010 could be equivalently represented as 0 001 1000, and only the latter is considered correct for these arithmetic operations. Additionally, if an arithmetic operation would return 0, you must return exactly 00000000.

If applying addition or multiplication would result in a real number larger or smaller than can be represented by a minifloat, the result of these operations is undefined, and need not be tested.

Hint: If you become stuck on any of these functions, consider attempting another—each requires detail that can become more obvious while working on another.

Testing Part 2

Testing minifloat operations is more straightforward than testing the printing implemented earlier. We can simply run each test file and compare the resulting minifloats to expected values. To test part 2, you can directly build and execute part2:

rv make part2 && rv qemu minifloat_test_part2.out

Reminder: You must add 4 new tests per function.

Hint: Write as many edge-case tests as you can think of, there are many potential tricks with negative numbers and very small or very large minifloats.

Part 3: Using Minifloats

Your third task is a straightforward example use of the minifloats you have implemented. Specifically, you’ll be implementing functions to calculate the volume and surface area of a cylinder in the functions titled cylinder_volume and cylinder_area.

The volume and surface area of a cylinder depends on two variables, the radius r and height h of the cylinder, by the following equations:

\( \text{volume} = \pi \times r \times r \times h \)
\( \text{surface area} = 2 \times \pi \times r \times (h + r) \)

For reference and comparison, we have also written an implementation of these functions double_cylinder_volume and double_cylinder_area. These may be useful to refer to while implementing your own function, but are also used for the written task below.

For these implementations, you are expected to use the constant minifloat representation of PI to be 01001101 (representing 3.25), which is the closest minifloat to the decimal \(\pi \approx 3.14159\). We have included this constant definition in minifloat.c for your convenience.

Testing Part 3

To test part 3, you can directly build and execute part3:

rv make part3 && rv qemu minifloat_test_part3.out

We have only provided you with a single simple test for each, and you should write at least 4 new tests. We test these particular functions by comparing our minifloat calculation to the result produced by calculating the same value with a double. We expect that the minifloat result (being less accurate) will have some error compared to the double representation, which in the test is represented by the threshold parameter.

We recommend trying out a few operations and seeing how difference there is between minifloat and double calculations, and adjusting your threshold accordingly. To help with comparing these operations, we use the provided mini_to_double utility function to calculate calculate a double value before and after computing the minifloat equivalent. (We do not define a double_to_mini conversion.)

The mini_to_double utility is only for testing. Do not use it in your main implementation.

Remember that your goal is to implement minifloat operations “from scratch,” using only integer arithmetic. This is what makes minifloats more efficient than float or double.

Your tests should not include cases where the minifloat arithmetic would overflow (produce a result larger than the maximum minifloat or smaller than the largest negative minifloat). We do not define the results of these overflowing operations.

Submission

Submit minifloat.c, minifloat_test_part1.expected, minifloat_test_part1.c, minifloat_test_part2.c, and minifloat_test_part3.c to Gradescope. Upon submission, we will provide a sanity check to ensure your code compiles and passes the public test cases.

Rubric

16 points: print_mini correctness
18 points: mini_eq correctness
16 points: mini_add correctness
19 points: mini_mul correctness
8 points: cylinder_area correctness
8 points: cylinder_volume correctness
15 points: test quality