Recitation 3: Testing and debugging

Debugging is the last resort

Many programmers, especially in introductory classes, might think of programming as a task largely involving debugging. So it's worthwhile to take a step back and think about everything that comes before debugging.

The first defense against bugs is to make them impossible.
Entire classes of bugs can be eradicated by choosing to program in languages that guarantee memory safety (that no part of memory can be accessed except through a pointer (or reference) that is valid for that region of memory) and type safety (that no value can be used in a way inconsistent with its type). The OCaml type system, for example, prevents programs from buffer overflows and meaningless operations (like adding a boolean to a float), whereas the C type system does not.
The second defense against bugs is to use tools that find them.
There are automated source-code analysis tools, like FindBugs, which can find many common kinds of bugs in Java programs, and SLAM, which is used to find bugs in device drivers. The subfield of CS known as formal methods studies how to use mathematics to specify and verify programs, that is, how to prove that programs have no bugs. We'll study verification later in this course. Social methods such as code reviews and pair-programming are also useful at finding bugs. Studies at IBM in the 1970s-1990s suggested that code reviews were the most effective tools at finding bugs; see the Appendix below.
The third defense against bugs is to make them immediately visible.
The earlier a bug appears, the easier it is to diagnose and fix. If computation instead proceeds past the point of the bug, then that further computation might obscure where the failure really occurred. Assertions in the source code make programs "fail fast" and "fail loudly", so that bugs appear immediately, and the programmer knows exactly where in the source code to look.
The fourth defense against bugs is extensive testing.
How can you know whether a piece of code has a particular bug? Write tests that would expose the bug, then confirm that your code doesn't fail those tests. Unit test for a relatively small piece of code, such as an individual function or module, are especially important to write at the same time as you develop that code. Running of those tests should be automated, so that if you ever break the code, you find out as soon as possible. (That's really Defense 3 again.)
The fifth and final defense against bugs is debugging as you know it.
Debugging is really the last resort of a programmer.

In the rest of this recitation, we'll look at defenses 3 through 5.

Defense 3: Assertions

The Assertions module contains a function assert_true : bool -> unit, which returns unit if passed true as an argument, and raises an exception if passed false. For example:

open Assertions 
let t = assert_true true
let inc x = x+1
let u = assert_true ((inc 0) = 0) (* raises an exception *)

(The open Assertions line makes it possible to refer to any function f declared inside Assertions simply as f, rather than Assertions.f.)

If we save this code in a file test.ml then run cs3110 compile test.ml; cs3110 run test.ml we see the exception being raised:

Fatal error: exception Assertions.Assert_true("false is not true")
Raised at file "assertions/assertions.ml", line 4, characters 18-49
Called from file "test.ml", line 4, characters 8-33
Error (exit code 2)

We'll see later in the course how to declare, raise, and handle exceptions ourselves. For now, it's enough to know that Assertions uses them.

The Assertions module contains many other functions we can use to write assertions. You can read the documentation at the GitHub repo for the module.

Defense 4: Unit tests in OCaml

This course has developed its own unit-testing framework, which is part of the cs3110 tool distributed on the VM. Our framework is quite similar to (and in fact is built on) an open-source library Pa_ounit.

Suppose you're writing a function to determine whether a number is prime:

let is_prime : int -> bool = 
  fun x -> x=3 || x=5 || x=7

(Obviously, you're not yet done writing the function...) And suppose you've saved this function in a file named prime.ml. To write unit tests, create a new file prime_test.ml. (The choice of filename is a convention, not a strict requirement.) Inside prime_test.ml, you can write unit tests such as the following:

open Prime
open Assertions

TEST_UNIT "test_prime_one" = assert_true (is_prime 3)
TEST_UNIT "test_prime_two" = assert_true (is_prime 7)
TEST_UNIT "test_not_prime_eight" = assert_false (is_prime 8)

let () = Pa_ounit_lib.Runtime.summarize()

The first two lines open the modules for the code being tested and for the assertions module. Note that the first letter of the module name is capitalized, regardless of whether or not it was capitalized in the actual filename. The strings associated with each unit test are simply descriptive names. The last line (the call to summarize) should remain the final line of the file if you add any new tests. This function call exits with an error and a summary of the tests that failed. If all tests pass, it will exit normally.

To run your unit tests, do the following:

cs3110 compile prime_test.ml
cs3110 test prime_test.ml

As output, you will see:

3 tests ran, 0 test_modules ran

Example: Unit tests

The following file contains some buggy code:

 Turn on Javascript to see the program.

Here is a unit test file:

 Turn on Javascript to see the program.

Do the above test cases pass? If not, which tests fail? If so, does this guarantee correctness of the above functions? Can you think of some more cases that we did not test? Add some cases of your own to the test file, and fix the implementation of the functions so that your tests pass.

Test-driven development

Testing doesn't have to happen strictly after you write code. In test-driven development, testing comes first!

Suppose we are working with a datatype for days:

type day = Sunday | Monday | Tuesday | Wednesday 
         | Thursday | Friday | Saturday

And we want to write a function next_weekday : day -> day that returns the next weekday after a given day. We start by writing the most basic, broken version of that function we can:

let next_weekday (d:day) = Sunday

Then we begin writing tests. For example, we know that the next weekday after Monday is Tuesday. So we add a test:

TEST_UNIT "next_weekday_test1" = assert_true (next_weekday(Monday) = Tuesday)

We compile and run the test; it fails. That's good! Now we know that we have some functionality to add. Let's go back and make this test pass:

let next_weekday (d:day) = 
  match d with 
    Monday -> Tuesday
  | _ -> Sunday

We compile and run the test; it passes. Time to add some more tests:

TEST_UNIT "next_weekday_test2" = assert_true (next_weekday(Tuesday) = Wednesday)
TEST_UNIT "next_weekday_test3" = assert_true (next_weekday(Wednesday) = Thursday)
TEST_UNIT "next_weekday_test4" = assert_true (next_weekday(Thursday) = Friday)

We compile and run the tests; many fail. That's good! We add new functionality:

let next_weekday (d:day) = 
  match d with 
    Monday -> Tuesday
  | Tuesday -> Wednesday
  | Wednesday -> Thursday
  | Thursday -> Friday
  | _ -> Sunday

We compile and run the tests; they pass. Perhaps overly emboldened by our success in adding functionality, we notice the pattern that has evolved, and we decide to finish writing the function:

let next_weekday (d:day) = 
  match d with 
    Monday -> Tuesday
  | Tuesday -> Wednesday
  | Wednesday -> Thursday
  | Thursday -> Friday
  | Friday -> Saturday
  | Saturday -> Sunday
  | Sunday -> Monday

(Of course we got it wrong.) We add some new tests to finish testing all the possible cases:

TEST_UNIT "next_weekday_test5" = assert_true (next_weekday(Friday) = Monday)
TEST_UNIT "next_weekday_test6" = assert_true (next_weekday(Saturday) = Monday)
TEST_UNIT "next_weekday_test7" = assert_true (next_weekday(Sunday) = Monday)

We compile and run the tests; tests 5 and 6 fail! Surprised, we look at test 5, notice that the code for next_weekday(Friday) must be wrong, and fix the bug. After fixing the same bug for test 6, we end up with the following complete piece of code and unit tests:

let next_weekday (d:day) = 
  match d with 
    Monday -> Tuesday
  | Tuesday -> Wednesday
  | Wednesday -> Thursday
  | Thursday -> Friday
  | Friday -> Monday
  | Saturday -> Monday
  | Sunday -> Monday

TEST_UNIT "next_weekday_test1" = assert_true (next_weekday(Monday) = Tuesday)
TEST_UNIT "next_weekday_test2" = assert_true (next_weekday(Tuesday) = Wednesday)
TEST_UNIT "next_weekday_test3" = assert_true (next_weekday(Wednesday) = Thursday)
TEST_UNIT "next_weekday_test4" = assert_true (next_weekday(Thursday) = Friday)
TEST_UNIT "next_weekday_test5" = assert_true (next_weekday(Friday) = Monday)
TEST_UNIT "next_weekday_test6" = assert_true (next_weekday(Saturday) = Monday)
TEST_UNIT "next_weekday_test7" = assert_true (next_weekday(Sunday) = Monday)

Test-driven development is related to extreme programming (XP), is a collection of ideas about how to structure team programming projects. It emphasizes incremental development of code: there is always something that can be tested. Testing is not something that happens after implementation; instead, continuous testing is used to catch errors early. Thus, it is important to develop unit tests immediately when the code is written. Automating test suites is crucial so that continuous testing requires essentially no effort.

Extreme programming emphasizes bottom-up development of code modules for which units tests can be developed. Extreme programming also emphasizes social process on programming teams. Programmers work closely, communicating more verbally than with formal design documents. Specifications are short and programmers rely on common knowledge among the development team. Extreme programming is a good way to quickly development software, but it has been criticized for not producing high-level design documents. In the long run a project developed this way might be hard to manage, because the code is wedded to the engineers.

Guidelines for writing unit tests

When writing tests, keep in mind the acronym FIRST:

Fast: tests should run and complete quickly.
Independent: no tests should depend on each other.
Repeatable: tests should work in any environment—i.e. they should not depend on any external factor such as the current time.
Self-verifying: tests should be able to report success/failure without manual inspection on the part of a human.
Timely: tests should be written before or along with code.

In addition:

Keep each unit test small. A test should be designed to diagnose whether one particular piece of functionality does or does not work.
Name tests in a way that suggests what they test. If nothing else, the pattern of funcname_testi works fairly well.
Test continuously. Every time you add new functionality, rerun all the unit tests to ensure you haven't broken anything.

Black-box testing

In the next_weekday example, it's relatively obvious which unit tests to write. But for more complicated functions, it can be difficult to decide what tests and how many to write. A good strategy for developing unit tests is to write them by looking only at the specification for the function being tested. The function's specification actually gives us a lot of information about what to test:

Any preconditions might give us ideas about values to test.
The postcondition might have structure that suggests values to test.

Testing based solely on the function's specification, as opposed to its implementation, is called black-box testing. The idea is that the function is a box that we cannot see into.

As an example of black-box testing, consider writing a function to compute the amount of income tax owed in the US's progressive income tax brackets. The postcondition for this function might stipulate that the tax rate is determined by the following table:

Income	Tax rate
$0–$8,925	10%
$8,926–$35,250	15%
$35,251–$87,850	25%
...	...

There is structure in this postcondition that suggests values to test. Rather than test every dollar amount from $0 to $87,850, we could test the partitions suggested by the postcondition: one test for the partition $0–$8,925, another test for $8,926–$35,250, etc. This testing strategy is called equivalence-class partitioning. The values that fall at the extreme end of each partition are often particularly fruitful test cases. For example, we might write a test for $0, for $8,925, for $8,926, for $35,250, etc. This testing strategy is called boundary-value analysis.

White-box testing

The specification for a function does not necessarily give us all the information we need to write a good test suite. By looking at the function's implementation, we might be inspired to write additional tests. Testing based on the function's implementation is called white-box testing (or perhaps better, glass-box testing). The idea is that the function is a box that we can see into.

As an example of white-box testing, consider writing a function to determine the real roots (if any) of a quadratic equation ax² + bx + c = 0. The specification for the function might read as follows:

(* requires: a <> 0.0
 * returns: SOME (f1,f2) if there exists an f1 and f2 such that
 *   each fi is a root of ax^2 + bx + c.  Or NONE if no real
 *   roots exist. *)
let qroots((a:int),(b:int),(c:int)) : (float*float) option = ...

In which case, we might be inspired by the precondition and equivalence-class partitioning to generate 27 test cases for when each of the coefficients is negative, zero, or positive. (And that's not a bad idea!) But if we go on to read the function's implementation, we get new inspiration:

let qroots((a:float),(b:float),(c:float)) : (float*float) option = 
  let d = (b *. b) -. (4.0 *. a *. c) in
    if d < 0.0
    then None
    else Some ((~-. b +. (sqrt d)) /. (2.0 *. a), 
               (~-. b -. (sqrt d)) /. (2.0 *. a))

In the body, we see that the value of d is relevant as to whether the then branch or the else branch of an if expression is evaluated. So it makes sense to write test cases that explore both branches:

TEST_UNIT "qroots_test1" = assert_true ((qroots(1.0,3.0,-4.0) = Some (1.0,-4.0)))
TEST_UNIT "qroots_test2" = assert_true (qroots(3.0,4.0,2.0) = None)

Code coverage is a metric for how much of the code is being tested by a test suite. There are various ways to measure coverage: by the number of functions being tested, by whether every branch of an if or match expression is tested, by whether enough tests exist to cause every Boolean subexpression to take on both true and false, etc. By writing both tests above, we are increasing code coverage as measured by branches.

Defense 5: Debugging

So you've discovered a bug. What next?

Distill the bug into a small test case. Debugging is hard work, but the smaller the test case, the more likely you are to focus your attention of the piece of code where the bug lurks. Time spent on this distillation can therefore be time saved, because you won't have to re-read lots of code. Don't continue debugging until you have a small test case!
Employ the scientific method. Formulate a hypothesis as to why the bug is occurring. You might even write down that hypothesis in a notebook, as if you were in a Chemistry lab, to clarify it in your own mind and keep track of what hypotheses you've already considered. Next, design an experiment to affirm or deny that hypothesis. Run your experiment and record the result. Based on what you've learned, reformulate your hypothesis. Continue until you have rationally, scientifically determined the cause of the bug.
Fix the bug. The fix might be a simple correction of a typo. Or it might reveal a design flaw that causes you to make major changes. Consider whether you might need to apply the fix to other locations in your code based—for example, was it a copy and paste error? If so, do you need to refactor your code?
Permanently add the small test case to your test suite. You wouldn't want the bug to creep back into your code base. So keep track of that small test case by keeping it as part of your unit tests. That way, any time you make future changes, you will automatically be guarding against that same bug. Repeatedly running tests distilled from previous bugs is called regression testing.

The mechanics of OCaml debugging

Here are a couple tips on how to debug—if you are forced into it—in OCaml.

Print statements. Insert a print statement to ascertain the value of a variable. Suppose you want to know what the value of x is in the following function:
```
let inc x = 
  x+1
```
Just add the line below to print that value:
```
let inc x = 
  let () = print_int(x) in  (* added *)
    x+1
```
The Pervasives module contains many other printing statements you can use.
Function traces. Suppose you want to see the trace of recursive calls and returns for a function. Use the #trace directive:
```
let rec fib x = if x<=1 then 1 else fib(x-1) + fib(x-2)
#trace fib;;
```
If you evaluate fib 2, you will now see the following output:
```
fib <-- 2
fib <-- 0
fib --> 1                                                                       
fib <-- 1
fib --> 1
fib --> 2
```
To stop tracing, use the #untrace directive.

Acknowledgment

Some of the material in these notes is based on slides on debugging by Rob Miller from MIT 6.005.

Appendix: Code reviews

An extremely effective way to find faults in software systems is code review, in which the programmer goes through the code and documentation to explain how it works and why it is correct. It is the programmer's job to convince an expert review team (usually 1–3 people) who act as skeptical devil's advocates.

There are different ways to use code review effectively:

Code walkthrough. In the walkthrough approach, the programmer presents the documentation and code to the reviewing team, and the team gives comments. This is an informal process. The focus is on the code rather than the coder, so hurt feelings are easier to avoid. However, the team may not get as much assurance that the code is correct.
Code inspection. Here, the review team drives the code review process. Some, though not necessarily very much, team preparation beforehand is useful. They define goals for the review process and interact with the coder(s) to understand where there may be quality problems. Again, making the process as blameless as possible is important.
Pair programming. The most informal approach to code review is through pair programming, in which code is developed by a pair of engineers: the driver who writes the code, and the observer who watches. The role of the observer is be a critic, to think about potential errors, and to help navigate larger design issues. It's usually better to have the observer be the engineer with the greater experience with the coding task at hand. The observer reviews the code, serving as the devil's advocate that the driver must convince. When the pair is developing specifications, the observer thinks about how to make specs clearer or shorter. Pair programming has other benefits. It is often more fun and educational to work with a partner, and it helps focus both partners on the task. If you are just starting to work with another programmer, pair programming is a good way to understand how your partner thinks and to establish common vocabulary. It is a good idea for partners to trade off roles, too. Pair programming is an excellent approach for CS 3110, and we recommend trying it.

Code review can be remarkably effective. In one study conducted at IBM (Jones, 1991), code inspection found 65% of the known coding errors and 25% of the known documentation errors, whereas testing found only 20% of the coding errors and none of the documentation errors. The more formal inspection process may be more effective than walkthroughs. One study (Fagan, 1976) found that code inspections resulted in code with 38% fewer failures, compared to code walkthroughs.

Through code review can be expensive, however. Jones found that preparing for code inspection took one hour per 150 lines of code, and the actual inspection covered 75 lines of code per hour. Having up to three people on the inspection team improves the quality of inspection; beyond that, more inspectors doesn't seem to help. Spending a lot of time preparing for inspection did not seem to be useful, either. Perhaps this is because much of the value of inspection lies in the interaction with the coders.