Processing math: 100%

Testing and Debugging


Topics:


Validation

Many programmers might think of programming as a task largely involving debugging. So it's worthwhile to take a step back and think about everything that comes before debugging.

The goal we're after is that programs behave as we intend them to behave. Validation is the process of building our confidence in correct program behavior. There are many ways to increase that confidence. Social methods, formal methods, and testing are three, and we discuss them, next.

Social methods involve developing programs with other people, relying on their assistance to improve correctness. Some good techniques include the following:

These social techniques for code review can be remarkably effective. In one study conducted at IBM (Jones, 1991), code inspection found 65% of the known coding errors and 25% of the known documentation errors, whereas testing found only 20% of the coding errors and none of the documentation errors. The code inspection process may be more effective than walkthroughs. One study (Fagan, 1976) found that code inspections resulted in code with 38% fewer failures, compared to code walkthroughs.

Thorough code review can be expensive, however. Jones found that preparing for code inspection took one hour per 150 lines of code, and the actual inspection covered 75 lines of code per hour. Having up to three people on the inspection team improves the quality of inspection; beyond that, more inspectors doesn't seem to help. Spending a lot of time preparing for inspection did not seem to be useful, either. Perhaps this is because much of the value of inspection lies in the interaction with the coders.

Formal methods use the power of mathematics and logic to validate program behavior. Verification uses the program code and its specifications to construct a proof that the program behaves correctly on all possible inputs. There are research tools available to help with program verification, often based on automated theorem provers, as well as research languages that are designed for program verification. Verification tends to be expensive and to require thinking carefully about and deeply understanding the code to be verified. So in practice, it tends to be applied to code that is important and relatively short. Verification is particularly valuable for critical systems where testing is less effective. Because their execution is not deterministic, concurrent programs are hard to test, and sometimes subtle bugs can only be found by attempting to verify the code formally. In fact, tools to help prove programs correct have been getting increasingly effective and some large systems have been fully verified, including compilers, processors and processor emulators, and key pieces of operating systems.

Testing involves actually executing the program on sample inputs to see whether the behavior is as expected. By comparing the actual results of the program with the expected results, we find out whether the program really works on the particular inputs we try it on. Testing can never provide the absolute guarantees that formal methods do, but it is significantly easier and cheaper to do. It is also the validation methodology with which you are probably most familiar. Testing is a good, cost-effective way of building confidence in correct program behavior.

Test coverage

We would like to know that a program works on all possible inputs. The problem with testing is that it is usually infeasible to try all the possible inputs. For example, suppose that we are implementing a module that provides an abstract data type for rational numbers. One of its operations might be an addition function plus, e.g.:

(* AF: [(p,q)] represents the rational number p/q
 * RI: [q] is not 0 *)
type rational = int*int

(* [create p q] is the rational number p/q.
 * raises: [Invalid_argument "0"] if [q] is 0 *)
val create : int -> int -> rational

(* [plus r1 r2] is r1 + r2 *)
val plus : rational -> rational -> rational

What would it take to exhaustively test just this one function? We'd want to try all possible rationals as both the r1 and r2 arguments. A rational is formed from two ints, and there are 263 ints on a modern OCaml implementation. Therefore there are approximately (263)4=2252 possible inputs to the plus function. Even if we test one addition every nanosecond, it will take about 10^59 years to finish testing this one function.

Clearly we can't test software exhaustively. But that doesn't mean we should give up on testing. It just means that we need to think carefully about what our test cases should be so that they are as effective as possible at convincing us that the code works.

Consider our create function, above. It takes in two integers p and q as arguments. How should we go about selecting a relatively small number of test cases that will convince us that the function works correctly on all possible inputs? We can visualize the space of all possible inputs as a large square:

There are about 2126 points in this square, so we can't afford to test them all. And testing them all is going to mostly be a waste of time—most of the possible inputs provide nothing new. We need a way to find a set of points in this space to test that are interesting and will give a good sense of the behavior of the program across the whole space.

Input spaces generally comprise a number of subsets in which the behavior of the code is similar in some essential fashion across the entire subset. We don't get any additional information by testing more than one input from each such subset.

If we test all the interesting regions of the input space, we have achieved good coverage. We want tests that in some useful sense cover the space of possible program inputs.

Two good ways of achieving coverage are black-box testing and glass-box testing.

Black-box testing

In selecting our test cases for good coverage, we might want to consider both the specification and the implementation of the program or module being tested. It turns out that we can often do a pretty good job of picking test cases by just looking at the specification and ignoring the implementation. This is known as black-box testing. The idea is that we think of the code as a black box about which all we can see is its surface: its specification. We pick test cases by looking at how the specification implicitly introduces boundaries that divide the space of possible inputs into different regions.

When writing black-box test cases, we ask ourselves what set of test cases that will produce distinctive behavior as predicted by the specification. It is important to try out both typical inputs and inputs that are boundary cases aka corner cases or edge cases. A common error is to only test typical inputs, with the result that the program usually works but fails in less frequent situations. It's also important to identify ways in which the specification creates classes of inputs that should elicit similar behavior from the function, and to test on those paths through the specification. Here are some examples.

Example 1. Here are some ideas for how to test the create function:

The specification also says that the code will check that q is not zero. We should construct some test cases to ensure this checking is done as advertised. Trying (1,0), (maxint,0), (minint,0), (-1,0), (0,0) to see that they all raise the specified exception would probably be an adequate set of black-box tests.

Example 2. Consider the function list_max:

(* Return the maximum element in the list. *)
val list_max: int list -> int

What is a good set of black-box test cases? Here the input space is the set of all possible lists of ints. We need to try some typical inputs and also consider boundary cases. Based on this spec, boundary cases include the following:

Example 3. Consider the function sqrt:

(* [sqrt x n] is the square root of [x] computed to an accuracy of [n]
 * significant digits.
 * requires: [x >= 0] and [n >= 1] *)
val sqrt : float -> int -> float

The precondition identifies two possibilities for x (either it is zero or greater) and two possibilities for n (either it is one or greater). That leads to four "paths through the specification", i.e., representative and boundary cases for satisfying the precondition, which we should test:

Summary. Black-box testing has some important advantages:

The disadvantage of black box testing is that its coverage may not be as high as we'd like, because it has to work without the implementation.

Glass-box testing

Black-box testing is a good place to start when writing test cases, but ultimately it is not enough. In particular, it's not possible to determine how much coverage of the implementation a black-box test suite actually achieves—we actually need to know the implementation source code. Testing based on that code is known as glass box or white box testing. Glass-box testing can improve on black-box by testing execution paths through the implementation code: the series of expressions that is conditionally evaluated based on if-expressions, match-expressions, and function applications. Test cases that collectively exercise all paths are said to be path-complete. At a minimum, path-completeness requires that for every line of code, and even for every expression in the program, there should be a test case that causes it to be executed. Any unexecuted code could contain a bug if has never been tested.

For true path completeness we must consider all possible execution paths from start to finish of each function, and try to exercise every distinct path. In general this is infeasible, because there are too many paths. A good approach is to think of the set of paths as the space that we are trying to explore, and to identify boundary cases within this space that are worth testing.

For example, consider the following implementation of a function that finds the maximum of its three arguments:

let max3 x y z = 
  if x>y then 
    if x>z then x else z 
  else 
    if y>z then y else z

Black-box testing might lead us to invent many tests, but looking at the implementation reveals that there are only four paths through the code—the paths that return x, z, y, or z (again). We could test each of those paths with representative inputs such as: 3,2,1; 3,2,4; 1,2,1; 1,2,3.

When doing glass box testing, we should include test cases for each branch of each (nested) if expression, and each branch of each (nested) pattern match. If there are recursive functions, we should include test cases for the base cases as well as each recursive call. Also, we should include test cases to trigger each place where an exception might be raised.

Of course, path complete testing does not guarantee an absence of errors. We still need to test against the specification, i.e., do black-box testing. For example, here is a broken implementation of max3:

let max3 x y z =
  x

The test max 2 1 1 is path complete, but doesn't reveal the error.

Glass-box testing can be aided by code-coverage tools that assess how much of the code has been exercised by a test suite. The bisect tool for OCaml can tell you which expressions in your program have been tested, and which have not.

Testing data abstractions

When testing a data abstraction, a simple first step is to look at the abstraction function and representation invariant for hints about what boundaries may exist in the space of values manipulated by a data abstraction. The rep invariant is a particularly effective tool for constructing useful test cases. Looking at the rep invariant of the rational data abstraction above, we see that it requires that q is non-zero. Therefore we should construct test cases that make q as close to 0 as possible, i.e. 1 or -1.

We should also test how each consumer of the data abstraction handles every path through each producer of it. A consumer is an operation that takes a value of the data abstraction as input, and a producer is an operation that returns such a value.

For example, consider this set abstraction:

module type Set = sig

  (* ['a t] is the type of a set whose elements have type ['a]. *)
  type 'a t

  (* [empty] is the empty set. *)
  val empty : 'a t

  (* [size s] is the number of elements in [s]. *
   * [size empty] is [0]. *)
  val size : 'a t -> int

  (* [add x s] is a set containing all the elements of
   * [s] as well as element [x]. *)
  val add : 'a -> 'a t -> 'a t

  (* [mem x s] is [true] iff [x] is an element of [s]. *)
  val mem : 'a -> 'a t -> bool

end

The empty and add functions are producers; and the size, add and mem functions are consumers. So we should test how

Randomized testing

Randomized testing aka fuzz testing is the process of generating random inputs and feeding them to a program or a function to see whether the program behaves correctly. The immediate issue is how to determine what the correct output is for a given input. If a reference implementation is available—that is, an implementation that is believed to be correct but in some other way does not suffice (e.g., its performance is too slow, or it is in a different language)—then the outputs of the two implementations can be compared. Otherwise, perhaps some property of the output could be checked. For example,

Randomized testing is an incredibly powerful technique. It is often used in testing programs for security vulnerabilities. The qcheck package for OCaml supports randomized testing.

Debugging

The word "bug" suggests something that wandered into a program. Better terminology would be that there are

Some faults might never appear to an end user of a system, but failures are those faults that do. A fault might result because an implementation doesn't match design, or a design doesn't match the requirements.

Debugging is the process of discovering and fixing faults. Testing clearly is the "discovery" part, but fixing can be more complicated. Debugging can be a task that takes even more time than an original implementation itself! So you would do well to make it easy to debug your programs from the start. Write good specifications for each function. Document the AF and RI for each data abstraction. Keep modules small, and test them independently. Utilize both black box and glass box testing.

Inevitably, though, you will discover faults in your programs. When you do, approach them as a scientist by employing the scientific method:

Often the crux of this process is finding the simplest, smallest input that triggers a fault. That's not usually the original input for which we discover a fault. So some initial experimentation might be needed to find a minimal test case.

Never be afraid to write additional code, even a lot of additional code, to help you find faults. Functions like to_string or format can be invaluable in understanding computations, so writing them up front before any faults are detected is completely worthwhile.

When you do discover the source of a fault, be extra careful in fixing it. It is tempting to slap a quick fix into the code and move on. This is quite dangerous. Far too often, fixing a fault just introduces a new (and unknown) fault! If a bug is difficult to find, it is often because the program logic is complex and hard to reason about. You should think carefully about why the bug happened in the first place and what the right solution to the problem is. Regression testing (i.e., recording only test cases that originally failed but now pass) is important whenever a bug fix is introduced, but nothing can replace careful thinking about the code.

Defensive programming

Defensive programming is a form of proactive debugging: implementing code that will later be easy to debug. Some excellent techniques include the following:

Sometimes programmers worry unnecessarily that defensive programming will be too expensive—either in terms of the time it costs them to implement the checks initially, or in the run-time costs that will be paid in checking assertions. These concerns are far too often misplaced. The time and money it costs society to repair faults in software suggests that we could all afford to have programs that run a little more slowly.

Terms and concepts

Further reading