CS 3110 Lecture 9
Representation and Module Invariants

We need to write some comments in module implementations to help code implementers and maintainers to reason about the code within the module. We'd like them to be able to reason about individual functions within the module in a local way, so they can judge whether each function is correctly implemented without thinking about every other function in the module. Last time we saw the abstraction function, which is essential to this process because it explains how information within the module is viewed abstractly by module clients.

Commutation diagrams

Using the abstraction function, we can now talk about what it means for an implementation of an abstraction to be correct. It is correct exactly when every operation that takes place in the concrete space makes sense when mapped by the abstraction function into the abstract space. This can be visualized as a commutation diagram:

A commutation diagram means that if we take the two paths around the diagram, we have to get to the same place. Suppose that we start from a concrete value and apply the actual implementation of some operation to it to obtain a new concrete value or values. When viewed abstractly, a concrete result should be an abstract value that is a possible result of applying the function as described in its specification to the abstract view of the actual inputs. For example, consider the union function from the implementation of sets as lists with repeated elements covered last time. When this function is applied to the concrete pair ([1,3],[2,2]) it corresponds to the lower-left corner of the diagram. The result of this operation is the list [2,2,1,3], whose corresponding abstract value is the list {1,2,3}. Note that if we apply the abstraction function AF to the lists [1,3] and [2,2], we have the sets {1,3} and {2}. The commutation diagram requires that in this instance the union of {1,3} and {2} is {1,2,3}, which is of course true.

Some specifications, as we have seen, are nondeterministic: they do not fully specify the abstract behavior of the functions. For nondeterministic specifications there are several possible arrows leading from the abstract state at the upper left corner of the commutation diagram to other states. The commutation diagram is satisfied as long as the implemented function completes the diagram for any one of those arrows.

Some missing information

Recall the SET interface from last time:

module type SETSIG = sig
  type 'a set
  val empty : 'a set
  val add : 'a -> 'a set -> 'a set
  val mem : 'a -> 'a set -> bool
  val rem : 'a -> 'a set -> 'a set
  val size: 'a set -> int
  val union: 'a set -> 'a set -> 'a set
  val inter: 'a set -> 'a set -> 'a set
end

Last time we we talked about two implementations: a list of integers with no duplicates and a list of integers possibly with duplicates, each with its own abstraction function.

Consider the size function in each of the two implementations. For the list of integers with no duplicates the size is just the length of the list:

let size = List.length

But for the representation of a list of integers with possible duplicates we need to make sure we don't double-count any duplicate elements:

let rec size l = 
  match l with
     [] -> 0
   | h::t -> size(t) + (if List.mem h t then 0 else 1)

How we know that we don't need to do this check in the representation with no duplicates? Since the code does not explicitly say that there are no duplicates, implementers will not be able to reason locally about whether functions like size are implemented correctly.

If we think about this in terms of the commutation diagram, we see that the abstraction function is not enough.

Representation Invariant

We can fix this by adding a second piece of information to the implementation: the representation invariant (or rep invariant, or RI). The rep invariant defines what concrete data values are valid representations (reps) of abstract values. For sets represented as lists without duplicates we write this as part of the comment together with the abstraction function:

module Set : SETSIG = struct
  (* Simple implementation of sets as lists without duplicates 
   * Abstraction function:  the list [a1;...;an] represents the 
   * set {a1;...;an}. [] represents the empty set, {}.
   *
   * Representation invariant: the list contains no duplicate
   * elements.
   *)
  type 'a set = 'a list
  ...

The rep invariant holds for any valid representation. Therefore, a value of the representation type that does not satisfy the rep invariant does not correspond to implementation structures that must be maintained at all times. Structures that do not satisfy this constraint are invalid; they do not correspond (via the abstraction function) to any abstract type. The correct functioning of implemented operations depend on such constraints. If this constraint is broken, functions such as size() will not return the correct answer.

The rep invariant is a condition that is intended to hold for all values of the abstract type (e.g., set). Therefore, in implementing one of the operations of the abstract data type, it can be assumed that any arguments of the abstract type satisfy the rep invariant.

Rep invariant vs. abstraction function

We observed earlier that the abstraction function may be a partial function. In order to make sure that an implementation works—that it completes the commutation diagram—it had better be the case that the implementation never produces any concrete values that do not map to abstract values.

The role of the representation invariant is to restrict the domain of the abstraction function to those values on which the implementation is going to work properly. The relationship between the representation invariant and the abstraction function is depicted in this figure:

The rep invariant is a condition that is intended to hold for all values of the abstract type (e.g., set). Therefore, in implementing one of the operations of the abstract data type, it can be assumed that any arguments of the abstract type satisfy the rep invariant. This assumption restores local reasoning about correctness, because we can use the rep invariant and abstraction function to judge whether the implementation of a single operation is correct in isolation from the rest of the module. It is correct if, assuming that:

  1. the function's requires and checks clauses hold and
  2. the concrete representations of all values of the abstract type satisfy the rep invariant

we can show that

  1. the returns clause of the function is satisfied (that is, the commutation diagram holds) and
  2. all new values of the abstract type that are created have concrete representations satisfying the rep invariant

The rep invariant makes it easier to write code that is provably correct, because it means that we don't have to write code that works for all possible incoming concrete representations--only those that satisfy the rep invariant. This is why the set code doesn't have to work on lists that contain duplicate elements. On return there is a corresponding responsibility to produce only values that satisfy the rep invariant. As suggested in the figure above, the rep invariant holds for all reps both before and after the functions, which is why we call it an invariant at all.

repOK

When implementing a complex abstract data type, it is often helpful to write a function internal to the module that checks that the rep invariant holds. This function can provide an additional level of assurance about your reasoning the correctness of the code. By convention we will call this function repOK; given an abstract type (say, set) implemented as a concrete type (say, int list)  it always has the same specification:

(* Returns whether x satisfies the representation invariant *)
let repOK(x: int list): bool = ...

The repOK can be used to help us implement a module and be sure that each function is independently correct. The trick is to bulletproof each function in the module against all other functions by having it apply repOK to any values of the abstract type that come from outside. In addition, if it creates any new values of the abstract type, it applies repOK to them to ensure that it isn't breaking the rep invariant itself. With this approach, a bug in one function is less likely to create the appearance of a bug in another.

repOK as an identity function

A more convenient way to write repOK is to make it an identity function that raises an exception if the rep invariant doesn't hold. Making it an identity function lets us conveniently test the rep invariant in various ways, as shown below.

(* The identity function.
 * Checks whether x satisfies the representation invariant. *)
let repOK(x: int list): int list = ...

Here is an example of how we might use repOK for the implementation of sets given previously. Notice that repOK is applied to all sets that are created. This ensures that if a bad set representation is created, it will be detected immediately. In case we somehow missed a check on creation, we also apply repOK to incoming set arguments. If these is a bug, these checks will make help us quickly figure out where the rep invariant is being broken.



Production vs. development code

Calling repOK on every argument can be too expensive for the production version of a program. The repOK above is quite expensive (though it could be implemented more cheaply). For production code it may be more appropriate to use a version of repOK that only checks the parts of the rep invariant that are cheap to check. When there is a requirement that there be no run-time cost, repOK can be changed to an identity function (or macro) so the compiler optimizes away the calls to it. Howver, it is a good idea to keep around the full code of repOK (perhaps in a comment) so it can be easily reinstated during future debugging.

Module invariants

Invariants on data are useful even when writing modules that are not easily considered to be providing abstract data types. Sometimes it is difficult to identify an abstract view of data that is provided by the module, and there may not be any abstract type at all. Invariants are important even without an abstraction function, because they document the legal states and representations that the code is expected to handle correctly. In general we refer to module invariants as invariants enforced by modules. In the case of an ADT, the rep invariant is a module invariant. Module invariants are useful for understanding how the code works, and also for maintenance, because the maintainer can avoid changes to the code that violate the module invariant.

Module invariants and code evolution

A strong module invariant is not always the best choice, because it restricts future changes to the module. We described interface specifications as a contract between the implementer of a module and the user. A module invariant is a contract between the implementer and herself, or among the various implementers of the module, present and future. According to assumption 2, above, ADT operations may be implemented assuming that the rep invariant holds. If the rep invariant is ever weakened (made more permissive), some parts of the implementation may break.Thus, one of the most important purposes of the rep invariant is to document exactly what may and what may not be safely changed about a module implementation. A weak invariant forces the implementer to work harder to produce a correct implementation, because less can be assumed about concrete representation values, but conversely it gives maximum flexibility for future changes to the code.

Modularity and module invariants

A sign of good code design is that invariants on program data are enforced in a localized way, within modules, so that programmers can reason about whether the invariant is enforced without thinking about the rest of the program. To do this it is necessary to figure out just the right operations to be exposed by the various modules, so that useful functionality can be provided while also ensuring that invariants are maintained.

Conversely, a common design error is to break up a program into a set of modules that simply encapulate data and provide low-level accessor operations, while putting all the interesting logic of the program in one main module. The problem with this design is that all the interesting (and hard!) code still lives in one place, and the main module is responsible for enforcing many complex invariants among the data. This kind of design does not break the program into simpler parts that can be reasoned about independently.It shows the big danger sign that the abstractions aren't right: all the code is either boring code, or overly complex code that is hard to reason about. It is a kind of fake modularity.

For example, suppose we are implementing a graphical chess game. The game state includes a board and a bunch of pieces. We might want to keep track of where each piece is, as well as what is on each board square. And there may be a good deal of state in the graphical display too. A good design would ensure that the board, the pieces, and the graphical display stay in sync with each other in code that is separate from that which handles the detailed rules of the game of chess.

Principles for modular design

Given a large programming task, we want to divide it into modules in an effective way. There are several goals. To make the user of the software happy, we want a program that is correct, secure, and has acceptable performance. But to keep the cost of development and maintenance low, and to increase the chances that the program is correct, we want a modular design that has loose coupling and permits local reasoning. These goals are in tension. We can roughly characterize design tradeoffs along an axis between loose and tight coupling:

IssueLoose couplingTight coupling
Size of interfacenarrow interface:
few operations
wide interface:
many operations
Complexity Simple specifications Complex specifications
Invariants Local Global
Pre/post-conditions Weak, nondeterministic Strong, deterministic
Correctness Easier to get right Harder to get right
Performance May sacrifice performance May expose optimizations

Thus, if we want software that is very modular and relatively easy to build correctly, we should design modules that have simple, narrow interfaces with relatively few operations with simple specifications. In some cases we may need more performance than an aggressively modular design offers, and may need to make our specification more complex or add new operations.

A good rule of thumb is to start with as modular and simple a design as possible. Interfaces should be narrow, exposing only as many operations as are necessary for clients to carry out their tasks. Invariants should be simple and enforced locally. Avoid premature optimization that results in complex, tightly coupled programs, because very often the performance bottlenecks are not what is expected. You don't want to pay the price for complexity that turns out to be unnecessary. If performance becomes a problem, a simple, clean design is usually a good starting point for refinements that improve performance.

In general, the right choice along this axis depends on the system being built, and engineering judgment is required. Software designers must balance issues of cost, performance, correctness, usability, and security. They are expected all the time to make judgment calls trading off among these issues. One key is to realize that you are making these tradeoffs and to know when to make them. Probably the most common errors of judgment involve choosing design points too far to the right on this diagram, resulting in systems that are unreliable and hard to maintain.