Modules

Topics:

modular programming

language features for modularity

structures and modules

scoping and modules

signatures and module types

abstract types

functional data structures

sharing constraints

compilation units

We've been building very small programs. When a program is small enough, we can keep all of the details of the program in our heads at once. Real application programs are 100 to 10000 times larger than any program you have likely written (or maybe even worked on); they are simply too large and complex to hold all their details in our heads. They are also written by multiple authors. To build large software systems requires techniques we haven't talked about so far.

One key solution to managing complexity of large software is modular programming: the code is composed of many different code modules that are developed separately. This allows different developers to take on discrete pieces of the system and design and implement them without having to understand all the rest. But to build large programs out of modules effectively, we need to be able to write modules that we can convince ourselves are correct in isolation from the rest of the program. Rather than have to think about every other part of the program when developing a code module, we need to be able to use local reasoning: that is, reasoning about just the module and the contract it needs to satisfy with respect to the rest of the program. If everyone has done their job, separately developed code modules can be plugged together to form a working program without every developer needing to understand everything done by every other developer in the team. This is the key idea of modular programming.

Therefore, to build large programs that work, we must use abstraction to make it manageable to think about the program. Abstraction is simply the removal of detail. A well-written program has the property that we can think about its components (such as functions) abstractly, without concerning ourselves with all the details of how those components are implemented.

Modules are abstracted by giving specifications of what they are supposed to do. A good module specification is clear, understandable, and give just enough information about what the module does for clients to successfully use it. This abstraction makes the programmer's job much easier; it is helpful even when there is only one programmer working on a moderately large program, and it is crucial when there is more than one programmer.

Languages often contain mechanisms that support modules directly. OCaml is one, as we will see shortly. In general (i.e. across programming languags), a module specification is known as an interface, which provides information to clients about the module's functionality while hiding the implementation. Object-oriented languages support modular programming with classes. The Java interface construct is one example of a mechanism for specifying the interface to a class (but by no means the only one). A Java interface informs clients of the available functionality in any class that implements it without revealing the details of the implementation. But even just the public methods of a class constitute an interface in the more general sense—an abstract description of what the module can do.

Once we have defined a module and its interface, developers working with the module take on distinct roles. It is likely that most developers are clients of the module who understand the interface but do not need to understand the implementation of the module. A developer who works on the module implementation is an implementer. The module interface is a contract between the client and the implementer, defining the responsibilities of both. Contracts are very important because they help us isolate the source of the problem when something goes wrong—and know who to blame!

It is good practice to involve both clients and implementers in the design of a module's interface. Interfaces designed solely by one or the other can be seriously deficient, because each side may have its own view of what the final product should look like, and these may not align. So mutual agreement on the contract is essential. It is also important to think hard about global module structure and interfaces early, even before any coding is done, because changing an interface becomes more and more difficult as the development proceeds and more of the code comes to depend on it. Finally, it is important to be completely unambiguous in the specification. In OCaml, the signature is part of writing an unambiguous specification, but is by no means the whole story. While beyond the scope of this course, Interface Description (or Definition) Languages (IDL's) are used to specify interfaces in a language-independent way so that different modules do not even necessarily need to be implemented in the same language.

In modular programming, modules are used only through their declared interfaces, which the language may help enforce. This is true even when the client and the implementer are the same person. Modules decouple the system design and implementation problem into separate tasks that can be carried out largely independently. When a module is used only through its interface, the implementer has the flexibility to change the module as long as the module still satisfies its interface.

Features that support modularity

Any language's module system will usually provide support for these concerns:

Namespaces. A namespace provides a set of names that are grouped together, are usually logically related, and are distinct from other namespaces. That enables the name foo in one namespace to have a distinct meaning from foo in another namespace. A namespace is a scoping mechanism. Namespaces are essential for modularity, because without them, the names one programmer in a large team chooses could collide with the names another programmer chooses. In Java, packages and classes provide namespaces. In OCaml, there is a language feature called structures that is used to group names.

Abstraction. An abstraction hides some information while revealing other information. Abstraction thus enables encapsulation, aka information hiding. Usually, abstraction mechanisms for modules allow revealing some names that exist inside the module, but hiding some others. Abstractions therefore describe relationships among modules: there might be many modules that could considered to satisfy a given abstraction. Abstraction is essential for modularity, because it enables implementers of a module to hide the details of the implementation from clients, thus preventing the clients from abusing those details. In a large team, the modules one programmer designs are thereby protected from abuse by another programmer. It also enables clients to be blissfully unaware of those details. In a large team, no programmer has to be aware of all the details of all the modules. In Java, interfaces and abstract classes provide abstraction. In OCaml, there is a language feature called a signature that is used to abstract structures by hiding some of the structure's names.

Code reuse. A module system enables code reuse by providing features that enable code from one module to be used as part of another module without having to copy that code. Code reuse thereby enables programmers to build on the work of others in a way that is maintainable: when the implementer of one module makes an improvement in that module, all the programmers who are reusing that code automatically get the benefit of that improvement. Code reuse is essential for modularity, because it enables "building blocks" that can be assembled and reassembled to form complex pieces of software. In Java, subtyping and inheritance provide code reuse. In OCaml, there are language features called functors and includes that are used to reuse code by producing new code out of old code. We will cover those features in the next lecture.

Structures and modules

Modules in OCaml are implemented by module definitions that have the following syntax:

module ModuleName = struct 
  (* definitions *)
end

Here, for example, is a module for stacks implemented as lists:

module ListStack = struct
  let empty = []
  let is_empty s = (s = [])

  let push x s = x :: s

  let peek = function
    | [] -> failwith "Empty"
    | x::_ -> x

  let pop = function
    | [] -> failwith "Empty"
    | _::xs -> xs
end

Module names must begin with an uppercase letter. The part of the module definition that is written

struct (* definitions *) end

is called a structure. A structure is simply a sequence of definitions. The structure itself is anonymous—it has no name—until it is bound to a name by a module definition.

Modules partition the namespace, so that any symbol x that is bound in the implementation of a module named Module must be referenced by the qualifed name Module.x outside the implementation of the module (unless the namespace has been exposed using open).

The implementation of a module can contain type definitions, exception definitions, let definitions, open statements, as well as some other things we haven't seen so far. All the definitions inside a module are permitted to end with double semicolon ;; for compatibility with the toplevel, but 3110 considers it unidiomatic to do so.

Modules are not as first-class in OCaml as functions. There are some language extensions that make it possible to bundle up modules as values, but we won't be looking at them. If you're curious you can have a look at the manual.

Opening a module

After a module M has been defined, you can access the names within it using the . operator. For example:

# module M = struct let x = 42 end;;
module M : sig val x : int end 

# M.x;;
- : int = 42

You can also bring all of the definitions of a module into the current scope using open. Continuing our example above:

# x;;
Error: Unbound value x

# open M;;

# x;;
- : int = 42

Opening a module is like writing a local definition for each name defined in the module. open String, for example, brings all the definitions from the String module into scope, and has an effect similar to the following on the local namespace:

let length = String.length
let get = String.get
let lowercase_ascii = String.lowercase_ascii
...

If there types, exceptions, or modules defined in a module, those also are brought into scope with open. For example, if we're given this module:

module M = struct
  let x = 42
  type t = bool
  exception E
  module N = struct
    let y = 0
  end
end

then open M would have an effect similar to the following:

let x = M.x
type t = M.t
type exn += E = M.E
module N = M.N

(If the line with exn is mysterious, don't worry about it; it makes use of extensible variants, which we aren't covering. It might help to know that exception E is syntactic sugar for type exn += E, which is to say that it extends the type exn, which is an extensible variant, with a new constructor E.)

Pervasives. There is a special module called Pervasives that is automatically opened in every OCaml program. It contains the "built-in" functions and operators, as we've seen before. You therefore never need to prefix any of the names it defines with Pervasives., though you could do so if you ever needed to unambiguously identify a name from it.

Opening a module in a limited scope

If two modules both define the same name, and you open both of them, what does that name mean? For example:

module M = struct let x = 42 end
module N = struct let x = "bigred" end
open M
open N
(* what is [x]?  an [int] or a [string]? *)

The answer is that any names defined later shadow names defined earlier. So in the local namespace above, x is a string.

If you're using many third-party modules inside your code, chances are you'll have at least one collision like this. Often it will be with a standard higher-order function like map that is defined in many library modules. So it's generally good practice not to open all the modules you're going to use at the top of a .ml file. (This is perhaps different than how you're used to working with (e.g.) Java, where you might import many packages with *.)

Instead, it's good to restrict the scope in which you open modules. There are a couple ways of doing that.

Inside any expression you can locally open a module, such that the module's names are in scope only in the rest of that expression. The syntax for this is let open M in e; inside e all the names from M are in scope. This is useful for (e.g.) opening a module in the body of a function:

(* without [open] *)
let f x = 
  let y = List.filter ((>) 0) x in  
  ...  (* many more lines of code that use [List.] a lot *)

(* with [open] *)
let f x = 
  let open List in (* [filter] is now bound to [List.filter] *)
  let y = filter ((>) 0) x in  
  ...  (* many more lines of code that now can omit [List.] *)

There is a syntactic sugar for the above: M.(e). Again, inside e all the names from M are in scope. This is useful for briefly using M in a short expression:

(* remove surrounding whitespace from [s] and convert it to lower case *)
let s = "BigRed " 
let s' = s |> String.trim |> String.lowercase_ascii (*long way*)
let s' = String.(s |> trim |> lowercase_ascii)      (*shorter way*)

Signatures and module types

Module types let us describe groups of related modules. The syntax for defining a module type is:

module type ModuleTypeName = sig 
  (* declarations *)
end

Here is a module type for stacks:

module type Stack = sig
  type 'a stack
  val empty    : 'a stack
  val is_empty : 'a stack -> bool
  val push     : 'a -> 'a stack -> 'a stack
  val peek     : 'a stack -> 'a
  val pop      : 'a stack -> 'a stack
end

By convention, the module type name is capitalized, but it does not have to be. There is an older convention from the SML language that signature names are in ALLCAPS, and you might occasionally see that still, but we don't typically follow it in OCaml.

The part of the module type that is written

sig (* declarations *) end

is called a signature. A signature is simply a sequence of declarations. The signature itself is anonymous—it has no name—until it is bound to a name by a module type definition. The syntax val id : t means that there is a value named id whose type is t.

A structure matches a signature if the structure provides definitions for all the names specified in the signature (and possibly more), and these definitions meet the type requirements given in the signature. Usually, a definition meets a type requirement by providing a value of exactly that type. But the definition could instead provide a value that has a more general type. For example:

module type Sig = sig
  val f : int -> int
end

module M1 : Sig = struct
  let f x = x+1
end

module M2 : Sig = struct
  let f x = x
end

Module M1 provides a function f of exactly the type specified by Sig, namely, int->int. Module M2 provides a function that is instead of type 'a -> 'a. Both M1 and M2 match Sig. Note that anywhere a value v1 of type int->int is needed, it's safe to instead use a value v2 of type 'a -> 'a. That's because if we apply v2 to an int, its type guarantees us that we will get an int back.

Returning to our example, the structure given above for ListStack doesn't yet match the signature given above for Stack, because that structure doesn't define the type 'a stack. So we could amend the definition of ListStack to:

module ListStack = struct
  type 'a stack = 'a list
  (* the rest is the same as before *)
end

Now that structure matches the signature of Stack. We can ask the compiler to check that by providing a module type annotation for the module:

module ListStack : Stack = struct
  type 'a stack = 'a list
  (* the rest is the same as before *)
end

The type 'a stack is an example of a representation type: a type that is used to represent a version of a data structure. Here, we're implementing stacks using lists, so the representation type is a list.

Abstract types

The type 'a stack above is abstract: the Stack module type says that there is a type name 'a stack in any module that implements the module type, but it does not say what that type is defined to be. Once we add the : Stack module type annotation to ListStack, its 'a stack type also becomes abstract. Outside of the module, no one is allowed to know that 'a stack and 'a list are synonyms.

A module that implements a module type must specify concrete types for the abstract types in the signature and define all the names declared in the signature. Only declarations in the signature are accessible outside of the module. For example, functions defined in the module's structure but not in the module type's signature are not accessible. We say that the structure is sealed by the signature: nothing except what is revealed in the signature may be accessed.

Here is another implementation of the Stack module type:

module MyStack : Stack = struct
  type 'a stack = 
  | Empty 
  | Entry of 'a * 'a stack

  let empty = Empty
  let is_empty s = s = Empty
  let push x s = Entry (x, s)
  let peek = function
    | Empty -> failwith "Empty"
    | Entry(x,_) -> x
  let pop = function
    | Empty -> failwith "Empty"
    | Entry(_,s) -> s
end

In that implementation, we provide our own custom variant for the representation type. Of course, that custom variant is more or less the same as the built-in list type: it has two constructors, one the carries no data, and the other that carries a pair of an element and (recursively) the same variant type.

Because 'a stack is abstract in the Stack module type, no client of this data structure will be able to discern whether stacks are being implemented with the built-in list type or the custom one we just used. Clients may only access the stack in the ways that are defined by the Stack interface, which nowhere mentions list or Empty or Entry.

You can even observe that abstraction in utop. Observe what happens when utop displays the value that results from this expression:

# MyStack.push 1 MyStack.empty;;
- : int MyStack.stack = <abstr>

The value has type int MyStack.stack, which is to say, it is the MyStack.stack type constructor applied to int. And the value is...well, utop won't tell us! It simply prints <abstr> to indicate that the value has been abstracted.

Notice how verbose the type int MyStack.stack is. The module name already tells us that the value is related to MyStack; the word stack following that isn't particularly helpful. For that reason, it is idiomatic OCaml to name the primary representation type of a data structure simply t. Here's the Stack module type rewritten that way:

module type Stack = sig
  type 'a t
  val empty    : 'a t
  val is_empty : 'a t -> bool
  val push     : 'a -> 'a t -> 'a t
  val peek     : 'a t -> 'a
  val pop      : 'a t -> 'a t
end

Given that renaming, here's what the toplevel would display as the type:

# MyStack.push 1 MyStack.empty;;
- : int MyStack.t = <abstr>

And now by convention we would usually pronounce that type as "int MyStack", simply ignoring the t, though it does technically have to be there to be legal OCaml code.

Advanced digression on printing values of an abstract type. It is possible to install custom printers so that the toplevel will convert a value of an abstract type to a string and print it instead of <abstr>. This doesn't violate abstraction, because programmers still can't access the value. It just allows the toplevel to provide better pretty printing. Here's an example utop session, based on code that appears below:

# #install_printer ListStack.format;;

# open ListStack;;

# empty |> push 1 |> push 2;;
- : int stack = [2; 1; ]

Notice how the value of the stack is helpfully printed. The code that makes this happen is in ListStack.format:

module type Stack = sig
  type 'a stack
  (* ... all the usual operations ... *)
  val format : (Format.formatter -> 'a -> unit) -> Format.formatter -> 'a stack -> unit
end

module ListStack : Stack = struct
  type 'a stack = 'a list
  (* ... all the usual operations ... *)
  let format fmt_elt fmt s =
    Format.fprintf fmt "[";
    List.iter (fun elt -> Format.fprintf fmt "%a; " fmt_elt elt) s; 
    Format.fprintf fmt "]"
end

For more information, see the toplevel manual (search for #install_printer), and the Format module, as well as this patch in the OCaml Bug Tracker.

Functional data structures

A functional data structure is one that does not make use of any imperative features. That is, no operations of the data structure have any side effects. It's possible to build functional data structures both in functional languages and in imperative languages.

Functional data structures have the property of being persistent: updating the data structure with one of its operations does not change the existing version of the data structure but instead produces a new version. Both exist and both can still be accessed. A good language implementation will ensure that any parts of the data structure that are not changed by an operation will be shared between the old version and the new version. Any parts that do change will be copied so that the old version may persist. The opposite of a persistent data structure is an ephemeral data structure: changes are destructive, so that only one version exists at any time. Both persistent and ephemeral data structures can be built in both functional and imperative languages.

The ListStack module above is functional: the push and pop operations do not mutate the underlying list, but instead return a new list. We can see that in the following utop session (in which we assume the ListStack module has been defined as module ListStack = struct ..., without the module type annotation : Stack we added above):

# open ListStack;;

# let s = push 1 (push 2 empty);;
val s : int list = [1; 2] 

# let s' = pop s;;
val s' : int list = [2]  

# s;;
- : int list = [1; 2]

The value s is unchanged by the pop operation; both versions of the stack coexist.

The Stack module type gives us a strong hint that the data structure is functional in the types is provides for push and pop:

val push : 'a -> 'a stack -> 'a stack
val pop : 'a stack -> 'a stack

Both of those take a stack as an argument and return a new stack as a result. An ephemeral data structure usually would not bother to return a stack. In Java, for example, similar methods might return void; the equivalent in OCaml would be returning unit, which we'll see in the lab associated with this lecture.

Example: Arithmetic

Here is a module type that represents values that support the usual operations from arithmetic, or more precisely, a field:

module type Arith = sig
  type t
  val zero  : t
  val one   : t
  val (+)   : t -> t -> t
  val ( * ) : t -> t -> t
  val (~-)  : t -> t
end

There are a couple syntactic curiosities here. We have to write ( * ) instead of (*) because the latter would be parsed as beginning a comment. And we write the ~ in (~-) to indicate a unary negation operator.

Here is a module that implements that module type:

module Ints : Arith = struct
  type t    = int
  let zero  = 0
  let one   = 1
  let (+)   = Pervasives.(+)
  let ( * ) = Pervasives.( * )
  let (~-)  = Pervasives.(~-)
end

Outside of the module Ints, the expression Ints.(one + one) is perfectly fine, but Ints.(1 + 1) is not, because t is abstract: outside the module no one is permitted to know that t = int. In fact, the toplevel can't even give us good output about what the sum of one and one is!

# Ints.(one + one);;
- : Ints.t = <abstr>

The reason why is that the type Ints.t is abstract: the module type doesn't tell use that Ints.t is int. This is actually a good thing in many cases: code outside of Ints can't rely on the internal implementation details of Ints, and so we are free to change it. Since the Arith interface only has functions that return t, so once you have a value of type t, all you can do is create other values of type t.

When designing an interface with an abstract type, you will almost certainly want at least one function that returns something other than that type. For example, it's often useful to provide a to_string function. We could add that to the Arith module type:

module type Arith = sig
  (* everything else as before, and... *)
  val to_string : t -> string
end

And now we would need to implement it as part of Ints:

module Ints : Arith = struct
  (* everything else as before, and... *)
  let to_string = string_of_int
end

Now we can write:

# Ints.(to_string (one + one));;
- : string = "2"

Sharing constraints

Sometimes you actually want to expose the type in an implementation of a module. You might like to say "the module Ints implements Arith and the type t is int," and allow external users of the Ints module to use the fact that Ints.t is int.

OCaml lets you write sharing constraints that refine a signature by specifying equations that must hold on the abstract types in that signature. If T is a module type containing an abstract type t, then T with type t = int is a new module type that is the same as T, except that t is known to be int. For example, we could write:

module Ints : (Arith with type t = int) = struct
  (* all of Ints as before *)
end

Now both Ints.(one + one) and Ints.(1 + 1) are legal.

We don't have to specify the sharing constraint in the original definition of the module. We can create a structure, bind it to a module name, then bind it to another module name with its types being either abstract or exposed:

module Ints = struct
  type t    = int
  let zero  = 0
  let one   = 1
  let (+)   = Pervasives.(+)
  let ( * ) = Pervasives.( * )
  let (~-)  = Pervasives.(~-)
end

module IntsAbstracted : Arith = Ints
(* IntsAbstracted.(1 + 1) is illegal *)

module IntsExposed : (Arith with type t = int) = Ints
(* IntsExposed.(1 + 1) is legal *)

This can be a useful technique for testing purposes: provide one name for a module that clients use in which the types are abstract, and provide another name that implementers use for testing in which the types are exposed.

Semantics

The semantics of the OCaml module system is sufficiently complex that it's better left to a course like CS 6110. Here we'll just sketch a couple of the relevant facts.

Dynamic semantics. To evaluate a structure struct D1; ...; Dn end where each of the Di is a definition, evaluate each definition in order.

Static semantics. If a module is given a module type, as in module M : T = struct ... end, then there are two checks the compiler must perform:

Signature matching: every name declared in T must be defined in M.
Encapsulation: any name defined in M that does not appear in T is not visible to code outside of M.

Compilation units

A compilation unit is a pair of OCaml source files in the same directory. They share the same base name, call it x, but their extensions differ: one file is x.ml, the other is x.mli. The file x.ml is called the implementation, and x.mli is called the interface. When the compiler encounters these, it treats them as defining a module and a signature like this:

module X : sig (* insert contents of x.mli here *) end = struct
  (* insert contents of x.ml here *)
end

The unit name X is derived from the base name x by just capitalizing the first letter. Notice that there is no named module type being defined; the signature of X is actually an anonymous sig.

The standard library uses compilation units to implement most of the modules you have been using so far, like List and String. You can see that in the standard library source code.

Comments. The comments that go in an interface file vs. an implementation file are different. Interface files will be read by clients of an abstraction, so the comments that go there are for them. These will generally be specification comments describing how to use the abstraction, the preconditions for calling its functions, what exceptions they might raise, and perhaps some notes on what algorithms are used to implement operations. The standard library's List module contains many examples of these kinds of comments.

Implementation files will be read by programmers and maintainers of an abstraction, so the comments that go there are for them. These will be comments about how the representation type is used, how the code works, important internal invariants it maintains, and so forth.

An example. Tying together many of the things we have seen in this lecture and its lab, you could put this code in mystack.mli (notice that there is no sig..end around it or any module type):

type 'a t
val empty : 'a t
val is_empty : 'a t -> bool
val push : 'a -> 'a t -> 'a t
val peek : 'a t -> 'a
val pop : 'a t -> 'a t

and this code in mystack.ml (notice that there is no struct..end around it or any module):

type 'a t = 'a list

let empty = []
let is_empty s = (s = [])

let push x s = x :: s

let peek = function
  | [] -> failwith "Empty"
  | x::_ -> x

let pop = function
  | [] -> failwith "Empty"
  | _::xs -> xs

then from the command-line compile that source code (note that all we need is the .cmo file so we request it to be built instead of the .byte file):

$ ocamlbuild mystack.cmo

and launch utop and load your compilation unit for use:

# #directory "_build";;
# #load "mystack.cmo";;
# Mystack.empty;;
- : 'a Mystack.t = <abstr>

What about main()? OCaml programs do not need to have a special function named main that is invoked to start the program. Since a compilation unit is essentially just a structure, and since the semantics of structures say to evaluate each definition in order, the usual idiom is just to have the very last definition in some structure serve as the main function that kicks off whatever computation is to be done.

Summary

The OCaml module system provides mechanisms for modularity that provide the similar capabilities as mechanisms you will have seen in other languages. But seeing those mechanisms appear in different ways is hopefully helping you understand them better. OCaml abstract types and signatures, for example, provide a mechanism for abstraction that resembles Java visibility modifiers and interfaces. Seeing the same idea embodied in two different languages, but expressed in rather different ways, will hopefully help you recognize that idea when you encounter it in other languages in the future.

Moreover, the idea that a type could be abstract is a foundational notion in programming language design. The OCaml module system makes that idea brutally apparent. Other languages like Java obscure it a bit by coupling it together with many other features all at once. There's a sense in which every Java class implicitly defines an abstract type (actually, four abstract types that are related by subtyping, one for each visibility modifier [public, protected, private, and default]), and all the methods of the class are functions on that abstract type.

Using the OCaml module system can feel a bit backwards at first, though, because you pass values of an abstract type into functions, rather than invoking methods on objects. Don't worry if that's true for you; you'll get used to it quickly.

The main aspect of modularity that we have not yet seen how to accomplish is code reuse. We'll address that in the next lecture.

Terms and concepts

abstract type
abstraction
client
code reuse
compilation unit
declaration
definition
encapsulation
ephemeral data structure
functional data structure
implementation
implementer
information hiding
interface
local reasoning
modular programming
module
module type
namespace
open
persistant data structure
representation type
scope
sealed
sharing constraints
signature
signature matching
specification
structure