Mutable Data Types

Topics:

refs

mutable fields

arrays

mutable data structures

Mutable Data Types

OCaml is not a pure language: it does admit side effects. We have seen that already with I/O, especially printing. But up till now we have limited ourself to the subset of the language that is immutable: values could not change. Today, we look at data types that are mutable.

Mutability is neither good nor bad. It enables new functionality that we couldn't implement (at least not easily) before, and it enables us to create certain data structures that are asymptotically more efficient than their purely functional analogues. But mutability does make code more difficult to reason about, hence it is a source of many faults in code. One reason for that might be that humans are not good at thinking about change. With immutable values, we're guaranteed that any fact we might establish about them can never change. But with mutable values, that's no longer true. "Change is hard," as they say.

Refs

A ref is like a pointer or reference in an imperative language. It is a location in memory whose contents may change. Refs are also called ref cells, the idea being that there's a cell in memory that can change.

A first example. Here's an example utop transcript to introduce refs:

# let x = ref 0;;
val x : int ref = {contents = 0}

# !x;;
- : int = 0

# x := 1;;
- : unit = ()

# !x;;
- : int = 1

At a high level, what that shows is creating a ref, getting the value from inside it, changing its contents, and observing the changed contents. Let's dig a little deeper.

The first phrase, let x = ref 0, creates a reference using the ref keyword. That's a location in memory whose contents are initialized to 0. Think of the location itself as being an address—for example, 0x3110bae0—even though there's no way to write down such an address in an OCaml program. The keyword ref is what causes the memory location to be allocated and initialized.

The first part of the response from utop, val x : int ref, indicates that x is a variable whose type is int ref. We have a new type constructor here. Much like list and option are type constructors, so is ref. A t ref, for any type t, is a reference to a memory location that is guaranteed to contain a value of type t. As usual we should read should a type from right to left: t ref means a reference to a t. The second part of the response shows us the contents of the memory location. Indeed, the contents have been initialized to 0.

The second phrase, !x, dereferences x and returns the contents of the memory location. Note that ! is the dereference operator in OCaml, not Boolean negation.

The third phrase, x := 1, is an assignment. It mutates the contents x to be 1. Note that x itself still points to the same location (i.e., address) in memory. Variables really are immutable in that way. What changes is the contents of that memory location. Memory is mutable; variable bindings are not. The response from utop is simply (), meaning that the assignment took place—much like printing functions return () to indicate that the printing did happen.

The fourth phrase, !x again dereferences x to demonstrate that the contents of the memory location did indeed change.

A more sophisticated example. Here is code that implements a counter. Every time next_val is called, it returns one more than the previous time.

# let counter = ref 0;;
val counter : int ref = {contents = 0}

# let next_val = 
    fun () ->
      counter := (!counter) + 1;
      !counter;;
val next_val : unit -> int = <fun> 

# next_val();;
- : int = 1

# next_val();;
- : int = 2

# next_val();;
- : int = 3

In the implementation of next_val, there are two expressions separated by semi-colon. The first expression, counter := (!counter) + 1, is an assignment that increments counter by 1. The second expression, !counter, returns the newly incremented contents of counter.

This function is unusual in that every time we call it, it returns a different value. That's quite different than any of the functions we've implemented ourselves so far, which have always been deterministic: for a given input, they always produced the same output. On the other hand, we've seen some library functions that are nondeterministic, for example, functions in the Random module, and Pervasives.read_line. It's no coincidence that those happen to be implemented using mutable features.

We could improve our counter in a couple ways. First, there is a library function incr : int ref -> unit that increments an int ref by 1. Thus it is like the ++ operator in many language in the C family. Using it, we could write incr counter instead of counter := (!counter) + 1.

Second, the way we coded the counter currently exposes the counter variable to the outside world. Maybe we're prefer to hide it so that clients of next_val can't directly change it. We could do so by nesting counter inside the scope of next_val:

let next_val = 
  let counter = ref 0 
  in fun () ->
    incr counter;
    !counter

Now counter is in scope inside of next_val, but not accessible outside that scope.

When we gave the dynamic semantics of let expressions before, we talked about substitution. One way to think about the definition of next_val is as follows.

First, the expression ref 0 is evaluated. That returns a location loc, which is an address in memory. The contents of that address are initialized to 0.
Second, everywhere in the body of the let expression that counter occurs, we substitute for it that location. So we get:
```
fun () -> incr loc; !loc
```
Third, that anonymous function is bound to next_val.

So any time next_val is called, it increments and returns the contents of that one memory location loc.

Now imagine that we instead had written the following (broken) code:

let next_val_broken = fun () ->
  let counter = ref 0
  in incr counter;
     !counter

It's only a little different: the binding of counter occurs after the fun () -> instead of before. But it makes a huge difference:

# next_val_broken ();;
- : int = 1

# next_val_broken ();;
- : int = 1

# next_val_broken ();;
- : int = 1

Every time we call next_val_broken, it returns 1: we no longer have a counter. What's going wrong here?

The problem is that every time next_val_broken is called, the first thing it does is to evaluate ref 0 to a new location that is initialized to 0. That location is then incremented to 1, and 1 is returned. Every call to next_val_broken is thus allocating a new ref cell, whereas next_val allocates just one new ref cell.

Syntax. The first three of the following are new syntactic forms involving refs, and the last is a syntactic form that we haven't yet fully explored.

Ref creation: ref e
Ref assignment: e1 := e2
Dereference: !e
Sequencing of effects: e1; e2

Dynamic semantics.

To evaluate ref e,
- Evaluate e to a value v
- Allocate a new location loc in memory to hold v
- Store v in loc
- Return loc
To evaluate e1 := e2,
- Evaluate e2 to a value v, and e1 to a location loc.
- Store v in loc.
- Return (), i.e., unit.
To evaluate !e,
- Evaluate e to a location loc.
- Return the contents of loc.
To evaluate e1; e2,
- First evaluate e1 to a value v1.
- Then evaluate e2 to a value v2.
- Return v2. (v1 is not used at all.)
- If there are multiple expressions in a sequence, e.g., e1; e2; ...; en, then evaluate each one in order from left to right, returning only vn. Another way to think about this is that semi-colon is right associative—for example e1; e2; e3 is the same as e1; (e2; e3)).

Note that locations are values that can be passed to and returned from functions. But unlike other values (e.g., integers, variants), there is no way to directly write a location in an OCaml program. That's different than languages like C, where programmers can directly write memory addresses and do arithmetic on pointers. C programmers want that kind of low-level access to do things like interface with hardware and build operating systems. Higher-level programmers are willing to forego it to get memory safety. That's a hard term to define, but according to Hicks 2014 it intuitively means that

pointers are only created in a safe way that defines their legal memory region,
pointers can only be dereferenced if they point to their allotted memory region,
that region is (still) defined.

Static semantics. We have a new type constructor, ref, such that t ref is a type for any type t. Note that the ref keyword is used in two ways: as a type constructor, and as an expression that constructs refs.

ref e : t ref if e : t.
e1 := e2 : unit if e1 : t ref and e2 : t.
!e : t if e : t ref.
e1; e2 : t if e1 : unit and e2 : t. Similarly, e1; e2; ...; en : t if e1 : unit, e2 : unit, ... (i.e., all expressions except en have type unit), and en : t.

The typing rule for semi-colon is designed to prevent programmer mistakes. For example, a programmer who writes 2+3; 7 probably didn't mean to: there's no reason to evaluate 2+3 then throw away the result and instead return 7. The compiler will give you a warning if you violate this particular typing rule. To get rid of the warning (if you're sure that's what you need to do), there's a function ignore : 'a -> unit in the standard library. Using it, ignore(2+3); 7 will compile without a warning. Of course, you could code up ignore yourself: let ignore _ = ().

Aliasing. Now that we have refs, we have aliasing: two refs could point to the same memory location, hence updating through one causes the other to also be updated. For example,

let x = ref 42 
let y = ref 42 
let z = x
let () = x := 43
let w = (!y) + (!z)

The result of executing that code is that w is bound to 85, because let z = x causes z and x to become aliases, hence updating x to be 43 also causes z to be 43.

Equality. OCaml has two equality operators, physical equality and structural equality. The documentation of Pervasives.(==) explains physical equality:

e1 == e2 tests for physical equality of e1 and e2. On mutable types such as references, arrays, byte sequences, records with mutable fields and objects with mutable instance variables, e1 == e2 is true if and only if physical modification of e1 also affects e2. On non-mutable types, the behavior of ( == ) is implementation-dependent; however, it is guaranteed that e1 == e2 implies compare e1 e2 = 0.

One interpretation could be that == should be used only when comparing refs (and other mutable data types) to see whether they point to the same location in memory. Otherwise, don't use ==.

Structural equality is also explained in the documentation of Pervasives.(=):

e1 = e2 tests for structural equality of e1 and e2. Mutable structures (e.g. references and arrays) are equal if and only if their current contents are structurally equal, even if the two mutable objects are not the same physical object. Equality between functional values raises Invalid_argument. Equality between cyclic data structures may not terminate.

Structural equality is usually what you want to test. For refs, it checks whether the contents of the memory location are equal, regardless of whether they are the same location.

The negation of physical equality is !=, and the negation of structural equality is <>. This can be hard to remember.

Here are some examples involving equality and refs to illustrate the difference between structural equality (=) and physical equality (==):

# let r1 = ref 3110;;
val r1 : int ref = {contents = 3110}

# let r2 = ref 3110;;
val r2 : int ref = {contents = 3110}

# r1 == r1;;
- : bool = true

# r1 == r2;;
- : bool = false

# r1 != r2;;
- : bool = true

# r1 = r1;;
- : bool = true

# r1 = r2;;
- : bool = true

# r1 <> r2;;
- : bool = false

# ref 3110 <> ref 2110;;
- : bool = true

Mutable fields

The fields of a record can be declared as mutable, meaning their contents can be updated without constructing a new record. For example, here is a record type for two-dimensional colored points whose color field c is mutable:

# type point = {x:int; y:int; mutable c:string};;
type point = {x:int; y:int; mutable c:string; }

Note that mutable is a property of the field, rather than the type of the field. In particular, we write mutable field : type, not field : mutable type.

The operator to update a mutable field is <-:

# let p = {x=0; y=0; c="red"};;
val p : point = {x=0; y=0; c="red"}

# p.c <- "white";;
- : unit = ()

# p;;
val p : point = {x=0; y=0; c="white"}

# p.x <- 3;;
Error: The record field x is not mutable

The syntax and semantics of <- is similar to := but complicated by fields:

Syntax: e1.f <- e2
Dynamic semantics: To evaluate e1.f <- e2, evaluate e2 to a value v2, and e1 to a value v1, which must have a field named f. Update v1.f to v2. Return ().
Static semantics: e1.f <- e2 : unit if e1 : t1 and t1 = {...; mutable f : t2; ...}, and e2 : t2.

Refs and mutable fields

It turns out that refs are actually implemented as mutable fields. In Pervasives we find the following declaration:

type 'a ref = { mutable contents : 'a; }

And that's why when we create a ref it does in fact looks like a record: it is a record!

# let r = ref 3110;;
val r : int ref = {contents = 3110}

The other syntax we've seen for records is in fact equivalent to simple OCaml functions:

(* Equivalent to [fun v -> {contents=e}]. *)
val ref : 'a -> 'a ref

(* Equivalent to [fun r -> r.contents]. *)
val (!) : 'a ref -> 'a

(* Equivalent to [fun r v -> r.contents <- v]. *)
val (:=) : 'a ref -> 'a -> unit

The reason we say "equivalent" is that those functions are actually implemented not in OCaml but in the OCaml run-time, which is implemented mostly in C. But the functions do behave the same as the OCaml source given above in comments.

Arrays

Arrays are fixed-length mutable sequences with constant-time access and update. So they are similar in various ways to refs, lists, and tuples. Like refs, they are mutable. Like lists, they are (finite) sequences. Like tuples, their length is fixed in advance and cannot be resized.

The syntax for arrays is similar to lists:

# let v = [|0.; 1.|];;
val v : float array = [|0.; 1.|]

That code creates an array whose length is fixed to be 2 and whose contents are initialized to 0. and 1.. The keyword array is a type constructor, much like list.

Later those contents can be changed using the <- operator:

# v.(0) <- 5.;;
- : unit = ()

# v;;
- : float array = [|5.; 1.|]

As you can see in that example, indexing into an array uses the syntax array.(index), where the parentheses are mandatory.

The Array module has many useful functions on arrays.

Syntax.

Array creation: [|e0; e1; ...; en|]
Array indexing: e1.(e2)
Array assignment: e1.(e2) <- e3

Dynamic semantics.

To evaluate [|e0; e1; ...; en|], evaluate each ei to a value vi, create a new array of length n+1, and store each value in the array at its index.
To evaluate e1.(e2), evaluate e1 to an array value v1, and e2 to an integer v2. If v2 is not within the bounds of the array (i.e., 0 to n-1, where n is the length of the array), raise Invalid_argument. Otherwise, index into v1 to get the value v at index v2, and return v.
To evaluate e1.(e2) <- e3, evaluate each expression ei to a value vi. Check that v2 is within bounds, as in the semantics of indexing. Mutate the element of v1 at index v2 to be v3.

Static semantics.

[|e0; e1; ...; en|] : t array if ei : t for all the ei.
e1.(e2) : t if e1 : t array and e2 : int.
e1.(e2) <- e3 : unit if e1 : t array and e2 : int and e3 : t.

Loops. OCaml has while loops and for loops. Their syntax is as follows:

while e1 do e2 done
for x=e1 to e2 do e3 done
for x=e1 downto e2 do e3 done

The second form of for loop counts down from e1 to e2—that is, it decrements its index variable at each iteration.

Though not mutable features themselves, loops can be useful with mutable data types like arrays. We can also use functions like Array.iter, Array.map, and Array.fold_left instead of loops.

Mutable data structures

As an example of a mutable data structure, let's look at stacks. We're already familiar with functional stacks:

exception Empty

module type Stack = sig
  (* ['a t] is the type of stacks whose elements have type ['a]. *)
  type 'a t

  (* [empty] is the empty stack *)
  val empty : 'a t

  (* [push x s] is the stack whose top is [x] and the rest is [s]. *)
  val push : 'a -> 'a t -> 'a t

  (* [peek s] is the top element of [s].
   * raises: [Empty] is [s] is empty. *)
  val peek : 'a t -> 'a

  (* [pop s] is all but the top element of [s].
   * raises: [Empty] is [s] is empty. *)
  val pop : 'a t -> 'a t
end

An interface for a mutable or non-persistent stack would look a little different:

module type MutableStack = sig
  (* ['a t] is the type of mutable stacks whose elements have type ['a].
   * The stack is mutable not in the sense that its elements can
   * be changed, but in the sense that it is not persistent:
   * the operations [push] and [pop] destructively modify the stack. *)
  type 'a t

  (* [empty ()] is the empty stack *)
  val empty : unit -> 'a t

  (* [push x s] modifies [s] to make [x] its top element.
   * The rest of the elements are unchanged. *)
  val push : 'a -> 'a t -> unit

  (* [peek s] is the top element of [s].
   * raises: [Empty] is [s] is empty. *)
  val peek : 'a t -> 'a

  (* [pop s] removes the top element of [s].
   * raises: [Empty] is [s] is empty. *)
  val pop : 'a t -> unit
end

Notice especially how the type of empty changes: instead of being a value, it is now a function. This is typical of functions that create mutable data structures. Also notice how the types of push and pop change: instead of returning an 'a t, they return unit. This again is typical of functions that modify mutable data structures. In all these cases, the use of unit makes the functions more like their equivalents in an imperative language. The constructor for an empty stack in Java, for example, might not take any arguments (which is equivalent to taking unit). And the push and pop functions for a Java stack might return void, which is equivalent to returning unit.

Now let's implement the mutable stack with a mutable linked list. We'll have to code that up ourselves, since OCaml linked lists are persistent.

module MutableRecordStack = struct
  (* An ['a node] is a node of a mutable linked list.  It has
   * a field [value] that contains the node's value, and
   * a mutable field [next] that is [Null] if the node has
   * no successor, or [Some n] if the successor is [n]. *)
  type 'a node = {value : 'a; mutable next : 'a node option}

 (* AF: An ['a t] is a stack represented by a mutable linked list.
  * The mutable field [top] is the first node of the list,
  * which is the top of the stack. The empty stack is represented
  * by {top = None}.  The node {top = Some n} represents the
  * stack whose top is [n], and whose remaining elements are
  * the successors of [n]. *)
  type 'a t = {mutable top : 'a node option}

  let empty () = 
    {top = None}

  (* To push [x] onto [s], we allocate a new node with [Some {...}].
   * Its successor is the old top of the stack, [s.top].
   * The top of the stack is mutated to be the new node. *)
  let push x s =
    s.top <- Some {value = x; next = s.top}

  let peek s =
    match s.top with
    | None -> raise Empty
    | Some {value} -> value

  (* To pop [s], we mutate the top of the stack to become its successor. *)
  let pop s =
    match s.top with
    | None -> raise Empty
    | Some {next} -> s.top <- next
end

Here is some example usage of the mutable stack:

# let s = empty ();;
val s : '_a t = {top = None}

# push 1 s;;
- : unit = ()

# s;;
- : int t = {top = Some {value = 1; next = None}}

# push 2 s;;
- : unit = ()

# s;;
- : int t = {top = Some {value = 2; next = Some {value = 1; next = None}}} 

# pop s;;
- : unit = ()

# s;;
- : int t = {top = Some {value = 1; next = None}}

The '_a in the first utop response in that transcript is a weakly polymorphic type variable. It indicates that the type of elements of s is not yet fixed, but that as soon as one element is added, the type (for that particular stack) will forever be fixed. Weak type variables tend to appear once mutability is involved, and they are important for the type system to prevent certain kinds of errors, but we won't discuss them further.

Summary

We cover mutable data types in the "Advanced Data Structures" section of this course because they are, in fact, harder to reason about. For example, before refs, we didn't have to worry about aliasing in OCaml.
But mutability does have its uses. I/O is fundamentally about mutation. And some data structures (like arrays, which we saw here, and hash tables) cannot be implemented as efficiently without mutability.

Mutability thus offers great power, but with great power comes great responsibility. Try not to abuse your new-found power!

Terms and concepts

address
alias
array
assignment
dereference
determinstic
immutable
index
loop
memory safety
mutable
mutable field
nondeterministic
persistent
physical equality
pointer
pure
ref
ref cell
reference
sequencing
structural equality

Mutable Data Types