Mutable Data Types
Topics:
- refs
- mutable fields
- arrays
- mutable data structures
Mutable Data Types
OCaml is not a pure language: it does admit side effects. We have seen that already with I/O, especially printing. But up till now we have limited ourself to the subset of the language that is immutable: values could not change. Today, we look at data types that are mutable.
Mutability is neither good nor bad. It enables new functionality that we couldn't implement (at least not easily) before, and it enables us to create certain data structures that are asymptotically more efficient than their purely functional analogues. But mutability does make code more difficult to reason about, hence it is a source of many faults in code. One reason for that might be that humans are not good at thinking about change. With immutable values, we're guaranteed that any fact we might establish about them can never change. But with mutable values, that's no longer true. "Change is hard," as they say.
Refs
A ref is like a pointer or reference in an imperative language. It is a location in memory whose contents may change. Refs are also called ref cells, the idea being that there's a cell in memory that can change.
A first example. Here's an example utop transcript to introduce refs:
# let x = ref 0;;
val x : int ref = {contents = 0}
# !x;;
- : int = 0
# x := 1;;
- : unit = ()
# !x;;
- : int = 1
At a high level, what that shows is creating a ref, getting the value from inside it, changing its contents, and observing the changed contents. Let's dig a little deeper.
The first phrase, let x = ref 0
, creates a reference using the ref
keyword.
That's a location in memory whose contents are initialized to 0
. Think of the
location itself as being an address—for example, 0x3110bae0—even though
there's no way to write down such an address in an OCaml program. The keyword
ref
is what causes the memory location to be allocated and initialized.
The first part of the response from utop, val x : int ref
, indicates
that x
is a variable whose type is int ref
. We have a new type
constructor here. Much like list
and option
are type constructors,
so is ref
. A t ref
, for any type t
, is a reference to a memory
location that is guaranteed to contain a value of type t
. As usual
we should read should a type from right to left: t ref
means a
reference to a t
.
The second part of the response shows us the contents of the memory
location. Indeed, the contents have been initialized to 0
.
The second phrase, !x
, dereferences x
and returns the contents
of the memory location. Note that !
is the dereference operator
in OCaml, not Boolean negation.
The third phrase, x := 1
, is an assignment. It mutates the contents
x
to be 1
. Note that x
itself still points to the same location
(i.e., address) in memory. Variables really are immutable in that way.
What changes is the contents of that memory location. Memory is
mutable; variable bindings are not. The response from utop is simply
()
, meaning that the assignment took place—much like printing
functions return ()
to indicate that the printing did happen.
The fourth phrase, !x
again dereferences x
to demonstrate that
the contents of the memory location did indeed change.
A more sophisticated example.
Here is code that implements a counter. Every time next_val
is called, it returns one more than the previous time.
# let counter = ref 0;;
val counter : int ref = {contents = 0}
# let next_val =
fun () ->
counter := (!counter) + 1;
!counter;;
val next_val : unit -> int = <fun>
# next_val();;
- : int = 1
# next_val();;
- : int = 2
# next_val();;
- : int = 3
In the implementation of next_val
, there are two expressions
separated by semi-colon. The first expression, counter := (!counter) + 1
,
is an assignment that increments counter
by 1. The second
expression, !counter
, returns the newly incremented contents
of counter
.
This function is unusual in that every time we call it, it returns
a different value. That's quite different than any of the functions
we've implemented ourselves so far, which have always been
deterministic: for a given input, they always produced the same output.
On the other hand, we've seen some library functions that
are nondeterministic, for example, functions in the Random
module,
and Pervasives.read_line
. It's no coincidence that those happen to be
implemented using mutable features.
We could improve our counter in a couple ways. First, there is a
library function incr : int ref -> unit
that increments an int ref
by 1. Thus it is like the ++
operator in many language in the
C family. Using it, we could write incr counter
instead of
counter := (!counter) + 1
.
Second, the way we coded the counter currently exposes the counter
variable to the outside world. Maybe we're prefer to hide it so
that clients of next_val
can't directly change it. We could
do so by nesting counter
inside the scope of next_val
:
let next_val =
let counter = ref 0
in fun () ->
incr counter;
!counter
Now counter
is in scope inside of next_val
, but not accessible
outside that scope.
When we gave the dynamic semantics of let expressions before,
we talked about substitution. One way to think about the definition
of next_val
is as follows.
First, the expression
ref 0
is evaluated. That returns a locationloc
, which is an address in memory. The contents of that address are initialized to0
.Second, everywhere in the body of the let expression that
counter
occurs, we substitute for it that location. So we get:fun () -> incr loc; !loc
Third, that anonymous function is bound to
next_val
.
So any time next_val
is called, it increments and returns the contents
of that one memory location loc
.
Now imagine that we instead had written the following (broken) code:
let next_val_broken = fun () ->
let counter = ref 0
in incr counter;
!counter
It's only a little different: the binding of counter
occurs after
the fun () ->
instead of before. But it makes a huge difference:
# next_val_broken ();;
- : int = 1
# next_val_broken ();;
- : int = 1
# next_val_broken ();;
- : int = 1
Every time we call next_val_broken
, it returns 1
: we no longer
have a counter. What's going wrong here?
The problem is that every time next_val_broken
is called, the first
thing it does is to evaluate ref 0
to a new location that is initialized
to 0
. That location is then incremented to 1
, and 1
is returned.
Every call to next_val_broken
is thus allocating a new ref cell, whereas
next_val
allocates just one new ref cell.
Syntax. The first three of the following are new syntactic forms involving refs, and the last is a syntactic form that we haven't yet fully explored.
Ref creation:
ref e
Ref assignment:
e1 := e2
Dereference:
!e
Sequencing of effects:
e1; e2
Dynamic semantics.
To evaluate
ref e
,Evaluate
e
to a valuev
Allocate a new location
loc
in memory to holdv
Store
v
inloc
Return
loc
To evaluate
e1 := e2
,Evaluate
e2
to a valuev
, ande1
to a locationloc
.Store
v
inloc
.Return
()
, i.e., unit.
To evaluate
!e
,Evaluate
e
to a locationloc
.Return the contents of
loc
.
To evaluate
e1; e2
,First evaluate
e1
to a valuev1
.Then evaluate
e2
to a valuev2
.Return
v2
. (v1
is not used at all.)If there are multiple expressions in a sequence, e.g.,
e1; e2; ...; en
, then evaluate each one in order from left to right, returning onlyvn
. Another way to think about this is that semi-colon is right associative—for examplee1; e2; e3
is the same ase1; (e2; e3))
.
Note that locations are values that can be passed to and returned from functions. But unlike other values (e.g., integers, variants), there is no way to directly write a location in an OCaml program. That's different than languages like C, where programmers can directly write memory addresses and do arithmetic on pointers. C programmers want that kind of low-level access to do things like interface with hardware and build operating systems. Higher-level programmers are willing to forego it to get memory safety. That's a hard term to define, but according to Hicks 2014 it intuitively means that
pointers are only created in a safe way that defines their legal memory region,
pointers can only be dereferenced if they point to their allotted memory region,
that region is (still) defined.
Static semantics. We have a new type constructor, ref
, such that
t ref
is a type for any type t
. Note that the ref
keyword is used
in two ways: as a type constructor, and as an expression that constructs refs.
ref e : t ref
ife : t
.e1 := e2 : unit
ife1 : t ref
ande2 : t
.!e : t
ife : t ref
.e1; e2 : t
ife1 : unit
ande2 : t
. Similarly,e1; e2; ...; en : t
ife1 : unit
,e2 : unit
, ... (i.e., all expressions excepten
have typeunit
), anden : t
.
The typing rule for semi-colon is designed to prevent programmer mistakes. For
example, a programmer who writes 2+3; 7
probably didn't mean to: there's
no reason to evaluate 2+3
then throw away the result and instead return 7
.
The compiler will give you a warning if you violate this particular typing rule.
To get rid of the warning (if you're sure that's what you need to do),
there's a function ignore : 'a -> unit
in the standard library.
Using it, ignore(2+3); 7
will compile without a warning. Of course,
you could code up ignore
yourself: let ignore _ = ()
.
Aliasing. Now that we have refs, we have aliasing: two refs could point to the same memory location, hence updating through one causes the other to also be updated. For example,
let x = ref 42
let y = ref 42
let z = x
let () = x := 43
let w = (!y) + (!z)
The result of executing that code is that w
is bound to 85
, because let z = x
causes z
and x
to become aliases, hence updating x
to be 43
also causes z
to be 43
.
Equality. OCaml has two equality operators, physical equality and structural
equality. The documentation of Pervasives.(==)
explains physical equality:
e1 == e2
tests for physical equality ofe1
ande2
. On mutable types such as references, arrays, byte sequences, records with mutable fields and objects with mutable instance variables,e1 == e2
istrue
if and only if physical modification ofe1
also affectse2
. On non-mutable types, the behavior of( == )
is implementation-dependent; however, it is guaranteed thate1 == e2
impliescompare e1 e2 = 0
.
One interpretation could be that ==
should be used only when comparing refs
(and other mutable data types) to see whether they point to the same location in
memory. Otherwise, don't use ==
.
Structural equality is also explained in the documentation of Pervasives.(=)
:
e1 = e2
tests for structural equality ofe1
ande2
. Mutable structures (e.g. references and arrays) are equal if and only if their current contents are structurally equal, even if the two mutable objects are not the same physical object. Equality between functional values raisesInvalid_argument
. Equality between cyclic data structures may not terminate.
Structural equality is usually what you want to test. For refs, it checks whether the contents of the memory location are equal, regardless of whether they are the same location.
The negation of physical equality is !=
, and the negation of structural
equality is <>
. This can be hard to remember.
Here are some examples involving equality and refs to illustrate the difference
between structural equality (=
) and physical equality (==
):
# let r1 = ref 3110;;
val r1 : int ref = {contents = 3110}
# let r2 = ref 3110;;
val r2 : int ref = {contents = 3110}
# r1 == r1;;
- : bool = true
# r1 == r2;;
- : bool = false
# r1 != r2;;
- : bool = true
# r1 = r1;;
- : bool = true
# r1 = r2;;
- : bool = true
# r1 <> r2;;
- : bool = false
# ref 3110 <> ref 2110;;
- : bool = true
Mutable fields
The fields of a record can be declared as mutable, meaning their contents can be
updated without constructing a new record. For example, here is a record type
for two-dimensional colored points whose color field c
is mutable:
# type point = {x:int; y:int; mutable c:string};;
type point = {x:int; y:int; mutable c:string; }
Note that mutable
is a property of the field, rather than the type of the field.
In particular, we write mutable field : type
, not field : mutable type
.
The operator to update a mutable field is <-
:
# let p = {x=0; y=0; c="red"};;
val p : point = {x=0; y=0; c="red"}
# p.c <- "white";;
- : unit = ()
# p;;
val p : point = {x=0; y=0; c="white"}
# p.x <- 3;;
Error: The record field x is not mutable
The syntax and semantics of <-
is similar to :=
but complicated by fields:
Syntax:
e1.f <- e2
Dynamic semantics: To evaluate
e1.f <- e2
, evaluatee2
to a valuev2
, ande1
to a valuev1
, which must have a field namedf
. Updatev1.f
tov2
. Return()
.Static semantics:
e1.f <- e2 : unit
ife1 : t1
andt1 = {...; mutable f : t2; ...}
, ande2 : t2
.
Refs and mutable fields
It turns out that refs are actually implemented as mutable fields. In
Pervasives
we find the following declaration:
type 'a ref = { mutable contents : 'a; }
And that's why when we create a ref it does in fact looks like a record: it is a record!
# let r = ref 3110;;
val r : int ref = {contents = 3110}
The other syntax we've seen for records is in fact equivalent to simple OCaml functions:
(* Equivalent to [fun v -> {contents=e}]. *)
val ref : 'a -> 'a ref
(* Equivalent to [fun r -> r.contents]. *)
val (!) : 'a ref -> 'a
(* Equivalent to [fun r v -> r.contents <- v]. *)
val (:=) : 'a ref -> 'a -> unit
The reason we say "equivalent" is that those functions are actually implemented not in OCaml but in the OCaml run-time, which is implemented mostly in C. But the functions do behave the same as the OCaml source given above in comments.
Arrays
Arrays are fixed-length mutable sequences with constant-time access and update. So they are similar in various ways to refs, lists, and tuples. Like refs, they are mutable. Like lists, they are (finite) sequences. Like tuples, their length is fixed in advance and cannot be resized.
The syntax for arrays is similar to lists:
# let v = [|0.; 1.|];;
val v : float array = [|0.; 1.|]
That code creates an array whose length is fixed to be 2 and whose
contents are initialized to 0.
and 1.
. The keyword array
is a type constructor, much like list
.
Later those contents can be changed using the <-
operator:
# v.(0) <- 5.;;
- : unit = ()
# v;;
- : float array = [|5.; 1.|]
As you can see in that example, indexing into an array uses the
syntax array.(index)
, where the parentheses are mandatory.
The Array
module has many useful functions on arrays.
Syntax.
Array creation:
[|e0; e1; ...; en|]
Array indexing:
e1.(e2)
Array assignment:
e1.(e2) <- e3
Dynamic semantics.
To evaluate
[|e0; e1; ...; en|]
, evaluate eachei
to a valuevi
, create a new array of lengthn+1
, and store each value in the array at its index.To evaluate
e1.(e2)
, evaluatee1
to an array valuev1
, ande2
to an integerv2
. Ifv2
is not within the bounds of the array (i.e.,0
ton-1
, wheren
is the length of the array), raiseInvalid_argument
. Otherwise, index intov1
to get the valuev
at indexv2
, and returnv
.To evaluate
e1.(e2) <- e3
, evaluate each expressionei
to a valuevi
. Check thatv2
is within bounds, as in the semantics of indexing. Mutate the element ofv1
at indexv2
to bev3
.
Static semantics.
[|e0; e1; ...; en|] : t array
ifei : t
for all theei
.e1.(e2) : t
ife1 : t array
ande2 : int
.e1.(e2) <- e3 : unit
ife1 : t array
ande2 : int
ande3 : t
.
Loops. OCaml has while loops and for loops. Their syntax is as follows:
while e1 do e2 done
for x=e1 to e2 do e3 done
for x=e1 downto e2 do e3 done
The second form of for
loop counts down from e1
to e2
—that is,
it decrements its index variable at each iteration.
Though not mutable features themselves, loops can be useful with mutable
data types like arrays. We can also use functions like
Array.iter
, Array.map
, and Array.fold_left
instead of loops.
Mutable data structures
As an example of a mutable data structure, let's look at stacks. We're already familiar with functional stacks:
exception Empty
module type Stack = sig
(* ['a t] is the type of stacks whose elements have type ['a]. *)
type 'a t
(* [empty] is the empty stack *)
val empty : 'a t
(* [push x s] is the stack whose top is [x] and the rest is [s]. *)
val push : 'a -> 'a t -> 'a t
(* [peek s] is the top element of [s].
* raises: [Empty] is [s] is empty. *)
val peek : 'a t -> 'a
(* [pop s] is all but the top element of [s].
* raises: [Empty] is [s] is empty. *)
val pop : 'a t -> 'a t
end
An interface for a mutable or non-persistent stack would look a little different:
module type MutableStack = sig
(* ['a t] is the type of mutable stacks whose elements have type ['a].
* The stack is mutable not in the sense that its elements can
* be changed, but in the sense that it is not persistent:
* the operations [push] and [pop] destructively modify the stack. *)
type 'a t
(* [empty ()] is the empty stack *)
val empty : unit -> 'a t
(* [push x s] modifies [s] to make [x] its top element.
* The rest of the elements are unchanged. *)
val push : 'a -> 'a t -> unit
(* [peek s] is the top element of [s].
* raises: [Empty] is [s] is empty. *)
val peek : 'a t -> 'a
(* [pop s] removes the top element of [s].
* raises: [Empty] is [s] is empty. *)
val pop : 'a t -> unit
end
Notice especially how the type of empty
changes: instead of being a
value, it is now a function. This is typical of functions that create
mutable data structures. Also notice how the types of push
and pop
change: instead of returning an 'a t
, they return unit
. This again
is typical of functions that modify mutable data structures. In all
these cases, the use of unit
makes the functions more like their
equivalents in an imperative language. The constructor for an empty
stack in Java, for example, might not take any arguments (which is
equivalent to taking unit). And the push and pop functions for a Java
stack might return void
, which is equivalent to returning unit
.
Now let's implement the mutable stack with a mutable linked list. We'll have to code that up ourselves, since OCaml linked lists are persistent.
module MutableRecordStack = struct
(* An ['a node] is a node of a mutable linked list. It has
* a field [value] that contains the node's value, and
* a mutable field [next] that is [Null] if the node has
* no successor, or [Some n] if the successor is [n]. *)
type 'a node = {value : 'a; mutable next : 'a node option}
(* AF: An ['a t] is a stack represented by a mutable linked list.
* The mutable field [top] is the first node of the list,
* which is the top of the stack. The empty stack is represented
* by {top = None}. The node {top = Some n} represents the
* stack whose top is [n], and whose remaining elements are
* the successors of [n]. *)
type 'a t = {mutable top : 'a node option}
let empty () =
{top = None}
(* To push [x] onto [s], we allocate a new node with [Some {...}].
* Its successor is the old top of the stack, [s.top].
* The top of the stack is mutated to be the new node. *)
let push x s =
s.top <- Some {value = x; next = s.top}
let peek s =
match s.top with
| None -> raise Empty
| Some {value} -> value
(* To pop [s], we mutate the top of the stack to become its successor. *)
let pop s =
match s.top with
| None -> raise Empty
| Some {next} -> s.top <- next
end
Here is some example usage of the mutable stack:
# let s = empty ();;
val s : '_a t = {top = None}
# push 1 s;;
- : unit = ()
# s;;
- : int t = {top = Some {value = 1; next = None}}
# push 2 s;;
- : unit = ()
# s;;
- : int t = {top = Some {value = 2; next = Some {value = 1; next = None}}}
# pop s;;
- : unit = ()
# s;;
- : int t = {top = Some {value = 1; next = None}}
The '_a
in the first utop response in that transcript is a
weakly polymorphic type variable. It indicates that the
type of elements of s
is not yet fixed, but that as soon as
one element is added, the type (for that particular stack)
will forever be fixed. Weak type variables tend to appear
once mutability is involved, and they are important for the type
system to prevent certain kinds of errors, but we won't discuss
them further.
Summary
We cover mutable data types in the "Advanced Data Structures" section of
this course because they are, in fact, harder to reason about. For
example, before refs, we didn't have to worry about aliasing in OCaml.
But mutability does have its uses. I/O is fundamentally about mutation.
And some data structures (like arrays, which we saw here, and
hash tables) cannot be implemented as efficiently without mutability.
Mutability thus offers great power, but with great power comes great responsibility. Try not to abuse your new-found power!
Terms and concepts
- address
- alias
- array
- assignment
- dereference
- determinstic
- immutable
- index
- loop
- memory safety
- mutable
- mutable field
- nondeterministic
- persistent
- physical equality
- pointer
- pure
- ref
- ref cell
- reference
- sequencing
- structural equality
Further reading
- Introduction to Objective Caml, chapters 7 and 8
- OCaml from the Very Beginning, chapter 13
- Real World OCaml, chapter 8
- Relaxing the value restriction, by Jacques Garrigue, explains more about weak type variables. Section 2 is a succinct explanation of why they are needed.