We now start looking at types and type systems, which specify the most important kind of semantic analysis performed by compilers.
The C programming language has an expression *p
whose meaning is
to dereference the pointer p
. However, C gives a lot of flexibility
for computing the pointer p
. In fact, it's possible to construct
an invalid pointer, and the C compiler will not necessarily stop us
or even give a warning. What does the ANSI C specification say happens
in this case? In the specification (the C89 spec),
we read the following:
If an invalid value has been assigned to the pointer [p], the behavior of the unary operator "*" is undefined.
This slightly alarming sentence is actually only one of more than 80 occurrences of the phrase “the behavior of ... is undefined” in the specification. What does it mean? It means that the computer is allowed to do anything it feels like, including bad things like erasing data you care about or sending illegal commands to attached hardware. On most computers, it means in practice that your program crashes, but it also might simply cause the program to run incorrectly in an unpredictable way.
A program is safe only it never gets into a state where its behavior is undefined. Languages use types to help avoid undefined behavior, and the effectiveness of types at this job has been a real success story of computer science. Types help guarantee that various bad things don't happen, such as:
Operators (such as "*") receive the right number and kind of operands.
User-defined functions received the expected arguments too.
Variables are assigned the expected kinds of values.
Variables are initialized with values before they are used.
Private module components are accessed only from within the module (Java security depends crucially on this!)
A strongly typed or type-safe language is once where undefined behavior can never happen. (This does not rule out nondeterminism in which more than one behavior is possible from a specified set.) Clearly, C is not an example of a strongly typed language.
A statically typed language is one in which the compiler associates types with program expressions (specifically, terms). C is an example of a statically-typed language. A statically typed language has a sound type system if it is also strongly typed. The compile-time type checks ensure safe behavior at run time.
In a dynamically typed language, values carry around a representation of their type at run time. Languages like Perl and Python are dynamically typed but not statically typed. The Java language is statically typed but also has some support for dynamic typing, allowing it to safely handle type casts and “instanceof” tests.
We can classify languages according to whether they are strongly typed or statically typed:
Strongly typed | Not strongly typed | |
---|---|---|
Statically typed | OCaml, Java | Pascal, C, C++, Xi |
Not statically typed | Scheme, Python | FORTH, assembly |
Given that we can have type safety without static typing, why do we want static typing? There are two main reasons:
It helps prevent programmer errors.
It allows the compiler to avoid run-time overhead in generated code.
Static typing does have some downsides: it adds an annotation burden for programmers, it may make certain things programmers want to do impossible. Type systems that don't get in programmers' way tend to be complex and to require a complex, possibly slow type-checking algorithm. And in general, static type system cannot guarantee the absence of every possible kind of run-time error. Errors such as division by zero are hard to check statically, and some run-time checks are unavoidable.
Most program expressions denote types or terms. Terms are ordinary expressions that represent a computation to be carried out at run time, whereas computation associated with types is carried out at compile time, except when dynamic typing features are used.
Types are a compile-time abstraction of run-time program behavior. In a
sound type system, this abstraction is conservative, so good types rule
out bad program behavior. For example, the type int
usually indicates
that a term with that type computes some integer (or possibly fails to
terminate).
Type systems have two components: a set of possible type expressions, and a set of rules for associating terms with types. The set of possible type expressions is generally defined by a grammar, though in some type systems it is an involved process even to define the well-formed types. For a language like Xi, types are easy to define. We use the metavariable \(t\) here to denote a type, but the Greek letter \(τ\) (tau) is another popular choice.
\[ t ::= \texttt{int} \mid \texttt{bool} \mid t[\,] \]
Here you can see two base types (also known as primitive or
ground types), int
and bool
, and array types, which
result from the application of a type constructor.
\section{Names and type equality}
In many languages, programmers can use names of their choosing to
refer to types, so the grammar of types is extended accordingly.
We use the metavariable \(X\) to represent a type name.
\[
t := \dots \mid X
\]
For example, in OCaml or C we can define another name for an existing
type, for example making the name num
an alias for the type int
:
type num = int (* OCaml *) typedef int num; /* C */
Here the type expressions num
and int
are really the same type,
and the compiler needs to be able to do the compile-time computation
to decide this equality. The computation involves figuring the true
definition of num
and int
, by determining that they have the same
structure. If we had an OCaml function that expected a num*num
and
we provided it an int*int
, it would type-check because the types
num*num
and int*int
are equal. This is an example of a
structural equality test.
Another example of a named type is a class type, as in Java. Here we
declare two classes C
and D
, so C
and D
can be used as types
of objects:
class C { int x; int m(); } class D { int x; int m(); }
In this example, the names C
and D
are bound to types that
are structurally identical, but the types are considered unequal,
simply because they have different names. Therefore type equality in
Java is nominal
, i.e., based on names rather than on
the structure of types.
Different languages have many different kinds of type constructors. However, there are some common underlying patterns, and complex types such as class types can be understood to some extent as combinations of simpler type constructors. In the following we use fairly standard notations for the different kinds of types, though different languages may choose other notations.
Another handy kind of primitive type is a type with
only a single value, called a unit
type.
It has only one value, often written as ()
.
You could think of it as being like a boolean, but a
defective one! This might sound useless, but the unit type is useful for
giving a type to expressions that do not produce an interesting value,
such as statements, or procedures that do not return a value. In
OCaml this type is written as unit
. In
type theory, this type is often written as 1
. In C
and Java the type void
is essentially the unit type, and is used
only to describe the return type of methods that don't return a value.
A primitive kind of data structure is a tuple
, which groups together
several values of possibly different types. A common notation for
a tuple containing values of types \(t_1, t_2, \dots, t_n\) is
\(t_1*t_2*\dots*t_n\). Values within a tuple are distinguished by their
position in the tuple. OCaml is an example of a language with tuples.
Records are closely related to tuples. The type \(\{x_1:t_1,\dots,x_n:t_n\}\) is similar to the tuple type \(t_1*\dots*t_n\), except that the values in the record are accessed by field names \(x_i\) rather than by position. For example, accessing the field \(x_i\) of a value \(v\) might be written as \(v.x_i\). In some languages the order of the fields in the record is significant; in others, two record types are equal even if their field names appear in a different order. Making field order insignificant has implications if we want to support subtyping, which we will talk about in the next lecture.
Records store several named values. Sometimes we want a variable to
be able to contain one of several different kinds of values. Variant
types are one way to accomplish this. For example, in OCaml if we want
a value that can be either an int
or a bool
, we can define a named
type to accomplish this:
type int_or_bool = Integer of int | Boolean of bool
Given a value of type int_or_bool
, we find out which we have by
pattern matching:
let x: int_or_bool = ... in match x with Integer(i:int) → ... (* use i *) | Bool(b:bool) → ... (* use b *)
The different cases of a variant are distinguished by their names. A
more primitive variant-like type is a sum type, in which the
different cases are distinguished by the order in which they occur.
Thus, sums are to variants as tuples are to records. For example, the
type int+bool
represents a value that can be either an int
or a
bool
. Sum types are useful for studying the theory of types, but
real languages usually don't use them because it's too error-prone to
distinguish cases by their position.
The type \(t_1→t_2\) describes a function that takes in a value of type \(t_1\) and returns a value of type \(t_2\). To describe functions that expect multiple arguments, we can make \(t_1\) a tuple type.
In languages like OCaml and C, function types are part of the language, because functions are first-class values. In Java, there are no values with function type, because functions occur only as methods inside objects. However, function types are still useful as a way of thinking about the types of methods.
In some languages, the exceptions that can be raised by a function are also part of the function type. Code that uses the function may be required by semantic analysis to handle all exceptions that might be raised.
Most languages offer some kind of type constructor for mutable arrays.
In Java and C we have the type \(t[\,]\); in OCaml we have T array
.
In some languages the array type includes information about the
size of the array or the bounds on indices. For example, in the
language Pascal, the type array[1..10] of integer
represents an
array of ten integers with 1 as the lowest legal index. In the Cyclone
language, a type-safe language based on C, array types may
have lengths specified in terms of variables. For example, the
declaration int f(int n, int[n] a)
declares a function f
with
two parameters, where the first (n
) specifies the length of the
second, array argument (a
). Any attempt within the function
to access the elements of a
must be in a context where the compiler
can prove that the index is in the range 0..n-1
.
The Cyclone array type is an example of a dependent type: a type that mentions terms. Dependent types are a powerful language feature, but raise a lot of tricky issues. In general they can lead to the compiler needing to prove complex theorems about possible computations. Most languages with some form of dependent types place restrictions on how they are used to avoid this.
Many languages have some form of reference types. A reference type
refers to or points to another value, and can be imperatively updated.
The OCaml type t ref
is an example of a reference type. In C we
have the pointer type t *
, which serves a similar function. Pointer
types in C also support pointer arithmetic, which is unsafe.
Modern languages support parameterized types, which are user-defined type constructors. In Java we can declare parameterized classes, e.g.:
code
Here, Cell
is a type-level entity, but no value has the type Cell
.
Instead, Cell
is really a function at the type level. If applied to
any type T
, it produces a class which looks like
class CellT { T contents; }
We can therefore think of Cell
as having a signature
type→type
(technically, we say that the kind of Cell
is
type→type
), which is the same signature that we would ascribe to
the type constructor for arrays. In OCaml, we can similarly define our
own type constructors, e.g.:
type 't cell = {contents: 't}
In either case—Cell
or cell
—the type constructor describes
a computation that can be carried out entirely at compile time.
Languages have even been defined in which this type-level computation
is Turing-complete, but usually we like to be confident that our
compilers will eventually terminate, so this feature is quite unusual.