Compile-time Type Inference ============================================ Haldean Brown First draft: Jun 2016 Last updated: Jul 2016 Status: Draft ------------------------------------------------------------------------ A key part of the Ubik compilation process is type checking and type inference. Both of these two steps happen in the same AST pass, called "inference"; despite it's name, it's also responsible for type checking. Inference happens module-by-module; since all top-level definitions have types specified on them, no cross-module inference must occur. Ubik only infers types "forwards", meaning that, at the point of declaration, the tightest-possible type constraint on a name must be known. In general, this shouldn't be a hindrance to a Ubik programmer; the biggest "downside" of this is that top-level declarations must be given a type by the programmer. The upside of this is a simplified type system that can give better errors than a full Hindley-Milner-style type inferencer while still providing most of the convenience that such a type engine provides. As the inferencer runs over an AST, it builds a set of "deductions" and a set of "assertions". "Deductions" are things we know about the type of an object; these are set at the time of the object's instantiation. On the other hand, "assertions" are things that we hope to be true but aren't sure of yet. At the end of AST processing, we ensure that every assertion is satisfied by the deductions we made. For type checking, all that is required is that the assertions generated are sufficient to disallow any nonconformant Ubik program. For type inference, we need a set of deduction rules such that we can propogate types forward adequately. Before we go any further, I'm going to quickly define some sets so that I can be terser throughout the rest of this: - E is the set of all valid Ubik expressions - T is the set of all valid Ubik types - D is the set of all possible derivations. D is equal to E x T: a pair (e, t) means that expression e has type t. - A is the set of all possible assertions. A is also equal to E x T, but its elements have a different semantic meaning; a pair (e, t) means that the type of "e" must be assignable to a type "t". AST inference is expression-based, and each expression has its own set of deduction and assertion rules. Essentially what we're doing is making a pair of functions over source expressions; if D is the set of all derivations, E is the set of all expressions, and A is the set of all assertions, we're defining: d : E -> 2^D a : E -> 2^A These functions map an expression onto a set of derivations and assertions, respectively. The set of derivations for an entire program is the union of the sets of all derivations for each expression in the program, and likewise with the assertions. Once these sets are built, we can determine whether the program is valid through constraint satisfaction (which I'll get to later). First let's start by defining d and a. Since everything is expression-based, and each expression works a little differently, let's break it down by expression type. ------------------------------------------------------------------------ Atomic expressions Atoms can be pretty easy; source literals have the type of whatever sort of atom they are. Names are a little trickier. Because of that, let's break this down a little further into literals and names. Source literals As mentioned before, these are easy. Depending on the type of atom, a source literal has what we'll call a "literal type" L. Literal integers have a literal type of Word, literal strings have type String, literal decimal numbers have type Number. The derivation rules for literals are therefore: d(e) : { (e, L) } The assertion rules are empty; we know these types precisely: a(e) : { } Names This is trickier, but not by much. The name that we're dealing with must have been defined somewhere else (if that weren't the case, then we would have been met with an error in an earlier pass). We go look up where that name was defined and find out what type it was, and then say the type of this expression is the same as that one. Thus, if e has the value of the name n: d(e) : { (e, resolve_type(n)) } a(e) : { } ------------------------------------------------------------------------ Function application Alright, let's do something cooler than just copying stuff. This time, we have an apply expression; an apply expression has a head ("the function", "h") and a tail ("the argument", "t"), and applies the tail to the head. The derivation rules for this are pretty simple: we know that the derived value will have the type of whatever the function returns, and we know that the argument should have the type of the first argument to the head's type. Let's define two partial functions over types before we give the formal definitions for application. Both functions are only defined on application expressions; for any other expression they are undefined (and in the actual type checker, would cause a type error). apply (a -> b) = b head (a -> b) = a With those functions defined, we can easily define d and a on application expressions. As we said earlier, the return type of the function is what we derive our expression's type to be: d(e) : { (e, apply(type(h))) } And we assert the tail has the type of the head's argument: a(e) : { (t, head(type(h))) } ------------------------------------------------------------------------ EXPR_LAMBDA THIS DOESN'T WORK. we need some backwards inference if we're going to make lambdas work without explicit types on all of them, because you have to know how the arguments are going to be used. For \a -> b: d(e) : { (e, type(a) -> type(b)) } a(e) : { (e, type(a) -> type(b)) } ------------------------------------------------------------------------ EXPR_BLOCK For { : a = b ! x }: d(e) : { (e, type(x)) (a, type(b)) } a(e) : { } ------------------------------------------------------------------------ EXPR_COND_BLOCK For ? x { . p1 => t1 . p2 => t2 }: d(e) : { (e, root(t1, t2)) } a(e) : { (x, type(p1)) (x, type(p2)) }