Thursday, September 1, 2011

On Dynamically Typed Languages

In programming languages, types are a very fundamental thing.  All expressions have a type, or at least can be represented as types.  For example, in the expression "1 + 2", where 1 and 2 are integers, the type of the expression is most likely that of an integer.  I say "most likely", because usually languages are defined so that "+" with two integers results in an integer.  (In more formal terms, the type signature of "+" is integer -> integer -> integer, meaning it takes two integers as parameters and it returns an integer.)

Typing information is important regarding a language's implementation.  In general, information regarding types must be known in order to execute a program.  For example, the expression "'moo' + 1" is incorrect in regards to the previous definition of the "+" operator, as 'moo' is not an integer.  This is an example of a type error: the expected types do not match the given types.

To illustrate that type errors are a significant problem, consider assembly language.  In general, assembly languages have no concept of types.  Data is passed around, and the type of the data depends only on the context it is used in.  The same data can represent a variety of things: an integer, a floating point number, a string of characters, a memory address, etc.  Everything is considered "valid", though it is very easy to make a nonsensical command.  Such errors show themselves only by the program behaving incorrectly, which are some of the most difficult bugs to work out. 

Clearly, typing is important.  What is also important, and getting back to the title of this post, is when typing information is determined.  In a statically typed language, typing information is assigned at compile time, before a program is executed.  In these languages, type errors reveal themselves as compile time errors.  In a dynamically typed language, typing information is assigned on the fly as the program runs.  In these languages, type errors reveal themselves as runtime errors of varying severity.  (The severity is determined by other language properties not relevant to this discussion.)

Both types of languages exist in the real world, and there are a number in both camps.  For example, C, C++, and Java are all statically typed languages.  Perl, Python, and JavaScript are all examples of dynamically typed languages.

There are pros and cons for both methods.  For statically typed languages, determining types at compile time means that more efficient code can be generated.  There is no overhead at runtime in verifying types, as such verification is performed at compile time.  Additionally, many common programming errors reveal themselves as type errors, resulting in theoretically more reliable code.  The downside is that certain problems are difficult to express in statically typed languages.  One often focuses more on getting the types just right so code will compile as opposed to actually solving the problem at hand.  Additionally, for many languages, specifying type information requires bulky annotations which take up a significant chunk of the code.  Not to mention the specification of these annotations becomes tedious.

The second listed drawback of statically typed languages, that of bulky type annotations, is frequently heralded as the key advantage of dynamically typed languages.  I will be the first to admit that such annotations are an utter pain in languages like Java, accounting for probably between 10-20% of the characters typed.  Imagine code being 10-20% shorter, just because the language is dynamically typed.

There is, however, a middle ground: type inferencers.  These are built into the compilers of certain statically typed languages, such as Scala, Haskell, and OCaml.  Type inferencers allow the programmer to skip specifying most, if not all, type annotations in code.  When code is compiled, the type inferencer goes to work, filling in the gaps. 

That's the theory, anyway.  Different inferencers have different capabilities.  Personally, I'm very familiar with Scala's inferencer.  In Scala, type annotations must be specified for input parameters to functions.  Additionally, annotations must be provided for the return values of overloaded functions and for recursive functions.  There are also a few edge cases where annotations must be specified, though these are quite rare.  Even though all of these cases require annotations, it cuts down the number significantly.  Comparing Scala code to Java code, I would safely say in half.  This pales in comparison to Haskell's type inferencer, which is so powerful that frequently all type annotations can be omitted.  Haskell code often looks like types are determined dynamically, precisely because explicit type information is notably absent.

One might have an aversion to such inferencers, at least in theory.  What if the inference is wrong, for example?  In my experience with Scala, I can say that this is rare, but it does happen.  That said, when it happens, it usually means I'm doing something REALLY wrong.  In this case, it means I'm consistently doing the wrong thing.  Any inconsistency results in a failure to infer type, as the inferred type is different in different contexts.  As such, the code is still valid, but it suffers from a logical error.  One can do this in a traditional statically typed language as well, but it is even more rare given how frequently typing information must be specified.  The repetition makes it so one tends to notice more.

The way I view it, the typing argument is a continuum between statically typed languages and dynamically typed languages, which various type inferencing mechanisms in between. 

I will admit that true dynamic typing has advantages.  For example, one of Lisp's greatest strengths is its macros, which allow for the creation of entirely new language-level abstractions.  Such macros expand to arbitrary code, and can usually accept arbitrary code.  With a perfect type inferencer that never requires type annotations, this isn't a problem.  The issue is that there is no such thing; inevitably typing information will need to be specified.  With a macro, this type information may need to be inserted in the final expanded code, instead of as a parameter to the macro.  If this is true, then macros can't be implemented properly: one would need a way to manipulate the expanded code, which largely defeats the purpose of macros. 

Personally, I prefer statically typed languages with type inferencers.  I tend to hit a wall with true dynamically typed languages after about a thousand or so lines.  By this point, I have defined several simple data structures, along with operations that act on them.  In a dynamically typed language, I can pass around the wrong data structure into the wrong operation quite easily.  Worse yet, depending on how the data structure is implemented, I may never know except for incorrect program behavior.  For example, if all the data structures are lists of length >= 1, and I have an operation that looks at only the first element and performs some very general operation on the element, then probably every data structure can be passed to this operation without a resulting type error.  By the thousand line mark, I can't keep it all in my head at the same time, and I make these simple errors.  In a statically typed language, this is revealed immediately at compile time, but not so with a dynamically typed language.  At this point I spend more time testing code than writing code, tests that I get "for free" from a statically typed language.

To put it shortly, most of my programmatic errors reveal themselves as type errors, so static typing is important to me.  This issue is so important that I feel that people with a strong preference for dynamically typed languages must think in a way fundamentally different than I.  I know people who can write correct code just as fast in dynamically typed languages as I can in statically typed languages, so there must be something.  Maybe it's practice, or maybe something else is going on.  That would make an interesting case study...