Thursday, September 1, 2011

On Dynamically Typed Languages

In programming languages, types are a very fundamental thing.  All expressions have a type, or at least can be represented as types.  For example, in the expression "1 + 2", where 1 and 2 are integers, the type of the expression is most likely that of an integer.  I say "most likely", because usually languages are defined so that "+" with two integers results in an integer.  (In more formal terms, the type signature of "+" is integer -> integer -> integer, meaning it takes two integers as parameters and it returns an integer.)

Typing information is important regarding a language's implementation.  In general, information regarding types must be known in order to execute a program.  For example, the expression "'moo' + 1" is incorrect in regards to the previous definition of the "+" operator, as 'moo' is not an integer.  This is an example of a type error: the expected types do not match the given types.

To illustrate that type errors are a significant problem, consider assembly language.  In general, assembly languages have no concept of types.  Data is passed around, and the type of the data depends only on the context it is used in.  The same data can represent a variety of things: an integer, a floating point number, a string of characters, a memory address, etc.  Everything is considered "valid", though it is very easy to make a nonsensical command.  Such errors show themselves only by the program behaving incorrectly, which are some of the most difficult bugs to work out. 

Clearly, typing is important.  What is also important, and getting back to the title of this post, is when typing information is determined.  In a statically typed language, typing information is assigned at compile time, before a program is executed.  In these languages, type errors reveal themselves as compile time errors.  In a dynamically typed language, typing information is assigned on the fly as the program runs.  In these languages, type errors reveal themselves as runtime errors of varying severity.  (The severity is determined by other language properties not relevant to this discussion.)

Both types of languages exist in the real world, and there are a number in both camps.  For example, C, C++, and Java are all statically typed languages.  Perl, Python, and JavaScript are all examples of dynamically typed languages.

There are pros and cons for both methods.  For statically typed languages, determining types at compile time means that more efficient code can be generated.  There is no overhead at runtime in verifying types, as such verification is performed at compile time.  Additionally, many common programming errors reveal themselves as type errors, resulting in theoretically more reliable code.  The downside is that certain problems are difficult to express in statically typed languages.  One often focuses more on getting the types just right so code will compile as opposed to actually solving the problem at hand.  Additionally, for many languages, specifying type information requires bulky annotations which take up a significant chunk of the code.  Not to mention the specification of these annotations becomes tedious.

The second listed drawback of statically typed languages, that of bulky type annotations, is frequently heralded as the key advantage of dynamically typed languages.  I will be the first to admit that such annotations are an utter pain in languages like Java, accounting for probably between 10-20% of the characters typed.  Imagine code being 10-20% shorter, just because the language is dynamically typed.

There is, however, a middle ground: type inferencers.  These are built into the compilers of certain statically typed languages, such as Scala, Haskell, and OCaml.  Type inferencers allow the programmer to skip specifying most, if not all, type annotations in code.  When code is compiled, the type inferencer goes to work, filling in the gaps. 

That's the theory, anyway.  Different inferencers have different capabilities.  Personally, I'm very familiar with Scala's inferencer.  In Scala, type annotations must be specified for input parameters to functions.  Additionally, annotations must be provided for the return values of overloaded functions and for recursive functions.  There are also a few edge cases where annotations must be specified, though these are quite rare.  Even though all of these cases require annotations, it cuts down the number significantly.  Comparing Scala code to Java code, I would safely say in half.  This pales in comparison to Haskell's type inferencer, which is so powerful that frequently all type annotations can be omitted.  Haskell code often looks like types are determined dynamically, precisely because explicit type information is notably absent.

One might have an aversion to such inferencers, at least in theory.  What if the inference is wrong, for example?  In my experience with Scala, I can say that this is rare, but it does happen.  That said, when it happens, it usually means I'm doing something REALLY wrong.  In this case, it means I'm consistently doing the wrong thing.  Any inconsistency results in a failure to infer type, as the inferred type is different in different contexts.  As such, the code is still valid, but it suffers from a logical error.  One can do this in a traditional statically typed language as well, but it is even more rare given how frequently typing information must be specified.  The repetition makes it so one tends to notice more.

The way I view it, the typing argument is a continuum between statically typed languages and dynamically typed languages, which various type inferencing mechanisms in between. 

I will admit that true dynamic typing has advantages.  For example, one of Lisp's greatest strengths is its macros, which allow for the creation of entirely new language-level abstractions.  Such macros expand to arbitrary code, and can usually accept arbitrary code.  With a perfect type inferencer that never requires type annotations, this isn't a problem.  The issue is that there is no such thing; inevitably typing information will need to be specified.  With a macro, this type information may need to be inserted in the final expanded code, instead of as a parameter to the macro.  If this is true, then macros can't be implemented properly: one would need a way to manipulate the expanded code, which largely defeats the purpose of macros. 

Personally, I prefer statically typed languages with type inferencers.  I tend to hit a wall with true dynamically typed languages after about a thousand or so lines.  By this point, I have defined several simple data structures, along with operations that act on them.  In a dynamically typed language, I can pass around the wrong data structure into the wrong operation quite easily.  Worse yet, depending on how the data structure is implemented, I may never know except for incorrect program behavior.  For example, if all the data structures are lists of length >= 1, and I have an operation that looks at only the first element and performs some very general operation on the element, then probably every data structure can be passed to this operation without a resulting type error.  By the thousand line mark, I can't keep it all in my head at the same time, and I make these simple errors.  In a statically typed language, this is revealed immediately at compile time, but not so with a dynamically typed language.  At this point I spend more time testing code than writing code, tests that I get "for free" from a statically typed language.

To put it shortly, most of my programmatic errors reveal themselves as type errors, so static typing is important to me.  This issue is so important that I feel that people with a strong preference for dynamically typed languages must think in a way fundamentally different than I.  I know people who can write correct code just as fast in dynamically typed languages as I can in statically typed languages, so there must be something.  Maybe it's practice, or maybe something else is going on.  That would make an interesting case study...

3 comments:

  1. Nice blog post :) I had a few comments:

    (1) Type inference, is usually a static algorithm that can infer types at compile time thus being able to generate static errors. So it does not really come off as something in between static and dynamic typing. Its more like static typing, but without annotations because the compiler is sophisticated enough.

    (2) Could you give an example of when Scala type inference is wrong (that means that Scala type system is unsound, but I thought it wasnt)?

    (3) Although I liked Scala's type inference a lot, after some experience in writing Scala code, I prefer that manual annotations are also included, to help me or some else understand the code better. One point that you could include about pros of static types is that they provide extra documentation, which personally, I find
    very necessary to read/hack someone else's code. Of course I find Java's type annotation's are an overkill!

    (4) I am completely on your side about dynamic languages making me a bit worried about the code I write, especially when they cross 1000 lines, say. Unless it is backed up with good testing, I rarely am confident of the code written in dynamic languages.

    ReplyDelete
  2. Hahaha here and I thought this was just a monologue.

    1.) I'm bad at clarifying certain things. By "in-between", I mean in terms of the number of type annotations needed.

    2.) "Wrong" might be too strong a word. "Not intended" might be better. First example involves the overloaded "+" operator. Say I have two variables x and y. If I assume that these are integers, then "x + y" yields an integer. But say I've completely messed something up, and they are actually strings. "x + y" is still valid, but it types to string. The type inferencer isn't actually wrong, but it's not what I intend. (The inferencer can't save us from PEBKAC.)

    Hmm...when I originally wrote this, I didn't know the difference between "sound" and incorrect. I've hit a number of cases where the inferencer chooses a less specific type, which is still sound, but a more specific type is needed. (Correct me if I'm wrong on this.)

    I've also seen cases where the inferencer seems to not use certain typing information, resulting in failures to type when enough information is provided. Consider the following:

    trait Item {}

    class ItemProducer[ T <: Item ]( val produce: T ) {}

    class UseItem( val producer: ItemProducer[ _ ] ) {
    def use(): Item =
    producer.produce
    }

    This fails to compile with:
    error: type mismatch;
    found : Any
    required: Item
    producer.produce
    ^
    one error found

    Granted, I specified Any as the generic type for ItemProducer in UseItem, but anything less specific than Item is impossible because of the constraint on ItemProducer. The inferencer seems to discard this previously-defined constraint.

    3.) I agree with that. I don't like to have to use scaladoc just to get at the types of everything. I found that in dynamically typed languages, nearly half my comments are actually type annotations for humans.

    ReplyDelete
  3. (2) Hmmm, with the "+" example, one point about type inferencing engines could be that they make the programmers do calculations about types in their head, and with overloaded operators and implicit conversions (albeit specified by the programmer himself), you can get something that you don't expect. At least, you will get static type errors :) But they are "correct".

    Yes, it is true that type inference systems in general pick the most general type possible, and it could annoy the hell out of you when you are trying to get it to work. Well, that is why most systems are also a hybrid of type inference and type annotations (and type annotations usually get preference). But if you think about it, thats the way it should be, right?

    As far as the example goes, Scala does only local type inference. So while compiling UseItem, all the compiler knows is that producer is an ItemProducer that is parameterized over Any type, but in the function use, you are expecting its original constraints to hold for the type checks to work (which would be non local). Also, the programmer only needs to remember things locally (what are the types and their constraints, all of these are specified then and there :)

    ReplyDelete