Sunday, August 18, 2013

Parser Combinators in Common Lisp

Within the field of programming languages, parsing is often considered an afterthought.  Yes, we need to actually be able to parse in  a program in order to execute it, but for most people this is not the "interesting" part.  We often ignore this step entirely, despite the fact that parsing is often considered a solved problem that isn't.  While we often ignore this step, this does not make it any less difficult or important.  For many languages, parsing realistically requires knowledge of parser generators like ANTLR or Bison, which are used to automatically generate parsers from some input grammar.  While such generators are effective and can result in well-performing parsers, there can be a definite learning curve.  Additionally, depending on the parser generator chosen, an unwanted degree of separation can be introduced between the parser and the rest of the code.  The parser is largely a black box, which may or may not be desirable.

An alternative approach to parser generators is to use what are known as parser combinators.  With parser combinators, the parser can be represented directly in the metalanguage (i.e. the language your interpreter/compiler is written in), and it can be easily manipulated.  Moreover, these tend to be easier to use than parser generators; only a handful of APIs are needed for a working parser, as opposed to knowledge of a whole new tool.  The resulting code also tends to mirror the BNF grammar being parsed, allowing for easier maintainability.  The downside is that they are not the fastest things around by any means, but they can get a working parser off of the ground quickly.  But instead of blabbing on about parser combinators, let's let some code do the talking.

Moving forward, I assume that the target language has already been tokenized.  Given that this process is typically a series of one-to-one translations between source strings and tokens, this should be a fair assumption.  If not, one could always encode the whole underlying language in the parser, complete with (usually) unimportant pieces like whitespace.

Parser combinators are based off of a simple, usually true, assumption: any parser can be viewed as a function that takes a stream of tokens and returns some result.  More formally, this can be seen as:

parser: Stream[Token] -> ParserResult

This seems obvious enough, given that a parser by its very nature reads tokens and yields some result.  More interesting is exactly what ParserResult looks like.

For simplicity, let's say that we are only interested in recognizing whether or not a particular string is a member of the language specified by the CFG (we'll relax this restriction later).  With this restriction, we are interested only in whether or not a parser was able to parse in some tokens.  If a parser fails to parse, then the tokens were not recognized, and the parser thus returns some failure indicator.  For the remainder of this discussion, let's refer to this as ParserFailure, which is a subtype of ParserResult.

What of the case where the parser succeeded to parse in?  Intuitively, in this case we return the opposite of failure, which will be called ParserSuccess in the remainder of this discussion.  However, we must also return one extra bit.  In parsing in some form, the parser consumed some number of tokens from the input stream.  When the parser returns, we must know where it left off in this stream, i.e. what was not consumed.  As such, ParserSuccess also returns a stream consisting of all the tokens following what was successfully parsed in.

To better illustrate this, consider the code below that generates a parser that can recognize a given symbol lit.  In this code, if a string is returned, then it is assumed to be indicative of parser failure.  This allows for error messages to be automatically generated during parsing.  If is returned instead of a string, then it's assumed to be indicative of parser success, where the returned list consists of the characters remaining in the stream (i.e. everything after the literal that was parsed in).

This parser is clearly very simple, and may not appear to be very useful at first glance.  After all, it is easy enough to define a parser like this using only regular expressions.  However, the difference is that parser combinators allow for, well, combination of simpler parsers to form more complex parsers.  For example, we may want two parsers to run in sequence (i.e. AND), or we may want to indicate that multiple potential forms are possible (i.e. OR).  It is possible to derive such AND and OR parsers using these same techniques, all in a manner which is independent of the underlying parsers being combined.

First, consider the AND case.  AND first runs the first parser it was given.  If this parser returns success, then AND runs the second parser on the remaining tokens in the stream, returning the result of the second parser.  If either parser fails, then AND fails.  This behavior can be implemented like so:

Oddly enough, the AND parser's definition is shorter than that of the basic literal parser.

Now let us consider the OR case.  With the OR parser, it first tries to run the first parser.  If that succeeds, then it simply forwards the result.  If not, then it runs the second parser.  This is implemented below:

Once again, the definition of OR is quite short.

At this point, we have enough infrastructure to start parsing in some real things.  After all, conjunction (AND) and disjunction (OR) are the bread-and-butter of BNF grammars.  However, this parser is still not very useful - it can only recognize languages, while ideally we would like to generate ASTs.  With a bit of modification, we can add this functionality in an extensible way.

Looking again at ParserSuccess, it seems clear that whatever result is generated should be passed around within while it is generated.  For example, while generating an AST for an expression of the form -e, the fact that - was encountered should be retained while e is parsed.  We can do this by modifying ParserSuccess to return a pair holding our custom result along with the tokens that remain in the stream.  Whenever parsing succeeds, we also need to apply some user-defined function to return our processed result.  While we could modify all of the parsers defined so far to have this extra functionality, a more modular approach is to introduce a new kind of parser that handles this behavior.  This parser will first attempt to apply some other parser to a stream.  If the passed parser succeeds, then it will apply a user-defined function to the result of the other parser.  If the passed parser fails, then the failure is simply propagated.  This parser is detailed below.  Note that lists of two elements are used to represent a pair:

Of course, the AND parser still needs to be modified in order to use the new definition of ParserSuccess.  Not only does it need to grab the correct element of the tuple in ParserSuccess, it must also hold the results of the parsers it runs.  These are simply held in tuples, until the all-important result parser is triggered again.  Our new definition of the AND parser becomes the following:

With this new definition of the AND parser handy, we can run silly little examples like so:

A complete definition of the parser combinators shown here, along with an example parsing in a simple grammar for sentences, is on GitHub.