An alternative approach to parser generators is to use what are known as parser combinators. With parser combinators, the parser can be represented directly in the metalanguage (i.e. the language your interpreter/compiler is written in), and it can be easily manipulated. Moreover, these tend to be easier to use than parser generators; only a handful of APIs are needed for a working parser, as opposed to knowledge of a whole new tool. The resulting code also tends to mirror the BNF grammar being parsed, allowing for easier maintainability. The downside is that they are not the fastest things around by any means, but they can get a working parser off of the ground quickly. But instead of blabbing on about parser combinators, let's let some code do the talking.
Moving forward, I assume that the target language has already been tokenized. Given that this process is typically a series of one-to-one translations between source strings and tokens, this should be a fair assumption. If not, one could always encode the whole underlying language in the parser, complete with (usually) unimportant pieces like whitespace.
Parser combinators are based off of a simple, usually true, assumption: any parser can be viewed as a function that takes a stream of tokens and returns some result. More formally, this can be seen as:
parser: Stream[Token] -> ParserResult
This seems obvious enough, given that a parser by its very nature reads tokens and yields some result. More interesting is exactly what ParserResult looks like.
For simplicity, let's say that we are only interested in recognizing whether or not a particular string is a member of the language specified by the CFG (we'll relax this restriction later). With this restriction, we are interested only in whether or not a parser was able to parse in some tokens. If a parser fails to parse, then the tokens were not recognized, and the parser thus returns some failure indicator. For the remainder of this discussion, let's refer to this as ParserFailure, which is a subtype of ParserResult.
What of the case where the parser succeeded to parse in? Intuitively, in this case we return the opposite of failure, which will be called ParserSuccess in the remainder of this discussion. However, we must also return one extra bit. In parsing in some form, the parser consumed some number of tokens from the input stream. When the parser returns, we must know where it left off in this stream, i.e. what was not consumed. As such, ParserSuccess also returns a stream consisting of all the tokens following what was successfully parsed in.
To better illustrate this, consider the code below that generates a parser that can recognize a given symbol lit. In this code, if a string is returned, then it is assumed to be indicative of parser failure. This allows for error messages to be automatically generated during parsing. If is returned instead of a string, then it's assumed to be indicative of parser success, where the returned list consists of the characters remaining in the stream (i.e. everything after the literal that was parsed in).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defun literal-parser (lit) | |
(lambda (stream) | |
(cond | |
((null stream) "End of stream") | |
((not (eq (car stream) lit)) | |
(concatenate 'string "Expected " (symbol-name lit) " but saw " (symbol-name (car stream)))) | |
(T (cdr stream))))) |
First, consider the AND case. AND first runs the first parser it was given. If this parser returns success, then AND runs the second parser on the remaining tokens in the stream, returning the result of the second parser. If either parser fails, then AND fails. This behavior can be implemented like so:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defun and-parser (p1 p2) | |
(lambda (stream) | |
(let ((res1 (funcall p1 stream))) | |
(if (stringp res1) res1 ; failure | |
(funcall p2 res1))))) |
Now let us consider the OR case. With the OR parser, it first tries to run the first parser. If that succeeds, then it simply forwards the result. If not, then it runs the second parser. This is implemented below:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defun or-parser (p1 p2) | |
(lambda (stream) | |
(let ((res1 (funcall p1 stream))) | |
(if (stringp res1) ; failure | |
(funcall p2 stream) | |
res1)))) | |
At this point, we have enough infrastructure to start parsing in some real things. After all, conjunction (AND) and disjunction (OR) are the bread-and-butter of BNF grammars. However, this parser is still not very useful - it can only recognize languages, while ideally we would like to generate ASTs. With a bit of modification, we can add this functionality in an extensible way.
Looking again at ParserSuccess, it seems clear that whatever result is generated should be passed around within while it is generated. For example, while generating an AST for an expression of the form -e, the fact that - was encountered should be retained while e is parsed. We can do this by modifying ParserSuccess to return a pair holding our custom result along with the tokens that remain in the stream. Whenever parsing succeeds, we also need to apply some user-defined function to return our processed result. While we could modify all of the parsers defined so far to have this extra functionality, a more modular approach is to introduce a new kind of parser that handles this behavior. This parser will first attempt to apply some other parser to a stream. If the passed parser succeeds, then it will apply a user-defined function to the result of the other parser. If the passed parser fails, then the failure is simply propagated. This parser is detailed below. Note that lists of two elements are used to represent a pair:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defun res-parser (p make-result) | |
(lambda (stream) | |
(let ((res (funcall p stream))) | |
(if (stringp res) ; failure | |
res | |
(list (funcall make-result (first res)) | |
(second res)))))) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(defun and-parser2 (p1 p2) | |
(lambda (stream) | |
(let ((res1 (funcall p1 stream))) | |
(if (stringp res1) res1 ; failure | |
(let ((res2 (funcall p2 (second res1)))) | |
(if (stringp res2) res2 ; failure | |
(list | |
(list (first res1) (first res2)) | |
(second res2)))))))) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(funcall | |
(res-parser | |
(and-parser2 | |
(literal-parser 'foo) | |
(literal-parser 'bar)) | |
#'(lambda (and-res) | |
(list 'hi (first and-res) (second and-res)))) | |
(list 'foo 'bar)) |