I just played around a bit with Parsec, which is a parser which comes with all major Haskell98 implementations. Usually if you’re doing parsing in Haskell you’re using Happy, which is a fairly typical LALR(1) functional parser generator.

Parsec is a top-down parser, much like a recursive descent parser. Parsec provides you a monad in which to write pretty grammars; it then returns to you either a parse value—probably a tree—or an error message. I have to admit, the error messages it gives you by default, if you can’t parse, are stunning. Typically error messages are the hardest, and least often done, tasks in parsing, but Parsec’s defaults are fairly clever.

Anyway, combinator parsers like Parsec, like LL(k) parser generators, are predictive. I.e., you should know what rule you’re using to parse by some finite, constant look-ahead. In LL(k) parsing, this “should” is a strict requirement, and is what gives LR(k) its supremacy over LL(k) in terms of languages recognized; in Parsec, it’s a performance issue. Parsec handles back-tracking for you, but for performance reasons, the manual suggests you factor out your grammar yourself.

Example parenthesis grammar in an LALR(1) or LR(k) system:
a := '(' a ')'
  | ε

Example parenthesis grammar in Parsec:
parens =
  do
    char '('
    parens
    char ')'
  <|>

I’ve left out attached semantics, but they don’t ugly things up much more. The very cool thing about combinator parsers is the integration with the main language. Terminals, nonterminals and rules are nothing but Haskell functions and data. Thus, if you have some grammar design pattern, such as infix operator associativity, or lists of nonterminals with some separators, all you have to do is make a new combinator.

I should say that another cool thing about Parsec is that it’s not limited to a subset of context-free grammars, or even limited to context-free grammars at all. It’s not too much trouble to write Parsec grammars which accept non-context-free, but context-sensitive, languages.

The downside, which fits into a previous rant of mine, is that you’re shifting parsing from compile-time to run-time. Parser generator manuals like to give toy examples and show the expressiveness and how wonderful things are, but the fact is that writing correct grammars is hard. Maybe I’m just especially dumb with grammars, but I like to have all the help I can get. I like it when yacc tells me I have reduce/reduce conflicts or what have you: it tells me that I haven’t written my grammar correctly yet.

Also, its expressiveness and integration with Haskell, allowing you to do write non-context-free languages and crazy things like that, doesn’t strike me as a big win. Any language that you’d like to express in a computer science context is certainly going to be context-free—probably much more restrictive than that—and I like the sharp division we currently have between syntax and semantics. The simple, bitchy, LALR(1) parser generator deals with the syntax tree. Once it hands you the tree, you can forget all about syntax and start looking at semantics. I can’t come up with a parsing situation in which I’d want anything else.

Update: I did actually come up with a situation in which I’d want something else. There’s a rather famous language that I can’t recall (ALGOL?) which had no keywords. It allowed you to write ridiculously stupid things like IF IF = THEN THEN THEN = ELSE—e.g., IF, THEN and ELSE are regular identifiers, as well as constructs, depending on context, hence the “no keywords” idea. Does anyone recall which language this was, and why such a brilliant decision was made to not have keywords? In any case it’s a pain to parse: the usual trick is for the grammar to “know” when it’s looking for a keyword, and then flag the lexer; the lexer will then stop parsing identifiers and start parsing keywords. It’s an even bigger pain if you want to make your parser re-entrant, and a crazy combinator parser like Parsec might just make your life a lot easier then.

About these ads