Combinator parsing

I just played around a bit with Parsec, which is a parser which comes with all major Haskell98 implementations. Usually if you’re doing parsing in Haskell you’re using Happy, which is a fairly typical LALR(1) functional parser generator.

Parsec is a top-down parser, much like a recursive descent parser. Parsec provides you a monad in which to write pretty grammars; it then returns to you either a parse value—probably a tree—or an error message. I have to admit, the error messages it gives you by default, if you can’t parse, are stunning. Typically error messages are the hardest, and least often done, tasks in parsing, but Parsec’s defaults are fairly clever.

Anyway, combinator parsers like Parsec, like LL(k) parser generators, are predictive. I.e., you should know what rule you’re using to parse by some finite, constant look-ahead. In LL(k) parsing, this “should” is a strict requirement, and is what gives LR(k) its supremacy over LL(k) in terms of languages recognized; in Parsec, it’s a performance issue. Parsec handles back-tracking for you, but for performance reasons, the manual suggests you factor out your grammar yourself.

Example parenthesis grammar in an LALR(1) or LR(k) system:
a := '(' a ')' | ε

Example parenthesis grammar in Parsec:
parens = do char '(' parens char ')' <|>

I’ve left out attached semantics, but they don’t ugly things up much more. The very cool thing about combinator parsers is the integration with the main language. Terminals, nonterminals and rules are nothing but Haskell functions and data. Thus, if you have some grammar design pattern, such as infix operator associativity, or lists of nonterminals with some separators, all you have to do is make a new combinator.

I should say that another cool thing about Parsec is that it’s not limited to a subset of context-free grammars, or even limited to context-free grammars at all. It’s not too much trouble to write Parsec grammars which accept non-context-free, but context-sensitive, languages.

The downside, which fits into a previous rant of mine, is that you’re shifting parsing from compile-time to run-time. Parser generator manuals like to give toy examples and show the expressiveness and how wonderful things are, but the fact is that writing correct grammars is hard. Maybe I’m just especially dumb with grammars, but I like to have all the help I can get. I like it when yacc tells me I have reduce/reduce conflicts or what have you: it tells me that I haven’t written my grammar correctly yet.

Also, its expressiveness and integration with Haskell, allowing you to do write non-context-free languages and crazy things like that, doesn’t strike me as a big win. Any language that you’d like to express in a computer science context is certainly going to be context-free—probably much more restrictive than that—and I like the sharp division we currently have between syntax and semantics. The simple, bitchy, LALR(1) parser generator deals with the syntax tree. Once it hands you the tree, you can forget all about syntax and start looking at semantics. I can’t come up with a parsing situation in which I’d want anything else.

Update: I did actually come up with a situation in which I’d want something else. There’s a rather famous language that I can’t recall (ALGOL?) which had no keywords. It allowed you to write ridiculously stupid things like IF IF = THEN THEN THEN = ELSE—e.g., IF, THEN and ELSE are regular identifiers, as well as constructs, depending on context, hence the “no keywords” idea. Does anyone recall which language this was, and why such a brilliant decision was made to not have keywords? In any case it’s a pain to parse: the usual trick is for the grammar to “know” when it’s looking for a keyword, and then flag the lexer; the lexer will then stop parsing identifiers and start parsing keywords. It’s an even bigger pain if you want to make your parser re-entrant, and a crazy combinator parser like Parsec might just make your life a lot easier then.

augustss Says:

February 26, 2007 at 4:42 am

First, many languages, e.g. C, are context sensitive (think typedef), but you have to hack around it in some way. Isn’t it nice not to have to hack around it?

Second, the maybe language without reserved words you are thinking of is PL/I? The excuse for this was that PL/I has so many keywords the you couldn’t expect people to learn them all. :)

Mike Says:

February 26, 2007 at 9:26 am

Ahh thank you! Yes, it is indeed PL/I I was thinking of. Personally I’d rather have proper keywords, haha.

As for typedefs, the grammar seems the wrong place to deal with this. My understanding of the C standard is that typedef is just a storage classifier, in the same category as static and register (e.g., typedef int foo; is declaring the “variable” foo with storage qualifier typedef). Consequently typedefs are subject to all the usual scoping, etc., rules for variables. It seems a better idea to deal with typing after parsing is complete.

Indeed if you read section A.2 in the C99 standard (Phase structure grammar), they’re setting this up just deal with typedefs blindly. During parsing, you find identifier identifier;, and it’s only later that you would ensure that the first identifier token is a typedef.

Not that compiler writers should follow the grammar verbatim, but I think it’s a good way to structure your compiler code.

Ricardo Herrmann Says:

February 26, 2007 at 1:32 pm

As far as I know, MUMPS doesn’t have keyword either … you could even go further and abbreviate WHEN with W and IF with I and use I and W vars in their expressions ;-)

Kevin Scaldeferri Says:

February 26, 2007 at 2:32 pm

Perl is notoriously context-sensitive. As are natural languages. This is not a coincidence.

Aaron Denney Says:

March 9, 2007 at 2:37 am

the problem with typedef is not parsing it, but parsing things later on — it introduces new types, so when you’re looking for a type, you need to check to see if it’s one of the types that have been defined.

March 9, 2007 at 12:58 pm

Yes, but I would consider checking types to be a semantic check. This could be done after parsing.

Typedefs are in many ways analogous to variables. For example, when parsing an expression which uses a variable, you don’t check at parse time if that variable has been declared. Similarly, if you use a type name in some declaration, you don’t check at parse time if that type name has been declared. Rather, you build the parse tree and do semantics checks on that tree.

I should say there were efforts to do these sorts of things at parse time. The idea is that every time there’s a variable/type declaration, you add rules to your grammar, and when you leave the current scope, you remove the rules from your grammar. It’s cool, but doesn’t gain you a lot over the current separation we have now of syntax verification vs. semantic verification.

Jason Says:

March 24, 2008 at 9:48 pm

no. you can’t. you have to know at parse time. perhaps this example will help.

typedef int z;

int main(int argc, char * argv)
{
z(b);
return 1;
}

z(b) is *not* a function. it’s a type declaration. you can’t get that right unless the parser has access to the type table, or your parser doesn’t do semantics.

Copii Says:

January 5, 2011 at 3:05 am

Very interesting and usefull post. Thank you Mike

buying term papers Says:

May 6, 2013 at 12:39 pm

Hey! I understand this is somewhat off-topic but I had to ask.

Does building a well-established blog such as yours take a massive amount
work? I am brand new to operating a blog however
I do write in my journal every day. I’d like to start a blog so I can share my experience and views online. Please let me know if you have any ideas or tips for new aspiring bloggers. Appreciate it!

live fuck cam Says:

May 15, 2013 at 10:45 pm

Greetings from Los angeles! I’m bored to tears at work so I decided to browse your blog on my iphone during lunch break. I love the information you provide here and can’t wait to take a look when I get home.
I’m surprised at how fast your blog loaded on my mobile .. I’m not
even using WIFI, just 3G .. Anyways, fantastic
site!

www.tech.de Says:

July 2, 2013 at 2:25 am

Wow, wonderful weblog format! How lengthy have you been blogging
for? you make blogging glance easy. The total glance of your website is excellent,
as smartly as the content!

fistIng Says:

August 14, 2013 at 2:09 am

Great post. I was checking constantly this blog and I
am impressed! Extremely helpful information specifically the last part :) I care for
such information much. I was looking for this certain info for a very long time.
Thank you and good luck.

Closure Sale