(Plan for) MixFix Processing

BitC accepts mixfix expressions. This note describes some of the implementation issues.

What is MixFix?

A ``classic'' mixfix declaration takes one of the forms:

mixfix _+_   5     // infix   (starts and ends with an expression)
mixfix _!    9     // postfix (starts with a expression)
mixfix not_  6     // prefix  (ends with an expression)
mixfix (_)   6     // closed  (starts and ends with a quasi-keyword)
mixfix _[_]  6     // multi-part postfix

A given mixfix specification can have an arbitrary number of expression ``holes.'' Holes may not be adjacent. The mixfix syntax adopted in Agda is typical of the breed.

Mixfix declarations have the effect of augmenting the expression grammar. Expressions in the primary grammar have some "primary" cases that are higher precedence than any mixfix production (e.g. "a.b"). When the primary parser processes the mixfix_expr production, it gathers the elements into a linear list consisting of expressions. Some of these really are expression nodes that resulted from primary expression parsing. Most are identifiers. Some of the identifiers turn out to be mixfix quasi-keywords introduced by mixfix declarations. Those will be processed/rearranged by the mixfix processor.

BitC augments the Agda-style mixfix syntax in two ways:

The BitC mixfix design borrows two elements from Agda: (1) the ability to specify mixfix rather than just infix operators, and (2) the notion that "interior" (fully bracketed) expressions can be of any precedence. That is: precedence relations are only checked at edge holes. In contrast to Agda, we retain the tradition that precedence relations are transitive. While we treat quasi-keywords as keywords while they are in scope (meaning mainly that they cannot be rebound), we do not require that they be unique. In particular, it is perfectly okay in BitC to introduce the mixfix rules:

mixfix while_do#_  5
mixfix until_do#_  5

even though they both use "do" as a quasi-keyword. The fact that quasi-keywords are not unique somewhat increases the challenge of the BitC parsing problem in comparison to other mixfix implementations.

Note:
Actually, this example isn't quite legal because while, until, and do are keywords, but with the introduction of the mixfix mechanism and the (forthcoming) support for automatic thunking these may move out of the core and get specified as mixfix entries in the prelude.

Handling of Precedence

One of the back-and-forth discussions in the literature is whether precedence relations in mixfix should be transitive. For example, if we know _AND_ < _==_ and _==_ < _+_, does it follow that _AND_ < _+_ (which is arguably nonsense)? Danielsson argues that transitive precedence relations are undesirable, because they can lead to this sort of meaningless outcome. My view is that uses of mixfix tend to fall into two cases:

My (unsubstantiated) belief is that the perceived problems with transitive precedence relations arise mainly in the second situation, and are a result of the fact that existing implementations don't provide a first-class notion of syntax rules. That's easily remedied without too much complexity, so we're going to give it a try in BitC.

Theory of Operation

MixFix itself is conceptually straightforward, but not documented in a way that is particularly accessible to first-time implementors (like myself). The ``main'' parser (e.g. one built in bison) has expression productions that construct a linear list of expressions, some of which are quasi-keywords. For example, the mixfix rule:

mixfix _[_]  6     // postfix with multiple expressions

introduces the quasi-keywords '[' and ']'. While this rule is in scope, these identifiers will be treated as keywords. The rule also introduces what amounts to a grammar production

mixfix_expr: mixfix_expr '[' mixfix_expr ']'  %prec 6

The job of the mixfix layer is to implement these productions. The implementation challenge derives from the fact that the set of productions is not fixed: mixfix declarations can (eventually) occur in local scopes. In the classic table-driven parsing techniques, adding a production is (barely) tractable, but deleting one is pretty darned expensive. So, for example, the AGDA implementation uses memoizing parser combinators, and (I assume - haven't checked) flushes the memoization cache when a production is removed.

A mixfix grammar is qualitatively simpler than a CFG:

The implementation chosen here here is a variant on a back-tracking left-corner parser. We first seek a highest-precedence reduction in the left corner, use recursive descent in internal expressions, and then do a precedence-constrained recursive descent parse on the right corner. Back-tracking can occur in internal positions, and in the right-most position when the right-side construction proves to be right-associative. The parser itself is interpreted in order to avoid the challenges of inserting and removing productions from a table-driven parser.

Further optimizations are possible to control the cost of back-tracking if this becomes necessary. In particular, observe that if a given input position corresponds to the beginning of an expression, the position may or may not ultimately be the start of an expression, but if it is the start of an expression, the input from that point will always parse the same way. This is true because a mixfix parser only has one syntactic category (i.e. non-terminal). It means that parse sub-trees constructed in previous passes can be retained in pack-rat style.

In the present parser, I haven't implemented the pack-rat optimization for several reasons: # right-associative expressions are very rare # some amount of syntactic category structure is imposed in the bison parser, with the effect that most mixfix parses occur over very simple trees.

Supporting Data Structures

The implementation maintains some supporting data structures:

The mixfix parser implementation itself is a recursive, predictive, shunting-yard variant. It uses a distinct parse stack and copy of the input for each production it attempts so that it is easier to back-track.

There are several aspects of the BitC mixfix specification that simplify or help to optimizer the parsing problem:

Complications

After getting an initial mixfix implementation working, I stumbled into an unpleasant surprise. The expression:

a + (b-1)

got completely mangled, because function application has higher precedence than anything else. In a mixfix parser, the ``+'' is just an idenfitier, so the +(b-1) got processed as a procedure application in the primary parser before processing ever made it into the mixfix layer. The resulting input at the mixfix layer was:

a {apply + {apply - b 1}}

Needless to say, this was not quite what I had in mind. The question is: since ``+'' is just an identifier, why are the following two things handled differently?

a + (b-1)
a f (b-1)   // invalid

The answer is that they are different because, at the mixfix layer, the ``+'' is not an identifier; it is a quasi-keyword. In consequence, the input pattern +(b-1) does not have an expression on the left, and therefore will not match the rule for application. "Hang on," you say, "the rule for application?"

Yes. That's what this whole ``complications'' discussion is leading up to. We can't make decisions about what is an application until we get into the mixfix layer, which leads us to the following chain of conclusions:

Here is how I ended up sorting this all out:

First, remember that the mixfix layer is building ASTs, and that it is purely context free. Because of this, my description above of the apply pattern doesn't work, and here's what we need to do to sort out pairs and application:

The first important point in all of this is to notice that an {apply _,_ e} is always ``wrapped'' by something else in well-formed input, and conversely, that an {apply (_) e} node which is merely a parenthesized expression can be identified by virtue of not being wrapped.

The second important point is that the operator precedence rules for _,_, (_), and ``_ _'' (and friends) work out correctly, primarily because _,_ always appears inside a closed form.

The end result is that we can let the mixfix layer assemble an expression tree by running the precedence-based parser, and then make a post-pass to rewrite that tree into the proper form for literal constructions, vector constructions, and applications.

Which, after seeing the initial mess created by:

a + (b-1)

is a feasibility conclusion that I was very relieved to reach!

Issues on Implementation

When I went to actually implement, I found that the preferred parse rules for handling _,_ are not possible. Ideally, the primary parser would ensure that _,_ only arises within the legal bracketing forms (_) or [_]. In practice, this causes a long list of parse conflicts. The workaround is to simply accept ``,'' as if it were an identifier, and leave the syntactic issue to be diagnosed in the mixfix handler through the absence of any ,_ or _, productions.

The Parser

The parser a performs a mix of bottom-up and top-down parsing. It is bottom-up on the left and top-down on the right. While attempting to match a rule, a conventional recursive-descent parser can end up testing both left-side and right-side rules repeatedly. The intuition here is that on the left-hand side of the input we are necessarily going to end up reducing the highest-precedence production that matches the leading input, so we might as well reduce that eagerly without redundant effort. That is the bottom-up part of the strategy.

But in order to do that, we need to make sure that any hole appearing on the right of the candidate production is matched. This means that we need to do a recursive-descent style parse when we encounter a right-side hole. The interesting case there is when we have a "left rule" of the form ...left_ and a "right rule" of the form _right.... Because of precedence and associativity, these might combine in three ways:

left_
    \         right rule has higher precedence,
    _right    or right-associative

    _right    left rule has higher precedence,
    /         or left-associative
left_

left_ <=> _right  non-associative; no parse

Only the first case allows us to combine the two productions into a single case. What we do is a precedence-constrained right recursion. In the other two cases, we simply shift the intervening hole, reduce, push the resulting parse sub-tree back onto the input, and proceed again to attempt a left-corner reduction.


Generated on 19 May 2013 for BitC Compiler by  doxygen 1.4.7