Day 9 – Longest Token Matching

Perl 6 regular expressions prefer to match the longest alternative when possible.

say "food and drink" ~~ / foo | food /;   # food

This is in contrast to Perl 5, which would prefer the first alternative above, and produce the match “foo”.

You can still get the first-alternative behavior if you want; it’s tucked away in the slightly longer alternation operator ||:

say "food and drink" ~~ / foo || food /;  # foo

…And that’s it! That’s Longest Token Matching. ☺ Short post.

“Huh, wait!” I hear you exclaim, in a desperate attempt to make the daily Perl 6 Advent goodness last a bit longer. “Why is Longest Token Matching such a big deal? Who would ever be so obsessed with long tokens?”

I’m glad you asked. As it turns out, Longest Token Matching (or LTM for short) plays very well with our intuition about how things should be parsed. If you’re creating a language, you want people to be able to declare a variable forest_density without the mention of this variable clashing with the syntax of for loops. LTM will see to that.

I like “strange consistencies” — when distal parts of a language design turn out to have commonalities that make the language feel more uniform. There is that kind of consistency here, between classes and grammars. Perl 6 basically exploits that consistency to the max. Let me briefly map out what I mean.

We’re all used to writing classes at this point. From a birds-eye view, they look like this:

class {
    method
    method
    method
}

Grammars have a suspiciously similar structure:

grammar {
    rule
    rule
    rule
}

(The keywords are actually regex, token and rule, but when we talk about them as a group, we just call them “rules”.)

We’re also used to being able to derive classes into subclasses (class B is A), and add or override methods in a way which produces a nice mix of old and new behavior. Perl 6 provides multi methods which even allow you to add new methods of the same name, and the old ones won’t be overridden, they’ll just all try to match alongside the new methods. The dispatch is handled by a (usually autogenerated) proto method that dispatches to all eligible candidates.

What does all this have to do with grammars and rules? Well, it turns out that first off, you can derive new grammars from old ones. It works the same as deriving classes. (In fact, under the hood it’s exactly the same mechanism. Grammars are classes with a different metaclass object.) New rules will override old rules just like you’d expect with methods.

S05 has a cute example with parsing of letters, and deriving the grammar to parse formal letters:

    grammar Letter {
         rule text     { <greet> $<body>=<line>+? <close> }
         rule greet    { [Hi|Hey|Yo] $<to>=\S+? ',' }
         rule close    { Later dude ',' $<from>=.+ }
         token line    { \N* \n}
     }

     grammar FormalLetter is Letter {
         rule greet { Dear $<to>=\S+? ',' }
         rule close { Yours sincerely ',' $<from>=.+ }
     }

The derived FormalLetter overrides greet and close, but not line.

But what about all the goodness with multi methods? Could we define some kind of “proto rule” that would allow us to have several rules in a grammar with the same name but different bodies? For example, we might want to parse a language with a rule term, but there are many different terms: strings, numbers… and maybe the numbers can be decimal or binary or octal or hexadecimal…

Perl 6 grammars can contain a proto rule, and then you can define and redefine a rule with the same name as many times as you want. And now we’re back full circle with the / foo | food / alternation from the start of the article. All those rules you write with the same name compile down to one big alternation. Not only that — rules which call other rules, some of them possibly proto rules, all of that will be “flattened” out into one big LTM alternation. In practice that means that all the possible things a term can be are tried out all at once, on equal footing. Neither alternative wins because you happened to define it before the others. An alternative wins because it is the longest.

The strange consistency resides in the fact that in the call-a-method side of things, the most specific method wins, and “most specific” has to with signature narrowness. The better the types in the signature describe the arguments coming in, the more specific the method.

In the parse-with-a-rule side of things, the most specific rule wins, but here “most specific” has to do with parse success. The better the rule can describe what comes next in the text, the more specific the rule.

And that’s strangely consistent, because on the surface methods and rules look like quite different beasts.

We really believe we have something going with this whole principle of deriving a grammar and getting a new language. LTM is right at the center of that because it allows new rules and old to intermix in a fair and predictable way. It’s a kind of meritocracy: rules win not based on whether they’re young or old, but based on whether they are able to parse the text well.

In fact, the Perl 6 compiler itself works this way. It parses your program using a Perl 6 grammar, and that grammar is derivable… whenever you declare a new operator in your program, a new grammar is derived for you. The parsing of your operator is added as a new rule in the new grammar, and the new grammar is given the task of parsing the rest of your program. Your new operator will win against similar but shorter ones, and lose against similar but longer ones.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.