Day 14 – The Little Match Girl: Building and Testing Big Grammars in Perl 6

Perl 6 Grammars are great, but what is it like working with them in a project? Here is a bittersweet story of my experience before Christmas, and after Christmas. You can find the repository here. I do not come from a computer science background, so perhaps it will seem humble, but here are my pitfalls and triumphs as I learned Perl 6 Grammars.

The First Match

Like the Little Match Girl, our story takes place before Christmas. The Little Match Girl was tasked with selling a bundle of match sticks on Christmas Eve (New Years, actually. I did go back and read the story. Christmas just fits better with Perl 6), while I was tasked with extracting annotations from Modelica models to render as vector graphics. Now, Modelica is a wonderful object oriented modeling language, and I am going to completely gloss over it, except to mention that it has a very nice specification document (pdf) that contained a Concrete Syntax section in the Appendix. Perusing this section, I realized that the “syntactic meta symbols” and “lexical units” looked suspiciously like the Perl 6 Grammars that I had recently read a blog post about, and had been anxious to try out.

Example from Modelica Concrete Syntax:

class-definition :
[ encapsulated ] class-prefixes

Example of Perl 6 rule:

rule class_definition {

It was like the Little Match Girl striking the first match, and seeing for the first time a wonderful world beyond her stark reality. A warm little stove. And then it went out.

It was so close that I plopped it into a text editor, and replaced the not Perl 6 bits with some Perl 6 bits to see if it would run. It didn’t run. I hacked away at it, I pointed TOP at different bits to tackle smaller chunks. There were whitespace symbols everywhere, regexes, tokens, rules. I was able to parse some parts, others mysteriously didn’t work. Looking back, it must have been awful. In the meantime, we hacked together a traditional regular expression to extract the annotations, and I placed my Grammar on the shelf.

The Second Match

Not long after, the Grammar::Profiler and Grammar::Debugger were published, and I was inspired to give it another go. I was granted great insights into where my rules were behaving unexpectedly. I was able to drill down through the grammar deeper than before. The second match had been lit, and I was presented with a feast. And then it went out.

In the debugger, I dove into an abyss of backtracking. The profiler ran forever as it dove down into the morass, again and again. I was able to get much farther, but eventually ran into a wall. Success seemed so close, but there were too many missing pieces in my own experience, and documentation for me to get past the wall.

The Third Match

Time passed, and Christmas came. I had a new position, with time for personal projects. I had the ever improving Grammar documentation to guide me. I had read the book Working Effectively With Legacy Code. It was enough to warrant charging the hill once more.

Object orientation

This was the biggest breakthrough for me. When I understood from the documentation that Tokens, rules and regex were funny looking methods, I suddenly had all of the pieces. When I got home, I immediately checked if I could override TOP, and I checked if I could put the Grammar methods into a role. Both worked delightfully, and I was in business. Rather than having one monolithic, all-or-nothing Grammar, I could break it up into chunks. This greatly improved organization and testability of the code.

One particularly bodacious thing was that I was able to neatly split the grammar up into roles corresponding to those found in the Modelica specification.


Unit testing: one layer at a time

Object orientation opened up a sensible scheme of unit testing, and saved me from the nonsense of ad hoc testing by passing bits of Modelica into the Grammar. You can inherit and override grammars as you would any other class. This allows you to test each rule or token separately, splitting your grammar up into bite-sized layers. You just override TOP with the rule or token to test, and override any dependencies with placeholder methods.

Definition of expression from Expressions.pm6:

rule expression {
  <|w>'if'<|w> <expression> <|w>'then'<|w> <expression> [
  <|w>'elseif'<|w> <expression> <|w>'then'<|w> <expression>
  <|w>'else'<|w> <expression>

Here we see that expression depends on itself and simple_expression. In order to test, we replace the usual simple_expression rule with a placeholder. In this case it just matches the string 'simple_expression'.

Overridden test Grammar from Expressions.t:

grammar TestExpression is Grammar::Modelica {
rule TOP {^ <expression> $}
rule simple_expression { 'simple_expression' }
ok TestExpression.parse('simple_expression');

Regression testing is also much more pleasant when you can isolate the problematic portion of code, and create an overridden Grammar that targets it specifically.

<|w> is your friend

In my first efforts, trying to get things like Modelica reserved words working properly was one of the “banes of my existence”. That changed after I found the word boundary matching token <|w>. When I slap one on each side, it works, whether next to white space or a punctuation character.

From ComponentClause.pm6:

rule type_prefix {
  [<|w>[ 'flow' || 'stream' ]<|w>]?
  [<|w>[ 'discrete' || 'parameter' || 'constant' ]<|w>]?
  [<|w>[ 'input' || 'output' ]<|w>]?

Token, rule and regex

There is good documentation for these now, but I, also, will briefly contribute a description of my experience. I found that rule and its :sigspace magic was the best choice most of the time. token was helpful where tight control of format was needed.

regex is for backtracking. For Modelica, I have found it to be unhelpful, likely because it was designed to be a single pass language. token and rule work in the places I thought I needed it. All of my unit tests passed after I removed them, and the Grammar succeeded on four more Modelica Standard Library files. Only use this when you need it.

End With the Beginning

Another bit that was frustrating to me was class definition syntax. Modelica uses the form some_identifier ... end some_identifier for its classes. How to ensure that the same identifier was used at the beginning and end was troublesome for me. Fortunately, Perl 6 allows you to use a capture inside the Grammar methods. The (<IDENT>) capture below populates $0, which can then be used to ensure that our long_class_specifier ends with the proper identifier.

rule long_class_specifier {
  [(<IDENT>) <string_comment> <composition> <|w>'end'<|w> $0 ]
  [<|w>'extends'<|w> (<IDENT>) <class_modification>? <string_comment> <composition> <|w>'end'<|w> $0 ]

Integration Testing: lighting all the matches at once

After my unit tests were all passing, I felt a little trepidation. Sure it can parse my contrived test cases, but how will it do with real Modelica? With trembling hand, I fed it some of Michael Tiller’s example code from his Modelica e-book. It worked! No fiddling around with subtle things that I overlooked, no funny parsing bugs or eternal backtracking. Just success.

Now, stars do occasionally align. Miracles do happen. Sufficiently clever unit tests can be remarkably good at preventing bugs. I have been around the block enough times to verify. Recalling a presentation by Damian Conway, I decided to run it against the entire Modelica Standard Library. Not exactly all of CPAN, but 305 files is better than the mere two example models I had tried so far.

I wrote the script, pointed it at the Modelica directory, and fired it up. It churned through the library and wheezed to a stop. 150 failures. Now that is familiar territory. After several iterations, I am down to 66 failures when I run it on my parse_modelica_library branch. I just go through a file that is failing, isolate the code that is having issues, and write a regression test for it.

So, in the end the Little Match Girl lit all the rest of her bundle. Then, she died. Don’t die, but you can light all 305 matches with me, in parallel, with examples/parseThemAll.p6:


use v6;
use Test;
use lib '../lib';
use Grammar::Modelica;

plan 305;

sub light($file) {
  my $fh = open $file, :r;
  my $contents = $fh.slurp-rest;

  my $match = Grammar::Modelica.parse($contents);
  say $file;
  ok $match;

sub MAIN($modelica-dir) {
    say "directory: $modelica-dir";
    die "Can't find directory" if ! $modelica-dir.IO.d;

    # modified from the lovely docs at
    my @stack = $modelica-dir.IO;
    my @files;
    while @stack {
      for @stack.pop.dir -> $path {
        light($path) if $path.f && $ eq 'mo';
        @stack.push: $path if $path.d;
    # faster to do in parallel{light($_)});

I will see how many more I can persuade to pass before Christmas. Then perhaps I will figure out how to write some rules to build a QAST.

Merry Christmas!