Day 15 – Parsing Globs using Grammars

UPDATE 12/21/16: After some discussions here and at reddit, I realized I’d made a couple important mistakes. Sorry about that. I have corrected GlobAction and a bug in the glob helper subroutine. I also replaced the use of “AST” with “parse tree” below as well as that is more technically correct.

If you are at all familiar with Perl, you are likely to be familiar with regexes. In Perl 6, regexes have been extended to provide features and syntax allowing complex parsing tasks to be simplified and grouped together into grammars. To give a taste of this, let’s consider how to parse file globs. It’s simple, it’s well-defined, and I have some practice writing parsers for the task.

Introducing File Globs

In case you are not familiar with them, a file glob it a sort of pattern match for file names. For example, consider this glob:

*.txt

This will match any file ending in “.txt”. The asterisk (“*”) means “match as many characters as needed to make the match”. This matches “foo.txt” and “blah.blah.txt”, but not “z.html”.

Here’s another glob to consider:

.??*

This will match any file that starts with a period (“.”) followed by 2 or more characters. That means it will match a file named “.zshrc”, but not one named “..”, making it a popular heuristic for finding “hidden” files on a Unix file system.

Basic Grammar

What does a basic glob parsing grammar look like? Let me show a simple one and then I will explain the interesting bits:

grammar Glob {
    token TOP { + }
    token term {
        || 
        || 
    }
    token match {  |  }
    token match-any { '*' }
    token match-one { '?' }
    token char { . }
}

This is just about the simplest possible grammar for a glob that covers the two most popular forms. Let’s consider the parts.

A grammar is just a sort of class. Grammars may contain rules and tokens. Rules and tokens are like methods, but which are defined as a regex and are used to match the input to the grammar. Here we define a grammar named “Glob” to parse our glob.

Every grammar may define a special rule or token named “TOP”. This becomes the default regex to use when matching input with that grammar. Without it, any attempt to parse with the grammar must name the rule or token to begin with.

Aside from that each rule and token defines a regex to use when matching the input. For example, the match-any token here will match only an asterisk (“*”). Similarly, the char token will match any single character.

To be useful as a grammar, rules and tokens refer to one another using angle brackets (“<>”). Here, TOP contains one or more term matches, term matches either a match or a char (using || to try each match in the order given).

Whitespace Handling

Why do I only use token keywords and never make use of the rule? Because of whitespace. A rule inserts some implicit whitespace handling, which is helpful in keeping parsers readable. However, this handling is not helpful when parsing file globs.

Consider the following rule and token:

token { 'a' 'b'+ }
rule  { 'a' 'b'+ }

The token will match “abbbbb” and “ab” but not “a b” or “a bbbb”. The rule, however, will not match “abbbbb” or “ab” but will match “a b” and “a bbbb”. This is the whitespace handling at work. The rule above is implicitly similar to this token:

token { 'a'  'b'+  }

That ws is a predefined token available in all grammars. The built-in definition implies there is a word-break and zero or more spaces, newlines, or other whitespace.

Depending on what works better for you, you may redefine ws if you prefer something different:

token ws { "\n" } # whitespace is always newlines

Or you may just use tokens to avoid the implicit whitespace handling. This latter solution is what I believe makes the most sense for this grammar.

Parsing Input

Now that we have our basic grammar, how do we parse the input? Here’s a simple example:

my $ast = Glob.parse("*.?");
dd $ast;
# OUTPUT is an ugly, long AST
# Match $ast = Match.new(ast => Any, list => (), hash =>
# Map.new((:term([Match.new(ast => Any, list => (), hash =>
# Map.new((:match(Match.new(ast => Any, list => (), hash => Map.new(()), 
# orig => "*.?", to => 1, from => 0)))), orig => "*.?", to => 1, from => 
# 0), Match.new(ast => Any, list => (), hash => Map.new((:char(Match.new(
# ast => Any, list => (), hash => Map.new(()), orig => "*.?", to => 2, 
# from => 1)))), orig => "*.?", to => 2, from => 1), Match.new(ast => Any, 
# list => (), hash => Map.new((:match(Match.new(ast => Any, list => (), 
# hash => Map.new(()), orig => "*.?", to => 3, from => 2)))), orig => 
# "*.?", to => 3, from => 2)]))), orig => "*.?", to => 3, from => 0) 

As we get output, the parse succeeded. Given how accepting our grammar is, everything but an empty string ought to succeed. If a parser fails, the result of parse is an undefined Any.

If you’ve used regexes in Perl 6 at all, this ASTparse tree should look very familiar. Grammars are really nothing more than a little bit of extra structure around regexes. The ASTsparse trees they generate are just regex Match objects that may have additional matches nested within.

To do something useful with this ASTparse tree, you have some options. You could write a tree walker that will go through the ASTparse tree and do something interesting. However, Perl grammars provide a built-in mechanism to do that while in the middle of the parse called an action class. Let’s use an action class instead.

Action Classes

An action class is a regular class that can be paired to a grammar to perform actions based upon each rule or token matched. The methods in the action class will mirror the names of the rules and tokens in the grammar and be called for each.

Here is an action class that is able to translate the parsed glob into a matcher class:

class GlobAction {
    method TOP($/) { make GlobMatcher.new(terms => $».made) }
    method term($/) { make $/.values».made.first }
    method match($/) { make $/.values».made.first }
    method match-any($/) { make '*' }
    method match-one($/) { make '?' }
    method char($/) { make ~$/ }
}

Each method will receive a single argument which is the match object. If the argument list is not given, the implicit match variable $/ is used instead (here we make that explicit). The action class can perform any action in response to the match it wants. Often, this involves rewriting the return value of the match by using the make command. This sets the value returned by the .made method on the match (as you can see is used in term and match).

To put this action class into use, you pass the action class during parsing like so:

my $matcher = Glob.parse("*.txt", :actions(GlobAction.new)).made;
for  -> $f {
    say "$f matches" if $f ~~ $matcher;
}

Now all we need is the magic defined within GlobMatcher, which does the real work of testing whether a given file name matches the parsed glob.

Putting it all together

Here is an implementation of matching. It uses the ACCEPTS method to define a smartmatch operation that can be applied to any defined string.

class GlobMatcher {
    has @.terms;

    method ACCEPTS(Str:D $s) {
        my @backtracks = [ \([ $s.comb ], [ @!terms ]) ];
        BACKTRACK: while @backtracks.pop -> (@letters, @terms) {
            LETTER: while @letters.shift -> $c {
                last LETTER unless @terms;

                my $this-term = @terms.shift;
                my $next-term = @terms[0];

                given $this-term {
                    when '*' {
                        # Continue to the next term if we can
                        if @terms 
                        and $next-term ne '*' | '?' 
                        and $c eq $next-term {
                            push @backtracks, 
                                ([ @letters ], [ '*', |@terms ]);
                            redo;
                        }

                        # Match anything, and try again next round
                        unshift @terms, $this-term;
                    }

                    # We have a letter, so we match!
                    when '?' { }

                    # Only match exactly
                    default {
                        # If not an exact match, we fail; 
                        # try again if we can
                        next BACKTRACK if $c ne $this-term;
                    }
                }
            }

            # If we matched everything, we succeed
            return True unless @terms;

            # Otherwise, try the next backtrack, if any
        }

        # We ran out of back tracks, so we fail
        False;
    }
}

This algorithm is kind of ugly and might be prettier if I’d implemented it via a DFA or translated to regular expressions or something, but it works.

Typing out the code to parse is a bit tedious, so we’ll build ourselves a little helper to make it cleaner looking:

sub glob($glob, :$globlang = Glob) { 
    $globlang.parse($glob, :actions(GlobAction.new)).made; 
}

for  -> $f {
    next unless $f ~~ glob('*.txt');
    say "$f matches";
}

We could even use it in a nice grep:

.grep(glob('*.txt')).say;

Extending the Grammar

Now that we have a basic working grammar, let’s say we want to expand it to support a variant based on ANSI/SO SQL’s LIKE operator. It is very much like a glob, but uses percent (“%”) in place of asterisk (“*”) and underscore (“_”) in place of question mark (“?”). We could implement this as a brand new grammar, but let’s make our new grammar a sub-class instead, so we can avoid duplicating work.

grammar SQLGlob is Glob {
    token match-any { '%' }
    token match-one { '_' }
}

.grep(
    glob('%.txt', :globlang(SQLGlob))
).say;

That was too easy.

Expanding the Grammar

From here, we might also try to expand the grammar to handle character classes or curly-expansions:

[abc]*.{txt,html}

In BSD-like globs, the above will match any file starting with “a” or “b” or “c” that also ends with “.txt” or “.html”. In which case, we might consider using a proto to make it easy to expand the kinds of matches we allow. A full demonstration of this is beyond the scope of this post, but it could look something like this:

class Glob {
    ...

    token term {
        || 
        || 
        || 
    }
    proto token expansion { * }
    proto token match { * }

    ...
}

class BasicGlob is Glob {
    token match:any { '*' }
    token match:one { '?' }
}

class BSDGlob is BasicGlob {
    token expansion:list { '{' + % ',' '}' }
    token match:char-class { '['  ']' }
}

class SQLGlob is Glob {
    token match:any { '%' }
    token match:one { '_' }
}

By using a proto token or proto rule we gain a nice way of handling alternative matches within grammars. This allows us to create variety of grammar variants that don’t repeat themselves. It also gives us the flexibility to allow developers using your grammars to subclass and modify them to suit their custom needs without monkey patching.

A working example of something like this can be found in the source for IO::Glob. (Though, much of that having been written long enough ago that I’m not sure it is the best possible example of what to do, but it is an example.)

Grammars are a really nifty way of grouping a set of regexes together to build a complete parser as a unit. They are easy to read, provide tools that allow you to avoid repeating yourself, and I find them to be an enjoyable tool to play with and use in Perl 6.

Cheers.

8 thoughts on “Day 15 – Parsing Globs using Grammars

  1. Hi,

    You have a typo in a method name in the definition of class GlobActions:

    | method match-on($/) { make ‘?’ }

    Should be ‘match-one’.

    Also, I’m having trouble with parsing using the grammar SQLGlob. I’ve little experience in debugging Perl6, but I’ve come to the conclusion that the method ‘match-any’ and the method ‘TOP’ somehow don’t play nice with each other in the class GlobAction. Here

    | method match-any($/) { make ‘*’ }

    We make an ast object which is made of ‘*’, but here:

    | method TOP($/) {
    | #
    | }

    TOP receives a tree which contains the term ‘%’ for ‘match-any’.

    Anyway, nice series of posts! They really help me to better understand Perl6 and also get some confindence in my Perl6 skills by doing the same code exercises and looking at best practices for designing programs. :)

      1. I see there’s a problem with your example. The glob function uses the hardcoded pattern “*.txt” instead of the argument “$glob”. It always works because it never uses the pattern “%.txt”.

        If the argument “$glob” is used then this line will create trouble :

        > method TOP($/) { make GlobMatcher.new(terms => $.map(~*)) }

        When $/ or $ are stringified the result is the original text. But if you look deeper in the structure (for example $.made) you see that “%” was made into “*”. The problem is that “~*” doesn’t use the made value, but the original.

        I couldn’t think of an elegant solution to the problem, though.

      2. Thank you! I have issued a correction. I have also updated the glot.io snippet:

        https://glot.io/snippets/elgb9tbafm

        Everything should work now. I was definitely calling .made on the wrong thing and I knew better, but that bug in glob() prevented me from finding it. So, thanks for finding that for me.

        Next time, I will be sure to get someone to proofread for me. I was in such a rush this time, that I didn’t. Clearly, that was a mistake.

  2. I’m not seeing any items with angle-brackets in the initial code examples – looks like the markup is ignoring them as html or such :-P Apparently, it was readable for others but recent edits have lost the content?

    1. Sorry about that. You are correct. Something went dreadfully wrong. I don’t know if it happened with the original or the edit I just did, but it is fixed now. Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s