Exploratory parsing with Perl 6

by

There have already been some delectable little grammar goodies this Advent Calendar. We hope to add to that today, with a discussion of the concept of “Exploratory Parsing” and Perl 6.

There’s no question that many modern programming languages having embraced regular expressions as a central part of the language or via a standard library. Some languages have introduced full-blown-parsing facilities as core features of their design. For example, many functional languages give us powerful parser combinators. Perl 6, as the astute advent reader knows, gives us “Regexes and Rules”. It turns out that Perl 6 Regexes are an implementation of parsing expression grammars, or PEGs for short, originally formulated by Bryan Ford.

I was inspired by a recent Ward Cunnigham post where he uses PEGs to explore seemingly unstructured text. Ward used an implementation of a PEG parser generator written in C by Ian Piumarta.

We however have a powerful PEG-like parser built right into the core of the Perl 6 language, so what better way to play with exploratory parsing than getting cozy with your favourite Perl 6 compiler, pouring yourself another glass of eggnog, and divining meaning from random text out on the interwebs!

Our first example is taken directly from Ward’s “Exploratory Parsing” post. If you haven’t at least perused it yet, may I encourage you to do so.

The first thing we do is retrieve some data to explore:

wget http://introcs.cs.princeton.edu/java/data/world192.txt

We then translate Ward's first example into Perl 6:

use v6;

# Inspired by
# http://dev.aboutus.org/2011/07/03/getting-started-exploratory-parsing.html
# but using Perl 6 regexes and tokens

grammar ExploratoryParsing {
    token TOP {
        <fact>+
    }

    token fact { <key> <value> | <other_char> }

    token key { <whitespace> <word>+ ':' }

    token value { [<!key> .]+ }

    token word { <alpha>+ ' '* }

    token whitespace { '\n' ' '* }

    token other_char { . }

}

I don't know about you, but I love how declarative this is. I encourage you to compare the two implementations. The translation is almost trivial, no?

We introduce a MAIN method which slurps up a data file, uses our grammar definition to parse the text, and tells us how many "facts" we've found.

sub MAIN() {
    my $text = slurp('world192.txt');
    say "Read world factbook. Parsing...";

    my $match = ExploratoryParsing.parse($text);
    say "Found ", +$match, " facts.";
}

Running this with the Rakudo Perl 6 compiler we get:

$ perl6 exp-parsing.pl
Read world factbook. Parsing...
Found 16814 facts.

If we use the awesomely awesome Grammar::Tracer or Grammar::Debugger already unwrapped for us earlier this month, we can even step further into this and explore matches.

The remaining embellishments from Ward's original post are left as an exercise for the reader.

You can see how powerful this idea is. We start with some semi-structured text and use the power of Perl 6 Regexes and Rules to start pulling things apart, stirring the precipitate meaning and exploring pattern and trends we see in the data. This kind of work is trivial in a language like Perl 6 with powerful parsing support. You can even imagine jumping into a Perl 6 REPL and doing this interactively.

Hopefully this has whet your appetite for playing with Perl 6 regexes. Happy parsing.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 37 other followers

%d bloggers like this: