Posts Tagged ‘grammar’

Day 15 – Bioinformatics and the joy of Perl 6

December 15, 2014

On the 15th day of Christmas Perl6 said let there be life-sciences, and so there was.

A lot of coverage and testament is made to Perl’s role in developing dynamic content for the early web. As well as its slow decline. However, there is an equally important story that goes alongside the tale of Perl as the language of CGI. That is the story of Perl as the scientific glue that saved the human genome project, there is even a published journal article in Perl’s honour.

It’s not an accident that Perl did well and continues to do well in the field of Bioinformatics. Few languages at the time provided first class semantics that matched the problem domain such as: regular expression literals, good string handling, trivial data structure creation, sensible automatic type conversion and XS for native code wrapping when speed was important. That Perl replaced a lot of sed+awk use cases for one liners is also a welcome feature for a busy bioinformatician. Perl persists today as a de facto scripting language due to the large community effort known as BioPerl. For those not in the know the API of BioPerl inspired a whole host of work in most other languages such as Biopython, BioJava, BioRuby etc. To date none are quite as feature complete as BioPerl though each has its unique additional features. Biopython is getting there from a lot of work done through GSoC action in recent years. I also use ete2 from Python on a regular basis and recommend people check that out for phylogenetics work. However, this post is about the future and I think Perl 6 has a lot to offer everyone working in bioinformatics.

XS the Perl5 system of integrating native code into a Perl module has done fantastically well in providing scientists with the ability to integrate high performance C code with the simplicity of Perl’s syntax. But XS in the Perl6 world has been replaced with NativeCall. This library makes it incredibly simple to integrate C code with a Perl6 class or subroutine. I’m not going to cover this here though, the repo has plenty of docs and examples and two previous advent articles demonstrate use cases.

Instead everything on show here are core features available from writing ‘perl6′ at a command prompt. You don’t need any external dependencies so any system with vanilla Rakudo Perl6 can do everything you’re about to see.

The Grammar of life

Hopefully most people have at least heard about DNA, RNA and perhaps even proteins. All of these are what are referred to in the biz as biopolymers, or translated into programming parlance “strings of life”. Essentially each molecule is a big chain of smaller building blocks you can denote in a textual form with a letter. In DNA we have the alphabet GCTA , in RNA we never see a T instead we see a U in its place. Proteins have their own alphabet that is unrelated to DNA and RNA between 21 and 23 characters long, the last two are very quirky and relatively rare.

So in this section I’m going to outline the hip way to parse sequence files in Perl6. This is how sequencing data usually comes into a bioinformatician. A giant text file! I’ll be using our own class to hold the sequence and a well defined grammar for the well loved (not) FASTA format. So to start here is the Grammar as if it fell from heaven as a shooting star this wintery night:

grammar FASTA::Grammar {
    token TOP { <record>+ }

    token record { ">"<id><comment>?"\n"<sequence> }

    token id { <-[\ \n]>+ }

    token comment { " "<-[\n]>+ }

    proto rule sequence {*}
          rule sequence:sym<dna> { <[ACGTRYKMSWBDHVNX\-\n]>+ }
          rule sequence:sym<rna> { <[ACGURYKMSWBDHVNX\-\n]>+ }
          rule sequence:sym<aa> { <[A..Z\*\-\n]>+ }
}

Cool we have a super formal grammar defined for FASTA and it even looks like we are doing something special for different types of sequence! I guess I had better explain a little, keep in mind that Grammar is just a huge extension to regular expressions which you are hopefully familiar with.

First up we have token and rule. The difference is that token is like a hungry hippo, it consumes the input sequence and is quite greedy. So if a token matches at all it consumes the string and then lets parsing continue. If there is a choice of tokens the longest wins, which can be thought of as the winner in a game of hungry hippos!
A rule is a bit more tricksome, it has backtracking and can regurgitate and give up. Specifically we have a set of rules called sequence with the more abstract one defined with the keyword proto. The reason we have three rules is for each type of biopolymer, they are ordered by the most to least specific. Unfortunately DNA and RNA can both look like protein. So our Grammar is limited in its ability to detect the difference, but so would a human that’s how rubbish this major data format is!
With rule unlike token if the start of a sequence matches DNA but the rest of it looks like protein we back out of matching DNA because the grammar is smart enough to realise it can get a longer match from the other rule, but has to spew up the letters already consumed first. If we had used token we might have had a really short DNA token then a protein one etc. This might all sound a bit complicated but I’ve made a trace using Grammar::Debugger of what happens matching a test.fa file in this Gist.
If you do look at the trace you will see the final important fact about Grammars. We need a TOP token or rule, this is where we start the thought process and parsing of data. So a FASTA file is made of more than one record, that’s our TOP. A record is made of an id, a comment and a sequence and so the logic flows. The Perl6 regular expression syntax is way out of scope but Masak has a nice intro.

Now lets get this Grammar rolling along doing something. If we want to use a sequence in our code we need a basic type to represent it. Below is probably the simplest thing you would want:

class Seq {
    has Str $.id = "";
    has Str $.comment = "";
    has Str $.sequence = "";
}

It’s not the most inspired or interesting class in the world but it will do!

Now how do we hook up our Seq class to our FASTA Grammar to produce some nice objects from our FASTA file? The answer is we need to take action! But not the political kind, just create an Actions class for the Grammar to use when parsing. By default we could call .parse on our grammar above and we would get all the text chunked up into nicely named labels as a giant tree of Match variables. Some of the time that’s what you want. Action classes have methods that match the names of tokens and rules in a grammar. Each method is called at the corresponding point in the Grammar’s parse tree and given the chance to make something more interesting, replacing all the crummy Match variables in that part of the output. So what do we want:

class FASTA::Actions {
    method TOP($/) {
        make $<record>>>.ast;
    }
    method record($/) {
        make Seq.new(
            id => ~$<id>,
            comment => ($<comment>) ?? (~$<comment>).trim !! '',
            sequence => $<sequence>.subst("\n","", :g)
        );
    }
}

So the logic above is whenever you parse out the whole of a record we want to make a new Seq object and shove it in the resulting match tree for later. You can see that we do some trimming and cleaning up of white space in the action before creating a clean Seq object. Finally the TOP action basically just says that the final product of .parse should be all the record matches, which we just transformed into Seq objects, so it’s literally a list of Seq! Nice! Lets give it a whirl on that test file I mentioned:

Input

>hello
GCTATATAAGC
>world prot
TATAKEKEKELKL

Code

my $grammar = FASTA::Grammar.new;
for $grammar.parsefile("test.fa", actions => FASTA::Actions).ast -> $seq {
    say $seq.id;
}

Output

hello
world

Ultimately we didn’t need the added complexity of matching a sequence as DNA/RNA/protein but it was kind of cool, and at a later date we might want to do something different and have fancier Seq objects. I’ll leave exploration of this as a task to the interested reader :P

A Bag full of sequence

So I’m concerned that our Seq class is kind of crap still. It might as well just be a hash of values at this point. Sure we get a type which is nice, but it just doesn’t have anything cool about it. I think a nice addition is if we made our class work a little better with the rest of the Perl6 family of operations, letting more of the languages magic get at our data. The results of this simple change might actually impress you!

class Seq {
    has Str $.id = "";
    has Str $.comment = "";
    has Str $.sequence = "";

    multi method Numeric (Seq:D: --> Numeric) {
        $.sequence.codes;
    }

    multi method Bag (Seq:D: --> Bag) {
        bag $.sequence.comb;
    }

    multi method Str (Seq:D: --> Str) {
        $.sequence;
    }

    multi method gist (Seq:D: --> Str) {
        ">$.id $.comment\n$.sequence";
    }
}

Above I’ve introduced some snazzy new methods specific to our class. Anyone familiar with toString from Java should get what both the Str and gist methods are upto. They are well observed methods in Perl6 that deal with interpreting an object as a string. But it’s probably not clear at all why you would want two of them? Well simply put the Str method is there for the pure string interpretation of our object. If I use operators that expect Str this would be called to automagic our Seq into a string. The gist is kind of a neat one, if we use the higher level say function this calls the .gist on our object and prints that since it’s already assumed you want something a bit more human readable. Even more neat is if you are using the command line REPL the .gist is immediately called on an object that resulted from an expression. So when we create a new Seq in the Perl6 REPL or just give the variable name on its own we get FASTA ready to copy and paste! Take a look at the REPL dump below and see if you can work out where we are implicitly using $seq as a Str rather than Seq:

> my $seq = Seq.new(id=>'hello',comment=>'world',sequence=>'GCTACCTATAAAGC');
>hello world
GCTACCTATAAAGC
> say $seq
>hello world
GCTACCTATAAAGC
> "junk" ~ $seq
junkGCTACCTATAAAGC
> my $big-seq = Seq.new(id=>'whole',comment=>'new world',sequence=> $seq x 3);
>whole new world
GCTACCTATAAAGCGCTACCTATAAAGCGCTACCTATAAAGC
> print $seq
GCTACCTATAAAGC>

But wait! There’s more. For the low low price of just two more methods we get numeric wrangling and statistical modelling! The Numeric method does what it says on the tin, and we defined this as the length of the sequence which feels reasonable as a numerical representation.

> $seq + $big-seq
56
> ($seq + $big-seq) / $seq
4

Bag creates a Bag type object from our sequence. Basically the Bag we create is an immutable counting dictionary of all the letters in the sequence. So we now have the first step in some frequency analysis, letters and a count of how often they occurred. But what is even more awesome about Bags in Perl6 are the pick and roll methods. These allow you to sample the keys of the Bag based on their frequency! So if you wanted to create some dummy sequences to do statistical tests with it’s easy as pie! Roll keeps the letters in the bag as you sample whereas pick removes them and then alters the distribution of the next pick. Below I give an example where we .pick strings that are as long the input sequence but will be randomly different but with the same distribution of characters. This sort of dummy sequence is useful for calculating the statistical significance of our sequence.

> $seq.Bag
bag(G(2), C(4), T(3), A(5))
> $seq.Bag.pick($seq).join
CAGATCGTAAATCC
> $seq.Bag.pick($seq).join
GAATCCCGTACATA

Look at that we even use $seq as the length to pick to! Perhaps not the most obvious or readable thing in the world, but for playing on the REPL it’s a nice feature and not utterly unintuitive to people used to context in Perl.

A Range of possibilities

A common task in bioinformatics is dealing with what is commonly referred to as “annotation” type data. This can be anything from specifying the location of a gene in a whole genome, to specifying a patterned site within a protein sequence. In BioPerl there is a specific and special Bio::Range class for dealing with the sort of data that represents a segment on a number line with a start and an end position. In Perl6 this is built right into the language with most of the features any bioinformatician would be happy with. Plus they can be wrangled to Sets and use those operators. So creating a Range couldn’t be easier and more obvious, you’ve all done it before!

> my $range = 1..10;
> $range.WHAT.say;
(Range)

This is the exact same syntax and object used if you wanted to create a for loop over a given range both in Perl5 and Perl6.

To show what we can do with this basic type in Perl6 I’ve taken some data about the gene p53 which is a well known one for the development of many cancers. It’s all real data and you can see a coloured block diagram of the protien annotation data here. You think of each block as being single Range object you get some idea of what I’m harping on about. So lets start out with some DNA and some protein features in code but remember we could just makes these from a file just as easily:

#Genomic chromosome coordinates of the p53 gene in human
my $p53_gene = 7_565_097 .. 7_590_863;

#A single nucleotide polymorphism AKA a point mutation
my $snp = 7_575_996;

#Some annotated protein domains
my %domains = p53-TF => 97..287, p53-Tetramer => 319..357;

#Predicted interaction site
my $binding_site = 322..355;

#A known protein Phosphorylation site
my $phospho_site = 99;

So good so far, we have some one off integer values some ranges and a cool hash of ranges! The bonus Perl6 features come from operators that accept a Range as their operands. For example most importantly set operations work as you might expect:

#Work out if the SNP is within the gene
say "MUTATION!!!" if $p53_gene (cont) $snp;

#Find out which domains contain the phosphorylation site
my %phosphorylated = %domains.grep({ .value (cont) $phospho_site })

#Find out which domains are in the binding site and what percentage is binding
my %binding = %domains.map({
                            $_.key => 100 / $_.value.elems * ($_.value (&) $binding_site)
                      })
                      .grep(*.value > 0.0);

Hopefully a lot of the above is almost self explanatory once I introduce the set operators for contains (cont) or ∋ and intersection (&) or ∩. I chose the ASCII variants, mostly because I don’t have a ∩ key on my keyboard and I wouldn’t want my code to be trashed if someone used a non Unicode editor! The contains returns a Bool result of if the right operand contains the left and the intersection operator returns the set of numbers that intersect both of the Ranges. The final hash wrangling is quite nice, it keeps the keys of the %domains hash but replaces the values with the percentage coverage with our binding site and only gives a result if there was any coverage between the two Ranges. Hopefully almost any bioinformatician can see how nice these sort of primitive types and operations are to have in the core of a language.

A Curio

If any of the above has you at least a little curious if not excited to see what Perl6 can do within Bioinformatics then you can check out more at either Chris Fields official BioPerl6 repo or my own playpen. I’ll leave you with one final code nugget to chew on. I found this by accident just writing a method to wrangle DNA sequence to RNA sequence in Perl6. Something you might like to add to our Seq class. This is approximately the session I had at the REPL:

> my $dna = ‘GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAA’;
> my $rna = $dna ~~ tr/GCAT/GCAU/;
> $rna.WHAT.say
(StrDistance)

WAT?! So I was originally expecting to just have the translated string. As an aside you should use the method $dna.trans(‘T’=>’U’) if you do want that. Instead I got back this odd object StrDistance… What could it be I thought, so I did some basic inspection to take a look:

> $rna.perl
StrDistance.new(before => "GATG…TAAAA", after => "GAUG…UAAAA")
> $rna.^methods
Bool Numeric Int <anon> <anon>
> $rna.^attributes
Str $!before Str $!after Int $!distance

So in the .perl we see that .before and .after are the two strings the DNA and the newly translated RNA. But there is also this other guff like .Numeric and .Int. Turns out its something awesome!

> say +$rna;
13

Lucky number 13… or the Hamming distance between the two strings! Anyone from the world of Bioinformatics will instantly see this as quite a handy feature. Perl6 wasn’t created or invented for Bioinformatics, but it has so many accidentally great features laying around relatively uncommented on.

Sorry for the blowhard post but I thought it might be nice to put these ideas all in one place. Have a happy Christmas and New Year!

Day 18 – A Grammar with duplicate checking

December 18, 2013

Today’s example constructs a grammar for tracking playing cards in a single deal. We’ll say it’s poker with one or more players and that each player is being dealt a hand that contains exactly five cards.

There is, however, the need to detect duplicate cards. We’ll need some way of tracking cards both within each card-hand and between hands.

A simple Card Game Grammar

To start with, here’s the basic grammar (no duplicate checks yet):

grammar CardGame {

    rule TOP { ^ <deal> $ }

    rule deal {
        <hand>+ % ';'
    }

    rule hand { [ <card> ]**5 }
    token card {<face><suit>}

    proto token suit {*}
    token suit:sym<♥>  {<sym>}
    token suit:sym<♦>  {<sym>}
    token suit:sym<♣>  {<sym>}
    token suit:sym<♠>  {<sym>}

    token face {:i <[2..9]> | 10 | j | q | k | a }
}

say CardGame.parse("2♥ 5♥ 7♦ 8♣ 9♠");
say CardGame.parse("2♥ a♥ 7♦ 8♣ j♥");

The  top-level rule consists of a deal. The deal consists of one or more hands separated by ';'. Each hand consists of 5 playing cards.

Each card is represented by a face, one of: a (ace), j (jack), q (queen) or k (king), or 2 – 10. This is followed by a suite: ♥ (hearts) ♦ (diamonds) ♣ (clubs) or ♠ (spades).

[We could have used the playing cards characters, newly introduced in Unicode 6.0, but these aren’t widely supported yet].

As expected, the first cut of the grammar cheerily parses any hand:

say CardGame.parse("a♥ a♥ 7♦ 8♣ j♥");
# one hand, duplicate a♥
say CardGame.parse("a♥ 7♥ 7♦ 8♣ j♥; 10♥ j♥ q♥ k♥ a♥");
# two hands, duplicate j♥

Detecting Duplicates

We start by adding a Perl 6 variable declaration to the grammar. This will be used to track cards:

rule deal {
    :my %*PLAYED = ();
    <hand>+ % ';'
}

This declares %*PLAYED [1]. The '%*' twigil  indicates that it’s a hash '%' and that’s dynamically scoped '*'.

Dynamic scoping is not only for subroutine and method calls [1]. It also works seamlessly with grammar rules, tokens and actions.

Being dynamically scoped, %*PLAYED is available to callees of the deal rule; the hand token, and its callee, the card token.

It’s also available to any actions, that then get called. So we can track and report on duplicates by creating an action class with a method for the card token:

class CardGame::Actions {
    method card($/) {
       my $card = $/.lc;
       say "Hey, there's an extra $card"
           if %*PLAYED{$card}++;
   }
}

my $a = CardGame::Actions.new;
say CardGame.parse("a♥ a♥ 7♦ 8♣ j♥", :actions($a));
# "Hey there's an extra a♥"
say CardGame.parse("a♥ 7♥ 7♦ 8♣ j♥; 10♥ j♥ q♥ k♥ a♦",
                   :actions($a));
# "Hey there's an extra j♥"

And that might be all that’s needed  for tracking and reporting on duplicates. There’s a pretty good separation between the declarative grammar and procedural actions, with just one dynamically scoped hash variable.

Disallowing Duplicates

But I had a situation where I wanted duplicate checking to be a parsing constraint. Parsing needed to fail when duplicates were encountered.

I achieved this by moving the duplicate check grammar side:

token card {<face><suit>
    <?{
        # only allow each card to appear once
        my $card = $/.lc;
        say "Hey, there's an extra $card"
            if %*PLAYED{$card};

        ! %*PLAYED{$card}++;
     }>
}

This has introduced a code assertion between the <?{ and }>  [2]. The rule succeeds when the code evaluates to a True value. The card token thus fails when the same card is detected more than once in a single deal.

say CardGame.parse("2♥ 7♥ 2♦ 3♣ 3♦");
# legitimate, parses

say CardGame.parse("a♥ a♥ 7♦ 8♣ j♥");
# fails with message: Hey, there's an extra a♥

say CardGame.parse("a♥ 7♥ 7♦ 8♣ j♥; 10♥ j♥ q♥ k♥ a♦");
# fails with message: Hey, there's an extra j♥

Discussion/Conclusion

One thing to be careful of with this type of technique is back-tracking (trying of alternatives). If, for instance, the grammar was modified in such a way that the card token could be called more than once for single input card, then we might erroneously report a duplicate. It’s still possible to track, but becomes a bit more involved. The simplest answer is to keep the grammars as simple as possible and minimize back-tracking.

If in any doubt,  please consider using one or more of the Grammar::Debugger, Grammar::Tracer [3] or the debugger [4] modules [5] to track what’s going on. You can also insert debugging code into tokens or rules as closures: { say "here" } [6].

That the exercise for today; a simple Perl 6 Grammar to parse playing-cards in a card-game, but with duplicate checks using either actions or code assertions.

Day 06 – Parsing and generating recurring dates

December 6, 2013

There are a lot of events that are scheduled on particular days of the week each month, for example the regular Windows Patch Day on the second Tuesday of each month, or in Perl 6 land that Rakudo Perl 6 compiler release, which is scheduled for two days after the Parrot release day, which again is scheduled for the third Tuesday of the month.

So let's write something that calculates those dates.

The specification format I have chosen looks like 3rd tue + 2 for the Rakudo release date, that is, two days after the 3rd Tuesday of each month (note that this isn't always the same as the 3rd Thursday).

Parsing it isn't hard with a simple grammar:

grammar DateSpec::Grammar {
    rule TOP {
        [<count><.quant>?]?
        <day-of-week>
        [<sign>? <offset=count>]?
    }
    token count { \d+ }
    token quant { st | nd | rd | th }
    token day-of-week { :i
        [ mon | tue | wed | thu | fri | sat | sun ]
    }
    token sign { '+' | '-' }
}

As you can see, everything except the day of the week is optional, so sun would simply be the first Sunday of the month, and 2 sun - 1 the Saturday before the second Sunday of the month.

Now it's time to actually turn this specification into a data structure that does something useful. And for that, a class wouldn't be a bad choice:

my %dow = (mon => 1, tue => 2, wed => 3, thu => 4,
        fri => 5, sat => 6, sun => 7);

class DateSpec {
    has $.day-of-week;
    has $.count;
    has $.offset;

    multi method new(Str $s) {
        my $m = DateSpec::Grammar.parse($s);
        die "Invalid date specification '$s'\n" unless $m;
        self.bless(
            :day-of-week(%dow{lc $m<day-of-week>}),
            :count($m<count> ?? +$m<count>[0] !! 1),
            :offset( ($m<sign> eq '-' ?? -1 !! 1)
                    * ($m<offset> ?? +$m<offset> !! 0)),
        );
    }

We only need three pieces of data from those date specification strings: the day of the week, whether the 1st, 2nd, 3rd. etc is wanted (here named $.count), and the offset. Extracting them is a wee bit fiddly, mostly because so many pieces of the grammar are optional, and because the grammar allows a space between the sign and the offset, which means we can't use the Perl 6 string-to-number conversion directly.

There is a cleaner but longer method of extracting the relevant data using an actions class.

The closing } is missing, because the class doesn't do anything useful yet, and that should be added. The most basic operation is to find the specified date in a given month. Since Perl 6 has no built-in type for months, we use a Date object where the .day is one, that is, a Date object for the first day of the month.

    method based-on(Date $d is copy where { .day == 1}) {
        ++$d until $d.day-of-week == $.day-of-week;
        $d += 7 * ($.count - 1) + $.offset;
        return $d;
    }

The algorithm is quite simple: Proceed to the next date (++$d) until the day of week matches, then advance as many weeks as needed, plus as many days as needed for the offset. Date objects support addition and subtraction of integers, and the integers are interpreted as number of days to add or subtract. Handy, and exactly what we need here. (The API is blatantly copied from the Date::Simple Perl 5 module).

Another handy convenience method to implement is next, which returns the next date matching the specification, on or after a reference date.

    method next(Date $d = Date.today) {
        my $month-start = $d.truncated-to(month);
        my $candidate   = $.based-on($month-start);
        if $candidate ge $d {
            return $candidate;
        }
        else {
            return $.based-on($month-start + $month-start.days-in-month);
        }
    }
}

Again there's no rocket science involved: try the date based on the month of $d, and if that's before $d, try again, but with the next month as base.

Time to close the class :-).

So, when is the next Rakudo release? And the next Rakudo release after Christmas?

my $spec = DateSpec.new('3rd Tue + 2');
say $spec.next;
say $spec.next(Date.new(2013, 12, 25));

Output:

2013-12-19
2014-01-23

The code works fine on Rakudo with both the Parrot and the JVM backend.

Happy recurring hollidates!

Day 10: A Regex Story

December 10, 2009

On this tenth day of advent, we have the gift of a story …

Once upon a time in a land closer than you might think, an apprentice Perl 6 programmer named Tim was working on a simple parsing problem. His boss (let’s just call him Mr. C) had asked him to parse log files containing inventory information to make sure that there were only valid lines within the file. Each valid line within the file looked like this:

    <part number> <quantity> <color> <description>

So the Perl 6 apprentice, having some familiarity with regular expressions wrote a nice little regex that could be used to identify valid lines. The code that validated each line looked like this:

    next unless $line ~~ / ^^ \d+ \s+ \d+ \s+ \S+ \s+ \N* $$ /

The ~~ operator causes the regex on the right hand side to be matched against the scalar on the left hand side. In the regex itself, ^^ matches the beginning of a line, \d+ matches one or more digits (as the part number and quantity were so composed), \S+ matches one or more non- whitespace characters, \N* matches zero or more non-newline characters, \s+ matches whitespace in between each of these and $$ matches the end of a line. This being Perl 6, these individual components of the regex could be spread out a bit with spaces in between so that it could be more readable. All was good.

But then Mr. C decided that it would be nicer if the individual pieces of information could also be extracted from each in addition to validating it. Tim thought, “No problem, I’ll just use capturing parentheses”. And that’s just what he did:

    next unless $line ~~ / ^^ (\d+) \s+ (\d+) \s+ (\S+) \s+ (\N*) $$ /

After a successful pattern match, each parenthesized portion is available either as part of the match object itself ($/) via $/[0], $/[1], $/[2], or $/[3]. Or it could be accessed via the special variables $0, $1, $2, or $3. Tim was happy. Mr. C was happy.

But then it was discovered that some of the lines didn’t separate the color from the description and that these lines should be considered valid too. Lines where the color was integrated into the description were written a special way. They were always of the form:

    <part number> <quantity> <description> (<color>)

Where description, as before, could be any number of characters including spaces. “Blah!” thought Tim, “now this simple parser suddenly seems more complicated.” Luckily, Tim knew of a place to ask for help. He quickly logged on to irc.freenode.org, joined the #perl6 channel, and asked for help. Someone suggested that he should name the individual parts of his regex to make things easier and then use an alternation to match one or the other alternatives for the last part of the regex.

First, Tim tried naming the parts of his regex. Looking at the synopsis for Perl 6 regex, Tim found he could assign into the match object, so that’s what he did:

    next unless $line ~~ / ^^ $<product>=(\d+) \s+ $<quantity>=(\d+) \s+ $<color>=(\S+) \s+ $<description>=(\N*) $$ /

Now, after a successful match, the individual pieces are available via the match object or via special variables $<product>, $<quantity>, $<color>, and $<description>. This was turning out easier than expected and Tim was feeling quite confident. Next he added the alternation to distinguish between the two different valid lines:

    next unless $line ~~ / ^^
        $<product>=(\d+) \s+ $<quantity>=(\d+) \s+
        [
        | $<description>=(\N*) \s+ '(' $<color>=(\S+) ')'
        | $<color>=(\S+) \s+ $<description>=(\N*)
        ]
      $$
    /

In order to isolate the alternation from the rest of the regex, Tim surrounded it with grouping brackets ([ and ]). These group a portion of a regex much like parentheses only without capturing into $0 and friends. Since he needed to match literal parentheses, Tim took advantage of another useful Perl 6 regex feature: quoted strings are matched literally. And because of the assignments within the regex, $<color> and $<description> always contain the appropriate part of the string.

Tim was elated! He showed his code to Mr. C and Mr. C was elated too! “Well done Tim!” said Mr. C.

Everybody was happy. Tim beamed with pride.

However, after the initial glow of success faded, Tim started looking at his work with a more critical eye. For some of the lines where the color was at the end of the description, it was described as “( color)” or “( color )” or “( color )”. His current regex worked, but it would include the color as part of the description and wouldn’t set $<color> at all. That hardly seemed appropriate. Tim initially fixed this by adding more \s*:

    next unless $line ~~ / ^^
        $<product>=(\d+) \s+ $<quantity>=(\d+) \s+
        [
        | $<description>=(\N*) \s+ '(' \s* $<color>=(\S+) \s* ')'
        | $<color>=(\S+) \s+ $<description>=(\N*)
        ]
      $$
    /

This worked well, but the regex was starting to look a little cluttered. Again, Tim turned to #perl6 for help.

This time someone named PerlJam piped up, “Why don’t you put your regex in a grammar? That’s what you’re effectively doing by assigning each piece to a variable within the match object.” Wha??? Tim had no idea what PerlJam was talking about. After a brief exchange, Tim thought he understood and knew where to look if he needed more information. After thanking PerlJam, he went back to coding. This time the regex virtually disappeared as it turned into a grammar. Here’s what that grammar and matching code looked like:

grammar Inventory {
    regex product { \d+ }
    regex quantity { \d+ }
    regex color { \S+ }
    regex description { \N* }
    regex TOP { ^^ <product> \s+ <quantity>  \s+
                [
                | <description> \s+ '(' \s* <color> \s*  ')'
                | <color> \s+ <description>
                ]
                $$
    }
}

# ... and then later where the match happens
next unless Inventory.parse($line);

“Well,” thought Tim, “it is certainly more organized.”

Each of his variables in the previous incarnation of the regex simply became named regex within the grammar. Within Perl 6 regex, named regex are matched by simply enclosing the name within angle brackets (< and >). The specially named regex TOP is used when Grammar.parse is called with the scalar to match against. And the behavior is exactly the same as before because when a named regex matches as part of another regex, the text that was matched is saved in the match object and referenced by that name.

And though there was still room for improvement, both Tim and Mr. C were very happy with the result.

The End

* At the time of posting, Rakudo cannot correctly use this grammar to parse lines in the format

    <part number> <quantity> <description> (<color>)

Follow

Get every new post delivered to your Inbox.

Join 44 other followers