Posts Tagged ‘regex’

Day 18 – A Grammar with duplicate checking

December 18, 2013

Today’s example constructs a grammar for tracking playing cards in a single deal. We’ll say it’s poker with one or more players and that each player is being dealt a hand that contains exactly five cards.

There is, however, the need to detect duplicate cards. We’ll need some way of tracking cards both within each card-hand and between hands.

A simple Card Game Grammar

To start with, here’s the basic grammar (no duplicate checks yet):

grammar CardGame {

    rule TOP { ^ <deal> $ }

    rule deal {
        <hand>+ % ';'
    }

    rule hand { [ <card> ]**5 }
    token card {<face><suit>}

    proto token suit {*}
    token suit:sym<♥>  {<sym>}
    token suit:sym<♦>  {<sym>}
    token suit:sym<♣>  {<sym>}
    token suit:sym<♠>  {<sym>}

    token face {:i <[2..9]> | 10 | j | q | k | a }
}

say CardGame.parse("2♥ 5♥ 7♦ 8♣ 9♠");
say CardGame.parse("2♥ a♥ 7♦ 8♣ j♥");

The  top-level rule consists of a deal. The deal consists of one or more hands separated by ';'. Each hand consists of 5 playing cards.

Each card is represented by a face, one of: a (ace), j (jack), q (queen) or k (king), or 2 – 10. This is followed by a suite: ♥ (hearts) ♦ (diamonds) ♣ (clubs) or ♠ (spades).

[We could have used the playing cards characters, newly introduced in Unicode 6.0, but these aren’t widely supported yet].

As expected, the first cut of the grammar cheerily parses any hand:

say CardGame.parse("a♥ a♥ 7♦ 8♣ j♥");
# one hand, duplicate a♥
say CardGame.parse("a♥ 7♥ 7♦ 8♣ j♥; 10♥ j♥ q♥ k♥ a♥");
# two hands, duplicate j♥

Detecting Duplicates

We start by adding a Perl 6 variable declaration to the grammar. This will be used to track cards:

rule deal {
    :my %*PLAYED = ();
    <hand>+ % ';'
}

This declares %*PLAYED [1]. The '%*' twigil  indicates that it’s a hash '%' and that’s dynamically scoped '*'.

Dynamic scoping is not only for subroutine and method calls [1]. It also works seamlessly with grammar rules, tokens and actions.

Being dynamically scoped, %*PLAYED is available to callees of the deal rule; the hand token, and its callee, the card token.

It’s also available to any actions, that then get called. So we can track and report on duplicates by creating an action class with a method for the card token:

class CardGame::Actions {
    method card($/) {
       my $card = $/.lc;
       say "Hey, there's an extra $card"
           if %*PLAYED{$card}++;
   }
}

my $a = CardGame::Actions.new;
say CardGame.parse("a♥ a♥ 7♦ 8♣ j♥", :actions($a));
# "Hey there's an extra a♥"
say CardGame.parse("a♥ 7♥ 7♦ 8♣ j♥; 10♥ j♥ q♥ k♥ a♦",
                   :actions($a));
# "Hey there's an extra j♥"

And that might be all that’s needed  for tracking and reporting on duplicates. There’s a pretty good separation between the declarative grammar and procedural actions, with just one dynamically scoped hash variable.

Disallowing Duplicates

But I had a situation where I wanted duplicate checking to be a parsing constraint. Parsing needed to fail when duplicates were encountered.

I achieved this by moving the duplicate check grammar side:

token card {<face><suit>
    <?{
        # only allow each card to appear once
        my $card = $/.lc;
        say "Hey, there's an extra $card"
            if %*PLAYED{$card};

        ! %*PLAYED{$card}++;
     }>
}

This has introduced a code assertion between the <?{ and }>  [2]. The rule succeeds when the code evaluates to a True value. The card token thus fails when the same card is detected more than once in a single deal.

say CardGame.parse("2♥ 7♥ 2♦ 3♣ 3♦");
# legitimate, parses

say CardGame.parse("a♥ a♥ 7♦ 8♣ j♥");
# fails with message: Hey, there's an extra a♥

say CardGame.parse("a♥ 7♥ 7♦ 8♣ j♥; 10♥ j♥ q♥ k♥ a♦");
# fails with message: Hey, there's an extra j♥

Discussion/Conclusion

One thing to be careful of with this type of technique is back-tracking (trying of alternatives). If, for instance, the grammar was modified in such a way that the card token could be called more than once for single input card, then we might erroneously report a duplicate. It’s still possible to track, but becomes a bit more involved. The simplest answer is to keep the grammars as simple as possible and minimize back-tracking.

If in any doubt,  please consider using one or more of the Grammar::Debugger, Grammar::Tracer [3] or the debugger [4] modules [5] to track what’s going on. You can also insert debugging code into tokens or rules as closures: { say "here" } [6].

That the exercise for today; a simple Perl 6 Grammar to parse playing-cards in a card-game, but with duplicate checks using either actions or code assertions.

Day 10: A Regex Story

December 10, 2009

On this tenth day of advent, we have the gift of a story …

Once upon a time in a land closer than you might think, an apprentice Perl 6 programmer named Tim was working on a simple parsing problem. His boss (let’s just call him Mr. C) had asked him to parse log files containing inventory information to make sure that there were only valid lines within the file. Each valid line within the file looked like this:

    <part number> <quantity> <color> <description>

So the Perl 6 apprentice, having some familiarity with regular expressions wrote a nice little regex that could be used to identify valid lines. The code that validated each line looked like this:

    next unless $line ~~ / ^^ \d+ \s+ \d+ \s+ \S+ \s+ \N* $$ /

The ~~ operator causes the regex on the right hand side to be matched against the scalar on the left hand side. In the regex itself, ^^ matches the beginning of a line, \d+ matches one or more digits (as the part number and quantity were so composed), \S+ matches one or more non- whitespace characters, \N* matches zero or more non-newline characters, \s+ matches whitespace in between each of these and $$ matches the end of a line. This being Perl 6, these individual components of the regex could be spread out a bit with spaces in between so that it could be more readable. All was good.

But then Mr. C decided that it would be nicer if the individual pieces of information could also be extracted from each in addition to validating it. Tim thought, “No problem, I’ll just use capturing parentheses”. And that’s just what he did:

    next unless $line ~~ / ^^ (\d+) \s+ (\d+) \s+ (\S+) \s+ (\N*) $$ /

After a successful pattern match, each parenthesized portion is available either as part of the match object itself ($/) via $/[0], $/[1], $/[2], or $/[3]. Or it could be accessed via the special variables $0, $1, $2, or $3. Tim was happy. Mr. C was happy.

But then it was discovered that some of the lines didn’t separate the color from the description and that these lines should be considered valid too. Lines where the color was integrated into the description were written a special way. They were always of the form:

    <part number> <quantity> <description> (<color>)

Where description, as before, could be any number of characters including spaces. “Blah!” thought Tim, “now this simple parser suddenly seems more complicated.” Luckily, Tim knew of a place to ask for help. He quickly logged on to irc.freenode.org, joined the #perl6 channel, and asked for help. Someone suggested that he should name the individual parts of his regex to make things easier and then use an alternation to match one or the other alternatives for the last part of the regex.

First, Tim tried naming the parts of his regex. Looking at the synopsis for Perl 6 regex, Tim found he could assign into the match object, so that’s what he did:

    next unless $line ~~ / ^^ $<product>=(\d+) \s+ $<quantity>=(\d+) \s+ $<color>=(\S+) \s+ $<description>=(\N*) $$ /

Now, after a successful match, the individual pieces are available via the match object or via special variables $<product>, $<quantity>, $<color>, and $<description>. This was turning out easier than expected and Tim was feeling quite confident. Next he added the alternation to distinguish between the two different valid lines:

    next unless $line ~~ / ^^
        $<product>=(\d+) \s+ $<quantity>=(\d+) \s+
        [
        | $<description>=(\N*) \s+ '(' $<color>=(\S+) ')'
        | $<color>=(\S+) \s+ $<description>=(\N*)
        ]
      $$
    /

In order to isolate the alternation from the rest of the regex, Tim surrounded it with grouping brackets ([ and ]). These group a portion of a regex much like parentheses only without capturing into $0 and friends. Since he needed to match literal parentheses, Tim took advantage of another useful Perl 6 regex feature: quoted strings are matched literally. And because of the assignments within the regex, $<color> and $<description> always contain the appropriate part of the string.

Tim was elated! He showed his code to Mr. C and Mr. C was elated too! “Well done Tim!” said Mr. C.

Everybody was happy. Tim beamed with pride.

However, after the initial glow of success faded, Tim started looking at his work with a more critical eye. For some of the lines where the color was at the end of the description, it was described as “( color)” or “( color )” or “( color )”. His current regex worked, but it would include the color as part of the description and wouldn’t set $<color> at all. That hardly seemed appropriate. Tim initially fixed this by adding more \s*:

    next unless $line ~~ / ^^
        $<product>=(\d+) \s+ $<quantity>=(\d+) \s+
        [
        | $<description>=(\N*) \s+ '(' \s* $<color>=(\S+) \s* ')'
        | $<color>=(\S+) \s+ $<description>=(\N*)
        ]
      $$
    /

This worked well, but the regex was starting to look a little cluttered. Again, Tim turned to #perl6 for help.

This time someone named PerlJam piped up, “Why don’t you put your regex in a grammar? That’s what you’re effectively doing by assigning each piece to a variable within the match object.” Wha??? Tim had no idea what PerlJam was talking about. After a brief exchange, Tim thought he understood and knew where to look if he needed more information. After thanking PerlJam, he went back to coding. This time the regex virtually disappeared as it turned into a grammar. Here’s what that grammar and matching code looked like:

grammar Inventory {
    regex product { \d+ }
    regex quantity { \d+ }
    regex color { \S+ }
    regex description { \N* }
    regex TOP { ^^ <product> \s+ <quantity>  \s+
                [
                | <description> \s+ '(' \s* <color> \s*  ')'
                | <color> \s+ <description>
                ]
                $$
    }
}

# ... and then later where the match happens
next unless Inventory.parse($line);

“Well,” thought Tim, “it is certainly more organized.”

Each of his variables in the previous incarnation of the regex simply became named regex within the grammar. Within Perl 6 regex, named regex are matched by simply enclosing the name within angle brackets (< and >). The specially named regex TOP is used when Grammar.parse is called with the scalar to match against. And the behavior is exactly the same as before because when a named regex matches as part of another regex, the text that was matched is saved in the match object and referenced by that name.

And though there was still room for improvement, both Tim and Mr. C were very happy with the result.

The End

* At the time of posting, Rakudo cannot correctly use this grammar to parse lines in the format

    <part number> <quantity> <description> (<color>)

Follow

Get every new post delivered to your Inbox.

Join 37 other followers