Posts Tagged ‘pattern matching’

Day 10: A Regex Story

December 10, 2009

On this tenth day of advent, we have the gift of a story …

Once upon a time in a land closer than you might think, an apprentice Perl 6 programmer named Tim was working on a simple parsing problem. His boss (let’s just call him Mr. C) had asked him to parse log files containing inventory information to make sure that there were only valid lines within the file. Each valid line within the file looked like this:

    <part number> <quantity> <color> <description>

So the Perl 6 apprentice, having some familiarity with regular expressions wrote a nice little regex that could be used to identify valid lines. The code that validated each line looked like this:

    next unless $line ~~ / ^^ \d+ \s+ \d+ \s+ \S+ \s+ \N* $$ /

The ~~ operator causes the regex on the right hand side to be matched against the scalar on the left hand side. In the regex itself, ^^ matches the beginning of a line, \d+ matches one or more digits (as the part number and quantity were so composed), \S+ matches one or more non- whitespace characters, \N* matches zero or more non-newline characters, \s+ matches whitespace in between each of these and $$ matches the end of a line. This being Perl 6, these individual components of the regex could be spread out a bit with spaces in between so that it could be more readable. All was good.

But then Mr. C decided that it would be nicer if the individual pieces of information could also be extracted from each in addition to validating it. Tim thought, “No problem, I’ll just use capturing parentheses”. And that’s just what he did:

    next unless $line ~~ / ^^ (\d+) \s+ (\d+) \s+ (\S+) \s+ (\N*) $$ /

After a successful pattern match, each parenthesized portion is available either as part of the match object itself ($/) via $/[0], $/[1], $/[2], or $/[3]. Or it could be accessed via the special variables $0, $1, $2, or $3. Tim was happy. Mr. C was happy.

But then it was discovered that some of the lines didn’t separate the color from the description and that these lines should be considered valid too. Lines where the color was integrated into the description were written a special way. They were always of the form:

    <part number> <quantity> <description> (<color>)

Where description, as before, could be any number of characters including spaces. “Blah!” thought Tim, “now this simple parser suddenly seems more complicated.” Luckily, Tim knew of a place to ask for help. He quickly logged on to irc.freenode.org, joined the #perl6 channel, and asked for help. Someone suggested that he should name the individual parts of his regex to make things easier and then use an alternation to match one or the other alternatives for the last part of the regex.

First, Tim tried naming the parts of his regex. Looking at the synopsis for Perl 6 regex, Tim found he could assign into the match object, so that’s what he did:

    next unless $line ~~ / ^^ $<product>=(\d+) \s+ $<quantity>=(\d+) \s+ $<color>=(\S+) \s+ $<description>=(\N*) $$ /

Now, after a successful match, the individual pieces are available via the match object or via special variables $<product>, $<quantity>, $<color>, and $<description>. This was turning out easier than expected and Tim was feeling quite confident. Next he added the alternation to distinguish between the two different valid lines:

    next unless $line ~~ / ^^
        $<product>=(\d+) \s+ $<quantity>=(\d+) \s+
        [
        | $<description>=(\N*) \s+ '(' $<color>=(\S+) ')'
        | $<color>=(\S+) \s+ $<description>=(\N*)
        ]
      $$
    /

In order to isolate the alternation from the rest of the regex, Tim surrounded it with grouping brackets ([ and ]). These group a portion of a regex much like parentheses only without capturing into $0 and friends. Since he needed to match literal parentheses, Tim took advantage of another useful Perl 6 regex feature: quoted strings are matched literally. And because of the assignments within the regex, $<color> and $<description> always contain the appropriate part of the string.

Tim was elated! He showed his code to Mr. C and Mr. C was elated too! “Well done Tim!” said Mr. C.

Everybody was happy. Tim beamed with pride.

However, after the initial glow of success faded, Tim started looking at his work with a more critical eye. For some of the lines where the color was at the end of the description, it was described as “( color)” or “( color )” or “( color )”. His current regex worked, but it would include the color as part of the description and wouldn’t set $<color> at all. That hardly seemed appropriate. Tim initially fixed this by adding more \s*:

    next unless $line ~~ / ^^
        $<product>=(\d+) \s+ $<quantity>=(\d+) \s+
        [
        | $<description>=(\N*) \s+ '(' \s* $<color>=(\S+) \s* ')'
        | $<color>=(\S+) \s+ $<description>=(\N*)
        ]
      $$
    /

This worked well, but the regex was starting to look a little cluttered. Again, Tim turned to #perl6 for help.

This time someone named PerlJam piped up, “Why don’t you put your regex in a grammar? That’s what you’re effectively doing by assigning each piece to a variable within the match object.” Wha??? Tim had no idea what PerlJam was talking about. After a brief exchange, Tim thought he understood and knew where to look if he needed more information. After thanking PerlJam, he went back to coding. This time the regex virtually disappeared as it turned into a grammar. Here’s what that grammar and matching code looked like:

grammar Inventory {
    regex product { \d+ }
    regex quantity { \d+ }
    regex color { \S+ }
    regex description { \N* }
    regex TOP { ^^ <product> \s+ <quantity>  \s+
                [
                | <description> \s+ '(' \s* <color> \s*  ')'
                | <color> \s+ <description>
                ]
                $$
    }
}

# ... and then later where the match happens
next unless Inventory.parse($line);

“Well,” thought Tim, “it is certainly more organized.”

Each of his variables in the previous incarnation of the regex simply became named regex within the grammar. Within Perl 6 regex, named regex are matched by simply enclosing the name within angle brackets (< and >). The specially named regex TOP is used when Grammar.parse is called with the scalar to match against. And the behavior is exactly the same as before because when a named regex matches as part of another regex, the text that was matched is saved in the match object and referenced by that name.

And though there was still room for improvement, both Tim and Mr. C were very happy with the result.

The End

* At the time of posting, Rakudo cannot correctly use this grammar to parse lines in the format

    <part number> <quantity> <description> (<color>)

Follow

Get every new post delivered to your Inbox.

Join 36 other followers