Day 21 – transliteration and beyond

Transliteration sounds like it has Latin roots and means a changing of letters. And that’s what the Str.trans method does.

say "GATTACA".trans( "TCAG" => "0123" );  # prints "3200212\n"

Perl 5 people (and Unix shell folk) immediately recognize this as tr/tcag/0123/, but here’s a quick explanation for the rest of you out there: for every instance of T we find in the string, we replace it by 0, we replace every instance of C by 1, and so on. The two strings TCAG and 0123 supply alphabets to be translated to and from, respectively.

This can be used for any number of time-saving ends. Here, for example, is a simple subroutine that “encrypts” a text with ROT-13:

sub rot13($text) { $text.trans( "A..Za..z" => "" ) }

When .trans sees those .. ranges, it expands them internally (so "n..z" really means "nopqrstuvwxyz"). Thus, the ultimate effect of the rot13 sub is to map certain parts of the ASCII alphabet to certain other parts.

In Perl 5, the two dots (..) are a dash (-), but we’ve tried in Perl 6 to have those two dots stand for the concept “range”; in the main language, in regexes, and here in transliterations.

Note also that the .trans method is non-mutating; it doesn’t change $text, but just returns a new value. This is also a general theme in Perl 6; in the core language we prefer to offer the side-effect-free variants of methods. You can easily get the mutating behavior by doing .=trans:

$kabbala.=trans("A..Ia..i" => "1..91..9");

(And that goes not only for .trans, but for all methods. It’s a silent encouragement to you as a programmer to write your libraries with non-mutating methods, making the world a happier, more composable place.)

But Perl 6 wouldn’t be Perl 6 if .trans didn’t also contain a hidden weapon which takes the Perl 5 tr/// and just completely blows it out of the water. Here’s what it also does:

Let’s say we want to escape some HTML, that is, replace things according to this table:

    & => &
    < => &lt;
    > => &gt;

(By the way, I hope if you ever need to escape HTML, that there will be a library routine for you ready that does it for you. But the general principle is important; and in the few instances when you do need to do something like this, it’s good to know the tools are there, built into the language.)

This is nothing that a few well-placed regexes can’t handle. So what’s the big deal? Well, a naive in-place per-match replacement of the above three characters might be unlucky enough to get stuck in an infinite loop. (& => &amp; => &amp;amp; => ...) So you need to do various sordid trickery to avoid that.

But that’s not even the fundamental problem, which is that you want to resort to stitching together pieces of strings, rather than thinking of the problem in a more high-level manner. Generally, we wouldn’t want a solution that depends on the order of the substitutions. That would also affect something like this:

    foo         => bar
    foolishness => folly

If the former substitution is attempted first each time, there won’t ever be an occasion to perform the latter one — probably not what was intented. Generally, we want to try and match the longer substrings before shorter ones.

So, it seems we want a longest-token substitution matcher that avoids infinite cycles due to accidental re-substitution.

That’s what .trans in Perl 6 provides. That’s its hidden weapon: sending in a pair of arrays rather than strings. For the HTML escaping, all we need to do is this:

my $escaped = $html.trans(
    [ '&',     '<',    '>'    ] =>
    [ '&amp;', '&lt;', '&gt;' ]

…and the non-trivial problems of replacing things in the right order and avoiding cyclical replacement are taken care of for us.