Author Archive

Day 23 – It’s some .sort of wonderful.

December 23, 2010

Continuing on our Perl 6 adventure.

Sorting various lists of things is an extremely common programming task, and Perl 6 beefs up its sort function to help you get it done.

It has your everyday, garden variety sort:

    # default type sensitive sort
    my @sorted = @unsorted.sort; # or sort @unsorted;

And, like Perl 5, allows you to use custom comparator functions:

    # numeric comparison
    my @sorted = @unsorted.sort: { $^a <=> $^b };

    # or if you prefer to use function calling semantics
    my @sorted = sort { $^a <=> $^b }, @unsorted;

    # string comparison ( similar to Perl 5 cmp )
    my @sorted = @unsorted.sort: { $^a leg $^b };

    # type dependent comparison
    my @sorted = @unsorted.sort: { $^a cmp $^b };

These can also be written with parenthesis around the comparator, in which case you don’t need the colon. Handy for when you want to chain other methods after the sort:

    my @topten = @scores.sort( { $^b <=> $^a } ).list.munch(10);
A Small Aside… In Perl 6 the variables $a and $b don’t have any special global significance like they do in Perl 5. Use normal named variables ($var), positional variables ($^var) or whatever variables (*) in the sort comparison block just like any other code block.

You can directly apply a transform function as you sort:

    my @sorted = @unsorted.sort: { foo($^a) cmp foo($^b) };

but the foo() transform is being calculated anew for each iteration of the sort comparison. Probably not too bad for small lists, but can be a real drag as the lists get larger, especially if foo() is computation intensive.

A common idiom in Perl 5 is to use a Schwartzian Transform when you want to sort on some transformation of terms rather than the terms themselves.

When you use a Schwartzian Transform, each transform is calculated once, the items are sorted by the transformed terms, then the original terms mapped back and returned in the new order.

    # Perl 5 code
    @sorted =
        map  { $_->[0] }
        sort { $a->[1] cmp $b->[1] }
        map  { [$_, foo($_)] }
        @unsorted;

In Perl 6 you can still do a Schwartzian Transform, but there is some special intelligence built in to sort. If you have a transform function with arity 0 or 1, ( arity is the number of arguments that the function takes ), Perl 6 notices and will automatically apply a Schwartzian Transform for you.

Lets take a look at a few examples.

Sort Case Insensitively

Lowercase each term, sort on the lowercased terms then return the original terms in the lowercased sort order.

   my @sorted = @unsorted.sort: { .lc };

Pretty easy.

Sort By Word Length

Sort a list of strings by the number of characters in each string shortest to longest:

    my @sorted = @unsorted.sort: { .chars };

Or longest to shortest:

    my @sorted = @unsorted.sort: { -.chars };

Multiple Sort Comparators

You can pass a list of comparator functions in the sort block and it will perform as many comparator transforms in the list as necessary until it reaches a tie breaker:

Sort by word length with a secondary ASCII sort so that each group of words with the same length are then sorted in ASCII order:

     .say for @a.sort: { $^a.chars, $^a } ;

Except… that doesn’t quite work yet in Rakudo. It compares the word lengths string-wise rather than numerically, so 10 and 11 are sorted before 2. :-( BUT!! As the saying goes: TIMTOWDI! ( There Is More Than One Way to Do It!)

Sort is stable in Perl 6, so you can sort the list to get it into ASCII order, then re-sort it with the length comparator:

     .say for @a.sort.sort: { $^a.chars };

That works, but now it is sorting twice. Not too efficient. Or you could rewrite it like:

     .say for @a.sort: { $^a.chars <=> $^b.chars || $^a leg $^b };

That also works, but now you’ve lost the automatic Schwartzian Transform. ( arity 2 comparator )

Or, you could apply a natural sorting transform to the numeric terms to get the correct answer.

     .say for @a.sort: { $^a.&naturally, $^a };

“Natural Sorting!?”, I hear you cry, “Where’d that come from?”.

I’m glad you asked.

Let’s take a ride on that segue.


Natural Sorting

Standard lexical sorting returns items in “ASCIIbetical” order. Digits sort before upper case characters and both sort before lower case characters. People are often surprised / dismayed when they sort a list of strings and get something like:

    0
    1
    100
    11
    144th
    2
    21
    210
    3rd
    33rd
    AND
    ARE
    An
    Bit
    Can
    and
    by
    car
    d1
    d10
    d2

Which is perfectly correct, but not very intuitive for humans, especially for non-programmers.

“Natural” sorting would order numbers (strings of digits) sorted by order of magnitude then by magnitude before alphabetical position (upper or lower case).

Here is the same group of strings shown above sorted naturally:

    0
    1
    2
    3rd
    11
    21
    33rd
    100
    144th
    210
    An
    AND
    and
    ARE
    Bit
    by
    Can
    car
    d1
    d2
    d10

To do that, we want a simple transform we can apply to each term.

I’ll use a subst method. It an analogue of the familiar s/// operator in method form.

    .subst(/(\d+)/, -> $/ { 0 ~ $0.chars.chr ~ $0 }, :g)

In the first part we are capturing a group of one or more consecutive digits to operate on. That ‘ -> $/ { … } ‘ construct is a “pointy block”. It means: “Expose the contents of the match array ( $/ ) to the inner scope of the following block of code ( {…} )”. The block builds the replacement string: ‘0’ concatenated with the order of magnitude of (number of digits in) the group, expressed as an ASCII character, concatenated with the original string of digits. The :g adverb means “do it globally”.

We also want to sort case insensitively so we’ll chain a .lc method in and get:

    .lc.subst(/(\d+)/, -> $/ { 0 ~ $0.chars.chr ~ $0 }, :g)

Turn it into a subroutine:

    sub naturally ($a) {
        $a.lc.subst(/(\d+)/, -> $/ { 0 ~ $0.chars.chr ~ $0 }, :g)
    }

This works fairly well but has a subtle bug. Terms that map to the same transform will be returned in the order they were seen. So, for instance, the words; ‘THE’, ‘The’ and ‘the’ will return in the order seen in the list rather than some predictable sorted order. A simple solution is to just concatenate the original term to the end of the transformed term to act as a tie breaker.

So the final naturally() transform routine looks like:

    sub naturally ($a) {
        $a.lc.subst(/(\d+)/, -> $/ { 0 ~ $0.chars.chr ~ $0 }, :g) ~ "\x0" ~ $a
    }

Since it operates on a single term at a time, we get the Schwartzian Transform for free. Now we can use it as a sort modifier to naturally sort a list of words:

    .say for <0 1 100 11 144th 2 21 210 3rd 33rd AND ARE An Bit Can and by car d1 d10 d2>.sort: { .&naturally };

Or to sort a list of dotted quad notation IP addresses:

     # generate a list of random IPs
     my @ips = ((0..255).roll(4).join('.')for 0..99);

    .say for @ips.sort: { .&naturally };

    4.108.172.65
    5.149.121.70
    10.24.201.53
    11.10.90.219
    12.83.84.206
    12.124.106.41
    12.162.149.98
    14.203.88.93
    16.18.0.178
    17.68.226.104
    21.201.181.225
    23.61.166.202
    23.205.73.104
    24.250.90.75
    35.56.124.120
    36.158.70.141
    40.149.118.209
    40.238.169.146
    52.107.62.129
    55.119.95.120
    56.39.105.245
    ... and so on

Or to sort directory listings, or anything else that contains a mixture of alphanumeric characters… or even to work around bugs. ;-)

Merry Christmas, Happy sorting, and

May the Schwartz be with you!


Follow

Get every new post delivered to your Inbox.

Join 43 other followers