Day 19 – Snow white and the seven conditionals

December 19, 2014 by

Perl 6 is a big language. It’s rather unapologetic about it. The vast, frightening, beautiful (and slightly dated) periodic table of operators hammers this point home. The argument for having a plethora of operators is to cover a lot of semantic ground, to let you freely express yourself with More Than One Way To Do It. The argument against is fear: “aaargh, so many!”

But ’tis the season, and love conquers fear. Today I went searching for conditionals in the language; ways to say “if p then q” in Perl 6. All this in order to take you on a guided tour of conditionals.

Here’s what we are looking for: a language construct with an antecedent (“the ‘if’ part”) and a consequent (“the ‘then’ part”). (They may also have an optional ‘else’ part, but under my arbitrary search criteria, I don’t care.)

I went looking, and I found seven of them. Now for the promised tour.

if

An oldie but a goodie.

if prompt("Should I exult? ") eq "yes" {
    say "yay!";
}

Unlike in Perl 5, you don't have to put parentheses around the antecedent expression. (Cue Python programmers rolling their eyes, unimpressed.) There are elsif and else clauses that you can add if you need it.

I won't count unless separately, as all it does is reverse the logical value of the antecedent.

when

The when statement also qualifies. It's a conditional statement, but with a preference to matching against the topic variable, $_.

given prompt("What is your age? ") {
    when * < 15 { do-not-let-into-bar() }
    when * < 25 { ask-for-ID() }
    when * > 80 { suggest-bingo-around-the-corner() }
}

The semantic "bonus" with this statement is that the end of a when block implicitly triggers a succeed (known as break to traditionalists), which will leave the surrounding contextualizer ($_-binding) block, if any. Or, in plain language, you don't have to put in explicit breaks in your Perl 6 switch statements.

The when statement doesn't have an opposite version. I have sometimes campaigned for whenn't, but with surprisingly (or rather, predictably) little uptake. Ah well; there will be modules.

&&

"Now hold your horses!", I hear you say. "That ain't no conditional!" And you would be right. But, according to the criteria of the search, it is, you see. It has an antecedent and a consequent. The consequent only gets evaluated if the antecedent is truthy.

if @stuff && @stuff[0] eq "interesting" {
    # ...
}

This behavior is called "short-circuiting", and Perl 6 is only one of a long list of languages that work this way.

I won't count ||, because it's the dual of &&. It also short-circuits, but it bails early on truthy values instead of falsy ones.

and

If we count &&, surely we should count its low-precedence cousin, and.

open("file.txt")
    and say "Yup, it opened";

Some people use && and and interchangeable, or prefer only one of them. I'm a little bit more restrictive, and tend to use && inside of expressions, and and only between statements.

I won't count or, again because it's the dual of and.

not ?&

A non-entrant in our tour which nontheless deserves an honorable mention is the ?& operator. It's a "boolifying &&".

But besides always returning a boolean value (rather than the rightmost truthy value, as && and and do), it also doesn't short-circuit. Which makes it useless as a conditional.

sub f1 { say "first"; False }
sub f2 { say "sekkind" }

f1() && f2();   # says "first"
f1() ?& f2();   # says "first", "second"

Therefore, there are two reasons I'm not counting ?|.

not & either

We have another honorable mention. In Perl 5, the & operator deals in bitmasks (just as in a lot of C-derived languages), but in Perl 6 this operator has been stolen for a higher purpose: junctions.

f1() & f2();    # says "first", "second" in some order

The old bitmask operator (now spelled +&, by the way) doesn't short-circuit, and never did in any language I'm aware of. This new junctional operator also doesn't short-circuit. In fact, the semantics it conveys is one of quantum computer "simultaneous evaluation". Actual threading semantics is not guaranteed, currently does not happen in Rakudo — and may not be worth the overhead for most small calculations anyway — but the message is clear: hie the hence, short-circuiting.

//

The most questionable conditional on my list, this one nevertheless made the cut. It's a version of ||, but instead of testing for truthiness, it tests for undefinedness. Other than that, it's very conditional.

my $name = %names{$id} // "<no name given>";

This operator is notable for sharing a first place (with say) as "most liked feature side-ported to Perl 5". (Just don't mention smartmatch to a Perl 5 person.)

The dual of //, for evaluating a consequent only if the antecedent is a defined value, is spelled ⅋⅋, and of course, it does not exist and never did. *bright neuralizer flash*

?? !!

Loved and known by many names — ternary operator, trinary operator, conditional operator — this operator defies grammatical categories but is clearly a conditional, too. It even has an else part.

my $opponent = $player == White ?? Black !! White;

Almost every language out there spells this operator as ? : and it takes a while to get used to the new spelling in Perl 6. (The reason was that both ? and : were far too short and valuable to waste on this operator, so it got "demoted" to a longer spelling.)

If you ask me, ?? !! is an improvement once you see past the historical baggage: all of the other pure boolean operators already identify themselves by doubling a symbolic character like that. It blends in better.

You didn't hear it from me, but if anyone ever feels like creating an unless/else version of this operator, you could do a lot worse than choosing ¿¿ ¡¡ for the job. Ah, Unicode.

andthen

Finally, a late entry, andthen also works as a conditional. Just when you thought Perl 6 had you semantically covered with && and and for ordinary boolean values, and // for undefined values...

...it opens up a third door. The andthen operator only evaluates its right-hand statement if the left-hand statement succeeded.

shut-down-reactor()
    andthen don-hazmat-suit()
    andthen enter-chamber()
;

Succeeding means returning something defined and not failing along the way. Structually, the andthen operator has been compared to Haskell's monadic bind: success of earlier statements guards execution of later ones. Though it's not implemented yet in Rakudo, you're even supposed to be able to pass the successful value from the left-hand statement to the right-hand one. (It ends up in $_.)

I won't count orelse, even though it also short-circuits. Though it's worth noting that the right-hand side of orelse ends up with the $! value from the left-hand side.

Signoff

Thank you for taking the tour. As you can see, there really is More Than One Way to write a condition in Perl 6. Some of them will turn up more often in everyday code, and some less often. I use and only rarely, and I've yet to use an andthen in real code. The others all crop up fairly regularly, and all in their own favorite contexts.

Some languages give you a sparse diet of the bare necessities, claiming that it's good for you to have fewer choices. Perl is not one of those languages. Instead, you get lots of choice, lots of freedom, and it's up to you to do great things with it. Of course, we also try to make sure that the constructs are there make sense, are in good taste, and feel consistent with the rest of the language. The end result is a pleasant language with plenty of rope to shoot yourself in the foot with.

Merry conditional Christmas!

Day 18 – MoarVM Internals for the Brave (and Curious)

December 18, 2014 by

Hi there! I’m the guy who developed the MoarVM x64 JIT compiler during
this year’s GSoC. Jonathan has already given a detailed (and clear)
overview of MoarVMs optimization subsystems, of which the JIT compiler
is but a small step. Today, I’m going to try and show you how you can
see this in action. But first, a sample program:

    use v6;
    
    sub fibonacci (int $n --> int) {
        my int $x = 1;
        my int $y = 1;
        while $n > 0 {
            $x = $x + $y;
            $y = $x - $y;
            $n = $n - 1;
        }
        return $y;
    }
    
    for ^40 -> int $x {
        say fibonacci $x;
    }


Careful readers will spot the native integer definitions sprinkled
throughout the code. You can write perl6 code as if it is C,
too. (After all, that’s just one more way to do it). I’m not entirely
sure about other backends, but on MoarVM these really work – no object
is ever allocated for them.

Rakudo perl 6 can do more than just run this code, it can also show
you how it understands it. I’ve saved the above program as ‘fib.p6′,
and you can get a (rather verbose) textual representation of the
syntax tree by using the special target command line option:

perl6-m --target=mast fib.p6

However, this is just a static analysis. What we’re really interested
in is what happens during runtime. MoarVM will show you all the gory
details if you specify a filename with the MVM_SPESH_LOG environment
variable, like so.

export MVM_SPESH_LOG=spesh-log.txt; perl6-m fib.p6

From the outside it looks as if nothing special has
happened. As they say, looks are deceiving, because MoarVM has
faithfully produced a text file that details the way your code has
been changed. It is larger even than the first analysis, but it also
contains more information. Let’s look at the our function:

    Inserting logging for specialization of 'fibonacci' 
(cuid: cuid_2_1418763286.32937)
    
    Before:
    Spesh of 'fibonacci' (cuid: cuid_2_1418763286.32937, file: fib.p6:3)

    BB 0 (0x4f77118):
      Instructions:
        no_op 
      Successors: 1, 6, 5, 7, 9
      Predeccessors: 
      Dominance children: 1, 5, 6, 7, 9, 10

    BB 1 (0x4f77188):
      Instructions:
        checkarity liti16(1), liti16(1)
        param_rp_i r2(1), liti16(0)
        bindlex lex(idx=5,outers=0,$n), r2(1)
        paramnamesused 
        const_i64_16 r0(1), liti16(0)
        bindlex lex(idx=1,outers=0,$x), r0(1)

    ....


This little snippet is not without jargon, so I’ll try to explain. (I
did warn you at the top). The ‘cuid’ is the compilation unit
identifier, and it serves to identify the source of any
function. Sometimes compilation units correspond to files, and
sometimes they don’t (like in a REPL loop). The indented blocks denote
Basic Blocks. A Basic Block is a sequence of instructions that is
uninterrupted by a (conditional) jump. They are important because
within a basic block it is always obvious where all the values are
coming from.

Further along in our spesh log, we can see how spesh has transformed
the first block:

    ...
      BB 1 (0x4f77188):
        Instructions:
          sp_getarg_i r2(1), liti16(0)
          bindlex lex(idx=5,outers=0,$n), r2(1)
          const_i64_16 r0(1), liti16(0)
     ...


This is the same start of the function as before. What has happened is
that the polymorphic instruction param_rp_i to the much simpler
instruction sp_getarg_i. As Jonathan explained, we can get away with
this because we know exactly how this function is called, which is
just at line 15 of our little script.

While the spesh log is certainly interesting, it is no good for light
reading. Which is why I wanted to show off a really cool tool that
Timo Paulssen (timotimo) made a while ago – a way to transform a piece
of the spesh log (the ‘spesh graph’ that represents the code of a
function) into a visual form (with a little help of
graphviz). Unfortunately, I couldn’t really get it to work. This
demonstrates something important about all the tools that I’m showing
today – they’ve all been designed not for language users but VM
developers, and they may all be broken this time next year.

[I was able to run the tool and get this piece of output. – timotimo]

Let’s wrap it up. If you’re interested in what kind of assembly code
the JIT will produce for you, there is also a way to get to that. Run
your script again as follows:

export MVM_JIT_BYTECODE_DIR=. # or /tmp, or some other directory
perl6-m fib.p6

If you’ve followed these instructions directly, what you’ll now see is
your working directory littered with binary files representing different
frames (functions). Every such file will contain the raw binary data
generated by the JIT compiler. Inspecting these files is easy enough
(as long as you do know x64 assembly):

objdump -D -b binary -m i386:x86-64 -M intel `ls *.fibonacci.bin`

If you’re on a mac, that’s gobjdump rather than objdump. And if you
prefer AT&T syntax over intel syntax, just drop the ‘-M intel’ part.
Looking at the output of this bytecode, you might also see the rather
wastefull way your code is compiled. After all, I’ve written this
function specifically for simplicity. Yet I count no fewer than 215
‘mov’ instructions, 18 conditional move instructions, and 16 function
calls. As much as MoarVM and perl6 have achieved this year, there is
still a lot left to do. And with that, I wish you a hacky christmas
:-)

Day 17 – Three strange loops

December 17, 2014 by

For a long time, I’ve been fascinated by strange loops.

hands drawing each other

The concept “strange loop” revels in the circular definition. Something is the way it is because it’s defined in terms of itself. I think fascination for this concept is related to fascination with the infinite. Chicken-egg conundrums. Grandfather paradoxes. Catch-22s. (“Catches-22″?) Newcomb’s paradox. Gödel’s incompleteness theorems. Something about causal feedback loops tickle our brains.

The concept has taken root not just in art or in brainteasers. When we talk of “bootstrapping compilers” or “metacircular evaluators”, we mean exactly this type of activity. To make a compiler bootstrapping means to make it able to compile itself by (re-)writing it in the target language. An evaluator is metacircular if it implements features not in terms of a helper subsystem, but in terms of themselves. And people do this not just because it’s a cool trick, but because it opens up possibilities. Arbitrary limits in programming stop us from doing what we want, and the limit between “host system” and “guest system” is arbitrary.

In other words, if you’re seriously interested in extensibility, then you’re also likely interested in strange loops. That’s why you see these ideas floated around in the Lisp communities a lot. And that’s why compilers sometimes take the bootstrapping route. (Another happy example is JavaScript, where some language extension proposals happens through their metacircular interpreter, aptly named Narcissus.)

Perl 6 has three strange loops. Grammar, OO, and program elements. It’s in various stages of closing those.

Grammar

Status: not quite closed yet

Perl 6 grammars are famously powerful and allows the programmer to specify entire languages. It stands to reason that a Perl 6 implementation would want to eat its own dogfood and use a Perl 6 grammar to specify the syntax of the language itself. And behold, Rakudo and Niecza do exactly that. (Pugs doesn’t, but among its future versions is the milestone of being self-hosting.)

Having this bootstrapping in place gives us a high confidence in our grammar engine. We know that it’s powerful enough to parse all the Perl 6 we pass it every day. This is also what allows us to write most of Rakudo in Perl 6, making it easier for others to contribute.

I say “not quite closed yet” because the access from user code into the Perl 6 grammar has some ways to go still.

Benefits: tooling, slangs, macros

Something like PPI should be trivial once we close the loop completely. The compiler already has access to all that information, and we just need to get it out into user space. Slangs can either extend the existing Perl 6 grammar, or start from scratch and provide a completely different syntax. In either case, good communication between old and new grammars is required for e.g. variables to be visible across languages. Macros need to be able to effortlessly go from the textual form of something to AST form.

OO

Status: closed

The system underlying all objects in the Perl 6 runtime is called 6model. An object was created from a class, a class was created from a “class metaobject”. After just a few more steps you end up with something called a “knowhow” — this is the end of the line, because by all appearances the knowhow created itself.

6model is our object-oriented implementation of object orientation. It’s very nice and it allows us to extend OO when we need it.

Benefits: make your own OO rules

You will find most examples of the kind of OO extension that’s posssible through jnthn’s modules. Grammar::Debugger extends what it means to call a grammar rule. Test::Mock similarly keep track of how and when methods are called. The recent modules OO::Monitors and OO::Actors both provide new class metaobjects, basically giving us new rules for OO. They also go one step further and provide definitional keywords for monitors and actors, respectively.

Program elements

Status: just discovered

Macros can generate code, but in some cases they also analyze or extend code. That’s the idea anyway. What’s been stopping us so far in realizing this in Perl 6 is that there hasn’t been a standard way to talk about Perl 6 program elements. What do I mean by “program element” anyway? Let’s answer that by example.

Say you have a class definition in your program. Maybe you have a fancy editor, with refactoring capability. The editor is certainly aware of the class definition and can traverse/manipulate it according to the rules of the language. In order for it to do that, it needs to be able to represent the class definition as an object. That object is a program element. It’s different from the class metaobject; the former talks about its relation to the program text, and the latter talks about its relation to the OO system.

<PerlJam> masak: "program elements" reads like
          "DOM for Perl" to me.
<masak> yep.

Macros are headed in the same way as such a refactoring editor. By handling program elements, you can analyze and extend user code at compile time inside macros. The API of the program elements give the user the ability to extend the Perl 6 compiler itself from library code and user code.

The work on this has only just started. My progress report so far is this: Perl 6 has a lot of node types. :) Having a rich language that doesn't take the Lisp route of almost-no-syntax also means that the API of the program elements becomes quite extensive. But it will also open up some exciting doors.

Benefits: macros, slangs

Parting words

Part of the reason why I like Perl 6 is that it has these strange loops. Little by little, year by year, Perl 6 lifts itself up by the bootstraps. There's still some work left to close some of these loops. I've been hopefully waiting for a long time to be able to do Perl 6 parsing from user code. And I'm eager to provide more of the program element loop so that we can write Perl 6 that writes Perl 6 from inside of our macros.

Mostly I'm excited to see people come up with things that none of us have thought of yet, but that are made possible, even easy, by embedding these strange loops into the language.

Have a loopy Christmas.

Day 16 – Quoting on Steroids

December 16, 2014 by

A few days ago, there was a blog post about String Interpolation. Most of the examples there used a simple double quoted string.

my $a = 42;
say "a = $a"; # a = 42

The possibilities of the double quoted string are quite powerful already. But in the end, they’re just a special case of a much more generic and malleable quoting construct, named Q.

Q’s Basic Features

In its most generic form, Q just copies the string without any changes or interpretation:

my $a = 42;
say Q/foo $a \n/; # foo $a \n

You can add adverbs to Q[…], to change the way the resulting string will be formatted. For instance, if you want to have interpolation of scalars, you can add :s. If you want interpretation of backslashes like \n, you can add :b. And you can combine them:

my $a = 42;
say Q:s/foo $a\n/;   # foo 42\n
say Q:b/foo $a\n/;   # foo $a␤
say Q:s:b/foo $a\n/; # foo 42␤

(If you wonder what a is, it is a U+2424 SYMBOL FOR NEWLINE [So] (␤) and should be show up in your browser as a character containing an N and an L as a visible representation of a new line character)

In fact, the list of adverbs of basic features is:

    short       long            what does it do
    =====       ====            ===============
    :q          :single         Interpolate \\, \q and \' (or whatever)
    :s          :scalar         Interpolate $ vars
    :a          :array          Interpolate @ vars
    :h          :hash           Interpolate % vars
    :f          :function       Interpolate & calls
    :c          :closure        Interpolate {...} expressions
    :b          :backslash      Interpolate \n, \t, etc. (implies :q at least)

The :q (:single) basically gives you single quoted string semantics. The other adverbs, give you typically the functionality that you would expect from double quoted strings. If you really want to be verbose on your double quoted strings, you can write them like this:

my $a = 42;
say Q :scalar :array :hash :function :closure :backslash /foo $a\n/; # foo 42␤

Of course, you can also specify the short versions of the adverbs, and not separate them by whitespace. So, if you want to be less verbose:

my $a = 42;
say Q:s:a:h:f:c:b/foo $a\n/; # foo 42␤

As with any adverbs (which are just named parameters, really), the order does not matter:

my $a = 42;
say Q:f:s:b:a:c:h/foo $a\n/; # foo 42␤

Actually, the story about the order of the named parameters is a little bit more complicated than this. But for this set of adverbs, it does not matter in which order they are specified.

.oO( is that a brother of Johann Sebastian? )

But seriously, that is still a mouthful. So an even shorter shortcut is provided: :qq

    short       long            what does it do
    =====       ====            ===============
    :qq         :double         Interpolate with :s, :a, :h, :f, :c, :b

So, you can:

my $a = 42;
say Q:double/foo $a\n/; # foo 42␤
say Q:qq/foo $a\n/; # foo 42␤

All that for simply doing a double quoted string with interpolation? Well, because people are using double quoted strings a lot, the simple remains the quickest way of interpolating values into a string. However, underneath that all, it’s really Q:qq, which in turn is really Q:s:a:h:f:c:b.

What about a double quoted string that has double quotes in it? For those cases, the Q:qq form is available, but that is still quite verbose. Synopsis 2 therefore specifies:

In fact, all quote-like forms derive from Q with adverbs:
    q//         Q:q//
    qq//        Q:qq//

Which means we can shorten the Q:qq in that last example to qq (and have double quotes in the double quoted string without any problems):

my $a = 42;
say qq/foo "$a"\n/; # foo "42"␤

Both q// and qq// also support (the same) adverbs. This initially seems the most useful with q//, for instance in combination with :s, which would (also) interpolate scalars:

my $a = 42;
say q:s/foo "$a"\n/; # foo "42"\n

However, adverbs (just as named parameters) are just a shortcut for a Pair: :s is really s => True. And :!s is really just s => False. Can we also apply this to quoting constructs? The answer is: yes, you can!

say qq:!s:!c/foo "$x{$y}"\n/; # foo "$x{$y}"␤

Even though we specified qq//, the scalar is not interpolated, because of the :!s adverb. And the scope is not interpolated, because of the :!c. This can for instance be handy when building strings to be EVALled. So, if you want all quoting features except one or more, you can easily de-activate that feature by negating those adverbs.

Some of Q’s Advanced Features

Quoting features do not stop here. This is a list of some of the other features that already work in Rakudo Perl 6:

    short       long            what does it do
    =====       ====            ===============
    :x          :exec           Execute as command and return results
    :w          :words          Split result on words (no quote protection)
    :ww         :quotewords     Split result on words (with quote protection)
    :to         :heredoc        Parse result as heredoc terminator

Here are some examples. Note that we don’t bother specifying these features as attributes, because they’re basically the first additional attribute, so there is a shortcut version for them. For example, qq:x// can be shortened to qqx//. Whether you could consider that more readable or not, I leave up to the reader. There is, after all, more than one way to do it.

Interpolate and execute as an external program:

my $w = 'World';
say qqx/echo Hello $w/; # Hello World

Interpolate as single quoted words (please look at what happens to the single quotes):

.say for qw/ foo bar 'first second' /;
==
foo
bar
'first
second'

Interpolate as single quoted words with quote protection. This will make sure that balanced quotes will be treated as one entity (and note again what happened to the single quotes).

.say for qww/ foo bar 'first second' /;
==
foo
bar
first second

Interpolate variables into a heredoc:

my $a = 'world';
say qqto/FOO/;
  Hello $a
  FOO
==
Hello world␤

The text is exdented automatically for the same number of indents as the target has.

Conclusion

Perl 6 has a very powerful basic quoting construct in Q[…], from which all other quoting constructs are derived. They allow mix and matching of features in various short and more verbose ways. There are still some adverbs unimplemented, but the ones that are mentioned here, should Just Work™.

Finally, the Synopsis also indicates that there will be a way out of this alphabet soup:

If you want to abbreviate further, just define a macro:

    macro qx { 'qq:x ' }          # equivalent to P5's qx//
    macro qTO { 'qq:x:w:to ' }    # qq:x:w:to//
    macro quote:<❰ ❱> ($text) { quasi { {{{$text}}}.quoteharder } }

We can only hope that someone will find enough quality tuits soon to implement this.

Day 15 – Bioinformatics and the joy of Perl 6

December 15, 2014 by

On the 15th day of Christmas Perl6 said let there be life-sciences, and so there was.

A lot of coverage and testament is made to Perl’s role in developing dynamic content for the early web. As well as its slow decline. However, there is an equally important story that goes alongside the tale of Perl as the language of CGI. That is the story of Perl as the scientific glue that saved the human genome project, there is even a published journal article in Perl’s honour.

It’s not an accident that Perl did well and continues to do well in the field of Bioinformatics. Few languages at the time provided first class semantics that matched the problem domain such as: regular expression literals, good string handling, trivial data structure creation, sensible automatic type conversion and XS for native code wrapping when speed was important. That Perl replaced a lot of sed+awk use cases for one liners is also a welcome feature for a busy bioinformatician. Perl persists today as a de facto scripting language due to the large community effort known as BioPerl. For those not in the know the API of BioPerl inspired a whole host of work in most other languages such as Biopython, BioJava, BioRuby etc. To date none are quite as feature complete as BioPerl though each has its unique additional features. Biopython is getting there from a lot of work done through GSoC action in recent years. I also use ete2 from Python on a regular basis and recommend people check that out for phylogenetics work. However, this post is about the future and I think Perl 6 has a lot to offer everyone working in bioinformatics.

XS the Perl5 system of integrating native code into a Perl module has done fantastically well in providing scientists with the ability to integrate high performance C code with the simplicity of Perl’s syntax. But XS in the Perl6 world has been replaced with NativeCall. This library makes it incredibly simple to integrate C code with a Perl6 class or subroutine. I’m not going to cover this here though, the repo has plenty of docs and examples and two previous advent articles demonstrate use cases.

Instead everything on show here are core features available from writing ‘perl6′ at a command prompt. You don’t need any external dependencies so any system with vanilla Rakudo Perl6 can do everything you’re about to see.

The Grammar of life

Hopefully most people have at least heard about DNA, RNA and perhaps even proteins. All of these are what are referred to in the biz as biopolymers, or translated into programming parlance “strings of life”. Essentially each molecule is a big chain of smaller building blocks you can denote in a textual form with a letter. In DNA we have the alphabet GCTA , in RNA we never see a T instead we see a U in its place. Proteins have their own alphabet that is unrelated to DNA and RNA between 21 and 23 characters long, the last two are very quirky and relatively rare.

So in this section I’m going to outline the hip way to parse sequence files in Perl6. This is how sequencing data usually comes into a bioinformatician. A giant text file! I’ll be using our own class to hold the sequence and a well defined grammar for the well loved (not) FASTA format. So to start here is the Grammar as if it fell from heaven as a shooting star this wintery night:

grammar FASTA::Grammar {
    token TOP { <record>+ }

    token record { ">"<id><comment>?"\n"<sequence> }

    token id { <-[\ \n]>+ }

    token comment { " "<-[\n]>+ }

    proto rule sequence {*}
          rule sequence:sym<dna> { <[ACGTRYKMSWBDHVNX\-\n]>+ }
          rule sequence:sym<rna> { <[ACGURYKMSWBDHVNX\-\n]>+ }
          rule sequence:sym<aa> { <[A..Z\*\-\n]>+ }
}

Cool we have a super formal grammar defined for FASTA and it even looks like we are doing something special for different types of sequence! I guess I had better explain a little, keep in mind that Grammar is just a huge extension to regular expressions which you are hopefully familiar with.

First up we have token and rule. The difference is that token is like a hungry hippo, it consumes the input sequence and is quite greedy. So if a token matches at all it consumes the string and then lets parsing continue. If there is a choice of tokens the longest wins, which can be thought of as the winner in a game of hungry hippos!
A rule is a bit more tricksome, it has backtracking and can regurgitate and give up. Specifically we have a set of rules called sequence with the more abstract one defined with the keyword proto. The reason we have three rules is for each type of biopolymer, they are ordered by the most to least specific. Unfortunately DNA and RNA can both look like protein. So our Grammar is limited in its ability to detect the difference, but so would a human that’s how rubbish this major data format is!
With rule unlike token if the start of a sequence matches DNA but the rest of it looks like protein we back out of matching DNA because the grammar is smart enough to realise it can get a longer match from the other rule, but has to spew up the letters already consumed first. If we had used token we might have had a really short DNA token then a protein one etc. This might all sound a bit complicated but I’ve made a trace using Grammar::Debugger of what happens matching a test.fa file in this Gist.
If you do look at the trace you will see the final important fact about Grammars. We need a TOP token or rule, this is where we start the thought process and parsing of data. So a FASTA file is made of more than one record, that’s our TOP. A record is made of an id, a comment and a sequence and so the logic flows. The Perl6 regular expression syntax is way out of scope but Masak has a nice intro.

Now lets get this Grammar rolling along doing something. If we want to use a sequence in our code we need a basic type to represent it. Below is probably the simplest thing you would want:

class Seq {
    has Str $.id = "";
    has Str $.comment = "";
    has Str $.sequence = "";
}

It’s not the most inspired or interesting class in the world but it will do!

Now how do we hook up our Seq class to our FASTA Grammar to produce some nice objects from our FASTA file? The answer is we need to take action! But not the political kind, just create an Actions class for the Grammar to use when parsing. By default we could call .parse on our grammar above and we would get all the text chunked up into nicely named labels as a giant tree of Match variables. Some of the time that’s what you want. Action classes have methods that match the names of tokens and rules in a grammar. Each method is called at the corresponding point in the Grammar’s parse tree and given the chance to make something more interesting, replacing all the crummy Match variables in that part of the output. So what do we want:

class FASTA::Actions {
    method TOP($/) {
        make $<record>>>.ast;
    }
    method record($/) {
        make Seq.new(
            id => ~$<id>,
            comment => ($<comment>) ?? (~$<comment>).trim !! '',
            sequence => $<sequence>.subst("\n","", :g)
        );
    }
}

So the logic above is whenever you parse out the whole of a record we want to make a new Seq object and shove it in the resulting match tree for later. You can see that we do some trimming and cleaning up of white space in the action before creating a clean Seq object. Finally the TOP action basically just says that the final product of .parse should be all the record matches, which we just transformed into Seq objects, so it’s literally a list of Seq! Nice! Lets give it a whirl on that test file I mentioned:

Input

>hello
GCTATATAAGC
>world prot
TATAKEKEKELKL

Code

my $grammar = FASTA::Grammar.new;
for $grammar.parsefile("test.fa", actions => FASTA::Actions).ast -> $seq {
    say $seq.id;
}

Output

hello
world

Ultimately we didn’t need the added complexity of matching a sequence as DNA/RNA/protein but it was kind of cool, and at a later date we might want to do something different and have fancier Seq objects. I’ll leave exploration of this as a task to the interested reader :P

A Bag full of sequence

So I’m concerned that our Seq class is kind of crap still. It might as well just be a hash of values at this point. Sure we get a type which is nice, but it just doesn’t have anything cool about it. I think a nice addition is if we made our class work a little better with the rest of the Perl6 family of operations, letting more of the languages magic get at our data. The results of this simple change might actually impress you!

class Seq {
    has Str $.id = "";
    has Str $.comment = "";
    has Str $.sequence = "";

    multi method Numeric (Seq:D: --> Numeric) {
        $.sequence.codes;
    }

    multi method Bag (Seq:D: --> Bag) {
        bag $.sequence.comb;
    }

    multi method Str (Seq:D: --> Str) {
        $.sequence;
    }

    multi method gist (Seq:D: --> Str) {
        ">$.id $.comment\n$.sequence";
    }
}

Above I’ve introduced some snazzy new methods specific to our class. Anyone familiar with toString from Java should get what both the Str and gist methods are upto. They are well observed methods in Perl6 that deal with interpreting an object as a string. But it’s probably not clear at all why you would want two of them? Well simply put the Str method is there for the pure string interpretation of our object. If I use operators that expect Str this would be called to automagic our Seq into a string. The gist is kind of a neat one, if we use the higher level say function this calls the .gist on our object and prints that since it’s already assumed you want something a bit more human readable. Even more neat is if you are using the command line REPL the .gist is immediately called on an object that resulted from an expression. So when we create a new Seq in the Perl6 REPL or just give the variable name on its own we get FASTA ready to copy and paste! Take a look at the REPL dump below and see if you can work out where we are implicitly using $seq as a Str rather than Seq:

> my $seq = Seq.new(id=>'hello',comment=>'world',sequence=>'GCTACCTATAAAGC');
>hello world
GCTACCTATAAAGC
> say $seq
>hello world
GCTACCTATAAAGC
> "junk" ~ $seq
junkGCTACCTATAAAGC
> my $big-seq = Seq.new(id=>'whole',comment=>'new world',sequence=> $seq x 3);
>whole new world
GCTACCTATAAAGCGCTACCTATAAAGCGCTACCTATAAAGC
> print $seq
GCTACCTATAAAGC>

But wait! There’s more. For the low low price of just two more methods we get numeric wrangling and statistical modelling! The Numeric method does what it says on the tin, and we defined this as the length of the sequence which feels reasonable as a numerical representation.

> $seq + $big-seq
56
> ($seq + $big-seq) / $seq
4

Bag creates a Bag type object from our sequence. Basically the Bag we create is an immutable counting dictionary of all the letters in the sequence. So we now have the first step in some frequency analysis, letters and a count of how often they occurred. But what is even more awesome about Bags in Perl6 are the pick and roll methods. These allow you to sample the keys of the Bag based on their frequency! So if you wanted to create some dummy sequences to do statistical tests with it’s easy as pie! Roll keeps the letters in the bag as you sample whereas pick removes them and then alters the distribution of the next pick. Below I give an example where we .pick strings that are as long the input sequence but will be randomly different but with the same distribution of characters. This sort of dummy sequence is useful for calculating the statistical significance of our sequence.

> $seq.Bag
bag(G(2), C(4), T(3), A(5))
> $seq.Bag.pick($seq).join
CAGATCGTAAATCC
> $seq.Bag.pick($seq).join
GAATCCCGTACATA

Look at that we even use $seq as the length to pick to! Perhaps not the most obvious or readable thing in the world, but for playing on the REPL it’s a nice feature and not utterly unintuitive to people used to context in Perl.

A Range of possibilities

A common task in bioinformatics is dealing with what is commonly referred to as “annotation” type data. This can be anything from specifying the location of a gene in a whole genome, to specifying a patterned site within a protein sequence. In BioPerl there is a specific and special Bio::Range class for dealing with the sort of data that represents a segment on a number line with a start and an end position. In Perl6 this is built right into the language with most of the features any bioinformatician would be happy with. Plus they can be wrangled to Sets and use those operators. So creating a Range couldn’t be easier and more obvious, you’ve all done it before!

> my $range = 1..10;
> $range.WHAT.say;
(Range)

This is the exact same syntax and object used if you wanted to create a for loop over a given range both in Perl5 and Perl6.

To show what we can do with this basic type in Perl6 I’ve taken some data about the gene p53 which is a well known one for the development of many cancers. It’s all real data and you can see a coloured block diagram of the protien annotation data here. You think of each block as being single Range object you get some idea of what I’m harping on about. So lets start out with some DNA and some protein features in code but remember we could just makes these from a file just as easily:

#Genomic chromosome coordinates of the p53 gene in human
my $p53_gene = 7_565_097 .. 7_590_863;

#A single nucleotide polymorphism AKA a point mutation
my $snp = 7_575_996;

#Some annotated protein domains
my %domains = p53-TF => 97..287, p53-Tetramer => 319..357;

#Predicted interaction site
my $binding_site = 322..355;

#A known protein Phosphorylation site
my $phospho_site = 99;

So good so far, we have some one off integer values some ranges and a cool hash of ranges! The bonus Perl6 features come from operators that accept a Range as their operands. For example most importantly set operations work as you might expect:

#Work out if the SNP is within the gene
say "MUTATION!!!" if $p53_gene (cont) $snp;

#Find out which domains contain the phosphorylation site
my %phosphorylated = %domains.grep({ .value (cont) $phospho_site })

#Find out which domains are in the binding site and what percentage is binding
my %binding = %domains.map({
                            $_.key => 100 / $_.value.elems * ($_.value (&) $binding_site)
                      })
                      .grep(*.value > 0.0);

Hopefully a lot of the above is almost self explanatory once I introduce the set operators for contains (cont) or ∋ and intersection (&) or ∩. I chose the ASCII variants, mostly because I don’t have a ∩ key on my keyboard and I wouldn’t want my code to be trashed if someone used a non Unicode editor! The contains returns a Bool result of if the right operand contains the left and the intersection operator returns the set of numbers that intersect both of the Ranges. The final hash wrangling is quite nice, it keeps the keys of the %domains hash but replaces the values with the percentage coverage with our binding site and only gives a result if there was any coverage between the two Ranges. Hopefully almost any bioinformatician can see how nice these sort of primitive types and operations are to have in the core of a language.

A Curio

If any of the above has you at least a little curious if not excited to see what Perl6 can do within Bioinformatics then you can check out more at either Chris Fields official BioPerl6 repo or my own playpen. I’ll leave you with one final code nugget to chew on. I found this by accident just writing a method to wrangle DNA sequence to RNA sequence in Perl6. Something you might like to add to our Seq class. This is approximately the session I had at the REPL:

> my $dna = ‘GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAA’;
> my $rna = $dna ~~ tr/GCAT/GCAU/;
> $rna.WHAT.say
(StrDistance)

WAT?! So I was originally expecting to just have the translated string. As an aside you should use the method $dna.trans(‘T’=>’U’) if you do want that. Instead I got back this odd object StrDistance… What could it be I thought, so I did some basic inspection to take a look:

> $rna.perl
StrDistance.new(before => "GATG…TAAAA", after => "GAUG…UAAAA")
> $rna.^methods
Bool Numeric Int <anon> <anon>
> $rna.^attributes
Str $!before Str $!after Int $!distance

So in the .perl we see that .before and .after are the two strings the DNA and the newly translated RNA. But there is also this other guff like .Numeric and .Int. Turns out its something awesome!

> say +$rna;
13

Lucky number 13… or the Hamming distance between the two strings! Anyone from the world of Bioinformatics will instantly see this as quite a handy feature. Perl6 wasn’t created or invented for Bioinformatics, but it has so many accidentally great features laying around relatively uncommented on.

Sorry for the blowhard post but I thought it might be nice to put these ideas all in one place. Have a happy Christmas and New Year!

Day 14 – A Perl 6 Timer: Ready, Set, Go!

December 13, 2014 by

Christmas is coming!

With Perl 6 promising to be ready for production next year – now is the time to stop lurking and start learning. It’s exciting! But what to try first?

OK – so I’ve downloaded Rakudo and tinkered with the REPL. Time to try translating a simple module that I often use in Perl 5 projects. It’s a simple timer that records events and renders a timed report:

my $timer = Timer->new;
$timer->record('first event');
$timer->record('second event');
say $timer->report;

Here’s the output:

[0.000] first event
[0.007] second event

It helps trace code and identify bottlenecks. Here’s a Perl 5.20 version:


package Timer;

use Time::HiRes qw(gettimeofday tv_interval);
use Mojo::Base -base;
use experimental 'signatures', 'postderef';

has '_event_log'  => sub { [] };
has '_start_time' => sub { [ gettimeofday ] };

sub record ($self, $event_description) {
    # record the event and timestamp
    my $event = {
        timestamp   => sprintf("%.3f", tv_interval($self->_start_time)),
        description => $event_description,
    };
    push($self->_event_log->@*, $event);
}

sub report ($self) {
    # render a full report
    my $report = '';
    foreach my $event ($self->_event_log->@*) {
         $report .= '[' . $event->{timestamp} . '] ' . $event->{description} . "\n";
    }
    return $report;
}
1;

Now how would this look in Perl 6?

Perl 6 does full flavour, duck quacking OO so we can replace Perl 5’s package and declare a Timer class:


class Timer {
    ...
}

No need for that spurious 1; at the end of the file any more either. We need to store an initial start time and a list of event records. In Perl 6, object attributes are defined with the help of ‘twigils’ – that is, secondary sigils …


class Timer {
    has $!start-time;  # private scalar storing the starting time
    has @!event-log;   # private list of events 
                       # dashes-in-identifiers save keystrokes too
}

The exclamation mark here in $!start-time is a secondary sigil and declares the attribute as private. The @!event-log list is also marked as private.

You can imagine the ! character is like a portcullis dropping down at the entrance of a castle – blocking external access to the attribute behind. The . twigil is used for public attributes (e.g., $.this-attribute-is-public-no-portcullis-here).

The twigil tells you at first glance what an attribute’s contract is with the class. No more context-switching and scrolling back to the top of the file to find out. Thanks Larry for saving some brain-space!

So the attributes are defined, what about the methods, record() and report()?

The record() method starts the internal timer the first time it’s called so it needs an equivalent to Perl 5’s Time::HiRes. A quick Google search brings up an example on RosettaCode.org. Like the ancient Rosetta Stone this site is useful for translating between languages and is a hangout for early adopting polyglot programmers. Perl 5 and Perl 6 are well represented there so it’s a great place for converting Perl 5-isms into Perl 6. It seems now is the new gettimeofday!

I had a little chat with the Perl6 REPL about this:


> now # returns an Instant object
Instant:1418191166.678494
> now.HOW # access the Higher Order Workings
Perl6::Metamodel::ClassHOW.new()
> now.^methods # use ^ to access the HOW
new from-posix to-posix Bridge Num Int narrow Rat abs sign conj sqrt rand sin asin cos acos tan atan atan2 sec asec cosec acosec cotan acotan sinh asinh cosh acosh tanh atanh sech asech cosech acosech cotanh acotanh floor ceiling round unpolar cis Complex log exp truncate isNaN base Real log10 roots succ pred sleep Str perl Numeric ACCEPTS Bool gist DUMP <anon>
> now.perl # Data::Dumper in Perl 6
Instant.new(x => <1127461994944/795>)
> now.gist # Human readable explanation
Instant:1418191194.695649
> my $time = now
Instant:1418191201.250955
> now - $time # can we get the difference in time?
7.6888786

So this will do the trick! The record() method works like a stopwatch snapshotting the difference since the start:


method record ($event_description) {
    # initialise the start-time if this is the first time through
    $!start-time //= now;

    @!event-log.push({
        timestamp   => now - $!start-time, # take the difference
        description => $event_description
    });
}

The report() method pretty prints the @!event-log:


method report {
    my @report_lines;
    for @!event-log -> %event {
         @report_lines.push("[%event<timestamp>.fmt("%.3f")] %event<description>");
    }
    return @report_lines.join("\n");
}

Metaphorically speaking the for loop unrolls the @!event-log carpet (@) shooting values (->) into individual hash entries (%event)!

Sigils are invariant in Perl 6 so a %hash means %hash and the sigil doesn’t change while accessing a hash value: %hash<key>. Interpolation into a string just works.The full Perl 6 version looks like:


class Timer {
    has $!start-time;
    has @!event-log;

    method record ($event_description) {
        # initialise the start-time if this is the first time through
        $!start-time //= now;
        @!event-log.push({
            timestamp   => now - $!start-time,
            description => $event_description
        });
    }
    method report {
        my @report_lines;
        for @!event-log -> %event {
             @report_lines.push("[%event<timestamp>.fmt("%.3f")] %event<description>");
        }
        return @report_lines.join("\n");
    }
}

I’m happy with that!

Now … what can I learn next?

How about converting the class into a Perl 6 role instead?

There’s a learning challenge for you too. Can you turn it into a Perl 6 role?

Let me start the timer … Ready, Set, Go!

Day 13 – String Interpolation and the Zen Slice

December 13, 2014 by

So you think you know all about string interpolation in Perl 6?

Well, especially coming from Perl 5, you may find some things that do not work exactly as you’re used to. The simple examples all work, as it seems:

my $a = 42;
say 'value = $a'; # value = $a
say "value = $a"; # value = 42

my @a = ^10;
say 'value = @a'; # value = @a
say "value = @a"; # value = @a HUH??

In earlier versions of Perl 5 (or was it Perl 4?), this gave the same result. At one point, it was decided that arrays would be interpolated in double quoted strings as well. However, this caused quite a few problems with double quoted texts with email addresses in them: you would need to escape each @ (if you were using strict, which you of course should have). And if you forgot, and there was an array that just happened to have the same name as the user part of an email address, you would be in for a surprise. Or if you didn’t use strict, you would suddenly have the text without the email address. But then again, you got what you asked for.

So how can we make this work in Perl 6?

Introducing the Zen slice

The Zen slice on an object, returns the object. It’s like you specify nothing, and get everything. So what does that look like?

my @a = ^10;
say "value = @a[]"; # value = 0 1 2 3 4 5 6 7 8 9

You will have to make sure that you use the right indexers for the type of variable that you’re interpolating.

my %h = a => 42, b => 666;
say "value = %h{}"; # value = a 42 b 666

Note that the Zen slice on a hash returns both keys and values, whereas the Zen slice on an array only returns the values. This seems inconsistent, until you realize that you can think of a hash as a list of Pairs.

The Zen slice only really exists at compile time. So you will not get everything if your slice specification is an empty list at runtime:

my @a;
my %h = a => 42, b => 666;
# a slice, but not a Zen slice:
say "value = %h{@a}"; # value =

So the only way you can specify a Zen slice, is if there is nothing (but whitespace) between the slice delimiters.

The Whatever slice

The * ( Whatever ) slice is different. The Whatever will just fill in all keys that exist in the object, and thus only return the values of a hash.

my %h = a => 42, b => 666;
say "value = %h{*}"; # value = 42 666

For arrays, there isn’t really any difference at the moment (although that may change in the future when multi-dimensional arrays are fleshed out more).

Interpolating results from subs and methods

In double quoted strings, you can also interpolate subroutine calls, as long as they start with an ‘&‘ and have a set of parentheses (even if you don’t want to pass any arguments):

sub a { 42 }
say "value = &a()"; # value = 42

But it doesn’t stop at calling subs: you can also call a method on a variable as long as they have parentheses at the end:

my %h = a => 42, b => 666;
say "value = %h.keys()"; # value = b a

And it doesn’t stay limited to a single method call: you can have as many as you want, provided the last one has parentheses:

my %h = a => 42, b => 666;
say "value = %h.perl.EVAL.perl.EVAL.perl()"; # value = ("b" => 666, "a" => 42).hash

Interpolating expressions

If you want to interpolate an expression in a double quoted string, you can also do that by providing an executable block inside the string:

say "6 * 7 = { 6 * 7 }"; # 6 * 7 = 42

The result of the execution of the block, is what will be interpolated in the string. Well, what really is interpolated in a string, is the result of calling the .Str method on the value. This is different from just saying a value, in which case the .gist method is called. Suppose we have our own class with its own .gist and .Str methods:

class A {
    method Str { "foo" }
    method gist { "bar" }
}
say "value = { A }"; # value = foo
say "value = ", A;   # value = bar

Conclusion

String interpolation in Perl 6 is very powerful. As you can see, the Zen slice makes it easy to interpolate whole arrays and hashes in a string.

In this post I have only scratched the surface of string interpolation. Please check out Quoting on Steroids in a few days for more about quoting constructs.

Day 12 – Towards cleaner JVM-Interoperability

December 12, 2014 by

As some of our readers might remember, interoperability on Rakudo JVM has been described as “basic” by jnthn++ last year. Which is to say, calling methods on Java objects inside Rakudo on the JVM was possible, but it wasn’t always pretty or convenient. As a reminder:

    use java::util::zip::CRC32:from<java>;
    my $crc = CRC32.new();
    for 'Hello, Java'.encode('UTF-8').list {
        $crc.'method/update/(I)V'($_);
    }

In this post I want to describe my journey towards figuring out why the method call looks like that and what we can do to improve it.

What’s wrong with that method call?

The reason behind the unusual method call is that there’s no multi-method dispatch (MMD) for overloaded methods on Java objects from the Perl 6 side. Sure enough, checking the javadoc for java.util.zip.CRC32 reveals that update() is overloaded with three candidates. Readers familiar with the JVM might notice that the method we call on $crc is a method descriptor, defined in the JVM specification as

A method descriptor represents the parameters 
that the method takes and the value that it returns.
-- https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.3.3

In our case this means “a method called update which takes a primitive int and returns void[1]. Clearly Rakudo should figure out on its own which candidate we want, considering Rakudo has MMD on Perl 6 objects and Java has compile-time dispatch on it’s own objects as well. Let’s see what it takes!

The nitty-gritty, if only some of it

Let’s start at the top of the code example. We’re starting with a use statement which by and large does the same in Perl 6 as it does in Perl 5, except we are supplying a longname consisting of the module name and the colonpair :from<java>. The colonpair with key from tells the ModuleLoader to not use the default Perl 6 module loader, but a different one, in our case the JavaModuleLoader.

In JavaModuleLoader::load_module we’re starting to tread towards vm-level code. After making sure we have an interop loader, we call out to Java code with typeForNameFromJar() or typeForName() respectively. This is where we’re leaving NQP code and entering Java code on our trail. Next stop: org.perl6.nqp.runtime.BootJavaInterop

typeForName() and typeForNameFromJar() both do some amount of path-munging to find the .class or .jar file, build a ClassLoader with the path where they found those files and pass the loaded class to getSTableForClass. A STable, or shared table, represents a pairing of a HOW and a REPR, that is, a pairing between a metaobject and the vm-level representation of an object of that type. Creation of a STable happens lazily, via a ClassValue, where the InteropInfo remembers the Java class it represents as well as the computed interop and STable. The important thing we’re looking for is set inside computeInterop, where the documentation rightfully states the gory details begin. The details in question concern themselves with bytecode generation via the framework org.objectweb.asm, although most of the aforementioned details are not particularly important at this stage. What is important though is the following bit:

    HashMap<String, SixModelObject> names = new HashMap< >();
    // ...
    for (int i = 0; i < adaptor.descriptors.size(); i++) {
        String desc = adaptor.descriptors.get(i);
        SixModelObject cr = adaptorUnit.lookupCodeRef(i);
        int s1 = desc.indexOf('/');
        int s2 = desc.indexOf('/', s1+1);
        // create shortname
        String shorten = desc.substring(s1+1, s2);
        // XXX throw shortname => coderef away if shortname is overloaded
        names.put(shorten, names.containsKey(shorten) ? null : cr);
        // but keep the descriptor
        names.put(desc, cr);
    }
    // ...
    freshType.st.MethodCache = names;

The adaptor object is a dataholder for a multitude of things we need while generating the adaptor, for example the name of the adaptor class or a list of the constants it has to contain, or even the CompilationUnit we generated in bytecode. The adaptorUnit is a shorthand for the mentioned CompilationUnit. What happens here is that we construct the MethodCache (which is a HashMap) by extracting the shortname of the method out of the descriptor and adding the shortname as key with the CodeRef as value, as long as we only have one candidate. To be sure we aren’t forgetting anything, we also add the descriptor as key to the MethodCache with the same CodeRef. Thus we have figured out why the method call in the original example has to look the way it does: we don’t even know the method by its shortname.

Great. How do we fix it?

Dynamically. Which is a bit complicated on the JVM, because Java is statically typed. Luckily JSR 292 [2] has been approved some time ago which means we have an instruction available that “supports efficient and flexible execution of method invocations in the absence of static type information”. The basic mechanics of this are as follows:

    • While generating bytecode for the runtime of our language we insert an invokedynamic instruction in a place where we want to be able to decide at runtime which method we want to invoke.
    • This instruction references a slightly special method (usually called a bootstrap method) with a specific signature. Most importantly this method has to return a java.lang.invoke.CallSite.
    • When the invokedynamic instruction is encountered while executing the bytecode, the bootstrap method is called.
    • Afterwards the resulting CallSite is installed in place of the invokedynamic instruction, and on repeated calls the method installed with the CallSite will be invoked instead of the bootstrap method.

To be able to catch methods that we want to treat as multi on the Perl 6 side, we have to change how the adaptors are generated. Recall that we currently only know which methods are overloaded after we generated the adaptor, thus we’re too late to insert an invokedynamic as adaptor method. So we override BootJavaInterop.createAdaptor, and instead of walking through all methods and simply creating an adaptor method directly, we additionally memorize which methods would end up having the same shortname and generate an invokedynamic instruction for those as well.

There’s one more problem though. The fact that we have a shortname that should dispatch to different methods depending on the arguments means that we can’t take full advantage of installing a CallSite. This is because any given CallSite always dispatches to exactly one method, and method signatures in Java are statically typed. Luckily we can instead resolve to a CallSite which installs a fallback method, which does the actual resolution of methods. [3]

To summarize this briefly: Via invokedynamic we install a CallSite that dispatches to a fallback method which converts the Perl 6 arguments to Java objects and then looks among all methods with the same shortname for one that fits the argument types. I won’t paste the org.objectweb.asm instructions for generating byte code here, but the fallback method looks approximately as follow

Object fallback(Object intc, Object incf, Object incsd, Object... args) throws Throwable {
    // some bookkeeping omitted
    Object[] argVals = new Object[args.length];
    for(int i = 0; i < args.length; ++i) {
        // unboxing of Perl 6 types to Java objects happens here
    }

    int handlePos = -1;
    OUTER: for( int i = 0; i < hlist.length; ++i ) {
        Type[] argTypes = Type.getArgumentTypes(((MethodHandle)hlist[i]).type().toMethodDescriptorString());
        if(argTypes.length != argVals.length) {
            // Different amount of parameters never matches
            continue OUTER;
        }
        INNER: for( int j = 1; j < argTypes.length; ++j ) {
            if( !argTypes[j].equals(Type.getType(argVals[j].getClass())) ) {
                switch (argTypes[j].getSort()) {
                    // comparison of types of the unboxed objects with 
                    // the types of the method parameters happens here
                } 
                // if we didn't match the current signature type we can skip this method
                continue OUTER;
            }
        }
        handlePos = i;
        break;
    }
    if( handlePos == -1 ) {
        // die here, we didn't find a matching method
    }

    // create a MethodHandle with a boxed return type
    MethodType objRetType = ((MethodHandle)hlist[handlePos]).type().changeReturnType(Object.class);
    // and convince our adaptor method to return that type instead
    MethodHandle resHandle = ((MethodHandle) hlist[handlePos]).asType(objRetType);

    MethodHandle rfh;

    try {
        // here's where we look for the method to box the return values
    } catch (NoSuchMethodException|IllegalAccessException nsme) {
        throw ExceptionHandling.dieInternal(tc,
            "Couldn't find the method for filtering return values from Java.");
    }

    MethodHandle rethandle = MethodHandles.filterReturnValue(resHandle, (MethodHandle) rfh);
    return ((MethodHandle)rethandle).invokeWithArguments(argVals);
}

The curious may check the whole file here to see the omitted parts (which includes heaps of debug output), although you’d also have to build NQP from this branch for the ability to load .class files as well as a few fixes needed for overriding some of the methods of classes contained in BootJavaInterop if you wanted to compile it.

The result of these changes can be demonstrated with the following classfile Bar.java (which has to be compiled with javac Bar.java)

public class Bar {

    public String whatsMyDesc(int in, String str) {
        String out = "called 'method/answer/(ILjava/lang/String;)Ljava/lang/String;";
        out += "\nparameters are " + in + " and " + str;
        return out;
    }

    public String whatsMyDesc(String str, int in) {
        String out = "called 'method/answer/(Ljava/lang/String;I)Ljava/lang/String";
        out += "\nparameters are " + str + " and " + in;
        return out;
    }
}

and this oneliner:

$ perl6-j -e'use Bar:from<java>;
my $b = Bar.new; 
say $b.whatsMyDesc(1, "hi"); 
say $b.whatsMyDesc("bye", 2)'
called 'method/answer/(ILjava/lang/String;)Ljava/lang/String;
parameters are 1 and hi
called 'method/answer/(Ljava/lang/String;I)Ljava/lang/String;
parameters are bye and 2

Closing thoughts

While working on this I discovered a few old-looking bugs with marshalling reference type return values. This means that the current state can successfully dispatch among shortname-methods, but only value types and java.lang.String can be returned without causing problems, either while marshalling to Perl 6 objects or when calling Perl 6 methods on them. Additionally, there’s a few cases where we can’t sensibly decide which candidate to dispatch to, e.g. when two methods only differ in the byte size of a primitive type. For example one of

    public String foo(int in) {
        // ...
    }
    public String foo(short in) {
        // ...
    }

is called with a Perl 6 type that does not supply byte size as argument, let’s say Int. This is currently resolved by silently choosing the first (that is, declared first) candidate and not mentioning anything about this, but should eventually warn about ambiguity and give the method descriptors for all matching candidates, to facilitate an informed choice by the user. Another, probably the biggest problem that’s not quite resolved is invocation of overloaded constructors. Constructors in Java are a bit more than just static methods and handling of them doesn’t quite work properly yet, although it’s next on my list.

These shortcoming obviously need to be fixed, which means there’s still work to be done, but the thing I set out to do – improving how method calls to Java objects look on the surface in Perl 6 code – is definitely improved.

Day 11 – So, what does MoarVM do with your Perl 6 code?

December 11, 2014 by

The first day of the advent calendar remarked that during 2014, MoarVM became the de facto standard virtual machine to run Perl 6 on. And, with Perl 6 on course to hit 6.0.0 in 2015, MoarVM is what most people adopting Perl 6 at this point will be working with. In this post, we’ll take a look at what MoarVM brings to the table, and where it’s heading from here.

But first – why a VM?

Many modern languages depend on a good deal of runtime support. Perl 6 is very much in this camp, and the same goes for Perl 5. A VM is in the business of providing that support. With Perl 5, the word “interpreter” tends to be used over VM, but the goal is the same: provide the things needed to execute programs in the language across disparate hardware and operating system environments. The difference is mostly that we tend to use VM when there’s a clean separation of the language compiler and the runtime support – meaning that the VM might also be used to execute other languages. This has happened with the JVM, for example. MoarVM is squarely aimed at being a good host for the Rakudo Perl 6 compiler, and the NQP subset of Perl 6 that Rakudo is mostly implemented in. If any other language finds it useful, that’s fine, but it’s not a primary design consideration.

So why care so much over a clean interface between compiler and runtime? Because, in a project the size of “implement Perl 6″, managing implementation complexity is crucial. Establishing a clean boundary between the compiler and the runtime, and knowing what each side is responsible for, is one strategy for doing that. (Rakudo’s compilation process is also broken up into a pipeline of stages that only communicate by passing well-defined data structures from one to the next, which is another form of architectural boundary.)

But why interpreters and VMs at all? Why not go straight down to the metal? For any language where we can reasonably figure out a lot about the program statically, that is a fine strategy. The information needed to eliminate many forms of late binding, checks, and duplicate work is to hand at compile time, and a decent optimizer will be able to use that information. Of course, this approach restricts us to what can be seen at compile time. Some compilers as a result support profile-guided optimization, where we run the program on typical workloads, collect information about where time is spent and what code-paths are common and uncommon, and then compile the program again taking that data into account.

Profile-guided optimization hints at a strategy for efficiently executing languages where we can know rather less at compile time: put off most optimization until we observe the program running, and dynamically replace parts of the program with optimized versions. This isn’t a new idea, and there has been a massive amount of work done in this area. At its heart is the general principle that most programs are, if I may pun on a term from the database field, “eventually static”. Even though a Perl 6 program may have a huge number of points of potential dynamism and genericity, in most programs just a handful of them actually ever exploit it. Of course, different runs of the program on different inputs may tickle different bits of that dynamism, not to mention that as a program grows and evolves, these flexible points will be exploited here and there to (hopefully cleanly) add new functionality.

A modern VM aiming to support a language like Perl 6 is not, therefore, just there to run the code and provide services like garbage collection, OS abstraction, and so on. Its job is to take a program in all its dynamic glory, watch what code paths it actually takes, see what types really show up – and then produce optimized versions of carefully selected parts of the program, with the unused dynamism ruthlessly torn out, and much of the remaining dynamism massaged to more restricted and faster forms.

Bytecode specialization

So what exactly does MoarVM do with a program? First, it needs to find code worth considering optimizing. This is done by seeking the hot parts of a program: routines that are called again and again, and loops that are doing a lot of iterations. Since time to optimize a routine is proportional to the size of the routine, the threshold for “it’s worth considering” goes with code size. A small routine – such as a simple operator or an accessor method – has rather little code and will become hot quickly. A larger routine heats up more slowly. It’s just like our kettle: a little water boils quickly, but fill it up and we’re going to be waiting a bit for that cuppa. This is in part about risk management: we would prefer to avoid investing time optimizing code that’s never going to give a positive return on investment. We can’t predict the future, but some cheap, simple heuristics are at least a good start.

So, we’ve got some hot code. What now? Well, if it’s a call, we start by looking at the callsites that are invoking it. The callsite tells us how many arguments are being passed, whether they are native values or objects, and – for named arguments – what their names are. We can then start to produce bytecode specialized by callsite. Know that the object parameter is passed an object argument? Then replace the op that checks if we need to box the incoming argument with a simpler op that just copies the pointer. Know that an optional parameter actually doesn’t get passed? Then strip out the code to look for it, toss the branch, and just always run the code that sets the default value. Know that the named parameter “config” is always the first passed named argument? Then retrieve it by indexed access (so a cheap pointer copy), rather than messing about with strings.

Next up: types. If we are passed arguments, what types are they? If we look up constants, what types to those have? Knowing these is relatively easy: we just peek at the args buffer to see what’s being passed, or look up the constant and see what type it is. However, knowing them also presents us with a huge range of opportunities. See that method call? It doesn’t need to look up the method in a table each call any more, since we know where it’s going. And that attribute access? Since we know the object layout, it can just become a pointer offset from the object’s memory address. That type check? We often know the answer statically now – and can eliminate entire branches that won’t be taken. Oh, and that multiple dispatch? Let’s just resolve it once now we know the types we’re dealing with, not every single call. The list goes on.

Specialization works at the bytecode level, because that’s what the VM gets. And a specialization of a given piece of bytecode comes with a set of guards: conditions that must be met for it to be applicable. Those can constrain it by callsite and argument types. And we can produce multiple specializations for a given piece of original bytecode.

Of course, choosing the appropriate specialization is a cost too. It would eat into our winnings if we had to do that on every call. Thankfully, there are ways to avoid that. If we are optimizing a call to something and know the argument types being passed, we can pick the specialization of the callee and optimize the caller to always call that specialization, rather than having to hunt for the right one each time. And, if the callee is rather small, we can often do away with the call entirely and simply inline the callee into the caller – eliminating the need to create a callframe and keeping the code tighter, which CPU caches like.

Optimization speculation

So, all this is nice, but we want MOAR MOAR MOAR. Because there are lots of other things that we can’t statically figure out the types of, but that may end up having very stable types come runtime. Some examples are lexical variables that we close over, object attributes, and return values of our callees. So how do we get hold of these types?

When code becomes hot, we don’t actually optimize it right away. Well, we do for the simple callsite-related transformations. But before going and doing the type-driven stuff, we take a closer look at what’s actually going on in the program. The specializer quickly produces an instrumented version of the program that records what types show up where. This runs for the next bunch of invocations, and logs what it sees. After a threshold is hit, we have a look at the types that showed up and do a little analysis. If a given logging site consistently logs the same type, we’re onto something. If we see more than one type show up there, we consider that part of the program as being typically dynamic and leave it be.

The thing is, we didn’t actually prove a property of the program here, just profiled it and figured out what tends to happen. So how can we use this information? The trick is to insert a guard into the specialized program that cheaply asserts that we got the expected type. Beyond that guard, we can optimize assuming we really did. (We also make sure that we actually do an optimization based upon what the guard checks, and don’t bother to insert it otherwise.)

Which is all well and good when the guard is met, but what about when the assertion fails? In fact, this is a more general problem. Even the far simpler type-based optimizations on incoming parameters assume that an object’s type will not change – and of course, thanks to mixins, they can. All of these situations trigger deoptimization: replacing the optimized code we’re in the midst of running with the unoptimized, naive, but safe version that is dynamic enough to handle all the cases.

Deoptimization is quite a pain to get right. For one, if we were inside of inlined code, we have to go back and create all the callframes we skipped. Since inlining can be many levels deep, that can be a bunch of callframes to wire up. Then we have to make sure that our optimizations don’t throw away code that produces a value the deoptimized code relies on being there, but the optimized code would otherwise not. Of course, we still need to do dead code elimination in general; it’s just that it has to be aware of deoptimization.

Therefore, it’s not only important that MoarVM can optimize code. It’s also critical to its ability to do so that it can also deoptimize, falling back to slow paths when needed.

Down to machine code

Interpreting specialized bytecode can yield, in some programs, a significant improvement. The simple program:

my $i = 0; while ++$i <= 100000000 { }

Got a factor of 2.5 improvement with what we had several months ago, and there are still a lot of opportunities left (some of which we’ve already explored, and with others yet to go). However, interpreting – that is, looping over an instruction stream and choosing what to do for each instruction – has its overhead. Before the specialization process eliminates a lot of the dynamism, interpretation is only so bad; some ops have some checks to do, and so they cost a bit. But specialized code tends to have a lot of very cheap operations that just play with pointers – and then the overhead of interpreting can quickly come to account for the majority of the execution time.

Therefore, on x64, we can often take the further step of producing machine code to run, instead of specialized bytecode to interpret. Of course, we have to be ready to deoptimize from that too. I’ll save talking about the machine code part too much, since the guy who worked on it is, I believe, plotting an advent post about it in the near future. Needless to say, it makes a dramatic difference; for the benchmark above it more than halved the time again relative to the specialized bytecode.

We often referred to the “produce machine code” part of the work as “the JIT”. But really, the entire set of optimizations described in this article are part of the typical work a JIT compiler does. Producing machine code is just one small part, and it’s interesting to note that the win from eliminating dynamism was a bigger one than eliminating interpretation overhead.

What next?

The work so far has had a dramatic impact on performance. However, there’s still plenty of areas for improvement.

For one, the number of specializations produced and how we choose them can certainly use some tweaking. It’s possible for a routine to saturate its number of allowed specializations with ones that, in the long run, are not entirely strategic. Or we may be able to reasonably argue that it should be allowed more. At the same time, specialized bytecode is a notable memory overhead at this point. So, we need to produce both more and less specializations. Of course, what that really means is we need a better strategy for picking what is worth doing.

Another clear issue is that some speculative optimizations turn out to be bad bets, and we have no way of recovering from that. We can deoptimize pretty fast (in fact, some trade-offs have been made to have it be quite affordable). But it’s still a cost, and it’s possible a gamble that looked good based on available data may lead to code that has to be deoptimized a majority of the time – and thus run slower than if we’d never bothered. Again, it’s about being smarter and keeping on learning as we run the program.

Then there’s a bunch of things that we don’t specialize or compile to machine code in a good way yet. Array access is an obvious one, and big integer operations – which are at the heart of Perl 6’s Int type – are another. The specializer is also not as aware as it really should be of closures. These are all areas where we could do a whole lot better – and they are nice, incremental advances on what we already have.

A further interesting area for exploration is doing escape analysis, and being able to eliminate a bunch of GC allocations in favor of allocating the memory as part of the call frame, because we know it can never end up being used beyond it. This is a fair amount of work, but also highly relevant to Perl 6: many scalar containers for lexical variables would be great candidates.

Then there’s the whole idea of trace JIT. MoarVM so far has a routine-level JIT, often called a method JIT. There, we optimize at routine level first, and may do some inlining to flatten away the call tree. This is a pretty decent strategy for Perl 6, in so far as we’re often very interested in inlining primitive operators that, naively interpreted in the original dynamic program, would be multiple dispatch calls. Trace JITs, by contrast, just follow the flow of instructions, over however many callframes, and every single conditional branch becomes a guard. Their strength is loops, which today we cope with using On Stack Replacement (detect we’re in a hot loop, compile an optimized version of it, and hot-swap it). While it’s easy to talk about trace JIT vs. method JIT as a “pick one”, I’m much more interested in the “pick both” track. They both, after all, share a lot of infrastructural needs – and they have their strengths and weaknesses. There’s more than one way to dynamically optimize it, and, this being Perl, we might as well just do all the ways – spotting all the nice unifications and re-use opportunities along the way.

Day 10 – Introspecting the Symbol Tables

December 10, 2014 by

Perl 6 is designed with extensibility in mind. And when you want to extend something, you often need to know as much as possible about the environment.

Today we’ll look at an aspect of finding out about the environment: introspecting symbol tables.

A symbol is something you refer to by name, so it could be a package, class, variable, routine, constant etc.

A symbol table is a data structure in which symbols can be looked up. Perl 6 has three main types of symbol tables, differentiated by scope: lexical symbols live in the lexical pad, package symbols live in a Stash, and methods in a method table.

Enough theory, time for action:

    $ perl6 -e 'my $x = 42; say MY::.keys'
    $x $! $/ $_ GLOBALish EXPORT $?PACKAGE ::?PACKAGE $=pod !UNIT_MARKER

MY is a pseudo package representing the current scope, and appending two colons gives us its symbol table. Which in turn roughly behaves like a Hash, so we can use a method like keys to find all symbols in that table. Or look up a string there:

    $ perl6 -e 'my $x = 42; say MY::<$x>'
    42

A complicated way to say $x, but you can use that to lookup symbols by name:

    $ perl6 -e 'my $x = 42; my $var = q[$x]; say MY::{$var}'
    42

Or if you don’t care if it comes from the current lexical scope, just from somewhere:

    $ perl6 -e 'my $x = 42; my $var = q[$x]; say ::($var)'
    42

No need to tell you that in general, this is a bad idea; just use it when you have exceptional circumstances.

But wait, what were all those other symbols from the first line? $!, $/ and $_ are the three “special” variables for errors, matches and the current topic; the rest is stuff that doesn’t need to be in lexical scope, but the compiler puts it there because it makes stuff easier.

Outer lexicals are accessible through the OUTER:: pseudo package:

    $ perl6 -e 'my $x = 1; { my $x = 2; say OUTER::<$x>; say $x }'
    1
    2

Instead of OUTER::<$x>, you can also say $OUTER::x. In general, you can move the sigil to the front and omit the angle brackets if the symbol name is a literal.

Package declarations and explicitly our-scoped symbols generally go into a stash, with the top-most being GLOBAL::

    class A {
        class B { }
    }
    say GLOBAL::<A>;    # (A)
    say A;              # (A)
    say A::.keys;       # B

Not only the double colon gives you the symbol table, you can also get it with .WHO (which also works for non-literals). So a long way to write A::B is

    say A::B === GLOBAL::<A>.WHO<B>;    # True

Finally, methods live in a classes or roles, and you can use the meta object to introspect them; those generally respect inheritance and role composition, so they are a bit more complicated than stashes.

    my $m = Str.^can('split')[0];
    say $m.^name;               # Method
    say $m('a1B', /\d/).perl;    # ("a", "B").list

Here .^can(name) returns a list of methods on that object with a given name. It’s a list, because it includes methods from superclasses, so there can be more than one. To get a list of all available methods, you can use .^methods and .^methods(:all):

    $ perl6-m -e 'say  Str.^methods(:all).grep: {.name ~~ /^m/ }'
    match match map min max minmax

.^methods(:all) includes methods from all superclasses, without the :all, methods from classes Cool, Any and Mu are excluded.

This concludes our short tour through the land of the symbol table. There is more to see (for example the EXPORT package, and various pseudo packages), and the daring can go to the design documents or play with Rakudo’s implementation, which is fairly robust.


Follow

Get every new post delivered to your Inbox.

Join 43 other followers