Day 9 – A preview of the ‘hackable’ JIT compiler

Among programming languages Perl 6 has an unfortunate but historically well-deserved reputation for being slow, even compared to languages such as Perl 5, Ruby or Python. This is not really an accident, however. From the outset, perl6 was designed to be extensible by the programmer in many different ways, including defining custom operators, dynamic containers for variables, and a powerful meta-object protocol that allow you to modify the behaviour of objects and classes from regular Perl code. In the future, we might well add macros to that list. To this are added a rich set of control structures and phasers. Such features are useful, powerful and fun.

But such features also make it relatively difficult to execute perl6 efficiently, because they introduce overhead and polymorphism. Even before the language was released last year, work had started to improve the efficiency of Rakudo Perl 6. Because Rakudo is complex and layered, this works involves many aspects and many individuals, from core routine efficiency improvements to precompilation of┬ámodules. MoarVM also tries to specialize ‘hot’ code for specific object types at runtime, which reduces the polymorphism of that code, and allows simpler operations to be substituted for complex operations. For example, object attribute access can in many cases be reduced to a simple memory read instruction.

Since late summer 2014, MoarVM has contained a JIT compiler that I developed. At its core, JIT compilation reduces the overhead from interpretation. That is to say, rather than fetch new instructions from a bytecode stream and dispatch to execute them, a JIT compiler generates the code to execute these instructions in a series. Ideally, this allows the executed code to ‘flow’ directly through the program without interruption. In practice, there is still the machinery of the VM, for example for garbage collection, to take into account. Also, note that in many VM implementations, the JIT compiler and the type specialization code are combined, whereas in MoarVM they are distinct components.

Since summer 2015 (yes, that long) I’ve been working on an improved backend for that compiler, which was funded by the Perl foundation. One of the goals for the development of that backend (which I misleadingly call the ‘expression JIT compiler’) was to enable extension points for specialized object types and instructions. These extension points are designed to enable a greater range of type specialization and to improve code generation gradually. The best part, I think, is that they should require relatively little knowledge of compiler internals or assembly language. For this advent calendar post, I’d like to demonstrate those extension points.

First, consider the following innocent-looking perl6 code.

sub foo(@a) {
    @a[0] = @a.elems;
}

In case this (obviously critical) `foo` subroutine is invoked hundreds of times, MoarVM will start to optimize it, first by recording what @a really is. In most cases, that will be an instance of the Array class. If it always the same, and supposing the elems method is sufficiently small (which it is), that method call will be inlined into the foo routine. The assignment of @a[0] is somewhat more involved than it looks and what it would be like in a language like Java. In Perl 6, an array holds a sequence of containers, and it is the container to which the value is assigned. So the primitive sequence of operations in ‘foo’ after MoarVM has optimized it is similar to:

my $i := nqp::elems(@a);
my $c := nqp::atpos_o(@a, 0);
nqp::assign($c, $i);

The ‘nqp’ namespace signifies that these are primitive operations in the interpreter. However, they are still polymorphic, because arrays can be represented by multiple different stores. For example, arrays that are used by NativeCall foreign function interface use a different representation (for compatibility reasons) than the arrays natively used by the VM (which are called MVMArray internally). Thus, these operations are polymorphic, and MoarVM needs a dispatch table to find the correct implementation.

If the exact implementation type of @a is known, then this polymorphism is unnecessary. In principle, the type specializer could insert a specialized non-polymorphic instruction for each of those operations for the types found during the recording phase. That is not very practical though as there are many different types and each type would have to support many different operations.

However, with the expression compiler it will be relatively easy to add specialized code for a specific implementation. See for instance the following (somewhat simplified) implementations of the ‘elems’ and ‘atpos_o’ MoarVM instructions, which would work on MVMArray objects:

(template: elems (^getf $1 MVMArray body.elems))

(template: atpos_o 
  (load (index (^getf $1 MVMArray body.slots.o) $2 8)
        (&SIZEOF MVMObject*)))

Such ‘expression templates’ are textual representations of the templates used in the compiler to convert relatively high-level MoarVM bytecode to low-level code which is easier to compile and optimize. Although the code above might look complex, it is actually pretty straightforward:

  • Expressions are grouped by parentheses and can contain subexpression, exactly as in LISP.
  • The ‘template:’ keyword declares a template, and ‘elems’ declares that this template represents the ‘elems’ MoarVM instruction.
  • Numbers with a ‘$’ sigil represent the (input) operands to the MoarVM instruction. In case of elems, the 1st operand is the MVMArray object.
  • ‘load’ and ‘index’ are simple expression nodes, which represent a particular low-level operation.
  • ‘^getf’ is a template macro, which expands to a more complex expression of ‘load’ and ‘addr’ nodes which compute the address of the ‘body.elems’ field in the MVMArray structure and load a value from it.
  • &SIZEOF is an expression that is translated – prior to inclusion by MoarVM – into a C sizeof expression. With the ‘&’ syntax, any C macro can be referenced, and with that any compile-time expression may be evaluated.

Such templates can be added without any knowledge of assembler whatsoever, although some knowledge of C structure and VM internals are probably preferable. I hope that by making this simple enough I can convince others to share the task of JIT compiler development with me :-).

For those who are feeling more adventurous yet, it is also simple to hook into the data-driven code generator known as the tiler. The tiler picks the CPU instructions to implement the low-level code generated from expression templates. Each CPU instruction is represented by a ’tile’, which ‘covers’ part of the expression graph. The tiler tries to pick the optimal instruction so that the entire code graph is┬ácovered as cheaply as possible. As the x86 instruction set is very large, it is nearly impossible for me to write tiles for every possible instruction. Adding a tile is not very hard, though. Suppose for instance that we’d be writing a program to compute the XOR of all values in the array:

my $x = 0;
for @a -> $i {
    $x +^= $i;
}

In reality, there are still multiple barriers to implementing this as a tight loop, but suppose that we have taken them down. (We can, it is a matter of time and energy). Then you might find that the sequence of INDEX, LOAD and XOR operations were be inefficient, and you could optimize that into a single instruction. You can declare a tile for that instruction as follows:

(tile: xor_load_idx 
   (xor reg (load (index reg reg $stride) $size)) reg)
  • This uses the same syntax as the expression templates, which is convenient for parsing, but also makes the correspondence between templates and tiles clear
  • The ’tile:’ keyword now declares a tile, and ‘xor_load_idx’ the name of the tile implementation.
  • The words with a $sigil now specify constants. The ‘xor’, ‘load’, and ‘index’ words specify expression nodes (operations).
  • ‘reg’ specifies that this tile consumes values in registers (in the first expression) and yields a value in a register (as the last word in the expression).

The tile implementation would look as follows:

MVM_JIT_TILE_DECL(xor_load_idx) {
    MVMint8 out  = tile->values[0];
    MVMint8 base = tile->values[1];
    MVMint8 idx  = tile->values[2];
    /* this only works on 8-byte sized indexes (for now) */
    ASSERT(tile->args[0] == 8 && tile->args[1] == 8);
    | xor Rq(out), qword [Rq(base)+Rq(idx)*8];
}

The `MVM_JIT_TILE_DECL` macro declares the tile function with all required parameters. The ’tile’ parameter which is declared and passed automatically contains operational details such as the registers used (in ’tile->values’) and constant parameters (in ’tile->args’). The function of the assembly-code fragment that follows the ‘|’ symbol is to declare a piece of machine code that is to be assembled at runtime. For this we use dynasm with a few patches to support picking ‘dynamic’ registers on x86-64. The result of this addition would be that the aforementioned sequence of instructions would be compiled to a single one and your (hypothetical) program would run a few percent faster. Although small, such improvements can add up.

Unfortunately, all of this is not done yet. Although in a way the main body of the compiler exists, some essential pieces (especially the register allocator) need further development, and there are quite a few bugs and oversights that need to be fixed, too. So unfortunately you can’t play with this – yet. However, I do hope that I’ve given you a taste of what you can do when it is finished, which I hope to be soon. And with that, I wish you a happy holiday.

Day 18 – MoarVM Internals for the Brave (and Curious)

Hi there! I’m the guy who developed the MoarVM x64 JIT compiler during
this year’s GSoC. Jonathan has already given a detailed (and clear)
overview of MoarVMs optimization subsystems, of which the JIT compiler
is but a small step. Today, I’m going to try and show you how you can
see this in action. But first, a sample program:

    use v6;
    
    sub fibonacci (int $n --> int) {
        my int $x = 1;
        my int $y = 1;
        while $n > 0 {
            $x = $x + $y;
            $y = $x - $y;
            $n = $n - 1;
        }
        return $y;
    }
    
    for ^40 -> int $x {
        say fibonacci $x;
    }


Careful readers will spot the native integer definitions sprinkled
throughout the code. You can write perl6 code as if it is C,
too. (After all, that’s just one more way to do it). I’m not entirely
sure about other backends, but on MoarVM these really work – no object
is ever allocated for them.

Rakudo perl 6 can do more than just run this code, it can also show
you how it understands it. I’ve saved the above program as ‘fib.p6’,
and you can get a (rather verbose) textual representation of the
syntax tree by using the special target command line option:

perl6-m --target=mast fib.p6

However, this is just a static analysis. What we’re really interested
in is what happens during runtime. MoarVM will show you all the gory
details if you specify a filename with the MVM_SPESH_LOG environment
variable, like so.

export MVM_SPESH_LOG=spesh-log.txt; perl6-m fib.p6

From the outside it looks as if nothing special has
happened. As they say, looks are deceiving, because MoarVM has
faithfully produced a text file that details the way your code has
been changed. It is larger even than the first analysis, but it also
contains more information. Let’s look at the our function:

    Inserting logging for specialization of 'fibonacci' 
(cuid: cuid_2_1418763286.32937)
    
    Before:
    Spesh of 'fibonacci' (cuid: cuid_2_1418763286.32937, file: fib.p6:3)

    BB 0 (0x4f77118):
      Instructions:
        no_op 
      Successors: 1, 6, 5, 7, 9
      Predeccessors: 
      Dominance children: 1, 5, 6, 7, 9, 10

    BB 1 (0x4f77188):
      Instructions:
        checkarity liti16(1), liti16(1)
        param_rp_i r2(1), liti16(0)
        bindlex lex(idx=5,outers=0,$n), r2(1)
        paramnamesused 
        const_i64_16 r0(1), liti16(0)
        bindlex lex(idx=1,outers=0,$x), r0(1)

    ....


This little snippet is not without jargon, so I’ll try to explain. (I
did warn you at the top). The ‘cuid’ is the compilation unit
identifier, and it serves to identify the source of any
function. Sometimes compilation units correspond to files, and
sometimes they don’t (like in a REPL loop). The indented blocks denote
Basic Blocks. A Basic Block is a sequence of instructions that is
uninterrupted by a (conditional) jump. They are important because
within a basic block it is always obvious where all the values are
coming from.

Further along in our spesh log, we can see how spesh has transformed
the first block:

    ...
      BB 1 (0x4f77188):
        Instructions:
          sp_getarg_i r2(1), liti16(0)
          bindlex lex(idx=5,outers=0,$n), r2(1)
          const_i64_16 r0(1), liti16(0)
     ...


This is the same start of the function as before. What has happened is
that the polymorphic instruction param_rp_i to the much simpler
instruction sp_getarg_i. As Jonathan explained, we can get away with
this because we know exactly how this function is called, which is
just at line 15 of our little script.

While the spesh log is certainly interesting, it is no good for light
reading. Which is why I wanted to show off a really cool tool that
Timo Paulssen (timotimo) made a while ago – a way to transform a piece
of the spesh log (the ‘spesh graph’ that represents the code of a
function) into a visual form (with a little help of
graphviz). Unfortunately, I couldn’t really get it to work. This
demonstrates something important about all the tools that I’m showing
today – they’ve all been designed not for language users but VM
developers, and they may all be broken this time next year.

[I was able to run the tool and get this piece of output. – timotimo]

Let’s wrap it up. If you’re interested in what kind of assembly code
the JIT will produce for you, there is also a way to get to that. Run
your script again as follows:

export MVM_JIT_BYTECODE_DIR=. # or /tmp, or some other directory
perl6-m fib.p6

If you’ve followed these instructions directly, what you’ll now see is
your working directory littered with binary files representing different
frames (functions). Every such file will contain the raw binary data
generated by the JIT compiler. Inspecting these files is easy enough
(as long as you do know x64 assembly):

objdump -D -b binary -m i386:x86-64 -M intel `ls *.fibonacci.bin`

If you’re on a mac, that’s gobjdump rather than objdump. And if you
prefer AT&T syntax over intel syntax, just drop the ‘-M intel’ part.
Looking at the output of this bytecode, you might also see the rather
wastefull way your code is compiled. After all, I’ve written this
function specifically for simplicity. Yet I count no fewer than 215
‘mov’ instructions, 18 conditional move instructions, and 16 function
calls. As much as MoarVM and perl6 have achieved this year, there is
still a lot left to do. And with that, I wish you a hacky christmas
:-)