Day 9 – A preview of the ‘hackable’ JIT compiler

Among programming languages Perl 6 has an unfortunate but historically well-deserved reputation for being slow, even compared to languages such as Perl 5, Ruby or Python. This is not really an accident, however. From the outset, perl6 was designed to be extensible by the programmer in many different ways, including defining custom operators, dynamic containers for variables, and a powerful meta-object protocol that allow you to modify the behaviour of objects and classes from regular Perl code. In the future, we might well add macros to that list. To this are added a rich set of control structures and phasers. Such features are useful, powerful and fun.

But such features also make it relatively difficult to execute perl6 efficiently, because they introduce overhead and polymorphism. Even before the language was released last year, work had started to improve the efficiency of Rakudo Perl 6. Because Rakudo is complex and layered, this works involves many aspects and many individuals, from core routine efficiency improvements to precompilation of┬ámodules. MoarVM also tries to specialize ‘hot’ code for specific object types at runtime, which reduces the polymorphism of that code, and allows simpler operations to be substituted for complex operations. For example, object attribute access can in many cases be reduced to a simple memory read instruction.

Since late summer 2014, MoarVM has contained a JIT compiler that I developed. At its core, JIT compilation reduces the overhead from interpretation. That is to say, rather than fetch new instructions from a bytecode stream and dispatch to execute them, a JIT compiler generates the code to execute these instructions in a series. Ideally, this allows the executed code to ‘flow’ directly through the program without interruption. In practice, there is still the machinery of the VM, for example for garbage collection, to take into account. Also, note that in many VM implementations, the JIT compiler and the type specialization code are combined, whereas in MoarVM they are distinct components.

Since summer 2015 (yes, that long) I’ve been working on an improved backend for that compiler, which was funded by the Perl foundation. One of the goals for the development of that backend (which I misleadingly call the ‘expression JIT compiler’) was to enable extension points for specialized object types and instructions. These extension points are designed to enable a greater range of type specialization and to improve code generation gradually. The best part, I think, is that they should require relatively little knowledge of compiler internals or assembly language. For this advent calendar post, I’d like to demonstrate those extension points.

First, consider the following innocent-looking perl6 code.

sub foo(@a) {
    @a[0] = @a.elems;

In case this (obviously critical) `foo` subroutine is invoked hundreds of times, MoarVM will start to optimize it, first by recording what @a really is. In most cases, that will be an instance of the Array class. If it always the same, and supposing the elems method is sufficiently small (which it is), that method call will be inlined into the foo routine. The assignment of @a[0] is somewhat more involved than it looks and what it would be like in a language like Java. In Perl 6, an array holds a sequence of containers, and it is the container to which the value is assigned. So the primitive sequence of operations in ‘foo’ after MoarVM has optimized it is similar to:

my $i := nqp::elems(@a);
my $c := nqp::atpos_o(@a, 0);
nqp::assign($c, $i);

The ‘nqp’ namespace signifies that these are primitive operations in the interpreter. However, they are still polymorphic, because arrays can be represented by multiple different stores. For example, arrays that are used by NativeCall foreign function interface use a different representation (for compatibility reasons) than the arrays natively used by the VM (which are called MVMArray internally). Thus, these operations are polymorphic, and MoarVM needs a dispatch table to find the correct implementation.

If the exact implementation type of @a is known, then this polymorphism is unnecessary. In principle, the type specializer could insert a specialized non-polymorphic instruction for each of those operations for the types found during the recording phase. That is not very practical though as there are many different types and each type would have to support many different operations.

However, with the expression compiler it will be relatively easy to add specialized code for a specific implementation. See for instance the following (somewhat simplified) implementations of the ‘elems’ and ‘atpos_o’ MoarVM instructions, which would work on MVMArray objects:

(template: elems (^getf $1 MVMArray body.elems))

(template: atpos_o 
  (load (index (^getf $1 MVMArray body.slots.o) $2 8)
        (&SIZEOF MVMObject*)))

Such ‘expression templates’ are textual representations of the templates used in the compiler to convert relatively high-level MoarVM bytecode to low-level code which is easier to compile and optimize. Although the code above might look complex, it is actually pretty straightforward:

  • Expressions are grouped by parentheses and can contain subexpression, exactly as in LISP.
  • The ‘template:’ keyword declares a template, and ‘elems’ declares that this template represents the ‘elems’ MoarVM instruction.
  • Numbers with a ‘$’ sigil represent the (input) operands to the MoarVM instruction. In case of elems, the 1st operand is the MVMArray object.
  • ‘load’ and ‘index’ are simple expression nodes, which represent a particular low-level operation.
  • ‘^getf’ is a template macro, which expands to a more complex expression of ‘load’ and ‘addr’ nodes which compute the address of the ‘body.elems’ field in the MVMArray structure and load a value from it.
  • &SIZEOF is an expression that is translated – prior to inclusion by MoarVM – into a C sizeof expression. With the ‘&’ syntax, any C macro can be referenced, and with that any compile-time expression may be evaluated.

Such templates can be added without any knowledge of assembler whatsoever, although some knowledge of C structure and VM internals are probably preferable. I hope that by making this simple enough I can convince others to share the task of JIT compiler development with me :-).

For those who are feeling more adventurous yet, it is also simple to hook into the data-driven code generator known as the tiler. The tiler picks the CPU instructions to implement the low-level code generated from expression templates. Each CPU instruction is represented by a ’tile’, which ‘covers’ part of the expression graph. The tiler tries to pick the optimal instruction so that the entire code graph is┬ácovered as cheaply as possible. As the x86 instruction set is very large, it is nearly impossible for me to write tiles for every possible instruction. Adding a tile is not very hard, though. Suppose for instance that we’d be writing a program to compute the XOR of all values in the array:

my $x = 0;
for @a -> $i {
    $x +^= $i;

In reality, there are still multiple barriers to implementing this as a tight loop, but suppose that we have taken them down. (We can, it is a matter of time and energy). Then you might find that the sequence of INDEX, LOAD and XOR operations were be inefficient, and you could optimize that into a single instruction. You can declare a tile for that instruction as follows:

(tile: xor_load_idx 
   (xor reg (load (index reg reg $stride) $size)) reg)
  • This uses the same syntax as the expression templates, which is convenient for parsing, but also makes the correspondence between templates and tiles clear
  • The ’tile:’ keyword now declares a tile, and ‘xor_load_idx’ the name of the tile implementation.
  • The words with a $sigil now specify constants. The ‘xor’, ‘load’, and ‘index’ words specify expression nodes (operations).
  • ‘reg’ specifies that this tile consumes values in registers (in the first expression) and yields a value in a register (as the last word in the expression).

The tile implementation would look as follows:

MVM_JIT_TILE_DECL(xor_load_idx) {
    MVMint8 out  = tile->values[0];
    MVMint8 base = tile->values[1];
    MVMint8 idx  = tile->values[2];
    /* this only works on 8-byte sized indexes (for now) */
    ASSERT(tile->args[0] == 8 && tile->args[1] == 8);
    | xor Rq(out), qword [Rq(base)+Rq(idx)*8];

The `MVM_JIT_TILE_DECL` macro declares the tile function with all required parameters. The ’tile’ parameter which is declared and passed automatically contains operational details such as the registers used (in ’tile->values’) and constant parameters (in ’tile->args’). The function of the assembly-code fragment that follows the ‘|’ symbol is to declare a piece of machine code that is to be assembled at runtime. For this we use dynasm with a few patches to support picking ‘dynamic’ registers on x86-64. The result of this addition would be that the aforementioned sequence of instructions would be compiled to a single one and your (hypothetical) program would run a few percent faster. Although small, such improvements can add up.

Unfortunately, all of this is not done yet. Although in a way the main body of the compiler exists, some essential pieces (especially the register allocator) need further development, and there are quite a few bugs and oversights that need to be fixed, too. So unfortunately you can’t play with this – yet. However, I do hope that I’ve given you a taste of what you can do when it is finished, which I hope to be soon. And with that, I wish you a happy holiday.