Archive for the ‘2012’ Category

Day 14 – Primal Needs

December 14, 2012

Our brains are hard-wired to look for patterns, even where none exist. So, it’s no surprise that as soon as mankind started counting things, he would look for patterns in numbers. One group of numbers that have resisted the pattern matching capabilities of the human brain are the so-called “prime numbers”. These are numbers that can only be evenly divided by 1 or themselves–they have no other factors.

But you knew that already, so why am I talking about prime numbers instead of Perl 6? Because, just like our ancestors, the people that created Perl 6 and continue to shape it to be around for the next 100 years or more find prime numbers interesting. So interesting, in fact, that the language specification was modified to include a routine for determining whether or not a number is prime.

Alpha

At first, implmementations of this prime-number-finder were pure Perl 6 and took advantage of other features of the language such as ranges and junctions. An example implementation is shown below:

    sub is-prime($n) { $n %% none 2..sqrt $n }

This implementation checks to see that none of numbers from 2 to the square root of $n will evenly divide $n. If this is the case, then the number is prime.

While the above implementation works fine, it is a little slow and it does suffer a little redundancy in the numbers it checks. For instance, if you know a number isn’t evenly divisible by 2, there’s no need to check if it’s evenly divisible by 4, yet the above algorithm does so anyway.

Beta

An improvement on the algorithm is to only check if the I between 2 and the square root of the number evenly divide the number. But … but … that’s like like defining a word in terms of itself. Thanks to ubiquitous lazy evaluation in Perl 6, that’s entirely possible. Here’s an implementation:

    my @primes := 2, 3, 5, -> $p { ($p+2, $p+4 ... &is-prime)[*-1] } ... *;
    sub is-prime($n) { $n %% none @primes ...^  * > sqrt $n }

The array @primes is an infinite, lazily evaluated sequence of numbers starting with 2, 3, and 5. The next number in the sequence is generated by creating a new sequence of odd numbers that start from the last odd number and continue until we reach a prime. That prime is the next number in the sequence. But how do we know if it’s a prime? We check with our handy C subroutine that actually uses the lazy list of primes up to the square root of the number we’re testing to see if any of them are factors.

There’s a kind of mutual recursion going on here where the @primes array effectively memoizes the primes we’ve seen so far. But … then there’s the problem that @primes will continue to grow as you check bigger and bigger numbers for prime-ness. Can we do better?

Indeed we can.

Gamma: Rabin-Miller test

Well … maybe we can. It depends on your idea of “better”. The Rabin-Miller primality test is probabalistic in nature. It doesn’t require storing an ever increasing cache of prime numbers to test if they are factors of the potential prime, but there is a chance that it will tell you that a number is prime when it actually isn’t. The good news is that we can adjust the odds so that we are reasonably confident that the number is prime. Here’s an implementation (taken from http://rosettacode.org/wiki/Miller-Rabin_primality_test#Perl_6):

sub expmod(Int $a is copy, Int $b is copy, $n) {
	my $c = 1;
	repeat while $b div= 2 {
		($c *= $a) %= $n if $b % 2;
		($a *= $a) %= $n;
	}
	$c;
}

subset PrimeCandidate of Int where { $_ > 2 and $_ % 2 };

my Bool multi sub is-prime(Int $n, Int $k)            { return False; }
my Bool multi sub is-prime(2, Int $k)                 { return True; }
my Bool multi sub is-prime(PrimeCandidate $n, Int $k) {
	my Int $d = $n - 1;
	my Int $s = 0;

	while $d %% 2 {
		$d div= 2;
		$s++;
	}

	for (2 ..^ $n).pick($k) -> $a {
		my $x = expmod($a, $d, $n);

		next if $x == 1 or $x == $n - 1;

		for 1 ..^ $s {
			$x = $x ** 2 mod $n;
			return False if $x == 1;
			last if $x == $n - 1;
		}
		return False if $x !== $n - 1;
	}

	return True;
}

The third multi variant of is-prime with the signature (PrimeCandidate $n, Int $k) is where all of the magic happens. This multi is only triggered when the prime candidate ($n) is an odd number because of the definition of the PrimeCandidate type.

First, we factor out the powers of 2 from $n - 1. Since $n is an odd number, $n - 1 is even and so has at least one factor of 2. What we end up with is an odd number and some power-of-2 factors of $n - 1. We then use those factors to see if a random sample of $k numbers less than $n are congruent to the square roots of unity modulo $n (expmod handles the modular exponentiation). We repeat this for all of the powers of 2 we factored out of the original number. Fermat’s little theorem says that if we find any number where the congruence does not hold, then the number can not be prime.

The probability that this method will select a composite number as prime is based on how many numbers less than $n we choose to sample. If we select $k numbers to try, the probability is 4 ** -$k. By choosing to sample more numbers, we can quickly decrease the odds of a false positive to a negligible amount.

Wrap up

But … most people don’t really have to worry about the implementation details of is-prime. Not only have is-prime and expmod been added to the Perl 6 specification, but actual implementations (ala Rabin-Miller) have been added to the Rakudo and Niecza Perl 6 compilers. So, if you want to test your new cryptographic algorithm and need some large prime numbers, or if you’re developing a new random number generator and need some candidates for the modulus, or maybe you’re developing a new hashing algorithm, Perl 6 has a built-in is-prime that can help.

Day 13 – Bags and Sets

December 13, 2012

Over the years, I’ve written many variations on this code:

my %words;
for slurp.comb(/\w+/).map(*.lc) -> $word {
    %words{$word}++;
}

(Aside: slurp.comb(/\w+/).map(*.lc) does the standard Perl trick of reading files specified on the command line or standard in, goes through the data for words, and makes them lowercase.)

Perl 6 introduces two new Associative types for dealing with this sort of functionality. KeyBag is drop-in replacement for Hash in this sort of case:

my %words := KeyBag.new;
for slurp.comb(/\w+/).map(*.lc) -> $word {
    %words{$word}++;
}

Why would you prefer KeyBag over Hash in this case, considering that it’s a bit more code? Well, it does a better job of saying what you mean, if what you want is a positive Int-valued Hash. It actually enforces this as well:

> %words{"the"} = "green";
Unhandled exception: Cannot parse number: green

That’s Niecza’s error; Rakudo’s is less clear, but the important point is you get an error; Perl 6 detects that you’ve violated your contract and complains.

And KeyBag has a couple more tricks up its sleeve. First, four lines to initialize your KeyBag isn’t terribly verbose, but Perl 6 has no trouble getting it down to one line:

my %words := KeyBag.new(slurp.comb(/\w+/).map(*.lc));

KeyBag.new does its best to turn whatever it is given into the contents of a KeyBag. Given a List, each of the elements is added to the KeyBag, with the exact same result of our earlier block of code.

If you don’t need to modify the bag after its creation, then you can use Bag instead of KeyBag. The difference is Bag is immutable; if %words is a Bag, then %words{$word}++ is illegal. If immutability is okay for your application, then you can make the code even more compact:

my %words := bag slurp.comb(/\w+/).map(*.lc);

bag is a helper sub that just calls Bag.new on whatever you give it. (I’m not sure why there is no equivalent keybag sub.)

Bag and KeyBag have a couple more tricks up their sleeve. They have their own versions of .roll and .pick which weigh their results according to the given values:

> my $bag = bag "red" => 2, "blue" => 10;
> say $bag.roll(10);
> say $bag.pick(*).join(" ");
blue blue blue blue blue blue red blue red blue
blue red blue blue red blue blue blue blue blue blue blue

This wouldn’t be too hard to emulate using a normal Array, but this version would be:

> $bag = bag "red" => 20000000000000000001, "blue" => 100000000000000000000;
> say $bag.roll(10);
> say $bag.pick(10).join(" ");
blue blue blue blue red blue red blue blue blue
blue blue blue red blue blue blue red blue blue

They also work with all the standard Set operators, and have a few of their own as well. Here’s a simple demonstration:

sub MAIN($file1, $file2) {
    my $words1 = bag slurp($file1).comb(/\w+/).map(*.lc);
    my $words2 = set slurp($file2).comb(/\w+/).map(*.lc);
    my $unique = ($words1 (-) $words2);
    for $unique.list.sort({ -$words1{$_} })[^10] -> $word {
        say "$word: { $words1{$word} }";
    }
}

Passed two filenames, this makes a Bag from the words in the first file, a Set from the words in the second file, uses the set difference operator (-) to compute the set of words which are only in the first file, sorts those words by their frequency of appearance, and then prints out the top ten.

This is the perfect point to introduce Set. As you might guess from the above, it works much like Bag. Where Bag is a Hash from Any to positive Int, Set is a Hash from Any to Bool::True. Set is immutable, and there is also a mutable KeySet.

Between Set and Bag we have a very rich collection of operators:

Operation Unicode “Texas” Result Type
is an element of (elem) Bool
is not an element of !(elem) Bool
contains (cont) Bool
does not contain !(cont) Bool
union (|) Set or Bag
intersection (&) Set or Bag
set difference (-) Set
set symmetric difference (^) Set
subset (<=) Bool
not a subset !(<=) Bool
proper subset (<) Bool
not a proper subset !(<) Bool
superset (>=) Bool
not a superset !(>=) Bool
proper superset (>) Bool
not a proper superset !(>) Bool
bag multiplication (.) Bag
bag addition (+) Bag

Most of these are self-explanatory. Operators that return Set promote their arguments to Set before doing the operation. Operators that return Bag promote their arguments to Bag before doing the operation. Operators that return Set or Bag promote their arguments to Bag if at least one of them is a Bag or KeyBag, and to Set otherwise; in either case they return the type promoted to.

Please note that while the set operators have been in Niecza for some time, they were only added to Rakudo yesterday, and only in the Texas variations.

A bit of a word may be needed for the different varieties of unions and intersections of Bag. The normal union operator takes the max of the quantities in either bag. The intersection operator takes the min of the quantities in either bag. Bag addition adds the quantities from either bag. Bag multiplication multiplies the quantities from either bag. (There is some question if the last operation is actually useful for anything — if you know of a use for it, please let us know!)

> my $a = bag <a a a b b c>;
> my $b = bag <a b b b>;

> $a (|) $b;
bag("a" => 3, "b" => 3, "c" => 1)

> $a (&) $b;
bag("a" => 1, "b" => 2)

> $a (+) $b;
bag("a" => 4, "b" => 5, "c" => 1)

> $a (.) $b;
bag("a" => 3, "b" => 6)

I’ve placed my full set of examples for this article and several data files to play with on Github. All the sample files should work on the latest very latest Rakudo from Github; I think all but most-common-unique.pl and bag-union-demo.pl should work with the latest proper Rakudo releases. Meanwhile those two scripts will work on Niecza, and with any luck I’ll have the bug stopping the rest of the scripts from working there fixed in the next few hours.

A quick example of getting the 10 most common words in Hamlet which are not found in Much Ado About Nothing:

> perl6 bin/most-common-unique.pl data/Hamlet.txt data/Much_Ado_About_Nothing.txt
ham: 358
queen: 119
hamlet: 118
hor: 111
pol: 86
laer: 62
oph: 58
ros: 53
horatio: 48
clown: 47

Day 12 – Exceptions

December 12, 2012

Sometimes things go horribly wrong, and the only thing you can do is not to go on. Then you throw an exception.

But of course the story doesn’t end there. The caller (or the caller’s caller) must somehow deal with the exception. To do that in a sensible manner, the caller needs to have as much information as possible.

In Perl 6, exceptions should inherit from the type Exception, and by convention they go into the X:: namespace.

So for example if you write a HTTP client library, and you decide that an exception should be thrown when the server returns a status code starting with 4 or 5, you could declare your exception class as

 class X::HTTP is Exception {
     has $.request-method;
     has $.url;
     has $.status;
     has $.error-string;

     method message() {
         "Error during $.request-method request"
         ~ " to $.url: $.status $.error-string";
     }
 }

And throw an exception as

 die X::HTTP.new(
     request-method  => 'GET',
     url             => 'http://example.com/no-such-file',
     status          => 404,
     error-string    => 'Not found',
 );

The error message then looks like this:

 Error during GET request to
    http://example.com/no-such-file: 404 Not found

(line wrapped for the benefit of small browser windows).

If the exception is not caught, the program aborts and prints the error message, as well as a backtrace.

There are two ways to catch exceptions. The simple Pokemon style “gotta catch ‘em all” method catches exception of any type with try:

 my $result = try do-operation-that-might-die();
 if ($!) {
     note "There was an error: $!";
     note "But I'm going to go on anyway";
 }

Or you can selectively catch some exception types and handle only them, and rethrow all other exceptions to the caller:

 my $result =  do-operation-that-might-die();
 CATCH {
     when X::HTTP {
         note "Got an HTTP error for URL $_.url()";
         # do some proper error handling
     }
     # exceptions not of type X::HTTP are rethrown
 }

Note that the CATCH block is inside the same scope as the one where the error might occur, so that by default you have access to all the interesting varibles from that scope, which makes it easy to generate better error messages.

Inside the CATCH block, the exception is available as $_, and is matched against all when blocks.

Even if you don’t need to selectively catch your exceptions, it still makes sense to declare specific classes, because that makes it very easy to write tests that checks for proper error reporting. You can check the type and the payload of the exceptions, without having to resort to checking the exact error message (which is always brittle).

But Perl 6 being Perl, it doesn’t force you to write your own exception types. If you pass a non-Exception objects to die(), it simply wraps them in an object of type X::AdHoc (which in turn inherits from Exception), and makes the argument available with the payload method:

    sub I-am-fatal() {
        die "Neat error message";
    }
    try I-am-fatal();
    say $!;             # Neat error message;
    say $!.perl;        # X::AdHoc.new(payload => "Neat error message")

To find out more about exception handling, you can read the documentation of class Exception and Backtrace.

Day 11 – Parrot threads

December 11, 2012

Editors note: I, rurban, does know almost nothing about threads. Any errors are probably mine. I just tested them, fixed some deadlocks, added the numcpu code and merged the threads branch to master.

Parrot now supports fast and lightweight OS threads, based on Nat Tucks’s initial GSoC work together with Andrew “whiteknight” Whitworth on green threads and finally Stefan Seifert’s extension to true parallel OS threads as hybrid threads.
See http://wknight8111.blogspot.co.at/2010/08/gsoc-threads-chandons-results.html and http://niner.name/Hybrid_Threads_for_the_Parrot_VM.pdf

History

Parrot always supported “threads”, i.e. concurrency models over the last years, but we identified various problems with the particular designs and were continuously improving them. In our case without changing the API too much, as the pdd25 concurrency spec is pretty high level, describing the various models parrot should support, and also pretty low-level in describing the two PMC’s which export the threads API, the Task and the Scheduler classes.

Being born at a time when Perl 6 still looked much more similar to Perl 5 than it does nowadays, Parrot’s threading support initially was very close to Perl’s ithreads model. Previous attempts to change this into the more conventional model of data shared by default or implementing new technologies like STM “Software Transactional Memory” failed. For example Parrot has never supported running multiple threads and having garbage collection at the same time.

In the year 2005 development of faster Central Processing Units (CPUs) shifted from increased speed of a single core to adding more cores. Modern processors contain up to 12 cores with even mobile phones having up to four. To utilize a modern CPU’s power, code needs to be run in parallel. In UNIX (and thus Perl) tradition, this is accomplished using multiple processes being a good solution for many use cases. For many others like auto threading of hyper operators in Perl 6, the cost of process setup and communication would be prohibitively high except for very large data sets.

During the years of back and forth and failed attempts of adding threading support to Parrot, the Perl 6 specification evolved to a point where the largest parts of the language were covered and its features were implemented in the compilers. The lack of concurrency primitives in Parrot however prevents any progress in the area of concurrency support.

Summary

Green threads were used to simplify the implementation of a nearly lock free multithreading implementation. 

Parrot supports now native Win32 threads and POSIX threads. Win32 alarm, sleep and premption is unified with POSIX, it is handled on a common timer thread.

Parrot creates at startup a thread pool of --numthreads threads, which defaults to the number of available CPU cores. Activating a new thread at runtime causes no run-time penalties, until the number of cores is utilized. When a user starts a new task, the scheduler first looks for an idle thread. If one can be found, the task is scheduled on the thread’s interpreter. If more tasks are started than the maximum number of threads, the tasks are distributed evenly among the running interpreters. This is effectively an implementation of the N:M threading model.

Green threads

Our GSOC student Nan “Chandor” Tuck worked in summer 2010 on green threads.

What I have working now is a pre-emptively scheduled green threads system for Parrot that allows programs to be written in a concurrent style. Individual green threads can do basic blocking file input without stopping other threads from running. These logical threads are accessed using the Task API that I described a couple weeks ago. This functionality makes Parrot similarly powerful at threading as the standard version of Ruby or Python: Threads can do pretty much everything except run at the same time. http://parrot.org/content/hybrid-threads-gsoc-project-results

What was missing from this green threads branch was true parallel execution in OS threads, one global_interpreter structure that is shared and protected by locks or other concurrent access rules and many local_interpreters that run simultaneously in separate OS threads.

OS threads

From Fall 2011 to Summer 2012 Stefan “nine” Seifert implemented true OS threads on top of green threads to finally allow true parallel execution of Tasks, to implement blocking IO, and to give perl6 some more advantages over perl5.

The lightweight “green” threads are used as messages in a system where reading shared variables is allowed but only the one owner thread may write to it. That’s why we call it hybrid threads.

Why is multithreading support so difficult to implement?

Low level programming languages like C provide only the bare necessities, leaving the responsibility for preventing data corruption and synchronization entirely to the user. A high-level language like Perl 6 on the other hand provides complex and compound data types, handles garbage collection and a very dynamic object system. Even seemingly simple things like a method call can become very complex. In a statically typed programming language the definition of a class is immutable. Thus, calling a method on an object contains just the steps of determining the object’s class, fetching the required method from this class and calling it. Calling the same method again can then even omit the first two steps since their results cannot change.

In a dynamic language, the object may change its class at runtime. The inheritance hierarchy of the class may be changed by adding or removing parent classes. Methods may be added to or removed from classes (or objects) at runtime and even the way how to find a method of a class may change. So a simple method call results in the following steps:

    ·  determining the class of the object,
    ·  determining the method resolution method of the class,
    ·  finding the actual method to call,
    ·  calling the method.

These steps have to be repeated for every method call, because the results may change any time. In a threaded environment, a thread running in parallel may change the underlying data and meta data in between those sequences and even between those steps. As a consequence, this meta data has to be protected from corruption introducing the need for locks in a performance critical area.

Many interpreters for dynamic languages like Python or Ruby handle this problem by using a global interpreter lock (GIL) to effectively serialize all operations. This is a proven and reliable way but leaves much of the hardware’s potential unused.

Java

In Java, the user is responsible for preventing concurrency issues. The language provides synchronization primitives like mutexes, but the interpreter (the Java Virtual Machine, JVM) does not protect the consistency of the provided data structures. The class library provides the user with high-level data structures explicitly designed for multithreaded scenarios.

Java version 1.1 used green threads to support multithreaded execution of Java programs. Green threads are threads simulated by the virtual machine (VM) but unable to use more than one CPU core for processing. Version 1.2 introduced native Operating System (OS) threading support which since has become the standard way to do multithreading in Java.

Python

The CPython implementation of the Python runtime uses a Global Interpreter Lock (GIL) to protect its internal consistency. This is a single lock taken whenever the interpreter executes Python bytecode. Because of this lock, only one thread can execute bytecode at any time so all built-in types and the object model are implicitly type safe. The drawback is that Python code cannot benefit from having multiple CPU cores available. However I/O operations and calls to external libraries are executed without holding the GIL, so in applications with multiple I/O bounded threads, there may still be a performance benefit from using multithreading.

To run Python code in parallel, multiple processes have to be used. The multiprocessing module provides support for spawning processes exposed through an API similar to the threading module. Since processes may not directly access other processes’ memory, the multiprocessing module provides several means of communication between processes: Queues, Pipes and shared memory support.

Parrot

Much of Parrot’s previous threading related code has been removed to clean up the code and improve performance. Since the existing threading support was known to be unreliable and seriously flawed, this was no trade off. The final parts were removed by the merging of the kill_threads branch on September, 21st 2011.

In 2010, Nat Tuck began working on a green_threads branch during his Google Summer of Code internship. The feature got prototyped using pure PIR and then implemented in Parrot’s core. He got it to work in simple cases and started to work on OS thread support but the internship ended before the code was ready to be merged into the master branch. The code lay dormant until the work on hybrid threads in the threads branch started in 2011.

In Parrot, green threads are called Tasks. Each task is assigned a fixed amount of execution time. After this time is up a timer callback sets a flag which is checked at execution of every branch operation. Since the interpreter’s state is well defined at this point, its internal consistency is guaranteed. The same holds for the GC. Since task preemption is only done while executing user-level code, the GC can do its work undisturbed and without the need for measures like locking. Since user-level code is allowed to disable the scheduler, it can be guaranteed to run undisturbed through critical sections.

The scheduler is implemented as a PMC type. This allows the user to subclass this PMC thus allowing fine-grained control over the scheduling policy. Features, a user could add this way would be for example giving different priorities to tasks or implementing the possibility to suspend and resume a task.

Shared data

Cross-thread writes to shared variables may endanger the internal consistency of the interpreter. Traditionally, the solution to this problem is the use of locks of varying granularity. Fine-grained locking allows code to run in parallel but taking and releasing locks costs performance. It not only increases the instruction count and memory accesses but it also forces the CPU cores to coordinate and thus communicate. Even a seemingly simple operation like an atomic increment can take two orders of magnitude longer than a normal increment. While the gain through being able to utilize multiple CPU cores may offset this cost, it is still impacting the common case of having only a single thread running.

Too coarse locking on the other hand would reduce scalability and the performance gains through parallel execution by having threads wait for extended periods for locks to become available. In the extreme case of having a global interpreter lock it would effectively serialize all computations costing much of the benefits of using threads in the first place.

The other problem with locking is the possibility of introducing deadlocks. For example, two functions F1 and F2 both use two resources A and B protected by locks. If F1 first locks A and then tries to lock B while F2 has already locked B and is now trying to lock A, the program would come to a halt. Both functions would be left waiting for the other to unlock the resource which will never happen. With fine-grained locking, the possibilities for such bugs grow quickly. At the same time, it is easy to miss a case where a lock would be appropriate leading to difficult to diagnose corruption bugs.

The solution for these problems implemented in hybrid threads is to sidestep them altogether by disallowing write access to shared variables. The programmer (or in most cases the compiler) is obliged to declare a list of all shared variables before a newly created task is started. The interpreter would then create proxy objects for these variables which the task can use to access the data. These proxies contain references to the original objects. They use these references to forward all reading vtable functions to the originals. Write access on the other hand would lead to a runtime error.

In other words, all data is owned by the thread creating it and only the owner may write to it. Other threads have only read access.

For threads to be able to communicate with their creators and other threads, they still need to write to shared variables. This is where green threads come into play. Since green threads are light weight, it is feasible for a thread to create a task just for updating a variable. This task is scheduled on the interpreter owning this variable. To reduce latency, the task is flagged to run immediately. The data-owning interpreter will preempt the currently running task and process the new write task. Put another way, the data-owning interpreter is told what to write to its variables, so other threads don’t have to.

Proxies

Proxies are the arbiters between threads. They are the only means for a thread to access another thread’s data and are implemented by the Proxy PMC type.

Proxy has default implementations for all functions, writing functions raise a cant_do_write_method runtime exception.  If a method returns a PMC from the target’s interp, another proxy object has to be created and wrapped around it so it can be safely returned to the caller.

Sub

The Sub PMC represents executable subroutines. A Sub does not only contain the code to execute but also the context in which to execute the code such as visible globals and namespaces.  If a proxy to such a Sub were created and invoke called on it, the code would access this context directly since it belongs to the same interp as the proxied Sub itself.  Thus, an operation like get_global fetches a global from an unproxied namespace and an unproxied global is be put into the target register.  Since this is happening while running invoke on the original Sub, Proxy cannot intercept the call and create a Proxy for the result.

This is the reason why Parrot_thread_create_proxy does not create a Proxy for a Sub but uses Parrot_thread_create_local_sub to create a copy on the thread’s interp with proxies for all PMC attributes.

Writing to shared variables

To write to shared variables, a thread creates a task and schedules it on the data owning interpreter. An example task looks like this:

    .sub write_to_variable
         .param pmc variable
         variable = 1
    .end

This is a subroutine with just one parameter. The variable passed as this parameter is the one the task should write to. In this case the constant value 1 would be written to the variable. In PIR, an assignment to a PMC gets translated to a method call. In this case, the set_integer_native is called changing the variable’s value. Since PMCs are passed by reference, it is the original variable which gets written to.

Code to create the task looks like:

    1    write_task = new ['Task']
    2    setattribute write_task, 'code', write_to_variable
    3    setattribute write_task, 'data', shared_variable
    4    interp.'schedule_proxied'(write_task, shared_variable)

Line 1 creates a new task object. The example subroutine is used for the task’s code attribute. shared_variable is used for data. At this point, shared_variable is actually the proxy object created for the shared integer PMC. The interpreter object contains a schedule_proxied method which is used to schedule the write_task on the thread owning the original variable.

schedule_proxied uses Parrot_thread_create_local_task which in this case detects that the data given as parameter for the task’s code is actually a proxy already and unwraps the proxied object. Parrot_cx_schedule_immediate is then used to make the data owning interpreter execute the task as soon as possible.

To protect a critical section, preemption can be disabled so the critical section runs uninterrupted:

    1 .sub swap_variables
    2     .param pmc a, b
    3     .local temp
    4     disable_preemption
    5     temp = a
    6     a = b
    7     b = temp
    8     enable_preemption
    9 .end

wait

Using tasks to write to shared variables makes such actions inherently asynchronous. This is not always what is needed by the implemented algorithm. For example, when the shared variable is a lock, processing should continue as soon as it’s acquired. The wait operation is used to wait for a task’s completion. The waiting task is added to the waited for task’s waiters list and preempted immediately. When a task finishes, all the tasks in the waiters list are scheduled again for execution. Since for each task a local copy is created on the target thread, the running task not only checks its own waiters list but also its partner’s.

If a task on the main thread was waiting for a task on another thread to finish and no other tasks are in the scheduler’s queue on the main thread, the main thread exits if no alarms are pending. To prevent this unintended exit, all tasks are added to the scheduler’s foreign_tasks list when they are scheduled on other threads. To end the program with other threads still running, an explicit exit operation has to be used.

Benchmarks

Preliminary benchmarks have shown Parrot’s performance to be within an order of magnitude of that of an optimized implementation in Perl 5.

Since Parrot does not yet offer the user any synchronization primitives, locks had to be implemented using a shared variable which is written to only by the main thread. Replacing this primitive method with a native semaphore implementation would probably reduce runtime to a small fraction.

Runtime comparison for matrix multiplication

                singlethreaded  computation      multithreaded   computation
    1. run          28.522 s       19.530 s        17.543 s          8.478 s
    2. run          28.427 s       19.463 s        17.320 s          8.283 s
    3. run          28.200 s       19.235 s        17.489 s          8.473 s
    average         28.383 s       19.409 s        17.451 s          8.411 s

This test implements matrix multiplication using four threads. For simplicity the second matrix only has one column. The program is written in the Winxed programming language. Winxed is a low-level language with Javascript like syntax and the possibility to include sections of PIR code verbatim making it possible to try experimental opcodes while writing more readable and concise code than with PIR alone. The complete source code is available in examples/threads/matrix_part.winxed

The program consists of the parts initialization, computation and verification. Computation is parallelized using four tasks each calculating one fourth of the result vector. Runtime is compared to a simple singlethreaded implementation. Run times were measured using the time command and are recorded in the above table.

As can be seen, the multithreaded implementation gives an average speedup of 2.31 for the computation and 1.61 in total.

Runtime comparison for Mandelbrot set calculation

                 singlethreaded  1 thread    2 threads   4 threads    8 threads
    1. run           89.931 s    89.978 s    45.813 s     24.028 s     17.445 s
    2. run           89.707 s    89.871 s    45.906 s     24.048 s     17.695 s
    3. run           90.318 s    89.839 s    45.951 s     24.049 s     17.573 s
    average          89.985 s    89.896 s    45.890 s     24.042 s     17.571 s
    speedup           1.000        1.001       1.959       3.739        5.116

The complete source code is available in examples/pir/mandel.pir

Calculating an image of the Mandelbrot set is a common benchmark for multithreading implementations since calculations of points are independent of each other and are thus easily parallelizable. A simple implementation of the escape time algorithm written in Winxed has been used to determine scalability properties of the threading implementation. The image is split into lines which are calculated alternatedly by a configured number of tasks. Run times were measured using the time command on an Intel Core i7 3770K processor with 16 GiB RAM running openSUSE 12.1 and are recorded in the Mandelbrot table. As can be seen, the implementation scales nearly linearly up to four threads reflecting the CPU’s four physical cores. Using eight threads, the speedup is only 1.368 compared to four threads but this seems to be more a limitation of the hardware than the implementation.

Questions

On IRC and on the mailing list some detailed questions have been asked.

See here:
http://lists.parrot.org/pipermail/parrot-dev/2012-December/007295.html

Day 10 – Don’t quote me on it…

December 10, 2012

In many areas, Perl 6 provides you with a range of sane defaults for the common cases along with the power to do something a little more interesting when you need it. Quoting is no exception.

The Basics

The two most common quoting constructs are the single and double quotes. Single quotes are simplest: they let you quote a string and just about the only “magic” they provide is being able to stick a backslash before a single quote, which escapes it. Since backslash has this special meaning, you can write an explicit backslash with \\. However, you don’t even need to do that, since any other backslashes just pass on straight through. Here’s some examples.

> say 'Everybody loves Magical Trevor'
Everybody loves Magical Trevor
> say 'Oh wow, it\'s backslashed!'
Oh wow, it's backslashed!
> say 'You can include a \\ like this'
You can include a \ like this
> say 'Nothing like \n is available'
Nothing like \n is available
> say 'And a \ on its own is no problem'
And a \ on its own is no problem

Double quotes are, naturally, twice as powerful. :-) They support a range of backslash escapes, but more importantly they allow for interpolation. This means that variables and closures can be placed within them, saving you from having to use the concatenation operator or other string formatting constructs so often. Here are some simple examples.

> say "Ooh look!\nLine breaks!"
Ooh look!
Line breaks!
> my $who = 'Ninochka'; say "Hello, dear $who"
Hello, dear Ninochka
> say "Hello, { prompt 'Enter your name: ' }!"
Enter your name: Jonathan
Hello, Jonathan!

The second example shows the interpolation of a scalar, and the third shows how closures can be placed inside double quoted strings also. The value the closure produces will be stringified and interpolated into the string. But what about all the other sigils besides “$”? The rule is that you can interpolate all of them, but only if they are followed by some kind of postcircumfix (that is, an array or hash subscript, parentheses to make an invocation, or a method call). In fact, you can put all of these on a scalar too.

> my @beer = <Chimay Hobgoblin Yeti>;
Chimay Hobgoblin Yeti
> say "First up, a @beer[0]"
First up, a Chimay
> say "Then @beer[1,2].join(' and ')!"
Then Hobgoblin and Yeti!
> say "Tu je &prompt('Ktore pivo chces? ')"
Ktore pivo chces? Starobrno
Tu je Starobrno

Here you can see interpolation of an array element, a slice that we then call a method on and even a function call. The postcircumfix rule happily means that we don’t go screwing up your email address any more.

> say "Please spam me at blackhole@jnthn.net"
Please spam me at blackhole@jnthn.net

Choose Your Own Delimiters

The single and double quotes are suitable for a bunch of cases, but what if you want to use a bunch of single or double quotes inside the string? Escaping them would rather suck. Thing is, you could probably make that argument about any choice of quoting characters. So instead of making the choice for you, Perl 6 lets you pick. The q and qq quote constructs expect to be followed by a delimiter. If it’s something with a matching closer, it will look for that (for example, if you use an opening curly then your string is terminated by a closing curly; note that there’s only a finite set of these, and no, it doesn’t include having a comet be terminated by a snowman). Otherwise it looks for the same thing to terminate the string. Note you can also use multi-character openers and closers too (but only by repeating the same character). Otherwise, the q gives you the same semantics as single quotes, and qq gives you the same semantics as double quotes.

> say q{C'est la vie}
C'est la vie
> say q{{Unmatched } and { are { OK } in { here}}
Unmatched } and { are { OK } in { here
> say qq!Lottery results: {(1..49).roll(6).sort}!
Lottery results: 12 13 26 34 36 46

Heredocs

All of the quoting constructs demonstrated so far allow you to include multiple lines of content. However, for that there’s usually a better way: here documents. There can be started with either q or qq, and then with the :to adverb being used to specify the string we expect to find, on a line of its own, at the end of the quoted text. Let’s see how this works, illustrated by a touching story.

print q:to/THE END/
    Once upon a time, there was a pub. The pub had
    lots of awesome beer. One day, a Perl workshop
    was held near to the pub. The hackers drank
    the pub dry. The pub owner could finally afford
    a vacation.
    THE END

The output of this script is as follows:

Once upon a time, there was a pub. The pub had
lots of awesome beer. One day, a Perl workshop
was held near to the pub. The hackers drank
the pub dry. The pub owner could finally afford
a vacation.

Notice how the text is not indented like in the program source. Heredocs remove indentation automatically, up to the indentation level of the terminator. If we’d used qq, we could have interpolated things into the heredoc. Note that this is all implemented by using the indent method on strings, but if your string doesn’t do any interpolation we do the call to indent at compile time as an optimization.

You can also have multiple heredocs, and even call methods on the data that will be located in the heredoc (note the call to lines in the following program).

my ($input, @searches) = q:to/INPUT/, q:to/SEARCHES/.lines;
    Once upon a time, there was a pub. The pub had
    lots of awesome beer. One day, a Perl workshop
    was held near to the pub. The hackers drank
    the pub dry. The pub owner could finally afford
    a vacation.
    INPUT
    beer
    masak
    vacation
    whisky
    SEARCHES

for @searches -> $s {
    say $input ~~ /$s/
        ?? "Found $s"
        !! "Didn't find $s";
}

The output of this program is:

Found beer
Didn't find masak
Found vacation
Didn't find whisky

Quote Adverbs for Custom Quoting Constructs

The single and double quote semantics, also available through q and qq, cover most cases. But what if you have a situation where you want to, say, interpolate closures but not scalars? This is where quote adverbs come in. They allow you to turn certain quoting features on and off. Here’s an example.

> say qq:!s"It costs $10 to {<eat nom>.pick} here."
It costs $10 to eat here.

Here, we use the semantics of qq, but then turn off scalar interpolation. This means we can write the price without worrying about it trying to interpolate the 11th capture of the last regex. Note that this is just using the standard colonpair syntax. If you want to start from a quote construct that supports basically nothing, and then just turn on some options, you can use the Q construct.

> say Q{$*OS\n&sin(3)}
$*OS\n&sin(3)
> say Q:s{$*OS\n&sin(3)}
MSWin32\n&sin(3)
> say Q:s:b{$*OS\n&sin(3)}
MSWin32
&sin(3)
> say Q:s:b:f{$*OS\n&sin(3)}
MSWin32
0.141120008059867

Here we start with a featureless quoting construct, then turn on extra features: first scalar interpolation, then backslash escapes, then function interpolation. Note that we could have chosen any delimiter we wished too.

Quote Constructs are Languages

Finally, it’s worth mentioning that when the parser enters a quoting construct, really it is switching to parsing a different language. When we build up quoting constructs from adverbs, really this is just mixing extra roles into the base quoting language to turn on extra features. For the curious, here’s how Rakudo does it. Whenever we hit a closure or some other interpolation, the language is temporarily switched back to the main language. This is why you can do things like:

> say "Hello, { prompt "Enter your name: " }!"
Enter your name: Jonathan
Hello, Jonathan!

And the parser doesn’t get terribly confused about the fact that the closure being interpolated contains another double quoted string. That is, we’re parsing the main language, then slip into a quoting language, then recurse into the main language again, and finally recurse into the quoting language again to parse the string in the closure in the string in the program. It’s like the Perl 6 parser wants to give us all matryoshka dolls for Christmas. :-)

Day 9 – Longest Token Matching

December 9, 2012

Perl 6 regular expressions prefer to match the longest alternative when possible.

say "food and drink" ~~ / foo | food /;   # food

This is in contrast to Perl 5, which would prefer the first alternative above, and produce the match “foo”.

You can still get the first-alternative behavior if you want; it’s tucked away in the slightly longer alternation operator ||:

say "food and drink" ~~ / foo || food /;  # foo

…And that’s it! That’s Longest Token Matching. ☺ Short post.

“Huh, wait!” I hear you exclaim, in a desperate attempt to make the daily Perl 6 Advent goodness last a bit longer. “Why is Longest Token Matching such a big deal? Who would ever be so obsessed with long tokens?”

I’m glad you asked. As it turns out, Longest Token Matching (or LTM for short) plays very well with our intuition about how things should be parsed. If you’re creating a language, you want people to be able to declare a variable forest_density without the mention of this variable clashing with the syntax of for loops. LTM will see to that.

I like “strange consistencies” — when distal parts of a language design turn out to have commonalities that make the language feel more uniform. There is that kind of consistency here, between classes and grammars. Perl 6 basically exploits that consistency to the max. Let me briefly map out what I mean.

We’re all used to writing classes at this point. From a birds-eye view, they look like this:

class {
    method
    method
    method
}

Grammars have a suspiciously similar structure:

grammar {
    rule
    rule
    rule
}

(The keywords are actually regex, token and rule, but when we talk about them as a group, we just call them “rules”.)

We’re also used to being able to derive classes into subclasses (class B is A), and add or override methods in a way which produces a nice mix of old and new behavior. Perl 6 provides multi methods which even allow you to add new methods of the same name, and the old ones won’t be overridden, they’ll just all try to match alongside the new methods. The dispatch is handled by a (usually autogenerated) proto method that dispatches to all eligible candidates.

What does all this have to do with grammars and rules? Well, it turns out that first off, you can derive new grammars from old ones. It works the same as deriving classes. (In fact, under the hood it’s exactly the same mechanism. Grammars are classes with a different metaclass object.) New rules will override old rules just like you’d expect with methods.

S05 has a cute example with parsing of letters, and deriving the grammar to parse formal letters:

    grammar Letter {
         rule text     { <greet> $<body>=<line>+? <close> }
         rule greet    { [Hi|Hey|Yo] $<to>=\S+? ',' }
         rule close    { Later dude ',' $<from>=.+ }
         token line    { \N* \n}
     }

     grammar FormalLetter is Letter {
         rule greet { Dear $<to>=\S+? ',' }
         rule close { Yours sincerely ',' $<from>=.+ }
     }

The derived FormalLetter overrides greet and close, but not line.

But what about all the goodness with multi methods? Could we define some kind of “proto rule” that would allow us to have several rules in a grammar with the same name but different bodies? For example, we might want to parse a language with a rule term, but there are many different terms: strings, numbers… and maybe the numbers can be decimal or binary or octal or hexadecimal…

Perl 6 grammars can contain a proto rule, and then you can define and redefine a rule with the same name as many times as you want. And now we’re back full circle with the / foo | food / alternation from the start of the article. All those rules you write with the same name compile down to one big alternation. Not only that — rules which call other rules, some of them possibly proto rules, all of that will be “flattened” out into one big LTM alternation. In practice that means that all the possible things a term can be are tried out all at once, on equal footing. Neither alternative wins because you happened to define it before the others. An alternative wins because it is the longest.

The strange consistency resides in the fact that in the call-a-method side of things, the most specific method wins, and “most specific” has to with signature narrowness. The better the types in the signature describe the arguments coming in, the more specific the method.

In the parse-with-a-rule side of things, the most specific rule wins, but here “most specific” has to do with parse success. The better the rule can describe what comes next in the text, the more specific the rule.

And that’s strangely consistent, because on the surface methods and rules look like quite different beasts.

We really believe we have something going with this whole principle of deriving a grammar and getting a new language. LTM is right at the center of that because it allows new rules and old to intermix in a fair and predictable way. It’s a kind of meritocracy: rules win not based on whether they’re young or old, but based on whether they are able to parse the text well.

In fact, the Perl 6 compiler itself works this way. It parses your program using a Perl 6 grammar, and that grammar is derivable… whenever you declare a new operator in your program, a new grammar is derived for you. The parsing of your operator is added as a new rule in the new grammar, and the new grammar is given the task of parsing the rest of your program. Your new operator will win against similar but shorter ones, and lose against similar but longer ones.

Day 8 – Panda package manager

December 8, 2012

Perl 6 is not just the language. While without modules it can do more than Perl 5, modules can make life easier. About two years ago neutro was discussed on this blog. I’m not going to talk about it, as it’s deprecated today.

Today, the standard way of installing modules is the panda utility. If you’re using Rakudo Star, you should have it already installed (try the panda command in console to check it). After running it and waiting a few seconds, you should see help for the panda utility.

$ panda
Usage:
  panda [--notests] [--nodeps] install [ ...] -- Install the specified modules
  panda [--installed] [--verbose] list -- List all available modules
  panda update -- Update the module database
  panda info [ ...] -- Display information about specified modules
  panda search  -- Search the name/description

As you can see, it doesn’t have many options (it’s actually similar to RubyGems or cpanminus in its simplicity). You can see the current list of modules at Perl 6 Modules page. Let’s say you would want to parse an INI file. First, you can find module for it using the search command.

$ panda search INI
JSON::Tiny               *          A minimal JSON (de)serializer
Config::INI              *          .ini file parser and writer module for
                                    Perl 6
MiniDBI                  *          a subset of Perl 5 DBI ported to Perl 6
                                    to use while experts build the Real Deal
Class::Utils             0.1.0      Small utilities to help with defining
                                    classes

Config::INI is module you want. Other modules were found because my query wasn’t specific enough and found “ini” in other words (minimal, MiniDBI and defining). Config::INI isn’t part of Rakudo Star, so you have to install it.

Panda installs modules globally when you writing access to the installation directory, locally otherwise. Because of that you can use panda even when Perl 6 is installed globally without installing modules like local::lib, like you have to in Perl 5.

$ panda install Config::INI
==> Fetching Config::INI
==> Building Config::INI
Compiling lib/Config/INI.pm
Compiling lib/Config/INI/Writer.pm
==> Testing Config::INI
t/00-load.t .... ok
t/01-parser.t .. ok
t/02-writer.t .. ok
All tests successful.
Files=3, Tests=55, 3 wallclock secs ( 0.04 usr 0.00 sys + 2.38 cusr 0.14 csys = 2.56 CPU)
Result: PASS
==> Installing Config::INI
==> Succesfully installed Config::INI

After the module has been installed, you can update it as easily – by installing it. Currently panda cannot automatically upgrade modules, but after a module has been updated (you can watch repositories on GitHub to know when it happens – every module is available on GitHub), you can easily upgrade it by reinstalling the module.

When a module was installed, you can check if it works by trying to use it. This is a sample script that can be used to convert INI file into a Perl 6 data structure.

#!/usr/bin/env perl6
use Config::INI;
multi sub MAIN($file) {
    say '# your INI file as seen by Perl 6';
    say Config::INI::parse_file($file).perl;
}

Day 7 – MIME::Base64 – On encoded strings

December 7, 2012

parrot MIME::Base64 FixedIntegerArray: index out of bounds!

Ronaldxs created the following parrot ticket #813 4 months ago:

“Was playing with p6 MIME::Base64 and utf8 sampler page when I came across this. It seems that the parrot MIME Base64 library can’t handle some UTF-8 characters as demonstrated below.”

.sub go :main
    load_bytecode 'MIME/Base64.pbc'

    .local pmc enc_sub
    enc_sub = get_global [ "MIME"; "Base64" ], 'encode_base64'

    .local string result_encode
    result_encode = enc_sub(utf8:"\x{203e}")

    say result_encode
.end

FixedIntegerArray: index out of bounds!
current instr.: 'parrot;MIME;Base64;encode_base64'
pc 163 (runtime/parrot/library/MIME/Base64.pir:147)

called from Sub 'go' pc 11 (die_utf8_base64.pir:8)

This was interesting, because parrot strings store the encoding information in the string. The user does not need to store the string encoding information somewhere else as in perl5, nor have to do educated guesses about the encoding. parrot supports ascii, latin1, binary, utf-8, ucs-2, utf-16 and ucs-4 string encodings natively.
So we thought we the hell cannot parrot handle simple utf-8 encoded strings?

As it turned out, the parrot implementation of MIME::Base64, which can be shared to all languages which use parrot as VM, stored the character codepoints for each character as array of integers. On multibyte encodings such as UTF-8 this leads to different data held in memory than a normal multibyte string which is encoded as the byte buffer and the additional encoding information.

Internal string representations

For example an overview of different internal string representations for the utf-8 string “\x{203e}”:

perl5 strings:

len=3, utf-8 flag, "\342\200\276" buf=[e2 80 be]

parrot strings:

len=1, bufused=3, encoding=utf-8, buf=[e2 80 be]

The Unicode tables:

U+203E	‾	e2 80 be	OVERLINE

gdb perl5

Let’s check it out:

$ gdb --args perl -e'print "\x{203e}"'
(gdb) start
(gdb) b Perl_pp_print
(gdb) c
(gdb) n 

.. until if (!do_print(*MARK, fp))

(gdb) p **MARK
$1 = {sv_any = 0x404280, sv_refcnt = 1, sv_flags = 671106052, sv_u = {
      svu_pv = 0x426dd0 "‾", svu_iv = 4353488, svu_uv = 4353488, 
      svu_rv = 0x426dd0, svu_array = 0x426dd0, svu_hash = 0x426dd0, 
      svu_gp = 0x426dd0, svu_fp = 0x426dd0}, ...}

(gdb) p Perl_sv_dump(*MARK)
ALLOCATED at -e:1 for stringify (parent 0x0); serial 301
SV = PV(0x404280) at 0x4239a8
  REFCNT = 1
  FLAGS = (POK,READONLY,pPOK,UTF8)
  PV = 0x426dd0 "\342\200\276" [UTF8 "\x{203e}"]
  CUR = 3
  LEN = 16
$2 = void

(gdb) x/3x 0x426dd0
0x426dd0:	0xe2	0x80	0xbe

We see that perl5 does store the utf-8 flag, but not the length of the string, the utf8 length (=1), only the length of the buffer (=3).
Any other multi-byte encoded string, such as UCS-2 is stored differently. We suppose as utf-8.

We are already in the debugger, so let’s try the different cmdline argument.

(gdb) run -e'use Encode; print encode("UCS-2", "\x{203e}")'
  The program being debugged has been started already.
  Start it from the beginning? (y or n) y
Breakpoint 2, Perl_pp_print () at pp_hot.c:712
712	    dVAR; dSP; dMARK; dORIGMARK;

(gdb) p **MARK

$3 = {sv_any = 0x404b30, sv_refcnt = 1, sv_flags = 541700, sv_u = {
    svu_pv = 0x563a50 " >", svu_iv = 5651024, svu_uv = 5651024, 
    svu_rv = 0x563a50, svu_array = 0x563a50, svu_hash = 0x563a50, svu_gp = 0x563a50, 
    svu_fp = 0x563a50}, ...}

(gdb) p Perl_sv_dump(*MARK)
ALLOCATED at -e:1 by return (parent 0x0); serial 9579
SV = PV(0x404b30) at 0x556fb8
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK)
  PV = 0x563a50 " >"
  CUR = 2
  LEN = 16
$4 = void

(gdb) x/2x 0x563a50
0x563a50:	0x20	0x3e

But we don’t see the UTF8 flag in encode(“UCS-2″, “\x{203e}”), just the simple ascii string ” >”, which is the UCS-2 representation of [20 3e].
Because ” >” is perfectly representable as non-utf8 ASCII string.
UCS-2 is much much nicer than UTF-8, is has a fixed size, it is readable, Windows uses it, but it cannot represent all Unicode characters.

Encode::Unicode contains this nice cheatsheet:

Quick Reference
                Decodes from ord(N)           Encodes chr(N) to...
       octet/char BOM S.P d800-dfff  ord > 0xffff     \x{1abcd} ==
  ---------------+-----------------+------------------------------
  UCS-2BE       2   N   N  is bogus                  Not Available
  UCS-2LE       2   N   N     bogus                  Not Available
  UTF-16      2/4   Y   Y  is   S.P           S.P            BE/LE
  UTF-16BE    2/4   N   Y       S.P           S.P    0xd82a,0xdfcd
  UTF-16LE    2/4   N   Y       S.P           S.P    0x2ad8,0xcddf
  UTF-32        4   Y   -  is bogus         As is            BE/LE
  UTF-32BE      4   N   -     bogus         As is       0x0001abcd
  UTF-32LE      4   N   -     bogus         As is       0xcdab0100
  UTF-8       1-4   -   -     bogus   >= 4 octets   \xf0\x9a\af\8d
  ---------------+-----------------+------------------------------

gdb parrot

Back to parrot:

If you debug parrot with gdb you get a gdb pretty-printer thanks to Nolan Lum, which displays the string and encoding information automatically.
In perl5 you have to call Perl_sv_dump with or without the my_perl as first argument, if threaded or not. With a threaded perl, e.g. on Windows you’d need to call p Perl_sv_dump(my_perl, *MARK).
In parrot you just ask for the value and the formatting is done with a gdb pretty-printer plugin.
The string length is called strlen (of the encoded string), the buffer size is called bufused.

Even in a backtrace the string arguments are displayed abbrevated like this:

#3  0x00007ffff7c29fc4 in utf8_iter_get_and_advance (interp=0x412050, str="utf8:� [1/2]", 
    i=0x7fffffffdd00) at src/string/encoding/utf8.c:551
#4  0x00007ffff7a440f6 in Parrot_str_escape_truncate (interp=0x412050, src="utf8:� [1/2]",
    limit=20) at src/string/api.c:2492
#5  0x00007ffff7b02fb3 in trace_op_dump (interp=0x412050, code_start=0x63a1c0, pc=0x63b688)
    at src/runcore/trace.c:450

[1/2] means strlen=1 bufused=2
Each non-ascii or non latin-1 encoded string is printed with the encoding prefix.
Internally the encoding is of course a index or pointer in the table of supported encodings.

You can set a breakpoint to utf8_iter_get_and_advance and watch the strings.

(gdb) r t/library/mime_base64u.t
Breakpoint 1, utf8_iter_get_and_advance (interp=0x412050, str="utf8:\\x{00c7} [8/8]", 
                i=0x7fffffffcd40) at src/string/encoding/utf8.c:544
(gdb) p str
$1 = "utf8:\\x{00c7} [8/8]"
(gdb) p str->bufused 
$3 = 8
(gdb) p str->strlen
$4 = 8
(gdb) p str->strstart
$5 = 0x5102d7 "\\x{00c7}"

This is escaped. Let’s advance to a more interesting utf8 string in this test, i.e. until str=”utf8:Ā [1/2]“
You get the members of a struct with tab-completion, i.e. press <TAB> after p str->

(gdb) p str->
_buflen    _bufstart  bufused    encoding   flags      hashval    strlen     strstart
(gdb) p str->strlen
$9 = 8

(gdb) dis 1
(gdb) b utf8_iter_get_and_advance if str->strlen == 1
(gdb) c
Breakpoint 2, utf8_iter_get_and_advance (interp=0x412050, str="utf8:Ā [1/2]", 
                i=0x7fffffffcd10) at src/string/encoding/utf8.c:544
544	    ASSERT_ARGS(utf8_iter_get_and_advance)

(gdb) p str->strlen
$10 = 1
(gdb) p str->strstart
$11 = 0x7ffff7faeb58 "Ā"
(gdb) x/2x str->strstart
0x7ffff7faeb58:	0xc4	0x80
(gdb) p str->encoding
$12 = (const struct _str_vtable *) 0x7ffff7d882e0
(gdb) p *str->encoding

$13 = {num = 3, name = 0x7ffff7ce333f "utf8", name_str = "utf8", bytes_per_unit = 1,
  max_bytes_per_codepoint = 4, to_encoding = 0x7ffff7c292b0 <utf8_to_encoding>, chr =
  0x7ffff7c275c0 <unicode_chr>, equal = 0x7ffff7c252e0 <encoding_equal>, compare =
  0x7ffff7c254e0 <encoding_compare>, index = 0x7ffff7c25690 <encoding_index>, rindex
  = 0x7ffff7c257a0 <encoding_rindex>, hash = 0x7ffff7c25a20 <encoding_hash>, scan =
  0x7ffff7c29380 <utf8_scan>, partial_scan = 0x7ffff7c29460 <utf8_partial_scan>, ord
  = 0x7ffff7c297e0 <utf8_ord>, substr = 0x7ffff7c25de0 <encoding_substr>, is_cclass =
  0x7ffff7c26000 <encoding_is_cclass>, find_cclass =
  0x7ffff7c260e0 <encoding_find_cclass>, find_not_cclass =
  0x7ffff7c26220 <encoding_find_not_cclass>, get_graphemes =
  0x7ffff7c263d0 <encoding_get_graphemes>, compose =
  0x7ffff7c27680 <unicode_compose>, decompose = 0x7ffff7c26450 <encoding_decompose>,
  upcase = 0x7ffff7c27b20 <unicode_upcase>, downcase =
  0x7ffff7c27be0 <unicode_downcase>, titlecase = 0x7ffff7c27ca0 <unicode_titlecase>,
  upcase_first = 0x7ffff7c27d60 <unicode_upcase_first>, downcase_first =
  0x7ffff7c27dc0 <unicode_downcase_first>, titlecase_first =
  0x7ffff7c27e20 <unicode_titlecase_first>, iter_get =
  0x7ffff7c29c40 <utf8_iter_get>, iter_skip = 0x7ffff7c29d60 <utf8_iter_skip>,
  iter_get_and_advance = 0x7ffff7c29eb0 <utf8_iter_get_and_advance>,
  iter_set_and_advance = 0x7ffff7c29fd0 <utf8_iter_set_and_advance>}

encode_base64(str)

$ perl -MMIME::Base64 -lE'$x="20e3";$s="\x{20e3}";
  printf "0x%s\t%s=> %s",$x,$s,encode_base64($s)'
Wide character in subroutine entry at -e line 1.

Oops, I’m clearly a unicode perl5 newbie. Does my term not understand utf-8?

$ echo $TERM
xterm

No, it should. encode_base64 does not understand unicode.
perldoc MIME::Base64
“The base64 encoding is only defined for single-byte characters. Use the Encode module to select the byte encoding you want.”

Oh my! But it is just perl5. It just works on byte buffers, not on strings.
perl5 strings can be utf8 and non-utf8. Why on earth an utf8 encoded string is disallowed and only byte buffers of unknown encodings are allowed goes beyond my understanding, but what can you do. Nothing. base64 is a binary only protocol, based on byte buffers. So we decode it manually to byte buffers. The Encode API for decoding is called encode.

$ perl -MMIME::Base64 -MEncode -lE'$x="20e3";$s="\x{20e3}";
  printf "0x%s\t%s=> %s",$x,$s,encode_base64(encode('utf8',$s))'
Wide character in printf at -e line 1.
0x20e3	=> 4oOj

This is now the term warning I know. We need -C

$ perldoc perluniintro

$ perl -C -MMIME::Base64 -MEncode -lE'$x="20e3";$s="\x{20e3}";
  printf "0x%s\t%s=> %s",$x,$s,encode_base64(encode('utf8',$s))'
0x20e3	=> 4oOj

Over to rakudo/perl6 and parrot:

$ cat >m.pir << EOP
.sub main :main
    load_bytecode 'MIME/Base64.pbc'
    $P1 = get_global [ "MIME"; "Base64" ], 'encode_base64'
    $S1 = utf8:"\x{203e}"
    $S2 = $P1(s1)
    say $S1
    say $S2
.end
EOP

$ parrot m.pir
FixedIntegerArray: index out of bounds!
current instr.: 'parrot;MIME;Base64;encode_base64'
                pc 163 (runtime/parrot/library/MIME/Base64.pir:147)

The perl6 test, using the parrot library, from https://github.com/ronaldxs/perl6-Enc-MIME-Base64/

$ git clone git://github.com/ronaldxs/perl6-Enc-MIME-Base64.git
Cloning into 'perl6-Enc-MIME-Base64'...

$ PERL6LIB=perl6-Enc-MIME-Base64/lib perl6 <<EOP
use Enc::MIME::Base64;
say encode_base64_str("\x203e");
EOP

> use Enc::MIME::Base64;
Nil
> say encode_base64_str("\x203e");
FixedIntegerArray: index out of bounds!
...

The pure perl6 workaround:

$ PERL6LIB=perl6-Enc-MIME-Base64/lib perl6 <<EOP
use PP::Enc::MIME::Base64;
say encode_base64_str("\x203e");
EOP

> use PP::Enc::MIME::Base64;
Nil
> say encode_base64_str("\x203e");
4oC+

Wait. perl6 creates a different enoding than perl5?
What about coreutils base64 command.

$ echo -n "‾" > m.raw
$ od -x m.raw
0000000 80e2 00be
0000003
$ ls -al m.raw
-rw-r--r-- 1 rurban rurban 3 Dec  6 10:23 m.raw
$ base64 m.raw
4oC+

[80e2 00be] is the little-endian version of [e2 80 be], 3 bytes, flipped.
Ok, at least base64 agrees with perl6, and I must have made some encoding mistake with perl5.

Back to debugging our parrot problem:

parrot unlike perl6 has no debugger yet. So we have to use gdb, and we need to know in which function the error occured. We use the parrot -t trace flag, which is like the perl5 debugging -Dt flag, but it is always enabled, even in optimized builds.

$ parrot --help
...
    -t --trace [flags] 
    --help-debug
...
$ parrot --help-debug
...
--trace -t [Flags] ...
    0001    opcodes
    0002    find_method
    0004    function calls

$ parrot -t7 m.pir
...
009f band I9, I2, 63         I9=0 I2=0 
00a3 set I10, P0[I5]         I10=0 P0=FixedIntegerArray=PMC(0xff7638) I5=[2063]
016c get_results PC2 (1), P2 PC2=FixedIntegerArray=PMC(0xedd178) P2=PMCNULL
016f finalize P2             P2=Exception=PMC(0x16ed498)
0171 pop_eh
lots of error handling
...
0248 callmethodcc P0, "print" P0=FileHandle=PMC(0xedcca0) 
FixedIntegerArray: index out of bounds!

We finally see the problem, which matches the run-time error.

00a3 set I10, P0[I5]         I10=0 P0=FixedIntegerArray=PMC(0xff7638) I5=[2063]

We want to set I10 to the I5=2063′th element in the FixedIntegerArray P0, and the array is not big enough.

After several hours of analyzing I came to the conclusion that the parrot library MIME::Base64 was wrong by using ord of every character in the string. It should use a bytebuffer instead.
Which was fixed with commit 3a48e6. ord can return integers > 255, but base64 can only handle chars < 255.

The fixed parrot library was now correct:

$ parrot m.pir
‾
4oC+

But then the tests started failing. I spent several weeks trying to understand why the parrot testsuite was wrong with the mime_base64 tests, the testdata came from perl5. I came up with different implementation hacks which would match the testsuite, but finally had to bite the bullet, changing the tests to match the implementation.

And I had to special case the tests for big-endian, as base64 is endian agnostic. You cannot decode a base64 encoded powerpc file on an intel machine, when you use multi-byte characters. And utf-8 is even more multi-byte than ucs-2. I had to accept the fact the big-endian will return a different encoding. Before the results were the same. The tests were written to return the same encoding on little and big-endian.

Summary

The first reason why I wrote this blog post was to show how to debug into crazy problems like this, when you are not sure if the core implementation, the library, the spec or the tests are wrong. It turned out, that the library and the tests were wrong.
You saw how easily you could use gdb to debug into such problems, as soon as you find out a proper breakpoint.

The internal string representations looked like this:

MIME::Base64 internally:

len=1, encoding=utf-8, buf=[3e20]

and inside the parrot imcc compiler the SREG

len=8, buf="utf-8:\"\x{203e}\""

parrot is a register based runtime, and a SREG is the string representation of the register value. Unfortunately a SREG cannot hold the encoding info yet, so we prefix the encoding in the string, and unquote it back. This is not the reason why parrot is still slower than the perl5 VM. I benchmarked it. parrot still uses too much sprintf’s internally and the encoding quote/unquoting counts only for a 4th of the time of the sprintf gyrations.
And parrot function calls are awfully slow and de-optimized.

The second reason is to explain the new decode_base64() API, which only parrot – and therefore all parrot based languages like rakudo – now have got.

decode_base64(str, ?:encoding)

“Decode a base64 string by calling the decode_base64() function.
This function takes as first argument the string to decode, as optional second argument the encoding string for the decoded data.
It returns the decoded data.

Any character not part of the 65-character base64 subset is silently ignored.
Characters occurring after a ‘=’ padding character are never decoded.”

So decode_base64 got now a second optional encoding argument. The src string for encode_base64 can be any encoding and is automatically decoded to a bytebuffer. You can easily encode an image or unicode string without any trouble, and for the decoder you can define the wanted encoding beforehand. The result can be the encoding binary or utf-8 or any encoding you prefer, no need for additional decoding of the result. The default encoding of the decoded string is either ascii, latin-1 or utf-8. parrot will upgrade the encoding automatically.

You can compare the new examples of pir against the perl5 version:

parrot:

.sub main :main
    load_bytecode 'MIME/Base64.pbc'

    .local pmc enc_sub
    enc_sub = get_global [ "MIME"; "Base64" ], 'encode_base64'

    .local string result_encode
    # GH 814
    result_encode = enc_sub(utf8:"\x{a2}")
    say   "encode:   utf8:\"\\x{a2}\""
    say   "expected: wqI="
    print "result:   "
    say result_encode

    # GH 813
    result_encode = enc_sub(utf8:"\x{203e}")
    say   "encode:   utf8:\"\\x{203e}\""
    say   "expected: 4oC+"
    print "result:   "
    say result_encode

.end

perl5:

use MIME::Base64 qw(encode_base64 decode_base64);
use Encode qw(encode);

my $encoded = encode_base64(encode("UTF-8", "\x{a2}"));
print  "encode:   utf-8:\"\\x{a2}\"  - ", encode("UTF-8", "\x{a2}"), "\n";
print  "expected: wqI=\n";
print  "result:   $encoded\n";
print  "decode:   ",decode_base64("wqI="),"\n\n"; # 302 242

my $encoded = encode_base64(encode("UTF-8", "\x{203e}"));
print  "encode:   utf-8:\"\\x{203e}\"  -> ",encode("UTF-8", "\x{203e}"),"\n";
print  "expected: 4oC+\n";
print  "result:   $encoded\n"; # 342 200 276
print  "decode:   ",decode_base64("4oC+"),"\n";

for ([qq(a2)],[qq(c2a2)],[qw(203e)],[qw(3e 20)],[qw(1000)],[qw(00c7)],[qw(00ff 0000)]){
    $s = pack "H*",@{$_};
    printf "0x%s\t=> %s", join("",@{$_}), encode_base64($s);
}

perl6:

use Enc::MIME::Base64;
say encode_base64_str("\xa2");
say encode_base64_str("\x203e");

Day 6 – Lexical Imports

December 6, 2012

Perl 6 is built on lexical scopes. Variables, subroutines, constants and even types are looked up lexically first, and subroutines are only looked up in lexical scopes.

So it is only fitting that importing symbols from modules is also done into lexical scopes. I often write code such as

    use v6;

    # the main functionality of the script
    sub deduplicate(Str $s) {
        my %seen;
        $s.comb.grep({!%seen{ .lc }++}).join;
    }

    # normal call
    multi MAIN($phrase) {
        say deduplicate($phrase)
    }

    # if you call the script with --test, it runs its unit tests
    multi MAIN(Bool :$test!) {
        # imports &plan, &is etc. only into the lexical scope
        use Test;
        plan 2;
        is deduplicate('just some words'),
            'just omewrd', 'basic deduplication';
        is deduplicate('Abcabd'),
            'Abcd', 'case insensitivity';
    }

This script removes all but the first occurrence of each character given on the command line:

    $ perl6 deduplicate 'Duplicate character removal'
    Duplicate hrmov

But if you call it with the --test option, it runs its own unit tests:

    $ perl6 deduplicate --test
    1..2
    ok 1 - basic deduplication
    ok 2 - case insensitivity

Since the testing functions are only necessary in a part of the program — in a lexical scope, to be more precise –, the use statement is inside that scope, and limits the visibility of the imported symbols to this scope. So if you try to use the is function outside the routine in which Test is used, you get a compile-time error.

Why, you might ask? From the programmer's perspective, it reduces risk of (possibly unintended and unnoticed) name clashes the same way that lexical variables are safer than global variables.

From the point of view of language design, the combination of lexical importing, runtime-immutable lexical scopes and lexical-only lookup of subroutines allows resolving subroutine names at compile time, which again allows neat stuff like detecting calls to undeclared functions, compile-time type checking of arguments, and other nice optimizations.

But subroutines are only the tip of the iceberg. Perl 6 has a very flexible syntax, which you can modify with custom operators and macros. Those too can be exported, and imported into lexical scopes. Which means that language modifications are also lexically by default. So you can safely load any language-modifying extension, without running into danger that a library you use can't cope with it — the library doesn't even see the language modification.

So ultimately, lexical importing is another facet of encapsulation.

Day 5 – A Perl 6 Debugger

December 5, 2012

There’s much more to the developer experience of a language than its design, features and implementations. While the language and its implementations are perhaps the thing developers will spend most time with, the overall experience will also involve interaction with the community, reading documentation, using modules and employing various development tools. Thus, it’s important that Perl 6 make progress on these fronts too. Over the past year, we’ve taken some good steps forward in these areas; there’s now doc.perl6.org, the module ecosystem has grown, and the module installation tooling has improved. Another big step forward with regards to tooling – and the topic of this post – is that an interactive Perl 6 debugger is now available.

Running With The Debugger

The debugger has been included with the last few Rakudo * releases. If you have one of those, you’re all set. Just run perl6-debug instead of perl6. It takes the same set of options, so if your normal invocation involves, for example, using the -I flag to set the include path for modules, it’ll Just Work Like Usual. Of course, what happens next is entirely different. The debugger will show you each module it is loading, followed by placing you at the first interesting statement of the program, highlighted in yellow.

dbg0

Note how it takes care to put you on the first line that actually does something, skipping the my statement above it (it’s getting increasingly smart about this).

The Basics

Hitting enter allows you to single-step through the program. At any point, you can look at variables, call methods on variables, or even evaluate expressions.

dbg1

If you want to move statement by statement, but never descend into a function call or method call, type an s, followed by enter. To step out of the current sub (that is, run until it returns then break in its caller), use so to step out. To run the program until it hits an exception, just use r. Even at the point you get an exception, you can still access variables to try and dissect what went wrong.

dbg2

One final variant, rt, will run until an exception is thrown, but handled. You’ll break at the point of the throw. This means you’re not disadvantaged in the debugger if you took care to handle exceptions well in your program; you can still break when they are thrown and use the debugger to help understand why. :-)

Breakpoints

Sometimes, you know exactly where the juicy stuff happens in your program that you wish to debug. If only you could just run until you got there. Turns out you can – that’s what breakpoints are for. We can add one, use r to run, and it will stop where we placed the breakpoint.

dbg3

Note that you don’t have to type out the full name of the file you want to put the breakpoint in; any unambiguous substring of the name of a file that is loaded will be sufficient.

I won’t cover them here, but there are also tracepoints, which instead of breaking will log the value of an expression each time a certain place in the program is hit. Later, you can display the log. It’s like adding print statements, but without the print statement going in your code, removing the risk of them accidentally making it into a commit (‘cus we’ve all done that one, right? :-))

Regex and Grammar Debugging

When the debugger detects you are in a regex or grammar, it offers a little extra help. As well as allowing you to single-step your way through the regex, atom by atom, it also displays the match text, indicating what has been matched so far.

dbg4

Here, you can see that the pattern already successfully matched SELECT, and is now looking for a literal * or will try to call the field list rule. If in a regex, which may backtrack, the match position  jumps backwards when backtracking happens, so you can understand the backtracking behavior of the pattern.

Yes, Perl 5 Regexes Too!

Rakudo has some support for the :P5 adverb on regexes, which allows use of the Perl 5 regex syntax. Here the debugger is used in REPL mode (where you enter an expression, then can immediately debug it) to explore the difference between alternations in Perl 5 and Perl 6 (in Perl 5 they go left to right, in Perl 6 they have longest token matching semantics, such that it tries the thing that will match most characters first).

dbg5

The debugger in REPL mode is great for exploring and understanding how things will execute (and as such can serve as a learning or teaching aid). Another use is for debugging modules without having to write a test script; just write a use statement in the debugger or you can even supply the module using the -M command line flag and the debugger will load it!

What About Funky Stuff, Like Macros?

That is, macros, BEGIN time, eval, and those other things where your Perl 6 program does the time warp again, doing a bit of runtime at compile time or compiling some more stuff at runtime. The debugger is built for it. If a macro is applied, the debugger will place you in it. Notice below how we’re still in the process of loading the second file, and did not get to the third yet – we really are debugging at BEGIN time!

dbg6

Any lines of code that are stripped out by the macro are simply never hit at runtime. And what about statements in quasi blocks? The debugger will take you there, so you not only know what macro was applied, but can dig into exactly what it does too.

dbg7

Just as bits of runtime happening at compile time work out fine, any code that gets eval‘d at runtime is also compiled with debug hooks, meaning that you can step straight into it and debug the evaluated code.

Written in Perl 6 and NQP!

You might think that writing a debugger must involve all kinds of low-level hackery. In fact, that’s not the case. The debug hooks mechanism is written in NQP, and the command line user interface is written in Perl 6. This is significant from a couple of angles. The first is the fact that we can write something like this without breaking the encapsulation of the compiler, but instead just by subclassing the Grammar, Actions and Compiler objects and twiddling with the AST. In fact, the debugger was built without any changes being required to Rakudo as it already existed. This provides important feedback on our compiler architecture – this time, very positive feedback. Things are extensible in the ways they were designed to be. The second is that writing so much of it in Perl 6 is a healthy bit of dogfooding – using the product in order to build further products. My hope is that, since most of what people would want to change is actually written in the Perl 6 part, it will feel quite hackable by the community at large.

And What Of Future Plans?

Various features are still to come: conditional breakpoints, dumping tracepoint output to a file, showing the path taken through a grammar to get to the current point, and various bits of configurability. The command line interface is nice, but of course having some extra options would be even nicer. I’m interested in a web-based interface, but also in integration with tools like Padre. There’s some work afoot on a common protocol for these things, which could make such integration possible without having to re-invent too many wheels. In the meantime, having an interactive debugger which is aware of and works well with a wide range of Perl 6 language features is a solid step forward. Happy debugging, and feature ideas (or patches ;-)) are welcome; here’s the GitHub repo!


Follow

Get every new post delivered to your Inbox.

Join 36 other followers