Day 11 – Markov Sequence

On Day 4, I teased with mention of non-numeric sequences. Today I’d like to explore one such sequence, based on the common idea of using Markov chains on text. We will determine the next letter in the sequence randomly based on the previous two letters. The distribution follows the patterns contained in a model text, so that the result approximates the language of the model.

use v6;
use List::Utils;

my $model-text = $*;
$model-text .=subst(/<[_']>/, "", :global);
$model-text .=subst(/<-alpha>+/, " ", :global);

my %next-step;
for sliding-window($model-text.comb, 3) -> $a, $b, $c {
    %next-step{$a ~ $b}{$c}++;

my $first = $model-text.substr(0, 1);
my $second = $model-text.substr(1, 1);
my @chain := $first, $second, -> $a, $b { %next-step{$a ~ $b}.roll.key } ... *;
say @chain.munch(80);

After the initial use statements, the code divides neatly into three sections. The first section inputs the model text and gets rid of the non-alphabetic characters. Line 4 uses slurp to read standard input ($*IN) into a single string, and lc makes it all lowercase. The first subst (line 5) removes all underscores and apostrophes from the text. The second (line 6) replaces each string of non-alphabetic characters with a single space.

The second section combines the sliding-window sub from List::Utils with a bit of the good old Perl hash magic. (You can get List::Utils using neutro.)

$model-text.comb splits the text into individual characters.

sliding-windows iterates through a list, giving you the next N (3, in this case) elements in the list starting with each element in the list. (That is, you get the 1st, 2nd, and 3rd elements, then the 2nd, 3rd, and 4th elements, then the 3rd, 4th, and 5th elements, etc.) In this case we use it to get every set of three consecutive characters in the text.

In that loop, we construct a hash table of hash tables. The keys to the outer are the first two of those three consecutive characters. The keys to the inner are the third consecutive character, and its value is how many times that character follows the first two. So, for instance, if I feed the lyrics to the Aqualung album into this program, then %next-step{"qu"} looks like this:

{"a" => 5, "e" => 2}

That is to say, if you have “q” and “u”, they are followed by “a” five times (presumably all in the name Aqualung) and “e” twice (“requests” and “question”).

The third section of the code, then uses the knowledge we’ve just accumulated to build the sequence. First we get the first and second characters of the model space, just so we can start with a pair of letters we know for sure will have a letter coming after them. Then we construct a sequence which starts with those two, and uses -> $a, $b { %next-step{$a ~ $b}.roll } as a generator. The generator uses the previous two characters in the sequence to look up the appropriate frequency hash for the next character. The roll method randomly returns one of the keys of that hash, weighted by the value. (In the “qu” example above, you could think of this as rolling a seven-sided die, five of whose faces say “a” and two “e”.) If there is no known character which follows the previous two (for instance, if the last two characters in the model text are a pair unique in the text and you reach them both in order), then an undefined value is returned, which stops the sequence. We get the first 80 characters from the sequence using the munch method, chosen because it is well-behaved if the sequence terminates early.

Running the script on the lyrics to Aqualung produces sequences like
“t carealven thead you he sing i withe and upon a put saves pinsest to laboonfeet” and “t steall gets sill a creat ren ther he crokin whymn the gook sh an arlieves grac”. (Why does it start with “t “? My Aqualung lyrics file starts with some ASCII-art apparently attempting to imitate the original liner notes, and when we removed the non-alphabetic characters it boils down to “t “.)

Note that nowhere here does this script make assumptions about the character set it is working with. Anything that Perl 6 recognizes as alphabetic can be processed this way. If you feed it the standard “Land der Berge” file that p6eval uses as stdin, you’ll get strings like “laß in ber bist brüften las schören zeites öst froher land der äckerzeichöne lan”. (Fingers crossed that WordPress and your browser handle the non-ASCII characters correctly!)

One word of warning: As I was finishing this, the #perl6 channel raised the question of what Hash.roll should actually do. Right now in Rakudo it has the functionality of the (not yet implemented) KeyBag.roll method. Once it is implemented, KeyBag could be substituted if Hash.roll ends up spec’d differently.

There is a simple alternative which works today, however. If you change to %next-step{$a ~ $b}{$c}++ to %next-step.push($a ~ $b, $c), %next-step will be constructed as a Hash of Arrays. Each array will list all the characters $c which appear after $a and $b, with each distinct character repeated the number of times it appears in the file. This naturally acts as a weighting for .roll to use in the sequence generator.

I need to give a big thank you to Moritz Lenz, who was a huge help cleaning up and simplifying this script.