Quick (rhetorical) question: how many of you either try your best to ignore Unicode, or groan at the thought of having to deal with it again?
It’s fair, after all, considering Unicode is big. Really big. (You may think it’s a long walk down the ASCII table, but that’s peanuts compared to space Unicode.) It certainly doesn’t help that many languages, particularly older ones, don’t help you, the average programmer, work with it all that well. Either they don’t deal with encoding standards at all, meaning some familiarity is mandatory, or certain other languages claim to support it but really just balk once you get past the BMP (the codepoints that can fit in a 16-bit number).
Perl 6, as you might guess, does handle Unicode well. It’s actually necessary to go about this day in a twofold manner: half of the story is how to process Unicode text, and half is how to use Unicode syntax. Let’s start with the one more likely to be of concern when actually programming, that of…
How do I Handle Unicode Text?
No matter your level of experience in handling Unicode (or anything involving different encodings), you’ll be pleased to learn that in Perl 6, it goes just about the way you’d expect.
Perl 6’s strings are interesting in that they by default work on the notion of graphemes — a collection of codepoints that look like a distinct thing; what you’d call a “character” if you didn’t know better. Not every distinct “character” you could come up with has its own codepoint in the standard, so usually handling visual elements naturally can be quite painful.
However, Perl 6 does this work for you, keeping track of these collections of codepoints internally, so that you just have to think in terms of what you would see the characters as. If you’ve ever had to dance around with substring operations to make sure you didn’t split between a letter and a diacritic, this will be your happiest day in programming.
As an example, here’s a devanagari syllable in a string. The .codes
method returns the number of codepoints in the string, while .chars
returns the number of characters (aka graphemes):
say "नि".codes; # returns 2 say "नि".chars; # returns 1
Even though there isn’t a singular assigned codepoint for this syllable, Perl 6 still treats it as one character, suiting any purpose that doesn’t involve messing with the text at a lower level.
That’s cool, but does it matter much to me, a simple English-speaking programmer who’s never had to deal with other languages or scripts?, I can imagine some of you thinking. And the answer is yes, because regardless of your background, there is most definitely one grapheme you’ve encountered before:
say "\r\n".chars; # returns 1
Yep, the Windows end-of-line sequence is explicitly counted by Unicode’s “extended grapheme cluster” definition as one grapheme.
And of course it’s not just looks, that’s how operations on strings work:
say "नि\r\n".substr(1,1).perl # returns "\r\n"
Of course, that’s all just for the default Str
type. If you don’t want to work at a grapheme level, then you have several other string types to choose from: If you’re interested in working within a particular normalization, there’s the self-explanatory types of NFC
, NFD
, NFKC
, and NFKD
. If you just want to work with codepoints and not bother with normalization, there’s the Uni
string type (which may be most appropriate in cases where you don’t want the NFC normalization that comes with normal Str
, and keep text as-is). And if you want to work at the binary level, well, there’s always the Blob
family of types :) .
We also have several methods that let you examine the various bits of Unicode info associated with characters:
say "a".uniname; # get name of first Unicode character in string. say "\r\nhello!".ord # get number of first codepoint # (*not* grapheme) in string say "\r\nhello!".ords # get numbers of all codepoints say "0".uniprop("Numeric_Type") # get associated property
And so on :) . Note that the ord
/ords
part shows you that you’ll really never get the internal numbers used to keep track of graphemes. When ord
sees a grapheme cluster, it just returns the codepoint number for the first codepoint of that cluster.
Not Just Strings
Of course, our Unicode support wouldn’t be complete without regex support! Of particular note is the ability to match based on properties, so for example
/ <:Alpha>+ /
will match multiple alphabetic characters (<alpha>
will do almost the same thing, just with the addition of matching underscore), and
/ '0x' <:Nv(0..9) + :Hex_Digit>+ | '0b' <:Nv(0..1)>+ /
is a regex that lets you match against either hexadecimal numbers or binary ones, in a Unicode-friendly way. And if you wanted to write the Unicode standard’s “extended grapheme cluster” pattern in regexes (the same pattern we use to determine grapheme handling mentioned earlier):
grammar EGC { token Hangul-Syllable { || <:GCB<L>>* <:GCB<V>>+ <:GCB<T>>* || <:GCB<L>>* <:GCB<LV>> <:GCB<V>>* <:GCB<T>>* || <:GCB<L>>* <:GCB<LVT>> <:GCB<T>>* || <:GCB<L>>+ || <:GCB<T>>+ } token TOP { || <:GCB<CR>> <:GCB<LF>> || <:GCB<PP>>* [ || <:GCB<RI>> || <.Hangul-Syllable> || <!:GCB<Control>> ] [ || <:Grapheme_Extend> || <:GCB<Spacing_Mark>> ]* || . } }
A bit wordy, but just imagine how much more painful that would be without built-in Unicode support in your regexes!
And aside from all the programming-related stuff, there’s also…
Using Unicode to Write Perl 6
In part of our tireless support of Unicode, we also parse your source code with the same regex engine you just saw demonstrated above (though the Perl 6 parser doesn’t need to bother with Unicode properties nearly that often). This means we’re able to support syntax using Unicode in Perl 6, and have been taking advantage of it for a long time now. Observe:
say 0 ∈ «42 -5 1».map(&log ∘ &abs); say 0.1e0 + 0.2e0 ≅ 0.3e0; say 「There is no \escape in here!」
Just a small sampling of the Unicode built-in to Perl 6 by default. Featuring interpolating quote-words lists, setops, function composition, and approximate equality. Oh, and the delimiters for the most basic level of string quoting.
Don’t worry though, standard Perl 6 does not demand that you be able to type Unicode. If you can’t, there are so-called “Texas” variants:
say 0 (elem) <<42 -5 1>>.map(&log o &abs); say 0.1e0 + 0.2e0 =~= 0.3e0; say Q[[[There is no \escape in here!]]]
This is fine of course, but if it’s feasible for you to set up Unicode support, I heartily recommend it. Here’s a short list on various ways to do it:
- Get an awesome text editor — The more featureful text editors (such as emacs or vim, to name a couple) will have functionality in place to insert arbitrary characters. Go look it up in your editor’s documentation, and consider petitioning if it doesn’t support Unicode entry :) .
- Use your OS’s hex input — Some systems, such as Windows or applications using GTK, support key shortcuts to let you type the hexadecimal codepoint numbers for characters. You’ll have to memorize codepoints, but chances are you’d get used to it eventually.
- Set up your keyboard’s third/fourth/etc. levels — If your system supports it, you can enable third/fourth level modifiers and so on for your keyboard to access those levels (if you don’t know what those are, your ‘Shift’ key counts as a second-level modifier, and the characters it lets you type are considered on the second level, as an example). Depending on the amount of time and/or patience you have you could even customize those extra levels.
- (X11) Set up your Compose key — This is the method I myself use, and it involves setting up a key to use as the “Compose key” or “Multi key”, and use of a file in
~/.XCompose
(or some other place, as long as you configure it) to set up key combos. The Compose key works by letting you type any configured sequence of keys after pressing the Compose key, which will insert the character(s) of your choice.- Which key you sacrifice of course depends on which keys you don’t make use of; it could be the caps lock, or one of those extra Shift/Alt/Ctrl keys. It can even be that useless Menu key, which you probably just remembered was on your keyboard :P .
- An absolutely wonderful starting
.XCompose
can be found in this github repository. You’ll still want to add combinations to this for some Perl 6, and perhaps do other tinkering with it¹, but it’s still quite a lot better than having to start from scratch :) .
In Conclusion
This of course isn’t an exhaustive coverage of all that Perl 6 has to offer Unicode, but the underlying takeaway is that Perl 6 makes handling Unicode much nicer than other languages do (at least out of the box).
Bonus! Partly in the spirit of Christmastime, and partly in the spirit of “I love this, and what better time to share it?”, allow me to present for your historical interest Perl 6’s legendary “snowman comet” bug:
say "abc" ~~ m☃.(.).☄ # this used to work. Really.
Basically this old old old old bug that (sadly) doesn’t exist anymore was about the regex part of the parser messing up a bit and interpreting ☃☄
as just as valid a pair of brackets as ()
or ⦃⦄
.
Is there a relevant lesson in this bug? Nope. Is it only vaguely connected to a winter blog post on Unicode? You bet. It’s just that it’s thanks to Unicode support we were able to get that kind of bug way back in 2009, and it’s thanks to Unicode support (among other things) that would let someone re-implement this as a slang or something ☺ .
So go forth confident in your newfound ability to handle international text with much greater ease than you’re perhaps used to, and spend more time building ☃☃☃☃ the rest of this month.
Have the appropriate amount of fun! ❄
¹Psst! Use the texas variants for your compose combos if you’re stuck on coming up with them, e.g. <Multi_key> <equal> <asciitilde> <equal>
for ≅
Thanks for this post and for all of your work on Perl 6!
Perhaps it’s just my browser, but it seems that something is missing just inside the parentheses in your post after the “Alpha” regex:
This post overstates what’s achieved: it makes it a bit harder to slice things awkwardly, but it’s instructive to ask why we’d pick extended grapheme clusters as the slicing unit instead of, say, words: the problems with picking “word” as a unit also apply to grapheme clusters: both take varying amounts of space on screen, and neither is a concept that cleanly matches what humans might think of as a single word or single “character”.
In my experience, the most common length-like concepts that software wants to deal in are not codepoints or extended grapheme clusters as Python and Perl 6 expose in the goal of “making Unicode work like ASCII”, but rather either length on a terminal (make things line up, or be likely to fit in one line); or good old fashioned bytes (ensuring that something fits within an externally imposed limit, or for a reference to a location).
For many other things, you just want numbers to match each other, without even needing to know what the number is: e.g. make sure that queries “where and how long does this regexp match” can be passed to substr.
I think it’s a mistake to choose one concept as “what a character means”. As the Unicode FAQ says, there is no single concept corresponding to “character”, and fixing on one just leads to bugs.
(And fixing on “extended grapheme cluster” as the one final notion seems about as likely to work as Java’s choice of utf16 code unit, or Python’s choice of codepoint as the “notion of character that makes Unicode work just like ASCII”. The fact that “extended grapheme cluster” is the second official attempt to nail down “what a grapheme cluster” means, and that “legacy grapheme clusters” are still preferred for some things (I hear it works better for drop caps), should give some clue as to whether this second attempt is the Ultimate Character Notion.)
While acknowledging that the task isn’t really possible, my own attempt to “make Unicode work for ASCII programmers” would go further along the idea of “replace a single length() function with the various different length-like notions”. Maybe that means splitting substr into many functions, or maybe that means changing it from accepting integers to accepting more abstract offsets. For regular expressions (how much does ‘.’ or ‘[…]’ match): first of all, regular expressions are really useful (perl’s killer feature), and it would be great if they could be used on different types of sequence, in which case it becomes natural how much ‘.’ matches. Failing that: ‘[…]’ has some potential for the decision to be based on what’s inside the ‘…’, but I have no other suggestion for ‘.’ beyond the approach of Perl 5 (flags such as /u “use feature ‘unicode_strings'”), and providing escapes (like \d) for explicit choice.
Hi Peter,
> why we’d pick extended grapheme clusters
My understanding of this is perhaps best expressed by quoting [the current Unicode annex on text segmentation](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries). This states that EGCs are “recommended for general processing”. I take “general processing” to include “collation, regular expressions, UI interactions (such as mouse selection, arrow key movement, backspacing), segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text”.
> grapheme clusters … is [not] a concept that cleanly matches what humans might think of as a … single “character”.
Are you referring here to the complexity that has led to several variants of grapheme clustering being defined in the Unicode standard and to tailoring?
While it’s not clean, it’s what Unicode defines for dealing with “what users think of as a character”, so I don’t understand what else you imagine can be done if one is to support Unicode as it’s designed.
> length on a terminal (make things line up, or be likely to fit in one line);
I don’t see how one can ignore graphemes in doing any such processing in a generally correct manner.
> good old fashioned bytes (ensuring that something fits within an externally imposed limit, or for a reference to a location).
Perl 6 has nice support for that.
> For many other things, you just want numbers to match each other, without even needing to know what the number is: e.g. make sure that queries “where and how long does this regexp match” can be passed to substr.
If I’m understanding you correctly, Perl 6 does that especially well.
> I think it’s a mistake to choose one concept as “what a character means”. As the Unicode FAQ says, there is no single concept corresponding to “character”, and fixing on one just leads to bugs.
Perl 6 doesn’t fix on one concept. It provides a default string type in which the character unit is “what a user thinks of as a character”. If a programmer needs to work at some other level like the byte or codepoint level instead there are nice features for that too.
> (And fixing on “extended grapheme cluster” as the one final notion seems about as likely to work as Java’s choice of utf16 code unit, or Python’s choice of codepoint as the “notion of character that makes Unicode work just like ASCII”.
I don’t see Perl 6 as fixing on EGCs as the one final notion of anything. There is a commitment to deeply supporting Unicode. Right now that includes implementing grapheme clustering, starting with EGCs. If/when someone wants to implement LGCs that’ll presumably happen too. Likewise if the Unicode consortium introduces another grapheme cluster specification pertinent to Perl 6 usage.
> While acknowledging that the task isn’t really possible, my own attempt to “make Unicode work for ASCII programmers” would go further along the idea of “replace a single length() function with the various different length-like notions”.
There is no “length” function or method in Perl 6. Instead there’s “.bytes”, “.codes”, “.chars”, “.words”, etc.
> For regular expressions (how much does ‘.’ or ‘[…]’ match): first of all, regular expressions are really useful (perl’s killer feature), and it would be great if they could be used on different types of sequence
Here’s [a recent exchange among TimToady, jnthn, FROGGS, moritz](http://irclog.perlgeek.de/perl6/2015-05-12#i_10589597) about using regexes with :bytes, :codes, and the Cat type (a streaming sequence).
This advent was posted at LWN and an exchange ensued between “butlerm” and I that covers the same sort of territory seen in Peter Moulder’s comment and my first reply here. I wrap it up at http://lwn.net/Articles/668615/
“When ord sees a grapheme cluster, it just returns the codepoint number for the first codepoint of that cluster.” Is there any normalization involved in this? If I have the string “á” (precomposed, NFC) and the string “á” (non-composed, NFD) and take the ord() of them, will they return different numbers? It looks like it gets normalized to NFC, but that’s just a quick look in the REPL, which might do other things to my strings.
Regardless of whether “\r\n” counts as a single character, I believe the current state of newline handling in Perl6 is, at the very least, confusing and non-intuitive.
Looks like the Rakudo devs agree with you, jnthn has added your bug report to the meta ticket blocking release. You’ll notice that’s not the only “\r\n” grapheme related bug on there :( The feature is relatively recently implemented after all. So safe to say we need to thank you for unearthing the behaviour. I’ve personally yet to use Rakudo on Windows so never run into this sort of hassle. I don’t think as much conflict of opinion exists as your blog post might suggest though, this looks like a bug rather than an earth shattering feature choice from the P6 community.
Take a look at this:
my %esc = (
‘$’ => ‘\$’, ‘@’ => ‘\@’, ‘%’ => ‘\%’, ‘&’ => ‘\&’, ‘{‘ => ‘\{‘,
“\b” => ‘\b’, “\x0A” => ‘\n’, “\r” => ‘\r’, “\t” => ‘\t’, ‘”‘ => ‘\”‘,
‘\\’ => ‘\\\\’ );
That was written in 2011. That says something. As in, the behavior, whose source I have not yet been able to identify, has existed as long as that.
In any case, I am glad my bug report is going to be looked at, but I am mainly disappointed that the absurdity of “\n” becoming “\r\n” inside a program did not occur to anyone.
Also, looking through #perl6 shows many instances of the problem being pointed out, and dismissed with
'use newline :lf;'
etc. I am coming out strong on this because it looks like people do believe that what they did makes lot of sense. It does not. It never has. Newline translation happens at input-output boundaries. The notion that inside of a source file, there is a difference between:and
is outlandish, but that seems to have been designed and defended.
There is supposed to be a line break between the “1” and “2” in the second example, but, WordPress!
It’s worth pointing out that the “\x0A” in that code snippet was modified this November, not in 2011 (the “2011” claim probably comes from the hash’s other lines’ blame output). The problem has not been around for years; rather, only a month or so :) .
It gets more confusing. What one Christmas elf has been up to suggests this might be related to more recent changes: