Using Unicode Property Escapes

Categories: Useful Code

So, as part of my other post about writing a banner printer I ended up going down the rabbit hole of Unicode characters and combining marks. In the process, I found this wonderful blog post by Dmitri Pavlutin that explains the ins-and-outs of dealing with “complicated” text in JavaScript very well. But unfortunately it still wasn’t quite enough info for me to be able to correctly group Thai characters so that each one printed correctly on a single page:

In Thai, the vowels and tone marks can appear above, below, and on either side of their associated consonant. So for me to group the individual Unicode code points into meaningful graphemes, I need to be able to tell the difference between a base character and a combining mark.

At first I was looking for my solution in the JavaScript string/char classes, but found nothing helpful. But after some creative duck-duck-go-ing, I finally found my answer: Unicode property escapes. This is a feature of Unicode regular expressions, and resembles the \d and \w character class escapes on steroids. Using a JavaScript Unicode regex, I was able to differentiate between base characters and combining marks:

var ch = '้';

if (/\p{Mn}/u.test(ch)) {
    console.log(ch + ' is a combining mark!');

To better understand the full range of character property information available using Unicode property escapes, I wrote an applet to take a string and dump a table showing each character and the Unicode categories it belongs to. Here are some strings to try out: