Confession: My title is clickbait-y, this is really about building on the Unicode Character Database to support character-property regexp features in Quamina. Just halfway there, I’d already got to 775K lines of generated code so I abandoned that particular approach. Thus, this is about (among other things) avoiding those 1½M lines. And really only of interest to people whose pedantry includes some combination of Unicode, Go programming, and automaton wrangling. Oh, and GenAI, which (*gasp*) I think I should maybe have used. Character property matching · I’m talking about regexp incantations like [\p{L}\p{Zs}\p{Nd}], which matches anything that Unicode classifies as a letter, a space, or a decimal number. (Of course, in Quamina “\” is “~” for excellent reasons, so that reads [~p{L}~p{Zs}~p{Nd}].) (I’m writing about this now because I just launched a PR to enable this feature. Just one more to go before I can release a new version of Quamina with full regexp support, yay.) Finding the properties · To build an automaton that matches something like that, you have to find out what the character properties are. This information comes from the Unicode Character Database, helpfully provided online by the Unicode consortium. Of course, most programming languages have libraries that will help you out, and that includes Go, but I didn’t use it. Unfortunately, Go’s library doesn’t get updated every time Unicode does. As of now, January 2026, it’s still stuck at Unicode 15.0.0, which dates to September 2023; the latest version is 17.0.0, last September. Which means there are plenty of Unicode characters Go doesn’t know about, and I didn’t want Quamina to settle for that. So, I fetched and parsed the famous master file from www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt. Not exactly rocket science, it’s a flat file with ;-delimited fields, of which I only cared about the first and third. There are some funky bits, such as the pair of nonstandard lines indicating that the ...
First seen: 2026-01-24 02:49
Last seen: 2026-01-24 04:50