Date: Wed, 11 Feb 2015 13:14:41 -0500 From: Phil Shafer <phil@juniper.net> To: <freebsd-i18n@freebsd.org> Subject: libxo and i18n Message-ID: <201502111814.t1BIEfcI000650@idle.juniper.net>
next in thread | raw e-mail | index | archive | help
[background: libxo is a new library in freebsd that provides the ability for a single source code path to emit XML, JSON, HTML and traditional text. Full docs are at: http://juniper.github.io/libxo/libxo-manual.html ] In libxo, I'm having issues dealing with i18n, which are mostly from my lack of depth on the subject. Specifically, when someone makes a call like: xo_emit("[{:numbers/%-4..4s/%s}]\n", "123456"); they are asking for numbers to be truncated a 4 columns, rather than the printf-style four bytes. The output should be: [1234] My issue is when the ligatures are used, with multiple unicode values occupy the same column. An example would be the "Sri" in Sinhalese: http://en.wikipedia.org/wiki/Sinhala_alphabet#Consonant_conjuncts When I look at src/mklocale/UTF-8.src, I see: /* * U+0D80 - U+0DFF : Sinhala */ GRAPH 0x0d82 0x0d83 0x0d85 - 0x0d96 0x0d9a - 0x0db1 0x0db3 - 0x0dbb GRAPH 0x0dbd 0x0dc0 - 0x0dc6 0x0dca 0x0dcf - 0x0dd4 0x0dd6 GRAPH 0x0dd8 - 0x0ddf 0x0df2 - 0x0df4 PUNCT 0x0df4 PRINT 0x0d82 0x0d83 0x0d85 - 0x0d96 0x0d9a - 0x0db1 0x0db3 - 0x0dbb PRINT 0x0dbd 0x0dc0 - 0x0dc6 0x0dca 0x0dcf - 0x0dd4 0x0dd6 PRINT 0x0dd8 - 0x0ddf 0x0df2 - 0x0df4 SWIDTH1 0x0d82 0x0d83 0x0d85 - 0x0d96 0x0d9a - 0x0db1 0x0db3 - 0x0dbb SWIDTH1 0x0dbd 0x0dc0 - 0x0dc6 0x0dca 0x0dcf - 0x0dd4 0x0dd6 SWIDTH1 0x0dd8 - 0x0ddf 0x0df2 - 0x0df4 Consider the UTF-8 sequence for the glyph in the Sinhalese table above, at the ninth row from the bottom, fifth character in. UTF-8: [e0b6bb][e0b78a][e2808d][e0b69d] Unicode: u+0dbb u+0dca u+200d u+0d9d wcwidth reports third character (ZWJ) as -1, but all the others as width 1: (gdb) p (int) wcwidth(0xdbb) $1 = 1 (gdb) p (int) wcwidth(0xdca) $2 = 1 (gdb) p (int) wcwidth(0x200d) $3 = -1 (gdb) p (int) wcwidth(0xd9d) $4 = 1 So my question is (at long last): How does one know when multiple unicode characters will result in a single column of output? Thanks, Phil
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201502111814.t1BIEfcI000650>