Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 11 Feb 2015 13:14:41 -0500
From:      Phil Shafer <phil@juniper.net>
To:        <freebsd-i18n@freebsd.org>
Subject:   libxo and i18n
Message-ID:  <201502111814.t1BIEfcI000650@idle.juniper.net>

next in thread | raw e-mail | index | archive | help
[background: libxo is a new library in freebsd that provides
the ability for a single source code path to emit XML, JSON,
HTML and traditional text.  Full docs are at:
    http://juniper.github.io/libxo/libxo-manual.html
]

In libxo, I'm having issues dealing with i18n, which are
mostly from my lack of depth on the subject.  Specifically,
when someone makes a call like:

    xo_emit("[{:numbers/%-4..4s/%s}]\n", "123456");

they are asking for numbers to be truncated a 4 columns, rather
than the printf-style four bytes.  The output should be:

    [1234]

My issue is when the ligatures are used, with multiple unicode
values occupy the same column.  An example would be the "Sri"
in Sinhalese:

http://en.wikipedia.org/wiki/Sinhala_alphabet#Consonant_conjuncts

When I look at src/mklocale/UTF-8.src, I see:

/*
 * U+0D80 - U+0DFF : Sinhala
 */

GRAPH     0x0d82  0x0d83  0x0d85 - 0x0d96  0x0d9a - 0x0db1  0x0db3 - 0x0dbb
GRAPH     0x0dbd  0x0dc0 - 0x0dc6  0x0dca  0x0dcf - 0x0dd4  0x0dd6
GRAPH     0x0dd8 - 0x0ddf  0x0df2 - 0x0df4
PUNCT     0x0df4
PRINT     0x0d82  0x0d83  0x0d85 - 0x0d96  0x0d9a - 0x0db1  0x0db3 - 0x0dbb
PRINT     0x0dbd  0x0dc0 - 0x0dc6  0x0dca  0x0dcf - 0x0dd4  0x0dd6
PRINT     0x0dd8 - 0x0ddf  0x0df2 - 0x0df4
SWIDTH1   0x0d82  0x0d83  0x0d85 - 0x0d96  0x0d9a - 0x0db1  0x0db3 - 0x0dbb
SWIDTH1   0x0dbd  0x0dc0 - 0x0dc6  0x0dca  0x0dcf - 0x0dd4  0x0dd6
SWIDTH1   0x0dd8 - 0x0ddf  0x0df2 - 0x0df4

Consider the UTF-8 sequence for the glyph in the Sinhalese table above,
at the ninth row from the bottom, fifth character in.

UTF-8: [e0b6bb][e0b78a][e2808d][e0b69d]
Unicode: u+0dbb  u+0dca  u+200d  u+0d9d

wcwidth reports third character (ZWJ) as -1, but all the others as
width 1:

(gdb) p (int) wcwidth(0xdbb)
$1 = 1
(gdb) p (int) wcwidth(0xdca)
$2 = 1
(gdb) p (int) wcwidth(0x200d)
$3 = -1
(gdb) p (int) wcwidth(0xd9d)
$4 = 1

So my question is (at long last): How does one know when multiple
unicode characters will result in a single column of output?

Thanks,
 Phil



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201502111814.t1BIEfcI000650>