From owner-freebsd-i18n@FreeBSD.ORG Wed Feb 11 18:50:46 2015 Return-Path: Delivered-To: freebsd-i18n@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 204B03C7 for ; Wed, 11 Feb 2015 18:50:46 +0000 (UTC) Received: from na01-bl2-obe.outbound.protection.outlook.com (mail-bl2on0132.outbound.protection.outlook.com [65.55.169.132]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "MSIT Machine Auth CA 2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id BB9B89D6 for ; Wed, 11 Feb 2015 18:50:45 +0000 (UTC) Received: from CO2PR05CA018.namprd05.prod.outlook.com (10.141.241.146) by BN1PR05MB439.namprd05.prod.outlook.com (10.141.58.22) with Microsoft SMTP Server (TLS) id 15.1.81.19; Wed, 11 Feb 2015 18:15:25 +0000 Received: from BN1BFFO11FD054.protection.gbl (2a01:111:f400:7c10::1:196) by CO2PR05CA018.outlook.office365.com (2a01:111:e400:1429::18) with Microsoft SMTP Server (TLS) id 15.1.87.18 via Frontend Transport; Wed, 11 Feb 2015 18:15:24 +0000 Received: from P-EMF02-SAC.jnpr.net (66.129.239.16) by BN1BFFO11FD054.mail.protection.outlook.com (10.58.145.9) with Microsoft SMTP Server (TLS) id 15.1.87.10 via Frontend Transport; Wed, 11 Feb 2015 18:15:24 +0000 Received: from magenta.juniper.net (172.17.27.123) by P-EMF02-SAC.jnpr.net (172.24.192.21) with Microsoft SMTP Server (TLS) id 14.3.146.0; Wed, 11 Feb 2015 10:14:57 -0800 Received: from idle.juniper.net (idleski.juniper.net [172.25.4.26]) by magenta.juniper.net (8.11.3/8.11.3) with ESMTP id t1BIEuW73405 for ; Wed, 11 Feb 2015 10:14:56 -0800 (PST) (envelope-from phil@juniper.net) Received: from idle.juniper.net (localhost [127.0.0.1]) by idle.juniper.net (8.14.4/8.14.3) with ESMTP id t1BIEfcI000650 for ; Wed, 11 Feb 2015 13:14:41 -0500 (EST) (envelope-from phil@idle.juniper.net) Message-ID: <201502111814.t1BIEfcI000650@idle.juniper.net> To: Subject: libxo and i18n Date: Wed, 11 Feb 2015 13:14:41 -0500 From: Phil Shafer MIME-Version: 1.0 Content-Type: text/plain X-EOPAttributedMessage: 0 Received-SPF: SoftFail (protection.outlook.com: domain of transitioning juniper.net discourages use of 66.129.239.16 as permitted sender) Authentication-Results: spf=softfail (sender IP is 66.129.239.16) smtp.mailfrom=phil@juniper.net; freebsd.org; dkim=none (message not signed) header.d=none; X-Forefront-Antispam-Report: CIP:66.129.239.16; CTRY:US; IPV:NLI; EFV:NLI; SFV:NSPM; SFS:(10019020)(6009001)(164054003)(87936001)(62966003)(77156002)(105596002)(450100001)(54356999)(50986999)(47776003)(2351001)(110136001)(106466001)(229853001)(107886001)(15975445007)(6806004)(19580395003)(77096005)(76506005)(53416004)(92566002)(575784001)(86362001)(50466002)(48376002)(46102003); DIR:OUT; SFP:1102; SCL:1; SRVR:BN1PR05MB439; H:P-EMF02-SAC.jnpr.net; FPR:; SPF:SoftFail; MLV:sfv; LANG:en; X-Microsoft-Antispam: UriScan:; X-Microsoft-Antispam: BCL:0;PCL:0;RULEID:;SRVR:BN1PR05MB439; X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0; PCL:0; RULEID:(601004); SRVR:BN1PR05MB439; X-Forefront-PRVS: 0484063412 X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:;SRVR:BN1PR05MB439; X-OriginatorOrg: juniper.net X-MS-Exchange-CrossTenant-OriginalArrivalTime: 11 Feb 2015 18:15:24.1105 (UTC) X-MS-Exchange-CrossTenant-Id: bea78b3c-4cdb-4130-854a-1d193232e5f4 X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=bea78b3c-4cdb-4130-854a-1d193232e5f4; Ip=[66.129.239.16] X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN1PR05MB439 X-BeenThere: freebsd-i18n@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: FreeBSD Internationalization Effort List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Feb 2015 18:50:46 -0000 [background: libxo is a new library in freebsd that provides the ability for a single source code path to emit XML, JSON, HTML and traditional text. Full docs are at: http://juniper.github.io/libxo/libxo-manual.html ] In libxo, I'm having issues dealing with i18n, which are mostly from my lack of depth on the subject. Specifically, when someone makes a call like: xo_emit("[{:numbers/%-4..4s/%s}]\n", "123456"); they are asking for numbers to be truncated a 4 columns, rather than the printf-style four bytes. The output should be: [1234] My issue is when the ligatures are used, with multiple unicode values occupy the same column. An example would be the "Sri" in Sinhalese: http://en.wikipedia.org/wiki/Sinhala_alphabet#Consonant_conjuncts When I look at src/mklocale/UTF-8.src, I see: /* * U+0D80 - U+0DFF : Sinhala */ GRAPH 0x0d82 0x0d83 0x0d85 - 0x0d96 0x0d9a - 0x0db1 0x0db3 - 0x0dbb GRAPH 0x0dbd 0x0dc0 - 0x0dc6 0x0dca 0x0dcf - 0x0dd4 0x0dd6 GRAPH 0x0dd8 - 0x0ddf 0x0df2 - 0x0df4 PUNCT 0x0df4 PRINT 0x0d82 0x0d83 0x0d85 - 0x0d96 0x0d9a - 0x0db1 0x0db3 - 0x0dbb PRINT 0x0dbd 0x0dc0 - 0x0dc6 0x0dca 0x0dcf - 0x0dd4 0x0dd6 PRINT 0x0dd8 - 0x0ddf 0x0df2 - 0x0df4 SWIDTH1 0x0d82 0x0d83 0x0d85 - 0x0d96 0x0d9a - 0x0db1 0x0db3 - 0x0dbb SWIDTH1 0x0dbd 0x0dc0 - 0x0dc6 0x0dca 0x0dcf - 0x0dd4 0x0dd6 SWIDTH1 0x0dd8 - 0x0ddf 0x0df2 - 0x0df4 Consider the UTF-8 sequence for the glyph in the Sinhalese table above, at the ninth row from the bottom, fifth character in. UTF-8: [e0b6bb][e0b78a][e2808d][e0b69d] Unicode: u+0dbb u+0dca u+200d u+0d9d wcwidth reports third character (ZWJ) as -1, but all the others as width 1: (gdb) p (int) wcwidth(0xdbb) $1 = 1 (gdb) p (int) wcwidth(0xdca) $2 = 1 (gdb) p (int) wcwidth(0x200d) $3 = -1 (gdb) p (int) wcwidth(0xd9d) $4 = 1 So my question is (at long last): How does one know when multiple unicode characters will result in a single column of output? Thanks, Phil