From owner-freebsd-questions@freebsd.org Sun Oct 18 15:48:46 2020 Return-Path: Delivered-To: freebsd-questions@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id E15E6435739 for ; Sun, 18 Oct 2020 15:48:46 +0000 (UTC) (envelope-from johnl@iecc.com) Received: from gal.iecc.com (gal.iecc.com [IPv6:2001:470:1f07:1126:0:43:6f73:7461]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "gal.iecc.com", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4CDknn5rZmz4dSt for ; Sun, 18 Oct 2020 15:48:45 +0000 (UTC) (envelope-from johnl@iecc.com) Received: (qmail 30054 invoked from network); 18 Oct 2020 15:48:39 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=iecc.com; h=date:message-id:from:to:cc:subject:in-reply-to:mime-version:content-type:content-transfer-encoding; s=7563.5f8c63d7.k2010; bh=67jChVfgn6lqQivZeKUSC0THKCVOSM1rVGxlYbjGpDI=; b=L+EbSRx2LbzRrPbOsaFb9wpGGv0zneL+XgURFfOwL2fEwkijq/OGVvkalAb+KZ0DObLb86exJKINMIRDu7Dm3a5e1Go84WkqJXG0D+lxhcWnfeJqcR2E9QxrKJAoFgx2stpi4ZNqYJPPW9c2VFaC5+0pK7NHqKji7tIjvFXg/h+SBXgbp4sRaFWAUsrIDBoCzkK7pyP7f6zVw181v3xp17DnH4yH2XxJHgGewi4hxGKfuW6J4PxYIbP/DGrMvqmKmGpPdjM7Z9h3382dgrCPHIHt83Vf3D2fpcfxbEKHw90IcBhK+tSgh1ZiRsD+QUbqqqlYIP0jekJw1lYHQCR8Lg== Received: from ary.qy ([IPv6:2001:470:1f07:1126::78:696d:6170]) by imap.iecc.com ([IPv6:2001:470:1f07:1126::78:696d:6170]) with ESMTPS (TLS1.2 ECDHE-RSA AES-256-GCM AEAD) via TCP6; 18 Oct 2020 15:48:38 -0000 Received: by ary.qy (Postfix, from userid 501) id 49CBC239CEDF; Sun, 18 Oct 2020 11:48:37 -0400 (EDT) Date: 18 Oct 2020 11:48:37 -0400 Message-Id: <20201018154838.49CBC239CEDF@ary.qy> From: "John Levine" To: freebsd-questions@freebsd.org Cc: naddy@mips.inka.de Subject: Re: printf(1) and UTF-8 multi-byte chars In-Reply-To: Organization: Taughannock Networks X-Headerized: yes Mime-Version: 1.0 Content-type: text/plain; charset=utf-8 Content-transfer-encoding: 8bit X-Rspamd-Queue-Id: 4CDknn5rZmz4dSt X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=iecc.com header.s=7563.5f8c63d7.k2010 header.b=L+EbSRx2; dmarc=pass (policy=none) header.from=iecc.com; spf=pass (mx1.freebsd.org: domain of johnl@iecc.com designates 2001:470:1f07:1126:0:43:6f73:7461 as permitted sender) smtp.mailfrom=johnl@iecc.com X-Spamd-Result: default: False [-4.85 / 15.00]; RCVD_TLS_ALL(0.00)[]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[iecc.com:s=7563.5f8c63d7.k2010]; NEURAL_HAM_MEDIUM(-0.96)[-0.955]; FROM_HAS_DN(0.00)[]; DWL_DNSWL_MED(-2.00)[iecc.com:dkim]; MV_CASE(0.50)[]; R_SPF_ALLOW(-0.20)[+ip6:2001:470:1f07:1126::/64]; MIME_GOOD(-0.10)[text/plain]; TO_DN_NONE(0.00)[]; NEURAL_HAM_LONG(-0.97)[-0.966]; HAS_ORG_HEADER(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[iecc.com:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[iecc.com,none]; NEURAL_HAM_SHORT(-0.43)[-0.432]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_TWO(0.00)[2]; ASN(0.00)[asn:6939, ipnet:2001:470::/32, country:US]; MAILMAN_DEST(0.00)[freebsd-questions] X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Oct 2020 15:48:46 -0000 In article you write: >On 2020-10-17, Matthias Apitz wrote: > >> This means the output of printf(1) is byte oriented and not >> character oriented. > >This conforms to POSIX. I don't think there is any useful middle ground between counting bytes and full Unicode typesetting. Some Unicode characters are half- or double-width, particularly in east Asian languages, and many combine with adjacent characters depending on context, e.g., the character รถ can be the single xF6 character which is two UTF-8 bytes, or a combining diaresis x308 followed by lower case o x6F which is three UTF-8 bytes, but one space wide either way.