From owner-freebsd-stable@freebsd.org Sun Nov 6 21:49:54 2016 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 21DF4C33105 for ; Sun, 6 Nov 2016 21:49:54 +0000 (UTC) (envelope-from stb@lassitu.de) Received: from gilb.zs64.net (gilb.zs64.net [IPv6:2a00:14b0:4200:32e0::1ea]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gilb.zs64.net", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E180522D; Sun, 6 Nov 2016 21:49:53 +0000 (UTC) (envelope-from stb@lassitu.de) Received: by gilb.zs64.net (Postfix, from stb@lassitu.de) id 6BFCD1E24CB; Sun, 6 Nov 2016 21:49:52 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.1 \(3251\)) Subject: Re: Uppercase RE matching problems in FreeBSD 11 From: Stefan Bethke In-Reply-To: <20161106212729.z2edg44kg7hc4r2z@ivaldir.etoilebsd.net> Date: Sun, 6 Nov 2016 22:49:51 +0100 Cc: Greg Rivers , freebsd-stable@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de> <20161106210628.hg3dcpozfjtuo3nt@ivaldir.etoilebsd.net> <20161106212729.z2edg44kg7hc4r2z@ivaldir.etoilebsd.net> To: Baptiste Daroussin X-Mailer: Apple Mail (2.3251) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 06 Nov 2016 21:49:54 -0000 Am 06.11.2016 um 22:27 schrieb Baptiste Daroussin : >=20 >> But under what circumstances would [A-Z] mean anything other than a = character whose Unicode codepoint is between U+0041 and U+005A, = inclusive? Especially given the locale in the example is en_US.UTF-8. = Or, put another way, why would an implementation interpret [A-Z] as = anything other than [ABCDE=E2=80=A6XYZ]? >=20 > The collation rules for unicode comes from: http://cldr.unicode.org/ = and they do > match the one on linux for example and the one on illumos. >=20 > On some gnu tool they explicitly decide to be non locale aware to = avoid that > kind of "surprises" >>=20 >> =46rom reading your reference, I can see in 9.3.5.7: >>> In the POSIX locale, a range expression represents the set of = collating elements that fall between two elements in the collation = sequence, inclusive. In other locales, a range expression has = unspecified behavior[=E2=80=A6] >>=20 >> So even if the observed behaviour is conforming, I=E2=80=99d think = it=E2=80=99s still highly undesirable. >>=20 > That works for POSIX locale aka C aka ASCII only world So what do I set my LANG and LC variables to? I do want UTF-8, but I do = also want my scripts to continue to work. Clearly, en_US.UTF-8 is not = what I want. Is it C.UTF-8? Or do I set LANG=3Den_US.UTF-8 and = LC_COLLATE=3DC? Stefan --=20 Stefan Bethke Fon +49 151 14070811