Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 6 Nov 2016 22:14:50 +0100
From:      Stefan Ehmann <shoesoft@gmx.net>
To:        Stefan Bethke <stb@lassitu.de>, Baptiste Daroussin <bapt@FreeBSD.org>
Cc:        Greg Rivers <gcr+freebsd-stable@tharned.org>, freebsd-stable@freebsd.org
Subject:   Re: Uppercase RE matching problems in FreeBSD 11
Message-ID:  <a3f401a7-9dc9-d567-bf21-139364702599@gmx.net>
In-Reply-To: <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de>
References:  <alpine.BSF.2.20.1611051912260.2462@flake.tharned.org> <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On 06.11.2016 21:57, Stefan Bethke wrote:
> 
>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin
>> <bapt@FreeBSD.org>:
>> 
>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>>> I happened to run an old script today that uses sed(1) to extract
>>> the system boot time from the kern.boottime sysctl MIB. On 11.0
>>> this no longer works as expected:
..
>>> Here sed thinks every lowercase character except for 'a' is
>>> uppercase! This differs from the first test where sed did not
>>> think 'o' is uppercase. Again, the above behaves as expected with
>>> LANG=C.
>>> 
>>> Does anyone have any insight into this? This is likely to break a
>>> lot of existing code.
>>> 
>> 
>> Yes A-Z only means uppercase in an ASCII only world in a unicode
>> world it means AaBb... Z because there are way more characters that
>> simple A-Z. In FreeBSD 11 we have a unicode collation instead of
>> falling back in on LC_COLLATE=C which means ascii only
>> 
>> For regrexp for example one should use the classes: :upper: or
>> :lower:.
> 
> That is rather surprising.  Is there a normative reference for the
> treatment of bracket expressions and character classes when using
> locales other than C and/or encodings like UTF-8?

I found an interesting article about this issue in gawk:
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html

Apparently the meaning of ranges is unspecified outside the "C" locale.

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05
says:

"In the POSIX locale, a range expression represents the set of collating
elements that fall between two elements in the collation sequence,
inclusive. In other locales, a range expression has unspecified
behavior: strictly conforming applications shall not rely on whether the
range expression is valid, or on the set of collating elements matched"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?a3f401a7-9dc9-d567-bf21-139364702599>