Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 6 Nov 2016 22:27:29 +0100
From:      Baptiste Daroussin <bapt@FreeBSD.org>
To:        Stefan Bethke <stb@lassitu.de>
Cc:        Greg Rivers <gcr+freebsd-stable@tharned.org>, freebsd-stable@freebsd.org
Subject:   Re: Uppercase RE matching problems in FreeBSD 11
Message-ID:  <20161106212729.z2edg44kg7hc4r2z@ivaldir.etoilebsd.net>
In-Reply-To: <C4BC6673-2E07-45E6-81D6-EB4FF99605A8@lassitu.de>
References:  <alpine.BSF.2.20.1611051912260.2462@flake.tharned.org> <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net> <29451103-E8DB-4656-A5BB-AEB924A728D6@lassitu.de> <20161106210628.hg3dcpozfjtuo3nt@ivaldir.etoilebsd.net> <C4BC6673-2E07-45E6-81D6-EB4FF99605A8@lassitu.de>

next in thread | previous in thread | raw e-mail | index | archive | help

--emwtdvp3diybumk6
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Nov 06, 2016 at 10:20:54PM +0100, Stefan Bethke wrote:
>=20
> > Am 06.11.2016 um 22:06 schrieb Baptiste Daroussin <bapt@FreeBSD.org>:
> >=20
> > On Sun, Nov 06, 2016 at 09:57:00PM +0100, Stefan Bethke wrote:
> >>=20
> >>> Am 06.11.2016 um 12:07 schrieb Baptiste Daroussin <bapt@FreeBSD.org>:
> >>>=20
> >>> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
> >>>> I happened to run an old script today that uses sed(1) to extract th=
e system
> >>>> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer =
works as
> >>>> expected:
> >>>>=20
> >>>> $ sysctl kern.boottime
> >>>> kern.boottime: { sec =3D 1478380714, usec =3D 145351 } Sat Nov  5 16=
:18:34 2016
> >>>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
> >>>> v  5 16:18:34 2016
> >>>>=20
> >>>> sed passes over 'S' and 'N' until it hits 'v', which it considers up=
percase
> >>>> apparently. This is with LANG=3Den_US.UTF-8. If I set LANG=3DC, it w=
orks as
> >>>> expected:
> >>>>=20
> >>>> $ sysctl kern.boottime | LANG=3DC sed -e 's/.*\([A-Z].*\)$/\1/'
> >>>> Nov  5 16:18:34 2016
> >>>>=20
> >>>> Testing every lowercase character separately gives even more inconsi=
stent
> >>>> results:
> >>>>=20
> >>>> $ cat <<! | LANG=3Den_US.UTF-8 sed -n -e '/^[A-Z]$/=E2=80=9Ap
> >>=20
> >>>> Here sed thinks every lowercase character except for 'a' is uppercas=
e! This
> >>>> differs from the first test where sed did not think 'o' is uppercase=
=2E Again,
> >>>> the above behaves as expected with LANG=3DC.
> >>>>=20
> >>>> Does anyone have any insight into this? This is likely to break a lo=
t of
> >>>> existing code.
> >>>>=20
> >>>=20
> >>> Yes A-Z only means uppercase in an ASCII only world in a unicode worl=
d it means
> >>> AaBb... Z because there are way more characters that simple A-Z. In F=
reeBSD 11
> >>> we have a unicode collation instead of falling back in on LC_COLLATE=
=3DC which
> >>> means ascii only
> >>>=20
> >>> For regrexp for example one should use the classes: :upper: or :lower=
:.
> >>=20
> >> That is rather surprising.  Is there a normative reference for the tre=
atment of bracket expressions and character classes when using locales othe=
r than C and/or encodings like UTF-8?
> >=20
> > http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html
> >=20
> > For example:
> >=20
> > "Regular expressions are a context-independent syntax that can represen=
t a wide
> > variety of character sets and character set orderings, where these char=
acter
> > sets are interpreted according to the current locale. While many regular
> > expressions can be interpreted differently depending on the current loc=
ale, many
> > features, such as character class expressions, provide for contextual i=
nvariance
> > across locales.=E2=80=9C
>=20
> Sorry, maybe I wasn=E2=80=99t clear enough with my question.  When a char=
acter class fits the problem, it is clearly advantageous.
>=20
> But under what circumstances would [A-Z] mean anything other than a chara=
cter whose Unicode codepoint is between U+0041 and U+005A, inclusive?  Espe=
cially given the locale in the example is en_US.UTF-8.  Or, put another way=
, why would an implementation interpret [A-Z] as anything other than [ABCDE=
=E2=80=A6XYZ]?

The collation rules for unicode comes from: http://cldr.unicode.org/ and th=
ey do
match the one on linux for example and the one on illumos.

On some gnu tool they explicitly decide to be non locale aware to avoid that
kind of "surprises"
>=20
> From reading your reference, I can see in 9.3.5.7:
> > In the POSIX locale, a range expression represents the set of collating=
 elements that fall between two elements in the collation sequence, inclusi=
ve. In other locales, a range expression has unspecified behavior[=E2=80=A6]
>=20
> So even if the observed behaviour is conforming, I=E2=80=99d think it=E2=
=80=99s still highly undesirable.
>=20
That works for POSIX locale aka C aka ASCII only world

Best regards,
Bapt

--emwtdvp3diybumk6
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIcBAABCAAGBQJYH6BBAAoJEGOJi9zxtz5a4WMQANQEyjEiHzLFm+PjecLD9c2C
ZRpksfh/wypquEiHre6+OsQ3fVrLf2u82XJ6Drq/89sQFWovVIKuOvN7TnmAuDp/
xlpqgh1MW2svfsJqAWGgi5dhC9H7ayqpZRJG5Sdo0kobZq0EdPS3bAR15SCoKEWT
PQBX8Kx4CF1v+5f9VsmJvY7T+0YpgtFHUxBiqwfwm1d3GxQ0wrJ9TPhSB42XCcYT
f6rh38x/yrSgjQ9S8LdZ6C/0bBPjEUJX8GHKubCOjvIk6JpRZ/z1QTbvpdUNyldG
KzkYemFCrCpz1pEBgQE2LVslrAjmLBKG6F2QMLcPdE0RGhBX1/pO378noxLkQb2h
Z54J7PtirZ7JjdsvE/KZcKEoGNWYUJGEZvO4OFVKJ0MysBo7lOLEv4MmAHRfWR33
eu4oTNvvBCR+NP28TybqboWfiO9+9ZUuc6S/k4ShyPXwGkTgPvIvQiWp49m2U1hk
mFOVtg5TXWzARcWYso83MepmB4dM9eS56j/jcQ33lHoTSnzSPT16KOInp713R5KW
XkZQf5LFzjpVObyLjL/c5i9hYAzKxKT44Z4DrwDjp+x4byjwK1HTLmFOA0LT2Ncq
mHYlXJ3B7xvXtFHrgozdWh3df0GeiBMkJTDaRPlWbqFQj5qZ6THgiQSa2kb/8gm1
73E2KsvFIkUP86x4aH1I
=5UHd
-----END PGP SIGNATURE-----

--emwtdvp3diybumk6--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20161106212729.z2edg44kg7hc4r2z>