Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 6 Nov 2016 12:07:29 +0100
From:      Baptiste Daroussin <bapt@FreeBSD.org>
To:        Greg Rivers <gcr+freebsd-stable@tharned.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: Uppercase RE matching problems in FreeBSD 11
Message-ID:  <20161106110729.z2px7mzlhcwxvrvu@ivaldir.etoilebsd.net>
In-Reply-To: <alpine.BSF.2.20.1611051912260.2462@flake.tharned.org>
References:  <alpine.BSF.2.20.1611051912260.2462@flake.tharned.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--6tpwwlpjmvkdsy5z
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
> I happened to run an old script today that uses sed(1) to extract the sys=
tem
> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer works=
 as
> expected:
>=20
> $ sysctl kern.boottime
> kern.boottime: { sec =3D 1478380714, usec =3D 145351 } Sat Nov  5 16:18:3=
4 2016
> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
> v  5 16:18:34 2016
>=20
> sed passes over 'S' and 'N' until it hits 'v', which it considers upperca=
se
> apparently. This is with LANG=3Den_US.UTF-8. If I set LANG=3DC, it works =
as
> expected:
>=20
> $ sysctl kern.boottime | LANG=3DC sed -e 's/.*\([A-Z].*\)$/\1/'
> Nov  5 16:18:34 2016
>=20
> Testing every lowercase character separately gives even more inconsistent
> results:
>=20
> $ cat <<! | LANG=3Den_US.UTF-8 sed -n -e '/^[A-Z]$/'p
> > a
> > b
> > c
> > d
> > e
> > f
> > g
> > h
> > i
> > j
> > k
> > l
> > m
> > n
> > o
> > p
> > q
> > r
> > s
> > t
> > u
> > v
> > w
> > x
> > y
> > z
> > !
> b
> c
> d
> e
> f
> g
> h
> i
> j
> k
> l
> m
> n
> o
> p
> q
> r
> s
> t
> u
> v
> w
> x
> y
> z
>=20
> Here sed thinks every lowercase character except for 'a' is uppercase! Th=
is
> differs from the first test where sed did not think 'o' is uppercase. Aga=
in,
> the above behaves as expected with LANG=3DC.
>=20
> Does anyone have any insight into this? This is likely to break a lot of
> existing code.
>=20

Yes A-Z only means uppercase in an ASCII only world in a unicode world it m=
eans
AaBb... Z because there are way more characters that simple A-Z. In FreeBSD=
 11
we have a unicode collation instead of falling back in on LC_COLLATE=3DC wh=
ich
means ascii only

For regrexp for example one should use the classes: :upper: or :lower:.

Best regards,
Bapt

--6tpwwlpjmvkdsy5z
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIcBAABCAAGBQJYHw7xAAoJEGOJi9zxtz5anyQQANztz/d2fUYBiCo5QcF3iPHn
C98qrd7aqQWEXPE+hdhrqC4r82UaYNNqvaYdoaArV6WIQOqEDzu/Eju8c6VidOkj
uSJuai9mAxQTzbSi8oSka8kyGGUJZYKA0wZpGfqdWTCigQcE9yjFdnVYbkIn8LNp
Y4+N9ZEOm0pGDxbD7aOTCT4sZY7znqaZuoiA6Fid6jNe/dEIKnfDDoMOyUrt8YF7
v1O6RUILizjDpfs4VzrE2MmoUs5hXKREv1+rez87wLTUhj08d3h93vvQrtrzt/Zc
0sKBiJ3azbCuKGnz2y7HjIAO3kU1Do3RqqsjDA3catzc8n8qUt2j0iBJhmEMw/Oj
1A4Hbiem2EQXX5OTzvFkrQ2S3L4MhAjOjFDsPG6Edjt18Z8DSuuy94j6PYlnm02h
Cl0W2I/70fCegg2uYiO7aNg31eF48hc19Yar5c4UpYORV0iaf8pLX5Xc1E8AixH3
T9/oakMh9o5JS/1J+gRprxbN+tdHNlVky46hAz0Hq4uB2wcJdsS/yPqGKjdRYGIZ
ajmRewVcnoDVaJrdv1fqKbAdxfOkgi01fgSUq8+KRzP5Vleuj9H9mLEJRgpj6RXo
irpyTZbevLqNnmCCuCBdC/t1akpk1tXWCE+sP8I2JwURbMNK1+PpXgIxCLxIsmr5
h9oPHjvUPmd5GisZbtYa
=6UV7
-----END PGP SIGNATURE-----

--6tpwwlpjmvkdsy5z--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20161106110729.z2px7mzlhcwxvrvu>