Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Jan 2018 13:22:04 -0600
From:      Kyle Evans <kevans@freebsd.org>
To:        Yuri Pankov <yuripv@icloud.com>
Cc:        FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: libc/regex: r302824 added invalid check breaking collating ranges
Message-ID:  <CACNAnaGioTGUFpNEakT-b88f_pa7ZaAXBPp09ruTzD8HFWWQfQ@mail.gmail.com>
In-Reply-To: <f17e0e80-dc27-568a-2896-5cb38ebc470f@icloud.com>
References:  <a0d9abd8-19b8-cdf6-5451-e184fa182b38@icloud.com> <e192f9f7-9d9c-d1e3-8db4-02226ffa23d3@icloud.com> <CACNAnaF2aJ5EqLSCLTRkGH%2Bq5SYMmxD1dygGd8NFrkA9STJX8A@mail.gmail.com> <f17e0e80-dc27-568a-2896-5cb38ebc470f@icloud.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jan 23, 2018 at 1:10 PM, Yuri Pankov <yuripv@icloud.com> wrote:
> On Tue, Jan 23, 2018 at 08:10:32AM -0600, Kyle Evans wrote:
>>
>> On Mon, Jan 22, 2018 at 11:36 PM, Yuri Pankov <yuripv@icloud.com> wrote:
>>>
>>> On Tue, Jan 23, 2018 at 03:53:19AM +0300, Yuri Pankov wrote:
>>>>
>>>>
>>>> (CCing Kyle as he's working on regex at the moment and not because he
>>>> broke something)
>>>>
>>>> Hi,
>>>>
>>>> r302284 added an invalid check which breaks collating ranges:
>>>>
>>>> -if (table->__collate_load_error) {
>>>> -    (void)REQUIRE((uch)start <=3D (uch)finish, REG_ERANGE);
>>>> +if (table->__collate_load_error || MB_CUR_MAX > 1) {
>>>> +    (void)REQUIRE(start <=3D finish, REG_ERANGE);
>>>>
>>>> The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison
>>>> according to current locale's collation and not simply comparing the
>>>> wchar_t values.
>>>
>>>
>>>
>>> After re-reading the specification I now see that what looked like a bu=
g
>>> is
>>> actually an implementation choice, though the one that needs to be
>>> documentated.  I'll update the man page if anyone is willing to review
>>> (and
>>> commit) the changes.
>>
>>
>> Can you point to the section of specification that indicates this is
>> OK behavior? It doesn't seem desirable, but I see that GNU systems
>> will operate in the same manner that we do now.
>
>
> Here --
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html:
> ------------------------------------------------------------------------
> In the POSIX locale, a range expression represents the set of collating
> elements that fall between two elements in the collation sequence,
> inclusive. In other locales, a range expression has unspecified behavior:
> strictly conforming applications shall not rely on whether the range
> expression is valid, or on the set of collating elements matched.
> ------------------------------------------------------------------------
>

Thanks- our current behavior seems reasonable in that context.

> I've tried to "fix" what I was seeing as well, and yes, everything outsid=
e
> of ASCII is ugly, e.g. Cyrillic '=D0=B0-=D1=8F' would match much more tha=
n you could
> expect if you are doing lookups based on collation order (capital chars a=
nd
> a lot of other symbols).
>
> So what we have currently looks the least evil to me:
>
> - non-collating ASCII lookups for any locale -- looking at the log for
>   regcomp.c there was an attempt to "fix" this, but it was reverted as
>   a lot of existing code relies on this;
> - non-collating multi-byte locale lookups -- they will work for almost
>   all   cases, and where they don't, well POSIX says it's undefined :D
> - collating single-byte locale lookups for outside of ASCII range --
>   they make sense as collation order there doesn't seem to mix
>   small/caps/other characters together.
>
> What I think we need to do is document this as implementation choice in t=
he
> code and regex(3) "IMPLEMENTATION NOTES" so that another poor soul doesn'=
t
> come trying to fix it as I did :-)

I agree with your assessment- such a patch would be welcomed,
especially before I go and revise a bunch of this for clarification in
a future libregex world.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CACNAnaGioTGUFpNEakT-b88f_pa7ZaAXBPp09ruTzD8HFWWQfQ>