Date: Mon, 6 Jun 2016 08:43:12 -0500 From: Pedro Giffuni <pfg@FreeBSD.org> To: Andrey Chernov <ache@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r301461 - in head/lib/libc: gen locale regex Message-ID: <cc6f1905-5cb6-0076-7da4-e1cfdbde857e@FreeBSD.org> In-Reply-To: <40c481fe-5585-45d2-d4e3-b9988a8198f3@freebsd.org> References: <201606051912.u55JCqdR036458@repo.freebsd.org> <40c481fe-5585-45d2-d4e3-b9988a8198f3@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 06/05/16 14:49, Andrey Chernov wrote: > On 05.06.2016 22:12, Pedro F. Giffuni wrote: >> --- head/lib/libc/regex/regcomp.c Sun Jun 5 18:16:33 2016 (r301460) >> +++ head/lib/libc/regex/regcomp.c Sun Jun 5 19:12:52 2016 (r301461) >> @@ -821,10 +821,10 @@ p_b_term(struct parse *p, cset *cs) >> (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE); >> CHaddrange(p, cs, start, finish); >> } else { >> - (void)REQUIRE(__collate_range_cmp(table, start, finish) <= 0, REG_ERANGE); >> + (void)REQUIRE(__wcollate_range_cmp(table, start, finish) <= 0, REG_ERANGE); >> for (i = 0; i <= UCHAR_MAX; i++) { >> - if ( __collate_range_cmp(table, start, i) <= 0 >> - && __collate_range_cmp(table, i, finish) <= 0 >> + if ( __wcollate_range_cmp(table, start, i) <= 0 >> + && __wcollate_range_cmp(table, i, finish) <= 0 >> ) >> CHadd(p, cs, i); >> } >> > > As I already mention in PR, we have broken regcomp after someone adds > wchar_t support there. Now regcomp ranges works only for the first 256 > wchars of the current locale, notice that loop upper limit: > for (i = 0; i <= UCHAR_MAX; i++) { > In general, ranges are either broken in regcomp now or are memory > eating. We have bitmask only for the first 256 wchars, all other added > to the range literally. Imagine what happens if someone specify full > Unicode range in regexp. > > Proper fix will be adding bitmask for the whole Unicode range, and even > in that case regcomp attempting to use collation in ranges will be > _very_slow_ since needs to check all Unicode chars in its > for (i = 0; i <= Max_Unicode_wchar; i++) { > loop. > > Better stop pretending that we are able to do collation support in the > ranges, since POSIX cares about its own locale only here: > "In the POSIX locale, a range expression represents the set of collating > elements that fall between two elements in the collation sequence, > inclusive. In other locales, a range expression has unspecified > behavior: strictly conforming applications shall not rely on whether the > range expression is valid, or on the set of collating elements matched." > > Until whole Unicode range bitmask will be implemented (if ever), better > stop pretending to honor collation order, we just can't do it with > wchars now and do what NetBSD/OpenBSD does (using wchar_t) instead. It > does not prevent memory eating on big ranges (bitmask is needed, see > above), but at least fix the thing that only first 256 wchars are > considered. > Sadly regex is one part of the system that could use a maintainer :(, I have been forced to look at it more than I'd like to but I don't really use the collation support at all. Pedro.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?cc6f1905-5cb6-0076-7da4-e1cfdbde857e>