Date: Mon, 26 Apr 2021 20:14:32 -0700 From: Mark Millard <marklmi@yahoo.com> To: =?utf-8?Q?Fernando_Apestegu=C3=ADa?= <fernape@freebsd.org> Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: Regular expression compilation fail in current Message-ID: <CC70147B-9A24-433A-8678-31BD183DEE7F@yahoo.com> In-Reply-To: <CAGwOe2bwyLihdOzyxVYgdaSUTwGzELANARSh=HGQoou=5FgG%2Bg@mail.gmail.com> References: <CAGwOe2bwyLihdOzyxVYgdaSUTwGzELANARSh=HGQoou=5FgG%2Bg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2021-Apr-26, at 06:31, Fernando Apestegu=C3=ADa <fernape at = freebsd.org> wrote: > Hi there, >=20 > I'm working with this port PR > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D255182 >=20 > and the problem seems to boil down to a regular expression that does > not compile on current but it does in 12.2. >=20 > The minimum repro is this one: >=20 > #include <regex.h> > #include <stdio.h> >=20 > int > main() > { > regex_t regexp; > int ret =3D regcomp(®exp, "\\s*", REG_EXTENDED | REG_ICASE | > REG_NOSUB); Here is my stab at notes for this . . . It is not all that uncommon for error cases to be initially mistreated but later toolchains to reject instead of mistreating the same. I suspect that is what is going on here. But the details seem to be as follows. Using C++11's raw_characters notation to specify string content, "\\s*" is: R"%(\s*)%" In other words, the content of the string is just: \s* (3 characters, plus a terminating '\0' present). It is this later string contant that the regcomp 2nd parameter points to and that leads to the error report. The "s" is not valid after the backslash for Basic Regular Expressions or for Extended Regular Expressions. ( = https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html = ) REG_EESCAPE is described at: https://pubs.opengroup.org/onlinepubs/9699919799/functions/regcomp.html as: QUOTE REG_EESCAPE Trailing <backslash> character in pattern. END QUOTE In other words: an extra backslash not paired with anything valid just after it --so it is tailing whatever was before it. If you meant the parameter received to point in memory to: \\s* ( 4 characters, plus a terminating '\0' after it, a.k.a. R"%(\\s*)%" ) you likely want the C-string: "\\\\s*" as the argument, shown below: regcomp(®exp, "\\\\s*", REG_EXTENDED | REG_ICASE | REG_NOSUB) If you meant some other character sequence in memory, I'd have to know what it was to try to back-translate it to C-source that would produce the correct content in the memory pointed to. > if ( ret !=3D 0) { > printf("regexp compilation failed: %d\n", ret); > } >=20 > return 0; > } >=20 > This one works in 12.2 It might not be rejected, but was does it do? And is that conformant with: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html ? > but fails to compile the regexp in FreeBSD > 14.0-CURRENT #11 main-n245984-15221c552b3c with error 5 REG_EESCAPE > `\' applied to unescapable character. >=20 > Any help is appreciated. Note: While I used C++11's notation as one way of indicating string content, no C standard has the notation to my knowledge. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CC70147B-9A24-433A-8678-31BD183DEE7F>