Date: Tue, 11 Apr 2017 15:20:58 -0500 From: Kyle Evans <kevans91@ksu.edu> To: <freebsd-hackers@freebsd.org>, <freebsd-standards@freebsd.org> Subject: Replacing libgnuregex Message-ID: <CACNAnaEmBjWudEJwvRTSqyciOp7-oRbCEQ_e6qtGsap0oHQ4yw@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
Hello! To start, I'm cross-posting to freebsd-hackers@ and freebsd-standards@, since it seems to pertain to both as a question of how strictly we follow the standards, as well as potential approach. The following e-mail will somewhat outline my questions, then my personal opinion. == Almost objective, obviously biased stuff == The first question we must answer- is it strictly necessary necessary that we maintain a separate library for gnuregex, or would it be feasible/desirable to extend libc/regex to include GNU extensions? There's obvious benefits to both, but the former (a drop-in replacement for libgnuregex) seems like it's going to be more difficult to find. We only have two base-consumers of libgnuregex (at the moment), but one must consider the potential other consumers since this doesn't seem to be a private library. On the other hand, I think I could fairly easily implement most of these into libc/regex. Here's a summary of what this option entails adding to libc/regex, from what I've found: * Empty subexpressions(*) * Add missing quantifiers to BREs: \?, \+ * Add branching to BREs: \| * Add backreferences (\1 through \9) to EREs * Add \w, \W, \s, and \S corresponding to [[:alnum:]], [^[:alnum:]], [[:space:]], and [^[:space:]] respectively * Add word boundaries and anchors: ** \b: word boundary ** \B: not word boundary ** \<: Strt of word ** \>: End of word ** \`: Start of subject string ** \': End of subject string (*) I didn't actually find anything explicitly stating this as a GNU extension, but it's certainly not conformant to POSIX specifications to use, it gets used a tiny bit in some ports, and we implement a workaround in bsdgrep(1) for the simplest case of empty expressions ("") to match everything and produce zero length matches. The main benefit of this is not having to maintain a completely separate regex parser and the potential for inconsistencies that come along with it. The downside is that that would seem to promote expressions that are not strictly POSIX conformant. Is this a problem? Is this a problem worth worrying about? == Opinion == My personal opinion is that we should go the latter route and implement these features into libc/regex as a default behavior. Perhaps with a flag or something so that an application *could* opt out of GNU extensions ("strict POSIX" type of flag) if it so chooses or finds them undesirable, but that may not be deemed necessary. Ultimately, the GNU extensions are just that- extensions. There's no direct harm that I can think of in accepting them in our libc, and they do indeed provide some sensible features with little cost added to our current implementation. I'd personally like to have one parser that does it all so that when a regex-parsing bug does come in, there's no initial triage *at all* of whether it's a gnuregex bug or a libc/regex bug. Thoughts? What all have I missed? Thanks, Kyle Evans
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CACNAnaEmBjWudEJwvRTSqyciOp7-oRbCEQ_e6qtGsap0oHQ4yw>