FreeBSD Mail Archives

Date:      Tue, 11 Apr 2017 15:20:58 -0500
From:      Kyle Evans <kevans91@ksu.edu>
To:        <freebsd-hackers@freebsd.org>, <freebsd-standards@freebsd.org>
Subject:   Replacing libgnuregex
Message-ID:  <CACNAnaEmBjWudEJwvRTSqyciOp7-oRbCEQ_e6qtGsap0oHQ4yw@mail.gmail.com>

next in thread | raw e-mail | index | archive | help

Hello!

To start, I'm cross-posting to freebsd-hackers@ and freebsd-standards@,
since it seems to pertain to both as a question of how strictly we follow
the standards, as well as potential approach. The following e-mail will
somewhat outline my questions, then my personal opinion.

== Almost objective, obviously biased stuff ==

The first question we must answer- is it strictly necessary necessary that
we maintain a separate library for gnuregex, or would it be
feasible/desirable to extend libc/regex to include GNU extensions?

There's obvious benefits to both, but the former (a drop-in replacement for
libgnuregex) seems like it's going to be more difficult to find. We only
have two base-consumers of libgnuregex (at the moment), but one must
consider the potential other consumers since this doesn't seem to be a
private library.

On the other hand, I think I could fairly easily implement most of these
into libc/regex. Here's a summary of what this option entails adding to
libc/regex, from what I've found:

* Empty subexpressions(*)
* Add missing quantifiers to BREs: \?, \+
* Add branching to BREs: \|
* Add backreferences (\1 through \9) to EREs
* Add \w, \W, \s, and \S corresponding to [[:alnum:]], [^[:alnum:]],
[[:space:]], and [^[:space:]] respectively
* Add word boundaries and anchors:
** \b: word boundary
** \B: not word boundary
** \<: Strt of word
** \>: End of word
** \`: Start of subject string
** \': End of subject string

(*) I didn't actually find anything explicitly stating this as a GNU
extension, but it's certainly not conformant to POSIX specifications to
use, it gets used a tiny bit in some ports, and we implement a workaround
in bsdgrep(1) for the simplest case of empty expressions ("") to match
everything and produce zero length matches.

The main benefit of this is not having to maintain a completely separate
regex parser and the potential for inconsistencies that come along with it.
The downside is that that would seem to promote expressions that are not
strictly POSIX conformant. Is this a problem? Is this a problem worth
worrying about?

== Opinion ==

My personal opinion is that we should go the latter route and implement
these features into libc/regex as a default behavior. Perhaps with a flag
or something so that an application *could* opt out of GNU extensions
("strict POSIX" type of flag) if it so chooses or finds them undesirable,
but that may not be deemed necessary.

Ultimately, the GNU extensions are just that- extensions. There's no direct
harm that I can think of in accepting them in our libc, and they do indeed
provide some sensible features with little cost added to our current
implementation. I'd personally like to have one parser that does it all so
that when a regex-parsing bug does come in, there's no initial triage *at
all* of whether it's a gnuregex bug or a libc/regex bug.

Thoughts? What all have I missed?

Thanks,

Kyle Evans

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CACNAnaEmBjWudEJwvRTSqyciOp7-oRbCEQ_e6qtGsap0oHQ4yw>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation