Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 2 Sep 2013 20:52:18 +0300
From:      Kimmo Paasiala <kpaasial@gmail.com>
To:        Andriy Gapon <avg@freebsd.org>
Cc:        FreeBSD Current <freebsd-current@freebsd.org>
Subject:   Re: bug with special bracket expressions in regular expressions
Message-ID:  <CA%2B7WWSd0=m_4fBxTEoVzj15%2B%2B7az7WviENY6ah=39wM_R9FWPw@mail.gmail.com>
In-Reply-To: <5224C08E.1070404@FreeBSD.org>
References:  <5224A693.3000904@FreeBSD.org> <5224C08E.1070404@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Sep 2, 2013 at 7:45 PM, Andriy Gapon <avg@freebsd.org> wrote:
> on 02/09/2013 17:54 Andriy Gapon said the following:
>>
>> re_format(7) says:
>>      There are two special cases=E2=80=A1 of bracket expressions: the br=
acket expres=E2=80=90
>>      sions =E2=80=98[[:<:]]=E2=80=99 and =E2=80=98[[:>:]]=E2=80=99 match=
 the null string at the beginning and
>>      end of a word respectively.  A word is defined as a sequence of wor=
d
>>      characters which is neither preceded nor followed by word character=
s.  A
>>      word character is an alnum character (as defined by ctype(3)) or an
>>      underscore.  This is an extension, compatible with but not specifie=
d by
>>      IEEE Std 1003.2 (=E2=80=9CPOSIX.2=E2=80=9D), and should be used wit=
h caution in software
>>      intended to be portable to other systems.
>>
>> However I observe the following:
>> $ echo "cd0 cd1 xx" | sed 's/cd[0-9][^ ]* *//g'
>> xx
>> $ echo "cd0 cd1 xx" | sed 's/[[:<:]]cd[0-9][^ ]* *//g'
>> cd1 xx
>>
>> In my opinion '[[:<:]]' should not affect how the pattern is matched in =
this case.
>
> It seems that the code works like this:
> - first it matches "cd0 " and "removes" it
> - then it passes "cd1 xx" for matching with a flag that tells that this i=
s not
>   a real start of the string
> - thus the matching code
>  o knows that this is not a real line start, so it can't match [[:<:]]
>    just for that reason
>  o it does _not_ know what was the character before the start of the give=
n
>    substring, so it can not know if it could match [[:<:]]
>
> So matching fails.
> Not sure if this is an internal problem of regex(3) or a problem of how s=
ed(1)
> uses regex(3).
>
> --
> Andriy Gapon

In my opinion this is a bug. The [[:<:]] operator is said to match the
empty string at the beginning of a word with no mention that the word
has to be at the beginning of the whole string that is matched. OS X
version of sed(1) works differently:

$ echo "cd0 cd1 xx" | sed 's/cd[0-9][^ ]* *//g'
xx
$ echo "cd0 cd1 xx" | sed 's/[[:<:]]cd[0-9][^ ]* *//g'
xx

-Kimmo



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CA%2B7WWSd0=m_4fBxTEoVzj15%2B%2B7az7WviENY6ah=39wM_R9FWPw>