Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 27 Sep 2011 07:38:28 +0100
From:      Matthew Seaman <m.seaman@infracaninophile.co.uk>
To:        grarpamp <grarpamp@gmail.com>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: Regex Wizards
Message-ID:  <4E816F64.1030404@infracaninophile.co.uk>
In-Reply-To: <CAD2Ti29Uvz6tBp60SYnD-5bJ8Jf=ThbVG5UUU21NWmmqOrO5SA@mail.gmail.com>
References:  <CAD2Ti29Uvz6tBp60SYnD-5bJ8Jf=ThbVG5UUU21NWmmqOrO5SA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigD2A474E80C8A4F7A75EB313F
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 27/09/2011 03:02, grarpamp wrote:
> Under the ERE implementation in RELENG_8, I'm having
> trouble figuring out how to group and backreference this.
>=20
> Given a line, where:
>  If AAA is present, CCC will be too, and B may appear in between.
>  If AAA is not present, neither CCC or B will be present.
>  DDDD is always present.
>  Junk may be present.
>  Match good lines and ouput in chunks.
>=20
> echo junkAAAABCCCDDDDjunk | \
>=20
> This works as expected:
> sed -E -n 's,^.*(AAAB?CCC)(DDDD).*$,1 \1 2 \2,p'
> 1 AAABCCC 2 DDDD
>=20
> But making the leading bits optional per spec does not work:
> sed -E -n 's,^.*(AAAB?CCC)?(DDDD).*$,1 \1 2 \2,p'
> 1  2 DDDD
>=20
> Nor does adding the usual grouping parens:
> sed -E -n 's,^.*((AAAB?CCC)?)(DDDD).*$,1 \1 2 \2,p'
> 1 2
>=20
> How do I group off the leading bits?
> Or is this a limitation of ERE's?
> Or a bug?

Hmmmm.... works fine with perl REs, or with sed if you trim the 'match
any sequence of characters at the beginning and end of line bits:

% echo junkAAAABCCCDDDDjunk | perl -nle 'm/(AAAB?CCC)?(DDDD)/ && print
"1 $1 2 $2";'
1 AAABCCC 2 DDDD

% echo junkAAAABCCCDDDDjunk | sed -E -n 's/(AAAAB?CCC)?(DDDD)/1 \1 2 \2/p=
'
junk1 AAAABCCC 2 DDDDjunk

Of course, the problem with sed is that you're using a *substitution*
command rather than just printing out what the RE matched.  Suppressing
the leading and trailing junk from the output is what is screwing you up.=


Trouble is, that '^.*' term in you RE is greedy, so it will match to the
end of AAABCCC, then the RE engine will say to itself 'I've found DDDD,
so I'm not going to backtrack and look for all the optional AAAB?CCC
stuff.'  In fact, adding the bits to match the leading and training junk
makes the RE ambiguous -- there's two ways it could match your test
string, and the law of natural cussedness being what it is, it chooses
the wrong one.

	Cheers,

	Matthew

--=20
Dr Matthew J Seaman MA, D.Phil.                   7 Priory Courtyard
                                                  Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey     Ramsgate
JID: matthew@infracaninophile.co.uk               Kent, CT11 9PW


--------------enigD2A474E80C8A4F7A75EB313F
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.16 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6Bb2sACgkQ8Mjk52CukIy3DwCfTIvXfKUl706Ji6N4AVmDaJ6N
18QAnjmQlKitAxIA1h88WX8dQqWBaYyf
=bupQ
-----END PGP SIGNATURE-----

--------------enigD2A474E80C8A4F7A75EB313F--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E816F64.1030404>