Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 22 Aug 2010 11:02:37 -0400 (EDT)
From:      Garrett Wollman <wollman@hergotha.csail.mit.edu>
To:        des@des.no
Cc:        current@freebsd.org
Subject:   Re: why GNU grep is fast
Message-ID:  <201008221502.o7MF2bfZ086738@hergotha.csail.mit.edu>
In-Reply-To: <86k4nikglg.fsf@ds4.des.no>
References:  <201008210231.o7L2VRvI031700@ducky.net> <86k4nikglg.fsf@ds4.des.no>

next in thread | previous in thread | raw e-mail | index | archive | help
In article <86k4nikglg.fsf@ds4.des.no> you write:
>Mike Haertel <mike@ducky.net> writes:
>> GNU grep uses the well-known Boyer-Moore algorithm, which looks
>> first for the final letter of the target string, and uses a lookup
>> table to tell it how far ahead it can skip in the input whenever
>> it finds a non-matching character.
>
>Boyer-Moore is for fixed search strings.  I don't see how that
>optimization can work with a regexp search unless the regexp is so
>simple that you break it down into a small number of cases with known
>length and final character.

The common case of regexps used in the "grep" utility (and, for
obvious reasons, universal in the "fgrep" utility) is fixed-length
search strings.  Even non-fixed-length regexps typically consist of
one one or two variable-length parts.  Matching a completely
variable-length regexp is just hard, computationally, so it's OK for
it to be slower.  There are other tricks you can do, such as
turning the anchors ^ and $ into explicit newlines in your search --
"^foo" is a very common regexp to search for, and it's really a
fixed-string search for "\nfoo" which is entirely amenable to the B-M
treatment.  You just have to remember that a matched newline isn't
part of the result.

The GNU regexp library also uses the Boyer-Moore (or is it
Boyer-Moore-Gosper?) strategy.

-GAWollman




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201008221502.o7MF2bfZ086738>