From owner-freebsd-current@FreeBSD.ORG  Sun Aug 22 17:04:08 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9FCFB1065670
	for <current@freebsd.org>; Sun, 22 Aug 2010 17:04:08 +0000 (UTC)
	(envelope-from wollman@hergotha.csail.mit.edu)
Received: from hergotha.csail.mit.edu (hergotha.csail.mit.edu [66.92.79.170])
	by mx1.freebsd.org (Postfix) with ESMTP id 55D678FC19
	for <current@freebsd.org>; Sun, 22 Aug 2010 17:04:08 +0000 (UTC)
Received: from hergotha.csail.mit.edu (localhost [127.0.0.1])
	by hergotha.csail.mit.edu (8.14.4/8.14.4) with ESMTP id o7MH40BG088796; 
	Sun, 22 Aug 2010 13:04:00 -0400 (EDT)
	(envelope-from wollman@hergotha.csail.mit.edu)
Received: (from wollman@localhost)
	by hergotha.csail.mit.edu (8.14.4/8.14.4/Submit) id o7MH40pV088795;
	Sun, 22 Aug 2010 13:04:00 -0400 (EDT) (envelope-from wollman)
Date: Sun, 22 Aug 2010 13:04:00 -0400 (EDT)
From: Garrett Wollman <wollman@hergotha.csail.mit.edu>
Message-Id: <201008221704.o7MH40pV088795@hergotha.csail.mit.edu>
To: ed@80386.nl
X-Newsgroups: mit.lcs.mail.freebsd-current
In-Reply-To: <20100822163644.GU2978@hoeg.nl>
References: <20100822163644.GU2978@hoeg.nl>
	<201008210231.o7L2VRvI031700@ducky.net>
Organization: 
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.5
	(hergotha.csail.mit.edu [127.0.0.1]);
	Sun, 22 Aug 2010 13:04:01 -0400 (EDT)
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED
	autolearn=disabled version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	hergotha.csail.mit.edu
X-Mailman-Approved-At: Sun, 22 Aug 2010 17:34:17 +0000
Cc: current@freebsd.org
Subject: Re: why GNU grep is fast
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 22 Aug 2010 17:04:08 -0000

In article <20100822163644.GU2978@hoeg.nl> you write:

>I think that implementing a simple fgrep boils down to mmap()ing a file
>and calling memmem() on the mapping to search for the input string. Of
>course this relies on having an efficient memmem() implementation, for
>example using one of the algorithms mentioned in this thread.

It's actually more complicated than that, because you have to ensure
that you are not matching the middle of a multibyte character, when
the current locale specifies a character set with a multibyte
encoding.  Likewise when searching for the newlines that delimit the
matched line.  (I'm not sure whether FreeBSD supports any character
encodings that would be ambiguous in that way.)  I don't think this
was considered an issue when Mike Haertel was developing GNU grep.

It seems reasonable to implement BMG or some other fast search in
memmem().  Note that if you can't (or don't want to) mmap the whole
file at once, you'll need special handling for the boundary conditions
-- both at the string search level and at the search for line
delimiters.  This is much easier in the fgrep case, obviously, since
the length of the query puts a finite upper bound on the amount of the
old buffer you need to keep -- with regexps you really need your
regexp engine to be able to report its matching state, or else limit
your input to strictly conforming POSIX text files (i.e., line lengths
limited to {LINE_MAX}).

-GAWollman