From owner-freebsd-current@FreeBSD.ORG Sun Aug 22 17:04:08 2010 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9FCFB1065670 for ; Sun, 22 Aug 2010 17:04:08 +0000 (UTC) (envelope-from wollman@hergotha.csail.mit.edu) Received: from hergotha.csail.mit.edu (hergotha.csail.mit.edu [66.92.79.170]) by mx1.freebsd.org (Postfix) with ESMTP id 55D678FC19 for ; Sun, 22 Aug 2010 17:04:08 +0000 (UTC) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.14.4/8.14.4) with ESMTP id o7MH40BG088796; Sun, 22 Aug 2010 13:04:00 -0400 (EDT) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.14.4/8.14.4/Submit) id o7MH40pV088795; Sun, 22 Aug 2010 13:04:00 -0400 (EDT) (envelope-from wollman) Date: Sun, 22 Aug 2010 13:04:00 -0400 (EDT) From: Garrett Wollman Message-Id: <201008221704.o7MH40pV088795@hergotha.csail.mit.edu> To: ed@80386.nl X-Newsgroups: mit.lcs.mail.freebsd-current In-Reply-To: <20100822163644.GU2978@hoeg.nl> References: <20100822163644.GU2978@hoeg.nl> <201008210231.o7L2VRvI031700@ducky.net> Organization: X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.5 (hergotha.csail.mit.edu [127.0.0.1]); Sun, 22 Aug 2010 13:04:01 -0400 (EDT) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED autolearn=disabled version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on hergotha.csail.mit.edu X-Mailman-Approved-At: Sun, 22 Aug 2010 17:34:17 +0000 Cc: current@freebsd.org Subject: Re: why GNU grep is fast X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Aug 2010 17:04:08 -0000 In article <20100822163644.GU2978@hoeg.nl> you write: >I think that implementing a simple fgrep boils down to mmap()ing a file >and calling memmem() on the mapping to search for the input string. Of >course this relies on having an efficient memmem() implementation, for >example using one of the algorithms mentioned in this thread. It's actually more complicated than that, because you have to ensure that you are not matching the middle of a multibyte character, when the current locale specifies a character set with a multibyte encoding. Likewise when searching for the newlines that delimit the matched line. (I'm not sure whether FreeBSD supports any character encodings that would be ambiguous in that way.) I don't think this was considered an issue when Mike Haertel was developing GNU grep. It seems reasonable to implement BMG or some other fast search in memmem(). Note that if you can't (or don't want to) mmap the whole file at once, you'll need special handling for the boundary conditions -- both at the string search level and at the search for line delimiters. This is much easier in the fgrep case, obviously, since the length of the query puts a finite upper bound on the amount of the old buffer you need to keep -- with regexps you really need your regexp engine to be able to report its matching state, or else limit your input to strictly conforming POSIX text files (i.e., line lengths limited to {LINE_MAX}). -GAWollman