FreeBSD Mail Archives

Date:      Fri, 13 Aug 2010 08:47:38 -0500
From:      "Jack L. Stone" <jacks@sage-american.com>
To:        Chip Camden <sterling@camdensoftware.com>, freebsd-questions@freebsd.org
Subject:   Re: Grepping a list of words
Message-ID:  <3.0.1.32.20100813084738.00ee5c48@sage-american.com>
In-Reply-To: <20100812175614.GJ20504@libertas.local.camdensoftware.com>
References:  <867hjv92r2.fsf@gmail.com> <20100812153535.61549.qmail@joyce.lan> <201008121644.o7CGiflh099466@lurza.secnetix.de> <867hjv92r2.fsf@gmail.com>

At 10:56 AM 8.12.2010 -0700, Chip Camden wrote:
>Quoth Anonymous on Thursday, 12 August 2010:
>> Oliver Fromme <olli@lurza.secnetix.de> writes:
>> 
>> > John Levine <johnl@iecc.com> wrote:
>> >  > > > % egrep 'word1|word2|word3|...|wordn' filename.txt
>> >  > 
>> >  > > Thanks for the replies. This suggestion won't do the job as the
list of
>> >  > > words is very long, maybe 50-60. This is why I asked how to place
them all
>> >  > > in a file. One reply dealt with using a file with egrep. I'll try
that.
>> >  > 
>> >  > Gee, 50 words, that's about a 300 character pattern, that's not a
problem
>> >  > for any shell or version of grep I know.
>> >  > 
>> >  > But reading the words from a file is equivalent and as you note most
>> >  > likely easier to do.
>> >
>> > The question is what is more efficient.  This might be
>> > important if that kind of grep command is run very often
>> > by a script, or if it's run on very large files.
>> >
>> > My guess is that one large regular expression is more
>> > efficient than many small ones.  But I haven't done real
>> > benchmarks to prove this.
>> 
>> BTW, not using regular expressions is even more efficient, e.g.
>> 
>>   $ fgrep -f /usr/share/dict/words /etc/group
>> 
>> When using egrep(1) it takes considerably more time and memory.
>
>Having written a regex engine myself, I can see why.  Though I'm sure
>egrep is highly optimized, even the most optimized DFA table is going to
take more
>cycles to navigate than a simple string comparison.  Not to mention the
>initial overhead of parsing the regex and building that table.
>
>-- 
>Sterling (Chip) Camden    | sterling@camdensoftware.com | 2048D/3A978E4F

Many thanks to all of the suggestions. I found this worked very well,
ignoring concerns about use of resources:

egrep -i -o -w -f word.file main.file

The only thing it didn't do for me was the next step. My final objective
was to really determine the words in the "word.file" that were not in the
"main.file." I figured finding matches would be easy and then could then
run a sort|uniq comparison to determine the "new words" not yet in the
main.file.

Since I will have a need to run this check frequently, any suggestions for
a better approach are welcome.

Thanks again...

Jack

(^_^)
Happy trails,
Jack L. Stone

System Admin
Sage-american

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3.0.1.32.20100813084738.00ee5c48>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation