From owner-freebsd-questions@FreeBSD.ORG Thu Jan 23 16:30:37 2014 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2D0B9199 for ; Thu, 23 Jan 2014 16:30:37 +0000 (UTC) Received: from wonkity.com (wonkity.com [67.158.26.137]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id CFC95100D for ; Thu, 23 Jan 2014 16:30:36 +0000 (UTC) Received: from wonkity.com (localhost [127.0.0.1]) by wonkity.com (8.14.7/8.14.7) with ESMTP id s0NGUZn6078226; Thu, 23 Jan 2014 09:30:35 -0700 (MST) (envelope-from wblock@wonkity.com) Received: from localhost (wblock@localhost) by wonkity.com (8.14.7/8.14.7/Submit) with ESMTP id s0NGUZ4n078223; Thu, 23 Jan 2014 09:30:35 -0700 (MST) (envelope-from wblock@wonkity.com) Date: Thu, 23 Jan 2014 09:30:35 -0700 (MST) From: Warren Block To: Paul Schmehl Subject: Re: awk programming question In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (wonkity.com [127.0.0.1]); Thu, 23 Jan 2014 09:30:35 -0700 (MST) Cc: Freebsd Questions X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jan 2014 16:30:37 -0000 On Thu, 23 Jan 2014, Paul Schmehl wrote: > I'm kind of stubborn. There's lots of different ways to skin a cat, but I > like to force myself to use the built-in utilities to do things so I can > learn more about them and better understand how they work. > > So, I'm trying to parse a file of snort rules, extract two string values and > insert a double pipe between them to create a sig-msg.map file > > Here's a typical rule: > > alert udp $HOME_NET any -> $EXTERNAL_NET 69 (msg:"E3[rb] ET POLICY Outbound > TFTP Read Request"; content:"|00 01|"; depth:2; classtype:bad-unknown; > sid:2008120; rev:1;) > > Here's a typical sig-msg.map file entry: > > 9624 || RPC UNIX authentication machinename string overflow attempt UDP > > So, from the above rule I would want to create a single line like this: > > 2008120 || E3[rb] ET POLICY Outbound TFTP Read Request > > There are several ways I can extract one or the other value, and I've figured > out how to extract the sid and add the double pipe, but for the life of me I > can't figure out how to extract and print out sid || msg. > > This prints out the sid and the double pipe: > > echo `awk 'match($0,/sid:[0-9]*;/) {print substr($0,RSTART,RLENGTH)" || "}' > /tmp/mtc.rules | tr -d ";sid" > > It seems I could put the results into a variable rather than printing them > out, and then print var1 || var2, but my google foo hasn't found a useful > example. > > Surely there's a way to do this using awk? I can use tr for cleanup. I just > need to get close to the right result. > > How about it awk experts? What's the cleanest way to get this done? Not an awk expert, but you can do math on the start and length variables to get just the date part: echo "sid:2008120;" \ | awk '{ match($0, /sid:[0-9]*;/) ; \ ymd=substr($0, RSTART+4, RLENGTH-5) ; print ymd }' Closer to what you want: echo 'msg:"E3[rb] ET POLICY Outbound TFTP Read Request"; sid:2008120;' \ | awk '{ match($0, /sid:[0-9]*;/) ; \ ymd=substr($0, RSTART+4, RLENGTH-5) ; \ match($0, /msg:.*;/) ; \ msg = substr($0, RSTART+4, RLENGTH-5) ; \ print ymd, "||", msg }' Note the error that the too-greedy regex creates, and the inability of awk to capture regex sub-expressions. awk does not have a way to reduce the greediness, at least that I'm aware. You may be able to work around that, like if the message is always the same length. sed, despite its many weaknesses, can capture subexpressions: echo "sid:2008120;" | sed -e 's/^.*sid:\([0-9]*\);.*$/\1/' I don't think sed has a non-greedy modifier either. Basically, sed and awk are frozen in the early 1970s, back before it became popular to do useful things. That was one reason Perl came along, and later, Python and Ruby. echo 'msg:"E3[rb] ET POLICY Outbound TFTP Read Request"; sid:2008120;' \ | perl -ne 'if ( /msg:"(.*)?";.*sid:(\d*?);/ ) { print "$2 || $1\n" };' The regex uses the ? to reduce greediness, Perl's "\d" instead of the longer [0-9], and the pattern capturing parens, which fill in $1 and $2. The "if" statement is not required, but it's bad practice to print the contents of pattern capture variables unless the match actually succeeded.