From owner-freebsd-questions@FreeBSD.ORG Wed Jan 3 18:48:06 2007 Return-Path: X-Original-To: freebsd-questions@freebsd.org Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id AD71D16A403 for ; Wed, 3 Jan 2007 18:48:06 +0000 (UTC) (envelope-from kurt.buff@gmail.com) Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.234]) by mx1.freebsd.org (Postfix) with ESMTP id 5CA5F13C44B for ; Wed, 3 Jan 2007 18:48:06 +0000 (UTC) (envelope-from kurt.buff@gmail.com) Received: by wx-out-0506.google.com with SMTP id s18so6040158wxc for ; Wed, 03 Jan 2007 10:48:05 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=FdM5E2UT0WoArib/o1Nki4dZv3QACJnBrHPjJ6XcwzWwFggHIqsJPyeO8kbREAdgZaBMTXRTG4xPLonlGSGgH+vag3dowZYdMfmirqteKu7blnxDsKclWE5cx3HeJooEejW8CH/yg7sDkiuD/7/jwxRN3MXI5YFP/hYEDHVRVXU= Received: by 10.70.39.5 with SMTP id m5mr39430697wxm.1167850085925; Wed, 03 Jan 2007 10:48:05 -0800 (PST) Received: by 10.70.131.11 with HTTP; Wed, 3 Jan 2007 10:48:05 -0800 (PST) Message-ID: Date: Wed, 3 Jan 2007 10:48:05 -0800 From: "Kurt Buff" To: "Ian Smith" In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20070103120041.C104816A569@hub.freebsd.org> Cc: James Long , freebsd-questions@freebsd.org Subject: Re: Batch file question - average size of file in directory X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Jan 2007 18:48:06 -0000 On 1/3/07, Ian Smith wrote: > > Message: 17 > > Date: Tue, 2 Jan 2007 19:50:01 -0800 > > From: James Long > > > > Message: 28 > > > Date: Tue, 2 Jan 2007 10:20:08 -0800 > > > From: "Kurt Buff" > > > > I don't even have a clue how to start this one, so am looking for a little help. > > > > > > I've got a directory with a large number of gzipped files in it (over > > > 110k) along with a few thousand uncompressed files. > > If it were me I'd mv those into a bunch of subdirectories; things get > really slow with more than 500 or so files per directory .. anyway .. I just store them for a while - delete them after two weeks if they're not needed again. The overhead isn't enough to worry about at this point. > > > I'd like to find the average uncompressed size of the gzipped files, > > > and ignore the uncompressed files. > > > > > > How on earth would I go about doing that with the default shell (no > > > bash or other shells installed), or in perl, or something like that. > > > I'm no scripter of any great expertise, and am just stumbling over > > > this trying to find an approach. > > > > > > Many thanks for any help, > > > > > > Kurt > > > > Hi, Kurt. > > And hi, James, > > > Can I make some assumptions that simplify things? No kinky filenames, > > just [a-zA-Z0-9.]. My approach specifically doesn't like colons or > > spaces, I bet. Also, you say gzipped, so I'm assuming it's ONLY gzip, > > no bzip2, etc. > > > > Here's a first draft that might give you some ideas. It will output: > > > > foo.gz : 3456 > > bar.gz : 1048576 > > (etc.) > > > > find . -type f | while read fname; do > > file $fname | grep -q "compressed" && echo "$fname : $(zcat $fname | wc -c)" > > done > > % file cat7/tuning.7.gz > cat7/tuning.7.gz: gzip compressed data, from Unix > > Good check, though grep "gzip compressed" excludes bzip2 etc. > > But you REALLY don't want to zcat 110 thousand files just to wc 'em, > unless it's a benchmark :) .. may I suggest a slight speedup, template: > > % gunzip -l cat7/tuning.7.gz > compressed uncompr. ratio uncompressed_name > 13642 38421 64.5% cat7/tuning.7 > > > If you really need a script that will do the math for you, then > > pip the output of this into bc: > > > > #!/bin/sh > > > > find . -type f | { > > > > n=0 > > echo scale=2 > > echo -n "(" > > while read fname; do > - > if file $fname | grep -q "compressed" > + if file $fname | grep -q "gzip compressed" > > then > - > echo -n "$(zcat $fname | wc -c)+" > + echo -n "$(gunzip -l $fname | grep -v comp | awk '{print $2}')+" > > n=$(($n+1)) > > fi > > done > > echo "0) / $n" > > > > } > > > > That should give you the average decompressed size of the gzip'ped > > files in the current directory. > > HTH, Ian Ah - yes, I think that's much better. I should have thought of awk. At some point, I'd like to do a bit more processing of file sizes, such as trying to find out the number of IP packets each file would take during an SMTP transaction, so that I could categorize overhead a bit, but for now the average uncompressed file size is good enough. Thanks again for your help! Kurt