From owner-freebsd-questions@FreeBSD.ORG  Wed Jan  3 18:48:06 2007
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
X-Original-To: freebsd-questions@freebsd.org
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id AD71D16A403
	for <freebsd-questions@freebsd.org>;
	Wed,  3 Jan 2007 18:48:06 +0000 (UTC)
	(envelope-from kurt.buff@gmail.com)
Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.234])
	by mx1.freebsd.org (Postfix) with ESMTP id 5CA5F13C44B
	for <freebsd-questions@freebsd.org>;
	Wed,  3 Jan 2007 18:48:06 +0000 (UTC)
	(envelope-from kurt.buff@gmail.com)
Received: by wx-out-0506.google.com with SMTP id s18so6040158wxc
	for <freebsd-questions@freebsd.org>;
	Wed, 03 Jan 2007 10:48:05 -0800 (PST)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;
	h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	b=FdM5E2UT0WoArib/o1Nki4dZv3QACJnBrHPjJ6XcwzWwFggHIqsJPyeO8kbREAdgZaBMTXRTG4xPLonlGSGgH+vag3dowZYdMfmirqteKu7blnxDsKclWE5cx3HeJooEejW8CH/yg7sDkiuD/7/jwxRN3MXI5YFP/hYEDHVRVXU=
Received: by 10.70.39.5 with SMTP id m5mr39430697wxm.1167850085925;
	Wed, 03 Jan 2007 10:48:05 -0800 (PST)
Received: by 10.70.131.11 with HTTP; Wed, 3 Jan 2007 10:48:05 -0800 (PST)
Message-ID: <a9f4a3860701031048l7969e409ged9e3e583ed4ebdd@mail.gmail.com>
Date: Wed, 3 Jan 2007 10:48:05 -0800
From: "Kurt Buff" <kurt.buff@gmail.com>
To: "Ian Smith" <smithi@nimnet.asn.au>
In-Reply-To: <Pine.BSF.3.96.1070104025642.7024B-100000@gaia.nimnet.asn.au>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <20070103120041.C104816A569@hub.freebsd.org>
	<Pine.BSF.3.96.1070104025642.7024B-100000@gaia.nimnet.asn.au>
Cc: James Long <list@museum.rain.com>, freebsd-questions@freebsd.org
Subject: Re: Batch file question - average size of file in directory
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Jan 2007 18:48:06 -0000

On 1/3/07, Ian Smith <smithi@nimnet.asn.au> wrote:
>  > Message: 17
>  > Date: Tue, 2 Jan 2007 19:50:01 -0800
>  > From: James Long <list@museum.rain.com>
>
>  > > Message: 28
>  > > Date: Tue, 2 Jan 2007 10:20:08 -0800
>  > > From: "Kurt Buff" <kurt.buff@gmail.com>
>
>  > > I don't even have a clue how to start this one, so am looking for a little help.
>  > >
>  > > I've got a directory with a large number of gzipped files in it (over
>  > > 110k) along with a few thousand uncompressed files.
>
> If it were me I'd mv those into a bunch of subdirectories; things get
> really slow with more than 500 or so files per directory .. anyway ..

I just store them for a while - delete them after two weeks if they're
not needed again. The overhead isn't enough to worry about at this
point.

>  > > I'd like to find the average uncompressed size of the gzipped files,
>  > > and ignore the uncompressed files.
>  > >
>  > > How on earth would I go about doing that with the default shell (no
>  > > bash or other shells installed), or in perl, or something like that.
>  > > I'm no scripter of any great expertise, and am just stumbling over
>  > > this trying to find an approach.
>  > >
>  > > Many thanks for any help,
>  > >
>  > > Kurt
>  >
>  > Hi, Kurt.
>
> And hi, James,
>
>  > Can I make some assumptions that simplify things?  No kinky filenames,
>  > just [a-zA-Z0-9.].  My approach specifically doesn't like colons or
>  > spaces, I bet.  Also, you say gzipped, so I'm assuming it's ONLY gzip,
>  > no bzip2, etc.
>  >
>  > Here's a first draft that might give you some ideas.  It will output:
>  >
>  > foo.gz : 3456
>  > bar.gz : 1048576
>  > (etc.)
>  >
>  > find . -type f | while read fname; do
>  >   file $fname | grep -q "compressed" && echo "$fname : $(zcat $fname | wc -c)"
>  > done
>
>  % file cat7/tuning.7.gz
>  cat7/tuning.7.gz: gzip compressed data, from Unix
>
> Good check, though grep "gzip compressed" excludes bzip2 etc.
>
> But you REALLY don't want to zcat 110 thousand files just to wc 'em,
> unless it's a benchmark :) .. may I suggest a slight speedup, template:
>
>  % gunzip -l cat7/tuning.7.gz
>  compressed  uncompr. ratio uncompressed_name
>      13642     38421  64.5% cat7/tuning.7
>
>  > If you really need a script that will do the math for you, then
>  > pip the output of this into bc:
>  >
>  > #!/bin/sh
>  >
>  > find . -type f | {
>  >
>  > n=0
>  > echo scale=2
>  > echo -n "("
>  > while read fname; do
> - >   if file $fname | grep -q "compressed"
> +    if file $fname | grep -q "gzip compressed"
>  >   then
> - >     echo -n "$(zcat $fname | wc -c)+"
> +      echo -n "$(gunzip -l $fname | grep -v comp | awk '{print $2}')+"
>  >     n=$(($n+1))
>  >   fi
>  > done
>  > echo "0) / $n"
>  >
>  > }
>  >
>  > That should give you the average decompressed size of the gzip'ped
>  > files in the current directory.
>
> HTH, Ian


Ah - yes, I think that's much better. I should have thought of awk.

At some point, I'd like to do a bit more processing of file sizes,
such as trying to find out the number of IP packets each file would
take during an SMTP transaction, so that I could categorize overhead a
bit, but for now the average uncompressed file size is good enough.

Thanks again for your help!

Kurt