From owner-freebsd-hackers  Tue Feb 15 18:26: 2 2000
Delivered-To: freebsd-hackers@freebsd.org
Received: from aurora.sol.net (aurora.sol.net [206.55.65.76])
	by builder.freebsd.org (Postfix) with ESMTP id 709AE49CD
	for <hackers@freebsd.org>; Tue, 15 Feb 2000 18:20:29 -0800 (PST)
Received: (from jgreco@localhost)
	by aurora.sol.net (8.9.2/8.9.2/SNNS-1.02) id UAA97518;
	Tue, 15 Feb 2000 20:19:12 -0600 (CST)
From: Joe Greco <jgreco@ns.sol.net>
Message-Id: <200002160219.UAA97518@aurora.sol.net>
Subject: Re: Filesystem size limit?
In-Reply-To: <20000216115914.H12517@freebie.lemis.com> from Greg Lehey at "Feb 16, 2000 11:59:14 am"
To: grog@lemis.com (Greg Lehey)
Date: Tue, 15 Feb 2000 20:19:12 -0600 (CST)
Cc: hackers@freebsd.org
X-Mailer: ELM [version 2.4ME+ PL43 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> On Tuesday, 15 February 2000 at  3:40:58 -0600, Joe Greco wrote:
> > So I wanted to vinum my new 1.9TB of disks together just for chuckles, and
> > it went OK up to the newfs..
> >
> > S play.p0.s0            State: up       PO:        0  B Size:         46 GB
> > S play.p0.s1            State: up       PO:       32 MB Size:         46 GB
> > <snip>
> > S play.p0.s37           State: up       PO:     1184 MB Size:         46 GB
> 
> Well, it's a pity you weren't able to newfs it, but I'm glad to see
> that Vinum could do it.  I'm not sure that striping buys you anything
> here, though, and a 32 MB stripe is going to be worse than
> concatenation: you'll have *all* your superblocks on the same disk!

For a "play" filesystem, I didn't care, and for an un-newfs-able filesystem,
it's irrelevant anyways.  For production servers, I take the cylinder group
size in sectors and use that for the stripe size, hoping (of course) that
metadata and files that are related will end up on the same drive.  This is
the traditional optimization I've preached here for years.  With vinum, it
is pretty easy although I usually go through half a dozen "resetconfig"'s
before I reach something that I'm completely happy with.

> > Just thought I'd mention it.  I'm putting the machine into
> > production, with the smaller filesystems that I originally intended,
> > but it seemed noteworthy to pass this along.
> 
> JOOI, how big are the file systems?  Why did you choose this size?

It all has to do with unified design strategy.

In a news system, you cannot afford to lose the history.  I've a hundred
million articles on spool, and to reconstruct the history, I'd have to read
them all.  Even assuming I can do a hundred articles per second (possibly
a bit more), that means I'd need 11.5 days to reload the history from the
spool.  I'd rather not.

The history is also the most active filesystem:  you have lots of seek
activity and lots of small read/writes.  The actual spools do not need to
have too much speed.  So, since I'm using 9-bay Kingston rack-mount drive
arrays, what I did for the smaller text spool servers was to set up two
shelves of 18GB drives (18 x 18GB ~= 324GB).  The history does not need
to be large: maybe 15GB total for the partition.  So I grab 1.5GB from
each drive, and make a plex out of the top 9 drives and another out of
the bottom 9 drives, and mirror them.  Redundancy.  Hard to lose history.

For the data, which I'm less concerned about losing due to higher level
redundancy in the network, I simply stripe both drive 0's together for
my "n0" partition, drive 1's for "n1", ... drive 9's for "n8".  This
gives me 9 spool fs's and a history fs, both optimized for their tasks,
while keeping the number of drives to a minimum - since space can be
very expensive!

However, working with arbitrarily large numbers of spool filesystems is
a pain, so I don't know if I'd have a compelling reason to set up a server
with 18 spool fs's.  Yet, when I built my binaries spool with 4 shelves,
that would have been the model.

Instead, I chose to take 750MB from each of the top 18 50GB drives, and
stripe them into one half of the history mirror, and the bottom 18 50GB
for the other half.  I then striped all _four_ drive 0's, 1's, etc. for
my spools, yielding 9 190GB spools.  Lo and behold, it looks very similar
at the application level.

This all works out very nicely because accesses within a single spool
filesystem will tend to be striped not only between drives but also
between _controllers_, at least if the access is big enough to involve
more than a single stripe.

But, more importantly, it's easy to extend the model, and some sort of
logical consistency is important in this business, where someone else
may take over next year.

> > Dunno how many terabyte filesystem folks are out there.
> 
> None, by the looks of it.

:-(

... Joe

-------------------------------------------------------------------------------
Joe Greco - Systems Administrator			      jgreco@ns.sol.net
Solaria Public Access UNIX - Milwaukee, WI			   414/342-4847


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message