From owner-freebsd-hackers Tue Feb 15 18:26: 2 2000 Delivered-To: freebsd-hackers@freebsd.org Received: from aurora.sol.net (aurora.sol.net [206.55.65.76]) by builder.freebsd.org (Postfix) with ESMTP id 709AE49CD for ; Tue, 15 Feb 2000 18:20:29 -0800 (PST) Received: (from jgreco@localhost) by aurora.sol.net (8.9.2/8.9.2/SNNS-1.02) id UAA97518; Tue, 15 Feb 2000 20:19:12 -0600 (CST) From: Joe Greco Message-Id: <200002160219.UAA97518@aurora.sol.net> Subject: Re: Filesystem size limit? In-Reply-To: <20000216115914.H12517@freebie.lemis.com> from Greg Lehey at "Feb 16, 2000 11:59:14 am" To: grog@lemis.com (Greg Lehey) Date: Tue, 15 Feb 2000 20:19:12 -0600 (CST) Cc: hackers@freebsd.org X-Mailer: ELM [version 2.4ME+ PL43 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > On Tuesday, 15 February 2000 at 3:40:58 -0600, Joe Greco wrote: > > So I wanted to vinum my new 1.9TB of disks together just for chuckles, and > > it went OK up to the newfs.. > > > > S play.p0.s0 State: up PO: 0 B Size: 46 GB > > S play.p0.s1 State: up PO: 32 MB Size: 46 GB > > > > S play.p0.s37 State: up PO: 1184 MB Size: 46 GB > > Well, it's a pity you weren't able to newfs it, but I'm glad to see > that Vinum could do it. I'm not sure that striping buys you anything > here, though, and a 32 MB stripe is going to be worse than > concatenation: you'll have *all* your superblocks on the same disk! For a "play" filesystem, I didn't care, and for an un-newfs-able filesystem, it's irrelevant anyways. For production servers, I take the cylinder group size in sectors and use that for the stripe size, hoping (of course) that metadata and files that are related will end up on the same drive. This is the traditional optimization I've preached here for years. With vinum, it is pretty easy although I usually go through half a dozen "resetconfig"'s before I reach something that I'm completely happy with. > > Just thought I'd mention it. I'm putting the machine into > > production, with the smaller filesystems that I originally intended, > > but it seemed noteworthy to pass this along. > > JOOI, how big are the file systems? Why did you choose this size? It all has to do with unified design strategy. In a news system, you cannot afford to lose the history. I've a hundred million articles on spool, and to reconstruct the history, I'd have to read them all. Even assuming I can do a hundred articles per second (possibly a bit more), that means I'd need 11.5 days to reload the history from the spool. I'd rather not. The history is also the most active filesystem: you have lots of seek activity and lots of small read/writes. The actual spools do not need to have too much speed. So, since I'm using 9-bay Kingston rack-mount drive arrays, what I did for the smaller text spool servers was to set up two shelves of 18GB drives (18 x 18GB ~= 324GB). The history does not need to be large: maybe 15GB total for the partition. So I grab 1.5GB from each drive, and make a plex out of the top 9 drives and another out of the bottom 9 drives, and mirror them. Redundancy. Hard to lose history. For the data, which I'm less concerned about losing due to higher level redundancy in the network, I simply stripe both drive 0's together for my "n0" partition, drive 1's for "n1", ... drive 9's for "n8". This gives me 9 spool fs's and a history fs, both optimized for their tasks, while keeping the number of drives to a minimum - since space can be very expensive! However, working with arbitrarily large numbers of spool filesystems is a pain, so I don't know if I'd have a compelling reason to set up a server with 18 spool fs's. Yet, when I built my binaries spool with 4 shelves, that would have been the model. Instead, I chose to take 750MB from each of the top 18 50GB drives, and stripe them into one half of the history mirror, and the bottom 18 50GB for the other half. I then striped all _four_ drive 0's, 1's, etc. for my spools, yielding 9 190GB spools. Lo and behold, it looks very similar at the application level. This all works out very nicely because accesses within a single spool filesystem will tend to be striped not only between drives but also between _controllers_, at least if the access is big enough to involve more than a single stripe. But, more importantly, it's easy to extend the model, and some sort of logical consistency is important in this business, where someone else may take over next year. > > Dunno how many terabyte filesystem folks are out there. > > None, by the looks of it. :-( ... Joe ------------------------------------------------------------------------------- Joe Greco - Systems Administrator jgreco@ns.sol.net Solaria Public Access UNIX - Milwaukee, WI 414/342-4847 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message