From owner-freebsd-fs@FreeBSD.ORG Tue Jun 28 23:47:28 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0EE71106564A for ; Tue, 28 Jun 2011 23:47:28 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta06.emeryville.ca.mail.comcast.net (qmta06.emeryville.ca.mail.comcast.net [76.96.30.56]) by mx1.freebsd.org (Postfix) with ESMTP id E85F78FC0C for ; Tue, 28 Jun 2011 23:47:27 +0000 (UTC) Received: from omta01.emeryville.ca.mail.comcast.net ([76.96.30.11]) by qmta06.emeryville.ca.mail.comcast.net with comcast id 1bkw1h0030EPchoA6bnRo3; Tue, 28 Jun 2011 23:47:25 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta01.emeryville.ca.mail.comcast.net with comcast id 1bnf1h01F1t3BNj8MbnhhJ; Tue, 28 Jun 2011 23:47:42 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 99FBD102C19; Tue, 28 Jun 2011 16:47:23 -0700 (PDT) Date: Tue, 28 Jun 2011 16:47:23 -0700 From: Jeremy Chadwick To: George Sanders Message-ID: <20110628234723.GA63965@icarus.home.lan> References: <1309217450.43651.YahooMailRC@web120014.mail.ne1.yahoo.com> <20110628010822.GA41399@icarus.home.lan> <1309302840.88674.YahooMailRC@web120004.mail.ne1.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1309302840.88674.YahooMailRC@web120004.mail.ne1.yahoo.com> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org Subject: Re: Improving old-fashioned UFS2 performance with lots of inodes... X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jun 2011 23:47:28 -0000 On Tue, Jun 28, 2011 at 04:14:00PM -0700, George Sanders wrote: > > > with over 100 million inodes on the filesystem, things go slow. Overall > > > throughput is fine, and I have no complaints there, but doing any kind of > > > operations with the files is quite slow. Building a file list with rsync, or > > > doing a cp, or a ln -s of a big dir tree, etc. > > > > > > Let's assume that the architecture is not changing ... it's going to be FreeBSD > > > > > > 8.x, using UFS2, and raid6 on actual spinning (7200rpm) disks. > > > > > > What can I do to speed things up ? > > > > > > Right now I have these in my loader.conf: > > > > > > kern.maxdsiz="4096000000"# for fsck > > > vm.kmem_size="1610612736"# for big rsyncs > > > vm.kmem_size_max="1610612736"# for big rsyncs > > > > On what exact OS version? Please don't say "8.2", need to know > > 8.2-RELEASE, -STABLE, or what. You said "8.x" above, which is too > > vague. If 8.2-STABLE you should not be tuning vm.kmem_size_max at all, > > and you probably don't need to tune vm.kmem_size either. > > Ok, right now we are on 6.4-RELEASE, but it is our intention to move to > 8.2-RELEASE. Oh dear. I would recommend you focus solely on the complexity and pains of that upgrade and not about the "filesystem situation" here. The last thing you need to do is to try and "work in" some optimisations or tweaks while moving ahead by two major version releases. Take baby steps in this situation, otherwise there's going to be a mail about "problems with the upgrade but is it related to this tuning stuff we did or the filesystem problem or what happened and who changed what?" and you'll quickly lose track of everything. Re-visit the issue with UFS2 *after* you have done the upgrade. > If the kmem loader.conf options are no longer relevant in 8.2-STABLE, should I > assume that will also be the case when 8.3-RELEASE comes along ? Correct. > > I also do not understand how vm.kmem_size would affect rsync, since > > rsync is a userland application. I imagine you'd want to adjust > > kern.maxdsiz and kern.dfldsiz (default dsiz). > > Well, a huge rsync with 20+ million files dies with memory related errors, and > continued to do so until we upped the kmem values that high. We don't know > why, but we know it "fixed it". Again: I don't understand how adjusting vm.kmem_size or kmem_size_max would fix anything in regards to this. However, adjusting kern.maxdsiz I could see affecting this. It would indicate your rsync process becomes extremely large in size and exceeds maxdsiz, resulting in a segfault or some other anomalies sigN error. > > > and I also set: > > > > > > vfs.ufs.dirhash_maxmem=64000000 > > > > This tunable uses memory for a single directorie that has a huge amount > > of files in it; AFAIK it does not apply to "large directory structures" > > (as in directories within directories within directories). It's obvious > > you're just tinkering with random sysctls hoping to gain performance > > without really understanding what the sysctls do. :-) To see if you > > even need to increase that, try "sysctl -a | grep vfs.ufs.dirhash" and > > look at dirhash_mem vs. dirhash_maxmem, as well as dirhash_lowmemcount. > > No, we actually ALSO have huge directories, and we do indeed need this value. > > This is the one setting that we actually understand and have empirically > measured. Understood. > > The only thing I can think of on short notice is to have multiple > > filesystems (volumes) instead of one large 12TB one. This is pretty > > common in the commercial filer world. > > Ok, that is interesting - are you saying create multiple, smaller UFS > filesystems on the single large 12TB raid6 array ? Correct. Instead of one large 12TB filesystem, try four 3TB filesystems instead, or eight 2TB. > Or are you saying create a handful of smaller arrays ? We have to burn two > disks for every raid6 array we make, as I am sure you know, so we really can't split > it up into multiple arrays. Nah, not multiple arrays, just multiple filesystems on a single array. > We could, however, split the single raid6 array into multiple, formatted UFS2 > filesystems, but I don't understand how that would help with our performance ? > > Certainly fsck time would be much shorter, and we could bring up each filesystem > after it fsck'd, and then move to the next one ... but in terms of live performance, > how does splitting the array into multiple filesystems help ? The nature of a > raid array (as I understand it) would have us beating all 12 disks regardless of > which UFS filesystems were being used. > > Can you elaborate ? Please read everything I've written below before responding (e.g. do not respond in-line to this information). Actually, I think elaboration is needed on your part. :-) I say that with as much sincerity as possible. All you've stated in this thread so far is: - "With over 100 million inodes on the filesystem, things go slow" - "Building a list of files with rsync/using cp/ln -s in a very large directory tree" (does this mean a directory with a large amount of files in it?) "is slow" - Some sort of concern over the speed of fsck - You want to use more system memory/RAM for filesystem-level caching http://lists.freebsd.org/pipermail/freebsd-fs/2011-June/011867.html There's really nothing concrete provided here. Developers are going to need hard data, and I imagine you're going to get a lot of push-back given how you're using the filesystem. "Hard data" means you need to actually start showing some actual output of your filesystems, explain your directory structures, etc... Generally speaking, the below are No-Nos on most UNIX filesystems. At least these are things that I was taught very early on (early 90s), and I imagine others were as well: - Stick tons of files in a single directory - Cram hundreds of millions of files on a single filesystem I would recommend looking into tunefs(8) as well; the -e, -f, and -s arguments will probably interest you. Splitting things up into multiple filesystems would help with both the 1st and 3rd items on the 4-item list. Solving the 2nd item is as simple as: "then don't do that" (are you in biometrics per chance? Biometrics people have a tendency to abuse filesystems horribly :-) ), and the 4th item I can't really comment on (WRT UFS). Items 1, 3, and 4 are things that use of ZFS would help with. I'm not sure about the 2nd item. If I was in your situation, I would strongly recommend considering moving to it *after* you finish your OS upgrades. Furthermore, if you're going to consider using ZFS on FreeBSD, *please* use RELENG_8 (8.2-STABLE) and not RELENG_8_2 (8.2-RELEASE). There have been *major* improvements between those two tags. You can wait for 8.3-RELEASE if you want (which will obviously encapsulate those changes), but it's your choice. > > Regarding system RAM and UFS2: I have no idea, Kirk might have to > > comment on that. > > > > You could "make use" of system RAM for cache (ZFS ARC) if you were using > > ZFS instead of native UFS2. However, if the system has 64GB of RAM, you > > need to ask yourself why the system has that amount of RAM in the first > > place. For example, if the machine runs mysqld and is tuned to use a > > large amount of memory, you really don't ""have"" 64GB of RAM to play > > with, and thus wouldn't want mysqld and some filesystem caching model > > fighting over memory (e.g. paging/swapping). > > Actually, the system RAM is there for the purpose of someday using ZFS - and > for no other reason. However, it is realistically a few years away on our > timeline, > unfortunately, so for now we will use UFS2, and as I said ... it seems a shame > that UFS2 cannot use system RAM for any good purpose... > > Or can it ? Anyone ? Like I said: the only person (I know of) who could answer this would be Kirk McKusick. I'm not well-versed in the inner workings and design of filesystems; Kirk would be. I'm not sure who else "knows" UFS around here. I think you need to figure out which of your concerns have priority. Upgrading to ZFS (8.2-STABLE or later please) may solve all of your performance issues; I wish I could say "it will" but I can't. If upgrading to that isn't a priority (re: "a few years from now"), then you may have to live with your current situation, albeit painfully. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |