From owner-freebsd-stable@FreeBSD.ORG Mon Feb 15 12:27:45 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C43811065672 for ; Mon, 15 Feb 2010 12:27:45 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta09.emeryville.ca.mail.comcast.net (qmta09.emeryville.ca.mail.comcast.net [76.96.30.96]) by mx1.freebsd.org (Postfix) with ESMTP id A9D228FC0C for ; Mon, 15 Feb 2010 12:27:45 +0000 (UTC) Received: from omta16.emeryville.ca.mail.comcast.net ([76.96.30.72]) by qmta09.emeryville.ca.mail.comcast.net with comcast id iCSn1d0021ZMdJ4A9CTmxF; Mon, 15 Feb 2010 12:27:46 +0000 Received: from koitsu.dyndns.org ([98.248.46.159]) by omta16.emeryville.ca.mail.comcast.net with comcast id iCW11d0043S48mS8cCW1lv; Mon, 15 Feb 2010 12:30:02 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 5B53B1E301A; Mon, 15 Feb 2010 04:27:44 -0800 (PST) Date: Mon, 15 Feb 2010 04:27:44 -0800 From: Jeremy Chadwick To: freebsd-stable@freebsd.org Message-ID: <20100215122744.GA57382@icarus.home.lan> References: <20100215090756.GA54764@icarus.home.lan> <20100215105000.101326yj01j0f64g@webmail.leidinger.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100215105000.101326yj01j0f64g@webmail.leidinger.net> User-Agent: Mutt/1.5.20 (2009-06-14) Subject: Re: hardware for home use large storage X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Feb 2010 12:27:45 -0000 On Mon, Feb 15, 2010 at 10:50:00AM +0100, Alexander Leidinger wrote: > Quoting Jeremy Chadwick (from Mon, 15 Feb > 2010 01:07:56 -0800): > > >On Mon, Feb 15, 2010 at 10:49:47AM +0200, Dan Naumov wrote: > >>> I had a feeling someone would bring up L2ARC/cache devices. This gives > >>> me the opportunity to ask something that's been on my mind for quite > >>> some time now: > >>> > >>> Aside from the capacity different (e.g. 40GB vs. 1GB), is there a > >>> benefit to using a dedicated RAM disk (e.g. md(4)) to a pool for > >>> L2ARC/cache? The ZFS documentation explicitly states that cache > >>> device content is considered volatile. > >> > >>Using a ramdisk as an L2ARC vdev doesn't make any sense at all. If you > >>have RAM to spare, it should be used by regular ARC. > > > >...except that it's already been proven on FreeBSD that the ARC getting > >out of control can cause kernel panics[1], horrible performance until First and foremost, sorry for the long post. I tried to keep it short, but sometimes there's just a lot to be said. > There are other ways (not related to ZFS) to shoot into your feet > too, I'm tempted to say that this is > a) a documentation bug > and > b) a lack of sanity checking of the values... anyone out there with > a good algorithm for something like this? > > Normally you do some testing with the values you use, so once you > resolved the issues, the system should be stable. What documentation? :-) The Wiki? If so, that's been outdated for some time; I know Ivan Voras was doing his best to put good information there, but it's hard given the below chaos. The following tunables are recurrently mentioned as focal points, but no one's explained in full how to tune these "properly", or which does what (perfect example: vm.kmem_size_max vs. vm.kmem_size. _max used to be what you'd adjust to solve kmem exhaustion issues, but now people are saying otherwise?). I realise it may differ per system (given how much RAM the system has), so different system configurations/examples would need to be provided. I realise that the behaviour of some have changed too (e.g. -RELEASE differs from -STABLE, and 7.x differs from 8.x). I've marked commonly-referred-to tunables with an asterisk: kern.maxvnodes * vm.kmem_size * vm.kmem_size_max * vfs.zfs.arc_min * vfs.zfs.arc_max vfs.zfs.prefetch_disable (auto-tuned based on available RAM on 8-STABLE) vfs.zfs.txg.timeout vfs.zfs.vdev.cache.size vfs.zfs.vdev.cache.bshift vfs.zfs.vdev.max_pending vfs.zfs.zil_disable Then, when it comes to debugging problems as a result of tuning improperly (or entire lack of), the following counters (not tunables) are thrown into the mix as "things people should look at": kstat.zfs.misc.arcstats.c kstat.zfs.misc.arcstats.c_min kstat.zfs.misc.arcstats.c_max kstat.zfs.misc.arcstats.evict_skip kstat.zfs.misc.arcstats.memory_throttle_count kstat.zfs.misc.arcstats.size None of these have sysctl descriptions (sysctl -d) either. I can provide posts to freebsd-stable, freebsd-current, freebsd-fs, or freebsd-questions, or freebsd-users referencing these variables or counters if you need context. All that said: I would be more than happy to write some coherent documentation that folks could refer to "officially", but rather than spend my entire lifetime reverse-engineering the ZFS code I think it'd make more sense to get some official parties involved to explain things. I'd like to add some kind of monitoring section as well -- how administrators could keep an eye on things and detect, semi-early, if additional tuning is required or something along those lines. > >ZFS has had its active/inactive lists flushed[2], and brings into > > Someone needs to sit down and play a little bit with ways to tell > the ARC that there is free memory. The mail you reference already > tells that the inactive/cached lists should maybe taken into account > too (I didn't had a look at this part of the ZFS code). > > >question how proper tuning is to be established and what the effects are > >on the rest of the system[3]. There are still reports of people > > That's what I talk about regarding b) above. If you specify an > arc_max which is too big (arc_max > kmem_size - SOME_SAVE_VALUE), > there should be a message from the kernel and the value should be > adjusted to a save amount. > > Until the problems are fixed, a MD for L2ARC may be a viable > alternative (if you have enough mem to give for this). Feel free to > provide benchmark numbers, but in general I see this just as a > workaround for the current issues. I've played with this a bit (2-disk mirror + one 256MB md), but I'm not completely sure how to read the bonnie++ results, nor am I sure I'm using the right arguments (bonnie++ -s8192 -n64 -d/pool on a machine that has 4GB). L2ARC ("cache" vdev) is supposed to improve random reads, while a "log" vdev (presumably something that links in with the ZIL) improves random writes. I'm not sure where bonnie++ tests random reads, but I do see it testing random seeks. > >disabling ZIL "for stability reasons" as well. > > For the ZIL you definitively do not want to have a MD. If you do not > specify a log vdev for the pool, the ZIL will be written somewhere > on the disks of the pool. When the data hits the ZIL, it has to be > really on a non-volatile storage. If you lose the ZIL, you lose > data. Thanks for the clarification here. In my case, I never disable the ZIL. I never have and I never will given the above risk. However there's lots of folks who advocate doing this because they have systems which crash if they don't. I've never understood how/why that is (I've never seen the ZIL responsible for any crash I've witnessed either). > >The "Internals" section of Brendan Gregg's blog[4] outlines where the > >L2ARC sits in the scheme of things, or if the ARC could essentially > >be disabled by setting the minimum size to something very small (a few > >megabytes) and instead using L2ARC which is manageable. > > At least in 7-stable, 8-stable and 9-current, the arc_max now really > corresponds to a max value, so it is more of providing a save > arc_max than a minimal arc_max. Ahh, that might explain this semi-old post where a user was stating that arc_max didn't appear to really be a hard limit, but just some kind of high water mark. > No matter how you construct the L2ARC, ARC access will be faster than > L2ARC access. Yes, based on Brendan's blog, I can see how that'd be the case; there'd be some added overhead given the design/placement of L2ARC. The options as I see them are (a)) figure out some *reliable* way to describe to folks how to tune their systems to not experience ARC or memory exhaustion related issues, or (b) utilise L2ARC exclusively and set the ARC (arc_max) to something fairly small. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |