From owner-freebsd-stable@FreeBSD.ORG  Mon Feb 15 12:27:45 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C43811065672
	for <freebsd-stable@freebsd.org>; Mon, 15 Feb 2010 12:27:45 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta09.emeryville.ca.mail.comcast.net
	(qmta09.emeryville.ca.mail.comcast.net [76.96.30.96])
	by mx1.freebsd.org (Postfix) with ESMTP id A9D228FC0C
	for <freebsd-stable@freebsd.org>; Mon, 15 Feb 2010 12:27:45 +0000 (UTC)
Received: from omta16.emeryville.ca.mail.comcast.net ([76.96.30.72])
	by qmta09.emeryville.ca.mail.comcast.net with comcast
	id iCSn1d0021ZMdJ4A9CTmxF; Mon, 15 Feb 2010 12:27:46 +0000
Received: from koitsu.dyndns.org ([98.248.46.159])
	by omta16.emeryville.ca.mail.comcast.net with comcast
	id iCW11d0043S48mS8cCW1lv; Mon, 15 Feb 2010 12:30:02 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 5B53B1E301A; Mon, 15 Feb 2010 04:27:44 -0800 (PST)
Date: Mon, 15 Feb 2010 04:27:44 -0800
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: freebsd-stable@freebsd.org
Message-ID: <20100215122744.GA57382@icarus.home.lan>
References: <cf9b1ee01002150049o43fced71ucb5776a0a1eaf4cf@mail.gmail.com>
	<20100215090756.GA54764@icarus.home.lan>
	<20100215105000.101326yj01j0f64g@webmail.leidinger.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20100215105000.101326yj01j0f64g@webmail.leidinger.net>
User-Agent: Mutt/1.5.20 (2009-06-14)
Subject: Re: hardware for home use large storage
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 15 Feb 2010 12:27:45 -0000

On Mon, Feb 15, 2010 at 10:50:00AM +0100, Alexander Leidinger wrote:
> Quoting Jeremy Chadwick <freebsd@jdc.parodius.com> (from Mon, 15 Feb
> 2010 01:07:56 -0800):
> 
> >On Mon, Feb 15, 2010 at 10:49:47AM +0200, Dan Naumov wrote:
> >>> I had a feeling someone would bring up L2ARC/cache devices.  This gives
> >>> me the opportunity to ask something that's been on my mind for quite
> >>> some time now:
> >>>
> >>> Aside from the capacity different (e.g. 40GB vs. 1GB), is there a
> >>> benefit to using a dedicated RAM disk (e.g. md(4)) to a pool for
> >>> L2ARC/cache?  The ZFS documentation explicitly states that cache
> >>> device content is considered volatile.
> >>
> >>Using a ramdisk as an L2ARC vdev doesn't make any sense at all. If you
> >>have RAM to spare, it should be used by regular ARC.
> >
> >...except that it's already been proven on FreeBSD that the ARC getting
> >out of control can cause kernel panics[1], horrible performance until

First and foremost, sorry for the long post.  I tried to keep it short,
but sometimes there's just a lot to be said.

> There are other ways (not related to ZFS) to shoot into your feet
> too, I'm tempted to say that this is
>  a) a documentation bug
> and
>  b) a lack of sanity checking of the values... anyone out there with
> a good algorithm for something like this?
> 
> Normally you do some testing with the values you use, so once you
> resolved the issues, the system should be stable.

What documentation?  :-)  The Wiki?  If so, that's been outdated for
some time; I know Ivan Voras was doing his best to put good information
there, but it's hard given the below chaos.

The following tunables are recurrently mentioned as focal points, but no
one's explained in full how to tune these "properly", or which does what
(perfect example: vm.kmem_size_max vs. vm.kmem_size.  _max used to be
what you'd adjust to solve kmem exhaustion issues, but now people are
saying otherwise?).  I realise it may differ per system (given how much
RAM the system has), so different system configurations/examples would
need to be provided.  I realise that the behaviour of some have changed
too (e.g. -RELEASE differs from -STABLE, and 7.x differs from 8.x).
I've marked commonly-referred-to tunables with an asterisk:

  kern.maxvnodes
* vm.kmem_size
* vm.kmem_size_max
* vfs.zfs.arc_min
* vfs.zfs.arc_max
  vfs.zfs.prefetch_disable  (auto-tuned based on available RAM on 8-STABLE)
  vfs.zfs.txg.timeout
  vfs.zfs.vdev.cache.size
  vfs.zfs.vdev.cache.bshift
  vfs.zfs.vdev.max_pending
  vfs.zfs.zil_disable

Then, when it comes to debugging problems as a result of tuning
improperly (or entire lack of), the following counters (not tunables)
are thrown into the mix as "things people should look at":

  kstat.zfs.misc.arcstats.c
  kstat.zfs.misc.arcstats.c_min
  kstat.zfs.misc.arcstats.c_max
  kstat.zfs.misc.arcstats.evict_skip
  kstat.zfs.misc.arcstats.memory_throttle_count
  kstat.zfs.misc.arcstats.size

None of these have sysctl descriptions (sysctl -d) either.  I can
provide posts to freebsd-stable, freebsd-current, freebsd-fs, or
freebsd-questions, or freebsd-users referencing these variables or
counters if you need context.

All that said:

I would be more than happy to write some coherent documentation that
folks could refer to "officially", but rather than spend my entire
lifetime reverse-engineering the ZFS code I think it'd make more sense
to get some official parties involved to explain things.

I'd like to add some kind of monitoring section as well -- how
administrators could keep an eye on things and detect, semi-early, if
additional tuning is required or something along those lines.

> >ZFS has had its active/inactive lists flushed[2], and brings into
> 
> Someone needs to sit down and play a little bit with ways to tell
> the ARC that there is free memory. The mail you reference already
> tells that the inactive/cached lists should maybe taken into account
> too (I didn't had a look at this part of the ZFS code).
> 
> >question how proper tuning is to be established and what the effects are
> >on the rest of the system[3].  There are still reports of people
> 
> That's what I talk about regarding b) above. If you specify an
> arc_max which is too big (arc_max > kmem_size - SOME_SAVE_VALUE),
> there should be a message from the kernel and the value should be
> adjusted to a save amount.
> 
> Until the problems are fixed, a MD for L2ARC may be a viable
> alternative (if you have enough mem to give for this). Feel free to
> provide benchmark numbers, but in general I see this just as a
> workaround for the current issues.

I've played with this a bit (2-disk mirror + one 256MB md), but I'm not
completely sure how to read the bonnie++ results, nor am I sure I'm
using the right arguments (bonnie++ -s8192 -n64 -d/pool on a machine
that has 4GB).

L2ARC ("cache" vdev) is supposed to improve random reads, while a "log"
vdev (presumably something that links in with the ZIL) improves random
writes.  I'm not sure where bonnie++ tests random reads, but I do see it
testing random seeks.

> >disabling ZIL "for stability reasons" as well.
> 
> For the ZIL you definitively do not want to have a MD. If you do not
> specify a log vdev for the pool, the ZIL will be written somewhere
> on the disks of the pool. When the data hits the ZIL, it has to be
> really on a non-volatile storage. If you lose the ZIL, you lose
> data.

Thanks for the clarification here.  In my case, I never disable the ZIL.
I never have and I never will given the above risk.  However there's
lots of folks who advocate doing this because they have systems which
crash if they don't.  I've never understood how/why that is (I've never
seen the ZIL responsible for any crash I've witnessed either).

> >The "Internals" section of Brendan Gregg's blog[4] outlines where the
> >L2ARC sits in the scheme of things, or if the ARC could essentially
> >be disabled by setting the minimum size to something very small (a few
> >megabytes) and instead using L2ARC which is manageable.
> 
> At least in 7-stable, 8-stable and 9-current, the arc_max now really
> corresponds to a max value, so it is more of providing a save
> arc_max than a minimal arc_max.

Ahh, that might explain this semi-old post where a user was stating that
arc_max didn't appear to really be a hard limit, but just some kind of
high water mark.

> No matter how you construct the L2ARC, ARC access will be faster than
> L2ARC access.

Yes, based on Brendan's blog, I can see how that'd be the case; there'd
be some added overhead given the design/placement of L2ARC.

The options as I see them are (a)) figure out some *reliable* way to
describe to folks how to tune their systems to not experience ARC or
memory exhaustion related issues, or (b) utilise L2ARC exclusively and
set the ARC (arc_max) to something fairly small.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |