From owner-freebsd-stable@FreeBSD.ORG  Thu Mar  7 14:32:32 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 3A9EA786
 for <freebsd-stable@freebsd.org>; Thu,  7 Mar 2013 14:32:32 +0000 (UTC)
 (envelope-from karl@denninger.net)
Received: from fs.denninger.net (wsip-70-169-168-7.pn.at.cox.net
 [70.169.168.7]) by mx1.freebsd.org (Postfix) with ESMTP id EB534E28
 for <freebsd-stable@freebsd.org>; Thu,  7 Mar 2013 14:32:31 +0000 (UTC)
Received: from [127.0.0.1] (localhost [127.0.0.1])
 by fs.denninger.net (8.14.6/8.13.1) with ESMTP id r27EWNlU074373
 for <freebsd-stable@freebsd.org>; Thu, 7 Mar 2013 08:32:23 -0600 (CST)
 (envelope-from karl@denninger.net)
Received: from [127.0.0.1] [192.168.1.40] by Spamblock-sys (LOCAL);
 Thu Mar  7 08:32:23 2013
Message-ID: <5138A4C1.5090503@denninger.net>
Date: Thu, 07 Mar 2013 08:31:29 -0600
From: Karl Denninger <karl@denninger.net>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130215 Thunderbird/17.0.3
MIME-Version: 1.0
To: freebsd-stable@freebsd.org
Subject: Re: ZFS "stalls" -- and maybe we should be talking about defaults?
References: <513524B2.6020600@denninger.net>
 <20130307072145.GA2923@server.rulingia.com>
In-Reply-To: <20130307072145.GA2923@server.rulingia.com>
X-Enigmail-Version: 1.5.1
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature";
 boundary="----enig2DAJKLTVDKROLFATBSLSO"
X-Antivirus: avast! (VPS 130307-0, 03/07/2013), Outbound message
X-Antivirus-Status: Clean
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 07 Mar 2013 14:32:32 -0000

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
------enig2DAJKLTVDKROLFATBSLSO
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable


On 3/7/2013 1:21 AM, Peter Jeremy wrote:
> On 2013-Mar-04 16:48:18 -0600, Karl Denninger <karl@denninger.net> wrot=
e:
>> The subject machine in question has 12GB of RAM and dual Xeon
>> 5500-series processors.  It also has an ARECA 1680ix in it with 2GB of=

>> local cache and the BBU for it.  The ZFS spindles are all exported as
>> JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs=

>> partition added to them, are labeled and the providers are then
>> geli-encrypted and added to the pool.
> What sort of disks?  SAS or SATA?
SATA.  They're clean; they report no errors, no retries, no corrected
data (ECC) etc.  They also have been running for a couple of years under
UFS+SU without problems.  This isn't new hardware; it's an in-service
system.

>> also known good.  I began to get EXTENDED stalls with zero I/O going o=
n,
>> some lasting for 30 seconds or so.  The system was not frozen but
>> anything that touched I/O would lock until it cleared.  Dedup is off,
>> incidentally.
> When the system has stalled:
> - Do you see very low free memory?
Yes.  Effectively zero.
> - What happens to all the different CPU utilisation figures?  Do they
>   all go to zero?  Do you get high system or interrupt CPU (including
>   going to 1 core's worth)?
No, they start to fall.  This is a bad piece of data to trust though
because I am geli-encrypting the spindles, so falling CPU doesn't mean
the CPU is actually idle (since with no I/O there is nothing going
through geli.)  I'm working on instrumenting things sufficiently to try
to peel that off -- I suspect the kernel is spinning on something, but
the trick is finding out what it is.
> - What happens to interrupt load?  Do you see any disk controller
>   interrupts?
None.
>
> Would you be able to build a kernel with WITNESS (and WITNESS_SKIPSPIN)=

> and see if you get any errors when stalls happen.
If I have to.  That's easy to do on the test box -- on the production
one, not so much.
> On 2013-Mar-05 14:09:36 -0800, Jeremy Chadwick <jdc@koitsu.org> wrote:
>> On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote:
>>> Completely unrelated to the main thread:
>>>
>>> on 05/03/2013 07:32 Jeremy Chadwick said the following:
>>>> That said, I still do not recommend ZFS for a root filesystem
>>> Why?
>> Too long a history of problems with it and weird edge cases (keep
>> reading); the last thing an administrator wants to deal with is a syst=
em
>> where the root filesystem won't mount/can't be used.  It makes
>> recovery or problem-solving (i.e. the server is not physically accessi=
ble
>> given geographic distances) very difficult.
> I've had lots of problems with a gmirrored UFS root as well.  The
> biggest issue is that gmirror has no audit functionality so you
> can't verify that both sides of a mirror really do have the same data.
I have root on a 2-drive RAID mirror (done in the controller) and that
has been fine.  The controller does scrubs on a regular basis
internally.  The problem is that if it gets a clean read that is
different (e.g. no ECC indications, etc) it doesn't know which is the
correct copy.  The good news is that hasn't happened yet :-)

The risk of this happening as my data store continues to expand is one
of the reasons I want to move toward ZFS, but not necessarily for the
boot drives.  For the data store, however....

>> My point/opinion: UFS for a root filesystem is guaranteed to work
>> without any fiddling about and, barring drive failures or controller
>> issues, is (again, my opinion) a lot more risk-free than ZFS-on-root.
> AFAIK, you can't boot from anything other than a single disk (ie no
> graid).
Where I am right now is this:

1. I *CANNOT* reproduce the spins on the test machine with Postgres
stopped in any way.  Even with multiple ZFS send/recv copies going on
and the load average north of 20 (due to all the geli threads), the
system doesn't stall or produce any notable pauses in throughput.  Nor
does the system RAM allocation get driven hard enough to force paging.=20

This is with NO tuning hacks in /boot/loader.conf.  I/O performance is
both stable and solid.

2. WITH Postgres running as a connected hot spare (identical to the
production machine), allocating ~1.5G of shared, wired memory,  running
the same synthetic workload in (1) above I am getting SMALL versions of
the misbehavior.  However, while system RAM allocation gets driven
pretty hard and reaches down toward 100MB in some instances it doesn't
get driven hard enough to allocate swap.  The "burstiness" is very
evident in the iostat figures with spates getting into the single digit
MB/sec range from time to time but it's not enough to drive the system
to a full-on stall.

There's pretty-clearly a bad interaction here between Postgres wiring
memory and the ARC, when the latter is left alone and allowed to do what
it wants.   I'm continuing to work on replicating this on the test
machine... just not completely there yet.


--=20
-- Karl Denninger
/The Market Ticker =AE/ <http://market-ticker.org>
Cuda Systems LLC

------enig2DAJKLTVDKROLFATBSLSO
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJROKTyAAoJEGAtiW4Ft0U9jpUIAMaSwgkbF+gK/8mc2RmERB5G
y55vazdtFRxYF0PF4//Fjs8XJXtYwEpW/ORgFofuMPz5/q1pGmn7r04TP4Zs9hxb
lTNWIoGsfhoYvlVCKuMYzRCSeOMHtgYW4xikzXRSyEdPhN6eHzQBDsm91LnnUaB1
30eFsKXT3FVRheOTNSgnLZG6ywxIJq3inf0x56H3Jayw+voV3fF5BeqYVOH7Wd1E
+l4ShlW+C3ysvcyskqRxfNjC2t7lcSI3iV6JB46KbmvmArigGwrz+OKJx55tuUYB
Jl+vopzcM7WdzwYylro65UyGFU1CCg7BQXexKOkU1JM/qYdGxg404Mv7HpaUZcc=
=mRGs
-----END PGP SIGNATURE-----

------enig2DAJKLTVDKROLFATBSLSO--