From owner-freebsd-stable@FreeBSD.ORG Thu Mar 7 14:32:32 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 3A9EA786 for ; Thu, 7 Mar 2013 14:32:32 +0000 (UTC) (envelope-from karl@denninger.net) Received: from fs.denninger.net (wsip-70-169-168-7.pn.at.cox.net [70.169.168.7]) by mx1.freebsd.org (Postfix) with ESMTP id EB534E28 for ; Thu, 7 Mar 2013 14:32:31 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by fs.denninger.net (8.14.6/8.13.1) with ESMTP id r27EWNlU074373 for ; Thu, 7 Mar 2013 08:32:23 -0600 (CST) (envelope-from karl@denninger.net) Received: from [127.0.0.1] [192.168.1.40] by Spamblock-sys (LOCAL); Thu Mar 7 08:32:23 2013 Message-ID: <5138A4C1.5090503@denninger.net> Date: Thu, 07 Mar 2013 08:31:29 -0600 From: Karl Denninger User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130215 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-stable@freebsd.org Subject: Re: ZFS "stalls" -- and maybe we should be talking about defaults? References: <513524B2.6020600@denninger.net> <20130307072145.GA2923@server.rulingia.com> In-Reply-To: <20130307072145.GA2923@server.rulingia.com> X-Enigmail-Version: 1.5.1 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="----enig2DAJKLTVDKROLFATBSLSO" X-Antivirus: avast! (VPS 130307-0, 03/07/2013), Outbound message X-Antivirus-Status: Clean X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Mar 2013 14:32:32 -0000 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) ------enig2DAJKLTVDKROLFATBSLSO Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On 3/7/2013 1:21 AM, Peter Jeremy wrote: > On 2013-Mar-04 16:48:18 -0600, Karl Denninger wrot= e: >> The subject machine in question has 12GB of RAM and dual Xeon >> 5500-series processors. It also has an ARECA 1680ix in it with 2GB of= >> local cache and the BBU for it. The ZFS spindles are all exported as >> JBOD drives. I set up four disks under GPT, have a single freebsd-zfs= >> partition added to them, are labeled and the providers are then >> geli-encrypted and added to the pool. > What sort of disks? SAS or SATA? SATA. They're clean; they report no errors, no retries, no corrected data (ECC) etc. They also have been running for a couple of years under UFS+SU without problems. This isn't new hardware; it's an in-service system. >> also known good. I began to get EXTENDED stalls with zero I/O going o= n, >> some lasting for 30 seconds or so. The system was not frozen but >> anything that touched I/O would lock until it cleared. Dedup is off, >> incidentally. > When the system has stalled: > - Do you see very low free memory? Yes. Effectively zero. > - What happens to all the different CPU utilisation figures? Do they > all go to zero? Do you get high system or interrupt CPU (including > going to 1 core's worth)? No, they start to fall. This is a bad piece of data to trust though because I am geli-encrypting the spindles, so falling CPU doesn't mean the CPU is actually idle (since with no I/O there is nothing going through geli.) I'm working on instrumenting things sufficiently to try to peel that off -- I suspect the kernel is spinning on something, but the trick is finding out what it is. > - What happens to interrupt load? Do you see any disk controller > interrupts? None. > > Would you be able to build a kernel with WITNESS (and WITNESS_SKIPSPIN)= > and see if you get any errors when stalls happen. If I have to. That's easy to do on the test box -- on the production one, not so much. > On 2013-Mar-05 14:09:36 -0800, Jeremy Chadwick wrote: >> On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote: >>> Completely unrelated to the main thread: >>> >>> on 05/03/2013 07:32 Jeremy Chadwick said the following: >>>> That said, I still do not recommend ZFS for a root filesystem >>> Why? >> Too long a history of problems with it and weird edge cases (keep >> reading); the last thing an administrator wants to deal with is a syst= em >> where the root filesystem won't mount/can't be used. It makes >> recovery or problem-solving (i.e. the server is not physically accessi= ble >> given geographic distances) very difficult. > I've had lots of problems with a gmirrored UFS root as well. The > biggest issue is that gmirror has no audit functionality so you > can't verify that both sides of a mirror really do have the same data. I have root on a 2-drive RAID mirror (done in the controller) and that has been fine. The controller does scrubs on a regular basis internally. The problem is that if it gets a clean read that is different (e.g. no ECC indications, etc) it doesn't know which is the correct copy. The good news is that hasn't happened yet :-) The risk of this happening as my data store continues to expand is one of the reasons I want to move toward ZFS, but not necessarily for the boot drives. For the data store, however.... >> My point/opinion: UFS for a root filesystem is guaranteed to work >> without any fiddling about and, barring drive failures or controller >> issues, is (again, my opinion) a lot more risk-free than ZFS-on-root. > AFAIK, you can't boot from anything other than a single disk (ie no > graid). Where I am right now is this: 1. I *CANNOT* reproduce the spins on the test machine with Postgres stopped in any way. Even with multiple ZFS send/recv copies going on and the load average north of 20 (due to all the geli threads), the system doesn't stall or produce any notable pauses in throughput. Nor does the system RAM allocation get driven hard enough to force paging.=20 This is with NO tuning hacks in /boot/loader.conf. I/O performance is both stable and solid. 2. WITH Postgres running as a connected hot spare (identical to the production machine), allocating ~1.5G of shared, wired memory, running the same synthetic workload in (1) above I am getting SMALL versions of the misbehavior. However, while system RAM allocation gets driven pretty hard and reaches down toward 100MB in some instances it doesn't get driven hard enough to allocate swap. The "burstiness" is very evident in the iostat figures with spates getting into the single digit MB/sec range from time to time but it's not enough to drive the system to a full-on stall. There's pretty-clearly a bad interaction here between Postgres wiring memory and the ARC, when the latter is left alone and allowed to do what it wants. I'm continuing to work on replicating this on the test machine... just not completely there yet. --=20 -- Karl Denninger /The Market Ticker =AE/ Cuda Systems LLC ------enig2DAJKLTVDKROLFATBSLSO Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJROKTyAAoJEGAtiW4Ft0U9jpUIAMaSwgkbF+gK/8mc2RmERB5G y55vazdtFRxYF0PF4//Fjs8XJXtYwEpW/ORgFofuMPz5/q1pGmn7r04TP4Zs9hxb lTNWIoGsfhoYvlVCKuMYzRCSeOMHtgYW4xikzXRSyEdPhN6eHzQBDsm91LnnUaB1 30eFsKXT3FVRheOTNSgnLZG6ywxIJq3inf0x56H3Jayw+voV3fF5BeqYVOH7Wd1E +l4ShlW+C3ysvcyskqRxfNjC2t7lcSI3iV6JB46KbmvmArigGwrz+OKJx55tuUYB Jl+vopzcM7WdzwYylro65UyGFU1CCg7BQXexKOkU1JM/qYdGxg404Mv7HpaUZcc= =mRGs -----END PGP SIGNATURE----- ------enig2DAJKLTVDKROLFATBSLSO--