Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 07 Mar 2013 13:07:11 -0600
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-stable@freebsd.org
Subject:   Re: ZFS "stalls" -- and maybe we should be talking about defaults?
Message-ID:  <5138E55F.7080107@denninger.net>
In-Reply-To: <F99CDA75FB2C454680C1E8AA9008E9DA@multiplay.co.uk>
References:  <513524B2.6020600@denninger.net> <20130307072145.GA2923@server.rulingia.com> <5138A4C1.5090503@denninger.net> <F99CDA75FB2C454680C1E8AA9008E9DA@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help

On 3/7/2013 12:57 PM, Steven Hartland wrote:
>
> ----- Original Message ----- From: "Karl Denninger" <karl@denninger.net>
>> Where I am right now is this:
>>
>> 1. I *CANNOT* reproduce the spins on the test machine with Postgres
>> stopped in any way.  Even with multiple ZFS send/recv copies going on
>> and the load average north of 20 (due to all the geli threads), the
>> system doesn't stall or produce any notable pauses in throughput.  Nor
>> does the system RAM allocation get driven hard enough to force paging.
>> This is with NO tuning hacks in /boot/loader.conf.  I/O performance is
>> both stable and solid.
>>
>> 2. WITH Postgres running as a connected hot spare (identical to the
>> production machine), allocating ~1.5G of shared, wired memory,  running
>> the same synthetic workload in (1) above I am getting SMALL versions of
>> the misbehavior.  However, while system RAM allocation gets driven
>> pretty hard and reaches down toward 100MB in some instances it doesn't
>> get driven hard enough to allocate swap.  The "burstiness" is very
>> evident in the iostat figures with spates getting into the single digit
>> MB/sec range from time to time but it's not enough to drive the system
>> to a full-on stall.
>>
>> There's pretty-clearly a bad interaction here between Postgres wiring
>> memory and the ARC, when the latter is left alone and allowed to do what
>> it wants.   I'm continuing to work on replicating this on the test
>> machine... just not completely there yet.
>
> Another possibility to consider is how postgres uses the FS. For example
> does is request sync IO in ways not present in the system without it
> which is causing the FS and possibly underlying disk system to behave
> differently.
>
That's possible but not terribly-likely in this particular instance.  
The reason is that I ran into this with the Postgres data store on a UFS
volume BEFORE I converted it.  Now it's on the ZFS pool (with
recordsize=8k as recommended for that filesystem) but when I first ran
into this it was on a separate UFS filesystem (which is where it had
resided for 2+ years without incident), so unless the Postgres
filesystem use on a UFS volume would give ZFS fits it's unlikely to be
involved.

> One other options to test, just to rule it out is what happens if you
> use BSD scheduler instead of ULE?
>
>    Regards
>    Steve
>

I will test that but first I have to get the test machine to reliably
stall so I know I'm not chasing my tail.


-- 
-- Karl Denninger
/The Market Ticker ®/ <http://market-ticker.org>;
Cuda Systems LLC



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5138E55F.7080107>