Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 27 Apr 2010 21:22:06 +0200
From:      Anselm Strauss <amsibamsi@gmail.com>
To:        Dan Naumov <dan.naumov@gmail.com>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: ZFS scheduling
Message-ID:  <4BD7395E.2030300@gmail.com>
In-Reply-To: <y2qcf9b1ee01004251503jb4791869i9a812fade17a0558@mail.gmail.com>
References:  <y2qcf9b1ee01004251503jb4791869i9a812fade17a0558@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 04/26/10 00:03, Dan Naumov wrote:
>> Hi,
>>
>> I noticed that my system gets very slow when I'm doing some simple but
>> intense ZFS operations. For example, I move about 20 Gigabytes of data
>>from one data set to another on the same pool, which is a RAIDZ of 3 500
>> GB SATA disks. The operations itself runs fast, but meanwhile other
>> things get really slow. E.g. opening a application takes 5 times as long
>> as before. Also simple operations like 'ls' stall for some seconds which
>> they did never before. It already changed a lot when I switched from
>> RAIDZ to a mirror with only 2 disks. Memory and CPU don't seem to be the
>> issue, I have a quad-core CPU and 8 GB RAM.
>>
>> I can't get rid of the idea that this has something to do with
>> scheduling. The system is absolutely stable and fast. Somehow small I/O
>> operations on ZFS seem to have it very difficult to make it through when
>> other bigger ones are running. Maybe this has something to do with tuning?
>>
>> I know my system information is very incomplete, and there could be a
>> lot of causes. But anybody knows if this could be an issue with ZFS itself?
> 
> Hello
> 
> As you do mention, your system information is indeed very incomplete,
> making your problem rather hard to diagnose :)
> 
> Scheduling, in the traditional sense, is unlikely to be the cause of
> your problems, but here's a few things you could look into:
> 
> First one is obviously the pool layout, heavy-duty writing on a pool,
> consisting of a single raidz vdev is slow (slower than writing to a
> mirror, as you already discovered), period. such is the nature of
> raidz. Additionally, your problem is magnified by the fact that your
> have reads competing with writes since you are reading (I assume) from
> the same pool. One approach to alleviating the problem would be to
> utilize a pool consisting of 2 or more raidz vdevs in a stripe, like
> this:
> 
> pool
>   raidz
>     disc1
>     disc2
>     disc3
>   raidz
>     disc4
>     disc5
>     disc6
> 
> The second potential cause of your issues is the system wrongly
> guesstimating your optimal TXG commit size. ZFS works in such a
> fashion, that it commits data to disk in chunks. How big chunks it
> writes at a time it tries to optimize by evaluating your pool IO
> bandwidth over time and available RAM. The TXG commits happen with an
> interval of 5-30 seconds. The worst case scenario is such, that if the
> system misguesses the optimal TXG size, then under heavy write load,
> it continues to defer the commit for up to the 30 second timeout and
> when it hits the caps, it frantically commits it ALL at once. This can
> and most likely will completely starve your read IO on the pool for as
> long as the drives choke while committing the TXG.
> 
> If you are on 8.0-RELEASE, you could try playing with the
> vfs.zfs.txg.timeout= variable in /boot/loader.conf, generally sane
> values are 5-30, with 30 being the default. You could also try
> adjusting vfs.zfs.vdev.max_pending= down from the default of 35 to a
> lower value and see if that helps. AFAIK, 8-STABLE and -HEAD have a
> systctl variable which directly allow you to manually set the
> preferred TXG size and I've pretty sure I've seen some patches on the
> mailing lists to add this functionality to 8.0.
> 
> Hope this helps.
> 
> 
> - Sincerely,
> Dan Naumov

Thanks for the explanation and hints. As I said it's now already a lot
better with mirror instead of raidz, maybe I will try to adjust some
sysctl parameters as you suggested.

But I'm still a bit puzzled why it is possible at all that one simple
operation can stall the system so much. In my naive view I just compare
it to CPU scheduling. Even when I have a process that consumes the CPU
100%, when I start another small process in parallel that only needs
very few CPU time there is virtually no slowdown to it. A normal fair
scheduling would assign 50% of the CPU to each process so the small one
still has plenty of resources and doubling the execution time of a
already very short running process is barely noticeable. Of course it
changes when there are lots of processes, so even a small process only
gets a fraction of the CPU.
But I guess this is not how I/O scheduling or ZFS works. Maybe this goes
more into the topic of I/O scheduling priority of processes.

Anselm



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4BD7395E.2030300>