From owner-freebsd-questions@FreeBSD.ORG  Tue Apr 27 19:22:13 2010
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DF6A31065674
	for <freebsd-questions@freebsd.org>;
	Tue, 27 Apr 2010 19:22:13 +0000 (UTC)
	(envelope-from amsibamsi@gmail.com)
Received: from mail-gy0-f182.google.com (mail-gy0-f182.google.com
	[209.85.160.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 9557D8FC27
	for <freebsd-questions@freebsd.org>;
	Tue, 27 Apr 2010 19:22:13 +0000 (UTC)
Received: by gyh20 with SMTP id 20so6945437gyh.13
	for <freebsd-questions@freebsd.org>;
	Tue, 27 Apr 2010 12:22:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:received:received:message-id:date:from
	:user-agent:mime-version:to:cc:subject:references:in-reply-to
	:x-enigmail-version:content-type:content-transfer-encoding;
	bh=7jKfZ4A0Li92aFrW9TV2UWddReJ0YkFtHYyh0k170Y4=;
	b=q5Aw4zNIpnWYhpE6e96bY8w5Ts70RXW6OjFbNxGEwTc8AoGk6hwguIWtrea7NMX1Ik
	O2NYz0Qd/HLObvXnQPBluPe3JtQdTMCVaJn8JnBuAdfmlE4K56VgK+sclAenaj7dyV5D
	Sf3ZEct8nabbDDER1FGwdYwzZ01AmqQfZ2CdQ=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=message-id:date:from:user-agent:mime-version:to:cc:subject
	:references:in-reply-to:x-enigmail-version:content-type
	:content-transfer-encoding;
	b=Vwp423BAYUsz5XSEJCBE9rtE9cIiw9xLLR1wxmZlnJgcZ9FkXGefQhR0mo8ZUWMbjR
	K8r756dvvoFkELTelGJFn2Sym7swJzIaQfZXX198wSWHo9WbfrA0nEjKxL2+dov+WVNX
	l6An8qqhyvKKqQKqo/tDJKwwYBG7s3nRm4Jkk=
Received: by 10.101.110.9 with SMTP id n9mr2038283anm.131.1272396131921;
	Tue, 27 Apr 2010 12:22:11 -0700 (PDT)
Received: from hulk.l.ttyv0.net ([72.14.241.37])
	by mx.google.com with ESMTPS id 30sm57127460anp.1.2010.04.27.12.22.09
	(version=SSLv3 cipher=RC4-MD5); Tue, 27 Apr 2010 12:22:10 -0700 (PDT)
Message-ID: <4BD7395E.2030300@gmail.com>
Date: Tue, 27 Apr 2010 21:22:06 +0200
From: Anselm Strauss <amsibamsi@gmail.com>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US;
	rv:1.9.1.9) Gecko/20100424 Thunderbird/3.0.4
MIME-Version: 1.0
To: Dan Naumov <dan.naumov@gmail.com>
References: <y2qcf9b1ee01004251503jb4791869i9a812fade17a0558@mail.gmail.com>
In-Reply-To: <y2qcf9b1ee01004251503jb4791869i9a812fade17a0558@mail.gmail.com>
X-Enigmail-Version: 1.0.1
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: freebsd-questions@freebsd.org
Subject: Re: ZFS scheduling
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Apr 2010 19:22:14 -0000

On 04/26/10 00:03, Dan Naumov wrote:
>> Hi,
>>
>> I noticed that my system gets very slow when I'm doing some simple but
>> intense ZFS operations. For example, I move about 20 Gigabytes of data
>>from one data set to another on the same pool, which is a RAIDZ of 3 500
>> GB SATA disks. The operations itself runs fast, but meanwhile other
>> things get really slow. E.g. opening a application takes 5 times as long
>> as before. Also simple operations like 'ls' stall for some seconds which
>> they did never before. It already changed a lot when I switched from
>> RAIDZ to a mirror with only 2 disks. Memory and CPU don't seem to be the
>> issue, I have a quad-core CPU and 8 GB RAM.
>>
>> I can't get rid of the idea that this has something to do with
>> scheduling. The system is absolutely stable and fast. Somehow small I/O
>> operations on ZFS seem to have it very difficult to make it through when
>> other bigger ones are running. Maybe this has something to do with tuning?
>>
>> I know my system information is very incomplete, and there could be a
>> lot of causes. But anybody knows if this could be an issue with ZFS itself?
> 
> Hello
> 
> As you do mention, your system information is indeed very incomplete,
> making your problem rather hard to diagnose :)
> 
> Scheduling, in the traditional sense, is unlikely to be the cause of
> your problems, but here's a few things you could look into:
> 
> First one is obviously the pool layout, heavy-duty writing on a pool,
> consisting of a single raidz vdev is slow (slower than writing to a
> mirror, as you already discovered), period. such is the nature of
> raidz. Additionally, your problem is magnified by the fact that your
> have reads competing with writes since you are reading (I assume) from
> the same pool. One approach to alleviating the problem would be to
> utilize a pool consisting of 2 or more raidz vdevs in a stripe, like
> this:
> 
> pool
>   raidz
>     disc1
>     disc2
>     disc3
>   raidz
>     disc4
>     disc5
>     disc6
> 
> The second potential cause of your issues is the system wrongly
> guesstimating your optimal TXG commit size. ZFS works in such a
> fashion, that it commits data to disk in chunks. How big chunks it
> writes at a time it tries to optimize by evaluating your pool IO
> bandwidth over time and available RAM. The TXG commits happen with an
> interval of 5-30 seconds. The worst case scenario is such, that if the
> system misguesses the optimal TXG size, then under heavy write load,
> it continues to defer the commit for up to the 30 second timeout and
> when it hits the caps, it frantically commits it ALL at once. This can
> and most likely will completely starve your read IO on the pool for as
> long as the drives choke while committing the TXG.
> 
> If you are on 8.0-RELEASE, you could try playing with the
> vfs.zfs.txg.timeout= variable in /boot/loader.conf, generally sane
> values are 5-30, with 30 being the default. You could also try
> adjusting vfs.zfs.vdev.max_pending= down from the default of 35 to a
> lower value and see if that helps. AFAIK, 8-STABLE and -HEAD have a
> systctl variable which directly allow you to manually set the
> preferred TXG size and I've pretty sure I've seen some patches on the
> mailing lists to add this functionality to 8.0.
> 
> Hope this helps.
> 
> 
> - Sincerely,
> Dan Naumov

Thanks for the explanation and hints. As I said it's now already a lot
better with mirror instead of raidz, maybe I will try to adjust some
sysctl parameters as you suggested.

But I'm still a bit puzzled why it is possible at all that one simple
operation can stall the system so much. In my naive view I just compare
it to CPU scheduling. Even when I have a process that consumes the CPU
100%, when I start another small process in parallel that only needs
very few CPU time there is virtually no slowdown to it. A normal fair
scheduling would assign 50% of the CPU to each process so the small one
still has plenty of resources and doubling the execution time of a
already very short running process is barely noticeable. Of course it
changes when there are lots of processes, so even a small process only
gets a fraction of the CPU.
But I guess this is not how I/O scheduling or ZFS works. Maybe this goes
more into the topic of I/O scheduling priority of processes.

Anselm