From owner-freebsd-questions@FreeBSD.ORG Tue Apr 27 19:22:13 2010 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DF6A31065674 for ; Tue, 27 Apr 2010 19:22:13 +0000 (UTC) (envelope-from amsibamsi@gmail.com) Received: from mail-gy0-f182.google.com (mail-gy0-f182.google.com [209.85.160.182]) by mx1.freebsd.org (Postfix) with ESMTP id 9557D8FC27 for ; Tue, 27 Apr 2010 19:22:13 +0000 (UTC) Received: by gyh20 with SMTP id 20so6945437gyh.13 for ; Tue, 27 Apr 2010 12:22:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:cc:subject:references:in-reply-to :x-enigmail-version:content-type:content-transfer-encoding; bh=7jKfZ4A0Li92aFrW9TV2UWddReJ0YkFtHYyh0k170Y4=; b=q5Aw4zNIpnWYhpE6e96bY8w5Ts70RXW6OjFbNxGEwTc8AoGk6hwguIWtrea7NMX1Ik O2NYz0Qd/HLObvXnQPBluPe3JtQdTMCVaJn8JnBuAdfmlE4K56VgK+sclAenaj7dyV5D Sf3ZEct8nabbDDER1FGwdYwzZ01AmqQfZ2CdQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:x-enigmail-version:content-type :content-transfer-encoding; b=Vwp423BAYUsz5XSEJCBE9rtE9cIiw9xLLR1wxmZlnJgcZ9FkXGefQhR0mo8ZUWMbjR K8r756dvvoFkELTelGJFn2Sym7swJzIaQfZXX198wSWHo9WbfrA0nEjKxL2+dov+WVNX l6An8qqhyvKKqQKqo/tDJKwwYBG7s3nRm4Jkk= Received: by 10.101.110.9 with SMTP id n9mr2038283anm.131.1272396131921; Tue, 27 Apr 2010 12:22:11 -0700 (PDT) Received: from hulk.l.ttyv0.net ([72.14.241.37]) by mx.google.com with ESMTPS id 30sm57127460anp.1.2010.04.27.12.22.09 (version=SSLv3 cipher=RC4-MD5); Tue, 27 Apr 2010 12:22:10 -0700 (PDT) Message-ID: <4BD7395E.2030300@gmail.com> Date: Tue, 27 Apr 2010 21:22:06 +0200 From: Anselm Strauss User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.1.9) Gecko/20100424 Thunderbird/3.0.4 MIME-Version: 1.0 To: Dan Naumov References: In-Reply-To: X-Enigmail-Version: 1.0.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-questions@freebsd.org Subject: Re: ZFS scheduling X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Apr 2010 19:22:14 -0000 On 04/26/10 00:03, Dan Naumov wrote: >> Hi, >> >> I noticed that my system gets very slow when I'm doing some simple but >> intense ZFS operations. For example, I move about 20 Gigabytes of data >>from one data set to another on the same pool, which is a RAIDZ of 3 500 >> GB SATA disks. The operations itself runs fast, but meanwhile other >> things get really slow. E.g. opening a application takes 5 times as long >> as before. Also simple operations like 'ls' stall for some seconds which >> they did never before. It already changed a lot when I switched from >> RAIDZ to a mirror with only 2 disks. Memory and CPU don't seem to be the >> issue, I have a quad-core CPU and 8 GB RAM. >> >> I can't get rid of the idea that this has something to do with >> scheduling. The system is absolutely stable and fast. Somehow small I/O >> operations on ZFS seem to have it very difficult to make it through when >> other bigger ones are running. Maybe this has something to do with tuning? >> >> I know my system information is very incomplete, and there could be a >> lot of causes. But anybody knows if this could be an issue with ZFS itself? > > Hello > > As you do mention, your system information is indeed very incomplete, > making your problem rather hard to diagnose :) > > Scheduling, in the traditional sense, is unlikely to be the cause of > your problems, but here's a few things you could look into: > > First one is obviously the pool layout, heavy-duty writing on a pool, > consisting of a single raidz vdev is slow (slower than writing to a > mirror, as you already discovered), period. such is the nature of > raidz. Additionally, your problem is magnified by the fact that your > have reads competing with writes since you are reading (I assume) from > the same pool. One approach to alleviating the problem would be to > utilize a pool consisting of 2 or more raidz vdevs in a stripe, like > this: > > pool > raidz > disc1 > disc2 > disc3 > raidz > disc4 > disc5 > disc6 > > The second potential cause of your issues is the system wrongly > guesstimating your optimal TXG commit size. ZFS works in such a > fashion, that it commits data to disk in chunks. How big chunks it > writes at a time it tries to optimize by evaluating your pool IO > bandwidth over time and available RAM. The TXG commits happen with an > interval of 5-30 seconds. The worst case scenario is such, that if the > system misguesses the optimal TXG size, then under heavy write load, > it continues to defer the commit for up to the 30 second timeout and > when it hits the caps, it frantically commits it ALL at once. This can > and most likely will completely starve your read IO on the pool for as > long as the drives choke while committing the TXG. > > If you are on 8.0-RELEASE, you could try playing with the > vfs.zfs.txg.timeout= variable in /boot/loader.conf, generally sane > values are 5-30, with 30 being the default. You could also try > adjusting vfs.zfs.vdev.max_pending= down from the default of 35 to a > lower value and see if that helps. AFAIK, 8-STABLE and -HEAD have a > systctl variable which directly allow you to manually set the > preferred TXG size and I've pretty sure I've seen some patches on the > mailing lists to add this functionality to 8.0. > > Hope this helps. > > > - Sincerely, > Dan Naumov Thanks for the explanation and hints. As I said it's now already a lot better with mirror instead of raidz, maybe I will try to adjust some sysctl parameters as you suggested. But I'm still a bit puzzled why it is possible at all that one simple operation can stall the system so much. In my naive view I just compare it to CPU scheduling. Even when I have a process that consumes the CPU 100%, when I start another small process in parallel that only needs very few CPU time there is virtually no slowdown to it. A normal fair scheduling would assign 50% of the CPU to each process so the small one still has plenty of resources and doubling the execution time of a already very short running process is barely noticeable. Of course it changes when there are lots of processes, so even a small process only gets a fraction of the CPU. But I guess this is not how I/O scheduling or ZFS works. Maybe this goes more into the topic of I/O scheduling priority of processes. Anselm