From owner-freebsd-questions@FreeBSD.ORG Sun Jan 24 17:42:23 2010 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 577091065670; Sun, 24 Jan 2010 17:42:23 +0000 (UTC) (envelope-from dan.naumov@gmail.com) Received: from mail-yx0-f171.google.com (mail-yx0-f171.google.com [209.85.210.171]) by mx1.freebsd.org (Postfix) with ESMTP id DFCC28FC1C; Sun, 24 Jan 2010 17:42:22 +0000 (UTC) Received: by yxe1 with SMTP id 1so2275433yxe.3 for ; Sun, 24 Jan 2010 09:42:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=j88x6XhvR75nuWm2yTKP7/tyUJkjzQo0D5wtSP5t/Gs=; b=NH2HzacZQP2uCFh5RJWrbwrJrlt8RcQ5rex70CUV4KW8vReXnVG0sESFFh6wQBuzA9 r6wIFcX+QUSMFBdbGpZ0FeRer3PQnqsTWp+5PJWXA9VFqD2vszEkVMMlF4CF+oAxLRmM 3y+4TBhF2sxcUyHlYofpJf9ODXk/eF1HMuBN8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=cO7S2cshb89hsDgyAqO9AzCVXz6C+xdAI/1JC1SmD1cESNpsUqHKAPShRTE3cKyGvY VVAv/Ez07pniyN8THgxN+uHSgMxU2vXcbnzqZedZNRztIa4KPo7CyFZzmyv/HMqHAZpf wBTYDvk9GXaO7O7EDLL41VJYTdG8jVszUFebI= MIME-Version: 1.0 Received: by 10.101.6.22 with SMTP id j22mr6557167ani.224.1264354942041; Sun, 24 Jan 2010 09:42:22 -0800 (PST) In-Reply-To: <883b2dc51001240905r4cfbf830i3b9b400969ac261b@mail.gmail.com> References: <883b2dc51001240905r4cfbf830i3b9b400969ac261b@mail.gmail.com> Date: Sun, 24 Jan 2010 19:42:22 +0200 Message-ID: From: Dan Naumov To: Jason Edwards , FreeBSD-STABLE Mailing List , freebsd-fs@freebsd.org, freebsd-questions@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Cc: Subject: Re: 8.0-RELEASE/amd64 - full ZFS install - low read and write disk performance X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 24 Jan 2010 17:42:23 -0000 On Sun, Jan 24, 2010 at 7:05 PM, Jason Edwards wrote: > Hi Dan, > > I read on FreeBSD mailinglist you had some performance issues with ZFS. > Perhaps i can help you with that. > > You seem to be running a single mirror, which means you won't have any speed > benefit regarding writes, and usually RAID1 implementations offer little to > no acceleration to read requests also; some even just read from the master > disk and don't touch the 'slave' mirrored disk unless when writing. ZFS is > alot more modern however, although i did not test performance of its mirror > implementation. > > But, benchmarking I/O can be tricky: > > 1) you use bonnie, but bonnie's tests are performed without a 'cooldown' > period between the tests; meaning that when test 2 starts, data from test 1 > is still being processed. For single disks and simple I/O this is not so > bad, but for large write-back buffers and more complex I/O buffering, this > may be inappropriate. I had patched bonnie some time in the past, but if you > just want a MB/s number you can use DD for that. > > 2) The diskinfo tiny benchmark is single queue only i assume, meaning that > it would not scale well or at all on RAID-arrays. Actual filesystems on > RAID-arrays use multiple-queue; meaning it would not read one sector at a > time, but read 8 blocks (of 16KiB) "ahead"; this is called read-ahead and > for traditional UFS filesystems its controlled by the sysctl vfs.read_max > variable. ZFS works differently though, but you still need a "real" > benchmark. > > 3) You need low-latency hardware; in particular, no PCI controller should be > used. Only PCI-express based controllers or chipset-integrated Serial ATA > cotrollers have proper performance. PCI can hurt performance very badly, and > has high interrupt CPU usage. Generally you should avoid PCI. PCI-express is > fine though, its a completely different interface that is in many ways the > opposite of what PCI was. > > 4) Testing actual realistic I/O performance (in IOps) is very difficult. But > testing sequential performance should be alot easier. You may try using dd > for this. > > > For example, you can use dd on raw devices: > > dd if=/dev/ad4 of=/dev/null bs=1M count=1000 > > I will explain each parameter: > > if=/dev/ad4 is the input file, the "read source" > > of=/dev/null is the output file, the "write destination". /dev/null means it > just goes no-where; so this is a read-only benchmark > > bs=1M is the blocksize, howmuch data to transfer per time. default is 512 or > the sector size; but that's very slow. A value between 64KiB and 1024KiB is > appropriate. bs=1M will select 1MiB or 1024KiB. > > count=1000 means transfer 1000 pieces, and with bs=1M that means 1000 * 1MiB > = 1000MiB. > > > > This example was raw reading sequentially from the start of the device > /dev/ad4. If you want to test RAIDs, you need to work at the filesystem > level. You can use dd for that too: > > dd if=/dev/zero of=/path/to/ZFS/mount/zerofile.000 bs=1M count=2000 > > This command will read from /dev/zero (all zeroes) and write to a file on > ZFS-mounted filesystem, it will create the file "zerofile.000" and write > 2000MiB of zeroes to that file. > So this command tests write-performance of the ZFS-mounted filesystem. To > test read performance, you need to clear caches first by unmounting that > filesystem and re-mounting it again. This would free up memory containing > parts of the filesystem as cached (reported in top as "Inact(ive)" instead > of "Free"). > > Please do make sure you double-check a dd command before running it, and run > as normal user instead of root. A wrong dd command may write to the wrong > destination and do things you don't want. The only real thing you need to > check is the write destination (of=....). That's where dd is going to write > to, so make sure its the target you intended. A common mistake made by > myself was to write dd of=... if=... (starting with of instead of if) and > thus actually doing something the other way around than what i was meant to > do. This can be disastrous if you work with live data, so be careful! ;-) > > Hope any of this was helpful. During the dd benchmark, you can of course > open a second SSH terminal and start "gstat" to see the devices current I/O > stats. > > Kind regards, > Jason Hi and thanks for your tips, I appreciate it :) [jago@atombsd ~]$ dd if=/dev/zero of=/home/jago/test1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes transferred in 36.206372 secs (29656156 bytes/sec) [jago@atombsd ~]$ dd if=/dev/zero of=/home/jago/test2 bs=1M count=4096 4096+0 records in 4096+0 records out 4294967296 bytes transferred in 143.878615 secs (29851325 bytes/sec) This works out to 1GB in 36,2 seconds / 28,2mb/s in the first test and 4GB in 143.8 seconds / 28,4mb/s and somewhat consistent with the bonnie results. It also sadly seems to confirm the very slow speed :( The disks are attached to a 4-port Sil3124 controller and again, my Windows benchmarks showing 65mb/s+ were done on exact same machine, with same disks attached to the same controller. Only difference was that in Windows the disks weren't in a mirror configuration but were tested individually. I do understand that a mirror setup offers roughly the same write speed as individual disk, while the read speed usually varies from "equal to individual disk speed" to "nearly the throughput of both disks combined" depending on the implementation, but there is no obvious reason I am seeing why my setup offers both read and write speeds roughly 1/3 to 1/2 of what the individual disks are capable of. Dmesg shows: atapci0: port 0x1000-0x100f mem 0x90108000-0x9010807f,0x90100000-0x90107fff irq 21 at device 0.0 on pci4 ad8: 1907729MB at ata4-master SATA300 ad10: 1907729MB at ata5-master SATA300 I do recall also testing an alternative configuration in the past, where I would boot off an UFS disk and have the ZFS mirror consist of 2 discs directly. The bonnie numbers in that case were in line with my expectations, I was seeing 65-70mb/s. Note: again, exact same hardware, exact same disks attached to the exact same controller. In my knowledge, Solaris/OpenSolaris has an issue where they have to automatically disable disk cache if ZFS is used on top of partitions instead of raw disks, but to my knowledge (I recall reading this from multiple reputable sources) this issue does not affect FreeBSD. - Sincerely, Dan Naumov