From owner-freebsd-questions@FreeBSD.ORG Sat Jul 28 13:26:14 2007 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8EE4916A417 for ; Sat, 28 Jul 2007 13:26:14 +0000 (UTC) (envelope-from dom@bishnet.net) Received: from carrick.bishnet.net (unknown [IPv6:2001:618:400::54ea:1138]) by mx1.freebsd.org (Postfix) with ESMTP id 2BC7F13C461 for ; Sat, 28 Jul 2007 13:26:14 +0000 (UTC) (envelope-from dom@bishnet.net) Received: from cpc1-warw1-0-0-cust384.sol2.cable.ntl.com ([86.20.169.129] helo=magellan.dom.bishnet.net ident=mailnull) by carrick.bishnet.net with esmtps (TLSv1:AES256-SHA:256) (Exim 4.66 (FreeBSD)) (envelope-from ) id 1IEmJ4-000M4I-3a for freebsd-questions@freebsd.org; Sat, 28 Jul 2007 14:26:06 +0100 Received: from deimos.dom.bishnet.net ([192.168.3.100] helo=deimos) by magellan.dom.bishnet.net with esmtp (Exim 4.67 (FreeBSD)) (envelope-from ) id 1IEmJ3-0001IW-MU for freebsd-questions@freebsd.org; Sat, 28 Jul 2007 14:26:05 +0100 From: "Dominic Bishop" To: Date: Sat, 28 Jul 2007 14:26:19 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.5510 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3138 Thread-Index: AcfRGuGcXBCdf6JPQ5ya6lERCE94xA== X-Bishnet-MailScanner-Information: Contact postmaster@bishnet.net X-Bishnet-MailScanner-VirusCheck: Found to be clean X-Bishnet-MailScanner-SpamCheck: not spam, SpamAssassin (not cached, score=-0.713, required 5, autolearn=not spam, AWL 1.59, BAYES_00 -2.60, FORGED_RCVD_HELO 0.14, HOT_NASTY 0.16, SPF_PASS -0.00) X-Bishnet-MailScanner-From: dom@bishnet.net Message-Id: <20070728132614.2BC7F13C461@mx1.freebsd.org> Subject: Increasing GELI performance X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 28 Jul 2007 13:26:14 -0000 I've just been testing out GELI performance on an underlying RAID using a 3ware 9550SXU-12 running RELENG_6 as of yesterday and seem to be hitting a performance bottleneck, but I can't see where it is coming from. Testing with an unencrypted 100GB GPT partition (/dev/da0p1) gives me around 200-250MB/s read and write speeds to give an idea of the capability of the disk device itself. Using GELI with a default 128bit AES key seems to limit at ~50MB/s , changing the sector size all the way upto 128KB makes no difference whatsoever to the performance. If I use the threads sysctl in loader.conf and drop the geli threads to 1 thread only (instead of the usual 3 it spawns on this system) the performance still does not change at all. Monitoring during writes with systat confirms that it really is spawning 1 or 3 threads correctly in these cases. Here is a uname -a from the machine FreeBSD 004 6.2-STABLE FreeBSD 6.2-STABLE #2: Fri Jul 27 20:10:05 CEST 2007 dom@004:/u1/obj/u1/src/sys/004 amd64 Kernel is a copy of GENERIC with GELI option added Encrypted partition created using : geli init -s 65536 /dev/da0p1 Simple write test done with: dd if=/dev/zero of=/dev/da0p1.eli bs=1m count=10000 (same as I did on the unencyrpted, a full test with bonnie++ shows similar speeds) Systat output whilst writing, showing 3 threads: /0 /1 /2 /3 /4 /5 /6 /7 /8 /9 /10 Load Average |||| /0 /10 /20 /30 /40 /50 /60 /70 /80 /90 /100 root idle: cpu3 XXXXXXXXX root idle: cpu1 XXXXXXXX XXXXXXXX root idle: cpu0 XXXXXXX root idle: cpu2 XXXXXX root g_eli[2] d XXX root g_eli[0] d XXX root g_eli[1] d X root g_up root dd Output from vmstat -w 5 procs memory page disks faults cpu r b w avm fre flt re pi po fr sr ad4 da0 in sy cs us sy id 0 1 0 38124 3924428 208 0 1 0 9052 0 0 0 1758 451 6354 1 15 84 0 1 0 38124 3924428 0 0 0 0 13642 0 0 411 2613 128 9483 0 22 78 0 1 0 38124 3924428 0 0 0 0 13649 0 0 411 2614 130 9483 0 22 78 0 1 0 38124 3924428 0 0 0 0 13642 0 0 411 2612 128 9477 0 22 78 0 1 0 38124 3924428 0 0 0 0 13642 0 0 411 2611 128 9474 0 23 77 Output from iostat -x 5 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 2.2 0.7 31.6 8.1 0 3.4 1 da0 0.2 287.8 2.3 36841.5 0 0.4 10 pass0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 da0 0.0 411.1 0.0 52622.1 0 0.4 15 pass0 0.0 0.0 0.0 0.0 0 0.0 0 extended device statistics device r/s w/s kr/s kw/s wait svc_t %b ad4 0.0 0.0 0.0 0.0 0 0.0 0 da0 0.0 411.1 0.0 52616.2 0 0.4 15 pass0 0.0 0.0 0.0 0.0 0 0.0 0 Looking at these results myself I cannot see where the bottleneck is, I would assume since changing the sector size or the geli threads doesn't affect performance that there is some other single threaded part limiting it but I don't know enough about how it works to say what. CPU in the machine is a pair of these: CPU: Intel(R) Xeon(R) CPU 5110 @ 1.60GHz (1603.92-MHz K8-class CPU) I've also come across some other strange issues with some other machines which have identical arrays but only a pair of 32bit 3.0Ghz xeons in them (Also using releng_6 as of yesterday, just i386 not amd64). On those geli will launch a single thread by default (cores-1 seems to be the default) however I cannot force it to launch 2 by using the sysctl, although on the 4 core machine I can successfully use it to launch 4. It would be nice to be able to use both cores on the 32bit machines for geli but given the results I've shown here I'm not sure it would gain me much at the moment. Another problem I've found is that if I use a sector size for GELI > 8192 bytes then I'm unable to newfs the encrypted partition afterwards, it fails immediately with this error: newfs /dev/da0p1.eli increasing block size from 16384 to fragment size (65536) /dev/da0p1.eli: 62499.9MB (127999872 sectors) block size 65536, fragment size 65536 using 5 cylinder groups of 14514.56MB, 232233 blks, 58112 inodes. newfs: can't read old UFS1 superblock: read error from block device: Invalid argument The underlying device is readable/writeable however as dd can read/write to it without any errors. If anyone has any suggestions/thoughts on any of these points it would be much appreciated, these machines will be performing backups over 1Gbit LAN so more speed than I can currently get would be preferable. I sent this to geom@ and meant to CC here as that seems to be a pretty quiet list so might not get seen there, I forgot the CC so apologies for sending separately here. I'll add here a few extra bits sent to geom@ to a response: Trying newfs with -S option to specify sector size matching -s option to geli init: newfs -S 65536 /dev/da0p1.eli increasing block size from 16384 to fragment size (65536) /dev/da0p1.eli: 62499.9MB (127999872 sectors) block size 65536, fragment size 65536 using 5 cylinder groups of 14514.56MB, 232233 blks, 58112 inodes. newfs: can't read old UFS1 superblock: read error from block device: Invalid argument Diskinfo reports correct sector size for geli layer and 512 byte for underlying GPT partition: diskinfo -v /dev/da0p1 /dev/da0p1 512 # sectorsize 65536000000 # mediasize in bytes (61G) 128000000 # mediasize in sectors 7967 # Cylinders according to firmware. 255 # Heads according to firmware. 63 # Sectors according to firmware. diskinfo -v /dev/da0p1.eli /dev/da0p1.eli 65536 # sectorsize 65535934464 # mediasize in bytes (61G) 999999 # mediasize in sectors 62 # Cylinders according to firmware. 255 # Heads according to firmware. 63 # Sectors according to firmware. Testing on a onetime geli encryption of the underlying raw device to bypass the GPT shows very similar poor results: dd if=/dev/da0.eli of=/dev/null bs=1m count=1000 1000+0 records in 1000+0 records out 1048576000 bytes transferred in 29.739186 secs (35259069 bytes/sec) dd if=/dev/zero of=/dev/da0.eli bs=1m count=1000 1000+0 records in 1000+0 records out 1048576000 bytes transferred in 23.501061 secs (44618241 bytes/sec) For comparison the same test done on the unencrypted raw device: dd if=/dev/da0 of=/dev/null bs=1m count=1000 1000+0 records in 1000+0 records out 1048576000 bytes transferred in 5.802704 secs (180704717 bytes/sec) dd if=/dev/zero of=/dev/da0 bs=1m count=1000 1000+0 records in 1000+0 records out 1048576000 bytes transferred in 4.026869 secs (260394859 bytes/sec) Looking at 'top -S -s1' whilst doing a long read/write using geli shows a geli thread for each core but there only ever seems to be one in a running state at any given time, the others will be in a state of 'geli:w'. This would suggest why performance is identical with only 1 geli thread and with 4 geli threads. Regards, Dominic Bishop