From owner-freebsd-fs@FreeBSD.ORG Thu Jul 4 19:12:19 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 6860CD39 for ; Thu, 4 Jul 2013 19:12:19 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from relay5-d.mail.gandi.net (relay5-d.mail.gandi.net [217.70.183.197]) by mx1.freebsd.org (Postfix) with ESMTP id 0E06919C7 for ; Thu, 4 Jul 2013 19:12:18 +0000 (UTC) Received: from mfilter1-d.gandi.net (mfilter1-d.gandi.net [217.70.178.130]) by relay5-d.mail.gandi.net (Postfix) with ESMTP id D583441C07E; Thu, 4 Jul 2013 21:12:07 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mfilter1-d.gandi.net Received: from relay5-d.mail.gandi.net ([217.70.183.197]) by mfilter1-d.gandi.net (mfilter1-d.gandi.net [10.0.15.180]) (amavisd-new, port 10024) with ESMTP id zN6G9c8c7fBE; Thu, 4 Jul 2013 21:12:06 +0200 (CEST) X-Originating-IP: 76.102.14.35 Received: from jdc.koitsu.org (c-76-102-14-35.hsd1.ca.comcast.net [76.102.14.35]) (Authenticated sender: jdc@koitsu.org) by relay5-d.mail.gandi.net (Postfix) with ESMTPSA id 4EC5841C053; Thu, 4 Jul 2013 21:12:05 +0200 (CEST) Received: by icarus.home.lan (Postfix, from userid 1000) id 49D6673A1C; Thu, 4 Jul 2013 12:12:03 -0700 (PDT) Date: Thu, 4 Jul 2013 12:12:03 -0700 From: Jeremy Chadwick To: mxb Subject: Re: Slow resilvering with mirrored ZIL Message-ID: <20130704191203.GA95642@icarus.home.lan> References: <51D42107.1050107@digsys.bg> <2EF46A8C-6908-4160-BF99-EC610B3EA771@alumni.chalmers.se> <51D437E2.4060101@digsys.bg> <20130704000405.GA75529@icarus.home.lan> <20130704171637.GA94539@icarus.home.lan> <2A261BEA-4452-4F6A-8EFB-90A54D79CBB9@alumni.chalmers.se> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2A261BEA-4452-4F6A-8EFB-90A54D79CBB9@alumni.chalmers.se> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 04 Jul 2013 19:12:19 -0000 On Thu, Jul 04, 2013 at 07:36:27PM +0200, mxb wrote: > zdb -C: > > NAS: > version: 28 > name: 'NAS' > state: 0 > txg: 19039918 > pool_guid: 3808946822857359331 > hostid: 516334119 > hostname: 'nas.home.unixconn.com' > vdev_children: 2 > vdev_tree: > type: 'root' > id: 0 > guid: 3808946822857359331 > children[0]: > type: 'raidz' > id: 0 > guid: 15126043265564363201 > nparity: 1 > metaslab_array: 14 > metaslab_shift: 32 > ashift: 9 Wow, okay, where do I begin... should I just send you a bill for all this? :P I kid, but seriously... Your pool is not correctly 4K-aligned (ashift 12); it is clearly ashift 9, which is aligned to 512 bytes. This means you did not do the 4K gnop procedure which is needed for proper alignment: http://ivoras.net/blog/tree/2011-01-01.freebsd-on-4k-sector-drives.html The performance hit on 4K sector drives (particularly writes) is tremendous. The performance hit on SSDs is borderline catastrophic. The 4K procedure does not hurt/harm/hinder older 512-byte sector drives in any way, so using 4K alignment for all drives regardless of sector size is perfectly safe. I believe -- but I need someone else to chime in here with confirmation, particularly someone who is familiar with ZFS's internals -- once your pool is ashift 12, you can do a disk replacement ***without*** having to do the gnop procedure (because the pool itself is already using ashift 12). But again, I need someone to confirm that. Next topic. Your disks are as follows: ada0: ST32000645NS -- Seagate Constellation ES.2, 2TB, 7200rpm, 512-byte sector ada1: WD10EARS -- Western Digital Green, 1TB, 5400/7200rpm, 4KB sector ada2: WD10EARS -- Western Digital Green, 1TB, 5400/7200rpm, 4KB sector ada3: ST32000645NS -- Seagate Constellation ES.2, 2TB, 7200rpm, 512-byte sector ada4: INTEL SSDSA2VP020G2 -- Intel 311 Series, 20GB, SSD, 4KB sector ada5: OCZ-AGILITY3 -- OCZ Agility 3, 60GB, SSD, 4KB sector The WD10EARS are known for excessively parking their heads, which causes massive performance problems with both reads and writes. This is known by PC enthusiasts as the "LCC issue" (LCC = Load Cycle Count, referring to SMART attribute 193). On these drives there are ways to work around this issue -- it specifically involves disabling drive-level APM. To do so, you have to initiate a specific ATA CDB to the drive using "camcontrol cmd", and this has to be done every time the system reboots. There is one drawback to disabling APM as well: the drives run hotter. In general, stay away from any "Green" or "GreenPower" or "EcoGreen" drives from any vendor. Likewise, many of Seagate's drives these days parking their heads excessively and **without** any way to disable the behaviour (and in later firmwares, they don't even increment LCC, and they did this solely because customers were noticing the behaviour and complaining about it so they decided to hide it. Cute) I tend to recommend WD Red drives these days -- and not because they're "NAS friendly" (marketing drivel), but because they don't have retarded firmware settings, don't act stupid/get in the way, use single 1TB platters (thus less heads, thus less heat, less parts to fail, less vibration), and perform a little bit better than the Greens. Next topic, sort of circling back up... Your ada4 and ada5 drives -- the slices, I mean -- are ALSO not not properly aligned to a 4K boundary. This is destroying their performance and probably destroying the drives every time they write. They are having to do two actual erase-write cycles to make up for this problem, due to misaligned boundaries. Combine this fact with the fact that 9.1-RELEASE does not support TRIM on ZFS, and you now have SSDs which are probably beat to hell and back. You really need to be running stable/9 if you want to use SSDs with ZFS. I cannot stress this enough. I will not bend on this fact. I do not care if what people have are SLC rather than MLC or TLC -- it doesn't matter. TRIM on ZFS is a downright necessity for long-term reliability of an SSD. Anyway... These SSDs need a full Secure Erase done to them. In stable/9 you can do this through camcontrol, otherwise you need to use Linux (there are live CD/DVD distros that can do this for you) or the vendor's native utilities (in Windows usually). UNDERSTAND: THIS IS NOT THE SAME AS A "DISK FORMAT" OR "ZEROING THE DISK". In fact, dd if=/dev/zero to zero an SSD would be the worst possible thing you could do to it. Secure Erase clears the entire FTL and resets the wear levelling matrix (that's just what I call it) back to factory defaults, so you end up with out-of-the-box performance: there's no more LBA-to-NAND-cell map entries in the FTL (which are usually what are responsible for slowdown). But even so, there is a very good possibility you have induced massive wear and tear on the NAND cells given this situation (especially with the Intel drive) and they may be in very bad shape in general. smartctl will tell me (keep reading). You should do the Secure Erase before doing any of the gpart stuff I talk about below. Back to the 4K alignment problem with your SSD slices: You should have partitioned these using "gpart" and the GPT mechanism, and used a 1MByte alignment for both partitions -- it's the easiest way to get proper alignment (with MBR its possible but it's more of a mess and really not worth it). The procedure is described here: http://www.wonkity.com/~wblock/docs/html/disksetup.html#_the_new_standard_gpt I'll explain what I'd do, but please do not do this yet either (there is another aspect that needs to be covered as well). Ignore the stuff about labels, and ignore the "boot" partition stuff -- it doesn't apply to your setup. To mimic what you currently have, you just need two partitions on each drive, properly aligned via GPT. That should be as easy as this -- except you are going to have to destroy your pool and start over because you're using log devices: gpart destroy -F ada4 gpart destroy -F ada5 gpart create -s gpt ada4 gpart add -t freebsd-zfs -b 1M -s 10G gpart add -t freebsd-zfs -a 1M gpart create -s gpt ada5 gpart add -t freebsd-zfs -b 1M -s 10G gpart add -t freebsd-zfs -a 1M You should be left with 4 devices at this point -- note the name change ("p" not "s"): /dev/ada4p1 -- use for your log device (mirrored) /dev/ada4p2 -- unused /dev/ada5p1 -- use for your log device (mirrored) /dev/ada5p2 -- use for your cache device/L2ARC And these should be able to be used with ZFS without any issue due to being properly aligned to a 1MByte boundary (which is 4096-byte aligned, but the reason we pick 1MByte is that it often correlates with NAND erase block size on many SSDs; it's sort of a "de-facto social standard" used by other OSes too). But before doing ANY OF THIS... You should probably be made aware of the fact that SSDs need to be kept roughly 30-40% unused to get the most benefits out of wear levelling. Once you hit the 20% remaining mark, performance takes a hit, and the drive begins hurting more and more. Low-capacity SSDs are therefore generally worthless given the capacity limitation need. Your Intel drive is very very small, and in fact I wouldn't even bother to use this drive -- it means you'd only be able to use roughly 14GB of it (at most) for data, and leave the remaining 6GB unallocated/unused solely for wear levelling. So given this information, when using the above gpart commands, you might actually want to adjust the sizes of the partitions so that there is always a guaranteed level of free/untouched space on them. For example: gpart destroy -F ada4 gpart destroy -F ada5 gpart create -s gpt ada4 gpart add -t freebsd-zfs -b 1M -s 7G gpart add -t freebsd-zfs -a 1M -s 7G gpart create -s gpt ada5 gpart add -t freebsd-zfs -b 1M -s 10G gpart add -t freebsd-zfs -a 1M -s 32G This would result in the following: /dev/ada4p1 -- 7GB -- use for your log device (mirrored) /dev/ada4p2 -- 7GB -- unused -- 6GB (remaining 30% of SSD) -- for wear levelling /dev/ada5p1 -- 10GB -- use for your log device (mirrored) /dev/ada5p2 -- 32GB -- use for your cache device/L2ARC -- 18GB (remaining 30% of SSD) -- for wear levelling I feel like I'm missing something, but no, I think that just about does it. Oh, I remember now. Next topic... I would strongly recommend you not use 1 SSD for both log and cache. I understand your thought process here: "if the SSD dies, the log devices are mirrored so I'm okay, and the cache is throw-away anyway". What you're not taking into consideration is how log and cache devices bottleneck ZFS, in addition to the fact that SATA is not like SAS when it comes to simultaneous R/W. That poor OCZ drive... I will let someone else talk about this part, because I feel I've already written up enough as is, I'm not sure how much will sink in, you'll probably be angry being told "your setup is a pretty gigantic mess but here's how to fix it, it involves recreating your entire pool and partitions on your SSDs too", but I'm also exhausted as this is the 4th or 5th Email in just the past few days I've had to write to cover multiple bases. I don't understand how this knowledge has not been dispensed into the community by now, common practise, etc.. I feel like yelling "VAFAAAAAAN!!!" :-) > nas# smartctl -a ada3 > ada3: Unable to detect device type My fault -- the syntax here is wrong, I should have been more clear: smartctl -a /dev/ada{0,5} Also, please update your ports tree and install smartmontools 6.1. There are improvements there pertaining to SSDs that are relevant. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB |