From owner-freebsd-fs@FreeBSD.ORG  Thu Jul  4 19:12:19 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 6860CD39
 for <freebsd-fs@freebsd.org>; Thu,  4 Jul 2013 19:12:19 +0000 (UTC)
 (envelope-from jdc@koitsu.org)
Received: from relay5-d.mail.gandi.net (relay5-d.mail.gandi.net
 [217.70.183.197])
 by mx1.freebsd.org (Postfix) with ESMTP id 0E06919C7
 for <freebsd-fs@freebsd.org>; Thu,  4 Jul 2013 19:12:18 +0000 (UTC)
Received: from mfilter1-d.gandi.net (mfilter1-d.gandi.net [217.70.178.130])
 by relay5-d.mail.gandi.net (Postfix) with ESMTP id D583441C07E;
 Thu,  4 Jul 2013 21:12:07 +0200 (CEST)
X-Virus-Scanned: Debian amavisd-new at mfilter1-d.gandi.net
Received: from relay5-d.mail.gandi.net ([217.70.183.197])
 by mfilter1-d.gandi.net (mfilter1-d.gandi.net [10.0.15.180]) (amavisd-new,
 port 10024)
 with ESMTP id zN6G9c8c7fBE; Thu,  4 Jul 2013 21:12:06 +0200 (CEST)
X-Originating-IP: 76.102.14.35
Received: from jdc.koitsu.org (c-76-102-14-35.hsd1.ca.comcast.net
 [76.102.14.35]) (Authenticated sender: jdc@koitsu.org)
 by relay5-d.mail.gandi.net (Postfix) with ESMTPSA id 4EC5841C053;
 Thu,  4 Jul 2013 21:12:05 +0200 (CEST)
Received: by icarus.home.lan (Postfix, from userid 1000)
 id 49D6673A1C; Thu,  4 Jul 2013 12:12:03 -0700 (PDT)
Date: Thu, 4 Jul 2013 12:12:03 -0700
From: Jeremy Chadwick <jdc@koitsu.org>
To: mxb <mxb@alumni.chalmers.se>
Subject: Re: Slow resilvering with mirrored ZIL
Message-ID: <20130704191203.GA95642@icarus.home.lan>
References: <CABBFC07-68C2-4F43-9AFC-920D8C34282E@unixconn.com>
 <51D42107.1050107@digsys.bg>
 <2EF46A8C-6908-4160-BF99-EC610B3EA771@alumni.chalmers.se>
 <51D437E2.4060101@digsys.bg>
 <E5CCC8F551CA4627A3C7376AD63A83CC@multiplay.co.uk>
 <CBCA1716-A3EC-4E3B-AE0A-3C8028F6AACF@alumni.chalmers.se>
 <20130704000405.GA75529@icarus.home.lan>
 <C8C696C0-2963-4868-8BB8-6987B47C3460@alumni.chalmers.se>
 <20130704171637.GA94539@icarus.home.lan>
 <2A261BEA-4452-4F6A-8EFB-90A54D79CBB9@alumni.chalmers.se>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <2A261BEA-4452-4F6A-8EFB-90A54D79CBB9@alumni.chalmers.se>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 04 Jul 2013 19:12:19 -0000

On Thu, Jul 04, 2013 at 07:36:27PM +0200, mxb wrote:
> zdb -C:
> 
> NAS:
>     version: 28
>     name: 'NAS'
>     state: 0
>     txg: 19039918
>     pool_guid: 3808946822857359331
>     hostid: 516334119
>     hostname: 'nas.home.unixconn.com'
>     vdev_children: 2
>     vdev_tree:
>         type: 'root'
>         id: 0
>         guid: 3808946822857359331
>         children[0]:
>             type: 'raidz'
>             id: 0
>             guid: 15126043265564363201
>             nparity: 1
>             metaslab_array: 14
>             metaslab_shift: 32
>             ashift: 9

Wow, okay, where do I begin... should I just send you a bill for all
this?  :P  I kid, but seriously...

Your pool is not correctly 4K-aligned (ashift 12); it is clearly ashift
9, which is aligned to 512 bytes.  This means you did not do the 4K gnop
procedure which is needed for proper alignment:

http://ivoras.net/blog/tree/2011-01-01.freebsd-on-4k-sector-drives.html

The performance hit on 4K sector drives (particularly writes) is
tremendous.  The performance hit on SSDs is borderline catastrophic.

The 4K procedure does not hurt/harm/hinder older 512-byte sector drives
in any way, so using 4K alignment for all drives regardless of sector
size is perfectly safe.

I believe -- but I need someone else to chime in here with confirmation,
particularly someone who is familiar with ZFS's internals -- once your
pool is ashift 12, you can do a disk replacement ***without*** having to
do the gnop procedure (because the pool itself is already using ashift
12).  But again, I need someone to confirm that.

Next topic.

Your disks are as follows:

ada0: ST32000645NS -- Seagate Constellation ES.2, 2TB, 7200rpm, 512-byte sector
ada1: WD10EARS -- Western Digital Green, 1TB, 5400/7200rpm, 4KB sector
ada2: WD10EARS -- Western Digital Green, 1TB, 5400/7200rpm, 4KB sector
ada3: ST32000645NS -- Seagate Constellation ES.2, 2TB, 7200rpm, 512-byte sector
ada4: INTEL SSDSA2VP020G2 -- Intel 311 Series, 20GB, SSD, 4KB sector
ada5: OCZ-AGILITY3 -- OCZ Agility 3, 60GB, SSD, 4KB sector

The WD10EARS are known for excessively parking their heads, which
causes massive performance problems with both reads and writes.  This is
known by PC enthusiasts as the "LCC issue" (LCC = Load Cycle Count,
referring to SMART attribute 193).

On these drives there are ways to work around this issue -- it
specifically involves disabling drive-level APM.  To do so, you have to
initiate a specific ATA CDB to the drive using "camcontrol cmd", and
this has to be done every time the system reboots.  There is one
drawback to disabling APM as well: the drives run hotter.

In general, stay away from any "Green" or "GreenPower" or "EcoGreen"
drives from any vendor.  Likewise, many of Seagate's drives these days
parking their heads excessively and **without** any way to disable the
behaviour (and in later firmwares, they don't even increment LCC, and
they did this solely because customers were noticing the behaviour and
complaining about it so they decided to hide it.  Cute)

I tend to recommend WD Red drives these days -- and not because they're
"NAS friendly" (marketing drivel), but because they don't have retarded
firmware settings, don't act stupid/get in the way, use single 1TB
platters (thus less heads, thus less heat, less parts to fail, less
vibration), and perform a little bit better than the Greens.

Next topic, sort of circling back up...

Your ada4 and ada5 drives -- the slices, I mean -- are ALSO not not
properly aligned to a 4K boundary.  This is destroying their performance
and probably destroying the drives every time they write.  They are
having to do two actual erase-write cycles to make up for this problem,
due to misaligned boundaries.

Combine this fact with the fact that 9.1-RELEASE does not support TRIM
on ZFS, and you now have SSDs which are probably beat to hell and back.

You really need to be running stable/9 if you want to use SSDs with ZFS.
I cannot stress this enough.  I will not bend on this fact.  I do not
care if what people have are SLC rather than MLC or TLC -- it doesn't
matter.  TRIM on ZFS is a downright necessity for long-term reliability
of an SSD.  Anyway...

These SSDs need a full Secure Erase done to them.  In stable/9 you can
do this through camcontrol, otherwise you need to use Linux (there are
live CD/DVD distros that can do this for you) or the vendor's native
utilities (in Windows usually).

UNDERSTAND: THIS IS NOT THE SAME AS A "DISK FORMAT" OR "ZEROING THE
DISK".  In fact, dd if=/dev/zero to zero an SSD would be the worst
possible thing you could do to it.  Secure Erase clears the entire FTL
and resets the wear levelling matrix (that's just what I call it) back
to factory defaults, so you end up with out-of-the-box performance:
there's no more LBA-to-NAND-cell map entries in the FTL (which are
usually what are responsible for slowdown).

But even so, there is a very good possibility you have induced massive
wear and tear on the NAND cells given this situation (especially with
the Intel drive) and they may be in very bad shape in general.  smartctl
will tell me (keep reading).

You should do the Secure Erase before doing any of the gpart stuff I
talk about below.

Back to the 4K alignment problem with your SSD slices:

You should have partitioned these using "gpart" and the GPT mechanism,
and used a 1MByte alignment for both partitions -- it's the easiest way
to get proper alignment (with MBR its possible but it's more of a mess
and really not worth it).  The procedure is described here:

http://www.wonkity.com/~wblock/docs/html/disksetup.html#_the_new_standard_gpt

I'll explain what I'd do, but please do not do this yet either (there is
another aspect that needs to be covered as well).

Ignore the stuff about labels, and ignore the "boot" partition stuff --
it doesn't apply to your setup.  To mimic what you currently have, you
just need two partitions on each drive, properly aligned via GPT.  That
should be as easy as this -- except you are going to have to destroy
your pool and start over because you're using log devices:

gpart destroy -F ada4
gpart destroy -F ada5
gpart create -s gpt ada4
gpart add -t freebsd-zfs -b 1M -s 10G
gpart add -t freebsd-zfs -a 1M
gpart create -s gpt ada5
gpart add -t freebsd-zfs -b 1M -s 10G
gpart add -t freebsd-zfs -a 1M

You should be left with 4 devices at this point -- note the name change
("p" not "s"):

/dev/ada4p1 -- use for your log device (mirrored)
/dev/ada4p2 -- unused
/dev/ada5p1 -- use for your log device (mirrored)
/dev/ada5p2 -- use for your cache device/L2ARC

And these should be able to be used with ZFS without any issue due to
being properly aligned to a 1MByte boundary (which is 4096-byte aligned,
but the reason we pick 1MByte is that it often correlates with NAND
erase block size on many SSDs; it's sort of a "de-facto social standard"
used by other OSes too).

But before doing ANY OF THIS...

You should probably be made aware of the fact that SSDs need to be
kept roughly 30-40% unused to get the most benefits out of wear
levelling.  Once you hit the 20% remaining mark, performance takes a
hit, and the drive begins hurting more and more.  Low-capacity SSDs
are therefore generally worthless given the capacity limitation need.

Your Intel drive is very very small, and in fact I wouldn't even bother
to use this drive -- it means you'd only be able to use roughly 14GB of
it (at most) for data, and leave the remaining 6GB unallocated/unused
solely for wear levelling.

So given this information, when using the above gpart commands, you
might actually want to adjust the sizes of the partitions so that there
is always a guaranteed level of free/untouched space on them.  For
example:

gpart destroy -F ada4
gpart destroy -F ada5
gpart create -s gpt ada4
gpart add -t freebsd-zfs -b 1M -s 7G
gpart add -t freebsd-zfs -a 1M -s 7G
gpart create -s gpt ada5
gpart add -t freebsd-zfs -b 1M -s 10G
gpart add -t freebsd-zfs -a 1M -s 32G

This would result in the following:

/dev/ada4p1    -- 7GB -- use for your log device (mirrored)
/dev/ada4p2    -- 7GB -- unused
<rest of ada4> -- 6GB (remaining 30% of SSD) -- for wear levelling

/dev/ada5p1    -- 10GB -- use for your log device (mirrored)
/dev/ada5p2    -- 32GB -- use for your cache device/L2ARC
<rest of ada5> -- 18GB (remaining 30% of SSD) -- for wear levelling

I feel like I'm missing something, but no, I think that just about does
it.  Oh, I remember now.

Next topic...

I would strongly recommend you not use 1 SSD for both log and cache.
I understand your thought process here: "if the SSD dies, the log
devices are mirrored so I'm okay, and the cache is throw-away anyway".

What you're not taking into consideration is how log and cache devices
bottleneck ZFS, in addition to the fact that SATA is not like SAS when
it comes to simultaneous R/W.  That poor OCZ drive...

I will let someone else talk about this part, because I feel I've
already written up enough as is, I'm not sure how much will sink in,
you'll probably be angry being told "your setup is a pretty gigantic
mess but here's how to fix it, it involves recreating your entire pool
and partitions on your SSDs too", but I'm also exhausted as this is the
4th or 5th Email in just the past few days I've had to write to cover
multiple bases.  I don't understand how this knowledge has not been
dispensed into the community by now, common practise, etc..  I feel like
yelling "VAFAAAAAAN!!!"  :-)

> nas# smartctl -a ada3
> ada3: Unable to detect device type

My fault -- the syntax here is wrong, I should have been more clear:

smartctl -a /dev/ada{0,5}

Also, please update your ports tree and install smartmontools 6.1.
There are improvements there pertaining to SSDs that are relevant.

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |