From owner-freebsd-fs@FreeBSD.ORG  Tue Jul  5 22:57:42 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DAA7F106564A;
	Tue,  5 Jul 2011 22:57:42 +0000 (UTC)
	(envelope-from smckay@internode.on.net)
Received: from ipmail06.adl6.internode.on.net (ipmail06.adl6.internode.on.net
	[150.101.137.145])
	by mx1.freebsd.org (Postfix) with ESMTP id 1040F8FC14;
	Tue,  5 Jul 2011 22:57:41 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Av0EACuSE0520E0N/2dsb2JhbAA1HqgNeL9HjUaGNgSifg
Received: from ppp118-208-77-13.lns20.bne4.internode.on.net (HELO
	dungeon.home) ([118.208.77.13])
	by ipmail06.adl6.internode.on.net with ESMTP; 06 Jul 2011 08:12:22 +0930
Received: from dungeon.home (localhost [127.0.0.1])
	by dungeon.home (8.14.4/8.14.3) with ESMTP id p65Mfr9M002216;
	Wed, 6 Jul 2011 08:41:53 +1000 (EST)
	(envelope-from smckay@internode.on.net)
Received: (from root@localhost)
	by dungeon.home (8.14.4/8.14.3/Submit) id p65MfqVA002215;
	Wed, 6 Jul 2011 08:41:52 +1000 (EST) (envelope-from mckay)
Date: Wed, 6 Jul 2011 08:41:52 +1000 (EST)
Message-Id: <201107052241.p65MfqVA002215@dungeon.home>
From: Stephen McKay <mckay@freebsd.org>
To: freebsd-fs@freebsd.org
References: <201103081425.p28EPQtM002115@dungeon.home>
Cc: Stephen McKay <mckay@freebsd.org>
Subject: Constant minor ZFS corruption, probably solved
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Jul 2011 22:57:42 -0000

Perhaps you remember me struggling with a small but continuous amount
of corruption on ZFS volumes with a new server we had built at work.

Well, it's been 4 months now and although the machine sat in a corner
for most of that time, I've now done enough tests so that I'm 90%
certain what the problem is: Seagate's caching firmware.

The specific drive model we are using is the 2TB Green ST2000DL003-9VT166
(CC32 firmware), their first 4KB sector drive model as far as I know.
I would not be surprised to see bugs in Seagate's first stab at their
ambitious "smart align" technology.

Why do I think it's the firmware?  I have a testing regime that involves
copying about 1.2TB of data to the server via ssh while simultaneously
keeping the server disks busy reading and seeking with repeated runs
of find.  After each copy, I run a scrub.  Any copy plus scrub sequence
that completes with zero checksum errors is a success.  Any checksum
errors at all mean it's a fail.

With write caching enabled, every run has failed, and I've done this
20 or so times (with various other settings tweaked to no avail).
But with write caching disabled (using kern.cam.ada.write_cache=0 for
the onboard ports and "sdparm --clear=WCE" for the mps attached drives),
I have run my test successfully 6 times in a row.

To be sure that this was not a fluke, I turned write caching back on,
producing checksum errors, then back off and have completed a further
4 successful runs.

I can't see this as anything except bugs in the drive firmware, but I'm
open to other reasonable interpretations.  Have I missed anything?

The only reason I'm 90% sure and not 100% is that I'm chasing very low
error counts per run: around 20KB to 100KB wrong out of 1.2TB, which is
between 1 and 8 millionths of 1 percent bad data.  When trying to close
the gap between 1 millionth of a percent and zero, any experiment could
be just noise. :-(  I really should dig up some of my old statistics
textbooks and work out how many runs I must do before I can claim
"certainty".  Is there a statistician in the house?

I should also note that the speed of the machine seems undiminished by
disabling write caching.  I left NCQ enabled and this seems to allow
everything to run at full speed, at least under ZFS.  If the machine
had been crippled by disabling write caching, I would have believed
that the general slowdown was lowering the stress on some dodgy part
or that some race condition was being avoided.  But with no noticeable
slowdown?  I have to conclude that the fault is just in the disks
since that's the part directly affected by the setting.  Except...
Except for that last niggling doubt that when chasing 1 millionth of
1 percent bugs, even a tiny perturbation in operation may be nudging
the system past a software or (other) hardware bug.

Or, to put it another way, I'm certain that disabling write caching
has given us a stable machine.  And I'm 90% certain that it's because
of bugs in Seagate's cache firmware.  I hope someone else can replicate
this and settle the issue.

I am running -current from 17 June (though earlier versions made no
difference, at least for failing tests, which was all I knew how to
do until recently).  The same failures occurred when running 8.2-release.
Changing from ZFS v15 to ZFS v28 made no difference.

A recap of our hardware, which has been slightly rearranged for recent
testing, though none of it was replaced:

Asus P7F-E (includes 6 3Gb/s SATA ports)
PIKE2008 (8 port SAS card based on LSI2008 chip, supports 6Gb/s)
Xeon X3440 (2.53GHz 4core with hyperthreading)
Chenbro CSPC-41416AB rackmount case
2x 2GB 1333MHz ECC DDR3 RAM (Corsair)
2x Seagate ST3500418AS 500GB normal disks, for OS booting (on the PIKE)
12x Seagate ST2000DL003 2TB "green" disks (yes, with 4kB sectors)
	(6 disks on the onboard Intel SATA controller using ahci driver,
	 6 disks on the PIKE using the mps driver)

The pool is arranged as 2 x 6 disk raidz2 vdevs, one vdev entirely on
the onboard controller and one on the PIKE.  ZFS has been set to use
ashift=12 to align for lying 4kB sector drives.

Jeremy has already said his piece about aiming too low on hardware
but I've made many reliable small servers and workstations from ASUS
motherboards, ECC RAM and consumer level disks.  The occasional failure
to achieve a stable cheap system occurs roughly as often as the failure
rate for expensive gear, though my sample size for expensive gear is small.

We weren't expecting to build a machine to eclipse all others, just a
reliable disk store of a decent size.  For our next machine, we might
be more conservative on our disk purchases though, since it is becoming
clear that all disk manufacturers are cutting even more corners on
consumer level drives, more or less forcing raid and zfs users to use
full speed (full heat, full noise, full power draw, full price) disks.

BTW, if anyone has a direct line into Seagate or Western Digital
I'd really love to know why nobody is following the standard written
specificially for 4K sector drives.  Why report the wrong physical
sector size if no OS that uses that data gets it wrong?  And where
is the jumper for "I want native 4KB sectors with no translation"?
Surely the engineers made one of those *before* they made the mind
bending sector shuffling version?  </end rant>

Cheers,

Stephen.

PS Jeremy, I swapped in various 2 and 4 port disk controllers during
testing (none remain in there now).  A cheap JMicron JMB363 card and
a modestly priced Adaptec 1430SA worked well, but another cheap card
using a Silicon Image 3132 chip produced silent corruption on reads.
I've seen a small number of posts claiming the 3132 is known for
silent corruption when both channels are busy, but you recommend this
chip.  Is there a way to tell the good cards from the bad ones?  Is
FreeBSD missing a driver (errata) fix?  Or are there actually no good
SiI 3132 cards?  I'm keeping my 3132 card in the "bad hardware - driver
testing only" bucket.