From owner-freebsd-fs@FreeBSD.ORG Tue Jul 5 22:57:42 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DAA7F106564A; Tue, 5 Jul 2011 22:57:42 +0000 (UTC) (envelope-from smckay@internode.on.net) Received: from ipmail06.adl6.internode.on.net (ipmail06.adl6.internode.on.net [150.101.137.145]) by mx1.freebsd.org (Postfix) with ESMTP id 1040F8FC14; Tue, 5 Jul 2011 22:57:41 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av0EACuSE0520E0N/2dsb2JhbAA1HqgNeL9HjUaGNgSifg Received: from ppp118-208-77-13.lns20.bne4.internode.on.net (HELO dungeon.home) ([118.208.77.13]) by ipmail06.adl6.internode.on.net with ESMTP; 06 Jul 2011 08:12:22 +0930 Received: from dungeon.home (localhost [127.0.0.1]) by dungeon.home (8.14.4/8.14.3) with ESMTP id p65Mfr9M002216; Wed, 6 Jul 2011 08:41:53 +1000 (EST) (envelope-from smckay@internode.on.net) Received: (from root@localhost) by dungeon.home (8.14.4/8.14.3/Submit) id p65MfqVA002215; Wed, 6 Jul 2011 08:41:52 +1000 (EST) (envelope-from mckay) Date: Wed, 6 Jul 2011 08:41:52 +1000 (EST) Message-Id: <201107052241.p65MfqVA002215@dungeon.home> From: Stephen McKay To: freebsd-fs@freebsd.org References: <201103081425.p28EPQtM002115@dungeon.home> Cc: Stephen McKay Subject: Constant minor ZFS corruption, probably solved X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jul 2011 22:57:42 -0000 Perhaps you remember me struggling with a small but continuous amount of corruption on ZFS volumes with a new server we had built at work. Well, it's been 4 months now and although the machine sat in a corner for most of that time, I've now done enough tests so that I'm 90% certain what the problem is: Seagate's caching firmware. The specific drive model we are using is the 2TB Green ST2000DL003-9VT166 (CC32 firmware), their first 4KB sector drive model as far as I know. I would not be surprised to see bugs in Seagate's first stab at their ambitious "smart align" technology. Why do I think it's the firmware? I have a testing regime that involves copying about 1.2TB of data to the server via ssh while simultaneously keeping the server disks busy reading and seeking with repeated runs of find. After each copy, I run a scrub. Any copy plus scrub sequence that completes with zero checksum errors is a success. Any checksum errors at all mean it's a fail. With write caching enabled, every run has failed, and I've done this 20 or so times (with various other settings tweaked to no avail). But with write caching disabled (using kern.cam.ada.write_cache=0 for the onboard ports and "sdparm --clear=WCE" for the mps attached drives), I have run my test successfully 6 times in a row. To be sure that this was not a fluke, I turned write caching back on, producing checksum errors, then back off and have completed a further 4 successful runs. I can't see this as anything except bugs in the drive firmware, but I'm open to other reasonable interpretations. Have I missed anything? The only reason I'm 90% sure and not 100% is that I'm chasing very low error counts per run: around 20KB to 100KB wrong out of 1.2TB, which is between 1 and 8 millionths of 1 percent bad data. When trying to close the gap between 1 millionth of a percent and zero, any experiment could be just noise. :-( I really should dig up some of my old statistics textbooks and work out how many runs I must do before I can claim "certainty". Is there a statistician in the house? I should also note that the speed of the machine seems undiminished by disabling write caching. I left NCQ enabled and this seems to allow everything to run at full speed, at least under ZFS. If the machine had been crippled by disabling write caching, I would have believed that the general slowdown was lowering the stress on some dodgy part or that some race condition was being avoided. But with no noticeable slowdown? I have to conclude that the fault is just in the disks since that's the part directly affected by the setting. Except... Except for that last niggling doubt that when chasing 1 millionth of 1 percent bugs, even a tiny perturbation in operation may be nudging the system past a software or (other) hardware bug. Or, to put it another way, I'm certain that disabling write caching has given us a stable machine. And I'm 90% certain that it's because of bugs in Seagate's cache firmware. I hope someone else can replicate this and settle the issue. I am running -current from 17 June (though earlier versions made no difference, at least for failing tests, which was all I knew how to do until recently). The same failures occurred when running 8.2-release. Changing from ZFS v15 to ZFS v28 made no difference. A recap of our hardware, which has been slightly rearranged for recent testing, though none of it was replaced: Asus P7F-E (includes 6 3Gb/s SATA ports) PIKE2008 (8 port SAS card based on LSI2008 chip, supports 6Gb/s) Xeon X3440 (2.53GHz 4core with hyperthreading) Chenbro CSPC-41416AB rackmount case 2x 2GB 1333MHz ECC DDR3 RAM (Corsair) 2x Seagate ST3500418AS 500GB normal disks, for OS booting (on the PIKE) 12x Seagate ST2000DL003 2TB "green" disks (yes, with 4kB sectors) (6 disks on the onboard Intel SATA controller using ahci driver, 6 disks on the PIKE using the mps driver) The pool is arranged as 2 x 6 disk raidz2 vdevs, one vdev entirely on the onboard controller and one on the PIKE. ZFS has been set to use ashift=12 to align for lying 4kB sector drives. Jeremy has already said his piece about aiming too low on hardware but I've made many reliable small servers and workstations from ASUS motherboards, ECC RAM and consumer level disks. The occasional failure to achieve a stable cheap system occurs roughly as often as the failure rate for expensive gear, though my sample size for expensive gear is small. We weren't expecting to build a machine to eclipse all others, just a reliable disk store of a decent size. For our next machine, we might be more conservative on our disk purchases though, since it is becoming clear that all disk manufacturers are cutting even more corners on consumer level drives, more or less forcing raid and zfs users to use full speed (full heat, full noise, full power draw, full price) disks. BTW, if anyone has a direct line into Seagate or Western Digital I'd really love to know why nobody is following the standard written specificially for 4K sector drives. Why report the wrong physical sector size if no OS that uses that data gets it wrong? And where is the jumper for "I want native 4KB sectors with no translation"? Surely the engineers made one of those *before* they made the mind bending sector shuffling version? Cheers, Stephen. PS Jeremy, I swapped in various 2 and 4 port disk controllers during testing (none remain in there now). A cheap JMicron JMB363 card and a modestly priced Adaptec 1430SA worked well, but another cheap card using a Silicon Image 3132 chip produced silent corruption on reads. I've seen a small number of posts claiming the 3132 is known for silent corruption when both channels are busy, but you recommend this chip. Is there a way to tell the good cards from the bad ones? Is FreeBSD missing a driver (errata) fix? Or are there actually no good SiI 3132 cards? I'm keeping my 3132 card in the "bad hardware - driver testing only" bucket.