Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 6 Jan 2012 14:43:30 -0800
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        freebsd-stable@freebsd.org
Subject:   Re: gmirror not synced
Message-ID:  <20120106224330.GA26856@icarus.home.lan>
In-Reply-To: <4F0573B2.9070301@infracaninophile.co.uk>
References:  <20120104194313.GA2558@lordcow.org> <4F0573B2.9070301@infracaninophile.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jan 05, 2012 at 09:56:02AM +0000, Matthew Seaman wrote:
> On 04/01/2012 19:43, Gareth de Vaux wrote:
> > Hi all,	I've noticed that the md5 hashes of a couple of files on
> > a gmirror change when I recalculate the hashes. The output usually
> > cycles between 2 hashes per file.
> > 
> > I'm guessing this is because each calculation reads the file
> > randomly from 1 of 2 component drives, and the files in question
> > had a few bit flips during their original sync. I also assume
> > this's something you have to live with for gmirror? Is removing
> > and completely rebuilding the secondary drive the only thing you
> > can do (which might fix these bit flips but incur others elsewhere)?
> 
> No, that's not something acceptable at all.  Randomly flipping bits in
> files is a really nasty failure mode.
> 
> What does 'gmirror list' tell you about the state of the gmirror?  Is
> there any possibility that your hardware is failing?  Check the SMART
> attributes of the disk in the first instance (it isn't brilliant for
> picking up impending failure, but it should be pretty accurate once the
> drive is actually generating errors.)  Also try a few passes of
> memtest86 to try and spot problems with RAM.  Cleaning dust out of air
> vents and heatsinks and generally making sure the machine is not
> overheating is a good idea too.

Another possibility is a disk with intermittently faulty cache, or a
drive who has basically given up (firmware bug, design flaw, etc.)
honouring ECC[1][2] when reading/writing sectors.

For the former point, SMART statistics from the drives could help
determine if this is the case, but I stress the word "could".  This is
usually stored in Attribute 184 ("End-to-End_Error") but is not
available on very many drives.

Gareth, please install ports/sysutils/smartmontools (make sure it's
version 5.42 or newer) and provide output from "smartctl -x /dev/disk"
and I'll review it for you.

[1]: http://www.storagereview.com/guide/error.html
     (read all subsections too)
[2]: http://www.dewassoc.com/kbase/hard_drives/hard_disk_sector_structures.htm

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120106224330.GA26856>