Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 11 Feb 2008 12:39:08 -0700
From:      Joe Peterson <joe@skyrush.com>
To:        Gavin Atkinson <gavin.atkinson@ury.york.ac.uk>
Cc:        freebsd-fs@freebsd.org, freebsd-stable@freebsd.org
Subject:   Re: Analysis of disk file block with ZFS checksum error
Message-ID:  <47B0A45C.4090909@skyrush.com>
In-Reply-To: <1202747953.27277.7.camel@buffy.york.ac.uk>
References:  <47ACD7D4.5050905@skyrush.com>	 <D6B0BBFB-D6DB-4DE1-9094-8EA69710A10C@apple.com>	 <47ACDE82.1050100@skyrush.com>	 <20080208173517.rdtobnxqg4g004c4@www.wolves.k12.mo.us>	 <47ACF0AE.3040802@skyrush.com> <1202747953.27277.7.camel@buffy.york.ac.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
Gavin Atkinson wrote:
> Are the datestamps (Thu Jan 24 23:20:58 2008) found within the corrupt
> block before or after the datestamp of the file it was found within?
> i.e. was the corrupt block on the disk before or after the mp3 was
> written there?

Hi Gavin, those dated are later than the original copy (I do not have
the file timestamps to prove this, but according to my email record, I
am pretty sure of this).  So the corrupt block is later than the
original write.

If this is the case, I assume that the block got written, by mistake,
into the middle of the mp3 file.  Someone else suggested that it could
be caused by a bad transfer block number or bad drive command (corrupted
on the way to the drive, since these are not checksummed in the
hardware).  If the block went to the wrong place, AND if it was a HW
glitch, I suppose the best ZFS could then do is retry the write (if its
failure was even detected - still not sure if ZFS does a re-check of the
disk data checksum after the disk write), not knowing until the later
scrub that the block had corrupted a file.

I think that anything is possible, but I know I was getting periodic DMA
timeouts, etc. around that time.  I hesitate, although it is tempting,
to use this evidence to focus blame purely on bad HW, given that others
seem to be seeing DMA problems too, and there is reasonable doubt
whether their problems are HW related or not.  In my case, I have been
free of DMA errors (cross your fingers) after re-installed FreeBSD
completely (giving it a larger boot partition and redoing the ZFS slice
too), and before this, I changed the IDE cable just to eliminate one
more variable.  Therefore, there are too many variables to reach a firm
conclusion, since even if the cable was "bad", I never saw one DMA error
or other indication of anything wrong with HW from the Linux side (and
I've been using that HW with both Linux and FreeBSD 6.2 for months now -
no apparent flakiness of any kind on either system).  So either it *was*
bad and FreeBSD 7.0 was being more "honest", FreeBSD's drivers and/or
ZFS was stressing the HW and revealing weaknesses in the cable, or it
was a SW issue that got cleared somehow when I re-installed.

Is it possible that the problem lies in the ATA drivers in FreeBSD or
even in ZFS and just looks like HW issues?  I do not have enough
info/expertise to know.  If not, then it may very well be true that HW
problems are pretty widespread (and that disk HW cannot, in fact, be
trusted), and there really *is* a strong need for ZFS *now* to protect
our data.  If there is a possibility that SW could be involved, any
hints on how to further debug this would be of great help to those still
experiencing recent DMA errors.  I just want to be more sure one way or
the other, but I know this issue is not an easy one (however, it's the
kind of problem that should receive the highest priority, IMHO).

						-Joe



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?47B0A45C.4090909>