Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 16 Sep 2008 13:15:30 -0700
From:      Jeremy Chadwick <koitsu@FreeBSD.org>
To:        Mike Tancsa <mike@sentex.net>
Cc:        stable@freebsd.org, Clint Olsen <clint.olsen@gmail.com>
Subject:   Re: Help debugging DMA_READ errors
Message-ID:  <20080916201530.GA72912@icarus.home.lan>
In-Reply-To: <200809161934.m8GJY9oe039218@lava.sentex.ca>
References:  <20080916170452.GB4861@0lsen.net> <20080916175858.GA70396@icarus.home.lan> <20080916181903.GC7540@0lsen.net> <20080916185401.GA71275@icarus.home.lan> <200809161934.m8GJY9oe039218@lava.sentex.ca>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Sep 16, 2008 at 03:34:07PM -0400, Mike Tancsa wrote:
> At 02:54 PM 9/16/2008, Jeremy Chadwick wrote:
>
>> However, there's no sign of DMA errors in the SMART log.  I'm not sure
>> what to make of that; I really would expect there to be some.
>
> Would not bad cables (or trays) be consistent with symptoms like that ? 
> i.e. the OS sees errors, but when we ask the drive, it says, "what  
> errors".  I am sure there are other things that could cause this, but in 
> the past I would start with the cables and or trays.

My official answer is: "I'm not sure".  :-)  Anything is possible.

I'd expect carrier/tray problems to manifest themselves as constant data
corruption, or disks falling off the bus (loose signal cable or losing
power).  I'd expect "detach" messages for the SATA channels.  But
remember, ICH5 lacks AHCI, and I don't know if the FreeBSD ata(4) driver
would report detach/attach in that case.  I guess a disk falling off the
bus or disappearing could in fact lock the controller up in this
scenario, I'd imagine.

I'd expect cable problems to show constant data errors or loss, and
regular DMA errors.  FreeBSD would be quite chatty about this, I assume.
He just started getting these, and they're only "every couple days".
I'd also expect the attribute counters to be much higher -- a bad cable
would eventually get noticed by both the controller and the disk, maybe
just not consistently.  ZFS could help with detecting this (checksum
errors), but that's a different beast.

I have doubts about the cables being bad because he's seeing issues on a
SATA disk and a PATA disk.  It seems very unlikely that separate SATA
and PATA cables would go bad within a day or two of one another.

Another possibility is that the firmware on his drives lack UDMA error
logging in SMART.  I've seen some drives do this (increase the attribute
but not stick anything in the SMART log), but they were old Maxtors.
UDMA CRCs were sky-high (to the point where the general drive health
was FAIL, REPLACE NOW), but nothing in the SMART log.

The acd0 thing bothers me the most, I think -- not because of the
oddity, but because it tried to read the TOC of a disc that wasn't even
there.  A specific ATAPI command induces that, if I remember right.

All that said: there is absolutely no harm in replacing the cables!
By doing so you can rule those out.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080916201530.GA72912>