Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 26 May 1998 05:43:42 -0700
From:      Mike Smith <mike@smith.net.au>
To:        Michael Robinson <robinson@public.bta.net.cn>
Cc:        mike@smith.net.au, freebsd-stable@FreeBSD.ORG
Subject:   Re: Bug in wd driver 
Message-ID:  <199805261243.FAA00386@antipodes.cdrom.com>
In-Reply-To: Your message of "Tue, 26 May 1998 18:23:58 %2B0800." <199805261023.SAA11951@public.bta.net.cn> 

next in thread | previous in thread | raw e-mail | index | archive | help
> Mike Smith writes:
> >> I wrote a message related to this problem to freebsd-questions
> >> yesterday, but upon further investigation, I have decided this is
> >> a bug, not a feature.
> >
> >Actually, it's almost certainly a hardware fault.
> 
> Actually, the bug is that the driver does not recover gracefully from a 
> recoverable hardware fault.  It instead goes into an infinite loop, taking
> significant pieces of the kernel with it.

Actually, an interrupt timeout is not a "recoverable hardware fault".
This is a basic failure in the driver:controller protocol on the part 
of the drive.

> >>  1. Any I/O access to the affected sectors will cause the following
> >>     message:
> >> 
> >>     wd0: interrupt timeout
> >>     wd0: status 58<rdy,seekdone,drq> error 0
> >
> >The disk has failed to respond to the access request.   You may be able 
> >to recover by dd'ing zeroes over the whole partition (forcing a block 
> >reallocation), however the disk may be damaged beyond repair.
> 
> I repeat, any attempted access to the affected sectors locks up that 
> process.  Unless dd has the ability to circumvent the wd driver, I don't
> see how I would be able to dd zeroes over the whole partition.

With the level of detail you provided, it was not possible to determine 
whether "any access" referred to read or write operations.  If the disk 
can recover the sector(s) involved, and is not required to read from 
them first, a dd operation will put it in a position to do so.

The fault is fairly likely related to scribble which occurred when you
powered down during a write operation.  You may have damaged
non-recoverable metadata in the process, and the drive may not handle
this case well.

It is also possible that the drive is taking an inordinate amount of 
time before returning an error, and the interrupt timeout is preempting 
this return (that isn't actually likely, given the drive status above).

The fact that an unrecoverable disk error locks other parts of the
kernel is understandable, if not desirable.  There isn't a lot that can
trivially be done about this though.

> What I will probably end up having to do is repartition around that track.
> However, this seems like an unecessarily crude solution to me, considering
> how minor the damage is.

Disk metadata damage that causes the drive firmware to fail doesn't 
strike me as "minor" in any common usage of the term.

-- 
\\  Sometimes you're ahead,       \\  Mike Smith
\\  sometimes you're behind.      \\  mike@smith.net.au
\\  The race is long, and in the  \\  msmith@freebsd.org
\\  end it's only with yourself.  \\  msmith@cdrom.com



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199805261243.FAA00386>