Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 25 Feb 1996 07:00:19 -0500 (EST)
From:      Peter Dufault <dufault@hda.com>
To:        msmith@atrad.adelaide.edu.au (Michael Smith)
Cc:        hsu@freefall.freebsd.org, hackers@freefall.freebsd.org
Subject:   Re: wierd scsi error message
Message-ID:  <199602251200.HAA28140@hda.com>
In-Reply-To: <199602250046.LAA28298@genesis.atrad.adelaide.edu.au> from "Michael Smith" at Feb 25, 96 11:16:49 am

next in thread | previous in thread | raw e-mail | index | archive | help
> 
> Jeffrey Hsu stands accused of saying:
> > 
> > Okay, now that I know what the message means, the $24,000 question is how
> > do I identify the file for which this write failed?
> > 
> > Feb 24 02:24:15 armour /kernel: sd0(uha0:0:0): Deferred Error: HARDWARE FAILURE 
> > info:327900 asc:3,0 Peripheral device write fault field replaceable unit: 11 sks:80,19
> 
> The general procedure looks like this :
> 
>  - obtain the SCSI-2 spec. (gatekeeper.dec.com has a copy somewhere)
>  - Find out what the 'deferred error' message looks like
>  - look at, and maybe modify the code to emit the block number (if one is
>    part of the message)

Hi - I'm back.  Sorry I missed the start of this thread.

The appropriate section of the SCSI-II spec is "7.2.14.2 Deferred
Errors".

This deferred error is particularly nasty in that it is an error
returned for "a previous command for which GOOD status has already
been returned", that is, the disk lied and told us it had successfully
finished a write and now we find out it did not.  This could happen
when the disk transferred the block to its cache, said it was done,
and then failed the transfer from the cache to the disk, maybe
because it has a bad cache area (or whatever FRU 11 is).  In this
case I suggest turning off the write cache to test to see if the
error goes away.  This is the WCE field in page 8.

There is an interesting implementors note indicating that you want
to use synchronizing commands (I assume they mean "synchronize
cache") to ensure the data is actually transferred to disk at
certain places in the driver.  I suppose this has to play with the
FS code.

The "info" field should be the block number in error for a disk.
Does that make sense?  Can you tickle the raw device to generate
the error?  Does it follow the block number?

-- 
Peter Dufault               Real-Time Machine Control and Simulation
HD Associates, Inc.         Voice: 508 433 6936
dufault@hda.com             Fax:   508 433 5267



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199602251200.HAA28140>