Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Feb 1998 22:50:13 -0700 (MST)
From:      "Justin T. Gibbs" <gibbs@narnia.plutotech.com>
To:        Karl Denninger  <karl@mcs.net>
Cc:        hackers@FreeBSD.ORG
Subject:   Re: SCSI Sense ASC 11, ASCQ 0x0c - Unrecovered read errors
Message-ID:  <199802260550.WAA19882@narnia.plutotech.com>
In-Reply-To: <19980224105842.07731@mcs.net>

next in thread | previous in thread | raw e-mail | index | archive | help
In article <19980224105842.07731@mcs.net> you wrote:
> Hi folks,
> 
> I have a question...
> 
> Right now, as the driver stands, if you get a sense return on a disk of
> 0x11,0x0c ("Unrecovered read error - recommend rewrite the block"), the
> driver does not attempt to do anything about it.
> 
> Why?
> 
> You're screwed in this case - the data is gone.  But, some RAID controllers
> (notably the CMD adapters) will *FIX* such an error if you write back to the
> block.

Is the data really gone?  Isn't that for the user to decide?  I've known
disks to report temporary media errors that "dissapear" after they are
moved, the temperature changes, or the moon goes full.

> Here's the scenario:
> 
> 1)	You have a failure on a data drive.  It gets reported back with
> 	sense ASC 0x11, 0x0c.
> 
> 2)	The driver does not attempt to do anything other than report the
> 	error.  

I don't believe that it is the driver's responsibility to take action
in this case.

> This sounds like bogus behavior to me.  Here's why:
> 
> 	You've ALREADY lost the data.

This is arguable.

>	There is no harm in trying to 
> 	"fix it".  Thus, why not do the following:

How does the driver know what it means to fix it?  If the bad block
is in the MBR, the system may well have this information somewhere
in core to restore the data.  If the data is in the filesystem,
writing one pattern might cause the FS to crash the kernel or
confuse fsck, while another may minimize damage.  If the "client"
of the driver is not going to get the data it expects, an error
should be returned, period.

> 	a)	Attempt a forced reassign of the block.
> 	b)	If that FAILS, write zeros into the block.
> 
> Why do these things you ask?  Simple:
> 
> 1)	The error, if repeated (or even singly) may cause a panic.  If its
> 	in a swap area, for example, you're screwed - you're probably
> 	reading back a page of an executable from the paging space, and if
> 	its corrupted you're going down.

The swap pager should terminate the program(s) needing that block
if it receives an I/O error.  This should not panic the system.

If, on the other hand, you remap the block, and silently return garbage
data, you may well cause behavior that is recoverable.

> 2)	If its a data file you MIGHT die.  There's no way to know.

And the FS may be able to clean up it's data structures to minimize
the effect of a missing/corrupted block of data if you tell it that
the read operation failed.  If you remap it and return garbage, who
knows what will happen.

> 3)	IF YOU DON'T "FIX" IT, YOU WILL GET KILLED EVENTUALLY.

This need not be the case.

> With a regular disk, (a) above will succeed.  You may still crash, but at
> least you should come back up.  If the data was a file, its gone anyway -
> likewise for a directory.  There is no harm in trying to prevent FUTURE
> errors at that point.

I have no problem with the client of the data taking some action to clear
an I/O error.  There may even need to be an additional API to do this,
but the disk driver does not have sufficient information to make the
decision on how to perform that recovery.  The only safe thing is to
report the error until some external action is taken.

If the system is not properly dealing with EIO conditions, that is
certainly a bug, but your suggested fix is not a correct solution.

> -- 
> Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
> http://www.mcs.net/          | T1's from $600 monthly to FULL DS-3 Service
> 			     | NEW! K56Flex support on ALL modems
> Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
> Fax:   [+1 312 803-4929]     | *SPAMBLOCK* Technology now included at no cost

--
Justin

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199802260550.WAA19882>