Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 1 Feb 1997 06:24:59 -0800
From:      Don Lewis <Don.Lewis@tsc.tdk.com>
To:        joerg_wunsch@uriah.heep.sax.de (Joerg Wunsch), Don.Lewis@tsc.tdk.com (Don Lewis)
Cc:        freebsd-fs@freebsd.org, freebsd-scsi@freebsd.org
Subject:   Re: SCSI disk MEDIUM ERROR with a few twists
Message-ID:  <199702011424.GAA28908@salsa.gv.tsc.tdk.com>
In-Reply-To: j@uriah.heep.sax.de (J Wunsch) "Re: SCSI disk MEDIUM ERROR with a few twists" (Feb  1,  2:29pm)

next in thread | raw e-mail | index | archive | help
On Feb 1,  2:29pm, J Wunsch wrote:
} Subject: Re: SCSI disk MEDIUM ERROR with a few twists
} As Don Lewis wrote:
} 
} > 	/etc/daily doesn't report this
} 
} (and others don't report this)
} 
} Of course.  That's because buffered writes cannot report media errors
} to their caller.  The caller has already got an OK indication about
} the write operation, when the device driver finally notices the write
} error.  All the driver can do at this point is syslogging the problem.

Yes, but this is the "unrecovered read error" so often mentioned in the
freebsd-scsi mail archive.  Also, tar and dump were definitely reading
it.  INN was probably doing both.

} You ought to check your syslog regularly.  The easiest way is to drop
} it onto all your logged in terminals :) (seriously, i do).

A syslog scanner is on my list of things to do.

} > It could be the filesystem, the SCSI driver, or the drive firmware.
} 
} It could be the drive itself.

The MEDIUM ERROR itself and the falling offline a week or so later
are definitely the fault of the drive.  That the error wasn't reported
to userland lies somewhere between the driver and userland, inclusive.

} What MEDIUM ERRORs are these?  You forgot to quote the most important
} thing, the driver message.

Ok, here it is:

Jan 18 04:30:33 news /kernel: sd0(ahc0:0:0): MEDIUM ERROR info:14683a asc:11,0 Unrecovered read error field replaceable unit: ea sks:80,11
Jan 18 04:30:34 news /kernel: , retries:4
Jan 18 04:30:35 news /kernel: sd0(ahc0:0:0): MEDIUM ERROR info:14683a asc:11,0 Unrecovered read error field replaceable unit: ea sks:80,11 
Jan 18 04:30:35 news /kernel: , retries:3
Jan 18 04:30:36 news /kernel: sd0(ahc0:0:0): MEDIUM ERROR info:14683a asc:11,0 Unrecovered read error field replaceable unit: ea sks:80,11
Jan 18 04:30:38 news /kernel: , retries:2
Jan 18 04:30:42 news /kernel: sd0(ahc0:0:0): MEDIUM ERROR info:14683a asc:11,0 Unrecovered read error field replaceable unit: ea sks:80,11
Jan 18 04:30:42 news /kernel: , retries:1
Jan 18 04:30:43 news /kernel: sd0(ahc0:0:0): MEDIUM ERROR info:14683a asc:11,0 Unrecovered read error field replaceable unit: ea sks:80,11
Jan 18 04:30:44 news /kernel: , FAILURE

Always the same info:#.

} > I don't know whether the SCSI code isn't reporting this to the filesystem,
} > or the filesystem isn't reporting this to userland code, but dump didn't
} > seem to see a problem, tar didn't seem to see a problem.
} 
} It's interesting to know that dump didn't see the problem, since dump
} operates on the raw device, where error reporting is possible.  Are
} you sure they were _unrecovered_ medium errors, i.e. the kernel didn't
} successfully retry them?  Again, please *quote* the error messages,
} instead of assuming we know them.

Actually I'm not sure if it was recovered or not when I ran dump.  I
was running in single user at the time, so it was not logged.  It was the
same basic message, but I don't remember if it got all the way to FAILURE.
I didn't decide that I should report this until I had seen how badly the
filesystem *appeared* to have been munched by what turned out to be one
bad sector.  By that time, the sector had been remapped and I could no
longer reproduce the problem.

I also can't quote messages from it's death throes before it wedged,
because this disk also contains /var and nothing was syslogged until
after I got the machine running multi-user again.  I *think* the message
was: "Logical unit is in process of becoming ready", but if so it was
lying.

} > Before replacing the drive, I decided to run the Adaptec disk verification.
} > It found a grand total of one bad sector and remapped it.  The only
} > remaining damage was that fsck had deleted my newsgroups file and
} > history.pag had one formerly bad sector.  Since the disk didn't appear
} > to be hopeless, I replaced the newsgroups file and rebuilt history.pag,
} > and things have been working flawlessly ever since.
} 
} I wouldn't use that disk for serious work again.  It's certainly good
} for storing news articles, but no longer reliable enough for storing
} your history database there.

If it was more than one sector it would already be gone, but in this
case I'm going to leave it running and keep a very close eye on it.
It gave me at least two weeks warning last time.  If it gets sick again,
then I can at least file a more complete report ;-)  Are there any
experiments you want me to try?

} Also, go through SCSI reformatting it.  This will cause the drive to
} recreate the bad sector table as necessary.  You can even do this
} without using the adapter BIOS, there's always /sbin/scsiformat for
} this.

The painful part is that this is the root disk, and I'm pretty sure the
2.1.x fixit disk doesn't contain scsiformat.  Doesn't remapping the sector
add the original to the drive's grown defect list?

			---  Truck



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199702011424.GAA28908>