Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 15 Sep 2000 00:21:35 -0600
From:      "Kenneth D. Merry" <ken@kdm.org>
To:        Jean-Francois Dockes <jean-francois.dockes@wanadoo.fr>
Cc:        freebsd-stable@FreeBSD.ORG
Subject:   Re: SCSI retries without errors in /var/log/messages?
Message-ID:  <20000915002135.A83469@panzer.kdm.org>
In-Reply-To: <14784.33648.251152.511680@localhost.dockes.com>; from jean-francois.dockes@wanadoo.fr on Thu, Sep 14, 2000 at 09:51:12AM %2B0200
References:  <20000911162530.34FC599C8C@waltz.rahul.net> <20000911130644.A50024@panzer.kdm.org> <14784.33648.251152.511680@localhost.dockes.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Sep 14, 2000 at 09:51:12 +0200, Jean-Francois Dockes wrote:
> Kenneth D. Merry writes:
>  > SCSI errors that we recover from aren't logged.  
> 
> I think that most SCSI retries should be logged.
> 
> Except in some specific cases of retries after a unit attention
> condition, retries are usually indicative of hardware trouble.

There are other reasons, like medium not present, and tape drives use
errors to report a lot of stuff.  (See saerror() in sys/cam/scsi/scsi_sa.c.)

We already have something similar turned on for bootverbose (it logs error 
messages that we wouldn't otherwise log), but people sometimes get confused
and concerned when they see the additional error messages.

They (not unreasonably) think something is wrong with their hardware, when 
in fact they're just seeing normal error messages.  (Like devices that     
don't support the serial number inquiry, CDROM drives without media 
present, etc.)    

In any case, if you want to see error messages, even for retried commands, 
boot with -v and comment out the following print_sense line in
scsi_interpret_sense() in sys/cam/scsi/scsi_sa.c:

	default:
		/* decrement the number of retries */
		retry = ccb->ccb_h.retry_count > 0;
		if (retry) {
			ccb->ccb_h.retry_count--;
			error = ERESTART;
			print_sense = FALSE;
		} else 
			error = EIO;
		break;
	}

> Don't most devices already use a number of internal retries which is
> appropriate when they are healthy ? When external retries become
> necessary and frequent, the situation is bad, and subsequent failures
> are quite probable.

Yeah, most devices have internal retry mechanisms, and often error
correction mechanisms.

> Better to be warned earlier. 
> 
>  (There are also the retries caused by scsi protocol problems - bad
>    bus - but these are usually followed by a bus reset which is logged ?)
> 
> And, by the way, 'recovered errors' sense keys (problems solved
> internally by the device) should also be logged for the same reason,
> only more benign, (but I'm not too sure that many devices actually
> generate these).

I think there are better ways than printing out sense information to figure
out if a device is going bad.  Disk drives and tape drives keep statistics
in log pages, many of which could be monitored to see if they exceed a
certain threshold, or change at a certain rate.  Then the administrator
could be notified of the problem.

You could write a script using camcontrol to dump the log pages, or a small
C program to do the same.

Another thing you can do with disks is monitor the number of grown defects.
(Assuming you've got read and write reallocation turned on.)

As far as recovered errors, I've seen disks return them when they
automatically reallocate a block.

Ken
-- 
Kenneth Merry
ken@kdm.org


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000915002135.A83469>