From owner-freebsd-bugs  Wed Aug 13 05:40:07 1997
Return-Path: <owner-freebsd-bugs>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.5/8.8.5) id FAA14918
          for bugs-outgoing; Wed, 13 Aug 1997 05:40:07 -0700 (PDT)
Received: (from gnats@localhost)
          by hub.freebsd.org (8.8.5/8.8.5) id FAA14894;
          Wed, 13 Aug 1997 05:40:02 -0700 (PDT)
Date: Wed, 13 Aug 1997 05:40:02 -0700 (PDT)
Message-Id: <199708131240.FAA14894@hub.freebsd.org>
To: freebsd-bugs
Cc: 
From: Stefan Esser <se@FreeBSD.ORG>
Subject: Re: misc/4293: strang disk error messages
Reply-To: Stefan Esser <se@FreeBSD.ORG>
Sender: owner-freebsd-bugs@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

The following reply was made to PR misc/4293; it has been noted by GNATS.

From: Stefan Esser <se@FreeBSD.ORG>
To: daniels@media.mit.edu
Cc: FreeBSD-gnats-submit@freebsd.org, Stefan Esser <se@freebsd.org>
Subject: Re: misc/4293: strang disk error messages
Date: Wed, 13 Aug 1997 14:19:41 +0200

 On Aug 13, daniels@media.mit.edu wrote:
 > The disk is a 2GB Quantum (SCSI) running from a PCI SCSI controller.
 
 What Quantum drive is that ?
 They are of quite different quality ...
 
 > Every few hours or days, a series of error messages about the disk
 > (and maybe the controller) appear on the console. These messages last
 > about 2 minutes, and then stop. During that time, user activity may
 > freeze, but the Web server (the primary purpose of the system) seems
 > to be running well. My preliminary deciphering of the error messages
 > suggest something wrong with swap space (pager errors) but I can't
 > really tell.
 
 No, there is an error returned as a result of 
 a disk request from the VM system.
 
 > Late last week, the computer lost power (as did most of Cambridage,
 > Mass.) which may have contributed to the problem, which only surfaced
 > over the weekend.
 
 The problem did not exist before that power loss ?
 
 > Here is a complete cycle of the /var/log/messages accounting of the
 > problem:
 > 
 > Aug 13 06:40:26 borg login: login on ttyv1 as daniels
 > Aug 13 06:41:30 borg /kernel: ncr0: restart (ncr dead ?).
 > Aug 13 06:44:13 borg /kernel: sd0(ncr0:0:0): UNIT ATTENTION asc:29,2 
 
 The drive returns an UNIT ATTENTION message with 
 ASC=29 and ASCQ=2. This is a little odd, ASC=29 
 and ASCQ=0 have been expected ...
 
 > Aug 13 06:44:13 borg /kernel: , retries:3
 > Aug 13 06:44:14 borg /kernel: sd0(ncr0:0:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
 > Aug 13 06:44:15 borg /kernel: ncr0: restart (ncr dead ?).
 > Aug 13 06:44:15 borg /kernel: ncr0: restart (ncr dead ?).
 
 > Aug 13 06:44:19 borg /kernel: sd0(ncr0:0:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
 > Aug 13 06:44:19 borg /kernel: ncr0: restart (ncr dead ?).
 > Aug 13 06:44:19 borg /kernel: sd0(ncr0:0:0): UNIT ATTENTION asc:29,2 
 > Aug 13 06:44:19 borg /kernel: , retries:1
 > Aug 13 06:44:19 borg /kernel: sd0(ncr0:0:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8.
 > Aug 13 06:44:19 borg /kernel: pid 3577 (httpd), uid 65534: exited on signal 6
 
 Hmmm, and the system recovers after some time ?
 
 > >How-To-Repeat:
 > 
 > Just wait a few hours.
 
 Well, sorry, but this is not true. It may work 
 if *you* wait a few hours, but my system runs 
 fine for however long I let it ...
 
 So, there must be some other problem. The first
 obvious question is of course, whether the drive
 worked fine up to some external event (opposed 
 to a kernel rebuild :)
 
 If you did not install a new kernel, then there
 is a high probability, that your drive is going
 bad. Did you check whether it stops spinning
 during the time when those errors are reported ?
 
 There is a limited number of retries after a 
 SCSI transfer failed, but if a failure extends
 for more than a few seconds, then read errors
 will be returned back to the application (which
 may be the VM code in the kernel, as observed by
 you.)
 
 For now, I assume a hardware problem. Please let
 me know, if you know for sure, that your hardware
 does not cause the failure ...
 
 Regards, STefan