From owner-freebsd-bugs Wed Aug 13 05:40:07 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.5/8.8.5) id FAA14918 for bugs-outgoing; Wed, 13 Aug 1997 05:40:07 -0700 (PDT) Received: (from gnats@localhost) by hub.freebsd.org (8.8.5/8.8.5) id FAA14894; Wed, 13 Aug 1997 05:40:02 -0700 (PDT) Date: Wed, 13 Aug 1997 05:40:02 -0700 (PDT) Message-Id: <199708131240.FAA14894@hub.freebsd.org> To: freebsd-bugs Cc: From: Stefan Esser Subject: Re: misc/4293: strang disk error messages Reply-To: Stefan Esser Sender: owner-freebsd-bugs@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk The following reply was made to PR misc/4293; it has been noted by GNATS. From: Stefan Esser To: daniels@media.mit.edu Cc: FreeBSD-gnats-submit@freebsd.org, Stefan Esser Subject: Re: misc/4293: strang disk error messages Date: Wed, 13 Aug 1997 14:19:41 +0200 On Aug 13, daniels@media.mit.edu wrote: > The disk is a 2GB Quantum (SCSI) running from a PCI SCSI controller. What Quantum drive is that ? They are of quite different quality ... > Every few hours or days, a series of error messages about the disk > (and maybe the controller) appear on the console. These messages last > about 2 minutes, and then stop. During that time, user activity may > freeze, but the Web server (the primary purpose of the system) seems > to be running well. My preliminary deciphering of the error messages > suggest something wrong with swap space (pager errors) but I can't > really tell. No, there is an error returned as a result of a disk request from the VM system. > Late last week, the computer lost power (as did most of Cambridage, > Mass.) which may have contributed to the problem, which only surfaced > over the weekend. The problem did not exist before that power loss ? > Here is a complete cycle of the /var/log/messages accounting of the > problem: > > Aug 13 06:40:26 borg login: login on ttyv1 as daniels > Aug 13 06:41:30 borg /kernel: ncr0: restart (ncr dead ?). > Aug 13 06:44:13 borg /kernel: sd0(ncr0:0:0): UNIT ATTENTION asc:29,2 The drive returns an UNIT ATTENTION message with ASC=29 and ASCQ=2. This is a little odd, ASC=29 and ASCQ=0 have been expected ... > Aug 13 06:44:13 borg /kernel: , retries:3 > Aug 13 06:44:14 borg /kernel: sd0(ncr0:0:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8. > Aug 13 06:44:15 borg /kernel: ncr0: restart (ncr dead ?). > Aug 13 06:44:15 borg /kernel: ncr0: restart (ncr dead ?). > Aug 13 06:44:19 borg /kernel: sd0(ncr0:0:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8. > Aug 13 06:44:19 borg /kernel: ncr0: restart (ncr dead ?). > Aug 13 06:44:19 borg /kernel: sd0(ncr0:0:0): UNIT ATTENTION asc:29,2 > Aug 13 06:44:19 borg /kernel: , retries:1 > Aug 13 06:44:19 borg /kernel: sd0(ncr0:0:0): FAST SCSI-2 100ns (10 Mb/sec) offset 8. > Aug 13 06:44:19 borg /kernel: pid 3577 (httpd), uid 65534: exited on signal 6 Hmmm, and the system recovers after some time ? > >How-To-Repeat: > > Just wait a few hours. Well, sorry, but this is not true. It may work if *you* wait a few hours, but my system runs fine for however long I let it ... So, there must be some other problem. The first obvious question is of course, whether the drive worked fine up to some external event (opposed to a kernel rebuild :) If you did not install a new kernel, then there is a high probability, that your drive is going bad. Did you check whether it stops spinning during the time when those errors are reported ? There is a limited number of retries after a SCSI transfer failed, but if a failure extends for more than a few seconds, then read errors will be returned back to the application (which may be the VM code in the kernel, as observed by you.) For now, I assume a hardware problem. Please let me know, if you know for sure, that your hardware does not cause the failure ... Regards, STefan