From owner-freebsd-stable  Sat Aug  7  5:53:44 1999
Delivered-To: freebsd-stable@freebsd.org
Received: from corwin.nall.com (corwin.nall.com [216.30.44.163])
	by hub.freebsd.org (Postfix) with ESMTP id 4724214DC1
	for <freebsd-stable@FreeBSD.ORG>; Sat,  7 Aug 1999 05:53:41 -0700 (PDT)
	(envelope-from joe@nall.com)
Received: from nall.com (localhost [127.0.0.1]) by corwin.nall.com with ESMTP (8.7.1/8.7.1) id HAA08252; Sat, 7 Aug 1999 07:52:03 -0500 (CDT)
Message-ID: <37AC2BF2.C4C60F1E@nall.com>
Date: Sat, 07 Aug 1999 07:52:02 -0500
From: Joe Nall <joe@nall.com>
Organization: Nall Design Works
X-Mailer: Mozilla 4.6 [en] (X11; I; HP-UX B.10.26 9000/770)
X-Accept-Language: en
MIME-Version: 1.0
To: lweb Lightningweb <lightningweb@hotmail.com>
Cc: freebsd-stable@FreeBSD.ORG
Subject: Re: continued crashes with 3.1-Stable
References: <19990807033241.17071.qmail@hotmail.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

lweb Lightningweb wrote:
> 
> One suggestion was to fix the "pthreads library," whic we did.  The other
> was: "You may have hardware problems."
> ...
> We have replaced drives in the RAID array, we are now replacing drive
> caddies.  Next step I think will be the RAID controller.  I have a strong
> gut feeling that it is software however.  There's nothing to substantiate
> this, except that that more often than not, the crash happens during an
> MySQL query.
> ...

> (da0:dpt0:0:0:0): Invalidating pack
> biodone: buffer already done
> spec_getpages: I/O read failure: (error code=6)
>                size: 32768, resid: 32768, a_count: 32768, valid: 0x0
>                nread: 0, reqpage: 0, pindex: 0, pcount: 8
> 
> Everyone please take a second look at this and help us brainstorm the
> problem?  I am including a list of the hardware, the original message we
> sent to the list, and a recent dmesg:
> 
> FreeBSD 3.1-STABLE #1
> Dual-Proc PII 450
> 512MB RAM
> DPT PM334UW RAID controller
> - 16MB RAM
> - dual bus Ultra Wide
> - Six 9.1GB Quantum VikingII SCSI3 U2W drives
> - Three drives per bus, RAID5, one drive is hot-spare
> Intel EtherExpress Pro 10/100B Ethernet
> TOSHIBA CD-ROM XM-6201TA

 Don't discount the hardware problem response.  We use big (200GB+)
Winchester Systems raid arrays on production HP-UX servers at work.
These boxes have a custom, modified OS that we were blaming for random,
very painful crashes with occasional data corruption on a JFS
filesystem. On of the symptoms of these crashes were the lack of crash
dumps.  Two week ago we found out that the firmware installed in the
dual redundant controllers (nothing but the best :) had known problems
with similar symptoms to ours and we should upgrade.  The lack of crash
dump should have been a clue earlier that there were disk problems.  
 The "Invalidating pack" error comes from the SCSI CAM driver in
"src/sys/cam/scsi/scsi_da.c" and occurs when there has been a
catastrophic error (quoting from the code). The error returned from the
driver is ENXIO. It appears that your DPT is dropping a SCSI LUN off
line.
So far my FreeBSD servers have been exactly as reliable as my hardware.

Good Luck,
Joe


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message