From owner-freebsd-scsi Thu Aug 24 19: 7: 6 2000 Delivered-To: freebsd-scsi@freebsd.org Received: from wantadilla.lemis.com (wantadilla.lemis.com [192.109.197.80]) by hub.freebsd.org (Postfix) with ESMTP id 6796337B422 for ; Thu, 24 Aug 2000 19:07:01 -0700 (PDT) Received: (from grog@localhost) by wantadilla.lemis.com (8.9.3/8.9.3) id LAA47231; Fri, 25 Aug 2000 11:36:38 +0930 (CST) (envelope-from grog) Date: Fri, 25 Aug 2000 11:36:38 +0930 From: Greg Lehey To: David Gilbert Cc: freebsd-scsi@FreeBSD.ORG Subject: Re: Vinum 29160 detaches drives, invalidates RAID. Message-ID: <20000825113638.D39208@wantadilla.lemis.com> References: <14757.14569.732766.367692@trooper.velocet.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.4i In-Reply-To: <14757.14569.732766.367692@trooper.velocet.net>; from dgilbert@velocet.ca on Thu, Aug 24, 2000 at 11:02:01AM -0400 Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia Phone: +61-8-8388-8286 Fax: +61-8-8388-8725 Mobile: +61-418-838-708 WWW-Home-Page: http://www.lemis.com/~grog X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF 13 24 52 F8 6D A4 95 EF Sender: owner-freebsd-scsi@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org [dropping -stable] On Thursday, 24 August 2000 at 11:02:01 -0400, David Gilbert wrote: Content-Description: message body text > I have a vinum test box with 8x driven > by 2x running 4.1-RELEASE. The > machine also has a 9G root SCSI drive that is on a non-LVD connector > of the 29160 card. I'll include all the probe messages 'n such at the > end, but I want to get to the description first. > > The 18G drives are all VINUM. It results in > > Filesystem 1K-blocks Used Avail Capacity Mounted on > /dev/vinum/raid 121691956 14986173 96970427 13% /raid > > I have loaded the drive with a copy of the FreeBSD cvs tree (that I > cvsup) and a checked-out copy of the tree. I believe there is also a > copy of the results of a make-release and other misc garbage. > > The drives are cabled 4 per controller with a certified LVD cable that > ends with a certified LVD terminator. The root drive has a regular > SCSI-2 cable and also an SE/LVD terminator. The 8 drives are in a > separate case with redundant (and definately adequate) power. The > root drive is on the main case's power. > > The whole thing is running on a PIII/450 with either 128M or 256M of > RAM (depending on what I'm testing). > > ... so we know what the hardware is. I've been careful to cable and > do this all properly. > > First of all, I'm very pleased with the speed. The system easily > beats the AMI MegaRAID 1500 (same drives) with a whopping 35Mbyte/s in > RAID-5 (vs. the 1500's 14Mbyte/s) for read. (They both score a dead > heat of 4Mbyte/s write.) Nice to hear :-) > I have had no troubles with cvsup'ing and no troubles with multiple > concurrent cvs checkouts. I have a script that tells me that the 8 > drives are dispatching from 500 to 1000 transactions a second (summing > up all the iostat figures) and "cvs update" on the FreeBSD src > directory can get a combined r/w performance of about 5Mbyte/s on the > array. > > However, the nightly find scripts almost always disable the array. > Many, many messages of the form > > Aug 24 02:04:37 news /kernel: (da6:ahc1:0:6:0): SCB 0x4a - timed out in Data-out phase, SEQADDR == 0x5c > Aug 24 02:04:37 news /kernel: (da6:ahc1:0:6:0): Other SCB Timeout This is almost certainly a SCSI problem. I'll leave the SCSI experts to decipher it, though I'd suspect a bus problem, despite your attempts to avoid them. > resulting in > > Aug 24 02:04:44 news /kernel: raid.p0.s4: fatal write I/O error > Aug 24 02:04:44 news /kernel: vinum: raid.p0.s4 is stale by force > Aug 24 02:04:44 news /kernel: vinum: raid.p0 is degraded > > and then (as other disks do the same) Yes, there's little else I can do there. > Now... if I reboot, and "vinum setstate up" all these drives, They all go down, do they? > fsck completes without any complaint. I then generally have to > "vinum rebuild parity" ... but I suppose that I'd expect that. Hmm. rebuildparity is a dangerous command. Basically, a parity error means that *one* (or more) of the drives has incorrect data. rebuildparity simply assumes that the error is in the data block and "corrects" it. It's a serious problem, one that is very difficult to solve. > The problem I'm having here (and I've had it before) is that the > FreeBSD SCSI system seems to "give up" under conditions that others > would keep retrying or resetting/retrying. > > It seems really, really, really important to me that we try harder to > get a drive back online. This seems as if it could affect the > long-term viability of a vinum-based raid server... not because vinum > is bad, but because the SCSI subsystem is too fragile. Hmm. I can't really comment on that, but it would be nice if the SCSI system could recover from these problems. Greg -- Finger grog@lemis.com for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-scsi" in the body of the message