From owner-freebsd-scsi  Thu Aug 24 19: 7: 6 2000
Delivered-To: freebsd-scsi@freebsd.org
Received: from wantadilla.lemis.com (wantadilla.lemis.com [192.109.197.80])
	by hub.freebsd.org (Postfix) with ESMTP id 6796337B422
	for <freebsd-scsi@FreeBSD.ORG>; Thu, 24 Aug 2000 19:07:01 -0700 (PDT)
Received: (from grog@localhost)
	by wantadilla.lemis.com (8.9.3/8.9.3) id LAA47231;
	Fri, 25 Aug 2000 11:36:38 +0930 (CST)
	(envelope-from grog)
Date: Fri, 25 Aug 2000 11:36:38 +0930
From: Greg Lehey <grog@lemis.com>
To: David Gilbert <dgilbert@velocet.ca>
Cc: freebsd-scsi@FreeBSD.ORG
Subject: Re: Vinum 29160 detaches drives, invalidates RAID.
Message-ID: <20000825113638.D39208@wantadilla.lemis.com>
References: <14757.14569.732766.367692@trooper.velocet.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.4i
In-Reply-To: <14757.14569.732766.367692@trooper.velocet.net>; from dgilbert@velocet.ca on Thu, Aug 24, 2000 at 11:02:01AM -0400
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
Phone: +61-8-8388-8286
Fax: +61-8-8388-8725
Mobile: +61-418-838-708
WWW-Home-Page: http://www.lemis.com/~grog
X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF  13 24 52 F8 6D A4 95 EF
Sender: owner-freebsd-scsi@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

[dropping -stable]

On Thursday, 24 August 2000 at 11:02:01 -0400, David Gilbert wrote:
Content-Description: message body text
> I have a vinum test box with 8x <QUANTUM ATLAS IV 18 WLS 0909> driven
> by 2x <Adaptec 29160 Ultra160 SCSI adapter> running 4.1-RELEASE.  The
> machine also has a 9G root SCSI drive that is on a non-LVD connector
> of the 29160 card.  I'll include all the probe messages 'n such at the
> end, but I want to get to the description first.
>
> The 18G drives are all VINUM.  It results in
>
> Filesystem      1K-blocks     Used    Avail Capacity  Mounted on
> /dev/vinum/raid 121691956 14986173 96970427    13%    /raid
>
> I have loaded the drive with a copy of the FreeBSD cvs tree (that I
> cvsup) and a checked-out copy of the tree.  I believe there is also a
> copy of the results of a make-release and other misc garbage.
>
> The drives are cabled 4 per controller with a certified LVD cable that
> ends with a certified LVD terminator.  The root drive has a regular
> SCSI-2 cable and also an SE/LVD terminator.  The 8 drives are in a
> separate case with redundant (and definately adequate) power.  The
> root drive is on the main case's power.
>
> The whole thing is running on a PIII/450 with either 128M or 256M of
> RAM (depending on what I'm testing).
>
> ... so we know what the hardware is.  I've been careful to cable and
> do this all properly.
>
> First of all, I'm very pleased with the speed.  The system easily
> beats the AMI MegaRAID 1500 (same drives) with a whopping 35Mbyte/s in
> RAID-5 (vs. the 1500's 14Mbyte/s) for read.  (They both score a dead
> heat of 4Mbyte/s write.)

Nice to hear :-)

> I have had no troubles with cvsup'ing and no troubles with multiple
> concurrent cvs checkouts.  I have a script that tells me that the 8
> drives are dispatching from 500 to 1000 transactions a second (summing
> up all the iostat figures) and "cvs update" on the FreeBSD src
> directory can get a combined r/w performance of about 5Mbyte/s on the
> array.
>
> However, the nightly find scripts almost always disable the array.
> Many, many messages of the form
>
> Aug 24 02:04:37 news /kernel: (da6:ahc1:0:6:0): SCB 0x4a - timed out in Data-out phase, SEQADDR == 0x5c
> Aug 24 02:04:37 news /kernel: (da6:ahc1:0:6:0): Other SCB Timeout

This is almost certainly a SCSI problem.  I'll leave the SCSI experts
to decipher it, though I'd suspect a bus problem, despite your
attempts to avoid them.

> resulting in
>
> Aug 24 02:04:44 news /kernel: raid.p0.s4: fatal write I/O error
> Aug 24 02:04:44 news /kernel: vinum: raid.p0.s4 is stale by force
> Aug 24 02:04:44 news /kernel: vinum: raid.p0 is degraded
>
> and then (as other disks do the same)

Yes, there's little else I can do there.

> Now... if I reboot, and "vinum setstate up" all these drives, 

They all go down, do they?

> fsck completes without any complaint.  I then generally have to
> "vinum rebuild parity" ... but I suppose that I'd expect that.

Hmm.  rebuildparity is a dangerous command.  Basically, a parity error
means that *one* (or more) of the drives has incorrect data.
rebuildparity simply assumes that the error is in the data block and
"corrects" it.  It's a serious problem, one that is very difficult to
solve.

> The problem I'm having here (and I've had it before) is that the
> FreeBSD SCSI system seems to "give up" under conditions that others
> would keep retrying or resetting/retrying.
>
> It seems really, really, really important to me that we try harder to
> get a drive back online.  This seems as if it could affect the
> long-term viability of a vinum-based raid server... not because vinum
> is bad, but because the SCSI subsystem is too fragile.

Hmm.  I can't really comment on that, but it would be nice if the SCSI
system could recover from these problems.

Greg
--
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message