From owner-freebsd-questions  Fri Jan 21  5:29:54 2000
Delivered-To: freebsd-questions@freebsd.org
Received: from cc942873-a.ewndsr1.nj.home.com (cc942873-a.ewndsr1.nj.home.com [24.2.89.207])
	by hub.freebsd.org (Postfix) with ESMTP
	id 716A314D7B; Fri, 21 Jan 2000 05:29:51 -0800 (PST)
	(envelope-from cjc@cc942873-a.ewndsr1.nj.home.com)
Received: (from cjc@localhost)
	by cc942873-a.ewndsr1.nj.home.com (8.9.3/8.9.3) id IAA76100;
	Fri, 21 Jan 2000 08:34:02 -0500 (EST)
	(envelope-from cjc)
Date: Fri, 21 Jan 2000 08:34:02 -0500
From: "Crist J. Clark" <cjc@cc942873-a.ewndsr1.nj.home.com>
To: Greg Lehey <grog@lemis.com>
Cc: John Baldwin <jhb@FreeBSD.org>, freebsd-questions@FreeBSD.org,
	cjclark@home.com
Subject: Re: Recoverving/reviving a 'stale' subdisk under vinum
Message-ID: <20000121083402.A76063@cc942873-a.ewndsr1.nj.home.com>
Reply-To: cjclark@home.com
References: <20000121105518.N481@mojave.worldwide.lemis.com> <200001210635.BAA73206@server.baldwin.cx> <20000121133435.U1123@mojave.worldwide.lemis.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0i
In-Reply-To: <20000121133435.U1123@mojave.worldwide.lemis.com>; from grog@lemis.com on Fri, Jan 21, 2000 at 01:34:35PM +0530
Sender: owner-freebsd-questions@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Fri, Jan 21, 2000 at 01:34:35PM +0530, Greg Lehey wrote:
> On Friday, 21 January 2000 at  1:35:33 -0500, John Baldwin wrote:
> >
> > On 21-Jan-00 Greg Lehey wrote:
> >> On Thursday, 20 January 2000 at 19:15:43 -0500, Crist J. Clark wrote:
> >>> On Thu, Jan 20, 2000 at 01:56:07PM -0500, John H. Baldwin wrote:
> >>>> I've read the vinum(4) and vinum(8) manpages as well as the webpages at
> >>>> www.lemis.com/~grog/vinum.html, and while they are very good as far as
> >>>> setup and configuration info, I haven't been able to find a lot of info
> >>>> about recovering.  I have a stale subdisk that I can't get to recover no
> >>>> matter how many different start commands I try.  I've tried starting the
> >>>> volume, the plex, and the subdisk itself with no success.
> >>>>
> >>>> # vinum list
> >>>> Configuration summary
> >>>>
> >>>> Drives:         3 (4 configured)
> >>>> Volumes:        1 (4 configured)
> >>>> Plexes:         1 (8 configured)
> >>>> Subdisks:       3 (16 configured)
> >>>>
> >>>> D vinumdrive0           State: up       Device /dev/da1s1e      Avail: 0/8683 MB (0%)
> >>>> D vinumdrive1           State: up       Device /dev/da2s1e      Avail: 0/8683 MB (0%)
> >>>> D vinumdrive2           State: up       Device /dev/da3s1e      Avail: 0/8683 MB (0%)
> >>>>
> >>>> V ftp_mirror            State: up       Plexes:       1 Size:         25 GB
> >>>>
> >>>> P ftp_mirror.p0       S State: corrupt  Subdisks:     3 Size:         25 GB
> >>>>
> >>>> S ftp_mirror.p0.s0      State: up       PO:        0  B Size:       8683 MB
> >>>> S ftp_mirror.p0.s1      State: up       PO:      256 kB Size:       8683 MB
> >>>> S ftp_mirror.p0.s2      State: stale    PO:      512 kB Size:       8683 MB
> >>>>
> >>>> # vinum start ftp_mirror.p0.s2
> >>>> Can't start ftp_mirror.p0.s2: Device busy (16)
> >>
> >> Hmm.  That shouldn't happen.
> >
> > Well, that's comforting. :)
> 
> Hmm.  Looking at this more carefully, yes, you can't do anything
> there.  You just don't have the information to recover the subdisk.
> I'm still debating what to do in this case; there's no way to bring it
> back to a guaranteed consistent state here, but you *can* use the
> 'setupstate' command to fake it.

When I was having troubles with an iffy SCSI HDD a week or two or go,
this is _exactly_ what would happen to me too, the "Device busy (16)"
message. The only thing I found to fix it was a forced stop, and it
seemed to always work. Sorry if it is not the idel way to go, but it
is what worked fine for me.

> >>> You have to 'stop' everything first. (I might be overkilling here,
> >>> but better safe...)
> >>
> >> No, that's not safe.  That would mean taking down the volume.

I my case it was a striped setup so once one subdisk was down, the
whole plex was useless. There was no reason not to stop everything.

[snip]
> >> I haven't seen this before.  How about the information I ask for in
> >> the web page?

I have abundant /var/log/message info from my problems. Need more
data?

[snip]
> > However, the drive seems to have fallen over again (*sigh*) with the
> > following kernel messages:
> >
> > Jan 20 23:28:38 raven /kernel: (da2:ahc1:0:1:0): SCB 0x96 - timed out while idle, LASTPHASE == 0x1, SEQADDR == 0xa
> > Jan 20 23:28:38 raven /kernel: (da2:ahc1:0:1:0): Queuing a BDR SCB
> > Jan 20 23:28:38 raven /kernel: (da2:ahc1:0:1:0): Bus Device Reset Message Sent
> > Jan 20 23:28:38 raven /kernel: (da2:ahc1:0:1:0): no longer in timeout, status = 34b
> > Jan 20 23:28:38 raven /kernel: ahc1: Bus Device Reset on A:1. 1 SCBs aborted
> 
> Yup, that looks like a hardware problem; possibly bus termination or
> some such.  Vinum is good at finding suboptimal SCSI chains, since it
> issues multiple requests in parallel.
> 
> > Note that I didn't get this message until after the drive had been
> > booted for a while,
> 
> Right, that's relatively typical.

Yup, that's the general type of error I was getting. I finally
narrowed it down to one of the drives after swapping SCSI cards,
changing all of the external cabling, swapping terminators, and
disassembling and reassembling the two shoeboxes the drives live
in. SCSI can be a real pain sometimes.
-- 
Crist J. Clark                           cjclark@home.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message