From owner-freebsd-questions Fri Jan 21 5:29:54 2000 Delivered-To: freebsd-questions@freebsd.org Received: from cc942873-a.ewndsr1.nj.home.com (cc942873-a.ewndsr1.nj.home.com [24.2.89.207]) by hub.freebsd.org (Postfix) with ESMTP id 716A314D7B; Fri, 21 Jan 2000 05:29:51 -0800 (PST) (envelope-from cjc@cc942873-a.ewndsr1.nj.home.com) Received: (from cjc@localhost) by cc942873-a.ewndsr1.nj.home.com (8.9.3/8.9.3) id IAA76100; Fri, 21 Jan 2000 08:34:02 -0500 (EST) (envelope-from cjc) Date: Fri, 21 Jan 2000 08:34:02 -0500 From: "Crist J. Clark" To: Greg Lehey Cc: John Baldwin , freebsd-questions@FreeBSD.org, cjclark@home.com Subject: Re: Recoverving/reviving a 'stale' subdisk under vinum Message-ID: <20000121083402.A76063@cc942873-a.ewndsr1.nj.home.com> Reply-To: cjclark@home.com References: <20000121105518.N481@mojave.worldwide.lemis.com> <200001210635.BAA73206@server.baldwin.cx> <20000121133435.U1123@mojave.worldwide.lemis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <20000121133435.U1123@mojave.worldwide.lemis.com>; from grog@lemis.com on Fri, Jan 21, 2000 at 01:34:35PM +0530 Sender: owner-freebsd-questions@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Fri, Jan 21, 2000 at 01:34:35PM +0530, Greg Lehey wrote: > On Friday, 21 January 2000 at 1:35:33 -0500, John Baldwin wrote: > > > > On 21-Jan-00 Greg Lehey wrote: > >> On Thursday, 20 January 2000 at 19:15:43 -0500, Crist J. Clark wrote: > >>> On Thu, Jan 20, 2000 at 01:56:07PM -0500, John H. Baldwin wrote: > >>>> I've read the vinum(4) and vinum(8) manpages as well as the webpages at > >>>> www.lemis.com/~grog/vinum.html, and while they are very good as far as > >>>> setup and configuration info, I haven't been able to find a lot of info > >>>> about recovering. I have a stale subdisk that I can't get to recover no > >>>> matter how many different start commands I try. I've tried starting the > >>>> volume, the plex, and the subdisk itself with no success. > >>>> > >>>> # vinum list > >>>> Configuration summary > >>>> > >>>> Drives: 3 (4 configured) > >>>> Volumes: 1 (4 configured) > >>>> Plexes: 1 (8 configured) > >>>> Subdisks: 3 (16 configured) > >>>> > >>>> D vinumdrive0 State: up Device /dev/da1s1e Avail: 0/8683 MB (0%) > >>>> D vinumdrive1 State: up Device /dev/da2s1e Avail: 0/8683 MB (0%) > >>>> D vinumdrive2 State: up Device /dev/da3s1e Avail: 0/8683 MB (0%) > >>>> > >>>> V ftp_mirror State: up Plexes: 1 Size: 25 GB > >>>> > >>>> P ftp_mirror.p0 S State: corrupt Subdisks: 3 Size: 25 GB > >>>> > >>>> S ftp_mirror.p0.s0 State: up PO: 0 B Size: 8683 MB > >>>> S ftp_mirror.p0.s1 State: up PO: 256 kB Size: 8683 MB > >>>> S ftp_mirror.p0.s2 State: stale PO: 512 kB Size: 8683 MB > >>>> > >>>> # vinum start ftp_mirror.p0.s2 > >>>> Can't start ftp_mirror.p0.s2: Device busy (16) > >> > >> Hmm. That shouldn't happen. > > > > Well, that's comforting. :) > > Hmm. Looking at this more carefully, yes, you can't do anything > there. You just don't have the information to recover the subdisk. > I'm still debating what to do in this case; there's no way to bring it > back to a guaranteed consistent state here, but you *can* use the > 'setupstate' command to fake it. When I was having troubles with an iffy SCSI HDD a week or two or go, this is _exactly_ what would happen to me too, the "Device busy (16)" message. The only thing I found to fix it was a forced stop, and it seemed to always work. Sorry if it is not the idel way to go, but it is what worked fine for me. > >>> You have to 'stop' everything first. (I might be overkilling here, > >>> but better safe...) > >> > >> No, that's not safe. That would mean taking down the volume. I my case it was a striped setup so once one subdisk was down, the whole plex was useless. There was no reason not to stop everything. [snip] > >> I haven't seen this before. How about the information I ask for in > >> the web page? I have abundant /var/log/message info from my problems. Need more data? [snip] > > However, the drive seems to have fallen over again (*sigh*) with the > > following kernel messages: > > > > Jan 20 23:28:38 raven /kernel: (da2:ahc1:0:1:0): SCB 0x96 - timed out while idle, LASTPHASE == 0x1, SEQADDR == 0xa > > Jan 20 23:28:38 raven /kernel: (da2:ahc1:0:1:0): Queuing a BDR SCB > > Jan 20 23:28:38 raven /kernel: (da2:ahc1:0:1:0): Bus Device Reset Message Sent > > Jan 20 23:28:38 raven /kernel: (da2:ahc1:0:1:0): no longer in timeout, status = 34b > > Jan 20 23:28:38 raven /kernel: ahc1: Bus Device Reset on A:1. 1 SCBs aborted > > Yup, that looks like a hardware problem; possibly bus termination or > some such. Vinum is good at finding suboptimal SCSI chains, since it > issues multiple requests in parallel. > > > Note that I didn't get this message until after the drive had been > > booted for a while, > > Right, that's relatively typical. Yup, that's the general type of error I was getting. I finally narrowed it down to one of the drives after swapping SCSI cards, changing all of the external cabling, swapping terminators, and disassembling and reassembling the two shoeboxes the drives live in. SCSI can be a real pain sometimes. -- Crist J. Clark cjclark@home.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-questions" in the body of the message