Date: Fri, 21 Jan 2000 22:17:18 +0530 From: Greg Lehey <grog@mojave.worldwide.lemis.com> To: cjclark@home.com Cc: John Baldwin <jhb@FreeBSD.org>, freebsd-questions@FreeBSD.org Subject: Re: Recoverving/reviving a 'stale' subdisk under vinum Message-ID: <20000121221718.C918@mojave.worldwide.lemis.com> In-Reply-To: <20000121083402.A76063@cc942873-a.ewndsr1.nj.home.com>; from cjc@cc942873-a.ewndsr1.nj.home.com on Fri, Jan 21, 2000 at 08:34:02AM -0500 References: <20000121105518.N481@mojave.worldwide.lemis.com> <200001210635.BAA73206@server.baldwin.cx> <20000121133435.U1123@mojave.worldwide.lemis.com> <20000121083402.A76063@cc942873-a.ewndsr1.nj.home.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Friday, 21 January 2000 at 8:34:02 -0500, Crist J. Clark wrote: > On Fri, Jan 21, 2000 at 01:34:35PM +0530, Greg Lehey wrote: >> On Friday, 21 January 2000 at 1:35:33 -0500, John Baldwin wrote: >>> >>> On 21-Jan-00 Greg Lehey wrote: >>>> On Thursday, 20 January 2000 at 19:15:43 -0500, Crist J. Clark wrote: >>>>> On Thu, Jan 20, 2000 at 01:56:07PM -0500, John H. Baldwin wrote: >>>>>> I've read the vinum(4) and vinum(8) manpages as well as the webpages at >>>>>> www.lemis.com/~grog/vinum.html, and while they are very good as far as >>>>>> setup and configuration info, I haven't been able to find a lot of info >>>>>> about recovering. I have a stale subdisk that I can't get to recover no >>>>>> matter how many different start commands I try. I've tried starting the >>>>>> volume, the plex, and the subdisk itself with no success. >>>>>> >>>>>> # vinum list >>>>>> Configuration summary >>>>>> >>>>>> Drives: 3 (4 configured) >>>>>> Volumes: 1 (4 configured) >>>>>> Plexes: 1 (8 configured) >>>>>> Subdisks: 3 (16 configured) >>>>>> >>>>>> D vinumdrive0 State: up Device /dev/da1s1e Avail: 0/8683 MB (0%) >>>>>> D vinumdrive1 State: up Device /dev/da2s1e Avail: 0/8683 MB (0%) >>>>>> D vinumdrive2 State: up Device /dev/da3s1e Avail: 0/8683 MB (0%) >>>>>> >>>>>> V ftp_mirror State: up Plexes: 1 Size: 25 GB >>>>>> >>>>>> P ftp_mirror.p0 S State: corrupt Subdisks: 3 Size: 25 GB >>>>>> >>>>>> S ftp_mirror.p0.s0 State: up PO: 0 B Size: 8683 MB >>>>>> S ftp_mirror.p0.s1 State: up PO: 256 kB Size: 8683 MB >>>>>> S ftp_mirror.p0.s2 State: stale PO: 512 kB Size: 8683 MB >>>>>> >>>>>> # vinum start ftp_mirror.p0.s2 >>>>>> Can't start ftp_mirror.p0.s2: Device busy (16) >>>> >>>> Hmm. That shouldn't happen. >>> >>> Well, that's comforting. :) >> >> Hmm. Looking at this more carefully, yes, you can't do anything >> there. You just don't have the information to recover the subdisk. >> I'm still debating what to do in this case; there's no way to bring it >> back to a guaranteed consistent state here, but you *can* use the >> 'setupstate' command to fake it. > > When I was having troubles with an iffy SCSI HDD a week or two or go, > this is _exactly_ what would happen to me too, the "Device busy (16)" > message. The only thing I found to fix it was a forced stop, and it > seemed to always work. Sorry if it is not the idel way to go, but it > is what worked fine for me. Hmm. I suppose this is worth investigating. It's quite possible that the message is incorrect and should say something like "device not accessible". True story: About 17 years ago, I was working for Tandem, and we had sporadic reports of customers unable to revive disk mirrors. The error reported was 12 (FEINUSE, file in use), which looks pretty much like the thing we have here. The first report was from Helsinki, the second was from Taranto in the South of Italy, and in each case the customer engineer was able to hide the symptoms before I could find the problem. The third time it happened in Bern, the capital of Switzerland. I told the CE to do nothing, and I would be there immediately. I jumped in my car, was in Basel by 7 pm, and we spent an hour or so debugging the disk driver. The reason? It checked a flag at the beginning of the disk, which specified what kind of format it had, and found nothing it recognized, so it decided it must belong to an ancient, no longer used disk controller, and refused to touch it ("it belongs to somebody else"). In fact, the check was incorrect: if the very first sector of the disk had been spared, it had a different flag, but it didn't check for this eventuality. A hard format got rid of the spare, and people were able to revive again. >>>> You have to 'stop' everything first. (I might be overkilling here, >>>>> but better safe...) >>>> >>>> No, that's not safe. That would mean taking down the volume. > > I my case it was a striped setup so once one subdisk was down, the > whole plex was useless. There was no reason not to stop everything. Yes, in fact this was the case here as well. > [snip] >>>> I haven't seen this before. How about the information I ask for in >>>> the web page? > > I have abundant /var/log/message info from my problems. Need more > data? Hold on to it, but don't send it to me yet. I'm way away from home, and I won't be able to look at it for at least a week. >>> Note that I didn't get this message until after the drive had been >>> booted for a while, >> >> Right, that's relatively typical. > > Yup, that's the general type of error I was getting. I finally > narrowed it down to one of the drives after swapping SCSI cards, > changing all of the external cabling, swapping terminators, and > disassembling and reassembling the two shoeboxes the drives live > in. SCSI can be a real pain sometimes. SCSI is not a mystery. There are serious technical reasons why it is occasionally necessary to sacrifice a live goat to a SCSI chain. Greg -- When replying to this message, please copy the original recipients. For more information, see http://www.lemis.com/questions.html Finger grog@lemis.com for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-questions" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000121221718.C918>