From owner-freebsd-questions  Mon Feb 12  1:24:34 2001
Delivered-To: freebsd-questions@freebsd.org
Received: from wantadilla.lemis.com (wantadilla.lemis.com [192.109.197.80])
	by hub.freebsd.org (Postfix) with ESMTP id 7FDE937B401
	for <freebsd-questions@FreeBSD.ORG>; Mon, 12 Feb 2001 01:24:25 -0800 (PST)
Received: by wantadilla.lemis.com (Postfix, from userid 1004)
	id 4461A6ACAF; Mon, 12 Feb 2001 19:54:22 +1030 (CST)
Date: Mon, 12 Feb 2001 19:54:22 +1030
From: Greg Lehey <grog@lemis.com>
To: David Schooley <dcschooley@ieee.org>
Cc: freebsd-questions@FreeBSD.ORG
Subject: Re: Vinum behavior (long)
Message-ID: <20010212195422.S47700@wantadilla.lemis.com>
References: <p04320401b6ad1e740361@[192.168.1.100]>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <p04320401b6ad1e740361@[192.168.1.100]>; from dcschooley@ieee.org on Mon, Feb 12, 2001 at 12:28:04AM -0600
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
Phone: +61-8-8388-8286
Fax: +61-8-8388-8725
Mobile: +61-418-838-708
WWW-Home-Page: http://www.lemis.com/~grog
X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF  13 24 52 F8 6D A4 95 EF
Sender: owner-freebsd-questions@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Monday, 12 February 2001 at  0:28:04 -0600, David Schooley wrote:
> I have been doing some experimenting with vinum, primarily to
> understand it before putting it to regular use. I have a few
> questions, primarily due to oddities I can't explain.
>
> The setup consists 4 identical 30GB ATA drives, each on its own
> channel. One pair of channels is comes off of the motherboard
> controller; the other pair hangs off of a PCI card. I am running
> 4.2-STABLE, cvsup'ed some time within the past week.
>
> The configuration file I am using is as follows and is fairly close
> to the examples in the man page and elsewhere, although it raises
> some questions by itself. What I attempted to do was make sure each
> drive was mirrored to the corresponding drive on the other
> controller, i.e., 1<->3, and 2->4:
>
> ***
> drive drive1 device /dev/ad0s1d
> drive drive2 device /dev/ad2s1d
> drive drive3 device /dev/ad4s1d
> drive drive4 device /dev/ad6s1d
>
> volume raid setupstate
>    plex org striped 300k
>      sd length 14655m drive drive1
>      sd length 14655m drive drive2
>      sd length 14655m drive drive3
>      sd length 14655m drive drive4
>    plex org striped 300k
>      sd length 14655m drive drive3
>      sd length 14655m drive drive4
>      sd length 14655m drive drive1
>      sd length 14655m drive drive2
>
> ***
>
> I wanted to see what would happen if I lost an entire IDE controller,
> so I set everything up, mounted the new volume and copied over
> everything from /usr/local. I shut the machine down, cut the power to
> drives 3 and 4, and restarted. Upon restart, vinum reported that
> drives 3 and 4 had failed. If my understanding is correct, then I
> should have been OK since any data on drives 3 and 4 would have been
> a copy of what was on drives 1 and 2, respectively.

Correct.

> For the next part of the test, I attempted to duplicate a directory
> in the raid version of /usr/local. It partially worked, but there
> there were errors

What errors?

> during the copy and only about two thirds of the data was
> successfully copied.
>
> Question #1: Shouldn't this have worked?

Answer: Yes, it should have.  What went wrong?

> After I "fixed" the "broken" controller and restarted the machine,
> vinum's list looked like this:
>
> ***
> 4 drives:
> D drive1                State: up       Device /dev/ad0s1d Avail: 1/29311 MB (0%)
> D drive2                State: up       Device /dev/ad2s1d Avail: 1/29311 MB (0%)
> D drive3                State: up       Device /dev/ad4s1d Avail: 1/29311 MB (0%)
> D drive4                State: up       Device /dev/ad6s1d Avail: 1/29311 MB (0%)
>
> 1 volumes:
> V raid                  State: up       Plexes:       2 Size:         57 GB
>
> 2 plexes:
> P raid.p0             S State: corrupt  Subdisks:     4 Size:         57 GB
> P raid.p1             S State: corrupt  Subdisks:     4 Size:         57 GB
>
> 8 subdisks:
> S raid.p0.s0            State: up       PO:        0  B Size:         14 GB
> S raid.p0.s1            State: up       PO:      300 kB Size:         14 GB
> S raid.p0.s2            State: stale   PO:      600 kB Size:         14 GB
> S raid.p0.s3            State: stale   PO:      900 kB Size:         14 GB
> S raid.p1.s0            State: stale    PO:        0  B Size:         14 GB
> S raid.p1.s1            State: stale    PO:      300 kB Size:         14 GB
> S raid.p1.s2            State: up       PO:      600 kB Size:         14 GB
> S raid.p1.s3            State: up       PO:      900 kB Size:         14 GB
> ***
>
> This makes sense. Now after restarting raid.p0 and waiting for
> everything to resync, I got this:
>
> ***
> 2 plexes:
> P raid.p0             S State: up       Subdisks:     4 Size:         57 GB
> P raid.p1             S State: corrupt  Subdisks:     4 Size:         57 GB
>
> 8 subdisks:
> S raid.p0.s0            State: up       PO:        0  B Size:         14 GB
> S raid.p0.s1            State: up       PO:      300 kB Size:         14 GB
> S raid.p0.s2            State: up       PO:      600 kB Size:         14 GB
> S raid.p0.s3            State: up       PO:      900 kB Size:         14 GB
> S raid.p1.s0            State: stale    PO:        0  B Size: 	14 GB  <--- still stale

Please don't wrap output.

> S raid.p1.s1            State: stale    PO:      300 kB Size:		14 GB  <--- still stale
> S raid.p1.s2            State: up       PO:      600 kB Size:         14 GB
> S raid.p1.s3            State: up       PO:      900 kB Size:         14 GB
> ***
>
> Now the only place that raid.p0.s2 and raid.p0.s3 could have gotten
> their data is from raid.p1.s0 and raid.p1.s1, neither of which were
> involved in the "event".

Correct.

> Question #2:  Since the data on raid.p0 now matches raid.p1,
> shouldn't raid.p1 have come up automatically and without having to
> copy data from raid.p0?

No.  According to the output above, raid.p1 hasn't been started yet.
There's also no indication in your message or in the output that you
tried to start it.  If the start had died in the middle, the list
command would have shown that.

> The configuration file below makes sense, but suffers a slight
> performance penalty over the first one.
>
> Question #3:   Is there a reason why "mirror -s" does it this way
> instead of striping to all 4 disks?

Yes.  mirror -s is a pretty bare bones config utility.  You have so
many different options with Vinum, and mirror just does one of them.

> I kind of prefer it this way, but I'm still curious.

You're better off with the first config.  Your performance will be
more even.

> drive drive1 device /dev/ad0s1d
> drive drive2 device /dev/ad2s1d
> drive drive3 device /dev/ad4s1d
> drive drive4 device /dev/ad6s1d
>
> volume raid setupstate
>    plex org striped 300k
>      sd length 29310 m drive drive1
>      sd length 29310 m drive drive2
>    plex org striped 300k
>      sd length 29310 m drive drive3
>      sd length 29310 m drive drive4
> ***
>
> While reading through the archives, I noticed several occasions where
> it was stated that a power-of-two stripe size was potentially bad
> because all of the superblocks could end up on the same disk, thereby
> impacting performance, but the documentation and "mirror -s" all use
> a stripe size of 256k.
>
> Question 4: Is the power-of-two concern still valid, and if so,
> shouldn't the documentation and "mirror -s" function be changed?

Yes.

Getting back to the first problem, my first guess is that you tried
only 'start raid.p0', and didn't do a 'start raid.p1'.  If you did,
I'd like to see the output I ask for in the man page and at
http://www.vinumvm.org/vinum/how-to-debug.html.  It's too detailed to
repeat here.

Greg
--
When replying to this message, please copy the original recipients.
If you don't, I may ignore the reply.
For more information, see http://www.lemis.com/questions.html
Finger grog@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message