From owner-freebsd-stable@FreeBSD.ORG  Tue Mar 30 19:14:21 2004
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 36C9116A4CE; Tue, 30 Mar 2004 19:14:21 -0800 (PST)
Received: from mail.globo.com (smtp1.globo.com [200.208.9.168])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id DD26D43D1D; Tue, 30 Mar 2004 19:14:20 -0800 (PST)
	(envelope-from jonny@jonny.eng.br)
Received: from jonny.eng.br (200.217.22.173) by mail.globo.com (6.0.053)
	(authenticated as jcml21@globo.com)
	id 40628E5100052255; Wed, 31 Mar 2004 00:14:18 -0300
Message-ID: <406A3785.1040007@jonny.eng.br>
Date: Wed, 31 Mar 2004 00:14:13 -0300
From: =?ISO-8859-1?Q?Jo=E3o_Carlos_Mendes_Lu=EDs?= <jonny@jonny.eng.br>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
	rv:1.6) Gecko/20040113
X-Accept-Language: pt-br, en-us, en, pt
MIME-Version: 1.0
To: Greg 'groggy' Lehey <grog@FreeBSD.org>
References: <4068EA56.3060600@jonny.eng.br>
	<20040330053143.GN15929@wantadilla.lemis.com> <40697F3B.2020202@jonny.eng.br>
	<20040326222853.GA93269@zeus.faperj.br>
	<20040330143257.C72259@pcle2.cc.univie.ac.at>
	<20040331004630.GA15929@wantadilla.lemis.com>
In-Reply-To: <20040331004630.GA15929@wantadilla.lemis.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
cc: stable@freebsd.org
cc: robert <robert@fledge.watson.org>
cc: Lukas Ertl <le@FreeBSD.org>
cc: hackers@freebsd.org
cc: bugs@FreeBSD.org
cc: Joao Carlos Mendes Luis <jonny@faperj.br>
Subject: Re: Serious bug in vinum?
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Production branch of FreeBSD source code
	<freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 31 Mar 2004 03:14:21 -0000


Greg 'groggy' Lehey wrote:

> On Tuesday, 30 March 2004 at 14:37:00 +0200, Lukas Ertl wrote:
> 
>>On Fri, 26 Mar 2004, Joao Carlos Mendes Luis wrote:
>>
>>
>>>    I think this should be like:
>>>
>>>        if (plex->state > plex_corrupt) {                  /* something accessible, */
>>>
>>>    Or, in other words, volume state is up only if plex state is degraded
>>>or better.
>>
>>You are right, this is a bug,
> 
> No, see my reply.

     I think "maybe" is the best answer here.

>>The correct solution, of course, is to check if the data is valid
>>before changing the volume state, but turn might turn out to be a
>>very complex check.
> 
> 
> Well, the minimum correct solution is to return an error if somebody
> tries to access the inaccessible part of the volume.  That should
> happen, and I'm confused that it doesn't appear to be doing so in this
> case.
> 
> On Tuesday, 30 March 2004 at 11:07:55 -0300, Joo Carlos Mendes Lus wrote:
> 
>>Greg 'groggy' Lehey wrote:
>>
>>>On Tuesday, 30 March 2004 at  0:32:38 -0300, Joo Carlos Mendes Lus wrote:
>>>
>>>Basically, this is a feature and not a bug.  A plex that is corrupt is
>>>still partially accessible, so we should allow access to it.  If you
>>>have two striped plexes both striped between two disks, with the same
>>>stripe size, and one plex starts on the first drive, and the other on
>>>the second, and one drive dies, then each plex will lose half of its
>>>data, every second stripe.  But the volume will be completely
>>>accessible.
>>
>>    A good idea if you have both stripe and mirror, to avoid discarding the
>>whole disk.  But, IMHO, if some part of the disk is inacessible, the volume
>>should go down, and IFF the operator wants to try recovery, should use the
>>setstate command.  This is the safe state.
> 
> setstate is not safe.  It bypasses a lot of consistency checking.

     That's why it should be done only by a human operator, and only after 
checking the physical disk.  I use setstate frequently, when I have my wizard 
hat on, but I know the consequences of doing that.  If I have someone watching I 
carefully explain then to *not* repeat that.   ;-)

> 
> One possibility would be: 
> 
> 1.  Based on the plex states, check if all of the volume is still
>     accessible.
> 2.  If not, take the volume into a "flaky" state.  

     This is easy if the volume is composed of a single plex (my case, and the 
case of most people who needs only a big and "unsafe" disk.  Where unsafe means 
a disk available or not available, and not half a disk.  At least for me.

     If the volume has more than one plex, then you could think of an algoritm 
that explores this redundancy.

     But, IMO, a disk with half of it unavailable is hardly an "up and ok" one.

     Also note that, instead of turning the whole subdisk stale when a single 
I/O fails, the error could be passed above.  But, also, this only works with 
single plex stripe or concat configurations.


> 3.  *Somehow* ensure that the volume can't be accessed again as a file
>     system until it has been remounted.
> 4.  Refuse to remount the file system without the -f option.
> 
> The last two are outside the scope of Vinum, of course.

     And again violates the layering aproach.  I thought newfs -v has been enough...

     The first time I used vinum I was happilly thinking that I would mix 4 
whole disks (except for boot and swap partitions, of course) and create a new 
pseudo disk, in which I would again disklabel it, and repartition for expected 
use.  Say, for example, that I want to have /var and /usr on different 
partitions, but I want both with mirroring.  With real world vinum I need to 
create 2 vinum partitions on real disks, and have 2 vinum volumes.

     AFAIK, -current and GEOM fixes this, right?  My last experience with 
RaidFrame was a panic one, since the disk creation.  But I must confess I did 
not try that hard, since vinum and -stable was working for me.  I am not a 
-current hacker for a long time now.

     Greg, I like vinum, and I use it since its release in FreeBSD.  Before that 
I have used ccd(4).  When 5.x is stable, I will use GEOM, vinum or raidframe. 
But I really think *ix is great for it's reusability, recursivity and modularity 
and vinum breaks this.  If vinum creates a virtual disk, it should behave like a 
real disk.

                                         Jonny

-- 
João Carlos Mendes Luís - Networking Engineer - jonny@jonny.eng.br