From owner-freebsd-stable@FreeBSD.ORG Tue Mar 30 19:14:21 2004 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 36C9116A4CE; Tue, 30 Mar 2004 19:14:21 -0800 (PST) Received: from mail.globo.com (smtp1.globo.com [200.208.9.168]) by mx1.FreeBSD.org (Postfix) with ESMTP id DD26D43D1D; Tue, 30 Mar 2004 19:14:20 -0800 (PST) (envelope-from jonny@jonny.eng.br) Received: from jonny.eng.br (200.217.22.173) by mail.globo.com (6.0.053) (authenticated as jcml21@globo.com) id 40628E5100052255; Wed, 31 Mar 2004 00:14:18 -0300 Message-ID: <406A3785.1040007@jonny.eng.br> Date: Wed, 31 Mar 2004 00:14:13 -0300 From: =?ISO-8859-1?Q?Jo=E3o_Carlos_Mendes_Lu=EDs?= User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113 X-Accept-Language: pt-br, en-us, en, pt MIME-Version: 1.0 To: Greg 'groggy' Lehey References: <4068EA56.3060600@jonny.eng.br> <20040330053143.GN15929@wantadilla.lemis.com> <40697F3B.2020202@jonny.eng.br> <20040326222853.GA93269@zeus.faperj.br> <20040330143257.C72259@pcle2.cc.univie.ac.at> <20040331004630.GA15929@wantadilla.lemis.com> In-Reply-To: <20040331004630.GA15929@wantadilla.lemis.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit cc: stable@freebsd.org cc: robert cc: Lukas Ertl cc: hackers@freebsd.org cc: bugs@FreeBSD.org cc: Joao Carlos Mendes Luis Subject: Re: Serious bug in vinum? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 31 Mar 2004 03:14:21 -0000 Greg 'groggy' Lehey wrote: > On Tuesday, 30 March 2004 at 14:37:00 +0200, Lukas Ertl wrote: > >>On Fri, 26 Mar 2004, Joao Carlos Mendes Luis wrote: >> >> >>> I think this should be like: >>> >>> if (plex->state > plex_corrupt) { /* something accessible, */ >>> >>> Or, in other words, volume state is up only if plex state is degraded >>>or better. >> >>You are right, this is a bug, > > No, see my reply. I think "maybe" is the best answer here. >>The correct solution, of course, is to check if the data is valid >>before changing the volume state, but turn might turn out to be a >>very complex check. > > > Well, the minimum correct solution is to return an error if somebody > tries to access the inaccessible part of the volume. That should > happen, and I'm confused that it doesn't appear to be doing so in this > case. > > On Tuesday, 30 March 2004 at 11:07:55 -0300, Joo Carlos Mendes Lus wrote: > >>Greg 'groggy' Lehey wrote: >> >>>On Tuesday, 30 March 2004 at 0:32:38 -0300, Joo Carlos Mendes Lus wrote: >>> >>>Basically, this is a feature and not a bug. A plex that is corrupt is >>>still partially accessible, so we should allow access to it. If you >>>have two striped plexes both striped between two disks, with the same >>>stripe size, and one plex starts on the first drive, and the other on >>>the second, and one drive dies, then each plex will lose half of its >>>data, every second stripe. But the volume will be completely >>>accessible. >> >> A good idea if you have both stripe and mirror, to avoid discarding the >>whole disk. But, IMHO, if some part of the disk is inacessible, the volume >>should go down, and IFF the operator wants to try recovery, should use the >>setstate command. This is the safe state. > > setstate is not safe. It bypasses a lot of consistency checking. That's why it should be done only by a human operator, and only after checking the physical disk. I use setstate frequently, when I have my wizard hat on, but I know the consequences of doing that. If I have someone watching I carefully explain then to *not* repeat that. ;-) > > One possibility would be: > > 1. Based on the plex states, check if all of the volume is still > accessible. > 2. If not, take the volume into a "flaky" state. This is easy if the volume is composed of a single plex (my case, and the case of most people who needs only a big and "unsafe" disk. Where unsafe means a disk available or not available, and not half a disk. At least for me. If the volume has more than one plex, then you could think of an algoritm that explores this redundancy. But, IMO, a disk with half of it unavailable is hardly an "up and ok" one. Also note that, instead of turning the whole subdisk stale when a single I/O fails, the error could be passed above. But, also, this only works with single plex stripe or concat configurations. > 3. *Somehow* ensure that the volume can't be accessed again as a file > system until it has been remounted. > 4. Refuse to remount the file system without the -f option. > > The last two are outside the scope of Vinum, of course. And again violates the layering aproach. I thought newfs -v has been enough... The first time I used vinum I was happilly thinking that I would mix 4 whole disks (except for boot and swap partitions, of course) and create a new pseudo disk, in which I would again disklabel it, and repartition for expected use. Say, for example, that I want to have /var and /usr on different partitions, but I want both with mirroring. With real world vinum I need to create 2 vinum partitions on real disks, and have 2 vinum volumes. AFAIK, -current and GEOM fixes this, right? My last experience with RaidFrame was a panic one, since the disk creation. But I must confess I did not try that hard, since vinum and -stable was working for me. I am not a -current hacker for a long time now. Greg, I like vinum, and I use it since its release in FreeBSD. Before that I have used ccd(4). When 5.x is stable, I will use GEOM, vinum or raidframe. But I really think *ix is great for it's reusability, recursivity and modularity and vinum breaks this. If vinum creates a virtual disk, it should behave like a real disk. Jonny -- João Carlos Mendes Luís - Networking Engineer - jonny@jonny.eng.br