From owner-freebsd-geom@FreeBSD.ORG  Tue Feb  6 19:43:32 2007
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
X-Original-To: geom@FreeBSD.org
Delivered-To: freebsd-geom@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 154B316A401
	for <geom@FreeBSD.org>; Tue,  6 Feb 2007 19:43:32 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id 8303613C4AA
	for <geom@FreeBSD.org>; Tue,  6 Feb 2007 19:43:31 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2])
	by phk.freebsd.dk (Postfix) with ESMTP id DF7CB1747B;
	Tue,  6 Feb 2007 19:43:29 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.13.8/8.13.8) with ESMTP id l16JhT05094383;
	Tue, 6 Feb 2007 19:43:29 GMT (envelope-from phk@critter.freebsd.dk)
To: Marcel Moolenaar <xcllnt@mac.com>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Tue, 06 Feb 2007 11:15:06 PST."
	<F2D457F7-46C3-4A81-9147-7F6FA6E01374@mac.com> 
Date: Tue, 06 Feb 2007 19:43:29 +0000
Message-ID: <94382.1170791009@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: geom@FreeBSD.org
Subject: Re: New g_part class 
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: GEOM-specific discussions and implementations
	<freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
	<mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
	<mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Feb 2007 19:43:32 -0000

In message <F2D457F7-46C3-4A81-9147-7F6FA6E01374@mac.com>, Marcel Moolenaar wri
tes:

>> Right, but the current scheme handles this by asking the kernel to
>> write the finished metadata image, which the kernels taste
>> functionality can be used to validate and parse.
>
>This is in effect a replacement oriented approach that is
>based on retasting. One cannot change the partition table
>by adding a new partition when an existing partition is
>already mounted without circumventing permissions and other
>checks, right?

Sure you can, works fine for at least MBR and BSD.

>What if I want to replace a MBR with a GPT without actually
>changing the meta-data? This doesn't work when each partition
>scheme has its own image-oriented verbs, but it is supported
>by g_part.

The MBR to GPT migration is a nasty corner case, in particular if
you want to do it while partitions are open.  To the MBR method,
the GPT would look like bootcode, so one way to do it without
catering to open partitions is to write the GPT via the MBR geom
and reboot (or retaste).

>With g_part the least important and most discriminating aspect
>of partitioning is abstracted: the on-disk format for storing
>the meta-data. This, I believe, is the right approach.

But couldn't this be equally well abstracted in userland ?

I'm trying to find out what the benefit of stuffing it into
the kernel is, and the only thing I can think of would be
"it doesn't have to live in userland then".  But if it
still has to live in userland, then what's the point ?

>The ctlreq functions will indeed be rarely used. However,
>the ctlreq functions don't actually have to be present in
>the kernel to make g_part functional for dealing with
>partitions. If space is a concern, then it should be possible
>to put the ctlreq functions in a separate module. I don't
>see this as a problem so I don't give it any attention.

Ok, that addresses at least that part.  This is a concern for
the embedded space.

>It's worth noting that the introduction of GEOM brought along
>some additional, and unsexy, work that had to be finished in
>time for the next release. 

Ohh, the damage to libdisk happened much earlier, does "On Track
Disk Manager" and "Dangerously Dedicated Mode" ring a bell with
anybody ?

>That said: libdisk is now at the root of various forms of evil,
>including sysinstall(8) and its deadbeat offspring sade(8).
>Something needs to be done, and done right, if we want to stop
>this madness...

I fully agree, I'm just trying to make sure I understand where
you're headed, having been there myself, I know how many nasty
corners there are.

>>> It's the application that should exhibit artificial intelligence
>>> (if at all), not the kernel.
>>
>> So what is the advantage of editing the metadata in the kernel
>> instead of userland ?
>
>Abstraction. Userland does not know or care what the on-disk
>meta-data looks like. It performs elementary operations that
>every partitioning scheme supports (modulo "extensions") and
>since the kernel needs to be involved anyway, it makes sense
>to have it involved at the basic level to simplify checks
>and to increase flexibility.

Well, this is where I don't fully agree.  Userland will
need to know about the on-disk metadata differences, because
partition-id's UUID's and similar has to come from somewhere,
and somebody needs to know to align MBR:s1 on the second
track etc etc.

>Giving the kernel an image of what it needs to write to disk
>leaves a big gap between the current state and the new state
>and checking whether the new state is at all possible becomes
>very hard if not impossible in cases.

Works for the classes I've written:  The taste function
gets called before the write to validate the new metadata,
and afterwards to instantiate it.

What or where is this an impossible plan ?

>Error reporting to the user will also be improved. With an
>image approach the kernel has very few error conditions to
>report to the user with a single errno.

g_ctl allows for a string error, so this is no longer 
a limitation execpt in the legacy ioctls (which we should
concentrate on killing)

>The kernel can return error strings, but that's fundamentally
>the wrong thing to do, because that would mean that you need
>to add i18n or l10n to the kernel.

This is probably a more political question than anything.  All the
people I asked agreed that english errormessages were way bettern
than an EINVAL that could mean 12 different things.

>When the kernel is involved for each step and checks each
>step, the user will have direct feedback to its actions
>and as such will be able to understand better what went
>wrong and will therefore be able to take appropriate
>action.

These same checks could be carried out in userland, where i18n
is much easier ?  Why bother the kernel with a check userland
can do ?

>> If you could have writte a generic partitioning tool that didn't
>> know about the different formats, then I could see the point,
>> but having to have the code both in userland and in the kernel
>> makes little no sense to me, in particular given how seldom
>> it is used.
>
>That's the point. A single partitioning tool will be written
>and it will not know about the on-disk format of the meta-data.

But it will know about meta-data idosyncracies like parition ID,
alignment, size restrictions, uuids etc etc, so now, instead of
having it all collected on place, we get it spread with half
in the kernel and half in userland ?

>> The problem with BSD labels is that you need to intercept writes
>> to the metadata part if one of the partitions allows this to
>> happen.
>
>I personally don't worry about that. If the meta-data is within
>a partition, then the user (e.g. file system) of that partition
>needs to be aware of that anyway.

Yeah, well, that's why I have changed our defaults to not create
partitions covering the metadata (part table + boot code).

I guess this bugwards compat code could die now, but I'd rather
that all of BSD dies.

>The responsibility of keeping
>the BSD label intact is automatically delegated to the user of
>that partition and cannot in general be enforced anywhere else.

Not quite.

Today the BSD geom will fail the write if it does not contain valid
BSD label metadata.  This has saved quite a lot of people from
losing their paritioning when they put non-UFS filesystems on a
...a partition.

It's a major bit of complication on the BSD class, but it was
necessary.

Summary:

I'm not against what you are proposing, but I doubt that it turn
out as clean as you expect.

I predict that you will find, as you add the legacy parition types
like MBR, that you will not avoid knowing about their warts in
userland.

In particular, you will need facilites for writing boot code and
other non-partition metadata, and those are, by definition type
specific.

The alternative to what you are proposing is marginally saner: a
per-class object in libgeom, loaded by a disk-edit library or
application.  But it would at least keep all the magic source for
editing each class in one place, and require less code in the kernel.

You're doing the work, you decide, you get the blame :-)

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.