From owner-freebsd-geom@FreeBSD.ORG  Tue Feb  6 23:12:44 2007
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
X-Original-To: geom@FreeBSD.org
Delivered-To: freebsd-geom@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 0967916A401
	for <geom@FreeBSD.org>; Tue,  6 Feb 2007 23:12:43 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from smtpout.mac.com (smtpout.mac.com [17.250.248.178])
	by mx1.freebsd.org (Postfix) with ESMTP id E9C1F13C48E
	for <geom@FreeBSD.org>; Tue,  6 Feb 2007 23:12:42 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from mac.com (smtpin08-en2 [10.13.10.153])
	by smtpout.mac.com (Xserve/8.12.11/smtpout08/MantshX 4.0) with ESMTP id
	l16NCgJm009945; Tue, 6 Feb 2007 15:12:42 -0800 (PST)
Received: from [172.24.104.147] (natint3.juniper.net [66.129.224.36])
	(authenticated bits=0)
	by mac.com (Xserve/smtpin08/MantshX 4.0) with ESMTP id l16NCMDd003835
	(version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO);
	Tue, 6 Feb 2007 15:12:38 -0800 (PST)
In-Reply-To: <94382.1170791009@critter.freebsd.dk>
References: <94382.1170791009@critter.freebsd.dk>
Mime-Version: 1.0 (Apple Message framework v752.3)
Content-Type: text/plain; charset=US-ASCII; format=flowed
Message-Id: <5CDE358B-864C-41D9-A3A0-952A49A93218@mac.com>
Content-Transfer-Encoding: 7bit
From: Marcel Moolenaar <xcllnt@mac.com>
Date: Tue, 6 Feb 2007 15:10:55 -0800
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
X-Mailer: Apple Mail (2.752.3)
X-Brightmail-Tracker: AAAAAA==
X-Brightmail-scanned: yes
Cc: geom@FreeBSD.org
Subject: Re: New g_part class
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: GEOM-specific discussions and implementations
	<freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
	<mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
	<mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Feb 2007 23:12:44 -0000


On Feb 6, 2007, at 11:43 AM, Poul-Henning Kamp wrote:

>> With g_part the least important and most discriminating aspect
>> of partitioning is abstracted: the on-disk format for storing
>> the meta-data. This, I believe, is the right approach.
>
> But couldn't this be equally well abstracted in userland ?

So far we are unsuccessful. Every partitioning scheme has its
own tool, so we don't even abstract there. The user is faced
with the differences.

Sysinstall is supposed to abstract the details from the user
but has turned into a non-portable obstinate piece of code
that is so hard to maintain that new partitioning schemes
are simply not added anymore: it's too difficult. Not to
mention that it's impossible to label a disk other than the
single one supported for that particular architecture. As
an example, people seem to like GPT to partition big disks
on i486/amd64 and only use MBR for booting purposes. This
cannot be done in sysinstall at all.

On the other hand, Linux has parted. This seems to be an
example of where it is successful. At least, to certain
extend. Unfortunately, parted likes to write the data to
disk itself and as such has knowledge about each and every
partitioning scheme itself. This then is an example of
having the abstraction in userland. Since GEOM has the
ctlreq interface for this and disk access is generally
speaking not allowed, it's not an example that fits our
use case entirely.

Also, we share files between kernel and userland for encoding
and decoding MBRs and BSD labels. This is the immediate result
of having detailed knowledge in two places: the kernel and
the tool that works with the partitioning scheme. Since we
do need the knowledge in the kernel to begin with, the logical
step to avoid duplication is to eliminate the need for this
knowledge from userland. This is the g_part approach.

With g_part, the on-disk layout is kept in the kernel so that
detailed knowledge resides in one place and the elementary
operations are the handshake between kernel and userland.
The handshake already provides the abstraction.

> I'm trying to find out what the benefit of stuffing it into
> the kernel is, and the only thing I can think of would be
> "it doesn't have to live in userland then".  But if it
> still has to live in userland, then what's the point ?

It doesn't have to live in userland anymore.

>> That said: libdisk is now at the root of various forms of evil,
>> including sysinstall(8) and its deadbeat offspring sade(8).
>> Something needs to be done, and done right, if we want to stop
>> this madness...
>
> I fully agree, I'm just trying to make sure I understand where
> you're headed, having been there myself, I know how many nasty
> corners there are.

Understood. If I cannot defend my self in this discussion, we
know that I'm headed for disaster.

>>>> It's the application that should exhibit artificial intelligence
>>>> (if at all), not the kernel.
>>>
>>> So what is the advantage of editing the metadata in the kernel
>>> instead of userland ?
>>
>> Abstraction. Userland does not know or care what the on-disk
>> meta-data looks like. It performs elementary operations that
>> every partitioning scheme supports (modulo "extensions") and
>> since the kernel needs to be involved anyway, it makes sense
>> to have it involved at the basic level to simplify checks
>> and to increase flexibility.
>
> Well, this is where I don't fully agree.  Userland will
> need to know about the on-disk metadata differences, because
> partition-id's UUID's and similar has to come from somewhere,
> and somebody needs to know to align MBR:s1 on the second
> track etc etc.

The g_part class works with with aliases. A partition type of
@freebsd-ufs translates to different scheme-dependent codes
that represent the type. For GPT this is the corresponding
UUID, for APM it's the corresponding string and for the BSD
disklabel this translates to FS_FFS.

However, the g_part class does not itself deal with partition
types. It's handled by the scheme-specific code. This means
that tools can pass scheme-specific codes. Typically these
scheme-specific codes are used only when the user wants to
create partitions foreign to FreeBSD.

>> Giving the kernel an image of what it needs to write to disk
>> leaves a big gap between the current state and the new state
>> and checking whether the new state is at all possible becomes
>> very hard if not impossible in cases.
>
> Works for the classes I've written:  The taste function
> gets called before the write to validate the new metadata,
> and afterwards to instantiate it.
>
> What or where is this an impossible plan ?

It's not impossible. Just not reusable and/or scalable. Every
partitioning scheme has to support the act of adding one or
more partitions, but only 1 partitioning scheme will need to
support the writing of a MBR.

Adding support for a new partitioning scheme will therefore
result in copying both some GEOM class as well as the tool
to define and/or modify partitions and subsequently modified
for the new scheme. This has already resulted in naming
conflicts and forced us to make dsklabel a link to bsdlabel
or sunlabel (or something like that).

>> The kernel can return error strings, but that's fundamentally
>> the wrong thing to do, because that would mean that you need
>> to add i18n or l10n to the kernel.
>
> This is probably a more political question than anything.  All the
> people I asked agreed that english errormessages were way bettern
> than an EINVAL that could mean 12 different things.

I'm sure they all understand english and don't localize
their systems (if at all possible).

>> When the kernel is involved for each step and checks each
>> step, the user will have direct feedback to its actions
>> and as such will be able to understand better what went
>> wrong and will therefore be able to take appropriate
>> action.
>
> These same checks could be carried out in userland, where i18n
> is much easier ?  Why bother the kernel with a check userland
> can do ?

Not all checks can be performed in userland. Things like whether
devices are opened (and thus whether a partition can be removed)
need kernel involvement. Also consider concurrency: keeping the
work local to the process creates big race conditions when some
other sysadmin is doing the same thing. What about devices that
went away or different devices that replaced a disk you're
working on?

>>> If you could have writte a generic partitioning tool that didn't
>>> know about the different formats, then I could see the point,
>>> but having to have the code both in userland and in the kernel
>>> makes little no sense to me, in particular given how seldom
>>> it is used.
>>
>> That's the point. A single partitioning tool will be written
>> and it will not know about the on-disk format of the meta-data.
>
> But it will know about meta-data idosyncracies like parition ID,
> alignment, size restrictions, uuids etc etc, so now, instead of
> having it all collected on place, we get it spread with half
> in the kernel and half in userland ?

Knowledge is either in the kernel or in userland. Ideally no piece
of knowledge is present in both. See also below.

> In particular, you will need facilites for writing boot code and
> other non-partition metadata, and those are, by definition type
> specific.

Yes. Boot code handling has not been implemented. I considered
the implications and came to the conclusion that the kernel
cannot validate the correctness of boot-code; only whether the
size fits the space. This means that the responsibility lies
with the tool to help out the user. As such, it will be the
tool who has the knowledge and not the kernel. The only thing
the kernel provides is a verb to allow a BLOB to be passed to
the partitioning scheme and its the partitioning scheme that
deposits the blob in the right space.

-- 
Marcel Moolenaar
xcllnt@mac.com