From owner-freebsd-geom@FreeBSD.ORG Tue Feb 6 23:12:44 2007 Return-Path: X-Original-To: geom@FreeBSD.org Delivered-To: freebsd-geom@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0967916A401 for ; Tue, 6 Feb 2007 23:12:43 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from smtpout.mac.com (smtpout.mac.com [17.250.248.178]) by mx1.freebsd.org (Postfix) with ESMTP id E9C1F13C48E for ; Tue, 6 Feb 2007 23:12:42 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from mac.com (smtpin08-en2 [10.13.10.153]) by smtpout.mac.com (Xserve/8.12.11/smtpout08/MantshX 4.0) with ESMTP id l16NCgJm009945; Tue, 6 Feb 2007 15:12:42 -0800 (PST) Received: from [172.24.104.147] (natint3.juniper.net [66.129.224.36]) (authenticated bits=0) by mac.com (Xserve/smtpin08/MantshX 4.0) with ESMTP id l16NCMDd003835 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO); Tue, 6 Feb 2007 15:12:38 -0800 (PST) In-Reply-To: <94382.1170791009@critter.freebsd.dk> References: <94382.1170791009@critter.freebsd.dk> Mime-Version: 1.0 (Apple Message framework v752.3) Content-Type: text/plain; charset=US-ASCII; format=flowed Message-Id: <5CDE358B-864C-41D9-A3A0-952A49A93218@mac.com> Content-Transfer-Encoding: 7bit From: Marcel Moolenaar Date: Tue, 6 Feb 2007 15:10:55 -0800 To: Poul-Henning Kamp X-Mailer: Apple Mail (2.752.3) X-Brightmail-Tracker: AAAAAA== X-Brightmail-scanned: yes Cc: geom@FreeBSD.org Subject: Re: New g_part class X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Feb 2007 23:12:44 -0000 On Feb 6, 2007, at 11:43 AM, Poul-Henning Kamp wrote: >> With g_part the least important and most discriminating aspect >> of partitioning is abstracted: the on-disk format for storing >> the meta-data. This, I believe, is the right approach. > > But couldn't this be equally well abstracted in userland ? So far we are unsuccessful. Every partitioning scheme has its own tool, so we don't even abstract there. The user is faced with the differences. Sysinstall is supposed to abstract the details from the user but has turned into a non-portable obstinate piece of code that is so hard to maintain that new partitioning schemes are simply not added anymore: it's too difficult. Not to mention that it's impossible to label a disk other than the single one supported for that particular architecture. As an example, people seem to like GPT to partition big disks on i486/amd64 and only use MBR for booting purposes. This cannot be done in sysinstall at all. On the other hand, Linux has parted. This seems to be an example of where it is successful. At least, to certain extend. Unfortunately, parted likes to write the data to disk itself and as such has knowledge about each and every partitioning scheme itself. This then is an example of having the abstraction in userland. Since GEOM has the ctlreq interface for this and disk access is generally speaking not allowed, it's not an example that fits our use case entirely. Also, we share files between kernel and userland for encoding and decoding MBRs and BSD labels. This is the immediate result of having detailed knowledge in two places: the kernel and the tool that works with the partitioning scheme. Since we do need the knowledge in the kernel to begin with, the logical step to avoid duplication is to eliminate the need for this knowledge from userland. This is the g_part approach. With g_part, the on-disk layout is kept in the kernel so that detailed knowledge resides in one place and the elementary operations are the handshake between kernel and userland. The handshake already provides the abstraction. > I'm trying to find out what the benefit of stuffing it into > the kernel is, and the only thing I can think of would be > "it doesn't have to live in userland then". But if it > still has to live in userland, then what's the point ? It doesn't have to live in userland anymore. >> That said: libdisk is now at the root of various forms of evil, >> including sysinstall(8) and its deadbeat offspring sade(8). >> Something needs to be done, and done right, if we want to stop >> this madness... > > I fully agree, I'm just trying to make sure I understand where > you're headed, having been there myself, I know how many nasty > corners there are. Understood. If I cannot defend my self in this discussion, we know that I'm headed for disaster. >>>> It's the application that should exhibit artificial intelligence >>>> (if at all), not the kernel. >>> >>> So what is the advantage of editing the metadata in the kernel >>> instead of userland ? >> >> Abstraction. Userland does not know or care what the on-disk >> meta-data looks like. It performs elementary operations that >> every partitioning scheme supports (modulo "extensions") and >> since the kernel needs to be involved anyway, it makes sense >> to have it involved at the basic level to simplify checks >> and to increase flexibility. > > Well, this is where I don't fully agree. Userland will > need to know about the on-disk metadata differences, because > partition-id's UUID's and similar has to come from somewhere, > and somebody needs to know to align MBR:s1 on the second > track etc etc. The g_part class works with with aliases. A partition type of @freebsd-ufs translates to different scheme-dependent codes that represent the type. For GPT this is the corresponding UUID, for APM it's the corresponding string and for the BSD disklabel this translates to FS_FFS. However, the g_part class does not itself deal with partition types. It's handled by the scheme-specific code. This means that tools can pass scheme-specific codes. Typically these scheme-specific codes are used only when the user wants to create partitions foreign to FreeBSD. >> Giving the kernel an image of what it needs to write to disk >> leaves a big gap between the current state and the new state >> and checking whether the new state is at all possible becomes >> very hard if not impossible in cases. > > Works for the classes I've written: The taste function > gets called before the write to validate the new metadata, > and afterwards to instantiate it. > > What or where is this an impossible plan ? It's not impossible. Just not reusable and/or scalable. Every partitioning scheme has to support the act of adding one or more partitions, but only 1 partitioning scheme will need to support the writing of a MBR. Adding support for a new partitioning scheme will therefore result in copying both some GEOM class as well as the tool to define and/or modify partitions and subsequently modified for the new scheme. This has already resulted in naming conflicts and forced us to make dsklabel a link to bsdlabel or sunlabel (or something like that). >> The kernel can return error strings, but that's fundamentally >> the wrong thing to do, because that would mean that you need >> to add i18n or l10n to the kernel. > > This is probably a more political question than anything. All the > people I asked agreed that english errormessages were way bettern > than an EINVAL that could mean 12 different things. I'm sure they all understand english and don't localize their systems (if at all possible). >> When the kernel is involved for each step and checks each >> step, the user will have direct feedback to its actions >> and as such will be able to understand better what went >> wrong and will therefore be able to take appropriate >> action. > > These same checks could be carried out in userland, where i18n > is much easier ? Why bother the kernel with a check userland > can do ? Not all checks can be performed in userland. Things like whether devices are opened (and thus whether a partition can be removed) need kernel involvement. Also consider concurrency: keeping the work local to the process creates big race conditions when some other sysadmin is doing the same thing. What about devices that went away or different devices that replaced a disk you're working on? >>> If you could have writte a generic partitioning tool that didn't >>> know about the different formats, then I could see the point, >>> but having to have the code both in userland and in the kernel >>> makes little no sense to me, in particular given how seldom >>> it is used. >> >> That's the point. A single partitioning tool will be written >> and it will not know about the on-disk format of the meta-data. > > But it will know about meta-data idosyncracies like parition ID, > alignment, size restrictions, uuids etc etc, so now, instead of > having it all collected on place, we get it spread with half > in the kernel and half in userland ? Knowledge is either in the kernel or in userland. Ideally no piece of knowledge is present in both. See also below. > In particular, you will need facilites for writing boot code and > other non-partition metadata, and those are, by definition type > specific. Yes. Boot code handling has not been implemented. I considered the implications and came to the conclusion that the kernel cannot validate the correctness of boot-code; only whether the size fits the space. This means that the responsibility lies with the tool to help out the user. As such, it will be the tool who has the knowledge and not the kernel. The only thing the kernel provides is a verb to allow a BLOB to be passed to the partitioning scheme and its the partitioning scheme that deposits the blob in the right space. -- Marcel Moolenaar xcllnt@mac.com