From owner-freebsd-arch@FreeBSD.ORG Sun Aug 22 19:05:50 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2518010656A5; Sun, 22 Aug 2010 19:05:50 +0000 (UTC) (envelope-from max@love2party.net) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.187]) by mx1.freebsd.org (Postfix) with ESMTP id C12AC8FC1F; Sun, 22 Aug 2010 19:05:49 +0000 (UTC) Received: from f8x64.laiers.local (dslb-088-066-049-083.pools.arcor-ip.net [88.66.49.83]) by mrelayeu.kundenserver.de (node=mreu2) with ESMTP (Nemesis) id 0Mao5W-1OYLdm02ws-00JuQt; Sun, 22 Aug 2010 21:05:48 +0200 From: Max Laier Organization: FreeBSD To: Stephan Uphoff Date: Sun, 22 Aug 2010 21:05:47 +0200 User-Agent: KMail/1.13.5 (FreeBSD/8.1-RELEASE; KDE/4.4.5; amd64; ; ) References: <201008160515.21412.max@love2party.net> <4C7042BA.8000402@freebsd.org> In-Reply-To: <4C7042BA.8000402@freebsd.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201008222105.47276.max@love2party.net> X-Provags-ID: V02:K0:YNor0iNpkdfjelCFoSMWyZtpXJhZ7ohG2c5/aghqCgx Y0kFjQW7ZEyXAxH3BAGR4CG/+A+ExZsSiYd7FNRUFc2gSxOhnj 9Vte5ZrKUUlT5A+foYJ0cZDmo3AbZpHypBmITXvyePacesCCag vJMxDHLlrIfSRo1pPmWYt4Q0KgcKqgeDxRzSKKX/G+r7VQDzFj n9xX7r3jji5syyUh2awzQ== Cc: freebsd-arch@freebsd.org Subject: Re: rmlock(9) two additions X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Aug 2010 19:05:50 -0000 On Saturday 21 August 2010 23:18:50 Stephan Uphoff wrote: > Max Laier wrote: > > Hi, > > > > I'd like to run two additions to rmlock(9) by you: > > > > 1) See the attached patch. It adds rm_try_rlock() which avoids taking > > the mutex if the lock is currently in a write episode. The only > > overhead to the hot path is an additional argument and a return value > > for _rm_rlock*. If you are worried about that, it can obviously be done > > in a separate code path, but I reckon it not worth the code crunch. > > Finally, there is one additional branch to check the "trylock" argument, > > but that's well past the hot path. > > The rm_try_rlock() will never succeed when the lock was last locked as > writer. > Maybe add: > > void > _rm_wunlock(struct rmlock *rm) > { > + rm->rm_noreadtoken = 0; > mtx_unlock(&rm->rm_lock); > } > > But then > > _rm_wlock(struct rmlock *rm) > > always needs to use IPIs - even when the lock was used last as a write > lock. I don't think this is a big problem - I can't see many use cases for rmlocks where you'd routinely see repeated wlocks without rlocks between them. However, I think there should be a release memory barrier before/while clearing rm_noreadtoken, otherwise readers may not see the data writes that are supposed to be protected by the lock?!? > Alternatively something like: > > if (trylock) { > if(mtx_trylock( &rm->rm_lock) == 0) > return (0); > } > else > { > mtx_lock(&rm->rm_lock); > } > > would work - but has a race. Two readers colliding just after a writer > (with the second not succeeding in trylocking the mutex) leads to not > granting the read lock (also it would be possible to do so). Also not too much of a problem, in my opinion. There is no time order between read locks and thus it is okay to grant one and fail another - eventhough they arrive "at the same time". A caller to trylock must always accept failure (unless it's a recursive use - and this is handled). > Let me think about it a bit. I believe either solution will work. #1 is a bit more in the spirit of the rmlock - i.e. make the read case cheap and the write case expensive. I'm just not sure about the lock semantics. I guess a atomic_store_rel_int(&rm->rm_noreadtoken, 0); should work. > > 2) No code for this yet - testing the waters first. I'd like to add the > > ability to replace the mutex for writer synchronization with a general > > lock - especially to be able to use a sx(9) lock here. > > > > The reason for #2 is the following use case in a debugging facility: > > "reader": > > if (rm_try_rlock()) { > > > > grab_per_cpu_buffer(); > > fill_per_cpu_buffer(); > > rm_runlock(); > > > > } > > > > "writer" - better exclusive access thread: > > rm_wlock(); > > collect_buffers_and_copy_out_to_userspace(); > > rm_wunlock(); > > > > This is much cleaner and possibly cheaper than the various hand rolled > > versions I've come across, that try to get the same synchronization with > > atomic operations. If we could sleep with the wlock held, we can also > > avoid copying the buffer contents, or swapping buffers. > > > > Is there any concern about either of this? Any objection? Input? > > Will take a look at your second patch soonish. > > Just ask per IPI for a copy of per cpu buffers (but not a copy to user > space) - and delay the copy when an update is in progress? Think huge circular per cpu buffer that are filled at high rates. Of course we could allocate new buffers and swap out while locked, but since this is a debugging facility it is better to miss a few events while copying out, rather than spending twice the memory. Thanks, Max From owner-freebsd-arch@FreeBSD.ORG Mon Aug 23 21:41:05 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 23B291065698 for ; Mon, 23 Aug 2010 21:41:05 +0000 (UTC) (envelope-from marcelm@juniper.net) Received: from exprod7og105.obsmtp.com (exprod7og105.obsmtp.com [64.18.2.163]) by mx1.freebsd.org (Postfix) with ESMTP id E07308FC0A for ; Mon, 23 Aug 2010 21:41:04 +0000 (UTC) Received: from source ([66.129.224.36]) (using TLSv1) by exprod7ob105.postini.com ([64.18.6.12]) with SMTP ID DSNKTHLq8FdHsG2SuPSF3wt3v4S+sOIHUb4I@postini.com; Mon, 23 Aug 2010 14:41:04 PDT Received: from EMBX01-HQ.jnpr.net ([fe80::c821:7c81:f21f:8bc7]) by P-EMHUB02-HQ.jnpr.net ([fe80::88f9:77fd:dfc:4d51%11]) with mapi; Mon, 23 Aug 2010 14:30:18 -0700 From: Marcel Moolenaar To: FreeBSD Arch Date: Mon, 23 Aug 2010 14:30:16 -0700 Thread-Topic: enhancing the root mount logic Thread-Index: ActDCmGMyxbU3bCuTUCbYapKaXZqsQ== Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 21:41:05 -0000 All, In embedded products, software is possibly installed as an image onto an actual storage device. This means that mounting the storage device as root is not enough to have a usable root file system. The rough draft below is an idea to enhance the root mount from having ad-hoc quirks to a well-defined and recursive mechanism to allow a wide- range of use cases. The root mount logic is recursive as follows: 1. The kernel mounts devfs as root (is it is now). 2. The kernel will re-mount root by virtue of reading a file, called /.mount.conf, in the current root file system and following the directives is it. devfs synthesizes the contents of this file. At each iteration, the kernel will: 1. move the devfs mount from /dev in the old file system to /dev in the new file system. 2. As per the directives or unconditionally, the kernel will re-mount the old root file system under /.mount (or some other name) within the new file system. devfs will synthesize the contents of /.mount.conf as per the kernel configuration and tunables. The administrator (or install process) will create and populate /.mount.conf for all other cases. Directives in /.mount.conf are envisioned to be something like: {FS}:{MOUNTPOINT} e.g. ufs:/dev/da0 a root mount alternative. The order of the alternatives in the file determines the priority. .ask a root mount alternative that asks the operator to specify what the root mount should be. .wait N .e.g. .wait 5 wait at most N seconds for a root mount alternative to succeed. If an alternative does not succeed within that time, move on to the next alternative. .onfail {panic|reboot|retry|continue} Tells the kernel what to do in case it can't successfully complete the root mount as directed to. The .wait directive works better (probably) if we have events that signify the arrival of a file system or device special file, so that we can wait for at most N seconds after the last event. This also allows us to wait for a separate interval between events. As an example, consider: [devfs] /.mount.conf: ufs:/dev/da0 .ask .wait 5 .onfail panic [ufs:/dev/da0] /.mount.conf md0:/images/OS-image-1.0.iso unionfs:/jail/freebsd-8-stable .wait 0 .onfail continue In the example, the kernel will mount devfs, read /.mount.conf and wait at most 5 seconds to mount the UFS on /dev/da0. If that fails, the kernel will ask (once) and panic in case of failure. If the UFS root mount succeeded, the kernel will re-mount devfs underneath /dev. Since this is the first non-devfs root file system, the kernel will not re-mount the old root under /.mount. Since there's a /.mount.conf on the UFS, the kernel will read it and repeat the process. First it'll try and mount the OS image in /images/OS-image-1.0.iso and if it's not present will try to mount some -stable 8 chroot using unionfs (not necessarily a real-world example here :-) If either fails, the kernel will continue booting using the current root file system. Assuming that the image is present, the kernel will re-mount root, move devfs underneath /dev in the MD root and remount ufs:/dev/da0 under /.mount in the MD root. This gives the following picture: / md0:[ufs:/dev/da0]/images/OS-image-1.0.iso /.mount ufs:/dev/da0 /dev devfs Things to not explicitly touched upon: o root mount options o directives to instruct the kernel what to run as the initial process to eliminate the rather ad-hoc hardcoding. E.g: .init /sbin/init .init /sbin/init.old Is this something that people feel is worth fleshing out and prototyping? --=20 Marcel Moolenaar marcelm@juniper.net From owner-freebsd-arch@FreeBSD.ORG Mon Aug 23 21:49:47 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 10C1F1065672 for ; Mon, 23 Aug 2010 21:49:47 +0000 (UTC) (envelope-from ed@hoeg.nl) Received: from mx0.hoeg.nl (unknown [IPv6:2a01:4f8:101:5343::aa]) by mx1.freebsd.org (Postfix) with ESMTP id A253D8FC20 for ; Mon, 23 Aug 2010 21:49:46 +0000 (UTC) Received: by mx0.hoeg.nl (Postfix, from userid 1000) id 1370B2A28CB9; Mon, 23 Aug 2010 23:49:46 +0200 (CEST) Date: Mon, 23 Aug 2010 23:49:46 +0200 From: Ed Schouten To: Marcel Moolenaar Message-ID: <20100823214946.GF64651@hoeg.nl> References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="7mxbaLlpDEyR1+x6" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Cc: FreeBSD Arch Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 21:49:47 -0000 --7mxbaLlpDEyR1+x6 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable * Marcel Moolenaar wrote: > Is this something that people feel is worth fleshing out and > prototyping? Sounds awesome! This would make my writable boot cd a lot more elegant than it is right now. Have you thought about things like possible endless loops? Say, you mount a unionfs on the root of the fs itself. This may cause the original .mount.conf to be reinterpreted, right? --=20 Ed Schouten WWW: http://80386.nl/ --7mxbaLlpDEyR1+x6 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.16 (FreeBSD) iEYEARECAAYFAkxy7PoACgkQ52SDGA2eCwUONQCfVMdpcEvj7mh9+nfl+S89VfLp d7gAn3Gxyi5GCWA+EKUQId69vr6tNcYZ =PUNq -----END PGP SIGNATURE----- --7mxbaLlpDEyR1+x6-- From owner-freebsd-arch@FreeBSD.ORG Mon Aug 23 22:07:32 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 50C8F10656A5 for ; Mon, 23 Aug 2010 22:07:32 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-bw0-f54.google.com (mail-bw0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id D1CD58FC17 for ; Mon, 23 Aug 2010 22:07:31 +0000 (UTC) Received: by bwz20 with SMTP id 20so575813bwz.13 for ; Mon, 23 Aug 2010 15:07:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=th6aRvS8GyJMiz5FWpFntdCAjLvzpHTJZQPm5u5gLXg=; b=mxqh2521jJh71mvxOd92oZYDj5v0w3Nq4JXtsL9PvBwWDEq++qwH7mhMSKUqNsKRa2 OzaVEXxwy+XKhwT+s4DqgwlzSO8Tl1FZ0Ja2hokelNMJaytGGye0t45RrM2XBL503+dt Iw6t7RcVzK+xAw7u+WKzSKvrTUceQToGkhM5I= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=TO2WJlIpO2OmoaXkF2L56QY233gVM9LapmJZY+LRiELQCqb9SuBDaLiBJNwi60BPo1 0DWTbG1t9olp7lHL1m9FPqHg44h7T2MS9w+OMmZ4dp/zB8ELYrnrxQCVMcuIphInMAWH SvAuonLGpfSZTPVwPoteGVQ5QfR5AEmeEhqoQ= MIME-Version: 1.0 Received: by 10.213.114.67 with SMTP id d3mr4667777ebq.73.1282601250485; Mon, 23 Aug 2010 15:07:30 -0700 (PDT) Sender: mdf356@gmail.com Received: by 10.213.20.144 with HTTP; Mon, 23 Aug 2010 15:07:30 -0700 (PDT) In-Reply-To: References: Date: Mon, 23 Aug 2010 15:07:30 -0700 X-Google-Sender-Auth: HZjnI6JaPW68ua7gcToaSnODCSs Message-ID: From: mdf@FreeBSD.org To: Marcel Moolenaar Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: FreeBSD Arch Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 22:07:32 -0000 On Mon, Aug 23, 2010 at 2:30 PM, Marcel Moolenaar wro= te: > In embedded products, software is possibly installed as an image onto > an actual storage device. This means that mounting the storage device > as root is not enough to have a usable root file system. The rough > draft below is an idea to enhance the root mount from having ad-hoc > quirks to a well-defined and recursive mechanism to allow a wide- > range of use cases. I am not making any claims to the overall desirability of this, but as a suggestion for the file format: > =A0 [devfs] =A0 =A0 =A0/.mount.conf: > =A0 =A0 =A0 =A0ufs:/dev/da0 > =A0 =A0 =A0 =A0.ask > =A0 =A0 =A0 =A0.wait 5 > =A0 =A0 =A0 =A0.onfail panic To me, this should wait 0 seconds (or whatever the default is) until after the .ask mount point has been tried. I'd suggest something like: =A0 [devfs] =A0 =A0 =A0/.mount.conf: =A0 =A0 =A0 =A0.wait 10 =A0 =A0 =A0 =A0ufs:/dev/da0 # wait up to 10 seconds for ufs .wait 5 =A0 =A0 =A0 =A0.ask # wait up to 5 seconds for the prompt-returned filesystem =A0 =A0 =A0 =A0.onfail panic The two reasons for such a usage: 1) simplifies parsing, since the file only needs to be read to a mount directive, not read in its entirety. 2) allows different timeouts for each root mount location I could also imagine, instead, a .mount directive, so that all "commands" start with a '.' which would be like: .mount ufs:/dev/da0 8 which would be an 8 second timeout on the specified mount point. Anyways, as a flexible mechanism this sounds reasonable. I have no idea how it compares to what other operating systems do, which is only relevant insofar as making migration from another platform easier. Thanks, matthew From owner-freebsd-arch@FreeBSD.ORG Mon Aug 23 22:57:53 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B6DF01065693; Mon, 23 Aug 2010 22:57:53 +0000 (UTC) (envelope-from max@laiers.net) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.17.8]) by mx1.freebsd.org (Postfix) with ESMTP id 4E6668FC1D; Mon, 23 Aug 2010 22:57:53 +0000 (UTC) Received: from f8x64.laiers.local (dslb-088-066-049-083.pools.arcor-ip.net [88.66.49.83]) by mrelayeu.kundenserver.de (node=mrbap1) with ESMTP (Nemesis) id 0Mgpk0-1ORaVI3EsS-00MQXi; Tue, 24 Aug 2010 00:45:17 +0200 From: Max Laier Organization: FreeBSD To: freebsd-arch@freebsd.org Date: Tue, 24 Aug 2010 00:45:15 +0200 User-Agent: KMail/1.13.5 (FreeBSD/8.1-RELEASE; KDE/4.4.5; amd64; ; ) References: <201008160515.21412.max@love2party.net> <4C7042BA.8000402@freebsd.org> <201008222105.47276.max@love2party.net> In-Reply-To: <201008222105.47276.max@love2party.net> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201008240045.15998.max@laiers.net> X-Provags-ID: V02:K0:Zg08zW314ugbuSAVqnWD3R1zBlQ41F1tCbKuxtn9A5k yU2CHRFFnha/vLZVaJJe5rrwS8t1X6O+Fkj6/RxECEB0CR65kY EDuasCN2KSlOP8mxJR4mMUuooaZFrzZI7T0+tBGj0Ae8qxAAXx 6CbBDPmQUmOAU/wluVIgC0D3ZUgvulLpD/O2tKtokOF8vDU/32 ub4MKdTcOirYEnQgsZ9DQ== Cc: Stephan Uphoff Subject: Re: rmlock(9) two additions X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 22:57:53 -0000 On Sunday 22 August 2010 21:05:47 Max Laier wrote: > On Saturday 21 August 2010 23:18:50 Stephan Uphoff wrote: > > Max Laier wrote: > > ... > > The rm_try_rlock() will never succeed when the lock was last locked as > > writer. > > Maybe add: > > > > void > > _rm_wunlock(struct rmlock *rm) > > { > > + rm->rm_noreadtoken = 0; > > > > mtx_unlock(&rm->rm_lock); > > > > } > > > > But then > > > > _rm_wlock(struct rmlock *rm) > > > > always needs to use IPIs - even when the lock was used last as a write > > lock. > > I don't think this is a big problem - I can't see many use cases for > rmlocks where you'd routinely see repeated wlocks without rlocks between > them. However, I think there should be a release memory barrier > before/while clearing rm_noreadtoken, otherwise readers may not see the > data writes that are supposed to be protected by the lock?!? > > ... > I believe either solution will work. #1 is a bit more in the spirit of the > rmlock - i.e. make the read case cheap and the write case expensive. I'm > just not sure about the lock semantics. > > I guess a > > atomic_store_rel_int(&rm->rm_noreadtoken, 0); > > should work. thinking about this for a while makes me wonder: Are readers really guaranteed to see all the updates of a writer - even in the current version? Example: writer thread: rm_wlock(); // lock mtx, IPI, wait for reader drain modify_data(); rm_wunlock(); // unlock mtx (this does a atomic_*_rel) reader thread #1: // failed to get the lock, spinning/waiting on mtx mtx_lock(); // this does a atomic_*_acq -> this CPU sees the new data rm->rm_noreadtoken = 0; // now new readers can enter quickly ... reader thread 2# (on a different CPU than reader #1): // enters rm_rlock() "after" rm_noreadtoken was reset -> no memory barrier // does this thread see the modifications? I realize this is a somewhat pathological case, but would it be possible in theory? Or is the compiler_memory_barrier() actually enough? Otherwise, I think we need an IPI on rm_wunlock() that does a atomic_*_acq on every CPU. Thoughts? Max From owner-freebsd-arch@FreeBSD.ORG Mon Aug 23 23:13:16 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0A2CE1065695 for ; Mon, 23 Aug 2010 23:13:16 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id AA9BF8FC0C for ; Mon, 23 Aug 2010 23:13:15 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7NNBpNW057077; Mon, 23 Aug 2010 17:11:51 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Mon, 23 Aug 2010 17:12:01 -0600 (MDT) Message-Id: <20100823.171201.107001114053031707.imp@bsdimp.com> To: marcelm@juniper.net From: "M. Warner Losh" In-Reply-To: References: X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: freebsd-arch@FreeBSD.org Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 23:13:16 -0000 In message: Marcel Moolenaar writes: : All, : : In embedded products, software is possibly installed as an image onto : an actual storage device. This means that mounting the storage device : as root is not enough to have a usable root file system. The rough : draft below is an idea to enhance the root mount from having ad-hoc : quirks to a well-defined and recursive mechanism to allow a wide- : range of use cases. : : The root mount logic is recursive as follows: : 1. The kernel mounts devfs as root (is it is now). : 2. The kernel will re-mount root by virtue of reading a file, called : /.mount.conf, in the current root file system and following the : directives is it. devfs synthesizes the contents of this file. : : At each iteration, the kernel will: : 1. move the devfs mount from /dev in the old file system to /dev in : the new file system. : 2. As per the directives or unconditionally, the kernel will re-mount : the old root file system under /.mount (or some other name) within : the new file system. : : devfs will synthesize the contents of /.mount.conf as per the kernel : configuration and tunables. The administrator (or install process) : will create and populate /.mount.conf for all other cases. : : Directives in /.mount.conf are envisioned to be something like: : : {FS}:{MOUNTPOINT} e.g. ufs:/dev/da0 : a root mount alternative. The order of the alternatives in : the file determines the priority. : : .ask : a root mount alternative that asks the operator to specify : what the root mount should be. : : .wait N .e.g. .wait 5 : wait at most N seconds for a root mount alternative to : succeed. If an alternative does not succeed within that : time, move on to the next alternative. : : .onfail {panic|reboot|retry|continue} : Tells the kernel what to do in case it can't successfully : complete the root mount as directed to. : : The .wait directive works better (probably) if we have events that : signify the arrival of a file system or device special file, so that : we can wait for at most N seconds after the last event. This also : allows us to wait for a separate interval between events. : : As an example, consider: : : [devfs] /.mount.conf: : ufs:/dev/da0 : .ask : .wait 5 : .onfail panic : : [ufs:/dev/da0] /.mount.conf : md0:/images/OS-image-1.0.iso : unionfs:/jail/freebsd-8-stable : .wait 0 : .onfail continue : : In the example, the kernel will mount devfs, read /.mount.conf and : wait at most 5 seconds to mount the UFS on /dev/da0. If that fails, : the kernel will ask (once) and panic in case of failure. : : If the UFS root mount succeeded, the kernel will re-mount devfs : underneath /dev. Since this is the first non-devfs root file system, : the kernel will not re-mount the old root under /.mount. : : Since there's a /.mount.conf on the UFS, the kernel will read it : and repeat the process. First it'll try and mount the OS image : in /images/OS-image-1.0.iso and if it's not present will try to : mount some -stable 8 chroot using unionfs (not necessarily a : real-world example here :-) If either fails, the kernel will : continue booting using the current root file system. Assuming that : the image is present, the kernel will re-mount root, move devfs : underneath /dev in the MD root and remount ufs:/dev/da0 under : /.mount in the MD root. This gives the following picture: : : / md0:[ufs:/dev/da0]/images/OS-image-1.0.iso : /.mount ufs:/dev/da0 : /dev devfs : : : Things to not explicitly touched upon: : o root mount options : o directives to instruct the kernel what to run as the initial : process to eliminate the rather ad-hoc hardcoding. E.g: : .init /sbin/init : .init /sbin/init.old : : Is this something that people feel is worth fleshing out and : prototyping? This sounds very interesting. If kept simple, I could see how this would make my life a lot easier. However, all this scripting sounds a bit like a very simple shell in the kernel. What advantages are there to this approach vs having the ability to run a simple shell script or executable and "pivot" the root to a new location? And how do you emulate the mount_foo programs for foo filesystems? Some of them do weird things that might not translate well into the kernel... As you can see, I'm torn about how I feel about the idea. For simple cases, I think it is great, but as complexity builds, I become less sure. What if that iso image was compressed? What if I had a software RAID of disks or flash devices? What about crypto? I know I can handle those cases in /bin/sh, but will each new one require more code in the kernel? What would df and/or mount tell you about the now-hidden file systems? Warner From owner-freebsd-arch@FreeBSD.ORG Mon Aug 23 23:43:28 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BD3271065672 for ; Mon, 23 Aug 2010 23:43:28 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout024.mac.com (asmtpout024.mac.com [17.148.16.99]) by mx1.freebsd.org (Postfix) with ESMTP id A2F968FC1D for ; Mon, 23 Aug 2010 23:43:28 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from sa-nc-cs-125.static.jnpr.net (natint3.juniper.net [66.129.224.36]) by asmtp024.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L7M00M0SPW42K70@asmtp024.mac.com> for freebsd-arch@FreeBSD.org; Mon, 23 Aug 2010 16:43:17 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1008230203 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-08-23_08:2010-08-24, 2010-08-23, 1970-01-01 signatures=0 From: Marcel Moolenaar In-reply-to: <20100823.171201.107001114053031707.imp@bsdimp.com> Date: Mon, 23 Aug 2010 16:43:15 -0700 Message-id: <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> References: <20100823.171201.107001114053031707.imp@bsdimp.com> To: "M. Warner Losh" X-Mailer: Apple Mail (2.1081) Cc: "freebsd-arch@FreeBSD.org" Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 23:43:28 -0000 On Aug 23, 2010, at 4:12 PM, M. Warner Losh wrote: *snip* > However, all this scripting sounds a bit like a very simple shell in > the kernel. What advantages are there to this approach vs having the > ability to run a simple shell script or executable and "pivot" the root > to a new location? The 2 reasons for doing this in the kernel are: 1. resiliency against ABI changes. 2. allowing /sbin/init to come from the actual root file system. Both points are impossible to handle efficiently or correctly if you need user space support in getting to your actual root file system. You basically have a catch-22 or bootstrap problem, which a pure in-kernel solution doesn't have. > And how do you emulate the mount_foo programs for > foo filesystems? Some of them do weird things that might not > translate well into the kernel... True. I haven't flushed that out, but I was hoping that nmount(2) would have normalized most of this that it's a non-issue, provided we support mount options in this scheme. If you have a concrete example of something that's not so trivial, but critical to support, let me know and I'll take it into account. > As you can see, I'm torn about how I feel about the idea. For simple > cases, I think it is great, but as complexity builds, I become less > sure. What if that iso image was compressed? Can you elaborate how this is potentially a problem in this scheme, but not for "manual" mounting? > What if I had a > software RAID of disks or flash devices? I see no problem. In fact, the idea is triggered by switching to a flash file system on a NAND flash. > What about crypto? See above. Can you elaborate? > I know I > can handle those cases in /bin/sh, but will each new one require more > code in the kernel? The way I see it is that the approach enhances how we now mount the root file system. We have very limited flexibility. I do not claim that my idea allows every possible variation, and I think it unfair to expect that of the approach. If one has real complex requirements, one can always just mount some file system on some storage device and deal with the root mount in user space. I don't see how this prevents that. > What would df and/or mount tell you about the > now-hidden file systems? Can you explain what you mean by now-hidden file systems? Thanks, -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-arch@FreeBSD.ORG Mon Aug 23 23:44:24 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2E6181065693 for ; Mon, 23 Aug 2010 23:44:24 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout024.mac.com (asmtpout024.mac.com [17.148.16.99]) by mx1.freebsd.org (Postfix) with ESMTP id 159E08FC2C for ; Mon, 23 Aug 2010 23:44:23 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from macbook-pro.lan.xcllnt.net (mail.xcllnt.net [70.36.220.4]) by asmtp024.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L7M00JOKN5F7A70@asmtp024.mac.com> for freebsd-arch@freebsd.org; Mon, 23 Aug 2010 15:44:05 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1008230194 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-08-23_08:2010-08-24, 2010-08-23, 1970-01-01 signatures=0 From: Marcel Moolenaar In-reply-to: <20100823214946.GF64651@hoeg.nl> Date: Mon, 23 Aug 2010 15:44:03 -0700 Message-id: <7318E60D-F00F-4519-A3E3-9CE8B752AE88@mac.com> References: <20100823214946.GF64651@hoeg.nl> To: Ed Schouten X-Mailer: Apple Mail (2.1081) Cc: FreeBSD Arch Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Aug 2010 23:44:24 -0000 On Aug 23, 2010, at 2:49 PM, Ed Schouten wrote: > * Marcel Moolenaar wrote: >> Is this something that people feel is worth fleshing out and >> prototyping? > > Sounds awesome! This would make my writable boot cd a lot more elegant > than it is right now. Have you thought about things like possible > endless loops? Say, you mount a unionfs on the root of the fs itself. > This may cause the original .mount.conf to be reinterpreted, right? Right. I haven't thought about it. My off the cuff response is that we should disallow it if the amount of effort required to detect it is within reason. Alternatively, we could simply impose a global limit on the depth of the recursion. Either appears reasonable to me, but I may be overlooking something here... Thoughts? -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 00:22:28 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8A80E1065693 for ; Tue, 24 Aug 2010 00:22:28 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 1F6EB8FC1E for ; Tue, 24 Aug 2010 00:22:27 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7O0EFrq057510; Mon, 23 Aug 2010 18:14:15 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Mon, 23 Aug 2010 18:14:24 -0600 (MDT) Message-Id: <20100823.181424.646155203640260173.imp@bsdimp.com> To: xcllnt@mac.com From: "M. Warner Losh" In-Reply-To: <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> References: <20100823.171201.107001114053031707.imp@bsdimp.com> <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 00:22:28 -0000 In message: <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> Marcel Moolenaar writes: : : On Aug 23, 2010, at 4:12 PM, M. Warner Losh wrote: : : *snip* : : > However, all this scripting sounds a bit like a very simple shell in : > the kernel. What advantages are there to this approach vs having the : > ability to run a simple shell script or executable and "pivot" the root : > to a new location? : : The 2 reasons for doing this in the kernel are: : 1. resiliency against ABI changes. : 2. allowing /sbin/init to come from the actual root file system. : : Both points are impossible to handle efficiently or correctly if : you need user space support in getting to your actual root file : system. You basically have a catch-22 or bootstrap problem, which : a pure in-kernel solution doesn't have. OK. That makes sense. Without execing the new init, which may be a problem with the current world view of init(8) and the kernel, you'd have to have your final init on the first level root file system. : > And how do you emulate the mount_foo programs for : > foo filesystems? Some of them do weird things that might not : > translate well into the kernel... : : True. I haven't flushed that out, but I was hoping that nmount(2) : would have normalized most of this that it's a non-issue, provided : we support mount options in this scheme. : : If you have a concrete example of something that's not so trivial, : but critical to support, let me know and I'll take it into account. mount_smbfs makes a connection to the remote system to do authentication presently in mount_smbfs and initializes the smb context before mounting the file system in the kernel. I don't know if I'd call this a critical to support feature, but it was the first "exception" to the rule that jumped into my head so I was curious if you'd thought about it. : > As you can see, I'm torn about how I feel about the idea. For simple : > cases, I think it is great, but as complexity builds, I become less : > sure. What if that iso image was compressed? : : Can you elaborate how this is potentially a problem in this scheme, : but not for "manual" mounting? You'd need a way to stack up different modules, since you'd need geom_uzip over md0 to make it useful to the cd9660 code. : > What if I had a : > software RAID of disks or flash devices? : : I see no problem. In fact, the idea is triggered by switching to a : flash file system on a NAND flash. RAID of Flashes. Something that would need configuration. but you may be correct: this level of flexibility may not be needed and other concerns may trump it... : > What about crypto? : : See above. Can you elaborate? Same thing, but with a crypto key :) : > I know I : > can handle those cases in /bin/sh, but will each new one require more : > code in the kernel? : : The way I see it is that the approach enhances how we now mount the : root file system. We have very limited flexibility. I do not claim : that my idea allows every possible variation, and I think it unfair : to expect that of the approach. If one has real complex requirements, : one can always just mount some file system on some storage device : and deal with the root mount in user space. I don't see how this : prevents that. init(8) is the show stopper to a pivot root approach, unless you could tell init that's on the first level and simple to exec /sbin/init to pickup the new copy, but I don't know how happy that would make the kernel.. : > What would df and/or mount tell you about the : > now-hidden file systems? : : Can you explain what you mean by now-hidden file systems? OK. Let's say we have a three level scheme: /dev/nor0 which has the initial root on it. Next up is foo.iso.gz which is mounted read only on md0 next up is geom_uzip which present the device as md0.uzip which gets mounted finally as root. So would df show: Filesystem 1024-blocks Used Avail Capacity Mounted on /dev/nor0 4096 4096 0 110% / /dev/md0.uzip 16000 16000 0 110% / or Filesystem 1024-blocks Used Avail Capacity Mounted on /dev/nor0 4096 4096 0 110% /.old_root /dev/md0.uzip 16000 16000 0 110% / and if we had one more layer on nand: Filesystem 1024-blocks Used Avail Capacity Mounted on /dev/nor0 4096 4096 0 110% / /dev/md0.uzip 16000 16000 0 110% / /dev/nand0 320000 300000 20000 82% / or Filesystem 1024-blocks Used Avail Capacity Mounted on /dev/nor0 4096 4096 0 110% /.old_root/.old_root /dev/md0.uzip 16000 16000 0 110% /.old_root /dev/nand0 320000 300000 20000 82% / is the question I'm asking... right now you can mostly do a pivot-root-like thing by having init do a chroot very early, possibly after executing a simple rc script to put the second level root system online. init_script gets run very early, followed by a chroot to init_chroot followed by a mount of devfs on /dev if necessary. However, when you do this, often times you end up with weird looking df output since / isn't really / to df. Anyway, the fact that we have a decoupled fork/exec really is what lead me to ask the question. It is useful to run arbitrary code between the two, even if you usually run the same code... sometimes you want to be different. I was thinking that this might be the same way here. But, as you rightly point out, maybe there's too much complexity in doing that and simpler is better. Warner From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 00:23:51 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 44FBF10657CA for ; Tue, 24 Aug 2010 00:23:51 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id 28CE68FC0A for ; Tue, 24 Aug 2010 00:23:50 +0000 (UTC) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id 042A45B3B; Mon, 23 Aug 2010 17:23:49 -0700 (PDT) To: Marcel Moolenaar In-reply-to: Your message of "Mon, 23 Aug 2010 16:43:15 PDT." <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> References: <20100823.171201.107001114053031707.imp@bsdimp.com> <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> Comments: In-reply-to Marcel Moolenaar message dated "Mon, 23 Aug 2010 16:43:15 -0700." Date: Mon, 23 Aug 2010 17:23:49 -0700 From: Bakul Shah Message-Id: <20100824002350.042A45B3B@mail.bitblocks.com> Cc: "freebsd-arch@FreeBSD.org" Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 00:23:51 -0000 On Mon, 23 Aug 2010 16:43:15 PDT Marcel Moolenaar wrote: > > On Aug 23, 2010, at 4:12 PM, M. Warner Losh wrote: > > *snip* > > > However, all this scripting sounds a bit like a very simple shell in > > the kernel. What advantages are there to this approach vs having the > > ability to run a simple shell script or executable and "pivot" the root > > to a new location? > > The 2 reasons for doing this in the kernel are: > 1. resiliency against ABI changes. > 2. allowing /sbin/init to come from the actual root file system. > > Both points are impossible to handle efficiently or correctly if > you need user space support in getting to your actual root file > system. You basically have a catch-22 or bootstrap problem, which > a pure in-kernel solution doesn't have. How about just bundling a small compressed ramfs with the kernel. The kernel unpacks it, uses it as the initial rootfs and runs init from it. A forth/scheme/lua based program wouldn't add more than a % or so (given that the GENERIC kernel is over 10MB now!). From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 01:24:37 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A8FBA1065693 for ; Tue, 24 Aug 2010 01:24:37 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout024.mac.com (asmtpout024.mac.com [17.148.16.99]) by mx1.freebsd.org (Postfix) with ESMTP id 8D48C8FC0A for ; Tue, 24 Aug 2010 01:24:37 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from sa-nc-cs-125.static.jnpr.net (natint3.juniper.net [66.129.224.36]) by asmtp024.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L7M0055ZUK7VJ10@asmtp024.mac.com> for freebsd-arch@FreeBSD.org; Mon, 23 Aug 2010 18:24:09 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1008230219 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-08-23_09:2010-08-24, 2010-08-23, 1970-01-01 signatures=0 From: Marcel Moolenaar In-reply-to: <20100824002350.042A45B3B@mail.bitblocks.com> Date: Mon, 23 Aug 2010 18:24:07 -0700 Message-id: <4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com> References: <20100823.171201.107001114053031707.imp@bsdimp.com> <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> <20100824002350.042A45B3B@mail.bitblocks.com> To: Bakul Shah X-Mailer: Apple Mail (2.1081) Cc: "freebsd-arch@FreeBSD.org" Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 01:24:37 -0000 On Aug 23, 2010, at 5:23 PM, Bakul Shah wrote: >> The 2 reasons for doing this in the kernel are: >> 1. resiliency against ABI changes. >> 2. allowing /sbin/init to come from the actual root file system. >> >> Both points are impossible to handle efficiently or correctly if >> you need user space support in getting to your actual root file >> system. You basically have a catch-22 or bootstrap problem, which >> a pure in-kernel solution doesn't have. > > How about just bundling a small compressed ramfs with the > kernel. The kernel unpacks it, uses it as the initial rootfs > and runs init from it. A forth/scheme/lua based program > wouldn't add more than a % or so (given that the GENERIC > kernel is over 10MB now!). Not impossible, but it isn't exactly simpler from what I'm looking for: 1. The /sbin/init being run is not the one on the actual (final) root file system. Getting that one to run requires a special init on the ramdisk. 2. The R/O image needs the underlying file system mounted some- where so that there's persistent storage to write. Setting all of this up in user space is impossible if the underlying file system(s) needs to be unmounted/unmountable. 3. Upgrades and downgrades are tricky to handle when the root F/S is the ramdisk, after which some user space environment has to find the storage media and then mount it using mount options it has no easy way to obtain. It appears that this solution, while in user space, requires more code and special handling than a "simple" recursive algorithm for something the kernel has to do anyway. I may be mistaken though... -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 02:27:26 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B6501106564A for ; Tue, 24 Aug 2010 02:27:26 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout026.mac.com (asmtpout026.mac.com [17.148.16.101]) by mx1.freebsd.org (Postfix) with ESMTP id 9B28B8FC08 for ; Tue, 24 Aug 2010 02:27:26 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from sa-nc-cs-125.static.jnpr.net (natint3.juniper.net [66.129.224.36]) by asmtp026.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L7M00J8DXH3AD00@asmtp026.mac.com> for freebsd-arch@freebsd.org; Mon, 23 Aug 2010 19:27:05 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1008230235 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-08-24_01:2010-08-24, 2010-08-23, 1970-01-01 signatures=0 From: Marcel Moolenaar In-reply-to: <20100823.181424.646155203640260173.imp@bsdimp.com> Date: Mon, 23 Aug 2010 19:27:03 -0700 Message-id: <9EED1D80-7E2E-4C9E-8608-7CFD5B25214B@mac.com> References: <20100823.171201.107001114053031707.imp@bsdimp.com> <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> <20100823.181424.646155203640260173.imp@bsdimp.com> To: "M. Warner Losh" X-Mailer: Apple Mail (2.1081) Cc: freebsd-arch@freebsd.org Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 02:27:26 -0000 On Aug 23, 2010, at 5:14 PM, M. Warner Losh wrote: > : > And how do you emulate the mount_foo programs for > : > foo filesystems? Some of them do weird things that might not > : > translate well into the kernel... > : > : True. I haven't flushed that out, but I was hoping that nmount(2) > : would have normalized most of this that it's a non-issue, provided > : we support mount options in this scheme. > : > : If you have a concrete example of something that's not so trivial, > : but critical to support, let me know and I'll take it into account. > > mount_smbfs makes a connection to the remote system to do > authentication presently in mount_smbfs and initializes the smb > context before mounting the file system in the kernel. I don't know > if I'd call this a critical to support feature, but it was the first > "exception" to the rule that jumped into my head so I was curious if > you'd thought about it. smbfs is definitely out of scope :-) > : > As you can see, I'm torn about how I feel about the idea. For simple > : > cases, I think it is great, but as complexity builds, I become less > : > sure. What if that iso image was compressed? > : > : Can you elaborate how this is potentially a problem in this scheme, > : but not for "manual" mounting? > > You'd need a way to stack up different modules, since you'd need > geom_uzip over md0 to make it useful to the cd9660 code. This is a perfect example, actually. I'll think about this in the context of my idea... > init(8) is the show stopper to a pivot root approach, unless you could > tell init that's on the first level and simple to exec /sbin/init to > pickup the new copy, but I don't know how happy that would make the > kernel.. I think a handshake is doable. If all else fails, you simply tell the kernel to always re-exec init when it exits (rather than panicing, which isn't exactly a product-friendly response to init exiting). > and if we had one more layer on nand: > > Filesystem 1024-blocks Used Avail Capacity Mounted on > /dev/nor0 4096 4096 0 110% / > /dev/md0.uzip 16000 16000 0 110% / > /dev/nand0 320000 300000 20000 82% / > > or > > Filesystem 1024-blocks Used Avail Capacity Mounted on > /dev/nor0 4096 4096 0 110% /.old_root/.old_root > /dev/md0.uzip 16000 16000 0 110% /.old_root > /dev/nand0 320000 300000 20000 82% / > > is the question I'm asking... I think it would be: /dev/nor0 /.old_root /dev/md0.uzip /.old_root /dev/nand0 / > Anyway, the fact that we have a decoupled fork/exec really is what > lead me to ask the question. It is useful to run arbitrary code > between the two, even if you usually run the same code... sometimes > you want to be different. I was thinking that this might be the same > way here. But, as you rightly point out, maybe there's too much > complexity in doing that and simpler is better. I'll chew on the geom_uzip example you gave. There's value in allowing the full power of GEOM when doing a root mount. Thanks, -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 04:33:45 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E2F0B1065670 for ; Tue, 24 Aug 2010 04:33:45 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (mail.bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id C03378FC0A for ; Tue, 24 Aug 2010 04:33:45 +0000 (UTC) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id CA4DE5B56; Mon, 23 Aug 2010 21:33:44 -0700 (PDT) To: Marcel Moolenaar In-reply-to: Your message of "Mon, 23 Aug 2010 18:24:07 PDT." <4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com> References: <20100823.171201.107001114053031707.imp@bsdimp.com> <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> <20100824002350.042A45B3B@mail.bitblocks.com> <4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com> Comments: In-reply-to Marcel Moolenaar message dated "Mon, 23 Aug 2010 18:24:07 -0700." Date: Mon, 23 Aug 2010 21:33:44 -0700 From: Bakul Shah Message-Id: <20100824043344.CA4DE5B56@mail.bitblocks.com> Cc: "freebsd-arch@FreeBSD.org" Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 04:33:46 -0000 On Mon, 23 Aug 2010 18:24:07 PDT Marcel Moolenaar wrote: > > On Aug 23, 2010, at 5:23 PM, Bakul Shah wrote: > > >> The 2 reasons for doing this in the kernel are: > >> 1. resiliency against ABI changes. > >> 2. allowing /sbin/init to come from the actual root file system. > >> > >> Both points are impossible to handle efficiently or correctly if > >> you need user space support in getting to your actual root file > >> system. You basically have a catch-22 or bootstrap problem, which > >> a pure in-kernel solution doesn't have. > > > > How about just bundling a small compressed ramfs with the > > kernel. The kernel unpacks it, uses it as the initial rootfs > > and runs init from it. A forth/scheme/lua based program > > wouldn't add more than a % or so (given that the GENERIC > > kernel is over 10MB now!). BTW, a friend tells me this is what Linux does (or more likely, what they used in their server startup). Basically a ramdisk with init + loadable drivers + tools needed to get going. Once the actual root fs device is found (even if disks got switched around etc.) they switched to the actual root. > Not impossible, but it isn't exactly simpler from what I'm looking > for: > 1. The /sbin/init being run is not the one on the actual (final) > root file system. Getting that one to run requires a special > init on the ramdisk. Yes. But then you just exec() the real init once you have "pivoted" to the final root fs. You run with ramfs only as long as you have to. > 2. The R/O image needs the underlying file system mounted some- > where so that there's persistent storage to write. Setting > all of this up in user space is impossible if the underlying > file system(s) needs to be unmounted/unmountable. > > 3. Upgrades and downgrades are tricky to handle when the root > F/S is the ramdisk, after which some user space environment > has to find the storage media and then mount it using mount > options it has no easy way to obtain. Would that still be a problem once you switch to the final root? > It appears that this solution, while in user space, requires more > code and special handling than a "simple" recursive algorithm for > something the kernel has to do anyway. I may be mistaken though... It may start out "simple".... From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 04:59:58 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3DCF0106566C for ; Tue, 24 Aug 2010 04:59:58 +0000 (UTC) (envelope-from mj@feral.com) Received: from ns1.feral.com (ns1.feral.com [192.67.166.1]) by mx1.freebsd.org (Postfix) with ESMTP id 05DDC8FC08 for ; Tue, 24 Aug 2010 04:59:57 +0000 (UTC) Received: from [192.168.1.2] (m206-63.dsl.tsoft.com [198.144.206.63]) by ns1.feral.com (8.14.3/8.14.3) with ESMTP id o7O4baYA072705 for ; Mon, 23 Aug 2010 21:37:37 -0700 (PDT) (envelope-from mj@feral.com) Message-ID: <4C734C92.4010105@feral.com> Date: Mon, 23 Aug 2010 21:37:38 -0700 From: Matthew Jacob Organization: Feral Software User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100711 Thunderbird/3.0.6 MIME-Version: 1.0 To: freebsd-arch@freebsd.org References: <20100823.171201.107001114053031707.imp@bsdimp.com> <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> <20100824002350.042A45B3B@mail.bitblocks.com> <4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com> <20100824043344.CA4DE5B56@mail.bitblocks.com> In-Reply-To: <20100824043344.CA4DE5B56@mail.bitblocks.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Default is to whitelist mail, not delayed by milter-greylist-4.2.6 (ns1.feral.com [192.67.166.1]); Mon, 23 Aug 2010 21:37:37 -0700 (PDT) Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 04:59:58 -0000 Yes, this is the RedHat root pivot goop that's been around for ages. It turns out to be a massive PITA, because the initrd image can get out of sync with the kernel and hardware, and since some of the modules can be loaded from there, but not from the root filesystem there is a definite possibility (which has happened with more times than I care to remember) that you'll get hosed and not be able to mount your root filesystem. This actually can happen so easily that when I install CentOS or Fedora, I override the defaults and put the root filesystem on a plain partition/filesystem rather than as part of an LVM2 volume. > BTW, a friend tells me this is what Linux does (or more > likely, what they used in their server startup). Basically a > ramdisk with init + loadable drivers + tools needed to get > going. Once the actual root fs device is found (even if > disks got switched around etc.) they switched to the actual > root. > > From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 05:52:18 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 738AF1065679 for ; Tue, 24 Aug 2010 05:52:18 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id 563CA8FC15 for ; Tue, 24 Aug 2010 05:52:18 +0000 (UTC) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id D0CF25B56; Mon, 23 Aug 2010 22:52:17 -0700 (PDT) To: Matthew Jacob In-reply-to: Your message of "Mon, 23 Aug 2010 21:37:38 PDT." <4C734C92.4010105@feral.com> References: <20100823.171201.107001114053031707.imp@bsdimp.com> <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> <20100824002350.042A45B3B@mail.bitblocks.com> <4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com> <20100824043344.CA4DE5B56@mail.bitblocks.com> <4C734C92.4010105@feral.com> Comments: In-reply-to Matthew Jacob message dated "Mon, 23 Aug 2010 21:37:38 -0700." Date: Mon, 23 Aug 2010 22:52:17 -0700 From: Bakul Shah Message-Id: <20100824055217.D0CF25B56@mail.bitblocks.com> Cc: freebsd-arch@freebsd.org Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 05:52:18 -0000 On Mon, 23 Aug 2010 21:37:38 PDT Matthew Jacob wrote: > Yes, this is the RedHat root pivot goop that's been around for ages. > > It turns out to be a massive PITA, because the initrd image can get out > of sync with the kernel and hardware, and since some of the modules can > be loaded from there, but not from the root filesystem there is a > definite possibility (which has happened with more times than I care to > remember) that you'll get hosed and not be able to mount your root > filesystem. To avoid getting out of sync is why I was advocating bundling the ramfs root with the kernel. That too can have problems -- it is all matter of which compromise you can live with. > This actually can happen so easily that when I install CentOS or Fedora, > I override the defaults and put the root filesystem on a plain > partition/filesystem rather than as part of an LVM2 volume. > > > BTW, a friend tells me this is what Linux does (or more > > likely, what they used in their server startup). Basically a > > ramdisk with init + loadable drivers + tools needed to get > > going. Once the actual root fs device is found (even if > > disks got switched around etc.) they switched to the actual > > root. From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 08:01:30 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3F554106566C for ; Tue, 24 Aug 2010 08:01:30 +0000 (UTC) (envelope-from ed@hoeg.nl) Received: from mx0.hoeg.nl (unknown [IPv6:2a01:4f8:101:5343::aa]) by mx1.freebsd.org (Postfix) with ESMTP id B83DA8FC13 for ; Tue, 24 Aug 2010 08:01:29 +0000 (UTC) Received: by mx0.hoeg.nl (Postfix, from userid 1000) id E6C062A28D2E; Tue, 24 Aug 2010 10:01:28 +0200 (CEST) Date: Tue, 24 Aug 2010 10:01:28 +0200 From: Ed Schouten To: Marcel Moolenaar Message-ID: <20100824080128.GJ64651@hoeg.nl> References: <20100823.171201.107001114053031707.imp@bsdimp.com> <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> <20100824002350.042A45B3B@mail.bitblocks.com> <4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="C7PTD44AewjTsiSV" Content-Disposition: inline In-Reply-To: <4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com> User-Agent: Mutt/1.5.20 (2009-06-14) X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: "freebsd-arch@FreeBSD.org" Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 08:01:30 -0000 --C7PTD44AewjTsiSV Content-Type: multipart/mixed; boundary="HkMjoL2LAeBLhbFV" Content-Disposition: inline --HkMjoL2LAeBLhbFV Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable * Marcel Moolenaar wrote: > 1. The /sbin/init being run is not the one on the actual (final) > root file system. Getting that one to run requires a special > init on the ramdisk. Well, the FreeBSD live CD I posted on the lists the other day has such a special /sbin/init. See the attachment. Maybe it could be rewritten in such a way that it parses a text file (fstab-like?), which can also be placed on the mdroot? --=20 Ed Schouten WWW: http://80386.nl/ --HkMjoL2LAeBLhbFV-- --C7PTD44AewjTsiSV Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.16 (FreeBSD) iEUEARECAAYFAkxzfFgACgkQ52SDGA2eCwXeMgCfYW0LcsRSaw9GbW+gV77NMchQ N64AmPF2fWwFiTVM3GIvGnF6pY9MlKw= =mBlY -----END PGP SIGNATURE----- --C7PTD44AewjTsiSV-- From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 08:03:10 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 506B01065697 for ; Tue, 24 Aug 2010 08:03:10 +0000 (UTC) (envelope-from ed@hoeg.nl) Received: from mx0.hoeg.nl (unknown [IPv6:2a01:4f8:101:5343::aa]) by mx1.freebsd.org (Postfix) with ESMTP id 9B9FE8FC15 for ; Tue, 24 Aug 2010 08:03:09 +0000 (UTC) Received: by mx0.hoeg.nl (Postfix, from userid 1000) id 0EA7B2A28CF5; Tue, 24 Aug 2010 10:03:09 +0200 (CEST) Date: Tue, 24 Aug 2010 10:03:09 +0200 From: Ed Schouten To: Marcel Moolenaar Message-ID: <20100824080309.GK64651@hoeg.nl> References: <20100823.171201.107001114053031707.imp@bsdimp.com> <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> <20100824002350.042A45B3B@mail.bitblocks.com> <4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com> <20100824080128.GJ64651@hoeg.nl> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="RVlUGXxwBj5SDcM9" Content-Disposition: inline In-Reply-To: <20100824080128.GJ64651@hoeg.nl> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: "freebsd-arch@FreeBSD.org" Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 08:03:10 -0000 --RVlUGXxwBj5SDcM9 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable * Ed Schouten wrote: > See the attachment. It seems like Mailman ate the attachment. %%% /*- * Copyright (c) 2010 Ed Schouten * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPO= SE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTI= AL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRI= CT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include #include #include #include #include #include #include #include #include #include static void die(const char *msg) { int fd, serrno; serrno =3D errno; fd =3D open("/dev/console", O_RDWR); if (fd !=3D -1 && fd !=3D STDERR_FILENO) dup2(fd, STDERR_FILENO); errno =3D serrno; perror(msg); sleep(10); exit(1); } static void domount(const char * const list[], unsigned int elems) { struct iovec iov[elems]; unsigned int i; for (i =3D 0; i < elems; i++) { iov[i].iov_base =3D (char *)list[i]; iov[i].iov_len =3D strlen(list[i]) + 1; } if (nmount(iov, elems, 0) !=3D 0) die(list[1]); } static char const * const cdfs[] =3D { "fstype", "cd9660", "from", "/dev/iso9660/freebsd", "fspath", "/ro" }; static char const * const tmpfs[] =3D { "fstype", "tmpfs", "fspath", "/rw" }; static char const * const unionfs[] =3D { "fstype", "unionfs", "from", "/ro", "fspath", "/rw", "below", "", "whiteout", "whenneeded" }; static char const * const devfs[] =3D { "fstype", "devfs", "fspath", "/rw/dev" }; int main(int argc, char *argv[]) { /* Prevent foot shooting. */ if (getpid() !=3D 1) return (1); /* Perform mounts. */ domount(cdfs, sizeof cdfs / sizeof(char *)); domount(tmpfs, sizeof tmpfs / sizeof(char *)); domount(unionfs, sizeof unionfs / sizeof(char *)); domount(devfs, sizeof devfs / sizeof(char *)); /* chroot() into system and continue boot process. */ if (chroot("/rw") !=3D 0) die("chroot"); chdir("/"); /* Execute the real /sbin/init. */ execv(argv[0], argv); die("execv"); return (1); } %%% --=20 Ed Schouten WWW: http://80386.nl/ --RVlUGXxwBj5SDcM9 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.16 (FreeBSD) iEYEARECAAYFAkxzfL0ACgkQ52SDGA2eCwVX4gCfXGwT+BrR2p/fcSDwzlgtDk4r LREAn1cIzCh1vFzUWnlRdCLCc48wwe8e =vd94 -----END PGP SIGNATURE----- --RVlUGXxwBj5SDcM9-- From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 10:03:58 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2498B10656A6 for ; Tue, 24 Aug 2010 10:03:58 +0000 (UTC) (envelope-from euphoria@billyfranks.com) Received: from mk-filter-3-a-1.mail.uk.tiscali.com (mk-filter-3-a-1.mail.uk.tiscali.com [212.74.100.54]) by mx1.freebsd.org (Postfix) with ESMTP id B10958FC08 for ; Tue, 24 Aug 2010 10:03:57 +0000 (UTC) X-Trace: 475488960/mk-filter-3.mail.uk.tiscali.com/B2C/$b2c-THROTTLED-DYNAMIC/b2c-CUSTOMER-DYNAMIC-IP/79.69.56.121/None/euphoria@billyfranks.com X-SBRS: None X-RemoteIP: 79.69.56.121 X-IP-MAIL-FROM: euphoria@billyfranks.com X-SMTP-AUTH: X-Originating-Country: GB/UNITED KINGDOM X-MUA: aspNetEmail ver 3.6.1.5 X-IP-BHB: Once X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AvsEACMvc0xPRTh5/2dsb2JhbACgPXK5RoU3BA X-IronPort-AV: E=Sophos;i="4.56,262,1280703600"; d="scan'208";a="475488960" Received: from 79-69-56-121.dynamic.dsl.as9105.com (HELO BillyPC) ([79.69.56.121]) by smtp.tiscali.co.uk with SMTP; 24 Aug 2010 10:34:57 +0100 From: "Billy Franks" To: "freebsd-arch@freebsd.org" Date: Tue, 24 Aug 2010 10:23:14 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Mailer: aspNetEmail ver 3.6.1.5 Message-ID: Subject: Free compilation album from legendary songsmith Billy Franks - With an introduction by best selling author, Christopher Brookmyre X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: euphoria@billyfranks.com List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 10:03:58 -0000 Hi,=0D=0A=0D=0A=22Penning Classics and garnering praise from Bono, Peter = Gabriel =26 Oasis=22 THE GUARDIAN=0D=0A=0D=0A=22Songwriting from the top = drawer=22 TIME OUT=2E=0D=0A=0D=0A=22Imagine McCartney=27s craftsmanship a= nd Springsteen=27s power and you=27ll get the gist=22 Q MAGAZINE=0D=0A=0D= =0A=0D=0A=0D=0AAs it seems I am only really know by famous novelists and = rock stars, I thought I might introduce myself by giving awayt a free com= pilation of 12 of my best songs from 6 albums spanning 2 decades=2E=0D=0A= =0D=0ATo grab your=27s just email euphoria=40billyfranks=2Ecom and you = will get the download link=2E=0D=0A=0D=0AIf ya want to read Christopher B= rookmyres introduction, here it is:=0D=0A=0D=0AEuphoria=0D=0A=0D=0AIt?s t= he first word that always comes to mind whenever I attempt to describe Bi= lly Franks? music=2E It refers primarily to an almost excessive feeling o= f joy, but for me the more important aspect that connects it to these son= gs is that sense of being consumed by an emotion; that sense of an unstop= pable, volcanic, up-rushing of passion, that exhilarating but tantalising= feeling you get when you are experiencing something that cannot be expre= ssed in mere language, nor even mere music=2E =0D=0A=0D=0AAnybody can wri= te a song about love=2E Not anybody can make you feel love, feel loss, fe= el pain, feel desire, feel ecstasy=2E Not anybody can make you feel eupho= ria=2E Billy Franks can=2E=0D=0A=0D=0AChristopher Brookmyre=0D=0A From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 14:29:16 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AB5C71065679 for ; Tue, 24 Aug 2010 14:29:16 +0000 (UTC) (envelope-from ups@freebsd.org) Received: from smtpauth16.prod.mesa1.secureserver.net (smtpauth16.prod.mesa1.secureserver.net [64.202.165.22]) by mx1.freebsd.org (Postfix) with SMTP id 75B3C8FC25 for ; Tue, 24 Aug 2010 14:29:16 +0000 (UTC) Received: (qmail 13014 invoked from network); 24 Aug 2010 14:01:43 -0000 Received: from unknown (75.139.142.171) by smtpauth16.prod.mesa1.secureserver.net (64.202.165.22) with ESMTP; 24 Aug 2010 14:01:43 -0000 Message-ID: <4C73D0FA.5030102@freebsd.org> Date: Tue, 24 Aug 2010 10:02:34 -0400 From: Stephan Uphoff User-Agent: Thunderbird 2.0.0.24 (Macintosh/20100228) MIME-Version: 1.0 To: Max Laier References: <201008160515.21412.max@love2party.net> <4C7042BA.8000402@freebsd.org> <201008222105.47276.max@love2party.net> <201008240045.15998.max@laiers.net> In-Reply-To: <201008240045.15998.max@laiers.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: rmlock(9) two additions X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 14:29:16 -0000 Max Laier wrote: > On Sunday 22 August 2010 21:05:47 Max Laier wrote: > >> On Saturday 21 August 2010 23:18:50 Stephan Uphoff wrote: >> >>> Max Laier wrote: >>> ... >>> The rm_try_rlock() will never succeed when the lock was last locked as >>> writer. >>> Maybe add: >>> >>> void >>> _rm_wunlock(struct rmlock *rm) >>> { >>> + rm->rm_noreadtoken = 0; >>> >>> mtx_unlock(&rm->rm_lock); >>> >>> } >>> >>> But then >>> >>> _rm_wlock(struct rmlock *rm) >>> >>> always needs to use IPIs - even when the lock was used last as a write >>> lock. >>> >> I don't think this is a big problem - I can't see many use cases for >> rmlocks where you'd routinely see repeated wlocks without rlocks between >> them. However, I think there should be a release memory barrier >> before/while clearing rm_noreadtoken, otherwise readers may not see the >> data writes that are supposed to be protected by the lock?!? >> >>> ... >>> >> I believe either solution will work. #1 is a bit more in the spirit of the >> rmlock - i.e. make the read case cheap and the write case expensive. I'm >> just not sure about the lock semantics. >> >> I guess a >> >> atomic_store_rel_int(&rm->rm_noreadtoken, 0); >> >> should work. >> > > thinking about this for a while makes me wonder: Are readers really guaranteed > to see all the updates of a writer - even in the current version? > > Example: > > writer thread: > rm_wlock(); // lock mtx, IPI, wait for reader drain > modify_data(); > rm_wunlock(); // unlock mtx (this does a atomic_*_rel) > > reader thread #1: > // failed to get the lock, spinning/waiting on mtx > mtx_lock(); // this does a atomic_*_acq -> this CPU sees the new data > rm->rm_noreadtoken = 0; // now new readers can enter quickly > ... > > reader thread 2# (on a different CPU than reader #1): > // enters rm_rlock() "after" rm_noreadtoken was reset -> no memory barrier > // does this thread see the modifications? > > I realize this is a somewhat pathological case, but would it be possible in > theory? Or is the compiler_memory_barrier() actually enough? > > Otherwise, I think we need an IPI on rm_wunlock() that does a atomic_*_acq on > every CPU. > > Thoughts? > Max > > Yes - this is a problem that needs to be addressed. Fortunately most platforms won't need to be as strict and I suggest per platform parameters. An alternative that was in my original design was to use a bitmap for the rm_noreadtoken. Each CPU would then have an associated bit that will only be cleared by that cpu. This would also allow targeted IPIs to only the token holders. Stephan From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 14:56:09 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AE5C810656A4 for ; Tue, 24 Aug 2010 14:56:09 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout026.mac.com (asmtpout026.mac.com [17.148.16.101]) by mx1.freebsd.org (Postfix) with ESMTP id 9170D8FC0C for ; Tue, 24 Aug 2010 14:56:09 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36]) by asmtp026.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L7N00AKJW5KWX10@asmtp026.mac.com> for freebsd-arch@FreeBSD.org; Tue, 24 Aug 2010 07:56:09 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1008240083 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-08-24_06:2010-08-24, 2010-08-24, 1970-01-01 signatures=0 From: Marcel Moolenaar In-reply-to: <20100824043344.CA4DE5B56@mail.bitblocks.com> Date: Tue, 24 Aug 2010 07:56:08 -0700 Message-id: <760A97A4-62D2-4900-915D-CA5D889855E1@mac.com> References: <20100823.171201.107001114053031707.imp@bsdimp.com> <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> <20100824002350.042A45B3B@mail.bitblocks.com> <4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com> <20100824043344.CA4DE5B56@mail.bitblocks.com> To: Bakul Shah X-Mailer: Apple Mail (2.1081) Cc: "freebsd-arch@FreeBSD.org" Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 14:56:09 -0000 On Aug 23, 2010, at 9:33 PM, Bakul Shah wrote: > On Mon, 23 Aug 2010 18:24:07 PDT Marcel Moolenaar wrote: >> >> On Aug 23, 2010, at 5:23 PM, Bakul Shah wrote: >> >>>> The 2 reasons for doing this in the kernel are: >>>> 1. resiliency against ABI changes. >>>> 2. allowing /sbin/init to come from the actual root file system. >>>> >>>> Both points are impossible to handle efficiently or correctly if >>>> you need user space support in getting to your actual root file >>>> system. You basically have a catch-22 or bootstrap problem, which >>>> a pure in-kernel solution doesn't have. >>> >>> How about just bundling a small compressed ramfs with the >>> kernel. The kernel unpacks it, uses it as the initial rootfs >>> and runs init from it. A forth/scheme/lua based program >>> wouldn't add more than a % or so (given that the GENERIC >>> kernel is over 10MB now!). > > BTW, a friend tells me this is what Linux does (or more > likely, what they used in their server startup). I see your point and buy into the argument, but not entirely. I explicitly mentioned "embedding" and so far your arguments include things like GENERIC being 10MB or Linux server startup. We're not exactly discussing the same thing are we? I'm perfectly happy to say that the ramdisk approach is the most generic and solution for desktop and server machines but I'm not at all ready to have it include embedded systems just yet. It's just too heavy weight... -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 15:48:59 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 773AD1065679 for ; Tue, 24 Aug 2010 15:48:59 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 3CC018FC18 for ; Tue, 24 Aug 2010 15:48:58 +0000 (UTC) Received: from critter.freebsd.dk (unknown [192.168.51.2]) by phk.freebsd.dk (Postfix) with ESMTP id 204C23F5B7; Tue, 24 Aug 2010 15:48:56 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.4/8.14.4) with ESMTP id o7OFmq7J012991; Tue, 24 Aug 2010 15:48:53 GMT (envelope-from phk@critter.freebsd.dk) To: Marcel Moolenaar From: "Poul-Henning Kamp" In-Reply-To: Your message of "Tue, 24 Aug 2010 07:56:08 MST." <760A97A4-62D2-4900-915D-CA5D889855E1@mac.com> Date: Tue, 24 Aug 2010 15:48:52 +0000 Message-ID: <12990.1282664932@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: "freebsd-arch@FreeBSD.org" Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 15:48:59 -0000 In message <760A97A4-62D2-4900-915D-CA5D889855E1@mac.com>, Marcel Moolenaar wri tes: >I'm perfectly happy to say that the ramdisk approach >is the most generic and solution for desktop and >server machines but I'm not at all ready to have it >include embedded systems just yet. It's just too >heavy weight... I'm with Marcel here. Except for one detail: In deeply embedded applications the ramdisk is actually preferable, because that saves you from providing a root filesystem any other way. Our solution for that is MD_PRELOADED which is quite a hack. The bit missing for the ramdisk approach is the root-fs-swizzle, code. There are two ways to do that, either a very magic mount-like system call, or by pid==1 setting the name of the real rootfs with a sysctl and exiting, which calls into the existing root-mount code again. The latter is almost trivial to implement, just remember to start the new /sbin/init with pid==1 Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 15:52:07 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 06BE610656A8 for ; Tue, 24 Aug 2010 15:52:07 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (mail.bitblocks.com [64.142.15.60]) by mx1.freebsd.org (Postfix) with ESMTP id D3DDB8FC1B for ; Tue, 24 Aug 2010 15:52:06 +0000 (UTC) Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id C2A535B23; Tue, 24 Aug 2010 08:52:05 -0700 (PDT) To: Marcel Moolenaar In-reply-to: Your message of "Tue, 24 Aug 2010 07:56:08 PDT." <760A97A4-62D2-4900-915D-CA5D889855E1@mac.com> References: <20100823.171201.107001114053031707.imp@bsdimp.com> <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> <20100824002350.042A45B3B@mail.bitblocks.com> <4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com> <20100824043344.CA4DE5B56@mail.bitblocks.com> <760A97A4-62D2-4900-915D-CA5D889855E1@mac.com> Comments: In-reply-to Marcel Moolenaar message dated "Tue, 24 Aug 2010 07:56:08 -0700." Date: Tue, 24 Aug 2010 08:52:05 -0700 From: Bakul Shah Message-Id: <20100824155205.C2A535B23@mail.bitblocks.com> Cc: "freebsd-arch@FreeBSD.org" Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 15:52:07 -0000 On Tue, 24 Aug 2010 07:56:08 PDT Marcel Moolenaar wrote: > > On Aug 23, 2010, at 9:33 PM, Bakul Shah wrote: > > > On Mon, 23 Aug 2010 18:24:07 PDT Marcel Moolenaar wrote: > >> > >> On Aug 23, 2010, at 5:23 PM, Bakul Shah wrote: > >> > >>>> The 2 reasons for doing this in the kernel are: > >>>> 1. resiliency against ABI changes. > >>>> 2. allowing /sbin/init to come from the actual root file system. > >>>> > >>>> Both points are impossible to handle efficiently or correctly if > >>>> you need user space support in getting to your actual root file > >>>> system. You basically have a catch-22 or bootstrap problem, which > >>>> a pure in-kernel solution doesn't have. > >>> > >>> How about just bundling a small compressed ramfs with the > >>> kernel. The kernel unpacks it, uses it as the initial rootfs > >>> and runs init from it. A forth/scheme/lua based program > >>> wouldn't add more than a % or so (given that the GENERIC > >>> kernel is over 10MB now!). > > > > BTW, a friend tells me this is what Linux does (or more > > likely, what they used in their server startup). > > I see your point and buy into the argument, but not > entirely. I explicitly mentioned "embedding" and so > far your arguments include things like GENERIC being > 10MB or Linux server startup. > > We're not exactly discussing the same thing are we? This friend's company used linux in an embedded system [it was a fileserver product. Presumably the OS had to run in a restricted environment since the FS space would be for their customers' use + you don't want to have to reload the OS when a disk dies! And yet you want the ability to upgrade your OS s/w etc.] In my job[-2] we used FreeBSD as an embedded OS. IIRC we just ran from a readonly flash FS as root. An upgrade was just a new FS image, including kernel + utilities. Didn't Juniper do something similar? > I'm perfectly happy to say that the ramdisk approach > is the most generic and solution for desktop and > server machines but I'm not at all ready to have it > include embedded systems just yet. It's just too > heavy weight... I would argue that while each individual embedded system typically runs in a simpler environment than GENERIC, the sum total of such embedded environments presents a large set of alternatives. Now if you can distill all that down to a small set of kernel changes, that is great! But I am not doing the work, you are. So feel free to use/ignore my input however you wish! From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 16:16:45 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A1F5A106566B for ; Tue, 24 Aug 2010 16:16:45 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout026.mac.com (asmtpout026.mac.com [17.148.16.101]) by mx1.freebsd.org (Postfix) with ESMTP id 83E7C8FC14 for ; Tue, 24 Aug 2010 16:16:45 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36]) by asmtp026.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L7N00A17ZV1WK70@asmtp026.mac.com> for freebsd-arch@FreeBSD.org; Tue, 24 Aug 2010 09:16:14 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1008240096 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-08-24_09:2010-08-24, 2010-08-24, 1970-01-01 signatures=0 From: Marcel Moolenaar In-reply-to: <20100824155205.C2A535B23@mail.bitblocks.com> Date: Tue, 24 Aug 2010 09:16:13 -0700 Message-id: References: <20100823.171201.107001114053031707.imp@bsdimp.com> <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> <20100824002350.042A45B3B@mail.bitblocks.com> <4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com> <20100824043344.CA4DE5B56@mail.bitblocks.com> <760A97A4-62D2-4900-915D-CA5D889855E1@mac.com> <20100824155205.C2A535B23@mail.bitblocks.com> To: Bakul Shah X-Mailer: Apple Mail (2.1081) Cc: "freebsd-arch@FreeBSD.org" Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 16:16:45 -0000 On Aug 24, 2010, at 8:52 AM, Bakul Shah wrote: >> >> I see your point and buy into the argument, but not >> entirely. I explicitly mentioned "embedding" and so >> far your arguments include things like GENERIC being >> 10MB or Linux server startup. >> >> We're not exactly discussing the same thing are we? > > This friend's company used linux in an embedded system [it > was a fileserver product. Presumably the OS had to run in a > restricted environment since the FS space would be for their > customers' use + you don't want to have to reload the OS when > a disk dies! And yet you want the ability to upgrade your OS > s/w etc.] > > In my job[-2] we used FreeBSD as an embedded OS. IIRC we just > ran from a readonly flash FS as root. An upgrade was just a > new FS image, including kernel + utilities. Didn't Juniper > do something similar? Juniper's approach is still heavily rooted in PC-class H/W. With Book-E, ARM and MIPS products for the low(er)-end and in particular without these products having a real harddisk, the existing way has shown it's problems and limitations. Also: Juniper has hacked a few tools, including the kernel at large and md(4) in particular to implement features they needed/wanted, which I'd like to get away from. FYI, -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-arch@FreeBSD.ORG Tue Aug 24 17:01:35 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F20C4106566C for ; Tue, 24 Aug 2010 17:01:35 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id AF7BB8FC08 for ; Tue, 24 Aug 2010 17:01:35 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7OGtasY067234; Tue, 24 Aug 2010 10:55:36 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Tue, 24 Aug 2010 10:55:46 -0600 (MDT) Message-Id: <20100824.105546.1002438156525560711.imp@bsdimp.com> To: xcllnt@mac.com From: "M. Warner Losh" In-Reply-To: References: <760A97A4-62D2-4900-915D-CA5D889855E1@mac.com> <20100824155205.C2A535B23@mail.bitblocks.com> X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: freebsd-arch@FreeBSD.org Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Aug 2010 17:01:36 -0000 In message: Marcel Moolenaar writes: : : On Aug 24, 2010, at 8:52 AM, Bakul Shah wrote: : >> : >> I see your point and buy into the argument, but not : >> entirely. I explicitly mentioned "embedding" and so : >> far your arguments include things like GENERIC being : >> 10MB or Linux server startup. : >> : >> We're not exactly discussing the same thing are we? : > : > This friend's company used linux in an embedded system [it : > was a fileserver product. Presumably the OS had to run in a : > restricted environment since the FS space would be for their : > customers' use + you don't want to have to reload the OS when : > a disk dies! And yet you want the ability to upgrade your OS : > s/w etc.] : > : > In my job[-2] we used FreeBSD as an embedded OS. IIRC we just : > ran from a readonly flash FS as root. An upgrade was just a : > new FS image, including kernel + utilities. Didn't Juniper : > do something similar? : : Juniper's approach is still heavily rooted in PC-class H/W. : With Book-E, ARM and MIPS products for the low(er)-end and : in particular without these products having a real harddisk, : the existing way has shown it's problems and limitations. : : Also: Juniper has hacked a few tools, including the kernel : at large and md(4) in particular to implement features they : needed/wanted, which I'd like to get away from. You can get away from a large MD by having a small MD and pivoting to large storage. Linux does this, as Bakul said, and it scales from the ultra-small 4MB Mips router up to the highest multicore server. Warner From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 01:51:57 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 98D6D1065672 for ; Wed, 25 Aug 2010 01:51:57 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 5AD468FC0C for ; Wed, 25 Aug 2010 01:51:57 +0000 (UTC) Received: by iwn36 with SMTP id 36so129257iwn.13 for ; Tue, 24 Aug 2010 18:51:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=C+n4G+tYFzPgwdRdiKULxztljynPHHLtwqtkABoTJE0=; b=CNzYt78WBd2wh71Yd+ZyhGucl296kYm+tTmf8xi2Px7Bv+BgX//YFpT7TKxu5wqrLn l7NdOP42aBEIzFIGjZ7uUPLFtPejJjmOuWf7KREXVJlkOIS8tr0fIpDYJ6SIpaCkyjT5 SxvfWE8D83SUfz8d4n6j3DZ3KYKLDZtZkVV7g= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=CBgD1ihveY6Gl1UDq1ZEPeUr2xLNcrZGEScQY1AwuQyhZDdLOwv+sbgl+EQL7CmN8T D1XZepZryO61eL/KXlgvBvulZ14pezY2bj+aW/ezZAOTCo8wr4MxA4vCEw9YIUmMJZpa JVkruvE7hZPFwCStsxJ4Hy0M7H91uEdJZ67XM= MIME-Version: 1.0 Received: by 10.231.148.195 with SMTP id q3mr9239909ibv.199.1282699323033; Tue, 24 Aug 2010 18:22:03 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.231.168.14 with HTTP; Tue, 24 Aug 2010 18:22:02 -0700 (PDT) In-Reply-To: <20100824.105546.1002438156525560711.imp@bsdimp.com> References: <760A97A4-62D2-4900-915D-CA5D889855E1@mac.com> <20100824155205.C2A535B23@mail.bitblocks.com> <20100824.105546.1002438156525560711.imp@bsdimp.com> Date: Wed, 25 Aug 2010 09:22:02 +0800 X-Google-Sender-Auth: Xmr9owVOhwyj2exg1SO6HNzb-Rg Message-ID: From: Adrian Chadd To: "M. Warner Losh" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: xcllnt@mac.com, freebsd-arch@freebsd.org Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 01:51:57 -0000 On 25 August 2010 00:55, M. Warner Losh wrote: > > You can get away from a large MD by having a small MD and pivoting to > large storage. =A0Linux does this, as Bakul said, and it scales from the > ultra-small 4MB Mips router up to the highest multicore server. > But as someone's said before - and as I've been a Linux sysadmin here and there, I've been bitten more than once by the linux mdroot setup where only the -bare minimum- modules needed to bring the system up are in the mdroot. Woe be if you have to swap hardware in a hurry - double woe if your distribution provides lots of nice "autodetect" methods for figuring out which modules should be in the mdroot and does this for you automatically. You can manually build modules into mdroot but that isn't any good when you're trying to boot a post-failed system on alternative hardware. The FreeBSD method has been nice - I can compile a lean GENERIC but use /boot/loader.conf to load modules at boot time to use alternative storage/network mechanisms. I'm not saying the whole Linux initrd approach is -bad-; i'm just saying it needs to be thought through a little more first. Adrian From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 01:57:33 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F005C106566B; Wed, 25 Aug 2010 01:57:33 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id AEF278FC0A; Wed, 25 Aug 2010 01:57:33 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7P1pZ9V001461; Tue, 24 Aug 2010 19:51:35 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Tue, 24 Aug 2010 19:51:45 -0600 (MDT) Message-Id: <20100824.195145.29593248078694701.imp@bsdimp.com> To: adrian@FreeBSD.org From: "M. Warner Losh" In-Reply-To: References: <20100824.105546.1002438156525560711.imp@bsdimp.com> X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Cc: xcllnt@mac.com, freebsd-arch@FreeBSD.org Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 01:57:34 -0000 In message: Adrian Chadd writes: : On 25 August 2010 00:55, M. Warner Losh wrote: : > : > You can get away from a large MD by having a small MD and pivoting = to : > large storage. =A0Linux does this, as Bakul said, and it scales fro= m the : > ultra-small 4MB Mips router up to the highest multicore server. : > : = : But as someone's said before - and as I've been a Linux sysadmin here= : and there, I've been bitten more than once by the linux mdroot setup : where only the -bare minimum- modules needed to bring the system up : are in the mdroot. Woe be if you have to swap hardware in a hurry - : double woe if your distribution provides lots of nice "autodetect" : methods for figuring out which modules should be in the mdroot and : does this for you automatically. You can manually build modules into : mdroot but that isn't any good when you're trying to boot a : post-failed system on alternative hardware. : = : The FreeBSD method has been nice - I can compile a lean GENERIC but : use /boot/loader.conf to load modules at boot time to use alternative= : storage/network mechanisms. : = : I'm not saying the whole Linux initrd approach is -bad-; i'm just : saying it needs to be thought through a little more first. No body is saying that the only way to do things (or even the default way) is via the Linux mdroot thing. We're saying that it is *A* way to bootstrap a kernel that uses the ramfs to find the proper location of root to mount (maybe after initializing the device where root is), pivot to that new location. Marcel's current proposal seems simpler (and less flexible) than this. The proof in the pudding will be his ability to handle the 'layered' cases of encryption or compression I brought up earlier. Warner From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 03:49:14 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6025A10656A6 for ; Wed, 25 Aug 2010 03:49:14 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 1EB818FC0A for ; Wed, 25 Aug 2010 03:49:13 +0000 (UTC) Received: by iwn36 with SMTP id 36so231501iwn.13 for ; Tue, 24 Aug 2010 20:49:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=Ap3p8bED1ToxjF/kbq+79+8iirI9Z3xfR1eGhnCvGRQ=; b=kLrux8SX+R6JQ5vKOPpby06ymRyNBnmG/gX7EdHKhsMai6ndAYitic0w5cWydLjtK2 zGAnajKmVuyPOX574yzBwzU4l4zlE0vMQnFPOsX/LDBfOGjMCLM8ahVUZtt0XqtlKUCK y8iPhmWs2JcDTK4IFS6e4NAK+ISVf2m9hkvdg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=IsdRlioMLUxfrL0SI0Jd1ldXZjDeyIOfUBNSatZFxp3IXmH5TuUI4c6RcdYBGSu9CQ S71oAOlK3139A2lYx8DOQ/RZ7LJCVmCAynZzeH4SVonTMzmHZtVyBX55cn0pIl2T2ft0 pNhW4R7ti3L/eCbxuWG8EMmUch/VVKLn+QW3w= MIME-Version: 1.0 Received: by 10.231.170.21 with SMTP id b21mr9462358ibz.122.1282708150286; Tue, 24 Aug 2010 20:49:10 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.231.168.14 with HTTP; Tue, 24 Aug 2010 20:49:10 -0700 (PDT) In-Reply-To: <20100824.195145.29593248078694701.imp@bsdimp.com> References: <20100824.105546.1002438156525560711.imp@bsdimp.com> <20100824.195145.29593248078694701.imp@bsdimp.com> Date: Wed, 25 Aug 2010 11:49:10 +0800 X-Google-Sender-Auth: VU8UpYE0M7TBxG-ezKJdsmlGcSM Message-ID: From: Adrian Chadd To: "M. Warner Losh" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: xcllnt@mac.com, freebsd-arch@freebsd.org Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 03:49:14 -0000 On 25 August 2010 09:51, M. Warner Losh wrote: > : I'm not saying the whole Linux initrd approach is -bad-; i'm just > : saying it needs to be thought through a little more first. > > No body is saying that the only way to do things (or even the default > way) is via the Linux mdroot thing. =A0We're saying that it is *A* way > to bootstrap a kernel that uses the ramfs to find the proper location > of root to mount (maybe after initializing the device where root is), > pivot to that new location. > > Marcel's current proposal seems simpler (and less flexible) than > this. =A0The proof in the pudding will be his ability to handle the > 'layered' cases of encryption or compression I brought up earlier. I do like the idea of a formalish description of a bootstrap process rather than "hi, run these scripts." That said, I do like the idea of also being able to run some scripts, prep the system and then re-try the root mount. Having both options would be rather nice. In any case, +1 from me. Adrian From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 15:58:31 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B3EDC10656AE for ; Wed, 25 Aug 2010 15:58:31 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout023.mac.com (asmtpout023.mac.com [17.148.16.98]) by mx1.freebsd.org (Postfix) with ESMTP id 9D8768FC0A for ; Wed, 25 Aug 2010 15:58:31 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36]) by asmtp023.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L7P00515TP4UP70@asmtp023.mac.com> for freebsd-arch@freebsd.org; Wed, 25 Aug 2010 08:58:18 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1008250105 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-08-25_08:2010-08-25, 2010-08-25, 1970-01-01 signatures=0 From: Marcel Moolenaar Date: Wed, 25 Aug 2010 08:58:16 -0700 Message-id: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> To: "freebsd-arch@FreeBSD.org Arch" X-Mailer: Apple Mail (2.1081) Subject: RFC: root mount enhancement (round 2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 15:58:31 -0000 Summary of round 1: 1. A ramdisk root file system (whether pre-loaded by the loader or compiled into the kernel) allows any and all file systems to be mounted as root (in theory). One can populate the ramdisk with whatever tools one needs to setup the storage solution and mount file systems. 2. Negative experiences with the ramdisk root file system as a general approach for mounting a root file system have been expressed. 3. A well-defined and simple recursive algorithm that the kernel uses for finding (nested) root file systems has not been shot down, but needs to handle the power of GEOM better. See also: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=5942+0+current/freebsd-arch Round 2 preamble: Let me mention a problem with the currently implemented root mount logic as a reminder that something needs to be fixed, even if we don't want to enhance: A USB disk cannot always be used as a root file system by virtue of the USB stack releasing the root mount lock after creating the umass device, but before CAM has created the corresponding da device. The kernel will try mounting from /dev/da0 before the device exists, fails and then drops into the root mount prompt. Often the story ends here -- with failure. The root mount enhancement intends to solve this scenario by specifically waiting for the mentioned device/path before moving on to the next alternative. Round 2: The logic remains mostly the same as described in round 1, but gains a directive and limited variable substitution. These are added to decouple the mount directive (${FS}:${DEV}) from the creation of the memory disk so that GEOM can do it's thing. As such, the creation of a memory disk is now a separate directive: .md To mount the memory disk (UFS in the example), use: ufs:/dev/md# Here md# refers to the md unit created by the last .md directive. Since the logic is for mounting the root file system only, a .md directive implicitly detaches and releases the previously created md device before creating a new one. In other words: the enhancement is not for creating a bunch of md devices. Should this be relaxed so that any number of md device can be created before we try a root mount? When the md device appears, GEOM gets to taste the provider and all kinds of interesting things can happen. By decoupling the creating of the md device and the mount directive, it's trivial to handle arbitrarily complex GEOM graphs. For example: ufs:/dev/md#s1a ufs:/dev/md#.uzip ... For completeness, the syntax of the configuration file (in some weird hybrid regex-based specification that is sloppy about spaces) to make sure things get fleshed out enough for review: <.mount.conf> : (^$)* : | | : '#'.* : : | | | | | : ':' : | : | ',' : | '=' | ".md" : | : | ',' : "nocompress" # compress is default | "nocluster" # cluster is default | "async" | "readonly" : ".ask" : "wait" : "onfail" : "panic" # default | "reboot" | "retry" | "continue" : ".init" : | ':' To re-iterate: the logic is recursive. After mounting some file system as root, the kernel will follow the directives in /.mount.conf (if the file exists) for remounting the root file system. At each iteration the kernel will remount devfs under /dev and remount the current root file system under /.mount within the new root file system. Thoughts? -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 16:01:16 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6C06D1065695 for ; Wed, 25 Aug 2010 16:01:16 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 313888FC1A for ; Wed, 25 Aug 2010 16:01:15 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id DC42D3F627; Wed, 25 Aug 2010 16:01:14 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.4/8.14.4) with ESMTP id o7PG1DbB054211; Wed, 25 Aug 2010 16:01:14 GMT (envelope-from phk@critter.freebsd.dk) To: Marcel Moolenaar From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 25 Aug 2010 08:58:16 MST." <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> Date: Wed, 25 Aug 2010 16:01:13 +0000 Message-ID: <54210.1282752073@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: "freebsd-arch@FreeBSD.org Arch" Subject: Re: RFC: root mount enhancement (round 2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 16:01:16 -0000 In message <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>, Marcel Moolenaar wri tes: >don't want to enhance: A USB disk cannot always be used as a root >file system by virtue of the USB stack releasing the root mount >lock after creating the umass device, but before CAM has created >the corresponding da device. This is a bug which is entirely unrelated to how we find the root filesystem: It should simply be fixed by CAM grabing a root mount lock when activated from USB and releasing it only when all it's stuff is done. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 18:18:11 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AE52710656A5 for ; Wed, 25 Aug 2010 18:18:11 +0000 (UTC) (envelope-from yanegomi@gmail.com) Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com [209.85.215.182]) by mx1.freebsd.org (Postfix) with ESMTP id 3EE1E8FC1B for ; Wed, 25 Aug 2010 18:18:10 +0000 (UTC) Received: by eyx24 with SMTP id 24so601118eyx.13 for ; Wed, 25 Aug 2010 11:18:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type; bh=ZfMWrdhokGw8ITWt1RhkVvVd/o/KGJUx2AG0HR4dH50=; b=A12Y8Jyd7PJRM33Hv5xlH2TAhniuitAdm7ieF08d/m+7bhp/XnqFacfwBm8b4aGCGK GXLCvE4LMoqV3c8CpUSqvlQlN2IYYmKLJDplyNWWUvljpEeWRly1gfdUQrXR/UqetwN9 4E3hbCdwtIPBifBHevQC08IVP3LMtnSnJOpmI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=MKDqPEi6bWsqYC7OFlyR+O5t3qczqLuRWsuLURMWyYI3W6LnugwdOHRnEMbWmAbuKp A2HRCgNsHdHL7htC+/G2SkwNmpDRyirpEAdyf5b59x51qXDJOwlyJZfHEXu3KUP+F6JY t5tUZVGlolUKxKv7+FHxWNpf/ynntUoGaHfC4= MIME-Version: 1.0 Received: by 10.213.62.206 with SMTP id y14mr5937881ebh.34.1282760290089; Wed, 25 Aug 2010 11:18:10 -0700 (PDT) Sender: yanegomi@gmail.com Received: by 10.14.47.197 with HTTP; Wed, 25 Aug 2010 11:18:09 -0700 (PDT) In-Reply-To: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> Date: Wed, 25 Aug 2010 11:18:09 -0700 X-Google-Sender-Auth: ZWnYFfjdY8oQ91QZPpsVE8ZYdIA Message-ID: From: Garrett Cooper To: Marcel Moolenaar Content-Type: text/plain; charset=ISO-8859-1 Cc: "freebsd-arch@FreeBSD.org Arch" Subject: Re: RFC: root mount enhancement (round 2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 18:18:11 -0000 On Wed, Aug 25, 2010 at 8:58 AM, Marcel Moolenaar wrote: > Summary of round 1: ... > To re-iterate: the logic is recursive. After mounting some file system > as root, the kernel will follow the directives in /.mount.conf (if the > file exists) for remounting the root file system. At each iteration the > kernel will remount devfs under /dev and remount the current root file > system under /.mount within the new root file system. I like the proposal, but like Ed, I do have a concern with infinite recursion. Should a breadcrumb be added to prevent infinite recursion with the mounts, or is it game over, egg on your face, if you create an infinite recursion situation? Thanks, -Garrett From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 19:06:58 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5DA061065695 for ; Wed, 25 Aug 2010 19:06:58 +0000 (UTC) (envelope-from fb-arch@psconsult.nl) Received: from mx1.psconsult.nl (unknown [IPv6:2001:7b8:30f:e0::5059:ee8a]) by mx1.freebsd.org (Postfix) with ESMTP id 13AA48FC0C for ; Wed, 25 Aug 2010 19:06:57 +0000 (UTC) Received: from mx1.psconsult.nl (psc11.adsl.iaf.nl [80.89.238.138]) by mx1.psconsult.nl (8.14.4/8.14.4) with ESMTP id o7PJ6px7065853 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Wed, 25 Aug 2010 21:06:56 +0200 (CEST) (envelope-from fb-arch@psconsult.nl) Received: (from paul@localhost) by mx1.psconsult.nl (8.14.4/8.14.4/Submit) id o7PHmpmw064435 for freebsd-arch@freebsd.org; Wed, 25 Aug 2010 19:48:51 +0200 (CEST) (envelope-from fb-arch@psconsult.nl) X-Authentication-Warning: mx1.psconsult.nl: paul set sender to fb-arch@psconsult.nl using -f Date: Wed, 25 Aug 2010 19:48:51 +0200 From: Paul Schenkeveld To: freebsd-arch@freebsd.org Message-ID: <20100825174851.GA64117@psconsult.nl> References: <20100823214946.GF64651@hoeg.nl> <7318E60D-F00F-4519-A3E3-9CE8B752AE88@mac.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <7318E60D-F00F-4519-A3E3-9CE8B752AE88@mac.com> User-Agent: Mutt/1.5.19 (2009-01-05) Subject: Re: RFC: enhancing the root mount logic X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 19:06:58 -0000 On Mon, Aug 23, 2010 at 03:44:03PM -0700, Marcel Moolenaar wrote: > > On Aug 23, 2010, at 2:49 PM, Ed Schouten wrote: > > > * Marcel Moolenaar wrote: > >> Is this something that people feel is worth fleshing out and > >> prototyping? > > > > Sounds awesome! This would make my writable boot cd a lot more elegant > > than it is right now. Have you thought about things like possible > > endless loops? Say, you mount a unionfs on the root of the fs itself. So far I've not yet seen any endless loop in computing ... :-) > > This may cause the original .mount.conf to be reinterpreted, right? > > Right. I haven't thought about it. My off the cuff response is that we > should disallow it if the amount of effort required to detect it is > within reason. Alternatively, we could simply impose a global limit on > the depth of the recursion. Either appears reasonable to me, but I may > be overlooking something here... What about a directive in .mount.conf, e.g. ".exit" to end the recursion? > Thoughts? > > -- > Marcel Moolenaar > xcllnt@mac.com Regards, Paul Schenkeveld From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 19:09:08 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 26CD91065693 for ; Wed, 25 Aug 2010 19:09:08 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout024.mac.com (asmtpout024.mac.com [17.148.16.99]) by mx1.freebsd.org (Postfix) with ESMTP id 0C3DC8FC0A for ; Wed, 25 Aug 2010 19:09:07 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36]) by asmtp024.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L7Q00C3U2J4B930@asmtp024.mac.com>; Wed, 25 Aug 2010 12:09:04 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1008250146 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-08-25_09:2010-08-25, 2010-08-25, 1970-01-01 signatures=0 From: Marcel Moolenaar In-reply-to: Date: Wed, 25 Aug 2010 12:09:04 -0700 Message-id: <9EA74D18-1CA4-4F3D-9CE5-0BD1B4D6B7BB@mac.com> References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> To: Garrett Cooper X-Mailer: Apple Mail (2.1081) Cc: "freebsd-arch@FreeBSD.org Arch" Subject: Re: RFC: root mount enhancement (round 2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 19:09:08 -0000 On Aug 25, 2010, at 11:18 AM, Garrett Cooper wrote: > On Wed, Aug 25, 2010 at 8:58 AM, Marcel Moolenaar wrote: >> Summary of round 1: > > ... > >> To re-iterate: the logic is recursive. After mounting some file system >> as root, the kernel will follow the directives in /.mount.conf (if the >> file exists) for remounting the root file system. At each iteration the >> kernel will remount devfs under /dev and remount the current root file >> system under /.mount within the new root file system. > > I like the proposal, but like Ed, I do have a concern with > infinite recursion. Should a breadcrumb be added to prevent infinite > recursion with the mounts, or is it game over, egg on your face, if > you create an infinite recursion situation? Since we have a trail of file systems (by virtue of mounting the previous root under the new root at /.mount), we should be able to detect when we're about to mount from a device previously used to mount from. Alternatively or on top of that, we can have a global limit on the recursion depth. Unless this is something we want to control through /.mount.conf, I don't think it's an item that needs to be closed or nailed down before we can move ahead. Put differently: I can implement both to start with... -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 20:49:36 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2E2B7106566B for ; Wed, 25 Aug 2010 20:49:36 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id E4F2C8FC0A for ; Wed, 25 Aug 2010 20:49:35 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7PKijUs013734; Wed, 25 Aug 2010 14:44:46 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Wed, 25 Aug 2010 14:44:47 -0600 (MDT) Message-Id: <20100825.144447.195066307629816163.imp@bsdimp.com> To: phk@phk.freebsd.dk From: "M. Warner Losh" In-Reply-To: <54210.1282752073@critter.freebsd.dk> References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> <54210.1282752073@critter.freebsd.dk> X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: xcllnt@mac.com, freebsd-arch@FreeBSD.org Subject: Re: RFC: root mount enhancement (round 2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 20:49:36 -0000 In message: <54210.1282752073@critter.freebsd.dk> "Poul-Henning Kamp" writes: : In message <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>, Marcel Moolenaar wri : tes: : : >don't want to enhance: A USB disk cannot always be used as a root : >file system by virtue of the USB stack releasing the root mount : >lock after creating the umass device, but before CAM has created : >the corresponding da device. : : This is a bug which is entirely unrelated to how we find the : root filesystem: It should simply be fixed by CAM grabing a : root mount lock when activated from USB and releasing it : only when all it's stuff is done. We already do this... But it is insufficient since usb discovery is done asynchronously... Scott has a similar fix in the pipeline, but I don't know the state of it. Warner From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 21:11:18 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9593910656A8 for ; Wed, 25 Aug 2010 21:11:18 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 4068C8FC13 for ; Wed, 25 Aug 2010 21:11:18 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7PL2eWY013887; Wed, 25 Aug 2010 15:02:40 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Wed, 25 Aug 2010 15:02:42 -0600 (MDT) Message-Id: <20100825.150242.450985660301753093.imp@bsdimp.com> To: xcllnt@mac.com From: "M. Warner Losh" In-Reply-To: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: freebsd-arch@FreeBSD.org Subject: Re: RFC: root mount enhancement (round 2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 21:11:18 -0000 In message: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> Marcel Moolenaar writes: : 2. Negative experiences with the ramdisk root file system as a : general approach for mounting a root file system have been : expressed. To be fair, it was both positive and negative experiences. The negative experiences were from the server folks who hated when upgrading and the ram disk compiled into the kernel was out of date or incomplete. The positive experiences were from the embedded folks who used the RAM disk given to it by the boot loader so there was quite a bit more flexibility. This ram disk comes from a dedicated flash partition and is well supported by the different embedded boot loaders that are common in the embedded space (mostly because Linux requires it). There's even support for compression of the kernel and ram disk in the boot loader: it expands the kernel, the ram disk and then tells the kernel where to find the ram disk. : Let me mention a problem with the currently implemented root mount : logic as a reminder that something needs to be fixed, even if we : don't want to enhance: A USB disk cannot always be used as a root : file system by virtue of the USB stack releasing the root mount : lock after creating the umass device, but before CAM has created : the corresponding da device. The kernel will try mounting from : /dev/da0 before the device exists, fails and then drops into the : root mount prompt. Often the story ends here -- with failure. Actually, the problem isn't the locking at all. The problem is that the umass SIMs arrive 'late' in the game. by the time they arrive, CAM has already released the root lock. But as phk points out, this is a bug in the usb/cam interaction and should be fixed there and completely irrelevant for your root mounting system. : Round 2: : : The logic remains mostly the same as described in round 1, but : gains a directive and limited variable substitution. These are : added to decouple the mount directive (${FS}:${DEV}) from the : creation of the memory disk so that GEOM can do it's thing. As : such, the creation of a memory disk is now a separate directive: : : .md : : To mount the memory disk (UFS in the example), use: : : ufs:/dev/md# : : Here md# refers to the md unit created by the last .md directive. : Since the logic is for mounting the root file system only, a .md : directive implicitly detaches and releases the previously created : md device before creating a new one. In other words: the : enhancement is not for creating a bunch of md devices. : : Should this be relaxed so that any number of md device can be : created before we try a root mount? I guess I'm having trouble understanding why you'd need this given that ram disk information is already passed from the boot loader (/boot/loader or in the board's init code (although the latter I don't think is done by any in-tree code)) to the kernel... : When the md device appears, GEOM gets to taste the provider : and all kinds of interesting things can happen. By decoupling : the creating of the md device and the mount directive, it's : trivial to handle arbitrarily complex GEOM graphs. For example: : : ufs:/dev/md#s1a : ufs:/dev/md#.uzip : ... Shouldn't the MD device already be created by virtual of the MD_ROOT junk in the kernel config file? Why do you need a special directive to create it... : For completeness, the syntax of the configuration file (in : some weird hybrid regex-based specification that is sloppy : about spaces) to make sure things get fleshed out enough : for review: : : <.mount.conf> : (^$)* : : : | : | : : '#'.* : : : : : | : | : | : | : | : : ':' : : : | : : : | ',' : : : | '=' : | ".md" : : : | : : : | ',' : : "nocompress" # compress is default : | "nocluster" # cluster is default : | "async" : | "readonly" read-write compressed works? Also, is compression a property of the md device, or the GEOM that tastes it to see that it is compressed... What does cluster do anyway? I see that as an option for mdconfig, but there's no explanation of it there or in the md man page. How do you differentiate between these two roots: mdconfig -a -t file -f /gerbil.ram and mdconfig -a -t swap -s 4m dd if=/gerbil.rom of=/dev/md0 bs=1m with this scheme? I'm guessing only the former makes sense, although for upgrades, maybe you want the latter so you can replace /gerbil.rom at any time. But in that case, you're better off going through /boot/loader for this stuff, which leads me to my next question: Would any md device passed by the boot loader (or compiled into the kernel) would effectively be the second one and you'd not need any .md directives at all? : : ".ask" : : "wait" : : "onfail" : : "panic" # default : | "reboot" : | "retry" : | "continue" : : ".init" : : : | ':' : : : To re-iterate: the logic is recursive. After mounting some file system : as root, the kernel will follow the directives in /.mount.conf (if the : file exists) for remounting the root file system. At each iteration the : kernel will remount devfs under /dev and remount the current root file : system under /.mount within the new root file system. : : Thoughts? How is init handled at each stage? forked after the last one, I assume? Warner From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 21:29:51 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0BE84106564A for ; Wed, 25 Aug 2010 21:29:51 +0000 (UTC) (envelope-from andy@fud.org.nz) Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id D202C8FC1E for ; Wed, 25 Aug 2010 21:29:50 +0000 (UTC) Received: by iwn36 with SMTP id 36so1022097iwn.13 for ; Wed, 25 Aug 2010 14:29:50 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.182.204 with SMTP id cd12mr10818749ibb.101.1282769999275; Wed, 25 Aug 2010 13:59:59 -0700 (PDT) Sender: andy@fud.org.nz Received: by 10.231.187.6 with HTTP; Wed, 25 Aug 2010 13:59:59 -0700 (PDT) In-Reply-To: <20100825.144447.195066307629816163.imp@bsdimp.com> References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> <54210.1282752073@critter.freebsd.dk> <20100825.144447.195066307629816163.imp@bsdimp.com> Date: Thu, 26 Aug 2010 08:59:59 +1200 X-Google-Sender-Auth: dJAeRM0U-F2nutUQzTEk2aqwGqs Message-ID: From: Andrew Thompson To: "M. Warner Losh" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: phk@phk.freebsd.dk, xcllnt@mac.com, freebsd-arch@freebsd.org Subject: Re: RFC: root mount enhancement (round 2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 21:29:51 -0000 On 26 August 2010 08:44, M. Warner Losh wrote: > In message: <54210.1282752073@critter.freebsd.dk> > =A0 =A0 =A0 =A0 =A0 =A0"Poul-Henning Kamp" writes: > : In message <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>, Marcel Moole= naar wri > : tes: > : > : >don't want to enhance: A USB disk cannot always be used as a root > : >file system by virtue of the USB stack releasing the root mount > : >lock after creating the umass device, but before CAM has created > : >the corresponding da device. > : > : This is a bug which is entirely unrelated to how we find the > : root filesystem: =A0It should simply be fixed by CAM grabing a > : root mount lock when activated from USB and releasing it > : only when all it's stuff is done. > > We already do this... =A0But it is insufficient since usb discovery is > done asynchronously... Its more that the usb disk appears and the root mount lock is dropped without geom tasting taken into account. This was fixed with r190677 but then I was asked to back it out (r190878). > Scott has a similar fix in the pipeline, but I don't know the state of > it. It would be great to get this finished, I believe the solution Scott wanted was to properly use intr_config_hooks to kick off usb enumeration. Andrew From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 21:51:59 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 180A51065694 for ; Wed, 25 Aug 2010 21:51:59 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout025.mac.com (asmtpout025.mac.com [17.148.16.100]) by mx1.freebsd.org (Postfix) with ESMTP id F216E8FC1B for ; Wed, 25 Aug 2010 21:51:58 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36]) by asmtp025.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L7Q00813A2LKX50@asmtp025.mac.com> for freebsd-arch@FreeBSD.org; Wed, 25 Aug 2010 14:51:58 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1008250181 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-08-25_10:2010-08-25, 2010-08-25, 1970-01-01 signatures=0 From: Marcel Moolenaar In-reply-to: <20100825.150242.450985660301753093.imp@bsdimp.com> Date: Wed, 25 Aug 2010 14:51:57 -0700 Message-id: References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> <20100825.150242.450985660301753093.imp@bsdimp.com> To: "M. Warner Losh" X-Mailer: Apple Mail (2.1081) Cc: freebsd-arch@FreeBSD.org Subject: Re: RFC: root mount enhancement (round 2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 21:51:59 -0000 On Aug 25, 2010, at 2:02 PM, M. Warner Losh wrote: > : Let me mention a problem with the currently implemented root mount > : logic as a reminder that something needs to be fixed, even if we > : don't want to enhance: A USB disk cannot always be used as a root > : file system by virtue of the USB stack releasing the root mount > : lock after creating the umass device, but before CAM has created > : the corresponding da device. The kernel will try mounting from > : /dev/da0 before the device exists, fails and then drops into the > : root mount prompt. Often the story ends here -- with failure. > > Actually, the problem isn't the locking at all. The problem is that > the umass SIMs arrive 'late' in the game. by the time they arrive, > CAM has already released the root lock. But as phk points out, this > is a bug in the usb/cam interaction and should be fixed there and > completely irrelevant for your root mounting system. I perceive the problem differently, because I see no value in waiting for *all* devices to appear when the root device is already there. That just slows down the boot. I prefer mounting the root file system as soon as the device appears and enhance the fstab mounting to deal with the device not being there yet. Consequently: the bug is with root_mount_hold() and root_mount_rel() as a means to do the right thing... > : Here md# refers to the md unit created by the last .md directive. > : Since the logic is for mounting the root file system only, a .md > : directive implicitly detaches and releases the previously created > : md device before creating a new one. In other words: the > : enhancement is not for creating a bunch of md devices. > : > : Should this be relaxed so that any number of md device can be > : created before we try a root mount? > > I guess I'm having trouble understanding why you'd need this given > that ram disk information is already passed from the boot loader > (/boot/loader or in the board's init code (although the latter I don't > think is done by any in-tree code)) to the kernel... You're fixating on the preloaded or compiled-in ramdisk. The .md directive is there for vnode-backed images -- the root file system image is stored on a file system and memory is only used for buffering and caching. > read-write compressed works? Also, is compression a property of the > md device, or the GEOM that tastes it to see that it is compressed... > What does cluster do anyway? I see that as an option for mdconfig, > but there's no explanation of it there or in the md man page. The options are as useful as the md implementation is. The options are listed because they appeared in mdconfig. Semantics is not to be argued when syntax is discussed :-) > How do you differentiate between these two roots: > > mdconfig -a -t file -f /gerbil.ram > and > mdconfig -a -t swap -s 4m > dd if=/gerbil.rom of=/dev/md0 bs=1m The first is supported, the second isn't. The .md directive only supports vnode-backed md devices. There's no point trying to mount a malloc- or swap-backed md device because they instantiate empty and are useless for root file systems, unless you construct them first (using dd is a way to construct them). Supporting the construction of a root file system is where things get complicated and where I personally don't want to go. > But in that case, you're better off going through > /boot/loader for this stuff, which leads me to my next question: Would > any md device passed by the boot loader (or compiled into the kernel) > would effectively be the second one and you'd not need any .md > directives at all? You can start off with a preloaded or compiled-in ramdisk, and then recursively mount root, including from vnode-backed md devices, so the .md directive is not rendered useless by preloading or compiling in. You can even end the root mount recursion with the preloaded ramdisk last -- this gives you premounted file systems under /.mount without having to run /etc/rc (if you want to)... > : > : To re-iterate: the logic is recursive. After mounting some file system > : as root, the kernel will follow the directives in /.mount.conf (if the > : file exists) for remounting the root file system. At each iteration the > : kernel will remount devfs under /dev and remount the current root file > : system under /.mount within the new root file system. > : > : Thoughts? > > How is init handled at each stage? forked after the last one, I assume? No, init is only spawned after the root mount recursion ends. The .init directive is there to override defaults. This is envisioned to be useful for rescue images where you want to swawn /rescue/init or installation images where you may want to spawn sysinstall. It eliminates having to hardcode the possibilities in the kernel. In a sense it gives you more freedom in how you want to call your initial process without the pitfalls when the root mount recursion ends early due to a problem. As a concrete example, consider having a single file system on a writable medium (say /dev/da0) and software images are ISO images stored in it. You can install some recovery procedure on /dev/da0 that gets run when none of the ISO images can be mounted. The ISO images have /sbin/init as init as usual, but you can select to run /sbin/recovery from /dev/da0. This allows for a single init executable that performs the right functions based on the program name for example... -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 22:39:38 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CFFDD1065674 for ; Wed, 25 Aug 2010 22:39:38 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 7E6958FC13 for ; Wed, 25 Aug 2010 22:39:38 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7PMaaTl014728; Wed, 25 Aug 2010 16:36:36 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Wed, 25 Aug 2010 16:36:37 -0600 (MDT) Message-Id: <20100825.163637.1151864885495248514.imp@bsdimp.com> To: xcllnt@mac.com From: "M. Warner Losh" In-Reply-To: References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> <20100825.150242.450985660301753093.imp@bsdimp.com> X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: freebsd-arch@FreeBSD.ORG Subject: Re: RFC: root mount enhancement (round 2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 22:39:39 -0000 Hey Marcel, The more I talk about this, the more that I think it might be useful in some ways. In message: Marcel Moolenaar writes: : : On Aug 25, 2010, at 2:02 PM, M. Warner Losh wrote: : > : Let me mention a problem with the currently implemented root mount : > : logic as a reminder that something needs to be fixed, even if we : > : don't want to enhance: A USB disk cannot always be used as a root : > : file system by virtue of the USB stack releasing the root mount : > : lock after creating the umass device, but before CAM has created : > : the corresponding da device. The kernel will try mounting from : > : /dev/da0 before the device exists, fails and then drops into the : > : root mount prompt. Often the story ends here -- with failure. : > : > Actually, the problem isn't the locking at all. The problem is that : > the umass SIMs arrive 'late' in the game. by the time they arrive, : > CAM has already released the root lock. But as phk points out, this : > is a bug in the usb/cam interaction and should be fixed there and : > completely irrelevant for your root mounting system. : : I perceive the problem differently, because I see no value in waiting : for *all* devices to appear when the root device is already there. : That just slows down the boot. : : I prefer mounting the root file system as soon as the device appears : and enhance the fstab mounting to deal with the device not being : there yet. : : Consequently: the bug is with root_mount_hold() and root_mount_rel() : as a means to do the right thing... We don't need to enhance fstab to cope with / not being there. We need / to be there, one way or another. We may disagree on how best to make it be there. In the past I've swung the direction you talk about too. I've hacked mountroot() wait up to a given amount of time new devices to appear that contain the root file system before giving up. That way, if you know you've got the root file system, you can go right away, but otherwise you do something more intelligent than 'nothing' or 'prompt' when it isn't there. This meshes well with the .wait directive and your thinking too. The part I didn't like about this was the arbitrary upper time limit on it. I'd like to wait until *ALL* devices are done to fail and accept a '.wait 5' as an ugly alternative to knowing that all boot devices are there. I've also thought about having it drop to a prompt, but noticing that new devices show up. You could automatically proceed, or at the very least be able to type the new device in once it is there. This would let the normal boot proceed, kick you to the prompt if, say, the usb drive fell out and still let you plug it back in and have the system pick back up again. So, if your approach could have some hook for these types of enhancements (or used to implement them), that would be a compelling reason to support it. Of course, it would still require knowing when you are done with your initial scans of the device tree, which is at present an unsolved problem.... : > : Here md# refers to the md unit created by the last .md directive. : > : Since the logic is for mounting the root file system only, a .md : > : directive implicitly detaches and releases the previously created : > : md device before creating a new one. In other words: the : > : enhancement is not for creating a bunch of md devices. : > : : > : Should this be relaxed so that any number of md device can be : > : created before we try a root mount? : > : > I guess I'm having trouble understanding why you'd need this given : > that ram disk information is already passed from the boot loader : > (/boot/loader or in the board's init code (although the latter I don't : > think is done by any in-tree code)) to the kernel... : : You're fixating on the preloaded or compiled-in ramdisk. The : .md directive is there for vnode-backed images -- the root : file system image is stored on a file system and memory is : only used for buffering and caching. That makes sense. Not so much fixating on them, but noting that they work really really well and are the basis for many livecd's and such. They are the basis for all the picobsd derivatives as well. : > read-write compressed works? Also, is compression a property of the : > md device, or the GEOM that tastes it to see that it is compressed... : > What does cluster do anyway? I see that as an option for mdconfig, : > but there's no explanation of it there or in the md man page. : : The options are as useful as the md implementation is. The options : are listed because they appeared in mdconfig. Semantics is not to : be argued when syntax is discussed :-) fair enough... The compression bit was confusing. : > How do you differentiate between these two roots: : > : > mdconfig -a -t file -f /gerbil.ram : > and : > mdconfig -a -t swap -s 4m : > dd if=/gerbil.rom of=/dev/md0 bs=1m : : The first is supported, the second isn't. The .md directive only : supports vnode-backed md devices. There's no point trying to mount : a malloc- or swap-backed md device because they instantiate empty : and are useless for root file systems, unless you construct them : first (using dd is a way to construct them). Supporting the : construction of a root file system is where things get complicated : and where I personally don't want to go. Fair enough. It was mostly just a question for clarification that wound up rambling far too long. : > But in that case, you're better off going through : > /boot/loader for this stuff, which leads me to my next question: Would : > any md device passed by the boot loader (or compiled into the kernel) : > would effectively be the second one and you'd not need any .md : > directives at all? : : You can start off with a preloaded or compiled-in ramdisk, and then : recursively mount root, including from vnode-backed md devices, so : the .md directive is not rendered useless by preloading or compiling : in. You can even end the root mount recursion with the preloaded : ramdisk last -- this gives you premounted file systems under /.mount : without having to run /etc/rc (if you want to)... Is the .md directive globally destructive, or just destructive to the local level of recursion? If it is just the local level, how do you specify the unit number? Maybe a better approach would be to encourage people to mount root based on how file systems are labelled, rather than what unit they happen to be taking up... Would that help any here? : > : To re-iterate: the logic is recursive. After mounting some file system : > : as root, the kernel will follow the directives in /.mount.conf (if the : > : file exists) for remounting the root file system. At each iteration the : > : kernel will remount devfs under /dev and remount the current root file : > : system under /.mount within the new root file system. : > : : > : Thoughts? : > : > How is init handled at each stage? forked after the last one, I assume? : : No, init is only spawned after the root mount recursion ends. The .init : directive is there to override defaults. This is envisioned to be useful : for rescue images where you want to swawn /rescue/init or installation : images where you may want to spawn sysinstall. It eliminates having to : hardcode the possibilities in the kernel. Right now through the boot loader you can set init_path, why would you need to add the ability to spawn a different one to the scripts? : In a sense it gives you more freedom in how you want to call your initial : process without the pitfalls when the root mount recursion ends early due : to a problem. : : As a concrete example, consider having a single file system on a writable : medium (say /dev/da0) and software images are ISO images stored in it. : You can install some recovery procedure on /dev/da0 that gets run when : none of the ISO images can be mounted. The ISO images have /sbin/init : as init as usual, but you can select to run /sbin/recovery from /dev/da0. : This allows for a single init executable that performs the right functions : based on the program name for example... I think this is a bit convoluted an example. The ISO images would fail to mount only if they were all damaged in a way that would make them unmountable, true? If the backup ISO is AFU, then what's to say that /sbin/recovery isn't also AFU? When would you need this? Without a 'branch' construct of some kind, there's no way to match machine/platform names here. Given the limited ability for us to run kernels on multiple different platforms, I'm not sure how big a deal this actually would be, but if you can do this, it would be a nice plus. I presume the default script would be something like (ignoring the hard coding of device names): ufs:/dev/da0s1a .wait 5 .onfail ask which would mount /dev/da0s1a when it became available, waiting up to 5 seconds and asking the user afterwards if that failed, right? Warner From owner-freebsd-arch@FreeBSD.ORG Wed Aug 25 23:47:57 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AB98510656A7 for ; Wed, 25 Aug 2010 23:47:57 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout029.mac.com (asmtpout029.mac.com [17.148.16.104]) by mx1.freebsd.org (Postfix) with ESMTP id 8FA6F8FC15 for ; Wed, 25 Aug 2010 23:47:57 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36]) by asmtp029.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L7Q006K7FFW9Z80@asmtp029.mac.com> for freebsd-arch@FreeBSD.ORG; Wed, 25 Aug 2010 16:47:57 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1008250206 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-08-25_11:2010-08-25, 2010-08-25, 1970-01-01 signatures=0 From: Marcel Moolenaar In-reply-to: <20100825.163637.1151864885495248514.imp@bsdimp.com> Date: Wed, 25 Aug 2010 16:47:55 -0700 Message-id: <9765A15A-6CCD-4341-A103-5501CC4FCFDD@mac.com> References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> <20100825.150242.450985660301753093.imp@bsdimp.com> <20100825.163637.1151864885495248514.imp@bsdimp.com> To: "M. Warner Losh" X-Mailer: Apple Mail (2.1081) Cc: freebsd-arch@FreeBSD.ORG Subject: Re: RFC: root mount enhancement (round 2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 23:47:57 -0000 On Aug 25, 2010, at 3:36 PM, M. Warner Losh wrote: > Hey Marcel, > > The more I talk about this, the more that I think it might be useful > in some ways. Ok. I'll start prototyping something so that we can see if it can live up to its promise or not. > : : I prefer mounting the root file system as soon as the device appears > : and enhance the fstab mounting to deal with the device not being > : there yet. > : > : Consequently: the bug is with root_mount_hold() and root_mount_rel() > : as a means to do the right thing... > > We don't need to enhance fstab to cope with / not being there. I'm sorry. I worded it too sloppily. The enhancement relates to all other file systems that we want to mount during boot, but which may not have been discovered yet. If we proceed with the boot as soon as we have the desired root file system, we create a serialization problem downstream that we don't have when we wait for all devices first. > I've hacked mountroot() wait up to a given amount of time > new devices to appear that contain the root file system before giving > up. That way, if you know you've got the root file system, you can go > right away, but otherwise you do something more intelligent than > 'nothing' or 'prompt' when it isn't there. This meshes well with the > .wait directive and your thinking too. The part I didn't like about > this was the arbitrary upper time limit on it. I'd like to wait until > *ALL* devices are done to fail and accept a '.wait 5' as an ugly > alternative to knowing that all boot devices are there. I may be able to implement this by changing the .wait directive to take a flag: .wait The can be "next", meaning that we wait up to X seconds until we give up on the next mountdirective. The could also be "all", which then slightly changes the meaning into the max number of seconds to wait for "new device" events before trying the next (or subsequent) mount directrives. Put differently: the the number of seconds to idle and wait *after* the last device arrival before trying the first or next mount. Every time a new device is announced, you restart the clock. Would this do what you described? > I've also thought about having it drop to a prompt, but noticing that > new devices show up. You could automatically proceed, or at the very > least be able to type the new device in once it is there. This would > let the normal boot proceed, kick you to the prompt if, say, the usb > drive fell out and still let you plug it back in and have the system > pick back up again. Interesting. I like this. Let me see if this is doable without inviting complexity. > So, if your approach could have some hook for these types of > enhancements (or used to implement them), that would be a compelling > reason to support it. Of course, it would still require knowing when > you are done with your initial scans of the device tree, which is at > present an unsolved problem.... Technically speaking we're never done. If I plug in a disk any time after booting up, then we didn't wait long enough before mounting root :-) Seriously: hot plug implies that you can never truly wait for all devices, because they can come and go during the entire up time of the machine. Proceeding with the boot based on some reasonable heuristics (i.e. nothing new was found in the last X seconds, so it's unlikely we'll get a new disk) is probably the best we can do.... > : > But in that case, you're better off going through > : > /boot/loader for this stuff, which leads me to my next question: Would > : > any md device passed by the boot loader (or compiled into the kernel) > : > would effectively be the second one and you'd not need any .md > : > directives at all? > : > : You can start off with a preloaded or compiled-in ramdisk, and then > : recursively mount root, including from vnode-backed md devices, so > : the .md directive is not rendered useless by preloading or compiling > : in. You can even end the root mount recursion with the preloaded > : ramdisk last -- this gives you premounted file systems under /.mount > : without having to run /etc/rc (if you want to)... > > Is the .md directive globally destructive, or just destructive to the > local level of recursion? If it is just the local level, how do you > specify the unit number? Maybe a better approach would be to > encourage people to mount root based on how file systems are labelled, > rather than what unit they happen to be taking up... Would that help > any here? The .md directive (as envisioned so far) uses dynamic unit numbers and is only locally destructive. This allows nested mounting of root file systems that are all vnode-backed (don't ask me for a real-life use case now :-) The proposal uses '#' as the placeholder for the unit number. To be precise: the '#' is literal and appears in the configuration file to denote the md unit number created by the last .md directive. As such, you don't actually need to know it. Too klugy? Too limited? > : > > : > How is init handled at each stage? forked after the last one, I assume? > : > : No, init is only spawned after the root mount recursion ends. The .init > : directive is there to override defaults. This is envisioned to be useful > : for rescue images where you want to swawn /rescue/init or installation > : images where you may want to spawn sysinstall. It eliminates having to > : hardcode the possibilities in the kernel. > > Right now through the boot loader you can set init_path, why would you > need to add the ability to spawn a different one to the scripts? No particular reason. I just tossed it in. If it's over the top, then I'll remove it. It was just one of those ideas... > : In a sense it gives you more freedom in how you want to call your initial > : process without the pitfalls when the root mount recursion ends early due > : to a problem. > : > : As a concrete example, consider having a single file system on a writable > : medium (say /dev/da0) and software images are ISO images stored in it. > : You can install some recovery procedure on /dev/da0 that gets run when > : none of the ISO images can be mounted. The ISO images have /sbin/init > : as init as usual, but you can select to run /sbin/recovery from /dev/da0. > : This allows for a single init executable that performs the right functions > : based on the program name for example... > > I think this is a bit convoluted an example. The ISO images would > fail to mount only if they were all damaged in a way that would make > them unmountable, true? If the backup ISO is AFU, then what's to say > that /sbin/recovery isn't also AFU? When would you need this? The images could also have been, euh .. misplaced :-) What to do when the ISO images aren't there? A panic may not be the most user friendly response... > I presume the default script would be something like (ignoring the > hard coding of device names): > > ufs:/dev/da0s1a > .wait 5 > .onfail ask Roughly. devfs will synthesize the .mount.conf contents based on tunables and kernel options. The same options we now have hardcoded. Without recursion this means that the root mount will not be any different from what it is now. > which would mount /dev/da0s1a when it became available, waiting up to > 5 seconds and asking the user afterwards if that failed, right? Yes, but I like the feedback I got from Matthew, who said that the .wait applies to the mount directive following it. So the .wait will precede the mount. Also, the proposal as an .ask directive, rather than ask on failure. I see asking as a mount directive of which the FS and device are provided by the user. -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-arch@FreeBSD.ORG Thu Aug 26 00:05:36 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DB6C6106566C; Thu, 26 Aug 2010 00:05:36 +0000 (UTC) (envelope-from max@love2party.net) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.186]) by mx1.freebsd.org (Postfix) with ESMTP id 2F38D8FC13; Thu, 26 Aug 2010 00:05:35 +0000 (UTC) Received: from f8x64.laiers.local (dslb-088-066-038-053.pools.arcor-ip.net [88.66.38.53]) by mrelayeu.kundenserver.de (node=mrbap1) with ESMTP (Nemesis) id 0LpPi1-1PHiNk1bfR-00em6r; Thu, 26 Aug 2010 02:05:34 +0200 From: Max Laier Organization: FreeBSD To: freebsd-arch@freebsd.org Date: Thu, 26 Aug 2010 02:05:32 +0200 User-Agent: KMail/1.13.5 (FreeBSD/8.1-RELEASE; KDE/4.4.5; amd64; ; ) References: <201008160515.21412.max@love2party.net> <201008240045.15998.max@laiers.net> <4C73D0FA.5030102@freebsd.org> In-Reply-To: <4C73D0FA.5030102@freebsd.org> MIME-Version: 1.0 Content-Type: Multipart/Mixed; boundary="Boundary-00=_M/adMKo7R9HbyBw" Message-Id: <201008260205.32416.max@love2party.net> X-Provags-ID: V02:K0:HYfpzmDMnXHIqZsA6wYzridVct8x6lnWVbObJ2/VUMA 3/x0V81DnLlqHp5DMj3HeoMNgmIDxxCBAUF0QkGNfhgmmEnt9d 3j5MAId6qgYf1TntHUd6rCYFgt9wuhXbFuOUgkCFM/3nXFO7f/ WZafki/mqk3OQeEwn++/NKEvePM4Jv1z6uIt+YIa3B55iDJDbr 1G6B6xa8xfvDJk2zIpdrg== X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: Stephan Uphoff Subject: Re: rmlock(9) two additions X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Aug 2010 00:05:36 -0000 --Boundary-00=_M/adMKo7R9HbyBw Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit On Tuesday 24 August 2010 16:02:34 Stephan Uphoff wrote: > Yes - this is a problem that needs to be addressed. > Fortunately most platforms won't need to be as strict and I suggest per > platform parameters. > An alternative that was in my original design was to use a bitmap for > the rm_noreadtoken. > Each CPU would then have an associated bit that will only be cleared by > that cpu. > This would also allow targeted IPIs to only the token holders. Okay ... attached is a version with the bitmask idea. It is not using a per platform parameters, yet. But I believe it fixes the issue in general. This comes at the cost of an additional memory reference to pc_cpumask on the exit conditional from the fast-path. I don't think this is too much of a problem currently, as this will be in the cache already ... but I might be wrong. For easier review, I've also attached my current version of kern_rmlock.c. BTW, this also fixes the trylock race by trylocking the base lock. Again, I don't think the race against other readers is a concern in the trylock API. I'd like to move forward with this in some way ... any objections and/or input on what to do/fix before committing this? Any ideas/pointers on how to best implement the per platform selector? Thanks, Max --Boundary-00=_M/adMKo7R9HbyBw Content-Type: text/x-patch; charset="ISO-8859-1"; name="rmlock.full.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="rmlock.full.diff" diff --git a/share/man/man9/Makefile b/share/man/man9/Makefile index b438b90..e6d8881 100644 --- a/share/man/man9/Makefile +++ b/share/man/man9/Makefile @@ -986,6 +986,7 @@ MLINKS+=rman.9 rman_activate_resource.9 \ MLINKS+=rmlock.9 rm_destroy.9 \ rmlock.9 rm_init.9 \ rmlock.9 rm_rlock.9 \ + rmlock.9 rm_try_rlock.9 \ rmlock.9 rm_runlock.9 \ rmlock.9 RM_SYSINIT.9 \ rmlock.9 rm_wlock.9 \ diff --git a/share/man/man9/locking.9 b/share/man/man9/locking.9 index 005f476..3728319 100644 --- a/share/man/man9/locking.9 +++ b/share/man/man9/locking.9 @@ -301,7 +301,7 @@ one of the synchronization primitives discussed: .It mutex Ta \&ok Ta \&ok-1 Ta \&no Ta \&ok Ta \&ok Ta \&no-3 .It sx Ta \&ok Ta \&ok Ta \&ok-2 Ta \&ok Ta \&ok Ta \&ok-4 .It rwlock Ta \&ok Ta \&ok Ta \&no Ta \&ok-2 Ta \&ok Ta \&no-3 -.It rmlock Ta \&ok Ta \&ok Ta \&no Ta \&ok Ta \&ok-2 Ta \&no +.It rmlock Ta \&ok Ta \&ok Ta \&ok-5 Ta \&ok Ta \&ok-2 Ta \&ok-5 .El .Pp .Em *1 @@ -326,6 +326,13 @@ Though one can sleep holding an sx lock, one can also use .Fn sx_sleep which will atomically release this primitive when going to sleep and reacquire it on wakeup. +.Pp +.Em *5 +.Em Read-mostly +locks can be initialized to support sleeping while holding a write lock. +See +.Xr rmlock 9 +for details. .Ss Context mode table The next table shows what can be used in different contexts. At this time this is a rather easy to remember table. diff --git a/share/man/man9/rmlock.9 b/share/man/man9/rmlock.9 index e99661d..28ac0a5 100644 --- a/share/man/man9/rmlock.9 +++ b/share/man/man9/rmlock.9 @@ -35,6 +35,7 @@ .Nm rm_init_flags , .Nm rm_destroy , .Nm rm_rlock , +.Nm rm_try_rlock , .Nm rm_wlock , .Nm rm_runlock , .Nm rm_wunlock , @@ -53,6 +54,8 @@ .Fn rm_destroy "struct rmlock *rm" .Ft void .Fn rm_rlock "struct rmlock *rm" "struct rm_priotracker* tracker" +.Ft int +.Fn rm_try_rlock "struct rmlock *rm" "struct rm_priotracker* tracker" .Ft void .Fn rm_wlock "struct rmlock *rm" .Ft void @@ -84,14 +87,16 @@ Although reader/writer locks look very similar to locks, their usage pattern is different. Reader/writer locks can be treated as mutexes (see .Xr mutex 9 ) -with shared/exclusive semantics. +with shared/exclusive semantics unless initialized with +.Dv RM_SLEEPABLE . Unlike .Xr sx 9 , an .Nm can be locked while holding a non-spin mutex, and an .Nm -cannot be held while sleeping. +cannot be held while sleeping, again unless initialized with +.Dv RM_SLEEPABLE . The .Nm locks have full priority propagation like mutexes. @@ -135,6 +140,13 @@ to ignore this lock. .It Dv RM_RECURSE Allow threads to recursively acquire exclusive locks for .Fa rm . +.It Dv RM_SLEEPABLE +Allow writers to sleep while holding the lock. +Readers must not sleep while holding the lock and can avoid to sleep on +taking the lock by using +.Fn rm_try_rlock +instead of +.Fn rm_rlock . .El .It Fn rm_rlock "struct rmlock *rm" "struct rm_priotracker* tracker" Lock @@ -161,6 +173,13 @@ access on .Fa rm . This is called .Dq "recursing on a lock" . +.It Fn rm_try_rlock "struct rmlock *rm" "struct rm_priotracker* tracker" +Try to lock +.Fa rm +as a reader. +.Fn rm_try_rlock +will return 0 if the lock cannot be acquired immediately; +otherwise the lock will be acquired and a non-zero value will be returned. .It Fn rm_wlock "struct rmlock *rm" Lock .Fa rm diff --git a/sys/kern/kern_rmlock.c b/sys/kern/kern_rmlock.c index a6a622e..0ab5d74 100644 --- a/sys/kern/kern_rmlock.c +++ b/sys/kern/kern_rmlock.c @@ -187,6 +187,8 @@ rm_cleanIPI(void *arg) } } +CTASSERT((RM_SLEEPABLE & LO_CLASSFLAGS) == RM_SLEEPABLE); + void rm_init_flags(struct rmlock *rm, const char *name, int opts) { @@ -197,9 +199,13 @@ rm_init_flags(struct rmlock *rm, const char *name, int opts) liflags |= LO_WITNESS; if (opts & RM_RECURSE) liflags |= LO_RECURSABLE; - rm->rm_noreadtoken = 1; + rm->rm_writecpus = all_cpus; LIST_INIT(&rm->rm_activeReaders); - mtx_init(&rm->rm_lock, name, "rmlock_mtx", MTX_NOWITNESS); + if (opts & RM_SLEEPABLE) { + liflags |= RM_SLEEPABLE; + sx_init_flags(&rm->rm_lock_sx, "rmlock_sx", SX_RECURSE); + } else + mtx_init(&rm->rm_lock_mtx, name, "rmlock_mtx", MTX_NOWITNESS); lock_init(&rm->lock_object, &lock_class_rm, name, NULL, liflags); } @@ -214,7 +220,10 @@ void rm_destroy(struct rmlock *rm) { - mtx_destroy(&rm->rm_lock); + if (rm->lock_object.lo_flags & RM_SLEEPABLE) + sx_destroy(&rm->rm_lock_sx); + else + mtx_destroy(&rm->rm_lock_mtx); lock_destroy(&rm->lock_object); } @@ -222,7 +231,10 @@ int rm_wowned(struct rmlock *rm) { - return (mtx_owned(&rm->rm_lock)); + if (rm->lock_object.lo_flags & RM_SLEEPABLE) + return (sx_xlocked(&rm->rm_lock_sx)); + else + return (mtx_owned(&rm->rm_lock_mtx)); } void @@ -241,8 +253,8 @@ rm_sysinit_flags(void *arg) rm_init_flags(args->ra_rm, args->ra_desc, args->ra_opts); } -static void -_rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker) +static int +_rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker, int trylock) { struct pcpu *pc; struct rm_queue *queue; @@ -252,9 +264,9 @@ _rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker) pc = pcpu_find(curcpu); /* Check if we just need to do a proper critical_exit. */ - if (0 == rm->rm_noreadtoken) { + if (!(pc->pc_cpumask & rm->rm_writecpus)) { critical_exit(); - return; + return (1); } /* Remove our tracker from the per-cpu list. */ @@ -265,7 +277,7 @@ _rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker) /* Just add back tracker - we hold the lock. */ rm_tracker_add(pc, tracker); critical_exit(); - return; + return (1); } /* @@ -289,7 +301,7 @@ _rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker) mtx_unlock_spin(&rm_spinlock); rm_tracker_add(pc, tracker); critical_exit(); - return; + return (1); } } } @@ -297,20 +309,38 @@ _rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker) sched_unpin(); critical_exit(); - mtx_lock(&rm->rm_lock); - rm->rm_noreadtoken = 0; - critical_enter(); + if (trylock) { + if (rm->lock_object.lo_flags & RM_SLEEPABLE) { + if (!sx_try_xlock(&rm->rm_lock_sx)) + return (0); + } else { + if (!mtx_trylock(&rm->rm_lock_mtx)) + return (0); + } + } else { + if (rm->lock_object.lo_flags & RM_SLEEPABLE) + sx_xlock(&rm->rm_lock_sx); + else + mtx_lock(&rm->rm_lock_mtx); + } + critical_enter(); pc = pcpu_find(curcpu); + rm->rm_writecpus &= ~pc->pc_cpumask; rm_tracker_add(pc, tracker); sched_pin(); critical_exit(); - mtx_unlock(&rm->rm_lock); + if (rm->lock_object.lo_flags & RM_SLEEPABLE) + sx_xunlock(&rm->rm_lock_sx); + else + mtx_unlock(&rm->rm_lock_mtx); + + return (1); } -void -_rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker) +int +_rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker, int trylock) { struct thread *td = curthread; struct pcpu *pc; @@ -337,11 +367,11 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker) * Fast path to combine two common conditions into a single * conditional jump. */ - if (0 == (td->td_owepreempt | rm->rm_noreadtoken)) - return; + if (0 == (td->td_owepreempt | (rm->rm_writecpus & pc->pc_cpumask))) + return (1); /* We do not have a read token and need to acquire one. */ - _rm_rlock_hard(rm, tracker); + return _rm_rlock_hard(rm, tracker, trylock); } static void @@ -400,20 +430,26 @@ _rm_wlock(struct rmlock *rm) { struct rm_priotracker *prio; struct turnstile *ts; + cpumask_t readcpus; - mtx_lock(&rm->rm_lock); + if (rm->lock_object.lo_flags & RM_SLEEPABLE) + sx_xlock(&rm->rm_lock_sx); + else + mtx_lock(&rm->rm_lock_mtx); - if (rm->rm_noreadtoken == 0) { + if (rm->rm_writecpus != all_cpus) { /* Get all read tokens back */ - rm->rm_noreadtoken = 1; + readcpus = all_cpus & (all_cpus & ~rm->rm_writecpus); + rm->rm_writecpus = all_cpus; /* - * Assumes rm->rm_noreadtoken update is visible on other CPUs + * Assumes rm->rm_writecpus update is visible on other CPUs * before rm_cleanIPI is called. */ #ifdef SMP - smp_rendezvous(smp_no_rendevous_barrier, + smp_rendezvous_cpus(readcpus, + smp_no_rendevous_barrier, rm_cleanIPI, smp_no_rendevous_barrier, rm); @@ -439,7 +475,10 @@ void _rm_wunlock(struct rmlock *rm) { - mtx_unlock(&rm->rm_lock); + if (rm->lock_object.lo_flags & RM_SLEEPABLE) + sx_xunlock(&rm->rm_lock_sx); + else + mtx_unlock(&rm->rm_lock_mtx); } #ifdef LOCK_DEBUG @@ -454,7 +493,11 @@ void _rm_wlock_debug(struct rmlock *rm, const char *file, int line) LOCK_LOG_LOCK("RMWLOCK", &rm->lock_object, 0, 0, file, line); - WITNESS_LOCK(&rm->lock_object, LOP_EXCLUSIVE, file, line); + if (rm->lock_object.lo_flags & RM_SLEEPABLE) + WITNESS_LOCK(&rm->rm_lock_sx.lock_object, LOP_EXCLUSIVE, + file, line); + else + WITNESS_LOCK(&rm->lock_object, LOP_EXCLUSIVE, file, line); curthread->td_locks++; @@ -465,25 +508,35 @@ _rm_wunlock_debug(struct rmlock *rm, const char *file, int line) { curthread->td_locks--; - WITNESS_UNLOCK(&rm->lock_object, LOP_EXCLUSIVE, file, line); + if (rm->lock_object.lo_flags & RM_SLEEPABLE) + WITNESS_UNLOCK(&rm->rm_lock_sx.lock_object, LOP_EXCLUSIVE, + file, line); + else + WITNESS_UNLOCK(&rm->lock_object, LOP_EXCLUSIVE, file, line); LOCK_LOG_LOCK("RMWUNLOCK", &rm->lock_object, 0, 0, file, line); _rm_wunlock(rm); } -void +int _rm_rlock_debug(struct rmlock *rm, struct rm_priotracker *tracker, - const char *file, int line) + int trylock, const char *file, int line) { - + if (!trylock && (rm->lock_object.lo_flags & RM_SLEEPABLE)) + WITNESS_CHECKORDER(&rm->rm_lock_sx.lock_object, LOP_NEWORDER, + file, line, NULL); WITNESS_CHECKORDER(&rm->lock_object, LOP_NEWORDER, file, line, NULL); - _rm_rlock(rm, tracker); + if (_rm_rlock(rm, tracker, trylock)) { + LOCK_LOG_LOCK("RMRLOCK", &rm->lock_object, 0, 0, file, line); - LOCK_LOG_LOCK("RMRLOCK", &rm->lock_object, 0, 0, file, line); + WITNESS_LOCK(&rm->lock_object, 0, file, line); - WITNESS_LOCK(&rm->lock_object, 0, file, line); + curthread->td_locks++; - curthread->td_locks++; + return (1); + } + + return (0); } void @@ -517,12 +570,12 @@ _rm_wunlock_debug(struct rmlock *rm, const char *file, int line) _rm_wunlock(rm); } -void +int _rm_rlock_debug(struct rmlock *rm, struct rm_priotracker *tracker, - const char *file, int line) + int trylock, const char *file, int line) { - _rm_rlock(rm, tracker); + return _rm_rlock(rm, tracker, trylock); } void diff --git a/sys/sys/_rmlock.h b/sys/sys/_rmlock.h index e5c68d5..75a159c 100644 --- a/sys/sys/_rmlock.h +++ b/sys/sys/_rmlock.h @@ -45,11 +45,15 @@ LIST_HEAD(rmpriolist,rm_priotracker); struct rmlock { struct lock_object lock_object; - volatile int rm_noreadtoken; + volatile cpumask_t rm_writecpus; LIST_HEAD(,rm_priotracker) rm_activeReaders; - struct mtx rm_lock; - + union { + struct mtx _rm_lock_mtx; + struct sx _rm_lock_sx; + } _rm_lock; }; +#define rm_lock_mtx _rm_lock._rm_lock_mtx +#define rm_lock_sx _rm_lock._rm_lock_sx struct rm_priotracker { struct rm_queue rmp_cpuQueue; /* Must be first */ diff --git a/sys/sys/rmlock.h b/sys/sys/rmlock.h index 9766f67..ef5776b 100644 --- a/sys/sys/rmlock.h +++ b/sys/sys/rmlock.h @@ -33,6 +33,7 @@ #define _SYS_RMLOCK_H_ #include +#include #include #include @@ -43,6 +44,7 @@ */ #define RM_NOWITNESS 0x00000001 #define RM_RECURSE 0x00000002 +#define RM_SLEEPABLE 0x00000004 void rm_init(struct rmlock *rm, const char *name); void rm_init_flags(struct rmlock *rm, const char *name, int opts); @@ -53,14 +55,15 @@ void rm_sysinit_flags(void *arg); void _rm_wlock_debug(struct rmlock *rm, const char *file, int line); void _rm_wunlock_debug(struct rmlock *rm, const char *file, int line); -void _rm_rlock_debug(struct rmlock *rm, struct rm_priotracker *tracker, - const char *file, int line); +int _rm_rlock_debug(struct rmlock *rm, struct rm_priotracker *tracker, + int trylock, const char *file, int line); void _rm_runlock_debug(struct rmlock *rm, struct rm_priotracker *tracker, const char *file, int line); void _rm_wlock(struct rmlock *rm); void _rm_wunlock(struct rmlock *rm); -void _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker); +int _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker, + int trylock); void _rm_runlock(struct rmlock *rm, struct rm_priotracker *tracker); /* @@ -74,14 +77,17 @@ void _rm_runlock(struct rmlock *rm, struct rm_priotracker *tracker); #define rm_wlock(rm) _rm_wlock_debug((rm), LOCK_FILE, LOCK_LINE) #define rm_wunlock(rm) _rm_wunlock_debug((rm), LOCK_FILE, LOCK_LINE) #define rm_rlock(rm,tracker) \ - _rm_rlock_debug((rm),(tracker), LOCK_FILE, LOCK_LINE ) + ((void)_rm_rlock_debug((rm),(tracker), 0, LOCK_FILE, LOCK_LINE )) +#define rm_try_rlock(rm,tracker) \ + _rm_rlock_debug((rm),(tracker), 1, LOCK_FILE, LOCK_LINE ) #define rm_runlock(rm,tracker) \ _rm_runlock_debug((rm), (tracker), LOCK_FILE, LOCK_LINE ) #else -#define rm_wlock(rm) _rm_wlock((rm)) -#define rm_wunlock(rm) _rm_wunlock((rm)) -#define rm_rlock(rm,tracker) _rm_rlock((rm),(tracker)) -#define rm_runlock(rm,tracker) _rm_runlock((rm), (tracker)) +#define rm_wlock(rm) _rm_wlock((rm)) +#define rm_wunlock(rm) _rm_wunlock((rm)) +#define rm_rlock(rm,tracker) ((void)_rm_rlock((rm),(tracker), 0)) +#define rm_try_rlock(rm,tracker) _rm_rlock((rm),(tracker), 1) +#define rm_runlock(rm,tracker) _rm_runlock((rm), (tracker)) #endif struct rm_args { --Boundary-00=_M/adMKo7R9HbyBw--