From owner-freebsd-arch@FreeBSD.ORG  Sun Aug 22 19:05:50 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2518010656A5;
	Sun, 22 Aug 2010 19:05:50 +0000 (UTC)
	(envelope-from max@love2party.net)
Received: from moutng.kundenserver.de (moutng.kundenserver.de
	[212.227.126.187])
	by mx1.freebsd.org (Postfix) with ESMTP id C12AC8FC1F;
	Sun, 22 Aug 2010 19:05:49 +0000 (UTC)
Received: from f8x64.laiers.local (dslb-088-066-049-083.pools.arcor-ip.net
	[88.66.49.83])
	by mrelayeu.kundenserver.de (node=mreu2) with ESMTP (Nemesis)
	id 0Mao5W-1OYLdm02ws-00JuQt; Sun, 22 Aug 2010 21:05:48 +0200
From: Max Laier <max@love2party.net>
Organization: FreeBSD
To: Stephan Uphoff <ups@freebsd.org>
Date: Sun, 22 Aug 2010 21:05:47 +0200
User-Agent: KMail/1.13.5 (FreeBSD/8.1-RELEASE; KDE/4.4.5; amd64; ; )
References: <201008160515.21412.max@love2party.net>
	<4C7042BA.8000402@freebsd.org>
In-Reply-To: <4C7042BA.8000402@freebsd.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201008222105.47276.max@love2party.net>
X-Provags-ID: V02:K0:YNor0iNpkdfjelCFoSMWyZtpXJhZ7ohG2c5/aghqCgx
	Y0kFjQW7ZEyXAxH3BAGR4CG/+A+ExZsSiYd7FNRUFc2gSxOhnj
	9Vte5ZrKUUlT5A+foYJ0cZDmo3AbZpHypBmITXvyePacesCCag
	vJMxDHLlrIfSRo1pPmWYt4Q0KgcKqgeDxRzSKKX/G+r7VQDzFj
	n9xX7r3jji5syyUh2awzQ==
Cc: freebsd-arch@freebsd.org
Subject: Re: rmlock(9) two additions
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 22 Aug 2010 19:05:50 -0000

On Saturday 21 August 2010 23:18:50 Stephan Uphoff wrote:
> Max Laier wrote:
> > Hi,
> > 
> > I'd like to run two additions to rmlock(9) by you:
> > 
> > 1) See the attached patch.  It adds rm_try_rlock() which avoids taking
> > the mutex if the lock is currently in a write episode.  The only
> > overhead to the hot path is an additional argument and a return value
> > for _rm_rlock*.  If you are worried about that, it can obviously be done
> > in a separate code path, but I reckon it not worth the code crunch. 
> > Finally, there is one additional branch to check the "trylock" argument,
> > but that's well past the hot path.
> 
> The rm_try_rlock() will never succeed when the lock was last locked as
> writer.
> Maybe add:
> 
> void
> _rm_wunlock(struct rmlock *rm)
> {
> +   rm->rm_noreadtoken = 0;
>     mtx_unlock(&rm->rm_lock);
> }
> 
> But then
> 
> _rm_wlock(struct rmlock *rm)
> 
> always needs to use IPIs - even when the lock was used last as a write
> lock.

I don't think this is a big problem - I can't see many use cases for rmlocks 
where you'd routinely see repeated wlocks without rlocks between them.  
However, I think there should be a release memory barrier before/while 
clearing rm_noreadtoken, otherwise readers may not see the data writes that 
are supposed to be protected by the lock?!?

> Alternatively something like:
> 
>     if (trylock) {
>          if(mtx_trylock( &rm->rm_lock) == 0)
>               return (0);
>    }
>     else
>   {
>     mtx_lock(&rm->rm_lock);
>   }
> 
>   would work - but has a race. Two readers colliding just after a writer
> (with the second not succeeding in trylocking the mutex) leads to not
> granting the read lock (also it would be possible to do so).

Also not too much of a problem, in my opinion.  There is no time order between 
read locks and thus it is okay to grant one and fail another - eventhough they 
arrive "at the same time".  A caller to trylock must always accept failure 
(unless it's a recursive use - and this is handled).

> Let me think about it a bit.

I believe either solution will work.  #1 is a bit more in the spirit of the 
rmlock - i.e. make the read case cheap and the write case expensive.  I'm just 
not sure about the lock semantics.

I guess a

  atomic_store_rel_int(&rm->rm_noreadtoken, 0);

should work.

> > 2) No code for this yet - testing the waters first.  I'd like to add the
> > ability to replace the mutex for writer synchronization with a general
> > lock - especially to be able to use a sx(9) lock here.
> > 
> > The reason for #2 is the following use case in a debugging facility:
> > 	"reader":
> > 		if (rm_try_rlock()) {
> > 		
> > 			grab_per_cpu_buffer();
> > 			fill_per_cpu_buffer();
> > 			rm_runlock();
> > 		
> > 		}
> > 	
> > 	"writer" - better exclusive access thread:
> > 		rm_wlock();
> > 		collect_buffers_and_copy_out_to_userspace();
> > 		rm_wunlock();
> > 
> > This is much cleaner and possibly cheaper than the various hand rolled
> > versions I've come across, that try to get the same synchronization with
> > atomic operations.  If we could sleep with the wlock held, we can also
> > avoid copying the buffer contents, or swapping buffers.
> > 
> > Is there any concern about either of this?  Any objection?  Input?
> 
> Will take a look at your second patch soonish.
> 
> Just ask per IPI for a copy of per cpu buffers (but not a copy to user
> space) - and delay the copy when an update is in progress?

Think huge circular per cpu buffer that are filled at high rates.  Of course 
we could allocate new buffers and swap out while locked, but since this is a 
debugging facility it is better to miss a few events while copying out, rather 
than spending twice the memory.

Thanks,
  Max

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 23 21:41:05 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 23B291065698
	for <freebsd-arch@freebsd.org>; Mon, 23 Aug 2010 21:41:05 +0000 (UTC)
	(envelope-from marcelm@juniper.net)
Received: from exprod7og105.obsmtp.com (exprod7og105.obsmtp.com [64.18.2.163])
	by mx1.freebsd.org (Postfix) with ESMTP id E07308FC0A
	for <freebsd-arch@freebsd.org>; Mon, 23 Aug 2010 21:41:04 +0000 (UTC)
Received: from source ([66.129.224.36]) (using TLSv1) by
	exprod7ob105.postini.com ([64.18.6.12]) with SMTP
	ID DSNKTHLq8FdHsG2SuPSF3wt3v4S+sOIHUb4I@postini.com;
	Mon, 23 Aug 2010 14:41:04 PDT
Received: from EMBX01-HQ.jnpr.net ([fe80::c821:7c81:f21f:8bc7]) by
	P-EMHUB02-HQ.jnpr.net ([fe80::88f9:77fd:dfc:4d51%11]) with mapi;
	Mon, 23 Aug 2010 14:30:18 -0700
From: Marcel Moolenaar <marcelm@juniper.net>
To: FreeBSD Arch <freebsd-arch@freebsd.org>
Date: Mon, 23 Aug 2010 14:30:16 -0700
Thread-Topic: enhancing the root mount logic
Thread-Index: ActDCmGMyxbU3bCuTUCbYapKaXZqsQ==
Message-ID: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2010 21:41:05 -0000

All,

In embedded products, software is possibly installed as an image onto
an actual storage device. This means that mounting the storage device
as root is not enough to have a usable root file system. The rough
draft below is an idea to enhance the root mount from having ad-hoc
quirks to a well-defined and recursive mechanism to allow a wide-
range of use cases.

The root mount logic is recursive as follows:
1.  The kernel mounts devfs as root (is it is now).
2.  The kernel will re-mount root by virtue of reading a file, called
    /.mount.conf, in the current root file system and following the
    directives is it. devfs synthesizes the contents of this file.

At each iteration, the kernel will:
1.  move the devfs mount from /dev in the old file system to /dev in
    the new file system.
2.  As per the directives or unconditionally, the kernel will re-mount
    the old root file system under /.mount (or some other name) within
    the new file system.

devfs will synthesize the contents of /.mount.conf as per the kernel
configuration and tunables. The administrator (or install process)
will create and populate /.mount.conf for all other cases.

Directives in /.mount.conf are envisioned to be something like:

   {FS}:{MOUNTPOINT}	e.g.	ufs:/dev/da0
	a root mount alternative. The order of the alternatives in
	the file determines the priority.

   .ask
	a root mount alternative that asks the operator to specify
	what the root mount should be.

   .wait N			.e.g.	.wait 5
	wait at most N seconds for a root mount alternative to
	succeed. If an alternative does not succeed within that
	time, move on to the next alternative.

   .onfail	{panic|reboot|retry|continue}
	Tells the kernel what to do in case it can't successfully
	complete the root mount as directed to.

The .wait directive works better (probably) if we have events that
signify the arrival of a file system or device special file, so that
we can wait for at most N seconds after the last event. This also
allows us to wait for a separate interval between events.

As an example, consider:

   [devfs]	/.mount.conf:
	ufs:/dev/da0
	.ask
	.wait 5
	.onfail panic

   [ufs:/dev/da0]	/.mount.conf
	md0:/images/OS-image-1.0.iso
	unionfs:/jail/freebsd-8-stable
	.wait 0
	.onfail continue

In the example, the kernel will mount devfs, read /.mount.conf and
wait at most 5 seconds to mount the UFS on /dev/da0. If that fails,
the kernel will ask (once) and panic in case of failure.

If the UFS root mount succeeded, the kernel will re-mount devfs
underneath /dev. Since this is the first non-devfs root file system,
the kernel will not re-mount the old root under /.mount.

Since there's a /.mount.conf on the UFS, the kernel will read it
and repeat the process. First it'll try and mount the OS image
in /images/OS-image-1.0.iso and if it's not present will try to
mount some -stable 8 chroot using unionfs (not necessarily a
real-world example here :-) If either fails, the kernel will
continue booting using the current root file system. Assuming that
the image is present, the kernel will re-mount root, move devfs
underneath /dev in the MD root and remount ufs:/dev/da0 under
/.mount in the MD root. This gives the following picture:

/		md0:[ufs:/dev/da0]/images/OS-image-1.0.iso
/.mount		ufs:/dev/da0
/dev		devfs


Things to not explicitly touched upon:
o   root mount options
o   directives to instruct the kernel what to run as the initial
    process to eliminate the rather ad-hoc hardcoding. E.g:
	.init /sbin/init
	.init /sbin/init.old

Is this something that people feel is worth fleshing out and
prototyping?

--=20
Marcel Moolenaar
marcelm@juniper.net


From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 23 21:49:47 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 10C1F1065672
	for <freebsd-arch@freebsd.org>; Mon, 23 Aug 2010 21:49:47 +0000 (UTC)
	(envelope-from ed@hoeg.nl)
Received: from mx0.hoeg.nl (unknown [IPv6:2a01:4f8:101:5343::aa])
	by mx1.freebsd.org (Postfix) with ESMTP id A253D8FC20
	for <freebsd-arch@freebsd.org>; Mon, 23 Aug 2010 21:49:46 +0000 (UTC)
Received: by mx0.hoeg.nl (Postfix, from userid 1000)
	id 1370B2A28CB9; Mon, 23 Aug 2010 23:49:46 +0200 (CEST)
Date: Mon, 23 Aug 2010 23:49:46 +0200
From: Ed Schouten <ed@80386.nl>
To: Marcel Moolenaar <marcelm@juniper.net>
Message-ID: <20100823214946.GF64651@hoeg.nl>
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="7mxbaLlpDEyR1+x6"
Content-Disposition: inline
In-Reply-To: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: FreeBSD Arch <freebsd-arch@freebsd.org>
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2010 21:49:47 -0000


--7mxbaLlpDEyR1+x6
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Marcel Moolenaar <marcelm@juniper.net> wrote:
> Is this something that people feel is worth fleshing out and
> prototyping?

Sounds awesome! This would make my writable boot cd a lot more elegant
than it is right now. Have you thought about things like possible
endless loops? Say, you mount a unionfs on the root of the fs itself.
This may cause the original .mount.conf to be reinterpreted, right?

--=20
 Ed Schouten <ed@80386.nl>
 WWW: http://80386.nl/

--7mxbaLlpDEyR1+x6
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.16 (FreeBSD)

iEYEARECAAYFAkxy7PoACgkQ52SDGA2eCwUONQCfVMdpcEvj7mh9+nfl+S89VfLp
d7gAn3Gxyi5GCWA+EKUQId69vr6tNcYZ
=PUNq
-----END PGP SIGNATURE-----

--7mxbaLlpDEyR1+x6--

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 23 22:07:32 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 50C8F10656A5
	for <freebsd-arch@freebsd.org>; Mon, 23 Aug 2010 22:07:32 +0000 (UTC)
	(envelope-from mdf356@gmail.com)
Received: from mail-bw0-f54.google.com (mail-bw0-f54.google.com
	[209.85.214.54])
	by mx1.freebsd.org (Postfix) with ESMTP id D1CD58FC17
	for <freebsd-arch@freebsd.org>; Mon, 23 Aug 2010 22:07:31 +0000 (UTC)
Received: by bwz20 with SMTP id 20so575813bwz.13
	for <freebsd-arch@freebsd.org>; Mon, 23 Aug 2010 15:07:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:received:sender:received
	:in-reply-to:references:date:x-google-sender-auth:message-id:subject
	:from:to:cc:content-type:content-transfer-encoding;
	bh=th6aRvS8GyJMiz5FWpFntdCAjLvzpHTJZQPm5u5gLXg=;
	b=mxqh2521jJh71mvxOd92oZYDj5v0w3Nq4JXtsL9PvBwWDEq++qwH7mhMSKUqNsKRa2
	OzaVEXxwy+XKhwT+s4DqgwlzSO8Tl1FZ0Ja2hokelNMJaytGGye0t45RrM2XBL503+dt
	Iw6t7RcVzK+xAw7u+WKzSKvrTUceQToGkhM5I=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	b=TO2WJlIpO2OmoaXkF2L56QY233gVM9LapmJZY+LRiELQCqb9SuBDaLiBJNwi60BPo1
	0DWTbG1t9olp7lHL1m9FPqHg44h7T2MS9w+OMmZ4dp/zB8ELYrnrxQCVMcuIphInMAWH
	SvAuonLGpfSZTPVwPoteGVQ5QfR5AEmeEhqoQ=
MIME-Version: 1.0
Received: by 10.213.114.67 with SMTP id d3mr4667777ebq.73.1282601250485; Mon,
	23 Aug 2010 15:07:30 -0700 (PDT)
Sender: mdf356@gmail.com
Received: by 10.213.20.144 with HTTP; Mon, 23 Aug 2010 15:07:30 -0700 (PDT)
In-Reply-To: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
Date: Mon, 23 Aug 2010 15:07:30 -0700
X-Google-Sender-Auth: HZjnI6JaPW68ua7gcToaSnODCSs
Message-ID: <AANLkTikLd0-sLN=oxobC03yKfNcZ7mHguQNtooojOE=B@mail.gmail.com>
From: mdf@FreeBSD.org
To: Marcel Moolenaar <marcelm@juniper.net>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: FreeBSD Arch <freebsd-arch@freebsd.org>
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2010 22:07:32 -0000

On Mon, Aug 23, 2010 at 2:30 PM, Marcel Moolenaar <marcelm@juniper.net> wro=
te:
> In embedded products, software is possibly installed as an image onto
> an actual storage device. This means that mounting the storage device
> as root is not enough to have a usable root file system. The rough
> draft below is an idea to enhance the root mount from having ad-hoc
> quirks to a well-defined and recursive mechanism to allow a wide-
> range of use cases.

I am not making any claims to the overall desirability of this, but as
a suggestion for the file format:

> =A0 [devfs] =A0 =A0 =A0/.mount.conf:
> =A0 =A0 =A0 =A0ufs:/dev/da0
> =A0 =A0 =A0 =A0.ask
> =A0 =A0 =A0 =A0.wait 5
> =A0 =A0 =A0 =A0.onfail panic

To me, this should wait 0 seconds (or whatever the default is) until
after the .ask mount point has been tried.

I'd suggest something like:

 =A0 [devfs] =A0 =A0 =A0/.mount.conf:
 =A0 =A0 =A0 =A0.wait 10
 =A0 =A0 =A0 =A0ufs:/dev/da0
        # wait up to 10 seconds for ufs
        .wait 5
 =A0 =A0 =A0 =A0.ask
        # wait up to 5 seconds for the prompt-returned filesystem
 =A0 =A0 =A0 =A0.onfail panic

The two reasons for such a usage:

1) simplifies parsing, since the file only needs to be read to a mount
directive, not read in its entirety.
2) allows different timeouts for each root mount location

I could also imagine, instead, a .mount directive, so that all
"commands" start with a '.' which would be like:

.mount ufs:/dev/da0 8

which would be an 8 second timeout on the specified mount point.

Anyways, as a flexible mechanism this sounds reasonable.  I have no
idea how it compares to what other operating systems do, which is only
relevant insofar as making migration from another platform easier.

Thanks,
matthew

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 23 22:57:53 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B6DF01065693;
	Mon, 23 Aug 2010 22:57:53 +0000 (UTC) (envelope-from max@laiers.net)
Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.17.8])
	by mx1.freebsd.org (Postfix) with ESMTP id 4E6668FC1D;
	Mon, 23 Aug 2010 22:57:53 +0000 (UTC)
Received: from f8x64.laiers.local (dslb-088-066-049-083.pools.arcor-ip.net
	[88.66.49.83])
	by mrelayeu.kundenserver.de (node=mrbap1) with ESMTP (Nemesis)
	id 0Mgpk0-1ORaVI3EsS-00MQXi; Tue, 24 Aug 2010 00:45:17 +0200
From: Max Laier <max@laiers.net>
Organization: FreeBSD
To: freebsd-arch@freebsd.org
Date: Tue, 24 Aug 2010 00:45:15 +0200
User-Agent: KMail/1.13.5 (FreeBSD/8.1-RELEASE; KDE/4.4.5; amd64; ; )
References: <201008160515.21412.max@love2party.net>
	<4C7042BA.8000402@freebsd.org>
	<201008222105.47276.max@love2party.net>
In-Reply-To: <201008222105.47276.max@love2party.net>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201008240045.15998.max@laiers.net>
X-Provags-ID: V02:K0:Zg08zW314ugbuSAVqnWD3R1zBlQ41F1tCbKuxtn9A5k
	yU2CHRFFnha/vLZVaJJe5rrwS8t1X6O+Fkj6/RxECEB0CR65kY
	EDuasCN2KSlOP8mxJR4mMUuooaZFrzZI7T0+tBGj0Ae8qxAAXx
	6CbBDPmQUmOAU/wluVIgC0D3ZUgvulLpD/O2tKtokOF8vDU/32
	ub4MKdTcOirYEnQgsZ9DQ==
Cc: Stephan Uphoff <ups@freebsd.org>
Subject: Re: rmlock(9) two additions
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2010 22:57:53 -0000

On Sunday 22 August 2010 21:05:47 Max Laier wrote:
> On Saturday 21 August 2010 23:18:50 Stephan Uphoff wrote:
> > Max Laier wrote:
> > ...
> > The rm_try_rlock() will never succeed when the lock was last locked as
> > writer.
> > Maybe add:
> > 
> > void
> > _rm_wunlock(struct rmlock *rm)
> > {
> > +   rm->rm_noreadtoken = 0;
> > 
> >     mtx_unlock(&rm->rm_lock);
> > 
> > }
> > 
> > But then
> > 
> > _rm_wlock(struct rmlock *rm)
> > 
> > always needs to use IPIs - even when the lock was used last as a write
> > lock.
> 
> I don't think this is a big problem - I can't see many use cases for
> rmlocks where you'd routinely see repeated wlocks without rlocks between
> them. However, I think there should be a release memory barrier
> before/while clearing rm_noreadtoken, otherwise readers may not see the
> data writes that are supposed to be protected by the lock?!?
> > ...
> I believe either solution will work.  #1 is a bit more in the spirit of the
> rmlock - i.e. make the read case cheap and the write case expensive.  I'm
> just not sure about the lock semantics.
> 
> I guess a
> 
>   atomic_store_rel_int(&rm->rm_noreadtoken, 0);
> 
> should work.

thinking about this for a while makes me wonder: Are readers really guaranteed 
to see all the updates of a writer - even in the current version?

Example:

  writer thread:
  rm_wlock();		// lock mtx, IPI, wait for reader drain
  modify_data();
  rm_wunlock();	// unlock mtx (this does a atomic_*_rel)

  reader thread #1:
  // failed to get the lock, spinning/waiting on mtx
  mtx_lock();		// this does a atomic_*_acq -> this CPU sees the new data
  rm->rm_noreadtoken = 0;	// now new readers can enter quickly
  ...

  reader thread 2# (on a different CPU than reader #1):
  // enters rm_rlock() "after" rm_noreadtoken was reset -> no memory barrier
  // does this thread see the modifications?

I realize this is a somewhat pathological case, but would it be possible in 
theory?  Or is the compiler_memory_barrier() actually enough?

Otherwise, I think we need an IPI on rm_wunlock() that does a atomic_*_acq on 
every CPU.

Thoughts?
  Max

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 23 23:13:16 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0A2CE1065695
	for <freebsd-arch@FreeBSD.org>; Mon, 23 Aug 2010 23:13:16 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id AA9BF8FC0C
	for <freebsd-arch@FreeBSD.org>; Mon, 23 Aug 2010 23:13:15 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7NNBpNW057077;
	Mon, 23 Aug 2010 17:11:51 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Mon, 23 Aug 2010 17:12:01 -0600 (MDT)
Message-Id: <20100823.171201.107001114053031707.imp@bsdimp.com>
To: marcelm@juniper.net
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: freebsd-arch@FreeBSD.org
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2010 23:13:16 -0000

In message: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
            Marcel Moolenaar <marcelm@juniper.net> writes:
: All,
: 
: In embedded products, software is possibly installed as an image onto
: an actual storage device. This means that mounting the storage device
: as root is not enough to have a usable root file system. The rough
: draft below is an idea to enhance the root mount from having ad-hoc
: quirks to a well-defined and recursive mechanism to allow a wide-
: range of use cases.
: 
: The root mount logic is recursive as follows:
: 1.  The kernel mounts devfs as root (is it is now).
: 2.  The kernel will re-mount root by virtue of reading a file, called
:     /.mount.conf, in the current root file system and following the
:     directives is it. devfs synthesizes the contents of this file.
: 
: At each iteration, the kernel will:
: 1.  move the devfs mount from /dev in the old file system to /dev in
:     the new file system.
: 2.  As per the directives or unconditionally, the kernel will re-mount
:     the old root file system under /.mount (or some other name) within
:     the new file system.
: 
: devfs will synthesize the contents of /.mount.conf as per the kernel
: configuration and tunables. The administrator (or install process)
: will create and populate /.mount.conf for all other cases.
: 
: Directives in /.mount.conf are envisioned to be something like:
: 
:    {FS}:{MOUNTPOINT}	e.g.	ufs:/dev/da0
: 	a root mount alternative. The order of the alternatives in
: 	the file determines the priority.
: 
:    .ask
: 	a root mount alternative that asks the operator to specify
: 	what the root mount should be.
: 
:    .wait N			.e.g.	.wait 5
: 	wait at most N seconds for a root mount alternative to
: 	succeed. If an alternative does not succeed within that
: 	time, move on to the next alternative.
: 
:    .onfail	{panic|reboot|retry|continue}
: 	Tells the kernel what to do in case it can't successfully
: 	complete the root mount as directed to.
: 
: The .wait directive works better (probably) if we have events that
: signify the arrival of a file system or device special file, so that
: we can wait for at most N seconds after the last event. This also
: allows us to wait for a separate interval between events.
: 
: As an example, consider:
: 
:    [devfs]	/.mount.conf:
: 	ufs:/dev/da0
: 	.ask
: 	.wait 5
: 	.onfail panic
: 
:    [ufs:/dev/da0]	/.mount.conf
: 	md0:/images/OS-image-1.0.iso
: 	unionfs:/jail/freebsd-8-stable
: 	.wait 0
: 	.onfail continue
: 
: In the example, the kernel will mount devfs, read /.mount.conf and
: wait at most 5 seconds to mount the UFS on /dev/da0. If that fails,
: the kernel will ask (once) and panic in case of failure.
: 
: If the UFS root mount succeeded, the kernel will re-mount devfs
: underneath /dev. Since this is the first non-devfs root file system,
: the kernel will not re-mount the old root under /.mount.
: 
: Since there's a /.mount.conf on the UFS, the kernel will read it
: and repeat the process. First it'll try and mount the OS image
: in /images/OS-image-1.0.iso and if it's not present will try to
: mount some -stable 8 chroot using unionfs (not necessarily a
: real-world example here :-) If either fails, the kernel will
: continue booting using the current root file system. Assuming that
: the image is present, the kernel will re-mount root, move devfs
: underneath /dev in the MD root and remount ufs:/dev/da0 under
: /.mount in the MD root. This gives the following picture:
: 
: /		md0:[ufs:/dev/da0]/images/OS-image-1.0.iso
: /.mount		ufs:/dev/da0
: /dev		devfs
: 
: 
: Things to not explicitly touched upon:
: o   root mount options
: o   directives to instruct the kernel what to run as the initial
:     process to eliminate the rather ad-hoc hardcoding. E.g:
: 	.init /sbin/init
: 	.init /sbin/init.old
: 
: Is this something that people feel is worth fleshing out and
: prototyping?

This sounds very interesting.  If kept simple, I could see how this
would make my life a lot easier.

However, all this scripting sounds a bit like a very simple shell in
the kernel.  What advantages are there to this approach vs having the
ability to run a simple shell script or executable and "pivot" the root
to a new location?  And how do you emulate the mount_foo programs for
foo filesystems?  Some of them do weird things that might not
translate well into the kernel...

As you can see, I'm torn about how I feel about the idea.  For simple
cases, I think it is great, but as complexity builds, I become less
sure.  What if that iso image was compressed?  What if I had a
software RAID of disks or flash devices?  What about crypto?  I know I
can handle those cases in /bin/sh, but will each new one require more
code in the kernel?  What would df and/or mount tell you about the
now-hidden file systems?

Warner

From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 23 23:43:28 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id BD3271065672
	for <freebsd-arch@FreeBSD.org>; Mon, 23 Aug 2010 23:43:28 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from asmtpout024.mac.com (asmtpout024.mac.com [17.148.16.99])
	by mx1.freebsd.org (Postfix) with ESMTP id A2F968FC1D
	for <freebsd-arch@FreeBSD.org>; Mon, 23 Aug 2010 23:43:28 +0000 (UTC)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; charset=us-ascii
Received: from sa-nc-cs-125.static.jnpr.net
	(natint3.juniper.net [66.129.224.36])
	by asmtp024.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01
	(built Dec
	16 2008; 32bit)) with ESMTPSA id <0L7M00M0SPW42K70@asmtp024.mac.com> for
	freebsd-arch@FreeBSD.org; Mon, 23 Aug 2010 16:43:17 -0700 (PDT)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=6.0.2-1004200000 definitions=main-1008230203
X-Proofpoint-Virus-Version: vendor=fsecure
	engine=2.50.10432:5.0.10011,1.0.148,0.0.0000
	definitions=2010-08-23_08:2010-08-24, 2010-08-23,
	1970-01-01 signatures=0
From: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: <20100823.171201.107001114053031707.imp@bsdimp.com>
Date: Mon, 23 Aug 2010 16:43:15 -0700
Message-id: <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823.171201.107001114053031707.imp@bsdimp.com>
To: "M. Warner Losh" <imp@bsdimp.com>
X-Mailer: Apple Mail (2.1081)
Cc: "freebsd-arch@FreeBSD.org" <freebsd-arch@FreeBSD.org>
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2010 23:43:28 -0000


On Aug 23, 2010, at 4:12 PM, M. Warner Losh wrote:

*snip*

> However, all this scripting sounds a bit like a very simple shell in
> the kernel.  What advantages are there to this approach vs having the
> ability to run a simple shell script or executable and "pivot" the root
> to a new location?

The 2 reasons for doing this in the kernel are:
1.  resiliency against ABI changes.
2.  allowing /sbin/init to come from the actual root file system.

Both points are impossible to handle efficiently or correctly if
you need user space support in getting to your actual root file
system. You basically have a catch-22 or bootstrap problem, which
a pure in-kernel solution doesn't have.

> And how do you emulate the mount_foo programs for
> foo filesystems?  Some of them do weird things that might not
> translate well into the kernel...

True. I haven't flushed that out, but I was hoping that nmount(2)
would have normalized most of this that it's a non-issue, provided
we support mount options in this scheme.

If you have a concrete example of something that's not so trivial,
but critical to support, let me know and I'll take it into account.

> As you can see, I'm torn about how I feel about the idea.  For simple
> cases, I think it is great, but as complexity builds, I become less
> sure.  What if that iso image was compressed?

Can you elaborate how this is potentially a problem in this scheme,
but not for "manual" mounting?


> What if I had a
> software RAID of disks or flash devices?

I see no problem. In fact, the idea is triggered by switching to a
flash file system on a NAND flash.

> What about crypto?

See above. Can you elaborate?

> I know I
> can handle those cases in /bin/sh, but will each new one require more
> code in the kernel?

The way I see it is that the approach enhances how we now mount the
root file system. We have very limited flexibility. I do not claim
that my idea allows every possible variation, and I think it unfair
to expect that of the approach. If one has real complex requirements,
one can always just mount some file system on some storage device
and deal with the root mount in user space. I don't see how this
prevents that.

>  What would df and/or mount tell you about the
> now-hidden file systems?

Can you explain what you mean by now-hidden file systems?

Thanks,

-- 
Marcel Moolenaar
xcllnt@mac.com


From owner-freebsd-arch@FreeBSD.ORG  Mon Aug 23 23:44:24 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2E6181065693
	for <freebsd-arch@freebsd.org>; Mon, 23 Aug 2010 23:44:24 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from asmtpout024.mac.com (asmtpout024.mac.com [17.148.16.99])
	by mx1.freebsd.org (Postfix) with ESMTP id 159E08FC2C
	for <freebsd-arch@freebsd.org>; Mon, 23 Aug 2010 23:44:23 +0000 (UTC)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; charset=us-ascii
Received: from macbook-pro.lan.xcllnt.net (mail.xcllnt.net [70.36.220.4])
	by asmtp024.mac.com
	(Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008;
	32bit)) with ESMTPSA id <0L7M00JOKN5F7A70@asmtp024.mac.com> for
	freebsd-arch@freebsd.org; Mon, 23 Aug 2010 15:44:05 -0700 (PDT)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=6.0.2-1004200000 definitions=main-1008230194
X-Proofpoint-Virus-Version: vendor=fsecure
	engine=2.50.10432:5.0.10011,1.0.148,0.0.0000
	definitions=2010-08-23_08:2010-08-24, 2010-08-23,
	1970-01-01 signatures=0
From: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: <20100823214946.GF64651@hoeg.nl>
Date: Mon, 23 Aug 2010 15:44:03 -0700
Message-id: <7318E60D-F00F-4519-A3E3-9CE8B752AE88@mac.com>
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823214946.GF64651@hoeg.nl>
To: Ed Schouten <ed@80386.nl>
X-Mailer: Apple Mail (2.1081)
Cc: FreeBSD Arch <freebsd-arch@freebsd.org>
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Aug 2010 23:44:24 -0000


On Aug 23, 2010, at 2:49 PM, Ed Schouten wrote:

> * Marcel Moolenaar <marcelm@juniper.net> wrote:
>> Is this something that people feel is worth fleshing out and
>> prototyping?
> 
> Sounds awesome! This would make my writable boot cd a lot more elegant
> than it is right now. Have you thought about things like possible
> endless loops? Say, you mount a unionfs on the root of the fs itself.
> This may cause the original .mount.conf to be reinterpreted, right?

Right. I haven't thought about it. My off the cuff response is that we
should disallow it if the amount of effort required to detect it is
within reason. Alternatively, we could simply impose a global limit on
the depth of the recursion. Either appears reasonable to me, but I may
be overlooking something here...

Thoughts?

-- 
Marcel Moolenaar
xcllnt@mac.com


From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 00:22:28 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8A80E1065693
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 00:22:28 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id 1F6EB8FC1E
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 00:22:27 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7O0EFrq057510;
	Mon, 23 Aug 2010 18:14:15 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Mon, 23 Aug 2010 18:14:24 -0600 (MDT)
Message-Id: <20100823.181424.646155203640260173.imp@bsdimp.com>
To: xcllnt@mac.com
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823.171201.107001114053031707.imp@bsdimp.com>
	<8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: freebsd-arch@freebsd.org
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 00:22:28 -0000

In message: <8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
            Marcel Moolenaar <xcllnt@mac.com> writes:
: 
: On Aug 23, 2010, at 4:12 PM, M. Warner Losh wrote:
: 
: *snip*
: 
: > However, all this scripting sounds a bit like a very simple shell in
: > the kernel.  What advantages are there to this approach vs having the
: > ability to run a simple shell script or executable and "pivot" the root
: > to a new location?
: 
: The 2 reasons for doing this in the kernel are:
: 1.  resiliency against ABI changes.
: 2.  allowing /sbin/init to come from the actual root file system.
: 
: Both points are impossible to handle efficiently or correctly if
: you need user space support in getting to your actual root file
: system. You basically have a catch-22 or bootstrap problem, which
: a pure in-kernel solution doesn't have.

OK.  That makes sense.  Without execing the new init, which may be a
problem with the current world view of init(8) and the kernel, you'd
have to have your final init on the first level root file system.

: > And how do you emulate the mount_foo programs for
: > foo filesystems?  Some of them do weird things that might not
: > translate well into the kernel...
: 
: True. I haven't flushed that out, but I was hoping that nmount(2)
: would have normalized most of this that it's a non-issue, provided
: we support mount options in this scheme.
: 
: If you have a concrete example of something that's not so trivial,
: but critical to support, let me know and I'll take it into account.

mount_smbfs makes a connection to the remote system to do
authentication presently in mount_smbfs and initializes the smb
context before mounting the file system in the kernel.  I don't know
if I'd call this a critical to support feature, but it was the first
"exception" to the rule that jumped into my head so I was curious if
you'd thought about it.

: > As you can see, I'm torn about how I feel about the idea.  For simple
: > cases, I think it is great, but as complexity builds, I become less
: > sure.  What if that iso image was compressed?
: 
: Can you elaborate how this is potentially a problem in this scheme,
: but not for "manual" mounting?

You'd need a way to stack up different modules, since you'd need
geom_uzip over md0 to make it useful to the cd9660 code.

: > What if I had a
: > software RAID of disks or flash devices?
: 
: I see no problem. In fact, the idea is triggered by switching to a
: flash file system on a NAND flash.

RAID of Flashes.  Something that would need configuration.  but you
may be correct: this level of flexibility may not be needed and other
concerns may trump it...

: > What about crypto?
: 
: See above. Can you elaborate?

Same thing, but with a crypto key :)

: > I know I
: > can handle those cases in /bin/sh, but will each new one require more
: > code in the kernel?
: 
: The way I see it is that the approach enhances how we now mount the
: root file system. We have very limited flexibility. I do not claim
: that my idea allows every possible variation, and I think it unfair
: to expect that of the approach. If one has real complex requirements,
: one can always just mount some file system on some storage device
: and deal with the root mount in user space. I don't see how this
: prevents that.

init(8) is the show stopper to a pivot root approach, unless you could
tell init that's on the first level and simple to exec /sbin/init to
pickup the new copy, but I don't know how happy that would make the
kernel..

: >  What would df and/or mount tell you about the
: > now-hidden file systems?
: 
: Can you explain what you mean by now-hidden file systems?

OK.  Let's say we have a three level scheme:

/dev/nor0 which has the initial root on it.
Next up is foo.iso.gz which is mounted read only on md0
next up is geom_uzip which present the device as md0.uzip which gets
mounted finally as root.

So would df show:

Filesystem     1024-blocks     Used    Avail Capacity  Mounted on
/dev/nor0             4096     4096    	   0     110%  /
/dev/md0.uzip	     16000    16000	   0	 110%  /

or

Filesystem     1024-blocks     Used    Avail Capacity  Mounted on
/dev/nor0             4096     4096    	   0     110%  /.old_root
/dev/md0.uzip	     16000    16000	   0	 110%  /

and if we had one more layer on nand:

Filesystem     1024-blocks     Used    Avail Capacity  Mounted on
/dev/nor0             4096     4096    	   0     110%  /
/dev/md0.uzip	     16000    16000	   0	 110%  /
/dev/nand0	    320000   300000    20000      82%  /

or

Filesystem     1024-blocks     Used    Avail Capacity  Mounted on
/dev/nor0             4096     4096    	   0     110%  /.old_root/.old_root
/dev/md0.uzip	     16000    16000	   0	 110%  /.old_root
/dev/nand0	    320000   300000    20000      82%  /

is the question I'm asking...

right now you can mostly do a pivot-root-like thing by having init do
a chroot very early, possibly after executing a simple rc script to
put the second level root system online.  init_script gets run very
early, followed by a chroot to init_chroot followed by a mount of
devfs on /dev if necessary.  However, when you do this, often times
you end up with weird looking df output since / isn't really / to df.

Anyway, the fact that we have a decoupled fork/exec really is what
lead me to ask the question.  It is useful to run arbitrary code
between the two, even if you usually run the same code...  sometimes
you want to be different.  I was thinking that this might be the same 
way here.  But, as you rightly point out, maybe there's too much
complexity in doing that and simpler is better.

Warner

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 00:23:51 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 44FBF10657CA
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 00:23:51 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (bitblocks.com [64.142.15.60])
	by mx1.freebsd.org (Postfix) with ESMTP id 28CE68FC0A
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 00:23:50 +0000 (UTC)
Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id 042A45B3B;
	Mon, 23 Aug 2010 17:23:49 -0700 (PDT)
To: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: Your message of "Mon, 23 Aug 2010 16:43:15 PDT."
	<8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com> 
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823.171201.107001114053031707.imp@bsdimp.com>
	<8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
Comments: In-reply-to Marcel Moolenaar <xcllnt@mac.com>
	message dated "Mon, 23 Aug 2010 16:43:15 -0700."
Date: Mon, 23 Aug 2010 17:23:49 -0700
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20100824002350.042A45B3B@mail.bitblocks.com>
Cc: "freebsd-arch@FreeBSD.org" <freebsd-arch@FreeBSD.org>
Subject: Re: RFC: enhancing the root mount logic 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 00:23:51 -0000

On Mon, 23 Aug 2010 16:43:15 PDT Marcel Moolenaar <xcllnt@mac.com>  wrote:
> 
> On Aug 23, 2010, at 4:12 PM, M. Warner Losh wrote:
> 
> *snip*
> 
> > However, all this scripting sounds a bit like a very simple shell in
> > the kernel.  What advantages are there to this approach vs having the
> > ability to run a simple shell script or executable and "pivot" the root
> > to a new location?
> 
> The 2 reasons for doing this in the kernel are:
> 1.  resiliency against ABI changes.
> 2.  allowing /sbin/init to come from the actual root file system.
> 
> Both points are impossible to handle efficiently or correctly if
> you need user space support in getting to your actual root file
> system. You basically have a catch-22 or bootstrap problem, which
> a pure in-kernel solution doesn't have.

How about just bundling a small compressed ramfs with the
kernel.  The kernel unpacks it, uses it as the initial rootfs
and runs init from it. A forth/scheme/lua based program
wouldn't add more than a % or so (given that the GENERIC
kernel is over 10MB now!).

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 01:24:37 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A8FBA1065693
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 01:24:37 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from asmtpout024.mac.com (asmtpout024.mac.com [17.148.16.99])
	by mx1.freebsd.org (Postfix) with ESMTP id 8D48C8FC0A
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 01:24:37 +0000 (UTC)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; charset=us-ascii
Received: from sa-nc-cs-125.static.jnpr.net
	(natint3.juniper.net [66.129.224.36])
	by asmtp024.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01
	(built Dec
	16 2008; 32bit)) with ESMTPSA id <0L7M0055ZUK7VJ10@asmtp024.mac.com> for
	freebsd-arch@FreeBSD.org; Mon, 23 Aug 2010 18:24:09 -0700 (PDT)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=6.0.2-1004200000 definitions=main-1008230219
X-Proofpoint-Virus-Version: vendor=fsecure
	engine=2.50.10432:5.0.10011,1.0.148,0.0.0000
	definitions=2010-08-23_09:2010-08-24, 2010-08-23,
	1970-01-01 signatures=0
From: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: <20100824002350.042A45B3B@mail.bitblocks.com>
Date: Mon, 23 Aug 2010 18:24:07 -0700
Message-id: <4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com>
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823.171201.107001114053031707.imp@bsdimp.com>
	<8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
	<20100824002350.042A45B3B@mail.bitblocks.com>
To: Bakul Shah <bakul@bitblocks.com>
X-Mailer: Apple Mail (2.1081)
Cc: "freebsd-arch@FreeBSD.org" <freebsd-arch@FreeBSD.org>
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 01:24:37 -0000


On Aug 23, 2010, at 5:23 PM, Bakul Shah wrote:

>> The 2 reasons for doing this in the kernel are:
>> 1.  resiliency against ABI changes.
>> 2.  allowing /sbin/init to come from the actual root file system.
>> 
>> Both points are impossible to handle efficiently or correctly if
>> you need user space support in getting to your actual root file
>> system. You basically have a catch-22 or bootstrap problem, which
>> a pure in-kernel solution doesn't have.
> 
> How about just bundling a small compressed ramfs with the
> kernel.  The kernel unpacks it, uses it as the initial rootfs
> and runs init from it. A forth/scheme/lua based program
> wouldn't add more than a % or so (given that the GENERIC
> kernel is over 10MB now!).

Not impossible, but it isn't exactly simpler from what I'm looking
for:
1.  The /sbin/init being run is not the one on the actual (final)
    root file system. Getting that one to run requires a special
    init on the ramdisk.
2.  The R/O image needs the underlying file system mounted some-
    where so that there's persistent storage to write. Setting
    all of this up in user space is impossible if the underlying
    file system(s) needs to be unmounted/unmountable.
3.  Upgrades and downgrades are tricky to handle when the root
    F/S is the ramdisk, after which some user space environment
    has to find the storage media and then mount it using mount
    options it has no easy way to obtain.

It appears that this solution, while in user space, requires more
code and special handling than a "simple" recursive algorithm for
something the kernel has to do anyway. I may be mistaken though...

-- 
Marcel Moolenaar
xcllnt@mac.com


From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 02:27:26 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B6501106564A
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 02:27:26 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from asmtpout026.mac.com (asmtpout026.mac.com [17.148.16.101])
	by mx1.freebsd.org (Postfix) with ESMTP id 9B28B8FC08
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 02:27:26 +0000 (UTC)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; charset=us-ascii
Received: from sa-nc-cs-125.static.jnpr.net
	(natint3.juniper.net [66.129.224.36])
	by asmtp026.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01
	(built Dec
	16 2008; 32bit)) with ESMTPSA id <0L7M00J8DXH3AD00@asmtp026.mac.com> for
	freebsd-arch@freebsd.org; Mon, 23 Aug 2010 19:27:05 -0700 (PDT)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=6.0.2-1004200000 definitions=main-1008230235
X-Proofpoint-Virus-Version: vendor=fsecure
	engine=2.50.10432:5.0.10011,1.0.148,0.0.0000
	definitions=2010-08-24_01:2010-08-24, 2010-08-23,
	1970-01-01 signatures=0
From: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: <20100823.181424.646155203640260173.imp@bsdimp.com>
Date: Mon, 23 Aug 2010 19:27:03 -0700
Message-id: <9EED1D80-7E2E-4C9E-8608-7CFD5B25214B@mac.com>
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823.171201.107001114053031707.imp@bsdimp.com>
	<8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
	<20100823.181424.646155203640260173.imp@bsdimp.com>
To: "M. Warner Losh" <imp@bsdimp.com>
X-Mailer: Apple Mail (2.1081)
Cc: freebsd-arch@freebsd.org
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 02:27:26 -0000


On Aug 23, 2010, at 5:14 PM, M. Warner Losh wrote:

> : > And how do you emulate the mount_foo programs for
> : > foo filesystems?  Some of them do weird things that might not
> : > translate well into the kernel...
> : 
> : True. I haven't flushed that out, but I was hoping that nmount(2)
> : would have normalized most of this that it's a non-issue, provided
> : we support mount options in this scheme.
> : 
> : If you have a concrete example of something that's not so trivial,
> : but critical to support, let me know and I'll take it into account.
> 
> mount_smbfs makes a connection to the remote system to do
> authentication presently in mount_smbfs and initializes the smb
> context before mounting the file system in the kernel.  I don't know
> if I'd call this a critical to support feature, but it was the first
> "exception" to the rule that jumped into my head so I was curious if
> you'd thought about it.

smbfs is definitely out of scope :-)

> : > As you can see, I'm torn about how I feel about the idea.  For simple
> : > cases, I think it is great, but as complexity builds, I become less
> : > sure.  What if that iso image was compressed?
> : 
> : Can you elaborate how this is potentially a problem in this scheme,
> : but not for "manual" mounting?
> 
> You'd need a way to stack up different modules, since you'd need
> geom_uzip over md0 to make it useful to the cd9660 code.

This is a perfect example, actually. I'll think about this in the
context of my idea...

> init(8) is the show stopper to a pivot root approach, unless you could
> tell init that's on the first level and simple to exec /sbin/init to
> pickup the new copy, but I don't know how happy that would make the
> kernel..

I think a handshake is doable. If all else fails, you
simply tell the kernel to always re-exec init when
it exits (rather than panicing, which isn't exactly
a product-friendly response to init exiting).


> and if we had one more layer on nand:
> 
> Filesystem     1024-blocks     Used    Avail Capacity  Mounted on
> /dev/nor0             4096     4096    	   0     110%  /
> /dev/md0.uzip	     16000    16000	   0	 110%  /
> /dev/nand0	    320000   300000    20000      82%  /
> 
> or
> 
> Filesystem     1024-blocks     Used    Avail Capacity  Mounted on
> /dev/nor0             4096     4096    	   0     110%  /.old_root/.old_root
> /dev/md0.uzip	     16000    16000	   0	 110%  /.old_root
> /dev/nand0	    320000   300000    20000      82%  /
> 
> is the question I'm asking...

I think it would be:

/dev/nor0	/.old_root
/dev/md0.uzip	/.old_root
/dev/nand0	/


> Anyway, the fact that we have a decoupled fork/exec really is what
> lead me to ask the question.  It is useful to run arbitrary code
> between the two, even if you usually run the same code...  sometimes
> you want to be different.  I was thinking that this might be the same 
> way here.  But, as you rightly point out, maybe there's too much
> complexity in doing that and simpler is better.

I'll chew on the geom_uzip example you gave. There's value
in allowing the full power of GEOM when doing a root mount.

Thanks,

-- 
Marcel Moolenaar
xcllnt@mac.com


From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 04:33:45 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E2F0B1065670
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 04:33:45 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (mail.bitblocks.com [64.142.15.60])
	by mx1.freebsd.org (Postfix) with ESMTP id C03378FC0A
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 04:33:45 +0000 (UTC)
Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id CA4DE5B56;
	Mon, 23 Aug 2010 21:33:44 -0700 (PDT)
To: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: Your message of "Mon, 23 Aug 2010 18:24:07 PDT."
	<4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com> 
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823.171201.107001114053031707.imp@bsdimp.com>
	<8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
	<20100824002350.042A45B3B@mail.bitblocks.com>
	<4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com>
Comments: In-reply-to Marcel Moolenaar <xcllnt@mac.com>
	message dated "Mon, 23 Aug 2010 18:24:07 -0700."
Date: Mon, 23 Aug 2010 21:33:44 -0700
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20100824043344.CA4DE5B56@mail.bitblocks.com>
Cc: "freebsd-arch@FreeBSD.org" <freebsd-arch@FreeBSD.org>
Subject: Re: RFC: enhancing the root mount logic 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 04:33:46 -0000

On Mon, 23 Aug 2010 18:24:07 PDT Marcel Moolenaar <xcllnt@mac.com>  wrote:
> 
> On Aug 23, 2010, at 5:23 PM, Bakul Shah wrote:
> 
> >> The 2 reasons for doing this in the kernel are:
> >> 1.  resiliency against ABI changes.
> >> 2.  allowing /sbin/init to come from the actual root file system.
> >> 
> >> Both points are impossible to handle efficiently or correctly if
> >> you need user space support in getting to your actual root file
> >> system. You basically have a catch-22 or bootstrap problem, which
> >> a pure in-kernel solution doesn't have.
> > 
> > How about just bundling a small compressed ramfs with the
> > kernel.  The kernel unpacks it, uses it as the initial rootfs
> > and runs init from it. A forth/scheme/lua based program
> > wouldn't add more than a % or so (given that the GENERIC
> > kernel is over 10MB now!).

BTW, a friend tells me this is what Linux does (or more
likely, what they used in their server startup). Basically a
ramdisk with init + loadable drivers + tools needed to get
going.  Once the actual root fs device is found (even if
disks got switched around etc.) they switched to the actual
root.

> Not impossible, but it isn't exactly simpler from what I'm looking
> for:
> 1.  The /sbin/init being run is not the one on the actual (final)
>     root file system. Getting that one to run requires a special
>     init on the ramdisk.

Yes. But then you just exec() the real init once you have
"pivoted" to the final root fs. You run with ramfs only as
long as you have to.

> 2.  The R/O image needs the underlying file system mounted some-
>     where so that there's persistent storage to write. Setting
>     all of this up in user space is impossible if the underlying
>     file system(s) needs to be unmounted/unmountable.
>
> 3.  Upgrades and downgrades are tricky to handle when the root
>     F/S is the ramdisk, after which some user space environment
>     has to find the storage media and then mount it using mount
>     options it has no easy way to obtain.

Would that still be a problem once you switch to the final root?

> It appears that this solution, while in user space, requires more
> code and special handling than a "simple" recursive algorithm for
> something the kernel has to do anyway. I may be mistaken though...

It may start out "simple"....

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 04:59:58 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3DCF0106566C
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 04:59:58 +0000 (UTC)
	(envelope-from mj@feral.com)
Received: from ns1.feral.com (ns1.feral.com [192.67.166.1])
	by mx1.freebsd.org (Postfix) with ESMTP id 05DDC8FC08
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 04:59:57 +0000 (UTC)
Received: from [192.168.1.2] (m206-63.dsl.tsoft.com [198.144.206.63])
	by ns1.feral.com (8.14.3/8.14.3) with ESMTP id o7O4baYA072705
	for <freebsd-arch@freebsd.org>; Mon, 23 Aug 2010 21:37:37 -0700 (PDT)
	(envelope-from mj@feral.com)
Message-ID: <4C734C92.4010105@feral.com>
Date: Mon, 23 Aug 2010 21:37:38 -0700
From: Matthew Jacob <mj@feral.com>
Organization: Feral Software
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
	rv:1.9.1.11) Gecko/20100711 Thunderbird/3.0.6
MIME-Version: 1.0
To: freebsd-arch@freebsd.org
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>	<20100823.171201.107001114053031707.imp@bsdimp.com>	<8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>	<20100824002350.042A45B3B@mail.bitblocks.com>	<4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com>
	<20100824043344.CA4DE5B56@mail.bitblocks.com>
In-Reply-To: <20100824043344.CA4DE5B56@mail.bitblocks.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Greylist: Default is to whitelist mail, not delayed by milter-greylist-4.2.6
	(ns1.feral.com [192.67.166.1]);
	Mon, 23 Aug 2010 21:37:37 -0700 (PDT)
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 04:59:58 -0000

Yes, this is the RedHat root pivot goop that's been around for ages.

It turns out to be a massive PITA, because the initrd image can get out 
of sync with the kernel and hardware, and since some of the modules can 
be loaded from there, but not from the root filesystem there is a 
definite possibility (which has happened with more times than I care to 
remember) that you'll get hosed and not be able to mount your root 
filesystem.

This actually can happen so easily that when I install CentOS or Fedora, 
I override the defaults and put the root filesystem on a plain 
partition/filesystem rather than as part of an LVM2 volume.

> BTW, a friend tells me this is what Linux does (or more
> likely, what they used in their server startup). Basically a
> ramdisk with init + loadable drivers + tools needed to get
> going.  Once the actual root fs device is found (even if
> disks got switched around etc.) they switched to the actual
> root.
>
>    


From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 05:52:18 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 738AF1065679
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 05:52:18 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (bitblocks.com [64.142.15.60])
	by mx1.freebsd.org (Postfix) with ESMTP id 563CA8FC15
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 05:52:18 +0000 (UTC)
Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id D0CF25B56;
	Mon, 23 Aug 2010 22:52:17 -0700 (PDT)
To: Matthew Jacob <mj@feral.com>
In-reply-to: Your message of "Mon, 23 Aug 2010 21:37:38 PDT."
	<4C734C92.4010105@feral.com> 
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823.171201.107001114053031707.imp@bsdimp.com>
	<8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
	<20100824002350.042A45B3B@mail.bitblocks.com>
	<4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com>
	<20100824043344.CA4DE5B56@mail.bitblocks.com>
	<4C734C92.4010105@feral.com>
Comments: In-reply-to Matthew Jacob <mj@feral.com>
	message dated "Mon, 23 Aug 2010 21:37:38 -0700."
Date: Mon, 23 Aug 2010 22:52:17 -0700
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20100824055217.D0CF25B56@mail.bitblocks.com>
Cc: freebsd-arch@freebsd.org
Subject: Re: RFC: enhancing the root mount logic 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 05:52:18 -0000

On Mon, 23 Aug 2010 21:37:38 PDT Matthew Jacob <mj@feral.com>  wrote:
> Yes, this is the RedHat root pivot goop that's been around for ages.
> 
> It turns out to be a massive PITA, because the initrd image can get out 
> of sync with the kernel and hardware, and since some of the modules can 
> be loaded from there, but not from the root filesystem there is a 
> definite possibility (which has happened with more times than I care to 
> remember) that you'll get hosed and not be able to mount your root 
> filesystem.

To avoid getting out of sync is why I was advocating bundling
the ramfs root with the kernel. That too can have problems --
it is all matter of which compromise you can live with.

> This actually can happen so easily that when I install CentOS or Fedora, 
> I override the defaults and put the root filesystem on a plain 
> partition/filesystem rather than as part of an LVM2 volume.
> 
> > BTW, a friend tells me this is what Linux does (or more
> > likely, what they used in their server startup). Basically a
> > ramdisk with init + loadable drivers + tools needed to get
> > going.  Once the actual root fs device is found (even if
> > disks got switched around etc.) they switched to the actual
> > root.

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 08:01:30 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3F554106566C
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 08:01:30 +0000 (UTC)
	(envelope-from ed@hoeg.nl)
Received: from mx0.hoeg.nl (unknown [IPv6:2a01:4f8:101:5343::aa])
	by mx1.freebsd.org (Postfix) with ESMTP id B83DA8FC13
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 08:01:29 +0000 (UTC)
Received: by mx0.hoeg.nl (Postfix, from userid 1000)
	id E6C062A28D2E; Tue, 24 Aug 2010 10:01:28 +0200 (CEST)
Date: Tue, 24 Aug 2010 10:01:28 +0200
From: Ed Schouten <ed@80386.nl>
To: Marcel Moolenaar <xcllnt@mac.com>
Message-ID: <20100824080128.GJ64651@hoeg.nl>
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823.171201.107001114053031707.imp@bsdimp.com>
	<8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
	<20100824002350.042A45B3B@mail.bitblocks.com>
	<4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="C7PTD44AewjTsiSV"
Content-Disposition: inline
In-Reply-To: <4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: "freebsd-arch@FreeBSD.org" <freebsd-arch@FreeBSD.org>
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 08:01:30 -0000


--C7PTD44AewjTsiSV
Content-Type: multipart/mixed; boundary="HkMjoL2LAeBLhbFV"
Content-Disposition: inline


--HkMjoL2LAeBLhbFV
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Marcel Moolenaar <xcllnt@mac.com> wrote:
> 1.  The /sbin/init being run is not the one on the actual (final)
>     root file system. Getting that one to run requires a special
>     init on the ramdisk.

Well, the FreeBSD live CD I posted on the lists the other day has such a
special /sbin/init. See the attachment. Maybe it could be rewritten in
such a way that it parses a text file (fstab-like?), which can also be
placed on the mdroot?

--=20
 Ed Schouten <ed@80386.nl>
 WWW: http://80386.nl/

--HkMjoL2LAeBLhbFV--

--C7PTD44AewjTsiSV
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.16 (FreeBSD)

iEUEARECAAYFAkxzfFgACgkQ52SDGA2eCwXeMgCfYW0LcsRSaw9GbW+gV77NMchQ
N64AmPF2fWwFiTVM3GIvGnF6pY9MlKw=
=mBlY
-----END PGP SIGNATURE-----

--C7PTD44AewjTsiSV--

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 08:03:10 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 506B01065697
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 08:03:10 +0000 (UTC)
	(envelope-from ed@hoeg.nl)
Received: from mx0.hoeg.nl (unknown [IPv6:2a01:4f8:101:5343::aa])
	by mx1.freebsd.org (Postfix) with ESMTP id 9B9FE8FC15
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 08:03:09 +0000 (UTC)
Received: by mx0.hoeg.nl (Postfix, from userid 1000)
	id 0EA7B2A28CF5; Tue, 24 Aug 2010 10:03:09 +0200 (CEST)
Date: Tue, 24 Aug 2010 10:03:09 +0200
From: Ed Schouten <ed@80386.nl>
To: Marcel Moolenaar <xcllnt@mac.com>
Message-ID: <20100824080309.GK64651@hoeg.nl>
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823.171201.107001114053031707.imp@bsdimp.com>
	<8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
	<20100824002350.042A45B3B@mail.bitblocks.com>
	<4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com>
	<20100824080128.GJ64651@hoeg.nl>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="RVlUGXxwBj5SDcM9"
Content-Disposition: inline
In-Reply-To: <20100824080128.GJ64651@hoeg.nl>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: "freebsd-arch@FreeBSD.org" <freebsd-arch@FreeBSD.org>
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 08:03:10 -0000


--RVlUGXxwBj5SDcM9
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Ed Schouten <ed@80386.nl> wrote:
> See the attachment.

It seems like Mailman ate the attachment.

%%%
/*-
 * Copyright (c) 2010 Ed Schouten <ed@FreeBSD.org>
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPO=
SE
 * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTI=
AL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRI=
CT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */

#include <sys/param.h>
#include <sys/linker.h>
#include <sys/mount.h>
#include <sys/uio.h>

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

static void
die(const char *msg)
{
	int fd, serrno;

	serrno =3D errno;
	fd =3D open("/dev/console", O_RDWR);
	if (fd !=3D -1 && fd !=3D STDERR_FILENO)
		dup2(fd, STDERR_FILENO);
	errno =3D serrno;
	perror(msg);
	sleep(10);
	exit(1);
}

static void
domount(const char * const list[], unsigned int elems)
{
	struct iovec iov[elems];
	unsigned int i;

	for (i =3D 0; i < elems; i++) {
		iov[i].iov_base =3D (char *)list[i];
		iov[i].iov_len =3D strlen(list[i]) + 1;
	}

	if (nmount(iov, elems, 0) !=3D 0)
		die(list[1]);
}

static char const * const cdfs[] =3D {
    "fstype", "cd9660", "from", "/dev/iso9660/freebsd", "fspath", "/ro"
};
static char const * const tmpfs[] =3D {
    "fstype", "tmpfs", "fspath", "/rw"
};
static char const * const unionfs[] =3D {
    "fstype", "unionfs", "from", "/ro", "fspath", "/rw", "below", "",
    "whiteout", "whenneeded"
};
static char const * const devfs[] =3D {
    "fstype", "devfs", "fspath", "/rw/dev"
};

int
main(int argc, char *argv[])
{

	/* Prevent foot shooting. */
	if (getpid() !=3D 1)
		return (1);

	/* Perform mounts. */
	domount(cdfs, sizeof cdfs / sizeof(char *));
	domount(tmpfs, sizeof tmpfs / sizeof(char *));
	domount(unionfs, sizeof unionfs / sizeof(char *));
	domount(devfs, sizeof devfs / sizeof(char *));

	/* chroot() into system and continue boot process. */
	if (chroot("/rw") !=3D 0)
		die("chroot");
	chdir("/");

	/* Execute the real /sbin/init. */
	execv(argv[0], argv);
	die("execv");
	return (1);
}
%%%

--=20
 Ed Schouten <ed@80386.nl>
 WWW: http://80386.nl/

--RVlUGXxwBj5SDcM9
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.16 (FreeBSD)

iEYEARECAAYFAkxzfL0ACgkQ52SDGA2eCwVX4gCfXGwT+BrR2p/fcSDwzlgtDk4r
LREAn1cIzCh1vFzUWnlRdCLCc48wwe8e
=vd94
-----END PGP SIGNATURE-----

--RVlUGXxwBj5SDcM9--

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 10:03:58 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2498B10656A6
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 10:03:58 +0000 (UTC)
	(envelope-from euphoria@billyfranks.com)
Received: from mk-filter-3-a-1.mail.uk.tiscali.com
	(mk-filter-3-a-1.mail.uk.tiscali.com [212.74.100.54])
	by mx1.freebsd.org (Postfix) with ESMTP id B10958FC08
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 10:03:57 +0000 (UTC)
X-Trace: 475488960/mk-filter-3.mail.uk.tiscali.com/B2C/$b2c-THROTTLED-DYNAMIC/b2c-CUSTOMER-DYNAMIC-IP/79.69.56.121/None/euphoria@billyfranks.com
X-SBRS: None
X-RemoteIP: 79.69.56.121
X-IP-MAIL-FROM: euphoria@billyfranks.com
X-SMTP-AUTH: 
X-Originating-Country: GB/UNITED KINGDOM
X-MUA: aspNetEmail ver 3.6.1.5
X-IP-BHB: Once
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AvsEACMvc0xPRTh5/2dsb2JhbACgPXK5RoU3BA
X-IronPort-AV: E=Sophos;i="4.56,262,1280703600"; d="scan'208";a="475488960"
Received: from 79-69-56-121.dynamic.dsl.as9105.com (HELO BillyPC)
	([79.69.56.121])
	by smtp.tiscali.co.uk with SMTP; 24 Aug 2010 10:34:57 +0100
From: "Billy Franks" <euphoria@billyfranks.com> 
To: "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
Date: Tue, 24 Aug 2010 10:23:14 +0100
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
X-Mailer: aspNetEmail ver 3.6.1.5
Message-ID: <BILLYPCabdd6fd01bdf49b69b99f97ba8c2c273@BillyPC>
Subject: Free compilation album from legendary songsmith Billy Franks -
 With an introduction by best selling author, Christopher Brookmyre
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: euphoria@billyfranks.com
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 10:03:58 -0000

Hi,=0D=0A=0D=0A=22Penning Classics and garnering praise from Bono, Peter =
Gabriel =26 Oasis=22 THE GUARDIAN=0D=0A=0D=0A=22Songwriting from the top =
drawer=22 TIME OUT=2E=0D=0A=0D=0A=22Imagine McCartney=27s craftsmanship a=
nd Springsteen=27s power and you=27ll get the gist=22 Q MAGAZINE=0D=0A=0D=
=0A=0D=0A=0D=0AAs it seems I am only really know by famous novelists and =
rock stars, I thought I might introduce myself by giving awayt a free com=
pilation of 12 of my best songs from 6 albums spanning 2 decades=2E=0D=0A=
=0D=0ATo grab your=27s just email  euphoria=40billyfranks=2Ecom  and you =
will get the download link=2E=0D=0A=0D=0AIf ya want to read Christopher B=
rookmyres introduction, here it is:=0D=0A=0D=0AEuphoria=0D=0A=0D=0AIt?s t=
he first word that always comes to mind whenever I attempt to describe Bi=
lly Franks? music=2E It refers primarily to an almost excessive feeling o=
f joy, but for me the more important aspect that connects it to these son=
gs is that sense of being consumed by an emotion; that sense of an unstop=
pable, volcanic, up-rushing of passion, that exhilarating but tantalising=
 feeling you get when you are experiencing something that cannot be expre=
ssed in mere language, nor even mere music=2E =0D=0A=0D=0AAnybody can wri=
te a song about love=2E Not anybody can make you feel love, feel loss, fe=
el pain, feel desire, feel ecstasy=2E Not anybody can make you feel eupho=
ria=2E Billy Franks can=2E=0D=0A=0D=0AChristopher Brookmyre=0D=0A


From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 14:29:16 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id AB5C71065679
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 14:29:16 +0000 (UTC)
	(envelope-from ups@freebsd.org)
Received: from smtpauth16.prod.mesa1.secureserver.net
	(smtpauth16.prod.mesa1.secureserver.net [64.202.165.22])
	by mx1.freebsd.org (Postfix) with SMTP id 75B3C8FC25
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 14:29:16 +0000 (UTC)
Received: (qmail 13014 invoked from network); 24 Aug 2010 14:01:43 -0000
Received: from unknown (75.139.142.171)
	by smtpauth16.prod.mesa1.secureserver.net (64.202.165.22) with ESMTP;
	24 Aug 2010 14:01:43 -0000
Message-ID: <4C73D0FA.5030102@freebsd.org>
Date: Tue, 24 Aug 2010 10:02:34 -0400
From: Stephan Uphoff <ups@freebsd.org>
User-Agent: Thunderbird 2.0.0.24 (Macintosh/20100228)
MIME-Version: 1.0
To: Max Laier <max@laiers.net>
References: <201008160515.21412.max@love2party.net>
	<4C7042BA.8000402@freebsd.org>
	<201008222105.47276.max@love2party.net>
	<201008240045.15998.max@laiers.net>
In-Reply-To: <201008240045.15998.max@laiers.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-arch@freebsd.org
Subject: Re: rmlock(9) two additions
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 14:29:16 -0000

Max Laier wrote:
> On Sunday 22 August 2010 21:05:47 Max Laier wrote:
>   
>> On Saturday 21 August 2010 23:18:50 Stephan Uphoff wrote:
>>     
>>> Max Laier wrote:
>>> ...
>>> The rm_try_rlock() will never succeed when the lock was last locked as
>>> writer.
>>> Maybe add:
>>>
>>> void
>>> _rm_wunlock(struct rmlock *rm)
>>> {
>>> +   rm->rm_noreadtoken = 0;
>>>
>>>     mtx_unlock(&rm->rm_lock);
>>>
>>> }
>>>
>>> But then
>>>
>>> _rm_wlock(struct rmlock *rm)
>>>
>>> always needs to use IPIs - even when the lock was used last as a write
>>> lock.
>>>       
>> I don't think this is a big problem - I can't see many use cases for
>> rmlocks where you'd routinely see repeated wlocks without rlocks between
>> them. However, I think there should be a release memory barrier
>> before/while clearing rm_noreadtoken, otherwise readers may not see the
>> data writes that are supposed to be protected by the lock?!?
>>     
>>> ...
>>>       
>> I believe either solution will work.  #1 is a bit more in the spirit of the
>> rmlock - i.e. make the read case cheap and the write case expensive.  I'm
>> just not sure about the lock semantics.
>>
>> I guess a
>>
>>   atomic_store_rel_int(&rm->rm_noreadtoken, 0);
>>
>> should work.
>>     
>
> thinking about this for a while makes me wonder: Are readers really guaranteed 
> to see all the updates of a writer - even in the current version?
>
> Example:
>
>   writer thread:
>   rm_wlock();		// lock mtx, IPI, wait for reader drain
>   modify_data();
>   rm_wunlock();	// unlock mtx (this does a atomic_*_rel)
>
>   reader thread #1:
>   // failed to get the lock, spinning/waiting on mtx
>   mtx_lock();		// this does a atomic_*_acq -> this CPU sees the new data
>   rm->rm_noreadtoken = 0;	// now new readers can enter quickly
>   ...
>
>   reader thread 2# (on a different CPU than reader #1):
>   // enters rm_rlock() "after" rm_noreadtoken was reset -> no memory barrier
>   // does this thread see the modifications?
>
> I realize this is a somewhat pathological case, but would it be possible in 
> theory?  Or is the compiler_memory_barrier() actually enough?
>
> Otherwise, I think we need an IPI on rm_wunlock() that does a atomic_*_acq on 
> every CPU.
>
> Thoughts?
>   Max
>
>   
Yes - this is a problem that needs to be addressed.
Fortunately most platforms won't need to be as strict and I suggest per 
platform parameters.
An alternative that was in my original design was to use a bitmap for 
the rm_noreadtoken.
Each CPU would then have an associated bit that will only be cleared by 
that cpu.
This would also allow targeted IPIs to only the token holders.

Stephan


From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 14:56:09 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id AE5C810656A4
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 14:56:09 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from asmtpout026.mac.com (asmtpout026.mac.com [17.148.16.101])
	by mx1.freebsd.org (Postfix) with ESMTP id 9170D8FC0C
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 14:56:09 +0000 (UTC)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; charset=us-ascii
Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36])
	by asmtp026.mac.com
	(Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008;
	32bit)) with ESMTPSA id <0L7N00AKJW5KWX10@asmtp026.mac.com> for
	freebsd-arch@FreeBSD.org; Tue, 24 Aug 2010 07:56:09 -0700 (PDT)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=6.0.2-1004200000 definitions=main-1008240083
X-Proofpoint-Virus-Version: vendor=fsecure
	engine=2.50.10432:5.0.10011,1.0.148,0.0.0000
	definitions=2010-08-24_06:2010-08-24, 2010-08-24,
	1970-01-01 signatures=0
From: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: <20100824043344.CA4DE5B56@mail.bitblocks.com>
Date: Tue, 24 Aug 2010 07:56:08 -0700
Message-id: <760A97A4-62D2-4900-915D-CA5D889855E1@mac.com>
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823.171201.107001114053031707.imp@bsdimp.com>
	<8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
	<20100824002350.042A45B3B@mail.bitblocks.com>
	<4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com>
	<20100824043344.CA4DE5B56@mail.bitblocks.com>
To: Bakul Shah <bakul@bitblocks.com>
X-Mailer: Apple Mail (2.1081)
Cc: "freebsd-arch@FreeBSD.org" <freebsd-arch@FreeBSD.org>
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 14:56:09 -0000


On Aug 23, 2010, at 9:33 PM, Bakul Shah wrote:

> On Mon, 23 Aug 2010 18:24:07 PDT Marcel Moolenaar <xcllnt@mac.com>  wrote:
>> 
>> On Aug 23, 2010, at 5:23 PM, Bakul Shah wrote:
>> 
>>>> The 2 reasons for doing this in the kernel are:
>>>> 1.  resiliency against ABI changes.
>>>> 2.  allowing /sbin/init to come from the actual root file system.
>>>> 
>>>> Both points are impossible to handle efficiently or correctly if
>>>> you need user space support in getting to your actual root file
>>>> system. You basically have a catch-22 or bootstrap problem, which
>>>> a pure in-kernel solution doesn't have.
>>> 
>>> How about just bundling a small compressed ramfs with the
>>> kernel.  The kernel unpacks it, uses it as the initial rootfs
>>> and runs init from it. A forth/scheme/lua based program
>>> wouldn't add more than a % or so (given that the GENERIC
>>> kernel is over 10MB now!).
> 
> BTW, a friend tells me this is what Linux does (or more
> likely, what they used in their server startup).

I see your point and buy into the argument, but not
entirely. I explicitly mentioned "embedding" and so
far your arguments include things like GENERIC being
10MB or Linux server startup.

We're not exactly discussing the same thing are we?

I'm perfectly happy to say that the ramdisk approach
is the most generic and solution for desktop and
server machines but I'm not at all ready to have it
include embedded systems just yet. It's just too
heavy weight...

-- 
Marcel Moolenaar
xcllnt@mac.com


From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 15:48:59 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 773AD1065679
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 15:48:59 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id 3CC018FC18
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 15:48:58 +0000 (UTC)
Received: from critter.freebsd.dk (unknown [192.168.51.2])
	by phk.freebsd.dk (Postfix) with ESMTP id 204C23F5B7;
	Tue, 24 Aug 2010 15:48:56 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.4/8.14.4) with ESMTP id o7OFmq7J012991;
	Tue, 24 Aug 2010 15:48:53 GMT (envelope-from phk@critter.freebsd.dk)
To: Marcel Moolenaar <xcllnt@mac.com>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Tue, 24 Aug 2010 07:56:08 MST."
	<760A97A4-62D2-4900-915D-CA5D889855E1@mac.com> 
Date: Tue, 24 Aug 2010 15:48:52 +0000
Message-ID: <12990.1282664932@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: "freebsd-arch@FreeBSD.org" <freebsd-arch@FreeBSD.org>
Subject: Re: RFC: enhancing the root mount logic 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 15:48:59 -0000

In message <760A97A4-62D2-4900-915D-CA5D889855E1@mac.com>, Marcel Moolenaar wri
tes:

>I'm perfectly happy to say that the ramdisk approach
>is the most generic and solution for desktop and
>server machines but I'm not at all ready to have it
>include embedded systems just yet. It's just too
>heavy weight...

I'm with Marcel here.

Except for one detail:

In deeply embedded applications the ramdisk is actually preferable,
because that saves you from providing a root filesystem any other
way.

Our solution for that is MD_PRELOADED which is quite a hack.

The bit missing for the ramdisk approach is the root-fs-swizzle, code.

There are two ways to do that, either a very magic mount-like system
call, or by pid==1 setting the name of the real rootfs with a sysctl
and exiting, which calls into the existing root-mount code again.

The latter is almost trivial to implement, just remember to start
the new /sbin/init with pid==1

Poul-Henning

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 15:52:07 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 06BE610656A8
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 15:52:07 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (mail.bitblocks.com [64.142.15.60])
	by mx1.freebsd.org (Postfix) with ESMTP id D3DDB8FC1B
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 15:52:06 +0000 (UTC)
Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id C2A535B23;
	Tue, 24 Aug 2010 08:52:05 -0700 (PDT)
To: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: Your message of "Tue, 24 Aug 2010 07:56:08 PDT."
	<760A97A4-62D2-4900-915D-CA5D889855E1@mac.com> 
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823.171201.107001114053031707.imp@bsdimp.com>
	<8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
	<20100824002350.042A45B3B@mail.bitblocks.com>
	<4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com>
	<20100824043344.CA4DE5B56@mail.bitblocks.com>
	<760A97A4-62D2-4900-915D-CA5D889855E1@mac.com>
Comments: In-reply-to Marcel Moolenaar <xcllnt@mac.com>
	message dated "Tue, 24 Aug 2010 07:56:08 -0700."
Date: Tue, 24 Aug 2010 08:52:05 -0700
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20100824155205.C2A535B23@mail.bitblocks.com>
Cc: "freebsd-arch@FreeBSD.org" <freebsd-arch@FreeBSD.org>
Subject: Re: RFC: enhancing the root mount logic 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 15:52:07 -0000

On Tue, 24 Aug 2010 07:56:08 PDT Marcel Moolenaar <xcllnt@mac.com>  wrote:
> 
> On Aug 23, 2010, at 9:33 PM, Bakul Shah wrote:
> 
> > On Mon, 23 Aug 2010 18:24:07 PDT Marcel Moolenaar <xcllnt@mac.com>  wrote:
> >> 
> >> On Aug 23, 2010, at 5:23 PM, Bakul Shah wrote:
> >> 
> >>>> The 2 reasons for doing this in the kernel are:
> >>>> 1.  resiliency against ABI changes.
> >>>> 2.  allowing /sbin/init to come from the actual root file system.
> >>>> 
> >>>> Both points are impossible to handle efficiently or correctly if
> >>>> you need user space support in getting to your actual root file
> >>>> system. You basically have a catch-22 or bootstrap problem, which
> >>>> a pure in-kernel solution doesn't have.
> >>> 
> >>> How about just bundling a small compressed ramfs with the
> >>> kernel.  The kernel unpacks it, uses it as the initial rootfs
> >>> and runs init from it. A forth/scheme/lua based program
> >>> wouldn't add more than a % or so (given that the GENERIC
> >>> kernel is over 10MB now!).
> > 
> > BTW, a friend tells me this is what Linux does (or more
> > likely, what they used in their server startup).
> 
> I see your point and buy into the argument, but not
> entirely. I explicitly mentioned "embedding" and so
> far your arguments include things like GENERIC being
> 10MB or Linux server startup.
> 
> We're not exactly discussing the same thing are we?

This friend's company used linux in an embedded system [it
was a fileserver product.  Presumably the OS had to run in a
restricted environment since the FS space would be for their
customers' use + you don't want to have to reload the OS when
a disk dies! And yet you want the ability to upgrade your OS
s/w etc.]

In my job[-2] we used FreeBSD as an embedded OS. IIRC we just
ran from a readonly flash FS as root.  An upgrade was just a
new FS image, including kernel + utilities.  Didn't Juniper
do something similar?

> I'm perfectly happy to say that the ramdisk approach
> is the most generic and solution for desktop and
> server machines but I'm not at all ready to have it
> include embedded systems just yet. It's just too
> heavy weight...

I would argue that while each individual embedded system
typically runs in a simpler environment than GENERIC, the sum
total of such embedded environments presents a large set of
alternatives. Now if you can distill all that down to a small
set of kernel changes, that is great!

But I am not doing the work, you are. So feel free to
use/ignore my input however you wish!

From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 16:16:45 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A1F5A106566B
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 16:16:45 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from asmtpout026.mac.com (asmtpout026.mac.com [17.148.16.101])
	by mx1.freebsd.org (Postfix) with ESMTP id 83E7C8FC14
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 16:16:45 +0000 (UTC)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; charset=us-ascii
Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36])
	by asmtp026.mac.com
	(Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008;
	32bit)) with ESMTPSA id <0L7N00A17ZV1WK70@asmtp026.mac.com> for
	freebsd-arch@FreeBSD.org; Tue, 24 Aug 2010 09:16:14 -0700 (PDT)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=6.0.2-1004200000 definitions=main-1008240096
X-Proofpoint-Virus-Version: vendor=fsecure
	engine=2.50.10432:5.0.10011,1.0.148,0.0.0000
	definitions=2010-08-24_09:2010-08-24, 2010-08-24,
	1970-01-01 signatures=0
From: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: <20100824155205.C2A535B23@mail.bitblocks.com>
Date: Tue, 24 Aug 2010 09:16:13 -0700
Message-id: <C6B677DB-5CC8-46C1-B551-7BEB7BF953E0@mac.com>
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823.171201.107001114053031707.imp@bsdimp.com>
	<8C76250B-E272-4807-BD0D-9F50D0BC5E10@mac.com>
	<20100824002350.042A45B3B@mail.bitblocks.com>
	<4CB9F7C8-39E8-4C3B-A3F8-A5A9EC178E7D@mac.com>
	<20100824043344.CA4DE5B56@mail.bitblocks.com>
	<760A97A4-62D2-4900-915D-CA5D889855E1@mac.com>
	<20100824155205.C2A535B23@mail.bitblocks.com>
To: Bakul Shah <bakul@bitblocks.com>
X-Mailer: Apple Mail (2.1081)
Cc: "freebsd-arch@FreeBSD.org" <freebsd-arch@FreeBSD.org>
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 16:16:45 -0000


On Aug 24, 2010, at 8:52 AM, Bakul Shah wrote:
>> 
>> I see your point and buy into the argument, but not
>> entirely. I explicitly mentioned "embedding" and so
>> far your arguments include things like GENERIC being
>> 10MB or Linux server startup.
>> 
>> We're not exactly discussing the same thing are we?
> 
> This friend's company used linux in an embedded system [it
> was a fileserver product.  Presumably the OS had to run in a
> restricted environment since the FS space would be for their
> customers' use + you don't want to have to reload the OS when
> a disk dies! And yet you want the ability to upgrade your OS
> s/w etc.]
> 
> In my job[-2] we used FreeBSD as an embedded OS. IIRC we just
> ran from a readonly flash FS as root.  An upgrade was just a
> new FS image, including kernel + utilities.  Didn't Juniper
> do something similar?

Juniper's approach is still heavily rooted in PC-class H/W.
With Book-E, ARM and MIPS products for the low(er)-end and
in particular without these products having a real harddisk,
the existing way has shown it's problems and limitations.

Also: Juniper has hacked a few tools, including the kernel
at large and md(4) in particular to implement features they
needed/wanted, which I'd like to get away from.

FYI,

-- 
Marcel Moolenaar
xcllnt@mac.com


From owner-freebsd-arch@FreeBSD.ORG  Tue Aug 24 17:01:35 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id F20C4106566C
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 17:01:35 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id AF7BB8FC08
	for <freebsd-arch@FreeBSD.org>; Tue, 24 Aug 2010 17:01:35 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7OGtasY067234;
	Tue, 24 Aug 2010 10:55:36 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Tue, 24 Aug 2010 10:55:46 -0600 (MDT)
Message-Id: <20100824.105546.1002438156525560711.imp@bsdimp.com>
To: xcllnt@mac.com
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <C6B677DB-5CC8-46C1-B551-7BEB7BF953E0@mac.com>
References: <760A97A4-62D2-4900-915D-CA5D889855E1@mac.com>
	<20100824155205.C2A535B23@mail.bitblocks.com>
	<C6B677DB-5CC8-46C1-B551-7BEB7BF953E0@mac.com>
X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: freebsd-arch@FreeBSD.org
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Aug 2010 17:01:36 -0000

In message: <C6B677DB-5CC8-46C1-B551-7BEB7BF953E0@mac.com>
            Marcel Moolenaar <xcllnt@mac.com> writes:
: 
: On Aug 24, 2010, at 8:52 AM, Bakul Shah wrote:
: >> 
: >> I see your point and buy into the argument, but not
: >> entirely. I explicitly mentioned "embedding" and so
: >> far your arguments include things like GENERIC being
: >> 10MB or Linux server startup.
: >> 
: >> We're not exactly discussing the same thing are we?
: > 
: > This friend's company used linux in an embedded system [it
: > was a fileserver product.  Presumably the OS had to run in a
: > restricted environment since the FS space would be for their
: > customers' use + you don't want to have to reload the OS when
: > a disk dies! And yet you want the ability to upgrade your OS
: > s/w etc.]
: > 
: > In my job[-2] we used FreeBSD as an embedded OS. IIRC we just
: > ran from a readonly flash FS as root.  An upgrade was just a
: > new FS image, including kernel + utilities.  Didn't Juniper
: > do something similar?
: 
: Juniper's approach is still heavily rooted in PC-class H/W.
: With Book-E, ARM and MIPS products for the low(er)-end and
: in particular without these products having a real harddisk,
: the existing way has shown it's problems and limitations.
: 
: Also: Juniper has hacked a few tools, including the kernel
: at large and md(4) in particular to implement features they
: needed/wanted, which I'd like to get away from.

You can get away from a large MD by having a small MD and pivoting to
large storage.  Linux does this, as Bakul said, and it scales from the
ultra-small 4MB Mips router up to the highest multicore server.

Warner


From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 01:51:57 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 98D6D1065672
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 01:51:57 +0000 (UTC)
	(envelope-from adrian.chadd@gmail.com)
Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com
	[209.85.214.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 5AD468FC0C
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 01:51:57 +0000 (UTC)
Received: by iwn36 with SMTP id 36so129257iwn.13
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 18:51:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:received:sender:received
	:in-reply-to:references:date:x-google-sender-auth:message-id:subject
	:from:to:cc:content-type:content-transfer-encoding;
	bh=C+n4G+tYFzPgwdRdiKULxztljynPHHLtwqtkABoTJE0=;
	b=CNzYt78WBd2wh71Yd+ZyhGucl296kYm+tTmf8xi2Px7Bv+BgX//YFpT7TKxu5wqrLn
	l7NdOP42aBEIzFIGjZ7uUPLFtPejJjmOuWf7KREXVJlkOIS8tr0fIpDYJ6SIpaCkyjT5
	SxvfWE8D83SUfz8d4n6j3DZ3KYKLDZtZkVV7g=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	b=CBgD1ihveY6Gl1UDq1ZEPeUr2xLNcrZGEScQY1AwuQyhZDdLOwv+sbgl+EQL7CmN8T
	D1XZepZryO61eL/KXlgvBvulZ14pezY2bj+aW/ezZAOTCo8wr4MxA4vCEw9YIUmMJZpa
	JVkruvE7hZPFwCStsxJ4Hy0M7H91uEdJZ67XM=
MIME-Version: 1.0
Received: by 10.231.148.195 with SMTP id q3mr9239909ibv.199.1282699323033;
	Tue, 24 Aug 2010 18:22:03 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.231.168.14 with HTTP; Tue, 24 Aug 2010 18:22:02 -0700 (PDT)
In-Reply-To: <20100824.105546.1002438156525560711.imp@bsdimp.com>
References: <760A97A4-62D2-4900-915D-CA5D889855E1@mac.com>
	<20100824155205.C2A535B23@mail.bitblocks.com>
	<C6B677DB-5CC8-46C1-B551-7BEB7BF953E0@mac.com>
	<20100824.105546.1002438156525560711.imp@bsdimp.com>
Date: Wed, 25 Aug 2010 09:22:02 +0800
X-Google-Sender-Auth: Xmr9owVOhwyj2exg1SO6HNzb-Rg
Message-ID: <AANLkTimUgLAYfM7FJ32hMmF8SEtUYYTrOMKBZep0zDJs@mail.gmail.com>
From: Adrian Chadd <adrian@freebsd.org>
To: "M. Warner Losh" <imp@bsdimp.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: xcllnt@mac.com, freebsd-arch@freebsd.org
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 01:51:57 -0000

On 25 August 2010 00:55, M. Warner Losh <imp@bsdimp.com> wrote:
>
> You can get away from a large MD by having a small MD and pivoting to
> large storage. =A0Linux does this, as Bakul said, and it scales from the
> ultra-small 4MB Mips router up to the highest multicore server.
>

But as someone's said before - and as I've been a Linux sysadmin here
and there, I've been bitten more than once by the linux mdroot setup
where only the -bare minimum- modules needed to bring the system up
are in the mdroot. Woe be if you have to swap hardware in a hurry -
double woe if your distribution provides lots of nice "autodetect"
methods for figuring out which modules should be in the mdroot and
does this for you automatically. You can manually build modules into
mdroot but that isn't any good when you're trying to boot a
post-failed system on alternative hardware.

The FreeBSD method has been nice - I can compile a lean GENERIC but
use /boot/loader.conf to load modules at boot time to use alternative
storage/network mechanisms.

I'm not saying the whole Linux initrd approach is -bad-; i'm just
saying it needs to be thought through a little more first.

Adrian

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 01:57:33 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id F005C106566B;
	Wed, 25 Aug 2010 01:57:33 +0000 (UTC) (envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id AEF278FC0A;
	Wed, 25 Aug 2010 01:57:33 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7P1pZ9V001461;
	Tue, 24 Aug 2010 19:51:35 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Tue, 24 Aug 2010 19:51:45 -0600 (MDT)
Message-Id: <20100824.195145.29593248078694701.imp@bsdimp.com>
To: adrian@FreeBSD.org
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <AANLkTimUgLAYfM7FJ32hMmF8SEtUYYTrOMKBZep0zDJs@mail.gmail.com>
References: <C6B677DB-5CC8-46C1-B551-7BEB7BF953E0@mac.com>
	<20100824.105546.1002438156525560711.imp@bsdimp.com>
	<AANLkTimUgLAYfM7FJ32hMmF8SEtUYYTrOMKBZep0zDJs@mail.gmail.com>
X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: xcllnt@mac.com, freebsd-arch@FreeBSD.org
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 01:57:34 -0000

In message: <AANLkTimUgLAYfM7FJ32hMmF8SEtUYYTrOMKBZep0zDJs@mail.gmail.c=
om>
            Adrian Chadd <adrian@FreeBSD.org> writes:
: On 25 August 2010 00:55, M. Warner Losh <imp@bsdimp.com> wrote:
: >
: > You can get away from a large MD by having a small MD and pivoting =
to
: > large storage. =A0Linux does this, as Bakul said, and it scales fro=
m the
: > ultra-small 4MB Mips router up to the highest multicore server.
: >
: =

: But as someone's said before - and as I've been a Linux sysadmin here=

: and there, I've been bitten more than once by the linux mdroot setup
: where only the -bare minimum- modules needed to bring the system up
: are in the mdroot. Woe be if you have to swap hardware in a hurry -
: double woe if your distribution provides lots of nice "autodetect"
: methods for figuring out which modules should be in the mdroot and
: does this for you automatically. You can manually build modules into
: mdroot but that isn't any good when you're trying to boot a
: post-failed system on alternative hardware.
: =

: The FreeBSD method has been nice - I can compile a lean GENERIC but
: use /boot/loader.conf to load modules at boot time to use alternative=

: storage/network mechanisms.
: =

: I'm not saying the whole Linux initrd approach is -bad-; i'm just
: saying it needs to be thought through a little more first.

No body is saying that the only way to do things (or even the default
way) is via the Linux mdroot thing.  We're saying that it is *A* way
to bootstrap a kernel that uses the ramfs to find the proper location
of root to mount (maybe after initializing the device where root is),
pivot to that new location.

Marcel's current proposal seems simpler (and less flexible) than
this.  The proof in the pudding will be his ability to handle the
'layered' cases of encryption or compression I brought up earlier.

Warner


From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 03:49:14 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6025A10656A6
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 03:49:14 +0000 (UTC)
	(envelope-from adrian.chadd@gmail.com)
Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com
	[209.85.214.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 1EB818FC0A
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 03:49:13 +0000 (UTC)
Received: by iwn36 with SMTP id 36so231501iwn.13
	for <freebsd-arch@freebsd.org>; Tue, 24 Aug 2010 20:49:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:received:sender:received
	:in-reply-to:references:date:x-google-sender-auth:message-id:subject
	:from:to:cc:content-type:content-transfer-encoding;
	bh=Ap3p8bED1ToxjF/kbq+79+8iirI9Z3xfR1eGhnCvGRQ=;
	b=kLrux8SX+R6JQ5vKOPpby06ymRyNBnmG/gX7EdHKhsMai6ndAYitic0w5cWydLjtK2
	zGAnajKmVuyPOX574yzBwzU4l4zlE0vMQnFPOsX/LDBfOGjMCLM8ahVUZtt0XqtlKUCK
	y8iPhmWs2JcDTK4IFS6e4NAK+ISVf2m9hkvdg=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	b=IsdRlioMLUxfrL0SI0Jd1ldXZjDeyIOfUBNSatZFxp3IXmH5TuUI4c6RcdYBGSu9CQ
	S71oAOlK3139A2lYx8DOQ/RZ7LJCVmCAynZzeH4SVonTMzmHZtVyBX55cn0pIl2T2ft0
	pNhW4R7ti3L/eCbxuWG8EMmUch/VVKLn+QW3w=
MIME-Version: 1.0
Received: by 10.231.170.21 with SMTP id b21mr9462358ibz.122.1282708150286;
	Tue, 24 Aug 2010 20:49:10 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.231.168.14 with HTTP; Tue, 24 Aug 2010 20:49:10 -0700 (PDT)
In-Reply-To: <20100824.195145.29593248078694701.imp@bsdimp.com>
References: <C6B677DB-5CC8-46C1-B551-7BEB7BF953E0@mac.com>
	<20100824.105546.1002438156525560711.imp@bsdimp.com>
	<AANLkTimUgLAYfM7FJ32hMmF8SEtUYYTrOMKBZep0zDJs@mail.gmail.com>
	<20100824.195145.29593248078694701.imp@bsdimp.com>
Date: Wed, 25 Aug 2010 11:49:10 +0800
X-Google-Sender-Auth: VU8UpYE0M7TBxG-ezKJdsmlGcSM
Message-ID: <AANLkTika4Njgrz=cmgyZJxofz9gvSiwuFhnDVjXk_rEB@mail.gmail.com>
From: Adrian Chadd <adrian@freebsd.org>
To: "M. Warner Losh" <imp@bsdimp.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: xcllnt@mac.com, freebsd-arch@freebsd.org
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 03:49:14 -0000

On 25 August 2010 09:51, M. Warner Losh <imp@bsdimp.com> wrote:

> : I'm not saying the whole Linux initrd approach is -bad-; i'm just
> : saying it needs to be thought through a little more first.
>
> No body is saying that the only way to do things (or even the default
> way) is via the Linux mdroot thing. =A0We're saying that it is *A* way
> to bootstrap a kernel that uses the ramfs to find the proper location
> of root to mount (maybe after initializing the device where root is),
> pivot to that new location.
>
> Marcel's current proposal seems simpler (and less flexible) than
> this. =A0The proof in the pudding will be his ability to handle the
> 'layered' cases of encryption or compression I brought up earlier.

I do like the idea of a formalish description of a bootstrap process
rather than "hi, run these scripts."

That said, I do like the idea of also being able to run some scripts,
prep the system and then re-try the root mount. Having both options
would be rather nice.

In any case, +1 from me.

Adrian

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 15:58:31 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B3EDC10656AE
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 15:58:31 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from asmtpout023.mac.com (asmtpout023.mac.com [17.148.16.98])
	by mx1.freebsd.org (Postfix) with ESMTP id 9D8768FC0A
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 15:58:31 +0000 (UTC)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; charset=us-ascii
Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36])
	by asmtp023.mac.com
	(Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008;
	32bit)) with ESMTPSA id <0L7P00515TP4UP70@asmtp023.mac.com> for
	freebsd-arch@freebsd.org; Wed, 25 Aug 2010 08:58:18 -0700 (PDT)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=6.0.2-1004200000 definitions=main-1008250105
X-Proofpoint-Virus-Version: vendor=fsecure
	engine=2.50.10432:5.0.10011,1.0.148,0.0.0000
	definitions=2010-08-25_08:2010-08-25, 2010-08-25,
	1970-01-01 signatures=0
From: Marcel Moolenaar <xcllnt@mac.com>
Date: Wed, 25 Aug 2010 08:58:16 -0700
Message-id: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>
To: "freebsd-arch@FreeBSD.org Arch" <freebsd-arch@freebsd.org>
X-Mailer: Apple Mail (2.1081)
Subject: RFC: root mount enhancement (round 2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 15:58:31 -0000

Summary of round 1:
1.  A ramdisk root file system (whether pre-loaded by the loader or
    compiled into the kernel) allows any and all file systems to
    be mounted as root (in theory). One can populate the ramdisk
    with whatever tools one needs to setup the storage solution and
    mount file systems.
2.  Negative experiences with the ramdisk root file system as a
    general approach for mounting a root file system have been
    expressed.
3.  A well-defined and simple recursive algorithm that the kernel
    uses for finding (nested) root file systems has not been shot
    down, but needs to handle the power of GEOM better.

See also:
http://docs.freebsd.org/cgi/getmsg.cgi?fetch=5942+0+current/freebsd-arch

Round 2 preamble:

Let me mention a problem with the currently implemented root mount
logic as a reminder that something needs to be fixed, even if we
don't want to enhance: A USB disk cannot always be used as a root
file system by virtue of the USB stack releasing the root mount
lock after creating the umass device, but before CAM has created
the corresponding da device. The kernel will try mounting from
/dev/da0 before the device exists, fails and then drops into the
root mount prompt. Often the story ends here -- with failure.

The root mount enhancement intends to solve this scenario by
specifically waiting for the mentioned device/path before
moving on to the next alternative.

Round 2:

The logic remains mostly the same as described in round 1, but
gains a directive and limited variable substitution. These are
added to decouple the mount directive (${FS}:${DEV}) from the
creation of the memory disk so that GEOM can do it's thing. As
such, the creation of a memory disk is now a separate directive:

	.md <image-file> <md-options>

To mount the memory disk (UFS in the example), use:

	ufs:/dev/md# <mount-options>

Here md# refers to the md unit created by the last .md directive.
Since the logic is for mounting the root file system only, a .md
directive implicitly detaches and releases the previously created
md device before creating a new one. In other words: the
enhancement is not for creating a bunch of md devices.

Should this be relaxed so that any number of md device can be
created before we try a root mount?

When the md device appears, GEOM gets to taste the provider
and all kinds of interesting things can happen. By decoupling
the creating of the md device and the mount directive, it's
trivial to handle arbitrarily complex GEOM graphs. For example:

	ufs:/dev/md#s1a
	ufs:/dev/md#.uzip
	...

For completeness, the syntax of the configuration file (in
some weird hybrid regex-based specification that is sloppy
about spaces) to make sure things get fleshed out enough
for review:

	<.mount.conf>	: (^<config-line>$)*
	<config-line>	: <comment>
			| <empty>
			| <directive>
	<comment>	: '#'.*
	<empty>		: 
	<directive>	: <mount>
			| <md>
			| <ask>
			| <wait>
			| <onfail>
			| <init>
	<mount>		: <fs>':'<path> <mount-options>
	<mount-options>	: <empty>
			| <mount-opt-list>
	<mount-opt-list>: <mount-option>
			| <mount-option>','<mount-opt-list>
	<mount-option>	: <var>
			| <var>'='<value>
	<md>		| ".md" <file> <md-options>
	<md-options>	: <empty>
			| <md-opt-list>
	<md-opt-list>	: <md-option>
			| <md-option>','<md-opt-list>
	<md-option>	: "nocompress"			# compress is default
			| "nocluster"			# cluster is default
			| "async"
			| "readonly"
	<ask>		: ".ask"
	<wait>		: "wait" <seconds>
	<onfail>	: "onfail" <onfail-action>
	<onfail-action>	: "panic"			# default
			| "reboot"
			| "retry"
			| "continue"
	<init>		: ".init" <init-list>
	<init-list>	: <program>
			| <program>':'<init-list>


To re-iterate: the logic is recursive. After mounting some file system
as root, the kernel will follow the directives in /.mount.conf (if the
file exists) for remounting the root file system. At each iteration the
kernel will remount devfs under /dev and remount the current root file
system under /.mount within the new root file system.

Thoughts?

-- 
Marcel Moolenaar
xcllnt@mac.com


From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 16:01:16 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6C06D1065695
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 16:01:16 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id 313888FC1A
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 16:01:15 +0000 (UTC)
Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3])
	by phk.freebsd.dk (Postfix) with ESMTP id DC42D3F627;
	Wed, 25 Aug 2010 16:01:14 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.4/8.14.4) with ESMTP id o7PG1DbB054211;
	Wed, 25 Aug 2010 16:01:14 GMT (envelope-from phk@critter.freebsd.dk)
To: Marcel Moolenaar <xcllnt@mac.com>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Wed, 25 Aug 2010 08:58:16 MST."
	<34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com> 
Date: Wed, 25 Aug 2010 16:01:13 +0000
Message-ID: <54210.1282752073@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: "freebsd-arch@FreeBSD.org Arch" <freebsd-arch@freebsd.org>
Subject: Re: RFC: root mount enhancement (round 2) 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 16:01:16 -0000

In message <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>, Marcel Moolenaar wri
tes:

>don't want to enhance: A USB disk cannot always be used as a root
>file system by virtue of the USB stack releasing the root mount
>lock after creating the umass device, but before CAM has created
>the corresponding da device. 

This is a bug which is entirely unrelated to how we find the
root filesystem:  It should simply be fixed by CAM grabing a
root mount lock when activated from USB and releasing it
only when all it's stuff is done.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 18:18:11 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id AE52710656A5
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 18:18:11 +0000 (UTC)
	(envelope-from yanegomi@gmail.com)
Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com
	[209.85.215.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 3EE1E8FC1B
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 18:18:10 +0000 (UTC)
Received: by eyx24 with SMTP id 24so601118eyx.13
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 11:18:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:received:sender:received
	:in-reply-to:references:date:x-google-sender-auth:message-id:subject
	:from:to:cc:content-type;
	bh=ZfMWrdhokGw8ITWt1RhkVvVd/o/KGJUx2AG0HR4dH50=;
	b=A12Y8Jyd7PJRM33Hv5xlH2TAhniuitAdm7ieF08d/m+7bhp/XnqFacfwBm8b4aGCGK
	GXLCvE4LMoqV3c8CpUSqvlQlN2IYYmKLJDplyNWWUvljpEeWRly1gfdUQrXR/UqetwN9
	4E3hbCdwtIPBifBHevQC08IVP3LMtnSnJOpmI=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type;
	b=MKDqPEi6bWsqYC7OFlyR+O5t3qczqLuRWsuLURMWyYI3W6LnugwdOHRnEMbWmAbuKp
	A2HRCgNsHdHL7htC+/G2SkwNmpDRyirpEAdyf5b59x51qXDJOwlyJZfHEXu3KUP+F6JY
	t5tUZVGlolUKxKv7+FHxWNpf/ynntUoGaHfC4=
MIME-Version: 1.0
Received: by 10.213.62.206 with SMTP id y14mr5937881ebh.34.1282760290089; Wed,
	25 Aug 2010 11:18:10 -0700 (PDT)
Sender: yanegomi@gmail.com
Received: by 10.14.47.197 with HTTP; Wed, 25 Aug 2010 11:18:09 -0700 (PDT)
In-Reply-To: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>
References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>
Date: Wed, 25 Aug 2010 11:18:09 -0700
X-Google-Sender-Auth: ZWnYFfjdY8oQ91QZPpsVE8ZYdIA
Message-ID: <AANLkTimXJ7m8UbMn5+uZRbsk9cCwh3CVRFGC-SQY88q=@mail.gmail.com>
From: Garrett Cooper <gcooper@FreeBSD.org>
To: Marcel Moolenaar <xcllnt@mac.com>
Content-Type: text/plain; charset=ISO-8859-1
Cc: "freebsd-arch@FreeBSD.org Arch" <freebsd-arch@freebsd.org>
Subject: Re: RFC: root mount enhancement (round 2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 18:18:11 -0000

On Wed, Aug 25, 2010 at 8:58 AM, Marcel Moolenaar <xcllnt@mac.com> wrote:
> Summary of round 1:

...

> To re-iterate: the logic is recursive. After mounting some file system
> as root, the kernel will follow the directives in /.mount.conf (if the
> file exists) for remounting the root file system. At each iteration the
> kernel will remount devfs under /dev and remount the current root file
> system under /.mount within the new root file system.

    I like the proposal, but like Ed, I do have a concern with
infinite recursion. Should a breadcrumb be added to prevent infinite
recursion with the mounts, or is it game over, egg on your face, if
you create an infinite recursion situation?
Thanks,
-Garrett

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 19:06:58 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5DA061065695
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 19:06:58 +0000 (UTC)
	(envelope-from fb-arch@psconsult.nl)
Received: from mx1.psconsult.nl (unknown [IPv6:2001:7b8:30f:e0::5059:ee8a])
	by mx1.freebsd.org (Postfix) with ESMTP id 13AA48FC0C
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 19:06:57 +0000 (UTC)
Received: from mx1.psconsult.nl (psc11.adsl.iaf.nl [80.89.238.138])
	by mx1.psconsult.nl (8.14.4/8.14.4) with ESMTP id o7PJ6px7065853
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 21:06:56 +0200 (CEST)
	(envelope-from fb-arch@psconsult.nl)
Received: (from paul@localhost)
	by mx1.psconsult.nl (8.14.4/8.14.4/Submit) id o7PHmpmw064435
	for freebsd-arch@freebsd.org; Wed, 25 Aug 2010 19:48:51 +0200 (CEST)
	(envelope-from fb-arch@psconsult.nl)
X-Authentication-Warning: mx1.psconsult.nl: paul set sender to
	fb-arch@psconsult.nl using -f
Date: Wed, 25 Aug 2010 19:48:51 +0200
From: Paul Schenkeveld <fb-arch@psconsult.nl>
To: freebsd-arch@freebsd.org
Message-ID: <20100825174851.GA64117@psconsult.nl>
References: <AFBE2FCA-30A6-4E1D-A964-AC4DC4C843EB@juniper.net>
	<20100823214946.GF64651@hoeg.nl>
	<7318E60D-F00F-4519-A3E3-9CE8B752AE88@mac.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <7318E60D-F00F-4519-A3E3-9CE8B752AE88@mac.com>
User-Agent: Mutt/1.5.19 (2009-01-05)
Subject: Re: RFC: enhancing the root mount logic
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 19:06:58 -0000

On Mon, Aug 23, 2010 at 03:44:03PM -0700, Marcel Moolenaar wrote:
> 
> On Aug 23, 2010, at 2:49 PM, Ed Schouten wrote:
> 
> > * Marcel Moolenaar <marcelm@juniper.net> wrote:
> >> Is this something that people feel is worth fleshing out and
> >> prototyping?
> > 
> > Sounds awesome! This would make my writable boot cd a lot more elegant
> > than it is right now. Have you thought about things like possible
> > endless loops? Say, you mount a unionfs on the root of the fs itself.

So far I've not yet seen any endless loop in computing ... :-)

> > This may cause the original .mount.conf to be reinterpreted, right?
> 
> Right. I haven't thought about it. My off the cuff response is that we
> should disallow it if the amount of effort required to detect it is
> within reason. Alternatively, we could simply impose a global limit on
> the depth of the recursion. Either appears reasonable to me, but I may
> be overlooking something here...

What about a directive in .mount.conf, e.g. ".exit" to end the recursion?

> Thoughts?
> 
> -- 
> Marcel Moolenaar
> xcllnt@mac.com

Regards,

Paul Schenkeveld

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 19:09:08 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 26CD91065693
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 19:09:08 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from asmtpout024.mac.com (asmtpout024.mac.com [17.148.16.99])
	by mx1.freebsd.org (Postfix) with ESMTP id 0C3DC8FC0A
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 19:09:07 +0000 (UTC)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; charset=us-ascii
Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36])
	by asmtp024.mac.com
	(Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008;
	32bit)) with ESMTPSA id <0L7Q00C3U2J4B930@asmtp024.mac.com>; Wed,
	25 Aug 2010 12:09:04 -0700 (PDT)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=6.0.2-1004200000 definitions=main-1008250146
X-Proofpoint-Virus-Version: vendor=fsecure
	engine=2.50.10432:5.0.10011,1.0.148,0.0.0000
	definitions=2010-08-25_09:2010-08-25, 2010-08-25,
	1970-01-01 signatures=0
From: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: <AANLkTimXJ7m8UbMn5+uZRbsk9cCwh3CVRFGC-SQY88q=@mail.gmail.com>
Date: Wed, 25 Aug 2010 12:09:04 -0700
Message-id: <9EA74D18-1CA4-4F3D-9CE5-0BD1B4D6B7BB@mac.com>
References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>
	<AANLkTimXJ7m8UbMn5+uZRbsk9cCwh3CVRFGC-SQY88q=@mail.gmail.com>
To: Garrett Cooper <gcooper@freebsd.org>
X-Mailer: Apple Mail (2.1081)
Cc: "freebsd-arch@FreeBSD.org Arch" <freebsd-arch@freebsd.org>
Subject: Re: RFC: root mount enhancement (round 2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 19:09:08 -0000


On Aug 25, 2010, at 11:18 AM, Garrett Cooper wrote:

> On Wed, Aug 25, 2010 at 8:58 AM, Marcel Moolenaar <xcllnt@mac.com> wrote:
>> Summary of round 1:
> 
> ...
> 
>> To re-iterate: the logic is recursive. After mounting some file system
>> as root, the kernel will follow the directives in /.mount.conf (if the
>> file exists) for remounting the root file system. At each iteration the
>> kernel will remount devfs under /dev and remount the current root file
>> system under /.mount within the new root file system.
> 
>    I like the proposal, but like Ed, I do have a concern with
> infinite recursion. Should a breadcrumb be added to prevent infinite
> recursion with the mounts, or is it game over, egg on your face, if
> you create an infinite recursion situation?

Since we have a trail of file systems (by virtue of mounting the
previous root under the new root at /.mount), we should be able
to detect when we're about to mount from a device previously used
to mount from. Alternatively or on top of that, we can have a
global limit on the recursion depth. Unless this is something we
want to control through /.mount.conf, I don't think it's an item
that needs to be closed or nailed down before we can move ahead.

Put differently: I can implement both to start with...

-- 
Marcel Moolenaar
xcllnt@mac.com


From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 20:49:36 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2E2B7106566B
	for <freebsd-arch@FreeBSD.org>; Wed, 25 Aug 2010 20:49:36 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id E4F2C8FC0A
	for <freebsd-arch@FreeBSD.org>; Wed, 25 Aug 2010 20:49:35 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7PKijUs013734;
	Wed, 25 Aug 2010 14:44:46 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Wed, 25 Aug 2010 14:44:47 -0600 (MDT)
Message-Id: <20100825.144447.195066307629816163.imp@bsdimp.com>
To: phk@phk.freebsd.dk
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <54210.1282752073@critter.freebsd.dk>
References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>
	<54210.1282752073@critter.freebsd.dk>
X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: xcllnt@mac.com, freebsd-arch@FreeBSD.org
Subject: Re: RFC: root mount enhancement (round 2) 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 20:49:36 -0000

In message: <54210.1282752073@critter.freebsd.dk>
            "Poul-Henning Kamp" <phk@phk.freebsd.dk> writes:
: In message <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>, Marcel Moolenaar wri
: tes:
: 
: >don't want to enhance: A USB disk cannot always be used as a root
: >file system by virtue of the USB stack releasing the root mount
: >lock after creating the umass device, but before CAM has created
: >the corresponding da device. 
: 
: This is a bug which is entirely unrelated to how we find the
: root filesystem:  It should simply be fixed by CAM grabing a
: root mount lock when activated from USB and releasing it
: only when all it's stuff is done.

We already do this...  But it is insufficient since usb discovery is
done asynchronously...

Scott has a similar fix in the pipeline, but I don't know the state of
it.

Warner

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 21:11:18 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9593910656A8
	for <freebsd-arch@FreeBSD.org>; Wed, 25 Aug 2010 21:11:18 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id 4068C8FC13
	for <freebsd-arch@FreeBSD.org>; Wed, 25 Aug 2010 21:11:18 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7PL2eWY013887;
	Wed, 25 Aug 2010 15:02:40 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Wed, 25 Aug 2010 15:02:42 -0600 (MDT)
Message-Id: <20100825.150242.450985660301753093.imp@bsdimp.com>
To: xcllnt@mac.com
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>
References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>
X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: freebsd-arch@FreeBSD.org
Subject: Re: RFC: root mount enhancement (round 2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 21:11:18 -0000

In message: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>
            Marcel Moolenaar <xcllnt@mac.com> writes:
: 2.  Negative experiences with the ramdisk root file system as a
:     general approach for mounting a root file system have been
:     expressed.

To be fair, it was both positive and negative experiences.  The
negative experiences were from the server folks who hated when
upgrading and the ram disk compiled into the kernel was out of date or
incomplete.  The positive experiences were from the embedded folks who
used the RAM disk given to it by the boot loader so there was quite a
bit more flexibility.  This ram disk comes from a dedicated flash
partition and is well supported by the different embedded boot loaders
that are common in the embedded space (mostly because Linux requires
it).  There's even support for compression of the kernel and ram disk
in the boot loader: it expands the kernel, the ram disk and then tells
the kernel where to find the ram disk.

: Let me mention a problem with the currently implemented root mount
: logic as a reminder that something needs to be fixed, even if we
: don't want to enhance: A USB disk cannot always be used as a root
: file system by virtue of the USB stack releasing the root mount
: lock after creating the umass device, but before CAM has created
: the corresponding da device. The kernel will try mounting from
: /dev/da0 before the device exists, fails and then drops into the
: root mount prompt. Often the story ends here -- with failure.

Actually, the problem isn't the locking at all.  The problem is that
the umass SIMs arrive 'late' in the game.  by the time they arrive,
CAM has already released the root lock.  But as phk points out, this
is a bug in the usb/cam interaction and should be fixed there and
completely irrelevant for your root mounting system.

: Round 2:
: 
: The logic remains mostly the same as described in round 1, but
: gains a directive and limited variable substitution. These are
: added to decouple the mount directive (${FS}:${DEV}) from the
: creation of the memory disk so that GEOM can do it's thing. As
: such, the creation of a memory disk is now a separate directive:
: 
: 	.md <image-file> <md-options>
:
: To mount the memory disk (UFS in the example), use:
: 
: 	ufs:/dev/md# <mount-options>
: 
: Here md# refers to the md unit created by the last .md directive.
: Since the logic is for mounting the root file system only, a .md
: directive implicitly detaches and releases the previously created
: md device before creating a new one. In other words: the
: enhancement is not for creating a bunch of md devices.
:
: Should this be relaxed so that any number of md device can be
: created before we try a root mount?

I guess I'm having trouble understanding why you'd need this given
that ram disk information is already passed from the boot loader
(/boot/loader or in the board's init code (although the latter I don't
think is done by any in-tree code)) to the kernel...

: When the md device appears, GEOM gets to taste the provider
: and all kinds of interesting things can happen. By decoupling
: the creating of the md device and the mount directive, it's
: trivial to handle arbitrarily complex GEOM graphs. For example:
: 
: 	ufs:/dev/md#s1a
: 	ufs:/dev/md#.uzip
: 	...

Shouldn't the MD device already be created by virtual of the MD_ROOT
junk in the kernel config file?  Why do you need a special directive
to create it...

: For completeness, the syntax of the configuration file (in
: some weird hybrid regex-based specification that is sloppy
: about spaces) to make sure things get fleshed out enough
: for review:
: 
: 	<.mount.conf>	: (^<config-line>$)*
: 	<config-line>	: <comment>
: 			| <empty>
: 			| <directive>
: 	<comment>	: '#'.*
: 	<empty>		: 
: 	<directive>	: <mount>
: 			| <md>
: 			| <ask>
: 			| <wait>
: 			| <onfail>
: 			| <init>
: 	<mount>		: <fs>':'<path> <mount-options>
: 	<mount-options>	: <empty>
: 			| <mount-opt-list>
: 	<mount-opt-list>: <mount-option>
: 			| <mount-option>','<mount-opt-list>
: 	<mount-option>	: <var>
: 			| <var>'='<value>
: 	<md>		| ".md" <file> <md-options>
: 	<md-options>	: <empty>
: 			| <md-opt-list>
: 	<md-opt-list>	: <md-option>
: 			| <md-option>','<md-opt-list>
: 	<md-option>	: "nocompress"			# compress is default
: 			| "nocluster"			# cluster is default
: 			| "async"
: 			| "readonly"

read-write compressed works?  Also, is compression a property of the
md device, or the GEOM that tastes it to see that it is compressed...
What does cluster do anyway?  I see that as an option for mdconfig,
but there's no explanation of it there or in the md man page.

How do you differentiate between these two roots:

	mdconfig -a -t file -f /gerbil.ram
and
	mdconfig -a -t swap -s 4m
	dd if=/gerbil.rom of=/dev/md0 bs=1m

with this scheme?  I'm guessing only the former makes sense, although
for upgrades, maybe you want the latter so you can replace /gerbil.rom
at any time.  But in that case, you're better off going through
/boot/loader for this stuff, which leads me to my next question: Would
any md device passed by the boot loader (or compiled into the kernel)
would effectively be the second one and you'd not need any .md
directives at all?

: 	<ask>		: ".ask"
: 	<wait>		: "wait" <seconds>
: 	<onfail>	: "onfail" <onfail-action>
: 	<onfail-action>	: "panic"			# default
: 			| "reboot"
: 			| "retry"
: 			| "continue"
: 	<init>		: ".init" <init-list>
: 	<init-list>	: <program>
: 			| <program>':'<init-list>
: 
: 
: To re-iterate: the logic is recursive. After mounting some file system
: as root, the kernel will follow the directives in /.mount.conf (if the
: file exists) for remounting the root file system. At each iteration the
: kernel will remount devfs under /dev and remount the current root file
: system under /.mount within the new root file system.
: 
: Thoughts?

How is init handled at each stage?  forked after the last one, I assume?

Warner

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 21:29:51 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0BE84106564A
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 21:29:51 +0000 (UTC)
	(envelope-from andy@fud.org.nz)
Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com
	[209.85.214.182])
	by mx1.freebsd.org (Postfix) with ESMTP id D202C8FC1E
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 21:29:50 +0000 (UTC)
Received: by iwn36 with SMTP id 36so1022097iwn.13
	for <freebsd-arch@freebsd.org>; Wed, 25 Aug 2010 14:29:50 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.231.182.204 with SMTP id cd12mr10818749ibb.101.1282769999275; 
	Wed, 25 Aug 2010 13:59:59 -0700 (PDT)
Sender: andy@fud.org.nz
Received: by 10.231.187.6 with HTTP; Wed, 25 Aug 2010 13:59:59 -0700 (PDT)
In-Reply-To: <20100825.144447.195066307629816163.imp@bsdimp.com>
References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>
	<54210.1282752073@critter.freebsd.dk>
	<20100825.144447.195066307629816163.imp@bsdimp.com>
Date: Thu, 26 Aug 2010 08:59:59 +1200
X-Google-Sender-Auth: dJAeRM0U-F2nutUQzTEk2aqwGqs
Message-ID: <AANLkTi=QLmZSUNQ5AWJ2TyLNOxJfqD_2rW3RWCbfEx7R@mail.gmail.com>
From: Andrew Thompson <thompsa@FreeBSD.org>
To: "M. Warner Losh" <imp@bsdimp.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: phk@phk.freebsd.dk, xcllnt@mac.com, freebsd-arch@freebsd.org
Subject: Re: RFC: root mount enhancement (round 2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 21:29:51 -0000

On 26 August 2010 08:44, M. Warner Losh <imp@bsdimp.com> wrote:
> In message: <54210.1282752073@critter.freebsd.dk>
> =A0 =A0 =A0 =A0 =A0 =A0"Poul-Henning Kamp" <phk@phk.freebsd.dk> writes:
> : In message <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>, Marcel Moole=
naar wri
> : tes:
> :
> : >don't want to enhance: A USB disk cannot always be used as a root
> : >file system by virtue of the USB stack releasing the root mount
> : >lock after creating the umass device, but before CAM has created
> : >the corresponding da device.
> :
> : This is a bug which is entirely unrelated to how we find the
> : root filesystem: =A0It should simply be fixed by CAM grabing a
> : root mount lock when activated from USB and releasing it
> : only when all it's stuff is done.
>
> We already do this... =A0But it is insufficient since usb discovery is
> done asynchronously...

Its more that the usb disk appears and the root mount lock is dropped
without geom tasting taken into account. This was fixed with r190677
but then I was asked to back it out (r190878).

> Scott has a similar fix in the pipeline, but I don't know the state of
> it.

It would be great to get this finished, I believe the solution Scott
wanted was to properly use intr_config_hooks to kick off usb
enumeration.


Andrew

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 21:51:59 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 180A51065694
	for <freebsd-arch@FreeBSD.org>; Wed, 25 Aug 2010 21:51:59 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from asmtpout025.mac.com (asmtpout025.mac.com [17.148.16.100])
	by mx1.freebsd.org (Postfix) with ESMTP id F216E8FC1B
	for <freebsd-arch@FreeBSD.org>; Wed, 25 Aug 2010 21:51:58 +0000 (UTC)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; charset=us-ascii
Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36])
	by asmtp025.mac.com
	(Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008;
	32bit)) with ESMTPSA id <0L7Q00813A2LKX50@asmtp025.mac.com> for
	freebsd-arch@FreeBSD.org; Wed, 25 Aug 2010 14:51:58 -0700 (PDT)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=6.0.2-1004200000 definitions=main-1008250181
X-Proofpoint-Virus-Version: vendor=fsecure
	engine=2.50.10432:5.0.10011,1.0.148,0.0.0000
	definitions=2010-08-25_10:2010-08-25, 2010-08-25,
	1970-01-01 signatures=0
From: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: <20100825.150242.450985660301753093.imp@bsdimp.com>
Date: Wed, 25 Aug 2010 14:51:57 -0700
Message-id: <AE9A0FB9-E338-447A-A788-C53E94600116@mac.com>
References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>
	<20100825.150242.450985660301753093.imp@bsdimp.com>
To: "M. Warner Losh" <imp@bsdimp.com>
X-Mailer: Apple Mail (2.1081)
Cc: freebsd-arch@FreeBSD.org
Subject: Re: RFC: root mount enhancement (round 2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 21:51:59 -0000


On Aug 25, 2010, at 2:02 PM, M. Warner Losh wrote:
> : Let me mention a problem with the currently implemented root mount
> : logic as a reminder that something needs to be fixed, even if we
> : don't want to enhance: A USB disk cannot always be used as a root
> : file system by virtue of the USB stack releasing the root mount
> : lock after creating the umass device, but before CAM has created
> : the corresponding da device. The kernel will try mounting from
> : /dev/da0 before the device exists, fails and then drops into the
> : root mount prompt. Often the story ends here -- with failure.
> 
> Actually, the problem isn't the locking at all.  The problem is that
> the umass SIMs arrive 'late' in the game.  by the time they arrive,
> CAM has already released the root lock.  But as phk points out, this
> is a bug in the usb/cam interaction and should be fixed there and
> completely irrelevant for your root mounting system.

I perceive the problem differently, because I see no value in waiting
for *all* devices to appear when the root device is already there.
That just slows down the boot. 

I prefer mounting the root file system as soon as the device appears
and enhance the fstab mounting to deal with the device not being
there yet.

Consequently: the bug is with root_mount_hold() and root_mount_rel()
as a means to do the right thing...

> : Here md# refers to the md unit created by the last .md directive.
> : Since the logic is for mounting the root file system only, a .md
> : directive implicitly detaches and releases the previously created
> : md device before creating a new one. In other words: the
> : enhancement is not for creating a bunch of md devices.
> :
> : Should this be relaxed so that any number of md device can be
> : created before we try a root mount?
> 
> I guess I'm having trouble understanding why you'd need this given
> that ram disk information is already passed from the boot loader
> (/boot/loader or in the board's init code (although the latter I don't
> think is done by any in-tree code)) to the kernel...

You're fixating on the preloaded or compiled-in ramdisk. The
.md directive is there for vnode-backed images -- the root
file system image is stored on a file system and memory is
only used for buffering and caching.

> read-write compressed works?  Also, is compression a property of the
> md device, or the GEOM that tastes it to see that it is compressed...
> What does cluster do anyway?  I see that as an option for mdconfig,
> but there's no explanation of it there or in the md man page.

The options are as useful as the md implementation is. The options
are listed because they appeared in mdconfig. Semantics is not to
be argued when syntax is discussed :-)

> How do you differentiate between these two roots:
> 
> 	mdconfig -a -t file -f /gerbil.ram
> and
> 	mdconfig -a -t swap -s 4m
> 	dd if=/gerbil.rom of=/dev/md0 bs=1m

The first is supported, the second isn't. The .md directive only
supports vnode-backed md devices. There's no point trying to mount
a malloc- or swap-backed md device because they instantiate empty
and are useless for root file systems, unless you construct them
first (using dd is a way to construct them). Supporting the
construction of a root file system is where things get complicated
and where I personally don't want to go.

>  But in that case, you're better off going through
> /boot/loader for this stuff, which leads me to my next question: Would
> any md device passed by the boot loader (or compiled into the kernel)
> would effectively be the second one and you'd not need any .md
> directives at all?

You can start off with a preloaded or compiled-in ramdisk, and then
recursively mount root, including from vnode-backed md devices, so
the .md directive is not rendered useless by preloading or compiling
in. You can even end the root mount recursion with the preloaded
ramdisk last -- this gives you premounted file systems under /.mount
without having to run /etc/rc (if you want to)...

> : 
> : To re-iterate: the logic is recursive. After mounting some file system
> : as root, the kernel will follow the directives in /.mount.conf (if the
> : file exists) for remounting the root file system. At each iteration the
> : kernel will remount devfs under /dev and remount the current root file
> : system under /.mount within the new root file system.
> : 
> : Thoughts?
> 
> How is init handled at each stage?  forked after the last one, I assume?

No, init is only spawned after the root mount recursion ends. The .init
directive is there to override defaults. This is envisioned to be useful
for rescue images where you want to swawn /rescue/init or installation
images where you may want to spawn sysinstall. It eliminates having to
hardcode the possibilities in the kernel.

In a sense it gives you more freedom in how you want to call your initial
process without the pitfalls when the root mount recursion ends early due
to a problem.

As a concrete example, consider having a single file system on a writable
medium (say /dev/da0) and software images are ISO images stored in it.
You can install some recovery procedure on /dev/da0 that gets run when
none of the ISO images can be mounted. The ISO images have /sbin/init
as init as usual, but you can select to run /sbin/recovery from /dev/da0.
This allows for a single init executable that performs the right functions
based on the program name for example...

-- 
Marcel Moolenaar
xcllnt@mac.com


From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 22:39:38 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CFFDD1065674
	for <freebsd-arch@FreeBSD.ORG>; Wed, 25 Aug 2010 22:39:38 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id 7E6958FC13
	for <freebsd-arch@FreeBSD.ORG>; Wed, 25 Aug 2010 22:39:38 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o7PMaaTl014728;
	Wed, 25 Aug 2010 16:36:36 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Wed, 25 Aug 2010 16:36:37 -0600 (MDT)
Message-Id: <20100825.163637.1151864885495248514.imp@bsdimp.com>
To: xcllnt@mac.com
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <AE9A0FB9-E338-447A-A788-C53E94600116@mac.com>
References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>
	<20100825.150242.450985660301753093.imp@bsdimp.com>
	<AE9A0FB9-E338-447A-A788-C53E94600116@mac.com>
X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: freebsd-arch@FreeBSD.ORG
Subject: Re: RFC: root mount enhancement (round 2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 22:39:39 -0000

Hey Marcel,

The more I talk about this, the more that I think it might be useful
in some ways.

In message: <AE9A0FB9-E338-447A-A788-C53E94600116@mac.com>
            Marcel Moolenaar <xcllnt@mac.com> writes:
: 
: On Aug 25, 2010, at 2:02 PM, M. Warner Losh wrote:
: > : Let me mention a problem with the currently implemented root mount
: > : logic as a reminder that something needs to be fixed, even if we
: > : don't want to enhance: A USB disk cannot always be used as a root
: > : file system by virtue of the USB stack releasing the root mount
: > : lock after creating the umass device, but before CAM has created
: > : the corresponding da device. The kernel will try mounting from
: > : /dev/da0 before the device exists, fails and then drops into the
: > : root mount prompt. Often the story ends here -- with failure.
: > 
: > Actually, the problem isn't the locking at all.  The problem is that
: > the umass SIMs arrive 'late' in the game.  by the time they arrive,
: > CAM has already released the root lock.  But as phk points out, this
: > is a bug in the usb/cam interaction and should be fixed there and
: > completely irrelevant for your root mounting system.
: 
: I perceive the problem differently, because I see no value in waiting
: for *all* devices to appear when the root device is already there.
: That just slows down the boot. 
: 
: I prefer mounting the root file system as soon as the device appears
: and enhance the fstab mounting to deal with the device not being
: there yet.
: 
: Consequently: the bug is with root_mount_hold() and root_mount_rel()
: as a means to do the right thing...

We don't need to enhance fstab to cope with / not being there.  We
need / to be there, one way or another.  We may disagree on how best
to make it be there.  In the past I've swung the direction you talk
about too.  I've hacked mountroot() wait up to a given amount of time
new devices to appear that contain the root file system before giving
up.  That way, if you know you've got the root file system, you can go
right away, but otherwise you do something more intelligent than
'nothing' or 'prompt' when it isn't there.  This meshes well with the
.wait directive and your thinking too.  The part I didn't like about
this was the arbitrary upper time limit on it.  I'd like to wait until
*ALL* devices are done to fail and accept a '.wait 5' as an ugly
alternative to knowing that all boot devices are there.

I've also thought about having it drop to a prompt, but noticing that
new devices show up.  You could automatically proceed, or at the very
least be able to type the new device in once it is there.  This would
let the normal boot proceed, kick you to the prompt if, say, the usb
drive fell out and still let you plug it back in and have the system
pick back up again.

So, if your approach could have some hook for these types of
enhancements (or used to implement them), that would be a compelling
reason to support it.  Of course, it would still require knowing when
you are done with your initial scans of the device tree, which is at
present an unsolved problem....

: > : Here md# refers to the md unit created by the last .md directive.
: > : Since the logic is for mounting the root file system only, a .md
: > : directive implicitly detaches and releases the previously created
: > : md device before creating a new one. In other words: the
: > : enhancement is not for creating a bunch of md devices.
: > :
: > : Should this be relaxed so that any number of md device can be
: > : created before we try a root mount?
: > 
: > I guess I'm having trouble understanding why you'd need this given
: > that ram disk information is already passed from the boot loader
: > (/boot/loader or in the board's init code (although the latter I don't
: > think is done by any in-tree code)) to the kernel...
: 
: You're fixating on the preloaded or compiled-in ramdisk. The
: .md directive is there for vnode-backed images -- the root
: file system image is stored on a file system and memory is
: only used for buffering and caching.

That makes sense.  Not so much fixating on them, but noting that they
work really really well and are the basis for many livecd's and such.
They are the basis for all the picobsd derivatives as well.

: > read-write compressed works?  Also, is compression a property of the
: > md device, or the GEOM that tastes it to see that it is compressed...
: > What does cluster do anyway?  I see that as an option for mdconfig,
: > but there's no explanation of it there or in the md man page.
: 
: The options are as useful as the md implementation is. The options
: are listed because they appeared in mdconfig. Semantics is not to
: be argued when syntax is discussed :-)

fair enough...  The compression bit was confusing.

: > How do you differentiate between these two roots:
: > 
: > 	mdconfig -a -t file -f /gerbil.ram
: > and
: > 	mdconfig -a -t swap -s 4m
: > 	dd if=/gerbil.rom of=/dev/md0 bs=1m
: 
: The first is supported, the second isn't. The .md directive only
: supports vnode-backed md devices. There's no point trying to mount
: a malloc- or swap-backed md device because they instantiate empty
: and are useless for root file systems, unless you construct them
: first (using dd is a way to construct them). Supporting the
: construction of a root file system is where things get complicated
: and where I personally don't want to go.

Fair enough.  It was mostly just a question for clarification that
wound up rambling far too long.

: >  But in that case, you're better off going through
: > /boot/loader for this stuff, which leads me to my next question: Would
: > any md device passed by the boot loader (or compiled into the kernel)
: > would effectively be the second one and you'd not need any .md
: > directives at all?
: 
: You can start off with a preloaded or compiled-in ramdisk, and then
: recursively mount root, including from vnode-backed md devices, so
: the .md directive is not rendered useless by preloading or compiling
: in. You can even end the root mount recursion with the preloaded
: ramdisk last -- this gives you premounted file systems under /.mount
: without having to run /etc/rc (if you want to)...

Is the .md directive globally destructive, or just destructive to the
local level of recursion?  If it is just the local level, how do you
specify the unit number?  Maybe a better approach would be to
encourage people to mount root based on how file systems are labelled,
rather than what unit they happen to be taking up...  Would that help
any here?

: > : To re-iterate: the logic is recursive. After mounting some file system
: > : as root, the kernel will follow the directives in /.mount.conf (if the
: > : file exists) for remounting the root file system. At each iteration the
: > : kernel will remount devfs under /dev and remount the current root file
: > : system under /.mount within the new root file system.
: > : 
: > : Thoughts?
: > 
: > How is init handled at each stage?  forked after the last one, I assume?
: 
: No, init is only spawned after the root mount recursion ends. The .init
: directive is there to override defaults. This is envisioned to be useful
: for rescue images where you want to swawn /rescue/init or installation
: images where you may want to spawn sysinstall. It eliminates having to
: hardcode the possibilities in the kernel.

Right now through the boot loader you can set init_path, why would you
need to add the ability to spawn a different one to the scripts?

: In a sense it gives you more freedom in how you want to call your initial
: process without the pitfalls when the root mount recursion ends early due
: to a problem.
: 
: As a concrete example, consider having a single file system on a writable
: medium (say /dev/da0) and software images are ISO images stored in it.
: You can install some recovery procedure on /dev/da0 that gets run when
: none of the ISO images can be mounted. The ISO images have /sbin/init
: as init as usual, but you can select to run /sbin/recovery from /dev/da0.
: This allows for a single init executable that performs the right functions
: based on the program name for example...

I think this is a bit convoluted an example.  The ISO images would
fail to mount only if they were all damaged in a way that would make
them unmountable, true?  If the backup ISO is AFU, then what's to say
that /sbin/recovery isn't also AFU?  When would you need this?

Without a 'branch' construct of some kind, there's no way to match
machine/platform names here.  Given the limited ability for us to
run kernels on multiple different platforms, I'm not sure how big a
deal this actually would be, but if you can do this, it would be a
nice plus.

I presume the default script would be something like (ignoring the
hard coding of device names):

ufs:/dev/da0s1a
.wait 5
.onfail ask

which would mount /dev/da0s1a when it became available, waiting up to
5 seconds and asking the user afterwards if that failed, right?

Warner

From owner-freebsd-arch@FreeBSD.ORG  Wed Aug 25 23:47:57 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id AB98510656A7
	for <freebsd-arch@FreeBSD.ORG>; Wed, 25 Aug 2010 23:47:57 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from asmtpout029.mac.com (asmtpout029.mac.com [17.148.16.104])
	by mx1.freebsd.org (Postfix) with ESMTP id 8FA6F8FC15
	for <freebsd-arch@FreeBSD.ORG>; Wed, 25 Aug 2010 23:47:57 +0000 (UTC)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; charset=us-ascii
Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36])
	by asmtp029.mac.com
	(Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008;
	32bit)) with ESMTPSA id <0L7Q006K7FFW9Z80@asmtp029.mac.com> for
	freebsd-arch@FreeBSD.ORG; Wed, 25 Aug 2010 16:47:57 -0700 (PDT)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=6.0.2-1004200000 definitions=main-1008250206
X-Proofpoint-Virus-Version: vendor=fsecure
	engine=2.50.10432:5.0.10011,1.0.148,0.0.0000
	definitions=2010-08-25_11:2010-08-25, 2010-08-25,
	1970-01-01 signatures=0
From: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: <20100825.163637.1151864885495248514.imp@bsdimp.com>
Date: Wed, 25 Aug 2010 16:47:55 -0700
Message-id: <9765A15A-6CCD-4341-A103-5501CC4FCFDD@mac.com>
References: <34EF2360-1B68-4E0C-8CCE-409CE141D0B8@mac.com>
	<20100825.150242.450985660301753093.imp@bsdimp.com>
	<AE9A0FB9-E338-447A-A788-C53E94600116@mac.com>
	<20100825.163637.1151864885495248514.imp@bsdimp.com>
To: "M. Warner Losh" <imp@bsdimp.com>
X-Mailer: Apple Mail (2.1081)
Cc: freebsd-arch@FreeBSD.ORG
Subject: Re: RFC: root mount enhancement (round 2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Aug 2010 23:47:57 -0000


On Aug 25, 2010, at 3:36 PM, M. Warner Losh wrote:

> Hey Marcel,
> 
> The more I talk about this, the more that I think it might be useful
> in some ways.

Ok. I'll start prototyping something so that we can see if
it can live up to its promise or not.

> : : I prefer mounting the root file system as soon as the device appears
> : and enhance the fstab mounting to deal with the device not being
> : there yet.
> : 
> : Consequently: the bug is with root_mount_hold() and root_mount_rel()
> : as a means to do the right thing...
> 
> We don't need to enhance fstab to cope with / not being there.

I'm sorry. I worded it too sloppily. The enhancement relates to all
other file systems that we want to mount during boot, but which may
not have been discovered yet. If we proceed with the boot as soon
as we have the desired root file system, we create a serialization
problem downstream that we don't have when we wait for all devices
first.

> I've hacked mountroot() wait up to a given amount of time
> new devices to appear that contain the root file system before giving
> up.  That way, if you know you've got the root file system, you can go
> right away, but otherwise you do something more intelligent than
> 'nothing' or 'prompt' when it isn't there.  This meshes well with the
> .wait directive and your thinking too.  The part I didn't like about
> this was the arbitrary upper time limit on it.  I'd like to wait until
> *ALL* devices are done to fail and accept a '.wait 5' as an ugly
> alternative to knowing that all boot devices are there.

I may be able to implement this by changing the .wait directive to
take a flag:

	.wait <seconds> <for-what>

The <for-what> can be "next", meaning that we wait up to X seconds
until we give up on the next mountdirective. The <for-what> could
also be "all", which then slightly changes the meaning into the
max number of seconds to wait for "new device" events before trying
the next (or subsequent) mount directrives. Put differently: the
the number of seconds to idle and wait *after* the last device
arrival before trying the first or next mount. Every time a new
device is announced, you restart the clock.

Would this do what you described?

> I've also thought about having it drop to a prompt, but noticing that
> new devices show up.  You could automatically proceed, or at the very
> least be able to type the new device in once it is there.  This would
> let the normal boot proceed, kick you to the prompt if, say, the usb
> drive fell out and still let you plug it back in and have the system
> pick back up again.

Interesting. I like this. Let me see if this is doable without
inviting complexity.


> So, if your approach could have some hook for these types of
> enhancements (or used to implement them), that would be a compelling
> reason to support it.  Of course, it would still require knowing when
> you are done with your initial scans of the device tree, which is at
> present an unsolved problem....

Technically speaking we're never done. If I plug in a disk any
time after booting up, then we didn't wait long enough before
mounting root :-)

Seriously: hot plug implies that you can never truly wait for
all devices, because they can come and go during the entire
up time of the machine. Proceeding with the boot based on some
reasonable heuristics (i.e. nothing new was found in the last
X seconds, so it's unlikely we'll get a new disk) is probably
the best we can do....

> : >  But in that case, you're better off going through
> : > /boot/loader for this stuff, which leads me to my next question: Would
> : > any md device passed by the boot loader (or compiled into the kernel)
> : > would effectively be the second one and you'd not need any .md
> : > directives at all?
> : 
> : You can start off with a preloaded or compiled-in ramdisk, and then
> : recursively mount root, including from vnode-backed md devices, so
> : the .md directive is not rendered useless by preloading or compiling
> : in. You can even end the root mount recursion with the preloaded
> : ramdisk last -- this gives you premounted file systems under /.mount
> : without having to run /etc/rc (if you want to)...
> 
> Is the .md directive globally destructive, or just destructive to the
> local level of recursion?  If it is just the local level, how do you
> specify the unit number?  Maybe a better approach would be to
> encourage people to mount root based on how file systems are labelled,
> rather than what unit they happen to be taking up...  Would that help
> any here?

The .md directive (as envisioned so far) uses dynamic unit numbers
and is only locally destructive. This allows nested mounting of
root file systems that are all vnode-backed (don't ask me for a
real-life use case now :-)
The proposal uses '#' as the placeholder for the unit number. To be
precise: the '#' is literal and appears in the configuration file
to denote the md unit number created by the last .md directive. As
such, you don't actually need to know it.

Too klugy?
Too limited?

> : > 
> : > How is init handled at each stage?  forked after the last one, I assume?
> : 
> : No, init is only spawned after the root mount recursion ends. The .init
> : directive is there to override defaults. This is envisioned to be useful
> : for rescue images where you want to swawn /rescue/init or installation
> : images where you may want to spawn sysinstall. It eliminates having to
> : hardcode the possibilities in the kernel.
> 
> Right now through the boot loader you can set init_path, why would you
> need to add the ability to spawn a different one to the scripts?

No particular reason. I just tossed it in. If it's over the top, then
I'll remove it. It was just one of those ideas...

> : In a sense it gives you more freedom in how you want to call your initial
> : process without the pitfalls when the root mount recursion ends early due
> : to a problem.
> : 
> : As a concrete example, consider having a single file system on a writable
> : medium (say /dev/da0) and software images are ISO images stored in it.
> : You can install some recovery procedure on /dev/da0 that gets run when
> : none of the ISO images can be mounted. The ISO images have /sbin/init
> : as init as usual, but you can select to run /sbin/recovery from /dev/da0.
> : This allows for a single init executable that performs the right functions
> : based on the program name for example...
> 
> I think this is a bit convoluted an example.  The ISO images would
> fail to mount only if they were all damaged in a way that would make
> them unmountable, true?  If the backup ISO is AFU, then what's to say
> that /sbin/recovery isn't also AFU?  When would you need this?

The images could also have been, euh .. misplaced :-)
What to do when the ISO images aren't there? A panic may not be the
most user friendly response...

> I presume the default script would be something like (ignoring the
> hard coding of device names):
> 
> ufs:/dev/da0s1a
> .wait 5
> .onfail ask

Roughly. devfs will synthesize the .mount.conf contents based on tunables
and kernel options. The same options we now have hardcoded. Without
recursion this means that the root mount will not be any different from
what it is now.

> which would mount /dev/da0s1a when it became available, waiting up to
> 5 seconds and asking the user afterwards if that failed, right?

Yes, but I like the feedback I got from Matthew, who said that
the .wait applies to the mount directive following it. So the
.wait will precede the mount.

Also, the proposal as an .ask directive, rather than ask on
failure. I see asking as a mount directive of which the FS
and device are provided by the user.

-- 
Marcel Moolenaar
xcllnt@mac.com


From owner-freebsd-arch@FreeBSD.ORG  Thu Aug 26 00:05:36 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DB6C6106566C;
	Thu, 26 Aug 2010 00:05:36 +0000 (UTC)
	(envelope-from max@love2party.net)
Received: from moutng.kundenserver.de (moutng.kundenserver.de
	[212.227.126.186])
	by mx1.freebsd.org (Postfix) with ESMTP id 2F38D8FC13;
	Thu, 26 Aug 2010 00:05:35 +0000 (UTC)
Received: from f8x64.laiers.local (dslb-088-066-038-053.pools.arcor-ip.net
	[88.66.38.53])
	by mrelayeu.kundenserver.de (node=mrbap1) with ESMTP (Nemesis)
	id 0LpPi1-1PHiNk1bfR-00em6r; Thu, 26 Aug 2010 02:05:34 +0200
From: Max Laier <max@love2party.net>
Organization: FreeBSD
To: freebsd-arch@freebsd.org
Date: Thu, 26 Aug 2010 02:05:32 +0200
User-Agent: KMail/1.13.5 (FreeBSD/8.1-RELEASE; KDE/4.4.5; amd64; ; )
References: <201008160515.21412.max@love2party.net>
	<201008240045.15998.max@laiers.net> <4C73D0FA.5030102@freebsd.org>
In-Reply-To: <4C73D0FA.5030102@freebsd.org>
MIME-Version: 1.0
Content-Type: Multipart/Mixed;
  boundary="Boundary-00=_M/adMKo7R9HbyBw"
Message-Id: <201008260205.32416.max@love2party.net>
X-Provags-ID: V02:K0:HYfpzmDMnXHIqZsA6wYzridVct8x6lnWVbObJ2/VUMA
	3/x0V81DnLlqHp5DMj3HeoMNgmIDxxCBAUF0QkGNfhgmmEnt9d
	3j5MAId6qgYf1TntHUd6rCYFgt9wuhXbFuOUgkCFM/3nXFO7f/
	WZafki/mqk3OQeEwn++/NKEvePM4Jv1z6uIt+YIa3B55iDJDbr
	1G6B6xa8xfvDJk2zIpdrg==
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: Stephan Uphoff <ups@freebsd.org>
Subject: Re: rmlock(9) two additions
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 26 Aug 2010 00:05:36 -0000

--Boundary-00=_M/adMKo7R9HbyBw
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

On Tuesday 24 August 2010 16:02:34 Stephan Uphoff wrote:
> Yes - this is a problem that needs to be addressed.
> Fortunately most platforms won't need to be as strict and I suggest per
> platform parameters.
> An alternative that was in my original design was to use a bitmap for
> the rm_noreadtoken.
> Each CPU would then have an associated bit that will only be cleared by
> that cpu.
> This would also allow targeted IPIs to only the token holders.

Okay ... attached is a version with the bitmask idea.  It is not using a per 
platform parameters, yet.  But I believe it fixes the issue in general.  This 
comes at the cost of an additional memory reference to pc_cpumask on the exit 
conditional from the fast-path.  I don't think this is too much of a problem 
currently, as this will be in the cache already ... but I might be wrong.

For easier review, I've also attached my current version of kern_rmlock.c.

BTW, this also fixes the trylock race by trylocking the base lock.  Again, I 
don't think the race against other readers is a concern in the trylock API.

I'd like to move forward with this in some way ... any objections and/or input 
on what to do/fix before committing this?  Any ideas/pointers on how to best 
implement the per platform selector?

Thanks,
  Max

--Boundary-00=_M/adMKo7R9HbyBw
Content-Type: text/x-patch;
  charset="ISO-8859-1";
  name="rmlock.full.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename="rmlock.full.diff"

diff --git a/share/man/man9/Makefile b/share/man/man9/Makefile
index b438b90..e6d8881 100644
--- a/share/man/man9/Makefile
+++ b/share/man/man9/Makefile
@@ -986,6 +986,7 @@ MLINKS+=rman.9 rman_activate_resource.9 \
 MLINKS+=rmlock.9 rm_destroy.9 \
 	rmlock.9 rm_init.9 \
 	rmlock.9 rm_rlock.9 \
+	rmlock.9 rm_try_rlock.9 \
 	rmlock.9 rm_runlock.9 \
 	rmlock.9 RM_SYSINIT.9 \
 	rmlock.9 rm_wlock.9 \
diff --git a/share/man/man9/locking.9 b/share/man/man9/locking.9
index 005f476..3728319 100644
--- a/share/man/man9/locking.9
+++ b/share/man/man9/locking.9
@@ -301,7 +301,7 @@ one of the synchronization primitives discussed:
 .It mutex     Ta \&ok Ta \&ok-1 Ta \&no Ta \&ok Ta \&ok Ta \&no-3
 .It sx        Ta \&ok Ta \&ok Ta \&ok-2 Ta \&ok Ta \&ok Ta \&ok-4
 .It rwlock    Ta \&ok Ta \&ok Ta \&no Ta \&ok-2 Ta \&ok Ta \&no-3
-.It rmlock    Ta \&ok Ta \&ok Ta \&no Ta \&ok Ta \&ok-2 Ta \&no
+.It rmlock    Ta \&ok Ta \&ok Ta \&ok-5 Ta \&ok Ta \&ok-2 Ta \&ok-5
 .El
 .Pp
 .Em *1
@@ -326,6 +326,13 @@ Though one can sleep holding an sx lock, one can also use
 .Fn sx_sleep
 which will atomically release this primitive when going to sleep and
 reacquire it on wakeup.
+.Pp
+.Em *5
+.Em Read-mostly
+locks can be initialized to support sleeping while holding a write lock.
+See
+.Xr rmlock 9
+for details.
 .Ss Context mode table
 The next table shows what can be used in different contexts.
 At this time this is a rather easy to remember table.
diff --git a/share/man/man9/rmlock.9 b/share/man/man9/rmlock.9
index e99661d..28ac0a5 100644
--- a/share/man/man9/rmlock.9
+++ b/share/man/man9/rmlock.9
@@ -35,6 +35,7 @@
 .Nm rm_init_flags ,
 .Nm rm_destroy ,
 .Nm rm_rlock ,
+.Nm rm_try_rlock ,
 .Nm rm_wlock ,
 .Nm rm_runlock ,
 .Nm rm_wunlock ,
@@ -53,6 +54,8 @@
 .Fn rm_destroy "struct rmlock *rm"
 .Ft void
 .Fn rm_rlock "struct rmlock *rm"  "struct rm_priotracker* tracker"
+.Ft int
+.Fn rm_try_rlock "struct rmlock *rm"  "struct rm_priotracker* tracker"
 .Ft void
 .Fn rm_wlock "struct rmlock *rm"
 .Ft void
@@ -84,14 +87,16 @@ Although reader/writer locks look very similar to
 locks, their usage pattern is different.
 Reader/writer locks can be treated as mutexes (see
 .Xr mutex 9 )
-with shared/exclusive semantics.
+with shared/exclusive semantics unless initialized with
+.Dv RM_SLEEPABLE .
 Unlike
 .Xr sx 9 ,
 an
 .Nm
 can be locked while holding a non-spin mutex, and an
 .Nm
-cannot be held while sleeping.
+cannot be held while sleeping, again unless initialized with
+.Dv RM_SLEEPABLE .
 The
 .Nm
 locks have full priority propagation like mutexes.
@@ -135,6 +140,13 @@ to ignore this lock.
 .It Dv RM_RECURSE
 Allow threads to recursively acquire exclusive locks for
 .Fa rm .
+.It Dv RM_SLEEPABLE
+Allow writers to sleep while holding the lock.
+Readers must not sleep while holding the lock and can avoid to sleep on
+taking the lock by using
+.Fn rm_try_rlock
+instead of
+.Fn rm_rlock .
 .El
 .It Fn rm_rlock "struct rmlock *rm" "struct rm_priotracker* tracker"
 Lock
@@ -161,6 +173,13 @@ access on
 .Fa rm .
 This is called
 .Dq "recursing on a lock" .
+.It Fn rm_try_rlock "struct rmlock *rm" "struct rm_priotracker* tracker"
+Try to lock
+.Fa rm
+as a reader.
+.Fn rm_try_rlock
+will return 0 if the lock cannot be acquired immediately;
+otherwise the lock will be acquired and a non-zero value will be returned.
 .It Fn rm_wlock "struct rmlock *rm"
 Lock
 .Fa rm
diff --git a/sys/kern/kern_rmlock.c b/sys/kern/kern_rmlock.c
index a6a622e..0ab5d74 100644
--- a/sys/kern/kern_rmlock.c
+++ b/sys/kern/kern_rmlock.c
@@ -187,6 +187,8 @@ rm_cleanIPI(void *arg)
 	}
 }
 
+CTASSERT((RM_SLEEPABLE & LO_CLASSFLAGS) == RM_SLEEPABLE);
+
 void
 rm_init_flags(struct rmlock *rm, const char *name, int opts)
 {
@@ -197,9 +199,13 @@ rm_init_flags(struct rmlock *rm, const char *name, int opts)
 		liflags |= LO_WITNESS;
 	if (opts & RM_RECURSE)
 		liflags |= LO_RECURSABLE;
-	rm->rm_noreadtoken = 1;
+	rm->rm_writecpus = all_cpus;
 	LIST_INIT(&rm->rm_activeReaders);
-	mtx_init(&rm->rm_lock, name, "rmlock_mtx", MTX_NOWITNESS);
+	if (opts & RM_SLEEPABLE) {
+		liflags |= RM_SLEEPABLE;
+		sx_init_flags(&rm->rm_lock_sx, "rmlock_sx", SX_RECURSE);
+	} else
+		mtx_init(&rm->rm_lock_mtx, name, "rmlock_mtx", MTX_NOWITNESS);
 	lock_init(&rm->lock_object, &lock_class_rm, name, NULL, liflags);
 }
 
@@ -214,7 +220,10 @@ void
 rm_destroy(struct rmlock *rm)
 {
 
-	mtx_destroy(&rm->rm_lock);
+	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
+		sx_destroy(&rm->rm_lock_sx);
+	else
+		mtx_destroy(&rm->rm_lock_mtx);
 	lock_destroy(&rm->lock_object);
 }
 
@@ -222,7 +231,10 @@ int
 rm_wowned(struct rmlock *rm)
 {
 
-	return (mtx_owned(&rm->rm_lock));
+	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
+		return (sx_xlocked(&rm->rm_lock_sx));
+	else
+		return (mtx_owned(&rm->rm_lock_mtx));
 }
 
 void
@@ -241,8 +253,8 @@ rm_sysinit_flags(void *arg)
 	rm_init_flags(args->ra_rm, args->ra_desc, args->ra_opts);
 }
 
-static void
-_rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker)
+static int
+_rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker, int trylock)
 {
 	struct pcpu *pc;
 	struct rm_queue *queue;
@@ -252,9 +264,9 @@ _rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker)
 	pc = pcpu_find(curcpu);
 
 	/* Check if we just need to do a proper critical_exit. */
-	if (0 == rm->rm_noreadtoken) {
+	if (!(pc->pc_cpumask & rm->rm_writecpus)) {
 		critical_exit();
-		return;
+		return (1);
 	}
 
 	/* Remove our tracker from the per-cpu list. */
@@ -265,7 +277,7 @@ _rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker)
 		/* Just add back tracker - we hold the lock. */
 		rm_tracker_add(pc, tracker);
 		critical_exit();
-		return;
+		return (1);
 	}
 
 	/*
@@ -289,7 +301,7 @@ _rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker)
 				mtx_unlock_spin(&rm_spinlock);
 				rm_tracker_add(pc, tracker);
 				critical_exit();
-				return;
+				return (1);
 			}
 		}
 	}
@@ -297,20 +309,38 @@ _rm_rlock_hard(struct rmlock *rm, struct rm_priotracker *tracker)
 	sched_unpin();
 	critical_exit();
 
-	mtx_lock(&rm->rm_lock);
-	rm->rm_noreadtoken = 0;
-	critical_enter();
+	if (trylock) {
+		if (rm->lock_object.lo_flags & RM_SLEEPABLE) {
+			if (!sx_try_xlock(&rm->rm_lock_sx))
+				return (0);
+		} else {
+			if (!mtx_trylock(&rm->rm_lock_mtx))
+				return (0);
+		}
+	} else {
+		if (rm->lock_object.lo_flags & RM_SLEEPABLE)
+			sx_xlock(&rm->rm_lock_sx);
+		else
+			mtx_lock(&rm->rm_lock_mtx);
+	}
 
+	critical_enter();
 	pc = pcpu_find(curcpu);
+	rm->rm_writecpus &= ~pc->pc_cpumask;
 	rm_tracker_add(pc, tracker);
 	sched_pin();
 	critical_exit();
 
-	mtx_unlock(&rm->rm_lock);
+	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
+		sx_xunlock(&rm->rm_lock_sx);
+	else
+		mtx_unlock(&rm->rm_lock_mtx);
+
+	return (1);
 }
 
-void
-_rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker)
+int
+_rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker, int trylock)
 {
 	struct thread *td = curthread;
 	struct pcpu *pc;
@@ -337,11 +367,11 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker)
 	 * Fast path to combine two common conditions into a single
 	 * conditional jump.
 	 */
-	if (0 == (td->td_owepreempt | rm->rm_noreadtoken))
-		return;
+	if (0 == (td->td_owepreempt | (rm->rm_writecpus & pc->pc_cpumask)))
+		return (1);
 
 	/* We do not have a read token and need to acquire one. */
-	_rm_rlock_hard(rm, tracker);
+	return _rm_rlock_hard(rm, tracker, trylock);
 }
 
 static void
@@ -400,20 +430,26 @@ _rm_wlock(struct rmlock *rm)
 {
 	struct rm_priotracker *prio;
 	struct turnstile *ts;
+	cpumask_t readcpus;
 
-	mtx_lock(&rm->rm_lock);
+	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
+		sx_xlock(&rm->rm_lock_sx);
+	else
+		mtx_lock(&rm->rm_lock_mtx);
 
-	if (rm->rm_noreadtoken == 0) {
+	if (rm->rm_writecpus != all_cpus) {
 		/* Get all read tokens back */
 
-		rm->rm_noreadtoken = 1;
+		readcpus = all_cpus & (all_cpus & ~rm->rm_writecpus);
+		rm->rm_writecpus = all_cpus;
 
 		/*
-		 * Assumes rm->rm_noreadtoken update is visible on other CPUs
+		 * Assumes rm->rm_writecpus update is visible on other CPUs
 		 * before rm_cleanIPI is called.
 		 */
 #ifdef SMP
-		smp_rendezvous(smp_no_rendevous_barrier,
+		smp_rendezvous_cpus(readcpus,
+		    smp_no_rendevous_barrier,
 		    rm_cleanIPI,
 		    smp_no_rendevous_barrier,
 		    rm);
@@ -439,7 +475,10 @@ void
 _rm_wunlock(struct rmlock *rm)
 {
 
-	mtx_unlock(&rm->rm_lock);
+	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
+		sx_xunlock(&rm->rm_lock_sx);
+	else
+		mtx_unlock(&rm->rm_lock_mtx);
 }
 
 #ifdef LOCK_DEBUG
@@ -454,7 +493,11 @@ void _rm_wlock_debug(struct rmlock *rm, const char *file, int line)
 
 	LOCK_LOG_LOCK("RMWLOCK", &rm->lock_object, 0, 0, file, line);
 
-	WITNESS_LOCK(&rm->lock_object, LOP_EXCLUSIVE, file, line);
+	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
+		WITNESS_LOCK(&rm->rm_lock_sx.lock_object, LOP_EXCLUSIVE,
+		    file, line);	
+	else
+		WITNESS_LOCK(&rm->lock_object, LOP_EXCLUSIVE, file, line);
 
 	curthread->td_locks++;
 
@@ -465,25 +508,35 @@ _rm_wunlock_debug(struct rmlock *rm, const char *file, int line)
 {
 
 	curthread->td_locks--;
-	WITNESS_UNLOCK(&rm->lock_object, LOP_EXCLUSIVE, file, line);
+	if (rm->lock_object.lo_flags & RM_SLEEPABLE)
+		WITNESS_UNLOCK(&rm->rm_lock_sx.lock_object, LOP_EXCLUSIVE,
+		    file, line);
+	else
+		WITNESS_UNLOCK(&rm->lock_object, LOP_EXCLUSIVE, file, line);
 	LOCK_LOG_LOCK("RMWUNLOCK", &rm->lock_object, 0, 0, file, line);
 	_rm_wunlock(rm);
 }
 
-void
+int
 _rm_rlock_debug(struct rmlock *rm, struct rm_priotracker *tracker,
-    const char *file, int line)
+    int trylock, const char *file, int line)
 {
-
+	if (!trylock && (rm->lock_object.lo_flags & RM_SLEEPABLE))
+		WITNESS_CHECKORDER(&rm->rm_lock_sx.lock_object, LOP_NEWORDER,
+		    file, line, NULL);
 	WITNESS_CHECKORDER(&rm->lock_object, LOP_NEWORDER, file, line, NULL);
 
-	_rm_rlock(rm, tracker);
+	if (_rm_rlock(rm, tracker, trylock)) {
+		LOCK_LOG_LOCK("RMRLOCK", &rm->lock_object, 0, 0, file, line);
 
-	LOCK_LOG_LOCK("RMRLOCK", &rm->lock_object, 0, 0, file, line);
+		WITNESS_LOCK(&rm->lock_object, 0, file, line);
 
-	WITNESS_LOCK(&rm->lock_object, 0, file, line);
+		curthread->td_locks++;
 
-	curthread->td_locks++;
+		return (1);
+	}
+
+	return (0);
 }
 
 void
@@ -517,12 +570,12 @@ _rm_wunlock_debug(struct rmlock *rm, const char *file, int line)
 	_rm_wunlock(rm);
 }
 
-void
+int
 _rm_rlock_debug(struct rmlock *rm, struct rm_priotracker *tracker,
-    const char *file, int line)
+    int trylock, const char *file, int line)
 {
 
-	_rm_rlock(rm, tracker);
+	return _rm_rlock(rm, tracker, trylock);
 }
 
 void
diff --git a/sys/sys/_rmlock.h b/sys/sys/_rmlock.h
index e5c68d5..75a159c 100644
--- a/sys/sys/_rmlock.h
+++ b/sys/sys/_rmlock.h
@@ -45,11 +45,15 @@ LIST_HEAD(rmpriolist,rm_priotracker);
 
 struct rmlock {
 	struct lock_object lock_object; 
-	volatile int 	rm_noreadtoken;
+	volatile cpumask_t rm_writecpus;
 	LIST_HEAD(,rm_priotracker) rm_activeReaders;
-	struct mtx	rm_lock;
-
+	union {
+		struct mtx _rm_lock_mtx;
+		struct sx _rm_lock_sx;
+	} _rm_lock;
 };
+#define	rm_lock_mtx	_rm_lock._rm_lock_mtx
+#define	rm_lock_sx	_rm_lock._rm_lock_sx
 
 struct rm_priotracker {
 	struct rm_queue rmp_cpuQueue; /* Must be first */
diff --git a/sys/sys/rmlock.h b/sys/sys/rmlock.h
index 9766f67..ef5776b 100644
--- a/sys/sys/rmlock.h
+++ b/sys/sys/rmlock.h
@@ -33,6 +33,7 @@
 #define _SYS_RMLOCK_H_
 
 #include <sys/mutex.h>
+#include <sys/sx.h>
 #include <sys/_lock.h>
 #include <sys/_rmlock.h>
 
@@ -43,6 +44,7 @@
  */
 #define	RM_NOWITNESS	0x00000001
 #define	RM_RECURSE	0x00000002
+#define	RM_SLEEPABLE	0x00000004
 
 void	rm_init(struct rmlock *rm, const char *name);
 void	rm_init_flags(struct rmlock *rm, const char *name, int opts);
@@ -53,14 +55,15 @@ void	rm_sysinit_flags(void *arg);
 
 void	_rm_wlock_debug(struct rmlock *rm, const char *file, int line);
 void	_rm_wunlock_debug(struct rmlock *rm, const char *file, int line);
-void	_rm_rlock_debug(struct rmlock *rm, struct rm_priotracker *tracker,
-	    const char *file, int line);
+int	_rm_rlock_debug(struct rmlock *rm, struct rm_priotracker *tracker,
+	    int trylock, const char *file, int line);
 void	_rm_runlock_debug(struct rmlock *rm,  struct rm_priotracker *tracker,
 	    const char *file, int line);
 
 void	_rm_wlock(struct rmlock *rm);
 void	_rm_wunlock(struct rmlock *rm);
-void	_rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker);
+int	_rm_rlock(struct rmlock *rm, struct rm_priotracker *tracker,
+	    int trylock);
 void	_rm_runlock(struct rmlock *rm,  struct rm_priotracker *tracker);
 
 /*
@@ -74,14 +77,17 @@ void	_rm_runlock(struct rmlock *rm,  struct rm_priotracker *tracker);
 #define	rm_wlock(rm)	_rm_wlock_debug((rm), LOCK_FILE, LOCK_LINE)
 #define	rm_wunlock(rm)	_rm_wunlock_debug((rm), LOCK_FILE, LOCK_LINE)
 #define	rm_rlock(rm,tracker)  \
-    _rm_rlock_debug((rm),(tracker), LOCK_FILE, LOCK_LINE )
+    ((void)_rm_rlock_debug((rm),(tracker), 0, LOCK_FILE, LOCK_LINE ))
+#define	rm_try_rlock(rm,tracker)  \
+    _rm_rlock_debug((rm),(tracker), 1, LOCK_FILE, LOCK_LINE )
 #define	rm_runlock(rm,tracker)	\
     _rm_runlock_debug((rm), (tracker), LOCK_FILE, LOCK_LINE )
 #else
-#define	rm_wlock(rm)		_rm_wlock((rm))
-#define	rm_wunlock(rm)		_rm_wunlock((rm))
-#define	rm_rlock(rm,tracker)   	_rm_rlock((rm),(tracker))
-#define	rm_runlock(rm,tracker)	_rm_runlock((rm), (tracker))
+#define	rm_wlock(rm)			_rm_wlock((rm))
+#define	rm_wunlock(rm)			_rm_wunlock((rm))
+#define	rm_rlock(rm,tracker)		((void)_rm_rlock((rm),(tracker), 0))
+#define	rm_try_rlock(rm,tracker)	_rm_rlock((rm),(tracker), 1)
+#define	rm_runlock(rm,tracker)		_rm_runlock((rm), (tracker))
 #endif
 
 struct rm_args {

--Boundary-00=_M/adMKo7R9HbyBw--