From owner-freebsd-arch@FreeBSD.ORG  Sun Sep 26 11:28:51 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id BC7A1106566C
	for <freebsd-arch@freebsd.org>; Sun, 26 Sep 2010 11:28:51 +0000 (UTC)
	(envelope-from paketix@bluewin.ch)
Received: from mail31.bluewin.ch (mail31.bluewin.ch [195.186.18.72])
	by mx1.freebsd.org (Postfix) with ESMTP id 3E7388FC0A
	for <freebsd-arch@freebsd.org>; Sun, 26 Sep 2010 11:28:50 +0000 (UTC)
Received: from [195.186.18.84] ([195.186.18.84:56597] helo=tr17.bluewin.ch)
	by mail31.bluewin.ch (envelope-from <paketix@bluewin.ch>)
	(ecelerity 2.2.2.45 r()) with ESMTP
	id 2B/12-19667-AEA2F9C4; Sun, 26 Sep 2010 11:13:46 +0000
Received: from [192.168.1.62] (188.61.142.81) by tr17.bluewin.ch (The Blue
	Window 8.5.119.018.5.119.01) (authenticated as paketix@bluewin.ch)
	id 4C6921000180F4F0 for freebsd-arch@freebsd.org;
	Sun, 26 Sep 2010 11:13:46 +0000
From: Paketix <paketix@bluewin.ch>
Date: Sun, 26 Sep 2010 13:13:44 +0200
Message-Id: <DAF6D540-3311-4F75-8E24-A5BCBDBC7AE0@bluewin.ch>
To: freebsd-arch@freebsd.org
Mime-Version: 1.0 (Apple Message framework v1081)
X-Mailer: Apple Mail (2.1081)
Content-Type: text/plain;
	charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Subject: Porting effort towards TILERA massive multicore CPUs...?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Sep 2010 11:28:51 -0000

there is a rather new processor from TILERA (100 core chip) which is
most certainly already known here at FreeBSD mailing list.
[http://www.tilera.com/products/processors/TILE-Gx_Family]
the processor/platform is targeted towards:
- high performance network security platforms
  - firewalling/vpn
  - utm
  - l7 deep packet inspection
  - network monitoring and forensics
- cloud computing
  - web application (lamp)
  - data caching (memcached)
  - database applications
  - high-performance computing

chris metcalf from TILERA did the current linux port and i was in
contact with him about two weeks ago.
at this time QUANTA computer is starting to offer a 512 core 2U box
with an impressive performance/watt ratio (400 watts only for 512
cores).
[http://www.tilera.com/solutions/cloud_computing]

i guess those massive multicore chips would enable bleeding edge
high performance solutions based on FreeBSD.

well...
- anyone interested in porting FreeBSD towards TILERA?
  (architecture seems to be similar to MIPS...)
- is there already some ongoing porting effort?
- porting for this chip already discussed in this mailing list? 

many thx
/pat

some links for those who want some more details:
company homepage:
http://www.tilera.com/
64core processor:
http://www.tilera.com/products/processors/TILEPRO64
100core processor with hardware packet (pre)processing
http://www.tilera.com/products/processors/TILE-Gx_Family
sample architecture for network appliances:
http://www.tilera.com/solutions/networking/network_security_appliances
512core system from QUANTA computer inc. (available Q4-10/Q1-11):
http://www.tilera.com/solutions/cloud_computing
development system from TILERA:
http://www.tilera.com/products/platforms/TILEmpower_platform

From owner-freebsd-arch@FreeBSD.ORG  Sun Sep 26 19:28:53 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DA135106564A
	for <freebsd-arch@freebsd.org>; Sun, 26 Sep 2010 19:28:53 +0000 (UTC)
	(envelope-from paketix@bluewin.ch)
Received: from mail31.bluewin.ch (mail31.bluewin.ch [195.186.18.72])
	by mx1.freebsd.org (Postfix) with ESMTP id 710458FC14
	for <freebsd-arch@freebsd.org>; Sun, 26 Sep 2010 19:28:53 +0000 (UTC)
Received: from [195.186.18.84] ([195.186.18.84:40549] helo=tr17.bluewin.ch)
	by mail31.bluewin.ch (envelope-from <paketix@bluewin.ch>)
	(ecelerity 2.2.2.45 r()) with ESMTP
	id DB/E9-19667-4FE9F9C4; Sun, 26 Sep 2010 19:28:52 +0000
Received: from [192.168.1.62] (188.61.142.81) by tr17.bluewin.ch (The Blue
	Window 8.5.119.018.5.119.01) (authenticated as paketix@bluewin.ch)
	id 4C6921000184B737; Sun, 26 Sep 2010 19:28:52 +0000
Mime-Version: 1.0 (Apple Message framework v1081)
Content-Type: text/plain; charset=us-ascii
From: Paketix <paketix@bluewin.ch>
In-Reply-To: <AANLkTikmUrRWbXJZ-RGiyLHEgTWHP_epPw7+4XDJSrjk@mail.gmail.com>
Date: Sun, 26 Sep 2010 21:28:51 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <616133A7-3DF8-4192-8457-09BC27D2085E@bluewin.ch>
References: <DAF6D540-3311-4F75-8E24-A5BCBDBC7AE0@bluewin.ch>
	<AANLkTikmUrRWbXJZ-RGiyLHEgTWHP_epPw7+4XDJSrjk@mail.gmail.com>
To: Garrett Cooper <gcooper@FreeBSD.org>
X-Mailer: Apple Mail (2.1081)
Cc: freebsd-arch@freebsd.org
Subject: Re: Porting effort towards TILERA massive multicore CPUs...?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Sep 2010 19:28:54 -0000


On Sep 26, 2010, at 20:05, Garrett Cooper wrote:

> On Sun, Sep 26, 2010 at 4:13 AM, Paketix <paketix@bluewin.ch> wrote:
>> there is a rather new processor from TILERA (100 core chip) which is
>> most certainly already known here at FreeBSD mailing list.
>> [http://www.tilera.com/products/processors/TILE-Gx_Family]
>> the processor/platform is targeted towards:
>> - high performance network security platforms
>>  - firewalling/vpn
>>  - utm
>>  - l7 deep packet inspection
>>  - network monitoring and forensics
>> - cloud computing
>>  - web application (lamp)
>>  - data caching (memcached)
>>  - database applications
>>  - high-performance computing
>>=20
>> chris metcalf from TILERA did the current linux port and i was in
>> contact with him about two weeks ago.
>> at this time QUANTA computer is starting to offer a 512 core 2U box
>> with an impressive performance/watt ratio (400 watts only for 512
>> cores).
>> [http://www.tilera.com/solutions/cloud_computing]
>>=20
>> i guess those massive multicore chips would enable bleeding edge
>> high performance solutions based on FreeBSD.
>>=20
>> well...
>> - anyone interested in porting FreeBSD towards TILERA?
>>  (architecture seems to be similar to MIPS...)
>> - is there already some ongoing porting effort?
>> - porting for this chip already discussed in this mailing list?
>>=20
>> many thx
>> /pat
>>=20
>> some links for those who want some more details:
>> company homepage:
>> http://www.tilera.com/
>> 64core processor:
>> http://www.tilera.com/products/processors/TILEPRO64
>> 100core processor with hardware packet (pre)processing
>> http://www.tilera.com/products/processors/TILE-Gx_Family
>> sample architecture for network appliances:
>> =
http://www.tilera.com/solutions/networking/network_security_appliances
>> 512core system from QUANTA computer inc. (available Q4-10/Q1-11):
>> http://www.tilera.com/solutions/cloud_computing
>> development system from TILERA:
>> http://www.tilera.com/products/platforms/TILEmpower_platform
>=20
>    In short this work requires changes to the scheduler and kernel
> structures that aren't 100% done yet. Look for some of Robert Watson
> and John Baldwin's replies to "Bumping MAXCPU on amd64" thread in the
> past month to freebsd-arch and freebsd-current.
> Cheers,
> -Garrett

usually it would - yes
but maybe not on tilera if you use it for security applications (like =
firewalling, proxy, url filter, ...)
each tile of a tilera chip chan run its own full featured OS
starting with TileGX the chip has a hardware loadbalancer serving the =
packet streams to the cores...
this could maybe serve as a first step
full SMP for e.g. database applications etc. later on
btw: the tilera chip does not have a floating point unit anyway which =
will limit the range of applications (FP must be emulated in software)
BR
/pat=

From owner-freebsd-arch@FreeBSD.ORG  Mon Sep 27 15:49:37 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2F60E1065670;
	Mon, 27 Sep 2010 15:49:37 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id F17888FC16;
	Mon, 27 Sep 2010 15:49:36 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id 8178746B81;
	Mon, 27 Sep 2010 11:49:36 -0400 (EDT)
Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id BB8E38A04E;
	Mon, 27 Sep 2010 11:49:34 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-arch@freebsd.org
Date: Mon, 27 Sep 2010 09:28:47 -0400
User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20100819; KDE/4.4.5; amd64; ; )
References: <201009211507.o8LF7iVv097676@svn.freebsd.org>
	<alpine.LNX.2.00.1009231841500.23791@ury.york.ac.uk>
	<20100924225352.GD49476@server.vk2pj.dyndns.org>
In-Reply-To: <20100924225352.GD49476@server.vk2pj.dyndns.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-15"
Content-Transfer-Encoding: 7bit
Message-Id: <201009270928.47232.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Mon, 27 Sep 2010 11:49:35 -0400 (EDT)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham
	version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org,
	src-committers@freebsd.org
Subject: Re: svn commit: r212964 - head/sys/kern
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Sep 2010 15:49:37 -0000

On Friday, September 24, 2010 6:53:52 pm Peter Jeremy wrote:
> [Pruning CC list and re-adding freebsd-arch on the (forlorn) hope that
> this thread will move to where it belongs]
> 
> On 2010-Sep-23 07:31:13 -0700, Matthew Jacob <mj@feral.com> wrote:
> >It turns out that the big issue here was more the savecore time coming 
> >back up rather than the time of dumping.
> 
> In my experience, the problem isn't so much the savecore time as the
> time to run /usr/bin/crashinfo.  Whilst savecore needs to run early
> (before anything tramples on the crashdump in swap), the latter could
> run at any time.  It would seem reasonable to either run crashinfo in
> the background or as a batchjob triggered by /etc/rc.d/savecore.

That is probably true and would be fine, yes.

> On 2010-Sep-23 18:59:53 +0100, Gavin Atkinson <gavin@FreeBSD.org> wrote:
> >I appreciate the issue about filling partitions is a valid one.  Would a 
> >possible compromise be that on release media, crashinfo(8) or similar will 
> >default to only keeping the most recent coredump or similar?  Given /var 
> >now defaults to 4GB, Defaulting to keeping a single core is probably 
> >acceptable.
> 
> savecore already has support for a 'minfree' file to prevent
> crashdumps filling the crashdir.  Maybe the default install should
> include a minfree set to (say) 512MB.

The one problem this approach is it implements a FIFO instead of a LIFO.  I 
want the N most recent crashdumps to be saved, not the first N.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Tue Sep 28 06:22:58 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9D348106566C
	for <freebsd-arch@freebsd.org>; Tue, 28 Sep 2010 06:22:58 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from asmtpout030.mac.com (asmtpout030.mac.com [17.148.16.105])
	by mx1.freebsd.org (Postfix) with ESMTP id 8098A8FC1A
	for <freebsd-arch@freebsd.org>; Tue, 28 Sep 2010 06:22:58 +0000 (UTC)
MIME-version: 1.0
Content-type: multipart/mixed; boundary="Boundary_(ID_e2iysUHX7Ge1qa8HV4CNIg)"
Received: from sa-nc-apg-144.static.jnpr.net
	(natint3.juniper.net [66.129.224.36])
	by asmtp030.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01
	(built Dec
	16 2008; 32bit)) with ESMTPSA id <0L9G003YU1PGFO70@asmtp030.mac.com> for
	freebsd-arch@freebsd.org; Mon, 27 Sep 2010 23:22:33 -0700 (PDT)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=6.0.2-1004200000 definitions=main-1009270310
X-Proofpoint-Virus-Version: vendor=fsecure
	engine=2.50.10432:5.0.10011,1.0.148,0.0.0000
	definitions=2010-09-28_06:2010-09-28, 2010-09-27,
	1970-01-01 signatures=0
From: Marcel Moolenaar <xcllnt@mac.com>
Date: Mon, 27 Sep 2010 23:22:27 -0700
Message-id: <CD1BDE8F-29BE-4A82-B0D9-8849FF3C1A1F@mac.com>
To: "freebsd-arch@FreeBSD.org Arch" <freebsd-arch@freebsd.org>
X-Mailer: Apple Mail (2.1081)
Subject: [patch] functional prototype of root mount enhancement
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Sep 2010 06:22:58 -0000


--Boundary_(ID_e2iysUHX7Ge1qa8HV4CNIg)
Content-type: text/plain; charset=us-ascii
Content-transfer-encoding: 7BIT

All,

I prototyped the root mount enhancement previously discussed.
I would appreciate feedback and suggestions and bug reports
of course.

See:
http://docs.freebsd.org/cgi/getmsg.cgi?fetch=5942+0+current/freebsd-arch
http://docs.freebsd.org/cgi/getmsg.cgi?fetch=120899+0+archive/2010/freebsd-arch/20100829.freebsd-arch

The prototype supports all boot options that affect the root
mount. Those are: -a, -C, -r
When present, the initial root mount directives get adjusted
accordingly.

The prototype adds better support for mount options. Both the
interactive, as well has the compiled-in root mount option
(i.e. ROOTDEVNAME) can contain mount options.

Not implemented yet is the .onfail handling, as well as the
.timeout handling (previously called .wait). Also, the .init
directive is not implemented.

There's 1 bug under investigation: when a 2nd (non-devfs)
file system is mounted as root, the 1st (non-devfs) gets
moved under /.mount or /mnt under the new (=2nd) file
system. However, trying to access the file system results in
a WITNESS panic caused by a syscall leaving with the ufs
lock held.

The code has some debug output still, which is helpful to
see what's going on internally. From a boot (with a
/.mount.conf present on ufs:/dev/ad0s1a):

	:
WARNING: WITNESS option enabled, expect reduced performance.
Root mount waiting for: usbus1
Root mount waiting for: usbus1
uhub1: 6 ports with 6 removable, self powered
Root mount waiting for: usbus1
Root mount waiting for: usbus1
ugen1.2: <Apple Inc.> at usbus1
========
.onfail panic
.timeout 1
ufs:/dev/ad0s1a rw
.ask
========
Trying to mount root from ufs:/dev/ad0s1a [rw]...
XXX: vfs_mountroot_parse: error = 0, mpdevfs=0xc3fa3000, mp=0xc3fa2c94
========
.onfail continue
#ufs:/dev/da0a
.ask
========

Loader variables:
  vfs.root.mountfrom=ufs:/dev/ad0s1a
  vfs.root.mountfrom.options=rw

Manual root filesystem specification:
  <fstype>:<device> [options]
      Mount <device> using filesystem <fstype>
      and with the specified (optional) option list.

    eg. ufs:/dev/da0s1a
        cd9660:/dev/acd0 ro
          (which is equivalent to: mount -t cd9660 -o ro /dev/acd0 /

  ?                  List valid disk boot devices
  <empty line>       Abort manual input

mountroot> 
XXX: vfs_mountroot_parse: error = -1, mpdevfs=0xc3fa3000, mp=0
	:

In case the attachment gets eaten:
	http://www.xcllnt.net/~marcel/rootmount.diff

-- 
Marcel Moolenaar
xcllnt@mac.com


--Boundary_(ID_e2iysUHX7Ge1qa8HV4CNIg)
Content-type: application/octet-stream; name=rootmount.diff
Content-transfer-encoding: 7bit
Content-disposition: attachment; filename=rootmount.diff

Index: conf/files
===================================================================
--- conf/files	(revision 41)
+++ conf/files	(revision 49)
@@ -2216,6 +2216,7 @@
 kern/vfs_init.c			standard
 kern/vfs_lookup.c		standard
 kern/vfs_mount.c		standard
+kern/vfs_mountroot.c		standard
 kern/vfs_subr.c			standard
 kern/vfs_syscalls.c		standard
 kern/vfs_vnops.c		standard
Index: kern/vfs_mountroot.c
===================================================================
--- kern/vfs_mountroot.c	(revision 0)
+++ kern/vfs_mountroot.c	(revision 49)
@@ -0,0 +1,985 @@
+/*-
+ * Copyright (c) 1999-2004 Poul-Henning Kamp
+ * Copyright (c) 1999 Michael Smith
+ * Copyright (c) 1989, 1993
+ *      The Regents of the University of California.  All rights reserved.
+ * (c) UNIX System Laboratories, Inc.
+ * All or some portions of this file are derived from material licensed
+ * to the University of California by American Telephone and Telegraph
+ * Co. or Unix System Laboratories, Inc. and are reproduced herein with
+ * the permission of UNIX System Laboratories, Inc.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 4. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "opt_rootdevname.h"
+
+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD$");
+
+#include <sys/param.h>
+#include <sys/conf.h>
+#include <sys/fcntl.h>
+#include <sys/jail.h>
+#include <sys/kernel.h>
+#include <sys/libkern.h>
+#include <sys/malloc.h>
+#include <sys/mdioctl.h>
+#include <sys/mount.h>
+#include <sys/mutex.h>
+#include <sys/namei.h>
+#include <sys/priv.h>
+#include <sys/proc.h>
+#include <sys/filedesc.h>
+#include <sys/reboot.h>
+#include <sys/stat.h>
+#include <sys/syscallsubr.h>
+#include <sys/sysproto.h>
+#include <sys/sx.h>
+#include <sys/sysctl.h>
+#include <sys/sysent.h>
+#include <sys/systm.h>
+#include <sys/vnode.h>
+
+#include <geom/geom.h>
+
+/*
+ * The root filesystem is detailed in the kernel environment variable
+ * vfs.root.mountfrom, which is expected to be in the general format
+ *
+ * <vfsname>:[<path>][	<vfsname>:[<path>] ...]
+ * vfsname   := the name of a VFS known to the kernel and capable
+ *              of being mounted as root
+ * path      := disk device name or other data used by the filesystem
+ *              to locate its physical store
+ *
+ * If the environment variable vfs.root.mountfrom is a space separated list,
+ * each list element is tried in turn and the root filesystem will be mounted
+ * from the first one that suceeds.
+ *
+ * The environment variable vfs.root.mountfrom.options is a comma delimited
+ * set of string mount options.  These mount options must be parseable
+ * by nmount() in the kernel.
+ */
+
+static int parse_mount(char **);
+static struct mntarg *parse_mountroot_options(struct mntarg *, const char *);
+
+/*
+ * The vnode of the system's root (/ in the filesystem, without chroot
+ * active.)
+ */
+struct vnode *rootvnode;
+
+char *rootdevnames[2] = {NULL, NULL};
+
+struct root_hold_token {
+	const char			*who;
+	LIST_ENTRY(root_hold_token)	list;
+};
+
+static LIST_HEAD(, root_hold_token)	root_holds =
+    LIST_HEAD_INITIALIZER(root_holds);
+
+enum action {
+	A_PANIC,
+	A_CONTINUE,
+	A_REBOOT,
+	A_RETRY
+};
+
+static enum action root_mount_action;
+
+static int root_mount_mddev;
+static int root_mount_complete;
+
+/* By default wait up to 1 second for devices to appear. */
+static int root_mount_timeout = 1;
+
+struct root_hold_token *
+root_mount_hold(const char *identifier)
+{
+	struct root_hold_token *h;
+
+	if (root_mounted())
+		return (NULL);
+
+	h = malloc(sizeof *h, M_DEVBUF, M_ZERO | M_WAITOK);
+	h->who = identifier;
+	mtx_lock(&mountlist_mtx);
+	LIST_INSERT_HEAD(&root_holds, h, list);
+	mtx_unlock(&mountlist_mtx);
+	return (h);
+}
+
+void
+root_mount_rel(struct root_hold_token *h)
+{
+
+	if (h == NULL)
+		return;
+	mtx_lock(&mountlist_mtx);
+	LIST_REMOVE(h, list);
+	wakeup(&root_holds);
+	mtx_unlock(&mountlist_mtx);
+	free(h, M_DEVBUF);
+}
+
+int
+root_mounted(void)
+{
+
+	/* No mutex is acquired here because int stores are atomic. */
+	return (root_mount_complete);
+}
+
+void
+root_mount_wait(void)
+{
+
+	/*
+	 * Panic on an obvious deadlock - the function can't be called from
+	 * a thread which is doing the whole SYSINIT stuff.
+	 */
+	KASSERT(curthread->td_proc->p_pid != 0,
+	    ("root_mount_wait: cannot be called from the swapper thread"));
+	mtx_lock(&mountlist_mtx);
+	while (!root_mount_complete) {
+		msleep(&root_mount_complete, &mountlist_mtx, PZERO, "rootwait",
+		    hz);
+	}
+	mtx_unlock(&mountlist_mtx);
+}
+
+static void
+set_rootvnode(void)
+{
+	struct proc *p;
+
+	if (VFS_ROOT(TAILQ_FIRST(&mountlist), LK_EXCLUSIVE, &rootvnode))
+		panic("Cannot find root vnode");
+
+	VOP_UNLOCK(rootvnode, 0);
+
+	p = curthread->td_proc;
+	FILEDESC_XLOCK(p->p_fd);
+
+	if (p->p_fd->fd_cdir != NULL)
+		vrele(p->p_fd->fd_cdir);
+	p->p_fd->fd_cdir = rootvnode;
+	VREF(rootvnode);
+
+	if (p->p_fd->fd_rdir != NULL)
+		vrele(p->p_fd->fd_rdir);
+	p->p_fd->fd_rdir = rootvnode;
+	VREF(rootvnode);
+
+	FILEDESC_XUNLOCK(p->p_fd);
+
+	EVENTHANDLER_INVOKE(mountroot);
+}
+
+static int
+vfs_mountroot_devfs(struct thread *td, struct mount **mpp)
+{
+	struct vfsoptlist *opts;
+	struct vfsconf *vfsp;
+	struct mount *mp;
+	int error;
+
+	*mpp = NULL;
+
+	vfsp = vfs_byname("devfs");
+	KASSERT(vfsp != NULL, ("Could not find devfs by name"));
+	if (vfsp == NULL)
+		return (ENOENT);
+
+	mp = vfs_mount_alloc(NULLVP, vfsp, "/dev", td->td_ucred);
+
+	error = VFS_MOUNT(mp);
+	KASSERT(error == 0, ("VFS_MOUNT(devfs) failed %d", error));
+	if (error)
+		return (error);
+
+	opts = malloc(sizeof(struct vfsoptlist), M_MOUNT, M_WAITOK);
+	TAILQ_INIT(opts);
+	mp->mnt_opt = opts;
+
+	mtx_lock(&mountlist_mtx);
+	TAILQ_INSERT_HEAD(&mountlist, mp, mnt_list);
+	mtx_unlock(&mountlist_mtx);
+
+	*mpp = mp;
+	set_rootvnode();
+
+	error = kern_symlink(td, "/", "dev", UIO_SYSSPACE);
+	if (error)
+		printf("kern_symlink /dev -> / returns %d\n", error);
+
+	return (error);
+}
+
+static int
+vfs_mountroot_shuffle(struct thread *td, struct mount *mpdevfs)
+{
+	struct nameidata nd;
+	struct mount *mporoot, *mpnroot;
+	struct vnode *vp, *vporoot, *vpdevfs;
+	char *fspath;
+	int error;
+
+	mpnroot = TAILQ_NEXT(mpdevfs, mnt_list);
+
+	/* Shuffle the mountlist. */
+	mtx_lock(&mountlist_mtx);
+	mporoot = TAILQ_FIRST(&mountlist);
+	TAILQ_REMOVE(&mountlist, mpdevfs, mnt_list);
+	if (mporoot != mpdevfs) {
+		TAILQ_REMOVE(&mountlist, mpnroot, mnt_list);
+		TAILQ_INSERT_HEAD(&mountlist, mpnroot, mnt_list);
+	}
+	TAILQ_INSERT_TAIL(&mountlist, mpdevfs, mnt_list);
+	mtx_unlock(&mountlist_mtx);
+
+	cache_purgevfs(mporoot);
+	if (mporoot != mpdevfs)
+		cache_purgevfs(mpdevfs);
+
+	VFS_ROOT(mporoot, LK_EXCLUSIVE, &vporoot);
+
+	VI_LOCK(vporoot);
+	vporoot->v_iflag &= ~VI_MOUNT;
+	VI_UNLOCK(vporoot);
+	vporoot->v_vflag &= ~VV_ROOT;
+	vporoot->v_mountedhere = NULL;
+	mporoot->mnt_vnodecovered = NULL;
+	vput(vporoot);
+
+	/* Set up the new rootvnode, and purge the cache */
+	mpnroot->mnt_vnodecovered = NULL;
+	set_rootvnode();
+	cache_purgevfs(rootvnode->v_mount);
+
+	if (mporoot != mpdevfs) {
+		/* Remount old root under /.mount or /mnt */
+		fspath = "/.mount";
+		NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF, UIO_SYSSPACE,
+		    fspath, td);
+		error = namei(&nd);
+		if (error) {
+			NDFREE(&nd, NDF_ONLY_PNBUF);
+			fspath = "/mnt";
+			NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF, UIO_SYSSPACE,
+			    fspath, td);
+			error = namei(&nd);
+		}
+		if (!error) {
+			vp = nd.ni_vp;
+			error = (vp->v_type == VDIR) ? 0 : ENOTDIR;
+			if (!error)
+				error = vinvalbuf(vp, V_SAVE, 0, 0);
+			if (!error) {
+				cache_purge(vp);
+				mporoot->mnt_vnodecovered = vp;
+				vp->v_mountedhere = mporoot;
+				strlcpy(mporoot->mnt_stat.f_mntonname,
+				    fspath, MNAMELEN);
+				VOP_UNLOCK(vp, 0);
+			} else
+				vput(vp);
+		}
+		NDFREE(&nd, NDF_ONLY_PNBUF);
+
+		if (mporoot->mnt_vnodecovered == NULL)
+			printf("mountroot: unable to remount previous root.\n");
+	}
+
+	/* Remount devfs under /dev */
+	NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF, UIO_SYSSPACE, "/dev", td);
+
+	error = namei(&nd);
+	if (!error) {
+		vp = nd.ni_vp;
+		error = (vp->v_type == VDIR) ? 0 : ENOTDIR;
+		if (!error)
+			error = vinvalbuf(vp, V_SAVE, 0, 0);
+		if (!error) {
+			vpdevfs = mpdevfs->mnt_vnodecovered;
+			if (vpdevfs != NULL) {
+				cache_purge(vpdevfs);
+				vpdevfs->v_mountedhere = NULL;
+				vrele(vpdevfs);
+			}
+			mpdevfs->mnt_vnodecovered = vp;
+			vp->v_mountedhere = mpdevfs;
+			VOP_UNLOCK(vp, 0);
+		} else
+			vput(vp);
+	}
+	NDFREE(&nd, NDF_ONLY_PNBUF);
+
+	if (mporoot == mpdevfs) {
+		vfs_unbusy(mpdevfs);
+		/* Unlink the no longer needed /dev/dev -> / symlink */
+		kern_unlink(td, "/dev/dev", UIO_SYSSPACE);
+	}
+
+	return (0);
+}
+
+/*
+ * Configuration parser.
+ */
+
+/* Parser character classes. */
+#define	CC_WHITESPACE		-1
+#define	CC_NONWHITESPACE	-2
+
+/* Parse errors. */
+#define	PE_EOF			-1
+#define	PE_EOL			-2
+
+static __inline int
+parse_peek(char **conf)
+{
+
+	return (**conf);
+}
+
+static __inline void
+parse_poke(char **conf, int c)
+{
+
+	**conf = c;
+}
+
+static __inline void
+parse_advance(char **conf)
+{
+
+	(*conf)++;
+}
+
+static __inline int
+parse_isspace(int c)
+{
+
+	return ((c == ' ' || c == '\t' || c == '\n') ? 1 : 0);
+}
+
+static int
+parse_skipto(char **conf, int mc)
+{
+	int c, match;
+
+	while (1) {
+		c = parse_peek(conf);
+		if (c == 0)
+			return (PE_EOF);
+		switch (mc) {
+		case CC_WHITESPACE:
+			match = (c == ' ' || c == '\t' || c == '\n') ? 1 : 0;
+			break;
+		case CC_NONWHITESPACE:
+			if (c == '\n')
+				return (PE_EOL);
+			match = (c != ' ' && c != '\t') ? 1 : 0;
+			break;
+		default:
+			match = (c == mc) ? 1 : 0;
+			break;
+		}
+		if (match)
+			break;
+		parse_advance(conf);
+	}
+	return (0);
+}
+
+static int
+parse_token(char **conf, char **tok)
+{
+	char *p;
+	size_t len;
+	int error;
+
+	*tok = NULL;
+	error = parse_skipto(conf, CC_NONWHITESPACE);
+	if (error)
+		return (error);
+	p = *conf;
+	error = parse_skipto(conf, CC_WHITESPACE);
+	len = *conf - p;
+	*tok = malloc(len + 1, M_TEMP, M_WAITOK | M_ZERO);
+	bcopy(p, *tok, len);
+	return (0);
+}
+
+static void
+parse_dir_ask_printenv(const char *var)
+{
+	char *val;
+
+	val = getenv(var);
+	if (val != NULL) {
+		printf("  %s=%s\n", var, val);
+		freeenv(val);
+	}
+}
+
+static int
+parse_dir_ask(char **conf)
+{
+	char name[80];
+	char *mnt;
+	int error;
+
+	printf("\nLoader variables:\n");
+	parse_dir_ask_printenv("vfs.root.mountfrom");
+	parse_dir_ask_printenv("vfs.root.mountfrom.options");
+
+	printf("\nManual root filesystem specification:\n");
+	printf("  <fstype>:<device> [options]\n");
+	printf("      Mount <device> using filesystem <fstype>\n");
+	printf("      and with the specified (optional) option list.\n");
+	printf("\n");
+	printf("    eg. ufs:/dev/da0s1a\n");
+	printf("        cd9660:/dev/acd0 ro\n");
+	printf("          (which is equivalent to: ");
+	printf("mount -t cd9660 -o ro /dev/acd0 /\n");
+	printf("\n");
+	printf("  ?                  List valid disk boot devices\n");
+	printf("  <empty line>       Abort manual input\n");
+
+ again:
+	printf("\nmountroot> ");
+	gets(name, sizeof(name), 1);
+	if (name[0] == '\0')
+		return (0);
+	if (name[0] == '?') {
+		printf("\nList of GEOM managed disk devices:\n  ");
+		g_dev_print();
+		goto again;
+	}
+	mnt = name;
+	error = parse_mount(&mnt);
+	if (error == -1) {
+		printf("Invalid specification.\n");
+		goto again;
+	}
+	return (error);
+}
+
+static int
+parse_dir_md(char **conf)
+{
+	struct stat sb;
+	struct thread *td;
+	struct md_ioctl *mdio;
+	char *path, *tok;
+	int error, fd, len;
+
+	td = curthread;
+
+	error = parse_token(conf, &tok);
+	if (error)
+		return (error);
+
+	len = strlen(tok);
+	mdio = malloc(sizeof(*mdio) + len + 1, M_TEMP, M_WAITOK | M_ZERO);
+	path = (void *)(mdio + 1);
+	bcopy(tok, path, len);
+	free(tok, M_TEMP);
+
+	/* Get file status. */
+	error = kern_stat(td, path, UIO_SYSSPACE, &sb);
+	if (error)
+		goto out;
+
+	/* Open /dev/mdctl so that we can attach/detach. */
+	error = kern_open(td, "/dev/" MDCTL_NAME, UIO_SYSSPACE, O_RDWR, 0);
+	if (error)
+		goto out;
+
+	fd = td->td_retval[0];
+	mdio->md_version = MDIOVERSION;
+	mdio->md_type = MD_VNODE;
+
+	if (root_mount_mddev != -1) {
+		mdio->md_unit = root_mount_mddev;
+		DROP_GIANT();
+		error = kern_ioctl(td, fd, MDIOCDETACH, (void *)mdio);
+		PICKUP_GIANT();
+		/* Ignore errors. We don't care. */
+		root_mount_mddev = -1;
+	}
+
+	mdio->md_file = (void *)(mdio + 1);
+	mdio->md_options = MD_AUTOUNIT | MD_READONLY;
+	mdio->md_mediasize = sb.st_size;
+	mdio->md_unit = 0;
+	DROP_GIANT();
+	error = kern_ioctl(td, fd, MDIOCATTACH, (void *)mdio);
+	PICKUP_GIANT();
+	if (error)
+		goto out;
+
+	if (mdio->md_unit > 9) {
+		printf("rootmount: too many md units\n");
+		mdio->md_file = NULL;
+		mdio->md_options = 0;
+		mdio->md_mediasize = 0;
+		DROP_GIANT();
+		error = kern_ioctl(td, fd, MDIOCDETACH, (void *)mdio);
+		PICKUP_GIANT();
+		/* Ignore errors. We don't care. */
+		error = ERANGE;
+		goto out;
+	}
+
+	root_mount_mddev = mdio->md_unit;
+	printf(MD_NAME "%u attached to %s\n", root_mount_mddev, mdio->md_file);
+
+	error = kern_close(td, fd);
+
+ out:
+	free(mdio, M_TEMP);
+	return (error);
+}
+
+static int
+parse_dir_onfail(char **conf)
+{
+	char *action;
+	int error;
+
+	error = parse_token(conf, &action);
+	if (error)
+		return (error);
+
+	if (!strcmp(action, "continue"))
+		root_mount_action = A_CONTINUE;
+	else if (!strcmp(action, "panic"))
+		root_mount_action = A_PANIC;
+	else if (!strcmp(action, "reboot"))
+		root_mount_action = A_REBOOT;
+	else if (!strcmp(action, "retry"))
+		root_mount_action = A_RETRY;
+	else {
+		printf("rootmount: %s: unknown action\n", action);
+		error = EINVAL;
+	}
+
+	free(action, M_TEMP);
+	return (0);
+}
+
+static int
+parse_dir_timeout(char **conf)
+{
+	char *tok, *endtok;
+	long secs;
+	int error;
+
+	error = parse_token(conf, &tok);
+	if (error)
+		return (error);
+
+	secs = strtol(tok, &endtok, 0);
+	error = (secs < 0 || *endtok != '\0') ? EINVAL : 0;
+	if (!error)
+		root_mount_timeout = secs;
+	free(tok, M_TEMP);
+	return (error);
+}
+
+static int
+parse_directive(char **conf)
+{
+	char *dir;
+	int error;
+
+	error = parse_token(conf, &dir);
+	if (error)
+		return (error);
+
+	if (strcmp(dir, ".ask") == 0)
+		error = parse_dir_ask(conf);
+	else if (strcmp(dir, ".md") == 0)
+		error = parse_dir_md(conf);
+	else if (strcmp(dir, ".onfail") == 0)
+		error = parse_dir_onfail(conf);
+	else if (strcmp(dir, ".timeout") == 0)
+		error = parse_dir_timeout(conf);
+	else {
+		printf("mountroot: invalid directive `%s'\n", dir);
+		/* Ignore the rest of the line. */
+		(void)parse_skipto(conf, '\n');
+		error = EINVAL;
+	}
+	free(dir, M_TEMP);
+	return (error);
+}
+
+static int
+parse_mount(char **conf)
+{
+	char errmsg[255];
+	struct mntarg *ma;
+	char *dev, *fs, *opts, *tok;
+	int error;
+
+	error = parse_token(conf, &tok);
+	if (error)
+		return (error);
+	fs = tok;
+	error = parse_skipto(&tok, ':');
+	if (error) {
+		free(fs, M_TEMP);
+		return (error);
+	}
+	parse_poke(&tok, '\0');
+	parse_advance(&tok);
+	dev = tok;
+
+	if (root_mount_mddev != -1) {
+		/* Handle substitution for the md unit number. */
+		tok = strstr(dev, "md#");
+		if (tok != NULL)
+			tok[2] = '0' + root_mount_mddev;
+	}
+
+	/* Parse options. */
+	error = parse_token(conf, &tok);
+	opts = (error == 0) ? tok : NULL;
+
+	printf("Trying to mount root from %s:%s [%s]...\n", fs, dev,
+	    (opts != NULL) ? opts : "");
+
+	bzero(errmsg, sizeof(errmsg));
+
+	if (vfs_byname(fs) == NULL) {
+		strlcpy(errmsg, "unknown file system", sizeof(errmsg));
+		error = ENOENT;
+		goto out;
+	}
+
+	if (dev[0] != '\0') {
+		/* XXX wait N seconds for the device to appear. */
+	}
+
+	ma = NULL;
+	ma = mount_arg(ma, "fstype", fs, -1);
+	ma = mount_arg(ma, "fspath", "/", -1);
+	ma = mount_arg(ma, "from", dev, -1);
+	ma = mount_arg(ma, "errmsg", errmsg, sizeof(errmsg));
+	ma = mount_arg(ma, "ro", NULL, 0);
+	ma = parse_mountroot_options(ma, opts);
+	error = kernel_mount(ma, MNT_ROOTFS);
+
+ out:
+	if (error) {
+		printf("Mounting from %s:%s failed with error %d",
+		    fs, dev, error);
+		if (errmsg[0] != '\0')
+			printf(": %s", errmsg);
+		printf(".\n");
+	}
+	free(fs, M_TEMP);
+	if (opts != NULL)
+		free(opts, M_TEMP);
+	/* kernel_mount can return -1 on error. */
+	return ((error < 0) ? EDOOFUS : error);
+}
+
+static int
+vfs_mountroot_parse(char **conf, struct mount *mpdevfs)
+{
+	struct mount *mp;
+	int error;
+
+	mp = TAILQ_NEXT(mpdevfs, mnt_list);
+	error = (mp == NULL) ? 0 : EDOOFUS;
+	root_mount_mddev = -1;
+	root_mount_action = A_CONTINUE;
+	while (mp == NULL) {
+		error = parse_skipto(conf, CC_NONWHITESPACE);
+		if (error == PE_EOL) {
+			parse_advance(conf);
+			continue;
+		}
+		if (error < 0)
+			break;
+		switch (parse_peek(conf)) {
+		case '#':
+			error = parse_skipto(conf, '\n');
+			break;
+		case '.':
+			error = parse_directive(conf);
+			break;
+		default:
+			error = parse_mount(conf);
+			break;
+		}
+		if (error < 0)
+			break;
+		/* Ignore any trailing garbage on the line. */
+		if (parse_peek(conf) != '\n') {
+			printf("mountroot: advancing to next directive...\n");
+			(void)parse_skipto(conf, '\n');
+		}
+		mp = TAILQ_NEXT(mpdevfs, mnt_list);
+	}
+
+	printf("XXX: %s: error = %d, mpdevfs=%p, mp=%p\n", __func__,
+	    error, mpdevfs, mp);
+
+	return (error);
+}
+
+static void
+vfs_mountroot_conf0(struct sbuf *sb)
+{
+	char *s, *tok, *mnt, *opt;
+	int error;
+
+	sbuf_printf(sb, ".onfail panic\n");
+	sbuf_printf(sb, ".timeout 1\n");
+	if (boothowto & RB_ASKNAME)
+		sbuf_printf(sb, ".ask\n");
+#ifdef ROOTDEVNAME
+	if (boothowto & RB_DFLTROOT)
+		sbuf_printf(sb, "%s\n", ROOTDEVNAME);
+#endif
+	if (boothowto & RB_CDROM) {
+		sbuf_printf(sb, "cd9660:cd0\n");
+		sbuf_printf(sb, ".timeout 0\n");
+		sbuf_printf(sb, "cd9660:acd0\n");
+		sbuf_printf(sb, ".timeout 1\n");
+	}
+	s = getenv("vfs.root.mountfrom");
+	if (s != NULL) {
+		opt = getenv("vfs.root.mountfrom.options");
+		tok = s;
+		error = parse_token(&tok, &mnt);
+		while (!error) {
+			sbuf_printf(sb, "%s %s\n", mnt,
+			    (opt != NULL) ? opt : "");
+			free(mnt, M_TEMP);
+			error = parse_token(&tok, &mnt);
+		}
+		if (opt != NULL)
+			freeenv(opt);
+		freeenv(s);
+	}
+	if (rootdevnames[0] != NULL)
+		sbuf_printf(sb, "%s\n", rootdevnames[0]);
+	if (rootdevnames[1] != NULL)
+		sbuf_printf(sb, "%s\n", rootdevnames[1]);
+#ifdef ROOTDEVNAME
+	if (!(boothowto & RB_DFLTROOT))
+		sbuf_printf(sb, "%s\n", ROOTDEVNAME);
+#endif
+	if (!(boothowto & RB_ASKNAME))
+		sbuf_printf(sb, ".ask\n");
+}
+
+static int
+vfs_mountroot_readconf(struct thread *td, struct sbuf *sb)
+{
+	static char buf[128];
+	struct nameidata nd;
+	off_t ofs;
+	int error, flags;
+	int len, resid;
+	int vfslocked;
+
+	NDINIT(&nd, LOOKUP, FOLLOW | MPSAFE, UIO_SYSSPACE,
+	    "/.mount.conf", td);
+	flags = FREAD;
+	error = vn_open(&nd, &flags, 0, NULL);
+	if (error)
+		return (error);
+
+	vfslocked = NDHASGIANT(&nd);
+	NDFREE(&nd, NDF_ONLY_PNBUF);
+	ofs = 0;
+	len = sizeof(buf) - 1;
+	while (1) {
+		error = vn_rdwr(UIO_READ, nd.ni_vp, buf, len, ofs,
+		    UIO_SYSSPACE, IO_NODELOCKED, td->td_ucred,
+		    NOCRED, &resid, td);
+		if (error)
+			break;
+		if (resid == len)
+			break;
+		buf[len - resid] = 0;
+		sbuf_printf(sb, "%s", buf);
+		ofs += len - resid;
+	}
+
+	VOP_UNLOCK(nd.ni_vp, 0);
+	vn_close(nd.ni_vp, FREAD, td->td_ucred, td);
+	VFS_UNLOCK_GIANT(vfslocked);
+	return (error);
+}
+
+static void
+vfs_mountroot_wait(void)
+{
+	struct root_hold_token *h;
+	struct timeval lastfail;
+	int curfail;
+
+	curfail = 0;
+	while (1) {
+		DROP_GIANT();
+		g_waitidle();
+		PICKUP_GIANT();
+		mtx_lock(&mountlist_mtx);
+		if (LIST_EMPTY(&root_holds)) {
+			mtx_unlock(&mountlist_mtx);
+			break;
+		}
+		if (ppsratecheck(&lastfail, &curfail, 1)) {
+			printf("Root mount waiting for:");
+			LIST_FOREACH(h, &root_holds, list)
+				printf(" %s", h->who);
+			printf("\n");
+		}
+		msleep(&root_holds, &mountlist_mtx, PZERO | PDROP, "roothold",
+		    hz);
+	}
+}
+
+void
+vfs_mountroot(void)
+{
+	struct mount *mp;
+	struct sbuf *sb;
+	struct thread *td;
+	char *conf;
+	time_t timebase;
+	int error;
+
+	td = curthread;
+
+	vfs_mountroot_wait();
+
+	sb = sbuf_new_auto();
+	vfs_mountroot_conf0(sb);
+	sbuf_finish(sb);
+
+	error = vfs_mountroot_devfs(td, &mp);
+	while (!error) {
+		conf = sbuf_data(sb);
+		printf("========\n%s========\n", conf);
+		error = vfs_mountroot_parse(&conf, mp);
+		if (!error) {
+			error = vfs_mountroot_shuffle(td, mp);
+			if (!error) {
+				sbuf_clear(sb);
+				error = vfs_mountroot_readconf(td, sb);
+				sbuf_finish(sb);
+			}
+		}
+	}
+
+	sbuf_delete(sb);
+
+	/*
+	 * Iterate over all currently mounted file systems and use
+	 * the time stamp found to check and/or initialize the RTC.
+	 * Call inittodr() only once and pass it the largest of the
+	 * timestamps we encounter.
+	 */
+	timebase = 0;
+	mtx_lock(&mountlist_mtx);
+	mp = TAILQ_FIRST(&mountlist);
+	while (mp != NULL) {
+		if (mp->mnt_time > timebase)
+			timebase = mp->mnt_time;
+		mp = TAILQ_NEXT(mp, mnt_list);
+	}
+	mtx_unlock(&mountlist_mtx);
+	inittodr(timebase);
+
+	/* Keep prison0's root in sync with the global rootvnode. */
+	mtx_lock(&prison0.pr_mtx);
+	prison0.pr_root = rootvnode;
+	vref(prison0.pr_root);
+	mtx_unlock(&prison0.pr_mtx);
+
+	mtx_lock(&mountlist_mtx);
+	atomic_store_rel_int(&root_mount_complete, 1);
+	wakeup(&root_mount_complete);
+	mtx_unlock(&mountlist_mtx);
+}
+
+static struct mntarg *
+parse_mountroot_options(struct mntarg *ma, const char *options)
+{
+	char *p;
+	char *name, *name_arg;
+	char *val, *val_arg;
+	char *opts;
+
+	if (options == NULL || options[0] == '\0')
+		return (ma);
+
+	p = opts = strdup(options, M_MOUNT);
+	if (opts == NULL) {
+		return (ma);
+	}
+
+	while((name = strsep(&p, ",")) != NULL) {
+		if (name[0] == '\0')
+			break;
+
+		val = strchr(name, '=');
+		if (val != NULL) {
+			*val = '\0';
+			++val;
+		}
+		if( strcmp(name, "rw") == 0 ||
+		    strcmp(name, "noro") == 0) {
+			/*
+			 * The first time we mount the root file system,
+			 * we need to mount 'ro', so We need to ignore
+			 * 'rw' and 'noro' mount options.
+			 */
+			continue;
+		}
+		name_arg = strdup(name, M_MOUNT);
+		val_arg = NULL;
+		if (val != NULL)
+			val_arg = strdup(val, M_MOUNT);
+
+		ma = mount_arg(ma, name_arg, val_arg,
+		    (val_arg != NULL ? -1 : 0));
+	}
+	free(opts, M_MOUNT);
+	return (ma);
+}
Index: kern/vfs_mount.c
===================================================================
--- kern/vfs_mount.c	(revision 41)
+++ kern/vfs_mount.c	(revision 49)
@@ -67,16 +67,10 @@
 #include <security/audit/audit.h>
 #include <security/mac/mac_framework.h>
 
-#include "opt_rootdevname.h"
-
-#define	ROOTNAME		"root_device"
 #define	VFS_MOUNTARG_SIZE_MAX	(1024 * 64)
 
-static void	set_rootvnode(void);
 static int	vfs_domount(struct thread *td, const char *fstype,
 		    char *fspath, int fsflags, void *fsdata);
-static int	vfs_mountroot_ask(void);
-static int	vfs_mountroot_try(const char *mountfrom, const char *options);
 static void	free_mntarg(struct mntarg *ma);
 
 static int	usermount = 0;
@@ -95,31 +89,6 @@
 MTX_SYSINIT(mountlist, &mountlist_mtx, "mountlist", MTX_DEF);
 
 /*
- * The vnode of the system's root (/ in the filesystem, without chroot
- * active.)
- */
-struct vnode	*rootvnode;
-
-/*
- * The root filesystem is detailed in the kernel environment variable
- * vfs.root.mountfrom, which is expected to be in the general format
- *
- * <vfsname>:[<path>][	<vfsname>:[<path>] ...]
- * vfsname   := the name of a VFS known to the kernel and capable
- *              of being mounted as root
- * path      := disk device name or other data used by the filesystem
- *              to locate its physical store
- *
- * If the environment variable vfs.root.mountfrom is a space separated list,
- * each list element is tried in turn and the root filesystem will be mounted
- * from the first one that suceeds.
- *
- * The environment variable vfs.root.mountfrom.options is a comma delimited
- * set of string mount options.  These mount options must be parseable
- * by nmount() in the kernel.
- */
-
-/*
  * Global opts, taken by all filesystems
  */
 static const char *global_opts[] = {
@@ -133,22 +102,36 @@
 	NULL
 };
 
-/*
- * The root specifiers we will try if RB_CDROM is specified.
- */
-static char *cdrom_rootdevnames[] = {
-	"cd9660:cd0",
-	"cd9660:acd0",
-	NULL
-};
+static int
+mount_init(void *mem, int size, int flags)
+{
+	struct mount *mp;
 
-/* legacy find-root code */
-char		*rootdevnames[2] = {NULL, NULL};
-#ifndef ROOTDEVNAME
-#  define ROOTDEVNAME NULL
-#endif
-static const char	*ctrootdevname = ROOTDEVNAME;
+	mp = (struct mount *)mem;
+	mtx_init(&mp->mnt_mtx, "struct mount mtx", NULL, MTX_DEF);
+	lockinit(&mp->mnt_explock, PVFS, "explock", 0, 0);
+	return (0);
+}
 
+static void
+mount_fini(void *mem, int size)
+{
+	struct mount *mp;
+
+	mp = (struct mount *)mem;
+	lockdestroy(&mp->mnt_explock);
+	mtx_destroy(&mp->mnt_mtx);
+}
+
+static void
+vfs_mount_init(void *dummy __unused)
+{
+
+	mount_zone = uma_zcreate("Mountpoints", sizeof(struct mount), NULL,
+	    NULL, mount_init, mount_fini, UMA_ALIGN_PTR, UMA_ZONE_NOFREE);
+}
+SYSINIT(vfs_mount, SI_SUB_VFS, SI_ORDER_ANY, vfs_mount_init, NULL);
+
 /*
  * ---------------------------------------------------------------------
  * Functions for building and sanitizing the mount options
@@ -452,27 +435,6 @@
 	MNT_IUNLOCK(mp);
 }
 
-static int
-mount_init(void *mem, int size, int flags)
-{
-	struct mount *mp;
-
-	mp = (struct mount *)mem;
-	mtx_init(&mp->mnt_mtx, "struct mount mtx", NULL, MTX_DEF);
-	lockinit(&mp->mnt_explock, PVFS, "explock", 0, 0);
-	return (0);
-}
-
-static void
-mount_fini(void *mem, int size)
-{
-	struct mount *mp;
-
-	mp = (struct mount *)mem;
-	lockdestroy(&mp->mnt_explock);
-	mtx_destroy(&mp->mnt_mtx);
-}
-
 /*
  * Allocate and initialize the mount point struct.
  */
@@ -1343,269 +1305,6 @@
 }
 
 /*
- * ---------------------------------------------------------------------
- * Mounting of root filesystem
- *
- */
-
-struct root_hold_token {
-	const char			*who;
-	LIST_ENTRY(root_hold_token)	list;
-};
-
-static LIST_HEAD(, root_hold_token)	root_holds =
-    LIST_HEAD_INITIALIZER(root_holds);
-
-static int root_mount_complete;
-
-/*
- * Hold root mount.
- */
-struct root_hold_token *
-root_mount_hold(const char *identifier)
-{
-	struct root_hold_token *h;
-
-	if (root_mounted())
-		return (NULL);
-
-	h = malloc(sizeof *h, M_DEVBUF, M_ZERO | M_WAITOK);
-	h->who = identifier;
-	mtx_lock(&mountlist_mtx);
-	LIST_INSERT_HEAD(&root_holds, h, list);
-	mtx_unlock(&mountlist_mtx);
-	return (h);
-}
-
-/*
- * Release root mount.
- */
-void
-root_mount_rel(struct root_hold_token *h)
-{
-
-	if (h == NULL)
-		return;
-	mtx_lock(&mountlist_mtx);
-	LIST_REMOVE(h, list);
-	wakeup(&root_holds);
-	mtx_unlock(&mountlist_mtx);
-	free(h, M_DEVBUF);
-}
-
-/*
- * Wait for all subsystems to release root mount.
- */
-static void
-root_mount_prepare(void)
-{
-	struct root_hold_token *h;
-	struct timeval lastfail;
-	int curfail = 0;
-
-	for (;;) {
-		DROP_GIANT();
-		g_waitidle();
-		PICKUP_GIANT();
-		mtx_lock(&mountlist_mtx);
-		if (LIST_EMPTY(&root_holds)) {
-			mtx_unlock(&mountlist_mtx);
-			break;
-		}
-		if (ppsratecheck(&lastfail, &curfail, 1)) {
-			printf("Root mount waiting for:");
-			LIST_FOREACH(h, &root_holds, list)
-				printf(" %s", h->who);
-			printf("\n");
-		}
-		msleep(&root_holds, &mountlist_mtx, PZERO | PDROP, "roothold",
-		    hz);
-	}
-}
-
-/*
- * Root was mounted, share the good news.
- */
-static void
-root_mount_done(void)
-{
-
-	/* Keep prison0's root in sync with the global rootvnode. */
-	mtx_lock(&prison0.pr_mtx);
-	prison0.pr_root = rootvnode;
-	vref(prison0.pr_root);
-	mtx_unlock(&prison0.pr_mtx);
-	/*
-	 * Use a mutex to prevent the wakeup being missed and waiting for
-	 * an extra 1 second sleep.
-	 */
-	mtx_lock(&mountlist_mtx);
-	root_mount_complete = 1;
-	wakeup(&root_mount_complete);
-	mtx_unlock(&mountlist_mtx);
-}
-
-/*
- * Return true if root is already mounted.
- */
-int
-root_mounted(void)
-{
-
-	/* No mutex is acquired here because int stores are atomic. */
-	return (root_mount_complete);
-}
-
-/*
- * Wait until root is mounted.
- */
-void
-root_mount_wait(void)
-{
-
-	/*
-	 * Panic on an obvious deadlock - the function can't be called from
-	 * a thread which is doing the whole SYSINIT stuff.
-	 */
-	KASSERT(curthread->td_proc->p_pid != 0,
-	    ("root_mount_wait: cannot be called from the swapper thread"));
-	mtx_lock(&mountlist_mtx);
-	while (!root_mount_complete) {
-		msleep(&root_mount_complete, &mountlist_mtx, PZERO, "rootwait",
-		    hz);
-	}
-	mtx_unlock(&mountlist_mtx);
-}
-
-static void
-set_rootvnode()
-{
-	struct proc *p;
-
-	if (VFS_ROOT(TAILQ_FIRST(&mountlist), LK_EXCLUSIVE, &rootvnode))
-		panic("Cannot find root vnode");
-
-	VOP_UNLOCK(rootvnode, 0);
-
-	p = curthread->td_proc;
-	FILEDESC_XLOCK(p->p_fd);
-
-	if (p->p_fd->fd_cdir != NULL)
-		vrele(p->p_fd->fd_cdir);
-	p->p_fd->fd_cdir = rootvnode;
-	VREF(rootvnode);
-
-	if (p->p_fd->fd_rdir != NULL)
-		vrele(p->p_fd->fd_rdir);
-	p->p_fd->fd_rdir = rootvnode;
-	VREF(rootvnode);
-
-	FILEDESC_XUNLOCK(p->p_fd);
-
-	EVENTHANDLER_INVOKE(mountroot);
-}
-
-/*
- * Mount /devfs as our root filesystem, but do not put it on the mountlist
- * yet.  Create a /dev -> / symlink so that absolute pathnames will lookup.
- */
-
-static void
-devfs_first(void)
-{
-	struct thread *td = curthread;
-	struct vfsoptlist *opts;
-	struct vfsconf *vfsp;
-	struct mount *mp = NULL;
-	int error;
-
-	vfsp = vfs_byname("devfs");
-	KASSERT(vfsp != NULL, ("Could not find devfs by name"));
-	if (vfsp == NULL)
-		return;
-
-	mp = vfs_mount_alloc(NULLVP, vfsp, "/dev", td->td_ucred);
-
-	error = VFS_MOUNT(mp);
-	KASSERT(error == 0, ("VFS_MOUNT(devfs) failed %d", error));
-	if (error)
-		return;
-
-	opts = malloc(sizeof(struct vfsoptlist), M_MOUNT, M_WAITOK);
-	TAILQ_INIT(opts);
-	mp->mnt_opt = opts;
-
-	mtx_lock(&mountlist_mtx);
-	TAILQ_INSERT_HEAD(&mountlist, mp, mnt_list);
-	mtx_unlock(&mountlist_mtx);
-
-	set_rootvnode();
-
-	error = kern_symlink(td, "/", "dev", UIO_SYSSPACE);
-	if (error)
-		printf("kern_symlink /dev -> / returns %d\n", error);
-}
-
-/*
- * Surgically move our devfs to be mounted on /dev.
- */
-
-static void
-devfs_fixup(struct thread *td)
-{
-	struct nameidata nd;
-	int error;
-	struct vnode *vp, *dvp;
-	struct mount *mp;
-
-	/* Remove our devfs mount from the mountlist and purge the cache */
-	mtx_lock(&mountlist_mtx);
-	mp = TAILQ_FIRST(&mountlist);
-	TAILQ_REMOVE(&mountlist, mp, mnt_list);
-	mtx_unlock(&mountlist_mtx);
-	cache_purgevfs(mp);
-
-	VFS_ROOT(mp, LK_EXCLUSIVE, &dvp);
-	VI_LOCK(dvp);
-	dvp->v_iflag &= ~VI_MOUNT;
-	VI_UNLOCK(dvp);
-	dvp->v_mountedhere = NULL;
-
-	/* Set up the real rootvnode, and purge the cache */
-	TAILQ_FIRST(&mountlist)->mnt_vnodecovered = NULL;
-	set_rootvnode();
-	cache_purgevfs(rootvnode->v_mount);
-
-	NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF, UIO_SYSSPACE, "/dev", td);
-	error = namei(&nd);
-	if (error) {
-		printf("Lookup of /dev for devfs, error: %d\n", error);
-		return;
-	}
-	NDFREE(&nd, NDF_ONLY_PNBUF);
-	vp = nd.ni_vp;
-	if (vp->v_type != VDIR) {
-		vput(vp);
-	}
-	error = vinvalbuf(vp, V_SAVE, 0, 0);
-	if (error) {
-		vput(vp);
-	}
-	cache_purge(vp);
-	mp->mnt_vnodecovered = vp;
-	vp->v_mountedhere = mp;
-	mtx_lock(&mountlist_mtx);
-	TAILQ_INSERT_TAIL(&mountlist, mp, mnt_list);
-	mtx_unlock(&mountlist_mtx);
-	VOP_UNLOCK(vp, 0);
-	vput(dvp);
-	vfs_unbusy(mp);
-
-	/* Unlink the no longer needed /dev/dev -> / symlink */
-	kern_unlink(td, "/dev/dev", UIO_SYSSPACE);
-}
-
-/*
  * Report errors during filesystem mounting.
  */
 void
@@ -1642,288 +1341,7 @@
 }
 
 /*
- * Find and mount the root filesystem
- */
-void
-vfs_mountroot(void)
-{
-	char *cp, *cpt, *options, *tmpdev;
-	int error, i, asked = 0;
-
-	options = NULL;
-
-	root_mount_prepare();
-
-	mount_zone = uma_zcreate("Mountpoints", sizeof(struct mount),
-	    NULL, NULL, mount_init, mount_fini,
-	    UMA_ALIGN_PTR, UMA_ZONE_NOFREE);
-	devfs_first();
-
-	/*
-	 * We are booted with instructions to prompt for the root filesystem.
-	 */
-	if (boothowto & RB_ASKNAME) {
-		if (!vfs_mountroot_ask())
-			goto mounted;
-		asked = 1;
-	}
-
-	options = getenv("vfs.root.mountfrom.options");
-
-	/*
-	 * The root filesystem information is compiled in, and we are
-	 * booted with instructions to use it.
-	 */
-	if (ctrootdevname != NULL && (boothowto & RB_DFLTROOT)) {
-		if (!vfs_mountroot_try(ctrootdevname, options))
-			goto mounted;
-		ctrootdevname = NULL;
-	}
-
-	/*
-	 * We've been given the generic "use CDROM as root" flag.  This is
-	 * necessary because one media may be used in many different
-	 * devices, so we need to search for them.
-	 */
-	if (boothowto & RB_CDROM) {
-		for (i = 0; cdrom_rootdevnames[i] != NULL; i++) {
-			if (!vfs_mountroot_try(cdrom_rootdevnames[i], options))
-				goto mounted;
-		}
-	}
-
-	/*
-	 * Try to use the value read by the loader from /etc/fstab, or
-	 * supplied via some other means.  This is the preferred
-	 * mechanism.
-	 */
-	cp = getenv("vfs.root.mountfrom");
-	if (cp != NULL) {
-		cpt = cp;
-		while ((tmpdev = strsep(&cpt, " \t")) != NULL) {
-			error = vfs_mountroot_try(tmpdev, options);
-			if (error == 0) {
-				freeenv(cp);
-				goto mounted;
-			}
-		}
-		freeenv(cp);
-	}
-
-	/*
-	 * Try values that may have been computed by code during boot
-	 */
-	if (!vfs_mountroot_try(rootdevnames[0], options))
-		goto mounted;
-	if (!vfs_mountroot_try(rootdevnames[1], options))
-		goto mounted;
-
-	/*
-	 * If we (still) have a compiled-in default, try it.
-	 */
-	if (ctrootdevname != NULL)
-		if (!vfs_mountroot_try(ctrootdevname, options))
-			goto mounted;
-	/*
-	 * Everything so far has failed, prompt on the console if we haven't
-	 * already tried that.
-	 */
-	if (!asked)
-		if (!vfs_mountroot_ask())
-			goto mounted;
-
-	panic("Root mount failed, startup aborted.");
-
-mounted:
-	root_mount_done();
-	freeenv(options);
-}
-
-static struct mntarg *
-parse_mountroot_options(struct mntarg *ma, const char *options)
-{
-	char *p;
-	char *name, *name_arg;
-	char *val, *val_arg;
-	char *opts;
-
-	if (options == NULL || options[0] == '\0')
-		return (ma);
-
-	p = opts = strdup(options, M_MOUNT);
-	if (opts == NULL) {
-		return (ma);
-	} 
-
-	while((name = strsep(&p, ",")) != NULL) {
-		if (name[0] == '\0')
-			break;
-
-		val = strchr(name, '=');
-		if (val != NULL) {
-			*val = '\0';
-			++val;
-		}
-		if( strcmp(name, "rw") == 0 ||
-		    strcmp(name, "noro") == 0) {
-			/*
-			 * The first time we mount the root file system,
-			 * we need to mount 'ro', so We need to ignore
-			 * 'rw' and 'noro' mount options.
-			 */
-			continue;
-		}
-		name_arg = strdup(name, M_MOUNT);
-		val_arg = NULL;
-		if (val != NULL) 
-			val_arg = strdup(val, M_MOUNT);
-
-		ma = mount_arg(ma, name_arg, val_arg,
-		    (val_arg != NULL ? -1 : 0));
-	}
-	free(opts, M_MOUNT);
-	return (ma);
-}
-
-/*
- * Mount (mountfrom) as the root filesystem.
- */
-static int
-vfs_mountroot_try(const char *mountfrom, const char *options)
-{
-	struct mount	*mp;
-	struct mntarg	*ma;
-	char		*vfsname, *path;
-	time_t		timebase;
-	int		error;
-	char		patt[32];
-	char		errmsg[255];
-
-	vfsname = NULL;
-	path    = NULL;
-	mp      = NULL;
-	ma	= NULL;
-	error   = EINVAL;
-	bzero(errmsg, sizeof(errmsg));
-
-	if (mountfrom == NULL)
-		return (error);		/* don't complain */
-	printf("Trying to mount root from %s\n", mountfrom);
-
-	/* parse vfs name and path */
-	vfsname = malloc(MFSNAMELEN, M_MOUNT, M_WAITOK);
-	path = malloc(MNAMELEN, M_MOUNT, M_WAITOK);
-	vfsname[0] = path[0] = 0;
-	sprintf(patt, "%%%d[a-z0-9]:%%%ds", MFSNAMELEN, MNAMELEN);
-	if (sscanf(mountfrom, patt, vfsname, path) < 1)
-		goto out;
-
-	if (path[0] == '\0')
-		strcpy(path, ROOTNAME);
-
-	ma = mount_arg(ma, "fstype", vfsname, -1);
-	ma = mount_arg(ma, "fspath", "/", -1);
-	ma = mount_arg(ma, "from", path, -1);
-	ma = mount_arg(ma, "errmsg", errmsg, sizeof(errmsg));
-	ma = mount_arg(ma, "ro", NULL, 0);
-	ma = parse_mountroot_options(ma, options);
-	error = kernel_mount(ma, MNT_ROOTFS);
-
-	if (error == 0) {
-		/*
-		 * We mount devfs prior to mounting the / FS, so the first
-		 * entry will typically be devfs.
-		 */
-		mp = TAILQ_FIRST(&mountlist);
-		KASSERT(mp != NULL, ("%s: mountlist is empty", __func__));
-
-		/*
-		 * Iterate over all currently mounted file systems and use
-		 * the time stamp found to check and/or initialize the RTC.
-		 * Typically devfs has no time stamp and the only other FS
-		 * is the actual / FS.
-		 * Call inittodr() only once and pass it the largest of the
-		 * timestamps we encounter.
-		 */
-		timebase = 0;
-		do {
-			if (mp->mnt_time > timebase)
-				timebase = mp->mnt_time;
-			mp = TAILQ_NEXT(mp, mnt_list);
-		} while (mp != NULL);
-		inittodr(timebase);
-
-		devfs_fixup(curthread);
-	}
-
-	if (error != 0 ) {
-		printf("ROOT MOUNT ERROR: %s\n", errmsg);
-		printf("If you have invalid mount options, reboot, and ");
-		printf("first try the following from\n");
-		printf("the loader prompt:\n\n");
-		printf("     set vfs.root.mountfrom.options=rw\n\n");
-		printf("and then remove invalid mount options from ");
-		printf("/etc/fstab.\n\n");
-	}
-out:
-	free(path, M_MOUNT);
-	free(vfsname, M_MOUNT);
-	return (error);
-}
-
-/*
  * ---------------------------------------------------------------------
- * Interactive root filesystem selection code.
- */
-
-static int
-vfs_mountroot_ask(void)
-{
-	char name[128];
-	char *mountfrom;
-	char *options;
-
-	for(;;) {
-		printf("Loader variables:\n");
-		printf("vfs.root.mountfrom=");
-		mountfrom = getenv("vfs.root.mountfrom");
-		if (mountfrom != NULL) {
-			printf("%s", mountfrom);
-		}
-		printf("\n");
-		printf("vfs.root.mountfrom.options=");
-		options = getenv("vfs.root.mountfrom.options");
-		if (options != NULL) {
-			printf("%s", options);
-		}
-		printf("\n");
-		freeenv(mountfrom);
-		freeenv(options);
-		printf("\nManual root filesystem specification:\n");
-		printf("  <fstype>:<device>  Mount <device> using filesystem <fstype>\n");
-		printf("                       eg. ufs:/dev/da0s1a\n");
-		printf("                       eg. cd9660:/dev/acd0\n");
-		printf("                       This is equivalent to: ");
-		printf("mount -t cd9660 /dev/acd0 /\n"); 
-		printf("\n");
-		printf("  ?                  List valid disk boot devices\n");
-		printf("  <empty line>       Abort manual input\n");
-		printf("\nmountroot> ");
-		gets(name, sizeof(name), 1);
-		if (name[0] == '\0')
-			return (1);
-		if (name[0] == '?') {
-			printf("\nList of GEOM managed disk devices:\n  ");
-			g_dev_print();
-			continue;
-		}
-		if (!vfs_mountroot_try(name, NULL))
-			return (0);
-	}
-}
-
-/*
- * ---------------------------------------------------------------------
  * Functions for querying mount options/arguments from filesystems.
  */
 
@@ -1965,15 +1383,17 @@
 			continue;
 		snprintf(errmsg, sizeof(errmsg),
 		    "mount option <%s> is unknown", p);
-		printf("%s\n", errmsg);
 		ret = EINVAL;
 	}
 	if (ret != 0) {
 		TAILQ_FOREACH(opt, opts, link) {
 			if (strcmp(opt->name, "errmsg") == 0) {
 				strncpy((char *)opt->value, errmsg, opt->len);
+				break;
 			}
 		}
+		if (opt == NULL)
+			printf("%s\n", errmsg);
 	}
 	return (ret);
 }
Index: dev/md/md.c
===================================================================
--- dev/md/md.c	(revision 41)
+++ dev/md/md.c	(revision 49)
@@ -911,18 +911,26 @@
 {
 	struct vattr vattr;
 	struct nameidata nd;
+	char *fname;
 	int error, flags, vfslocked;
 
-	error = copyinstr(mdio->md_file, sc->file, sizeof(sc->file), NULL);
-	if (error != 0)
-		return (error);
-	flags = FREAD|FWRITE;
 	/*
-	 * If the user specified that this is a read only device, unset the
-	 * FWRITE mask before trying to open the backing store.
+	 * Kernel-originated requests must have the filename appended
+	 * to the mdio structure to protect against malicious software.
 	 */
-	if ((mdio->md_options & MD_READONLY) != 0)
-		flags &= ~FWRITE;
+	fname = mdio->md_file;
+	if ((void *)fname != (void *)(mdio + 1)) {
+		error = copyinstr(fname, sc->file, sizeof(sc->file), NULL);
+		if (error != 0)
+			return (error);
+	} else
+		strlcpy(sc->file, fname, sizeof(sc->file));
+
+	/*
+	 * If the user specified that this is a read only device, don't
+	 * set the FWRITE mask before trying to open the backing store.
+	 */
+	flags = FREAD | ((mdio->md_options & MD_READONLY) ? 0 : FWRITE);
 	NDINIT(&nd, LOOKUP, FOLLOW | MPSAFE, UIO_SYSSPACE, sc->file, td);
 	error = vn_open(&nd, &flags, 0, NULL);
 	if (error != 0)

--Boundary_(ID_e2iysUHX7Ge1qa8HV4CNIg)--

From owner-freebsd-arch@FreeBSD.ORG  Tue Sep 28 10:08:46 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C986C1065694;
	Tue, 28 Sep 2010 10:08:46 +0000 (UTC)
	(envelope-from alexander@leidinger.net)
Received: from mail.ebusiness-leidinger.de (mail.ebusiness-leidinger.de
	[217.11.53.44])
	by mx1.freebsd.org (Postfix) with ESMTP id 774308FC24;
	Tue, 28 Sep 2010 10:08:46 +0000 (UTC)
Received: from outgoing.leidinger.net (p57B3B90B.dip.t-dialin.net
	[87.179.185.11])
	by mail.ebusiness-leidinger.de (Postfix) with ESMTPSA id 8143984400A;
	Tue, 28 Sep 2010 11:49:18 +0200 (CEST)
Received: from webmail.leidinger.net (unknown [IPv6:fd73:10c7:2053:1::2:102])
	by outgoing.leidinger.net (Postfix) with ESMTP id 6BDAC193D;
	Tue, 28 Sep 2010 11:49:15 +0200 (CEST)
Received: (from www@localhost)
	by webmail.leidinger.net (8.14.4/8.13.8/Submit) id o8S9nCgW066608;
	Tue, 28 Sep 2010 11:49:12 +0200 (CEST)
	(envelope-from Alexander@Leidinger.net)
Received: from pslux.ec.europa.eu (pslux.ec.europa.eu [158.169.9.14]) by
	webmail.leidinger.net (Horde Framework) with HTTP; Tue, 28 Sep 2010
	11:49:12 +0200
Message-ID: <20100928114912.17443a2o7j71kpaw@webmail.leidinger.net>
Date: Tue, 28 Sep 2010 11:49:12 +0200
From: Alexander Leidinger <Alexander@Leidinger.net>
To: John Baldwin <jhb@freebsd.org>
References: <201009211507.o8LF7iVv097676@svn.freebsd.org>
	<alpine.LNX.2.00.1009231841500.23791@ury.york.ac.uk>
	<20100924225352.GD49476@server.vk2pj.dyndns.org>
	<201009270928.47232.jhb@freebsd.org>
In-Reply-To: <201009270928.47232.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain;
 charset=UTF-8;
 DelSp="Yes";
 format="flowed"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
User-Agent: Dynamic Internet Messaging Program (DIMP) H3 (1.1.4)
X-EBL-MailScanner-Information: Please contact the ISP for more information
X-EBL-MailScanner-ID: 8143984400A.A6EFC
X-EBL-MailScanner: Found to be clean
X-EBL-MailScanner-SpamCheck: not spam, spamhaus-ZEN,
	SpamAssassin (not cached, score=1.351, required 6,
	autolearn=disabled, RDNS_NONE 1.27, TW_SV 0.08)
X-EBL-MailScanner-SpamScore: s
X-EBL-MailScanner-From: alexander@leidinger.net
X-EBL-MailScanner-Watermark: 1286272160.18906@ARPINEuCoVZS+Y+Z0zA5pA
X-EBL-Spam-Status: No
X-Mailman-Approved-At: Tue, 28 Sep 2010 11:19:10 +0000
Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org,
	src-committers@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: svn commit: r212964 - head/sys/kern
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Sep 2010 10:08:46 -0000

Quoting John Baldwin <jhb@freebsd.org> (from Mon, 27 Sep 2010 09:28:47 -0400):

>> savecore already has support for a 'minfree' file to prevent
>> crashdumps filling the crashdir.  Maybe the default install should
>> include a minfree set to (say) 512MB.
>
> The one problem this approach is it implements a FIFO instead of a LIFO.  I
> want the N most recent crashdumps to be saved, not the first N.

Check the size in the shell script before, remove older ones ("ls -1t  
| grep pattern | tail +<N+1>" gives you possible candidates).

Bye,
Alexander.

-- 
Applause, n.:
	The echo of a platitude from the mouth of a fool.
		-- Ambrose Bierce, "The Devil's Dictionary"

http://www.Leidinger.net    Alexander @ Leidinger.net: PGP ID = B0063FE7
http://www.FreeBSD.org       netchild @ FreeBSD.org  : PGP ID = 72077137

From owner-freebsd-arch@FreeBSD.ORG  Tue Sep 28 15:26:05 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0B05A10656D8;
	Tue, 28 Sep 2010 15:25:59 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 52B778FC1D;
	Tue, 28 Sep 2010 15:25:59 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id 0431446B9B;
	Tue, 28 Sep 2010 11:25:59 -0400 (EDT)
Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id 2916F8A050;
	Tue, 28 Sep 2010 11:25:58 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Alexander Leidinger <Alexander@leidinger.net>
Date: Tue, 28 Sep 2010 09:37:25 -0400
User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20100819; KDE/4.4.5; amd64; ; )
References: <201009211507.o8LF7iVv097676@svn.freebsd.org>
	<201009270928.47232.jhb@freebsd.org>
	<20100928114912.17443a2o7j71kpaw@webmail.leidinger.net>
In-Reply-To: <20100928114912.17443a2o7j71kpaw@webmail.leidinger.net>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit
Message-Id: <201009280937.25619.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Tue, 28 Sep 2010 11:25:58 -0400 (EDT)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham
	version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org,
	src-committers@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: svn commit: r212964 - head/sys/kern
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Sep 2010 15:26:05 -0000

On Tuesday, September 28, 2010 5:49:12 am Alexander Leidinger wrote:
> Quoting John Baldwin <jhb@freebsd.org> (from Mon, 27 Sep 2010 09:28:47 -0400):
> 
> >> savecore already has support for a 'minfree' file to prevent
> >> crashdumps filling the crashdir.  Maybe the default install should
> >> include a minfree set to (say) 512MB.
> >
> > The one problem this approach is it implements a FIFO instead of a LIFO.  I
> > want the N most recent crashdumps to be saved, not the first N.
> 
> Check the size in the shell script before, remove older ones ("ls -1t  
> | grep pattern | tail +<N+1>" gives you possible candidates).

Yes, but the point is that you want that logic in savecore as an alternate to
the current minfree logic.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Tue Sep 28 15:49:43 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3EF18106566B
	for <freebsd-arch@freebsd.org>; Tue, 28 Sep 2010 15:49:43 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from asmtpout030.mac.com (asmtpout030.mac.com [17.148.16.105])
	by mx1.freebsd.org (Postfix) with ESMTP id 262B68FC15
	for <freebsd-arch@freebsd.org>; Tue, 28 Sep 2010 15:49:42 +0000 (UTC)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; charset=us-ascii
Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36])
	by asmtp030.mac.com
	(Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008;
	32bit)) with ESMTPSA id <0L9G00C3PRYGXAA0@asmtp030.mac.com> for
	freebsd-arch@freebsd.org; Tue, 28 Sep 2010 08:49:29 -0700 (PDT)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=6.0.2-1004200000 definitions=main-1009280098
X-Proofpoint-Virus-Version: vendor=fsecure
	engine=2.50.10432:5.0.10011,1.0.148,0.0.0000
	definitions=2010-09-28_10:2010-09-28, 2010-09-28,
	1970-01-01 signatures=0
From: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: <CD1BDE8F-29BE-4A82-B0D9-8849FF3C1A1F@mac.com>
Date: Tue, 28 Sep 2010 08:48:53 -0700
Message-id: <DE680426-01D0-4CB6-B1CB-3B3789C99068@mac.com>
References: <CD1BDE8F-29BE-4A82-B0D9-8849FF3C1A1F@mac.com>
To: "freebsd-arch@FreeBSD.org Arch" <freebsd-arch@freebsd.org>
X-Mailer: Apple Mail (2.1081)
Subject: Re: [patch] functional prototype of root mount enhancement
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Sep 2010 15:49:43 -0000


On Sep 27, 2010, at 11:22 PM, Marcel Moolenaar wrote:
> 
> The code has some debug output still, which is helpful to
> see what's going on internally. From a boot (with a
> /.mount.conf present on ufs:/dev/ad0s1a):

A more interesting example is using an ISO image as root that
lives on UFS file system (in this case FreeBSD 8.1 livefs):

	:
Root mount waiting for: usbus1
ugen1.2: <Apple Inc.> at usbus1
========
.onfail panic
.timeout 1
ufs:/dev/ad0s1a rw
.ask
========
Trying to mount root from ufs:/dev/ad0s1a [rw]...
XXX: vfs_mountroot_parse: error = 0, mpdevfs=0xc3fa4000, mp=0xc3fa3c94
========
.onfail continue
.md /livefs.iso
#ufs:/dev/da0a
.ask
========
md0 attached to /livefs.iso

Loader variables:
  vfs.root.mountfrom=ufs:/dev/ad0s1a
  vfs.root.mountfrom.options=rw

Manual root filesystem specification:
  <fstype>:<device> [options]
      Mount <device> using filesystem <fstype>
      and with the specified (optional) option list.

    eg. ufs:/dev/da0s1a
        cd9660:/dev/acd0 ro
          (which is equivalent to: mount -t cd9660 -o ro /dev/acd0 /)

  ?                  List valid disk boot devices
  <empty line>       Abort manual input

mountroot> ?

List of GEOM managed disk devices:
  da0p2 da0p1 da0 acd0 ad0s1a ad0s1 ad0

mountroot> .

mountroot> ?

List of GEOM managed disk devices:
  md0 da0p2 da0p1 da0 acd0 ad0s1a ad0s1 ad0

mountroot> cd9660:/dev/md#
Trying to mount root from cd9660:/dev/md0 []...
XXX: vfs_mountroot_parse: error = 0, mpdevfs=0xc3fa4000, mp=0xc3fa3a10
lock order reversal:
 1st 0xc3e95270 isofs (isofs) @ /usr/src/sys/fs/cd9660/cd9660_vfsops.c:694
 2nd 0xc3e959c4 ufs (ufs) @ /usr/src/sys/kern/vfs_subr.c:2221
KDB: stack backtrace:
	:

# mount
/dev/md0 on / (cd9660, local, read-only)
/dev/ad0s1a on /mnt (ufs, local, read-only)
devfs on /dev (devfs, local)
/dev/md1 on /var (ufs, local)
/dev/md2 on /tmp (ufs, local)

(md1 & md2 are created by /etc/rc)

-- 
Marcel Moolenaar
xcllnt@mac.com


From owner-freebsd-arch@FreeBSD.ORG  Tue Sep 28 17:31:26 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 00EC21065672
	for <freebsd-arch@freebsd.org>; Tue, 28 Sep 2010 17:31:26 +0000 (UTC)
	(envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id 9C0208FC1C
	for <freebsd-arch@freebsd.org>; Tue, 28 Sep 2010 17:31:25 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o8SHQmKu092682;
	Tue, 28 Sep 2010 11:26:49 -0600 (MDT) (envelope-from imp@bsdimp.com)
Date: Tue, 28 Sep 2010 11:27:01 -0600 (MDT)
Message-Id: <20100928.112701.539398516089932776.imp@bsdimp.com>
To: xcllnt@mac.com
From: "M. Warner Losh" <imp@bsdimp.com>
In-Reply-To: <CD1BDE8F-29BE-4A82-B0D9-8849FF3C1A1F@mac.com>
References: <CD1BDE8F-29BE-4A82-B0D9-8849FF3C1A1F@mac.com>
X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI)
Mime-Version: 1.0
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Cc: freebsd-arch@freebsd.org
Subject: Re: [patch] functional prototype of root mount enhancement
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Sep 2010 17:31:26 -0000

Hey Marcel,

haven't had a chance to look through this in detail yet.  One item
that has always bugged me is why when we hit the prompt that has to be
the end of discovery...  Why can't we have a method to listen to new
geom providers being advertised and then 'short circuit' the ask
prompt if /dev/da0s1a or /dev/ufs/rootfs or whatever it originally
wanted appears.

Maybe this isn't .ask, but some other verb in your language?

Warner

From owner-freebsd-arch@FreeBSD.ORG  Tue Sep 28 18:24:49 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9BD341065674
	for <freebsd-arch@freebsd.org>; Tue, 28 Sep 2010 18:24:49 +0000 (UTC)
	(envelope-from xcllnt@mac.com)
Received: from asmtpout030.mac.com (asmtpout030.mac.com [17.148.16.105])
	by mx1.freebsd.org (Postfix) with ESMTP id 822788FC12
	for <freebsd-arch@freebsd.org>; Tue, 28 Sep 2010 18:24:49 +0000 (UTC)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; charset=us-ascii
Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36])
	by asmtp030.mac.com
	(Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008;
	32bit)) with ESMTPSA id <0L9G00L3NZ55IW70@asmtp030.mac.com> for
	freebsd-arch@freebsd.org; Tue, 28 Sep 2010 11:24:43 -0700 (PDT)
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
	ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam
	adjust=0
	reason=mlx engine=6.0.2-1004200000 definitions=main-1009280125
X-Proofpoint-Virus-Version: vendor=fsecure
	engine=2.50.10432:5.0.10011,1.0.148,0.0.0000
	definitions=2010-09-28_11:2010-09-28, 2010-09-28,
	1970-01-01 signatures=0
From: Marcel Moolenaar <xcllnt@mac.com>
In-reply-to: <20100928.112701.539398516089932776.imp@bsdimp.com>
Date: Tue, 28 Sep 2010 11:24:41 -0700
Message-id: <4E910770-812B-4F04-B026-E3DB5EDEE000@mac.com>
References: <CD1BDE8F-29BE-4A82-B0D9-8849FF3C1A1F@mac.com>
	<20100928.112701.539398516089932776.imp@bsdimp.com>
To: "M. Warner Losh" <imp@bsdimp.com>
X-Mailer: Apple Mail (2.1081)
Cc: freebsd-arch@freebsd.org
Subject: Re: [patch] functional prototype of root mount enhancement
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Sep 2010 18:24:49 -0000


On Sep 28, 2010, at 10:27 AM, M. Warner Losh wrote:

> Hey Marcel,
> 
> haven't had a chance to look through this in detail yet.  One item
> that has always bugged me is why when we hit the prompt that has to be
> the end of discovery...  Why can't we have a method to listen to new
> geom providers being advertised and then 'short circuit' the ask
> prompt if /dev/da0s1a or /dev/ufs/rootfs or whatever it originally
> wanted appears.
> 
> Maybe this isn't .ask, but some other verb in your language?

Hmmm... I think we should give .ask an option so that it can be
made conditional upon a key press then. I don't think it's nice
to print all that stuff, present a prompt, wait for input and
then shortly after continue booting anyway because some device
showed up.

Say we have ".ask on-key-press", which basically nullifies the
.ask directive (by implicitly failing to mount) unless a key was
pressed. At that time we actually print the help, show a prompt
and wait for input. This in combination with ".onfail retry"
allows us to cycle through the alternatives until 1) a key was
pressed and we'll drop at the interactive mount prompt or 2) a
device we've been waiting for appears and we can mount root.

Would that address your case?

Another feature we may need is the alternative: if you boot
with -C, we'll try cd9660:/dev/cd0 and cd9660:/dev/acd0. What
we really want to do is:
	.select /dev/cd0 /dev/acd0
	cd9660:%selected%

...

-- 
Marcel Moolenaar
xcllnt@mac.com


From owner-freebsd-arch@FreeBSD.ORG  Wed Sep 29 14:09:47 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9F7451065670
	for <freebsd-arch@freebsd.org>; Wed, 29 Sep 2010 14:09:47 +0000 (UTC)
	(envelope-from gonzo@launchpad.bluezbox.com)
Received: from launchpad.bluezbox.com (hq.bluezbox.com [70.38.37.145])
	by mx1.freebsd.org (Postfix) with ESMTP id 50EBF8FC15
	for <freebsd-arch@freebsd.org>; Wed, 29 Sep 2010 14:09:46 +0000 (UTC)
Received: from [24.87.53.93] (helo=[192.168.1.116])
	by launchpad.bluezbox.com with esmtpsa (TLSv1:AES128-SHA:128)
	(Exim 4.71 (FreeBSD)) (envelope-from <gonzo@launchpad.bluezbox.com>)
	id 1Ozxc2-000OKi-5P; Sun, 26 Sep 2010 13:14:18 -0700
Mime-Version: 1.0 (Apple Message framework v1081)
Content-Type: text/plain; charset=us-ascii
From: Oleksandr Tymoshenko <gonzo@bluezbox.com>
In-Reply-To: <DAF6D540-3311-4F75-8E24-A5BCBDBC7AE0@bluewin.ch>
Date: Sun, 26 Sep 2010 13:14:17 -0700
Content-Transfer-Encoding: 7bit
Message-Id: <94219799-34FF-4210-B816-6A5B6F5DBC2C@bluezbox.com>
References: <DAF6D540-3311-4F75-8E24-A5BCBDBC7AE0@bluewin.ch>
To: Paketix <paketix@bluewin.ch>
X-Mailer: Apple Mail (2.1081)
Sender: gonzo@launchpad.bluezbox.com
X-Spam-Level: ---
X-Spam-Report: Spam detection software, running on the system "hq.bluezbox.com",
	has
	identified this incoming email as possible spam. The original message
	has been attached to this so you can view it (if it isn't spam) or
	label similar future email.  If you have any questions, see
	The administrator of that system for details.
	Content preview:  On 2010-09-26, at 4:13 AM,
	Paketix wrote: > there is a rather
	new processor from TILERA (100 core chip) which is > most certainly
	already known here at FreeBSD mailing list. >
	[http://www.tilera.com/products/processors/TILE-Gx_Family]
	> the processor/platform is targeted towards: > - high performance
	network
	security platforms > - firewalling/vpn > - utm > - l7 deep packet
	inspection
	> - network monitoring and forensics > - cloud computing > - web
	application
	(lamp) > - data caching (memcached) > - database applications > -
	high-performance
	computing > > chris metcalf from TILERA did the current linux port and
	i
	was in > contact with him about two weeks ago. > at this time QUANTA
	computer
	is starting to offer a 512 core 2U box > with an impressive
	performance/watt ratio (400 watts only for 512 > cores). >
	[http://www.tilera.com/solutions/cloud_computing]
	> > i guess those massive multicore chips would enable bleeding edge >
	high
	performance solutions based on FreeBSD. > > well... > - anyone
	interested
	in porting FreeBSD towards TILERA? > (architecture seems to be similar
	to
	MIPS...) Architecture/hardware looks really high end. I think there are
	several
	people among FreeBSD developers who would like to get their hands on
	this kind of technology. [...] 
	Content analysis details:   (-3.1 points, 5.0 required)
	pts rule name              description
	---- ----------------------
	--------------------------------------------------
	-1.8 ALL_TRUSTED Passed through trusted hosts only via SMTP
	-2.6 BAYES_00 BODY: Bayesian spam probability is 0 to 1%
	[score: 0.0000]
	1.3 AWL AWL: From: address is in the auto white-list
Cc: freebsd-arch@freebsd.org
Subject: Re: Porting effort towards TILERA massive multicore CPUs...?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Sep 2010 14:09:47 -0000


On 2010-09-26, at 4:13 AM, Paketix wrote:

> there is a rather new processor from TILERA (100 core chip) which is
> most certainly already known here at FreeBSD mailing list.
> [http://www.tilera.com/products/processors/TILE-Gx_Family]
> the processor/platform is targeted towards:
> - high performance network security platforms
>  - firewalling/vpn
>  - utm
>  - l7 deep packet inspection
>  - network monitoring and forensics
> - cloud computing
>  - web application (lamp)
>  - data caching (memcached)
>  - database applications
>  - high-performance computing
> 
> chris metcalf from TILERA did the current linux port and i was in
> contact with him about two weeks ago.
> at this time QUANTA computer is starting to offer a 512 core 2U box
> with an impressive performance/watt ratio (400 watts only for 512
> cores).
> [http://www.tilera.com/solutions/cloud_computing]
> 
> i guess those massive multicore chips would enable bleeding edge
> high performance solutions based on FreeBSD.
> 
> well...
> - anyone interested in porting FreeBSD towards TILERA?
>  (architecture seems to be similar to MIPS...)
    Architecture/hardware looks really high end. I think there are
several people among FreeBSD developers who would like to get 
their hands on this kind of technology.

> - is there already some ongoing porting effort?
    Not that I know of.

> - porting for this chip already discussed in this mailing list? 
    AFAIR - nope 


From owner-freebsd-arch@FreeBSD.ORG  Thu Sep 30 10:05:41 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A1604106566B
	for <freebsd-arch@freebsd.org>; Thu, 30 Sep 2010 10:05:41 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 7E2228FC13
	for <freebsd-arch@freebsd.org>; Thu, 30 Sep 2010 10:05:41 +0000 (UTC)
Received: from fledge.watson.org (fledge.watson.org [65.122.17.41])
	by cyrus.watson.org (Postfix) with ESMTPS id 1B5C046B82;
	Thu, 30 Sep 2010 06:05:41 -0400 (EDT)
Date: Thu, 30 Sep 2010 11:05:40 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Paketix <paketix@bluewin.ch>
In-Reply-To: <DAF6D540-3311-4F75-8E24-A5BCBDBC7AE0@bluewin.ch>
Message-ID: <alpine.BSF.2.00.1009301103540.12886@fledge.watson.org>
References: <DAF6D540-3311-4F75-8E24-A5BCBDBC7AE0@bluewin.ch>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-arch@freebsd.org
Subject: Re: Porting effort towards TILERA massive multicore CPUs...?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Sep 2010 10:05:41 -0000


On Sun, 26 Sep 2010, Paketix wrote:

> there is a rather new processor from TILERA (100 core chip) which is
> most certainly already known here at FreeBSD mailing list.

Theory has it I'll be getting access to Intel SCC 48/96-core hardware here at 
Cambridge in the moderately near future, and I've been pondering what would be 
involved.  Their model involves 48+ x86 cores without cache coherency, so you 
need separate OS instances for each.  However, the cores are linked by 
fifo-like memory that we'll need to figure out what to do with.  I assume 
Tilera has some similar sort of message-passing feature?

Robert

> [http://www.tilera.com/products/processors/TILE-Gx_Family]
> the processor/platform is targeted towards:
> - high performance network security platforms
>  - firewalling/vpn
>  - utm
>  - l7 deep packet inspection
>  - network monitoring and forensics
> - cloud computing
>  - web application (lamp)
>  - data caching (memcached)
>  - database applications
>  - high-performance computing
>
> chris metcalf from TILERA did the current linux port and i was in
> contact with him about two weeks ago.
> at this time QUANTA computer is starting to offer a 512 core 2U box
> with an impressive performance/watt ratio (400 watts only for 512
> cores).
> [http://www.tilera.com/solutions/cloud_computing]
>
> i guess those massive multicore chips would enable bleeding edge
> high performance solutions based on FreeBSD.
>
> well...
> - anyone interested in porting FreeBSD towards TILERA?
>  (architecture seems to be similar to MIPS...)
> - is there already some ongoing porting effort?
> - porting for this chip already discussed in this mailing list?
>
> many thx
> /pat
>
> some links for those who want some more details:
> company homepage:
> http://www.tilera.com/
> 64core processor:
> http://www.tilera.com/products/processors/TILEPRO64
> 100core processor with hardware packet (pre)processing
> http://www.tilera.com/products/processors/TILE-Gx_Family
> sample architecture for network appliances:
> http://www.tilera.com/solutions/networking/network_security_appliances
> 512core system from QUANTA computer inc. (available Q4-10/Q1-11):
> http://www.tilera.com/solutions/cloud_computing
> development system from TILERA:
> http://www.tilera.com/products/platforms/TILEmpower_platform
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>

From owner-freebsd-arch@FreeBSD.ORG  Thu Sep 30 10:44:30 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 74A35106566C
	for <freebsd-arch@freebsd.org>; Thu, 30 Sep 2010 10:44:30 +0000 (UTC)
	(envelope-from paketix@bluewin.ch)
Received: from mail31.bluewin.ch (mail31.bluewin.ch [195.186.18.72])
	by mx1.freebsd.org (Postfix) with ESMTP id 0CC4A8FC0A
	for <freebsd-arch@freebsd.org>; Thu, 30 Sep 2010 10:44:29 +0000 (UTC)
Received: from [195.186.18.83] ([195.186.18.83:55628] helo=tr15.bluewin.ch)
	by mail31.bluewin.ch (envelope-from <paketix@bluewin.ch>)
	(ecelerity 2.2.2.45 r()) with ESMTP
	id D9/FE-19667-C0A64AC4; Thu, 30 Sep 2010 10:44:28 +0000
Received: from [10.21.20.106] (194.209.131.192) by tr15.bluewin.ch (The Blue
	Window 8.5.119.018.5.119.01) (authenticated as paketix@bluewin.ch)
	id 4C69210201AB6AB2; Thu, 30 Sep 2010 10:44:28 +0000
References: <DAF6D540-3311-4F75-8E24-A5BCBDBC7AE0@bluewin.ch>
	<alpine.BSF.2.00.1009301103540.12886@fledge.watson.org>
In-Reply-To: <alpine.BSF.2.00.1009301103540.12886@fledge.watson.org>
Mime-Version: 1.0 (iPhone Mail 8B117)
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii
Message-Id: <6DD4F31E-93F7-4D80-AAB8-86E69FE5D9E5@bluewin.ch>
X-Mailer: iPhone Mail (8B117)
From: Paketix <paketix@bluewin.ch>
Date: Thu, 30 Sep 2010 12:44:18 +0200
To: Robert Watson <rwatson@FreeBSD.org>
Cc: "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
Subject: Re: Porting effort towards TILERA massive multicore CPUs...?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Sep 2010 10:44:30 -0000

do not know all the details yet
but tileGX features (incomplete list):
- DCC fully coherent cache
- mPipe wire speed pkt processing engine
- on chip encryption/compression engines
- fast on chip mesh interconnect
- 2x40G interlaken or 8x10G
...

for more details see:
tilera.com/products/processors/TILE-Gx-Family

BR
/pat

Sent from Pat's iPhone

On 30.09.2010, at 12:05, Robert Watson <rwatson@FreeBSD.org> wrote:

>=20
> On Sun, 26 Sep 2010, Paketix wrote:
>=20
>> there is a rather new processor from TILERA (100 core chip) which is
>> most certainly already known here at FreeBSD mailing list.
>=20
> Theory has it I'll be getting access to Intel SCC 48/96-core hardware here=
 at Cambridge in the moderately near future, and I've been pondering what wo=
uld be involved.  Their model involves 48+ x86 cores without cache coherency=
, so you need separate OS instances for each.  However, the cores are linked=
 by fifo-like memory that we'll need to figure out what to do with.  I assum=
e Tilera has some similar sort of message-passing feature?
>=20
> Robert
>=20
>> [http://www.tilera.com/products/processors/TILE-Gx_Family]
>> the processor/platform is targeted towards:
>> - high performance network security platforms
>> - firewalling/vpn
>> - utm
>> - l7 deep packet inspection
>> - network monitoring and forensics
>> - cloud computing
>> - web application (lamp)
>> - data caching (memcached)
>> - database applications
>> - high-performance computing
>>=20
>> chris metcalf from TILERA did the current linux port and i was in
>> contact with him about two weeks ago.
>> at this time QUANTA computer is starting to offer a 512 core 2U box
>> with an impressive performance/watt ratio (400 watts only for 512
>> cores).
>> [http://www.tilera.com/solutions/cloud_computing]
>>=20
>> i guess those massive multicore chips would enable bleeding edge
>> high performance solutions based on FreeBSD.
>>=20
>> well...
>> - anyone interested in porting FreeBSD towards TILERA?
>> (architecture seems to be similar to MIPS...)
>> - is there already some ongoing porting effort?
>> - porting for this chip already discussed in this mailing list?
>>=20
>> many thx
>> /pat
>>=20
>> some links for those who want some more details:
>> company homepage:
>> http://www.tilera.com/
>> 64core processor:
>> http://www.tilera.com/products/processors/TILEPRO64
>> 100core processor with hardware packet (pre)processing
>> http://www.tilera.com/products/processors/TILE-Gx_Family
>> sample architecture for network appliances:
>> http://www.tilera.com/solutions/networking/network_security_appliances
>> 512core system from QUANTA computer inc. (available Q4-10/Q1-11):
>> http://www.tilera.com/solutions/cloud_computing
>> development system from TILERA:
>> http://www.tilera.com/products/platforms/TILEmpower_platform
>> _______________________________________________
>> freebsd-arch@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
>> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>>=20

From owner-freebsd-arch@FreeBSD.ORG  Thu Sep 30 16:15:05 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5E8DF106564A
	for <freebsd-arch@freebsd.org>; Thu, 30 Sep 2010 16:15:05 +0000 (UTC)
	(envelope-from julian@freebsd.org)
Received: from out-0.mx.aerioconnect.net (out-0-24.mx.aerioconnect.net
	[216.240.47.84])
	by mx1.freebsd.org (Postfix) with ESMTP id 41F318FC13
	for <freebsd-arch@freebsd.org>; Thu, 30 Sep 2010 16:15:05 +0000 (UTC)
Received: from idiom.com (postfix@mx0.idiom.com [216.240.32.160])
	by out-0.mx.aerioconnect.net (8.13.8/8.13.8) with ESMTP id
	o8UFqQrS006736
	for <freebsd-arch@freebsd.org>; Thu, 30 Sep 2010 08:52:26 -0700
X-Client-Authorized: MaGic Cook1e
Received: from julian-mac.elischer.org
	(h-67-100-89-137.snfccasy.static.covad.net [67.100.89.137])
	by idiom.com (Postfix) with ESMTP id D57032D6017
	for <freebsd-arch@freebsd.org>; Thu, 30 Sep 2010 08:52:25 -0700 (PDT)
Message-ID: <4CA4B264.4000601@freebsd.org>
Date: Thu, 30 Sep 2010 08:53:08 -0700
From: Julian Elischer <julian@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US;
	rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4
MIME-Version: 1.0
To: freebsd-arch@freebsd.org
References: <DAF6D540-3311-4F75-8E24-A5BCBDBC7AE0@bluewin.ch>
	<alpine.BSF.2.00.1009301103540.12886@fledge.watson.org>
In-Reply-To: <alpine.BSF.2.00.1009301103540.12886@fledge.watson.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Scanned-By: MIMEDefang 2.67 on 216.240.47.51
Subject: Re: Porting effort towards TILERA massive multicore CPUs...?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Sep 2010 16:15:05 -0000

  On 9/30/10 3:05 AM, Robert Watson wrote:
>
> On Sun, 26 Sep 2010, Paketix wrote:
>
>> there is a rather new processor from TILERA (100 core chip) which is
>> most certainly already known here at FreeBSD mailing list.
>
> Theory has it I'll be getting access to Intel SCC 48/96-core 
> hardware here at Cambridge in the moderately near future, and I've 
> been pondering what would be involved.  Their model involves 48+ x86 
> cores without cache coherency, so you need separate OS instances for 
> each.  However, the cores are linked by fifo-like memory that we'll 
> need to figure out what to do with.  I assume Tilera has some 
> similar sort of message-passing feature?
>
> Robert
>
hmm echoes of 'transputer'?    I believe there is an occam compiler 
that runs on FreeBSD.


From owner-freebsd-arch@FreeBSD.ORG  Thu Sep 30 16:16:40 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C6EB51065672;
	Thu, 30 Sep 2010 16:16:40 +0000 (UTC)
	(envelope-from julian@freebsd.org)
Received: from out-0.mx.aerioconnect.net (out-0-24.mx.aerioconnect.net
	[216.240.47.84])
	by mx1.freebsd.org (Postfix) with ESMTP id 7CBC18FC12;
	Thu, 30 Sep 2010 16:16:40 +0000 (UTC)
Received: from idiom.com (postfix@mx0.idiom.com [216.240.32.160])
	by out-0.mx.aerioconnect.net (8.13.8/8.13.8) with ESMTP id
	o8UFscRi006790; Thu, 30 Sep 2010 08:54:38 -0700
X-Client-Authorized: MaGic Cook1e
X-Client-Authorized: MaGic Cook1e
X-Client-Authorized: MaGic Cook1e
Received: from julian-mac.elischer.org
	(h-67-100-89-137.snfccasy.static.covad.net [67.100.89.137])
	by idiom.com (Postfix) with ESMTP id 0AD322D6021;
	Thu, 30 Sep 2010 08:54:36 -0700 (PDT)
Message-ID: <4CA4B2E7.1@freebsd.org>
Date: Thu, 30 Sep 2010 08:55:19 -0700
From: Julian Elischer <julian@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US;
	rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4
MIME-Version: 1.0
To: Paketix <paketix@bluewin.ch>
References: <DAF6D540-3311-4F75-8E24-A5BCBDBC7AE0@bluewin.ch>	<alpine.BSF.2.00.1009301103540.12886@fledge.watson.org>
	<6DD4F31E-93F7-4D80-AAB8-86E69FE5D9E5@bluewin.ch>
In-Reply-To: <6DD4F31E-93F7-4D80-AAB8-86E69FE5D9E5@bluewin.ch>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Scanned-By: MIMEDefang 2.67 on 216.240.47.51
Cc: Robert Watson <rwatson@freebsd.org>,
	"freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
Subject: Re: Porting effort towards TILERA massive multicore CPUs...?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Sep 2010 16:16:41 -0000

  On 9/30/10 3:44 AM, Paketix wrote:
> do not know all the details yet
> but tileGX features (incomplete list):
> - DCC fully coherent cache
> - mPipe wire speed pkt processing engine
> - on chip encryption/compression engines
> - fast on chip mesh interconnect
> - 2x40G interlaken or 8x10G
> ...
>
> for more details see:
> tilera.com/products/processors/TILE-Gx-Family
  http://www.tilera.com/products/processors/TILE-Gx_Family
> BR
> /pat
>
> Sent from Pat's iPhone
>
> On 30.09.2010, at 12:05, Robert Watson<rwatson@FreeBSD.org>  wrote:
>
>> On Sun, 26 Sep 2010, Paketix wrote:
>>
>>> there is a rather new processor from TILERA (100 core chip) which is
>>> most certainly already known here at FreeBSD mailing list.
>> Theory has it I'll be getting access to Intel SCC 48/96-core hardware here at Cambridge in the moderately near future, and I've been pondering what would be involved.  Their model involves 48+ x86 cores without cache coherency, so you need separate OS instances for each.  However, the cores are linked by fifo-like memory that we'll need to figure out what to do with.  I assume Tilera has some similar sort of message-passing feature?
>>
>> Robert
>>
>>> [http://www.tilera.com/products/processors/TILE-Gx_Family]
>>> the processor/platform is targeted towards:
>>> - high performance network security platforms
>>> - firewalling/vpn
>>> - utm
>>> - l7 deep packet inspection
>>> - network monitoring and forensics
>>> - cloud computing
>>> - web application (lamp)
>>> - data caching (memcached)
>>> - database applications
>>> - high-performance computing
>>>
>>> chris metcalf from TILERA did the current linux port and i was in
>>> contact with him about two weeks ago.
>>> at this time QUANTA computer is starting to offer a 512 core 2U box
>>> with an impressive performance/watt ratio (400 watts only for 512
>>> cores).
>>> [http://www.tilera.com/solutions/cloud_computing]
>>>
>>> i guess those massive multicore chips would enable bleeding edge
>>> high performance solutions based on FreeBSD.
>>>
>>> well...
>>> - anyone interested in porting FreeBSD towards TILERA?
>>> (architecture seems to be similar to MIPS...)
>>> - is there already some ongoing porting effort?
>>> - porting for this chip already discussed in this mailing list?
>>>
>>> many thx
>>> /pat
>>>
>>> some links for those who want some more details:
>>> company homepage:
>>> http://www.tilera.com/
>>> 64core processor:
>>> http://www.tilera.com/products/processors/TILEPRO64
>>> 100core processor with hardware packet (pre)processing
>>> http://www.tilera.com/products/processors/TILE-Gx_Family
>>> sample architecture for network appliances:
>>> http://www.tilera.com/solutions/networking/network_security_appliances
>>> 512core system from QUANTA computer inc. (available Q4-10/Q1-11):
>>> http://www.tilera.com/solutions/cloud_computing
>>> development system from TILERA:
>>> http://www.tilera.com/products/platforms/TILEmpower_platform
>>> _______________________________________________
>>> freebsd-arch@freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
>>> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>>>
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>


From owner-freebsd-arch@FreeBSD.ORG  Fri Oct  1 05:09:03 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0AD4810675F2;
	Fri,  1 Oct 2010 05:08:45 +0000 (UTC)
	(envelope-from adrian.chadd@gmail.com)
Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com
	[209.85.214.182])
	by mx1.freebsd.org (Postfix) with ESMTP id ABB3F8FC19;
	Fri,  1 Oct 2010 05:08:44 +0000 (UTC)
Received: by iwn34 with SMTP id 34so4137114iwn.13
	for <multiple recipients>; Thu, 30 Sep 2010 22:08:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:mime-version:received:sender:received
	:in-reply-to:references:date:x-google-sender-auth:message-id:subject
	:from:to:cc:content-type:content-transfer-encoding;
	bh=AyKap6HXYH1dkkeZPcUJ9wplcSjprWw2IfMr/k+QoiQ=;
	b=lY25nncFFlnHsmXdVhMXPEvmDDKLRAxuWrYpVY5ggjbeqDB6iUVaPIgDmyP26xe10x
	EMcfbDkoPCbzAJrwiGSgVsw02kg8rUzxrAgVwLDI39xCpmR8drPUnx59iijckmlzi2w0
	3Al8o3rivQIMhqw3jk2oG/I1TkhukBdjVV484=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	b=F7n/ACzlR65xaewwBgcUAlrxqgACQuzikLAEpUD/JKFVJWYuRpz+TV8toLNuh4RKU9
	mjMZKLVi/GwsMA53DTfvAjwKjo8wH1eV7zHfsaDPshPp0/TemQCwxjsODbjkDQdp/oH8
	qYVszMufJ8AZDd7ySVHf1wUfS7kiFMwvwxUdA=
MIME-Version: 1.0
Received: by 10.231.144.74 with SMTP id y10mr5037675ibu.65.1285908139888; Thu,
	30 Sep 2010 21:42:19 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.231.171.203 with HTTP; Thu, 30 Sep 2010 21:42:19 -0700 (PDT)
In-Reply-To: <4CA4B264.4000601@freebsd.org>
References: <DAF6D540-3311-4F75-8E24-A5BCBDBC7AE0@bluewin.ch>
	<alpine.BSF.2.00.1009301103540.12886@fledge.watson.org>
	<4CA4B264.4000601@freebsd.org>
Date: Fri, 1 Oct 2010 12:42:19 +0800
X-Google-Sender-Auth: W4E2TSphFsAFA0L15y8ZJZ1BX3M
Message-ID: <AANLkTimf6cQHv6cTE8H-BRyJNBW1-6T6Fp2uY6YvmzdG@mail.gmail.com>
From: Adrian Chadd <adrian@freebsd.org>
To: Julian Elischer <julian@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: freebsd-arch@freebsd.org
Subject: Re: Porting effort towards TILERA massive multicore CPUs...?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Oct 2010 05:09:03 -0000

On 30 September 2010 23:53, Julian Elischer <julian@freebsd.org> wrote:

> hmm echoes of 'transputer'? =A0 =A0I believe there is an occam compiler t=
hat
> runs on FreeBSD.

Google XMOS.

I've been trying very hard to not buy some of this until -after- i
finish my degree.

(but I do have an ISA Transputer board at home. :-)


Adrian

From owner-freebsd-arch@FreeBSD.ORG  Sat Oct  2 08:14:07 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 92BF81065672
	for <arch@FreeBSD.org>; Sat,  2 Oct 2010 08:14:07 +0000 (UTC)
	(envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (adsl-75-1-14-242.dsl.scrm01.sbcglobal.net
	[75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id 6566A8FC08
	for <arch@FreeBSD.org>; Sat,  2 Oct 2010 08:14:07 +0000 (UTC)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id o927f7FJ056708
	for <arch@FreeBSD.org>; Sat, 2 Oct 2010 00:41:11 -0700 (PDT)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <201010020741.o927f7FJ056708@gw.catspoiler.org>
Date: Sat, 2 Oct 2010 00:41:07 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
To: arch@FreeBSD.org
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
Cc: 
Subject: "process slock" vs. "scrlock" lock order
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Oct 2010 08:14:07 -0000

The hard coded lock order list in subr_witness.c has "scrlock" listed
before "process slock".  This causes a lock order reversal when
calcru1(), which requires "process slock" to be held, calls printf() to
report unexpected runtime problems.  The call to printf() eventually
gets into the console code which locks "scrlock".  This normally isn't
noticed because both of these are spin locks, and hardly anyone uses
WITNESS without disabling the checking of spinlocks with
WITNESS_SKIPSPIN.  If spin lock checking is not disabled, the result is
a silent reset because witness catches the LOR, which recurses into
printf(), which ends up causing a panic in cnputs().

One obvious fix would be to move "scrlock" to a later spot in the list,
but I suspect the same problem could occur with the "sio" or "uart"
locks if a serial console is being used.  It might not be possible to
fix them the same way because there might be cases where they are in the
input path and get locked before "process slock" or other spin locks
that can be held when calling printf().

Another fix for this particular case would be to rearrange the code in
calcru1() so that the calls to printf() occur after ruxp->rux_* are
updated and where I assume it would be safe to temporarily drop "process
slock" for the duration of the printf() calls.

Thoughts?


From owner-freebsd-arch@FreeBSD.ORG  Sat Oct  2 10:03:03 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1EB1C1065675;
	Sat,  2 Oct 2010 10:03:03 +0000 (UTC)
	(envelope-from kostikbel@gmail.com)
Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200])
	by mx1.freebsd.org (Postfix) with ESMTP id 6C1E18FC0C;
	Sat,  2 Oct 2010 10:03:02 +0000 (UTC)
Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua
	[10.1.1.148])
	by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id o929cDYR029797
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sat, 2 Oct 2010 12:38:13 +0300 (EEST)
	(envelope-from kostikbel@gmail.com)
Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1])
	by deviant.kiev.zoral.com.ua (8.14.4/8.14.4) with ESMTP id
	o929cDbd018516; Sat, 2 Oct 2010 12:38:13 +0300 (EEST)
	(envelope-from kostikbel@gmail.com)
Received: (from kostik@localhost)
	by deviant.kiev.zoral.com.ua (8.14.4/8.14.4/Submit) id o929cDU8018515; 
	Sat, 2 Oct 2010 12:38:13 +0300 (EEST)
	(envelope-from kostikbel@gmail.com)
X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to
	kostikbel@gmail.com using -f
Date: Sat, 2 Oct 2010 12:38:13 +0300
From: Kostik Belousov <kostikbel@gmail.com>
To: Don Lewis <truckman@freebsd.org>
Message-ID: <20101002093813.GC2392@deviant.kiev.zoral.com.ua>
References: <201010020741.o927f7FJ056708@gw.catspoiler.org>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="HG+GLK89HZ1zG0kk"
Content-Disposition: inline
In-Reply-To: <201010020741.o927f7FJ056708@gw.catspoiler.org>
User-Agent: Mutt/1.4.2.3i
X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.1 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_50,
	DNS_FROM_OPENWHOIS autolearn=no version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	skuns.kiev.zoral.com.ua
Cc: arch@freebsd.org
Subject: Re: "process slock" vs. "scrlock" lock order
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Oct 2010 10:03:03 -0000


--HG+GLK89HZ1zG0kk
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Oct 02, 2010 at 12:41:07AM -0700, Don Lewis wrote:
> The hard coded lock order list in subr_witness.c has "scrlock" listed
> before "process slock".  This causes a lock order reversal when
> calcru1(), which requires "process slock" to be held, calls printf() to
> report unexpected runtime problems.  The call to printf() eventually
> gets into the console code which locks "scrlock".  This normally isn't
> noticed because both of these are spin locks, and hardly anyone uses
> WITNESS without disabling the checking of spinlocks with
> WITNESS_SKIPSPIN.  If spin lock checking is not disabled, the result is
> a silent reset because witness catches the LOR, which recurses into
> printf(), which ends up causing a panic in cnputs().
>=20
> One obvious fix would be to move "scrlock" to a later spot in the list,
> but I suspect the same problem could occur with the "sio" or "uart"
> locks if a serial console is being used.  It might not be possible to
> fix them the same way because there might be cases where they are in the
> input path and get locked before "process slock" or other spin locks
> that can be held when calling printf().
>=20
> Another fix for this particular case would be to rearrange the code in
> calcru1() so that the calls to printf() occur after ruxp->rux_* are
> updated and where I assume it would be safe to temporarily drop "process
> slock" for the duration of the printf() calls.
>=20
> Thoughts?

Yes, printing from under a spinlock is somewhat epidemic. Moving the printf
out of process slock looks as the right solution. On the other hand, all
calcru() callers unlock slock immediately after calcru(), and calcru1()
sometimes only called with thread lock held, not process slock.

I propose the following refinement, it does not need relock of process slock
at all. Lets drop slock in calcru(), and do neccessary print after that.
No need to reacquire the slock.

diff --git a/sys/compat/linux/linux_misc.c b/sys/compat/linux/linux_misc.c
index d2cf6b6..6a599f6 100644
--- a/sys/compat/linux/linux_misc.c
+++ b/sys/compat/linux/linux_misc.c
@@ -691,7 +691,6 @@ linux_times(struct thread *td, struct linux_times_args =
*args)
 		PROC_LOCK(p);
 		PROC_SLOCK(p);
 		calcru(p, &utime, &stime);
-		PROC_SUNLOCK(p);
 		calccru(p, &cutime, &cstime);
 		PROC_UNLOCK(p);
=20
diff --git a/sys/compat/svr4/svr4_misc.c b/sys/compat/svr4/svr4_misc.c
index 6f80fe6..554eb44 100644
--- a/sys/compat/svr4/svr4_misc.c
+++ b/sys/compat/svr4/svr4_misc.c
@@ -865,7 +865,6 @@ svr4_sys_times(td, uap)
 	PROC_LOCK(p);
 	PROC_SLOCK(p);
 	calcru(p, &utime, &stime);
-	PROC_SUNLOCK(p);
 	calccru(p, &cutime, &cstime);
 	PROC_UNLOCK(p);
=20
@@ -1278,7 +1277,6 @@ loop:
 			ru =3D p->p_ru;
 			PROC_SLOCK(p);
 			calcru(p, &ru.ru_utime, &ru.ru_stime);
-			PROC_SUNLOCK(p);
 			PROC_UNLOCK(p);
 			sx_sunlock(&proctree_lock);
=20
@@ -1305,7 +1303,6 @@ loop:
 			ru =3D p->p_ru;
 			PROC_SLOCK(p);
 			calcru(p, &ru.ru_utime, &ru.ru_stime);
-			PROC_SUNLOCK(p);
 			PROC_UNLOCK(p);
=20
 		        if (((uap->options & SVR4_WNOWAIT)) =3D=3D 0) {
@@ -1329,7 +1326,6 @@ loop:
 			status =3D SIGCONT;
 			PROC_SLOCK(p);
 			calcru(p, &ru.ru_utime, &ru.ru_stime);
-			PROC_SUNLOCK(p);
 			PROC_UNLOCK(p);
=20
 		        if (((uap->options & SVR4_WNOWAIT)) =3D=3D 0) {
diff --git a/sys/fs/procfs/procfs_status.c b/sys/fs/procfs/procfs_status.c
index 7850504..12f08f6 100644
--- a/sys/fs/procfs/procfs_status.c
+++ b/sys/fs/procfs/procfs_status.c
@@ -125,7 +125,6 @@ procfs_doprocstatus(PFS_FILL_ARGS)
=20
 		PROC_SLOCK(p);
 		calcru(p, &ut, &st);
-		PROC_SUNLOCK(p);
 		start =3D p->p_stats->p_start;
 		timevaladd(&start, &boottime);
 		sbuf_printf(sb, " %jd,%ld %jd,%ld %jd,%ld",
diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c
index 8358f75..7819d7b 100644
--- a/sys/kern/kern_exit.c
+++ b/sys/kern/kern_exit.c
@@ -703,8 +703,8 @@ proc_reap(struct thread *td, struct proc *p, int *statu=
s, int options,
 	if (rusage) {
 		*rusage =3D p->p_ru;
 		calcru(p, &rusage->ru_utime, &rusage->ru_stime);
-	}
-	PROC_SUNLOCK(p);
+	} else
+		PROC_SUNLOCK(p);
 	td->td_retval[0] =3D p->p_pid;
 	if (status)
 		*status =3D p->p_xstat;	/* convert to int */
diff --git a/sys/kern/kern_proc.c b/sys/kern/kern_proc.c
index 4899946..fb0be15 100644
--- a/sys/kern/kern_proc.c
+++ b/sys/kern/kern_proc.c
@@ -783,7 +783,6 @@ fill_kinfo_proc_only(struct proc *p, struct kinfo_proc =
*kp)
 		timevaladd(&kp->ki_start, &boottime);
 		PROC_SLOCK(p);
 		calcru(p, &kp->ki_rusage.ru_utime, &kp->ki_rusage.ru_stime);
-		PROC_SUNLOCK(p);
 		calccru(p, &kp->ki_childutime, &kp->ki_childstime);
=20
 		/* Some callers want child-times in a single value */
diff --git a/sys/kern/kern_resource.c b/sys/kern/kern_resource.c
index ec2d6b6..13cc50c 100644
--- a/sys/kern/kern_resource.c
+++ b/sys/kern/kern_resource.c
@@ -72,8 +72,15 @@ static struct rwlock uihashtbl_lock;
 static LIST_HEAD(uihashhead, uidinfo) *uihashtbl;
 static u_long uihash;		/* size of hash table - 1 */
=20
+struct calcru1_warn {
+	int64_t neg_runtime;
+	int64_t new_runtime;
+	int64_t old_runtime;
+};
+
 static void	calcru1(struct proc *p, struct rusage_ext *ruxp,
-		    struct timeval *up, struct timeval *sp);
+		    struct timeval *up, struct timeval *sp,
+		    struct calcru1_warn *w);
 static int	donice(struct thread *td, struct proc *chgp, int n);
 static struct uidinfo *uilookup(uid_t uid);
 static void	ruxagg_locked(struct rusage_ext *rux, struct thread *td);
@@ -797,6 +804,20 @@ getrlimit(td, uap)
 	return (error);
 }
=20
+static void
+print_calcru1_warn(struct proc *p, const struct calcru1_warn *w)
+{
+
+	if (w->neg_runtime > 0)
+		printf("calcru: negative runtime of %jd usec for pid %d (%s)\n",
+		    (intmax_t)w->neg_runtime, p->p_pid, p->p_comm);
+	if (w->new_runtime > 0)
+		printf("calcru: runtime went backwards from %ju usec "
+		    "to %ju usec for pid %d (%s)\n",
+		    (uintmax_t)w->old_runtime, (uintmax_t)w->new_runtime,
+		    p->p_pid, p->p_comm);
+}
+
 /*
  * Transform the running time and tick information for children of proc p
  * into user and system time usage.
@@ -807,24 +828,33 @@ calccru(p, up, sp)
 	struct timeval *up;
 	struct timeval *sp;
 {
+	struct calcru1_warn w;
=20
 	PROC_LOCK_ASSERT(p, MA_OWNED);
-	calcru1(p, &p->p_crux, up, sp);
+	bzero(&w, sizeof(w));
+	calcru1(p, &p->p_crux, up, sp, &w);
+	print_calcru1_warn(p, &w);
 }
=20
 /*
  * Transform the running time and tick information in proc p into user
  * and system time usage.  If appropriate, include the current time slice
  * on this CPU.
+ *
+ * The process slock shall be locked on entry, and it is unlocked
+ * after function returned.
  */
 void
 calcru(struct proc *p, struct timeval *up, struct timeval *sp)
 {
 	struct thread *td;
 	uint64_t u;
+	struct calcru1_warn w;
=20
 	PROC_LOCK_ASSERT(p, MA_OWNED);
 	PROC_SLOCK_ASSERT(p, MA_OWNED);
+
+	bzero(&w, sizeof(w));
 	/*
 	 * If we are getting stats for the current process, then add in the
 	 * stats that this thread has accumulated in its current time slice.
@@ -843,12 +873,14 @@ calcru(struct proc *p, struct timeval *up, struct tim=
eval *sp)
 			continue;
 		ruxagg(p, td);
 	}
-	calcru1(p, &p->p_rux, up, sp);
+	calcru1(p, &p->p_rux, up, sp, &w);
+	PROC_SUNLOCK(p);
+	print_calcru1_warn(p, &w);
 }
=20
 static void
 calcru1(struct proc *p, struct rusage_ext *ruxp, struct timeval *up,
-    struct timeval *sp)
+    struct timeval *sp, struct calcru1_warn *w)
 {
 	/* {user, system, interrupt, total} {ticks, usec}: */
 	uint64_t ut, uu, st, su, it, tt, tu;
@@ -865,8 +897,7 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, struct=
 timeval *up,
 	tu =3D cputick2usec(ruxp->rux_runtime);
 	if ((int64_t)tu < 0) {
 		/* XXX: this should be an assert /phk */
-		printf("calcru: negative runtime of %jd usec for pid %d (%s)\n",
-		    (intmax_t)tu, p->p_pid, p->p_comm);
+		w->neg_runtime =3D tu;
 		tu =3D ruxp->rux_tu;
 	}
=20
@@ -903,10 +934,8 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, struc=
t timeval *up,
 		 * serious, so lets keep it and hope laptops can be made
 		 * more truthful about their CPU speed via ACPI.
 		 */
-		printf("calcru: runtime went backwards from %ju usec "
-		    "to %ju usec for pid %d (%s)\n",
-		    (uintmax_t)ruxp->rux_tu, (uintmax_t)tu,
-		    p->p_pid, p->p_comm);
+		w->new_runtime =3D tu;
+		w->old_runtime =3D ruxp->rux_tu;
 		uu =3D (tu * ut) / tt;
 		su =3D (tu * st) / tt;
 	}
@@ -946,6 +975,7 @@ kern_getrusage(struct thread *td, int who, struct rusag=
e *rup)
 {
 	struct proc *p;
 	int error;
+	struct calcru1_warn w;
=20
 	error =3D 0;
 	p =3D td->td_proc;
@@ -962,13 +992,15 @@ kern_getrusage(struct thread *td, int who, struct rus=
age *rup)
 		break;
=20
 	case RUSAGE_THREAD:
+		bzero(&w, sizeof(w));
 		PROC_SLOCK(p);
 		ruxagg(p, td);
 		PROC_SUNLOCK(p);
 		thread_lock(td);
 		*rup =3D td->td_ru;
-		calcru1(p, &td->td_rux, &rup->ru_utime, &rup->ru_stime);
+		calcru1(p, &td->td_rux, &rup->ru_utime, &rup->ru_stime, &w);
 		thread_unlock(td);
+		print_calcru1_warn(p, &w);
 		break;
=20
 	default:
@@ -1069,7 +1101,6 @@ rufetchcalc(struct proc *p, struct rusage *ru, struct=
 timeval *up,
 	PROC_SLOCK(p);
 	rufetch(p, ru);
 	calcru(p, up, sp);
-	PROC_SUNLOCK(p);
 }
=20
 /*
diff --git a/sys/kern/kern_time.c b/sys/kern/kern_time.c
index 3aea2bd..d603958 100644
--- a/sys/kern/kern_time.c
+++ b/sys/kern/kern_time.c
@@ -204,7 +204,6 @@ kern_clock_gettime(struct thread *td, clockid_t clock_i=
d, struct timespec *ats)
 		PROC_LOCK(p);
 		PROC_SLOCK(p);
 		calcru(p, &user, &sys);
-		PROC_SUNLOCK(p);
 		PROC_UNLOCK(p);
 		TIMEVAL_TO_TIMESPEC(&user, ats);
 		break;
@@ -212,7 +211,6 @@ kern_clock_gettime(struct thread *td, clockid_t clock_i=
d, struct timespec *ats)
 		PROC_LOCK(p);
 		PROC_SLOCK(p);
 		calcru(p, &user, &sys);
-		PROC_SUNLOCK(p);
 		PROC_UNLOCK(p);
 		timevaladd(&user, &sys);
 		TIMEVAL_TO_TIMESPEC(&user, ats);

--HG+GLK89HZ1zG0kk
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (FreeBSD)

iEYEARECAAYFAkym/YQACgkQC3+MBN1Mb4hvwwCfbLiXCeE8l1mxv+FiDxdA/3zu
NM4An1kwNyAMiDcgGbBPVuIetfjyhf0d
=R9Em
-----END PGP SIGNATURE-----

--HG+GLK89HZ1zG0kk--

From owner-freebsd-arch@FreeBSD.ORG  Sat Oct  2 13:08:45 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 95648106566B;
	Sat,  2 Oct 2010 13:08:45 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from fallbackmx06.syd.optusnet.com.au
	(fallbackmx06.syd.optusnet.com.au [211.29.132.8])
	by mx1.freebsd.org (Postfix) with ESMTP id 185408FC17;
	Sat,  2 Oct 2010 13:08:44 +0000 (UTC)
Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au
	[211.29.132.183])
	by fallbackmx06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	o92BSuI2020318; Sat, 2 Oct 2010 21:28:56 +1000
Received: from besplex.bde.org (c122-107-116-249.carlnfd1.nsw.optusnet.com.au
	[122.107.116.249])
	by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	o92BSr05003394
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sat, 2 Oct 2010 21:28:54 +1000
Date: Sat, 2 Oct 2010 21:28:52 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Don Lewis <truckman@freebsd.org>
In-Reply-To: <201010020741.o927f7FJ056708@gw.catspoiler.org>
Message-ID: <20101002190453.K11563@besplex.bde.org>
References: <201010020741.o927f7FJ056708@gw.catspoiler.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org
Subject: Re: "process slock" vs. "scrlock" lock order
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Oct 2010 13:08:45 -0000

On Sat, 2 Oct 2010, Don Lewis wrote:

> The hard coded lock order list in subr_witness.c has "scrlock" listed
> before "process slock".  This causes a lock order reversal when
> calcru1(), which requires "process slock" to be held, calls printf() to
> report unexpected runtime problems.  The call to printf() eventually
> gets into the console code which locks "scrlock".

Console drivers are not permitted to use any normal locks, since they
are required to work when called from any instruction boundary via a
trace trap into ddb.  Syscons has lots of state, so it is difficult
for it to be reentrant enough to be a console driver.  It barely tries,
but mostly works anyway.  It used to use the axed cndbctl() call to
try harder.  This told it when ddb was entered and exited, so that it
could do things like save its state on ddb entry and restore it on ddb
exit.  In practice, it did little more than stop the screen saver and
switch to vty0 on ddb entry and set a private variable to indicate
that it was in ddb mode instead of peeking at db_active.  Then it used
this local variable in a few places to avoid a few dangerous things.
It still uses this variable to decide what to do, but this variable
is now never initialized (except statically to 0).  Replacing tests
of this variable by tests of kdb_active would unbreak a few things
and lose mainly the vty switch relative to the old version.

"scrlock" seems to be the only lock in syscons internals (except it
is giant locked), and it is already guarded by a kdb_active test (and
that is the only kdb_active test in syscons internals), so it mostly
doesn't cause problems for calls from ddb, just like the old cndbctl()
tests.  This part of it was cloned from sio where it is less incorrect
since the corresponding lock is made MTX_QUIET iff any sio devices is
a console.   (This is still wrong, since sio's lock should be a normal
one and any console lock a separate non-normal one.  Among other
problems, it makes sio's lock too quiet.)  syscons's lock is missing
the MTX_QUIET, but this lock is not a normal one (it is only used for
console output) so it can become more correct.  OTOH, its limited use
makes it useless for locking syscons generally.  It is only used to
prevents corruption of data structures (and garbled output) by multople
concurrent calls into the console driver.  It doesn't prevent corruption
from a console call concurrent with a (Giant-locked and maybe tty-locked)
user call.  sio's needs the corresponding locking only to reduce
garbling of output, since it console calls are reentrant enough to to
avoid corrupting any software state and most hardware state.

> This normally isn't
> noticed because both of these are spin locks, and hardly anyone uses
> WITNESS without disabling the checking of spinlocks with
> WITNESS_SKIPSPIN.  If spin lock checking is not disabled, the result is
> a silent reset because witness catches the LOR, which recurses into
> printf(), which ends up causing a panic in cnputs().
>
> One obvious fix would be to move "scrlock" to a later spot in the list,
> but I suspect the same problem could occur with the "sio" or "uart"
> locks if a serial console is being used.  It might not be possible to
> fix them the same way because there might be cases where they are in the
> input path and get locked before "process slock" or other spin locks
> that can be held when calling printf().

I think sio isn't affected, since it uses MTX_QUIET (though maybe it needs
MTX_NOWITNESS too -- one or both of those should "work" by breaking
witnessing in much the same way as WITNESS_SKIPSPIN).  uart is missing
the MTX_QUIET, and uses a too-normal lock for the console.

uart has locking for the whole of cngetc() too (except it drops the
look to wait), while sio has only reentrancy for cngetc().  Both are
useless for serialization, since cngetc() hasn't actually been a getc
function since ~2001 (?) when the multiple console changes broke input.
It is now cncheckc() misnamed.  The multiple console code polls each
console for input in turn, even when there is only 1 active console,
and this involves dropping locks so interrupts tend to eat your input.

> Another fix for this particular case would be to rearrange the code in
> calcru1() so that the calls to printf() occur after ruxp->rux_* are
> updated and where I assume it would be safe to temporarily drop "process
> slock" for the duration of the printf() calls.

printf() is supposed to be callable from almost anywhere (just not quite
at any instruction boundary unless in ddb mode).

There is related broken locking in cnputs().  This uses a non-normal
mutex for serialization.  The mutex is MTX_NOWITNESS and MTX_QUIET,
but there are no kdb_active tests before using it, and it us not bogusly
MTX_RECURSE, so it can deadlock in some cases (all cases with ddb output?)
when cnputs() is debugged.  I use better serialization of output involving
a similar (but less normal) lock over single printfs (callers wanting
to ensure non-garbled output must put it all together).  Deadlock is
avoided by ignoring the lock after trying for it for 1 second.  Console
drivers still need lower-level locking to protect their data structures.

The Giant locking in syscons seems bogus now that there is tty locking.
In the syscons directory, it is only done explicitly in sckbdevent(),
which calls tty_rint*() which needs tty locking but there is none
visible (maybe an upper layer of the interrupt handler does it, or
Giant locking of everything is enough).

"scrlock" causes problems with tty locking too.  syscons.c has only a
single explicit tty_lock() call, and that one is under "#if 0" together
with some scroll lock handling since "scrlock" causes a more detectable
LOR relative to tty_lock.  This is in sc_cngetc().  The LOR detection
has exposed the larger bug that a console driver is calling an upper
tty layer.  In old versions, the call was directly to scstart() except
for a check of the upper layer's open flag.  This was unsafe too (it
called up to the tty layer).  Add full tty locking to syscons and you
would probably find its console routines can't go anywhere without
hitting the tty lock, so when a console routine is called with the tty
lock held, it should deadlock or panic.  Giant locking was too feeble
to detect such problems, and before Giant I thought syscons was missing
lots of spl locking (which needed to be splhigh() to defend against
reentry for printf from an interrupt handler, leaving only the problem
of reentry for printf from a trap handler).

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Sat Oct  2 20:03:26 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1BC3A1065670;
	Sat,  2 Oct 2010 20:03:26 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au
	[211.29.132.188])
	by mx1.freebsd.org (Postfix) with ESMTP id 92AA38FC14;
	Sat,  2 Oct 2010 20:03:25 +0000 (UTC)
Received: from c122-107-116-249.carlnfd1.nsw.optusnet.com.au
	(c122-107-116-249.carlnfd1.nsw.optusnet.com.au [122.107.116.249])
	by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	o92K3LSU032357
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 3 Oct 2010 07:03:22 +1100
Date: Sun, 3 Oct 2010 07:03:21 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Kostik Belousov <kostikbel@gmail.com>
In-Reply-To: <20101002093813.GC2392@deviant.kiev.zoral.com.ua>
Message-ID: <20101003062141.C1323@delplex.bde.org>
References: <201010020741.o927f7FJ056708@gw.catspoiler.org>
	<20101002093813.GC2392@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org, Don Lewis <truckman@FreeBSD.org>
Subject: Re: "process slock" vs. "scrlock" lock order
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Oct 2010 20:03:26 -0000

On Sat, 2 Oct 2010, Kostik Belousov wrote:

> On Sat, Oct 02, 2010 at 12:41:07AM -0700, Don Lewis wrote:
>> The hard coded lock order list in subr_witness.c has "scrlock" listed
>> before "process slock".  This causes a lock order reversal when
>> calcru1(), which requires "process slock" to be held, calls printf() to
>> report unexpected runtime problems.  The call to printf() eventually
>> gets into the console code which locks "scrlock".  This normally isn't
>> slock" for the duration of the printf() calls.
>> ...
>> Thoughts?
>
> Yes, printing from under a spinlock is somewhat epidemic. Moving the printf
> out of process slock looks as the right solution.

No, it is shooting the messenger.  printf() (and console functions)
may be called with any locks held (except ones related to printf and
console functions themselves, and even those must be blown open for
printfs from panics and possibly from debuggers (if you have a reeentrant
debugger)).

> On the other hand, all
> calcru() callers unlock slock immediately after calcru(), and calcru1()
> sometimes only called with thread lock held, not process slock.
>
> I propose the following refinement, it does not need relock of process slock
> at all. Lets drop slock in calcru(), and do neccessary print after that.
> No need to reacquire the slock.

This might be cleaner for other reasons.

> diff --git a/sys/compat/linux/linux_misc.c b/sys/compat/linux/linux_misc.c
> index d2cf6b6..6a599f6 100644
> --- a/sys/compat/linux/linux_misc.c
> +++ b/sys/compat/linux/linux_misc.c
> @@ -691,7 +691,6 @@ linux_times(struct thread *td, struct linux_times_args *args)
> 		PROC_LOCK(p);
> 		PROC_SLOCK(p);
> 		calcru(p, &utime, &stime);
> -		PROC_SUNLOCK(p);
> 		calccru(p, &cutime, &cstime);
> 		PROC_UNLOCK(p);
>

Clean to remove lots of these.

> diff --git a/sys/kern/kern_resource.c b/sys/kern/kern_resource.c
> index ec2d6b6..13cc50c 100644
> --- a/sys/kern/kern_resource.c
> +++ b/sys/kern/kern_resource.c
> @@ -72,8 +72,15 @@ static struct rwlock uihashtbl_lock;
> static LIST_HEAD(uihashhead, uidinfo) *uihashtbl;
> static u_long uihash;		/* size of hash table - 1 */
>
> +struct calcru1_warn {
> +	int64_t neg_runtime;
> +	int64_t new_runtime;
> +	int64_t old_runtime;
> +};
> +
> static void	calcru1(struct proc *p, struct rusage_ext *ruxp,
> -		    struct timeval *up, struct timeval *sp);
> +		    struct timeval *up, struct timeval *sp,
> +		    struct calcru1_warn *w);
> static int	donice(struct thread *td, struct proc *chgp, int n);
> static struct uidinfo *uilookup(uid_t uid);
> static void	ruxagg_locked(struct rusage_ext *rux, struct thread *td);
> @@ -797,6 +804,20 @@ getrlimit(td, uap)
> 	return (error);
> }
>
> +static void
> +print_calcru1_warn(struct proc *p, const struct calcru1_warn *w)
> +{
> +
> +	if (w->neg_runtime > 0)
> +		printf("calcru: negative runtime of %jd usec for pid %d (%s)\n",
> +		    (intmax_t)w->neg_runtime, p->p_pid, p->p_comm);
> +	if (w->new_runtime > 0)
> +		printf("calcru: runtime went backwards from %ju usec "
> +		    "to %ju usec for pid %d (%s)\n",
> +		    (uintmax_t)w->old_runtime, (uintmax_t)w->new_runtime,
> +		    p->p_pid, p->p_comm);
> +}
> +
> /*
>  * Transform the running time and tick information for children of proc p
>  * into user and system time usage.
> @@ -807,24 +828,33 @@ calccru(p, up, sp)
> 	struct timeval *up;
> 	struct timeval *sp;
> {
> +	struct calcru1_warn w;
>
> 	PROC_LOCK_ASSERT(p, MA_OWNED);
> -	calcru1(p, &p->p_crux, up, sp);
> +	bzero(&w, sizeof(w));
> +	calcru1(p, &p->p_crux, up, sp, &w);
> +	print_calcru1_warn(p, &w);
> }
>
> /*
>  * Transform the running time and tick information in proc p into user
>  * and system time usage.  If appropriate, include the current time slice
>  * on this CPU.
> + *
> + * The process slock shall be locked on entry, and it is unlocked
> + * after function returned.
>  */
> void
> calcru(struct proc *p, struct timeval *up, struct timeval *sp)
> {
> 	struct thread *td;
> 	uint64_t u;
> +	struct calcru1_warn w;
>
> 	PROC_LOCK_ASSERT(p, MA_OWNED);
> 	PROC_SLOCK_ASSERT(p, MA_OWNED);
> +
> +	bzero(&w, sizeof(w));
> 	/*
> 	 * If we are getting stats for the current process, then add in the
> 	 * stats that this thread has accumulated in its current time slice.
> @@ -843,12 +873,14 @@ calcru(struct proc *p, struct timeval *up, struct timeval *sp)
> 			continue;
> 		ruxagg(p, td);
> 	}
> -	calcru1(p, &p->p_rux, up, sp);
> +	calcru1(p, &p->p_rux, up, sp, &w);
> +	PROC_SUNLOCK(p);
> +	print_calcru1_warn(p, &w);
> }
>
> static void
> calcru1(struct proc *p, struct rusage_ext *ruxp, struct timeval *up,
> -    struct timeval *sp)
> +    struct timeval *sp, struct calcru1_warn *w)
> {
> 	/* {user, system, interrupt, total} {ticks, usec}: */
> 	uint64_t ut, uu, st, su, it, tt, tu;
> @@ -865,8 +897,7 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, struct timeval *up,
> 	tu = cputick2usec(ruxp->rux_runtime);
> 	if ((int64_t)tu < 0) {
> 		/* XXX: this should be an assert /phk */
> -		printf("calcru: negative runtime of %jd usec for pid %d (%s)\n",
> -		    (intmax_t)tu, p->p_pid, p->p_comm);
> +		w->neg_runtime = tu;
> 		tu = ruxp->rux_tu;
> 	}
>
> @@ -903,10 +934,8 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, struct timeval *up,
> 		 * serious, so lets keep it and hope laptops can be made
> 		 * more truthful about their CPU speed via ACPI.
> 		 */
> -		printf("calcru: runtime went backwards from %ju usec "
> -		    "to %ju usec for pid %d (%s)\n",
> -		    (uintmax_t)ruxp->rux_tu, (uintmax_t)tu,
> -		    p->p_pid, p->p_comm);
> +		w->new_runtime = tu;
> +		w->old_runtime = ruxp->rux_tu;
> 		uu = (tu * ut) / tt;
> 		su = (tu * st) / tt;
> 	}
> @@ -946,6 +975,7 @@ kern_getrusage(struct thread *td, int who, struct rusage *rup)
> {
> 	struct proc *p;
> 	int error;
> +	struct calcru1_warn w;
>
> 	error = 0;
> 	p = td->td_proc;
> @@ -962,13 +992,15 @@ kern_getrusage(struct thread *td, int who, struct rusage *rup)
> 		break;
>
> 	case RUSAGE_THREAD:
> +		bzero(&w, sizeof(w));
> 		PROC_SLOCK(p);
> 		ruxagg(p, td);
> 		PROC_SUNLOCK(p);
> 		thread_lock(td);
> 		*rup = td->td_ru;
> -		calcru1(p, &td->td_rux, &rup->ru_utime, &rup->ru_stime);
> +		calcru1(p, &td->td_rux, &rup->ru_utime, &rup->ru_stime, &w);
> 		thread_unlock(td);
> +		print_calcru1_warn(p, &w);
> 		break;
>
> 	default:
> @@ -1069,7 +1101,6 @@ rufetchcalc(struct proc *p, struct rusage *ru, struct timeval *up,
> 	PROC_SLOCK(p);
> 	rufetch(p, ru);
> 	calcru(p, up, sp);
> -	PROC_SUNLOCK(p);
> }
>
> /*

Not clean to ad mounds of code to defer a couple of normal printfs.  I
think the only relationship of calcru() to the problem is that it has
a printf that is actually executed quite often.  Just about any printf
within a locked region may become a messenger for the problem if the
printf is actually executed.

To see lots more bugs in console drivers, put printfs in lots of critical
place and arrange for them to be executed frequently.  Ones near malloc
might be good.  I once used the one in the following timeout handler to
demonstrate the missing locking in syscons:

% static void
% foo(void *arg)
% {
% #if 0
%     sccnputc(0, '*');
%     timeout_handle = timeout(foo, NULL, 1);
% #else
%     /*
%      * Fills up log if done every tick so only do it every 10 ticks and
%      * wait a bit longer for races.
%      */
%     printf("*");
%     timeout_handle = timeout(foo, NULL, 10);
% #endif
% }

This printf can contend with write(2) or perhaps another printf.  Panics
were easiest to demonstrate with write(2).  Timeouts can easily interrupt
write(2), so the above printf contended with write(2) any time the
timeout is scheduled while syscons is active with write(2).  (Some
console drivers have locking to prevent the contention, but they must
be careful about deadlock.  The printf cannot be blocked.)  Giant
locking reduced this problem a bit.  It makes the above timeout handler
Giant-locked.  Syscons remains Giant-locked, so there is enough locking
to prevent contention from the above, but the above printf is broken
(it blocks).  The blocking doesn't matter here, but it would in a more
critical context.  More critical contexts wouldn't be Giant-locked
anyway, so they would contend.  I never got around to changing the
above to be an MPSAFE callout handler so that Giant locking doesn't
help.  It is in fact not MPSAFE, but only because the console driver
is not even UPSAFE.  You can see how old the above is from its non-KNF
style which I once preferred.

Bruce