From owner-freebsd-arch@FreeBSD.ORG Sun Sep 26 11:28:51 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BC7A1106566C for ; Sun, 26 Sep 2010 11:28:51 +0000 (UTC) (envelope-from paketix@bluewin.ch) Received: from mail31.bluewin.ch (mail31.bluewin.ch [195.186.18.72]) by mx1.freebsd.org (Postfix) with ESMTP id 3E7388FC0A for ; Sun, 26 Sep 2010 11:28:50 +0000 (UTC) Received: from [195.186.18.84] ([195.186.18.84:56597] helo=tr17.bluewin.ch) by mail31.bluewin.ch (envelope-from ) (ecelerity 2.2.2.45 r()) with ESMTP id 2B/12-19667-AEA2F9C4; Sun, 26 Sep 2010 11:13:46 +0000 Received: from [192.168.1.62] (188.61.142.81) by tr17.bluewin.ch (The Blue Window 8.5.119.018.5.119.01) (authenticated as paketix@bluewin.ch) id 4C6921000180F4F0 for freebsd-arch@freebsd.org; Sun, 26 Sep 2010 11:13:46 +0000 From: Paketix Date: Sun, 26 Sep 2010 13:13:44 +0200 Message-Id: To: freebsd-arch@freebsd.org Mime-Version: 1.0 (Apple Message framework v1081) X-Mailer: Apple Mail (2.1081) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Porting effort towards TILERA massive multicore CPUs...? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Sep 2010 11:28:51 -0000 there is a rather new processor from TILERA (100 core chip) which is most certainly already known here at FreeBSD mailing list. [http://www.tilera.com/products/processors/TILE-Gx_Family] the processor/platform is targeted towards: - high performance network security platforms - firewalling/vpn - utm - l7 deep packet inspection - network monitoring and forensics - cloud computing - web application (lamp) - data caching (memcached) - database applications - high-performance computing chris metcalf from TILERA did the current linux port and i was in contact with him about two weeks ago. at this time QUANTA computer is starting to offer a 512 core 2U box with an impressive performance/watt ratio (400 watts only for 512 cores). [http://www.tilera.com/solutions/cloud_computing] i guess those massive multicore chips would enable bleeding edge high performance solutions based on FreeBSD. well... - anyone interested in porting FreeBSD towards TILERA? (architecture seems to be similar to MIPS...) - is there already some ongoing porting effort? - porting for this chip already discussed in this mailing list? many thx /pat some links for those who want some more details: company homepage: http://www.tilera.com/ 64core processor: http://www.tilera.com/products/processors/TILEPRO64 100core processor with hardware packet (pre)processing http://www.tilera.com/products/processors/TILE-Gx_Family sample architecture for network appliances: http://www.tilera.com/solutions/networking/network_security_appliances 512core system from QUANTA computer inc. (available Q4-10/Q1-11): http://www.tilera.com/solutions/cloud_computing development system from TILERA: http://www.tilera.com/products/platforms/TILEmpower_platform From owner-freebsd-arch@FreeBSD.ORG Sun Sep 26 19:28:53 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DA135106564A for ; Sun, 26 Sep 2010 19:28:53 +0000 (UTC) (envelope-from paketix@bluewin.ch) Received: from mail31.bluewin.ch (mail31.bluewin.ch [195.186.18.72]) by mx1.freebsd.org (Postfix) with ESMTP id 710458FC14 for ; Sun, 26 Sep 2010 19:28:53 +0000 (UTC) Received: from [195.186.18.84] ([195.186.18.84:40549] helo=tr17.bluewin.ch) by mail31.bluewin.ch (envelope-from ) (ecelerity 2.2.2.45 r()) with ESMTP id DB/E9-19667-4FE9F9C4; Sun, 26 Sep 2010 19:28:52 +0000 Received: from [192.168.1.62] (188.61.142.81) by tr17.bluewin.ch (The Blue Window 8.5.119.018.5.119.01) (authenticated as paketix@bluewin.ch) id 4C6921000184B737; Sun, 26 Sep 2010 19:28:52 +0000 Mime-Version: 1.0 (Apple Message framework v1081) Content-Type: text/plain; charset=us-ascii From: Paketix In-Reply-To: Date: Sun, 26 Sep 2010 21:28:51 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <616133A7-3DF8-4192-8457-09BC27D2085E@bluewin.ch> References: To: Garrett Cooper X-Mailer: Apple Mail (2.1081) Cc: freebsd-arch@freebsd.org Subject: Re: Porting effort towards TILERA massive multicore CPUs...? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Sep 2010 19:28:54 -0000 On Sep 26, 2010, at 20:05, Garrett Cooper wrote: > On Sun, Sep 26, 2010 at 4:13 AM, Paketix wrote: >> there is a rather new processor from TILERA (100 core chip) which is >> most certainly already known here at FreeBSD mailing list. >> [http://www.tilera.com/products/processors/TILE-Gx_Family] >> the processor/platform is targeted towards: >> - high performance network security platforms >> - firewalling/vpn >> - utm >> - l7 deep packet inspection >> - network monitoring and forensics >> - cloud computing >> - web application (lamp) >> - data caching (memcached) >> - database applications >> - high-performance computing >>=20 >> chris metcalf from TILERA did the current linux port and i was in >> contact with him about two weeks ago. >> at this time QUANTA computer is starting to offer a 512 core 2U box >> with an impressive performance/watt ratio (400 watts only for 512 >> cores). >> [http://www.tilera.com/solutions/cloud_computing] >>=20 >> i guess those massive multicore chips would enable bleeding edge >> high performance solutions based on FreeBSD. >>=20 >> well... >> - anyone interested in porting FreeBSD towards TILERA? >> (architecture seems to be similar to MIPS...) >> - is there already some ongoing porting effort? >> - porting for this chip already discussed in this mailing list? >>=20 >> many thx >> /pat >>=20 >> some links for those who want some more details: >> company homepage: >> http://www.tilera.com/ >> 64core processor: >> http://www.tilera.com/products/processors/TILEPRO64 >> 100core processor with hardware packet (pre)processing >> http://www.tilera.com/products/processors/TILE-Gx_Family >> sample architecture for network appliances: >> = http://www.tilera.com/solutions/networking/network_security_appliances >> 512core system from QUANTA computer inc. (available Q4-10/Q1-11): >> http://www.tilera.com/solutions/cloud_computing >> development system from TILERA: >> http://www.tilera.com/products/platforms/TILEmpower_platform >=20 > In short this work requires changes to the scheduler and kernel > structures that aren't 100% done yet. Look for some of Robert Watson > and John Baldwin's replies to "Bumping MAXCPU on amd64" thread in the > past month to freebsd-arch and freebsd-current. > Cheers, > -Garrett usually it would - yes but maybe not on tilera if you use it for security applications (like = firewalling, proxy, url filter, ...) each tile of a tilera chip chan run its own full featured OS starting with TileGX the chip has a hardware loadbalancer serving the = packet streams to the cores... this could maybe serve as a first step full SMP for e.g. database applications etc. later on btw: the tilera chip does not have a floating point unit anyway which = will limit the range of applications (FP must be emulated in software) BR /pat= From owner-freebsd-arch@FreeBSD.ORG Mon Sep 27 15:49:37 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2F60E1065670; Mon, 27 Sep 2010 15:49:37 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id F17888FC16; Mon, 27 Sep 2010 15:49:36 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 8178746B81; Mon, 27 Sep 2010 11:49:36 -0400 (EDT) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id BB8E38A04E; Mon, 27 Sep 2010 11:49:34 -0400 (EDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Mon, 27 Sep 2010 09:28:47 -0400 User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20100819; KDE/4.4.5; amd64; ; ) References: <201009211507.o8LF7iVv097676@svn.freebsd.org> <20100924225352.GD49476@server.vk2pj.dyndns.org> In-Reply-To: <20100924225352.GD49476@server.vk2pj.dyndns.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201009270928.47232.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Mon, 27 Sep 2010 11:49:35 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r212964 - head/sys/kern X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Sep 2010 15:49:37 -0000 On Friday, September 24, 2010 6:53:52 pm Peter Jeremy wrote: > [Pruning CC list and re-adding freebsd-arch on the (forlorn) hope that > this thread will move to where it belongs] > > On 2010-Sep-23 07:31:13 -0700, Matthew Jacob wrote: > >It turns out that the big issue here was more the savecore time coming > >back up rather than the time of dumping. > > In my experience, the problem isn't so much the savecore time as the > time to run /usr/bin/crashinfo. Whilst savecore needs to run early > (before anything tramples on the crashdump in swap), the latter could > run at any time. It would seem reasonable to either run crashinfo in > the background or as a batchjob triggered by /etc/rc.d/savecore. That is probably true and would be fine, yes. > On 2010-Sep-23 18:59:53 +0100, Gavin Atkinson wrote: > >I appreciate the issue about filling partitions is a valid one. Would a > >possible compromise be that on release media, crashinfo(8) or similar will > >default to only keeping the most recent coredump or similar? Given /var > >now defaults to 4GB, Defaulting to keeping a single core is probably > >acceptable. > > savecore already has support for a 'minfree' file to prevent > crashdumps filling the crashdir. Maybe the default install should > include a minfree set to (say) 512MB. The one problem this approach is it implements a FIFO instead of a LIFO. I want the N most recent crashdumps to be saved, not the first N. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Sep 28 06:22:58 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9D348106566C for ; Tue, 28 Sep 2010 06:22:58 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout030.mac.com (asmtpout030.mac.com [17.148.16.105]) by mx1.freebsd.org (Postfix) with ESMTP id 8098A8FC1A for ; Tue, 28 Sep 2010 06:22:58 +0000 (UTC) MIME-version: 1.0 Content-type: multipart/mixed; boundary="Boundary_(ID_e2iysUHX7Ge1qa8HV4CNIg)" Received: from sa-nc-apg-144.static.jnpr.net (natint3.juniper.net [66.129.224.36]) by asmtp030.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L9G003YU1PGFO70@asmtp030.mac.com> for freebsd-arch@freebsd.org; Mon, 27 Sep 2010 23:22:33 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1009270310 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-09-28_06:2010-09-28, 2010-09-27, 1970-01-01 signatures=0 From: Marcel Moolenaar Date: Mon, 27 Sep 2010 23:22:27 -0700 Message-id: To: "freebsd-arch@FreeBSD.org Arch" X-Mailer: Apple Mail (2.1081) Subject: [patch] functional prototype of root mount enhancement X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2010 06:22:58 -0000 --Boundary_(ID_e2iysUHX7Ge1qa8HV4CNIg) Content-type: text/plain; charset=us-ascii Content-transfer-encoding: 7BIT All, I prototyped the root mount enhancement previously discussed. I would appreciate feedback and suggestions and bug reports of course. See: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=5942+0+current/freebsd-arch http://docs.freebsd.org/cgi/getmsg.cgi?fetch=120899+0+archive/2010/freebsd-arch/20100829.freebsd-arch The prototype supports all boot options that affect the root mount. Those are: -a, -C, -r When present, the initial root mount directives get adjusted accordingly. The prototype adds better support for mount options. Both the interactive, as well has the compiled-in root mount option (i.e. ROOTDEVNAME) can contain mount options. Not implemented yet is the .onfail handling, as well as the .timeout handling (previously called .wait). Also, the .init directive is not implemented. There's 1 bug under investigation: when a 2nd (non-devfs) file system is mounted as root, the 1st (non-devfs) gets moved under /.mount or /mnt under the new (=2nd) file system. However, trying to access the file system results in a WITNESS panic caused by a syscall leaving with the ufs lock held. The code has some debug output still, which is helpful to see what's going on internally. From a boot (with a /.mount.conf present on ufs:/dev/ad0s1a): : WARNING: WITNESS option enabled, expect reduced performance. Root mount waiting for: usbus1 Root mount waiting for: usbus1 uhub1: 6 ports with 6 removable, self powered Root mount waiting for: usbus1 Root mount waiting for: usbus1 ugen1.2: at usbus1 ======== .onfail panic .timeout 1 ufs:/dev/ad0s1a rw .ask ======== Trying to mount root from ufs:/dev/ad0s1a [rw]... XXX: vfs_mountroot_parse: error = 0, mpdevfs=0xc3fa3000, mp=0xc3fa2c94 ======== .onfail continue #ufs:/dev/da0a .ask ======== Loader variables: vfs.root.mountfrom=ufs:/dev/ad0s1a vfs.root.mountfrom.options=rw Manual root filesystem specification: : [options] Mount using filesystem and with the specified (optional) option list. eg. ufs:/dev/da0s1a cd9660:/dev/acd0 ro (which is equivalent to: mount -t cd9660 -o ro /dev/acd0 / ? List valid disk boot devices Abort manual input mountroot> XXX: vfs_mountroot_parse: error = -1, mpdevfs=0xc3fa3000, mp=0 : In case the attachment gets eaten: http://www.xcllnt.net/~marcel/rootmount.diff -- Marcel Moolenaar xcllnt@mac.com --Boundary_(ID_e2iysUHX7Ge1qa8HV4CNIg) Content-type: application/octet-stream; name=rootmount.diff Content-transfer-encoding: 7bit Content-disposition: attachment; filename=rootmount.diff Index: conf/files =================================================================== --- conf/files (revision 41) +++ conf/files (revision 49) @@ -2216,6 +2216,7 @@ kern/vfs_init.c standard kern/vfs_lookup.c standard kern/vfs_mount.c standard +kern/vfs_mountroot.c standard kern/vfs_subr.c standard kern/vfs_syscalls.c standard kern/vfs_vnops.c standard Index: kern/vfs_mountroot.c =================================================================== --- kern/vfs_mountroot.c (revision 0) +++ kern/vfs_mountroot.c (revision 49) @@ -0,0 +1,985 @@ +/*- + * Copyright (c) 1999-2004 Poul-Henning Kamp + * Copyright (c) 1999 Michael Smith + * Copyright (c) 1989, 1993 + * The Regents of the University of California. All rights reserved. + * (c) UNIX System Laboratories, Inc. + * All or some portions of this file are derived from material licensed + * to the University of California by American Telephone and Telegraph + * Co. or Unix System Laboratories, Inc. and are reproduced herein with + * the permission of UNIX System Laboratories, Inc. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 4. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include "opt_rootdevname.h" + +#include +__FBSDID("$FreeBSD$"); + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +/* + * The root filesystem is detailed in the kernel environment variable + * vfs.root.mountfrom, which is expected to be in the general format + * + * :[][ :[] ...] + * vfsname := the name of a VFS known to the kernel and capable + * of being mounted as root + * path := disk device name or other data used by the filesystem + * to locate its physical store + * + * If the environment variable vfs.root.mountfrom is a space separated list, + * each list element is tried in turn and the root filesystem will be mounted + * from the first one that suceeds. + * + * The environment variable vfs.root.mountfrom.options is a comma delimited + * set of string mount options. These mount options must be parseable + * by nmount() in the kernel. + */ + +static int parse_mount(char **); +static struct mntarg *parse_mountroot_options(struct mntarg *, const char *); + +/* + * The vnode of the system's root (/ in the filesystem, without chroot + * active.) + */ +struct vnode *rootvnode; + +char *rootdevnames[2] = {NULL, NULL}; + +struct root_hold_token { + const char *who; + LIST_ENTRY(root_hold_token) list; +}; + +static LIST_HEAD(, root_hold_token) root_holds = + LIST_HEAD_INITIALIZER(root_holds); + +enum action { + A_PANIC, + A_CONTINUE, + A_REBOOT, + A_RETRY +}; + +static enum action root_mount_action; + +static int root_mount_mddev; +static int root_mount_complete; + +/* By default wait up to 1 second for devices to appear. */ +static int root_mount_timeout = 1; + +struct root_hold_token * +root_mount_hold(const char *identifier) +{ + struct root_hold_token *h; + + if (root_mounted()) + return (NULL); + + h = malloc(sizeof *h, M_DEVBUF, M_ZERO | M_WAITOK); + h->who = identifier; + mtx_lock(&mountlist_mtx); + LIST_INSERT_HEAD(&root_holds, h, list); + mtx_unlock(&mountlist_mtx); + return (h); +} + +void +root_mount_rel(struct root_hold_token *h) +{ + + if (h == NULL) + return; + mtx_lock(&mountlist_mtx); + LIST_REMOVE(h, list); + wakeup(&root_holds); + mtx_unlock(&mountlist_mtx); + free(h, M_DEVBUF); +} + +int +root_mounted(void) +{ + + /* No mutex is acquired here because int stores are atomic. */ + return (root_mount_complete); +} + +void +root_mount_wait(void) +{ + + /* + * Panic on an obvious deadlock - the function can't be called from + * a thread which is doing the whole SYSINIT stuff. + */ + KASSERT(curthread->td_proc->p_pid != 0, + ("root_mount_wait: cannot be called from the swapper thread")); + mtx_lock(&mountlist_mtx); + while (!root_mount_complete) { + msleep(&root_mount_complete, &mountlist_mtx, PZERO, "rootwait", + hz); + } + mtx_unlock(&mountlist_mtx); +} + +static void +set_rootvnode(void) +{ + struct proc *p; + + if (VFS_ROOT(TAILQ_FIRST(&mountlist), LK_EXCLUSIVE, &rootvnode)) + panic("Cannot find root vnode"); + + VOP_UNLOCK(rootvnode, 0); + + p = curthread->td_proc; + FILEDESC_XLOCK(p->p_fd); + + if (p->p_fd->fd_cdir != NULL) + vrele(p->p_fd->fd_cdir); + p->p_fd->fd_cdir = rootvnode; + VREF(rootvnode); + + if (p->p_fd->fd_rdir != NULL) + vrele(p->p_fd->fd_rdir); + p->p_fd->fd_rdir = rootvnode; + VREF(rootvnode); + + FILEDESC_XUNLOCK(p->p_fd); + + EVENTHANDLER_INVOKE(mountroot); +} + +static int +vfs_mountroot_devfs(struct thread *td, struct mount **mpp) +{ + struct vfsoptlist *opts; + struct vfsconf *vfsp; + struct mount *mp; + int error; + + *mpp = NULL; + + vfsp = vfs_byname("devfs"); + KASSERT(vfsp != NULL, ("Could not find devfs by name")); + if (vfsp == NULL) + return (ENOENT); + + mp = vfs_mount_alloc(NULLVP, vfsp, "/dev", td->td_ucred); + + error = VFS_MOUNT(mp); + KASSERT(error == 0, ("VFS_MOUNT(devfs) failed %d", error)); + if (error) + return (error); + + opts = malloc(sizeof(struct vfsoptlist), M_MOUNT, M_WAITOK); + TAILQ_INIT(opts); + mp->mnt_opt = opts; + + mtx_lock(&mountlist_mtx); + TAILQ_INSERT_HEAD(&mountlist, mp, mnt_list); + mtx_unlock(&mountlist_mtx); + + *mpp = mp; + set_rootvnode(); + + error = kern_symlink(td, "/", "dev", UIO_SYSSPACE); + if (error) + printf("kern_symlink /dev -> / returns %d\n", error); + + return (error); +} + +static int +vfs_mountroot_shuffle(struct thread *td, struct mount *mpdevfs) +{ + struct nameidata nd; + struct mount *mporoot, *mpnroot; + struct vnode *vp, *vporoot, *vpdevfs; + char *fspath; + int error; + + mpnroot = TAILQ_NEXT(mpdevfs, mnt_list); + + /* Shuffle the mountlist. */ + mtx_lock(&mountlist_mtx); + mporoot = TAILQ_FIRST(&mountlist); + TAILQ_REMOVE(&mountlist, mpdevfs, mnt_list); + if (mporoot != mpdevfs) { + TAILQ_REMOVE(&mountlist, mpnroot, mnt_list); + TAILQ_INSERT_HEAD(&mountlist, mpnroot, mnt_list); + } + TAILQ_INSERT_TAIL(&mountlist, mpdevfs, mnt_list); + mtx_unlock(&mountlist_mtx); + + cache_purgevfs(mporoot); + if (mporoot != mpdevfs) + cache_purgevfs(mpdevfs); + + VFS_ROOT(mporoot, LK_EXCLUSIVE, &vporoot); + + VI_LOCK(vporoot); + vporoot->v_iflag &= ~VI_MOUNT; + VI_UNLOCK(vporoot); + vporoot->v_vflag &= ~VV_ROOT; + vporoot->v_mountedhere = NULL; + mporoot->mnt_vnodecovered = NULL; + vput(vporoot); + + /* Set up the new rootvnode, and purge the cache */ + mpnroot->mnt_vnodecovered = NULL; + set_rootvnode(); + cache_purgevfs(rootvnode->v_mount); + + if (mporoot != mpdevfs) { + /* Remount old root under /.mount or /mnt */ + fspath = "/.mount"; + NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF, UIO_SYSSPACE, + fspath, td); + error = namei(&nd); + if (error) { + NDFREE(&nd, NDF_ONLY_PNBUF); + fspath = "/mnt"; + NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF, UIO_SYSSPACE, + fspath, td); + error = namei(&nd); + } + if (!error) { + vp = nd.ni_vp; + error = (vp->v_type == VDIR) ? 0 : ENOTDIR; + if (!error) + error = vinvalbuf(vp, V_SAVE, 0, 0); + if (!error) { + cache_purge(vp); + mporoot->mnt_vnodecovered = vp; + vp->v_mountedhere = mporoot; + strlcpy(mporoot->mnt_stat.f_mntonname, + fspath, MNAMELEN); + VOP_UNLOCK(vp, 0); + } else + vput(vp); + } + NDFREE(&nd, NDF_ONLY_PNBUF); + + if (mporoot->mnt_vnodecovered == NULL) + printf("mountroot: unable to remount previous root.\n"); + } + + /* Remount devfs under /dev */ + NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF, UIO_SYSSPACE, "/dev", td); + + error = namei(&nd); + if (!error) { + vp = nd.ni_vp; + error = (vp->v_type == VDIR) ? 0 : ENOTDIR; + if (!error) + error = vinvalbuf(vp, V_SAVE, 0, 0); + if (!error) { + vpdevfs = mpdevfs->mnt_vnodecovered; + if (vpdevfs != NULL) { + cache_purge(vpdevfs); + vpdevfs->v_mountedhere = NULL; + vrele(vpdevfs); + } + mpdevfs->mnt_vnodecovered = vp; + vp->v_mountedhere = mpdevfs; + VOP_UNLOCK(vp, 0); + } else + vput(vp); + } + NDFREE(&nd, NDF_ONLY_PNBUF); + + if (mporoot == mpdevfs) { + vfs_unbusy(mpdevfs); + /* Unlink the no longer needed /dev/dev -> / symlink */ + kern_unlink(td, "/dev/dev", UIO_SYSSPACE); + } + + return (0); +} + +/* + * Configuration parser. + */ + +/* Parser character classes. */ +#define CC_WHITESPACE -1 +#define CC_NONWHITESPACE -2 + +/* Parse errors. */ +#define PE_EOF -1 +#define PE_EOL -2 + +static __inline int +parse_peek(char **conf) +{ + + return (**conf); +} + +static __inline void +parse_poke(char **conf, int c) +{ + + **conf = c; +} + +static __inline void +parse_advance(char **conf) +{ + + (*conf)++; +} + +static __inline int +parse_isspace(int c) +{ + + return ((c == ' ' || c == '\t' || c == '\n') ? 1 : 0); +} + +static int +parse_skipto(char **conf, int mc) +{ + int c, match; + + while (1) { + c = parse_peek(conf); + if (c == 0) + return (PE_EOF); + switch (mc) { + case CC_WHITESPACE: + match = (c == ' ' || c == '\t' || c == '\n') ? 1 : 0; + break; + case CC_NONWHITESPACE: + if (c == '\n') + return (PE_EOL); + match = (c != ' ' && c != '\t') ? 1 : 0; + break; + default: + match = (c == mc) ? 1 : 0; + break; + } + if (match) + break; + parse_advance(conf); + } + return (0); +} + +static int +parse_token(char **conf, char **tok) +{ + char *p; + size_t len; + int error; + + *tok = NULL; + error = parse_skipto(conf, CC_NONWHITESPACE); + if (error) + return (error); + p = *conf; + error = parse_skipto(conf, CC_WHITESPACE); + len = *conf - p; + *tok = malloc(len + 1, M_TEMP, M_WAITOK | M_ZERO); + bcopy(p, *tok, len); + return (0); +} + +static void +parse_dir_ask_printenv(const char *var) +{ + char *val; + + val = getenv(var); + if (val != NULL) { + printf(" %s=%s\n", var, val); + freeenv(val); + } +} + +static int +parse_dir_ask(char **conf) +{ + char name[80]; + char *mnt; + int error; + + printf("\nLoader variables:\n"); + parse_dir_ask_printenv("vfs.root.mountfrom"); + parse_dir_ask_printenv("vfs.root.mountfrom.options"); + + printf("\nManual root filesystem specification:\n"); + printf(" : [options]\n"); + printf(" Mount using filesystem \n"); + printf(" and with the specified (optional) option list.\n"); + printf("\n"); + printf(" eg. ufs:/dev/da0s1a\n"); + printf(" cd9660:/dev/acd0 ro\n"); + printf(" (which is equivalent to: "); + printf("mount -t cd9660 -o ro /dev/acd0 /\n"); + printf("\n"); + printf(" ? List valid disk boot devices\n"); + printf(" Abort manual input\n"); + + again: + printf("\nmountroot> "); + gets(name, sizeof(name), 1); + if (name[0] == '\0') + return (0); + if (name[0] == '?') { + printf("\nList of GEOM managed disk devices:\n "); + g_dev_print(); + goto again; + } + mnt = name; + error = parse_mount(&mnt); + if (error == -1) { + printf("Invalid specification.\n"); + goto again; + } + return (error); +} + +static int +parse_dir_md(char **conf) +{ + struct stat sb; + struct thread *td; + struct md_ioctl *mdio; + char *path, *tok; + int error, fd, len; + + td = curthread; + + error = parse_token(conf, &tok); + if (error) + return (error); + + len = strlen(tok); + mdio = malloc(sizeof(*mdio) + len + 1, M_TEMP, M_WAITOK | M_ZERO); + path = (void *)(mdio + 1); + bcopy(tok, path, len); + free(tok, M_TEMP); + + /* Get file status. */ + error = kern_stat(td, path, UIO_SYSSPACE, &sb); + if (error) + goto out; + + /* Open /dev/mdctl so that we can attach/detach. */ + error = kern_open(td, "/dev/" MDCTL_NAME, UIO_SYSSPACE, O_RDWR, 0); + if (error) + goto out; + + fd = td->td_retval[0]; + mdio->md_version = MDIOVERSION; + mdio->md_type = MD_VNODE; + + if (root_mount_mddev != -1) { + mdio->md_unit = root_mount_mddev; + DROP_GIANT(); + error = kern_ioctl(td, fd, MDIOCDETACH, (void *)mdio); + PICKUP_GIANT(); + /* Ignore errors. We don't care. */ + root_mount_mddev = -1; + } + + mdio->md_file = (void *)(mdio + 1); + mdio->md_options = MD_AUTOUNIT | MD_READONLY; + mdio->md_mediasize = sb.st_size; + mdio->md_unit = 0; + DROP_GIANT(); + error = kern_ioctl(td, fd, MDIOCATTACH, (void *)mdio); + PICKUP_GIANT(); + if (error) + goto out; + + if (mdio->md_unit > 9) { + printf("rootmount: too many md units\n"); + mdio->md_file = NULL; + mdio->md_options = 0; + mdio->md_mediasize = 0; + DROP_GIANT(); + error = kern_ioctl(td, fd, MDIOCDETACH, (void *)mdio); + PICKUP_GIANT(); + /* Ignore errors. We don't care. */ + error = ERANGE; + goto out; + } + + root_mount_mddev = mdio->md_unit; + printf(MD_NAME "%u attached to %s\n", root_mount_mddev, mdio->md_file); + + error = kern_close(td, fd); + + out: + free(mdio, M_TEMP); + return (error); +} + +static int +parse_dir_onfail(char **conf) +{ + char *action; + int error; + + error = parse_token(conf, &action); + if (error) + return (error); + + if (!strcmp(action, "continue")) + root_mount_action = A_CONTINUE; + else if (!strcmp(action, "panic")) + root_mount_action = A_PANIC; + else if (!strcmp(action, "reboot")) + root_mount_action = A_REBOOT; + else if (!strcmp(action, "retry")) + root_mount_action = A_RETRY; + else { + printf("rootmount: %s: unknown action\n", action); + error = EINVAL; + } + + free(action, M_TEMP); + return (0); +} + +static int +parse_dir_timeout(char **conf) +{ + char *tok, *endtok; + long secs; + int error; + + error = parse_token(conf, &tok); + if (error) + return (error); + + secs = strtol(tok, &endtok, 0); + error = (secs < 0 || *endtok != '\0') ? EINVAL : 0; + if (!error) + root_mount_timeout = secs; + free(tok, M_TEMP); + return (error); +} + +static int +parse_directive(char **conf) +{ + char *dir; + int error; + + error = parse_token(conf, &dir); + if (error) + return (error); + + if (strcmp(dir, ".ask") == 0) + error = parse_dir_ask(conf); + else if (strcmp(dir, ".md") == 0) + error = parse_dir_md(conf); + else if (strcmp(dir, ".onfail") == 0) + error = parse_dir_onfail(conf); + else if (strcmp(dir, ".timeout") == 0) + error = parse_dir_timeout(conf); + else { + printf("mountroot: invalid directive `%s'\n", dir); + /* Ignore the rest of the line. */ + (void)parse_skipto(conf, '\n'); + error = EINVAL; + } + free(dir, M_TEMP); + return (error); +} + +static int +parse_mount(char **conf) +{ + char errmsg[255]; + struct mntarg *ma; + char *dev, *fs, *opts, *tok; + int error; + + error = parse_token(conf, &tok); + if (error) + return (error); + fs = tok; + error = parse_skipto(&tok, ':'); + if (error) { + free(fs, M_TEMP); + return (error); + } + parse_poke(&tok, '\0'); + parse_advance(&tok); + dev = tok; + + if (root_mount_mddev != -1) { + /* Handle substitution for the md unit number. */ + tok = strstr(dev, "md#"); + if (tok != NULL) + tok[2] = '0' + root_mount_mddev; + } + + /* Parse options. */ + error = parse_token(conf, &tok); + opts = (error == 0) ? tok : NULL; + + printf("Trying to mount root from %s:%s [%s]...\n", fs, dev, + (opts != NULL) ? opts : ""); + + bzero(errmsg, sizeof(errmsg)); + + if (vfs_byname(fs) == NULL) { + strlcpy(errmsg, "unknown file system", sizeof(errmsg)); + error = ENOENT; + goto out; + } + + if (dev[0] != '\0') { + /* XXX wait N seconds for the device to appear. */ + } + + ma = NULL; + ma = mount_arg(ma, "fstype", fs, -1); + ma = mount_arg(ma, "fspath", "/", -1); + ma = mount_arg(ma, "from", dev, -1); + ma = mount_arg(ma, "errmsg", errmsg, sizeof(errmsg)); + ma = mount_arg(ma, "ro", NULL, 0); + ma = parse_mountroot_options(ma, opts); + error = kernel_mount(ma, MNT_ROOTFS); + + out: + if (error) { + printf("Mounting from %s:%s failed with error %d", + fs, dev, error); + if (errmsg[0] != '\0') + printf(": %s", errmsg); + printf(".\n"); + } + free(fs, M_TEMP); + if (opts != NULL) + free(opts, M_TEMP); + /* kernel_mount can return -1 on error. */ + return ((error < 0) ? EDOOFUS : error); +} + +static int +vfs_mountroot_parse(char **conf, struct mount *mpdevfs) +{ + struct mount *mp; + int error; + + mp = TAILQ_NEXT(mpdevfs, mnt_list); + error = (mp == NULL) ? 0 : EDOOFUS; + root_mount_mddev = -1; + root_mount_action = A_CONTINUE; + while (mp == NULL) { + error = parse_skipto(conf, CC_NONWHITESPACE); + if (error == PE_EOL) { + parse_advance(conf); + continue; + } + if (error < 0) + break; + switch (parse_peek(conf)) { + case '#': + error = parse_skipto(conf, '\n'); + break; + case '.': + error = parse_directive(conf); + break; + default: + error = parse_mount(conf); + break; + } + if (error < 0) + break; + /* Ignore any trailing garbage on the line. */ + if (parse_peek(conf) != '\n') { + printf("mountroot: advancing to next directive...\n"); + (void)parse_skipto(conf, '\n'); + } + mp = TAILQ_NEXT(mpdevfs, mnt_list); + } + + printf("XXX: %s: error = %d, mpdevfs=%p, mp=%p\n", __func__, + error, mpdevfs, mp); + + return (error); +} + +static void +vfs_mountroot_conf0(struct sbuf *sb) +{ + char *s, *tok, *mnt, *opt; + int error; + + sbuf_printf(sb, ".onfail panic\n"); + sbuf_printf(sb, ".timeout 1\n"); + if (boothowto & RB_ASKNAME) + sbuf_printf(sb, ".ask\n"); +#ifdef ROOTDEVNAME + if (boothowto & RB_DFLTROOT) + sbuf_printf(sb, "%s\n", ROOTDEVNAME); +#endif + if (boothowto & RB_CDROM) { + sbuf_printf(sb, "cd9660:cd0\n"); + sbuf_printf(sb, ".timeout 0\n"); + sbuf_printf(sb, "cd9660:acd0\n"); + sbuf_printf(sb, ".timeout 1\n"); + } + s = getenv("vfs.root.mountfrom"); + if (s != NULL) { + opt = getenv("vfs.root.mountfrom.options"); + tok = s; + error = parse_token(&tok, &mnt); + while (!error) { + sbuf_printf(sb, "%s %s\n", mnt, + (opt != NULL) ? opt : ""); + free(mnt, M_TEMP); + error = parse_token(&tok, &mnt); + } + if (opt != NULL) + freeenv(opt); + freeenv(s); + } + if (rootdevnames[0] != NULL) + sbuf_printf(sb, "%s\n", rootdevnames[0]); + if (rootdevnames[1] != NULL) + sbuf_printf(sb, "%s\n", rootdevnames[1]); +#ifdef ROOTDEVNAME + if (!(boothowto & RB_DFLTROOT)) + sbuf_printf(sb, "%s\n", ROOTDEVNAME); +#endif + if (!(boothowto & RB_ASKNAME)) + sbuf_printf(sb, ".ask\n"); +} + +static int +vfs_mountroot_readconf(struct thread *td, struct sbuf *sb) +{ + static char buf[128]; + struct nameidata nd; + off_t ofs; + int error, flags; + int len, resid; + int vfslocked; + + NDINIT(&nd, LOOKUP, FOLLOW | MPSAFE, UIO_SYSSPACE, + "/.mount.conf", td); + flags = FREAD; + error = vn_open(&nd, &flags, 0, NULL); + if (error) + return (error); + + vfslocked = NDHASGIANT(&nd); + NDFREE(&nd, NDF_ONLY_PNBUF); + ofs = 0; + len = sizeof(buf) - 1; + while (1) { + error = vn_rdwr(UIO_READ, nd.ni_vp, buf, len, ofs, + UIO_SYSSPACE, IO_NODELOCKED, td->td_ucred, + NOCRED, &resid, td); + if (error) + break; + if (resid == len) + break; + buf[len - resid] = 0; + sbuf_printf(sb, "%s", buf); + ofs += len - resid; + } + + VOP_UNLOCK(nd.ni_vp, 0); + vn_close(nd.ni_vp, FREAD, td->td_ucred, td); + VFS_UNLOCK_GIANT(vfslocked); + return (error); +} + +static void +vfs_mountroot_wait(void) +{ + struct root_hold_token *h; + struct timeval lastfail; + int curfail; + + curfail = 0; + while (1) { + DROP_GIANT(); + g_waitidle(); + PICKUP_GIANT(); + mtx_lock(&mountlist_mtx); + if (LIST_EMPTY(&root_holds)) { + mtx_unlock(&mountlist_mtx); + break; + } + if (ppsratecheck(&lastfail, &curfail, 1)) { + printf("Root mount waiting for:"); + LIST_FOREACH(h, &root_holds, list) + printf(" %s", h->who); + printf("\n"); + } + msleep(&root_holds, &mountlist_mtx, PZERO | PDROP, "roothold", + hz); + } +} + +void +vfs_mountroot(void) +{ + struct mount *mp; + struct sbuf *sb; + struct thread *td; + char *conf; + time_t timebase; + int error; + + td = curthread; + + vfs_mountroot_wait(); + + sb = sbuf_new_auto(); + vfs_mountroot_conf0(sb); + sbuf_finish(sb); + + error = vfs_mountroot_devfs(td, &mp); + while (!error) { + conf = sbuf_data(sb); + printf("========\n%s========\n", conf); + error = vfs_mountroot_parse(&conf, mp); + if (!error) { + error = vfs_mountroot_shuffle(td, mp); + if (!error) { + sbuf_clear(sb); + error = vfs_mountroot_readconf(td, sb); + sbuf_finish(sb); + } + } + } + + sbuf_delete(sb); + + /* + * Iterate over all currently mounted file systems and use + * the time stamp found to check and/or initialize the RTC. + * Call inittodr() only once and pass it the largest of the + * timestamps we encounter. + */ + timebase = 0; + mtx_lock(&mountlist_mtx); + mp = TAILQ_FIRST(&mountlist); + while (mp != NULL) { + if (mp->mnt_time > timebase) + timebase = mp->mnt_time; + mp = TAILQ_NEXT(mp, mnt_list); + } + mtx_unlock(&mountlist_mtx); + inittodr(timebase); + + /* Keep prison0's root in sync with the global rootvnode. */ + mtx_lock(&prison0.pr_mtx); + prison0.pr_root = rootvnode; + vref(prison0.pr_root); + mtx_unlock(&prison0.pr_mtx); + + mtx_lock(&mountlist_mtx); + atomic_store_rel_int(&root_mount_complete, 1); + wakeup(&root_mount_complete); + mtx_unlock(&mountlist_mtx); +} + +static struct mntarg * +parse_mountroot_options(struct mntarg *ma, const char *options) +{ + char *p; + char *name, *name_arg; + char *val, *val_arg; + char *opts; + + if (options == NULL || options[0] == '\0') + return (ma); + + p = opts = strdup(options, M_MOUNT); + if (opts == NULL) { + return (ma); + } + + while((name = strsep(&p, ",")) != NULL) { + if (name[0] == '\0') + break; + + val = strchr(name, '='); + if (val != NULL) { + *val = '\0'; + ++val; + } + if( strcmp(name, "rw") == 0 || + strcmp(name, "noro") == 0) { + /* + * The first time we mount the root file system, + * we need to mount 'ro', so We need to ignore + * 'rw' and 'noro' mount options. + */ + continue; + } + name_arg = strdup(name, M_MOUNT); + val_arg = NULL; + if (val != NULL) + val_arg = strdup(val, M_MOUNT); + + ma = mount_arg(ma, name_arg, val_arg, + (val_arg != NULL ? -1 : 0)); + } + free(opts, M_MOUNT); + return (ma); +} Index: kern/vfs_mount.c =================================================================== --- kern/vfs_mount.c (revision 41) +++ kern/vfs_mount.c (revision 49) @@ -67,16 +67,10 @@ #include #include -#include "opt_rootdevname.h" - -#define ROOTNAME "root_device" #define VFS_MOUNTARG_SIZE_MAX (1024 * 64) -static void set_rootvnode(void); static int vfs_domount(struct thread *td, const char *fstype, char *fspath, int fsflags, void *fsdata); -static int vfs_mountroot_ask(void); -static int vfs_mountroot_try(const char *mountfrom, const char *options); static void free_mntarg(struct mntarg *ma); static int usermount = 0; @@ -95,31 +89,6 @@ MTX_SYSINIT(mountlist, &mountlist_mtx, "mountlist", MTX_DEF); /* - * The vnode of the system's root (/ in the filesystem, without chroot - * active.) - */ -struct vnode *rootvnode; - -/* - * The root filesystem is detailed in the kernel environment variable - * vfs.root.mountfrom, which is expected to be in the general format - * - * :[][ :[] ...] - * vfsname := the name of a VFS known to the kernel and capable - * of being mounted as root - * path := disk device name or other data used by the filesystem - * to locate its physical store - * - * If the environment variable vfs.root.mountfrom is a space separated list, - * each list element is tried in turn and the root filesystem will be mounted - * from the first one that suceeds. - * - * The environment variable vfs.root.mountfrom.options is a comma delimited - * set of string mount options. These mount options must be parseable - * by nmount() in the kernel. - */ - -/* * Global opts, taken by all filesystems */ static const char *global_opts[] = { @@ -133,22 +102,36 @@ NULL }; -/* - * The root specifiers we will try if RB_CDROM is specified. - */ -static char *cdrom_rootdevnames[] = { - "cd9660:cd0", - "cd9660:acd0", - NULL -}; +static int +mount_init(void *mem, int size, int flags) +{ + struct mount *mp; -/* legacy find-root code */ -char *rootdevnames[2] = {NULL, NULL}; -#ifndef ROOTDEVNAME -# define ROOTDEVNAME NULL -#endif -static const char *ctrootdevname = ROOTDEVNAME; + mp = (struct mount *)mem; + mtx_init(&mp->mnt_mtx, "struct mount mtx", NULL, MTX_DEF); + lockinit(&mp->mnt_explock, PVFS, "explock", 0, 0); + return (0); +} +static void +mount_fini(void *mem, int size) +{ + struct mount *mp; + + mp = (struct mount *)mem; + lockdestroy(&mp->mnt_explock); + mtx_destroy(&mp->mnt_mtx); +} + +static void +vfs_mount_init(void *dummy __unused) +{ + + mount_zone = uma_zcreate("Mountpoints", sizeof(struct mount), NULL, + NULL, mount_init, mount_fini, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); +} +SYSINIT(vfs_mount, SI_SUB_VFS, SI_ORDER_ANY, vfs_mount_init, NULL); + /* * --------------------------------------------------------------------- * Functions for building and sanitizing the mount options @@ -452,27 +435,6 @@ MNT_IUNLOCK(mp); } -static int -mount_init(void *mem, int size, int flags) -{ - struct mount *mp; - - mp = (struct mount *)mem; - mtx_init(&mp->mnt_mtx, "struct mount mtx", NULL, MTX_DEF); - lockinit(&mp->mnt_explock, PVFS, "explock", 0, 0); - return (0); -} - -static void -mount_fini(void *mem, int size) -{ - struct mount *mp; - - mp = (struct mount *)mem; - lockdestroy(&mp->mnt_explock); - mtx_destroy(&mp->mnt_mtx); -} - /* * Allocate and initialize the mount point struct. */ @@ -1343,269 +1305,6 @@ } /* - * --------------------------------------------------------------------- - * Mounting of root filesystem - * - */ - -struct root_hold_token { - const char *who; - LIST_ENTRY(root_hold_token) list; -}; - -static LIST_HEAD(, root_hold_token) root_holds = - LIST_HEAD_INITIALIZER(root_holds); - -static int root_mount_complete; - -/* - * Hold root mount. - */ -struct root_hold_token * -root_mount_hold(const char *identifier) -{ - struct root_hold_token *h; - - if (root_mounted()) - return (NULL); - - h = malloc(sizeof *h, M_DEVBUF, M_ZERO | M_WAITOK); - h->who = identifier; - mtx_lock(&mountlist_mtx); - LIST_INSERT_HEAD(&root_holds, h, list); - mtx_unlock(&mountlist_mtx); - return (h); -} - -/* - * Release root mount. - */ -void -root_mount_rel(struct root_hold_token *h) -{ - - if (h == NULL) - return; - mtx_lock(&mountlist_mtx); - LIST_REMOVE(h, list); - wakeup(&root_holds); - mtx_unlock(&mountlist_mtx); - free(h, M_DEVBUF); -} - -/* - * Wait for all subsystems to release root mount. - */ -static void -root_mount_prepare(void) -{ - struct root_hold_token *h; - struct timeval lastfail; - int curfail = 0; - - for (;;) { - DROP_GIANT(); - g_waitidle(); - PICKUP_GIANT(); - mtx_lock(&mountlist_mtx); - if (LIST_EMPTY(&root_holds)) { - mtx_unlock(&mountlist_mtx); - break; - } - if (ppsratecheck(&lastfail, &curfail, 1)) { - printf("Root mount waiting for:"); - LIST_FOREACH(h, &root_holds, list) - printf(" %s", h->who); - printf("\n"); - } - msleep(&root_holds, &mountlist_mtx, PZERO | PDROP, "roothold", - hz); - } -} - -/* - * Root was mounted, share the good news. - */ -static void -root_mount_done(void) -{ - - /* Keep prison0's root in sync with the global rootvnode. */ - mtx_lock(&prison0.pr_mtx); - prison0.pr_root = rootvnode; - vref(prison0.pr_root); - mtx_unlock(&prison0.pr_mtx); - /* - * Use a mutex to prevent the wakeup being missed and waiting for - * an extra 1 second sleep. - */ - mtx_lock(&mountlist_mtx); - root_mount_complete = 1; - wakeup(&root_mount_complete); - mtx_unlock(&mountlist_mtx); -} - -/* - * Return true if root is already mounted. - */ -int -root_mounted(void) -{ - - /* No mutex is acquired here because int stores are atomic. */ - return (root_mount_complete); -} - -/* - * Wait until root is mounted. - */ -void -root_mount_wait(void) -{ - - /* - * Panic on an obvious deadlock - the function can't be called from - * a thread which is doing the whole SYSINIT stuff. - */ - KASSERT(curthread->td_proc->p_pid != 0, - ("root_mount_wait: cannot be called from the swapper thread")); - mtx_lock(&mountlist_mtx); - while (!root_mount_complete) { - msleep(&root_mount_complete, &mountlist_mtx, PZERO, "rootwait", - hz); - } - mtx_unlock(&mountlist_mtx); -} - -static void -set_rootvnode() -{ - struct proc *p; - - if (VFS_ROOT(TAILQ_FIRST(&mountlist), LK_EXCLUSIVE, &rootvnode)) - panic("Cannot find root vnode"); - - VOP_UNLOCK(rootvnode, 0); - - p = curthread->td_proc; - FILEDESC_XLOCK(p->p_fd); - - if (p->p_fd->fd_cdir != NULL) - vrele(p->p_fd->fd_cdir); - p->p_fd->fd_cdir = rootvnode; - VREF(rootvnode); - - if (p->p_fd->fd_rdir != NULL) - vrele(p->p_fd->fd_rdir); - p->p_fd->fd_rdir = rootvnode; - VREF(rootvnode); - - FILEDESC_XUNLOCK(p->p_fd); - - EVENTHANDLER_INVOKE(mountroot); -} - -/* - * Mount /devfs as our root filesystem, but do not put it on the mountlist - * yet. Create a /dev -> / symlink so that absolute pathnames will lookup. - */ - -static void -devfs_first(void) -{ - struct thread *td = curthread; - struct vfsoptlist *opts; - struct vfsconf *vfsp; - struct mount *mp = NULL; - int error; - - vfsp = vfs_byname("devfs"); - KASSERT(vfsp != NULL, ("Could not find devfs by name")); - if (vfsp == NULL) - return; - - mp = vfs_mount_alloc(NULLVP, vfsp, "/dev", td->td_ucred); - - error = VFS_MOUNT(mp); - KASSERT(error == 0, ("VFS_MOUNT(devfs) failed %d", error)); - if (error) - return; - - opts = malloc(sizeof(struct vfsoptlist), M_MOUNT, M_WAITOK); - TAILQ_INIT(opts); - mp->mnt_opt = opts; - - mtx_lock(&mountlist_mtx); - TAILQ_INSERT_HEAD(&mountlist, mp, mnt_list); - mtx_unlock(&mountlist_mtx); - - set_rootvnode(); - - error = kern_symlink(td, "/", "dev", UIO_SYSSPACE); - if (error) - printf("kern_symlink /dev -> / returns %d\n", error); -} - -/* - * Surgically move our devfs to be mounted on /dev. - */ - -static void -devfs_fixup(struct thread *td) -{ - struct nameidata nd; - int error; - struct vnode *vp, *dvp; - struct mount *mp; - - /* Remove our devfs mount from the mountlist and purge the cache */ - mtx_lock(&mountlist_mtx); - mp = TAILQ_FIRST(&mountlist); - TAILQ_REMOVE(&mountlist, mp, mnt_list); - mtx_unlock(&mountlist_mtx); - cache_purgevfs(mp); - - VFS_ROOT(mp, LK_EXCLUSIVE, &dvp); - VI_LOCK(dvp); - dvp->v_iflag &= ~VI_MOUNT; - VI_UNLOCK(dvp); - dvp->v_mountedhere = NULL; - - /* Set up the real rootvnode, and purge the cache */ - TAILQ_FIRST(&mountlist)->mnt_vnodecovered = NULL; - set_rootvnode(); - cache_purgevfs(rootvnode->v_mount); - - NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF, UIO_SYSSPACE, "/dev", td); - error = namei(&nd); - if (error) { - printf("Lookup of /dev for devfs, error: %d\n", error); - return; - } - NDFREE(&nd, NDF_ONLY_PNBUF); - vp = nd.ni_vp; - if (vp->v_type != VDIR) { - vput(vp); - } - error = vinvalbuf(vp, V_SAVE, 0, 0); - if (error) { - vput(vp); - } - cache_purge(vp); - mp->mnt_vnodecovered = vp; - vp->v_mountedhere = mp; - mtx_lock(&mountlist_mtx); - TAILQ_INSERT_TAIL(&mountlist, mp, mnt_list); - mtx_unlock(&mountlist_mtx); - VOP_UNLOCK(vp, 0); - vput(dvp); - vfs_unbusy(mp); - - /* Unlink the no longer needed /dev/dev -> / symlink */ - kern_unlink(td, "/dev/dev", UIO_SYSSPACE); -} - -/* * Report errors during filesystem mounting. */ void @@ -1642,288 +1341,7 @@ } /* - * Find and mount the root filesystem - */ -void -vfs_mountroot(void) -{ - char *cp, *cpt, *options, *tmpdev; - int error, i, asked = 0; - - options = NULL; - - root_mount_prepare(); - - mount_zone = uma_zcreate("Mountpoints", sizeof(struct mount), - NULL, NULL, mount_init, mount_fini, - UMA_ALIGN_PTR, UMA_ZONE_NOFREE); - devfs_first(); - - /* - * We are booted with instructions to prompt for the root filesystem. - */ - if (boothowto & RB_ASKNAME) { - if (!vfs_mountroot_ask()) - goto mounted; - asked = 1; - } - - options = getenv("vfs.root.mountfrom.options"); - - /* - * The root filesystem information is compiled in, and we are - * booted with instructions to use it. - */ - if (ctrootdevname != NULL && (boothowto & RB_DFLTROOT)) { - if (!vfs_mountroot_try(ctrootdevname, options)) - goto mounted; - ctrootdevname = NULL; - } - - /* - * We've been given the generic "use CDROM as root" flag. This is - * necessary because one media may be used in many different - * devices, so we need to search for them. - */ - if (boothowto & RB_CDROM) { - for (i = 0; cdrom_rootdevnames[i] != NULL; i++) { - if (!vfs_mountroot_try(cdrom_rootdevnames[i], options)) - goto mounted; - } - } - - /* - * Try to use the value read by the loader from /etc/fstab, or - * supplied via some other means. This is the preferred - * mechanism. - */ - cp = getenv("vfs.root.mountfrom"); - if (cp != NULL) { - cpt = cp; - while ((tmpdev = strsep(&cpt, " \t")) != NULL) { - error = vfs_mountroot_try(tmpdev, options); - if (error == 0) { - freeenv(cp); - goto mounted; - } - } - freeenv(cp); - } - - /* - * Try values that may have been computed by code during boot - */ - if (!vfs_mountroot_try(rootdevnames[0], options)) - goto mounted; - if (!vfs_mountroot_try(rootdevnames[1], options)) - goto mounted; - - /* - * If we (still) have a compiled-in default, try it. - */ - if (ctrootdevname != NULL) - if (!vfs_mountroot_try(ctrootdevname, options)) - goto mounted; - /* - * Everything so far has failed, prompt on the console if we haven't - * already tried that. - */ - if (!asked) - if (!vfs_mountroot_ask()) - goto mounted; - - panic("Root mount failed, startup aborted."); - -mounted: - root_mount_done(); - freeenv(options); -} - -static struct mntarg * -parse_mountroot_options(struct mntarg *ma, const char *options) -{ - char *p; - char *name, *name_arg; - char *val, *val_arg; - char *opts; - - if (options == NULL || options[0] == '\0') - return (ma); - - p = opts = strdup(options, M_MOUNT); - if (opts == NULL) { - return (ma); - } - - while((name = strsep(&p, ",")) != NULL) { - if (name[0] == '\0') - break; - - val = strchr(name, '='); - if (val != NULL) { - *val = '\0'; - ++val; - } - if( strcmp(name, "rw") == 0 || - strcmp(name, "noro") == 0) { - /* - * The first time we mount the root file system, - * we need to mount 'ro', so We need to ignore - * 'rw' and 'noro' mount options. - */ - continue; - } - name_arg = strdup(name, M_MOUNT); - val_arg = NULL; - if (val != NULL) - val_arg = strdup(val, M_MOUNT); - - ma = mount_arg(ma, name_arg, val_arg, - (val_arg != NULL ? -1 : 0)); - } - free(opts, M_MOUNT); - return (ma); -} - -/* - * Mount (mountfrom) as the root filesystem. - */ -static int -vfs_mountroot_try(const char *mountfrom, const char *options) -{ - struct mount *mp; - struct mntarg *ma; - char *vfsname, *path; - time_t timebase; - int error; - char patt[32]; - char errmsg[255]; - - vfsname = NULL; - path = NULL; - mp = NULL; - ma = NULL; - error = EINVAL; - bzero(errmsg, sizeof(errmsg)); - - if (mountfrom == NULL) - return (error); /* don't complain */ - printf("Trying to mount root from %s\n", mountfrom); - - /* parse vfs name and path */ - vfsname = malloc(MFSNAMELEN, M_MOUNT, M_WAITOK); - path = malloc(MNAMELEN, M_MOUNT, M_WAITOK); - vfsname[0] = path[0] = 0; - sprintf(patt, "%%%d[a-z0-9]:%%%ds", MFSNAMELEN, MNAMELEN); - if (sscanf(mountfrom, patt, vfsname, path) < 1) - goto out; - - if (path[0] == '\0') - strcpy(path, ROOTNAME); - - ma = mount_arg(ma, "fstype", vfsname, -1); - ma = mount_arg(ma, "fspath", "/", -1); - ma = mount_arg(ma, "from", path, -1); - ma = mount_arg(ma, "errmsg", errmsg, sizeof(errmsg)); - ma = mount_arg(ma, "ro", NULL, 0); - ma = parse_mountroot_options(ma, options); - error = kernel_mount(ma, MNT_ROOTFS); - - if (error == 0) { - /* - * We mount devfs prior to mounting the / FS, so the first - * entry will typically be devfs. - */ - mp = TAILQ_FIRST(&mountlist); - KASSERT(mp != NULL, ("%s: mountlist is empty", __func__)); - - /* - * Iterate over all currently mounted file systems and use - * the time stamp found to check and/or initialize the RTC. - * Typically devfs has no time stamp and the only other FS - * is the actual / FS. - * Call inittodr() only once and pass it the largest of the - * timestamps we encounter. - */ - timebase = 0; - do { - if (mp->mnt_time > timebase) - timebase = mp->mnt_time; - mp = TAILQ_NEXT(mp, mnt_list); - } while (mp != NULL); - inittodr(timebase); - - devfs_fixup(curthread); - } - - if (error != 0 ) { - printf("ROOT MOUNT ERROR: %s\n", errmsg); - printf("If you have invalid mount options, reboot, and "); - printf("first try the following from\n"); - printf("the loader prompt:\n\n"); - printf(" set vfs.root.mountfrom.options=rw\n\n"); - printf("and then remove invalid mount options from "); - printf("/etc/fstab.\n\n"); - } -out: - free(path, M_MOUNT); - free(vfsname, M_MOUNT); - return (error); -} - -/* * --------------------------------------------------------------------- - * Interactive root filesystem selection code. - */ - -static int -vfs_mountroot_ask(void) -{ - char name[128]; - char *mountfrom; - char *options; - - for(;;) { - printf("Loader variables:\n"); - printf("vfs.root.mountfrom="); - mountfrom = getenv("vfs.root.mountfrom"); - if (mountfrom != NULL) { - printf("%s", mountfrom); - } - printf("\n"); - printf("vfs.root.mountfrom.options="); - options = getenv("vfs.root.mountfrom.options"); - if (options != NULL) { - printf("%s", options); - } - printf("\n"); - freeenv(mountfrom); - freeenv(options); - printf("\nManual root filesystem specification:\n"); - printf(" : Mount using filesystem \n"); - printf(" eg. ufs:/dev/da0s1a\n"); - printf(" eg. cd9660:/dev/acd0\n"); - printf(" This is equivalent to: "); - printf("mount -t cd9660 /dev/acd0 /\n"); - printf("\n"); - printf(" ? List valid disk boot devices\n"); - printf(" Abort manual input\n"); - printf("\nmountroot> "); - gets(name, sizeof(name), 1); - if (name[0] == '\0') - return (1); - if (name[0] == '?') { - printf("\nList of GEOM managed disk devices:\n "); - g_dev_print(); - continue; - } - if (!vfs_mountroot_try(name, NULL)) - return (0); - } -} - -/* - * --------------------------------------------------------------------- * Functions for querying mount options/arguments from filesystems. */ @@ -1965,15 +1383,17 @@ continue; snprintf(errmsg, sizeof(errmsg), "mount option <%s> is unknown", p); - printf("%s\n", errmsg); ret = EINVAL; } if (ret != 0) { TAILQ_FOREACH(opt, opts, link) { if (strcmp(opt->name, "errmsg") == 0) { strncpy((char *)opt->value, errmsg, opt->len); + break; } } + if (opt == NULL) + printf("%s\n", errmsg); } return (ret); } Index: dev/md/md.c =================================================================== --- dev/md/md.c (revision 41) +++ dev/md/md.c (revision 49) @@ -911,18 +911,26 @@ { struct vattr vattr; struct nameidata nd; + char *fname; int error, flags, vfslocked; - error = copyinstr(mdio->md_file, sc->file, sizeof(sc->file), NULL); - if (error != 0) - return (error); - flags = FREAD|FWRITE; /* - * If the user specified that this is a read only device, unset the - * FWRITE mask before trying to open the backing store. + * Kernel-originated requests must have the filename appended + * to the mdio structure to protect against malicious software. */ - if ((mdio->md_options & MD_READONLY) != 0) - flags &= ~FWRITE; + fname = mdio->md_file; + if ((void *)fname != (void *)(mdio + 1)) { + error = copyinstr(fname, sc->file, sizeof(sc->file), NULL); + if (error != 0) + return (error); + } else + strlcpy(sc->file, fname, sizeof(sc->file)); + + /* + * If the user specified that this is a read only device, don't + * set the FWRITE mask before trying to open the backing store. + */ + flags = FREAD | ((mdio->md_options & MD_READONLY) ? 0 : FWRITE); NDINIT(&nd, LOOKUP, FOLLOW | MPSAFE, UIO_SYSSPACE, sc->file, td); error = vn_open(&nd, &flags, 0, NULL); if (error != 0) --Boundary_(ID_e2iysUHX7Ge1qa8HV4CNIg)-- From owner-freebsd-arch@FreeBSD.ORG Tue Sep 28 10:08:46 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C986C1065694; Tue, 28 Sep 2010 10:08:46 +0000 (UTC) (envelope-from alexander@leidinger.net) Received: from mail.ebusiness-leidinger.de (mail.ebusiness-leidinger.de [217.11.53.44]) by mx1.freebsd.org (Postfix) with ESMTP id 774308FC24; Tue, 28 Sep 2010 10:08:46 +0000 (UTC) Received: from outgoing.leidinger.net (p57B3B90B.dip.t-dialin.net [87.179.185.11]) by mail.ebusiness-leidinger.de (Postfix) with ESMTPSA id 8143984400A; Tue, 28 Sep 2010 11:49:18 +0200 (CEST) Received: from webmail.leidinger.net (unknown [IPv6:fd73:10c7:2053:1::2:102]) by outgoing.leidinger.net (Postfix) with ESMTP id 6BDAC193D; Tue, 28 Sep 2010 11:49:15 +0200 (CEST) Received: (from www@localhost) by webmail.leidinger.net (8.14.4/8.13.8/Submit) id o8S9nCgW066608; Tue, 28 Sep 2010 11:49:12 +0200 (CEST) (envelope-from Alexander@Leidinger.net) Received: from pslux.ec.europa.eu (pslux.ec.europa.eu [158.169.9.14]) by webmail.leidinger.net (Horde Framework) with HTTP; Tue, 28 Sep 2010 11:49:12 +0200 Message-ID: <20100928114912.17443a2o7j71kpaw@webmail.leidinger.net> Date: Tue, 28 Sep 2010 11:49:12 +0200 From: Alexander Leidinger To: John Baldwin References: <201009211507.o8LF7iVv097676@svn.freebsd.org> <20100924225352.GD49476@server.vk2pj.dyndns.org> <201009270928.47232.jhb@freebsd.org> In-Reply-To: <201009270928.47232.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; DelSp="Yes"; format="flowed" Content-Disposition: inline Content-Transfer-Encoding: 7bit User-Agent: Dynamic Internet Messaging Program (DIMP) H3 (1.1.4) X-EBL-MailScanner-Information: Please contact the ISP for more information X-EBL-MailScanner-ID: 8143984400A.A6EFC X-EBL-MailScanner: Found to be clean X-EBL-MailScanner-SpamCheck: not spam, spamhaus-ZEN, SpamAssassin (not cached, score=1.351, required 6, autolearn=disabled, RDNS_NONE 1.27, TW_SV 0.08) X-EBL-MailScanner-SpamScore: s X-EBL-MailScanner-From: alexander@leidinger.net X-EBL-MailScanner-Watermark: 1286272160.18906@ARPINEuCoVZS+Y+Z0zA5pA X-EBL-Spam-Status: No X-Mailman-Approved-At: Tue, 28 Sep 2010 11:19:10 +0000 Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, freebsd-arch@freebsd.org Subject: Re: svn commit: r212964 - head/sys/kern X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2010 10:08:46 -0000 Quoting John Baldwin (from Mon, 27 Sep 2010 09:28:47 -0400): >> savecore already has support for a 'minfree' file to prevent >> crashdumps filling the crashdir. Maybe the default install should >> include a minfree set to (say) 512MB. > > The one problem this approach is it implements a FIFO instead of a LIFO. I > want the N most recent crashdumps to be saved, not the first N. Check the size in the shell script before, remove older ones ("ls -1t | grep pattern | tail +" gives you possible candidates). Bye, Alexander. -- Applause, n.: The echo of a platitude from the mouth of a fool. -- Ambrose Bierce, "The Devil's Dictionary" http://www.Leidinger.net Alexander @ Leidinger.net: PGP ID = B0063FE7 http://www.FreeBSD.org netchild @ FreeBSD.org : PGP ID = 72077137 From owner-freebsd-arch@FreeBSD.ORG Tue Sep 28 15:26:05 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0B05A10656D8; Tue, 28 Sep 2010 15:25:59 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 52B778FC1D; Tue, 28 Sep 2010 15:25:59 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 0431446B9B; Tue, 28 Sep 2010 11:25:59 -0400 (EDT) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 2916F8A050; Tue, 28 Sep 2010 11:25:58 -0400 (EDT) From: John Baldwin To: Alexander Leidinger Date: Tue, 28 Sep 2010 09:37:25 -0400 User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20100819; KDE/4.4.5; amd64; ; ) References: <201009211507.o8LF7iVv097676@svn.freebsd.org> <201009270928.47232.jhb@freebsd.org> <20100928114912.17443a2o7j71kpaw@webmail.leidinger.net> In-Reply-To: <20100928114912.17443a2o7j71kpaw@webmail.leidinger.net> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <201009280937.25619.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Tue, 28 Sep 2010 11:25:58 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, freebsd-arch@freebsd.org Subject: Re: svn commit: r212964 - head/sys/kern X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2010 15:26:05 -0000 On Tuesday, September 28, 2010 5:49:12 am Alexander Leidinger wrote: > Quoting John Baldwin (from Mon, 27 Sep 2010 09:28:47 -0400): > > >> savecore already has support for a 'minfree' file to prevent > >> crashdumps filling the crashdir. Maybe the default install should > >> include a minfree set to (say) 512MB. > > > > The one problem this approach is it implements a FIFO instead of a LIFO. I > > want the N most recent crashdumps to be saved, not the first N. > > Check the size in the shell script before, remove older ones ("ls -1t > | grep pattern | tail +" gives you possible candidates). Yes, but the point is that you want that logic in savecore as an alternate to the current minfree logic. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Sep 28 15:49:43 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3EF18106566B for ; Tue, 28 Sep 2010 15:49:43 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout030.mac.com (asmtpout030.mac.com [17.148.16.105]) by mx1.freebsd.org (Postfix) with ESMTP id 262B68FC15 for ; Tue, 28 Sep 2010 15:49:42 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36]) by asmtp030.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L9G00C3PRYGXAA0@asmtp030.mac.com> for freebsd-arch@freebsd.org; Tue, 28 Sep 2010 08:49:29 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1009280098 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-09-28_10:2010-09-28, 2010-09-28, 1970-01-01 signatures=0 From: Marcel Moolenaar In-reply-to: Date: Tue, 28 Sep 2010 08:48:53 -0700 Message-id: References: To: "freebsd-arch@FreeBSD.org Arch" X-Mailer: Apple Mail (2.1081) Subject: Re: [patch] functional prototype of root mount enhancement X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2010 15:49:43 -0000 On Sep 27, 2010, at 11:22 PM, Marcel Moolenaar wrote: > > The code has some debug output still, which is helpful to > see what's going on internally. From a boot (with a > /.mount.conf present on ufs:/dev/ad0s1a): A more interesting example is using an ISO image as root that lives on UFS file system (in this case FreeBSD 8.1 livefs): : Root mount waiting for: usbus1 ugen1.2: at usbus1 ======== .onfail panic .timeout 1 ufs:/dev/ad0s1a rw .ask ======== Trying to mount root from ufs:/dev/ad0s1a [rw]... XXX: vfs_mountroot_parse: error = 0, mpdevfs=0xc3fa4000, mp=0xc3fa3c94 ======== .onfail continue .md /livefs.iso #ufs:/dev/da0a .ask ======== md0 attached to /livefs.iso Loader variables: vfs.root.mountfrom=ufs:/dev/ad0s1a vfs.root.mountfrom.options=rw Manual root filesystem specification: : [options] Mount using filesystem and with the specified (optional) option list. eg. ufs:/dev/da0s1a cd9660:/dev/acd0 ro (which is equivalent to: mount -t cd9660 -o ro /dev/acd0 /) ? List valid disk boot devices Abort manual input mountroot> ? List of GEOM managed disk devices: da0p2 da0p1 da0 acd0 ad0s1a ad0s1 ad0 mountroot> . mountroot> ? List of GEOM managed disk devices: md0 da0p2 da0p1 da0 acd0 ad0s1a ad0s1 ad0 mountroot> cd9660:/dev/md# Trying to mount root from cd9660:/dev/md0 []... XXX: vfs_mountroot_parse: error = 0, mpdevfs=0xc3fa4000, mp=0xc3fa3a10 lock order reversal: 1st 0xc3e95270 isofs (isofs) @ /usr/src/sys/fs/cd9660/cd9660_vfsops.c:694 2nd 0xc3e959c4 ufs (ufs) @ /usr/src/sys/kern/vfs_subr.c:2221 KDB: stack backtrace: : # mount /dev/md0 on / (cd9660, local, read-only) /dev/ad0s1a on /mnt (ufs, local, read-only) devfs on /dev (devfs, local) /dev/md1 on /var (ufs, local) /dev/md2 on /tmp (ufs, local) (md1 & md2 are created by /etc/rc) -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-arch@FreeBSD.ORG Tue Sep 28 17:31:26 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 00EC21065672 for ; Tue, 28 Sep 2010 17:31:26 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 9C0208FC1C for ; Tue, 28 Sep 2010 17:31:25 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.3/8.14.1) with ESMTP id o8SHQmKu092682; Tue, 28 Sep 2010 11:26:49 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Tue, 28 Sep 2010 11:27:01 -0600 (MDT) Message-Id: <20100928.112701.539398516089932776.imp@bsdimp.com> To: xcllnt@mac.com From: "M. Warner Losh" In-Reply-To: References: X-Mailer: Mew version 6.3 on Emacs 22.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: [patch] functional prototype of root mount enhancement X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2010 17:31:26 -0000 Hey Marcel, haven't had a chance to look through this in detail yet. One item that has always bugged me is why when we hit the prompt that has to be the end of discovery... Why can't we have a method to listen to new geom providers being advertised and then 'short circuit' the ask prompt if /dev/da0s1a or /dev/ufs/rootfs or whatever it originally wanted appears. Maybe this isn't .ask, but some other verb in your language? Warner From owner-freebsd-arch@FreeBSD.ORG Tue Sep 28 18:24:49 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9BD341065674 for ; Tue, 28 Sep 2010 18:24:49 +0000 (UTC) (envelope-from xcllnt@mac.com) Received: from asmtpout030.mac.com (asmtpout030.mac.com [17.148.16.105]) by mx1.freebsd.org (Postfix) with ESMTP id 822788FC12 for ; Tue, 28 Sep 2010 18:24:49 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from macbook-pro.jnpr.net (natint3.juniper.net [66.129.224.36]) by asmtp030.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L9G00L3NZ55IW70@asmtp030.mac.com> for freebsd-arch@freebsd.org; Tue, 28 Sep 2010 11:24:43 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1009280125 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-09-28_11:2010-09-28, 2010-09-28, 1970-01-01 signatures=0 From: Marcel Moolenaar In-reply-to: <20100928.112701.539398516089932776.imp@bsdimp.com> Date: Tue, 28 Sep 2010 11:24:41 -0700 Message-id: <4E910770-812B-4F04-B026-E3DB5EDEE000@mac.com> References: <20100928.112701.539398516089932776.imp@bsdimp.com> To: "M. Warner Losh" X-Mailer: Apple Mail (2.1081) Cc: freebsd-arch@freebsd.org Subject: Re: [patch] functional prototype of root mount enhancement X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2010 18:24:49 -0000 On Sep 28, 2010, at 10:27 AM, M. Warner Losh wrote: > Hey Marcel, > > haven't had a chance to look through this in detail yet. One item > that has always bugged me is why when we hit the prompt that has to be > the end of discovery... Why can't we have a method to listen to new > geom providers being advertised and then 'short circuit' the ask > prompt if /dev/da0s1a or /dev/ufs/rootfs or whatever it originally > wanted appears. > > Maybe this isn't .ask, but some other verb in your language? Hmmm... I think we should give .ask an option so that it can be made conditional upon a key press then. I don't think it's nice to print all that stuff, present a prompt, wait for input and then shortly after continue booting anyway because some device showed up. Say we have ".ask on-key-press", which basically nullifies the .ask directive (by implicitly failing to mount) unless a key was pressed. At that time we actually print the help, show a prompt and wait for input. This in combination with ".onfail retry" allows us to cycle through the alternatives until 1) a key was pressed and we'll drop at the interactive mount prompt or 2) a device we've been waiting for appears and we can mount root. Would that address your case? Another feature we may need is the alternative: if you boot with -C, we'll try cd9660:/dev/cd0 and cd9660:/dev/acd0. What we really want to do is: .select /dev/cd0 /dev/acd0 cd9660:%selected% ... -- Marcel Moolenaar xcllnt@mac.com From owner-freebsd-arch@FreeBSD.ORG Wed Sep 29 14:09:47 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9F7451065670 for ; Wed, 29 Sep 2010 14:09:47 +0000 (UTC) (envelope-from gonzo@launchpad.bluezbox.com) Received: from launchpad.bluezbox.com (hq.bluezbox.com [70.38.37.145]) by mx1.freebsd.org (Postfix) with ESMTP id 50EBF8FC15 for ; Wed, 29 Sep 2010 14:09:46 +0000 (UTC) Received: from [24.87.53.93] (helo=[192.168.1.116]) by launchpad.bluezbox.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.71 (FreeBSD)) (envelope-from ) id 1Ozxc2-000OKi-5P; Sun, 26 Sep 2010 13:14:18 -0700 Mime-Version: 1.0 (Apple Message framework v1081) Content-Type: text/plain; charset=us-ascii From: Oleksandr Tymoshenko In-Reply-To: Date: Sun, 26 Sep 2010 13:14:17 -0700 Content-Transfer-Encoding: 7bit Message-Id: <94219799-34FF-4210-B816-6A5B6F5DBC2C@bluezbox.com> References: To: Paketix X-Mailer: Apple Mail (2.1081) Sender: gonzo@launchpad.bluezbox.com X-Spam-Level: --- X-Spam-Report: Spam detection software, running on the system "hq.bluezbox.com", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see The administrator of that system for details. Content preview: On 2010-09-26, at 4:13 AM, Paketix wrote: > there is a rather new processor from TILERA (100 core chip) which is > most certainly already known here at FreeBSD mailing list. > [http://www.tilera.com/products/processors/TILE-Gx_Family] > the processor/platform is targeted towards: > - high performance network security platforms > - firewalling/vpn > - utm > - l7 deep packet inspection > - network monitoring and forensics > - cloud computing > - web application (lamp) > - data caching (memcached) > - database applications > - high-performance computing > > chris metcalf from TILERA did the current linux port and i was in > contact with him about two weeks ago. > at this time QUANTA computer is starting to offer a 512 core 2U box > with an impressive performance/watt ratio (400 watts only for 512 > cores). > [http://www.tilera.com/solutions/cloud_computing] > > i guess those massive multicore chips would enable bleeding edge > high performance solutions based on FreeBSD. > > well... > - anyone interested in porting FreeBSD towards TILERA? > (architecture seems to be similar to MIPS...) Architecture/hardware looks really high end. I think there are several people among FreeBSD developers who would like to get their hands on this kind of technology. [...] Content analysis details: (-3.1 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -1.8 ALL_TRUSTED Passed through trusted hosts only via SMTP -2.6 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] 1.3 AWL AWL: From: address is in the auto white-list Cc: freebsd-arch@freebsd.org Subject: Re: Porting effort towards TILERA massive multicore CPUs...? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Sep 2010 14:09:47 -0000 On 2010-09-26, at 4:13 AM, Paketix wrote: > there is a rather new processor from TILERA (100 core chip) which is > most certainly already known here at FreeBSD mailing list. > [http://www.tilera.com/products/processors/TILE-Gx_Family] > the processor/platform is targeted towards: > - high performance network security platforms > - firewalling/vpn > - utm > - l7 deep packet inspection > - network monitoring and forensics > - cloud computing > - web application (lamp) > - data caching (memcached) > - database applications > - high-performance computing > > chris metcalf from TILERA did the current linux port and i was in > contact with him about two weeks ago. > at this time QUANTA computer is starting to offer a 512 core 2U box > with an impressive performance/watt ratio (400 watts only for 512 > cores). > [http://www.tilera.com/solutions/cloud_computing] > > i guess those massive multicore chips would enable bleeding edge > high performance solutions based on FreeBSD. > > well... > - anyone interested in porting FreeBSD towards TILERA? > (architecture seems to be similar to MIPS...) Architecture/hardware looks really high end. I think there are several people among FreeBSD developers who would like to get their hands on this kind of technology. > - is there already some ongoing porting effort? Not that I know of. > - porting for this chip already discussed in this mailing list? AFAIR - nope From owner-freebsd-arch@FreeBSD.ORG Thu Sep 30 10:05:41 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A1604106566B for ; Thu, 30 Sep 2010 10:05:41 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 7E2228FC13 for ; Thu, 30 Sep 2010 10:05:41 +0000 (UTC) Received: from fledge.watson.org (fledge.watson.org [65.122.17.41]) by cyrus.watson.org (Postfix) with ESMTPS id 1B5C046B82; Thu, 30 Sep 2010 06:05:41 -0400 (EDT) Date: Thu, 30 Sep 2010 11:05:40 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Paketix In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Porting effort towards TILERA massive multicore CPUs...? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Sep 2010 10:05:41 -0000 On Sun, 26 Sep 2010, Paketix wrote: > there is a rather new processor from TILERA (100 core chip) which is > most certainly already known here at FreeBSD mailing list. Theory has it I'll be getting access to Intel SCC 48/96-core hardware here at Cambridge in the moderately near future, and I've been pondering what would be involved. Their model involves 48+ x86 cores without cache coherency, so you need separate OS instances for each. However, the cores are linked by fifo-like memory that we'll need to figure out what to do with. I assume Tilera has some similar sort of message-passing feature? Robert > [http://www.tilera.com/products/processors/TILE-Gx_Family] > the processor/platform is targeted towards: > - high performance network security platforms > - firewalling/vpn > - utm > - l7 deep packet inspection > - network monitoring and forensics > - cloud computing > - web application (lamp) > - data caching (memcached) > - database applications > - high-performance computing > > chris metcalf from TILERA did the current linux port and i was in > contact with him about two weeks ago. > at this time QUANTA computer is starting to offer a 512 core 2U box > with an impressive performance/watt ratio (400 watts only for 512 > cores). > [http://www.tilera.com/solutions/cloud_computing] > > i guess those massive multicore chips would enable bleeding edge > high performance solutions based on FreeBSD. > > well... > - anyone interested in porting FreeBSD towards TILERA? > (architecture seems to be similar to MIPS...) > - is there already some ongoing porting effort? > - porting for this chip already discussed in this mailing list? > > many thx > /pat > > some links for those who want some more details: > company homepage: > http://www.tilera.com/ > 64core processor: > http://www.tilera.com/products/processors/TILEPRO64 > 100core processor with hardware packet (pre)processing > http://www.tilera.com/products/processors/TILE-Gx_Family > sample architecture for network appliances: > http://www.tilera.com/solutions/networking/network_security_appliances > 512core system from QUANTA computer inc. (available Q4-10/Q1-11): > http://www.tilera.com/solutions/cloud_computing > development system from TILERA: > http://www.tilera.com/products/platforms/TILEmpower_platform > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Thu Sep 30 10:44:30 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 74A35106566C for ; Thu, 30 Sep 2010 10:44:30 +0000 (UTC) (envelope-from paketix@bluewin.ch) Received: from mail31.bluewin.ch (mail31.bluewin.ch [195.186.18.72]) by mx1.freebsd.org (Postfix) with ESMTP id 0CC4A8FC0A for ; Thu, 30 Sep 2010 10:44:29 +0000 (UTC) Received: from [195.186.18.83] ([195.186.18.83:55628] helo=tr15.bluewin.ch) by mail31.bluewin.ch (envelope-from ) (ecelerity 2.2.2.45 r()) with ESMTP id D9/FE-19667-C0A64AC4; Thu, 30 Sep 2010 10:44:28 +0000 Received: from [10.21.20.106] (194.209.131.192) by tr15.bluewin.ch (The Blue Window 8.5.119.018.5.119.01) (authenticated as paketix@bluewin.ch) id 4C69210201AB6AB2; Thu, 30 Sep 2010 10:44:28 +0000 References: In-Reply-To: Mime-Version: 1.0 (iPhone Mail 8B117) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Message-Id: <6DD4F31E-93F7-4D80-AAB8-86E69FE5D9E5@bluewin.ch> X-Mailer: iPhone Mail (8B117) From: Paketix Date: Thu, 30 Sep 2010 12:44:18 +0200 To: Robert Watson Cc: "freebsd-arch@freebsd.org" Subject: Re: Porting effort towards TILERA massive multicore CPUs...? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Sep 2010 10:44:30 -0000 do not know all the details yet but tileGX features (incomplete list): - DCC fully coherent cache - mPipe wire speed pkt processing engine - on chip encryption/compression engines - fast on chip mesh interconnect - 2x40G interlaken or 8x10G ... for more details see: tilera.com/products/processors/TILE-Gx-Family BR /pat Sent from Pat's iPhone On 30.09.2010, at 12:05, Robert Watson wrote: >=20 > On Sun, 26 Sep 2010, Paketix wrote: >=20 >> there is a rather new processor from TILERA (100 core chip) which is >> most certainly already known here at FreeBSD mailing list. >=20 > Theory has it I'll be getting access to Intel SCC 48/96-core hardware here= at Cambridge in the moderately near future, and I've been pondering what wo= uld be involved. Their model involves 48+ x86 cores without cache coherency= , so you need separate OS instances for each. However, the cores are linked= by fifo-like memory that we'll need to figure out what to do with. I assum= e Tilera has some similar sort of message-passing feature? >=20 > Robert >=20 >> [http://www.tilera.com/products/processors/TILE-Gx_Family] >> the processor/platform is targeted towards: >> - high performance network security platforms >> - firewalling/vpn >> - utm >> - l7 deep packet inspection >> - network monitoring and forensics >> - cloud computing >> - web application (lamp) >> - data caching (memcached) >> - database applications >> - high-performance computing >>=20 >> chris metcalf from TILERA did the current linux port and i was in >> contact with him about two weeks ago. >> at this time QUANTA computer is starting to offer a 512 core 2U box >> with an impressive performance/watt ratio (400 watts only for 512 >> cores). >> [http://www.tilera.com/solutions/cloud_computing] >>=20 >> i guess those massive multicore chips would enable bleeding edge >> high performance solutions based on FreeBSD. >>=20 >> well... >> - anyone interested in porting FreeBSD towards TILERA? >> (architecture seems to be similar to MIPS...) >> - is there already some ongoing porting effort? >> - porting for this chip already discussed in this mailing list? >>=20 >> many thx >> /pat >>=20 >> some links for those who want some more details: >> company homepage: >> http://www.tilera.com/ >> 64core processor: >> http://www.tilera.com/products/processors/TILEPRO64 >> 100core processor with hardware packet (pre)processing >> http://www.tilera.com/products/processors/TILE-Gx_Family >> sample architecture for network appliances: >> http://www.tilera.com/solutions/networking/network_security_appliances >> 512core system from QUANTA computer inc. (available Q4-10/Q1-11): >> http://www.tilera.com/solutions/cloud_computing >> development system from TILERA: >> http://www.tilera.com/products/platforms/TILEmpower_platform >> _______________________________________________ >> freebsd-arch@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-arch >> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >>=20 From owner-freebsd-arch@FreeBSD.ORG Thu Sep 30 16:15:05 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5E8DF106564A for ; Thu, 30 Sep 2010 16:15:05 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from out-0.mx.aerioconnect.net (out-0-24.mx.aerioconnect.net [216.240.47.84]) by mx1.freebsd.org (Postfix) with ESMTP id 41F318FC13 for ; Thu, 30 Sep 2010 16:15:05 +0000 (UTC) Received: from idiom.com (postfix@mx0.idiom.com [216.240.32.160]) by out-0.mx.aerioconnect.net (8.13.8/8.13.8) with ESMTP id o8UFqQrS006736 for ; Thu, 30 Sep 2010 08:52:26 -0700 X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (h-67-100-89-137.snfccasy.static.covad.net [67.100.89.137]) by idiom.com (Postfix) with ESMTP id D57032D6017 for ; Thu, 30 Sep 2010 08:52:25 -0700 (PDT) Message-ID: <4CA4B264.4000601@freebsd.org> Date: Thu, 30 Sep 2010 08:53:08 -0700 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4 MIME-Version: 1.0 To: freebsd-arch@freebsd.org References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.67 on 216.240.47.51 Subject: Re: Porting effort towards TILERA massive multicore CPUs...? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Sep 2010 16:15:05 -0000 On 9/30/10 3:05 AM, Robert Watson wrote: > > On Sun, 26 Sep 2010, Paketix wrote: > >> there is a rather new processor from TILERA (100 core chip) which is >> most certainly already known here at FreeBSD mailing list. > > Theory has it I'll be getting access to Intel SCC 48/96-core > hardware here at Cambridge in the moderately near future, and I've > been pondering what would be involved. Their model involves 48+ x86 > cores without cache coherency, so you need separate OS instances for > each. However, the cores are linked by fifo-like memory that we'll > need to figure out what to do with. I assume Tilera has some > similar sort of message-passing feature? > > Robert > hmm echoes of 'transputer'? I believe there is an occam compiler that runs on FreeBSD. From owner-freebsd-arch@FreeBSD.ORG Thu Sep 30 16:16:40 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C6EB51065672; Thu, 30 Sep 2010 16:16:40 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from out-0.mx.aerioconnect.net (out-0-24.mx.aerioconnect.net [216.240.47.84]) by mx1.freebsd.org (Postfix) with ESMTP id 7CBC18FC12; Thu, 30 Sep 2010 16:16:40 +0000 (UTC) Received: from idiom.com (postfix@mx0.idiom.com [216.240.32.160]) by out-0.mx.aerioconnect.net (8.13.8/8.13.8) with ESMTP id o8UFscRi006790; Thu, 30 Sep 2010 08:54:38 -0700 X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (h-67-100-89-137.snfccasy.static.covad.net [67.100.89.137]) by idiom.com (Postfix) with ESMTP id 0AD322D6021; Thu, 30 Sep 2010 08:54:36 -0700 (PDT) Message-ID: <4CA4B2E7.1@freebsd.org> Date: Thu, 30 Sep 2010 08:55:19 -0700 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4 MIME-Version: 1.0 To: Paketix References: <6DD4F31E-93F7-4D80-AAB8-86E69FE5D9E5@bluewin.ch> In-Reply-To: <6DD4F31E-93F7-4D80-AAB8-86E69FE5D9E5@bluewin.ch> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.67 on 216.240.47.51 Cc: Robert Watson , "freebsd-arch@freebsd.org" Subject: Re: Porting effort towards TILERA massive multicore CPUs...? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Sep 2010 16:16:41 -0000 On 9/30/10 3:44 AM, Paketix wrote: > do not know all the details yet > but tileGX features (incomplete list): > - DCC fully coherent cache > - mPipe wire speed pkt processing engine > - on chip encryption/compression engines > - fast on chip mesh interconnect > - 2x40G interlaken or 8x10G > ... > > for more details see: > tilera.com/products/processors/TILE-Gx-Family http://www.tilera.com/products/processors/TILE-Gx_Family > BR > /pat > > Sent from Pat's iPhone > > On 30.09.2010, at 12:05, Robert Watson wrote: > >> On Sun, 26 Sep 2010, Paketix wrote: >> >>> there is a rather new processor from TILERA (100 core chip) which is >>> most certainly already known here at FreeBSD mailing list. >> Theory has it I'll be getting access to Intel SCC 48/96-core hardware here at Cambridge in the moderately near future, and I've been pondering what would be involved. Their model involves 48+ x86 cores without cache coherency, so you need separate OS instances for each. However, the cores are linked by fifo-like memory that we'll need to figure out what to do with. I assume Tilera has some similar sort of message-passing feature? >> >> Robert >> >>> [http://www.tilera.com/products/processors/TILE-Gx_Family] >>> the processor/platform is targeted towards: >>> - high performance network security platforms >>> - firewalling/vpn >>> - utm >>> - l7 deep packet inspection >>> - network monitoring and forensics >>> - cloud computing >>> - web application (lamp) >>> - data caching (memcached) >>> - database applications >>> - high-performance computing >>> >>> chris metcalf from TILERA did the current linux port and i was in >>> contact with him about two weeks ago. >>> at this time QUANTA computer is starting to offer a 512 core 2U box >>> with an impressive performance/watt ratio (400 watts only for 512 >>> cores). >>> [http://www.tilera.com/solutions/cloud_computing] >>> >>> i guess those massive multicore chips would enable bleeding edge >>> high performance solutions based on FreeBSD. >>> >>> well... >>> - anyone interested in porting FreeBSD towards TILERA? >>> (architecture seems to be similar to MIPS...) >>> - is there already some ongoing porting effort? >>> - porting for this chip already discussed in this mailing list? >>> >>> many thx >>> /pat >>> >>> some links for those who want some more details: >>> company homepage: >>> http://www.tilera.com/ >>> 64core processor: >>> http://www.tilera.com/products/processors/TILEPRO64 >>> 100core processor with hardware packet (pre)processing >>> http://www.tilera.com/products/processors/TILE-Gx_Family >>> sample architecture for network appliances: >>> http://www.tilera.com/solutions/networking/network_security_appliances >>> 512core system from QUANTA computer inc. (available Q4-10/Q1-11): >>> http://www.tilera.com/solutions/cloud_computing >>> development system from TILERA: >>> http://www.tilera.com/products/platforms/TILEmpower_platform >>> _______________________________________________ >>> freebsd-arch@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-arch >>> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" >>> > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Fri Oct 1 05:09:03 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0AD4810675F2; Fri, 1 Oct 2010 05:08:45 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id ABB3F8FC19; Fri, 1 Oct 2010 05:08:44 +0000 (UTC) Received: by iwn34 with SMTP id 34so4137114iwn.13 for ; Thu, 30 Sep 2010 22:08:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=AyKap6HXYH1dkkeZPcUJ9wplcSjprWw2IfMr/k+QoiQ=; b=lY25nncFFlnHsmXdVhMXPEvmDDKLRAxuWrYpVY5ggjbeqDB6iUVaPIgDmyP26xe10x EMcfbDkoPCbzAJrwiGSgVsw02kg8rUzxrAgVwLDI39xCpmR8drPUnx59iijckmlzi2w0 3Al8o3rivQIMhqw3jk2oG/I1TkhukBdjVV484= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=F7n/ACzlR65xaewwBgcUAlrxqgACQuzikLAEpUD/JKFVJWYuRpz+TV8toLNuh4RKU9 mjMZKLVi/GwsMA53DTfvAjwKjo8wH1eV7zHfsaDPshPp0/TemQCwxjsODbjkDQdp/oH8 qYVszMufJ8AZDd7ySVHf1wUfS7kiFMwvwxUdA= MIME-Version: 1.0 Received: by 10.231.144.74 with SMTP id y10mr5037675ibu.65.1285908139888; Thu, 30 Sep 2010 21:42:19 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.231.171.203 with HTTP; Thu, 30 Sep 2010 21:42:19 -0700 (PDT) In-Reply-To: <4CA4B264.4000601@freebsd.org> References: <4CA4B264.4000601@freebsd.org> Date: Fri, 1 Oct 2010 12:42:19 +0800 X-Google-Sender-Auth: W4E2TSphFsAFA0L15y8ZJZ1BX3M Message-ID: From: Adrian Chadd To: Julian Elischer Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-arch@freebsd.org Subject: Re: Porting effort towards TILERA massive multicore CPUs...? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Oct 2010 05:09:03 -0000 On 30 September 2010 23:53, Julian Elischer wrote: > hmm echoes of 'transputer'? =A0 =A0I believe there is an occam compiler t= hat > runs on FreeBSD. Google XMOS. I've been trying very hard to not buy some of this until -after- i finish my degree. (but I do have an ISA Transputer board at home. :-) Adrian From owner-freebsd-arch@FreeBSD.ORG Sat Oct 2 08:14:07 2010 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 92BF81065672 for ; Sat, 2 Oct 2010 08:14:07 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (adsl-75-1-14-242.dsl.scrm01.sbcglobal.net [75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id 6566A8FC08 for ; Sat, 2 Oct 2010 08:14:07 +0000 (UTC) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id o927f7FJ056708 for ; Sat, 2 Oct 2010 00:41:11 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <201010020741.o927f7FJ056708@gw.catspoiler.org> Date: Sat, 2 Oct 2010 00:41:07 -0700 (PDT) From: Don Lewis To: arch@FreeBSD.org MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Cc: Subject: "process slock" vs. "scrlock" lock order X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Oct 2010 08:14:07 -0000 The hard coded lock order list in subr_witness.c has "scrlock" listed before "process slock". This causes a lock order reversal when calcru1(), which requires "process slock" to be held, calls printf() to report unexpected runtime problems. The call to printf() eventually gets into the console code which locks "scrlock". This normally isn't noticed because both of these are spin locks, and hardly anyone uses WITNESS without disabling the checking of spinlocks with WITNESS_SKIPSPIN. If spin lock checking is not disabled, the result is a silent reset because witness catches the LOR, which recurses into printf(), which ends up causing a panic in cnputs(). One obvious fix would be to move "scrlock" to a later spot in the list, but I suspect the same problem could occur with the "sio" or "uart" locks if a serial console is being used. It might not be possible to fix them the same way because there might be cases where they are in the input path and get locked before "process slock" or other spin locks that can be held when calling printf(). Another fix for this particular case would be to rearrange the code in calcru1() so that the calls to printf() occur after ruxp->rux_* are updated and where I assume it would be safe to temporarily drop "process slock" for the duration of the printf() calls. Thoughts? From owner-freebsd-arch@FreeBSD.ORG Sat Oct 2 10:03:03 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1EB1C1065675; Sat, 2 Oct 2010 10:03:03 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 6C1E18FC0C; Sat, 2 Oct 2010 10:03:02 +0000 (UTC) Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id o929cDYR029797 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 2 Oct 2010 12:38:13 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4) with ESMTP id o929cDbd018516; Sat, 2 Oct 2010 12:38:13 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.4/8.14.4/Submit) id o929cDU8018515; Sat, 2 Oct 2010 12:38:13 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 2 Oct 2010 12:38:13 +0300 From: Kostik Belousov To: Don Lewis Message-ID: <20101002093813.GC2392@deviant.kiev.zoral.com.ua> References: <201010020741.o927f7FJ056708@gw.catspoiler.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="HG+GLK89HZ1zG0kk" Content-Disposition: inline In-Reply-To: <201010020741.o927f7FJ056708@gw.catspoiler.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-2.1 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_50, DNS_FROM_OPENWHOIS autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: arch@freebsd.org Subject: Re: "process slock" vs. "scrlock" lock order X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Oct 2010 10:03:03 -0000 --HG+GLK89HZ1zG0kk Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Oct 02, 2010 at 12:41:07AM -0700, Don Lewis wrote: > The hard coded lock order list in subr_witness.c has "scrlock" listed > before "process slock". This causes a lock order reversal when > calcru1(), which requires "process slock" to be held, calls printf() to > report unexpected runtime problems. The call to printf() eventually > gets into the console code which locks "scrlock". This normally isn't > noticed because both of these are spin locks, and hardly anyone uses > WITNESS without disabling the checking of spinlocks with > WITNESS_SKIPSPIN. If spin lock checking is not disabled, the result is > a silent reset because witness catches the LOR, which recurses into > printf(), which ends up causing a panic in cnputs(). >=20 > One obvious fix would be to move "scrlock" to a later spot in the list, > but I suspect the same problem could occur with the "sio" or "uart" > locks if a serial console is being used. It might not be possible to > fix them the same way because there might be cases where they are in the > input path and get locked before "process slock" or other spin locks > that can be held when calling printf(). >=20 > Another fix for this particular case would be to rearrange the code in > calcru1() so that the calls to printf() occur after ruxp->rux_* are > updated and where I assume it would be safe to temporarily drop "process > slock" for the duration of the printf() calls. >=20 > Thoughts? Yes, printing from under a spinlock is somewhat epidemic. Moving the printf out of process slock looks as the right solution. On the other hand, all calcru() callers unlock slock immediately after calcru(), and calcru1() sometimes only called with thread lock held, not process slock. I propose the following refinement, it does not need relock of process slock at all. Lets drop slock in calcru(), and do neccessary print after that. No need to reacquire the slock. diff --git a/sys/compat/linux/linux_misc.c b/sys/compat/linux/linux_misc.c index d2cf6b6..6a599f6 100644 --- a/sys/compat/linux/linux_misc.c +++ b/sys/compat/linux/linux_misc.c @@ -691,7 +691,6 @@ linux_times(struct thread *td, struct linux_times_args = *args) PROC_LOCK(p); PROC_SLOCK(p); calcru(p, &utime, &stime); - PROC_SUNLOCK(p); calccru(p, &cutime, &cstime); PROC_UNLOCK(p); =20 diff --git a/sys/compat/svr4/svr4_misc.c b/sys/compat/svr4/svr4_misc.c index 6f80fe6..554eb44 100644 --- a/sys/compat/svr4/svr4_misc.c +++ b/sys/compat/svr4/svr4_misc.c @@ -865,7 +865,6 @@ svr4_sys_times(td, uap) PROC_LOCK(p); PROC_SLOCK(p); calcru(p, &utime, &stime); - PROC_SUNLOCK(p); calccru(p, &cutime, &cstime); PROC_UNLOCK(p); =20 @@ -1278,7 +1277,6 @@ loop: ru =3D p->p_ru; PROC_SLOCK(p); calcru(p, &ru.ru_utime, &ru.ru_stime); - PROC_SUNLOCK(p); PROC_UNLOCK(p); sx_sunlock(&proctree_lock); =20 @@ -1305,7 +1303,6 @@ loop: ru =3D p->p_ru; PROC_SLOCK(p); calcru(p, &ru.ru_utime, &ru.ru_stime); - PROC_SUNLOCK(p); PROC_UNLOCK(p); =20 if (((uap->options & SVR4_WNOWAIT)) =3D=3D 0) { @@ -1329,7 +1326,6 @@ loop: status =3D SIGCONT; PROC_SLOCK(p); calcru(p, &ru.ru_utime, &ru.ru_stime); - PROC_SUNLOCK(p); PROC_UNLOCK(p); =20 if (((uap->options & SVR4_WNOWAIT)) =3D=3D 0) { diff --git a/sys/fs/procfs/procfs_status.c b/sys/fs/procfs/procfs_status.c index 7850504..12f08f6 100644 --- a/sys/fs/procfs/procfs_status.c +++ b/sys/fs/procfs/procfs_status.c @@ -125,7 +125,6 @@ procfs_doprocstatus(PFS_FILL_ARGS) =20 PROC_SLOCK(p); calcru(p, &ut, &st); - PROC_SUNLOCK(p); start =3D p->p_stats->p_start; timevaladd(&start, &boottime); sbuf_printf(sb, " %jd,%ld %jd,%ld %jd,%ld", diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c index 8358f75..7819d7b 100644 --- a/sys/kern/kern_exit.c +++ b/sys/kern/kern_exit.c @@ -703,8 +703,8 @@ proc_reap(struct thread *td, struct proc *p, int *statu= s, int options, if (rusage) { *rusage =3D p->p_ru; calcru(p, &rusage->ru_utime, &rusage->ru_stime); - } - PROC_SUNLOCK(p); + } else + PROC_SUNLOCK(p); td->td_retval[0] =3D p->p_pid; if (status) *status =3D p->p_xstat; /* convert to int */ diff --git a/sys/kern/kern_proc.c b/sys/kern/kern_proc.c index 4899946..fb0be15 100644 --- a/sys/kern/kern_proc.c +++ b/sys/kern/kern_proc.c @@ -783,7 +783,6 @@ fill_kinfo_proc_only(struct proc *p, struct kinfo_proc = *kp) timevaladd(&kp->ki_start, &boottime); PROC_SLOCK(p); calcru(p, &kp->ki_rusage.ru_utime, &kp->ki_rusage.ru_stime); - PROC_SUNLOCK(p); calccru(p, &kp->ki_childutime, &kp->ki_childstime); =20 /* Some callers want child-times in a single value */ diff --git a/sys/kern/kern_resource.c b/sys/kern/kern_resource.c index ec2d6b6..13cc50c 100644 --- a/sys/kern/kern_resource.c +++ b/sys/kern/kern_resource.c @@ -72,8 +72,15 @@ static struct rwlock uihashtbl_lock; static LIST_HEAD(uihashhead, uidinfo) *uihashtbl; static u_long uihash; /* size of hash table - 1 */ =20 +struct calcru1_warn { + int64_t neg_runtime; + int64_t new_runtime; + int64_t old_runtime; +}; + static void calcru1(struct proc *p, struct rusage_ext *ruxp, - struct timeval *up, struct timeval *sp); + struct timeval *up, struct timeval *sp, + struct calcru1_warn *w); static int donice(struct thread *td, struct proc *chgp, int n); static struct uidinfo *uilookup(uid_t uid); static void ruxagg_locked(struct rusage_ext *rux, struct thread *td); @@ -797,6 +804,20 @@ getrlimit(td, uap) return (error); } =20 +static void +print_calcru1_warn(struct proc *p, const struct calcru1_warn *w) +{ + + if (w->neg_runtime > 0) + printf("calcru: negative runtime of %jd usec for pid %d (%s)\n", + (intmax_t)w->neg_runtime, p->p_pid, p->p_comm); + if (w->new_runtime > 0) + printf("calcru: runtime went backwards from %ju usec " + "to %ju usec for pid %d (%s)\n", + (uintmax_t)w->old_runtime, (uintmax_t)w->new_runtime, + p->p_pid, p->p_comm); +} + /* * Transform the running time and tick information for children of proc p * into user and system time usage. @@ -807,24 +828,33 @@ calccru(p, up, sp) struct timeval *up; struct timeval *sp; { + struct calcru1_warn w; =20 PROC_LOCK_ASSERT(p, MA_OWNED); - calcru1(p, &p->p_crux, up, sp); + bzero(&w, sizeof(w)); + calcru1(p, &p->p_crux, up, sp, &w); + print_calcru1_warn(p, &w); } =20 /* * Transform the running time and tick information in proc p into user * and system time usage. If appropriate, include the current time slice * on this CPU. + * + * The process slock shall be locked on entry, and it is unlocked + * after function returned. */ void calcru(struct proc *p, struct timeval *up, struct timeval *sp) { struct thread *td; uint64_t u; + struct calcru1_warn w; =20 PROC_LOCK_ASSERT(p, MA_OWNED); PROC_SLOCK_ASSERT(p, MA_OWNED); + + bzero(&w, sizeof(w)); /* * If we are getting stats for the current process, then add in the * stats that this thread has accumulated in its current time slice. @@ -843,12 +873,14 @@ calcru(struct proc *p, struct timeval *up, struct tim= eval *sp) continue; ruxagg(p, td); } - calcru1(p, &p->p_rux, up, sp); + calcru1(p, &p->p_rux, up, sp, &w); + PROC_SUNLOCK(p); + print_calcru1_warn(p, &w); } =20 static void calcru1(struct proc *p, struct rusage_ext *ruxp, struct timeval *up, - struct timeval *sp) + struct timeval *sp, struct calcru1_warn *w) { /* {user, system, interrupt, total} {ticks, usec}: */ uint64_t ut, uu, st, su, it, tt, tu; @@ -865,8 +897,7 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, struct= timeval *up, tu =3D cputick2usec(ruxp->rux_runtime); if ((int64_t)tu < 0) { /* XXX: this should be an assert /phk */ - printf("calcru: negative runtime of %jd usec for pid %d (%s)\n", - (intmax_t)tu, p->p_pid, p->p_comm); + w->neg_runtime =3D tu; tu =3D ruxp->rux_tu; } =20 @@ -903,10 +934,8 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, struc= t timeval *up, * serious, so lets keep it and hope laptops can be made * more truthful about their CPU speed via ACPI. */ - printf("calcru: runtime went backwards from %ju usec " - "to %ju usec for pid %d (%s)\n", - (uintmax_t)ruxp->rux_tu, (uintmax_t)tu, - p->p_pid, p->p_comm); + w->new_runtime =3D tu; + w->old_runtime =3D ruxp->rux_tu; uu =3D (tu * ut) / tt; su =3D (tu * st) / tt; } @@ -946,6 +975,7 @@ kern_getrusage(struct thread *td, int who, struct rusag= e *rup) { struct proc *p; int error; + struct calcru1_warn w; =20 error =3D 0; p =3D td->td_proc; @@ -962,13 +992,15 @@ kern_getrusage(struct thread *td, int who, struct rus= age *rup) break; =20 case RUSAGE_THREAD: + bzero(&w, sizeof(w)); PROC_SLOCK(p); ruxagg(p, td); PROC_SUNLOCK(p); thread_lock(td); *rup =3D td->td_ru; - calcru1(p, &td->td_rux, &rup->ru_utime, &rup->ru_stime); + calcru1(p, &td->td_rux, &rup->ru_utime, &rup->ru_stime, &w); thread_unlock(td); + print_calcru1_warn(p, &w); break; =20 default: @@ -1069,7 +1101,6 @@ rufetchcalc(struct proc *p, struct rusage *ru, struct= timeval *up, PROC_SLOCK(p); rufetch(p, ru); calcru(p, up, sp); - PROC_SUNLOCK(p); } =20 /* diff --git a/sys/kern/kern_time.c b/sys/kern/kern_time.c index 3aea2bd..d603958 100644 --- a/sys/kern/kern_time.c +++ b/sys/kern/kern_time.c @@ -204,7 +204,6 @@ kern_clock_gettime(struct thread *td, clockid_t clock_i= d, struct timespec *ats) PROC_LOCK(p); PROC_SLOCK(p); calcru(p, &user, &sys); - PROC_SUNLOCK(p); PROC_UNLOCK(p); TIMEVAL_TO_TIMESPEC(&user, ats); break; @@ -212,7 +211,6 @@ kern_clock_gettime(struct thread *td, clockid_t clock_i= d, struct timespec *ats) PROC_LOCK(p); PROC_SLOCK(p); calcru(p, &user, &sys); - PROC_SUNLOCK(p); PROC_UNLOCK(p); timevaladd(&user, &sys); TIMEVAL_TO_TIMESPEC(&user, ats); --HG+GLK89HZ1zG0kk Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (FreeBSD) iEYEARECAAYFAkym/YQACgkQC3+MBN1Mb4hvwwCfbLiXCeE8l1mxv+FiDxdA/3zu NM4An1kwNyAMiDcgGbBPVuIetfjyhf0d =R9Em -----END PGP SIGNATURE----- --HG+GLK89HZ1zG0kk-- From owner-freebsd-arch@FreeBSD.ORG Sat Oct 2 13:08:45 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 95648106566B; Sat, 2 Oct 2010 13:08:45 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx06.syd.optusnet.com.au (fallbackmx06.syd.optusnet.com.au [211.29.132.8]) by mx1.freebsd.org (Postfix) with ESMTP id 185408FC17; Sat, 2 Oct 2010 13:08:44 +0000 (UTC) Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au [211.29.132.183]) by fallbackmx06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o92BSuI2020318; Sat, 2 Oct 2010 21:28:56 +1000 Received: from besplex.bde.org (c122-107-116-249.carlnfd1.nsw.optusnet.com.au [122.107.116.249]) by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o92BSr05003394 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 2 Oct 2010 21:28:54 +1000 Date: Sat, 2 Oct 2010 21:28:52 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Don Lewis In-Reply-To: <201010020741.o927f7FJ056708@gw.catspoiler.org> Message-ID: <20101002190453.K11563@besplex.bde.org> References: <201010020741.o927f7FJ056708@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org Subject: Re: "process slock" vs. "scrlock" lock order X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Oct 2010 13:08:45 -0000 On Sat, 2 Oct 2010, Don Lewis wrote: > The hard coded lock order list in subr_witness.c has "scrlock" listed > before "process slock". This causes a lock order reversal when > calcru1(), which requires "process slock" to be held, calls printf() to > report unexpected runtime problems. The call to printf() eventually > gets into the console code which locks "scrlock". Console drivers are not permitted to use any normal locks, since they are required to work when called from any instruction boundary via a trace trap into ddb. Syscons has lots of state, so it is difficult for it to be reentrant enough to be a console driver. It barely tries, but mostly works anyway. It used to use the axed cndbctl() call to try harder. This told it when ddb was entered and exited, so that it could do things like save its state on ddb entry and restore it on ddb exit. In practice, it did little more than stop the screen saver and switch to vty0 on ddb entry and set a private variable to indicate that it was in ddb mode instead of peeking at db_active. Then it used this local variable in a few places to avoid a few dangerous things. It still uses this variable to decide what to do, but this variable is now never initialized (except statically to 0). Replacing tests of this variable by tests of kdb_active would unbreak a few things and lose mainly the vty switch relative to the old version. "scrlock" seems to be the only lock in syscons internals (except it is giant locked), and it is already guarded by a kdb_active test (and that is the only kdb_active test in syscons internals), so it mostly doesn't cause problems for calls from ddb, just like the old cndbctl() tests. This part of it was cloned from sio where it is less incorrect since the corresponding lock is made MTX_QUIET iff any sio devices is a console. (This is still wrong, since sio's lock should be a normal one and any console lock a separate non-normal one. Among other problems, it makes sio's lock too quiet.) syscons's lock is missing the MTX_QUIET, but this lock is not a normal one (it is only used for console output) so it can become more correct. OTOH, its limited use makes it useless for locking syscons generally. It is only used to prevents corruption of data structures (and garbled output) by multople concurrent calls into the console driver. It doesn't prevent corruption from a console call concurrent with a (Giant-locked and maybe tty-locked) user call. sio's needs the corresponding locking only to reduce garbling of output, since it console calls are reentrant enough to to avoid corrupting any software state and most hardware state. > This normally isn't > noticed because both of these are spin locks, and hardly anyone uses > WITNESS without disabling the checking of spinlocks with > WITNESS_SKIPSPIN. If spin lock checking is not disabled, the result is > a silent reset because witness catches the LOR, which recurses into > printf(), which ends up causing a panic in cnputs(). > > One obvious fix would be to move "scrlock" to a later spot in the list, > but I suspect the same problem could occur with the "sio" or "uart" > locks if a serial console is being used. It might not be possible to > fix them the same way because there might be cases where they are in the > input path and get locked before "process slock" or other spin locks > that can be held when calling printf(). I think sio isn't affected, since it uses MTX_QUIET (though maybe it needs MTX_NOWITNESS too -- one or both of those should "work" by breaking witnessing in much the same way as WITNESS_SKIPSPIN). uart is missing the MTX_QUIET, and uses a too-normal lock for the console. uart has locking for the whole of cngetc() too (except it drops the look to wait), while sio has only reentrancy for cngetc(). Both are useless for serialization, since cngetc() hasn't actually been a getc function since ~2001 (?) when the multiple console changes broke input. It is now cncheckc() misnamed. The multiple console code polls each console for input in turn, even when there is only 1 active console, and this involves dropping locks so interrupts tend to eat your input. > Another fix for this particular case would be to rearrange the code in > calcru1() so that the calls to printf() occur after ruxp->rux_* are > updated and where I assume it would be safe to temporarily drop "process > slock" for the duration of the printf() calls. printf() is supposed to be callable from almost anywhere (just not quite at any instruction boundary unless in ddb mode). There is related broken locking in cnputs(). This uses a non-normal mutex for serialization. The mutex is MTX_NOWITNESS and MTX_QUIET, but there are no kdb_active tests before using it, and it us not bogusly MTX_RECURSE, so it can deadlock in some cases (all cases with ddb output?) when cnputs() is debugged. I use better serialization of output involving a similar (but less normal) lock over single printfs (callers wanting to ensure non-garbled output must put it all together). Deadlock is avoided by ignoring the lock after trying for it for 1 second. Console drivers still need lower-level locking to protect their data structures. The Giant locking in syscons seems bogus now that there is tty locking. In the syscons directory, it is only done explicitly in sckbdevent(), which calls tty_rint*() which needs tty locking but there is none visible (maybe an upper layer of the interrupt handler does it, or Giant locking of everything is enough). "scrlock" causes problems with tty locking too. syscons.c has only a single explicit tty_lock() call, and that one is under "#if 0" together with some scroll lock handling since "scrlock" causes a more detectable LOR relative to tty_lock. This is in sc_cngetc(). The LOR detection has exposed the larger bug that a console driver is calling an upper tty layer. In old versions, the call was directly to scstart() except for a check of the upper layer's open flag. This was unsafe too (it called up to the tty layer). Add full tty locking to syscons and you would probably find its console routines can't go anywhere without hitting the tty lock, so when a console routine is called with the tty lock held, it should deadlock or panic. Giant locking was too feeble to detect such problems, and before Giant I thought syscons was missing lots of spl locking (which needed to be splhigh() to defend against reentry for printf from an interrupt handler, leaving only the problem of reentry for printf from a trap handler). Bruce From owner-freebsd-arch@FreeBSD.ORG Sat Oct 2 20:03:26 2010 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1BC3A1065670; Sat, 2 Oct 2010 20:03:26 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au [211.29.132.188]) by mx1.freebsd.org (Postfix) with ESMTP id 92AA38FC14; Sat, 2 Oct 2010 20:03:25 +0000 (UTC) Received: from c122-107-116-249.carlnfd1.nsw.optusnet.com.au (c122-107-116-249.carlnfd1.nsw.optusnet.com.au [122.107.116.249]) by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o92K3LSU032357 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 3 Oct 2010 07:03:22 +1100 Date: Sun, 3 Oct 2010 07:03:21 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Kostik Belousov In-Reply-To: <20101002093813.GC2392@deviant.kiev.zoral.com.ua> Message-ID: <20101003062141.C1323@delplex.bde.org> References: <201010020741.o927f7FJ056708@gw.catspoiler.org> <20101002093813.GC2392@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org, Don Lewis Subject: Re: "process slock" vs. "scrlock" lock order X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Oct 2010 20:03:26 -0000 On Sat, 2 Oct 2010, Kostik Belousov wrote: > On Sat, Oct 02, 2010 at 12:41:07AM -0700, Don Lewis wrote: >> The hard coded lock order list in subr_witness.c has "scrlock" listed >> before "process slock". This causes a lock order reversal when >> calcru1(), which requires "process slock" to be held, calls printf() to >> report unexpected runtime problems. The call to printf() eventually >> gets into the console code which locks "scrlock". This normally isn't >> slock" for the duration of the printf() calls. >> ... >> Thoughts? > > Yes, printing from under a spinlock is somewhat epidemic. Moving the printf > out of process slock looks as the right solution. No, it is shooting the messenger. printf() (and console functions) may be called with any locks held (except ones related to printf and console functions themselves, and even those must be blown open for printfs from panics and possibly from debuggers (if you have a reeentrant debugger)). > On the other hand, all > calcru() callers unlock slock immediately after calcru(), and calcru1() > sometimes only called with thread lock held, not process slock. > > I propose the following refinement, it does not need relock of process slock > at all. Lets drop slock in calcru(), and do neccessary print after that. > No need to reacquire the slock. This might be cleaner for other reasons. > diff --git a/sys/compat/linux/linux_misc.c b/sys/compat/linux/linux_misc.c > index d2cf6b6..6a599f6 100644 > --- a/sys/compat/linux/linux_misc.c > +++ b/sys/compat/linux/linux_misc.c > @@ -691,7 +691,6 @@ linux_times(struct thread *td, struct linux_times_args *args) > PROC_LOCK(p); > PROC_SLOCK(p); > calcru(p, &utime, &stime); > - PROC_SUNLOCK(p); > calccru(p, &cutime, &cstime); > PROC_UNLOCK(p); > Clean to remove lots of these. > diff --git a/sys/kern/kern_resource.c b/sys/kern/kern_resource.c > index ec2d6b6..13cc50c 100644 > --- a/sys/kern/kern_resource.c > +++ b/sys/kern/kern_resource.c > @@ -72,8 +72,15 @@ static struct rwlock uihashtbl_lock; > static LIST_HEAD(uihashhead, uidinfo) *uihashtbl; > static u_long uihash; /* size of hash table - 1 */ > > +struct calcru1_warn { > + int64_t neg_runtime; > + int64_t new_runtime; > + int64_t old_runtime; > +}; > + > static void calcru1(struct proc *p, struct rusage_ext *ruxp, > - struct timeval *up, struct timeval *sp); > + struct timeval *up, struct timeval *sp, > + struct calcru1_warn *w); > static int donice(struct thread *td, struct proc *chgp, int n); > static struct uidinfo *uilookup(uid_t uid); > static void ruxagg_locked(struct rusage_ext *rux, struct thread *td); > @@ -797,6 +804,20 @@ getrlimit(td, uap) > return (error); > } > > +static void > +print_calcru1_warn(struct proc *p, const struct calcru1_warn *w) > +{ > + > + if (w->neg_runtime > 0) > + printf("calcru: negative runtime of %jd usec for pid %d (%s)\n", > + (intmax_t)w->neg_runtime, p->p_pid, p->p_comm); > + if (w->new_runtime > 0) > + printf("calcru: runtime went backwards from %ju usec " > + "to %ju usec for pid %d (%s)\n", > + (uintmax_t)w->old_runtime, (uintmax_t)w->new_runtime, > + p->p_pid, p->p_comm); > +} > + > /* > * Transform the running time and tick information for children of proc p > * into user and system time usage. > @@ -807,24 +828,33 @@ calccru(p, up, sp) > struct timeval *up; > struct timeval *sp; > { > + struct calcru1_warn w; > > PROC_LOCK_ASSERT(p, MA_OWNED); > - calcru1(p, &p->p_crux, up, sp); > + bzero(&w, sizeof(w)); > + calcru1(p, &p->p_crux, up, sp, &w); > + print_calcru1_warn(p, &w); > } > > /* > * Transform the running time and tick information in proc p into user > * and system time usage. If appropriate, include the current time slice > * on this CPU. > + * > + * The process slock shall be locked on entry, and it is unlocked > + * after function returned. > */ > void > calcru(struct proc *p, struct timeval *up, struct timeval *sp) > { > struct thread *td; > uint64_t u; > + struct calcru1_warn w; > > PROC_LOCK_ASSERT(p, MA_OWNED); > PROC_SLOCK_ASSERT(p, MA_OWNED); > + > + bzero(&w, sizeof(w)); > /* > * If we are getting stats for the current process, then add in the > * stats that this thread has accumulated in its current time slice. > @@ -843,12 +873,14 @@ calcru(struct proc *p, struct timeval *up, struct timeval *sp) > continue; > ruxagg(p, td); > } > - calcru1(p, &p->p_rux, up, sp); > + calcru1(p, &p->p_rux, up, sp, &w); > + PROC_SUNLOCK(p); > + print_calcru1_warn(p, &w); > } > > static void > calcru1(struct proc *p, struct rusage_ext *ruxp, struct timeval *up, > - struct timeval *sp) > + struct timeval *sp, struct calcru1_warn *w) > { > /* {user, system, interrupt, total} {ticks, usec}: */ > uint64_t ut, uu, st, su, it, tt, tu; > @@ -865,8 +897,7 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, struct timeval *up, > tu = cputick2usec(ruxp->rux_runtime); > if ((int64_t)tu < 0) { > /* XXX: this should be an assert /phk */ > - printf("calcru: negative runtime of %jd usec for pid %d (%s)\n", > - (intmax_t)tu, p->p_pid, p->p_comm); > + w->neg_runtime = tu; > tu = ruxp->rux_tu; > } > > @@ -903,10 +934,8 @@ calcru1(struct proc *p, struct rusage_ext *ruxp, struct timeval *up, > * serious, so lets keep it and hope laptops can be made > * more truthful about their CPU speed via ACPI. > */ > - printf("calcru: runtime went backwards from %ju usec " > - "to %ju usec for pid %d (%s)\n", > - (uintmax_t)ruxp->rux_tu, (uintmax_t)tu, > - p->p_pid, p->p_comm); > + w->new_runtime = tu; > + w->old_runtime = ruxp->rux_tu; > uu = (tu * ut) / tt; > su = (tu * st) / tt; > } > @@ -946,6 +975,7 @@ kern_getrusage(struct thread *td, int who, struct rusage *rup) > { > struct proc *p; > int error; > + struct calcru1_warn w; > > error = 0; > p = td->td_proc; > @@ -962,13 +992,15 @@ kern_getrusage(struct thread *td, int who, struct rusage *rup) > break; > > case RUSAGE_THREAD: > + bzero(&w, sizeof(w)); > PROC_SLOCK(p); > ruxagg(p, td); > PROC_SUNLOCK(p); > thread_lock(td); > *rup = td->td_ru; > - calcru1(p, &td->td_rux, &rup->ru_utime, &rup->ru_stime); > + calcru1(p, &td->td_rux, &rup->ru_utime, &rup->ru_stime, &w); > thread_unlock(td); > + print_calcru1_warn(p, &w); > break; > > default: > @@ -1069,7 +1101,6 @@ rufetchcalc(struct proc *p, struct rusage *ru, struct timeval *up, > PROC_SLOCK(p); > rufetch(p, ru); > calcru(p, up, sp); > - PROC_SUNLOCK(p); > } > > /* Not clean to ad mounds of code to defer a couple of normal printfs. I think the only relationship of calcru() to the problem is that it has a printf that is actually executed quite often. Just about any printf within a locked region may become a messenger for the problem if the printf is actually executed. To see lots more bugs in console drivers, put printfs in lots of critical place and arrange for them to be executed frequently. Ones near malloc might be good. I once used the one in the following timeout handler to demonstrate the missing locking in syscons: % static void % foo(void *arg) % { % #if 0 % sccnputc(0, '*'); % timeout_handle = timeout(foo, NULL, 1); % #else % /* % * Fills up log if done every tick so only do it every 10 ticks and % * wait a bit longer for races. % */ % printf("*"); % timeout_handle = timeout(foo, NULL, 10); % #endif % } This printf can contend with write(2) or perhaps another printf. Panics were easiest to demonstrate with write(2). Timeouts can easily interrupt write(2), so the above printf contended with write(2) any time the timeout is scheduled while syscons is active with write(2). (Some console drivers have locking to prevent the contention, but they must be careful about deadlock. The printf cannot be blocked.) Giant locking reduced this problem a bit. It makes the above timeout handler Giant-locked. Syscons remains Giant-locked, so there is enough locking to prevent contention from the above, but the above printf is broken (it blocks). The blocking doesn't matter here, but it would in a more critical context. More critical contexts wouldn't be Giant-locked anyway, so they would contend. I never got around to changing the above to be an MPSAFE callout handler so that Giant locking doesn't help. It is in fact not MPSAFE, but only because the console driver is not even UPSAFE. You can see how old the above is from its non-KNF style which I once preferred. Bruce