From owner-freebsd-arch@FreeBSD.ORG Mon Mar 10 11:06:58 2008 Return-Path: Delivered-To: freebsd-arch@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C03CD106567E for ; Mon, 10 Mar 2008 11:06:58 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id AB8F78FC20 for ; Mon, 10 Mar 2008 11:06:58 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.2/8.14.2) with ESMTP id m2AB6wVC086485 for ; Mon, 10 Mar 2008 11:06:58 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.2/8.14.1/Submit) id m2AB6v5E086481 for freebsd-arch@FreeBSD.org; Mon, 10 Mar 2008 11:06:57 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 10 Mar 2008 11:06:57 GMT Message-Id: <200803101106.m2AB6v5E086481@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-arch@FreeBSD.org Cc: Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Mar 2008 11:06:58 -0000 Current FreeBSD problem reports Critical problems Serious problems Non-critical problems S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From owner-freebsd-arch@FreeBSD.ORG Mon Mar 10 11:36:31 2008 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 63AED1065677 for ; Mon, 10 Mar 2008 11:36:31 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 53E718FC17 for ; Mon, 10 Mar 2008 11:36:31 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id B6ECE46B1E; Mon, 10 Mar 2008 06:36:30 -0500 (EST) Date: Mon, 10 Mar 2008 12:36:30 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: arch@FreeBSD.org Message-ID: <20080310122338.T29929@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: net@FreeBSD.org Subject: netatm removal warning X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Mar 2008 11:36:31 -0000 Dear all, This is another of those boring e-mails about kernel subsystems that still require Giant. Sorry about that! As previously published, netatm is a non-MPSAFE protocol stack largely superseded by our two other ATM stacks, netnatm and the netgraph/atm (both MPSAFE). netatm is currently non-functional and uncompileable because it depends on the Giant compatibility shims for the protocol stack, which were removed in FreeBSD 7.0. We left the code in place in case to make it easier for any interested third parties to distribute patches against it (in particular, patches to make it MPSAFE). The current plan is that we will remove the netatm code from HEAD and RELENG_7 before FreeBSD 7.1. A specific schedue for 7.1 hasn't been published yet, but in order to give plenty of warning, here's the proposed netatm removal schedule: 10 March 2008 E-mail warning to arch@/net@ 10 April 2008 E-mail warning to arch@/net@ 10 May 2008 Removal of netatm from HEAD 20 May 2008 Removal of netatm from RELENG_7 Obviously, netatm will remain in the revision control history should anyone wish to ressurect it after that date. However, I suspect that those interested in ATM on FreeBSD have long since been using Harti's netgraph ATM framework. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Mon Mar 10 13:34:35 2008 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4ED37106566B for ; Mon, 10 Mar 2008 13:34:35 +0000 (UTC) (envelope-from skalla.raabjorn@gmx.de) Received: from mail.gmx.net (mail.gmx.net [213.165.64.20]) by mx1.freebsd.org (Postfix) with SMTP id F33DF8FC12 for ; Mon, 10 Mar 2008 13:34:34 +0000 (UTC) (envelope-from skalla.raabjorn@gmx.de) Received: (qmail invoked by alias); 10 Mar 2008 13:07:53 -0000 Received: from g227178023.adsl.alicedsl.de (EHLO sol.hackerzberg.local) [92.227.178.23] by mail.gmx.net (mp055) with SMTP; 10 Mar 2008 14:07:53 +0100 X-Authenticated: #8038066 X-Provags-ID: V01U2FsdGVkX1965s4uXBV6PqIdRlIIJisGpBpO+rWaC1ALF1deH1 pjor9ICct/9qyr Date: Mon, 10 Mar 2008 14:07:53 +0100 From: Skalla Raabjorn To: freebsd-arch@FreeBSD.org Message-ID: <20080310140753.24630bda@sol.hackerzberg.local> X-Mailer: Claws Mail 3.3.1 (GTK+ 2.12.8; i386-portbld-freebsd7.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 Cc: Subject: If GIANT is locked can the MPSAFE parts run in parallel? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Mar 2008 13:34:35 -0000 Hi all, if GIANT is locked can the MPSAFE parts run in parallel? Like networking for example, as they have their own locks. regards Skalla From owner-freebsd-arch@FreeBSD.ORG Mon Mar 10 13:45:37 2008 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1F6BC1065677 for ; Mon, 10 Mar 2008 13:45:37 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 0D4D18FC16 for ; Mon, 10 Mar 2008 13:45:37 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 983FA46B0D; Mon, 10 Mar 2008 08:45:36 -0500 (EST) Date: Mon, 10 Mar 2008 14:45:36 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Skalla Raabjorn In-Reply-To: <20080310140753.24630bda@sol.hackerzberg.local> Message-ID: <20080310143919.V50827@fledge.watson.org> References: <20080310140753.24630bda@sol.hackerzberg.local> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@FreeBSD.org Subject: Re: If GIANT is locked can the MPSAFE parts run in parallel? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Mar 2008 13:45:37 -0000 On Mon, 10 Mar 2008, Skalla Raabjorn wrote: > if GIANT is locked can the MPSAFE parts run in parallel? Like networking for > example, as they have their own locks. Dear Skalla, Yes. Giant is [almost] a mutex like any other mutex, so as long as the MPSAFE subsystem isn't being invoked by something holding Giant, it generally won't run with it. Even if the network stack is sometimes executed with Giant held (for example, when receiving a packet from SLIP), that doesn't prevent the network stack from executing in parallel on other CPUs, it just serializes with respect to other Giant holders executing. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Mon Mar 10 14:18:52 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 71E601065674 for ; Mon, 10 Mar 2008 14:18:52 +0000 (UTC) (envelope-from skalla.raabjorn@gmx.de) Received: from mail.gmx.net (mail.gmx.net [213.165.64.20]) by mx1.freebsd.org (Postfix) with SMTP id CF1AC8FC28 for ; Mon, 10 Mar 2008 14:18:51 +0000 (UTC) (envelope-from skalla.raabjorn@gmx.de) Received: (qmail invoked by alias); 10 Mar 2008 14:18:50 -0000 Received: from g227178023.adsl.alicedsl.de (EHLO sol.hackerzberg.local) [92.227.178.23] by mail.gmx.net (mp010) with SMTP; 10 Mar 2008 15:18:50 +0100 X-Authenticated: #8038066 X-Provags-ID: V01U2FsdGVkX19ZjaZYMRNZ6doAILG23VidGSARLxIGYpEpSTThAZ uoh0wDPrGHyUfe Date: Mon, 10 Mar 2008 15:18:50 +0100 From: Skalla Raabjorn To: freebsd-arch@freebsd.org Message-ID: <20080310151850.6d8451ff@sol.hackerzberg.local> In-Reply-To: <20080310143919.V50827@fledge.watson.org> References: <20080310140753.24630bda@sol.hackerzberg.local> <20080310143919.V50827@fledge.watson.org> X-Mailer: Claws Mail 3.3.1 (GTK+ 2.12.8; i386-portbld-freebsd7.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 Subject: Re: If GIANT is locked can the MPSAFE parts run in parallel? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Mar 2008 14:18:52 -0000 On Mon, 10 Mar 2008 14:45:36 +0100 (BST) Robert Watson wrote: > > On Mon, 10 Mar 2008, Skalla Raabjorn wrote: > > > if GIANT is locked can the MPSAFE parts run in parallel? Like networking for > > example, as they have their own locks. > > Dear Skalla, > > Yes. Giant is [almost] a mutex like any other mutex, so as long as the MPSAFE > subsystem isn't being invoked by something holding Giant, it generally won't > run with it. Even if the network stack is sometimes executed with Giant held > (for example, when receiving a packet from SLIP), that doesn't prevent the > network stack from executing in parallel on other CPUs, it just serializes > with respect to other Giant holders executing. Thanks, that's all I wanted to know :) From owner-freebsd-arch@FreeBSD.ORG Mon Mar 10 17:50:47 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4A4A21065671 for ; Mon, 10 Mar 2008 17:50:47 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id E1F9E8FC31 for ; Mon, 10 Mar 2008 17:50:46 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8s) with ESMTP id 234968788-1834499 for multiple; Mon, 10 Mar 2008 13:51:55 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m2AHoCCn087969; Mon, 10 Mar 2008 13:50:12 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Jeff Roberson Date: Mon, 10 Mar 2008 13:13:03 -0400 User-Agent: KMail/1.9.7 References: <20080307020626.G920@desktop> <20080307124038.I920@desktop> <20080307234452.U1091@desktop> In-Reply-To: <20080307234452.U1091@desktop> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803101313.03526.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Mon, 10 Mar 2008 13:50:13 -0400 (EDT) X-Virus-Scanned: ClamAV 0.91.2/6192/Mon Mar 10 10:54:00 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: freebsd-arch@freebsd.org Subject: Re: Getting rid of the static msleep priority boost X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Mar 2008 17:50:47 -0000 On Saturday 08 March 2008 04:46:32 am Jeff Roberson wrote: > On Fri, 7 Mar 2008, Jeff Roberson wrote: > > > On Fri, 7 Mar 2008, John Baldwin wrote: > > > >> On Friday 07 March 2008 08:42:37 am John Baldwin wrote: > >>> On Friday 07 March 2008 07:16:30 am Jeff Roberson wrote: > >>>> Hello, > >>>> > >>>> I've been studying some problems with recent scheduler improvements that > >>>> help a lot on some workloads and hurt on others. I've tracked the > >>>> problem down to static priority boosts handed out by > >>>> msleep/cv_broadcastpri. The basic problem is that a user thread will be > >>>> woken with a kernel priority thus allowing it to preempt a thread running > >>>> on any processor with a lesser priority. The lesser priority thread may > >>>> in fact hold some resource that the higher priority thread requires. > >>>> Thus we context switch several times and perhaps go through priority > >>>> propagation as well. > >>>> > >>>> I have verified that disabling these static priority boosts entirely > >>>> fixes the performance problem I've run into on at least one workload. > >>>> There are probably others that it helps and hopefully we can discover > >>>> that. > >>>> > >>>> I'd like to know if anyone has a strong preference to keep this feature. > >>>> It is likely that it helps in some interactive situations. I'm not sure > >>>> how much however. I propose that we make a sysctl that disables it and > >>>> turn it off by default. If we see complaints on current@ we can suggest > >>>> that they toggle the sysctl to see if it alleviates problems. > >>>> > >>>> Based on feedback from that experiment and some testing we can then > >>>> choose a few options: > >>>> > >>>> 1) Disable the static boosts entirely. Leave kernel priorities for > >>>> kernel threads and priority propagation. Most other kernels do this. > >>>> Would make my life in ULE much easier as well. > >>>> > >>>> 2) Leave the support for static boosts but remove it from all but a few > >>>> key locations. Leaving it in the api would give some flexibility but > >>>> might confuse developers. > >>>> > >>>> 3) Leave things as they are. undesirable. > >>>> > >>>> I'm leaning towards #2 based on the information I have presently. This > >>>> is almost a significant change to historic BSD behavior so we might want > >>>> to tread lightly. > >>> > >>> One thing to note is that we actually depend on the priority boost > >>> (evilly) > >>> to pick processes to swap out. (I think we check for <= PSOCK and don't > >>> swap those out). One thing that I've wanted to happen for a while is that > >>> the sleep priority for msleep() just be a parameter available to the > >>> scheduler that the scheduler can use to calculate the real internal > >>> priority rather than just being a set. That is, I imagine having: > >>> > >>> void sched_set_sleep_prio(struct thread *td, u_char pri); > >>> u_char sched_get_sleep_prio(struct thread *td); > >>> > >>> (The swap check would use the get call). The 4BSD scheduler's > >>> implementation of sched_set_sleep_prio would look like this: > >>> > >>> void > >>> sched_set_sleep_prio(struct thread *td, u_char pri) > >>> { > >>> > >>> td->td_sched->sleep_pri = pri; > >>> sched_prio(td, pri); > >>> } > >>> > >>> void > >>> sched_userret(..) > >>> { > >>> > >>> ... > >>> td->td_sched->sleep_pri = 0; /* not in the kernel anymore */ > >>> } > >>> > >>> but other schedulers may just save it and recalculate the priority where > >>> the priority calculation just considers the sleep priority as one among > >>> many factors. If nothing else, this allows it to be a scheduler decision > >>> to ignore it (so 4BSD could continue to do what it does now, but ULE may > >>> ignore it, or ignore certain levels, etc.) > >> > >> One thing to clarify: I'm not opposed to replacing the PSOCK check with > >> something more suitable in the swap code, (in fact, that would be > >> desirable), > >> but it might take a good bit of work to do that and is probably easier to > >> work on that as a separate change. I also think there can be some merit in > >> having code paths hint to the scheduler the relative interactivity/priority > >> of a sleep. > > > > Couple of notes.. > > > > The priority argument to sleep is a reasonable way for the code to hint at > > the relative priority/interactivity. So that argues for leaving these > > arguments in place and making them more advisory. I don't think we have to > > change the api to take advantage of that. > > > > I'll look more closely for places like the swap that care about the absolute > > priority of a process and see what I can come up with. Thanks for raising > > that concern. > > > > I'd like to avoid apis that require the sched lock in seperate steps like > > msleep does now to elevate the priority. So far all sched* apis require the > > thread lock on enter and I'd hate to deviate from that norm. But another > > option may be just to make a globally visible td_sleep_pri that doesn't > > require the lock for write but does for read. The other option is to bubble > > the argument down through the sleepq code and into sched_sleep() and > > sched_wakeup(). I like that the best but it's the most api churn. > > http://people.freebsd.org/~jeff/sleeppri.diff > > What do you think of this? I added another parameter to sleepq_add() and > sched_sleep(). So the scheduler is responsible for adjusting the > priority. We could do the same thing for wakeup time adjustments like > sleepq_broadcastpri() but we'd have to pass it through setrunnable() as > well. The cv_broadcastpri() thing is a hack and I wish there was a better way to do it. I.e., I don't like having wakeup setting the priority at all. I think it's a good idea to pass this to sched_sleep(), but I'd rather leave sched_sleep() where it is and pass the prio arg to the sleepq_wait() routines instead so you don't get a bump unless you actually sleep. I think it's probably a bug that we bump the prio on threads that may not sleep now. > I'd like to normalize the other pri arguments in sleepq to use the same 0 > is not set vs -1 that msleep did. I realize that 0 is a valid priority > but for practical purposes this makes things consistent and does not > really restrict the api. Sounds fine to me. I think we should even formally make 0 an invalid priority (via a comment or something). -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Mon Mar 10 22:22:09 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6771610656D0; Mon, 10 Mar 2008 22:22:09 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 0D1578FC26; Mon, 10 Mar 2008 22:22:08 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com [24.94.75.93]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id m2AMM324089149; Mon, 10 Mar 2008 18:22:06 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Mon, 10 Mar 2008 12:22:54 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: John Baldwin In-Reply-To: <200803101313.03526.jhb@freebsd.org> Message-ID: <20080310121527.F1091@desktop> References: <20080307020626.G920@desktop> <20080307124038.I920@desktop> <20080307234452.U1091@desktop> <200803101313.03526.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Getting rid of the static msleep priority boost X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Mar 2008 22:22:09 -0000 On Mon, 10 Mar 2008, John Baldwin wrote: > On Saturday 08 March 2008 04:46:32 am Jeff Roberson wrote: >> On Fri, 7 Mar 2008, Jeff Roberson wrote: >> >>> On Fri, 7 Mar 2008, John Baldwin wrote: >>> >>>> On Friday 07 March 2008 08:42:37 am John Baldwin wrote: >>>>> On Friday 07 March 2008 07:16:30 am Jeff Roberson wrote: >>>>>> Hello, >>>>>> >>>>>> I've been studying some problems with recent scheduler improvements > that >>>>>> help a lot on some workloads and hurt on others. I've tracked the >>>>>> problem down to static priority boosts handed out by >>>>>> msleep/cv_broadcastpri. The basic problem is that a user thread will > be >>>>>> woken with a kernel priority thus allowing it to preempt a thread > running >>>>>> on any processor with a lesser priority. The lesser priority thread > may >>>>>> in fact hold some resource that the higher priority thread requires. >>>>>> Thus we context switch several times and perhaps go through priority >>>>>> propagation as well. >>>>>> >>>>>> I have verified that disabling these static priority boosts entirely >>>>>> fixes the performance problem I've run into on at least one workload. >>>>>> There are probably others that it helps and hopefully we can discover >>>>>> that. >>>>>> >>>>>> I'd like to know if anyone has a strong preference to keep this > feature. >>>>>> It is likely that it helps in some interactive situations. I'm not > sure >>>>>> how much however. I propose that we make a sysctl that disables it and >>>>>> turn it off by default. If we see complaints on current@ we can > suggest >>>>>> that they toggle the sysctl to see if it alleviates problems. >>>>>> >>>>>> Based on feedback from that experiment and some testing we can then >>>>>> choose a few options: >>>>>> >>>>>> 1) Disable the static boosts entirely. Leave kernel priorities for >>>>>> kernel threads and priority propagation. Most other kernels do this. >>>>>> Would make my life in ULE much easier as well. >>>>>> >>>>>> 2) Leave the support for static boosts but remove it from all but a > few >>>>>> key locations. Leaving it in the api would give some flexibility but >>>>>> might confuse developers. >>>>>> >>>>>> 3) Leave things as they are. undesirable. >>>>>> >>>>>> I'm leaning towards #2 based on the information I have presently. This >>>>>> is almost a significant change to historic BSD behavior so we might > want >>>>>> to tread lightly. >>>>> >>>>> One thing to note is that we actually depend on the priority boost >>>>> (evilly) >>>>> to pick processes to swap out. (I think we check for <= PSOCK and don't >>>>> swap those out). One thing that I've wanted to happen for a while is > that >>>>> the sleep priority for msleep() just be a parameter available to the >>>>> scheduler that the scheduler can use to calculate the real internal >>>>> priority rather than just being a set. That is, I imagine having: >>>>> >>>>> void sched_set_sleep_prio(struct thread *td, u_char pri); >>>>> u_char sched_get_sleep_prio(struct thread *td); >>>>> >>>>> (The swap check would use the get call). The 4BSD scheduler's >>>>> implementation of sched_set_sleep_prio would look like this: >>>>> >>>>> void >>>>> sched_set_sleep_prio(struct thread *td, u_char pri) >>>>> { >>>>> >>>>> td->td_sched->sleep_pri = pri; >>>>> sched_prio(td, pri); >>>>> } >>>>> >>>>> void >>>>> sched_userret(..) >>>>> { >>>>> >>>>> ... >>>>> td->td_sched->sleep_pri = 0; /* not in the kernel anymore */ >>>>> } >>>>> >>>>> but other schedulers may just save it and recalculate the priority where >>>>> the priority calculation just considers the sleep priority as one among >>>>> many factors. If nothing else, this allows it to be a scheduler > decision >>>>> to ignore it (so 4BSD could continue to do what it does now, but ULE may >>>>> ignore it, or ignore certain levels, etc.) >>>> >>>> One thing to clarify: I'm not opposed to replacing the PSOCK check with >>>> something more suitable in the swap code, (in fact, that would be >>>> desirable), >>>> but it might take a good bit of work to do that and is probably easier to >>>> work on that as a separate change. I also think there can be some merit > in >>>> having code paths hint to the scheduler the relative > interactivity/priority >>>> of a sleep. >>> >>> Couple of notes.. >>> >>> The priority argument to sleep is a reasonable way for the code to hint at >>> the relative priority/interactivity. So that argues for leaving these >>> arguments in place and making them more advisory. I don't think we have > to >>> change the api to take advantage of that. >>> >>> I'll look more closely for places like the swap that care about the > absolute >>> priority of a process and see what I can come up with. Thanks for raising >>> that concern. >>> >>> I'd like to avoid apis that require the sched lock in seperate steps like >>> msleep does now to elevate the priority. So far all sched* apis require > the >>> thread lock on enter and I'd hate to deviate from that norm. But another >>> option may be just to make a globally visible td_sleep_pri that doesn't >>> require the lock for write but does for read. The other option is to > bubble >>> the argument down through the sleepq code and into sched_sleep() and >>> sched_wakeup(). I like that the best but it's the most api churn. >> >> http://people.freebsd.org/~jeff/sleeppri.diff >> >> What do you think of this? I added another parameter to sleepq_add() and >> sched_sleep(). So the scheduler is responsible for adjusting the >> priority. We could do the same thing for wakeup time adjustments like >> sleepq_broadcastpri() but we'd have to pass it through setrunnable() as >> well. > > The cv_broadcastpri() thing is a hack and I wish there was a better way to do > it. I.e., I don't like having wakeup setting the priority at all. I think > it's a good idea to pass this to sched_sleep(), but I'd rather leave > sched_sleep() where it is and pass the prio arg to the sleepq_wait() routines > instead so you don't get a bump unless you actually sleep. I think it's > probably a bug that we bump the prio on threads that may not sleep now. Ok, I preferred not to move sched_sleep() as well but I also didn't want to add those arguments to the stack everywhere. I'll do that however. > >> I'd like to normalize the other pri arguments in sleepq to use the same 0 >> is not set vs -1 that msleep did. I realize that 0 is a valid priority >> but for practical purposes this makes things consistent and does not >> really restrict the api. > > Sounds fine to me. I think we should even formally make 0 an invalid priority > (via a comment or something). Ok, I'll consider that. I'm just going to commit this when it's tested and working. It's simple enough I don't think it warrents further review. Thanks, Jeff > > -- > John Baldwin > From owner-freebsd-arch@FreeBSD.ORG Tue Mar 11 02:25:27 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B494F1065675 for ; Tue, 11 Mar 2008 02:25:27 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 5F9368FC16 for ; Tue, 11 Mar 2008 02:25:27 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com [24.94.75.93]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id m2B2PPaL045126 for ; Mon, 10 Mar 2008 22:25:26 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Mon, 10 Mar 2008 16:26:17 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: arch@freebsd.org Message-ID: <20080310161115.X1091@desktop> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Mar 2008 02:25:27 -0000 http://people.freebsd.org/~jeff/amd64.diff At the above address there is an implementation of cpu_switch() and cpu_throw() for amd64 almost entirely in C. I'm posting this for discussion and eventual commit. There are numerous reasons to do this, I will outline some of them. Implementing the bulk of the code in C allows us to add/modify higher level features more easily. For example, we can change the pmap active bits to use a cpuset_t so we can support more than 64 cpus. It makes the code faster because we can do more complicated checks to save time, such as avoiding writing the fs/gsbase MSRs if they have not changed. It makes the code faster because infrequently used options can be moved out of the normal code paths. In fact, the c version is ~10% faster than the assembly version at a two thread sched_yield() test on a single cpu opteron: x asm.yield + csw.yield +------------------------------------------------------------------------------+ | ++ x x | |+ ++ ++ + + + + + ++ +x x x x xxx x| | |______M_____A___________| |__________AM__________| | +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 5.17 5.88 5.5 5.479 0.19272606 + 15 4.58 5.16 4.71 4.8126667 0.20738049 Difference at 95.0% confidence -0.666333 +/- 0.170431 -12.1616% +/- 3.11062% (Student's t, pooled s = 0.201773) This test measures the total time to call sched_yield() 10,000,000 times between two threads. Two threads are needed to be sure that the scheduler doesn't pick the same thread twice and skip cpu_switch(). The 10% speedup is notable because the cpu_switch() routine was consuming less than 40% of the cpu prior to the speedup. So it's almost 1/3rd faster. Peter also suggested that we can delay portions of the switch until the user boundary. For workloads that involve heavy kernel activity on the users part with multiple switches per-syscall this would be a big savings. We could also use this as a framework to implement custom switch routines if we want to switch directly to ithreads or taskqueue threads in the future. The C routine is supplemented by two assembly routines which are responsible for saving the core architecture state and manipulating the stack. These total approximately 50 assembly instructions and are similar to savecontext/swapcontext. The c code saves the old threads context but still runs on its stack as it continues the switch. This is safe because the old thread is locked until we call "cpu_switchin()" which is similar to swapcontext. The only appreciable downside is that it lowers the barrier of entry for modifying a very sensitive piece of code. Still, I think the flexibility it gives us outweighs those concerns. Comments? Thanks, Jeff From owner-freebsd-arch@FreeBSD.ORG Tue Mar 11 09:56:10 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7923E1065674 for ; Tue, 11 Mar 2008 09:56:10 +0000 (UTC) (envelope-from peterjeremy@optushome.com.au) Received: from mail15.syd.optusnet.com.au (mail15.syd.optusnet.com.au [211.29.132.196]) by mx1.freebsd.org (Postfix) with ESMTP id EFB6D8FC36 for ; Tue, 11 Mar 2008 09:56:09 +0000 (UTC) (envelope-from peterjeremy@optushome.com.au) Received: from server.vk2pj.dyndns.org (c220-239-20-82.belrs4.nsw.optusnet.com.au [220.239.20.82]) by mail15.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m2B9twW8016345 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 11 Mar 2008 20:55:59 +1100 Received: from server.vk2pj.dyndns.org (localhost.vk2pj.dyndns.org [127.0.0.1]) by server.vk2pj.dyndns.org (8.14.2/8.14.1) with ESMTP id m2B9twSQ042761; Tue, 11 Mar 2008 20:55:58 +1100 (EST) (envelope-from peter@server.vk2pj.dyndns.org) Received: (from peter@localhost) by server.vk2pj.dyndns.org (8.14.2/8.14.2/Submit) id m2B9twDi042760; Tue, 11 Mar 2008 20:55:58 +1100 (EST) (envelope-from peter) Date: Tue, 11 Mar 2008 20:55:58 +1100 From: Peter Jeremy To: Jeff Roberson Message-ID: <20080311095557.GX68971@server.vk2pj.dyndns.org> References: <20080310161115.X1091@desktop> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="213E7WwkW+nU62+Y" Content-Disposition: inline In-Reply-To: <20080310161115.X1091@desktop> X-PGP-Key: http://members.optusnet.com.au/peterjeremy/pubkey.asc User-Agent: Mutt/1.5.17 (2007-11-01) Cc: arch@freebsd.org Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Mar 2008 09:56:10 -0000 --213E7WwkW+nU62+Y Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Mar 10, 2008 at 04:26:17PM -1000, Jeff Roberson wrote: >In fact, the c version is ~10% faster than the assembly version at a two= =20 >thread sched_yield() test on a single cpu opteron: That sounds wonderful. How about comparing it on an SMP system. Are there any locking issues that might change that performance difference with lots of CPUs? >The only appreciable downside is that it lowers the barrier of entry for= =20 >modifying a very sensitive piece of code. IMHO, this isn't a valid reason. Increasing the both the legibility and performance of a very sensitive piece of code is a good thing. Having more people understand the code is also a good thing. FreeBSD already implements a substantial barrier of entry to code modification (commit bits) and I don't believe this should be further raised by unnecessarily hiding critical code in a language that the majority of committers are not expert in. I've seen relatively few examples of drive-by commits breaking critical code in the past and doubt that converting cpu_switch()/cpu_throw() into C will suddenly make them the target of a "how can I break FreeBSD in an obscure manner" competition. In any case, there is nothing stopping anyone with a src commit bit mangling the existing assembler implementation. --=20 Peter Jeremy Please excuse any delays as the result of my ISP's inability to implement an MTA that is either RFC2821-compliant or matches their claimed behaviour. --213E7WwkW+nU62+Y Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.8 (FreeBSD) iEYEARECAAYFAkfWVy0ACgkQ/opHv/APuIc9mwCgiz3QPF4lOauPkYpWHtaVkQ0h JboAnjNs/04TBin4fag0B10tX254eo4O =n3b8 -----END PGP SIGNATURE----- --213E7WwkW+nU62+Y-- From owner-freebsd-arch@FreeBSD.ORG Tue Mar 11 10:02:39 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 219141065672 for ; Tue, 11 Mar 2008 10:02:39 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id C73748FC2B for ; Tue, 11 Mar 2008 10:02:38 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 7B4F217104; Tue, 11 Mar 2008 10:02:37 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id m2BA2aWi005547; Tue, 11 Mar 2008 10:02:36 GMT (envelope-from phk@critter.freebsd.dk) To: Peter Jeremy From: "Poul-Henning Kamp" In-Reply-To: Your message of "Tue, 11 Mar 2008 20:55:58 +1100." <20080311095557.GX68971@server.vk2pj.dyndns.org> Date: Tue, 11 Mar 2008 10:02:35 +0000 Message-ID: <5546.1205229755@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: arch@freebsd.org Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Mar 2008 10:02:39 -0000 In message <20080311095557.GX68971@server.vk2pj.dyndns.org>, Peter Jeremy write s: >>The only appreciable downside is that it lowers the barrier of entry for >>modifying a very sensitive piece of code. > >IMHO, this isn't a valid reason. Increasing the both the legibility >and performance of a very sensitive piece of code is a good thing. >Having more people understand the code is also a good thing. This is not a legal inference, and that's exactly the point Jeff made: Just because it is written in C doesn't mean people will understand it, it merely means that they will _think_ they understand it. Nontheless, we have plenty of /* You ARE supposed to understand this */ C-code already, so I don't see it as an objection to Jeff's patch. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Tue Mar 11 20:04:04 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EFFCE1065670 for ; Tue, 11 Mar 2008 20:04:04 +0000 (UTC) (envelope-from peter@wemm.org) Received: from an-out-0708.google.com (an-out-0708.google.com [209.85.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id B74118FC1A for ; Tue, 11 Mar 2008 20:04:04 +0000 (UTC) (envelope-from peter@wemm.org) Received: by an-out-0708.google.com with SMTP id c14so774180anc.13 for ; Tue, 11 Mar 2008 13:04:04 -0700 (PDT) Received: by 10.100.6.13 with SMTP id 13mr13925043anf.16.1205264106316; Tue, 11 Mar 2008 12:35:06 -0700 (PDT) Received: by 10.100.8.6 with HTTP; Tue, 11 Mar 2008 12:35:06 -0700 (PDT) Message-ID: Date: Tue, 11 Mar 2008 12:35:06 -0700 From: "Peter Wemm" To: "Poul-Henning Kamp" In-Reply-To: <5546.1205229755@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20080311095557.GX68971@server.vk2pj.dyndns.org> <5546.1205229755@critter.freebsd.dk> Cc: arch@freebsd.org Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Mar 2008 20:04:05 -0000 On Tue, Mar 11, 2008 at 3:02 AM, Poul-Henning Kamp wrote: > In message <20080311095557.GX68971@server.vk2pj.dyndns.org>, Peter Jeremy write > s: > > > >>The only appreciable downside is that it lowers the barrier of entry for > >>modifying a very sensitive piece of code. > > > >IMHO, this isn't a valid reason. Increasing the both the legibility > >and performance of a very sensitive piece of code is a good thing. > >Having more people understand the code is also a good thing. > > This is not a legal inference, and that's exactly the point Jeff made: > > Just because it is written in C doesn't mean people will understand > it, it merely means that they will _think_ they understand it. I'd like to point out that if I hadn't converted the run queue parts of cpu_switch into C, then KSE might never have happened. At least, not in the form that hit the tree. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 04:13:19 2008 Return-Path: Delivered-To: arch@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 880DF106567F for ; Wed, 12 Mar 2008 04:13:19 +0000 (UTC) (envelope-from davidxu@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 451BB8FC12; Wed, 12 Mar 2008 04:13:19 +0000 (UTC) (envelope-from davidxu@FreeBSD.org) Received: from apple.my.domain (root@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.2/8.14.2) with ESMTP id m2C4DGK7003275; Wed, 12 Mar 2008 04:13:17 GMT (envelope-from davidxu@freebsd.org) Message-ID: <47D758AC.2020605@freebsd.org> Date: Wed, 12 Mar 2008 12:14:36 +0800 From: David Xu User-Agent: Thunderbird 2.0.0.9 (X11/20071211) MIME-Version: 1.0 To: Jeff Roberson References: <20080310161115.X1091@desktop> In-Reply-To: <20080310161115.X1091@desktop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@FreeBSD.org Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 04:13:24 -0000 Jeff Roberson wrote: > http://people.freebsd.org/~jeff/amd64.diff This is a good idea. In fact, according to calling conversion, some registers are not needed to be saved across function call, e.g on i386, eax, edx, and ecx. :-) but gdb may need them to dig out stack variable's value. Regards, David Xu From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 08:25:18 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0B0AA1065672 for ; Wed, 12 Mar 2008 08:25:18 +0000 (UTC) (envelope-from peter@wemm.org) Received: from an-out-0708.google.com (an-out-0708.google.com [209.85.132.251]) by mx1.freebsd.org (Postfix) with ESMTP id B91498FC22 for ; Wed, 12 Mar 2008 08:25:17 +0000 (UTC) (envelope-from peter@wemm.org) Received: by an-out-0708.google.com with SMTP id c14so852913anc.13 for ; Wed, 12 Mar 2008 01:25:17 -0700 (PDT) Received: by 10.100.94.14 with SMTP id r14mr15661286anb.23.1205310316735; Wed, 12 Mar 2008 01:25:16 -0700 (PDT) Received: by 10.100.8.6 with HTTP; Wed, 12 Mar 2008 01:25:16 -0700 (PDT) Message-ID: Date: Wed, 12 Mar 2008 01:25:16 -0700 From: "Peter Wemm" To: "David Xu" In-Reply-To: <47D758AC.2020605@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> Cc: arch@freebsd.org Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 08:25:18 -0000 On Tue, Mar 11, 2008 at 9:14 PM, David Xu wrote: > Jeff Roberson wrote: > > http://people.freebsd.org/~jeff/amd64.diff > > This is a good idea. In fact, according to calling conversion, some > registers are not needed to be saved across function call, e.g on > i386, eax, edx, and ecx. :-) but gdb may need them to dig out > stack variable's value. Jeff and I have been having a friendly "competition" today. With a UP kernel and INVARIANTS, my initial counter-patch response had nearly double the gain on my machine. (Jeff 7%, mine: 13.5%). I changed to compile kernels the same as he did (no invariants, SMP kernel, but kern.smp.disabled=1). After that, our patch sets were the same again - both at about 10% gain over baseline. I've made a few more changes and am now at 23% improvement over baseline. I'm not confident of testing methodology. More tests are in progress. The good news is that this tuning is finally being done. It should have been done in 2003 though... -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 08:51:22 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 51EA51065676; Wed, 12 Mar 2008 08:51:22 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 07B008FC36; Wed, 12 Mar 2008 08:51:21 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com [24.94.75.93]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id m2C8pH2W075419; Wed, 12 Mar 2008 04:51:20 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Tue, 11 Mar 2008 22:52:16 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: Peter Wemm In-Reply-To: Message-ID: <20080311224903.V1091@desktop> References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, David Xu Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 08:51:22 -0000 On Wed, 12 Mar 2008, Peter Wemm wrote: > On Tue, Mar 11, 2008 at 9:14 PM, David Xu wrote: >> Jeff Roberson wrote: >> > http://people.freebsd.org/~jeff/amd64.diff >> >> This is a good idea. In fact, according to calling conversion, some >> registers are not needed to be saved across function call, e.g on >> i386, eax, edx, and ecx. :-) but gdb may need them to dig out >> stack variable's value. > > Jeff and I have been having a friendly "competition" today. > > With a UP kernel and INVARIANTS, my initial counter-patch response had > nearly double the gain on my machine. (Jeff 7%, mine: 13.5%). > I changed to compile kernels the same as he did (no invariants, SMP > kernel, but kern.smp.disabled=1). After that, our patch sets were the > same again - both at about 10% gain over baseline. > > I've made a few more changes and am now at 23% improvement over baseline. The question is whether we care to have it in C or not. Given a C and assembly version with similar optimizations the assembly version will always win. However, it's easier to write the optimizations in C. > > I'm not confident of testing methodology. More tests are in progress. To keep everyone else up to date; We're using: http://people.freebsd.org/~jeff/yield.c & yield.sh Given two processes and the scheduling methodology for sched_yield() every yield should trigger a context switch to a new process. > > The good news is that this tuning is finally being done. It should > have been done in 2003 though... Yes indeed, better late than never. > > -- > Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com > "All of this is for nothing if we don't go to the stars" - JMS/B5 > "If Java had true garbage collection, most programs would delete > themselves upon execution." -- Robert Sewell > From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 09:20:05 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5D6AD1065678 for ; Wed, 12 Mar 2008 09:20:05 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.freebsd.org (Postfix) with ESMTP id 0D3758FC23 for ; Wed, 12 Mar 2008 09:20:05 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from root by ciao.gmane.org with local (Exim 4.43) id 1JZN7y-00046e-Io for freebsd-arch@freebsd.org; Wed, 12 Mar 2008 09:20:02 +0000 Received: from 195.208.174.178 ([195.208.174.178]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 12 Mar 2008 09:20:02 +0000 Received: from vadim_nuclight by 195.208.174.178 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 12 Mar 2008 09:20:02 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-arch@freebsd.org From: Vadim Goncharov Date: Wed, 12 Mar 2008 09:13:22 +0000 (UTC) Organization: Nuclear Lightning @ Tomsk, TPU AVTF Hostel Lines: 32 Message-ID: References: <86odacc04t.fsf@ds4.des.no> Mime-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: 8bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 195.208.174.178 X-Comment-To: Dag-Erling =?koi8-r?Q?Sm=F8rgrav?= User-Agent: slrn/0.9.8.1 (FreeBSD) Sender: news Subject: Re: dev.* analogue for interfaces X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: vadim_nuclight@mail.ru List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 09:20:05 -0000 Hi Dag-Erling Smørgrav! On Tue, 19 Feb 2008 18:43:46 +0100; Dag-Erling Smørgrav wrote about 'dev.* analogue for interfaces': > What I propose is to add a similar sysctl tree for interfaces. It would > look a little different. For instance, some interfaces (bridge, vlan) > have parents or children, but most don't. > Just as it is for devices, creation and destruction of the interface's > sysctl node and context would be hidden inside if_{attach,detach}() and > completely transparent to the driver, and there will be an API that > drivers can use if they want to add their own nodes. > Since interfaces don't all have parents, the API will include a function > to specify one for those that do. > This is *not* intended to replace ifconfig; it is intended for infor- > mation which isn't available through ifconfig and which it wouldn't be > natural to place there. For instance, every wlan interface already has > a sysctl tree under net.wlan. Will this allow to easier do things like adding new features in configuring per-interface network stack? To not bloat ifconfig, for example, to implement per-interface output DSCP->CoS map via sysctl subtree. Also, I'm not sure but think it will help virtualization, multiple routing tables, VRF and other things which can be bound to interface. So I agree with general idea, just actual info and position in tree should be discussed. -- WBR, Vadim Goncharov. ICQ#166852181 mailto:vadim_nuclight@mail.ru [Moderator of RU.ANTI-ECOLOGY][FreeBSD][http://antigreen.org][LJ:/nuclight] From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 09:44:34 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9B8C4106566B for ; Wed, 12 Mar 2008 09:44:34 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.freebsd.org (Postfix) with ESMTP id 235178FC24 for ; Wed, 12 Mar 2008 09:44:34 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from list by ciao.gmane.org with local (Exim 4.43) id 1JZNVc-0005Ga-Dz for freebsd-arch@freebsd.org; Wed, 12 Mar 2008 09:44:28 +0000 Received: from 195.208.174.178 ([195.208.174.178]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 12 Mar 2008 09:44:28 +0000 Received: from vadim_nuclight by 195.208.174.178 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 12 Mar 2008 09:44:28 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-arch@freebsd.org From: Vadim Goncharov Followup-To: gmane.os.freebsd.architechture Date: Wed, 12 Mar 2008 09:44:19 +0000 (UTC) Organization: Nuclear Lightning @ Tomsk, TPU AVTF Hostel Lines: 41 Message-ID: References: <3bbf2fe10802061700p253e68b8s704deb3e5e4ad086@mail.gmail.com> X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 195.208.174.178 X-Comment-To: Attilio Rao User-Agent: slrn/0.9.8.1 (FreeBSD) Sender: news Cc: freebsd-fs@freebsd.org Subject: Re: [RFC] Remove NTFS kernel support X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: vadim_nuclight@mail.ru List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 09:44:34 -0000 Hi Attilio Rao! On Thu, 7 Feb 2008 02:00:41 +0100; Attilio Rao wrote about '[RFC] Remove NTFS kernel support': > As exposed by several users, NTFS seems to be broken even before first > VFS commits happeing around the end of December. Those commits exposed > some problems about NTFS which are currently under investigation. > Ultimately, This filesystem is also unmaintained at the moment. > Speaking with jeff, we agreed on what can be a possible compromise: > remove the kernel support for NTFS and maybe take care of the FUSE > implementation. > What I now propose is a small survey which can shade a light on us > about what do you think about this idea and its implications: > - Do you use NTFS? Yes, occasionally. And I had scenarios when I was needed it withput Internet access, FUSE, etc. > - Are you interested in maintaining it? Not in 8.0 timeline :) > - Do you know a good reason to not use FUSE ntfs implementation? What > the kernel counter part adds? Localization: ntfs-3g requires UTF-8 as the only locale. And FreeBSD is not good in supporting UTF-8 everywhere (syscons, ufs2, etc.), while kernel part supports recoding to current locale's codepage. Valuable for people with non Latin-1 set. > - Do you think axing the kernel support a good idea? No. It was said about FAT32 as most popular FS for file exchange, look at that new USB flash devices with 4G+ sizes. People want to store 4G+ size files on them (e.g. DVD images), so I've already seen some of them formatted to NTFS instead of FAT32. Having support for them out of the box is good. -- WBR, Vadim Goncharov. ICQ#166852181 mailto:vadim_nuclight@mail.ru [Moderator of RU.ANTI-ECOLOGY][FreeBSD][http://antigreen.org][LJ:/nuclight] From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 10:02:32 2008 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EA1A6106566B for ; Wed, 12 Mar 2008 10:02:32 +0000 (UTC) (envelope-from gary.jennejohn@freenet.de) Received: from mout3.freenet.de (mout3.freenet.de [IPv6:2001:748:100:40::2:5]) by mx1.freebsd.org (Postfix) with ESMTP id 8118D8FC45 for ; Wed, 12 Mar 2008 10:02:32 +0000 (UTC) (envelope-from gary.jennejohn@freenet.de) Received: from [195.4.92.11] (helo=1.mx.freenet.de) by mout3.freenet.de with esmtpa (Exim 4.69) (envelope-from ) id 1JZNn4-00064J-OZ; Wed, 12 Mar 2008 11:02:30 +0100 Received: from x167d.x.pppool.de ([89.59.22.125]:56425 helo=peedub.jennejohn.org) by 1.mx.freenet.de with esmtpa (ID gary.jennejohn@freenet.de) (port 25) (Exim 4.69 #12) id 1JZNn4-0002oL-F1; Wed, 12 Mar 2008 11:02:30 +0100 Date: Wed, 12 Mar 2008 11:02:29 +0100 From: Gary Jennejohn Message-ID: <20080312110229.5aeefc1f@peedub.jennejohn.org> In-Reply-To: <47D758AC.2020605@freebsd.org> References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> X-Mailer: Claws Mail 3.3.1 (GTK+ 2.10.14; amd64-portbld-freebsd8.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: arch@FreeBSD.org Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: gary.jennejohn@freenet.de List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 10:02:33 -0000 On Wed, 12 Mar 2008 12:14:36 +0800 David Xu wrote: > Jeff Roberson wrote: > > http://people.freebsd.org/~jeff/amd64.diff > > This is a good idea. In fact, according to calling conversion, some > registers are not needed to be saved across function call, e.g on > i386, eax, edx, and ecx. :-) but gdb may need them to dig out > stack variable's value. > I applied this patch yesterday on an AMD64 X2 box and got this panic today after I started X: Unread portion of the kernel message buffer: panic: smp_tlb_shootdown: interrupts disabled cpuid = 0 Uptime: 47s Physical memory: 3062 MB Dumping 169 MB: 154 138 122 106 90 74 58 42 26 10 That's all the useful information which I have because the back trace is corrupted. BTW I'm using SCHED_ULE. Maybe I shouldn't have tried this patch yet since it doesn't seem to be SMP ready. --- Gary Jennejohn From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 10:04:05 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7BBEC1065676; Wed, 12 Mar 2008 10:04:05 +0000 (UTC) (envelope-from bsd.luigi@alshome.be) Received: from csmtp1.b-one.net (csmtp1.one.com [195.47.247.21]) by mx1.freebsd.org (Postfix) with ESMTP id 40AD98FC31; Wed, 12 Mar 2008 10:04:05 +0000 (UTC) (envelope-from bsd.luigi@alshome.be) Received: from [128.70.15.100] (85.248-78-194.adsl-static.isp.belgacom.be [194.78.248.85]) by csmtp1.b-one.net (Postfix) with ESMTP id A2D1BE00B9E7; Wed, 12 Mar 2008 10:36:03 +0100 (CET) Message-ID: <47D7A387.5020707@alshome.be> Date: Wed, 12 Mar 2008 10:33:59 +0100 From: Luigi User-Agent: Thunderbird 2.0.0.12 (Windows/20080213) MIME-Version: 1.0 To: freebsd-arch@freebsd.org, freebsd-doc Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Subject: study about Kernels X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: bsd.luigi@alshome.be List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 10:04:05 -0000 Hi all, I realize a study about kernels. The goal is to compare architechture of open kernels. I would like to examin the BSD, Darwin and linux Kernel. Who can help me and where can I find documentation ? Thank you very much for help. Luigi From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 10:06:44 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3979C1065670 for ; Wed, 12 Mar 2008 10:06:44 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.freebsd.org (Postfix) with ESMTP id E70A48FC1D for ; Wed, 12 Mar 2008 10:06:43 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from list by ciao.gmane.org with local (Exim 4.43) id 1JZNr4-0006B8-4T for freebsd-arch@freebsd.org; Wed, 12 Mar 2008 10:06:38 +0000 Received: from 195.208.174.178 ([195.208.174.178]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 12 Mar 2008 10:06:38 +0000 Received: from vadim_nuclight by 195.208.174.178 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 12 Mar 2008 10:06:38 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-arch@freebsd.org From: Vadim Goncharov Date: Wed, 12 Mar 2008 10:06:28 +0000 (UTC) Organization: Nuclear Lightning @ Tomsk, TPU AVTF Hostel Lines: 23 Message-ID: X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 195.208.174.178 X-Comment-To: All User-Agent: slrn/0.9.8.1 (FreeBSD) Sender: news Subject: sysctl vs procfs X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: vadim_nuclight@mail.ru List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 10:06:44 -0000 Hi! While it is good idea to prefer more consistent sysctl in favor of procfs, the sysctl interface has some drawbacks. For example, procfs has good file interface for big things, like VM map of the process. Imagine 800 megs... and sysctl in-kernel interface locks value then copies it to in-kernel memory then it can be copied to userspace. Not suitable for providing alternative to procfs in reading big files and getting rid of procfs, of course. So, what about adding sysctl interfaces allowing userland-application to read large buffers in parts without copying? Application, of course, should be aware of the fact that underlying buffer can change while copying, but many our base utilities (like netstat) already work in these conditions. Another proposal is about human-readable conversions. We already have C structs and arrays parsing/unparsing code in netgraph (/sys/netgraph/ng_parse.c). What about porting it userland (or leave in kernel, this should be thought) to allow user-interpreting blobs which are even hidden to user without sysctl -A ? This and previous can improve our KVM interactions, I think. -- WBR, Vadim Goncharov. ICQ#166852181 mailto:vadim_nuclight@mail.ru [Moderator of RU.ANTI-ECOLOGY][FreeBSD][http://antigreen.org][LJ:/nuclight] From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 10:16:30 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7055F1065673 for ; Wed, 12 Mar 2008 10:16:30 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 442EF8FC13 for ; Wed, 12 Mar 2008 10:16:30 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id F15F246B43; Wed, 12 Mar 2008 06:16:29 -0400 (EDT) Date: Wed, 12 Mar 2008 10:16:29 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Luigi In-Reply-To: <47D7A387.5020707@alshome.be> Message-ID: <20080312101301.X29518@fledge.watson.org> References: <47D7A387.5020707@alshome.be> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-doc , freebsd-arch@freebsd.org Subject: Re: study about Kernels X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 10:16:30 -0000 On Wed, 12 Mar 2008, Luigi wrote: > I realize a study about kernels. The goal is to compare architechture of > open kernels. I would like to examin the BSD, Darwin and linux Kernel. > > Who can help me and where can I find documentation ? > > Thank you very much for help. I don't think you'll find much in the way of serious documentation of the differences. If doing a comparative study, I'd encourage you also to take a look at the OpenSolaris kernel. However, you can find all the source trees here: http://fxr.watson.org/ Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 10:22:38 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 881A0106566B; Wed, 12 Mar 2008 10:22:38 +0000 (UTC) (envelope-from bsd.luigi@alshome.be) Received: from csmtp3.b-one.net (csmtp3.one.com [195.47.247.213]) by mx1.freebsd.org (Postfix) with ESMTP id 47BF28FC1C; Wed, 12 Mar 2008 10:22:38 +0000 (UTC) (envelope-from bsd.luigi@alshome.be) Received: from [128.70.15.100] (85.248-78-194.adsl-static.isp.belgacom.be [194.78.248.85]) by csmtp3.b-one.net (Postfix) with ESMTP id 8B143100EC8D; Wed, 12 Mar 2008 11:22:36 +0100 (CET) Message-ID: <47D7AE70.3020305@alshome.be> Date: Wed, 12 Mar 2008 11:20:32 +0100 From: Luigi User-Agent: Thunderbird 2.0.0.12 (Windows/20080213) MIME-Version: 1.0 To: Robert Watson , freebsd-arch , freebsd-doc References: <47D7A387.5020707@alshome.be> <20080312101301.X29518@fledge.watson.org> In-Reply-To: <20080312101301.X29518@fledge.watson.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Cc: Subject: Re: study about Kernels X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: bsd.luigi@alshome.be List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 10:22:38 -0000 Ok thank you very much. I'ill examin the opensolaris kernel too. Robert Watson a écrit : > On Wed, 12 Mar 2008, Luigi wrote: > >> I realize a study about kernels. The goal is to compare architechture >> of open kernels. I would like to examin the BSD, Darwin and linux >> Kernel. >> >> Who can help me and where can I find documentation ? >> >> Thank you very much for help. > > I don't think you'll find much in the way of serious documentation of > the differences. If doing a comparative study, I'd encourage you also > to take a look at the OpenSolaris kernel. However, you can find all > the source trees here: > > http://fxr.watson.org/ > > Robert N M Watson > Computer Laboratory > University of Cambridge > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 11:30:48 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D98241065670 for ; Wed, 12 Mar 2008 11:30:48 +0000 (UTC) (envelope-from peter@wemm.org) Received: from an-out-0708.google.com (an-out-0708.google.com [209.85.132.250]) by mx1.freebsd.org (Postfix) with ESMTP id A0D788FC28 for ; Wed, 12 Mar 2008 11:30:48 +0000 (UTC) (envelope-from peter@wemm.org) Received: by an-out-0708.google.com with SMTP id c14so873755anc.13 for ; Wed, 12 Mar 2008 04:30:47 -0700 (PDT) Received: by 10.100.108.20 with SMTP id g20mr15988543anc.8.1205321447364; Wed, 12 Mar 2008 04:30:47 -0700 (PDT) Received: by 10.100.8.6 with HTTP; Wed, 12 Mar 2008 04:30:47 -0700 (PDT) Message-ID: Date: Wed, 12 Mar 2008 04:30:47 -0700 From: "Peter Wemm" To: "David Xu" In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> Cc: arch@freebsd.org Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 11:30:49 -0000 On Wed, Mar 12, 2008 at 1:25 AM, Peter Wemm wrote: > On Tue, Mar 11, 2008 at 9:14 PM, David Xu wrote: > > Jeff Roberson wrote: > > > http://people.freebsd.org/~jeff/amd64.diff > > > > This is a good idea. In fact, according to calling conversion, some > > registers are not needed to be saved across function call, e.g on > > i386, eax, edx, and ecx. :-) but gdb may need them to dig out > > stack variable's value. > > Jeff and I have been having a friendly "competition" today. > > With a UP kernel and INVARIANTS, my initial counter-patch response had > nearly double the gain on my machine. (Jeff 7%, mine: 13.5%). > I changed to compile kernels the same as he did (no invariants, SMP > kernel, but kern.smp.disabled=1). After that, our patch sets were the > same again - both at about 10% gain over baseline. > > I've made a few more changes and am now at 23% improvement over baseline. > > I'm not confident of testing methodology. More tests are in progress. I've found a couple of pthreads test cases where Jeff's version is a couple of percent slower than the baseline, and mine is either the same or a couple of percent faster. His: Difference at 95.0% confidence 0.0921053 +/- 0.0648113 2.6455% +/- 1.86155% (2.6% longer to run the test) Mine: No difference proven at 95.0% confidence Same test, different kernel options: His: No difference proven at 95.0% confidence Mine: Difference at 95.0% confidence -0.2055 +/- 0.204382 -4.06086% +/- 4.03877% But my favourite one is Jeff's preferred test configuration: His: Difference at 95.0% confidence -0.668 +/- 0.047188 -10.9896% +/- 0.776309% Mine: Difference at 95.0% confidence -1.457 +/- 0.0290925 -23.9697% +/- 0.478613% (11% less time vs 24% less time for the test) This stuff directly affects latency with ithreads, kthreads, task queues, etc and should show up on networking benchmarks. I'm moving over to testing in an otherwise virgin cvs tree, since my p4 tree is somewhat polluted. More numbers tomorrow. -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 12:05:05 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D21491065680 for ; Wed, 12 Mar 2008 12:05:05 +0000 (UTC) (envelope-from cokane@cokane.org) Received: from QMTA08.westchester.pa.mail.comcast.net (qmta08.westchester.pa.mail.comcast.net [76.96.62.80]) by mx1.freebsd.org (Postfix) with ESMTP id 6FC138FC1D for ; Wed, 12 Mar 2008 12:05:04 +0000 (UTC) (envelope-from cokane@cokane.org) Received: from OMTA01.westchester.pa.mail.comcast.net ([76.96.62.11]) by QMTA08.westchester.pa.mail.comcast.net with comcast id 0B911Z0030EZKEL5802v00; Wed, 12 Mar 2008 11:48:24 +0000 Received: from discordia ([24.61.189.203]) by OMTA01.westchester.pa.mail.comcast.net with comcast id 0Bp41Z0054PktZC3M00000; Wed, 12 Mar 2008 11:49:04 +0000 X-Authority-Analysis: v=1.0 c=1 a=Pj4_Y536NHEA:10 a=0RHiSNdv7K1cofAIcD8A:9 a=c4nhsv1tePooTTT5wNglNuvk8YcA:4 a=50e4U0PicR4A:10 a=-rtcXVvtY7498SX-e9UA:9 a=34nm3S-1UQFYRJjl--IA:7 a=XnLui3lyQwI9Xcj7tu-HLo7BWC8A:4 a=NfA2RSpTaHsA:10 Received: by discordia (Postfix, from userid 103) id 1C5351636F9; Wed, 12 Mar 2008 07:49:04 -0400 (EDT) X-Spam-Checker-Version: SpamAssassin 3.1.8-gr1 (2007-02-13) on discordia X-Spam-Level: X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.8-gr1 Received: from [172.20.1.3] (erwin.int.cokane.org [172.20.1.3]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by discordia (Postfix) with ESMTP id 7440D1636F8 for ; Wed, 12 Mar 2008 07:48:53 -0400 (EDT) Message-ID: <47D7C25D.5070908@cokane.org> Date: Wed, 12 Mar 2008 07:45:33 -0400 From: Coleman Kane User-Agent: Thunderbird 2.0.0.12 (X11/20080304) MIME-Version: 1.0 To: arch@FreeBSD.org Content-Type: multipart/mixed; boundary="------------000300090804080100080401" Cc: Subject: SMPTODO: remove timeout(9) from ffs_softdep.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 12:05:05 -0000 This is a multi-part message in MIME format. --------------000300090804080100080401 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Hi all, I was poking around SMPTODO for some work during an idle night, and I decided to fix the non-MPSAFE use of timeout(9) in ffs_softdep.c, and learn more about the callout_* API in the kernel. I'm attaching a patch of what I've done, which I am running in my current kernel at the moment (and I am using softupdates on a number of filesystems on this SMP machine). Can anyone else try it out / review it / give feedback? -- Coleman Kane --------------000300090804080100080401 Content-Type: text/x-patch; name="ffs_softdep.c-newcallout.diff" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="ffs_softdep.c-newcallout.diff" diff --git a/sys/ufs/ffs/ffs_softdep.c b/sys/ufs/ffs/ffs_softdep.c index 3e8ba26..3e9122f 100644 --- a/sys/ufs/ffs/ffs_softdep.c +++ b/sys/ufs/ffs/ffs_softdep.c @@ -664,7 +664,7 @@ static int maxindirdeps = 50; /* max number of indirdeps before slowdown */ static int tickdelay = 2; /* number of ticks to pause during slowdown */ static int proc_waiting; /* tracks whether we have a timeout posted */ static int *stat_countp; /* statistic to count in proc_waiting timeout */ -static struct callout_handle handle; /* handle on posted proc_waiting timeout */ +static struct callout softdep_callout; static int req_pending; static int req_clear_inodedeps; /* syncer process flush some inodedeps */ #define FLUSH_INODES 1 @@ -1394,6 +1394,9 @@ softdep_initialize() bioops.io_complete = softdep_disk_write_complete; bioops.io_deallocate = softdep_deallocate_dependencies; bioops.io_countdeps = softdep_count_dependencies; + + /* Initialize the callout with an mtx. */ + callout_init_mtx(&softdep_callout, &lk, 0); } /* @@ -1403,7 +1406,9 @@ softdep_initialize() void softdep_uninitialize() { - + ACQUIRE_LOCK(&lk); + callout_drain(&softdep_callout); + FREE_LOCK(&lk); hashdestroy(pagedep_hashtbl, M_PAGEDEP, pagedep_hash); hashdestroy(inodedep_hashtbl, M_INODEDEP, inodedep_hash); hashdestroy(newblk_hashtbl, M_NEWBLK, newblk_hash); @@ -5858,8 +5863,16 @@ request_cleanup(mp, resource) * We wait at most tickdelay before proceeding in any case. */ proc_waiting += 1; - if (handle.callout == NULL) - handle = timeout(pause_timer, 0, tickdelay > 2 ? tickdelay : 2); + ACQUIRE_LOCK(&lk); + if(callout_active(&softdep_callout) == FALSE) { + /* + should always return zero due to callout_active being called to verify that no active + timeout already exists, which is the case where this would return non-zero (and + callout_active(&softdep_callout) would be TRUE. + */ + callout_reset(&softdep_callout, tickdelay > 2 ? tickdelay : 2, pause_timer, 0); + } + FREE_LOCK(&lk); msleep((caddr_t)&proc_waiting, &lk, PPAUSE, "softupdate", 0); proc_waiting -= 1; return (1); @@ -5873,15 +5886,17 @@ static void pause_timer(arg) void *arg; { - - ACQUIRE_LOCK(&lk); + /* Implied by callout_* API */ + /* ACQUIRE_LOCK(&lk); */ *stat_countp += 1; wakeup_one(&proc_waiting); - if (proc_waiting > 0) - handle = timeout(pause_timer, 0, tickdelay > 2 ? tickdelay : 2); - else - handle.callout = NULL; - FREE_LOCK(&lk); + if (proc_waiting > 0) { + /* We don't care about the return value here. */ + callout_reset(&softdep_callout, tickdelay > 2 ? tickdelay : 2, pause_timer, 0); + } else { + callout_deactivate(&softdep_callout); + } + /* FREE_LOCK(&lk); */ } /* --------------000300090804080100080401-- From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 13:53:54 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D9DE7106567D for ; Wed, 12 Mar 2008 13:53:54 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id C38F68FC14 for ; Wed, 12 Mar 2008 13:53:54 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from zion.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by elvis.mu.org (Postfix) with ESMTP id C89881A4D8B; Wed, 12 Mar 2008 06:53:01 -0700 (PDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Wed, 12 Mar 2008 09:45:28 -0400 User-Agent: KMail/1.9.7 References: <47D7C25D.5070908@cokane.org> In-Reply-To: <47D7C25D.5070908@cokane.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803120945.29018.jhb@freebsd.org> Cc: Coleman Kane Subject: Re: SMPTODO: remove timeout(9) from ffs_softdep.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 13:53:55 -0000 On Wednesday 12 March 2008 07:45:33 am Coleman Kane wrote: > Hi all, > > I was poking around SMPTODO for some work during an idle night, and I > decided to fix the non-MPSAFE use of timeout(9) in ffs_softdep.c, and > learn more about the callout_* API in the kernel. I'm attaching a patch > of what I've done, which I am running in my current kernel at the moment > (and I am using softupdates on a number of filesystems on this SMP > machine). > > Can anyone else try it out / review it / give feedback? > > @@ -1403,7 +1406,9 @@ softdep_initialize() > void > softdep_uninitialize() > { > - > + ACQUIRE_LOCK(&lk); > + callout_drain(&softdep_callout); > + FREE_LOCK(&lk); > hashdestroy(pagedep_hashtbl, M_PAGEDEP, pagedep_hash); > hashdestroy(inodedep_hashtbl, M_INODEDEP, inodedep_hash); > hashdestroy(newblk_hashtbl, M_NEWBLK, newblk_hash); Don't hold the mutex over a drain and leave the blank line at the start of the function (style(9)). > @@ -5858,8 +5863,16 @@ request_cleanup(mp, resource) > * We wait at most tickdelay before proceeding in any case. > */ > proc_waiting += 1; > - if (handle.callout == NULL) > - handle = timeout(pause_timer, 0, tickdelay > 2 ? tickdelay : 2); > + ACQUIRE_LOCK(&lk); > + if(callout_active(&softdep_callout) == FALSE) { > + /* > + should always return zero due to callout_active being called to verify that no active > + timeout already exists, which is the case where this would return non-zero (and > + callout_active(&softdep_callout) would be TRUE. > + */ > + callout_reset(&softdep_callout, tickdelay > 2 ? tickdelay : 2, pause_timer, 0); > + } > + FREE_LOCK(&lk); > msleep((caddr_t)&proc_waiting, &lk, PPAUSE, "softupdate", 0); > proc_waiting -= 1; > return (1); The lock is already held, so no need to lock it again. Also, space after 'if'. I'm not sure the new comment is needed as the reader can already infer that from the callout_active() test. Also, I think you really want callout_pending() rather than callout_active() if pause_timer() executes normally without rescheduling itself the callout will still be marked active and the next time this function is invoked it won't schedule the callout. > @@ -5873,15 +5886,17 @@ static void > pause_timer(arg) > void *arg; > { > - > - ACQUIRE_LOCK(&lk); > + /* Implied by callout_* API */ > + /* ACQUIRE_LOCK(&lk); */ > *stat_countp += 1; > wakeup_one(&proc_waiting); > - if (proc_waiting > 0) > - handle = timeout(pause_timer, 0, tickdelay > 2 ? tickdelay : 2); > - else > - handle.callout = NULL; > - FREE_LOCK(&lk); > + if (proc_waiting > 0) { > + /* We don't care about the return value here. */ > + callout_reset(&softdep_callout, tickdelay > 2 ? tickdelay : 2, pause_timer, 0); > + } else { > + callout_deactivate(&softdep_callout); > + } > + /* FREE_LOCK(&lk); */ > } No need to use callout_deactivate() here, the callout is already deactivated when it is invoked. I think you can also leave out the comment about the return value as the vast majority of places in the kernel that call callout_reset() ignore the return value, so it is a common practice. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 14:30:06 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 67C42106567B for ; Wed, 12 Mar 2008 14:30:06 +0000 (UTC) (envelope-from cokane@cokane.org) Received: from QMTA04.westchester.pa.mail.comcast.net (qmta04.westchester.pa.mail.comcast.net [76.96.62.40]) by mx1.freebsd.org (Postfix) with ESMTP id 0544D8FC19 for ; Wed, 12 Mar 2008 14:30:05 +0000 (UTC) (envelope-from cokane@cokane.org) Received: from OMTA04.westchester.pa.mail.comcast.net ([76.96.62.35]) by QMTA04.westchester.pa.mail.comcast.net with comcast id 08yg1Z00D0ldTLk540US00; Wed, 12 Mar 2008 14:19:09 +0000 Received: from discordia ([24.61.189.203]) by OMTA04.westchester.pa.mail.comcast.net with comcast id 0EL31Z00R4PktZC3Q00000; Wed, 12 Mar 2008 14:20:04 +0000 X-Authority-Analysis: v=1.0 c=1 a=yWIViUiLWPYA:10 a=CUMa_SbteGx0BZgX1pgA:9 a=HrSz8c6paNKFlKRnIwkA:7 a=LGVNa8fhoMo1oVVE5VAFUJlUgBwA:4 a=zUBsD6tbDSsA:10 a=-rtcXVvtY7498SX-e9UA:9 a=lEhTTu5oXNJMJ4XxhzkA:7 a=DneVQdTm6Pk93g1E2aIrDb4HF0cA:4 a=NfA2RSpTaHsA:10 Received: by discordia (Postfix, from userid 103) id BEB961636F9; Wed, 12 Mar 2008 10:20:03 -0400 (EDT) X-Spam-Checker-Version: SpamAssassin 3.1.8-gr1 (2007-02-13) on discordia X-Spam-Level: X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.8-gr1 Received: from [172.20.1.3] (erwin.int.cokane.org [172.20.1.3]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by discordia (Postfix) with ESMTP id F31B31636F8; Wed, 12 Mar 2008 10:19:51 -0400 (EDT) Message-ID: <47D7E5BF.2060102@cokane.org> Date: Wed, 12 Mar 2008 10:16:31 -0400 From: Coleman Kane User-Agent: Thunderbird 2.0.0.12 (X11/20080304) MIME-Version: 1.0 To: John Baldwin References: <47D7C25D.5070908@cokane.org> <200803120945.29018.jhb@freebsd.org> In-Reply-To: <200803120945.29018.jhb@freebsd.org> Content-Type: multipart/mixed; boundary="------------080002040909020109040007" Cc: freebsd-arch@freebsd.org Subject: Re: SMPTODO: remove timeout(9) from ffs_softdep.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 14:30:06 -0000 This is a multi-part message in MIME format. --------------080002040909020109040007 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit John Baldwin wrote: > On Wednesday 12 March 2008 07:45:33 am Coleman Kane wrote: > >> Hi all, >> >> I was poking around SMPTODO for some work during an idle night, and I >> decided to fix the non-MPSAFE use of timeout(9) in ffs_softdep.c, and >> learn more about the callout_* API in the kernel. I'm attaching a patch >> of what I've done, which I am running in my current kernel at the moment >> (and I am using softupdates on a number of filesystems on this SMP >> machine). >> >> Can anyone else try it out / review it / give feedback? >> >> @@ -1403,7 +1406,9 @@ softdep_initialize() >> void >> softdep_uninitialize() >> { >> - >> + ACQUIRE_LOCK(&lk); >> + callout_drain(&softdep_callout); >> + FREE_LOCK(&lk); >> hashdestroy(pagedep_hashtbl, M_PAGEDEP, pagedep_hash); >> hashdestroy(inodedep_hashtbl, M_INODEDEP, inodedep_hash); >> hashdestroy(newblk_hashtbl, M_NEWBLK, newblk_hash); >> > > Don't hold the mutex over a drain and leave the blank line at the start of the > function (style(9)). > Thanks. This point was not completely clear from the man page (whether to hold the lock around it or not). I went looking around for examples of this... Had I looked further, I would have found my answer in bge_detach of if_bge.c. > >> @@ -5858,8 +5863,16 @@ request_cleanup(mp, resource) >> * We wait at most tickdelay before proceeding in any case. >> */ >> proc_waiting += 1; >> - if (handle.callout == NULL) >> - handle = timeout(pause_timer, 0, tickdelay > 2 ? tickdelay : 2); >> + ACQUIRE_LOCK(&lk); >> + if(callout_active(&softdep_callout) == FALSE) { >> + /* >> + should always return zero due to callout_active being called to verify that no active >> + timeout already exists, which is the case where this would return non-zero (and >> + callout_active(&softdep_callout) would be TRUE. >> + */ >> + callout_reset(&softdep_callout, tickdelay > 2 ? tickdelay : 2, pause_timer, 0); >> + } >> + FREE_LOCK(&lk); >> msleep((caddr_t)&proc_waiting, &lk, PPAUSE, "softupdate", 0); >> proc_waiting -= 1; >> return (1); >> > > The lock is already held, so no need to lock it again. Also, space after > 'if'. I'm not sure the new comment is needed as the reader can already > infer that from the callout_active() test. Also, I think you really want > callout_pending() rather than callout_active() if pause_timer() executes > normally without rescheduling itself the callout will still be marked > active and the next time this function is invoked it won't schedule the > callout. > Thanks, I see this now. Every call to request_cleanup seems to already acquire lk. This solves the use of callout_deactivate, below. > >> @@ -5873,15 +5886,17 @@ static void >> pause_timer(arg) >> void *arg; >> { >> - >> - ACQUIRE_LOCK(&lk); >> + /* Implied by callout_* API */ >> + /* ACQUIRE_LOCK(&lk); */ >> *stat_countp += 1; >> wakeup_one(&proc_waiting); >> - if (proc_waiting > 0) >> - handle = timeout(pause_timer, 0, tickdelay > 2 ? tickdelay : 2); >> - else >> - handle.callout = NULL; >> - FREE_LOCK(&lk); >> + if (proc_waiting > 0) { >> + /* We don't care about the return value here. */ >> + callout_reset(&softdep_callout, tickdelay > 2 ? tickdelay : 2, pause_timer, 0); >> + } else { >> + callout_deactivate(&softdep_callout); >> + } >> + /* FREE_LOCK(&lk); */ >> } >> > > No need to use callout_deactivate() here, the callout is already deactivated > when it is invoked. I think you can also leave out the comment about the > return value as the vast majority of places in the kernel that call > callout_reset() ignore the return value, so it is a common practice. > Technically, the callout is no longer considered "pending". According to the man page, it isn't deactivated at the return of pause_timer. Nonetheless, the pointer above about s/callout_active/callout_pending/ makes this check here unnecessary, and I'm sure that's what you're meaning by this comment. I am attaching the revised patch. -- Coleman Kane --------------080002040909020109040007 Content-Type: text/x-patch; name="ffs_softdep.c-newcallout2.diff" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="ffs_softdep.c-newcallout2.diff" diff --git a/sys/ufs/ffs/ffs_softdep.c b/sys/ufs/ffs/ffs_softdep.c index 3e8ba26..d5c8536 100644 --- a/sys/ufs/ffs/ffs_softdep.c +++ b/sys/ufs/ffs/ffs_softdep.c @@ -664,7 +664,7 @@ static int maxindirdeps = 50; /* max number of indirdeps before slowdown */ static int tickdelay = 2; /* number of ticks to pause during slowdown */ static int proc_waiting; /* tracks whether we have a timeout posted */ static int *stat_countp; /* statistic to count in proc_waiting timeout */ -static struct callout_handle handle; /* handle on posted proc_waiting timeout */ +static struct callout softdep_callout; static int req_pending; static int req_clear_inodedeps; /* syncer process flush some inodedeps */ #define FLUSH_INODES 1 @@ -1394,6 +1394,9 @@ softdep_initialize() bioops.io_complete = softdep_disk_write_complete; bioops.io_deallocate = softdep_deallocate_dependencies; bioops.io_countdeps = softdep_count_dependencies; + + /* Initialize the callout with an mtx. */ + callout_init_mtx(&softdep_callout, &lk, 0); } /* @@ -1404,6 +1407,7 @@ void softdep_uninitialize() { + callout_drain(&softdep_callout); hashdestroy(pagedep_hashtbl, M_PAGEDEP, pagedep_hash); hashdestroy(inodedep_hashtbl, M_INODEDEP, inodedep_hash); hashdestroy(newblk_hashtbl, M_NEWBLK, newblk_hash); @@ -5858,8 +5862,9 @@ request_cleanup(mp, resource) * We wait at most tickdelay before proceeding in any case. */ proc_waiting += 1; - if (handle.callout == NULL) - handle = timeout(pause_timer, 0, tickdelay > 2 ? tickdelay : 2); + if (callout_pending(&softdep_callout) == FALSE) { + callout_reset(&softdep_callout, tickdelay > 2 ? tickdelay : 2, pause_timer, 0); + } msleep((caddr_t)&proc_waiting, &lk, PPAUSE, "softupdate", 0); proc_waiting -= 1; return (1); @@ -5874,14 +5879,12 @@ pause_timer(arg) void *arg; { - ACQUIRE_LOCK(&lk); + /* The callout_ API has acquired mtx and will hold it around this function call. */ *stat_countp += 1; wakeup_one(&proc_waiting); - if (proc_waiting > 0) - handle = timeout(pause_timer, 0, tickdelay > 2 ? tickdelay : 2); - else - handle.callout = NULL; - FREE_LOCK(&lk); + if (proc_waiting > 0) { + callout_reset(&softdep_callout, tickdelay > 2 ? tickdelay : 2, pause_timer, 0); + } } /* --------------080002040909020109040007-- From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 15:10:33 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F03071065679 for ; Wed, 12 Mar 2008 15:10:33 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 7899E8FC1E for ; Wed, 12 Mar 2008 15:10:32 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8s) with ESMTP id 235191287-1834499 for multiple; Wed, 12 Mar 2008 11:08:33 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m2CFAMTJ015215; Wed, 12 Mar 2008 11:10:23 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Coleman Kane Date: Wed, 12 Mar 2008 10:58:03 -0400 User-Agent: KMail/1.9.7 References: <47D7C25D.5070908@cokane.org> <200803120945.29018.jhb@freebsd.org> <47D7E5BF.2060102@cokane.org> In-Reply-To: <47D7E5BF.2060102@cokane.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803121058.04096.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Wed, 12 Mar 2008 11:10:23 -0400 (EDT) X-Virus-Scanned: ClamAV 0.91.2/6206/Wed Mar 12 07:16:10 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: freebsd-arch@freebsd.org Subject: Re: SMPTODO: remove timeout(9) from ffs_softdep.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 15:10:34 -0000 On Wednesday 12 March 2008 10:16:31 am Coleman Kane wrote: > I am attaching the revised patch. Looks good. I would perhaps not add the extra {}'s around the single-line if clauses as it slightly obfuscates the diff (style(9) actually suggests no {}'s in that case, but I think in practice our sources have a mixture of both). -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 15:28:06 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 863D4106566B for ; Wed, 12 Mar 2008 15:28:06 +0000 (UTC) (envelope-from cokane@cokane.org) Received: from QMTA09.emeryville.ca.mail.comcast.net (qmta09.emeryville.ca.mail.comcast.net [76.96.30.96]) by mx1.freebsd.org (Postfix) with ESMTP id 65DA88FC20 for ; Wed, 12 Mar 2008 15:28:05 +0000 (UTC) (envelope-from cokane@cokane.org) Received: from OMTA12.emeryville.ca.mail.comcast.net ([76.96.30.44]) by QMTA09.emeryville.ca.mail.comcast.net with comcast id 0EVA1Z0010x6nqcA902u00; Wed, 12 Mar 2008 15:11:13 +0000 Received: from discordia ([24.61.189.203]) by OMTA12.emeryville.ca.mail.comcast.net with comcast id 0FC21Z00H4PktZC8Y00000; Wed, 12 Mar 2008 15:12:03 +0000 X-Authority-Analysis: v=1.0 c=1 a=yWIViUiLWPYA:10 a=i5KzmmKIZ9Ex48Ii2mMA:9 a=QSIy1mGstWZsU_lFX9-BSgP05_sA:4 a=oltf0pfCdT4A:10 a=-rtcXVvtY7498SX-e9UA:9 a=lEhTTu5oXNJMJ4XxhzkA:7 a=n-E81eKMzrkRIg9F2d6QuEMDThgA:4 a=NfA2RSpTaHsA:10 Received: by discordia (Postfix, from userid 103) id 429AC1636FA; Wed, 12 Mar 2008 11:12:02 -0400 (EDT) X-Spam-Checker-Version: SpamAssassin 3.1.8-gr1 (2007-02-13) on discordia X-Spam-Level: X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.8-gr1 Received: from [172.20.1.3] (erwin.int.cokane.org [172.20.1.3]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by discordia (Postfix) with ESMTP id D1E001636F8; Wed, 12 Mar 2008 11:11:50 -0400 (EDT) Message-ID: <47D7F1EC.6040802@cokane.org> Date: Wed, 12 Mar 2008 11:08:28 -0400 From: Coleman Kane User-Agent: Thunderbird 2.0.0.12 (X11/20080304) MIME-Version: 1.0 To: obrien@FreeBSD.org References: <47D7C25D.5070908@cokane.org> <200803120945.29018.jhb@freebsd.org> <47D7E5BF.2060102@cokane.org> <20080312145734.GB26812@dragon.NUXI.org> In-Reply-To: <20080312145734.GB26812@dragon.NUXI.org> Content-Type: multipart/mixed; boundary="------------020807070709030700060709" Cc: arch@FreeBSD.org, "jh >> John Baldwin" Subject: Re: SMPTODO: remove timeout(9) from ffs_softdep.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 15:28:06 -0000 This is a multi-part message in MIME format. --------------020807070709030700060709 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit David O'Brien wrote: > On Wed, Mar 12, 2008 at 10:16:31AM -0400, Coleman Kane wrote: > >> I am attaching the revised patch. >> > .. > >> + callout_reset(&softdep_callout, tickdelay > 2 ? tickdelay : 2, pause_timer, 0); >> > > Wrap long line. > > >> + /* The callout_ API has acquired mtx and will hold it around this function call. */ >> > > Ditto. > > >> + callout_reset(&softdep_callout, tickdelay > 2 ? tickdelay : 2, pause_timer, 0); >> > > Ditto. > Third try at the patch, properly adjusting my vim tabs to 8 spaces as they should be so that I can follow style(9). -- Coleman Kane --------------020807070709030700060709 Content-Type: text/x-patch; name="ffs_softdep.c-newcallout3.diff" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="ffs_softdep.c-newcallout3.diff" diff --git a/sys/ufs/ffs/ffs_softdep.c b/sys/ufs/ffs/ffs_softdep.c index 3e8ba26..457ed49 100644 --- a/sys/ufs/ffs/ffs_softdep.c +++ b/sys/ufs/ffs/ffs_softdep.c @@ -664,7 +664,7 @@ static int maxindirdeps = 50; /* max number of indirdeps before slowdown */ static int tickdelay = 2; /* number of ticks to pause during slowdown */ static int proc_waiting; /* tracks whether we have a timeout posted */ static int *stat_countp; /* statistic to count in proc_waiting timeout */ -static struct callout_handle handle; /* handle on posted proc_waiting timeout */ +static struct callout softdep_callout; static int req_pending; static int req_clear_inodedeps; /* syncer process flush some inodedeps */ #define FLUSH_INODES 1 @@ -1394,6 +1394,9 @@ softdep_initialize() bioops.io_complete = softdep_disk_write_complete; bioops.io_deallocate = softdep_deallocate_dependencies; bioops.io_countdeps = softdep_count_dependencies; + + /* Initialize the callout with an mtx. */ + callout_init_mtx(&softdep_callout, &lk, 0); } /* @@ -1404,6 +1407,7 @@ void softdep_uninitialize() { + callout_drain(&softdep_callout); hashdestroy(pagedep_hashtbl, M_PAGEDEP, pagedep_hash); hashdestroy(inodedep_hashtbl, M_INODEDEP, inodedep_hash); hashdestroy(newblk_hashtbl, M_NEWBLK, newblk_hash); @@ -5858,8 +5862,10 @@ request_cleanup(mp, resource) * We wait at most tickdelay before proceeding in any case. */ proc_waiting += 1; - if (handle.callout == NULL) - handle = timeout(pause_timer, 0, tickdelay > 2 ? tickdelay : 2); + if (callout_pending(&softdep_callout) == FALSE) { + callout_reset(&softdep_callout, tickdelay > 2 ? tickdelay : 2, + pause_timer, 0); + } msleep((caddr_t)&proc_waiting, &lk, PPAUSE, "softupdate", 0); proc_waiting -= 1; return (1); @@ -5874,14 +5880,16 @@ pause_timer(arg) void *arg; { - ACQUIRE_LOCK(&lk); + /* + * The callout_ API has acquired mtx and will hold it around this + * function call. + */ *stat_countp += 1; wakeup_one(&proc_waiting); - if (proc_waiting > 0) - handle = timeout(pause_timer, 0, tickdelay > 2 ? tickdelay : 2); - else - handle.callout = NULL; - FREE_LOCK(&lk); + if (proc_waiting > 0) { + callout_reset(&softdep_callout, tickdelay > 2 ? tickdelay : 2, + pause_timer, 0); + } } /* --------------020807070709030700060709-- From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 22:23:12 2008 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B143A1065673 for ; Wed, 12 Mar 2008 22:23:12 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 7EB7A8FC14 for ; Wed, 12 Mar 2008 22:23:12 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com [24.94.75.93]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id m2CMN5VK077024; Wed, 12 Mar 2008 18:23:11 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Wed, 12 Mar 2008 12:24:07 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: Gary Jennejohn In-Reply-To: <20080312110229.5aeefc1f@peedub.jennejohn.org> Message-ID: <20080312122300.Y1091@desktop> References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> <20080312110229.5aeefc1f@peedub.jennejohn.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 22:23:12 -0000 On Wed, 12 Mar 2008, Gary Jennejohn wrote: > On Wed, 12 Mar 2008 12:14:36 +0800 > David Xu wrote: > >> Jeff Roberson wrote: >>> http://people.freebsd.org/~jeff/amd64.diff >> >> This is a good idea. In fact, according to calling conversion, some >> registers are not needed to be saved across function call, e.g on >> i386, eax, edx, and ecx. :-) but gdb may need them to dig out >> stack variable's value. >> > > I applied this patch yesterday on an AMD64 X2 box and got this panic > today after I started X: > > Unread portion of the kernel message buffer: > panic: smp_tlb_shootdown: interrupts disabled > cpuid = 0 > Uptime: 47s > Physical memory: 3062 MB > Dumping 169 MB: 154 138 122 106 90 74 58 42 26 10 > > That's all the useful information which I have because the back trace > is corrupted. > > BTW I'm using SCHED_ULE. > > Maybe I shouldn't have tried this patch yet since it doesn't seem to be SMP > ready. Thanks for testing. I just ran into that panic myself. I don't think it's a SMP problem. In general things on arch@ are sometimes more experimental than things we mail to to current@ asking for people to test. Thanks, Jeff > > --- > Gary Jennejohn > From owner-freebsd-arch@FreeBSD.ORG Wed Mar 12 23:51:27 2008 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 07C341065675 for ; Wed, 12 Mar 2008 23:51:27 +0000 (UTC) (envelope-from scf@FreeBSD.org) Received: from mail.farley.org (farley.org [67.64.95.201]) by mx1.freebsd.org (Postfix) with ESMTP id A0FE28FC24 for ; Wed, 12 Mar 2008 23:51:26 +0000 (UTC) (envelope-from scf@FreeBSD.org) Received: from thor.farley.org (thor.farley.org [192.168.1.5]) by mail.farley.org (8.14.2/8.14.2) with ESMTP id m2CNOL5d035933; Wed, 12 Mar 2008 18:24:21 -0500 (CDT) (envelope-from scf@FreeBSD.org) Date: Wed, 12 Mar 2008 18:24:21 -0500 (CDT) From: "Sean C. Farley" To: Coleman Kane In-Reply-To: <47D7F1EC.6040802@cokane.org> Message-ID: References: <47D7C25D.5070908@cokane.org> <200803120945.29018.jhb@freebsd.org> <47D7E5BF.2060102@cokane.org> <20080312145734.GB26812@dragon.NUXI.org> <47D7F1EC.6040802@cokane.org> User-Agent: Alpine 1.00 (BSF 882 2007-12-20) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.4 X-Spam-Checker-Version: SpamAssassin 3.2.4 (2008-01-01) on mail.farley.org Cc: arch@FreeBSD.org Subject: Re: SMPTODO: remove timeout(9) from ffs_softdep.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2008 23:51:27 -0000 On Wed, 12 Mar 2008, Coleman Kane wrote: > Third try at the patch, properly adjusting my vim tabs to 8 spaces as > they should be so that I can follow style(9). I wrote a function[1] last year to configure vim to follow style(9). Just run ':call FreeBSD_Style()' while editing a file. Sean 1. http://www.farley.org/freebsd/tmp/VIM/FreeBSD.vim -- scf@FreeBSD.org From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 01:41:24 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1F5ED106566C for ; Thu, 13 Mar 2008 01:41:24 +0000 (UTC) (envelope-from cokane@cokane.org) Received: from QMTA02.emeryville.ca.mail.comcast.net (qmta02.emeryville.ca.mail.comcast.net [76.96.30.24]) by mx1.freebsd.org (Postfix) with ESMTP id 0C54E8FC24 for ; Thu, 13 Mar 2008 01:41:24 +0000 (UTC) (envelope-from cokane@cokane.org) Received: from OMTA10.emeryville.ca.mail.comcast.net ([76.96.30.28]) by QMTA02.emeryville.ca.mail.comcast.net with comcast id 0Ppk1Z0020cQ2SLA205y00; Thu, 13 Mar 2008 01:40:35 +0000 Received: from discordia ([24.61.189.203]) by OMTA10.emeryville.ca.mail.comcast.net with comcast id 0RhN1Z0064PktZC8W00000; Thu, 13 Mar 2008 01:41:23 +0000 X-Authority-Analysis: v=1.0 c=1 a=yWIViUiLWPYA:10 a=cgcLz6ojAAAA:8 a=WsdIhoTgK6fTm0dQQWAA:9 a=1IvbqGLK4eI6TWeiJ0fnpeGqzLUA:4 a=BDXKcin-EtgA:10 Received: by discordia (Postfix, from userid 103) id 2C0AF1636FA; Wed, 12 Mar 2008 21:41:22 -0400 (EDT) X-Spam-Checker-Version: SpamAssassin 3.1.8-gr1 (2007-02-13) on discordia X-Spam-Level: X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.8-gr1 Received: from [172.20.1.3] (erwin.int.cokane.org [172.20.1.3]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by discordia (Postfix) with ESMTP id 3747B1636F8; Wed, 12 Mar 2008 21:41:05 -0400 (EDT) Message-ID: <47D88568.7000105@cokane.org> Date: Wed, 12 Mar 2008 21:37:44 -0400 From: Coleman Kane User-Agent: Thunderbird 2.0.0.12 (X11/20080304) MIME-Version: 1.0 To: "Sean C. Farley" References: <47D7C25D.5070908@cokane.org> <200803120945.29018.jhb@freebsd.org> <47D7E5BF.2060102@cokane.org> <20080312145734.GB26812@dragon.NUXI.org> <47D7F1EC.6040802@cokane.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@FreeBSD.org Subject: Re: SMPTODO: remove timeout(9) from ffs_softdep.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 01:41:24 -0000 Sean C. Farley wrote: > On Wed, 12 Mar 2008, Coleman Kane wrote: > >> Third try at the patch, properly adjusting my vim tabs to 8 spaces as >> they should be so that I can follow style(9). > > I wrote a function[1] last year to configure vim to follow style(9). > Just run ':call FreeBSD_Style()' while editing a file. > > Sean > 1. http://www.farley.org/freebsd/tmp/VIM/FreeBSD.vim Rock on. This should be in the committers' guide or something. -- Coleman From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 03:39:01 2008 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 87AB41065670 for ; Thu, 13 Mar 2008 03:39:01 +0000 (UTC) (envelope-from scf@FreeBSD.org) Received: from mail.farley.org (farley.org [67.64.95.201]) by mx1.freebsd.org (Postfix) with ESMTP id 642EA8FC1F for ; Thu, 13 Mar 2008 03:39:01 +0000 (UTC) (envelope-from scf@FreeBSD.org) Received: from thor.farley.org (thor.farley.org [192.168.1.5]) by mail.farley.org (8.14.2/8.14.2) with ESMTP id m2D3cwwf040244 for ; Wed, 12 Mar 2008 22:38:58 -0500 (CDT) (envelope-from scf@FreeBSD.org) Date: Wed, 12 Mar 2008 22:38:58 -0500 (CDT) From: "Sean C. Farley" To: arch@FreeBSD.org Message-ID: User-Agent: Alpine 1.00 (BSF 882 2007-12-20) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.4 X-Spam-Checker-Version: SpamAssassin 3.2.4 (2008-01-01) on mail.farley.org Cc: Subject: [RFC] struct grp related additions to libutil X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 03:39:01 -0000 I have written four functions related to struct grp processing that I would like to add to libutil. They are modeled in part after similar calls in libutil/pw_util.c. The calls are: 1. int gr_equal(const struct group *gr1, const struct group *gr2) Compares the values of two group structures. It does a thorough, yet unoptimized comparison of all the members regardless of order. 2. char *gr_make(const struct group *gr) Creates a string (as would exist within /etc/group) from a group structure. 3. struct group *gr_dup(const struct group *gr) Duplicate a group structure. Returned valued is a contiguous block of memory. 4. struct group *gr_scan(const char *line) Creates a group structure from a string (as produced by gr_make()). Questions: 1. What requirements are there for making additions/changes to libutil? 2. Will there be any issues with having gr_equal() return a bool? Currently, it is returning an int. I made patches with regression tests for both HEAD[1] and RELENG_7[2]. Sean 1. http://www.farley.org/freebsd/tmp/gr_util/libutil-grp-HEAD.patch 2. http://www.farley.org/freebsd/tmp/gr_util/libutil-grp-RELENG_7.patch -- scf@FreeBSD.org From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 06:22:52 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9BEC6106566C; Thu, 13 Mar 2008 06:22:52 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx07.syd.optusnet.com.au (fallbackmx07.syd.optusnet.com.au [211.29.132.9]) by mx1.freebsd.org (Postfix) with ESMTP id E4B8C8FC25; Thu, 13 Mar 2008 06:22:46 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail18.syd.optusnet.com.au (mail18.syd.optusnet.com.au [211.29.132.199]) by fallbackmx07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m2D2m0Qi018588; Thu, 13 Mar 2008 13:48:00 +1100 Received: from c220-239-252-11.carlnfd3.nsw.optusnet.com.au (c220-239-252-11.carlnfd3.nsw.optusnet.com.au [220.239.252.11]) by mail18.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m2D2lut1025522 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 13 Mar 2008 13:47:58 +1100 Date: Thu, 13 Mar 2008 13:47:56 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Peter Wemm In-Reply-To: Message-ID: <20080313124213.J31200@delplex.bde.org> References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, David Xu Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 06:22:52 -0000 On Wed, 12 Mar 2008, Peter Wemm wrote: > On Tue, Mar 11, 2008 at 9:14 PM, David Xu wrote: >> Jeff Roberson wrote: >> > http://people.freebsd.org/~jeff/amd64.diff >> >> This is a good idea. I wouldn't have expected it to make much difference. On i386 UP, cpu_switch() normally executes only 48 instructions for in-kernel context switches in my version of 5.2 and only 61 instructions in -current. ~5.2 differs from 5.2 here in only in not having to switch %eflags. This saves 4 instructions but much more in cycles, especially in P4 where accesses to %eflags are very slow. 5.2 would take 52 instructions, and -current has bloated by 9 instructions relative to 5.2. In-kernel switches are not a very typical case since they don't load %cr3. The 50-60 instructions might take as few as 20 cycles when pipelined through 3 ALUs, but they are only moderately parallelizable so would take more like 50-60 cycles on an Athlon. The only very slow instructions in them for the usual in-kernel case are the loads of %eflags and %gs. At least the latter is easy-to optimize away, but the former is assoicated with spin locking hard-disabling interrupts. For userland context switches, there is also an ltr in the usual path of execution. But 100 or so cycles for the simple instructions is noise compared with the cost of the TLB flush and other cache misses caused by loading %cr3 for userland context switches. Userland code that does useful work will do more than sched_yield() so it will suffer more from cache misses. Layers above cpu_switch() has become very bloated and make a full context switch take several hundred cycles for the simple instructions on machines where the simple instructions in cpu_switch() take only 100. Its overhead may almost be signficant relative to the cache misses. However, this is another reason why the speed of the simple instructions in cpu_switch() doesn't matter. >> In fact, according to calling conversion, some >> registers are not needed to be saved across function call, e.g on >> i386, eax, edx, and ecx. :-) but gdb may need them to dig out >> stack variable's value. The asm code already saves only call-saved registers for both i386 and amd64. It saves call-saved registers even when it apparently doesn't use them (lots more of these on amd64, while on i386 it uses more call-saved registers than it needs to, apparently since this is free after saving all call-saved registers). I think saving more than is needed is the result of confusion about what needs to be saved and/or what is needed for debugging. > Jeff and I have been having a friendly "competition" today. > > With a UP kernel and INVARIANTS, my initial counter-patch response had > nearly double the gain on my machine. (Jeff 7%, mine: 13.5%). > I changed to compile kernels the same as he did (no invariants, SMP > kernel, but kern.smp.disabled=1). After that, our patch sets were the > same again - both at about 10% gain over baseline. > > I've made a few more changes and am now at 23% improvement over baseline. > > I'm not confident of testing methodology. More tests are in progress. > > The good news is that this tuning is finally being done. It should > have been done in 2003 though... How is this possible with (according to my theory) most of the context switch cost being for %cr3 and upper layers? Unchanged amd64 has only a few more costs than i386. Mainly 3 unconditional wrmsr's and 2 unconditional rdmsr's for managing gsbase and fsbase. I thought that these were hard to avoid and anyway not nearly as expensive as %cr3 loads. Bruce From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 07:28:20 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 229651065677; Thu, 13 Mar 2008 07:28:20 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id E4E798FC27; Thu, 13 Mar 2008 07:28:19 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com [24.94.75.93]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id m2D7SDUk061861; Thu, 13 Mar 2008 03:28:14 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Wed, 12 Mar 2008 21:29:18 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: Bruce Evans In-Reply-To: <20080313124213.J31200@delplex.bde.org> Message-ID: <20080312211834.T1091@desktop> References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> <20080313124213.J31200@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, David Xu , Peter Wemm Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 07:28:20 -0000 On Thu, 13 Mar 2008, Bruce Evans wrote: > On Wed, 12 Mar 2008, Peter Wemm wrote: > >> On Tue, Mar 11, 2008 at 9:14 PM, David Xu wrote: >>> Jeff Roberson wrote: >>> > http://people.freebsd.org/~jeff/amd64.diff >>> >>> This is a good idea. > > I wouldn't have expected it to make much difference. On i386 UP, > cpu_switch() normally executes only 48 instructions for in-kernel > context switches in my version of 5.2 and only 61 instructions in > -current. ~5.2 differs from 5.2 here in only in not having to > switch %eflags. This saves 4 instructions but much more in cycles, > especially in P4 where accesses to %eflags are very slow. 5.2 would > take 52 instructions, and -current has bloated by 9 instructions > relative to 5.2. More expensive than the raw instruction count is: 1) The mispredicted branches to deal with all of the optional state and features that are not always saved. 2) The cost of extra icache for getting over all of those unused instructions, unaligned jumps, etc. I haven't looked at i386 very closely lately but on amd64 the wrmsrs for fs/gsbase are very expensive. On my 2ghz dual core opteron the optimized switch seems to take about 100ns. The total switch from userspace to userspace is about 4x that. > > In-kernel switches are not a very typical case since they don't load > %cr3. The 50-60 instructions might take as few as 20 cycles when > pipelined through 3 ALUs, but they are only moderately parallelizable > so would take more like 50-60 cycles on an Athlon. The only very slow > instructions in them for the usual in-kernel case are the loads of > %eflags and %gs. At least the latter is easy-to optimize away, but > the former is assoicated with spin locking hard-disabling interrupts. > For userland context switches, there is also an ltr in the usual path > of execution. But 100 or so cycles for the simple instructions is > noise compared with the cost of the TLB flush and other cache misses > caused by loading %cr3 for userland context switches. Userland code > that does useful work will do more than sched_yield() so it will suffer > more from cache misses. > We've been working on amd64 so I can't comment specifically about i386 costs. However, I definitely agree that cpu_switch() is not the greatest overhead in the path. Also, you have to load cr3 even for kernel threads because the page directory page or page directory pointer table at %cr3 can go away once you've switched out the old thread. > Layers above cpu_switch() has become very bloated and make a full > context switch take several hundred cycles for the simple instructions > on machines where the simple instructions in cpu_switch() take only > 100. Its overhead may almost be signficant relative to the cache > misses. However, this is another reason why the speed of the simple > instructions in cpu_switch() doesn't matter. > >>> In fact, according to calling conversion, some >>> registers are not needed to be saved across function call, e.g on >>> i386, eax, edx, and ecx. :-) but gdb may need them to dig out >>> stack variable's value. > > The asm code already saves only call-saved registers for both i386 and > amd64. It saves call-saved registers even when it apparently doesn't > use them (lots more of these on amd64, while on i386 it uses more > call-saved registers than it needs to, apparently since this is free > after saving all call-saved registers). I think saving more than is > needed is the result of confusion about what needs to be saved and/or > what is needed for debugging. It has to save all of the callee saved registers in the PCB because they will likely differ from thread to thread. Failing to save and restore them could leave you returning with the registers having different values and corrupt the calling function. > >> Jeff and I have been having a friendly "competition" today. >> >> With a UP kernel and INVARIANTS, my initial counter-patch response had >> nearly double the gain on my machine. (Jeff 7%, mine: 13.5%). >> I changed to compile kernels the same as he did (no invariants, SMP >> kernel, but kern.smp.disabled=1). After that, our patch sets were the >> same again - both at about 10% gain over baseline. >> >> I've made a few more changes and am now at 23% improvement over baseline. >> >> I'm not confident of testing methodology. More tests are in progress. >> >> The good news is that this tuning is finally being done. It should >> have been done in 2003 though... > > How is this possible with (according to my theory) most of the context > switch cost being for %cr3 and upper layers? Unchanged amd64 has only > a few more costs than i386. Mainly 3 unconditional wrmsr's and 2 > unconditional rdmsr's for managing gsbase and fsbase. I thought that > these were hard to avoid and anyway not nearly as expensive as %cr3 loads. %cr3 is actually a lot less expensive these days with page table flush filters and the PG_G bit. We were able to optimize away setting the msrs in the case that the previous values match the new values. Apparently the hardware doesn't optimize this case so we have to do comparisons ourselves. That was a big chunk of the optimization. Static branch hints, reordering code, possibly reordering for better pipeline scheduling in peter's asm, etc. provide the rest. My primary motivation is to get ithread/kthread/taskqueue switch costs down for interrupt heavy applications. There is a lot of unnecessary fat there. Jeff > > Bruce > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 13:25:06 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 30D041065678 for ; Thu, 13 Mar 2008 13:25:06 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.freebsd.org (Postfix) with ESMTP id D5CA98FC29 for ; Thu, 13 Mar 2008 13:25:05 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from root by ciao.gmane.org with local (Exim 4.43) id 1JZnQc-0005wG-6x for freebsd-arch@freebsd.org; Thu, 13 Mar 2008 13:25:02 +0000 Received: from 195.208.174.178 ([195.208.174.178]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 13 Mar 2008 13:25:02 +0000 Received: from vadim_nuclight by 195.208.174.178 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 13 Mar 2008 13:25:02 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-arch@freebsd.org From: Vadim Goncharov Date: Thu, 13 Mar 2008 13:24:49 +0000 (UTC) Organization: Nuclear Lightning @ Tomsk, TPU AVTF Hostel Lines: 20 Message-ID: References: <47D7C25D.5070908@cokane.org> <200803120945.29018.jhb@freebsd.org> <47D7E5BF.2060102@cokane.org> <20080312145734.GB26812@dragon.NUXI.org> <47D7F1EC.6040802@cokane.org> <47D88568.7000105@cokane.org> X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 195.208.174.178 X-Comment-To: Coleman Kane User-Agent: slrn/0.9.8.1 (FreeBSD) Sender: news Subject: Re: SMPTODO: remove timeout(9) from ffs_softdep.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: vadim_nuclight@mail.ru List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 13:25:06 -0000 Hi Coleman Kane! On Wed, 12 Mar 2008 21:37:44 -0400; Coleman Kane wrote about 'Re: SMPTODO: remove timeout(9) from ffs_softdep.c': >>> Third try at the patch, properly adjusting my vim tabs to 8 spaces as >>> they should be so that I can follow style(9). >> >> I wrote a function[1] last year to configure vim to follow style(9). >> Just run ':call FreeBSD_Style()' while editing a file. >> >> Sean >> 1. http://www.farley.org/freebsd/tmp/VIM/FreeBSD.vim > Rock on. > This should be in the committers' guide or something. I vote for this too :) -- WBR, Vadim Goncharov. ICQ#166852181 mailto:vadim_nuclight@mail.ru [Moderator of RU.ANTI-ECOLOGY][FreeBSD][http://antigreen.org][LJ:/nuclight] From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 13:34:54 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2A1BD106566C; Thu, 13 Mar 2008 13:34:54 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id 933038FC19; Thu, 13 Mar 2008 13:34:53 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-252-11.carlnfd3.nsw.optusnet.com.au (c220-239-252-11.carlnfd3.nsw.optusnet.com.au [220.239.252.11]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m2DDYRFk006412 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 14 Mar 2008 00:34:29 +1100 Date: Fri, 14 Mar 2008 00:34:27 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Jeff Roberson In-Reply-To: <20080312211834.T1091@desktop> Message-ID: <20080313230809.W32527@delplex.bde.org> References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> <20080313124213.J31200@delplex.bde.org> <20080312211834.T1091@desktop> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, Peter Wemm , David Xu Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 13:34:54 -0000 On Wed, 12 Mar 2008, Jeff Roberson wrote: > On Thu, 13 Mar 2008, Bruce Evans wrote: > >> On Wed, 12 Mar 2008, Peter Wemm wrote: >> >>> On Tue, Mar 11, 2008 at 9:14 PM, David Xu wrote: >>>> Jeff Roberson wrote: >>>> > http://people.freebsd.org/~jeff/amd64.diff >>>> >>>> This is a good idea. >> >> I wouldn't have expected it to make much difference. On i386 UP, >> cpu_switch() normally executes only 48 instructions for in-kernel >> context switches in my version of 5.2 and only 61 instructions in >> -current. ~5.2 differs from 5.2 here in only in not having to >> switch %eflags. This saves 4 instructions but much more in cycles, >> especially in P4 where accesses to %eflags are very slow. 5.2 would >> take 52 instructions, and -current has bloated by 9 instructions >> relative to 5.2. > > More expensive than the raw instruction count is: > > 1) The mispredicted branches to deal with all of the optional state and > features that are not always saved. This is unlikely to matter, and apparently doesn't, at least in simple benchmarks, since the C version has even more branches. Features that are rarely used cause branches that are usually perfectly predicted. > 2) The cost of extra icache for getting over all of those unused > instructions, unaligned jumps, etc. Again, if this were the cause of slowness then it would affect the C version more, since the C version is larger. In fact, the benchmark is probably too simple to show the cost of branches. Just doing sched_yield() in a loop gives the following atypical behaviour which may be atypical enough for the larger branch and cache costs for the C version to not have much effect: - it doesn't go near most of the special cases, so branches are predictable (always non-special) and are thus predicted provided (a) the CPU actually does reasonably good branch prediction, and (b) the branch predictions fit in the branch prediction cache (reasonably good branch prediction probably requires such a cache). - it doesn't touch much icache or dcache or branch-cache, so everything probably stays cached. If just the branch-cache were thrashed, then reasonably good dynamic branch prediction is impossible and things would be slow. In the C version, you use predict_true() and predict_false() a lot. This might improve static branch prediction but makes little difference if the branch cache is working. The C version uses lots of non-inline function calls. Just the branches for this would have a significant overhead if the branches are mispredicted. I think you are depending on gcc's auto-inlining of static functions which are only called once to avoid the full cost of the function calls. > I haven't looked at i386 very closely lately but on amd64 the wrmsrs for > fs/gsbase are very expensive. On my 2ghz dual core opteron the optimized > switch seems to take about 100ns. The total switch from userspace to > userspace is about 4x that. Probably avoiding these is the only significant large between all the versions. You use predict_false() for executing them. Are fsbase and gsbase really usually constant across processes? 400nS is about what I get for i386 on 2.2GHz A64 UP too (6.17 S for ./yield 1000000 10). getpid() on this machine takes 180nS so it is unreasonable to expect sched_yield() to take much less than a few hundred nS. Some perfmon output for ./yield 100000 10: % # s/kx-ls-microarchitectural-resync-by-self-mod-code % 0 % # s/kx-ls-buffer2-full % 909905 % # s/kx-ls-retired-cflush-instructions % 0 % # s/kx-ls-retired-cpuid-instructions % 0 % # s/kx-dc-accesses % 496436422 % # s/kx-dc-misses % 11102024 11 cache dmisses per yield. Probably the main cause of slowness (main memory latency on this machine is 42 nsec so 11 cache misses takes 462 of the 617 nS per call?). % # s/kx-dc-refills-from-l2 % 0 % # s/kx-dc-refills-from-system % 0 % # s/kx-dc-writebacks % 0 % # s/kx-dc-l1-dtlb-miss-and-l2-dtlb-hits % 3459100 % # s/kx-dc-l1-and-l2-dtlb-misses % 2138231 % # s/kx-dc-misaligned-references % 87 % # s/kx-dc-microarchitectural-late-cancel-of-an-access % 73146415 % # s/kx-dc-microarchitectural-early-cancel-of-an-access % 236927303 % # s/kx-bu-cpu-clk-unhalted % 1303921314 % # s/kx-ic-fetches % 236207869 % # s/kx-ic-misses % 22988 Insignificant icache misses. % # s/kx-ic-refill-from-l2 % 18979 % # s/kx-ic-refill-from-system % 4191 % # s/kx-ic-l1-itlb-misses % 0 % # s/kx-ic-l1-l2-itlb-misses % 1619297 % # s/kx-ic-instruction-fetch-stall % 1034570822 % # s/kx-ic-return-stack-hit % 20822416 % # s/kx-ic-return-stack-overflow % 5870 % # s/kx-fr-retired-instructions % 701240247 % # s/kx-fr-retired-ops % 1163464391 % # s/kx-fr-retired-branches % 121636370 % # s/kx-fr-retired-branches-mispredicted % 2761910 % # s/kx-fr-retired-taken-branches % 93488548 % # s/kx-fr-retired-taken-branches-mispredicted % 2848315 2.8 branches mispredicted per call. # s/kx-fr-retired-far-control-transfers % 2000934 1 int0x80 and 1 iret per shched_yield(), and apparentlty not much else. % # s/kx-fr-retired-resync-branches % 936968 % # s/kx-fr-retired-near-returns % 19008374 % # s/kx-fr-retired-near-returns-mispredicted % 784103 0.8 returns mispredicted per call. % # s/kx-fr-retired-taken-branches-mispred-by-addr-miscompare % 721241 % # s/kx-fr-interrupts-masked-cycles % 658462615 Ugh, this is from spinlocks bogusly masking interrupts. More than half the cycles have interrupts masked. This at least shows that lots of time is being spent near cpu_switch() with a spinlock held. % # s/kx-fr-interrupts-masked-while-pending-cycles % 9365 Since the CPU is reasonably fast, interrupts aren't masked for very long each time. This maximum is still 4.5 uS. % # s/kx-fr-hardware-interrupts % 63 % # s/kx-fr-decoder-empty % 247898696 % # s/kx-fr-dispatch-stalls % 589228741 % # s/kx-fr-dispatch-stall-from-branch-abort-to-retire % 39894120 % # s/kx-fr-dispatch-stall-for-serialization % 44037193 % # s/kx-fr-dispatch-stall-for-segment-load % 134520281 134 cyles per call. This may be more for ones in syscall() generally. I think each segreg load still costs ~20 cycles. Since this is on i386, there are 6 per call (%ds, %es and %fs save and restore), plus %ss save and which might not be counted here. 134 is a lot -- about 60nS of the 180nS for getpid(). % # s/kx-fr-dispatch-stall-when-reorder-buffer-is-full % 18648001 % # s/kx-fr-dispatch-stall-when-reservation-stations-are-full % 121485247 % # s/kx-fr-dispatch-stall-when-fpu-is-full % 19 % # s/kx-fr-dispatch-stall-when-ls-is-full % 203578275 % # s/kx-fr-dispatch-stall-when-waiting-for-all-to-be-quiet % 63136307 % # s/kx-fr-dispatch-stall-when-far-xfer-or-resync-br-pending % 6994131 >> In-kernel switches are not a very typical case since they don't load >> %cr3... > > We've been working on amd64 so I can't comment specifically about i386 costs. > However, I definitely agree that cpu_switch() is not the greatest overhead in > the path. Also, you have to load cr3 even for kernel threads because the > page directory page or page directory pointer table at %cr3 can go away once > you've switched out the old thread. I don't see this. The switch is avoided if %cr3 wouldn't change, which I think usually or always happens for switches between kernel threads. >> The asm code already saves only call-saved registers for both i386 and >> amd64. It saves call-saved registers even when it apparently doesn't >> use them (lots more of these on amd64, while on i386 it uses more >> call-saved registers than it needs to, apparently since this is free >> after saving all call-saved registers). I think saving more than is >> needed is the result of confusion about what needs to be saved and/or >> what is needed for debugging. > > It has to save all of the callee saved registers in the PCB because they will > likely differ from thread to thread. Failing to save and restore them could > leave you returning with the registers having different values and corrupt > the calling function. Yes, I had forgotten the detail of how the non-local flow of control can change the registers (the next call to the function in the context of the switched-to-process may have different values in the registers due to changes to the registers in callers). All that can be done differently here is saving all the registers on the stack (except %esp) in the usual way. This would probably be faster on old i386's using pushal or pushl, but on amd64 pushal is not available, and on Athlons generally (before Barcelona?) it is faster not to use pushl, so on amd64 the registers should be saved using movl and then it is just as easy to put them in the pcb as on the stack. >>> The good news is that this tuning is finally being done. It should >>> have been done in 2003 though... >> >> How is this possible with (according to my theory) most of the context >> switch cost being for %cr3 and upper layers? Unchanged amd64 has only >> a few more costs than i386. Mainly 3 unconditional wrmsr's and 2 >> unconditional rdmsr's for managing gsbase and fsbase. I thought that >> these were hard to avoid and anyway not nearly as expensive as %cr3 loads. > > %cr3 is actually a lot less expensive these days with page table flush > filters and the PG_G bit. We were able to optimize away setting the msrs in > the case that the previous values match the new values. Apparently the > hardware doesn't optimize this case so we have to do comparisons ourselves. > > That was a big chunk of the optimization. Static branch hints, reordering > code, possibly reordering for better pipeline scheduling in peter's asm, etc. > provide the rest. All the old i386 asm and probably clones of it on amd64 is certainly not optimized globally for anything newer than an i386 (barely even an i486). This rarely matters however. It lost more on Pentium-1's, but now out of order execution and better branch prediction hides most inefficiencies. Bruce From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 13:50:38 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2C0FD1065705 for ; Thu, 13 Mar 2008 13:50:38 +0000 (UTC) (envelope-from ed@hoeg.nl) Received: from palm.hoeg.nl (mx0.hoeg.nl [IPv6:2001:610:652::211]) by mx1.freebsd.org (Postfix) with ESMTP id DBB818FC1B for ; Thu, 13 Mar 2008 13:50:36 +0000 (UTC) (envelope-from ed@hoeg.nl) Received: by palm.hoeg.nl (Postfix, from userid 1000) id 567711CC44; Thu, 13 Mar 2008 14:50:35 +0100 (CET) Date: Thu, 13 Mar 2008 14:50:35 +0100 From: Ed Schouten To: FreeBSD Arch Message-ID: <20080313135035.GB80576@hoeg.nl> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="ht9V8wKec6a3w1Ef" Content-Disposition: inline User-Agent: Mutt/1.5.17 (2007-11-01) Cc: Subject: New TTY layer: condvar(9) and Giant X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 13:50:38 -0000 --ht9V8wKec6a3w1Ef Content-Type: multipart/mixed; boundary="FexDM9E/OpjgUmaq" Content-Disposition: inline --FexDM9E/OpjgUmaq Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hello everyone, Almost a month ago I started working on my assignment for my internship, to reimplement a new TTY layer that fixes a lot of architectural problems. So far, things are going quite fast: - I've already implemented a basic TTY layer, which has support for canonical and non-canonical mode. It still misses important features including flow control, but it seems to work quite good. Unlike the old layer, it doesn't buffer data as much, which should hopefully mean it's a bit faster. - I'm using a new PTY driver called pts(4). It works quite good, but it misses the compatibility bits, which we'll need to have to support older FreeBSD or Linux binaries. - Some of you may have read I'm working on syscons now. I've got syscons working with the new TTY layer; I'm typing this message through syscons. ;-) A lot of drivers that are used by the old TTY layer aren't mpsafe yet. Of course, I'm willing to fix this, but this cannot be done in the nearby future. This is why the new TTY layer should still allow TTY's to be run under Giant. In my initial implementation, each TTY device had its own mutex. In theory, this is great. The PTY driver already uses this and it works fine. There will be a lot of drivers, however, that want to use a per-class mutex to lock all related TTY devices down at once (i.e. syscons, which allocates 16 virtual TTY's). This is why I introduced a per-class lock. When set to Giant, all TTY instances will lock down the Giant lock when entering the TTY layer. Unfortunately, I discovered condvar(9) can't properly unlock/lock the Giant, which causes the system to panic. The condvar routines already call DROP_GIANT before unlocking the lock itself. I've attached a patch that adds support for Giant to condvar(9). I had to patch sys/mutex.h a little, because we now only need to call DROP_GIANT() under certain conditions. The macro's didn't allow that, because DROP_GIANT starts a new code block. I'm sending this to arch@, because I want to know if I'm doing something silly. It seems to work properly on my machine, but I'm not an SMP expert. ;-) --=20 Ed Schouten WWW: http://g-rave.nl/ --FexDM9E/OpjgUmaq Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="condvar-giant.diff" Content-Transfer-Encoding: quoted-printable --- sys/kern/kern_condvar.c +++ sys/kern/kern_condvar.c @@ -95,6 +95,7 @@ _cv_wait(struct cv *cvp, struct lock_object *lock) { WITNESS_SAVE_DECL(lock_witness); + PARTIAL_DROP_GIANT_DECL(); struct lock_class *class; struct thread *td; int lock_state; @@ -123,7 +124,8 @@ sleepq_lock(cvp); =20 cvp->cv_waiters++; - DROP_GIANT(); + if (lock !=3D (struct lock_object *)&Giant) + PARTIAL_DROP_GIANT(); =20 sleepq_add(cvp, lock, cvp->cv_description, SLEEPQ_CONDVAR, 0); if (class->lc_flags & LC_SLEEPABLE) @@ -137,7 +139,8 @@ if (KTRPOINT(td, KTR_CSW)) ktrcsw(0, 0); #endif - PICKUP_GIANT(); + if (lock !=3D (struct lock_object *)&Giant) + PARTIAL_PICKUP_GIANT(); class->lc_lock(lock, lock_state); WITNESS_RESTORE(lock, lock_witness); } @@ -149,6 +152,7 @@ void _cv_wait_unlock(struct cv *cvp, struct lock_object *lock) { + PARTIAL_DROP_GIANT_DECL(); struct lock_class *class; struct thread *td; =20 @@ -176,7 +180,8 @@ sleepq_lock(cvp); =20 cvp->cv_waiters++; - DROP_GIANT(); + if (lock !=3D (struct lock_object *)&Giant) + PARTIAL_DROP_GIANT(); =20 sleepq_add(cvp, lock, cvp->cv_description, SLEEPQ_CONDVAR, 0); if (class->lc_flags & LC_SLEEPABLE) @@ -190,7 +195,8 @@ if (KTRPOINT(td, KTR_CSW)) ktrcsw(0, 0); #endif - PICKUP_GIANT(); + if (lock !=3D (struct lock_object *)&Giant) + PARTIAL_PICKUP_GIANT(); } =20 /* @@ -203,6 +209,7 @@ _cv_wait_sig(struct cv *cvp, struct lock_object *lock) { WITNESS_SAVE_DECL(lock_witness); + PARTIAL_DROP_GIANT_DECL(); struct lock_class *class; struct thread *td; struct proc *p; @@ -233,7 +240,8 @@ sleepq_lock(cvp); =20 cvp->cv_waiters++; - DROP_GIANT(); + if (lock !=3D (struct lock_object *)&Giant) + PARTIAL_DROP_GIANT(); =20 sleepq_add(cvp, lock, cvp->cv_description, SLEEPQ_CONDVAR | SLEEPQ_INTERRUPTIBLE, 0); @@ -248,7 +256,8 @@ if (KTRPOINT(td, KTR_CSW)) ktrcsw(0, 0); #endif - PICKUP_GIANT(); + if (lock !=3D (struct lock_object *)&Giant) + PARTIAL_PICKUP_GIANT(); class->lc_lock(lock, lock_state); WITNESS_RESTORE(lock, lock_witness); =20 @@ -264,6 +273,7 @@ _cv_timedwait(struct cv *cvp, struct lock_object *lock, int timo) { WITNESS_SAVE_DECL(lock_witness); + PARTIAL_DROP_GIANT_DECL(); struct lock_class *class; struct thread *td; int lock_state, rval; @@ -293,7 +303,8 @@ sleepq_lock(cvp); =20 cvp->cv_waiters++; - DROP_GIANT(); + if (lock !=3D (struct lock_object *)&Giant) + PARTIAL_DROP_GIANT(); =20 sleepq_add(cvp, lock, cvp->cv_description, SLEEPQ_CONDVAR, 0); sleepq_set_timeout(cvp, timo); @@ -308,7 +319,8 @@ if (KTRPOINT(td, KTR_CSW)) ktrcsw(0, 0); #endif - PICKUP_GIANT(); + if (lock !=3D (struct lock_object *)&Giant) + PARTIAL_PICKUP_GIANT(); class->lc_lock(lock, lock_state); WITNESS_RESTORE(lock, lock_witness); =20 @@ -325,6 +337,7 @@ _cv_timedwait_sig(struct cv *cvp, struct lock_object *lock, int timo) { WITNESS_SAVE_DECL(lock_witness); + PARTIAL_DROP_GIANT_DECL(); struct lock_class *class; struct thread *td; struct proc *p; @@ -356,7 +369,8 @@ sleepq_lock(cvp); =20 cvp->cv_waiters++; - DROP_GIANT(); + if (lock !=3D (struct lock_object *)&Giant) + PARTIAL_DROP_GIANT(); =20 sleepq_add(cvp, lock, cvp->cv_description, SLEEPQ_CONDVAR | SLEEPQ_INTERRUPTIBLE, 0); @@ -372,7 +386,8 @@ if (KTRPOINT(td, KTR_CSW)) ktrcsw(0, 0); #endif - PICKUP_GIANT(); + if (lock !=3D (struct lock_object *)&Giant) + PARTIAL_PICKUP_GIANT(); class->lc_lock(lock, lock_state); WITNESS_RESTORE(lock, lock_witness); =20 --- sys/sys/mutex.h +++ sys/sys/mutex.h @@ -368,26 +368,33 @@ #ifndef DROP_GIANT #define DROP_GIANT() \ do { \ + PARTIAL_DROP_GIANT_DECL(); \ + PARTIAL_DROP_GIANT(); + +#define PARTIAL_DROP_GIANT_DECL() \ int _giantcnt =3D 0; \ - WITNESS_SAVE_DECL(Giant); \ - \ + WITNESS_SAVE_DECL(Giant); + +#define PARTIAL_DROP_GIANT() do { \ if (mtx_owned(&Giant)) { \ WITNESS_SAVE(&Giant.lock_object, Giant); \ for (_giantcnt =3D 0; mtx_owned(&Giant); _giantcnt++) \ mtx_unlock(&Giant); \ - } + } \ +} while (0) =20 #define PICKUP_GIANT() \ PARTIAL_PICKUP_GIANT(); \ } while (0) =20 -#define PARTIAL_PICKUP_GIANT() \ +#define PARTIAL_PICKUP_GIANT() do { \ mtx_assert(&Giant, MA_NOTOWNED); \ if (_giantcnt > 0) { \ while (_giantcnt--) \ mtx_lock(&Giant); \ WITNESS_RESTORE(&Giant.lock_object, Giant); \ - } + } \ +} while(0) #endif =20 #define UGAR(rval) do { \ --FexDM9E/OpjgUmaq-- --ht9V8wKec6a3w1Ef Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (FreeBSD) iEYEARECAAYFAkfZMSsACgkQ52SDGA2eCwXMFQCfYwaKBVHU7xCBZv/D+yglHfmk 7dEAn1mXTouuc66FGFTiiVnM6ylfLri5 =5MNx -----END PGP SIGNATURE----- --ht9V8wKec6a3w1Ef-- From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 14:26:29 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 52B9C106566C for ; Thu, 13 Mar 2008 14:26:29 +0000 (UTC) (envelope-from cokane@freebsd.org) Received: from QMTA08.emeryville.ca.mail.comcast.net (qmta08.emeryville.ca.mail.comcast.net [76.96.30.80]) by mx1.freebsd.org (Postfix) with ESMTP id 24B058FC1E for ; Thu, 13 Mar 2008 14:26:29 +0000 (UTC) (envelope-from cokane@freebsd.org) Received: from OMTA03.emeryville.ca.mail.comcast.net ([76.96.30.27]) by QMTA08.emeryville.ca.mail.comcast.net with comcast id 0dTj1Z0040b6N64A804T00; Thu, 13 Mar 2008 14:15:46 +0000 Received: from discordia ([24.61.189.203]) by OMTA03.emeryville.ca.mail.comcast.net with comcast id 0eGG1Z0074PktZC8P00000; Thu, 13 Mar 2008 14:16:17 +0000 X-Authority-Analysis: v=1.0 c=1 a=yWIViUiLWPYA:10 a=c5sTgUsrrxMA:10 a=pW49DAFhvdKtc-utfFoA:9 a=pMymjJw0Sgto9zC7YjIuad9EabMA:4 a=zUBsD6tbDSsA:10 Received: by discordia (Postfix, from userid 103) id 2870B1636F9; Thu, 13 Mar 2008 10:16:16 -0400 (EDT) X-Spam-Checker-Version: SpamAssassin 3.1.8-gr1 (2007-02-13) on discordia X-Spam-Level: X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.8-gr1 Received: from [172.20.1.3] (erwin.int.cokane.org [172.20.1.3]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by discordia (Postfix) with ESMTP id C8ABE1636F8; Thu, 13 Mar 2008 10:16:02 -0400 (EDT) Message-ID: <47D93656.2000203@FreeBSD.org> Date: Thu, 13 Mar 2008 10:12:38 -0400 From: Coleman Kane Organization: The FreeBSD Project User-Agent: Thunderbird 2.0.0.12 (X11/20080312) MIME-Version: 1.0 To: John Baldwin , freebsd-arch@freebsd.org References: <47D7C25D.5070908@cokane.org> <200803120945.29018.jhb@freebsd.org> <47D7E5BF.2060102@cokane.org> <200803121058.04096.jhb@freebsd.org> In-Reply-To: <200803121058.04096.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: imp@FreeBSD.org, obrien@FreeBSD.org Subject: Re: SMPTODO: remove timeout(9) from ffs_softdep.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: cokane@FreeBSD.org List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 14:26:29 -0000 John Baldwin wrote: > On Wednesday 12 March 2008 10:16:31 am Coleman Kane wrote: > >> I am attaching the revised patch. >> > > Looks good. I would perhaps not add the extra {}'s around the single-line if > clauses as it slightly obfuscates the diff (style(9) actually suggests no > {}'s in that case, but I think in practice our sources have a mixture of > both). > I'm going to commit the attached patch to ffs_softdep.c this afternoon (EDT) if there aren't any objections. Thanks for the thrice-over guys, you've been very helpful. -- Coleman Kane From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 14:50:40 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B0AB41065671 for ; Thu, 13 Mar 2008 14:50:40 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 968B68FC16 for ; Thu, 13 Mar 2008 14:50:40 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from zion.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by elvis.mu.org (Postfix) with ESMTP id C99201A4D7C; Thu, 13 Mar 2008 07:49:43 -0700 (PDT) From: John Baldwin To: freebsd-arch@freebsd.org Date: Thu, 13 Mar 2008 10:49:57 -0400 User-Agent: KMail/1.9.7 References: <20080313135035.GB80576@hoeg.nl> In-Reply-To: <20080313135035.GB80576@hoeg.nl> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803131049.58051.jhb@freebsd.org> Cc: Ed Schouten Subject: Re: New TTY layer: condvar(9) and Giant X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 14:50:40 -0000 On Thursday 13 March 2008 09:50:35 am Ed Schouten wrote: > Hello everyone, > > Almost a month ago I started working on my assignment for my internship, > to reimplement a new TTY layer that fixes a lot of architectural > problems. So far, things are going quite fast: > > - I've already implemented a basic TTY layer, which has support for > canonical and non-canonical mode. It still misses important features > including flow control, but it seems to work quite good. Unlike the > old layer, it doesn't buffer data as much, which should hopefully mean > it's a bit faster. > - I'm using a new PTY driver called pts(4). It works quite good, but it > misses the compatibility bits, which we'll need to have to support > older FreeBSD or Linux binaries. > - Some of you may have read I'm working on syscons now. I've got syscons > working with the new TTY layer; I'm typing this message through > syscons. ;-) > > A lot of drivers that are used by the old TTY layer aren't mpsafe yet. > Of course, I'm willing to fix this, but this cannot be done in the > nearby future. This is why the new TTY layer should still allow TTY's to > be run under Giant. > > In my initial implementation, each TTY device had its own mutex. In > theory, this is great. The PTY driver already uses this and it works > fine. There will be a lot of drivers, however, that want to use a > per-class mutex to lock all related TTY devices down at once (i.e. > syscons, which allocates 16 virtual TTY's). This is why I introduced a > per-class lock. When set to Giant, all TTY instances will lock down the > Giant lock when entering the TTY layer. > > Unfortunately, I discovered condvar(9) can't properly unlock/lock the > Giant, which causes the system to panic. The condvar routines already > call DROP_GIANT before unlocking the lock itself. > > I've attached a patch that adds support for Giant to condvar(9). I had > to patch sys/mutex.h a little, because we now only need to call > DROP_GIANT() under certain conditions. The macro's didn't allow that, > because DROP_GIANT starts a new code block. > > I'm sending this to arch@, because I want to know if I'm doing something > silly. It seems to work properly on my machine, but I'm not an SMP > expert. ;-) In general this sort of thing is discouraged as explicit use of Giant is discouraged. It's magical properties (being implicitly dropped in places) can make it unsuitable for use as a regular mutex (though in practice any regular mutex would need to be dropped in the same places to avoid problems). In other driver locking cases the need for this has been avoided, although probably what I sort of forced CAM to do maybe isn't quite right. Also, your patches won't work in the case of Giant being recursed (it will only drop Giant once and the sleeping thread will still own Giant). If you do want to make this work my suggestion would be to make the lc_unlock and lc_lock not do anything for Giant. You could either do this by 1) patching kern_convar.c so it does something like this: if (lock != &Giant.lo_object) cookie = class->lc_unlock(lock); or instead patch the lc_lock/lc_unlock routines to just not do anything for Giant like so: Index: kern_mutex.c =================================================================== RCS file: /host/cvs/usr/cvs/src/sys/kern/kern_mutex.c,v retrieving revision 1.205 diff -u -r1.205 kern_mutex.c --- kern_mutex.c 13 Feb 2008 23:39:05 -0000 1.205 +++ kern_mutex.c 13 Mar 2008 14:49:04 -0000 @@ -134,6 +134,8 @@ lock_mtx(struct lock_object *lock, int how) { + if (lock == &Giant.lo_object) + return; mtx_lock((struct mtx *)lock); } @@ -149,6 +151,8 @@ { struct mtx *m; + if (lock == &Giant.lo_object) + return (0); m = (struct mtx *)lock; mtx_assert(m, MA_OWNED | MA_NOTRECURSED); mtx_unlock(m); I still don't like the idea of letting Giant work with msleep/cv_*wait*() because I think it will be abused. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 14:53:17 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 13F321065670 for ; Thu, 13 Mar 2008 14:53:17 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from fk-out-0910.google.com (fk-out-0910.google.com [209.85.128.187]) by mx1.freebsd.org (Postfix) with ESMTP id 8E2348FC15 for ; Thu, 13 Mar 2008 14:53:16 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: by fk-out-0910.google.com with SMTP id b27so4085545fka.11 for ; Thu, 13 Mar 2008 07:53:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; bh=9l5zCdBHDeBRsdJ3CEnWMvKiABKfwJVi/E0GqdE9SyI=; b=xBv4hqW68xOPQk7nKGtfNrZf7m1P18Zr0u4HX+/Z1AVA41lz3ZZn6TwY8pkt6xqWaVEhnQyUK0qx5Lm3p/HLwFwUXN5/51VH/s3SQuaaZ7Q+dk3xDQ6D/cWcTnWbh0Wn4MFHveAjC+xQ2tAf1lduZSXGtCx0x90xqdtQGhQBXTs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=o16UPuV/qNub4jl2/HtcsNJcphlMPtx2tt4r40bg/+ext9ecc04KkiVNr1Gc7ztYWqVJeWLJzpmJU6dfmbfaERqFkZYTARtr1MPYx09pit94Q/7vMVLkCWYlN5A/plqq9FTCd/xW7g4XmEn6lZp3InlGHpMOQ0ADjGGBr7SjYQg= Received: by 10.82.127.14 with SMTP id z14mr23371281buc.3.1205419994230; Thu, 13 Mar 2008 07:53:14 -0700 (PDT) Received: by 10.86.30.17 with HTTP; Thu, 13 Mar 2008 07:53:14 -0700 (PDT) Message-ID: <3bbf2fe10803130753p623867d8j3cbb65e0c78a2164@mail.gmail.com> Date: Thu, 13 Mar 2008 15:53:14 +0100 From: "Attilio Rao" Sender: asmrookie@gmail.com To: "Ed Schouten" In-Reply-To: <20080313135035.GB80576@hoeg.nl> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20080313135035.GB80576@hoeg.nl> X-Google-Sender-Auth: 90bab4486d957c24 Cc: FreeBSD Arch Subject: Re: New TTY layer: condvar(9) and Giant X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 14:53:17 -0000 2008/3/13, Ed Schouten : > Hello everyone, > > Almost a month ago I started working on my assignment for my internship, > to reimplement a new TTY layer that fixes a lot of architectural > problems. So far, things are going quite fast: > > - I've already implemented a basic TTY layer, which has support for > canonical and non-canonical mode. It still misses important features > including flow control, but it seems to work quite good. Unlike the > old layer, it doesn't buffer data as much, which should hopefully mean > it's a bit faster. > - I'm using a new PTY driver called pts(4). It works quite good, but it > misses the compatibility bits, which we'll need to have to support > older FreeBSD or Linux binaries. > - Some of you may have read I'm working on syscons now. I've got syscons > working with the new TTY layer; I'm typing this message through > syscons. ;-) > > A lot of drivers that are used by the old TTY layer aren't mpsafe yet. > Of course, I'm willing to fix this, but this cannot be done in the > nearby future. This is why the new TTY layer should still allow TTY's to > be run under Giant. > > In my initial implementation, each TTY device had its own mutex. In > theory, this is great. The PTY driver already uses this and it works > fine. There will be a lot of drivers, however, that want to use a > per-class mutex to lock all related TTY devices down at once (i.e. > syscons, which allocates 16 virtual TTY's). This is why I introduced a > per-class lock. When set to Giant, all TTY instances will lock down the > Giant lock when entering the TTY layer. > > Unfortunately, I discovered condvar(9) can't properly unlock/lock the > Giant, which causes the system to panic. The condvar routines already > call DROP_GIANT before unlocking the lock itself. I don't think we should allow this. Giant is alredy too hidden inside other locking primitives creating a lot of mis-understanding, mis-conceptions and mis-assumptions. Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 15:17:14 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7D5A7106566B; Thu, 13 Mar 2008 15:17:14 +0000 (UTC) (envelope-from ed@hoeg.nl) Received: from palm.hoeg.nl (mx0.hoeg.nl [IPv6:2001:610:652::211]) by mx1.freebsd.org (Postfix) with ESMTP id 3829C8FC29; Thu, 13 Mar 2008 15:17:14 +0000 (UTC) (envelope-from ed@hoeg.nl) Received: by palm.hoeg.nl (Postfix, from userid 1000) id 872171CE0E; Thu, 13 Mar 2008 16:17:13 +0100 (CET) Date: Thu, 13 Mar 2008 16:17:13 +0100 From: Ed Schouten To: John Baldwin Message-ID: <20080313151713.GD80576@hoeg.nl> References: <20080313135035.GB80576@hoeg.nl> <200803131049.58051.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="hNweOTLwwbnii4NA" Content-Disposition: inline In-Reply-To: <200803131049.58051.jhb@freebsd.org> User-Agent: Mutt/1.5.17 (2007-11-01) Cc: freebsd-arch@freebsd.org Subject: Re: New TTY layer: condvar(9) and Giant X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 15:17:14 -0000 --hNweOTLwwbnii4NA Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable * John Baldwin wrote: > Also, your patches won't work in the case of Giant being recursed (it wil= l=20 > only drop Giant once and the sleeping thread will still own Giant). If y= ou=20 > do want to make this work my suggestion would be to make the lc_unlock an= d=20 > lc_lock not do anything for Giant. You could either do this by 1) patchi= ng=20 > kern_convar.c so it does something like this: >=20 > if (lock !=3D &Giant.lo_object) > cookie =3D class->lc_unlock(lock); >=20 > or instead patch the lc_lock/lc_unlock routines to just not do anything f= or=20 > Giant like so: >=20 > Index: kern_mutex.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > RCS file: /host/cvs/usr/cvs/src/sys/kern/kern_mutex.c,v > retrieving revision 1.205 > diff -u -r1.205 kern_mutex.c > --- kern_mutex.c 13 Feb 2008 23:39:05 -0000 1.205 > +++ kern_mutex.c 13 Mar 2008 14:49:04 -0000 > @@ -134,6 +134,8 @@ > lock_mtx(struct lock_object *lock, int how) > { >=20 > + if (lock =3D=3D &Giant.lo_object) > + return; > mtx_lock((struct mtx *)lock); > } >=20 > @@ -149,6 +151,8 @@ > { > struct mtx *m; >=20 > + if (lock =3D=3D &Giant.lo_object) > + return (0); > m =3D (struct mtx *)lock; > mtx_assert(m, MA_OWNED | MA_NOTRECURSED); > mtx_unlock(m); Indeed, those solutions look a lot better. The reason why I just disabled DROP/PICKUP_GIANT, was because I only wanted to allow those interfaces to work when Giant was only picked up once. > I still don't like the idea of letting Giant work with msleep/cv_*wait*()= =20 > because I think it will be abused. I don't like it either, but we'll need a mechanism like this to make the transition easier. I would rather have syscons mpsafe, but it just depends on too many other components that aren't mpsafe either (keyboard and mouse input, etc). I'm personally not afraid about it being abused, because people who are writing new drivers shouldn't be using Giant anyway. --=20 Ed Schouten WWW: http://g-rave.nl/ --hNweOTLwwbnii4NA Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (FreeBSD) iEYEARECAAYFAkfZRXkACgkQ52SDGA2eCwWNmQCfQFxfN0qyrvU2BF1xOXCfmF5V X3wAni0zPlW+3eO5i1vaCCDQg5YAOCzf =pJ4B -----END PGP SIGNATURE----- --hNweOTLwwbnii4NA-- From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 15:21:26 2008 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EAF881065679 for ; Thu, 13 Mar 2008 15:21:26 +0000 (UTC) (envelope-from scf@FreeBSD.org) Received: from mail.farley.org (farley.org [67.64.95.201]) by mx1.freebsd.org (Postfix) with ESMTP id A4BBB8FC1C for ; Thu, 13 Mar 2008 15:21:26 +0000 (UTC) (envelope-from scf@FreeBSD.org) Received: from thor.farley.org (thor.farley.org [192.168.1.5]) by mail.farley.org (8.14.2/8.14.2) with ESMTP id m2DF0lgM056068 for ; Thu, 13 Mar 2008 10:00:47 -0500 (CDT) (envelope-from scf@FreeBSD.org) Date: Thu, 13 Mar 2008 10:00:47 -0500 (CDT) From: "Sean C. Farley" To: freebsd-arch@FreeBSD.org In-Reply-To: Message-ID: References: <47D7C25D.5070908@cokane.org> <200803120945.29018.jhb@freebsd.org> <47D7E5BF.2060102@cokane.org> <20080312145734.GB26812@dragon.NUXI.org> <47D7F1EC.6040802@cokane.org> <47D88568.7000105@cokane.org> User-Agent: Alpine 1.00 (BSF 882 2007-12-20) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.4 X-Spam-Checker-Version: SpamAssassin 3.2.4 (2008-01-01) on mail.farley.org Cc: Subject: Re: SMPTODO: remove timeout(9) from ffs_softdep.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 15:21:27 -0000 On Thu, 13 Mar 2008, Vadim Goncharov wrote: > Hi Coleman Kane! > > On Wed, 12 Mar 2008 21:37:44 -0400; Coleman Kane wrote about 'Re: SMPTODO: remove timeout(9) from ffs_softdep.c': > >>>> Third try at the patch, properly adjusting my vim tabs to 8 spaces >>>> as they should be so that I can follow style(9). >>> >>> I wrote a function[1] last year to configure vim to follow style(9). >>> Just run ':call FreeBSD_Style()' while editing a file. >>> >>> Sean >>> 1. http://www.farley.org/freebsd/tmp/VIM/FreeBSD.vim >> Rock on. >> This should be in the committers' guide or something. > > I vote for this too :) I asked my mentor to allow me to commit the file to tools/tools/editing next to freebsd.el. There the Emacs and Vim scripts can battle it out for all eternity. :) I tweaked the file a bit by adding a few comments and a mapping to f for easy calling. BTW, I am thinking about commenting out the mapping before committing and letting people manually activate it. This is to avoid conflicts that may arise with existing mapping in a person's environment. Sean -- scf@FreeBSD.org From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 18:08:05 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AF8501065671 for ; Thu, 13 Mar 2008 18:08:05 +0000 (UTC) (envelope-from obrien@NUXI.org) Received: from dragon.nuxi.org (trang.nuxi.org [74.95.12.85]) by mx1.freebsd.org (Postfix) with ESMTP id 8DE828FC12 for ; Thu, 13 Mar 2008 18:08:05 +0000 (UTC) (envelope-from obrien@NUXI.org) Received: from dragon.nuxi.org (obrien@localhost [127.0.0.1]) by dragon.nuxi.org (8.14.1/8.14.1) with ESMTP id m2DI85v5083430 for ; Thu, 13 Mar 2008 11:08:05 -0700 (PDT) (envelope-from obrien@dragon.nuxi.org) Received: (from obrien@localhost) by dragon.nuxi.org (8.14.2/8.14.1/Submit) id m2DI85ti083429 for freebsd-arch@freebsd.org; Thu, 13 Mar 2008 11:08:05 -0700 (PDT) (envelope-from obrien) Date: Thu, 13 Mar 2008 11:08:05 -0700 From: "David O'Brien" To: freebsd-arch@freebsd.org Message-ID: <20080313180805.GA83406@dragon.NUXI.org> Mail-Followup-To: obrien@freebsd.org, freebsd-arch@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Operating-System: FreeBSD 8.0-CURRENT User-Agent: Mutt/1.5.16 (2007-06-09) Subject: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: obrien@freebsd.org List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 18:08:05 -0000 Hi folks, Some folks at Juniper have submitted these changes to hwpmc(4). I am sending them here for public review. Their thoughts are: The mp_ncpus refers to the count of the active CPU's. Where as mp_maxid refers to the count of all the cpus on the SMP. Using mp_ncpus in the cpu_id range-check of hwpmc module would lead to the assumption that all the active CPU's in the SMP are not interleaved. But for running on some platforms, the active and inactive cpus could be interleaved making hwpmc not work for the cpus whose cpu_id is greater than the active-cpu count. -- -- David (obrien@FreeBSD.org) Index: sys/dev/hwpmc/hwpmc_amd.c =================================================================== RCS file: /cvs/junos-2001/src/sys/dev/hwpmc/hwpmc_amd.c,v retrieving revision 1.1.1.1 retrieving revision 1.4 diff -u -p -r1.1.1.1 -r1.4 --- sys/dev/hwpmc/hwpmc_amd.c 21 Jun 2006 03:30:02 -0000 1.1.1.1 +++ sys/dev/hwpmc/hwpmc_amd.c 30 Oct 2007 18:00:43 -0000 1.4 @@ -265,7 +265,7 @@ amd_read_pmc(int cpu, int ri, pmc_value_ const struct pmc_hw *phw; pmc_value_t tmp; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < AMD_NPMCS, ("[amd,%d] illegal row-index %d", __LINE__, ri)); @@ -320,7 +320,7 @@ amd_write_pmc(int cpu, int ri, pmc_value const struct pmc_hw *phw; enum pmc_mode mode; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < AMD_NPMCS, ("[amd,%d] illegal row-index %d", __LINE__, ri)); @@ -367,7 +367,7 @@ amd_config_pmc(int cpu, int ri, struct p PMCDBG(MDP,CFG,1, "cpu=%d ri=%d pm=%p", cpu, ri, pm); - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < AMD_NPMCS, ("[amd,%d] illegal row-index %d", __LINE__, ri)); @@ -449,7 +449,7 @@ amd_allocate_pmc(int cpu, int ri, struct (void) cpu; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < AMD_NPMCS, ("[amd,%d] illegal row index %d", __LINE__, ri)); @@ -543,7 +543,7 @@ amd_release_pmc(int cpu, int ri, struct (void) pmc; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < AMD_NPMCS, ("[amd,%d] illegal row-index %d", __LINE__, ri)); @@ -575,7 +575,7 @@ amd_start_pmc(int cpu, int ri) struct pmc_hw *phw; const struct amd_descr *pd; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < AMD_NPMCS, ("[amd,%d] illegal row-index %d", __LINE__, ri)); @@ -624,7 +624,7 @@ amd_stop_pmc(int cpu, int ri) const struct amd_descr *pd; uint64_t config; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < AMD_NPMCS, ("[amd,%d] illegal row-index %d", __LINE__, ri)); @@ -676,7 +676,7 @@ amd_intr(int cpu, uintptr_t eip, int use struct pmc_hw *phw; pmc_value_t v; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[amd,%d] out of range CPU %d", __LINE__, cpu)); PMCDBG(MDP,INT,1, "cpu=%d eip=%p um=%d", cpu, (void *) eip, @@ -756,7 +756,7 @@ amd_describe(int cpu, int ri, struct pmc const struct amd_descr *pd; struct pmc_hw *phw; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[amd,%d] illegal CPU %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < AMD_NPMCS, ("[amd,%d] row-index %d out of range", __LINE__, ri)); @@ -825,7 +825,7 @@ amd_init(int cpu) struct amd_cpu *pcs; struct pmc_hw *phw; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[amd,%d] insane cpu number %d", __LINE__, cpu)); PMCDBG(MDP,INI,1,"amd-init cpu=%d", cpu); @@ -868,7 +868,7 @@ amd_cleanup(int cpu) uint32_t evsel; struct pmc_cpu *pcs; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[amd,%d] insane cpu number (%d)", __LINE__, cpu)); PMCDBG(MDP,INI,1,"amd-cleanup cpu=%d", cpu); Index: sys/dev/hwpmc/hwpmc_mod.c =================================================================== RCS file: /cvs/junos-2001/src/sys/dev/hwpmc/hwpmc_mod.c,v retrieving revision 1.1.1.1 retrieving revision 1.4 diff -u -p -r1.1.1.1 -r1.4 --- sys/dev/hwpmc/hwpmc_mod.c 21 Jun 2006 03:30:03 -0000 1.1.1.1 +++ sys/dev/hwpmc/hwpmc_mod.c 30 Oct 2007 18:00:43 -0000 1.4 @@ -615,7 +615,7 @@ pmc_restore_cpu_binding(struct pmc_bindi static void pmc_select_cpu(int cpu) { - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[pmc,%d] bad cpu number %d", __LINE__, cpu)); /* never move to a disabled CPU */ @@ -1167,7 +1167,7 @@ pmc_process_csw_in(struct thread *td) PMCDBG(CSW,SWI,1, "cpu=%d proc=%p (%d, %s) pp=%p", cpu, p, p->p_pid, p->p_comm, pp); - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[pmc,%d] wierd CPU id %d", __LINE__, cpu)); pc = pmc_pcpu[cpu]; @@ -1292,7 +1292,7 @@ pmc_process_csw_out(struct thread *td) PMCDBG(CSW,SWO,1, "cpu=%d proc=%p (%d, %s) pp=%p", cpu, p, p->p_pid, p->p_comm, pp); - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[pmc,%d wierd CPU id %d", __LINE__, cpu)); pc = pmc_pcpu[cpu]; @@ -2313,7 +2313,7 @@ pmc_stop(struct pmc *pm) cpu = PMC_TO_CPU(pm); - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[pmc,%d] illegal cpu=%d", __LINE__, cpu)); if (pmc_cpu_is_disabled(cpu)) @@ -2478,7 +2478,7 @@ pmc_syscall_handler(struct thread *td, v struct pmc_op_getcpuinfo gci; gci.pm_cputype = md->pmd_cputype; - gci.pm_ncpu = mp_ncpus; + gci.pm_ncpu = mp_maxid + 1; gci.pm_npmc = md->pmd_npmc; gci.pm_nclass = md->pmd_nclass; bcopy(md->pmd_classes, &gci.pm_classes, @@ -2546,7 +2546,7 @@ pmc_syscall_handler(struct thread *td, v if ((error = copyin(&gpi->pm_cpu, &cpu, sizeof(cpu))) != 0) break; - if (cpu >= (unsigned int) mp_ncpus) { + if (cpu > (unsigned int) mp_maxid) { error = EINVAL; break; } @@ -2641,7 +2641,7 @@ pmc_syscall_handler(struct thread *td, v cpu = pma.pm_cpu; - if (cpu < 0 || cpu >= mp_ncpus) { + if (cpu < 0 || cpu > mp_maxid) { error = EINVAL; break; } @@ -2734,7 +2734,7 @@ pmc_syscall_handler(struct thread *td, v if ((mode != PMC_MODE_SS && mode != PMC_MODE_SC && mode != PMC_MODE_TS && mode != PMC_MODE_TC) || - (cpu != (u_int) PMC_CPU_ANY && cpu >= (u_int) mp_ncpus)) { + (cpu != (u_int) PMC_CPU_ANY && cpu > (u_int) mp_maxid)) { error = EINVAL; break; } @@ -3973,16 +3973,16 @@ pmc_initialize(void) return ENOSYS; /* allocate space for the per-cpu array */ - MALLOC(pmc_pcpu, struct pmc_cpu **, mp_ncpus * sizeof(struct pmc_cpu *), - M_PMC, M_WAITOK|M_ZERO); + MALLOC(pmc_pcpu, struct pmc_cpu **, + (mp_maxid + 1) * sizeof(struct pmc_cpu *), M_PMC, M_WAITOK|M_ZERO); /* per-cpu 'saved values' for managing process-mode PMCs */ MALLOC(pmc_pcpu_saved, pmc_value_t *, - sizeof(pmc_value_t) * mp_ncpus * md->pmd_npmc, M_PMC, M_WAITOK); + sizeof(pmc_value_t) * (mp_maxid + 1) * md->pmd_npmc, M_PMC, M_WAITOK); /* perform cpu dependent initialization */ pmc_save_cpu_binding(&pb); - for (cpu = 0; cpu < mp_ncpus; cpu++) { + for (cpu = 0; cpu <= mp_maxid; cpu++) { if (pmc_cpu_is_disabled(cpu)) continue; pmc_select_cpu(cpu); @@ -3995,7 +3995,7 @@ pmc_initialize(void) return error; /* allocate space for the sample array */ - for (cpu = 0; cpu < mp_ncpus; cpu++) { + for (cpu = 0; cpu <= mp_maxid; cpu++) { if (pmc_cpu_is_disabled(cpu)) continue; MALLOC(sb, struct pmc_samplebuffer *, @@ -4156,7 +4156,7 @@ pmc_cleanup(void) ("[pmc,%d] Global SS count not empty", __LINE__)); /* free the per-cpu sample buffers */ - for (cpu = 0; cpu < mp_ncpus; cpu++) { + for (cpu = 0; cpu <= mp_maxid; cpu++) { if (pmc_cpu_is_disabled(cpu)) continue; KASSERT(pmc_pcpu[cpu]->pc_sb != NULL, @@ -4170,7 +4170,7 @@ pmc_cleanup(void) PMCDBG(MOD,INI,3, "%s", "md cleanup"); if (md) { pmc_save_cpu_binding(&pb); - for (cpu = 0; cpu < mp_ncpus; cpu++) { + for (cpu = 0; cpu <= mp_maxid; cpu++) { PMCDBG(MOD,INI,1,"pmc-cleanup cpu=%d pcs=%p", cpu, pmc_pcpu[cpu]); if (pmc_cpu_is_disabled(cpu)) Index: sys/dev/hwpmc/hwpmc_piv.c =================================================================== RCS file: /cvs/junos-2001/src/sys/dev/hwpmc/hwpmc_piv.c,v retrieving revision 1.1.1.1 retrieving revision 1.4 diff -u -p -r1.1.1.1 -r1.4 --- sys/dev/hwpmc/hwpmc_piv.c 21 Jun 2006 03:30:03 -0000 1.1.1.1 +++ sys/dev/hwpmc/hwpmc_piv.c 30 Oct 2007 18:00:43 -0000 1.4 @@ -585,7 +585,7 @@ p4_init(int cpu) struct p4_logicalcpu *plcs; struct pmc_hw *phw; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p4,%d] insane cpu number %d", __LINE__, cpu)); PMCDBG(MDP,INI,0, "p4-init cpu=%d logical=%d", cpu, @@ -737,7 +737,7 @@ p4_read_pmc(int cpu, int ri, pmc_value_t struct pmc_hw *phw; pmc_value_t tmp; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p4,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < P4_NPMCS, ("[p4,%d] illegal row-index %d", __LINE__, ri)); @@ -815,7 +815,7 @@ p4_write_pmc(int cpu, int ri, pmc_value_ const struct pmc_hw *phw; const struct p4pmc_descr *pd; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < P4_NPMCS, ("[amd,%d] illegal row-index %d", __LINE__, ri)); @@ -889,7 +889,7 @@ p4_config_pmc(int cpu, int ri, struct pm struct p4_cpu *pc; int cfgflags, cpuflag; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p4,%d] illegal CPU %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < P4_NPMCS, ("[p4,%d] illegal row-index %d", __LINE__, ri)); @@ -1026,7 +1026,7 @@ p4_allocate_pmc(int cpu, int ri, struct struct p4_event_descr *pevent; const struct p4pmc_descr *pd; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p4,%d] illegal CPU %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < P4_NPMCS, ("[p4,%d] illegal row-index value %d", __LINE__, ri)); @@ -1273,7 +1273,7 @@ p4_start_pmc(int cpu, int ri) struct pmc_hw *phw; struct p4pmc_descr *pd; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p4,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < P4_NPMCS, ("[p4,%d] illegal row-index %d", __LINE__, ri)); @@ -1425,7 +1425,7 @@ p4_stop_pmc(int cpu, int ri) struct p4pmc_descr *pd; pmc_value_t tmp; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p4,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < P4_NPMCS, ("[p4,%d] illegal row index %d", __LINE__, ri)); @@ -1694,7 +1694,7 @@ p4_describe(int cpu, int ri, struct pmc_ struct pmc_hw *phw; const struct p4pmc_descr *pd; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p4,%d] illegal CPU %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < P4_NPMCS, ("[p4,%d] row-index %d out of range", __LINE__, ri)); Index: sys/dev/hwpmc/hwpmc_ppro.c =================================================================== RCS file: /cvs/junos-2001/src/sys/dev/hwpmc/hwpmc_ppro.c,v retrieving revision 1.1.1.1 retrieving revision 1.4 diff -u -p -r1.1.1.1 -r1.4 --- sys/dev/hwpmc/hwpmc_ppro.c 21 Jun 2006 03:30:03 -0000 1.1.1.1 +++ sys/dev/hwpmc/hwpmc_ppro.c 30 Oct 2007 18:00:43 -0000 1.4 @@ -331,7 +331,7 @@ p6_init(int cpu) struct p6_cpu *pcs; struct pmc_hw *phw; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p6,%d] bad cpu %d", __LINE__, cpu)); PMCDBG(MDP,INI,0,"p6-init cpu=%d", cpu); @@ -361,7 +361,7 @@ p6_cleanup(int cpu) { struct pmc_cpu *pcs; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p6,%d] bad cpu %d", __LINE__, cpu)); PMCDBG(MDP,INI,0,"p6-cleanup cpu=%d", cpu); @@ -507,7 +507,7 @@ p6_allocate_pmc(int cpu, int ri, struct (void) cpu; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p4,%d] illegal CPU %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < P6_NPMCS, ("[p4,%d] illegal row-index value %d", __LINE__, ri)); @@ -611,7 +611,7 @@ p6_release_pmc(int cpu, int ri, struct p PMCDBG(MDP,REL,1, "p6-release cpu=%d ri=%d pm=%p", cpu, ri, pm); - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p6,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < P6_NPMCS, ("[p6,%d] illegal row-index %d", __LINE__, ri)); @@ -633,7 +633,7 @@ p6_start_pmc(int cpu, int ri) struct pmc_hw *phw; const struct p6pmc_descr *pd; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p6,%d] illegal CPU value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < P6_NPMCS, ("[p6,%d] illegal row-index %d", __LINE__, ri)); @@ -677,7 +677,7 @@ p6_stop_pmc(int cpu, int ri) struct pmc_hw *phw; struct p6pmc_descr *pd; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p6,%d] illegal cpu value %d", __LINE__, cpu)); KASSERT(ri >= 0 && ri < P6_NPMCS, ("[p6,%d] illegal row index %d", __LINE__, ri)); @@ -719,7 +719,7 @@ p6_intr(int cpu, uintptr_t eip, int user struct pmc_hw *phw; pmc_value_t v; - KASSERT(cpu >= 0 && cpu < mp_ncpus, + KASSERT(cpu >= 0 && cpu <= mp_maxid, ("[p6,%d] CPU %d out of range", __LINE__, cpu)); retval = 0; Index: usr.sbin/pmccontrol/pmccontrol.c =================================================================== RCS file: /cvs/junos-2001/src/usr.sbin/pmccontrol/pmccontrol.c,v retrieving revision 1.1.1.1 retrieving revision 1.4 diff -u -p -r1.1.1.1 -r1.4 --- usr.sbin/pmccontrol/pmccontrol.c 3 Nov 2006 01:43:32 -0000 1.1.1.1 +++ usr.sbin/pmccontrol/pmccontrol.c 29 Nov 2007 22:47:14 -0000 1.4 @@ -207,10 +207,16 @@ pmcc_do_enable_disable(struct pmcc_op_li else if (b == PMCC_OP_DISABLE) error = pmc_disable(i, j); - if (error < 0) + if (error < 0) { + if (errno == ENXIO) { + /* This cpu wasn't configured. */ + error = 0; + continue; + } err(EX_OSERR, "%s of PMC %d on CPU %d failed", b == PMCC_OP_ENABLE ? "Enable" : "Disable", j, i); + } } return error; @@ -242,9 +248,14 @@ pmcc_do_list_state(void) (logical_cpus_mask & (1 << cpu))) continue; /* skip P4-style 'logical' cpus */ #endif - if (pmc_pmcinfo(cpu, &pi) < 0) + if (pmc_pmcinfo(cpu, &pi) < 0) { + if (errno == ENXIO) { + /* This cpu wasn't enabled. */ + continue; + } err(EX_OSERR, "Unable to get PMC status for CPU %d", cpu); + } printf("#CPU %d:\n", c++); npmc = pmc_npmc(cpu); Index: usr.sbin/pmcstat/pmcstat.c =================================================================== RCS file: /cvs/junos-2001/src/usr.sbin/pmcstat/pmcstat.c,v retrieving revision 1.1.1.1 retrieving revision 1.4 diff -u -p -r1.1.1.1 -r1.4 --- usr.sbin/pmcstat/pmcstat.c 3 Nov 2006 01:43:32 -0000 1.1.1.1 +++ usr.sbin/pmcstat/pmcstat.c 30 Aug 2007 15:03:02 -0000 1.4 @@ -692,6 +692,7 @@ main(int argc, char **argv) if ((args.pa_logparser = pmclog_open(args.pa_logfd)) == NULL) err(EX_OSERR, "ERROR: Cannot create parser"); pmcstat_process_log(&args); + pmcstat_shutdown_logging(); exit(EX_OK); } -- -- David (obrien@FreeBSD.org) Q: Because it reverses the logical flow of conversation. A: Why is top-posting (putting a reply at the top of the message) frowned upon? Let's not play "Jeopardy-style quoting" From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 18:54:32 2008 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 93BF71065671 for ; Thu, 13 Mar 2008 18:54:32 +0000 (UTC) (envelope-from obrien@NUXI.org) Received: from dragon.nuxi.org (trang.nuxi.org [74.95.12.85]) by mx1.freebsd.org (Postfix) with ESMTP id 678B38FC19 for ; Thu, 13 Mar 2008 18:54:32 +0000 (UTC) (envelope-from obrien@NUXI.org) Received: from dragon.nuxi.org (obrien@localhost [127.0.0.1]) by dragon.nuxi.org (8.14.1/8.14.1) with ESMTP id m2DIsVQb085250; Thu, 13 Mar 2008 11:54:31 -0700 (PDT) (envelope-from obrien@dragon.nuxi.org) Received: (from obrien@localhost) by dragon.nuxi.org (8.14.2/8.14.1/Submit) id m2DIsVik085249; Thu, 13 Mar 2008 11:54:31 -0700 (PDT) (envelope-from obrien) Date: Thu, 13 Mar 2008 11:54:31 -0700 From: "David O'Brien" To: "Sean C. Farley" Message-ID: <20080313185431.GB85022@dragon.NUXI.org> Mail-Followup-To: obrien@freebsd.org, "Sean C. Farley" , freebsd-arch@FreeBSD.org References: <47D7C25D.5070908@cokane.org> <200803120945.29018.jhb@freebsd.org> <47D7E5BF.2060102@cokane.org> <20080312145734.GB26812@dragon.NUXI.org> <47D7F1EC.6040802@cokane.org> <47D88568.7000105@cokane.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Operating-System: FreeBSD 8.0-CURRENT User-Agent: Mutt/1.5.16 (2007-06-09) Cc: freebsd-arch@FreeBSD.org Subject: Re: SMPTODO: remove timeout(9) from ffs_softdep.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: obrien@FreeBSD.org List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 18:54:32 -0000 On Thu, Mar 13, 2008 at 10:00:47AM -0500, Sean C. Farley wrote: > I asked my mentor to allow me to commit the file to tools/tools/editing > next to freebsd.el. There the Emacs and Vim scripts can battle it out > for all eternity. :) This is too valuable to not have ASAP. I've committed it as tools/tools/editing/freebsd.vim. (case matches emacs .el) Let the embellishment of freebsd.vim begin! > I tweaked the file a bit by adding a few comments and a mapping to > f for easy calling. BTW, I am thinking about commenting out the > mapping before committing and letting people manually activate it. This > is to avoid conflicts that may arise with existing mapping in a person's > environment. I don't see a need - no one should be blindly using dot files without reviewing them. -- -- David (obrien@FreeBSD.org) From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 19:16:54 2008 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5DB8D1065673 for ; Thu, 13 Mar 2008 19:16:54 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id A04068FC12 for ; Thu, 13 Mar 2008 19:16:53 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8s) with ESMTP id 235372110-1834499 for multiple; Thu, 13 Mar 2008 15:14:52 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m2DJGS7t031052; Thu, 13 Mar 2008 15:16:40 -0400 (EDT) (envelope-from jhb@FreeBSD.org) From: John Baldwin To: freebsd-arch@FreeBSD.org, obrien@FreeBSD.org Date: Thu, 13 Mar 2008 15:16:12 -0400 User-Agent: KMail/1.9.7 References: <20080313180805.GA83406@dragon.NUXI.org> In-Reply-To: <20080313180805.GA83406@dragon.NUXI.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803131516.12284.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Thu, 13 Mar 2008 15:16:41 -0400 (EDT) X-Virus-Scanned: ClamAV 0.91.2/6225/Thu Mar 13 10:52:37 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 19:16:54 -0000 On Thursday 13 March 2008 02:08:05 pm David O'Brien wrote: > Hi folks, > Some folks at Juniper have submitted these changes to hwpmc(4). > I am sending them here for public review. > > Their thoughts are: > The mp_ncpus refers to the count of the active CPU's. Where as > mp_maxid refers to the count of all the cpus on the SMP. Using > mp_ncpus in the cpu_id range-check of hwpmc module would lead to the > assumption that all the active CPU's in the SMP are not interleaved. > But for running on some platforms, the active and inactive cpus could > be interleaved making hwpmc not work for the cpus whose cpu_id is > greater than the active-cpu count. This is correct, but you need to handle CPUs that are absent. It might be sufficient to update pmc_cpu_is_disabled() in kern_pmc.c to check CPU_ABSENT(cpu) and claim the CPU is disabled if it is absent, but I'm not sure that will catch everything as that seems aimed at handling having a non-absent CPU halted (such as disabling HTT on i386). > -- > -- David (obrien@FreeBSD.org) > > Index: sys/dev/hwpmc/hwpmc_amd.c > =================================================================== > RCS file: /cvs/junos-2001/src/sys/dev/hwpmc/hwpmc_amd.c,v > retrieving revision 1.1.1.1 > retrieving revision 1.4 > diff -u -p -r1.1.1.1 -r1.4 > --- sys/dev/hwpmc/hwpmc_amd.c 21 Jun 2006 03:30:02 -0000 1.1.1.1 > +++ sys/dev/hwpmc/hwpmc_amd.c 30 Oct 2007 18:00:43 -0000 1.4 > @@ -265,7 +265,7 @@ amd_read_pmc(int cpu, int ri, pmc_value_ > const struct pmc_hw *phw; > pmc_value_t tmp; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < AMD_NPMCS, > ("[amd,%d] illegal row-index %d", __LINE__, ri)); > @@ -320,7 +320,7 @@ amd_write_pmc(int cpu, int ri, pmc_value > const struct pmc_hw *phw; > enum pmc_mode mode; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < AMD_NPMCS, > ("[amd,%d] illegal row-index %d", __LINE__, ri)); > @@ -367,7 +367,7 @@ amd_config_pmc(int cpu, int ri, struct p > > PMCDBG(MDP,CFG,1, "cpu=%d ri=%d pm=%p", cpu, ri, pm); > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < AMD_NPMCS, > ("[amd,%d] illegal row-index %d", __LINE__, ri)); > @@ -449,7 +449,7 @@ amd_allocate_pmc(int cpu, int ri, struct > > (void) cpu; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < AMD_NPMCS, > ("[amd,%d] illegal row index %d", __LINE__, ri)); > @@ -543,7 +543,7 @@ amd_release_pmc(int cpu, int ri, struct > > (void) pmc; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < AMD_NPMCS, > ("[amd,%d] illegal row-index %d", __LINE__, ri)); > @@ -575,7 +575,7 @@ amd_start_pmc(int cpu, int ri) > struct pmc_hw *phw; > const struct amd_descr *pd; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < AMD_NPMCS, > ("[amd,%d] illegal row-index %d", __LINE__, ri)); > @@ -624,7 +624,7 @@ amd_stop_pmc(int cpu, int ri) > const struct amd_descr *pd; > uint64_t config; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < AMD_NPMCS, > ("[amd,%d] illegal row-index %d", __LINE__, ri)); > @@ -676,7 +676,7 @@ amd_intr(int cpu, uintptr_t eip, int use > struct pmc_hw *phw; > pmc_value_t v; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[amd,%d] out of range CPU %d", __LINE__, cpu)); > > PMCDBG(MDP,INT,1, "cpu=%d eip=%p um=%d", cpu, (void *) eip, > @@ -756,7 +756,7 @@ amd_describe(int cpu, int ri, struct pmc > const struct amd_descr *pd; > struct pmc_hw *phw; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[amd,%d] illegal CPU %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < AMD_NPMCS, > ("[amd,%d] row-index %d out of range", __LINE__, ri)); > @@ -825,7 +825,7 @@ amd_init(int cpu) > struct amd_cpu *pcs; > struct pmc_hw *phw; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[amd,%d] insane cpu number %d", __LINE__, cpu)); > > PMCDBG(MDP,INI,1,"amd-init cpu=%d", cpu); > @@ -868,7 +868,7 @@ amd_cleanup(int cpu) > uint32_t evsel; > struct pmc_cpu *pcs; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[amd,%d] insane cpu number (%d)", __LINE__, cpu)); > > PMCDBG(MDP,INI,1,"amd-cleanup cpu=%d", cpu); > Index: sys/dev/hwpmc/hwpmc_mod.c > =================================================================== > RCS file: /cvs/junos-2001/src/sys/dev/hwpmc/hwpmc_mod.c,v > retrieving revision 1.1.1.1 > retrieving revision 1.4 > diff -u -p -r1.1.1.1 -r1.4 > --- sys/dev/hwpmc/hwpmc_mod.c 21 Jun 2006 03:30:03 -0000 1.1.1.1 > +++ sys/dev/hwpmc/hwpmc_mod.c 30 Oct 2007 18:00:43 -0000 1.4 > @@ -615,7 +615,7 @@ pmc_restore_cpu_binding(struct pmc_bindi > static void > pmc_select_cpu(int cpu) > { > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[pmc,%d] bad cpu number %d", __LINE__, cpu)); > > /* never move to a disabled CPU */ > @@ -1167,7 +1167,7 @@ pmc_process_csw_in(struct thread *td) > PMCDBG(CSW,SWI,1, "cpu=%d proc=%p (%d, %s) pp=%p", cpu, p, > p->p_pid, p->p_comm, pp); > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[pmc,%d] wierd CPU id %d", __LINE__, cpu)); > > pc = pmc_pcpu[cpu]; > @@ -1292,7 +1292,7 @@ pmc_process_csw_out(struct thread *td) > PMCDBG(CSW,SWO,1, "cpu=%d proc=%p (%d, %s) pp=%p", cpu, p, > p->p_pid, p->p_comm, pp); > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[pmc,%d wierd CPU id %d", __LINE__, cpu)); > > pc = pmc_pcpu[cpu]; > @@ -2313,7 +2313,7 @@ pmc_stop(struct pmc *pm) > > cpu = PMC_TO_CPU(pm); > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[pmc,%d] illegal cpu=%d", __LINE__, cpu)); > > if (pmc_cpu_is_disabled(cpu)) > @@ -2478,7 +2478,7 @@ pmc_syscall_handler(struct thread *td, v > struct pmc_op_getcpuinfo gci; > > gci.pm_cputype = md->pmd_cputype; > - gci.pm_ncpu = mp_ncpus; > + gci.pm_ncpu = mp_maxid + 1; > gci.pm_npmc = md->pmd_npmc; > gci.pm_nclass = md->pmd_nclass; > bcopy(md->pmd_classes, &gci.pm_classes, > @@ -2546,7 +2546,7 @@ pmc_syscall_handler(struct thread *td, v > if ((error = copyin(&gpi->pm_cpu, &cpu, sizeof(cpu))) != 0) > break; > > - if (cpu >= (unsigned int) mp_ncpus) { > + if (cpu > (unsigned int) mp_maxid) { > error = EINVAL; > break; > } > @@ -2641,7 +2641,7 @@ pmc_syscall_handler(struct thread *td, v > > cpu = pma.pm_cpu; > > - if (cpu < 0 || cpu >= mp_ncpus) { > + if (cpu < 0 || cpu > mp_maxid) { > error = EINVAL; > break; > } > @@ -2734,7 +2734,7 @@ pmc_syscall_handler(struct thread *td, v > > if ((mode != PMC_MODE_SS && mode != PMC_MODE_SC && > mode != PMC_MODE_TS && mode != PMC_MODE_TC) || > - (cpu != (u_int) PMC_CPU_ANY && cpu >= (u_int) mp_ncpus)) { > + (cpu != (u_int) PMC_CPU_ANY && cpu > (u_int) mp_maxid)) { > error = EINVAL; > break; > } > @@ -3973,16 +3973,16 @@ pmc_initialize(void) > return ENOSYS; > > /* allocate space for the per-cpu array */ > - MALLOC(pmc_pcpu, struct pmc_cpu **, mp_ncpus * sizeof(struct pmc_cpu *), > - M_PMC, M_WAITOK|M_ZERO); > + MALLOC(pmc_pcpu, struct pmc_cpu **, > + (mp_maxid + 1) * sizeof(struct pmc_cpu *), M_PMC, M_WAITOK|M_ZERO); > > /* per-cpu 'saved values' for managing process-mode PMCs */ > MALLOC(pmc_pcpu_saved, pmc_value_t *, > - sizeof(pmc_value_t) * mp_ncpus * md->pmd_npmc, M_PMC, M_WAITOK); > + sizeof(pmc_value_t) * (mp_maxid + 1) * md->pmd_npmc, M_PMC, M_WAITOK); > > /* perform cpu dependent initialization */ > pmc_save_cpu_binding(&pb); > - for (cpu = 0; cpu < mp_ncpus; cpu++) { > + for (cpu = 0; cpu <= mp_maxid; cpu++) { > if (pmc_cpu_is_disabled(cpu)) > continue; > pmc_select_cpu(cpu); > @@ -3995,7 +3995,7 @@ pmc_initialize(void) > return error; > > /* allocate space for the sample array */ > - for (cpu = 0; cpu < mp_ncpus; cpu++) { > + for (cpu = 0; cpu <= mp_maxid; cpu++) { > if (pmc_cpu_is_disabled(cpu)) > continue; > MALLOC(sb, struct pmc_samplebuffer *, > @@ -4156,7 +4156,7 @@ pmc_cleanup(void) > ("[pmc,%d] Global SS count not empty", __LINE__)); > > /* free the per-cpu sample buffers */ > - for (cpu = 0; cpu < mp_ncpus; cpu++) { > + for (cpu = 0; cpu <= mp_maxid; cpu++) { > if (pmc_cpu_is_disabled(cpu)) > continue; > KASSERT(pmc_pcpu[cpu]->pc_sb != NULL, > @@ -4170,7 +4170,7 @@ pmc_cleanup(void) > PMCDBG(MOD,INI,3, "%s", "md cleanup"); > if (md) { > pmc_save_cpu_binding(&pb); > - for (cpu = 0; cpu < mp_ncpus; cpu++) { > + for (cpu = 0; cpu <= mp_maxid; cpu++) { > PMCDBG(MOD,INI,1,"pmc-cleanup cpu=%d pcs=%p", > cpu, pmc_pcpu[cpu]); > if (pmc_cpu_is_disabled(cpu)) > Index: sys/dev/hwpmc/hwpmc_piv.c > =================================================================== > RCS file: /cvs/junos-2001/src/sys/dev/hwpmc/hwpmc_piv.c,v > retrieving revision 1.1.1.1 > retrieving revision 1.4 > diff -u -p -r1.1.1.1 -r1.4 > --- sys/dev/hwpmc/hwpmc_piv.c 21 Jun 2006 03:30:03 -0000 1.1.1.1 > +++ sys/dev/hwpmc/hwpmc_piv.c 30 Oct 2007 18:00:43 -0000 1.4 > @@ -585,7 +585,7 @@ p4_init(int cpu) > struct p4_logicalcpu *plcs; > struct pmc_hw *phw; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p4,%d] insane cpu number %d", __LINE__, cpu)); > > PMCDBG(MDP,INI,0, "p4-init cpu=%d logical=%d", cpu, > @@ -737,7 +737,7 @@ p4_read_pmc(int cpu, int ri, pmc_value_t > struct pmc_hw *phw; > pmc_value_t tmp; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p4,%d] illegal CPU value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < P4_NPMCS, > ("[p4,%d] illegal row-index %d", __LINE__, ri)); > @@ -815,7 +815,7 @@ p4_write_pmc(int cpu, int ri, pmc_value_ > const struct pmc_hw *phw; > const struct p4pmc_descr *pd; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[amd,%d] illegal CPU value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < P4_NPMCS, > ("[amd,%d] illegal row-index %d", __LINE__, ri)); > @@ -889,7 +889,7 @@ p4_config_pmc(int cpu, int ri, struct pm > struct p4_cpu *pc; > int cfgflags, cpuflag; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p4,%d] illegal CPU %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < P4_NPMCS, > ("[p4,%d] illegal row-index %d", __LINE__, ri)); > @@ -1026,7 +1026,7 @@ p4_allocate_pmc(int cpu, int ri, struct > struct p4_event_descr *pevent; > const struct p4pmc_descr *pd; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p4,%d] illegal CPU %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < P4_NPMCS, > ("[p4,%d] illegal row-index value %d", __LINE__, ri)); > @@ -1273,7 +1273,7 @@ p4_start_pmc(int cpu, int ri) > struct pmc_hw *phw; > struct p4pmc_descr *pd; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p4,%d] illegal CPU value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < P4_NPMCS, > ("[p4,%d] illegal row-index %d", __LINE__, ri)); > @@ -1425,7 +1425,7 @@ p4_stop_pmc(int cpu, int ri) > struct p4pmc_descr *pd; > pmc_value_t tmp; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p4,%d] illegal CPU value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < P4_NPMCS, > ("[p4,%d] illegal row index %d", __LINE__, ri)); > @@ -1694,7 +1694,7 @@ p4_describe(int cpu, int ri, struct pmc_ > struct pmc_hw *phw; > const struct p4pmc_descr *pd; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p4,%d] illegal CPU %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < P4_NPMCS, > ("[p4,%d] row-index %d out of range", __LINE__, ri)); > Index: sys/dev/hwpmc/hwpmc_ppro.c > =================================================================== > RCS file: /cvs/junos-2001/src/sys/dev/hwpmc/hwpmc_ppro.c,v > retrieving revision 1.1.1.1 > retrieving revision 1.4 > diff -u -p -r1.1.1.1 -r1.4 > --- sys/dev/hwpmc/hwpmc_ppro.c 21 Jun 2006 03:30:03 -0000 1.1.1.1 > +++ sys/dev/hwpmc/hwpmc_ppro.c 30 Oct 2007 18:00:43 -0000 1.4 > @@ -331,7 +331,7 @@ p6_init(int cpu) > struct p6_cpu *pcs; > struct pmc_hw *phw; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p6,%d] bad cpu %d", __LINE__, cpu)); > > PMCDBG(MDP,INI,0,"p6-init cpu=%d", cpu); > @@ -361,7 +361,7 @@ p6_cleanup(int cpu) > { > struct pmc_cpu *pcs; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p6,%d] bad cpu %d", __LINE__, cpu)); > > PMCDBG(MDP,INI,0,"p6-cleanup cpu=%d", cpu); > @@ -507,7 +507,7 @@ p6_allocate_pmc(int cpu, int ri, struct > > (void) cpu; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p4,%d] illegal CPU %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < P6_NPMCS, > ("[p4,%d] illegal row-index value %d", __LINE__, ri)); > @@ -611,7 +611,7 @@ p6_release_pmc(int cpu, int ri, struct p > > PMCDBG(MDP,REL,1, "p6-release cpu=%d ri=%d pm=%p", cpu, ri, pm); > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p6,%d] illegal CPU value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < P6_NPMCS, > ("[p6,%d] illegal row-index %d", __LINE__, ri)); > @@ -633,7 +633,7 @@ p6_start_pmc(int cpu, int ri) > struct pmc_hw *phw; > const struct p6pmc_descr *pd; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p6,%d] illegal CPU value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < P6_NPMCS, > ("[p6,%d] illegal row-index %d", __LINE__, ri)); > @@ -677,7 +677,7 @@ p6_stop_pmc(int cpu, int ri) > struct pmc_hw *phw; > struct p6pmc_descr *pd; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p6,%d] illegal cpu value %d", __LINE__, cpu)); > KASSERT(ri >= 0 && ri < P6_NPMCS, > ("[p6,%d] illegal row index %d", __LINE__, ri)); > @@ -719,7 +719,7 @@ p6_intr(int cpu, uintptr_t eip, int user > struct pmc_hw *phw; > pmc_value_t v; > > - KASSERT(cpu >= 0 && cpu < mp_ncpus, > + KASSERT(cpu >= 0 && cpu <= mp_maxid, > ("[p6,%d] CPU %d out of range", __LINE__, cpu)); > > retval = 0; > Index: usr.sbin/pmccontrol/pmccontrol.c > =================================================================== > RCS file: /cvs/junos-2001/src/usr.sbin/pmccontrol/pmccontrol.c,v > retrieving revision 1.1.1.1 > retrieving revision 1.4 > diff -u -p -r1.1.1.1 -r1.4 > --- usr.sbin/pmccontrol/pmccontrol.c 3 Nov 2006 01:43:32 -0000 1.1.1.1 > +++ usr.sbin/pmccontrol/pmccontrol.c 29 Nov 2007 22:47:14 -0000 1.4 > @@ -207,10 +207,16 @@ pmcc_do_enable_disable(struct pmcc_op_li > else if (b == PMCC_OP_DISABLE) > error = pmc_disable(i, j); > > - if (error < 0) > + if (error < 0) { > + if (errno == ENXIO) { > + /* This cpu wasn't configured. */ > + error = 0; > + continue; > + } > err(EX_OSERR, "%s of PMC %d on CPU %d failed", > b == PMCC_OP_ENABLE ? "Enable" : > "Disable", j, i); > + } > } > > return error; > @@ -242,9 +248,14 @@ pmcc_do_list_state(void) > (logical_cpus_mask & (1 << cpu))) > continue; /* skip P4-style 'logical' cpus */ > #endif > - if (pmc_pmcinfo(cpu, &pi) < 0) > + if (pmc_pmcinfo(cpu, &pi) < 0) { > + if (errno == ENXIO) { > + /* This cpu wasn't enabled. */ > + continue; > + } > err(EX_OSERR, "Unable to get PMC status for CPU %d", > cpu); > + } > > printf("#CPU %d:\n", c++); > npmc = pmc_npmc(cpu); > Index: usr.sbin/pmcstat/pmcstat.c > =================================================================== > RCS file: /cvs/junos-2001/src/usr.sbin/pmcstat/pmcstat.c,v > retrieving revision 1.1.1.1 > retrieving revision 1.4 > diff -u -p -r1.1.1.1 -r1.4 > --- usr.sbin/pmcstat/pmcstat.c 3 Nov 2006 01:43:32 -0000 1.1.1.1 > +++ usr.sbin/pmcstat/pmcstat.c 30 Aug 2007 15:03:02 -0000 1.4 > @@ -692,6 +692,7 @@ main(int argc, char **argv) > if ((args.pa_logparser = pmclog_open(args.pa_logfd)) == NULL) > err(EX_OSERR, "ERROR: Cannot create parser"); > pmcstat_process_log(&args); > + pmcstat_shutdown_logging(); > exit(EX_OK); > } > > > -- > -- David (obrien@FreeBSD.org) > Q: Because it reverses the logical flow of conversation. > A: Why is top-posting (putting a reply at the top of the message) frowned upon? > Let's not play "Jeopardy-style quoting" > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 23:12:41 2008 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5DC021065670 for ; Thu, 13 Mar 2008 23:12:41 +0000 (UTC) (envelope-from scf@FreeBSD.org) Received: from mail.farley.org (farley.org [67.64.95.201]) by mx1.freebsd.org (Postfix) with ESMTP id 19D338FC19 for ; Thu, 13 Mar 2008 23:12:41 +0000 (UTC) (envelope-from scf@FreeBSD.org) Received: from thor.farley.org (thor.farley.org [192.168.1.5]) by mail.farley.org (8.14.2/8.14.2) with ESMTP id m2DNCdTl065151 for ; Thu, 13 Mar 2008 18:12:39 -0500 (CDT) (envelope-from scf@FreeBSD.org) Date: Thu, 13 Mar 2008 18:12:39 -0500 (CDT) From: "Sean C. Farley" To: freebsd-arch@FreeBSD.org In-Reply-To: <20080313185431.GB85022@dragon.NUXI.org> Message-ID: References: <47D7C25D.5070908@cokane.org> <200803120945.29018.jhb@freebsd.org> <47D7E5BF.2060102@cokane.org> <20080312145734.GB26812@dragon.NUXI.org> <47D7F1EC.6040802@cokane.org> <47D88568.7000105@cokane.org> <20080313185431.GB85022@dragon.NUXI.org> User-Agent: Alpine 1.00 (BSF 882 2007-12-20) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.4 X-Spam-Checker-Version: SpamAssassin 3.2.4 (2008-01-01) on mail.farley.org Cc: Subject: FreeBSD Vim style(9) plugin (was Re: SMPTODO: remove timeout(9) from ffs_softdep.c) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 23:12:41 -0000 On Thu, 13 Mar 2008, David O'Brien wrote: > On Thu, Mar 13, 2008 at 10:00:47AM -0500, Sean C. Farley wrote: >> I asked my mentor to allow me to commit the file to >> tools/tools/editing next to freebsd.el. There the Emacs and Vim >> scripts can battle it out for all eternity. :) > > This is too valuable to not have ASAP. > I've committed it as tools/tools/editing/freebsd.vim. (case matches > emacs .el) > > Let the embellishment of freebsd.vim begin! Thank you for the commit. Before I wrote it, I looked around for something similar assuming that some other FreeBSD developers must be using Vim. I assumed others have written similar plugins, or their default was FreeBSD style(9). >> I tweaked the file a bit by adding a few comments and a mapping to >> f for easy calling. BTW, I am thinking about commenting out >> the mapping before committing and letting people manually activate >> it. This is to avoid conflicts that may arise with existing mapping >> in a person's environment. > > I don't see a need - no one should be blindly using dot files without > reviewing them. OK. Just wondered. Sean -- scf@FreeBSD.org From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 23:35:47 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C3D4E106567E; Thu, 13 Mar 2008 23:35:47 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 70B6D8FC1F; Thu, 13 Mar 2008 23:35:47 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com [24.94.75.93]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id m2DNZdBd081656; Thu, 13 Mar 2008 19:35:40 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Thu, 13 Mar 2008 13:36:47 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: Bruce Evans In-Reply-To: <20080313230809.W32527@delplex.bde.org> Message-ID: <20080313132152.Y1091@desktop> References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> <20080313124213.J31200@delplex.bde.org> <20080312211834.T1091@desktop> <20080313230809.W32527@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, David Xu , Peter Wemm Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 23:35:47 -0000 On Fri, 14 Mar 2008, Bruce Evans wrote: > On Wed, 12 Mar 2008, Jeff Roberson wrote: > >> On Thu, 13 Mar 2008, Bruce Evans wrote: >> >>> On Wed, 12 Mar 2008, Peter Wemm wrote: >>> >>>> On Tue, Mar 11, 2008 at 9:14 PM, David Xu wrote: >>>>> Jeff Roberson wrote: >>>>> > http://people.freebsd.org/~jeff/amd64.diff >>>>> >>>>> This is a good idea. >>> >>> I wouldn't have expected it to make much difference. On i386 UP, >>> cpu_switch() normally executes only 48 instructions for in-kernel >>> context switches in my version of 5.2 and only 61 instructions in >>> -current. ~5.2 differs from 5.2 here in only in not having to >>> switch %eflags. This saves 4 instructions but much more in cycles, >>> especially in P4 where accesses to %eflags are very slow. 5.2 would >>> take 52 instructions, and -current has bloated by 9 instructions >>> relative to 5.2. >> >> More expensive than the raw instruction count is: >> >> 1) The mispredicted branches to deal with all of the optional state and >> features that are not always saved. > > This is unlikely to matter, and apparently doesn't, at least in simple > benchmarks, since the C version has even more branches. Features that > are rarely used cause branches that are usually perfectly predicted. The c version has two fewer branches because it tests for two unlikely features together. It has a few more branches than the in cvs asm version and the same number of extra branches as peter's asm version to support conditional gs/fsbase setting. The other extra branches have to do with supporting cpu_switch() and cpu_throw() together. > >> 2) The cost of extra icache for getting over all of those unused >> instructions, unaligned jumps, etc. > > Again, if this were the cause of slowness then it would affect the C > version more, since the C version is larger. The C version is not larger than the asm version at high optimization levels when you consider the total instruction count that is brought into the icache. It's worth noting that my C version is slower in some cases other than the microbenchmark due to extra instructions for optimizations that don't matter. Peter's asm version is tight enough that the extra compares don't cost more than the compacted code wins. The C version touches more distinct icache lines but makes up for it in other optmiizations in the common case. > > In fact, the benchmark is probably too simple to show the cost of > branches. Just doing sched_yield() in a loop gives the following > atypical behaviour which may be atypical enough for the larger branch > and cache costs for the C version to not have much effect: > - it doesn't go near most of the special cases, so branches are > predictable (always non-special) and are thus predicted provided > (a) the CPU actually does reasonably good branch prediction, and > (b) the branch predictions fit in the branch prediction cache > (reasonably good branch prediction probably requires such a > cache). This cache is surely virtual as it happens in the first few stages of the pipeline. That means it's flushed on every switch. We're probably coming in cold every time. > - it doesn't touch much icache or dcache or branch-cache, so > everything probably stays cached. > > If just the branch-cache were thrashed, then reasonably good dynamic > branch prediction is impossible and things would be slow. In the C > version, you use predict_true() and predict_false() a lot. This > might improve static branch prediction but makes little difference > if the branch cache is working. I doubt there are any cases where the branch cache is effective here. I don't know that for certain but it seems unlikely that it would be preserved across switches due to the complexity in validating addresses. > > The C version uses lots of non-inline function calls. Just the > branches for this would have a significant overhead if the branches > are mispredicted. I think you are depending on gcc's auto-inlining > of static functions which are only called once to avoid the full > cost of the function calls. I depend on it not inlining them to avoid polluting the icache with unused instructions. I broke that with my most recent patch by moving the calls back into C. > >> I haven't looked at i386 very closely lately but on amd64 the wrmsrs for >> fs/gsbase are very expensive. On my 2ghz dual core opteron the optimized >> switch seems to take about 100ns. The total switch from userspace to >> userspace is about 4x that. > > Probably avoiding these is the only significant large between all > the versions. You use predict_false() for executing them. Are fsbase > and gsbase really usually constant across processes? If they are non threaded, yes. > > 400nS is about what I get for i386 on 2.2GHz A64 UP too (6.17 S for > ./yield 1000000 10). getpid() on this machine takes 180nS so it is > unreasonable to expect sched_yield() to take much less than a few hundred > nS. > > Some perfmon output for ./yield 100000 10: > > % # s/kx-ls-microarchitectural-resync-by-self-mod-code % 0 > % # s/kx-ls-buffer2-full % 909905 > % # s/kx-ls-retired-cflush-instructions % 0 > % # s/kx-ls-retired-cpuid-instructions % 0 > % # s/kx-dc-accesses % 496436422 > % # s/kx-dc-misses % 11102024 > > 11 cache dmisses per yield. Probably the main cause of slowness (main > memory latency on this machine is 42 nsec so 11 cache misses takes > 462 of the 617 nS per call?). Yes I reduced that recently by reordering struct tdq and td_sched some. It would be even better if we could group the scheduling related fields of td_* near the bottom with td_sched. This would require more tedius initialization in fork and would be prone to being disturbed by people adding fields to struct thread wherever they please. Ultimately it doesn't matter that much except in this microbenchmarks anyway. > > % # s/kx-dc-refills-from-l2 % 0 > % # s/kx-dc-refills-from-system % 0 > % # s/kx-dc-writebacks % 0 > % # s/kx-dc-l1-dtlb-miss-and-l2-dtlb-hits % 3459100 > % # s/kx-dc-l1-and-l2-dtlb-misses % 2138231 > % # s/kx-dc-misaligned-references % 87 > % # s/kx-dc-microarchitectural-late-cancel-of-an-access % 73146415 > % # s/kx-dc-microarchitectural-early-cancel-of-an-access % 236927303 > % # s/kx-bu-cpu-clk-unhalted % 1303921314 > % # s/kx-ic-fetches % 236207869 > % # s/kx-ic-misses % 22988 > > Insignificant icache misses. > > % # s/kx-ic-refill-from-l2 % 18979 > % # s/kx-ic-refill-from-system % 4191 > % # s/kx-ic-l1-itlb-misses % 0 > % # s/kx-ic-l1-l2-itlb-misses % 1619297 > % # s/kx-ic-instruction-fetch-stall % 1034570822 > % # s/kx-ic-return-stack-hit % 20822416 > % # s/kx-ic-return-stack-overflow % 5870 > % # s/kx-fr-retired-instructions % 701240247 > % # s/kx-fr-retired-ops % 1163464391 > % # s/kx-fr-retired-branches % 121636370 > % # s/kx-fr-retired-branches-mispredicted % 2761910 > % # s/kx-fr-retired-taken-branches % 93488548 > % # s/kx-fr-retired-taken-branches-mispredicted % 2848315 > > 2.8 branches mispredicted per call. > > # s/kx-fr-retired-far-control-transfers % 2000934 > > 1 int0x80 and 1 iret per shched_yield(), and apparentlty not much else. > > % # s/kx-fr-retired-resync-branches % 936968 > % # s/kx-fr-retired-near-returns % 19008374 > % # s/kx-fr-retired-near-returns-mispredicted % 784103 > > 0.8 returns mispredicted per call. > > % # s/kx-fr-retired-taken-branches-mispred-by-addr-miscompare % 721241 > % # s/kx-fr-interrupts-masked-cycles % 658462615 > > Ugh, this is from spinlocks bogusly masking interrupts. More than half > the cycles have interrupts masked. This at least shows that lots of > time is being spent near cpu_switch() with a spinlock held. > I'm not sure why you feel masking interrupts in spinlocks is bogus. It's central to our SMP strategy. Unless you think we should do it lazily like we do with critical_*. I know jhb had that working at one point but it was abandoned. > % # s/kx-fr-interrupts-masked-while-pending-cycles % 9365 > > Since the CPU is reasonably fast, interrupts aren't masked for very long > each time. This maximum is still 4.5 uS. > > % # s/kx-fr-hardware-interrupts % 63 > % # s/kx-fr-decoder-empty % 247898696 > % # s/kx-fr-dispatch-stalls % 589228741 > % # s/kx-fr-dispatch-stall-from-branch-abort-to-retire % 39894120 > % # s/kx-fr-dispatch-stall-for-serialization % 44037193 > % # s/kx-fr-dispatch-stall-for-segment-load % 134520281 > > 134 cyles per call. This may be more for ones in syscall() generally. > I think each segreg load still costs ~20 cycles. Since this is on > i386, there are 6 per call (%ds, %es and %fs save and restore), plus > %ss save and which might not be counted here. 134 is a lot -- about > 60nS of the 180nS for getpid(). > > % # s/kx-fr-dispatch-stall-when-reorder-buffer-is-full % 18648001 > % # s/kx-fr-dispatch-stall-when-reservation-stations-are-full % 121485247 > % # s/kx-fr-dispatch-stall-when-fpu-is-full % 19 > % # s/kx-fr-dispatch-stall-when-ls-is-full % 203578275 > % # s/kx-fr-dispatch-stall-when-waiting-for-all-to-be-quiet % 63136307 > % # s/kx-fr-dispatch-stall-when-far-xfer-or-resync-br-pending % 6994131 > >>> In-kernel switches are not a very typical case since they don't load >>> %cr3... >> >> We've been working on amd64 so I can't comment specifically about i386 >> costs. However, I definitely agree that cpu_switch() is not the greatest >> overhead in the path. Also, you have to load cr3 even for kernel threads >> because the page directory page or page directory pointer table at %cr3 can >> go away once you've switched out the old thread. > > I don't see this. The switch is avoided if %cr3 wouldn't change, which > I think usually or always happens for switches between kernel threads. I see, you're saying 'between kernel threads'. There was some discussion of allowing kernel threads to use the page tables of whichever thread was last switched in to avoid cr3 in all cases for them. This requires other changes to be safe however. > >>> The asm code already saves only call-saved registers for both i386 and >>> amd64. It saves call-saved registers even when it apparently doesn't >>> use them (lots more of these on amd64, while on i386 it uses more >>> call-saved registers than it needs to, apparently since this is free >>> after saving all call-saved registers). I think saving more than is >>> needed is the result of confusion about what needs to be saved and/or >>> what is needed for debugging. >> >> It has to save all of the callee saved registers in the PCB because they >> will likely differ from thread to thread. Failing to save and restore them >> could leave you returning with the registers having different values and >> corrupt the calling function. > > Yes, I had forgotten the detail of how the non-local flow of control can > change the registers (the next call to the function in the context of > the switched-to-process may have different values in the registers due > to changes to the registers in callers). > > All that can be done differently here is saving all the registers on the > stack (except %esp) in the usual way. This would probably be faster on > old i386's using pushal or pushl, but on amd64 pushal is not available, > and on Athlons generally (before Barcelona?) it is faster not to use pushl, > so on amd64 the registers should be saved using movl and then it is just > as easy to put them in the pcb as on the stack. > >>>> The good news is that this tuning is finally being done. It should >>>> have been done in 2003 though... >>> >>> How is this possible with (according to my theory) most of the context >>> switch cost being for %cr3 and upper layers? Unchanged amd64 has only >>> a few more costs than i386. Mainly 3 unconditional wrmsr's and 2 >>> unconditional rdmsr's for managing gsbase and fsbase. I thought that >>> these were hard to avoid and anyway not nearly as expensive as %cr3 loads. >> >> %cr3 is actually a lot less expensive these days with page table flush >> filters and the PG_G bit. We were able to optimize away setting the msrs >> in the case that the previous values match the new values. Apparently the >> hardware doesn't optimize this case so we have to do comparisons ourselves. >> >> That was a big chunk of the optimization. Static branch hints, reordering >> code, possibly reordering for better pipeline scheduling in peter's asm, >> etc. provide the rest. > > All the old i386 asm and probably clones of it on amd64 is certainly not > optimized globally for anything newer than an i386 (barely even an i486). > This rarely matters however. It lost more on Pentium-1's, but now out of > order execution and better branch prediction hides most inefficiencies. > > Bruce > Jeff From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 23:51:18 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1B4461065671 for ; Thu, 13 Mar 2008 23:51:18 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outY.internet-mail-service.net (outY.internet-mail-service.net [216.240.47.248]) by mx1.freebsd.org (Postfix) with ESMTP id AE19F8FC22 for ; Thu, 13 Mar 2008 23:51:17 +0000 (UTC) (envelope-from julian@elischer.org) Received: from mx0.idiom.com (HELO idiom.com) (216.240.32.160) by out.internet-mail-service.net (qpsmtpd/0.40) with ESMTP; Thu, 13 Mar 2008 16:51:17 -0700 Received: from julian-mac.elischer.org (localhost [127.0.0.1]) by idiom.com (Postfix) with ESMTP id 832402D600F; Thu, 13 Mar 2008 16:51:16 -0700 (PDT) Message-ID: <47D9BDF3.80409@elischer.org> Date: Thu, 13 Mar 2008 16:51:15 -0700 From: Julian Elischer User-Agent: Thunderbird 2.0.0.12 (Macintosh/20080213) MIME-Version: 1.0 To: Jeff Roberson References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> <20080313124213.J31200@delplex.bde.org> <20080312211834.T1091@desktop> <20080313230809.W32527@delplex.bde.org> <20080313132152.Y1091@desktop> In-Reply-To: <20080313132152.Y1091@desktop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org, David Xu , Peter Wemm Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 23:51:18 -0000 Jeff Roberson wrote: > > > I'm not sure why you feel masking interrupts in spinlocks is bogus. > It's central to our SMP strategy. Unless you think we should do it > lazily like we do with critical_*. I know jhb had that working at one > point but it was abandoned. > > My memory is that we used to mask interrupts lazily in 4.x From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 02:07:07 2008 Return-Path: Delivered-To: arch@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0FC731065672 for ; Fri, 14 Mar 2008 02:07:07 +0000 (UTC) (envelope-from davidxu@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id EBE958FC17; Fri, 14 Mar 2008 02:07:06 +0000 (UTC) (envelope-from davidxu@FreeBSD.org) Received: from apple.my.domain (root@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.2/8.14.2) with ESMTP id m2E272FR055536; Fri, 14 Mar 2008 02:07:03 GMT (envelope-from davidxu@freebsd.org) Message-ID: <47D9DE17.7030605@freebsd.org> Date: Fri, 14 Mar 2008 10:08:23 +0800 From: David Xu User-Agent: Thunderbird 2.0.0.9 (X11/20071211) MIME-Version: 1.0 To: Jeff Roberson References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> <20080313124213.J31200@delplex.bde.org> <20080312211834.T1091@desktop> <20080313230809.W32527@delplex.bde.org> <20080313132152.Y1091@desktop> In-Reply-To: <20080313132152.Y1091@desktop> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@FreeBSD.org, Peter Wemm Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 02:07:07 -0000 Jeff Roberson wrote: >> Ugh, this is from spinlocks bogusly masking interrupts. More than half >> the cycles have interrupts masked. This at least shows that lots of >> time is being spent near cpu_switch() with a spinlock held. >> > > I'm not sure why you feel masking interrupts in spinlocks is bogus. > It's central to our SMP strategy. Unless you think we should do it > lazily like we do with critical_*. I know jhb had that working at one > point but it was abandoned. It may be that general mutex already does spinning, so spinlock is used only when interrupt should be enabled and disabled which is expensive. I don't know how many spinlocks are abused in CURRENT source code. Regards, David Xu From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 02:19:09 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 42C121065671; Fri, 14 Mar 2008 02:19:09 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail14.syd.optusnet.com.au (mail14.syd.optusnet.com.au [211.29.132.195]) by mx1.freebsd.org (Postfix) with ESMTP id CA3268FC14; Fri, 14 Mar 2008 02:19:08 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-252-11.carlnfd3.nsw.optusnet.com.au (c220-239-252-11.carlnfd3.nsw.optusnet.com.au [220.239.252.11]) by mail14.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m2E2Ii5F014565 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 14 Mar 2008 13:18:49 +1100 Date: Fri, 14 Mar 2008 13:18:44 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Julian Elischer In-Reply-To: <47D9BDF3.80409@elischer.org> Message-ID: <20080314115225.G34431@delplex.bde.org> References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> <20080313124213.J31200@delplex.bde.org> <20080312211834.T1091@desktop> <20080313230809.W32527@delplex.bde.org> <20080313132152.Y1091@desktop> <47D9BDF3.80409@elischer.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Peter Wemm , David Xu , arch@freebsd.org Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 02:19:09 -0000 On Thu, 13 Mar 2008, Julian Elischer wrote: > Jeff Roberson wrote: >> I'm not sure why you feel masking interrupts in spinlocks is bogus. It's >> central to our SMP strategy. Unless you think we should do it lazily like >> we do with critical_*. I know jhb had that working at one point but it was >> abandoned. Masking interrupts in spinlocks breaks fast interrupts among other things. Yes, I think it should be done like in critical_*. My version has done this for 6 years or so, but I don't really care about SMP and never made it work right for SMP. Its main impact is on fast interrupt handlers. Interrupt handlers cannot access any data that is not locked, and for non-broken fast interrupt handlers, in practice this means not accessing any global data, since locking global data would be too hard and/or slow. Global data includes all per-CPU-data, and I enforce non-access to this by loading %fs with 0 in fast interrupt handlers. This makes fast interrupt handlers quite difficult to write. An interrupt handler like hardclock(), which stomps around in global data, in some places without even locking the data, is far too large and complicated to be a non-broken fast interrupt handler. I use normal interrupt handlers for hardclock() and statclock() so my lower interrupt latency costs performance. > My memory is that we used to mask interrupts lazily in 4.x Right. Only for i386. The masking is in the PIC so it only affects devices on non-fast interrupts, which should only be slow devices. Lazy masking for critical_*() has the same results (it only affects non-fast interrupts) although its mechanism is different. I implemented this in 386BSD and am unhappy that it was broken in SMPng, though with CPUs hundreds of times faster than they were when 386BSD was new, and with devices not so much faster and/or with larger buffers, the extra latency rarely matters in practice; also, with SMP there is only extra latency if all CPUs happen to hold a spinlock at the same time. Bruce From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 03:00:00 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 82B9D1065675; Fri, 14 Mar 2008 03:00:00 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail35.syd.optusnet.com.au (mail35.syd.optusnet.com.au [211.29.133.51]) by mx1.freebsd.org (Postfix) with ESMTP id 013138FC15; Fri, 14 Mar 2008 02:59:59 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-252-11.carlnfd3.nsw.optusnet.com.au (c220-239-252-11.carlnfd3.nsw.optusnet.com.au [220.239.252.11]) by mail35.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m2E2xk4O002740 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 14 Mar 2008 13:59:47 +1100 Date: Fri, 14 Mar 2008 13:59:46 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Jeff Roberson In-Reply-To: <20080313132152.Y1091@desktop> Message-ID: <20080314132033.I34431@delplex.bde.org> References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> <20080313124213.J31200@delplex.bde.org> <20080312211834.T1091@desktop> <20080313230809.W32527@delplex.bde.org> <20080313132152.Y1091@desktop> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, Peter Wemm , David Xu Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 03:00:00 -0000 On Thu, 13 Mar 2008, Jeff Roberson wrote: Please trim quotes more. > On Fri, 14 Mar 2008, Bruce Evans wrote: > >> On Wed, 12 Mar 2008, Jeff Roberson wrote: >>> More expensive than the raw instruction count is: >>> >>> 1) The mispredicted branches to deal with all of the optional state and >>> features that are not always saved. >> >> This is unlikely to matter, and apparently doesn't, at least in simple >> benchmarks, since the C version has even more branches. Features that >> are rarely used cause branches that are usually perfectly predicted. > > The c version has two fewer branches because it tests for two unlikely > features together. It has a few more branches than the in cvs asm version > and the same number of extra branches as peter's asm version to support > conditional gs/fsbase setting. The other extra branches have to do with > supporting cpu_switch() and cpu_throw() together. Testing features together is probably best here, but it might not always be. Execution more branches might be faster because each individual branch is easier to predict. >>> 2) The cost of extra icache for getting over all of those unused >>> instructions, unaligned jumps, etc. >> >> Again, if this were the cause of slowness then it would affect the C >> version more, since the C version is larger. > > The C version is not larger than the asm version at high optimization levels > when you consider the total instruction count that is brought into the > icache. It's worth noting that my C version is slower in some cases other > than the microbenchmark due to extra instructions for optimizations that > don't matter. Peter's asm version is tight enough that the extra compares > don't cost more than the compacted code wins. The C version touches more > distinct icache lines but makes up for it in other optmiizations in the > common case. Are calls to rarely-called functions getting auto-inlined for your C version? THe asm version doesn't worry about this. Even with auto-inlining of static functions that are only called once (a new bugfeature in gcc-4.1 which breaks profiling and debugging), at some optimization levels gcc will place code for the unusual case far away so as not to pollute the i-cache in the usual case although this may cost an extra branch in the unusual case. For rarely-called functions, it must be better to not inline too. >> In fact, the benchmark is probably too simple to show the cost of >> branches. Just doing sched_yield() in a loop gives the following >> atypical behaviour which may be atypical enough for the larger branch >> and cache costs for the C version to not have much effect: >> - it doesn't go near most of the special cases, so branches are >> predictable (always non-special) and are thus predicted provided >> (a) the CPU actually does reasonably good branch prediction, and >> (b) the branch predictions fit in the branch prediction cache >> (reasonably good branch prediction probably requires such a >> cache). > > This cache is surely virtual as it happens in the first few stages of the > pipeline. That means it's flushed on every switch. We're probably coming in > cold every time. Which cache? My perfmon results show that the branch cache is far from cold. >> The C version uses lots of non-inline function calls. Just the >> branches for this would have a significant overhead if the branches >> are mispredicted. I think you are depending on gcc's auto-inlining >> of static functions which are only called once to avoid the full >> cost of the function calls. > > I depend on it not inlining them to avoid polluting the icache with unused > instructions. I broke that with my most recent patch by moving the calls > back into C. :-) Maybe I only looked at the most recent patch. It seemed to have lots of calls. To prevent inlining you probably need to use the noinline attribute for some functions. I don't see how the C version can be both simpler and (as|more) optimal than the asm version. It already has magic somewhat self-documenting macros for branch prediction and magic undocumented layout for the function calls etc. to improve branch prediction and icache use. For even-more-micro optimizations in libm, I try to do everything in C, but the only way I can get near the efficiency that I want is to look at the asm output and then figure out how to trick the compiler into not being so stupid. I could optimize it in asm with less work (starting with the asm output, especially at first to learn what works for SSE scheduling), but only for a single CPU type. >> Some perfmon output for ./yield 100000 10: >> ... >> % # s/kx-fr-dispatch-stall-for-segment-load % 134520281 >> >> 134 cyles per call. This may be more for ones in syscall() generally. >> I think each segreg load still costs ~20 cycles. Since this is on >> i386, there are 6 per call (%ds, %es and %fs save and restore), plus >> %ss save and which might not be counted here. 134 is a lot -- about >> 60nS of the 180nS for getpid(). I forgot about parallelism. With 3-way execution on an Athlon, there is at least a chance that all 3 segment registers are loaded in parallel, taking only ~20 cycles for all 3, but no chance of proceeding with other instructions if so. OTOH, if only 1 or 2 ALUs can do segreg loads, then the other ALUs may be able to proceed with independent instructions. We have some nearby instructions that depend on %ds (these might benefit from using %ss) but few or no nearby dependencies on %es and %fs. Kernel code mostly doesn't worry about dependencies at all. Dependencies don't matter as much in integer code as in SSE/FPU code. >>> We've been working on amd64 so I can't comment specifically about i386 >>> costs. However, I definitely agree that cpu_switch() is not the greatest >>> overhead in the path. Also, you have to load cr3 even for kernel threads >>> because the page directory page or page directory pointer table at %cr3 >>> can go away once you've switched out the old thread. >> >> I don't see this. The switch is avoided if %cr3 wouldn't change, which >> I think usually or always happens for switches between kernel threads. > > I see, you're saying 'between kernel threads'. There was some discussion of > allowing kernel threads to use the page tables of whichever thread was last > switched in to avoid cr3 in all cases for them. This requires other changes > to be safe however. Probably a good idea. Bruce From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 05:58:23 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 13B791065676 for ; Fri, 14 Mar 2008 05:58:23 +0000 (UTC) (envelope-from joseph.koshy@gmail.com) Received: from fg-out-1718.google.com (fg-out-1718.google.com [72.14.220.156]) by mx1.freebsd.org (Postfix) with ESMTP id 8CA9B8FC25 for ; Fri, 14 Mar 2008 05:58:22 +0000 (UTC) (envelope-from joseph.koshy@gmail.com) Received: by fg-out-1718.google.com with SMTP id 16so3271767fgg.35 for ; Thu, 13 Mar 2008 22:58:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=ba2RW5BQ+x3ZpKBwn6UASgrnMPMgo60kJVs+Ojjo2Lg=; b=caT+5axDR36Znl8/hXU6ZpvYxlxikwT0gCmJ8VTDQZym9squ++4unFfDXTZ4zMhXsPWK9PrTDj6RhZ5QNYfdKYWhFph2F+vQa62R46GpEYCOAyZHp2heK0LSSOuPI6Cu5zIxClirFL/lmyiONIbY9GGbpFKPPAKB5gxXMMj2uPg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=DpvGQU5I4tXIOvC2YQg38dLo9oK3zI9TIHkKpr3w0TgacKTPSAYH6vHcfptbUtwDhQ35QaSNSPE0hHzYaX8Jy0Nvco58LAz864RL37et7edRfdzmrLu/iuF86O7OTme13hq11ky3vZ7nTYzuuTHULUSFUsTSUnJo4gwet5yThrA= Received: by 10.86.68.20 with SMTP id q20mr2306814fga.59.1205472763829; Thu, 13 Mar 2008 22:32:43 -0700 (PDT) Received: by 10.86.99.18 with HTTP; Thu, 13 Mar 2008 22:32:43 -0700 (PDT) Message-ID: <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> Date: Fri, 14 Mar 2008 11:02:43 +0530 From: "Joseph Koshy" To: "John Baldwin" In-Reply-To: <200803131516.12284.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20080313180805.GA83406@dragon.NUXI.org> <200803131516.12284.jhb@freebsd.org> Cc: freebsd-arch@freebsd.org Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 05:58:23 -0000 On Fri, Mar 14, 2008 at 12:46 AM, John Baldwin wrote: > On Thursday 13 March 2008 02:08:05 pm David O'Brien wrote: > > Hi folks, > > Some folks at Juniper have submitted these changes to hwpmc(4). > > I am sending them here for public review. > > > > Their thoughts are: > > The mp_ncpus refers to the count of the active CPU's. Where as > > mp_maxid refers to the count of all the cpus on the SMP. Using > > mp_ncpus in the cpu_id range-check of hwpmc module would lead to the > > assumption that all the active CPU's in the SMP are not interleaved. > > But for running on some platforms, the active and inactive cpus could > > be interleaved making hwpmc not work for the cpus whose cpu_id is > > greater than the active-cpu count. jhb> This is correct, but you need to handle CPUs that are absent. It might be jhb> sufficient to update pmc_cpu_is_disabled() in kern_pmc.c to check jhb> CPU_ABSENT(cpu) and claim the CPU is disabled if it is absent, but I'm not jhb> sure that will catch everything as that seems aimed at handling having a jhb> non-absent CPU halted (such as disabling HTT on i386). That is inline with the feedback (and sample patch to kern_pmc.c) that I had sent in to O'Brien. But there are other problems with the patch at various levels, probably not obvious to someone who is just looking at the kernel code. First, the relevance. My understanding is that these changes are for a proprietary SMP platform that uses non-mainstream (Tier3 or Tier4) CPUs. It so happens that Juniper decided to numbers CPUs 'sparsely' in their kernel variant and that is the motivation for this patch. IMO, as a policy, code changes for exotic hardware need to be maintained by vendors of said exotic hardware and not dumped on volunteers. Second, when I designed the PMCTools API I didn't consider that CPU numbers could be 'sparse'. [They aren't sparsely allocated on the i386/amd64---the code I looked at when I was designing PmcTools.] So there are assumptions sprinkled throughout userland that that the integers 0..hw.ncpus can select a valid CPU. While all that can be tracked down and changed, and documentation updated, it is still work that I would prefer to defer until there is a chance that someone in the general public can use it. I do need to prioritize how I spend my volunteer hours. Third, IFF we as a project are going to support 'sparse CPU numbering, I would like to see the form that takes before making changes to HWPMC and tools. For example: - How will userland and in-kernel modules find out which CPUs are physically present? Would there be a bitmask on the lines of today's machdep.hlt_cpus that we could query? Could we make the 'all_cpus' bitmask visible to userland? What happens when we start supporting systems with more than 32 processors? - Will sysctl hw.ncpus represent the count of present CPUs or will it represent the maximum CPU id? - How will userland distinguish between absent CPUs those that could be temporarily administratively disabled? - Are we going to support 'transient' CPUs [that come and go]? Why would we want sparse CPU numbering otherwise? Nit: 'mp_maxid' appears to be an index, not a count as claimed above. If support for sparse CPU numbering is something useful, I feel the correct sequence should be to discuss it here, add sparse CPU numbering to the base i386/amd64 kernels (say) first and then propagate the feature to auxiliary code like HWPMC and userland. Changing HWPMC and its userland before the base kernel itself changes does not seem to be the right thing to do. Regards, Koshy From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 06:13:23 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F086F106566B; Fri, 14 Mar 2008 06:13:22 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id B4FFE8FC1D; Fri, 14 Mar 2008 06:13:22 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [192.168.1.107] (cpe-24-94-75-93.hawaii.res.rr.com [24.94.75.93]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id m2E6DHdG045317; Fri, 14 Mar 2008 02:13:18 -0400 (EDT) (envelope-from jroberson@chesapeake.net) Date: Thu, 13 Mar 2008 20:14:27 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: Joseph Koshy In-Reply-To: <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> Message-ID: <20080313200839.S1091@desktop> References: <20080313180805.GA83406@dragon.NUXI.org> <200803131516.12284.jhb@freebsd.org> <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 06:13:23 -0000 On Fri, 14 Mar 2008, Joseph Koshy wrote: > On Fri, Mar 14, 2008 at 12:46 AM, John Baldwin wrote: >> On Thursday 13 March 2008 02:08:05 pm David O'Brien wrote: >> > Hi folks, >> > Some folks at Juniper have submitted these changes to hwpmc(4). >> > I am sending them here for public review. >> > >> > Their thoughts are: >> > The mp_ncpus refers to the count of the active CPU's. Where as >> > mp_maxid refers to the count of all the cpus on the SMP. Using >> > mp_ncpus in the cpu_id range-check of hwpmc module would lead to the >> > assumption that all the active CPU's in the SMP are not interleaved. >> > But for running on some platforms, the active and inactive cpus could >> > be interleaved making hwpmc not work for the cpus whose cpu_id is >> > greater than the active-cpu count. > > jhb> This is correct, but you need to handle CPUs that are absent. It might be > jhb> sufficient to update pmc_cpu_is_disabled() in kern_pmc.c to check > jhb> CPU_ABSENT(cpu) and claim the CPU is disabled if it is absent, but I'm not > jhb> sure that will catch everything as that seems aimed at handling having a > jhb> non-absent CPU halted (such as disabling HTT on i386). > > That is inline with the feedback (and sample patch to kern_pmc.c) that I > had sent in to O'Brien. > > But there are other problems with the patch at various levels, > probably not obvious to someone who is just looking at the kernel > code. > > First, the relevance. My understanding is that these changes are for > a proprietary SMP platform that uses non-mainstream (Tier3 or > Tier4) CPUs. It so happens that Juniper decided to numbers CPUs > 'sparsely' in their kernel variant and that is the motivation for this > patch. > > IMO, as a policy, code changes for exotic hardware need to be > maintained by vendors of said exotic hardware and not dumped on > volunteers. In general we accept vendor patches that are not disruptive even in the case that the general communit doesn't perceive the real value. It is important for us to work with and encourage vendors. > > Second, when I designed the PMCTools API I didn't consider that CPU > numbers could be 'sparse'. [They aren't sparsely allocated > on the i386/amd64---the code I looked at when I was designing > PmcTools.] So there are assumptions sprinkled throughout userland > that that the integers 0..hw.ncpus can select a valid CPU. While > all that can be tracked down and changed, and documentation updated, > it is still work that I would prefer to defer until there is a chance > that someone > in the general public can use it. I do need to prioritize how I spend my > volunteer hours. We're not asking you to support the feature. It looks like juniper already has it tested and working. We just need someone to review the patches and commit them. > > Third, IFF we as a project are going to support 'sparse CPU numbering, > I would like to see the form that takes before making changes to > HWPMC and tools. For example: The majority of the kernel already deals with sparse cpu mappings. That's why we have CPU_ABSENT(). Please look at UMA and ULE for examples of code that I have written which use this macro correctly. I'm sure there are other places that do as well that I'm not familiar with. > - How will userland and in-kernel modules find out which CPUs are > physically present? Would there be a bitmask on the lines of today's > machdep.hlt_cpus that we could query? Could we make the > 'all_cpus' bitmask visible to userland? What happens when we > start supporting systems with more than 32 processors? The kernel has the various cpumasks available in sys/smp.h. Userland can now use cpusets to find out what processors are available to it. In the future we are going to replace simple cpumasks with the cpuset_t structure from cpusets so on machines that support more than sizeof(register) * 8 processors we will use arrays. > - Will sysctl hw.ncpus represent the count of present CPUs or will it > represent the maximum CPU id? That is the number of cpus not the maximum id. > - How will userland distinguish between absent CPUs those that > could be temporarily administratively disabled? We don't presently make the distinction to the user. > - Are we going to support 'transient' CPUs [that come and go]? Why > would we want sparse CPU numbering otherwise? That is a much more difficult problem and one which we have discussed for virtualization purposes. This patch would further that eventual goal although obviously there is more work to do to get there. > > Nit: 'mp_maxid' appears to be an index, not a count as claimed above. Yes, that is unfortunate for these purposes. > > If support for sparse CPU numbering is something useful, I feel the > correct sequence should be to discuss it here, add sparse CPU > numbering to the base i386/amd64 kernels (say) first and then > propagate the feature to auxiliary code like HWPMC and userland. > > Changing HWPMC and its userland before the base kernel itself > changes does not seem to be the right thing to do. The rest of the generic code in the kernel already supports this. Juniper claims to have tested and is using this feature. Furthermore, it will get us a tiny step closer to being able to support pluggable cpus in a virtualized environment. Thanks, Jeff > > Regards, > Koshy > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 06:40:36 2008 Return-Path: Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 789501065671 for ; Fri, 14 Mar 2008 06:40:36 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 215C98FC1C for ; Fri, 14 Mar 2008 06:40:36 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.2/8.14.1) with ESMTP id m2E6b7u8084978; Fri, 14 Mar 2008 00:37:07 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Fri, 14 Mar 2008 00:37:49 -0600 (MDT) Message-Id: <20080314.003749.-432746071.imp@bsdimp.com> To: jroberson@chesapeake.net From: "M. Warner Losh" In-Reply-To: <20080313200839.S1091@desktop> References: <200803131516.12284.jhb@freebsd.org> <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> <20080313200839.S1091@desktop> X-Mailer: Mew version 5.2 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: freebsd-arch@FreeBSD.ORG Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 06:40:36 -0000 In message: <20080313200839.S1091@desktop> Jeff Roberson writes: : In general we accept vendor patches that are not disruptive even in the : case that the general communit doesn't perceive the real value. It is : important for us to work with and encourage vendors. ... : The rest of the generic code in the kernel already supports this. Juniper : claims to have tested and is using this feature. Furthermore, it will get : us a tiny step closer to being able to support pluggable cpus in a : virtualized environment. I'd like to echo these sentiments. We've generally been willing to accept code from vendors that makes their lives easier, even when that code doesn't directly benefit the project. We do this on the theory that if we make their life easy, they will contribute to the project. Juniper has certainly given a large chunk of code to the project (a fairly complete MIPS port that has been integrated with the so-called "mips2" port and will be headed into the tree soonish), which is certainly a lot more code than has been given from vendors whom we've made much bigger accommodations to. In this case a vendor came forward with a patch that introduces no real additional burdon to the volunteers who are maintaining the code. It seems like a no brainer to me to commit it. There's certainly no compelling technical argument against it. I work for Cisco. Cisco has no love for Juniper, and vice versa. However, I put that aside for the good of the project and work with people from Juniper all the time to make the project better by focusing on the technology. The project has similar expectations for all its developers: if there's a technical reason to not do something, then that's OK. If there's a political reason, especially one that isn't shared honestly an openly, then the bar is much much higher to exclude the technology from the tree. What would people think if I were to block the MIPS stuff from Juniper just because it came from Juniper and I work for a company that is in competition with Juniper? I don't think it would be too favorable. Warner From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 11:40:05 2008 Return-Path: Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3DBF91065672 for ; Fri, 14 Mar 2008 11:40:05 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 285458FC20 for ; Fri, 14 Mar 2008 11:40:04 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 1945D46B8F; Fri, 14 Mar 2008 07:40:04 -0400 (EDT) Date: Fri, 14 Mar 2008 11:40:04 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: "M. Warner Losh" In-Reply-To: <20080314.003749.-432746071.imp@bsdimp.com> Message-ID: <20080314112104.I60466@fledge.watson.org> References: <200803131516.12284.jhb@freebsd.org> <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> <20080313200839.S1091@desktop> <20080314.003749.-432746071.imp@bsdimp.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@FreeBSD.ORG Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 11:40:05 -0000 On Fri, 14 Mar 2008, M. Warner Losh wrote: > I'd like to echo these sentiments. We've generally been willing to accept > code from vendors that makes their lives easier, even when that code doesn't > directly benefit the project. We do this on the theory that if we make > their life easy, they will contribute to the project. Juniper has certainly > given a large chunk of code to the project (a fairly complete MIPS port that > has been integrated with the so-called "mips2" port and will be headed into > the tree soonish), which is certainly a lot more code than has been given > from vendors whom we've made much bigger accommodations to. > > In this case a vendor came forward with a patch that introduces no real > additional burdon to the volunteers who are maintaining the code. It seems > like a no brainer to me to commit it. There's certainly no compelling > technical argument against it. I think (hope?) everyone here would generally agree on the point regarding vendors. However, I think there is a technical point being made as well, and we're at risk of losing track of it. Koshy has pointed out that changing just the kernel parts is *insufficient* to remove the assumption of non-sparse CPU identifiers, because the kernel parts are not all there is to hwpmc. The KASSERT()s document not just the assumptions of the kernel code, which are updated by the proposed patch, but also relate to the guarantees made by the user APIs for hwpmc libraries, tools, and documentation. They are directly affected by the proposed change because they both expose and rely on the non-sparse CPU identifier assumption, and also need to be updated to reflect the changed assumption. FWIW, we should reemphasize here that sparse CPU identifiers, although not all that well-supported by our kernel, do exist and function today on all the SMP architectures that we support. The hyperthreading disable frob introduced a few years ago leads to sparse identifiers for live CPUs on i386 and amd64, and triggered problems in several pieces of code (now believed to mostly be resolved?). We do need a better general infrastructure for handling CPU information, and the cpuset(2) API starts to address this. I understand that a man page for this will materialize soon :-). Still missing, and something to discuss in detail at the devsummit since it will require non-trivial architectural changes, is how to handle live CPU reconfiguration, which is increasingly relevant due to hypervisor-driven virtualization. It became rapidly clear when the HTT frob was a run-time changeable sysctl (no longer true, I hope) that changing the set of "absent" CPUs at run time caused our kernel to behave in relatively catastrophic ways, and should be avoided, and that's just a hint in the direction of the changes we'll need to make to fully support hotplug. Universal support for sparse CPU identifiers throughout the system is just one prerequisite for getting to hotplug. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 16:22:42 2008 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B805D106566C for ; Fri, 14 Mar 2008 16:22:42 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 88E5B8FC15 for ; Fri, 14 Mar 2008 16:22:42 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.14.2/8.14.1) with ESMTP id m2EGKrjh098511 for ; Fri, 14 Mar 2008 10:20:54 -0600 (MDT) (envelope-from imp@bsdimp.com) Date: Fri, 14 Mar 2008 10:21:37 -0600 (MDT) Message-Id: <20080314.102137.-2034679600.imp@bsdimp.com> To: arch@FreeBSD.org From: "M. Warner Losh" X-Mailer: Mew version 5.2 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Multipart/Mixed; boundary="--Next_Part(Fri_Mar_14_10_21_37_2008_059)--" Content-Transfer-Encoding: 7bit Cc: Subject: BUS_DMA_ISA unused, planning on removing X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 16:22:42 -0000 ----Next_Part(Fri_Mar_14_10_21_37_2008_059)-- Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Greetings, It appears that BUS_DMA_ISA is unused: find . -name \*.c -o -name \*.h | xargs egrep BUS_DMA_ISA ./ia64/isa/isa_dma.c: /*flags*/BUS_DMA_ISA, ./sys/bus_dma.h:#define BUS_DMA_ISA 0x400 /* map memory for AXP ISA dma */ I talked to Marcel, and he's cool with removing it. Can anybody see a reason not to GC this? Warner ----Next_Part(Fri_Mar_14_10_21_37_2008_059)-- Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="bus-dma-isa.diff" Index: ia64/isa/isa_dma.c =================================================================== RCS file: /pe/ncvs/src/sys/ia64/isa/isa_dma.c,v retrieving revision 1.10 diff -u -r1.10 isa_dma.c --- ia64/isa/isa_dma.c 9 Jul 2007 04:58:16 -0000 1.10 +++ ia64/isa/isa_dma.c 14 Mar 2008 16:17:39 -0000 @@ -106,7 +106,7 @@ /*filter*/NULL, /*filterarg*/NULL, /*maxsize*/bouncebufsize, /*nsegments*/1, /*maxsegz*/0x3ffff, - /*flags*/BUS_DMA_ISA, + /*flags*/0, /*lockfunc*/busdma_lock_mutex, /*lockarg*/&Giant, &dma_tag[chan]) != 0) { Index: sys/bus_dma.h =================================================================== RCS file: /pe/ncvs/src/sys/sys/bus_dma.h,v retrieving revision 1.30 diff -u -r1.30 bus_dma.h --- sys/bus_dma.h 3 Sep 2006 00:26:17 -0000 1.30 +++ sys/bus_dma.h 14 Mar 2008 16:17:17 -0000 @@ -101,7 +101,6 @@ */ #define BUS_DMA_NOWRITE 0x100 #define BUS_DMA_NOCACHE 0x200 -#define BUS_DMA_ISA 0x400 /* map memory for AXP ISA dma */ /* Forwards needed by prototypes below. */ struct mbuf; ----Next_Part(Fri_Mar_14_10_21_37_2008_059)---- From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 16:37:05 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CD5D2106567F; Fri, 14 Mar 2008 16:37:05 +0000 (UTC) (envelope-from sam@freebsd.org) Received: from ebb.errno.com (ebb.errno.com [69.12.149.25]) by mx1.freebsd.org (Postfix) with ESMTP id 9E1228FC1C; Fri, 14 Mar 2008 16:37:05 +0000 (UTC) (envelope-from sam@freebsd.org) Received: from trouble.errno.com (trouble.errno.com [10.0.0.248]) (authenticated bits=0) by ebb.errno.com (8.13.6/8.12.6) with ESMTP id m2EGClcF058175 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 14 Mar 2008 09:12:48 -0700 (PDT) (envelope-from sam@freebsd.org) Message-ID: <47DAA3FF.9040906@freebsd.org> Date: Fri, 14 Mar 2008 09:12:47 -0700 From: Sam Leffler Organization: FreeBSD Project User-Agent: Thunderbird 2.0.0.9 (X11/20071125) MIME-Version: 1.0 To: Robert Watson References: <200803131516.12284.jhb@freebsd.org> <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> <20080313200839.S1091@desktop> <20080314.003749.-432746071.imp@bsdimp.com> <20080314112104.I60466@fledge.watson.org> In-Reply-To: <20080314112104.I60466@fledge.watson.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-DCC--Metrics: ebb.errno.com; whitelist Cc: freebsd-arch@freebsd.org Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 16:37:06 -0000 Robert Watson wrote: > On Fri, 14 Mar 2008, M. Warner Losh wrote: > >> I'd like to echo these sentiments. We've generally been willing to >> accept code from vendors that makes their lives easier, even when >> that code doesn't directly benefit the project. We do this on the >> theory that if we make their life easy, they will contribute to the >> project. Juniper has certainly given a large chunk of code to the >> project (a fairly complete MIPS port that has been integrated with >> the so-called "mips2" port and will be headed into the tree soonish), >> which is certainly a lot more code than has been given from vendors >> whom we've made much bigger accommodations to. >> >> In this case a vendor came forward with a patch that introduces no >> real additional burdon to the volunteers who are maintaining the >> code. It seems like a no brainer to me to commit it. There's >> certainly no compelling technical argument against it. > > I think (hope?) everyone here would generally agree on the point > regarding vendors. However, I think there is a technical point being > made as well, and we're at risk of losing track of it. > > Koshy has pointed out that changing just the kernel parts is > *insufficient* to remove the assumption of non-sparse CPU identifiers, > because the kernel parts are not all there is to hwpmc. The > KASSERT()s document not just the assumptions of the kernel code, which > are updated by the proposed patch, but also relate to the guarantees > made by the user APIs for hwpmc libraries, tools, and documentation. > They are directly affected by the proposed change because they both > expose and rely on the non-sparse CPU identifier assumption, and also > need to be updated to reflect the changed assumption. > > FWIW, we should reemphasize here that sparse CPU identifiers, although > not all that well-supported by our kernel, do exist and function today > on all the SMP architectures that we support. The hyperthreading > disable frob introduced a few years ago leads to sparse identifiers > for live CPUs on i386 and amd64, and triggered problems in several > pieces of code (now believed to mostly be resolved?). We do need a > better general infrastructure for handling CPU information, and the > cpuset(2) API starts to address this. I understand that a man page > for this will materialize soon :-). > > Still missing, and something to discuss in detail at the devsummit > since it will require non-trivial architectural changes, is how to > handle live CPU reconfiguration, which is increasingly relevant due to > hypervisor-driven virtualization. It became rapidly clear when the > HTT frob was a run-time changeable sysctl (no longer true, I hope) > that changing the set of "absent" CPUs at run time caused our kernel > to behave in relatively catastrophic ways, and should be avoided, and > that's just a hint in the direction of the changes we'll need to make > to fully support hotplug. Universal support for sparse CPU > identifiers throughout the system is just one prerequisite for getting > to hotplug. > hwpmc is a useful tool and needs to be improved. It appears there are multiple groups/people interested in doing that and we need to leverage that, not discourage it (given the rate of progress on the existing implementation I can only guess it's too much work for one individual). Getting the kernel changes in will allow other work to go on in parallel and doesn't appear to impact any existing usage. Please commit these changes and let's move on. Sam From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 16:45:54 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2E8381065673 for ; Fri, 14 Mar 2008 16:45:54 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from proxy.meer.net (proxy.meer.net [64.13.141.13]) by mx1.freebsd.org (Postfix) with ESMTP id 1D1958FC1A for ; Fri, 14 Mar 2008 16:45:53 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from outbound0.mx.meer.net (outbound0.mx.meer.net [209.157.153.23]) by proxy.meer.net (8.14.2/8.14.2) with ESMTP id m2EFt1wO041774 for ; Fri, 14 Mar 2008 08:55:06 -0700 (PDT) (envelope-from gnn@neville-neil.com) Received: from mail.meer.net (mail.meer.net [209.157.152.14]) by outbound0.mx.meer.net (8.12.10/8.12.6) with ESMTP id m2EFh0iJ000580; Fri, 14 Mar 2008 07:43:34 -0800 (PST) (envelope-from gnn@neville-neil.com) Received: from mail2.meer.net (mail2.meer.net [64.13.141.16]) by mail.meer.net (8.13.3/8.13.3/meer) with ESMTP id m2EFgnxd078983; Fri, 14 Mar 2008 08:42:49 -0700 (PDT) (envelope-from gnn@neville-neil.com) Received: from minion.myhome.westell.com.neville-neil.com (209-45-135-131.dia.static.qwest.net [209.45.135.131]) (authenticated bits=0) by mail2.meer.net (8.14.1/8.14.1) with ESMTP id m2EFgmI4029698; Fri, 14 Mar 2008 08:42:49 -0700 (PDT) (envelope-from gnn@neville-neil.com) Date: Fri, 14 Mar 2008 11:42:48 -0400 Message-ID: From: gnn@freebsd.org To: Robert Watson In-Reply-To: <20080314112104.I60466@fledge.watson.org> References: <200803131516.12284.jhb@freebsd.org> <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> <20080313200839.S1091@desktop> <20080314.003749.-432746071.imp@bsdimp.com> <20080314112104.I60466@fledge.watson.org> User-Agent: Wanderlust/2.15.5 (Almost Unreal) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (=?ISO-8859-4?Q?Shij=F2?=) APEL/10.7 Emacs/22.1.50 (i386-apple-darwin8.10.1) MULE/5.0 (SAKAKI) MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka") Content-Type: text/plain; charset=US-ASCII X-Bayes-Prob: 0.5 (Score 0) X-Spam-Score: 0.70 () [Tag at 5.00] COMBINED_FROM,NO_REAL_NAME X-CanItPRO-Stream: default X-Canit-Stats-ID: 44615 - 8278e499340a X-Scanned-By: CanIt (www . roaringpenguin . com) on 64.13.141.13 Cc: freebsd-arch@freebsd.org Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 16:45:54 -0000 At Fri, 14 Mar 2008 11:40:04 +0000 (GMT), rwatson wrote: > > On Fri, 14 Mar 2008, M. Warner Losh wrote: > > > I'd like to echo these sentiments. We've generally been willing to accept > > code from vendors that makes their lives easier, even when that code doesn't > > directly benefit the project. We do this on the theory that if we make > > their life easy, they will contribute to the project. Juniper has certainly > > given a large chunk of code to the project (a fairly complete MIPS port that > > has been integrated with the so-called "mips2" port and will be headed into > > the tree soonish), which is certainly a lot more code than has been given > > from vendors whom we've made much bigger accommodations to. > > > > In this case a vendor came forward with a patch that introduces no real > > additional burdon to the volunteers who are maintaining the code. It seems > > like a no brainer to me to commit it. There's certainly no compelling > > technical argument against it. > > I think (hope?) everyone here would generally agree on the point > regarding vendors. However, I think there is a technical point > being made as well, and we're at risk of losing track of it. > > Koshy has pointed out that changing just the kernel parts is > *insufficient* to remove the assumption of non-sparse CPU > identifiers, because the kernel parts are not all there is to hwpmc. > The KASSERT()s document not just the assumptions of the kernel code, > which are updated by the proposed patch, but also relate to the > guarantees made by the user APIs for hwpmc libraries, tools, and > documentation. They are directly affected by the proposed change > because they both expose and rely on the non-sparse CPU identifier > assumption, and also need to be updated to reflect the changed > assumption. > > FWIW, we should reemphasize here that sparse CPU identifiers, > although not all that well-supported by our kernel, do exist and > function today on all the SMP architectures that we support. The > hyperthreading disable frob introduced a few years ago leads to > sparse identifiers for live CPUs on i386 and amd64, and triggered > problems in several pieces of code (now believed to mostly be > resolved?). We do need a better general infrastructure for handling > CPU information, and the cpuset(2) API starts to address this. I > understand that a man page for this will materialize soon :-). > > Still missing, and something to discuss in detail at the devsummit > since it will require non-trivial architectural changes, is how to > handle live CPU reconfiguration, which is increasingly relevant due > to hypervisor-driven virtualization. It became rapidly clear when > the HTT frob was a run-time changeable sysctl (no longer true, I > hope) that changing the set of "absent" CPUs at run time caused our > kernel to behave in relatively catastrophic ways, and should be > avoided, and that's just a hint in the direction of the changes > we'll need to make to fully support hotplug. Universal support for > sparse CPU identifiers throughout the system is just one > prerequisite for getting to hotplug. Just to jump in on this quickly. I'm looking at the patches and at hwpmc in general and I'll try to massage all of this stuff together so that we can get this up and running on the newer processors. So, if people have patches out there please post links and/or email them to me, and I'll review them and get them reviewed and try to get them into the tree. I think everyone agrees that we want hwpmc to keep advancing with newer chips as it's one of the tools we have to really understand and improve the performance of our systems. Best, George From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 17:03:09 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3193D106567E for ; Fri, 14 Mar 2008 17:03:09 +0000 (UTC) (envelope-from joseph.koshy@gmail.com) Received: from fg-out-1718.google.com (fg-out-1718.google.com [72.14.220.157]) by mx1.freebsd.org (Postfix) with ESMTP id C18298FC2B for ; Fri, 14 Mar 2008 17:03:08 +0000 (UTC) (envelope-from joseph.koshy@gmail.com) Received: by fg-out-1718.google.com with SMTP id 16so3471540fgg.35 for ; Fri, 14 Mar 2008 10:03:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=x9v6+hcDWGzACBl922H/9lsq2ACxLLGM0IN+cVAX/IQ=; b=Di6CEHSd5RqxBF/Mf8ewrH9MDsZY2BCJixcE+QC7YB/7jf2mMUU23T9G1ITdxCwyJHBoe9yUVdb/gBOiUqimtF/Nq7GjUeNXyqmnOvaGXc95bqBb1iHQnYIxh+2+68Xon42+sLPL6sDNOJm4jpZuo8gHeivhiMrFh5gs4toP/AQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=NsVTw5nH7yXSvqLyYHCuIc8ENpl4LleuTjAwAXFNKRsaV74YZ4EyKEmkRg7Q/+YnpOiaKSDWtVO8rkgEhh2AzvwQPVNmZA8Bbmal1I1+BMHPPoMj9VmMr0BFn58doMTkU3s5AQ5Gy1kukWSx3vUSvaZaua8PnvRQujjAgpVMmUA= Received: by 10.82.107.15 with SMTP id f15mr26999518buc.39.1205514187158; Fri, 14 Mar 2008 10:03:07 -0700 (PDT) Received: by 10.86.99.18 with HTTP; Fri, 14 Mar 2008 10:03:07 -0700 (PDT) Message-ID: <84dead720803141003p386f10e3y9f0a8aeceada53c4@mail.gmail.com> Date: Fri, 14 Mar 2008 22:33:07 +0530 From: "Joseph Koshy" To: "Robert Watson" In-Reply-To: <20080314112104.I60466@fledge.watson.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <200803131516.12284.jhb@freebsd.org> <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> <20080313200839.S1091@desktop> <20080314.003749.-432746071.imp@bsdimp.com> <20080314112104.I60466@fledge.watson.org> Cc: freebsd-arch@freebsd.org Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 17:03:09 -0000 rw> Koshy has pointed out that changing just the kernel parts is *insufficient* to rw> remove the assumption of non-sparse CPU identifiers, because the kernel parts rw> are not all there is to hwpmc. The KASSERT()s document not just the rw> assumptions of the kernel code, which are updated by the proposed patch, but rw> also relate to the guarantees made by the user APIs for hwpmc libraries, rw> tools, and documentation. They are directly affected by the proposed change rw> because they both expose and rely on the non-sparse CPU identifier assumption, rw> and also need to be updated to reflect the changed assumption. Thank you Robert, for keeping the focus on the technical issues. Regards, Koshy From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 17:19:55 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F3DAD1065671 for ; Fri, 14 Mar 2008 17:19:54 +0000 (UTC) (envelope-from joseph.koshy@gmail.com) Received: from fg-out-1718.google.com (fg-out-1718.google.com [72.14.220.152]) by mx1.freebsd.org (Postfix) with ESMTP id 92E1C8FC29 for ; Fri, 14 Mar 2008 17:19:54 +0000 (UTC) (envelope-from joseph.koshy@gmail.com) Received: by fg-out-1718.google.com with SMTP id 16so3476846fgg.35 for ; Fri, 14 Mar 2008 10:19:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=1GQB49w7QBBldtGyg3khojJmKlhpnM00zt0duzLXKR0=; b=suDXvdqi75c33kTMr5S3MQdNBMlkDFKucy0rkgzbZ0IOEraUmMU5yLtLejgt3DPXx446az6G7GJhHWbJWqrpeAsSobFgRFpUJWaOQM8yfYNQe5oOaXD0Ei4wqSFaW4iVuoJceFw1cBjSLzdZcumahxxxpmiZQEf2PApzl1NW3hk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=D3VHdqMcdFxbdVErspSqNmmhz2Iy8WxxQ4fdnSU0qxlss5ldx0tjFvLS+KlQxgjca75sm9BPX5IUf6hldmeGXBSqIU8dtv4T48jpsWo0XHdmBCDyn//9BDwqDVIW9KeYyn+fQXK6EwZM/YTyrnTSul++415fpgpQ1OXRGd9RguE= Received: by 10.86.36.11 with SMTP id j11mr2740740fgj.5.1205515192945; Fri, 14 Mar 2008 10:19:52 -0700 (PDT) Received: by 10.86.99.18 with HTTP; Fri, 14 Mar 2008 10:19:52 -0700 (PDT) Message-ID: <84dead720803141019j5b3d6cbfyf23583596ba97f88@mail.gmail.com> Date: Fri, 14 Mar 2008 22:49:52 +0530 From: "Joseph Koshy" To: "Jeff Roberson" In-Reply-To: <20080313200839.S1091@desktop> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20080313180805.GA83406@dragon.NUXI.org> <200803131516.12284.jhb@freebsd.org> <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> <20080313200839.S1091@desktop> Cc: freebsd-arch@freebsd.org Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 17:19:55 -0000 jr> In general we accept vendor patches that are not disruptive even in the jr> case that the general communit doesn't perceive the real value. It is jr> important for us to work with and encourage vendors. Well thats ok, but we need to keep the quality bar too and 'do the right thing'. jr> We're not asking you to support the feature. It looks like juniper jr> already has it tested and working. We just need someone to review the jr> patches and commit them. The patch offers userland a way to get the kernel to schedule threads on non-existent CPUS. So I'm curious to know how it was 'tested' in Juniper. As for support, I'm the one currently answering questions and fielding the bug reports about PmcTools. > The majority of the kernel already deals with sparse cpu mappings. That's > why we have CPU_ABSENT(). Please look at UMA and ULE for examples of code > that I have written which use this macro correctly. I'm sure there are > other places that do as well that I'm not familiar with. Yes, I suggested changes to kern_pmc.c that use CPU_ABSENT(). > The kernel has the various cpumasks available in sys/smp.h. Userland can > now use cpusets to find out what processors are available to it. In the > future we are going to replace simple cpumasks with the cpuset_t structure > from cpusets so on machines that support more than sizeof(register) * 8 > processors we will use arrays. Ok, will read up about cpusets. A manual page would help. > > - How will userland distinguish between absent CPUs those that > > could be temporarily administratively disabled? > > We don't presently make the distinction to the user. Ok, we can treat them both as 'missing'. HWPMC cannot deal with CPUs that come and go though. > The rest of the generic code in the kernel already supports this. The MD layers need to catch up then? > Juniper claims to have tested and is using this feature. Define 'tested'. > Furthermore, it will get us a tiny step closer to being able to support > pluggable cpus in a virtualized environment. Ok, but that isn't really relevant to HWPMC. Virtualized environments do not usually emulate PMCs. Thanks, Koshy From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 17:25:46 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C3FF41065676 for ; Fri, 14 Mar 2008 17:25:46 +0000 (UTC) (envelope-from joseph.koshy@gmail.com) Received: from fk-out-0910.google.com (fk-out-0910.google.com [209.85.128.190]) by mx1.freebsd.org (Postfix) with ESMTP id 639198FC1A for ; Fri, 14 Mar 2008 17:25:46 +0000 (UTC) (envelope-from joseph.koshy@gmail.com) Received: by fk-out-0910.google.com with SMTP id b27so4832723fka.11 for ; Fri, 14 Mar 2008 10:25:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=CscYvJ7SzSBVOF9WNLp9IFPwJvaMAefAGMEWo1LiWJA=; b=ksylA5HL0GDCwoeww210NW8tcI49rtUS0B7psMDdMFQcrY4daY1uYALKRr7LH4F+2VKwwHa+/jfOJPOMEgrDWMgPB9Ge3D1/Hr4FSVtoS/yk+PimgePR4NKAXTHP1aHBbd0F8f3cjUaQJuC//4MqOoqGGsewX3pFV+R1t8cu1Vo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=YzpTxW0qbvP+oVCeAX2flOxpuH9gV9OhQ/r58G8FLQUhRbWhPHCdf4fdSCsMA2axxiDH3KRWkEA7BWM311HEWzsjO6DCbCm5kZrbyf+1Yam9hXQ2mNVrio4dQndWUwQkOAI7Vd8eMAGIn48SuZ0Fo7HY/4dtxgQ2Tk6lK3v8QCw= Received: by 10.82.113.6 with SMTP id l6mr27066281buc.20.1205515544570; Fri, 14 Mar 2008 10:25:44 -0700 (PDT) Received: by 10.86.99.18 with HTTP; Fri, 14 Mar 2008 10:25:44 -0700 (PDT) Message-ID: <84dead720803141025y543da4d6r2f91a5db1bcf2e34@mail.gmail.com> Date: Fri, 14 Mar 2008 22:55:44 +0530 From: "Joseph Koshy" To: gnn@freebsd.org In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <200803131516.12284.jhb@freebsd.org> <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> <20080313200839.S1091@desktop> <20080314.003749.-432746071.imp@bsdimp.com> <20080314112104.I60466@fledge.watson.org> Cc: Robert Watson , freebsd-arch@freebsd.org Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 17:25:46 -0000 > Just to jump in on this quickly. I'm looking at the patches and at > hwpmc in general and I'll try to massage all of this stuff together so > that we can get this up and running on the newer processors. FYI, here is documentation about how to go about adding new PMC support: http://wiki.freebsd.org/PmcTools/PmcHardwareHowTo Regards, Koshy From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 18:32:18 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DD3741065670 for ; Fri, 14 Mar 2008 18:32:18 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id 677E38FC15 for ; Fri, 14 Mar 2008 18:32:18 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8s) with ESMTP id 235507521-1834499 for multiple; Fri, 14 Mar 2008 14:30:29 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m2EIW7e8043494; Fri, 14 Mar 2008 14:32:12 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-arch@freebsd.org Date: Fri, 14 Mar 2008 14:13:44 -0400 User-Agent: KMail/1.9.7 References: <20080314.102137.-2034679600.imp@bsdimp.com> In-Reply-To: <20080314.102137.-2034679600.imp@bsdimp.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803141413.44367.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Fri, 14 Mar 2008 14:32:13 -0400 (EDT) X-Virus-Scanned: ClamAV 0.91.2/6232/Fri Mar 14 12:43:44 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Subject: Re: BUS_DMA_ISA unused, planning on removing X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 18:32:19 -0000 On Friday 14 March 2008 12:21:37 pm M. Warner Losh wrote: > Greetings, > > It appears that BUS_DMA_ISA is unused: > > find . -name \*.c -o -name \*.h | xargs egrep BUS_DMA_ISA > ./ia64/isa/isa_dma.c: /*flags*/BUS_DMA_ISA, > ./sys/bus_dma.h:#define BUS_DMA_ISA 0x400 /* map memory for AXP ISA dma */ > > I talked to Marcel, and he's cool with removing it. Can anybody see a > reason not to GC this? It was for Alpha and ia64 probably cut and pasted it. (Alpha had a separate sort of IOMMU for ISA dma.) You can axe it. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Fri Mar 14 18:32:42 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2A79A106566B for ; Fri, 14 Mar 2008 18:32:42 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id C63F38FC19 for ; Fri, 14 Mar 2008 18:32:41 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8s) with ESMTP id 235507532-1834499 for multiple; Fri, 14 Mar 2008 14:30:33 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.14.2/8.14.2) with ESMTP id m2EIW7e9043494; Fri, 14 Mar 2008 14:32:16 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: "Joseph Koshy" Date: Fri, 14 Mar 2008 14:31:53 -0400 User-Agent: KMail/1.9.7 References: <20080313180805.GA83406@dragon.NUXI.org> <200803131516.12284.jhb@freebsd.org> <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> In-Reply-To: <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803141431.53846.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Fri, 14 Mar 2008 14:32:16 -0400 (EDT) X-Virus-Scanned: ClamAV 0.91.2/6232/Fri Mar 14 12:43:44 2008 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: freebsd-arch@freebsd.org Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Mar 2008 18:32:42 -0000 On Friday 14 March 2008 01:32:43 am Joseph Koshy wrote: > On Fri, Mar 14, 2008 at 12:46 AM, John Baldwin wrote: > > On Thursday 13 March 2008 02:08:05 pm David O'Brien wrote: > > > Hi folks, > > > Some folks at Juniper have submitted these changes to hwpmc(4). > > > I am sending them here for public review. > > > > > > Their thoughts are: > > > The mp_ncpus refers to the count of the active CPU's. Where as > > > mp_maxid refers to the count of all the cpus on the SMP. Using > > > mp_ncpus in the cpu_id range-check of hwpmc module would lead to the > > > assumption that all the active CPU's in the SMP are not interleaved. > > > But for running on some platforms, the active and inactive cpus could > > > be interleaved making hwpmc not work for the cpus whose cpu_id is > > > greater than the active-cpu count. > > jhb> This is correct, but you need to handle CPUs that are absent. It might be > jhb> sufficient to update pmc_cpu_is_disabled() in kern_pmc.c to check > jhb> CPU_ABSENT(cpu) and claim the CPU is disabled if it is absent, but I'm not > jhb> sure that will catch everything as that seems aimed at handling having a > jhb> non-absent CPU halted (such as disabling HTT on i386). > > That is inline with the feedback (and sample patch to kern_pmc.c) that I > had sent in to O'Brien. > > But there are other problems with the patch at various levels, > probably not obvious to someone who is just looking at the kernel > code. > > First, the relevance. My understanding is that these changes are for > a proprietary SMP platform that uses non-mainstream (Tier3 or > Tier4) CPUs. It so happens that Juniper decided to numbers CPUs > 'sparsely' in their kernel variant and that is the motivation for this > patch. > > IMO, as a policy, code changes for exotic hardware need to be > maintained by vendors of said exotic hardware and not dumped on > volunteers. I would respond with two things: 1) I commited an overhaul of the x86 new-bus code to make it easier for "exotic" embedded x86 hardware platforms in use at companies such as NetApp to hook into new-bus more cleanly. By making it easier for companies to use FreeBSD we a) make it possible for them to even consider using FreeBSD, and b) for companies that use FreeBSD and devote resources (employees) to working on FreeBSD, those resources (e.g. grehan@ at NetApp) can spend more of their time working on stuff that might be able to given back to FreeBSD than coming up with hacks to work around deficiencies in FreeBSD. 2) All the sparse CPU stuff actually dates back to 5.0 and was there to support Alpha which originally numbered the CPUs using the HWPRB CPU IDs which were not sparse at all. (I think my DS20 has CPUs 6 and 7 or some such). So this was actually done to support a Tier-1 plaform (at the time). Also, note that the comments in sys/smp.h for CPU_ABSENT() and cpu_setmaxid() specifically refer to mp_maxid's purpose and the fact that sparse CPU ID sets are expected and should be handled by code in the kernel. > Second, when I designed the PMCTools API I didn't consider that CPU > numbers could be 'sparse'. [They aren't sparsely allocated > on the i386/amd64---the code I looked at when I was designing > PmcTools.] So there are assumptions sprinkled throughout userland > that that the integers 0..hw.ncpus can select a valid CPU. While > all that can be tracked down and changed, and documentation updated, > it is still work that I would prefer to defer until there is a chance > that someone > in the general public can use it. I do need to prioritize how I spend my > volunteer hours. FreeBSD has been trying to not be quite as i386-centric as it used to be. If you look at other code in the kernel that handles per-cpu data such as UMA you will see that it uses mp_maxid and CPU_ABSENT(). There are other places in the kernel that are broken though (such as ndis(4)). > Third, IFF we as a project are going to support 'sparse CPU numbering, > I would like to see the form that takes before making changes to > HWPMC and tools. For example: > - How will userland and in-kernel modules find out which CPUs are > physically present? Would there be a bitmask on the lines of today's > machdep.hlt_cpus that we could query? Could we make the > 'all_cpus' bitmask visible to userland? What happens when we > start supporting systems with more than 32 processors? Yes, we can certainly export more stuff to userland. The all_cpus mask would be good as would a MI online_cpus mask, though at this point they would be cpusets to handle > 32 rather than cpumask_t. Note that machdep.hlt_cpus is x86-only and would be superseded by a MI online_cpus mask. > - Will sysctl hw.ncpus represent the count of present CPUs or will it > represent the maximum CPU id? hw.ncpus is always mp_ncpus kern.smp.cpus is also mp_ncpus kern.smp.maxcpus is MAX_CPUS. Userland can just iterate from 0 to kern.smp.maxcpus while handling absent CPUs. (For example, the kern.cp_time[] sysctl just writes out all 0's for absent CPUs so that is how userland can determine an absent CPU in that case.) > - How will userland distinguish between absent CPUs those that > could be temporarily administratively disabled? See above re: all_cpus and online_cpus cpu sets. > - Are we going to support 'transient' CPUs [that come and go]? Why > would we want sparse CPU numbering otherwise? Yes. > Nit: 'mp_maxid' appears to be an index, not a count as claimed above. Correct, and documented as such in sys/smp.h. > If support for sparse CPU numbering is something useful, I feel the > correct sequence should be to discuss it here, add sparse CPU > numbering to the base i386/amd64 kernels (say) first and then > propagate the feature to auxiliary code like HWPMC and userland. > > Changing HWPMC and its userland before the base kernel itself > changes does not seem to be the right thing to do. While the userland interface is somewhat lacking, all of the in-kernel infrastructure has been in place for at least the past 4 years, and there is no excuse for any in-kernel code not properly handling sparse CPU IDs. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Sat Mar 15 05:43:02 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 190D31065673 for ; Sat, 15 Mar 2008 05:43:02 +0000 (UTC) (envelope-from joseph.koshy@gmail.com) Received: from fg-out-1718.google.com (fg-out-1718.google.com [72.14.220.158]) by mx1.freebsd.org (Postfix) with ESMTP id 814BC8FC23 for ; Sat, 15 Mar 2008 05:43:01 +0000 (UTC) (envelope-from joseph.koshy@gmail.com) Received: by fg-out-1718.google.com with SMTP id 16so3696780fgg.35 for ; Fri, 14 Mar 2008 22:43:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=1LDe7embEZsf2PAgnps0sas/wEp08h11p+VgJvmqArs=; b=Nh2/gSdvSQ7UetRjd3jhNFO65Pe0X9ZHuGgOfOPWxy8ncDkLelbIMVcVpMcxvGcYg1BePkr7SxuUYysJW+tFEsvLFUbL/edAGsy2SIz5+kX6Gal6B2b0aKAYkYWRnKjhsc6ZRwcolo3t9G/6ax+RD0U8TCUlmix9jGMtdwuxnmU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=esk+2crhO2k4Ut9bpuCn2sMeVD2x7ntzexv2PRAWLjxNd2U/HO3O8lsuzWv5glkWTcpSkVc0vjbxGQfYrLxFi6YI0M3w7UsR9RmEJ6Wns+eXCH5/d5gOHxNFlqSj0mtZvu76AReTLax1BNxJA/It+nQ3a+teoOh6NweFm8oQSFk= Received: by 10.86.26.11 with SMTP id 11mr11266492fgz.74.1205559780337; Fri, 14 Mar 2008 22:43:00 -0700 (PDT) Received: by 10.86.99.18 with HTTP; Fri, 14 Mar 2008 22:43:00 -0700 (PDT) Message-ID: <84dead720803142243r6c8cc68dm325e7fb925189fd@mail.gmail.com> Date: Sat, 15 Mar 2008 11:13:00 +0530 From: "Joseph Koshy" To: "John Baldwin" In-Reply-To: <200803141431.53846.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <20080313180805.GA83406@dragon.NUXI.org> <200803131516.12284.jhb@freebsd.org> <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> <200803141431.53846.jhb@freebsd.org> Cc: freebsd-arch@freebsd.org Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Mar 2008 05:43:02 -0000 > FreeBSD has been trying to not be quite as i386-centric as it used to be. If > you look at other code in the kernel that handles per-cpu data such as UMA > you will see that it uses mp_maxid and CPU_ABSENT(). There are other places > in the kernel that are broken though (such as ndis(4)). HWPMC is very x86 centric, for obvious reasons. > Yes, we can certainly export more stuff to userland. The all_cpus mask would > be good as would a MI online_cpus mask, though at this point they would be > cpusets to handle > 32 rather than cpumask_t. Note that machdep.hlt_cpus is > x86-only and would be superseded by a MI online_cpus mask. Sure, an MI counter is a good idea. > > - Will sysctl hw.ncpus represent the count of present CPUs or will it > > represent the maximum CPU id? > > hw.ncpus is always mp_ncpus > kern.smp.cpus is also mp_ncpus > kern.smp.maxcpus is MAX_CPUS. > Userland can just iterate from 0 to kern.smp.maxcpus while handling absent > CPUs. (For example, the kern.cp_time[] sysctl just writes out all 0's for > absent CPUs so that is how userland can determine an absent CPU in that > case.) I thought of that. For PMCTools use, using the proposed 'online_cpus' mask would be a better option. MAX_CPUS is a compile time value and could be large, whereas most machines will have far fewer CPUs than that limit. Why waste cycles needlessly? Now it appears to me that in the scheme of things described above one of mp_maxid and mp_ncpus is superfluous. Here is the reasoning: 0) We need a compile time limit for the kernel; this is kern.smp.maxcpus. 1) A given machine has a maximum number of CPUs that can fit in it. This is usually <<= MAXCPUS. Let us call this {MACHINE-MAX}. We need to scale kernel data structures based on {MACHINE-MAX} since using {MAXCPUS} is probably wasteful. We cannot just count the current number of CPUS, as we do today, because more could be hotplugged in later. 2) At any given instant a subset of CPUs 0..{MACHINE_MAX} will be online. This would be tracked by the kern.smp.online_cpus/all_cpus bitmask. Therefore we can use either a count (mp_ncpus) or a maximum id (mp_maxid) to represent {MACHINE-MAX}, but either one would do. However, x86 MD code uses both, with newer code seeming to prefer mp_maxid. So I am puzzled. There are far more uses of mp_ncpus there though. jk> Changing HWPMC and its userland before the base kernel itself jk> changes does not seem to be the right thing to do. jb> While the userland intIerface is somewhat lacking, all of the in-kernel jb> infrastructure has been in place for at least the past 4 years, and there is jb> no excuse for any in-kernel code not properly handling sparse CPU IDs. I try keep userland, kernel and documentation associated with PmcTools in sync. Looking around, there appear to be lots of nits that need correction. For one, the kern.smp sysctl hierarchy is undocumented. Thanks, Koshy From owner-freebsd-arch@FreeBSD.ORG Sat Mar 15 10:43:23 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4B1A21065674 for ; Sat, 15 Mar 2008 10:43:23 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe05.swip.net [212.247.154.129]) by mx1.freebsd.org (Postfix) with ESMTP id 5D97D8FC26 for ; Sat, 15 Mar 2008 10:43:22 +0000 (UTC) (envelope-from hselasky@c2i.net) X-Cloudmark-Score: 0.000000 [] Received: from [62.113.132.89] (account mc467741@c2i.net [62.113.132.89] verified) by mailfe05.swip.net (CommuniGate Pro SMTP 5.1.13) with ESMTPA id 751509790; Sat, 15 Mar 2008 10:43:18 +0100 From: Hans Petter Selasky To: freebsd-arch@freebsd.org Date: Sat, 15 Mar 2008 10:44:23 +0100 User-Agent: KMail/1.9.7 References: <86ve4s9357.fsf@ds4.des.no> <47B3EB4E.40508@elischer.org> <20080217.113340.390436320.imp@bsdimp.com> In-Reply-To: <20080217.113340.390436320.imp@bsdimp.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803151044.25764.hselasky@c2i.net> Cc: des@des.no, julian@elischer.org, ed@fxq.nl Subject: Re: Proposal for redesigning the TTY layer X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Mar 2008 10:43:23 -0000 Hi Ed, Just some ideas: Maybe you can add some more functionality to the TTY layer so that it becomes symmetric with regard to Host and Device side. For example that you can send a "RING" to a modem, and not only receive a "RING." This can be very interesting for embedded products where you want to emulate a modem through an USB device side driver. The official FreeBSD USB stack does not support device side drivers, but the one in P4 does. --HPS From owner-freebsd-arch@FreeBSD.ORG Sat Mar 15 12:40:10 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 43AEA1065688 for ; Sat, 15 Mar 2008 12:40:10 +0000 (UTC) (envelope-from ed@hoeg.nl) Received: from palm.hoeg.nl (mx0.hoeg.nl [IPv6:2001:610:652::211]) by mx1.freebsd.org (Postfix) with ESMTP id 035928FC17 for ; Sat, 15 Mar 2008 12:40:09 +0000 (UTC) (envelope-from ed@hoeg.nl) Received: by palm.hoeg.nl (Postfix, from userid 1000) id 46E6C1CC44; Sat, 15 Mar 2008 13:40:08 +0100 (CET) Date: Sat, 15 Mar 2008 13:40:08 +0100 From: Ed Schouten To: FreeBSD Arch Message-ID: <20080315124008.GF80576@hoeg.nl> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="3M7QbeJEF900HlmX" Content-Disposition: inline User-Agent: Mutt/1.5.17 (2007-11-01) Cc: Subject: vgone() calling VOP_CLOSE() -> blocked threads? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Mar 2008 12:40:10 -0000 --3M7QbeJEF900HlmX Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hello everyone, The last couple of days I'm seeing some strange things in my mpsafetty branch related to terminal revocation. In my current TTY design, I hold a count (t_ldisccnt) of the amount of threads that are sleeping in the line discipline. I need to store such a count, because it's not possible to change line disciplines while some threads are still blocked inside the discipline. This means that when d_close() is called on a TTY, t_ldisccnt should always be 0. There cannot be any threads stuck inside the line discipline when there aren't any descriptors referencing it. Unfortunately, this isn't entirely true with the current VFS/devfs design. When vgone() is called, a VOP_CLOSE() is performed , which means there could be a dozen threads still stuck inside a device driver, but the close routine is already called to clean up stuff. There are a *real* lot of drivers that blindly clean up their stuff in the d_close() routine, expecting that the device is completely unused. This can easily be demonstrated by revoking a bpf device, while running tcpdump. To be honest, I'm not completely sure how to solve this issue, though I know it should at least do something similar to this: - The device driver should have a seperate routine (d_revoke) to wake up any blocked threads, to make sure they leave the device driver properly. - Maybe vgonel() shouldn't call VOP_CLOSE(). It should probably move the vnode into deadfs, with the exception of the close() routine. Maybe it's better to add a new function to do this, vrevoke(). This means that when a revoke() call is performed, all blocked threads are woken up, will leave the driver, to find out their terminal has been revoked. Further system calls will fail, because the vnode is in deadfs, but when the processes close the descriptor, the device driver can still clean up everything. In theory these changes would also make it easier for other filesystems to support the revoke() call. A generic vop_revoke could just call vrevoke(), which means the current system calls aren't interrupted, but calls later on will fail. This will be sufficient for most filesystems. I'm not a VFS guru, so it will probably take me some time and will probably dogfood some of my filesystems. I could probably need some help. ;-) --=20 Ed Schouten WWW: http://g-rave.nl/ --3M7QbeJEF900HlmX Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (FreeBSD) iEYEARECAAYFAkfbw6gACgkQ52SDGA2eCwWHvQCeP/wk8sTFNFsgKM2kdVhGN6PS 3zQAniPruoouxd1GnjDDq6al+rWk+pBb =zwA/ -----END PGP SIGNATURE----- --3M7QbeJEF900HlmX-- From owner-freebsd-arch@FreeBSD.ORG Sat Mar 15 16:55:38 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A27F91065675 for ; Sat, 15 Mar 2008 16:55:38 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail16.syd.optusnet.com.au (mail16.syd.optusnet.com.au [211.29.132.197]) by mx1.freebsd.org (Postfix) with ESMTP id 331C28FC1F for ; Sat, 15 Mar 2008 16:55:37 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-252-11.carlnfd3.nsw.optusnet.com.au (c220-239-252-11.carlnfd3.nsw.optusnet.com.au [220.239.252.11]) by mail16.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m2FGtIjX022004 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 16 Mar 2008 03:55:21 +1100 Date: Sun, 16 Mar 2008 03:55:18 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Ed Schouten In-Reply-To: <20080315124008.GF80576@hoeg.nl> Message-ID: <20080316015903.N39516@delplex.bde.org> References: <20080315124008.GF80576@hoeg.nl> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: FreeBSD Arch Subject: Re: vgone() calling VOP_CLOSE() -> blocked threads? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Mar 2008 16:55:38 -0000 On Sat, 15 Mar 2008, Ed Schouten wrote: > The last couple of days I'm seeing some strange things in my mpsafetty > branch related to terminal revocation. > > In my current TTY design, I hold a count (t_ldisccnt) of the amount of > threads that are sleeping in the line discipline. I need to store such a > count, because it's not possible to change line disciplines while some > threads are still blocked inside the discipline. This means that when > d_close() is called on a TTY, t_ldisccnt should always be 0. There > cannot be any threads stuck inside the line discipline when there aren't > any descriptors referencing it. > > Unfortunately, this isn't entirely true with the current VFS/devfs > design. When vgone() is called, a VOP_CLOSE() is performed , which means > there could be a dozen threads still stuck inside a device driver, but > the close routine is already called to clean up stuff. There are a > *real* lot of drivers that blindly clean up their stuff in the d_close() > routine, expecting that the device is completely unused. This can > easily be demonstrated by revoking a bpf device, while running tcpdump. Yes, most drivers are broken here, but the problem is rarely noticed because revoke() isn't normally applied to any devices except ttys. Even ordinary close() can cause problems when a thread is sleeping in device open, but this too is only common for ttys (for callin and callout devices). The tty driver is about the only driver that handles this problem almost correctly. It uses a generation count. All tty drivers are supposed to sleep using only ttysleep(). ttysleep() checks the generation count and returns ERESTART if the generation is new. All tty drivers should consider this error to be fatal and propagate it up to the syscall level where the syscall is restarted. This tends to happen naturally, but some places (in device close IIRC), the driver ignores the error and does more i/o (to finish cleaning up in close -- close and open can easily pass each other and clobber each others state when this happens). More I/O also tends to occur if a revoke() happens when a thread is blocked but not sleeping. Then ttysleep() isn't in sight, so the thread has no idea that the generation count changed. Giant locking limits this problem. > To be honest, I'm not completely sure how to solve this issue, though I > know it should at least do something similar to this: > > - The device driver should have a seperate routine (d_revoke) to wake > up any blocked threads, to make sure they leave the device driver > properly. Something is needed to signal blocked but non-sleeping threads. I think the wakeup for ttys now normally occurs as a side effect of flushing i/o. revoke() normally calls ttyclose() which calls ttyldclose() which normally calls ttylclose() which flushes i/o which wakes up threads waiting on the i/o. I don't see how ttylclose() can work right in the usual !FNONBLOCK case. Maybe revoke() sets FNONBLOCK. The generation count stuff doesn't help here because the flush is done before incrementing the generation count. There is an obvious race here for threads doing i/o instead of waiting for it. These muse be blocked (by Giant now for ttys, or by your MPSAFE locking). They will run when revoke() releases the lock and find the i/o flushed and maybe the generation count incrememented, but they normally won't check these states and will just blunder on doing more i/o. It would be painful to check these states every time the lock is aquired, but this seems to be necessary. Magic Giant locking makes the places where the lock is acquired hard see. > - Maybe vgonel() shouldn't call VOP_CLOSE(). It should probably move the > vnode into deadfs, with the exception of the close() routine. Maybe > it's better to add a new function to do this, vrevoke(). > > This means that when a revoke() call is performed, all blocked threads > are woken up, will leave the driver, to find out their terminal has been > revoked. Further system calls will fail, because the vnode is in deadfs, > but when the processes close the descriptor, the device driver can still > clean up everything. I think vfs already moves the vnode to deadfs. It doesn't do anything to synchronize with threads running in device drivers. The forced last-close() should complete synchronously as part of revoke(). Then other threads leave the device driver asynchronously, hopefully not much later. Then if the generation count stuff is working right, the syscall is restarted, but now file descriptors point to deadfs so the syscall normally fails. I think the async completion is OK provided it is done right (don't delay it indefinitely, and don't do more i/o on completion). It doesn't seem to be useful to make revoke() wait for the completions. I don't think it would work well to move everything except d_close to deadfs. Other problems near here: - neither vfs nor drivers currently know how many threads are in a driver. vfs uses vp->v_rdev->si_usecount, but this doesn't quite work since it doesn't count threads sleeping in open. Maybe ones excuting last close too -- this would be more of a problem. revoke() just uses vcount(), which just acquires the device locks and returns si_usecount after releasing the device lock. (I don't understand this locking -- what stops the count changing after the lock is released, or if it cannot changed then why acquire the lock?) This can result in revoke() not calling device close when it should. Drivers can obviously keep count of their activities using large code. I can't see any way for vfs to keep count short of asking drivers for their counts. - there can be any number of threads in device open and close concurrently, even without the complications for revoke(). The most problematic cases happen when last-close blocks, as is common for ttys waiting for output to drain (since no one cares about their output actually working and ensures draining it using tcdrain() -- normal losing programs finish up with something like printf(); exit(); and depend on the close() in exit(); blocking to drain the output). Then new opens are allowed, and this is useful for doing ioctls() to unblocked blocked closes. If the new open or fcntl sets non-blocking mode, then the last-close for the new open may pass the blocked last close. If the new mode is blocking, then the last-close for the new open may block too. The number of threads in last-close is thus unlimited. A thundering herd of them tends to stomp on each other when they are all unblocked at the same time. The connections of this with revoke() are: - it takes vfs's not counting of all threads in the device driver to allow the useful behaviour of opens while a close is blocked and the necessary behaviour of last-close while another last-close is executing (drivers should be aware of this possibility and merge the closes, but don't). - I think revoke() sets FNONBLOCK somewhere. Thus it tends to unblock any thread waiting in last-close for output to drain. Less problematic cases occur when opens block. ttyopen() understands this possibility and handles it almost right using its t_wopeners count. ttyopen() uses various sleeps where it should use ttysleep() or check the generation count itself; this results in it looping internally instead of restarting the syscall, which is only a small error since for open() alone, restarting the syscall would call back to the same non-dead device open except in unusual cases where there was a signal and syscalls are not restarted, or the device name went away. There is still a problem with the vfs usage counting -- in one case involving callin and callout devices whose details I forget, last-close is not called when it needs to be called to wake up all the threads sleeping in open so that they can enter a new state. Bruce From owner-freebsd-arch@FreeBSD.ORG Sat Mar 15 17:16:47 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D57101065675; Sat, 15 Mar 2008 17:16:47 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 94DC78FC18; Sat, 15 Mar 2008 17:16:47 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 1FC1046C23; Sat, 15 Mar 2008 13:16:46 -0400 (EDT) Date: Sat, 15 Mar 2008 17:16:46 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Joseph Koshy In-Reply-To: <84dead720803142243r6c8cc68dm325e7fb925189fd@mail.gmail.com> Message-ID: <20080315170411.A42065@fledge.watson.org> References: <20080313180805.GA83406@dragon.NUXI.org> <200803131516.12284.jhb@freebsd.org> <84dead720803132232k15c3aad7pe59875f0c84e0c27@mail.gmail.com> <200803141431.53846.jhb@freebsd.org> <84dead720803142243r6c8cc68dm325e7fb925189fd@mail.gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: [PATCH] hwpmc(4) changes to use 'mp_maxid' instead of 'mp_ncpus'. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Mar 2008 17:16:48 -0000 On Sat, 15 Mar 2008, Joseph Koshy wrote: > Therefore we can use either a count (mp_ncpus) or a maximum id (mp_maxid) to > represent {MACHINE-MAX}, but either one would do. > > However, x86 MD code uses both, with newer code seeming to prefer mp_maxid. > So I am puzzled. There are far more uses of mp_ncpus there though. I suspect that's because kernel code wants to index into a data structure using the CPU ID, i.e., curcpu, but don't want to size the array at MAXCPU, which will be an increasingly large compile-time constant over time. This relies on the relative non-sparseness of CPU IDs to be of benefit, and generally, this does hold. For example, on the HTT boxes, CPU IDs might be 0..3 with 0 and 2 being used, and that's still less than 16 or 32. However, in some cases we size kernel arrays to MAXCPU, and sometimes to mp_maxid. There's a reasonable argument that sizing arrays this way is a dubious practice as you more ideally want to store per-CPU data hung off the percpu block to avoid adjacent per-cpu data in the same cache line. I ran into some similar concerns when trying to figure out how best to export memory allocator statistics from the kernel. In the end what I concluded was that I would export contiguous CPU data up to mp_maxid from the kernel, and that userspace would try to avoid any compile-time knowledge of CPU limits so that it doesn't matter if a kernel is compiled for UP (MAXCPU=1) or SMP (MAXCPU=(n), where n is often 16, I believe). I do end up exporting data for absent CPUs under mp_maxid. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Sat Mar 15 19:48:07 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A8105106564A for ; Sat, 15 Mar 2008 19:48:07 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 8812B8FC1A for ; Sat, 15 Mar 2008 19:48:07 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.64.3]) by phk.freebsd.dk (Postfix) with ESMTP id 7AFBF17104; Sat, 15 Mar 2008 19:48:05 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.2/8.14.2) with ESMTP id m2FJm4U5006719; Sat, 15 Mar 2008 19:48:04 GMT (envelope-from phk@critter.freebsd.dk) To: Ed Schouten From: "Poul-Henning Kamp" In-Reply-To: Your message of "Sat, 15 Mar 2008 13:40:08 +0100." <20080315124008.GF80576@hoeg.nl> Date: Sat, 15 Mar 2008 19:48:04 +0000 Message-ID: <6718.1205610484@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: FreeBSD Arch Subject: Re: vgone() calling VOP_CLOSE() -> blocked threads? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Mar 2008 19:48:07 -0000 >To be honest, I'm not completely sure how to solve this issue, though I >know it should at least do something similar to this: > >- The device driver should have a seperate routine (d_revoke) to wake > up any blocked threads, to make sure they leave the device driver > properly. It's already there, it's called d_purge(). -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Sat Mar 15 21:06:00 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4320D106566C for ; Sat, 15 Mar 2008 21:06:00 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from relay02.kiev.sovam.com (relay02.kiev.sovam.com [62.64.120.197]) by mx1.freebsd.org (Postfix) with ESMTP id B845E8FC1A for ; Sat, 15 Mar 2008 21:05:59 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from [212.82.216.226] (helo=skuns.kiev.zoral.com.ua) by relay02.kiev.sovam.com with esmtps (TLSv1:AES256-SHA:256) (Exim 4.67) (envelope-from ) id 1JacxM-0004Q2-W0 for arch@freebsd.org; Sat, 15 Mar 2008 22:26:19 +0200 Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by skuns.kiev.zoral.com.ua (8.14.2/8.14.2) with ESMTP id m2FJmMWB091804 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 15 Mar 2008 21:48:22 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.2/8.14.2) with ESMTP id m2FJm9oL062454; Sat, 15 Mar 2008 21:48:09 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.2/8.14.2/Submit) id m2FJm9C1062453; Sat, 15 Mar 2008 21:48:09 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 15 Mar 2008 21:48:09 +0200 From: Kostik Belousov To: Ed Schouten Message-ID: <20080315194809.GN10374@deviant.kiev.zoral.com.ua> References: <20080315124008.GF80576@hoeg.nl> <20080316015903.N39516@delplex.bde.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="h/ohfBjN02kAJu/T" Content-Disposition: inline In-Reply-To: <20080316015903.N39516@delplex.bde.org> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: ClamAV version 0.91.2, clamav-milter version 0.91.2 on skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.4 X-Spam-Checker-Version: SpamAssassin 3.2.4 (2008-01-01) on skuns.kiev.zoral.com.ua X-Scanner-Signature: 46978630fe90eedb12f62e56ce2f8684 X-DrWeb-checked: yes X-SpamTest-Envelope-From: kostikbel@gmail.com X-SpamTest-Group-ID: 00000000 X-SpamTest-Info: Profiles 2421 [Mar 14 2008] X-SpamTest-Info: helo_type=3 X-SpamTest-Method: none X-SpamTest-Rate: 0 X-SpamTest-Status: Not detected X-SpamTest-Status-Extended: not_detected X-SpamTest-Version: SMTP-Filter Version 3.0.0 [0278], KAS30/Release Cc: FreeBSD Arch Subject: Re: vgone() calling VOP_CLOSE() -> blocked threads? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Mar 2008 21:06:00 -0000 --h/ohfBjN02kAJu/T Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Mar 16, 2008 at 03:55:18AM +1100, Bruce Evans wrote: > On Sat, 15 Mar 2008, Ed Schouten wrote: >=20 > >The last couple of days I'm seeing some strange things in my mpsafetty > >branch related to terminal revocation. > > > >In my current TTY design, I hold a count (t_ldisccnt) of the amount of > >threads that are sleeping in the line discipline. I need to store such a > >count, because it's not possible to change line disciplines while some > >threads are still blocked inside the discipline. This means that when > >d_close() is called on a TTY, t_ldisccnt should always be 0. There > >cannot be any threads stuck inside the line discipline when there aren't > >any descriptors referencing it. > > > >Unfortunately, this isn't entirely true with the current VFS/devfs > >design. When vgone() is called, a VOP_CLOSE() is performed , which means > >there could be a dozen threads still stuck inside a device driver, but > >the close routine is already called to clean up stuff. There are a > >*real* lot of drivers that blindly clean up their stuff in the d_close() > >routine, expecting that the device is completely unused. This can > >easily be demonstrated by revoking a bpf device, while running tcpdump. >=20 > Yes, most drivers are broken here, but the problem is rarely noticed > because revoke() isn't normally applied to any devices except ttys. > Even ordinary close() can cause problems when a thread is sleeping > in device open, but this too is only common for ttys (for callin and > callout devices). >=20 > The tty driver is about the only driver that handles this problem > almost correctly. It uses a generation count. All tty drivers are > supposed to sleep using only ttysleep(). ttysleep() checks the > generation count and returns ERESTART if the generation is new. All > tty drivers should consider this error to be fatal and propagate it > up to the syscall level where the syscall is restarted. This tends > to happen naturally, but some places (in device close IIRC), the driver > ignores the error and does more i/o (to finish cleaning up in close > -- close and open can easily pass each other and clobber each others > state when this happens). More I/O also tends to occur if a revoke() > happens when a thread is blocked but not sleeping. Then ttysleep() > isn't in sight, so the thread has no idea that the generation count > changed. Giant locking limits this problem. >=20 > >To be honest, I'm not completely sure how to solve this issue, though I > >know it should at least do something similar to this: > > > >- The device driver should have a seperate routine (d_revoke) to wake > > up any blocked threads, to make sure they leave the device driver > > properly. >=20 > Something is needed to signal blocked but non-sleeping threads. I > think the wakeup for ttys now normally occurs as a side effect of > flushing i/o. revoke() normally calls ttyclose() which calls ttyldclose() > which normally calls ttylclose() which flushes i/o which wakes up > threads waiting on the i/o. I don't see how ttylclose() can work right > in the usual !FNONBLOCK case. Maybe revoke() sets FNONBLOCK. The > generation count stuff doesn't help here because the flush is done > before incrementing the generation count. >=20 > There is an obvious race here for threads doing i/o instead of waiting > for it. These muse be blocked (by Giant now for ttys, or by your > MPSAFE locking). They will run when revoke() releases the lock and > find the i/o flushed and maybe the generation count incrememented, but > they normally won't check these states and will just blunder on doing > more i/o. It would be painful to check these states every time the > lock is aquired, but this seems to be necessary. Magic Giant locking > makes the places where the lock is acquired hard see. >=20 > >- Maybe vgonel() shouldn't call VOP_CLOSE(). It should probably move the > > vnode into deadfs, with the exception of the close() routine. Maybe > > it's better to add a new function to do this, vrevoke(). > > > >This means that when a revoke() call is performed, all blocked threads > >are woken up, will leave the driver, to find out their terminal has been > >revoked. Further system calls will fail, because the vnode is in deadfs, > >but when the processes close the descriptor, the device driver can still > >clean up everything. >=20 > I think vfs already moves the vnode to deadfs. It doesn't do anything > to synchronize with threads running in device drivers. The forced > last-close() should complete synchronously as part of revoke(). Then > other threads leave the device driver asynchronously, hopefully not > much later. Then if the generation count stuff is working right, the > syscall is restarted, but now file descriptors point to deadfs so the > syscall normally fails. I think the async completion is OK provided > it is done right (don't delay it indefinitely, and don't do more > i/o on completion). It doesn't seem to be useful to make revoke() > wait for the completions. >=20 > I don't think it would work well to move everything except d_close to > deadfs. >=20 > Other problems near here: > - neither vfs nor drivers currently know how many threads are in a > driver. vfs uses vp->v_rdev->si_usecount, but this doesn't quite work This is provided by si_threadcount. See the dev(vn)_refthread and it usage in the devfs vnops and fops. > since it doesn't count threads sleeping in open. Maybe ones excuting > last close too -- this would be more of a problem. revoke() just > uses vcount(), which just acquires the device locks and returns > si_usecount after releasing the device lock. (I don't understand > this locking -- what stops the count changing after the lock is > released, or if it cannot changed then why acquire the lock?) This > can result in revoke() not calling device close when it should. > Drivers can obviously keep count of their activities using large > code. I can't see any way for vfs to keep count short of asking > drivers for their counts. > - there can be any number of threads in device open and close concurrentl= y, > even without the complications for revoke(). The most problematic > cases happen when last-close blocks, as is common for ttys waiting > for output to drain (since no one cares about their output actually > working and ensures draining it using tcdrain() -- normal losing > programs finish up with something like printf(); exit(); and depend > on the close() in exit(); blocking to drain the output). Then new > opens are allowed, and this is useful for doing ioctls() to unblocked > blocked closes. If the new open or fcntl sets non-blocking mode, then > the last-close for the new open may pass the blocked last close. If > the new mode is blocking, then the last-close for the new open may block > too. The number of threads in last-close is thus unlimited. A thunder= ing > herd of them tends to stomp on each other when they are all unblocked at > the same time. >=20 > The connections of this with revoke() are: > - it takes vfs's not counting of all threads in the device driver to > allow the useful behaviour of opens while a close is blocked and > the necessary behaviour of last-close while another last-close is > executing (drivers should be aware of this possibility and merge > the closes, but don't). > - I think revoke() sets FNONBLOCK somewhere. Thus it tends to unblock > any thread waiting in last-close for output to drain. >=20 > Less problematic cases occur when opens block. ttyopen() understands > this possibility and handles it almost right using its t_wopeners > count. ttyopen() uses various sleeps where it should use ttysleep() > or check the generation count itself; this results in it looping > internally instead of restarting the syscall, which is only a small > error since for open() alone, restarting the syscall would call back > to the same non-dead device open except in unusual cases where there > was a signal and syscalls are not restarted, or the device name went > away. There is still a problem with the vfs usage counting -- in > one case involving callin and callout devices whose details I forget, > last-close is not called when it needs to be called to wake up all > the threads sleeping in open so that they can enter a new state. The device driver already could provide the d_purge method that is intended to safely drain all threads that are now in the driver. See the kern_conf.c:destroy_dev() for the usage. Also, please note that the drivers cannot call destroy_dev() from the d_close method due to selflock with si_threadcount. The livelock is caused by the fix for the problem identical to the problem you described, with substitution s/ldisc/cdev/. The destroy_dev_sched() function is provided to execute destroy_dev() from another context. Alternatively to what was proposed regarding vrevoke(), you could use the similar lifecycle management for the ldisc, if suitable. --h/ohfBjN02kAJu/T Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (FreeBSD) iEYEARECAAYFAkfcJ/gACgkQC3+MBN1Mb4jN+ACfXT7H0LrUGepI7fnS51azFdte pSYAnR9PIGY9M/yezNxRpxph+od4d5Up =u9ra -----END PGP SIGNATURE----- --h/ohfBjN02kAJu/T--