From owner-freebsd-arch@FreeBSD.ORG Sun Nov 26 01:40:50 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0BB0316A412 for ; Sun, 26 Nov 2006 01:40:50 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outZ.internet-mail-service.net (outZ.internet-mail-service.net [216.240.47.249]) by mx1.FreeBSD.org (Postfix) with ESMTP id D141543D49 for ; Sun, 26 Nov 2006 01:39:58 +0000 (GMT) (envelope-from julian@elischer.org) Received: from shell.idiom.com (HELO idiom.com) (216.240.47.20) by out.internet-mail-service.net (qpsmtpd/0.32) with ESMTP; Sat, 25 Nov 2006 17:28:10 -0800 Received: from [192.168.2.5] (home.elischer.org [216.240.48.38]) by idiom.com (8.12.11/8.12.11) with ESMTP id kAQ1elC6042145; Sat, 25 Nov 2006 17:40:47 -0800 (PST) (envelope-from julian@elischer.org) Message-ID: <4568F09E.2030007@elischer.org> Date: Sat, 25 Nov 2006 17:40:46 -0800 From: Julian Elischer User-Agent: Thunderbird 1.5.0.8 (Macintosh/20061025) MIME-Version: 1.0 To: Gordon Tetlow References: <45649E42.70409@cs.rice.edu> <20061123020747.GZ2260@obelix.dsto.defence.gov.au> <17DE7E25-BCEE-46C7-9EB2-73A9D5C37CB1@tetlows.org> In-Reply-To: <17DE7E25-BCEE-46C7-9EB2-73A9D5C37CB1@tetlows.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "Wilkinson, Alex" , freebsd-arch@freebsd.org Subject: Re: superpage plans X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Nov 2006 01:40:50 -0000 Gordon Tetlow wrote: > > On Nov 22, 2006, at 6:07 PM, Wilkinson, Alex wrote: > >> 0n Wed, Nov 22, 2006 at 01:00:18PM -0600, Alan Cox wrote: >> >>> Kip Macy wrote: >>> >>>> Do you have any thoughts on when superpage support might go into >>>> -CURRENT? >> >> erm, what is superpage ? > > http://www.cs.rice.edu/~jnavarro/superpages/ or.. http://people.freebsd.org/~julian/BAFUG/talks/superpages/ A .mov format video of (our) Alan Cox explaining this Superpages implementation in FreeBSD. > > -gordon > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" From owner-freebsd-arch@FreeBSD.ORG Sun Nov 26 12:05:01 2006 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7C75916A500 for ; Sun, 26 Nov 2006 12:05:01 +0000 (UTC) (envelope-from yar@comp.chem.msu.su) Received: from comp.chem.msu.su (comp.chem.msu.su [158.250.32.97]) by mx1.FreeBSD.org (Postfix) with ESMTP id CE9CF43DAA for ; Sun, 26 Nov 2006 12:03:54 +0000 (GMT) (envelope-from yar@comp.chem.msu.su) Received: from comp.chem.msu.su (localhost [127.0.0.1]) by comp.chem.msu.su (8.13.4/8.13.3) with ESMTP id kAQC4iRZ062118; Sun, 26 Nov 2006 15:04:44 +0300 (MSK) (envelope-from yar@comp.chem.msu.su) Received: (from yar@localhost) by comp.chem.msu.su (8.13.4/8.13.3/Submit) id kAQC4ciY062107; Sun, 26 Nov 2006 15:04:38 +0300 (MSK) (envelope-from yar) Date: Sun, 26 Nov 2006 15:04:38 +0300 From: Yar Tikhiy To: John Birrell Message-ID: <20061126120437.GA60959@comp.chem.msu.su> References: <20061123232035.GA56985@what-creek.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20061123232035.GA56985@what-creek.com> User-Agent: Mutt/1.5.9i Cc: arch@freebsd.org Subject: Re: Proposed change to make -j X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Nov 2006 12:05:01 -0000 On Thu, Nov 23, 2006 at 11:20:35PM +0000, John Birrell wrote: > Currently 'make -j' reports an error if the number of jobs > isn't specified. > > I'd like to change make(1) to treat -j (without a number) as > meaning "set the number of jobs to the number of processors". > > On sun4v, each processor isn't too powerful and system performance > is only decent when you use all the processors - 32 in my case. > > I've been working on a parallel 'make release' process which > would benefit from having -j set by default. At the moment I > set MAKEFLAGS=j32 in my environment and this achieves the desired > result, but -j would be more general. > > Thoughts? Besides the portability issues already pointed at, making option's argument optional itself doesn't fit in the getopt(3) semantics and is confusing. As a rule, option's argument must be able to begin with a dash. If this extension to make(1) were good from the technical POV, I'd suggest "-j -1", "-j max", "-j ncpu", or whatever, but not a bare "-j". Even if the make(1) code can handle such optional arguments, other stock tools should not be spoiled by that. IMHO :-) -- Yar From owner-freebsd-arch@FreeBSD.ORG Sun Nov 26 17:44:22 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C8B2A16A47B for ; Sun, 26 Nov 2006 17:44:22 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id A067C43D53 for ; Sun, 26 Nov 2006 17:43:25 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 6BB4D46CA0; Sun, 26 Nov 2006 12:44:20 -0500 (EST) Date: Sun, 26 Nov 2006 17:44:20 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Ivan Voras In-Reply-To: Message-ID: <20061126174041.V83346@fledge.watson.org> References: <20061119041421.I16763@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: What is the PREEMPTION option good for? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Nov 2006 17:44:22 -0000 On Thu, 23 Nov 2006, Ivan Voras wrote: > Ivan Voras wrote: >> Bruce Evans wrote: >> >>> Most of the difference is caused by pgzero becoming too active with >>> PREEMPTION. >> >> Don't know about the other things but I've noticed pagezero is suspiciously >> active on heavy loaded SMP web servers (even complained on @stable a long >> time ago). I'll try disabling PREEMPTION and see how it goes. > > Ok, I couldn't run extensive tests because people were waiting to use the > machine, so this should be considered anecdotal evidence. On a simple > benchmark that repeatedly (for 1 minute) and concurrently (target=50 > concurrent requests) hits a dynamic web page on a development machine (2 > proc true SMP), the performance goes up from ~85 requests/sec. to ~105 > requests/s by disabling PREEMPTION. This improvement looks suspiciously high > to me, but I don't think I'll be going back :) pagezero is now not noticable > in 'top' output. There's a known performance regression with PREEMPTION and loopback network traffic on UP or UP-like systems due to a poor series of context switches occuring in the network stack. If your benchmark involves the above web load over the loopback, that could be the source of what you're seeing. If it's not loopback traffic, then that's not the source of the problem. You might try fiddling with kern.sched.ipiwakeup.enabled and see what the effect is, btw -- this controls whether or not the scheduler wakes up another idle CPU to run a thread when waking up that thread, rather than queuing it to run which may occur on the other CPU at the next clock tick. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Sun Nov 26 18:09:31 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 01CF516A511 for ; Sun, 26 Nov 2006 18:09:31 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 70199441DF for ; Sun, 26 Nov 2006 18:02:49 +0000 (GMT) (envelope-from freebsd-arch@m.gmane.org) Received: from list by ciao.gmane.org with local (Exim 4.43) id 1GoOKA-00004q-M2 for freebsd-arch@freebsd.org; Sun, 26 Nov 2006 19:01:54 +0100 Received: from 89-172-50-96.adsl.net.t-com.hr ([89.172.50.96]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 26 Nov 2006 19:01:54 +0100 Received: from ivoras by 89-172-50-96.adsl.net.t-com.hr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 26 Nov 2006 19:01:54 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-arch@freebsd.org From: Ivan Voras Date: Sun, 26 Nov 2006 19:01:44 +0100 Lines: 19 Message-ID: References: <20061119041421.I16763@delplex.bde.org> <20061126174041.V83346@fledge.watson.org> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: 89-172-50-96.adsl.net.t-com.hr User-Agent: Thunderbird 1.5.0.8 (Windows/20061025) In-Reply-To: <20061126174041.V83346@fledge.watson.org> Sender: news Subject: Re: What is the PREEMPTION option good for? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Nov 2006 18:09:31 -0000 Robert Watson wrote: > There's a known performance regression with PREEMPTION and loopback > network traffic on UP or UP-like systems due to a poor series of context > switches occuring in the network stack. If your benchmark involves the > above web load over the loopback, that could be the source of what > you're seeing. If it's not loopback traffic, then that's not the source > of the problem. The dynamic stuff is accessing the database (fairly intensively) over the loopback. > You might try fiddling with kern.sched.ipiwakeup.enabled and see what > the effect is, btw -- this controls whether or not the scheduler wakes > up another idle CPU to run a thread when waking up that thread, rather > than queuing it to run which may occur on the other CPU at the next > clock tick. Try this with or without PREEMPTION? From owner-freebsd-arch@FreeBSD.ORG Mon Nov 27 16:26:56 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 064A716A509 for ; Mon, 27 Nov 2006 16:26:56 +0000 (UTC) (envelope-from gilah@magnai.com) Received: from bbgun.com (d01m-213-44-214-53.d4.club-internet.fr [213.44.214.53]) by mx1.FreeBSD.org (Postfix) with SMTP id F019D43D97 for ; Mon, 27 Nov 2006 16:25:46 +0000 (GMT) (envelope-from gilah@magnai.com) Message-ID: <000001c71240$becc3790$d670a8c0@lebh> From: "Melicent Laiche" To: freebsd-arch@freebsd.org Date: Mon, 27 Nov 2006 08:26:09 -0800 MIME-Version: 1.0 X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1106 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Re: underwoo X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Melicent Laiche List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Nov 2006 16:26:56 -0000 Hi, =20 VjAGRA_ip_$1,78 CjALiS_tf_$3,00 LEVjTRA_hn_$3,33 =20 www [dot] rx44 [dot] info _____ =20 A quick grrr-grrr reassured me. So I did not have to keep track of From owner-freebsd-arch@FreeBSD.ORG Tue Nov 28 14:23:51 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8EA4716A407 for ; Tue, 28 Nov 2006 14:23:51 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 864CB43CA0 for ; Tue, 28 Nov 2006 14:23:48 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id D0A4A46E24; Tue, 28 Nov 2006 09:23:50 -0500 (EST) Date: Tue, 28 Nov 2006 14:23:50 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Ivan Voras In-Reply-To: Message-ID: <20061128142218.P44465@fledge.watson.org> References: <20061119041421.I16763@delplex.bde.org> <20061126174041.V83346@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: What is the PREEMPTION option good for? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Nov 2006 14:23:51 -0000 On Sun, 26 Nov 2006, Ivan Voras wrote: > Robert Watson wrote: > >> There's a known performance regression with PREEMPTION and loopback network >> traffic on UP or UP-like systems due to a poor series of context switches >> occuring in the network stack. If your benchmark involves the above web >> load over the loopback, that could be the source of what you're seeing. >> If it's not loopback traffic, then that's not the source of the problem. > > The dynamic stuff is accessing the database (fairly intensively) over the > loopback. This may be significantly affected by preemption then. >> You might try fiddling with kern.sched.ipiwakeup.enabled and see what the >> effect is, btw -- this controls whether or not the scheduler wakes up >> another idle CPU to run a thread when waking up that thread, rather than >> queuing it to run which may occur on the other CPU at the next clock tick. > > Try this with or without PREEMPTION? They're independent twiddles, and can be frobbed separately. If you can easily measure performance in the different configurations, seeing a table of permutations and results would be very nice to see what happens :-). Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Tue Nov 28 21:40:25 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7184216A508 for ; Tue, 28 Nov 2006 21:40:25 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.FreeBSD.org (Postfix) with ESMTP id BCFA443CB1 for ; Tue, 28 Nov 2006 21:39:25 +0000 (GMT) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.6/8.13.6) with ESMTP id kASLdMSd015015; Tue, 28 Nov 2006 16:39:23 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-arch@freebsd.org Date: Tue, 28 Nov 2006 16:31:18 -0500 User-Agent: KMail/1.9.1 References: <7105.1163451221@critter.freebsd.dk> In-Reply-To: <7105.1163451221@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200611281631.19224.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Tue, 28 Nov 2006 16:39:23 -0500 (EST) X-Virus-Scanned: ClamAV 0.88.3/2255/Tue Nov 28 11:52:00 2006 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Poul-Henning Kamp Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Nov 2006 21:40:25 -0000 On Monday 13 November 2006 15:53, Poul-Henning Kamp wrote: > The proposed API > ---------------- > > tick_t XXX_ns_tick(unsigned nsec, unsigned *low, unsigned *high); > Caculate the tick value for a given timeout. > Optionally return theoretical lower and upper limits to > actual value, > > tick_t XXX_s_tick(unsigned seconds) > Caculate the tick value for a given timeout. > > The point behind these two functions is that we do not want to > incur a scaling operating at every arming of a callout. Very > few callouts use varying timeouts (and for those, no avoidance > is possible), but for the rest, precalculating the correct > (opaque) number is a good optimization. One note and one question. First, the note. I was planning on rototilling our sleep() APIs to 1) handle multiple locking primitives, and 2) use explicit timescales rather than hz. I had intended on using microseconds with a negative value indicating a relative timeout (so an 'uptime' timeout, i.e. trigger X us from now) and a positive value indicating an absolute timeout (time_t-ish, and subject to ntp changes). Partly because (IIRC) Windows does something similar (negative: relative, positive: absolute, and in microseconds too IIRC) and Darwin as well. Part of the idea was to fix places that abused tsleep(..., 1), etc. to figure out a "real" sleep interval. With your proposal, I would probably change the various sleep routines to take a tick_t instead. That leads me to my question if if you would want to support the notion of absolute vs relative timeouts? Also, my other API change I was going to do was something like this: msleep() -> mtx_sleep() msleep_spin() -> sl_sleep() (or some such, was talking with ups@ at BSDCan about divorcing spin locks from mutexes altogether, including a separate API namespace, since it's practically already separate as it is) new functions such as: rw_sleep(), sx_sleep() (ZFS wants this I think), but this is rather secondary. I'd just rather get the pain and suffering over all at once. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Nov 28 22:54:25 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 56A5516A500; Tue, 28 Nov 2006 22:54:25 +0000 (UTC) (envelope-from prvs=julian=480af87b9@elischer.org) Received: from a50.ironport.com (a50.ironport.com [63.251.108.112]) by mx1.FreeBSD.org (Postfix) with ESMTP id EB23D43CA1; Tue, 28 Nov 2006 22:54:19 +0000 (GMT) (envelope-from prvs=julian=480af87b9@elischer.org) Received: from unknown (HELO [10.251.18.229]) ([10.251.18.229]) by a50.ironport.com with ESMTP; 28 Nov 2006 14:54:25 -0800 Message-ID: <456CBE20.4010902@elischer.org> Date: Tue, 28 Nov 2006 14:54:24 -0800 From: Julian Elischer User-Agent: Thunderbird 1.5.0.8 (Macintosh/20061025) MIME-Version: 1.0 To: John Baldwin References: <7105.1163451221@critter.freebsd.dk> <200611281631.19224.jhb@freebsd.org> In-Reply-To: <200611281631.19224.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Nov 2006 22:54:25 -0000 John Baldwin wrote: > On Monday 13 November 2006 15:53, Poul-Henning Kamp wrote: >> The proposed API >> ---------------- >> >> tick_t XXX_ns_tick(unsigned nsec, unsigned *low, unsigned *high); >> Caculate the tick value for a given timeout. >> Optionally return theoretical lower and upper limits to >> actual value, >> >> tick_t XXX_s_tick(unsigned seconds) >> Caculate the tick value for a given timeout. >> >> The point behind these two functions is that we do not want to >> incur a scaling operating at every arming of a callout. Very >> few callouts use varying timeouts (and for those, no avoidance >> is possible), but for the rest, precalculating the correct >> (opaque) number is a good optimization. > > One note and one question. First, the note. I was planning on rototilling > our sleep() APIs to 1) handle multiple locking primitives, and 2) use > explicit timescales rather than hz. I had intended on using microseconds > with a negative value indicating a relative timeout (so an 'uptime' timeout, > i.e. trigger X us from now) and a positive value indicating an absolute > timeout (time_t-ish, and subject to ntp changes). Partly because (IIRC) > Windows does something similar (negative: relative, positive: absolute, and > in microseconds too IIRC) and Darwin as well. Part of the idea was to fix > places that abused tsleep(..., 1), etc. to figure out a "real" sleep > interval. With your proposal, I would probably change the various sleep > routines to take a tick_t instead. That leads me to my question if if you > would want to support the notion of absolute vs relative timeouts? > > Also, my other API change I was going to do was something like this: > > msleep() -> mtx_sleep() > > msleep_spin() -> sl_sleep() (or some such, was talking with ups@ at BSDCan > about divorcing spin locks from mutexes altogether, including a separate API > namespace, since it's practically already separate as it is) I've mentionned several times that I think making both the spinlock and Mutex code use the same mutex structure is a problem because one cannot do any run-time checking to ensure that the correct call is being used.. one must instead do runtime checking which is slower and may not hit all unusual cases (except in unusual circumstances). > > new functions such as: rw_sleep(), sx_sleep() (ZFS wants this I think), but > this is rather secondary. I'd just rather get the pain and suffering over > all at once. > From owner-freebsd-arch@FreeBSD.ORG Tue Nov 28 23:04:10 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6F2CB16A403; Tue, 28 Nov 2006 23:04:10 +0000 (UTC) (envelope-from prvs=julian=480af87b9@elischer.org) Received: from a50.ironport.com (a50.ironport.com [63.251.108.112]) by mx1.FreeBSD.org (Postfix) with ESMTP id B4FF343CA0; Tue, 28 Nov 2006 23:04:04 +0000 (GMT) (envelope-from prvs=julian=480af87b9@elischer.org) Received: from unknown (HELO [10.251.18.229]) ([10.251.18.229]) by a50.ironport.com with ESMTP; 28 Nov 2006 15:04:10 -0800 Message-ID: <456CC069.3040107@elischer.org> Date: Tue, 28 Nov 2006 15:04:09 -0800 From: Julian Elischer User-Agent: Thunderbird 1.5.0.8 (Macintosh/20061025) MIME-Version: 1.0 To: Julian Elischer References: <7105.1163451221@critter.freebsd.dk> <200611281631.19224.jhb@freebsd.org> <456CBE20.4010902@elischer.org> In-Reply-To: <456CBE20.4010902@elischer.org> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Nov 2006 23:04:10 -0000 Julian Elischer wrote: > John Baldwin wrote: > I've mentionned several times that I think making both the spinlock and > Mutex code use the same mutex structure is a problem because one cannot > do any run-time checking to ensure that the correct call is being used.. s/run-time/compile-time/ > one must instead do runtime checking which is slower and may not hit all > unusual cases (except in unusual circumstances). > From owner-freebsd-arch@FreeBSD.ORG Tue Nov 28 23:28:46 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 471D216A403; Tue, 28 Nov 2006 23:28:46 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.FreeBSD.org (Postfix) with ESMTP id 37FF343CAA; Tue, 28 Nov 2006 23:28:39 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id A260A170C5; Tue, 28 Nov 2006 23:28:43 +0000 (UTC) To: John Baldwin From: "Poul-Henning Kamp" In-Reply-To: Your message of "Tue, 28 Nov 2006 16:31:18 EST." <200611281631.19224.jhb@freebsd.org> Date: Tue, 28 Nov 2006 23:28:41 +0000 Message-ID: <6194.1164756521@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Nov 2006 23:28:46 -0000 In message <200611281631.19224.jhb@freebsd.org>, John Baldwin writes: John, I would very much welcome your participation on this. On the absolute vs relative time thing, this gets far nastier once you start to think about it. As far as I know, nothing in the kernel asks for sleeps until a given wall-clock (UTC) time. Userland on the other hand often does, and almost never should, but lets leave that behind for a moment. [1] Suspend/resume is a tricky complication here. Some sleeps and callouts want to sleep on the "while the CPU is concious" timescale, for instance for pushing dirty pages to disk or collecting usage statistics. Others want to sleep on the absolute (TAI) timescale, such as TCP retransmission and keepalive timeouts. (The indicative internal/external distinction is not safe btw.) Right now we don't distinguish between the two cases, and my intention was to leave this for a later stage where we could add flag-bits to signal these desires, once an survey of the kernel code had revealed which were the sensible default. We can of course add the flags as no-ops already now where this is immediately obvious to us. >Part of the idea was to fix >places that abused tsleep(..., 1), etc. to figure out a "real" sleep >interval. This is going to be the major pain in the transition, no matter what we do. Pretty much all short sleep and callout durations are bogus because of the traditional rounding(-up) and HZ granularity. >Also, my other API change I was going to do was something like this: > >msleep() -> mtx_sleep() >msleep_spin() -> sl_sleep() [...] >rw_sleep(), sx_sleep() [...] I think this sounds eminently sensible, even if we initially do just the crude thing, getting it expressed in the API allows us to improve the implementation later on. Poul-Henning [1] OK, couldn't resist: Much of this trouble comes about because it used to be that only the UTC clock were available, and programs havn't been rewritten to use CLOCK_MONOTONIC where they should. Examples of bogus behaviour: Named(8) wants to time zones out on the TAI scale not the UTC scale, so it should not be affected by NTPD stepping the clock but only the uptime of the system. Any amount of time the system is suspended should be tolled on the timer. Xlock suffers from the same and gets terribly upset when NTPD steps the clock. Various reminder tools, want to sleep until a given UTC time, but end up sleeping the relative time we estimate until that time when they go to sleep. If NTPD steps the clock while they sleep, they do not find out and the reminder gets fired at the wrong time. (Hint: Don't entrust calendar(8) with remembering you marriage aniversary). NTPD on the other hand, needs to know about suspend/resume so it can DTRT to the clock and doesn't get told so it totally makes a mess of things. One conclusion I've reached is that the kernel should issue a SIGTIMEWARP to all processes whenever there is a UTC clock discontinuity. It's been suggested that devd(8) should do this but I think it is a kernel task. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 01:10:25 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8F62816A524; Wed, 29 Nov 2006 01:10:25 +0000 (UTC) (envelope-from rnsanchez@wait4.org) Received: from spunkymail-a15.dreamhost.com (sd-green-bigip-207.dreamhost.com [208.97.132.207]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5335C43CAD; Wed, 29 Nov 2006 01:10:10 +0000 (GMT) (envelope-from rnsanchez@wait4.org) Received: from sauron.lan.box (unknown [200.203.28.73]) by spunkymail-a15.dreamhost.com (Postfix) with ESMTP id 288027F046; Tue, 28 Nov 2006 17:10:13 -0800 (PST) Date: Tue, 28 Nov 2006 23:10:10 -0200 From: Ricardo Nabinger Sanchez To: John Baldwin Message-Id: <20061128231010.cbdc4e1d.rnsanchez@wait4.org> In-Reply-To: <200611281631.19224.jhb@freebsd.org> References: <7105.1163451221@critter.freebsd.dk> <200611281631.19224.jhb@freebsd.org> Organization: SYS_WAIT4 X-Mailer: Sylpheed version 2.3.0beta2 (GTK+ 2.10.6; i386-portbld-freebsd6.1) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 01:10:25 -0000 On Tue, 28 Nov 2006 16:31:18 -0500 John Baldwin wrote: > I had intended on using microseconds > with a negative value indicating a relative timeout (so an 'uptime' > timeout, i.e. trigger X us from now) and a positive value indicating an > absolute timeout (time_t-ish, and subject to ntp changes). Just some devil's advocate thoughts... What are the advantages of encoding some semantic in one or two bits of the argument, instead of passing another word with flags? The obvious is more one word on stack per call, but how many words will have the code to deal with those encoded semantic bits? I can see 2 feasible cenarios: 1- the critical path deals with a single word, but can possibly have more branches and/or operations due to additional manipulation needed (extract the semantic bits and act on them) 2- two words must be popped, so there's the chance of a complete cache-miss penalty, which turns out to be some hundred cycles in every architecture. Also, limiting the resolution to microsecond may leave the kernel in a short blanket situation, considering the future trends (we're almost in the picosecond processor era, although even nanoseconds are used very much, being microseconds the "right"[1] choice). Comments? :) [1] as of 2006. -- Ricardo Nabinger Sanchez Powered by FreeBSD "Left to themselves, things tend to go from bad to worse." From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 10:21:53 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9179816A403; Wed, 29 Nov 2006 10:21:53 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.FreeBSD.org (Postfix) with ESMTP id 14BEF43CAC; Wed, 29 Nov 2006 10:21:49 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id 782D1170C5; Wed, 29 Nov 2006 10:21:49 +0000 (UTC) To: Ricardo Nabinger Sanchez From: "Poul-Henning Kamp" In-Reply-To: Your message of "Tue, 28 Nov 2006 23:10:10 -0200." <20061128231010.cbdc4e1d.rnsanchez@wait4.org> Date: Wed, 29 Nov 2006 10:21:47 +0000 Message-ID: <8092.1164795707@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 10:21:53 -0000 In message <20061128231010.cbdc4e1d.rnsanchez@wait4.org>, Ricardo Nabinger Sanc hez writes: >On Tue, 28 Nov 2006 16:31:18 -0500 >John Baldwin wrote: > >> I had intended on using microseconds >> with a negative value indicating a relative timeout (so an 'uptime' >> timeout, i.e. trigger X us from now) and a positive value indicating an >> absolute timeout (time_t-ish, and subject to ntp changes). > >Just some devil's advocate thoughts... > >What are the advantages of encoding some semantic in one or two bits of the >argument, instead of passing another word with flags? The bits _will_ go in the flags argument I proposed. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 10:45:34 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3C35016A47B; Wed, 29 Nov 2006 10:45:34 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 67B3F43CAD; Wed, 29 Nov 2006 10:45:30 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id D461D46DCA; Wed, 29 Nov 2006 05:45:30 -0500 (EST) Date: Wed, 29 Nov 2006 10:45:30 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: John Baldwin In-Reply-To: <200611281631.19224.jhb@freebsd.org> Message-ID: <20061129104205.C95096@fledge.watson.org> References: <7105.1163451221@critter.freebsd.dk> <200611281631.19224.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 10:45:34 -0000 On Tue, 28 Nov 2006, John Baldwin wrote: > One note and one question. First, the note. I was planning on rototilling > our sleep() APIs to 1) handle multiple locking primitives, and 2) use > explicit timescales rather than hz. I had intended on using microseconds > with a negative value indicating a relative timeout (so an 'uptime' timeout, > i.e. trigger X us from now) and a positive value indicating an absolute > timeout (time_t-ish, and subject to ntp changes). Partly because (IIRC) > Windows does something similar (negative: relative, positive: absolute, and > in microseconds too IIRC) and Darwin as well. Part of the idea was to fix > places that abused tsleep(..., 1), etc. to figure out a "real" sleep > interval. With your proposal, I would probably change the various sleep > routines to take a tick_t instead. That leads me to my question if if you > would want to support the notion of absolute vs relative timeouts? I realize that Windows has established something of a convention here, but I would prefer it if we had different APIs for absolute and relative timescales, rather than overloading the signed value. I would instead like to either pass in an unsigned value (giving compile-time checking, especially with gcc4), or pass signed and assert it's > 0 in the relative case (to give runtime checking). We could also generate run-time warnings for absolute times in the past, and so on. Especially if we start to move towards rescheduling callouts in order to reduce the size of the outstanding callout queues (TCP uses 4+ per connection now, and 1 would be a better number), time offset arithmetic is likely to be error prone, and catching these problems sooner rather than later would be good. Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 16:10:50 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 187A716A416 for ; Wed, 29 Nov 2006 16:10:50 +0000 (UTC) (envelope-from rnsanchez@wait4.org) Received: from spunkymail-a18.dreamhost.com (sd-green-bigip-211.dreamhost.com [208.97.132.211]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6035243CC9 for ; Wed, 29 Nov 2006 16:10:39 +0000 (GMT) (envelope-from rnsanchez@wait4.org) Received: from sauron.lan.box (unknown [200.180.183.74]) by spunkymail-a18.dreamhost.com (Postfix) with ESMTP id 224D95B522; Wed, 29 Nov 2006 08:10:34 -0800 (PST) Date: Wed, 29 Nov 2006 14:10:27 -0200 From: Ricardo Nabinger Sanchez To: "Poul-Henning Kamp" Message-Id: <20061129141027.5bd71945.rnsanchez@wait4.org> In-Reply-To: <8092.1164795707@critter.freebsd.dk> References: <20061128231010.cbdc4e1d.rnsanchez@wait4.org> <8092.1164795707@critter.freebsd.dk> Organization: SYS_WAIT4 X-Mailer: Sylpheed version 2.3.0beta2 (GTK+ 2.10.6; i386-portbld-freebsd6.1) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 16:10:50 -0000 On Wed, 29 Nov 2006 10:21:47 +0000 "Poul-Henning Kamp" wrote: > >> I had intended on using microseconds > >> with a negative value indicating a relative timeout (so an 'uptime' > >> timeout, i.e. trigger X us from now) and a positive value indicating an > >> absolute timeout (time_t-ish, and subject to ntp changes). > > > >Just some devil's advocate thoughts... > > > >What are the advantages of encoding some semantic in one or two bits of the > >argument, instead of passing another word with flags? > > The bits _will_ go in the flags argument I proposed. Yes, I recall from your first message about the proposed API. What confused me was that John explained a +useconds/-useconds encoding, and this microsecond bounding is what concerned me. It seems to me that it's too tight, given that 10G-E can get popular soon (5 or less years). If I understood your (phk) proposal, it is tick-ready---which I think is the natural way to go in order to handle high precision events. My concern is that microsecond resolution (if I understood John's proposal correctly) may seem right for now, but what about in 5 years? Promoting it to nanoseconds (also using 64-bit word) seems likely to happen, with some breakage. Also, regarding your encoding proposal, I think my argumentation in the previous message isn't valid, as the bit is used to determine the scale of the timeout in the remaining bits. It looks fine to me, as both tick-resolution and second-resolution (or longer) events are representable, probably being applicaple for a good time (I don't have any idea of how much time, but feel like it's a lot). -- Ricardo Nabinger Sanchez Powered by FreeBSD "Left to themselves, things tend to go from bad to worse." From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 19:24:30 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1A4C316A47B for ; Wed, 29 Nov 2006 19:24:30 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7874943CA8 for ; Wed, 29 Nov 2006 19:24:26 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.7/8.13.4) with ESMTP id kATJOTPp047307; Wed, 29 Nov 2006 11:24:29 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.7/8.13.4/Submit) id kATJOMmR047302; Wed, 29 Nov 2006 11:24:22 -0800 (PST) Date: Wed, 29 Nov 2006 11:24:22 -0800 (PST) From: Matthew Dillon Message-Id: <200611291924.kATJOMmR047302@apollo.backplane.com> To: Ricardo Nabinger Sanchez References: <20061128231010.cbdc4e1d.rnsanchez@wait4.org> <8092.1164795707@critter.freebsd.dk> <20061129141027.5bd71945.rnsanchez@wait4.org> Cc: Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 19:24:30 -0000 Since nearly all callout_reset() calls use the same relative timeout as previously, it seems rather polluting to expose the low level tick calculations in the API. I'll bet if you just *CACHE* the last translation it would be sufficient to optimize your callout paths: callout_reset(...) { if (to_ticks == c->last_to_ticks) ... use c->last_to_translated_ticks; else ... recalculate ... } Insofar as math overhead goes... well, if you REALLY want to make things optimal you need to get rid of all those mutex operations you are doing in the low level callwheel code. I would recommend doing what we did, which is to make the call wheels per-cpu and to issue the callout on the same cpu it was registered on. Now, granted, DragonFly uses a more cpu-localized design, particularly for network operations (which are the vast majority of callout operations in the system). But you should really consider it. A cpu-localized design replaces all mutexes and spinlocks in the implementation with a simple critical section. Cross-cpu operations use IPI messages (which, in DragonFly, very rarely occur since all the callout users are cpu-localized). But assuming you deal with that issue in your network stacks, OTHER uses of the callout API are well served by a cpu-localized model. Because re-arming usually occurs FROM the callout callback procedure, which itself is cpu-localized by the callout implementation, you again wind up being able to use just a critical section and no mutexes or spin locks. One mutex or spinlock is worth half a dozen math operations. Even if the locked bus cycle memory location is already owned by the calling cpu you still wind up flushing the cpu's read and write pipeline, and that is really nasty at the beginning of a procedure when the caller of the procedure has just pushed a bunch of arguments onto the stack. There is virtually no cache overhead in handling the callwheel due to the burstiness effect of the slots, in particular when handling TCP connections in bulk. There is so much locality of reference there that for all intents and purposes callout_reset() becomes FREE if you can just get rid of the mutexes. In anycase, network operations are a bad place to use fine-grained timeouts. It just doesn't work well... for example, using a TCP retry timeout in the microsecond range almost guarentees a ton of false hits due to cpu latency in handling the timeout on a heavily loaded system. You need wiggle room and lost packets just aren't an issue on LANs. Similarly if you want to change tsleep to use a fine-grained value, the same rule applies... when tsleep is called with a timeout it is almost always called with the same timeout. But nearly all uses of tsleep are insensitive to the granularity of the timeout, and most remaining uses are not in critical code paths (e.g. a device driver that is resetting some low level hardware interface or something), so it is questionable whether changing the API would reap any visible reward. There are a few places where a fine-grained timer is really useful, in particular a periodic fine-grained timer. But don't try to do it with the callout API. I recommend taking a look at our SYSTIMER API. We use it to drive interface polling, the scheduler, the stat clock, the hardclock, and to rate-limit interrupts. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 19:29:36 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 65E3916A412; Wed, 29 Nov 2006 19:29:36 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6F70443C9D; Wed, 29 Nov 2006 19:29:32 +0000 (GMT) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.6/8.13.6) with ESMTP id kATJTJuD093331; Wed, 29 Nov 2006 14:29:30 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: Robert Watson Date: Wed, 29 Nov 2006 13:46:00 -0500 User-Agent: KMail/1.9.1 References: <7105.1163451221@critter.freebsd.dk> <200611281631.19224.jhb@freebsd.org> <20061129104205.C95096@fledge.watson.org> In-Reply-To: <20061129104205.C95096@fledge.watson.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200611291346.01246.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Wed, 29 Nov 2006 14:29:31 -0500 (EST) X-Virus-Scanned: ClamAV 0.88.3/2258/Wed Nov 29 07:04:15 2006 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Poul-Henning Kamp , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 19:29:36 -0000 On Wednesday 29 November 2006 05:45, Robert Watson wrote: > On Tue, 28 Nov 2006, John Baldwin wrote: > > > One note and one question. First, the note. I was planning on rototilling > > our sleep() APIs to 1) handle multiple locking primitives, and 2) use > > explicit timescales rather than hz. I had intended on using microseconds > > with a negative value indicating a relative timeout (so an 'uptime' timeout, > > i.e. trigger X us from now) and a positive value indicating an absolute > > timeout (time_t-ish, and subject to ntp changes). Partly because (IIRC) > > Windows does something similar (negative: relative, positive: absolute, and > > in microseconds too IIRC) and Darwin as well. Part of the idea was to fix > > places that abused tsleep(..., 1), etc. to figure out a "real" sleep > > interval. With your proposal, I would probably change the various sleep > > routines to take a tick_t instead. That leads me to my question if if you > > would want to support the notion of absolute vs relative timeouts? > > I realize that Windows has established something of a convention here, but I > would prefer it if we had different APIs for absolute and relative timescales, > rather than overloading the signed value. I would instead like to either pass > in an unsigned value (giving compile-time checking, especially with gcc4), or > pass signed and assert it's > 0 in the relative case (to give runtime > checking). We could also generate run-time warnings for absolute times in the > past, and so on. Especially if we start to move towards rescheduling callouts > in order to reduce the size of the outstanding callout queues (TCP uses 4+ per > connection now, and 1 would be a better number), time offset arithmetic is > likely to be error prone, and catching these problems sooner rather than later > would be good. Different APIs would be fine. IIRC, that's how Darwin does it. With the tick_t idea, you could easily have: tick_t relative_wakeup(ulong nsec) tick_t absolute_wakeup(struct timeval *tv) (or something else, etc.) Doing it that way would let us stay as we are for now (just supporting the relative "uptime" timeouts) and investigate whether or not we want walltime timeouts (such as for TCP as Poul-Henning mentioned). I like tick_t, I just want to make sure we change foosleep() to use it as well, and wanted to raise the idea of relative vs absolute deadlines. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 19:32:21 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0333516A403 for ; Wed, 29 Nov 2006 19:32:21 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6006A43C9D for ; Wed, 29 Nov 2006 19:32:17 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id DB6FB170C7; Wed, 29 Nov 2006 19:32:18 +0000 (UTC) To: Matthew Dillon From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 29 Nov 2006 11:24:22 PST." <200611291924.kATJOMmR047302@apollo.backplane.com> Date: Wed, 29 Nov 2006 19:32:16 +0000 Message-ID: <10752.1164828736@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Ricardo Nabinger Sanchez , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 19:32:21 -0000 In message <200611291924.kATJOMmR047302@apollo.backplane.com>, Matthew Dillon w rites: Your input has been noted to the extent it is relevant. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 19:45:50 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6F6A616A407; Wed, 29 Nov 2006 19:45:50 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.FreeBSD.org (Postfix) with ESMTP id C18B543CA3; Wed, 29 Nov 2006 19:45:46 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id 7F996170C5; Wed, 29 Nov 2006 19:45:48 +0000 (UTC) To: John Baldwin From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 29 Nov 2006 13:46:00 EST." <200611291346.01246.jhb@freebsd.org> Date: Wed, 29 Nov 2006 19:45:46 +0000 Message-ID: <10814.1164829546@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Robert Watson , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 19:45:50 -0000 In message <200611291346.01246.jhb@freebsd.org>, John Baldwin writes: >Different APIs would be fine. IIRC, that's how Darwin does it. With the >tick_t idea, you could easily have: > >tick_t relative_wakeup(ulong nsec) >tick_t absolute_wakeup(struct timeval *tv) (or something else, etc.) I really do not want to encode the rel/abs aspect in the tick_t. I want it marked up directly in the flags passed which kind of behaviour the code wants. >walltime timeouts (such as for TCP as Poul-Henning mentioned). I like tick_t, >I just want to make sure we change foosleep() to use it as well, and wanted to >raise the idea of relative vs absolute deadlines. Agreed, foosleep() should take tick_t as well. I propose you and I write up the new API in detail and then present that document here on arch@ at a latter date. Is that OK with you ? -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 21:18:03 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1018A16A49E for ; Wed, 29 Nov 2006 21:18:03 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0119B43DF3 for ; Wed, 29 Nov 2006 21:15:51 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.7/8.13.4) with ESMTP id kATLFtEm047972; Wed, 29 Nov 2006 13:15:55 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.7/8.13.4/Submit) id kATLFlxd047970; Wed, 29 Nov 2006 13:15:47 -0800 (PST) Date: Wed, 29 Nov 2006 13:15:47 -0800 (PST) From: Matthew Dillon Message-Id: <200611292115.kATLFlxd047970@apollo.backplane.com> To: "Poul-Henning Kamp" References: <10752.1164828736@critter.freebsd.dk> Cc: Ricardo Nabinger Sanchez , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 21:18:03 -0000 : :In message <200611291924.kATJOMmR047302@apollo.backplane.com>, Matthew Dillon w :rites: : :Your input has been noted to the extent it is relevant. : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 Now now Poul, if you don't have anything nice to say.... try not to act like a stuck up pig. Oops! Did I say something bad? -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 21:24:03 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A1FC516A412 for ; Wed, 29 Nov 2006 21:24:03 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6FB8D43CB4 for ; Wed, 29 Nov 2006 21:23:29 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id 852A8170C6; Wed, 29 Nov 2006 21:23:31 +0000 (UTC) To: Matthew Dillon From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 29 Nov 2006 13:15:47 PST." <200611292115.kATLFlxd047970@apollo.backplane.com> Date: Wed, 29 Nov 2006 21:23:29 +0000 Message-ID: <11392.1164835409@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Ricardo Nabinger Sanchez , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 21:24:03 -0000 In message <200611292115.kATLFlxd047970@apollo.backplane.com>, Matthew Dillon w rites: >:Your input has been noted to the extent it is relevant. > > Now now Poul, if you don't have anything nice to say.... try not to act > like a stuck up pig. Oops! Did I say something bad? My qualification was only a reflection on the fact that you obviously had not read the first part of the tread and therefore did not seem to take into account the changes proposed initially. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 21:31:10 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4EF5816A416 for ; Wed, 29 Nov 2006 21:31:10 +0000 (UTC) (envelope-from apanqasem@gmail.com) Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.226]) by mx1.FreeBSD.org (Postfix) with ESMTP id 80C3B43CB6 for ; Wed, 29 Nov 2006 21:30:44 +0000 (GMT) (envelope-from apanqasem@gmail.com) Received: by wx-out-0506.google.com with SMTP id s18so2258341wxc for ; Wed, 29 Nov 2006 13:30:47 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:mime-version:to:message-id:content-type:from:subject:date:x-mailer; b=IBq8Is8o3bvuplHG3ph4moi53VGFweVtGa6x66YzO3LJjPM8iGanFcdbeVpsDsL3pX47nw8cdfLaNG/QF4ESvVVoSC4dy5dA3UUem0WA3qXGuMCz2zKx21Qn60XxWlUpEWd0DK2ICnlLbu8rygcFwSG4ZMO9dUZ1adAHSW4dabM= Received: by 10.70.44.4 with SMTP id r4mr4833958wxr.1164835847543; Wed, 29 Nov 2006 13:30:47 -0800 (PST) Received: from ?128.42.2.60? ( [128.42.2.60]) by mx.google.com with ESMTP id i15sm28164735wxd.2006.11.29.13.30.47; Wed, 29 Nov 2006 13:30:47 -0800 (PST) Mime-Version: 1.0 (Apple Message framework v752.2) To: freebsd-arch@freebsd.org Message-Id: <49B82C5A-7A21-40EF-A717-3320B249FE01@gmail.com> From: Apan Qasem Date: Wed, 29 Nov 2006 15:31:21 -0600 X-Mailer: Apple Mail (2.752.2) Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Page Coloring in FreeBSD X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 21:31:10 -0000 Does the FreeBSD kernel employ a page coloring algorithm? If so, where can I find information on the heuristics used to color pages? I have looked at Matthew Dillon's description at http:// people.freebsd.org/~nik/article.html-text I was hoping to find something more detailed. Thanks. - Apan From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 21:49:04 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 51D6A16A40F for ; Wed, 29 Nov 2006 21:49:04 +0000 (UTC) (envelope-from ru@rambler-co.ru) Received: from relay0.rambler.ru (relay0.rambler.ru [81.19.66.187]) by mx1.FreeBSD.org (Postfix) with ESMTP id 767174401C for ; Wed, 29 Nov 2006 21:42:46 +0000 (GMT) (envelope-from ru@rambler-co.ru) Received: from relay0.rambler.ru (localhost [127.0.0.1]) by relay0.rambler.ru (Postfix) with ESMTP id 0B2765E46; Thu, 30 Nov 2006 00:42:36 +0300 (MSK) Received: from edoofus.park.rambler.ru (unknown [81.19.65.108]) by relay0.rambler.ru (Postfix) with ESMTP id DC56F5D76; Thu, 30 Nov 2006 00:42:35 +0300 (MSK) Received: (from ru@localhost) by edoofus.park.rambler.ru (8.13.8/8.13.8) id kATLgabs020052; Thu, 30 Nov 2006 00:42:36 +0300 (MSK) (envelope-from ru) Date: Thu, 30 Nov 2006 00:42:36 +0300 From: Ruslan Ermilov To: Apan Qasem Message-ID: <20061129214236.GB20009@rambler-co.ru> References: <49B82C5A-7A21-40EF-A717-3320B249FE01@gmail.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="RASg3xLB4tUQ4RcS" Content-Disposition: inline In-Reply-To: <49B82C5A-7A21-40EF-A717-3320B249FE01@gmail.com> User-Agent: Mutt/1.5.13 (2006-08-11) X-Virus-Scanned: No virus found Cc: freebsd-arch@freebsd.org Subject: Re: Page Coloring in FreeBSD X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 21:49:04 -0000 --RASg3xLB4tUQ4RcS Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Nov 29, 2006 at 03:31:21PM -0600, Apan Qasem wrote: > Does the FreeBSD kernel employ a page coloring algorithm? If so, =20 > where can I find information on the heuristics used to color pages? >=20 > I have looked at Matthew Dillon's description at http://=20 > people.freebsd.org/~nik/article.html-text >=20 > I was hoping to find something more detailed. >=20 http://www.freebsd.org/doc/en_US.ISO8859-1/articles/vm-design/page-coloring= -optimizations.html --=20 Ruslan Ermilov ru@FreeBSD.org FreeBSD committer --RASg3xLB4tUQ4RcS Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (FreeBSD) iD8DBQFFbf7MqRfpzJluFF4RAlLzAKCH1zqdqWbjO5t6Ez4XrqTr39mQfQCeP/p4 ms+J9CuRp03b8WnmJ19Qmjg= =9MWz -----END PGP SIGNATURE----- --RASg3xLB4tUQ4RcS-- From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 21:51:19 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D7F0616A519 for ; Wed, 29 Nov 2006 21:51:19 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id A777E43DAC for ; Wed, 29 Nov 2006 21:47:48 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.7/8.13.4) with ESMTP id kATLlqVd048224; Wed, 29 Nov 2006 13:47:52 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.7/8.13.4/Submit) id kATLll4m048223; Wed, 29 Nov 2006 13:47:47 -0800 (PST) Date: Wed, 29 Nov 2006 13:47:47 -0800 (PST) From: Matthew Dillon Message-Id: <200611292147.kATLll4m048223@apollo.backplane.com> To: "Poul-Henning Kamp" References: <11392.1164835409@critter.freebsd.dk> Cc: Ricardo Nabinger Sanchez , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 21:51:20 -0000 : :In message <200611292115.kATLFlxd047970@apollo.backplane.com>, Matthew Dillon w :rites: : :>:Your input has been noted to the extent it is relevant. :> :> Now now Poul, if you don't have anything nice to say.... try not to act :> like a stuck up pig. Oops! Did I say something bad? : :My qualification was only a reflection on the fact that you obviously :had not read the first part of the tread and therefore did not seem to :take into account the changes proposed initially. : :-- :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 : The difference between you and me, Poul, is that you always try to play cute tricks with words when you intend to insult someone. Me? I just go ahead and insult them explicitly. In anycase, I think the relevance of my comments is clear to anyone who has followed the project. Are you guys so stuck up on performance that you are willing to seriously pollute your APIs just to get rid of a few multiplications and divisions? I mean, come on... the callout code is about as close to optimal as it is possible to *GET*. If performance is an issue, it isn't the callout algorithm that's the problem, its all the pollution that has been added to it to make it cpu-agnostic. You don't have to agree with me, but I think the relevance of my remarks is pretty clear. The FreeBSD source already has very serious mutex visibility pollution all throughout the codebase, and now you want to expose your already crazy multi-variable timer ABI to higher levels as well? Hell, people are still reporting calcru warnings and panics and problems after years! Maybe you should consider fixing those once and for all first. If you insist, I'll address your original points one at a time: :1. We need better resolution than a periodic "hz" clock can give us. : Highspeed networking, gaming servers and other real-time apps want : this. : :2. We "pollute" our call-wheel with tons of callouts that we know are : unlikely to happen. The callout algorithm was designed to make this 'pollution' optimal. And it is optimal both from the point of view of the callwheel design and from the point of view of cache locality of reference. The problem isn't the callwheel, it's the fact that all this additional mutex junk has been wrapped around the code to make it cpu-agnostic and MP-safe, requiring the callout code to dip into its mutex protected portions multiple times to execute a single operation (aka callout callback, then callout_reset()). There are performance problems here, but it's with the wrappers around the callout code, not with the code itself. :3. We have many operations on the callout wheel because certain : callouts gets rearmed for later in the future. (TCP keepalives). : :4. We execute all callouts on one CPU only. Well, interesting... that's aweful. Maybe, say, a PER-CPU callout design would solve that little problem? Sounds like it would kill two birds with one stone, especially if you are still deep-stacking your TCP protocol stacks from the interface interrupt. If you are going to associate interrupts with cpu's, then all related protocol operations could also be associated with those same cpu's, in PARTICULAR the callout operations. That would automatically give you a critical-section interlock and you wouldn't have to use mutexes to interlock the callout and the TCP stack. :5. Most of the specified timeouts are bogus, because of the imprecision : inheret in the current 1/hz method of scheduling them. If you are talking about TCP, this simply is not the case. In a LAN environment trying to apply timeouts less then a few milliseconds to a TCP protocol stack is just asking for it. Nobody gives a rats ass about packet loss in sub-millisecond TCP connections because it is NOT POSSIBLE to have optimal throughput EVEN IF you use fine-grained timers in any such environment where packet loss occurs. A LAN environment that loses packets in such a situation is broken and needs to be fixed. In WAN environments, where transit times are greater then a few milliseconds, having a fairly course-grained timeout for the TCP protocol stack is just not an issue. It really isn't. I'm wondering whether you are trying to fix issues in bogus contrived protocol tests or whether you are trying to fix issues in the real world here. There's a reason why GigE has hardware flow control. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 21:51:20 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CDDAC16A55B for ; Wed, 29 Nov 2006 21:51:20 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 28F3A43DEF for ; Wed, 29 Nov 2006 21:48:00 +0000 (GMT) (envelope-from andre@freebsd.org) Received: (qmail 10137 invoked from network); 29 Nov 2006 21:37:54 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 29 Nov 2006 21:37:54 -0000 Message-ID: <456E0013.7050405@freebsd.org> Date: Wed, 29 Nov 2006 22:48:03 +0100 From: Andre Oppermann User-Agent: Thunderbird 1.5.0.8 (Windows/20061025) MIME-Version: 1.0 To: Apan Qasem References: <49B82C5A-7A21-40EF-A717-3320B249FE01@gmail.com> In-Reply-To: <49B82C5A-7A21-40EF-A717-3320B249FE01@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: Page Coloring in FreeBSD X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 21:51:21 -0000 Apan Qasem wrote: > Does the FreeBSD kernel employ a page coloring algorithm? If so, where > can I find information on the heuristics used to color pages? Page coloring will be removed with the upcoming superpages commit in December. See the superpages discussion over the last two weeks on this mailing list. -- Andre > I have looked at Matthew Dillon's description at > http://people.freebsd.org/~nik/article.html-text > > I was hoping to find something more detailed. > > Thanks. > > - Apan > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > > From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 21:55:34 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id F00D216A415; Wed, 29 Nov 2006 21:55:34 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.FreeBSD.org (Postfix) with ESMTP id 55EFC43CA2; Wed, 29 Nov 2006 21:55:30 +0000 (GMT) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.6/8.13.6) with ESMTP id kATLtVHP094423; Wed, 29 Nov 2006 16:55:32 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: "Poul-Henning Kamp" Date: Wed, 29 Nov 2006 16:50:51 -0500 User-Agent: KMail/1.9.1 References: <10814.1164829546@critter.freebsd.dk> In-Reply-To: <10814.1164829546@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200611291650.51782.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Wed, 29 Nov 2006 16:55:32 -0500 (EST) X-Virus-Scanned: ClamAV 0.88.3/2259/Wed Nov 29 14:28:42 2006 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Robert Watson , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 21:55:35 -0000 On Wednesday 29 November 2006 14:45, Poul-Henning Kamp wrote: > In message <200611291346.01246.jhb@freebsd.org>, John Baldwin writes: > > >Different APIs would be fine. IIRC, that's how Darwin does it. With the > >tick_t idea, you could easily have: > > > >tick_t relative_wakeup(ulong nsec) > >tick_t absolute_wakeup(struct timeval *tv) (or something else, etc.) > > I really do not want to encode the rel/abs aspect in the tick_t. > > I want it marked up directly in the flags passed which kind of behaviour > the code wants. Hmm, I guess that depends on what you consider tick_t to be. I was thinking of it as an abstract type for a deadline, and that absolute and relative are sort of like subclasses of that. Doing it that way allows you to defer on absolute times rather than requiring whole new APIs. I assume you mean passing a different flag to msleep(), cv_*(), sema_wait(), lockmgr(), etc. that all take 'int timo'? If you allow it to be encoded into the tick_t, then adding support for it just requires a new function to generate a tick_t object and the consuming code has to learn how to handle it, but all the in-between stuff doesn't care and doesn't have to know. > >walltime timeouts (such as for TCP as Poul-Henning mentioned). I like tick_t, > >I just want to make sure we change foosleep() to use it as well, and wanted to > >raise the idea of relative vs absolute deadlines. > > Agreed, foosleep() should take tick_t as well. > > I propose you and I write up the new API in detail and then present > that document here on arch@ at a latter date. > > Is that OK with you ? Sure. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 22:00:28 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id EF8ED16A4D4; Wed, 29 Nov 2006 22:00:28 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.FreeBSD.org (Postfix) with ESMTP id 29D7543CD2; Wed, 29 Nov 2006 22:00:05 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id 080B6170C5; Wed, 29 Nov 2006 22:00:06 +0000 (UTC) To: John Baldwin From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 29 Nov 2006 16:50:51 EST." <200611291650.51782.jhb@freebsd.org> Date: Wed, 29 Nov 2006 22:00:04 +0000 Message-ID: <11587.1164837604@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Robert Watson , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 22:00:29 -0000 In message <200611291650.51782.jhb@freebsd.org>, John Baldwin writes: >> I want it marked up directly in the flags passed which kind of behaviour >> the code wants. > >Hmm, I guess that depends on what you consider tick_t to be. I was thinking >of it as an abstract type for a deadline, and that absolute and relative are >sort of like subclasses of that. I see tick_t only as an opaque measure of time and would prefer to not have modal bits stuck into it because I fear that will make it larger than 32 bits. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 22:01:55 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B209316A40F for ; Wed, 29 Nov 2006 22:01:55 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2A80543CA8 for ; Wed, 29 Nov 2006 22:01:51 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id AD3DC170C5; Wed, 29 Nov 2006 22:01:53 +0000 (UTC) To: Matthew Dillon From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 29 Nov 2006 13:47:47 PST." <200611292147.kATLll4m048223@apollo.backplane.com> Date: Wed, 29 Nov 2006 22:01:51 +0000 Message-ID: <11606.1164837711@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Ricardo Nabinger Sanchez , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 22:01:55 -0000 In message <200611292147.kATLll4m048223@apollo.backplane.com>, Matthew Dillon w rites: > The difference between you and me, Poul, is that you always try to play > cute tricks with words when you intend to insult someone. Me? I just > go ahead and insult them explicitly. I can do that too: You're a pompous asshole who doesn't know what you're talking about. Now, does that make you feel better ? > In anycase, I think the relevance of my comments is clear to anyone who > has followed the project. Are you guys so stuck up on performance > that you are willing to seriously pollute your APIs just to get rid of > a few multiplications and divisions? You have your project and we have ours. You make your choices, we make ours. You have your mailing lists, we have ours. CTRL-D for all I care. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 22:44:02 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id ACEC516A412; Wed, 29 Nov 2006 22:44:02 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id EC8EE43CB4; Wed, 29 Nov 2006 22:43:48 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.7/8.13.4) with ESMTP id kATMhmqO048754; Wed, 29 Nov 2006 14:43:52 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.7/8.13.4/Submit) id kATMhmaY048753; Wed, 29 Nov 2006 14:43:48 -0800 (PST) Date: Wed, 29 Nov 2006 14:43:48 -0800 (PST) From: Matthew Dillon Message-Id: <200611292243.kATMhmaY048753@apollo.backplane.com> To: John Baldwin References: <10814.1164829546@critter.freebsd.dk> <200611291650.51782.jhb@freebsd.org> Cc: Robert Watson , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 22:44:02 -0000 :Hmm, I guess that depends on what you consider tick_t to be. I was thinking :of it as an abstract type for a deadline, and that absolute and relative are :sort of like subclasses of that. Doing it that way allows you to defer on :absolute times rather than requiring whole new APIs. I assume you mean :passing a different flag to msleep(), cv_*(), sema_wait(), lockmgr(), etc. :that all take 'int timo'? If you allow it to be encoded into the tick_t, :then adding support for it just requires a new function to generate a tick_t :object and the consuming code has to learn how to handle it, but all the :in-between stuff doesn't care and doesn't have to know. :... :John Baldwin :_______________________________________________ We have something called sysclock_t that is very similar. These are the issues I encountered while implementing it: * I would recommend that the low level interfaces always operate as a deadline, because it simplifies the code enormously and makes it easy to detect negative-time events (which have to be turned into NOPs or minimally-timed events or looped events instead of sleeps). * If you intend to support negative numbers to mean relative times, the translation to a deadline should be done at the highest level. Lower levels should always ONLY operate as deadlines. For example, if you change msleep() over to use your tick_t, then allowing a negative number to indicate a relative time should be handled by msleep() and NOT by APIs at a deeper level then msleep(). This is particularly important, I think, because you do not want to get into the business of having to check what kind of timeout you are dealing with in every single API layer as you push down into the system timer, and you don't want to get into the business of having to re-read the current timestamp from hardware over and over again. (see more about this later). * The maximum amount of time that can be represented by tick_t is an issue. I decided that rather then try to represent any amount of time, I would instead stick to a 10-second limitation for our sysclock_t and only those APIs that required fine-grained timing (SYSTIMER API) would ever actually *USE* sysclock_t. This created a very explicit, well defined, and extremely visible 'wall' between fine-grained timing APIs and course-grained timing APIs. I believe it made the related source code far more readable as well. This also allowed me to use a 32 bit integer to represent the fine-grained time, which saved a great deal of memory for data storage all over the system. * I wanted an absolute fine-grained timestamp that was visible to upper layers but could be represented in 32 bits. The fact that the API was explicitly defined to cover only 10 seconds worth of (absolute or relative) time made manipulation of the timestamp extremely well defined. I also required that rollover work as expected (that you could subtract a rolled-over absolute timestamp from a prior timestamp and still get the correct delta time, within the 10-second limitation). Having the APIs explicitly assume a 2's complement rollover made all the coding easy. This worked extremely well in all respects. * The frequency didn't matter, as long as the requirements of the timestamp were met (aka 10 seconds within its data type). This made manipulation of the timestamp very easy. In particular, certainly common frequencies (like 1hz) could simply be cached in globals. * All 'active' storage of the timestamp was encapsulated and handled by the API. In our case, one-shot and periodic SYSTIMERs. This theoretically allowed timebase changes but the more I consider the problem the more I believe that a *BETTER* way to handle timebase changes is to simply use a frequency fixed at boot time and translate at the hardware interface. It turns out that if you use a single abstracted timebase type with a boot-time-fixed frequency in your APIs, the only place you actually have to translate it to the hardware timer is when you are actually reading or writing the hardware timer. (more on this in the next point). In particular, take this example again with msleep(.... ticks). In any system there will be many threads sleeping with a timeout. So the timer queue would look something like this (abstracted, this is not a data structure): [NEXT TIMEOUT] -> [LATER TIMEOUT] -> [LATER TIMEOUT] -> ... You can basically store the timeouts as absolute tick_t's without having to translate them to the hardware timer resolution. Just do everything at tick_t's boot-time-fixed resolution. * It turns out that timer events that required reloading are almost always synchronized with the callback related to the timer event. So instead of forcing the callback procedure to 'read' the exact timestamp (which requires translation from the hardware timer source), we simply add a sysclock_t (tick_t in your case) argument to the callout procedure so the current time is already available to it. We only have to calculate the translation of the hardware timer when the actual hardware interrupt occurs or if a newly installed (one-shot or periodic reload) timeout has a smaller count then timeouts already queued. Considering other overheads, even a significant number of mathmatical operations to do the translation are a drop in the bucket if you only have to do them once per timer interrupt. This almost completely removed all extranious multiplications and divisions from the critical code paths. Consider the deadline vs relative timestamp formats and the APIs that allow either or both very carefully. Normalizing everything to a deadline in lower level APIs reaps *HUGE* benefits. But also consider very carefully any intent to remove course timestamp APIs. I personally believe that BOTH fine-grained and course-timestamp APIs are needed. You could extend your tick_t abstraction to support both course and fine-grained APIs as well as relative and absolute timestamps by eating more bits out of tick_t (maybe don't use negative numbers in that case), but I have to caution, again, that any such representation should be translated to a deadline and that low level fine-grained APIs should remain separate from course-grained APIS. This is the old 'how long do I want to msleep() for' problem... it can be fine-grained (microseconds) or course-grained (minutes, even). It probably isn't useful to represent frequencies greater then a few megaherz, but the limitation is not so much the highest frequency you want to represent but instead the maximum amount of time you want to represent in the data type. A 32 bit integer with a 10-second limitation could represent intervals down to around 4ns (2^31 instead of 2^32 so you can detect deadlines which have passed). Since cpu's are improving speeds laterally now rather then in raw per-core processing power, and since a thread switch still takes at least 500ns even on the fastest cpu, I don't think it's an issue. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 22:59:50 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7F67C16A50D; Wed, 29 Nov 2006 22:59:50 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id B8B6443CBF; Wed, 29 Nov 2006 22:58:23 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.7/8.13.4) with ESMTP id kATMwNCL048850; Wed, 29 Nov 2006 14:58:27 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.7/8.13.4/Submit) id kATMwNb6048849; Wed, 29 Nov 2006 14:58:23 -0800 (PST) Date: Wed, 29 Nov 2006 14:58:23 -0800 (PST) From: Matthew Dillon Message-Id: <200611292258.kATMwNb6048849@apollo.backplane.com> To: John Baldwin , Robert Watson , freebsd-arch@freebsd.org References: <10814.1164829546@critter.freebsd.dk> <200611291650.51782.jhb@freebsd.org> <200611292243.kATMhmaY048753@apollo.backplane.com> Cc: Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 22:59:50 -0000 Oh, one more note on why passing a tick_t to the callback procedure is the better way to go when operating with fine-grained timeouts. Consider aggregation. Lets say you are using a fine-grained periodic timer for your scheduler clock at, say, 1000hz, and a different fine-grained periodic timer for your network poll at, say, 1000hz. If both timers are installed at the same time and use the same initial deadline timestamp, then both timeouts will occur at the same time. If the deadline is passed as an argument to the callout (not the actual current timestamp, but the deadline that was recorded, even if the event is somewhat late), then a periodic reload based on that passed timestamp will STAY SYNCHRONIZED for both events, even though they are 'independant'. The importance of this cannot be underestimated because it removes all extra hardware interrupts that would otherwise occur if the separately timed periodic events were not synchronized with each other. Similarly, if you have any set of periodic operations whos events share common factors, then some of those events will occur at the same time. The result is *significantly* more optimal hardware timer interrupts and timer callback event processing. And I do mean significant. -- To put it another way, if you have two independant subsystems each operating at 1000hz, would you rather take 1000 interrupts per second or 2000 interrupts per second? The answer should be clear. This really makes the case for passing deadlines instead of relative ticks. In DragonFly I mentioned that we have per-cpu SYSTIMERs, but we are still using the 8254 (instead of the LAPIC) to drive them. Each cpu gets its own scheduler clock, stats clock, and so on and so forth. Each one. The events can multiply very quickly, but because all the periodic requests are synchronized with each other and run at either the same frequency or have common divisors or multiples, the actual hardware clock interrupt rate is bounded. This has turned out to be so important that I am seriously considering creating an initial synchronization calculation when installing a periodic timer to try to sync it up with other periodic timers running on the system. -Matt From owner-freebsd-arch@FreeBSD.ORG Thu Nov 30 00:25:25 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6D00F16A415 for ; Thu, 30 Nov 2006 00:25:25 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2E72543C9D for ; Thu, 30 Nov 2006 00:25:19 +0000 (GMT) (envelope-from freebsd-arch@m.gmane.org) Received: from root by ciao.gmane.org with local (Exim 4.43) id 1GpZjd-0001xX-17 for freebsd-arch@freebsd.org; Thu, 30 Nov 2006 01:25:05 +0100 Received: from r5h168.net.upc.cz ([86.49.7.168]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 30 Nov 2006 01:25:03 +0100 Received: from gamato by r5h168.net.upc.cz with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 30 Nov 2006 01:25:03 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-arch@freebsd.org From: martinko Date: Thu, 30 Nov 2006 01:17:30 +0100 Lines: 26 Message-ID: References: <45649E42.70409@cs.rice.edu> <20061123020747.GZ2260@obelix.dsto.defence.gov.au> <17DE7E25-BCEE-46C7-9EB2-73A9D5C37CB1@tetlows.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: r5h168.net.upc.cz User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.8.0.8) Gecko/20061111 SeaMonkey/1.0.6 In-Reply-To: <17DE7E25-BCEE-46C7-9EB2-73A9D5C37CB1@tetlows.org> Sender: news Subject: Re: superpage plans X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Nov 2006 00:25:25 -0000 Gordon Tetlow wrote: > > On Nov 22, 2006, at 6:07 PM, Wilkinson, Alex wrote: > >> 0n Wed, Nov 22, 2006 at 01:00:18PM -0600, Alan Cox wrote: >> >>> Kip Macy wrote: >>> >>>> Do you have any thoughts on when superpage support might go into >>>> -CURRENT? >> >> erm, what is superpage ? > > http://www.cs.rice.edu/~jnavarro/superpages/ > > -gordon Hello, I followed the link and found some other interesting technologies like Anticipatory disk scheduling, Lazy Receiver Processing (LRP) and others. I wonder what the state of those is in FreeBSD.. ? Cheers, Martin From owner-freebsd-arch@FreeBSD.ORG Thu Nov 30 06:35:51 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 721E016A40F; Thu, 30 Nov 2006 06:35:51 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (vc4-2-0-87.dsl.netrack.net [199.45.160.85]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7552A43CA3; Thu, 30 Nov 2006 06:35:44 +0000 (GMT) (envelope-from imp@bsdimp.com) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.13.4/8.13.4) with ESMTP id kAU6ZNFt026546; Wed, 29 Nov 2006 23:35:25 -0700 (MST) (envelope-from imp@bsdimp.com) Date: Wed, 29 Nov 2006 23:36:11 -0700 (MST) Message-Id: <20061129.233611.-1540390902.imp@bsdimp.com> To: phk@phk.freebsd.dk From: "M. Warner Losh" In-Reply-To: <6194.1164756521@critter.freebsd.dk> References: <200611281631.19224.jhb@freebsd.org> <6194.1164756521@critter.freebsd.dk> X-Mailer: Mew version 4.2 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.0 (harmony.bsdimp.com [127.0.0.1]); Wed, 29 Nov 2006 23:35:25 -0700 (MST) Cc: freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Nov 2006 06:35:51 -0000 Here's a decoder ring... In message: <6194.1164756521@critter.freebsd.dk> "Poul-Henning Kamp" writes: : given wall-clock (UTC) time. UTC has leapseconds, and likely is adjusted by ntp. : Others want to sleep on the absolute (TAI) timescale, such as TCP TAI has no leapseconds and is a perfect monotonic timescale. Warner From owner-freebsd-arch@FreeBSD.ORG Thu Nov 30 06:47:45 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8330416A407; Thu, 30 Nov 2006 06:47:45 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (vc4-2-0-87.dsl.netrack.net [199.45.160.85]) by mx1.FreeBSD.org (Postfix) with ESMTP id E3F2C43C9D; Thu, 30 Nov 2006 06:47:38 +0000 (GMT) (envelope-from imp@bsdimp.com) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.13.4/8.13.4) with ESMTP id kAU6lN33026637; Wed, 29 Nov 2006 23:47:23 -0700 (MST) (envelope-from imp@bsdimp.com) Date: Wed, 29 Nov 2006 23:48:11 -0700 (MST) Message-Id: <20061129.234811.-1625879484.imp@bsdimp.com> To: phk@phk.freebsd.dk From: "M. Warner Losh" In-Reply-To: <11587.1164837604@critter.freebsd.dk> References: <200611291650.51782.jhb@freebsd.org> <11587.1164837604@critter.freebsd.dk> X-Mailer: Mew version 4.2 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.0 (harmony.bsdimp.com [127.0.0.1]); Wed, 29 Nov 2006 23:47:24 -0700 (MST) Cc: rwatson@freebsd.org, freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Nov 2006 06:47:45 -0000 In message: <11587.1164837604@critter.freebsd.dk> "Poul-Henning Kamp" writes: : In message <200611291650.51782.jhb@freebsd.org>, John Baldwin writes: : : >> I want it marked up directly in the flags passed which kind of behaviour : >> the code wants. : > : >Hmm, I guess that depends on what you consider tick_t to be. I was thinking : >of it as an abstract type for a deadline, and that absolute and relative are : >sort of like subclasses of that. : : I see tick_t only as an opaque measure of time and would prefer to : not have modal bits stuck into it because I fear that will make it : larger than 32 bits. There's many times and places that confuse a time interval with an elapsed time since an epoch. struct timeval started life as the latter and was press-ganged to also surve as the former. Having different types allows for an unabiguous conversion. I don't believe there's a prohibitive cost in doing this. I'd also argue that UTC is just a printing convention anyway. Keeping time in a TAI-like timescale and doing the conversion to UTC when UTC timestamps are necessary would be worth considering, but there are some costs with doing this that might prove to be too high since UTC is used a lot and any TAI-like thing is only used for the 'core' timing stuff. Warner From owner-freebsd-arch@FreeBSD.ORG Thu Nov 30 06:53:25 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id F057516A509; Thu, 30 Nov 2006 06:53:25 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6802C43CA6; Thu, 30 Nov 2006 06:53:19 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id D8FBA170C5; Thu, 30 Nov 2006 06:53:23 +0000 (UTC) To: "M. Warner Losh" From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 29 Nov 2006 23:48:11 MST." <20061129.234811.-1625879484.imp@bsdimp.com> Date: Thu, 30 Nov 2006 06:53:21 +0000 Message-ID: <13587.1164869601@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: rwatson@freebsd.org, freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Nov 2006 06:53:26 -0000 In message <20061129.234811.-1625879484.imp@bsdimp.com>, "M. Warner Losh" write s: >I'd also argue that UTC is just a printing convention anyway. Keeping >time in a TAI-like timescale and doing the conversion to UTC when UTC >timestamps are necessary would be worth considering, but there are >some costs with doing this that might prove to be too high since UTC >is used a lot and any TAI-like thing is only used for the 'core' >timing stuff. As far as I know we have no sleepers on UTC scale in the kernel and nobody has said otherwise throughout this discussion. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Thu Nov 30 06:59:45 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CBD7A16A4A0; Thu, 30 Nov 2006 06:59:45 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (vc4-2-0-87.dsl.netrack.net [199.45.160.85]) by mx1.FreeBSD.org (Postfix) with ESMTP id 12A5C43CA5; Thu, 30 Nov 2006 06:59:38 +0000 (GMT) (envelope-from imp@bsdimp.com) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.13.4/8.13.4) with ESMTP id kAU6uqAY026779; Wed, 29 Nov 2006 23:56:52 -0700 (MST) (envelope-from imp@bsdimp.com) Date: Wed, 29 Nov 2006 23:57:41 -0700 (MST) Message-Id: <20061129.235741.-278387249.imp@bsdimp.com> To: phk@phk.freebsd.dk From: "M. Warner Losh" In-Reply-To: <13587.1164869601@critter.freebsd.dk> References: <20061129.234811.-1625879484.imp@bsdimp.com> <13587.1164869601@critter.freebsd.dk> X-Mailer: Mew version 4.2 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.0 (harmony.bsdimp.com [127.0.0.1]); Wed, 29 Nov 2006 23:56:53 -0700 (MST) Cc: rwatson@freebsd.org, freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Nov 2006 06:59:45 -0000 In message: <13587.1164869601@critter.freebsd.dk> "Poul-Henning Kamp" writes: : In message <20061129.234811.-1625879484.imp@bsdimp.com>, "M. Warner Losh" write : s: : : >I'd also argue that UTC is just a printing convention anyway. Keeping : >time in a TAI-like timescale and doing the conversion to UTC when UTC : >timestamps are necessary would be worth considering, but there are : >some costs with doing this that might prove to be too high since UTC : >is used a lot and any TAI-like thing is only used for the 'core' : >timing stuff. : : As far as I know we have no sleepers on UTC scale in the kernel and : nobody has said otherwise throughout this discussion. Then never mind, that solve that problem :-) Warner From owner-freebsd-arch@FreeBSD.ORG Thu Nov 30 07:06:10 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CFF5F16A412; Thu, 30 Nov 2006 07:06:10 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3403643CB9; Thu, 30 Nov 2006 07:05:56 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id B1719170C5; Thu, 30 Nov 2006 07:06:01 +0000 (UTC) To: "M. Warner Losh" From: "Poul-Henning Kamp" In-Reply-To: Your message of "Wed, 29 Nov 2006 23:57:41 MST." <20061129.235741.-278387249.imp@bsdimp.com> Date: Thu, 30 Nov 2006 07:05:59 +0000 Message-ID: <13664.1164870359@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: rwatson@freebsd.org, freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Nov 2006 07:06:10 -0000 In message <20061129.235741.-278387249.imp@bsdimp.com>, "M. Warner Losh" writes : >: As far as I know we have no sleepers on UTC scale in the kernel and >: nobody has said otherwise throughout this discussion. > >Then never mind, that solve that problem :-) Well, at least it isolates that aspect to the poorly defined userland sleep facilities in POSIX, which as we know, doesn't even recognize what a timescale is or why it is important to read the entire definition of one :-) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Thu Nov 30 09:42:50 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2059116A407; Thu, 30 Nov 2006 09:42:50 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe06.swip.net [212.247.154.161]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5405943CA5; Thu, 30 Nov 2006 09:42:42 +0000 (GMT) (envelope-from hselasky@c2i.net) X-T2-Posting-ID: waeY3scLM7Rs5zHA9X5AxA== X-Cloudmark-Score: 0.000000 [] Received: from [193.217.137.195] (account mc467741@c2i.net HELO [10.0.0.249]) by mailfe06.swip.net (CommuniGate Pro SMTP 5.0.12) with ESMTPA id 342876043; Thu, 30 Nov 2006 10:42:46 +0100 From: Hans Petter Selasky To: freebsd-arch@freebsd.org Date: Thu, 30 Nov 2006 10:42:25 +0100 User-Agent: KMail/1.7 References: <7105.1163451221@critter.freebsd.dk> In-Reply-To: <7105.1163451221@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200611301042.27175.hselasky@c2i.net> Cc: arch@freebsd.org, Poul-Henning Kamp Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Nov 2006 09:42:50 -0000 On Monday 13 November 2006 21:53, Poul-Henning Kamp wrote: > A number of problems have been identified with our current callout > code and I have been thinking about and discussed various aspects > with people during the EuroBSDcon2007 conference. > > A lot of people are interested in this, so here is a quick sketch > of what I'm thinking about: > > > The Problems > ------------ > > 1. We need better resolution than a periodic "hz" clock can give us. > Highspeed networking, gaming servers and other real-time apps want > this. > > 2. We "pollute" our call-wheel with tons of callouts that we know are > unlikely to happen. > > 3. We have many operations on the callout wheel because certain > callouts gets rearmed for later in the future. (TCP keepalives). > > 4. We execute all callouts on one CPU only. > > 5. Most of the specified timeouts are bogus, because of the imprecision > inheret in the current 1/hz method of scheduling them. > > and a number of other issues. > > > The proposed API > ---------------- > > tick_t XXX_ns_tick(unsigned nsec, unsigned *low, unsigned *high); > Caculate the tick value for a given timeout. > Optionally return theoretical lower and upper limits to > actual value, > > tick_t XXX_s_tick(unsigned seconds) > Caculate the tick value for a given timeout. > > The point behind these two functions is that we do not want to > incur a scaling operating at every arming of a callout. Very > few callouts use varying timeouts (and for those, no avoidance > is possible), but for the rest, precalculating the correct > (opaque) number is a good optimization. > Hi, I have some comments on the matter. Is XXX_arm only an init routine? If not, it does not make sense that one is allowed to pass a variable mutex argument. This will lead to some locking problems I think. Also I am missing the CALLOUT_RETURNUNLOCKED flag. > XXX_arm(struct xxx*, tick_t, func *, arg *, int flag, struct mtx *); > Arm timer. > Struct xxx must be zeroed before first call. > > If mtx pointer is non-NULL, acq mutex before calling. This makes sense. Should also require this for rearm/disarm, but not drain. > > flags: > XXX_REPEAT > XXX_UNLIKELY > > Arm a callout with a number of optional behaviours specified. > > XXX_rearm(struct xxx*, tick_t) > Rearm timer. > > XXX_disarm(struct xxx*) > Unarm the timer. > > XXX_drain(struct xxx*) > Drain the timer. > > > The functions above will actually be wrappers for a more generic > set of the same family, which also takes a pointer to a callout-group. > > This is so that we can have different groups of callouts, for > instance one group for the/each netstack and one for the disk-I/O > stuff etc. Why can't you just rename and extend the existing callout_init_mtx()/callout_reset()/callout_stop()/callout_drain() functions ? Yours --HPS From owner-freebsd-arch@FreeBSD.ORG Thu Nov 30 09:42:50 2006 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2059116A407; Thu, 30 Nov 2006 09:42:50 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe06.swip.net [212.247.154.161]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5405943CA5; Thu, 30 Nov 2006 09:42:42 +0000 (GMT) (envelope-from hselasky@c2i.net) X-T2-Posting-ID: waeY3scLM7Rs5zHA9X5AxA== X-Cloudmark-Score: 0.000000 [] Received: from [193.217.137.195] (account mc467741@c2i.net HELO [10.0.0.249]) by mailfe06.swip.net (CommuniGate Pro SMTP 5.0.12) with ESMTPA id 342876043; Thu, 30 Nov 2006 10:42:46 +0100 From: Hans Petter Selasky To: freebsd-arch@freebsd.org Date: Thu, 30 Nov 2006 10:42:25 +0100 User-Agent: KMail/1.7 References: <7105.1163451221@critter.freebsd.dk> In-Reply-To: <7105.1163451221@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200611301042.27175.hselasky@c2i.net> Cc: arch@freebsd.org, Poul-Henning Kamp Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Nov 2006 09:42:50 -0000 On Monday 13 November 2006 21:53, Poul-Henning Kamp wrote: > A number of problems have been identified with our current callout > code and I have been thinking about and discussed various aspects > with people during the EuroBSDcon2007 conference. > > A lot of people are interested in this, so here is a quick sketch > of what I'm thinking about: > > > The Problems > ------------ > > 1. We need better resolution than a periodic "hz" clock can give us. > Highspeed networking, gaming servers and other real-time apps want > this. > > 2. We "pollute" our call-wheel with tons of callouts that we know are > unlikely to happen. > > 3. We have many operations on the callout wheel because certain > callouts gets rearmed for later in the future. (TCP keepalives). > > 4. We execute all callouts on one CPU only. > > 5. Most of the specified timeouts are bogus, because of the imprecision > inheret in the current 1/hz method of scheduling them. > > and a number of other issues. > > > The proposed API > ---------------- > > tick_t XXX_ns_tick(unsigned nsec, unsigned *low, unsigned *high); > Caculate the tick value for a given timeout. > Optionally return theoretical lower and upper limits to > actual value, > > tick_t XXX_s_tick(unsigned seconds) > Caculate the tick value for a given timeout. > > The point behind these two functions is that we do not want to > incur a scaling operating at every arming of a callout. Very > few callouts use varying timeouts (and for those, no avoidance > is possible), but for the rest, precalculating the correct > (opaque) number is a good optimization. > Hi, I have some comments on the matter. Is XXX_arm only an init routine? If not, it does not make sense that one is allowed to pass a variable mutex argument. This will lead to some locking problems I think. Also I am missing the CALLOUT_RETURNUNLOCKED flag. > XXX_arm(struct xxx*, tick_t, func *, arg *, int flag, struct mtx *); > Arm timer. > Struct xxx must be zeroed before first call. > > If mtx pointer is non-NULL, acq mutex before calling. This makes sense. Should also require this for rearm/disarm, but not drain. > > flags: > XXX_REPEAT > XXX_UNLIKELY > > Arm a callout with a number of optional behaviours specified. > > XXX_rearm(struct xxx*, tick_t) > Rearm timer. > > XXX_disarm(struct xxx*) > Unarm the timer. > > XXX_drain(struct xxx*) > Drain the timer. > > > The functions above will actually be wrappers for a more generic > set of the same family, which also takes a pointer to a callout-group. > > This is so that we can have different groups of callouts, for > instance one group for the/each netstack and one for the disk-I/O > stuff etc. Why can't you just rename and extend the existing callout_init_mtx()/callout_reset()/callout_stop()/callout_drain() functions ? Yours --HPS From owner-freebsd-arch@FreeBSD.ORG Thu Nov 30 14:56:34 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D60AA16A403 for ; Thu, 30 Nov 2006 14:56:34 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id CFD3C43D80 for ; Thu, 30 Nov 2006 14:52:55 +0000 (GMT) (envelope-from freebsd-arch@m.gmane.org) Received: from list by ciao.gmane.org with local (Exim 4.43) id 1GpnHI-0001hk-CQ for freebsd-arch@freebsd.org; Thu, 30 Nov 2006 15:52:44 +0100 Received: from lara.cc.fer.hr ([161.53.72.113]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 30 Nov 2006 15:52:44 +0100 Received: from ivoras by lara.cc.fer.hr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 30 Nov 2006 15:52:44 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-arch@freebsd.org From: Ivan Voras Date: Thu, 30 Nov 2006 15:52:22 +0100 Lines: 31 Message-ID: References: <200611292147.kATLll4m048223@apollo.backplane.com> <11606.1164837711@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: lara.cc.fer.hr User-Agent: Thunderbird 1.5.0.4 (X11/20060625) In-Reply-To: <11606.1164837711@critter.freebsd.dk> Sender: news Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Nov 2006 14:56:34 -0000 Poul-Henning Kamp wrote: > In message <200611292147.kATLll4m048223@apollo.backplane.com>, Matthew Dillon w > rites: > >> The difference between you and me, Poul, is that you always try to play >> cute tricks with words when you intend to insult someone. Me? I just >> go ahead and insult them explicitly. > > I can do that too: You're a pompous asshole who doesn't know what > you're talking about. > > Now, does that make you feel better ? > >> In anycase, I think the relevance of my comments is clear to anyone who >> has followed the project. Are you guys so stuck up on performance >> that you are willing to seriously pollute your APIs just to get rid of >> a few multiplications and divisions? > > You have your project and we have ours. > > You make your choices, we make ours. > > You have your mailing lists, we have ours. > > CTRL-D for all I care. No trying to take sides here, but for us willing to learn here, what exactly are the problems in Matt Dillon's suggestions? From a novice's POV, having per-cpu queues looks (emphasis: looks) very scalable and performant. From owner-freebsd-arch@FreeBSD.ORG Thu Nov 30 19:26:19 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0759A16A4AB for ; Thu, 30 Nov 2006 19:26:19 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.FreeBSD.org (Postfix) with ESMTP id D809443CA2 for ; Thu, 30 Nov 2006 19:26:08 +0000 (GMT) (envelope-from jhb@freebsd.org) Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.6/8.13.6) with ESMTP id kAUJQ1ji003967; Thu, 30 Nov 2006 14:26:02 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-arch@freebsd.org Date: Thu, 30 Nov 2006 11:05:51 -0500 User-Agent: KMail/1.9.1 References: <200611292147.kATLll4m048223@apollo.backplane.com> <11606.1164837711@critter.freebsd.dk> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200611301105.51983.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Thu, 30 Nov 2006 14:26:02 -0500 (EST) X-Virus-Scanned: ClamAV 0.88.3/2263/Thu Nov 30 01:51:08 2006 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.2 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00, DATE_IN_PAST_03_06 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Ivan Voras Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Nov 2006 19:26:19 -0000 On Thursday 30 November 2006 09:52, Ivan Voras wrote: > Poul-Henning Kamp wrote: > > In message <200611292147.kATLll4m048223@apollo.backplane.com>, Matthew Dillon w > > rites: > > > >> The difference between you and me, Poul, is that you always try to play > >> cute tricks with words when you intend to insult someone. Me? I just > >> go ahead and insult them explicitly. > > > > I can do that too: You're a pompous asshole who doesn't know what > > you're talking about. > > > > Now, does that make you feel better ? > > > >> In anycase, I think the relevance of my comments is clear to anyone who > >> has followed the project. Are you guys so stuck up on performance > >> that you are willing to seriously pollute your APIs just to get rid of > >> a few multiplications and divisions? > > > > You have your project and we have ours. > > > > You make your choices, we make ours. > > > > You have your mailing lists, we have ours. > > > > CTRL-D for all I care. > > No trying to take sides here, but for us willing to learn here, what > exactly are the problems in Matt Dillon's suggestions? From a novice's > POV, having per-cpu queues looks (emphasis: looks) very scalable and > performant. I don't think phk@ is ruling out per-cpu callout wheels. I know it is something I've thought about myself for a while now. One of the goals of the change is to make things a bit more abstract and less tick-centric (i.e. specify timeouts in real time units like nanoseconds or seconds rather than tick counts based on a hz periodic timer). Whether or not there are per-cpu callouts is really an implementation detail rather than an API one. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Nov 30 21:57:05 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E563016A492 for ; Thu, 30 Nov 2006 21:57:05 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.FreeBSD.org (Postfix) with ESMTP id 56D7243CA7 for ; Thu, 30 Nov 2006 21:56:55 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id CC83C170C5; Thu, 30 Nov 2006 21:57:03 +0000 (UTC) To: Ivan Voras From: "Poul-Henning Kamp" In-Reply-To: Your message of "Thu, 30 Nov 2006 15:52:22 +0100." Date: Thu, 30 Nov 2006 21:57:01 +0000 Message-ID: <1299.1164923821@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Nov 2006 21:57:06 -0000 In message , Ivan Voras writes: >No trying to take sides here, but for us willing to learn here, what >exactly are the problems in Matt Dillon's suggestions? From a novice's >POV, having per-cpu queues looks (emphasis: looks) very scalable and >performant. I'm not going to dissect Matts emails because that will just lead to a long an pointless flamewar. Most of Matts emails focus on the specifics of implementation whereas I have repeatedly stressed that my focus is on defining a good API for programmers to use which will allow us to isolate the implemetation so we can experiment with different strategies. I will work with John to write up our spec and we will publish that along with the reasoning soon. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Thu Nov 30 23:33:21 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id EB2D016A494 for ; Thu, 30 Nov 2006 23:33:21 +0000 (UTC) (envelope-from jb@what-creek.com) Received: from what-creek.com (what-creek.com [66.111.37.70]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2CCB143CA3 for ; Thu, 30 Nov 2006 23:33:11 +0000 (GMT) (envelope-from jb@what-creek.com) Received: by what-creek.com (Postfix, from userid 102) id A9CF1140EC03; Thu, 30 Nov 2006 23:34:32 +0000 (GMT) Date: Thu, 30 Nov 2006 23:34:32 +0000 From: John Birrell To: Poul-Henning Kamp Message-ID: <20061130233432.GA9667@what-creek.com> References: <1299.1164923821@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1299.1164923821@critter.freebsd.dk> User-Agent: Mutt/1.4.2.1i Cc: freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Nov 2006 23:33:22 -0000 On Thu, Nov 30, 2006 at 09:57:01PM +0000, Poul-Henning Kamp wrote: > Most of Matts emails focus on the specifics of implementation whereas > I have repeatedly stressed that my focus is on defining a good API > for programmers to use which will allow us to isolate the implemetation > so we can experiment with different strategies. > > I will work with John to write up our spec and we will publish that > along with the reasoning soon. DTrace relies on the Solaris cyclic timer subsystem for core functionality. Where the HPET is available on Intel arch machines, it is the cyclic driver that uses it. Here is a link to a Sun blog describing some of the features they implement: For DTrace to be as effective on FreeBSD as it is on Solaris, we need similar functionality. -- John Birrell From owner-freebsd-arch@FreeBSD.ORG Fri Dec 1 01:39:31 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6CB6116A415 for ; Fri, 1 Dec 2006 01:39:31 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id E48BF43CC0 for ; Fri, 1 Dec 2006 01:39:14 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 6683346D2B; Thu, 30 Nov 2006 20:39:24 -0500 (EST) Date: Fri, 1 Dec 2006 01:39:24 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Ivan Voras In-Reply-To: Message-ID: <20061201012221.J79653@fledge.watson.org> References: <200611292147.kATLll4m048223@apollo.backplane.com> <11606.1164837711@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Dec 2006 01:39:31 -0000 On Thu, 30 Nov 2006, Ivan Voras wrote: > No trying to take sides here, but for us willing to learn here, what exactly > are the problems in Matt Dillon's suggestions? From a novice's POV, having > per-cpu queues looks (emphasis: looks) very scalable and performant. The implications of adopting the model Matt proposes are quite far-reaching: callouts don't exist in isolation, but occur in the context of data structures and work occuring in many threads. If callouts are pinned to a particular CPU, and can only be scheduled, rescheduled, and cancelled from that CPU, that implies either that all work associated with that callout is also pinned to the CPU, or that migration or message-passing be involved if the requirement comes up in a thread on another CPU. Consider the case of TCP timers: a number of TCP timers get regularly rescheduled (delack, retransmit, etc). If they can only be manipulated from cpu0 (i.e., protected by a synchronization primitive that can't be acquired from another CPU -- i.e., critical sections instead of mutexes), how do you handle the case where the a TCP packet for that connection is processed on cpu1 and needs to change the scheduling of the timer? In a strict work/data structure pinning model, you would pin the TCP connection to cpu0, and only process any data leading to timer changes on that CPU. Alternatively, you might pass a message from cpu1 to cpu0 to change the scheduling. The idea of processing timers in multiple threads and pinning them to multiple CPUs clearly isn't a bad idea: we could likely benefit from parallelism (and generally, concurrency) in timer processing. One of the things we discussed at the recent developer summit was subsystem callout threads (introducing the opportunity for parallism without committing to a particular CPU scheduling model), as well as per-CPU callout threads but protected using mutexes so that reschedule/cancel/etc can be performed form other CPUs still. Changing the API so that scheduling/rescheduling/etc activities themselves must occur on a particular CPU has serious implications and commits us to an architectural approach for which there is little concensus. If the goal is simply parallelism, it's possible to accomplish that without embedding assumptions about the synchronization model at this point. Take a look at the USENIX paper by Paul Willmann (et al) at Rice for some rather interesting experimentation, measurement, and discussion precisely along these lines: http://www.ece.rice.edu/~willmann/pubs/paranet_tr06-872.pdf Robert N M Watson Computer Laboratory University of Cambridge From owner-freebsd-arch@FreeBSD.ORG Fri Dec 1 05:04:43 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 95FCE16ADC7 for ; Fri, 1 Dec 2006 05:04:33 +0000 (UTC) (envelope-from gnn@neville-neil.com) Received: from mrout1.yahoo.com (mrout1.yahoo.com [216.145.54.171]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6644643CA2 for ; Fri, 1 Dec 2006 05:04:21 +0000 (GMT) (envelope-from gnn@neville-neil.com) Received: from minion.local.neville-neil.com (proxy8.corp.yahoo.com [216.145.48.13]) by mrout1.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id kB1541NI052174; Thu, 30 Nov 2006 21:04:01 -0800 (PST) Date: Fri, 01 Dec 2006 12:38:08 +0900 Message-ID: From: "George V. Neville-Neil" To: "Poul-Henning Kamp" In-Reply-To: <1299.1164923821@critter.freebsd.dk> References: <1299.1164923821@critter.freebsd.dk> User-Agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (=?ISO-8859-4?Q?Shij=F2?=) APEL/10.6 Emacs/22.0.90 (i386-apple-darwin8.8.1) MULE/5.0 (SAKAKI) MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka") Content-Type: text/plain; charset=US-ASCII Cc: Ivan Voras , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Dec 2006 05:04:43 -0000 At Thu, 30 Nov 2006 21:57:01 +0000, Poul-Henning Kamp wrote: > I'm not going to dissect Matts emails because that will just lead > to a long an pointless flamewar. > > Most of Matts emails focus on the specifics of implementation whereas > I have repeatedly stressed that my focus is on defining a good API > for programmers to use which will allow us to isolate the implemetation > so we can experiment with different strategies. > > I will work with John to write up our spec and we will publish that > along with the reasoning soon. > Excellent. That will give us something more substantial to chew on. Thanks, George From owner-freebsd-arch@FreeBSD.ORG Fri Dec 1 09:31:09 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DE7B616A403; Fri, 1 Dec 2006 09:31:08 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 877F943CA6; Fri, 1 Dec 2006 09:30:55 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.7/8.13.4) with ESMTP id kB19Uxuj064008; Fri, 1 Dec 2006 01:30:59 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.7/8.13.4/Submit) id kB19Ushn064003; Fri, 1 Dec 2006 01:30:54 -0800 (PST) Date: Fri, 1 Dec 2006 01:30:54 -0800 (PST) From: Matthew Dillon Message-Id: <200612010930.kB19Ushn064003@apollo.backplane.com> To: Robert Watson References: <200611292147.kATLll4m048223@apollo.backplane.com> <11606.1164837711@critter.freebsd.dk> <20061201012221.J79653@fledge.watson.org> Cc: Ivan Voras , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Dec 2006 09:31:09 -0000 :The implications of adopting the model Matt proposes are quite far-reaching: :callouts don't exist in isolation, but occur in the context of data structures :and work occuring in many threads. If callouts are pinned to a particular :... :Consider the case of TCP timers: a number of TCP timers get regularly :rescheduled (delack, retransmit, etc). If they can only be manipulated from :cpu0 (i.e., protected by a synchronization primitive that can't be acquired :from another CPU -- i.e., critical sections instead of mutexes), how do you :handle the case where the a TCP packet for that connection is processed on :cpu1 and needs to change the scheduling of the timer? In a strict work/data :structure pinning model, you would pin the TCP connection to cpu0, and only :process any data leading to timer changes on that CPU. Alternatively, you :might pass a message from cpu1 to cpu0 to change the scheduling. Yes, this is all very true. One could think of this in a more abstract way if that would make things more clear: All the work processing related to a particular TCP connection is accumulated into a single 'hopper'. The hopper is what is being serialized with a mutex, or by cpu-locality, or even simply by thread-locality (dedicating a single thread to process a single hopper). This means that all the work that has accumulated in the hopper can be processed while holding a single serializer instead of having to acquire and release a serializer for each work item within the hopper. That's the jist of it. If you have enough hoppers, statistics takes care of the rest. There is nothing that says the hoppers have to be pinned to particular cpu's, it just makes it easier for other system APIs if they are. For FreeBSD, I think the hopper abstraction might be the way to go. You could then have worker threads running on each cpu (one per cpu) which compete for hoppers with pending work. You can avoid wiring the hoppers to particular cpus (which FreeBSD people seem to dislike considerably) yet still reap the benefits of batch processing. TCP callout timers are a really good example here, because TCP callout timers are ONLY ever manipulated from within the TCP protocol stack, which means they are only manipulated in the context of a TCP work item (either a packet, or a timeout, or user requested work). If you think about it, nearly *all* the manipulation of the TCP callout timers occurs during work item processing where you already hold the governing serializer. That is the manipulation that needs to become optimal here. So the question for callouts then becomes.... can the serializer used for the work item processing be the SAME serializer that the callout API uses to control access to the callout structures? In the DragonFly model the answer is: yes, easy, because the serializer is cpu-localized. In FreeBSD the same thing could be accomplished by implementing a callout wheel for each 'hopper', controlled by the same serializer. The only real performance issue is how to handle work item events caused by userland read() or write()'s.... do you have those operations send a message to the thread managing the hopper? Or do you have those operations obtain the hopper's serializer and enter the TCP stack directly? For FreeBSD I would guess the latter... obtain the hopper's serializer and enter the TCP stack directly. But if you were to implement it you could actually do it both ways and have a sysctl to select which method to use, then look at how that effects performance. The other main entry point for packets into the TCP stack is from the network interface. The network interface layer is typically interrupt driven, and just as typically it is not (in my opinion) the best idea to try to call the TCP protocol stack from the network interrupt as it seriously elongates the code path and enlarges the cache fingerprint required to run through a network interface's RX ring. The RX ring is likely to contain dozens or even a hundred or more packets bound for a fewer (but still significant) number of TCP connections. Breaking up that processing into two separate loops... getting the packets off the RX ring and placing them in the correct hopper, and processing the hopper's work queue, would yield a far better cache footprint. Again, my opinion. -- In any case, these methodologies basically exist in order to remove the need to acquire a serializer that is so fine-grained that the overhead of the serializer becomes a serious component of the overhead of the work being serialized. That is *certainly* the case for the callout API. Sans serializer, the callout API is basically one or two TAILQ manipulations and that is it. You can't get much faster then that. I don't think it is appropriate to try to abstract-away the serializer when the serializer becomes such a large component. That's like hiding something you don't like under your bed. -- Something just came to my attention... are you guys actually using high 'hz' values to govern your current callout API? In particular, the c->c_time field? If that is the case the size of your callwheel array may be insufficient to hold even short timeouts without wrapping. That could create *serious* performance problems with the callwheel design. And I do mean serious. The entire purpose of having the callwheel is to support the notion that most timeouts will be removed or reset before they actually occur, meaning before the iterator (softclock_handler() in kern_timeout.c) gets to the index. If you wrap, the iterator may wind up having to skip literally thousands or hundreds of thousands of callout structures during its scan. So, e.g. a typical callwheel is sized to 16384 or 32768 entries ('print callwheelsize' from kgdb on a live kernel). At 100hz 32768 entries gives us 327 seconds of range before callout entries start to wrap. At 1000hz 32768 entries barely gives you 32 seconds of range. All TCP timers except the idle timer are fairly short lived. The idle timer could be an issue for you. In fact, it could be an issue for us too... that's something I will have a look at in DragonFly. You could also be hitting another big problem by using a too fine-grained timer/timeout resolution, and that is destroying the natural aggregation of work that occurs with coarse resolutions. It doesn't make much sense to have a handful of callouts at 10ms, 11ms and 12ms for example. It would be better to have them all in one slot (like at 12ms) so they can all be processed in batch. This is particularly true for anything that can be processed with a tight code loop, and the TCP protocol stack certainly applies there. I think Jeffrey Hsu actually counted instruction cycles for TCP processing through the short-cut tests (the optimal/critical path when incoming data packets are in-order and non-overlapping and such), and once he fixed some of the conditionals the number of instructions required to process a packet had been reduced dramatically and certainly fit in the L1 cache. Someething to think about, anyhow. I'll read the paper you referenced. It looks interesting. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Fri Dec 1 10:09:36 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E845816A40F; Fri, 1 Dec 2006 10:09:36 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 559BE43CA3; Fri, 1 Dec 2006 10:09:23 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.7/8.13.4) with ESMTP id kB1A9WZL064232; Fri, 1 Dec 2006 02:09:32 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.7/8.13.4/Submit) id kB1A9VA8064231; Fri, 1 Dec 2006 02:09:31 -0800 (PST) Date: Fri, 1 Dec 2006 02:09:31 -0800 (PST) From: Matthew Dillon Message-Id: <200612011009.kB1A9VA8064231@apollo.backplane.com> To: Robert Watson References: <200611292147.kATLll4m048223@apollo.backplane.com> <11606.1164837711@critter.freebsd.dk> <20061201012221.J79653@fledge.watson.org> Cc: Ivan Voras , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Dec 2006 10:09:37 -0000 : : http://www.ece.rice.edu/~willmann/pubs/paranet_tr06-872.pdf : :Robert N M Watson Oh, that paper. You know, I talked to Alan about that paper a while back, but it isn't really possible to compare DragonFly side by side with FreeBSD yet in an SMP environment because we still have a lot of BGL junk in the network path, and because our interrupts are still going to cpu #0. The code itself is mostly MP safe, and Jeff has actually turned off the BGL in some of his own testing, but I can't do it officially yet. In anycase, that is why DragonFly wasn't used. I like the paper but I'm not sure it is possible to compare particulars of the network implementation in such different operating systems (FreeBSD vs Linux). The basic issue is, in a nutshell: per-packet per-connection ---------- ------------- long code paths short code paths long data paths shorter data paths (more cache contention) (less cache contention, work aggregation has a better change of fitting in the L1/L2, much less 'shared' data between cpus) lots of mutexes fewer mutexes or no mutexes (eats time, potential contention) no thread switches or more thread switches - protocol threads fewer thread switches process packets, not the interface interurpt (remember, interrupts (~500ns-~1us to switch threads) may be threaded) I'm sure I can think of more differences. The per-packet model almost certainly has better synchronous performance and almost certainly has better performance when operating on a limited number of connections. The basic premise is that you can make better use of any available cpu by using a fine-grained locking model. The per-connection model has the potential to process data in bulk with a far smaller cache footprint due to the separation of work, but requires a large number of connections to aggregate enough work to absorb the thread switching overhead. The basic premise is that you only care about peformance if you are actually running your machine flat out. This is largly borne out by the paper, though I again caution that the operating systems are so different from each other I don't think the graphs can be taken as an indication of the relative benefits of the networking model. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Fri Dec 1 10:21:33 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 24DE416A407; Fri, 1 Dec 2006 10:21:33 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6CAED43CAA; Fri, 1 Dec 2006 10:21:19 +0000 (GMT) (envelope-from phk@critter.freebsd.dk) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.48.2]) by phk.freebsd.dk (Postfix) with ESMTP id F28B5170C5; Fri, 1 Dec 2006 10:21:30 +0000 (UTC) To: Matthew Dillon From: "Poul-Henning Kamp" In-Reply-To: Your message of "Fri, 01 Dec 2006 02:09:31 PST." <200612011009.kB1A9VA8064231@apollo.backplane.com> Date: Fri, 01 Dec 2006 10:21:28 +0000 Message-ID: <3931.1164968488@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Robert Watson , Ivan Voras , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Dec 2006 10:21:33 -0000 In message <200612011009.kB1A9VA8064231@apollo.backplane.com>, Matthew Dillon w rites: >: >: http://www.ece.rice.edu/~willmann/pubs/paranet_tr06-872.pdf >: >:Robert N M Watson > > Oh, that paper. You know, I talked to Alan about that paper a while > back, but it isn't really possible to compare DragonFly side by side > with FreeBSD yet in an SMP environment because we still have a lot > of BGL junk in the network path, and because our interrupts are > still going to cpu #0. The code itself is mostly MP safe, and Jeff > has actually turned off the BGL in some of his own testing, but I > can't do it officially yet. In anycase, that is why DragonFly wasn't > used. So, like, why don't you work on that, instead of annoying us with your long lectures about how "The World Shall Be Ordered According To Me" ? Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Fri Dec 1 12:04:34 2006 Return-Path: X-Original-To: freebsd-arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 541DE16A403; Fri, 1 Dec 2006 12:04:34 +0000 (UTC) (envelope-from ivoras@fer.hr) Received: from lara.cc.fer.hr (lara.cc.fer.hr [161.53.72.113]) by mx1.FreeBSD.org (Postfix) with ESMTP id E8B0C43CAB; Fri, 1 Dec 2006 12:04:19 +0000 (GMT) (envelope-from ivoras@fer.hr) Received: from [127.0.0.1] (localhost.cc.fer.hr [127.0.0.1]) by lara.cc.fer.hr (8.13.8/8.13.8) with ESMTP id kB1C4Pa5016943; Fri, 1 Dec 2006 13:04:27 +0100 (CET) (envelope-from ivoras@fer.hr) Message-ID: <45701A49.5020809@fer.hr> Date: Fri, 01 Dec 2006 13:04:25 +0100 From: Ivan Voras User-Agent: Thunderbird 1.5.0.4 (X11/20060625) MIME-Version: 1.0 To: Robert Watson References: <20061119041421.I16763@delplex.bde.org> <20061126174041.V83346@fledge.watson.org> <20061128142218.P44465@fledge.watson.org> In-Reply-To: <20061128142218.P44465@fledge.watson.org> X-Enigmail-Version: 0.94.0.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: freebsd-arch@FreeBSD.org Subject: Re: What is the PREEMPTION option good for? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Dec 2006 12:04:34 -0000 Robert Watson wrote: > > They're independent twiddles, and can be frobbed separately. If you can > easily measure performance in the different configurations, seeing a > table of permutations and results would be very nice to see what happens > :-). Ok, this is what I found: - ipiwakeup doesn't produce differences as calculated by ministat - turning off preemption produces visible differences, which are calculated by ministat to be upto 10%. x nopreempt+ipiwakeup + preempt+ipiwakeup +--------------------------------------------------------------------------+ |+ + + + x x xx xx x| | |___________A__M________| |______MA_______| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 7 99.92 104.19 101.48 101.78429 1.4606717 + 4 90.5 95.78 94.12 93.53 2.2081365 Difference at 95.0% confidence -8.25429 +/- 2.4751 -8.10959% +/- 2.43172% (Student's t, pooled s = 1.74576) Sorry about the small number of samples - these are collected from the system in the same state and product version (the machine was otherwise idle, etc.), but the difference is always present - I've run simpler benchmarks every few days since the discussion started and it's there. This is on a low-end dual core Xeon (i.e. one socket, two cores, no HT), enough RAM not to swap, requests/second with high concurrency on a web application that does a lot of IPC to database & cache engines through both TCP/localhost and unix sockets. From owner-freebsd-arch@FreeBSD.ORG Fri Dec 1 18:39:31 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B565216A407; Fri, 1 Dec 2006 18:39:31 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9989343CBF; Fri, 1 Dec 2006 18:39:12 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.7/8.13.4) with ESMTP id kB1IdNR4067824; Fri, 1 Dec 2006 10:39:23 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.7/8.13.4/Submit) id kB1IdFZO067817; Fri, 1 Dec 2006 10:39:15 -0800 (PST) Date: Fri, 1 Dec 2006 10:39:15 -0800 (PST) From: Matthew Dillon Message-Id: <200612011839.kB1IdFZO067817@apollo.backplane.com> To: "Poul-Henning Kamp" References: <3931.1164968488@critter.freebsd.dk> Cc: Robert Watson , Ivan Voras , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Dec 2006 18:39:31 -0000 :So, like, why don't you work on that, instead of annoying us with your :long lectures about how "The World Shall Be Ordered According To Me" ? : :Poul-Henning Because, fortunately, the lives of DragonFly developers aren't governed by big-dick contests from people who clearly have no clue what DragonFly is all about. I said from the outset that we wouldn't be taking shortcuts to 'compete' with other distributions, and we haven't. The BGL will be turned off in the networking threads when we are good and ready to do it. #1 on my priority list is achieving the major goals of the project and keeping the system stable while we do it. MP is only part of those goals. A big part, but still only a part. Frankly, I think we have done a better job on the stability front then you guys have. My personal agenda for the January release is to finish the userland kernel support - basically features which allow a kernel to be built as a userland application and to control independant VM spaces (its 'user' processes) with the help of the real kernel. Userland kernels are linked against libc and the kernel APIs have to be simple enough and bullet proof enough for it to work as a non-root userland application. The intent is to be able to do so with as little supporting overhead from the real kernel as possible. This is going to be a very cool feature, similar in scope to usermode linux, and it is also a necessary prereq to reduce engineering cycle times for development of the big ticket items next year. My goals for all of next year are to make good progress on the two biggest ticket items in the DragonFly goal list -- SYSLINK and CCMS. Yes, we've finally reached the point where it is possible to actually start work on these items, and I'm very excited about it! SYSLINK is the core protocol that will be used to communicate between hosts in a cluster, and CCMS is the core MESI+ cache coherency algorithm that will guarentee full coherency between hosts in a cluster. A staggering amount of infrastructure work on pre-existing subsystems, such as the buffer and namecaches, VOP API, etc, was required to get ready for the SYSLINK and CCMS work. Neither will work very well if we have to push into the VFS every time we want to do an I/O or stat() a file, after all. There is still a ton more work to do, including and most especially giving higher kernel layers direct access to the buffer cache so as to be able to bypass the VFS in the cache case (this is also why the namecache topology was rewritten, and is already the case for lookup operations). In anycase, I am very happy to say that we have made extremely good progress on all fronts. Of all the original subsystems inherited from FreeBSD we are basically down to just the VM system, namecache, vnode(1), and packet filter APIs with regards to the MP work, and work is progressing on LWP style (user) threading. Dozens of major subsystems were rewritten this year, I'll have to go through the commit logs to get a complete list. Things like the buffer cache, VM page cache, kernel user process scheduler, file descriptor handling, and so on and so forth. (note 1): The vnode ABI is a different ball of wax. Both MP and CCMS have to be tightly integrated because CCMS will be used to control coherent access on a byte-range basis for all I/O, the intent is to allow the execution of multiple modifying operations on the same vnode to occur in parallel. Achieving this will require a complete rewrite of the vnode's current lockmgr-centric API. That's what I'm doing. I am also happy to say that we have some wonderful developers doing a great deal of work on DragonFly, such as the PkgSrc integration, WiFi and other network infrastructure and drivers, networking, web site, documentation, bug reporting systems, and other big pieces that are needed to make a real platform out of DragonFly. I provide enabling infrastructure to support the other work and I work on the major project goals. I couldn't do it without all the help I'm getting. -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Fri Dec 1 19:03:05 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9C48D16A403 for ; Fri, 1 Dec 2006 19:03:05 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outQ.internet-mail-service.net (outQ.internet-mail-service.net [216.240.47.240]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8A61D43CC0 for ; Fri, 1 Dec 2006 19:02:42 +0000 (GMT) (envelope-from julian@elischer.org) Received: from shell.idiom.com (HELO idiom.com) (216.240.47.20) by out.internet-mail-service.net (qpsmtpd/0.32) with ESMTP; Fri, 01 Dec 2006 10:38:57 -0800 Received: from [192.168.2.4] (home.elischer.org [216.240.48.38]) by idiom.com (8.12.11/8.12.11) with ESMTP id kB1IqJkn091755; Fri, 1 Dec 2006 10:52:20 -0800 (PST) (envelope-from julian@elischer.org) Message-ID: <457079E3.2050505@elischer.org> Date: Fri, 01 Dec 2006 10:52:19 -0800 From: Julian Elischer User-Agent: Thunderbird 1.5.0.8 (Macintosh/20061025) MIME-Version: 1.0 To: Poul-Henning Kamp References: <3931.1164968488@critter.freebsd.dk> In-Reply-To: <3931.1164968488@critter.freebsd.dk> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Robert Watson , Ivan Voras , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Dec 2006 19:03:05 -0000 Poul-Henning Kamp wrote: > In message <200612011009.kB1A9VA8064231@apollo.backplane.com>, Matthew Dillon w > rites: >> : >> : http://www.ece.rice.edu/~willmann/pubs/paranet_tr06-872.pdf >> : >> :Robert N M Watson >> >> Oh, that paper. You know, I talked to Alan about that paper a while >> back, but it isn't really possible to compare DragonFly side by side >> with FreeBSD yet in an SMP environment because we still have a lot >> of BGL junk in the network path, and because our interrupts are >> still going to cpu #0. The code itself is mostly MP safe, and Jeff >> has actually turned off the BGL in some of his own testing, but I >> can't do it officially yet. In anycase, that is why DragonFly wasn't >> used. > > So, like, why don't you work on that, instead of annoying us with your > long lectures about how "The World Shall Be Ordered According To Me" ? Matt, Ignore Poul-Henning's email on this.. He certainly speaks for himself but his use of "Us" or "We" doesn't incluse everyone.... some of us ARE interested to hear intelligent comments. Julian > > Poul-Henning > From owner-freebsd-arch@FreeBSD.ORG Fri Dec 1 19:18:03 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7F9A216A40F for ; Fri, 1 Dec 2006 19:18:03 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (vc4-2-0-87.dsl.netrack.net [199.45.160.85]) by mx1.FreeBSD.org (Postfix) with ESMTP id BB84F43CAB for ; Fri, 1 Dec 2006 19:17:46 +0000 (GMT) (envelope-from imp@bsdimp.com) Received: from localhost (localhost [127.0.0.1]) by harmony.bsdimp.com (8.13.4/8.13.4) with ESMTP id kB1JGrMs059967; Fri, 1 Dec 2006 12:16:54 -0700 (MST) (envelope-from imp@bsdimp.com) Date: Fri, 01 Dec 2006 12:17:43 -0700 (MST) Message-Id: <20061201.121743.232930081.imp@bsdimp.com> To: Matthew Dillon , "Poul-Henning Kamp" From: "M. Warner Losh" In-Reply-To: <457079E3.2050505@elischer.org> References: <3931.1164968488@critter.freebsd.dk> <457079E3.2050505@elischer.org> X-Mailer: Mew version 4.2 on Emacs 21.3 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.0 (harmony.bsdimp.com [127.0.0.1]); Fri, 01 Dec 2006 12:16:56 -0700 (MST) Cc: freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Dec 2006 19:18:03 -0000 Gentlemen: Please, if you don't have something nice to say, please STFU. At least in public. I know that insults are hard to ignore, and that wrongs must be righted, but not here. Not now. Not anymore. You've had multiple back and forths to get it out of your system. If it isn't out, please channel the energy elsewhere. If it is out, thank you for not polluting this list further. If the post is off topic, ignore it. If you are offended by it, take it up with the person making the post, not the whole list. If you are talking about off-topic stuff, please consider a different forum. In short, please return to civility and make liberal use of the 'delete' button (or key sequence) rather than the 'Reply All' button (or key sequence). If you can't resist, please confine yourself to the 'Reply to sender' functionality only. Warner From owner-freebsd-arch@FreeBSD.ORG Fri Dec 1 22:34:04 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B511116A403 for ; Fri, 1 Dec 2006 22:34:04 +0000 (UTC) (envelope-from gpalmer@freebsd.org) Received: from noop.in-addr.com (noop.in-addr.com [208.58.23.51]) by mx1.FreeBSD.org (Postfix) with ESMTP id 884D843CA2 for ; Fri, 1 Dec 2006 22:33:34 +0000 (GMT) (envelope-from gpalmer@freebsd.org) Received: from gjp by noop.in-addr.com with local (Exim 4.54 (FreeBSD)) id 1GqGx3-000ERs-Vb for freebsd-arch@freebsd.org; Fri, 01 Dec 2006 17:33:49 -0500 Date: Fri, 1 Dec 2006 17:33:49 -0500 From: Gary Palmer To: freebsd-arch@freebsd.org Message-ID: <20061201223349.GB53372@in-addr.com> Mail-Followup-To: freebsd-arch@freebsd.org References: <3931.1164968488@critter.freebsd.dk> <200612011839.kB1IdFZO067817@apollo.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200612011839.kB1IdFZO067817@apollo.backplane.com> Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Dec 2006 22:34:04 -0000 On Fri, Dec 01, 2006 at 10:39:15AM -0800, Matthew Dillon wrote: > :So, like, why don't you work on that, instead of annoying us with your > :long lectures about how "The World Shall Be Ordered According To Me" ? > : > :Poul-Henning > > Because, fortunately, the lives of DragonFly developers aren't governed > by big-dick contests from people who clearly have no clue what DragonFly > is all about. Can we please lose the attitudes and get on with the discussion? This is nothing to do with the original discussion and is no longer constructive. I'm not singling Matt out by following up to his e-mail directly. But this is now way off topic and needs to be put back ON topic. Thank you From owner-freebsd-arch@FreeBSD.ORG Sat Dec 2 00:07:37 2006 Return-Path: X-Original-To: freebsd-arch@FreeBSD.org Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 41BD616A407; Sat, 2 Dec 2006 00:07:37 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226]) by mx1.FreeBSD.org (Postfix) with ESMTP id B058643CA3; Sat, 2 Dec 2006 00:07:19 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout2.pacific.net.au (Postfix) with ESMTP id 5FBD86E3C9; Sat, 2 Dec 2006 11:07:34 +1100 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (Postfix) with ESMTP id B72C48C02; Sat, 2 Dec 2006 11:07:33 +1100 (EST) Date: Sat, 2 Dec 2006 11:07:27 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Ivan Voras In-Reply-To: <45701A49.5020809@fer.hr> Message-ID: <20061202094431.O16375@delplex.bde.org> References: <20061119041421.I16763@delplex.bde.org> <20061126174041.V83346@fledge.watson.org> <20061128142218.P44465@fledge.watson.org> <45701A49.5020809@fer.hr> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Robert Watson , freebsd-arch@FreeBSD.org Subject: Re: What is the PREEMPTION option good for? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Dec 2006 00:07:37 -0000 On Fri, 1 Dec 2006, Ivan Voras wrote: > Robert Watson wrote: > >> They're independent twiddles, and can be frobbed separately. If you can >> easily measure performance in the different configurations, seeing a >> table of permutations and results would be very nice to see what happens >> :-). > > Ok, this is what I found: > > - ipiwakeup doesn't produce differences as calculated by ministat > - turning off preemption produces visible differences, which are > calculated by ministat to be upto 10%. 10% is surprisingly high. I found another setup where PREEMPTION (should) help -- nfs servers. For building kernels, PREEMPTION on the client is just a tiny pessimization, but network latency is a problem for nfs and not having PREEMPTION configured makes it worse. PREEMPTION is needed even to give correct scheduling of interrupt threads, and that seems to be all that it gives, at least in the !KSE case, though the main comment about it says otherwise. From kern_switch.c: % int % maybe_preempt(struct thread *td) % { % ... % * [... conditions for preempting] % * - If the new thread's priority is not a realtime priority and ^^^^^^^^^^^^^^^^^^^^^^^ % * the current thread's priority is not an idle priority and % * FULL_PREEMPTION is disabled. % ... % #ifndef FULL_PREEMPTION % if (pri > PRI_MAX_ITHD && cpri < PRI_MIN_IDLE) % ^^^^^^^^^^^^^^^^^^ % return (0); % #endif The condition in the code is very far from being a realtime priority. "Realtime priority" is a technical term meaning "a user thread whose scheduling class is PRI_REALTIME" and there is a classification macro PRI_IS_REALTIME() for such priorities. Of course, "realtime priority" in the comment doesn't mean that -- it means something more informal, which I would expect to include all kernel threads and all realtime priority user threads. But the condition in the code is just "not an interrupt thread". I don't understand maybe_preempt_in_ksegrp() and have KSE unconfigured. FULL_PREEMPTION is apparently needed to get kernel threads preempted by anything other than interrupt threads. It is not the default, apparently because it pessimizes more cases than PREEMPTION. Anyway, with kernels already optimized by about 30% for nfs (mainly in the client), my ~5.2 UP kernel (with working preemption to interrupt threads, unlike 5.2) used as the server beats a -current UP kernel (without PREEMPTION) by about 3% in real time and 30% in dead time for building kernels with a -current SMP kernel (without PREEMPTION) as the client. The difference is entirely due to dead time somewhere in nfs. Unfortunately, turning on PREEMPTION and IPI_PREEMPTION didn't recover all the lost performance. This is despite the ~current kernel having slightly lower latency for flood pings and similar optimizations for nfs that reduce the RPC count by a factor of 4 and the ping latency by a factor of 2. In previously clipped context, Robert Watson wrote: > There's a known performance regression with PREEMPTION and loopback network > traffic on UP or UP-like systems due to a poor series of context switches > occuring in the network stack. If your benchmark involves the above web load > over the loopback, that could be the source of what you're seeing. If it's > not loopback traffic, then that's not the source of the problem. I see only a slight additional loss of performance since ~5.2 for loopback. Approximate latencies for flood pings: Celeron 366: RELENG_3: 14uS; RELENG_4: 19uS; current-2006/04/16: 48uS AthlonXP 2223: RELENG_4: 2uS; 4-5uS ... ... -current 5-6uS > You might try fiddling with kern.sched.ipiwakeup.enabled and see what the > effect is, btw -- this controls whether or not the scheduler wakes up another > idle CPU to run a thread when waking up that thread, rather than queuing it to > run which may occur on the other CPU at the next clock tick. kern.sched.ipiwakeup.enabled seems to be the default. Does it work without IPI_PREEMPTION? Is the rescheduling of even interrupt threads really delayed until the next clock tick? I guess it is -- scheduling delays are normally good for efficiency. I use HZ = 100 which might delay scheduling more than the default, but I think you mean scheduling clock ticks and stathz is normally only 128 Hz. Scheduling also occurs on other (non-fast) interrupts. Maybe the fast interrupt handers in some network drivers work better mainly because they do more forceful scheduling (of the task queue thread) than now happens for normal interrupt handlers. Bruce From owner-freebsd-arch@FreeBSD.ORG Sat Dec 2 04:54:27 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5DFA816A407; Sat, 2 Dec 2006 04:54:27 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id EC7FB43CA2; Sat, 2 Dec 2006 04:54:08 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.7/8.13.4) with ESMTP id kB24sOLt071259; Fri, 1 Dec 2006 20:54:24 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.7/8.13.4/Submit) id kB24sIpq071255; Fri, 1 Dec 2006 20:54:18 -0800 (PST) Date: Fri, 1 Dec 2006 20:54:18 -0800 (PST) From: Matthew Dillon Message-Id: <200612020454.kB24sIpq071255@apollo.backplane.com> To: Bruce Evans References: <20061119041421.I16763@delplex.bde.org> <20061126174041.V83346@fledge.watson.org> <20061128142218.P44465@fledge.watson.org> <45701A49.5020809@fer.hr> <20061202094431.O16375@delplex.bde.org> Cc: Robert Watson , Ivan Voras , freebsd-arch@freebsd.org Subject: Re: What is the PREEMPTION option good for? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Dec 2006 04:54:27 -0000 :... :the client. The difference is entirely due to dead time somewhere in :nfs. Unfortunately, turning on PREEMPTION and IPI_PREEMPTION didn't :recover all the lost performance. This is despite the ~current kernel :having slightly lower latency for flood pings and similar optimizations :for nfs that reduce the RPC count by a factor of 4 and the ping latency :by a factor of 2. The single biggest NFS client performance issue I have encountered in an environment where most of the data can be cached from earlier runs is with negative name lookups. Due the large number of -I options used in builds, the include search path is fairly long and this usually results in a large number of negative lookups, all of which introduce synchronous dead times while the stat() or open() waits for the over-the-wire transaction to complete. The #1 solution is to cache negative namecache hits for NFS clients. You don't have to cache them for long... just 3 seconds is usually enough to remove most of the dead time. Also make sure your access cache timeout is something reasonable. It is possible to reduce the number of over-the-wire transactions to zero but it requires seriously nerfing the access and negative cache timeouts. It isn't usually worth doing. Here are some test results: make buildkernel, /usr/src mounted via NFS, 10 second access cache timeout, multiple runs to pre-cache data and tcpdump used to verify that only access RPCs were being sent over the wire for all tests. (on DragonFly): No negative cache - 440 seconds real 3 second neg cache timeout - 411 seconds real 10 second neg cache timeout - 410 seconds real (6% improvement) 30 second neg cache timeout - 409 seconds real -Matt Matthew Dillon From owner-freebsd-arch@FreeBSD.ORG Sat Dec 2 05:46:30 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5649816A412; Sat, 2 Dec 2006 05:46:30 +0000 (UTC) (envelope-from frank@exit.com) Received: from tinker.exit.com (tinker.exit.com [206.223.0.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id 02E1943CA2; Sat, 2 Dec 2006 05:46:10 +0000 (GMT) (envelope-from frank@exit.com) Received: from jill.exit.com (jill.exit.com [206.223.0.4]) by tinker.exit.com (8.13.8/8.13.8) with ESMTP id kB25k6JO043058; Fri, 1 Dec 2006 21:46:06 -0800 (PST) (envelope-from frank@exit.com) Received: from jill.exit.com (localhost [127.0.0.1]) by jill.exit.com (8.13.6/8.13.4) with ESMTP id kB25k6u1060183; Fri, 1 Dec 2006 21:46:06 -0800 (PST) (envelope-from frank@exit.com) Received: (from frank@localhost) by jill.exit.com (8.13.6/8.13.6/Submit) id kB25k5D5060182; Fri, 1 Dec 2006 21:46:05 -0800 (PST) (envelope-from frank@exit.com) X-Authentication-Warning: jill.exit.com: frank set sender to frank@exit.com using -f From: Frank Mayhar To: Matthew Dillon In-Reply-To: <200612020454.kB24sIpq071255@apollo.backplane.com> References: <20061119041421.I16763@delplex.bde.org> <20061126174041.V83346@fledge.watson.org> <20061128142218.P44465@fledge.watson.org> <45701A49.5020809@fer.hr> <20061202094431.O16375@delplex.bde.org> <200612020454.kB24sIpq071255@apollo.backplane.com> Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: Exit Consulting Date: Fri, 01 Dec 2006 21:46:04 -0800 Message-Id: <1165038364.8249.8.camel@jill.exit.com> Mime-Version: 1.0 X-Mailer: Evolution 2.8.2.1 FreeBSD GNOME Team Port X-Virus-Scanned: ClamAV 0.88.4/2269/Fri Dec 1 10:17:05 2006 on tinker.exit.com X-Virus-Status: Clean Cc: Robert Watson , Ivan Voras , freebsd-arch@freebsd.org Subject: Re: What is the PREEMPTION option good for? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: frank@exit.com List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Dec 2006 05:46:30 -0000 On Fri, 2006-12-01 at 20:54 -0800, Matthew Dillon wrote: > The single biggest NFS client performance issue I have encountered > in an environment where most of the data can be cached from earlier > runs is with negative name lookups. Due the large number of -I > options used in builds, the include search path is fairly long and > this usually results in a large number of negative lookups, all of > which introduce synchronous dead times while the stat() or open() > waits for the over-the-wire transaction to complete. > > The #1 solution is to cache negative namecache hits for NFS clients. > You don't have to cache them for long... just 3 seconds is usually > enough to remove most of the dead time. Also make sure your access > cache timeout is something reasonable. Back in my early Locus days, 1994 or 95, we ran into the same phenomenon while benchmarking our distributed file system. The SVR4.2 name cache didn't cache negative lookups and that killed us on certain performance benchmarks, even locally (since we nearly doubled the code path) much less when the file system actually lived on a remote node. -- Frank Mayhar frank@exit.com http://www.exit.com/ Exit Consulting http://www.gpsclock.com/ http://www.exit.com/blog/frank/ From owner-freebsd-arch@FreeBSD.ORG Sat Dec 2 10:25:51 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id EB55616A47C; Sat, 2 Dec 2006 10:25:50 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8667443CB1; Sat, 2 Dec 2006 10:25:28 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout2.pacific.net.au (Postfix) with ESMTP id 5228E6E32C; Sat, 2 Dec 2006 21:25:18 +1100 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (Postfix) with ESMTP id 451B48C02; Sat, 2 Dec 2006 21:25:17 +1100 (EST) Date: Sat, 2 Dec 2006 21:25:11 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Matthew Dillon In-Reply-To: <200612020454.kB24sIpq071255@apollo.backplane.com> Message-ID: <20061202174801.Y17746@delplex.bde.org> References: <20061119041421.I16763@delplex.bde.org> <20061126174041.V83346@fledge.watson.org> <20061128142218.P44465@fledge.watson.org> <45701A49.5020809@fer.hr> <20061202094431.O16375@delplex.bde.org> <200612020454.kB24sIpq071255@apollo.backplane.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Robert Watson , Ivan Voras , freebsd-arch@freebsd.org Subject: Re: What is the PREEMPTION option good for? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Dec 2006 10:25:51 -0000 On Fri, 1 Dec 2006, Matthew Dillon wrote: > :... > :the client. The difference is entirely due to dead time somewhere in > :nfs. Unfortunately, turning on PREEMPTION and IPI_PREEMPTION didn't > :recover all the lost performance. This is despite the ~current kernel > :having slightly lower latency for flood pings and similar optimizations > :for nfs that reduce the RPC count by a factor of 4 and the ping latency > :by a factor of 2. > > The single biggest NFS client performance issue I have encountered > in an environment where most of the data can be cached from earlier > runs is with negative name lookups. That is one of my previous optimizations. I obtained it from NetBSD, not from you, sorry :-). It is not quite ready to commit since I haven't figured the correct cache timeouts to use with it. I think the timeouts for negative cache hits need to be much smaller than for positive ones, especially for directories, since stale positive hits tend to cause RPCs which refresh the cache while stale negative hits tend to prevent RPCs until the cache times out. > Due the large number of -I > options used in builds, the include search path is fairly long and > this usually results in a large number of negative lookups, all of > which introduce synchronous dead times while the stat() or open() > waits for the over-the-wire transaction to complete. > > The #1 solution is to cache negative namecache hits for NFS clients. > You don't have to cache them for long... just 3 seconds is usually > enough to remove most of the dead time. Also make sure your access > cache timeout is something reasonable. > > It is possible to reduce the number of over-the-wire transactions to > zero but it requires seriously nerfing the access and negative cache Negative cache hits were only my 3rd or 4th largest optimization. Avoiding some foot-shooting gave #1 and #2 or 3. My normal, fairly safe configuration requires 36000 RPCs for building a RELENG_4 kernel (down from 120000 unoptimized). Turning off close-to-open consitency reduces this to 14000 but is in the serious nerfing class so I don't normally use it. Reducing network latency by turning off interrupt moderation and/or not using NICs that have it, and compiling with -j4 even on UP systems also helped, but since they use more CPU they are not as free as reducing RPCs. The dead time with all of these except turning off close-to-open consistency is about 2.5% for "make -j4" of a RELENG_4 kernel with warm caches under 2-way SMP. The dead time for "make -j4 depend" is much larger since "depend" is not parallelized so the latency for the RPCs can't be hidden. OTOH, for parallelized things, -jN works well for hiding the latency so the main cost of the extra RPCs is just the CPU time to do them. Bruce From owner-freebsd-arch@FreeBSD.ORG Sat Dec 2 16:24:37 2006 Return-Path: X-Original-To: arch@freebsd.org Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8D07816A416 for ; Sat, 2 Dec 2006 16:24:37 +0000 (UTC) (envelope-from alexander@leidinger.net) Received: from redbull.bpaserver.net (redbullneu.bpaserver.net [213.198.78.217]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7413B43CAF for ; Sat, 2 Dec 2006 16:24:15 +0000 (GMT) (envelope-from alexander@leidinger.net) Received: from outgoing.leidinger.net (p54A5DDF4.dip.t-dialin.net [84.165.221.244]) by redbull.bpaserver.net (Postfix) with ESMTP id D5CA52E04C for ; Sat, 2 Dec 2006 17:24:30 +0100 (CET) Received: from Magellan.Leidinger.net (Magellan.Leidinger.net [192.168.1.1]) by outgoing.leidinger.net (Postfix) with ESMTP id A868B5B4C6C for ; Sat, 2 Dec 2006 17:24:07 +0100 (CET) Date: Sat, 2 Dec 2006 17:25:00 +0100 From: Alexander Leidinger To: arch@freebsd.org Message-ID: <20061202172500.62b03d9c@Magellan.Leidinger.net> X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.10.6; i386-portbld-freebsd7.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BPAnet-MailScanner-Information: Please contact the ISP for more information X-BPAnet-MailScanner: Found to be clean X-BPAnet-MailScanner-SpamCheck: not spam, SpamAssassin (not cached, score=-14.864, required 6, autolearn=not spam, BAYES_00 -15.00, DK_POLICY_SIGNSOME 0.00, FORGED_RCVD_HELO 0.14) X-BPAnet-MailScanner-From: alexander@leidinger.net X-Spam-Status: No Cc: Subject: Changing copyinstr(9) invariants X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Dec 2006 16:24:37 -0000 Hi, the copyinstr(9) documentation does not guarantee that in the ENAMETOOLONG case the result is a completely filled up memory region which contains a truncated string without termination. The powerpc version works like this. And while I don't understand i386 ASM, it looks to me the i386 version does the same. My questions: - Does the amd64/sparc64/ia64/... version do the same? - Are there reasons not to document this fact and then make use of it? My motivation is in the linux_prctl() function in src/sys/compat/linux/linux_misc.c: ---snip--- case LINUX_PR_SET_NAME: max_size = MIN(sizeof(comm), sizeof(p->p_comm)); error = copyinstr((void *)(register_t) args->arg2, comm, max_size, NULL); /* Linux silently truncates the name if it is too long. if (error == ENAMETOOLONG) { /* * XXX: copyinstr() isn't documented to populate the * array completely, so do a copyin() to be on the * safe side. This should be changed in case * copyinstr() is changed to guarantee this. */ error = copyin((void *)(register_t)args->arg2, comm, max_size - 1); comm[max_size - 1] = '\0'; } if (error) return (error); ---snip--- By documenting this behavior I could get rid of the copyin() and just terminate the string. Bye, Alexander. -- 'I'll tell you this!' shouted Rincewind. 'I'd rather trust me than history! Oh, shit, did I just say that?' (Interesting Times) http://www.Leidinger.net Alexander @ Leidinger.net: PGP ID = B0063FE7 http://www.FreeBSD.org netchild @ FreeBSD.org : PGP ID = 72077137