From owner-freebsd-arch@FreeBSD.ORG Sun Jan 17 07:40:08 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C059A106566C; Sun, 17 Jan 2010 07:40:08 +0000 (UTC) (envelope-from uqs@spoerlein.net) Received: from acme.spoerlein.net (acme.spoerlein.net [IPv6:2a01:198:206::1]) by mx1.freebsd.org (Postfix) with ESMTP id 2CC608FC16; Sun, 17 Jan 2010 07:40:07 +0000 (UTC) Received: from acme.spoerlein.net (localhost.spoerlein.net [IPv6:::1]) by acme.spoerlein.net (8.14.3/8.14.3) with ESMTP id o0H7e1JJ040736 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 17 Jan 2010 08:40:01 +0100 (CET) (envelope-from uqs@spoerlein.net) Received: (from uqs@localhost) by acme.spoerlein.net (8.14.3/8.14.3/Submit) id o0H7e1Qb040735; Sun, 17 Jan 2010 08:40:01 +0100 (CET) (envelope-from uqs@spoerlein.net) Date: Sun, 17 Jan 2010 08:40:01 +0100 From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= To: Peter Jeremy Message-ID: <20100117074001.GJ96430@acme.spoerlein.net> Mail-Followup-To: Peter Jeremy , "Robert N. M. Watson" , svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, freebsd-arch@freebsd.org References: <4B4E1586.7090102@FreeBSD.org> <20100114.102142.328914705071816274.imp@bsdimp.com> <20100114.105622.457034909117828677.imp@bsdimp.com> <4B4F7810.2080003@FreeBSD.org> <86625798-F339-4863-8F97-63B5232A6CF7@freebsd.org> <20100115085856.GA2556@server.vk2pj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100115085856.GA2556@server.vk2pj.dyndns.org> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, "Robert N. M. Watson" , freebsd-arch@freebsd.org Subject: Re: INCLUDE_CONFIG_FILE in GENERIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 17 Jan 2010 07:40:08 -0000 On Fri, 15.01.2010 at 19:58:57 +1100, Peter Jeremy wrote: > On 2010-Jan-14 20:12:24 +0000, "Robert N. M. Watson" wrote: > >- Desktop/server users who want their system to work without any > > special tuning or magic, and likely feel the comments they put in > > configuration files are important > > As far as I'm concerned, the most critical bit of my kernel config file > is the $Header...$ comment - which lets me extract the remainder of the > file from my CVS repository. I don't currently use includes (because > most of my config files have roots pre-dating the include directive). > > I find it a PITA that INCLUDE_CONFIG_FILE _doesn't_ include comments > (or at least my $Header$ line) by default. Seriously, is that the only "comment" people care about? I really have a hard time coming up with *important* stuff that people put in config's comments and then somehow lose the connection between comment and running kernel. > IMO, it would be useful to have an "include this literal string in the > kernel" config directive. This would allow config file version control > information to be embedded without needing the comments. And that would > resolve the issue of embedding fully expanded details of all included > files without the hassle of keeping the comments around. Ok, this I can understand. We could then call this directive something ... um like ident perhaps? :) Seems like all that people want to do is simply: cpu i386 ident SERVER descr "$Id: foo,v" That shouldn't be too hard? FWIW I think it is more important to have a way to recreate the current running kernel than to get a verbatim/expanded copy of all config files used to create it in the first place. Just my two cents, Uli From owner-freebsd-arch@FreeBSD.ORG Mon Jan 18 10:39:07 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AF9031065670; Mon, 18 Jan 2010 10:39:07 +0000 (UTC) (envelope-from des@des.no) Received: from smtp.des.no (smtp.des.no [194.63.250.102]) by mx1.freebsd.org (Postfix) with ESMTP id 64C148FC19; Mon, 18 Jan 2010 10:39:07 +0000 (UTC) Received: from ds4.des.no (des.no [84.49.246.2]) by smtp.des.no (Postfix) with ESMTP id 67A1E1FFC22; Mon, 18 Jan 2010 10:39:06 +0000 (UTC) Received: by ds4.des.no (Postfix, from userid 1001) id 246568448A; Mon, 18 Jan 2010 11:39:06 +0100 (CET) From: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= To: "M. Warner Losh" References: <20100114.105622.457034909117828677.imp@bsdimp.com> <4B4F7810.2080003@FreeBSD.org> <86625798-F339-4863-8F97-63B5232A6CF7@freebsd.org> <20100114.135930.80200584442733547.imp@bsdimp.com> Date: Mon, 18 Jan 2010 11:39:06 +0100 In-Reply-To: <20100114.135930.80200584442733547.imp@bsdimp.com> (M. Warner Losh's message of "Thu, 14 Jan 2010 13:59:30 -0700 (MST)") Message-ID: <86k4vf66k5.fsf@ds4.des.no> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.95 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: dougb@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, rwatson@freebsd.org, freebsd-arch@freebsd.org, svn-src-head@freebsd.org Subject: Re: INCLUDE_CONFIG_FILE in GENERIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Jan 2010 10:39:07 -0000 "M. Warner Losh" writes: > [...] I think this > is the model that best fits most user's needs, since they EITHER take > GENERIC and hack on it (in which case we preserve all that), OR they > include GENERIC and opt in/out of things based on that default. The latter is a far better option - I use the former and have been bitten several times by new mandatory devices and options such as device io, scheduler selection, etc. However, the latter option is not very practical. Picking one of my machines at random, its kernel config has 65 device / option lines, while GENERIC has 219, and they only have 51 lines in common. Instead of 65 device / option lines, I would need 168 nodevice / nooption lines plus 14 device / option lines, for a total of 182 lines. > Heck, we could save the whole src/sys tree as a tarball in a separate > non-loadable ELF section if people that that was useful. Ouch. OTOH, the sys tree is actually smaller than a full GENERIC build... Is there any way we could limit this to those parts of the tree that were actually used to build the kernel? DES --=20 Dag-Erling Sm=C3=B8rgrav - des@des.no From owner-freebsd-arch@FreeBSD.ORG Mon Jan 18 11:06:52 2010 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CA6BD106566B for ; Mon, 18 Jan 2010 11:06:52 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 9FDC78FC16 for ; Mon, 18 Jan 2010 11:06:52 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.3/8.14.3) with ESMTP id o0IB6q83047471 for ; Mon, 18 Jan 2010 11:06:52 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.3/8.14.3/Submit) id o0IB6qVe047469 for freebsd-arch@FreeBSD.org; Mon, 18 Jan 2010 11:06:52 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 18 Jan 2010 11:06:52 GMT Message-Id: <201001181106.o0IB6qVe047469@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-arch@FreeBSD.org Cc: Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Jan 2010 11:06:52 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From owner-freebsd-arch@FreeBSD.ORG Tue Jan 19 17:07:06 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A11DB1065670; Tue, 19 Jan 2010 17:07:06 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 5C0FF8FC0A; Tue, 19 Jan 2010 17:07:06 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id E132F46B1A; Tue, 19 Jan 2010 12:07:05 -0500 (EST) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPA id A2F828A025; Tue, 19 Jan 2010 12:07:04 -0500 (EST) From: John Baldwin To: Attilio Rao Date: Tue, 19 Jan 2010 11:44:23 -0500 User-Agent: KMail/1.12.1 (FreeBSD/7.2-CBSD-20091231; KDE/4.3.1; amd64; ; ) References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <20100116205752.J64514@delplex.bde.org> <3bbf2fe11001160409w1dfdbb9j36458c52d596c92a@mail.gmail.com> In-Reply-To: <3bbf2fe11001160409w1dfdbb9j36458c52d596c92a@mail.gmail.com> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <201001191144.23299.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Tue, 19 Jan 2010 12:07:04 -0500 (EST) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: FreeBSD Arch , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Jan 2010 17:07:06 -0000 On Saturday 16 January 2010 7:09:38 am Attilio Rao wrote: > 2010/1/16 Bruce Evans : > > On Fri, 15 Jan 2010, Attilio Rao wrote: > > > >> I still see clock_lock in place (and no particular critical section > >> code in that paths) or you meant to say that the clock_lock doesn't > >> still provide enough protection alone? > >> BTW, you were right about the lapic_timer_hz (I forgot to revert to > >> hz). There is an updated patch: > >> > >> http://www.freebsd.org/~attilio/Sandvine/STABLE_8/statclock_aliasing/statclock_aliasing4.diff > > > > It seems to have the same fundamental bugs as the previous version. > > The atrtc interrupt is too slow to use for anything, so it should never > > be used if there is something better like the lapic timer available > > (even the i8254 is better), and using it here doesn't even fix the > > problem (malicious applications can very easily hide from statclock > > by default since the default hz is much larger than the default stathz, > > and malicious applications can not so easily hide from statclock > > irrespective > > of the misconfiguration of hz, since statclock is not random). See my > > previous reply and ftp://ftp.ee.lbl.gov/papers/statclk-usenix93.ps.Z for > > more details. > > Well, the primary things I wanted to fix is not the hiding of > malicious programs but the clock aliasing created when handling all > the clocks by the same source. > About the slowness -- I'm fine with whatever additional source to > LAPIC we would eventually use thus would you feel better if i8254 is > used replacing atrtc? > Also note that atrtc is the default if LAPIC cannot be used. I don't > understand why another source, even simpler (eg. i8254) would have > been used in that specific case by the 'old' code. > > What I mean, then is: I see your points, I'm not arguing that at all, > but the old code has other problems that gets fixed with this patch > (having different sources make the whole system more flexible) while > the new things it does introduce are secondarilly (but still: I'm fine > with whatever second source is picked up for statclock, profclock) if > you really see a concern wrt atrtc slowness. You can't use the i8254 reliable with APIC enabled. Some motherboards don't actually hook up IRQ 0 to pin 2. We used to support this by enabling IRQ 0 in the atpic and enabling the ExtINT pin to use both sets of PICs in tandem. However, this was very gross and had its own set of issues, so we removed the support for "mixed mode" a while ago. Also, the ACPI specification specifically forbids an OS from using "mixed mode". My feeling, btw, is that the real solution is to not use a sampling clock for per-process stats, but to just use the cycle counter and keep separate user, system, and interrupt cycle counts (like the rux_runtime we have now). This makes calcru() trivial and eliminates many of the weird "going backwards", etc. problems. The only issue with this approach is that not all platforms have a cheap cycle counter (many embedded platforms lack one I think), so you would almost need to support both modes of operation and maybe have an #define in to choose between the two modes. Even in that mode you still need a sampling clock I think for cp_time[] and cp_times[], but individual threads can no longer "hide" as we would be keeping precise timing stats. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Jan 19 17:27:45 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 500E11065693; Tue, 19 Jan 2010 17:27:45 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-fx0-f218.google.com (mail-fx0-f218.google.com [209.85.220.218]) by mx1.freebsd.org (Postfix) with ESMTP id 8654C8FC1D; Tue, 19 Jan 2010 17:27:44 +0000 (UTC) Received: by fxm10 with SMTP id 10so958782fxm.14 for ; Tue, 19 Jan 2010 09:27:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to:cc :content-type:content-transfer-encoding; bh=clgQD7jQ0GgLrxN/KsTJ9eFVWWzurM7Md4HCKzTEn2k=; b=VNUmy91672iLKKVnXqPz4aY5N5ASRTVyBs5SHjTYbpEOB3F78aVxI+hUgq1QDoNJL6 ulVvHK3ZFZbvSETCg8HUa7PGo0xkNd1b2tPSOlRCylsysE9WSZY2LFZQqRXcoWvSCGQb fw061JocorWqvIiWcVmmxyxRaabCozkFMgLjM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=waMiI3Bq5oCygHZywHpQdAoHLJwUOLk9mr7NTw6042CdB44EH2LfVnp5uJtIQGghAH 1/SHCdF+z9qsy92Q4WDMHz3IyAqtT+wz3+tqVV0qlncOePKqL9NBdYaOGT1bh3+jO1Ua IOcKQXjotEkRgOnzXMwuHSIC+0rhEoIYjw+jU= MIME-Version: 1.0 Sender: asmrookie@gmail.com Received: by 10.223.76.69 with SMTP id b5mr9603579fak.20.1263922063338; Tue, 19 Jan 2010 09:27:43 -0800 (PST) In-Reply-To: <201001191144.23299.jhb@freebsd.org> References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <20100116205752.J64514@delplex.bde.org> <3bbf2fe11001160409w1dfdbb9j36458c52d596c92a@mail.gmail.com> <201001191144.23299.jhb@freebsd.org> Date: Tue, 19 Jan 2010 18:27:43 +0100 X-Google-Sender-Auth: b8b5d589fbfd4069 Message-ID: <3bbf2fe11001190927m10f73775p7b68eb4d3ce0470a@mail.gmail.com> From: Attilio Rao To: John Baldwin Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: FreeBSD Arch , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Jan 2010 17:27:45 -0000 2010/1/19 John Baldwin : > On Saturday 16 January 2010 7:09:38 am Attilio Rao wrote: >> 2010/1/16 Bruce Evans : >> > On Fri, 15 Jan 2010, Attilio Rao wrote: >> > >> >> I still see clock_lock in place (and no particular critical section >> >> code in that paths) or you meant to say that the clock_lock doesn't >> >> still provide enough protection alone? >> >> BTW, you were right about the lapic_timer_hz (I forgot to revert to >> >> hz). There is an updated patch: >> >> >> >> > http://www.freebsd.org/~attilio/Sandvine/STABLE_8/statclock_aliasing/stat= clock_aliasing4.diff >> > >> > It seems to have the same fundamental bugs as the previous version. >> > The atrtc interrupt is too slow to use for anything, so it should neve= r >> > be used if there is something better like the lapic timer available >> > (even the i8254 is better), and using it here doesn't even fix the >> > problem (malicious applications can very easily hide from statclock >> > by default since the default hz is much larger than the default stathz= , >> > and malicious applications can not so easily hide from statclock >> > irrespective >> > of the misconfiguration of hz, since statclock is not random). =C2=A0S= ee my >> > previous reply and ftp://ftp.ee.lbl.gov/papers/statclk-usenix93.ps.Z f= or >> > more details. >> >> Well, the primary things I wanted to fix is not the hiding of >> malicious programs but the clock aliasing created when handling all >> the clocks by the same source. >> About the slowness -- I'm fine with whatever additional source to >> LAPIC we would eventually use thus would you feel better if i8254 is >> used replacing atrtc? >> Also note that atrtc is the default if LAPIC cannot be used. I don't >> understand why another source, even simpler (eg. i8254) would have >> been used in that specific case by the 'old' code. >> >> What I mean, then is: I see your points, I'm not arguing that at all, >> but the old code has other problems that gets fixed with this patch >> (having different sources make the whole system more flexible) while >> the new things it does introduce are secondarilly (but still: I'm fine >> with whatever second source is picked up for statclock, profclock) if >> you really see a concern wrt atrtc slowness. > > You can't use the i8254 reliable with APIC enabled. =C2=A0Some motherboar= ds don't > actually hook up IRQ 0 to pin 2. =C2=A0We used to support this by enablin= g IRQ 0 in > the atpic and enabling the ExtINT pin to use both sets of PICs in tandem. > However, this was very gross and had its own set of issues, so we removed= the > support for "mixed mode" a while ago. =C2=A0Also, the ACPI specification > specifically forbids an OS from using "mixed mode". > > My feeling, btw, is that the real solution is to not use a sampling clock= for > per-process stats, but to just use the cycle counter and keep separate us= er, > system, and interrupt cycle counts (like the rux_runtime we have now). = =C2=A0This > makes calcru() trivial and eliminates many of the weird "going backwards"= , > etc. problems. =C2=A0The only issue with this approach is that not all pl= atforms > have a cheap cycle counter (many embedded platforms lack one I think), so= you > would almost need to support both modes of operation and maybe have an #d= efine > in to choose between the two modes. Generally that would be a good idea, but the problem is not only for the architectures not supporting it, but also for architectures that do (eg. TSC de-synchronization in some SMP environment). Attilio > Even in that mode you still need a sampling clock I think for cp_time[] a= nd > cp_times[], but individual threads can no longer "hide" as we would be ke= eping > precise timing stats. Yes, cp_times do require a sampling clock, I guess. Attilio --=20 Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Tue Jan 19 17:33:56 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 03EC01065676; Tue, 19 Jan 2010 17:33:56 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 9C1A28FC13; Tue, 19 Jan 2010 17:33:55 +0000 (UTC) Received: from [IPv6:::1] (pooker.samsco.org [168.103.85.57]) (authenticated bits=0) by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id o0JHXo0G044157; Tue, 19 Jan 2010 10:33:50 -0700 (MST) (envelope-from scottl@samsco.org) Mime-Version: 1.0 (Apple Message framework v1076) Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes From: Scott Long In-Reply-To: <3bbf2fe11001190927m10f73775p7b68eb4d3ce0470a@mail.gmail.com> Date: Tue, 19 Jan 2010 10:33:50 -0700 Content-Transfer-Encoding: 7bit Message-Id: <274B568B-81D9-4554-8C3A-888FF0CD7B08@samsco.org> References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <20100116205752.J64514@delplex.bde.org> <3bbf2fe11001160409w1dfdbb9j36458c52d596c92a@mail.gmail.com> <201001191144.23299.jhb@freebsd.org> <3bbf2fe11001190927m10f73775p7b68eb4d3ce0470a@mail.gmail.com> To: Attilio Rao X-Mailer: Apple Mail (2.1076) X-Spam-Status: No, score=-7.8 required=3.8 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Cc: FreeBSD Arch , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Jan 2010 17:33:56 -0000 On Jan 19, 2010, at 10:27 AM, Attilio Rao wrote: > 2010/1/19 John Baldwin : >> On Saturday 16 January 2010 7:09:38 am Attilio Rao wrote: >>> 2010/1/16 Bruce Evans : >>>> On Fri, 15 Jan 2010, Attilio Rao wrote: >>>> >>>>> I still see clock_lock in place (and no particular critical >>>>> section >>>>> code in that paths) or you meant to say that the clock_lock >>>>> doesn't >>>>> still provide enough protection alone? >>>>> BTW, you were right about the lapic_timer_hz (I forgot to revert >>>>> to >>>>> hz). There is an updated patch: >>>>> >>>>> >> http://www.freebsd.org/~attilio/Sandvine/STABLE_8/statclock_aliasing/statclock_aliasing4.diff >>>> >>>> It seems to have the same fundamental bugs as the previous version. >>>> The atrtc interrupt is too slow to use for anything, so it should >>>> never >>>> be used if there is something better like the lapic timer available >>>> (even the i8254 is better), and using it here doesn't even fix the >>>> problem (malicious applications can very easily hide from statclock >>>> by default since the default hz is much larger than the default >>>> stathz, >>>> and malicious applications can not so easily hide from statclock >>>> irrespective >>>> of the misconfiguration of hz, since statclock is not random). >>>> See my >>>> previous reply and ftp://ftp.ee.lbl.gov/papers/statclk-usenix93.ps.Z >>>> for >>>> more details. >>> >>> Well, the primary things I wanted to fix is not the hiding of >>> malicious programs but the clock aliasing created when handling all >>> the clocks by the same source. >>> About the slowness -- I'm fine with whatever additional source to >>> LAPIC we would eventually use thus would you feel better if i8254 is >>> used replacing atrtc? >>> Also note that atrtc is the default if LAPIC cannot be used. I don't >>> understand why another source, even simpler (eg. i8254) would have >>> been used in that specific case by the 'old' code. >>> >>> What I mean, then is: I see your points, I'm not arguing that at >>> all, >>> but the old code has other problems that gets fixed with this patch >>> (having different sources make the whole system more flexible) while >>> the new things it does introduce are secondarilly (but still: I'm >>> fine >>> with whatever second source is picked up for statclock, profclock) >>> if >>> you really see a concern wrt atrtc slowness. >> >> You can't use the i8254 reliable with APIC enabled. Some >> motherboards don't >> actually hook up IRQ 0 to pin 2. We used to support this by >> enabling IRQ 0 in >> the atpic and enabling the ExtINT pin to use both sets of PICs in >> tandem. >> However, this was very gross and had its own set of issues, so we >> removed the >> support for "mixed mode" a while ago. Also, the ACPI specification >> specifically forbids an OS from using "mixed mode". >> >> My feeling, btw, is that the real solution is to not use a sampling >> clock for >> per-process stats, but to just use the cycle counter and keep >> separate user, >> system, and interrupt cycle counts (like the rux_runtime we have >> now). This >> makes calcru() trivial and eliminates many of the weird "going >> backwards", >> etc. problems. The only issue with this approach is that not all >> platforms >> have a cheap cycle counter (many embedded platforms lack one I >> think), so you >> would almost need to support both modes of operation and maybe have >> an #define >> in to choose between the two modes. > > Generally that would be a good idea, but the problem is not only for > the architectures not supporting it, but also for architectures that > do (eg. TSC de-synchronization in some SMP environment). > For process stats, TSC desync isn't a big problem. As a process migrates from one CPU to the other, its stats from the old cpu will be recorded, then stats will be started on the new cpu. The only problem here is with normalizing the different TSC's to a common reference. Maybe that can be done when computing cp_times? This is definitely a case where 'perfect' is the enemy of 'a hell of a lot better than we have now'. Scott From owner-freebsd-arch@FreeBSD.ORG Tue Jan 19 17:41:25 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DDC64106566C; Tue, 19 Jan 2010 17:41:25 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-fx0-f218.google.com (mail-fx0-f218.google.com [209.85.220.218]) by mx1.freebsd.org (Postfix) with ESMTP id 167508FC0C; Tue, 19 Jan 2010 17:41:24 +0000 (UTC) Received: by fxm10 with SMTP id 10so973638fxm.14 for ; Tue, 19 Jan 2010 09:41:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to:cc :content-type:content-transfer-encoding; bh=9s+yjNTQy7e6dv/1K8hZfSwZ/tUMX+SU53tEqpdxu+s=; b=fGR2vTNHTiS+xrWtXZwg+mALMgpmV5NKuBwLAMpAuvYxIjwH1jb7pLi46ypzPeMyOL izapYwhjSjBVs72aU7cHLlVnrWT389z02qD/QqWlSA1pNbA9UAZ3ULdhjCkbs5xZBHS9 Rs7KaX1dm6qGciKG4FXoHKQS5D9jiSguU9QN8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=Lzz0mGaju2I481sEjz1Xmjfqbnqd2mR0WsPncycBKchjb4F1aohR8f+6nUmbBUbLgN VbYmZvpvGQAkXjh6Lvuv5ZYvFOMdoIGx+RImYAAbPAfYF6oBGs9MwUMWp50Eh+2qeKgG ueNSc5JfqVrwsVMXDZ6rHwQPgauQ7Nap43z70= MIME-Version: 1.0 Sender: asmrookie@gmail.com Received: by 10.223.5.87 with SMTP id 23mr9488559fau.87.1263922883816; Tue, 19 Jan 2010 09:41:23 -0800 (PST) In-Reply-To: <274B568B-81D9-4554-8C3A-888FF0CD7B08@samsco.org> References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <20100116205752.J64514@delplex.bde.org> <3bbf2fe11001160409w1dfdbb9j36458c52d596c92a@mail.gmail.com> <201001191144.23299.jhb@freebsd.org> <3bbf2fe11001190927m10f73775p7b68eb4d3ce0470a@mail.gmail.com> <274B568B-81D9-4554-8C3A-888FF0CD7B08@samsco.org> Date: Tue, 19 Jan 2010 18:41:23 +0100 X-Google-Sender-Auth: 136757c2979040ca Message-ID: <3bbf2fe11001190941s37f62c48tb91be0061b658b2c@mail.gmail.com> From: Attilio Rao To: Scott Long Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: FreeBSD Arch , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Jan 2010 17:41:25 -0000 2010/1/19 Scott Long : > On Jan 19, 2010, at 10:27 AM, Attilio Rao wrote: >> >> 2010/1/19 John Baldwin : >>> >>> On Saturday 16 January 2010 7:09:38 am Attilio Rao wrote: >>>> >>>> 2010/1/16 Bruce Evans : >>>>> >>>>> On Fri, 15 Jan 2010, Attilio Rao wrote: >>>>> >>>>>> I still see clock_lock in place (and no particular critical section >>>>>> code in that paths) or you meant to say that the clock_lock doesn't >>>>>> still provide enough protection alone? >>>>>> BTW, you were right about the lapic_timer_hz (I forgot to revert to >>>>>> hz). There is an updated patch: >>>>>> >>>>>> >>> >>> http://www.freebsd.org/~attilio/Sandvine/STABLE_8/statclock_aliasing/st= atclock_aliasing4.diff >>>>> >>>>> It seems to have the same fundamental bugs as the previous version. >>>>> The atrtc interrupt is too slow to use for anything, so it should nev= er >>>>> be used if there is something better like the lapic timer available >>>>> (even the i8254 is better), and using it here doesn't even fix the >>>>> problem (malicious applications can very easily hide from statclock >>>>> by default since the default hz is much larger than the default stath= z, >>>>> and malicious applications can not so easily hide from statclock >>>>> irrespective >>>>> of the misconfiguration of hz, since statclock is not random). =C2=A0= See my >>>>> previous reply and >>>>> ftp://ftp.ee.lbl.gov/papers/statclk-usenix93.ps.Z=C2=A0for >>>>> more details. >>>> >>>> Well, the primary things I wanted to fix is not the hiding of >>>> malicious programs but the clock aliasing created when handling all >>>> the clocks by the same source. >>>> About the slowness -- I'm fine with whatever additional source to >>>> LAPIC we would eventually use thus would you feel better if i8254 is >>>> used replacing atrtc? >>>> Also note that atrtc is the default if LAPIC cannot be used. I don't >>>> understand why another source, even simpler (eg. i8254) would have >>>> been used in that specific case by the 'old' code. >>>> >>>> What I mean, then is: I see your points, I'm not arguing that at all, >>>> but the old code has other problems that gets fixed with this patch >>>> (having different sources make the whole system more flexible) while >>>> the new things it does introduce are secondarilly (but still: I'm fine >>>> with whatever second source is picked up for statclock, profclock) if >>>> you really see a concern wrt atrtc slowness. >>> >>> You can't use the i8254 reliable with APIC enabled. =C2=A0Some motherbo= ards >>> don't >>> actually hook up IRQ 0 to pin 2. =C2=A0We used to support this by enabl= ing IRQ >>> 0 in >>> the atpic and enabling the ExtINT pin to use both sets of PICs in tande= m. >>> However, this was very gross and had its own set of issues, so we remov= ed >>> the >>> support for "mixed mode" a while ago. =C2=A0Also, the ACPI specificatio= n >>> specifically forbids an OS from using "mixed mode". >>> >>> My feeling, btw, is that the real solution is to not use a sampling clo= ck >>> for >>> per-process stats, but to just use the cycle counter and keep separate >>> user, >>> system, and interrupt cycle counts (like the rux_runtime we have now). >>> =C2=A0This >>> makes calcru() trivial and eliminates many of the weird "going >>> backwards", >>> etc. problems. =C2=A0The only issue with this approach is that not all >>> platforms >>> have a cheap cycle counter (many embedded platforms lack one I think), = so >>> you >>> would almost need to support both modes of operation and maybe have an >>> #define >>> in to choose between the two modes. >> >> Generally that would be a good idea, but the problem is not only for >> the architectures not supporting it, but also for architectures that >> do (eg. TSC de-synchronization in some SMP environment). >> > > For process stats, TSC desync isn't a big problem. =C2=A0As a process mig= rates > from one CPU to the other, its stats from the old cpu will be recorded, t= hen > stats will be started on the new cpu. =C2=A0The only problem here is with > normalizing the different TSC's to a common reference. =C2=A0Maybe that c= an be > done when computing cp_times? =C2=A0This is definitely a case where 'perf= ect' is > the enemy of 'a hell of a lot better than we have now'. I wouldn't like to be mistaken, but IIRC in some benchmarks kris@ did in the past years we were seeing TSC timers litterally going backwards after the de-synchronization (even on absolute measurement). Attilio --=20 Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Tue Jan 19 18:53:36 2010 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E37B01065694; Tue, 19 Jan 2010 18:53:36 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au [211.29.132.191]) by mx1.freebsd.org (Postfix) with ESMTP id 5F1978FC1B; Tue, 19 Jan 2010 18:53:35 +0000 (UTC) Received: from besplex.bde.org (c220-239-227-214.carlnfd1.nsw.optusnet.com.au [220.239.227.214]) by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o0JIrWAr003972 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 20 Jan 2010 05:53:33 +1100 Date: Wed, 20 Jan 2010 05:53:32 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin In-Reply-To: <201001191144.23299.jhb@freebsd.org> Message-ID: <20100120042822.L4223@besplex.bde.org> References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <20100116205752.J64514@delplex.bde.org> <3bbf2fe11001160409w1dfdbb9j36458c52d596c92a@mail.gmail.com> <201001191144.23299.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , FreeBSD Arch , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Jan 2010 18:53:37 -0000 On Tue, 19 Jan 2010, John Baldwin wrote: > On Saturday 16 January 2010 7:09:38 am Attilio Rao wrote: >> >> Well, the primary things I wanted to fix is not the hiding of >> malicious programs but the clock aliasing created when handling all >> the clocks by the same source. I probably misdiagnosed the aliasing in a previous reply -- after the one being replied to here -- please reply to the latest version --: the problem for malicious programs seems to be sort of the opposite of the one fixed by using a separate hardware clock for the statclock. It seems to be the near-aliasing of the separate statclock that gets short-lived timeout processes accounted for at all (but not enough if there are many such processes). A non-separate statclock won't see these processes excessively like I first thought, even when the statclock() call immediately follows the hardclock() call, since hardclock() doesn't start any new processes; thus a statclock() at the same time as a hardclock() is the same as a statclock() 1/hz- epsilon after the previous hardclock() arranged to start a few timeouts -- usually these timeouts will have finished. A separate statclock() is little better at seeing short-lived timeout processes, since it has to sweep nearly uniformly over the entire interval between hardclock() interrupts, so it cannot spend long nearly in sync. However, to fix the problem with malicious programs, except for short-lived (short-active) ones started by a timeout which hopefully don't matter because they are short-lived, statclock() just needs to sweep not so uniformly over the entire interval, and this doesn't need a separate statclock() -- interrupting at points randomly distributed at distances of a large fraction of 1/hz should do. This depends on other system activity not being in sync with hardclock(). >> What I mean, then is: I see your points, I'm not arguing that at all, >> but the old code has other problems that gets fixed with this patch >> (having different sources make the whole system more flexible) while >> the new things it does introduce are secondarilly (but still: I'm fine >> with whatever second source is picked up for statclock, profclock) if >> you really see a concern wrt atrtc slowness. > > You can't use the i8254 reliable with APIC enabled. Some motherboards don't > actually hook up IRQ 0 to pin 2. We used to support this by enabling IRQ 0 in > the atpic and enabling the ExtINT pin to use both sets of PICs in tandem. > However, this was very gross and had its own set of issues, so we removed the > support for "mixed mode" a while ago. Also, the ACPI specification > specifically forbids an OS from using "mixed mode". I thought that recent changes reenabled some of this. And what's to stop some motherboards breaking the RTC too? > My feeling, btw, is that the real solution is to not use a sampling clock for > per-process stats, but to just use the cycle counter and keep separate user, > system, and interrupt cycle counts (like the rux_runtime we have now). The total runtime info is already available (in rux_runtime). It is the main thing that we use to see that scheduling is broken :-) -- we see that the runtime is too large or small relative to %CPU. I think using this and never using ticks for scheduling would work OK. Schedulers shouldn't care about the difference between user and sys time. Something like this is also needed for tickless kernels. With schedulers still wanting ticks, perhaps the total runtime could be distributed as fake ticks for schedulers only to see, so that if the tick count is broken schedulers would still get feedback from the runtime. And/or processes started by a timeout could be charged a fake tick so that they can't run for free. Interrupt cycle counts are mostly already kept too, since most interrupt handlers are heavyweight and take a full context switch to get to. However, counting cycles to separate user from sys time would probably be too inefficient. A minimal syscalls now should take about 200 cycles. rdtsc on Athlon1 takes 12 cycles. rdtsc on Core2 and Phenom takes 40+ cycles. 2 of these would be needed for every syscall. These would only not be too inefficient if they ran mostly in parallel. They are non-serializing, but if they actually ran mostly in parallel then they might also be off 40+ cycles/call. > This > makes calcru() trivial and eliminates many of the weird "going backwards", > etc. problems. The only issue with this approach is that not all platforms > have a cheap cycle counter (many embedded platforms lack one I think), so you > would almost need to support both modes of operation and maybe have an #define > in to choose between the two modes. Not the only problem. This also doesn't work for things like vm statistics gathered in statclock(). You still need statclock() for these, and if you want the statistics to be reasonably accurate then you need a sufficiently non-aliased aliased and non-random random statclock(). > Even in that mode you still need a sampling clock I think for cp_time[] and > cp_times[], but individual threads can no longer "hide" as we would be keeping > precise timing stats. Not so much a problem as the vm stats -- most time-related statistics could be handled by adding up per-thread components, if we had them all. If we had fine-grained programability of a single timer, then accounting for threads started by a timeout would probably be best implemented for almost perfect correctness and slowness as follows: - statclock() interrupt a few usec after starting a timeout - then periodic statclock() interrupts every few tens or hundreds of usec a few times - then back to normal periodic statclock() interrupts, hopefully not so often All statistics including tick counts are a weighted sum depending on the current stathz (an integral over time, like now for the non-tick count stats, except with the time deltas varying). This would be slow, but it seems to be the only way to correctly account for short-lived processes started by a timeout -- in a limiting case, all system activity would be run as timeouts and on fast machines finish in a few usec. Maintaining the total runtime, which should be enough for scheduling, doesn't need this, but other statistics do. Other system activity probably doesn't need this, because it is probably started by other interrupts that aren't in sync with hardclock() -- only hardclock() combined with time^callout sematics gives a huge bias towards starting processes at particular times. Probably nothing needs this, since we don't really care about other statistics. Probably completely tickless kernels can't support the other statistics. Bruce From owner-freebsd-arch@FreeBSD.ORG Tue Jan 19 19:16:43 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C3E2B106566B; Tue, 19 Jan 2010 19:16:43 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 7F13A8FC08; Tue, 19 Jan 2010 19:16:43 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id F2DF146B1A; Tue, 19 Jan 2010 14:16:42 -0500 (EST) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPA id 2062A8A026; Tue, 19 Jan 2010 14:16:42 -0500 (EST) From: John Baldwin To: Attilio Rao Date: Tue, 19 Jan 2010 13:40:31 -0500 User-Agent: KMail/1.12.1 (FreeBSD/7.2-CBSD-20091231; KDE/4.3.1; amd64; ; ) References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <201001191144.23299.jhb@freebsd.org> <3bbf2fe11001190927m10f73775p7b68eb4d3ce0470a@mail.gmail.com> In-Reply-To: <3bbf2fe11001190927m10f73775p7b68eb4d3ce0470a@mail.gmail.com> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <201001191340.31700.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Tue, 19 Jan 2010 14:16:42 -0500 (EST) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: FreeBSD Arch , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Jan 2010 19:16:43 -0000 On Tuesday 19 January 2010 12:27:43 pm Attilio Rao wrote: > 2010/1/19 John Baldwin : > > On Saturday 16 January 2010 7:09:38 am Attilio Rao wrote: > >> 2010/1/16 Bruce Evans : > >> > On Fri, 15 Jan 2010, Attilio Rao wrote: > >> > > >> >> I still see clock_lock in place (and no particular critical section > >> >> code in that paths) or you meant to say that the clock_lock doesn't > >> >> still provide enough protection alone? > >> >> BTW, you were right about the lapic_timer_hz (I forgot to revert to > >> >> hz). There is an updated patch: > >> >> > >> >> > > http://www.freebsd.org/~attilio/Sandvine/STABLE_8/statclock_aliasing/statclock_aliasing4.diff > >> > > >> > It seems to have the same fundamental bugs as the previous version. > >> > The atrtc interrupt is too slow to use for anything, so it should never > >> > be used if there is something better like the lapic timer available > >> > (even the i8254 is better), and using it here doesn't even fix the > >> > problem (malicious applications can very easily hide from statclock > >> > by default since the default hz is much larger than the default stathz, > >> > and malicious applications can not so easily hide from statclock > >> > irrespective > >> > of the misconfiguration of hz, since statclock is not random). See my > >> > previous reply and ftp://ftp.ee.lbl.gov/papers/statclk-usenix93.ps.Z for > >> > more details. > >> > >> Well, the primary things I wanted to fix is not the hiding of > >> malicious programs but the clock aliasing created when handling all > >> the clocks by the same source. > >> About the slowness -- I'm fine with whatever additional source to > >> LAPIC we would eventually use thus would you feel better if i8254 is > >> used replacing atrtc? > >> Also note that atrtc is the default if LAPIC cannot be used. I don't > >> understand why another source, even simpler (eg. i8254) would have > >> been used in that specific case by the 'old' code. > >> > >> What I mean, then is: I see your points, I'm not arguing that at all, > >> but the old code has other problems that gets fixed with this patch > >> (having different sources make the whole system more flexible) while > >> the new things it does introduce are secondarilly (but still: I'm fine > >> with whatever second source is picked up for statclock, profclock) if > >> you really see a concern wrt atrtc slowness. > > > > You can't use the i8254 reliable with APIC enabled. Some motherboards don't > > actually hook up IRQ 0 to pin 2. We used to support this by enabling IRQ 0 in > > the atpic and enabling the ExtINT pin to use both sets of PICs in tandem. > > However, this was very gross and had its own set of issues, so we removed the > > support for "mixed mode" a while ago. Also, the ACPI specification > > specifically forbids an OS from using "mixed mode". > > > > My feeling, btw, is that the real solution is to not use a sampling clock for > > per-process stats, but to just use the cycle counter and keep separate user, > > system, and interrupt cycle counts (like the rux_runtime we have now). This > > makes calcru() trivial and eliminates many of the weird "going backwards", > > etc. problems. The only issue with this approach is that not all platforms > > have a cheap cycle counter (many embedded platforms lack one I think), so you > > would almost need to support both modes of operation and maybe have an #define > > in to choose between the two modes. > > Generally that would be a good idea, but the problem is not only for > the architectures not supporting it, but also for architectures that > do (eg. TSC de-synchronization in some SMP environment). No, that doesn't matter. You are merely accumulating TSC deltas just as we do now for rux_runtime. For that purpose the TSC drift never matters as you are always taking deltas relative to a single CPU. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Jan 19 19:16:45 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7B47A106566B; Tue, 19 Jan 2010 19:16:45 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 378FA8FC19; Tue, 19 Jan 2010 19:16:45 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id DC76346B2D; Tue, 19 Jan 2010 14:16:44 -0500 (EST) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPA id 2D7328A021; Tue, 19 Jan 2010 14:16:44 -0500 (EST) From: John Baldwin To: Bruce Evans Date: Tue, 19 Jan 2010 14:13:03 -0500 User-Agent: KMail/1.12.1 (FreeBSD/7.2-CBSD-20091231; KDE/4.3.1; amd64; ; ) References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <201001191144.23299.jhb@freebsd.org> <20100120042822.L4223@besplex.bde.org> In-Reply-To: <20100120042822.L4223@besplex.bde.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201001191413.03682.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Tue, 19 Jan 2010 14:16:44 -0500 (EST) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Attilio Rao , FreeBSD Arch , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Jan 2010 19:16:45 -0000 On Tuesday 19 January 2010 1:53:32 pm Bruce Evans wrote: > On Tue, 19 Jan 2010, John Baldwin wrote: > >> What I mean, then is: I see your points, I'm not arguing that at all, > >> but the old code has other problems that gets fixed with this patch > >> (having different sources make the whole system more flexible) while > >> the new things it does introduce are secondarilly (but still: I'm fine > >> with whatever second source is picked up for statclock, profclock) if > >> you really see a concern wrt atrtc slowness. > > > > You can't use the i8254 reliable with APIC enabled. Some motherboards don't > > actually hook up IRQ 0 to pin 2. We used to support this by enabling IRQ 0 in > > the atpic and enabling the ExtINT pin to use both sets of PICs in tandem. > > However, this was very gross and had its own set of issues, so we removed the > > support for "mixed mode" a while ago. Also, the ACPI specification > > specifically forbids an OS from using "mixed mode". > > I thought that recent changes reenabled some of this. And what's to stop > some motherboards breaking the RTC too? No, mixed mode is still very much disabled. The RTC is different because I believe Windows still uses it (or at least some older versions have used it with the APIC in the past) so that it actually gets tested for WHQL testing as opposed to IRQ 0 which does not get tested. I think on some recentish Nvidia chipsets the ISA timer was actually hooked directly to pin 0 on the first I/O APIC. There was no ExtInt pin, and pin 2 was just dead. We had a separate quirk to deal with that that I think is still present (though now effectively unused). > > My feeling, btw, is that the real solution is to not use a sampling clock for > > per-process stats, but to just use the cycle counter and keep separate user, > > system, and interrupt cycle counts (like the rux_runtime we have now). > > The total runtime info is already available (in rux_runtime). It is > the main thing that we use to see that scheduling is broken :-) -- we > see that the runtime is too large or small relative to %CPU. I think > using this and never using ticks for scheduling would work OK. Schedulers > shouldn't care about the difference between user and sys time. Something > like this is also needed for tickless kernels. I mostly care about splitting up the timers to attempt to remove the need for statclock() by making more stats event-driven rather than sampled (to help with a tickless kernel). I also would like to simplify calcru() and remove the weird hacks to satisfy monoticity (sp?). -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Jan 19 19:19:54 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B1BE01065670; Tue, 19 Jan 2010 19:19:54 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail01.syd.optusnet.com.au (mail01.syd.optusnet.com.au [211.29.132.182]) by mx1.freebsd.org (Postfix) with ESMTP id 422C08FC18; Tue, 19 Jan 2010 19:19:53 +0000 (UTC) Received: from c220-239-227-214.carlnfd1.nsw.optusnet.com.au (c220-239-227-214.carlnfd1.nsw.optusnet.com.au [220.239.227.214]) by mail01.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o0JJJo1h020005 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 20 Jan 2010 06:19:51 +1100 Date: Wed, 20 Jan 2010 06:19:50 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Attilio Rao In-Reply-To: <3bbf2fe11001190941s37f62c48tb91be0061b658b2c@mail.gmail.com> Message-ID: <20100120055636.U68115@delplex.bde.org> References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <20100116205752.J64514@delplex.bde.org> <3bbf2fe11001160409w1dfdbb9j36458c52d596c92a@mail.gmail.com> <201001191144.23299.jhb@freebsd.org> <3bbf2fe11001190927m10f73775p7b68eb4d3ce0470a@mail.gmail.com> <274B568B-81D9-4554-8C3A-888FF0CD7B08@samsco.org> <3bbf2fe11001190941s37f62c48tb91be0061b658b2c@mail.gmail.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="0-2077489133-1263928790=:68115" Cc: FreeBSD Arch , Scott Long , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Jan 2010 19:19:54 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-2077489133-1263928790=:68115 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Tue, 19 Jan 2010, Attilio Rao wrote: > 2010/1/19 Scott Long : >> On Jan 19, 2010, at 10:27 AM, Attilio Rao wrote: >>> >>> 2010/1/19 John Baldwin : >>>> My feeling, btw, is that the real solution is to not use a sampling cl= ock >>>> for >>>> per-process stats, but to just use the cycle counter and keep separate >>>> user, >>>> system, and interrupt cycle counts (like the rux_runtime we have now). >>>> =C2=A0This >>>> makes calcru() trivial and eliminates many of the weird "going >>>> backwards", >>>> etc. problems. =C2=A0The only issue with this approach is that not all >>>> platforms >>>> have a cheap cycle counter (many embedded platforms lack one I think),= so >>>> you >>>> would almost need to support both modes of operation and maybe have an >>>> #define >>>> in to choose between the two modes. >>> >>> Generally that would be a good idea, but the problem is not only for >>> the architectures not supporting it, but also for architectures that >>> do (eg. TSC de-synchronization in some SMP environment). >>> >> >> For process stats, TSC desync isn't a big problem. =C2=A0As a process mi= grates >> from one CPU to the other, its stats from the old cpu will be recorded, = then >> stats will be started on the new cpu. =C2=A0The only problem here is wit= h >> normalizing the different TSC's to a common reference. =C2=A0Maybe that = can be >> done when computing cp_times? =C2=A0This is definitely a case where 'per= fect' is >> the enemy of 'a hell of a lot better than we have now'. > Only the frequencies would need normalization, since the TSCs are per-CPU and they hopefully don't get reset by suspend etc. Separate frequencies for separate CPUs are not supported now. > I wouldn't like to be mistaken, but IIRC in some benchmarks kris@ did > in the past years we were seeing TSC timers litterally going backwards > after the de-synchronization (even on absolute measurement). Do you really mean individual TSCs going backwards? P-state-invariance (?) should prevent the desync. If the TSCs actually desync, then TSC timecounters are sure to break, with timecounters going backwards being a typical result (certain calculations overflow if time deltas are unexpectedly large). Timecounters used to be used for the equivalent of rux_runtime. There were/are no checks for timecounters themselves going backwards, but sanity checks in the use of rux_runtime detected this. Now TSCs (if available) are normally used for rux_runtime. Recalibration of the TSC's assumed-common frequency is buggy and can easily cause bizarre user times when the frequency is changed. Apart from that, rux_runtime is correct. Good enough for scheduling even when incorrect. Bruce --0-2077489133-1263928790=:68115-- From owner-freebsd-arch@FreeBSD.ORG Tue Jan 19 19:39:45 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 170DE10656A7; Tue, 19 Jan 2010 19:39:45 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail01.syd.optusnet.com.au (mail01.syd.optusnet.com.au [211.29.132.182]) by mx1.freebsd.org (Postfix) with ESMTP id 8736C8FC23; Tue, 19 Jan 2010 19:39:44 +0000 (UTC) Received: from c220-239-227-214.carlnfd1.nsw.optusnet.com.au (c220-239-227-214.carlnfd1.nsw.optusnet.com.au [220.239.227.214]) by mail01.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o0JJdfaS009144 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 20 Jan 2010 06:39:42 +1100 Date: Wed, 20 Jan 2010 06:39:41 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: John Baldwin In-Reply-To: <201001191413.03682.jhb@freebsd.org> Message-ID: <20100120062222.N68139@delplex.bde.org> References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <201001191144.23299.jhb@freebsd.org> <20100120042822.L4223@besplex.bde.org> <201001191413.03682.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , FreeBSD Arch , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Jan 2010 19:39:45 -0000 On Tue, 19 Jan 2010, John Baldwin wrote: > On Tuesday 19 January 2010 1:53:32 pm Bruce Evans wrote: >> On Tue, 19 Jan 2010, John Baldwin wrote: >>> My feeling, btw, is that the real solution is to not use a sampling clock for >>> per-process stats, but to just use the cycle counter and keep separate user, >>> system, and interrupt cycle counts (like the rux_runtime we have now). >> >> The total runtime info is already available (in rux_runtime). It is >> the main thing that we use to see that scheduling is broken :-) -- we >> see that the runtime is too large or small relative to %CPU. I think >> using this and never using ticks for scheduling would work OK. Schedulers >> shouldn't care about the difference between user and sys time. Something >> like this is also needed for tickless kernels. > > I mostly care about splitting up the timers to attempt to remove the need > for statclock() by making more stats event-driven rather than sampled (to > help with a tickless kernel). vm stats are inherently sampled at points not related to vm events, but could probably be sampled on other interrupts. > I also would like to simplify calcru() and > remove the weird hacks to satisfy monoticity (sp?). I don't know of any hacks in calcru(). Did someone break it? :-). Bruce From owner-freebsd-arch@FreeBSD.ORG Wed Jan 20 01:46:58 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 021151065692 for ; Wed, 20 Jan 2010 01:46:58 +0000 (UTC) (envelope-from jroberson@jroberson.net) Received: from mail-ew0-f226.google.com (mail-ew0-f226.google.com [209.85.219.226]) by mx1.freebsd.org (Postfix) with ESMTP id 9954B8FC12 for ; Wed, 20 Jan 2010 01:46:57 +0000 (UTC) Received: by ewy26 with SMTP id 26so1735021ewy.3 for ; Tue, 19 Jan 2010 17:46:56 -0800 (PST) Received: by 10.213.100.203 with SMTP id z11mr6693912ebn.51.1263950374389; Tue, 19 Jan 2010 17:19:34 -0800 (PST) Received: from ?10.0.1.198? (udp022762uds.hawaiiantel.net [72.234.79.107]) by mx.google.com with ESMTPS id 14sm4662639ewy.7.2010.01.19.17.19.31 (version=SSLv3 cipher=RC4-MD5); Tue, 19 Jan 2010 17:19:33 -0800 (PST) Date: Tue, 19 Jan 2010 15:23:17 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: arch@freebsd.org Message-ID: User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII Cc: Subject: Softdep journaling X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Jan 2010 01:46:58 -0000 Hello, Many of you may have already noticed that I have implemented a journaling layer that co-exists with softdep to eliminate fsck after an unclean shutdown. I have written about this here: http://jeffr-tech.livejournal.com/ And I have a patch against current here: http://people.freebsd.org/~jeff/suj.diff I have been working with McKusick and he has been providing review feedback. Tegge and kib have been reviewing my rename changes. Peter Holm has generously provided his time for testing. I am within a week of being able to commit this to CURRENT. I'm raising this here so people can discuss the project and I can answer any questions or concerns before it goes in the tree. Briefly, I have added an intent log to softdep that journals block allocation and free along with inode link count changes. After an unclean shutdown a special fsck pass reads this journal and frees blocks and inodes. The recovery pass is not like traditional block journaling as it actually evaluates the filesystem state to determine how far along the operation made it and rolls back intelligently. The worst case journal recovery time I've seen is a couple of minutes, however, I'm still generating a few hundred megabytes of text describing the operation when I run fsck so that I can quickly resolve any bugs. This worst case performance was generated using pho's stress2 and a completely full 64MB journal containing nearly 2 million outstanding records. Recovery time for a crash during buildworld, for example, is on the order of 10 seconds even while producing the text log. Without the log I expect the maximum on any drive to be around 2 minutes. Presently recovery is actually cpu bound and I'm using 3 year old hardware. It scales up with the size of the journal and down with the speed of the processor. The size of the filesystem makes little difference. The filesystem can not be mounted read/write until the journal is recovered or a full fsck pass is run. The filesystem will be backwards compatible with earlier ffs implementations. The journal can be enabled or disable with tunefs. The only requirement is sufficient free space for the journal which is stored in a regular inode. The patch I have presented is mostly complete. It only lacks the recovery operation for partial truncation. I'm still running through various scenarios to validate the checker, however, the kernel has been very stable as of late. Please raise any comments or concerns here. I'm going to make another call for testers on current@ and want to keep that reserved for bug reports. Thanks, Jeff From owner-freebsd-arch@FreeBSD.ORG Thu Jan 21 04:14:15 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 922BE106566B for ; Thu, 21 Jan 2010 04:14:15 +0000 (UTC) (envelope-from cokuph@mail.ru) Received: from smtp4.apollo.lv (smtp4.apollo.lv [80.232.168.199]) by mx1.freebsd.org (Postfix) with ESMTP id 4DB148FC0A for ; Thu, 21 Jan 2010 04:14:15 +0000 (UTC) Received: from Riba-PC (unknown [195.13.202.194]) by smtp4.apollo.lv (Postfix) with SMTP id AC71D268458 for ; Thu, 21 Jan 2010 05:55:25 +0200 (EET) Message-ID: From: "Sandra " To: "freebsd-arch" Date: Thu, 21 Jan 2010 05:55:12 +0200 Organization: Nedirsies MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="koi8-r"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Windows Mail 6.0.6001.18000 X-MimeOLE: Produced By Microsoft MimeOLE V6.0.6001.18049 X-Brightmail-Tracker: AAAAAgAAAUAAAAFT Subject: Izpaud sevi ;) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Sandra List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Jan 2010 04:14:15 -0000 Hei freebsd-arch! Beidzot atradu, kur visi pasuuta taas uzliimes un apliimee savus auto http://www.aplimeauto.lv Iecheko kautko arii sev! 2010.01.21.rrz5:55:12 From owner-freebsd-arch@FreeBSD.ORG Fri Jan 22 16:40:44 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 25F3010656A8; Fri, 22 Jan 2010 16:40:44 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-iw0-f198.google.com (mail-iw0-f198.google.com [209.85.223.198]) by mx1.freebsd.org (Postfix) with ESMTP id D28D78FC39; Fri, 22 Jan 2010 16:40:43 +0000 (UTC) Received: by iwn36 with SMTP id 36so1143038iwn.3 for ; Fri, 22 Jan 2010 08:40:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to:cc :content-type; bh=xE53rO8bNzlvKckRN3NxusNBC3lJ/s5kLFnipcevNbU=; b=GESMf0dcBhPJZIMd4M67+/zgaldOUpW4L31TmQ58RFtSFrJAjOpXQgq2JvQs/gTg6a lIaijghCyNJp12VXZhTbJb7aVu2SoOzejq09be93la4nJBOf5oC89C33et+Y/5kuSwlG Dz+4BbkzavcLItEZ2Y4XOwTpAZwOHlkJjo+HY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=f5mDbYRoxEtuJ5Y0xzGydMhaDX/GjvV7mq0Xb2/BEJGwJ7obhupbWwOOzrqS2TzIaA grwXdom/fpRpvhtG4c0BTX1n7MwTAiwmX6faj9QlpQThvRiYo1d7OZbfKRBeGBMkl/Cw tWqaUqooym1O8qmWZaWPB6XPpi2P0umDxG96c= MIME-Version: 1.0 Sender: asmrookie@gmail.com Received: by 10.231.146.66 with SMTP id g2mr5113148ibv.60.1264178442036; Fri, 22 Jan 2010 08:40:42 -0800 (PST) In-Reply-To: <20100116125451.GA99364@stack.nl> References: <3bbf2fe11001091709t228c48cft1c17af686e9e9c46@mail.gmail.com> <20100116125451.GA99364@stack.nl> Date: Fri, 22 Jan 2010 17:40:41 +0100 X-Google-Sender-Auth: e23b8cb18b9fae37 Message-ID: <3bbf2fe11001220840o42a59d9cu476fe24063cd55b8@mail.gmail.com> From: Attilio Rao To: Jilles Tjoelker Content-Type: text/plain; charset=UTF-8 Cc: Giovanni Trematerra , freebsd-arch@freebsd.org Subject: Re: [PATCH] kthread_{suspend, resume, suspend_check} locking bugs X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Jan 2010 16:40:44 -0000 2010/1/16 Jilles Tjoelker : > On Sun, Jan 10, 2010 at 02:09:43AM +0100, Attilio Rao wrote: >> I think that routines kthread_suspend(), kthread_resume() and >> kthread_suspend_check() are not adeguately SMP protected. >> That is because, in particular, the critical path doesn't protect, >> together, TDF_KTH_SUSP and sleeping activity. The right pattern should >> be to use the thread lock spinlock as an interlock and use msleep. >> Such bugs have not been revealed probabilly because there has been a >> lack of testing of such primitives and there are not, currently, >> consumers within our stock kernel. > >> Additively, kthread_suspend_check() seems to require to always pass >> curthread, which is silly (as we don't have to conform to any >> particular KPI), thus I think it is appropriate for the prototype to >> change. >> The following patch should fix the issue: >> http://www.freebsd.org/~attilio/kthread.diff > > The analysis and patch look sensible. After have digged more with jhb I made a new patch: http://www.freebsd.org/~attilio/kthread_races.diff The previous one has the problem that thread_lock is going to change in ULE, and thus panic, if the awaken thread will be scheduled on a different runqueue wrt the one where it got sleeping. On this optic it would be good to panic if a td_lock lock is passed to msleep_spin() but that is not an easy condition to check (I thought about asserting the name of the locks to not be a threads container one... sigh...). On the original code, there is also another problem. The waitchannel for suspender and suspending threads are different while they should not be (and kproc_*() seems to agree with me). It might be fixed as well. Gianni, may you test this patch with the modules you made? Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Sat Jan 23 00:33:30 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B14891065679; Sat, 23 Jan 2010 00:33:30 +0000 (UTC) (envelope-from giovanni.trematerra@gmail.com) Received: from smtp.net.vodafone.it (smtp.net.vodafone.it [83.224.65.24]) by mx1.freebsd.org (Postfix) with ESMTP id 3194D8FC08; Sat, 23 Jan 2010 00:33:29 +0000 (UTC) Received: from matrix64.localdomain ([109.115.25.212]) by smtp.net.vodafone.it with ESMTP id o0N05DsA024036; Sat, 23 Jan 2010 01:05:15 +0100 Received: from matrix64.localdomain (localhost [127.0.0.1]) by matrix64.localdomain (8.14.3/8.14.3) with ESMTP id o0N05EwZ001504; Sat, 23 Jan 2010 01:05:15 +0100 (CET) (envelope-from giovanni.trematerra@gmail.com) Received: (from gianni@localhost) by matrix64.localdomain (8.14.3/8.14.3/Submit) id o0N059a6001503; Sat, 23 Jan 2010 01:05:09 +0100 (CET) (envelope-from giovanni.trematerra@gmail.com) X-Authentication-Warning: matrix64.localdomain: gianni set sender to giovanni.trematerra@gmail.com using -f Date: Sat, 23 Jan 2010 01:05:08 +0100 From: Giovanni Trematerra To: Attilio Rao Message-ID: <20100123000508.GA1428@matrix64.localdomain> References: <3bbf2fe11001091709t228c48cft1c17af686e9e9c46@mail.gmail.com> <20100116125451.GA99364@stack.nl> <3bbf2fe11001220840o42a59d9cu476fe24063cd55b8@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3bbf2fe11001220840o42a59d9cu476fe24063cd55b8@mail.gmail.com> User-Agent: Mutt/1.4.2.3i Cc: Jilles Tjoelker , freebsd-arch@freebsd.org Subject: Re: [PATCH] kthread_{suspend, resume, suspend_check} locking bugs X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Jan 2010 00:33:30 -0000 On Fri, Jan 22, 2010 at 05:40:41PM +0100, Attilio Rao wrote: > 2010/1/16 Jilles Tjoelker : > > On Sun, Jan 10, 2010 at 02:09:43AM +0100, Attilio Rao wrote: > >> I think that routines kthread_suspend(), kthread_resume() and > >> kthread_suspend_check() are not adeguately SMP protected. > >> That is because, in particular, the critical path doesn't protect, > >> together, TDF_KTH_SUSP and sleeping activity. The right pattern should > >> be to use the thread lock spinlock as an interlock and use msleep. > >> Such bugs have not been revealed probabilly because there has been a > >> lack of testing of such primitives and there are not, currently, > >> consumers within our stock kernel. > > > >> Additively, kthread_suspend_check() seems to require to always pass > >> curthread, which is silly (as we don't have to conform to any > >> particular KPI), thus I think it is appropriate for the prototype to > >> change. > >> The following patch should fix the issue: > >> http://www.freebsd.org/~attilio/kthread.diff > > > > The analysis and patch look sensible. > > After have digged more with jhb I made a new patch: > http://www.freebsd.org/~attilio/kthread_races.diff In kthread_suspend_check you might have written panic("%s: curthread is not a valid kthread", __func__); > > Gianni, may you test this patch with the modules you made? > With your last patch, no deadlock happens now. It seems ok. Hope this help Just in case someone would make a review of the kernel test module here it is the code: /*- * Copyright (c) 2010 * Giovanni Trematerra * * Permission to use, copy, modify, and distribute this software for any * purpose with or without fee is hereby granted, provided that the above * copyright notice and this permission notice appear in all copies. * * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. */ #include #include #include #include #include #include #include #include #include #ifdef TESTPAUSE_DEBUG #define DPRINTF(x) do { printf x; } while (0) #else #define DPRINTF(x) #endif static struct mtx test_global_lock; static int global_condvar; volatile int QUIT; int test_thrcnt = 3; static void thr_suspender(void *arg) { struct thread *td = (struct thread *) arg; int error; for (;;) { if (QUIT == 1) break; error = kthread_suspend(td, 10*hz); if (error != 0) { if (error == EWOULDBLOCK) panic("Ooops: kthread deadlock\n"); else panic("kthread_suspend error: %d\n", error); break; } } mtx_lock(&test_global_lock); test_thrcnt--; wakeup(&global_condvar); mtx_unlock(&test_global_lock); kthread_exit(); } static void thr_resumer(void *arg) { struct thread *td = (struct thread *) arg; int error; for (;;) { if (QUIT == 1) break; error = kthread_resume(td); if (error != 0) panic("%s: error on kthread_resume. error: %d\n", __func__, error); if (QUIT == 1) break; } mtx_lock(&test_global_lock); test_thrcnt--; wakeup(&global_condvar); mtx_unlock(&test_global_lock); kthread_exit(); } static void thr_getsuspended(void *arg) { for (;;) { if (QUIT == 1) break; kthread_suspend_check(curthread); } mtx_lock(&test_global_lock); test_thrcnt--; wakeup(&global_condvar); mtx_unlock(&test_global_lock); kthread_exit(); } static void kthrdlk_init(void) { struct proc *testproc; struct thread *newthr; int error; QUIT = 0; mtx_init(&test_global_lock, "thrdlk_lock", NULL, MTX_DEF); testproc = NULL; error = kproc_kthread_add(thr_getsuspended, NULL, &testproc, &newthr, 0, 0, "kthrdlk", "thr_getsuspended"); if (error != 0) uprintf("cannot start thr_getsuspended error: %d\n", error); error = kproc_kthread_add(thr_resumer, newthr, &testproc, NULL, 0, 0, "testproc", "thr_resumer"); if (error != 0) uprintf("cannot start thr_resumer error: %d\n", error); error = kproc_kthread_add(thr_suspender, newthr, &testproc, NULL, 0, 0, "testproc", "thr_suspender"); if (error != 0) uprintf("cannot start thr_suspender error: %d\n", error); } static void kthrdlk_done(void) { int ret; /* wait kernel threads end */ mtx_lock(&test_global_lock); QUIT = 1; while (test_thrcnt != 0) { ret = mtx_sleep(&global_condvar, &test_global_lock, 0, "waiting thrs end", 30 * hz); if (ret == EWOULDBLOCK) { uprintf("thrpause not die! remaing: %d", test_thrcnt); break; } } if (test_thrcnt == 0) DPRINTF(("All threads stopped \n")); mtx_destroy(&test_global_lock); } static int kthrdlk_handler(module_t mod, int /*modeventtype_t*/ what, void *arg) { switch (what) { case MOD_LOAD: kthrdlk_init(); uprintf("kthrdlk loaded!\n"); return (0); case MOD_UNLOAD: kthrdlk_done(); uprintf("Bye Bye! kthrdlk unloaded!\n"); return (0); } return (EOPNOTSUPP); } static moduledata_t mod_data= { "kthrdlk", kthrdlk_handler, 0 }; MODULE_VERSION(kthrdlk, 1); DECLARE_MODULE(kthrdlk, mod_data, SI_SUB_EXEC, SI_ORDER_ANY); -- Gianni From owner-freebsd-arch@FreeBSD.ORG Sat Jan 23 02:06:01 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 80D581065694 for ; Sat, 23 Jan 2010 02:06:01 +0000 (UTC) (envelope-from rpaulo@gmail.com) Received: from mail-fx0-f227.google.com (mail-fx0-f227.google.com [209.85.220.227]) by mx1.freebsd.org (Postfix) with ESMTP id 0C7A68FC19 for ; Sat, 23 Jan 2010 02:06:00 +0000 (UTC) Received: by fxm27 with SMTP id 27so90927fxm.3 for ; Fri, 22 Jan 2010 18:06:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:subject:mime-version :content-type:from:in-reply-to:date:cc:content-transfer-encoding :message-id:references:to:x-mailer; bh=9qepPiOdXjlkx9AXiWR8Rqa39+UWtWaLHifJ5UJbaiM=; b=o8uFXk9GJjqr3RqC7bt9OcWcv1wUquPUvKiDMmA10JIKG0KRxPoBwTcDfJjKQ7Eqd8 0Uw8xcpdA/T6hPbDCL92LIAa+6IUTFCwqx+abCiepLasJumMbgCNCw/2YbXZiGl/tABF P7IjpUHKUFN+kmZeXfe13Lo2uPHfspX64OEw0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; b=u05s/wWamiHG8eI1xR4VdFIOJnI5mFILSTbWjhUwuc42NS9BSzT2RfAjZ/CCo2EP2v HAVt5ARnFYGGQbsKj0OpG2BzUsUztv/UeFnd5ZKGQaHpvdGbHsVRSG2Z++TfHIYNGNnA gbHyRApwy7Ky6gVTiyEX5wihkLM9SfIgZC0lI= Received: by 10.87.11.25 with SMTP id o25mr6018136fgi.23.1264210720249; Fri, 22 Jan 2010 17:38:40 -0800 (PST) Received: from ?10.0.10.4? (54.81.54.77.rev.vodafone.pt [77.54.81.54]) by mx.google.com with ESMTPS id 13sm1619500fxm.9.2010.01.22.17.38.39 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 22 Jan 2010 17:38:39 -0800 (PST) Sender: Rui Paulo Mime-Version: 1.0 (Apple Message framework v1077) Content-Type: text/plain; charset=us-ascii From: Rui Paulo In-Reply-To: <20100123000508.GA1428@matrix64.localdomain> Date: Sat, 23 Jan 2010 01:38:37 +0000 Content-Transfer-Encoding: 7bit Message-Id: <98A68030-9AD6-4F80-8CCE-29FF29355675@freebsd.org> References: <3bbf2fe11001091709t228c48cft1c17af686e9e9c46@mail.gmail.com> <20100116125451.GA99364@stack.nl> <3bbf2fe11001220840o42a59d9cu476fe24063cd55b8@mail.gmail.com> <20100123000508.GA1428@matrix64.localdomain> To: Giovanni Trematerra X-Mailer: Apple Mail (2.1077) Cc: Attilio Rao , Jilles Tjoelker , freebsd-arch@freebsd.org Subject: Re: [PATCH] kthread_{suspend, resume, suspend_check} locking bugs X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Jan 2010 02:06:01 -0000 On 23 Jan 2010, at 00:05, Giovanni Trematerra wrote: > On Fri, Jan 22, 2010 at 05:40:41PM +0100, Attilio Rao wrote: >> 2010/1/16 Jilles Tjoelker : >>> On Sun, Jan 10, 2010 at 02:09:43AM +0100, Attilio Rao wrote: >>>> I think that routines kthread_suspend(), kthread_resume() and >>>> kthread_suspend_check() are not adeguately SMP protected. >>>> That is because, in particular, the critical path doesn't protect, >>>> together, TDF_KTH_SUSP and sleeping activity. The right pattern should >>>> be to use the thread lock spinlock as an interlock and use msleep. >>>> Such bugs have not been revealed probabilly because there has been a >>>> lack of testing of such primitives and there are not, currently, >>>> consumers within our stock kernel. >>> >>>> Additively, kthread_suspend_check() seems to require to always pass >>>> curthread, which is silly (as we don't have to conform to any >>>> particular KPI), thus I think it is appropriate for the prototype to >>>> change. >>>> The following patch should fix the issue: >>>> http://www.freebsd.org/~attilio/kthread.diff >>> >>> The analysis and patch look sensible. >> >> After have digged more with jhb I made a new patch: >> http://www.freebsd.org/~attilio/kthread_races.diff > > In kthread_suspend_check you might have written > panic("%s: curthread is not a valid kthread", __func__); > >> >> Gianni, may you test this patch with the modules you made? >> > > With your last patch, no deadlock happens now. > It seems ok. > > Hope this help > > Just in case someone would make a review of the kernel test module > here it is the code: This is a great candidate for a regression add-on. -- Rui Paulo