From owner-freebsd-arch@FreeBSD.ORG Sun Dec 13 20:00:08 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E8E78106566C; Sun, 13 Dec 2009 20:00:07 +0000 (UTC) (envelope-from bz@FreeBSD.org) Received: from mail.cksoft.de (mail.cksoft.de [IPv6:2001:4068:10::3]) by mx1.freebsd.org (Postfix) with ESMTP id 6A0128FC13; Sun, 13 Dec 2009 20:00:07 +0000 (UTC) Received: from localhost (amavis.fra.cksoft.de [192.168.74.71]) by mail.cksoft.de (Postfix) with ESMTP id 59EE641C752; Sun, 13 Dec 2009 21:00:06 +0100 (CET) X-Virus-Scanned: amavisd-new at cksoft.de Received: from mail.cksoft.de ([192.168.74.103]) by localhost (amavis.fra.cksoft.de [192.168.74.71]) (amavisd-new, port 10024) with ESMTP id H6Eh4T16LjOt; Sun, 13 Dec 2009 21:00:05 +0100 (CET) Received: by mail.cksoft.de (Postfix, from userid 66) id C496D41C751; Sun, 13 Dec 2009 21:00:05 +0100 (CET) Received: from maildrop.int.zabbadoz.net (maildrop.int.zabbadoz.net [10.111.66.10]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.int.zabbadoz.net (Postfix) with ESMTP id 604354448EC; Sun, 13 Dec 2009 19:55:58 +0000 (UTC) Date: Sun, 13 Dec 2009 19:55:58 +0000 (UTC) From: "Bjoern A. Zeeb" X-X-Sender: bz@maildrop.int.zabbadoz.net To: John Baldwin In-Reply-To: <20091026185459.U91695@maildrop.int.zabbadoz.net> Message-ID: <20091213195501.H86040@maildrop.int.zabbadoz.net> References: <20091025134226.Q91695@maildrop.int.zabbadoz.net> <200910260830.25168.jhb@freebsd.org> <20091026185459.U91695@maildrop.int.zabbadoz.net> X-OpenPGP-Key: 0x14003F198FEFA3E77207EE8D2B58B8F83CCF1842 MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: src/Makefile, universe, LINT, VIMAGE, .. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 13 Dec 2009 20:00:08 -0000 On Mon, 26 Oct 2009, Bjoern A. Zeeb wrote: Hi, > On Mon, 26 Oct 2009, John Baldwin wrote: > > Hi, > >>> @@ -345,3 +333,18 @@ >>> fi >>> .endif >>> .endif >>> + >>> +universe_kernels: universe_kernels_foo >>> +TARGET?= ${BUILD_ARCH} >>> +KERNCONFS!= cd ${.CURDIR}/sys/${TARGET}/conf && \ >>> + find [A-Z0-9]*[A-Z0-9] -type f -maxdepth 0 \ >>> + ! -name DEFAULTS ! -name NOTES >>> +KERNCONFS:= ${KERNCONFS} >>> +universe_kernels_foo: >>> +.for kernel in ${KERNCONFS} >>> + @(cd ${.CURDIR} && env __MAKE_CONF=/dev/null \ >>> + ${MAKE} ${JFLAG} buildkernel TARGET=${TARGET} KERNCONF=${kernel} >>> \ >>> + > _.${TARGET}.${kernel} 2>&1 || \ >>> + (echo "${TARGET} ${kernel} kernel failed," \ >>> + "check _.${TARGET}.${kernel} for details"| ${MAKEFAIL})) >>> +.endfor >> >> Hmm, I'm not sure why you need a universe_kernels_foo target that >> universe_kernels depends on? > > This is all about make and the variables after a target and within a > target. Whatever else I tried: make complained. If you know the > rightbetter solution that works I'll be happy to simplify this and > update the patch. > > It shouldn't be named _foo though;) > > >> Also, I would probably prefer to have >> universe_kernels come after universe_$target and before universe_epilogue. > > I think that should be possible to sneak it in after the the .endfor. I fixed those; I needed to allow the target for the outer .if make() though with that. >>> Index: sys/conf/makeLINT.mk >>> =================================================================== >>> --- sys/conf/makeLINT.mk (revision 198467) >>> +++ sys/conf/makeLINT.mk (working copy) >>> @@ -5,7 +5,15 @@ >>> >>> clean: >>> rm -f LINT >>> +.if ${TARGET} == "amd64" || ${TARGET} == "i386" >>> + rm -f LINT=VIMAGE >>> +.endif >> >> s/=/-/ > > Yeah, everyone notics that one; it should be fixed in the patch at the > URL originally referenced. > >> BTW, I'm not sure why you would only enable VIMAGE for these two archs >> rather >> than doing it for all archs that have a LINT? > > Because it'll usually simply not make any sense to build a VIMAGE > kernel for embedded platforms like arm, ... Also make universe time > increases significantly with any platform; indeed amd64 is the worst > now (again). We can talk about the proper set and I had thought of > sparc64 as well. Obviously just building it everywhere simplifies > things. An updated patch to test would be here: http://people.freebsd.org/~bz/20091213-01-make-LINT-VIMAGE.diff /bz -- Bjoern A. Zeeb It will not break if you know what you are doing. From owner-freebsd-arch@FreeBSD.ORG Mon Dec 14 11:06:50 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id F1EFB1065672 for ; Mon, 14 Dec 2009 11:06:50 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id C84778FC15 for ; Mon, 14 Dec 2009 11:06:50 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.3/8.14.3) with ESMTP id nBEB6owH075868 for ; Mon, 14 Dec 2009 11:06:50 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.3/8.14.3/Submit) id nBEB6otM075866 for freebsd-arch@FreeBSD.org; Mon, 14 Dec 2009 11:06:50 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 14 Dec 2009 11:06:50 GMT Message-Id: <200912141106.nBEB6otM075866@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-arch@FreeBSD.org Cc: Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Dec 2009 11:06:51 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From owner-freebsd-arch@FreeBSD.ORG Mon Dec 14 16:47:02 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4FE80106568B; Mon, 14 Dec 2009 16:47:02 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 1D0FF8FC1E; Mon, 14 Dec 2009 16:47:02 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id C467D46B32; Mon, 14 Dec 2009 11:47:01 -0500 (EST) Received: from jhbbsd.localnet (unknown [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPA id 25DE38A024; Mon, 14 Dec 2009 11:47:01 -0500 (EST) From: John Baldwin To: "Bjoern A. Zeeb" Date: Mon, 14 Dec 2009 11:19:36 -0500 User-Agent: KMail/1.12.1 (FreeBSD/7.2-CBSD-20091103; KDE/4.3.1; amd64; ; ) References: <20091025134226.Q91695@maildrop.int.zabbadoz.net> <20091026185459.U91695@maildrop.int.zabbadoz.net> <20091213195501.H86040@maildrop.int.zabbadoz.net> In-Reply-To: <20091213195501.H86040@maildrop.int.zabbadoz.net> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <200912141119.36165.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Mon, 14 Dec 2009 11:47:01 -0500 (EST) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: freebsd-arch@freebsd.org Subject: Re: src/Makefile, universe, LINT, VIMAGE, .. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Dec 2009 16:47:02 -0000 On Sunday 13 December 2009 2:55:58 pm Bjoern A. Zeeb wrote: > >> Also, I would probably prefer to have > >> universe_kernels come after universe_$target and before universe_epilogue. > > > > I think that should be possible to sneak it in after the the .endfor. > > I fixed those; I needed to allow the target for the outer .if make() > though with that. I think you can drop the 'KERNCONFS:= ${KERNCONFS}' line now. > >>> Index: sys/conf/makeLINT.mk > >>> =================================================================== > >>> --- sys/conf/makeLINT.mk (revision 198467) > >>> +++ sys/conf/makeLINT.mk (working copy) > >>> @@ -5,7 +5,15 @@ > >>> > >>> clean: > >>> rm -f LINT > >>> +.if ${TARGET} == "amd64" || ${TARGET} == "i386" > >>> + rm -f LINT=VIMAGE > >>> +.endif > >> > >> s/=/-/ > > > > Yeah, everyone notics that one; it should be fixed in the patch at the > > URL originally referenced. This is still here. :) -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Mon Dec 14 21:05:06 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 95E331065697; Mon, 14 Dec 2009 21:05:06 +0000 (UTC) (envelope-from bz@FreeBSD.org) Received: from mail.cksoft.de (mail.cksoft.de [IPv6:2001:4068:10::3]) by mx1.freebsd.org (Postfix) with ESMTP id 52B4D8FC16; Mon, 14 Dec 2009 21:05:06 +0000 (UTC) Received: from localhost (amavis.fra.cksoft.de [192.168.74.71]) by mail.cksoft.de (Postfix) with ESMTP id B351341C75B; Mon, 14 Dec 2009 22:05:05 +0100 (CET) X-Virus-Scanned: amavisd-new at cksoft.de Received: from mail.cksoft.de ([192.168.74.103]) by localhost (amavis.fra.cksoft.de [192.168.74.71]) (amavisd-new, port 10024) with ESMTP id 2sf-fU5cwaob; Mon, 14 Dec 2009 22:05:05 +0100 (CET) Received: by mail.cksoft.de (Postfix, from userid 66) id 3F0FD41C75A; Mon, 14 Dec 2009 22:05:05 +0100 (CET) Received: from maildrop.int.zabbadoz.net (maildrop.int.zabbadoz.net [10.111.66.10]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.int.zabbadoz.net (Postfix) with ESMTP id D5FC74448EC; Mon, 14 Dec 2009 21:01:41 +0000 (UTC) Date: Mon, 14 Dec 2009 21:01:41 +0000 (UTC) From: "Bjoern A. Zeeb" X-X-Sender: bz@maildrop.int.zabbadoz.net To: John Baldwin In-Reply-To: <200912141119.36165.jhb@freebsd.org> Message-ID: <20091214210054.A86040@maildrop.int.zabbadoz.net> References: <20091025134226.Q91695@maildrop.int.zabbadoz.net> <20091026185459.U91695@maildrop.int.zabbadoz.net> <20091213195501.H86040@maildrop.int.zabbadoz.net> <200912141119.36165.jhb@freebsd.org> X-OpenPGP-Key: 0x14003F198FEFA3E77207EE8D2B58B8F83CCF1842 MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: src/Makefile, universe, LINT, VIMAGE, .. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Dec 2009 21:05:06 -0000 On Mon, 14 Dec 2009, John Baldwin wrote: > I think you can drop the 'KERNCONFS:= ${KERNCONFS}' line now. So I did; thanks. >>>>> Index: sys/conf/makeLINT.mk >>>>> =================================================================== >>>>> --- sys/conf/makeLINT.mk (revision 198467) >>>>> +++ sys/conf/makeLINT.mk (working copy) >>>>> @@ -5,7 +5,15 @@ >>>>> >>>>> clean: >>>>> rm -f LINT >>>>> +.if ${TARGET} == "amd64" || ${TARGET} == "i386" >>>>> + rm -f LINT=VIMAGE >>>>> +.endif >>>> >>>> s/=/-/ >>> >>> Yeah, everyone notics that one; it should be fixed in the patch at the >>> URL originally referenced. > > This is still here. :) *grump* I had fixed it in the patch but not in my working tree. New try: http://people.freebsd.org/~bz/20091214-01-make-LINT-VIMAGE.diff -- Bjoern A. Zeeb It will not break if you know what you are doing. From owner-freebsd-arch@FreeBSD.ORG Tue Dec 15 09:50:14 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4EB671065670 for ; Tue, 15 Dec 2009 09:50:14 +0000 (UTC) (envelope-from Hartmut.Brandt@dlr.de) Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32]) by mx1.freebsd.org (Postfix) with ESMTP id D00F68FC20 for ; Tue, 15 Dec 2009 09:50:13 +0000 (UTC) Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); Tue, 15 Dec 2009 10:38:07 +0100 Date: Tue, 15 Dec 2009 10:38:04 +0100 (CET) From: Harti Brandt X-X-Sender: brandt_h@beagle.kn.op.dlr.de To: arch@freebsd.org Message-ID: <20091215103759.P97203@beagle.kn.op.dlr.de> X-OpenPGP-Key: harti@freebsd.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-OriginalArrivalTime: 15 Dec 2009 09:38:07.0697 (UTC) FILETIME=[4EE2C410:01CA7D6A] Cc: Subject: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Harti Brandt List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Dec 2009 09:50:14 -0000 Hi all, I'm working on our network statistics (in the context of SNMP) and wonder, to what extend we want them to be correct. I've re-read part of the past discussions about 64-bit counters on 32-bit archs and got the impression, that there are users that would like to have almost correct statistics (for accounting, for example). If this is the case I wonder whether the way we do the statistics today is correct. Basically all statistics are incremented or added to simply by a += b oder a++. As I understand, this worked fine in the old days, where you had spl*() calls at the right places. Nowadays when everything is SMP shouldn't we use at least atomic operations for this? Also I read that on architectures where cache coherency is not implemented in hardware even this does not help (I found a mail from jhb why for the mutex implementation this is not a problem, but I don't understand what to do for the += and ++ operations). I failed to find a way, though, to influence the caching policy (is there a function one can call to change the policy?). Any opinions? harti From owner-freebsd-arch@FreeBSD.ORG Tue Dec 15 16:43:03 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E3BDC1065697; Tue, 15 Dec 2009 16:43:03 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id B606C8FC1A; Tue, 15 Dec 2009 16:43:03 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 5859B46B23; Tue, 15 Dec 2009 11:43:03 -0500 (EST) Received: from jhbbsd.localnet (unknown [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPA id A40E58A01B; Tue, 15 Dec 2009 11:43:02 -0500 (EST) From: John Baldwin To: freebsd-arch@freebsd.org, Harti Brandt Date: Tue, 15 Dec 2009 08:12:35 -0500 User-Agent: KMail/1.12.1 (FreeBSD/7.2-CBSD-20091103; KDE/4.3.1; amd64; ; ) References: <20091215103759.P97203@beagle.kn.op.dlr.de> In-Reply-To: <20091215103759.P97203@beagle.kn.op.dlr.de> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <200912150812.35521.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Tue, 15 Dec 2009 11:43:02 -0500 (EST) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00, DATE_IN_PAST_03_06,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Dec 2009 16:43:04 -0000 On Tuesday 15 December 2009 4:38:04 am Harti Brandt wrote: > Hi all, > > I'm working on our network statistics (in the context of SNMP) and wonder, > to what extend we want them to be correct. I've re-read part of the past > discussions about 64-bit counters on 32-bit archs and got the impression, > that there are users that would like to have almost correct statistics > (for accounting, for example). If this is the case I wonder whether the > way we do the statistics today is correct. > > Basically all statistics are incremented or added to simply by a += b oder > a++. As I understand, this worked fine in the old days, where you had > spl*() calls at the right places. Nowadays when everything is SMP > shouldn't we use at least atomic operations for this? Also I read that on > architectures where cache coherency is not implemented in hardware even > this does not help (I found a mail from jhb why for the mutex > implementation this is not a problem, but I don't understand what to do > for the += and ++ operations). I failed to find a way, though, to > influence the caching policy (is there a function one can call to > change the policy?). Atomic ops will always work for reliable statistics. However, I believe Robert is working on using per-CPU statistics for TCP, UDP, etc. similar to what we do now for many of the 'cnt' stats (context switches, etc.). For 'cnt' each CPU has its own count of stats that are updated using non-atomic ops (since they are CPU local). sysctl handlers then sum up the various per- CPU counts to report global counts to userland. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Dec 15 17:07:56 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C523C106568F for ; Tue, 15 Dec 2009 17:07:56 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outB.internet-mail-service.net (outb.internet-mail-service.net [216.240.47.225]) by mx1.freebsd.org (Postfix) with ESMTP id A9F408FC12 for ; Tue, 15 Dec 2009 17:07:56 +0000 (UTC) Received: from idiom.com (mx0.idiom.com [216.240.32.160]) by out.internet-mail-service.net (Postfix) with ESMTP id 1C08D44441; Tue, 15 Dec 2009 09:07:57 -0800 (PST) X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (h-67-100-89-137.snfccasy.static.covad.net [67.100.89.137]) by idiom.com (Postfix) with ESMTP id DB9182D6014; Tue, 15 Dec 2009 09:07:55 -0800 (PST) Message-ID: <4B27C279.8030402@elischer.org> Date: Tue, 15 Dec 2009 09:08:09 -0800 From: Julian Elischer User-Agent: Thunderbird 2.0.0.23 (Macintosh/20090812) MIME-Version: 1.0 To: John Baldwin References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912150812.35521.jhb@freebsd.org> In-Reply-To: <200912150812.35521.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Harti Brandt , freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Dec 2009 17:07:56 -0000 John Baldwin wrote: > On Tuesday 15 December 2009 4:38:04 am Harti Brandt wrote: >> Hi all, >> >> I'm working on our network statistics (in the context of SNMP) and wonder, >> to what extend we want them to be correct. I've re-read part of the past >> discussions about 64-bit counters on 32-bit archs and got the impression, >> that there are users that would like to have almost correct statistics >> (for accounting, for example). If this is the case I wonder whether the >> way we do the statistics today is correct. >> >> Basically all statistics are incremented or added to simply by a += b oder >> a++. As I understand, this worked fine in the old days, where you had >> spl*() calls at the right places. Nowadays when everything is SMP >> shouldn't we use at least atomic operations for this? Also I read that on >> architectures where cache coherency is not implemented in hardware even >> this does not help (I found a mail from jhb why for the mutex >> implementation this is not a problem, but I don't understand what to do >> for the += and ++ operations). I failed to find a way, though, to >> influence the caching policy (is there a function one can call to >> change the policy?). > > Atomic ops will always work for reliable statistics. However, I believe > Robert is working on using per-CPU statistics for TCP, UDP, etc. similar to > what we do now for many of the 'cnt' stats (context switches, etc.). For > 'cnt' each CPU has its own count of stats that are updated using non-atomic > ops (since they are CPU local). sysctl handlers then sum up the various per- > CPU counts to report global counts to userland. the trouble is that PCPU and VNET collide. you then need to have Per-CPU, per VNET counters. which would be yet a different pool of linker set symbols.. > From owner-freebsd-arch@FreeBSD.ORG Tue Dec 15 17:45:17 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 678A81065670 for ; Tue, 15 Dec 2009 17:45:17 +0000 (UTC) (envelope-from Hartmut.Brandt@dlr.de) Received: from smtp3.dlr.de (smtp3.dlr.de [129.247.252.33]) by mx1.freebsd.org (Postfix) with ESMTP id F238F8FC16 for ; Tue, 15 Dec 2009 17:45:16 +0000 (UTC) Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp3.dlr.de over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); Tue, 15 Dec 2009 18:45:15 +0100 Date: Tue, 15 Dec 2009 18:45:13 +0100 (CET) From: Harti Brandt X-X-Sender: brandt_h@beagle.kn.op.dlr.de To: John Baldwin In-Reply-To: <200912150812.35521.jhb@freebsd.org> Message-ID: <20091215183859.S53283@beagle.kn.op.dlr.de> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912150812.35521.jhb@freebsd.org> X-OpenPGP-Key: harti@freebsd.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-OriginalArrivalTime: 15 Dec 2009 17:45:15.0071 (UTC) FILETIME=[5BC21CF0:01CA7DAE] Cc: freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Harti Brandt List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Dec 2009 17:45:17 -0000 On Tue, 15 Dec 2009, John Baldwin wrote: JB>On Tuesday 15 December 2009 4:38:04 am Harti Brandt wrote: JB>> Hi all, JB>> JB>> I'm working on our network statistics (in the context of SNMP) and wonder, JB>> to what extend we want them to be correct. I've re-read part of the past JB>> discussions about 64-bit counters on 32-bit archs and got the impression, JB>> that there are users that would like to have almost correct statistics JB>> (for accounting, for example). If this is the case I wonder whether the JB>> way we do the statistics today is correct. JB>> JB>> Basically all statistics are incremented or added to simply by a += b oder JB>> a++. As I understand, this worked fine in the old days, where you had JB>> spl*() calls at the right places. Nowadays when everything is SMP JB>> shouldn't we use at least atomic operations for this? Also I read that on JB>> architectures where cache coherency is not implemented in hardware even JB>> this does not help (I found a mail from jhb why for the mutex JB>> implementation this is not a problem, but I don't understand what to do JB>> for the += and ++ operations). I failed to find a way, though, to JB>> influence the caching policy (is there a function one can call to JB>> change the policy?). JB> JB>Atomic ops will always work for reliable statistics. However, I believe JB>Robert is working on using per-CPU statistics for TCP, UDP, etc. similar to JB>what we do now for many of the 'cnt' stats (context switches, etc.). For JB>'cnt' each CPU has its own count of stats that are updated using non-atomic JB>ops (since they are CPU local). sysctl handlers then sum up the various per- JB>CPU counts to report global counts to userland. I see. I was also thinking along these lines, but was not sure whether it is worth the trouble. I suppose this does not help to implement 64-bit counters on 32-bit architectures, though, because you cannot read them reliably without locking to sum them up, right? harti From owner-freebsd-arch@FreeBSD.ORG Tue Dec 15 19:39:04 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 23B4C1065676; Tue, 15 Dec 2009 19:39:04 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id E6A7B8FC13; Tue, 15 Dec 2009 19:39:03 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 881EA46B39; Tue, 15 Dec 2009 14:39:03 -0500 (EST) Received: from jhbbsd.localnet (unknown [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPA id BF5FB8A01B; Tue, 15 Dec 2009 14:39:02 -0500 (EST) From: John Baldwin To: Harti Brandt Date: Tue, 15 Dec 2009 13:13:28 -0500 User-Agent: KMail/1.12.1 (FreeBSD/7.2-CBSD-20091103; KDE/4.3.1; amd64; ; ) References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912150812.35521.jhb@freebsd.org> <20091215183859.S53283@beagle.kn.op.dlr.de> In-Reply-To: <20091215183859.S53283@beagle.kn.op.dlr.de> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <200912151313.28326.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Tue, 15 Dec 2009 14:39:02 -0500 (EST) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Dec 2009 19:39:04 -0000 On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote: > On Tue, 15 Dec 2009, John Baldwin wrote: > > JB>On Tuesday 15 December 2009 4:38:04 am Harti Brandt wrote: > JB>> Hi all, > JB>> > JB>> I'm working on our network statistics (in the context of SNMP) and wonder, > JB>> to what extend we want them to be correct. I've re-read part of the past > JB>> discussions about 64-bit counters on 32-bit archs and got the impression, > JB>> that there are users that would like to have almost correct statistics > JB>> (for accounting, for example). If this is the case I wonder whether the > JB>> way we do the statistics today is correct. > JB>> > JB>> Basically all statistics are incremented or added to simply by a += b oder > JB>> a++. As I understand, this worked fine in the old days, where you had > JB>> spl*() calls at the right places. Nowadays when everything is SMP > JB>> shouldn't we use at least atomic operations for this? Also I read that on > JB>> architectures where cache coherency is not implemented in hardware even > JB>> this does not help (I found a mail from jhb why for the mutex > JB>> implementation this is not a problem, but I don't understand what to do > JB>> for the += and ++ operations). I failed to find a way, though, to > JB>> influence the caching policy (is there a function one can call to > JB>> change the policy?). > JB> > JB>Atomic ops will always work for reliable statistics. However, I believe > JB>Robert is working on using per-CPU statistics for TCP, UDP, etc. similar to > JB>what we do now for many of the 'cnt' stats (context switches, etc.). For > JB>'cnt' each CPU has its own count of stats that are updated using non-atomic > JB>ops (since they are CPU local). sysctl handlers then sum up the various per- > JB>CPU counts to report global counts to userland. > > I see. I was also thinking along these lines, but was not sure whether it > is worth the trouble. I suppose this does not help to implement 64-bit > counters on 32-bit architectures, though, because you cannot read them > reliably without locking to sum them up, right? Either that or you just accept that you have a small race since it is only stats. :) -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Wed Dec 16 18:19:49 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E5C261065693; Wed, 16 Dec 2009 18:19:49 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail01.syd.optusnet.com.au (mail01.syd.optusnet.com.au [211.29.132.182]) by mx1.freebsd.org (Postfix) with ESMTP id 7A39A8FC16; Wed, 16 Dec 2009 18:19:49 +0000 (UTC) Received: from c220-239-235-116.carlnfd3.nsw.optusnet.com.au (c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116]) by mail01.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id nBGIJjhj016826 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 17 Dec 2009 05:19:47 +1100 Date: Thu, 17 Dec 2009 05:19:45 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: John Baldwin In-Reply-To: <200912151313.28326.jhb@freebsd.org> Message-ID: <20091217021211.O35780@delplex.bde.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912150812.35521.jhb@freebsd.org> <20091215183859.S53283@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Harti Brandt , freebsd-arch@FreeBSD.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Dec 2009 18:19:50 -0000 On Tue, 15 Dec 2009, John Baldwin wrote: > On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote: >> On Tue, 15 Dec 2009, John Baldwin wrote: >> >> JB>On Tuesday 15 December 2009 4:38:04 am Harti Brandt wrote: >> JB>> Hi all, >> JB>> >> JB>> I'm working on our network statistics (in the context of SNMP) and wonder, >> JB>> to what extend we want them to be correct. I've re-read part of the past >> JB>> discussions about 64-bit counters on 32-bit archs and got the impression, >> JB>> that there are users that would like to have almost correct statistics >> JB>> (for accounting, for example). If this is the case I wonder whether the >> JB>> way we do the statistics today is correct. >> JB>> >> JB>> Basically all statistics are incremented or added to simply by a += b oder >> JB>> a++. As I understand, this worked fine in the old days, where you had >> JB>> spl*() calls at the right places. Nowadays when everything is SMP >> JB>> shouldn't we use at least atomic operations for this? Also I read that on >> JB>> architectures where cache coherency is not implemented in hardware even >> JB>> this does not help (I found a mail from jhb why for the mutex >> JB>> implementation this is not a problem, but I don't understand what to do >> JB>> for the += and ++ operations). I failed to find a way, though, to >> JB>> influence the caching policy (is there a function one can call to >> JB>> change the policy?). >> JB> >> JB>Atomic ops will always work for reliable statistics. However, I believe >> JB>Robert is working on using per-CPU statistics for TCP, UDP, etc. similar to >> JB>what we do now for many of the 'cnt' stats (context switches, etc.). For >> JB>'cnt' each CPU has its own count of stats that are updated using non-atomic >> JB>ops (since they are CPU local). sysctl handlers then sum up the various per- >> JB>CPU counts to report global counts to userland. I don't like the bloat from this, but don't see anything better. Julian said in another reply that there are even more complications for VIMAGE. >> I see. I was also thinking along these lines, but was not sure whether it >> is worth the trouble. I suppose this does not help to implement 64-bit >> counters on 32-bit architectures, though, because you cannot read them >> reliably without locking to sum them up, right? > > Either that or you just accept that you have a small race since it is only stats. :) Actually, you can do better with a generation count. The generation count would at least tell you if you lost a race. The generation count should only be maintained while summing other counts, since it must be global and incremented by atomic ops (to avoid the races without even more costly locking which would make the generation count irrelevant) so maintaining it all the time would more than defeat the point of having per-CPU counters (all CPUs would compete for it at the same address). Probably not worth it for statistics. Except, if userland had control over it, then userland could decide the policy. Actually2, this solves your original problem!, provided the races are so rarely lost that looping to recover from them works: Once counters are per-CPU, they can be 64-bits with no complications until they are summed. Detection of lost races is essential for summing them on 32-bit systems, unlike for 32-bit counters, since a lost race at the point where the low 32 bits wraps around may give an error of 2**32 in the sum, while a lost race for a 32-bit counter only makes the sum a bit too small (unless the 32-bit counter wrapped). Simple version: - bloat PCPU_INC(var) to do something like the following: if (PCPU_GET(counter_summing_mode)) atomic_add_int(&counter_gen, 1); OLD_PCPU_INC(var); - set PCPU_GET(counter_summing_mode) while summing. Needs heavyweight synchronization (IPIs?) to set and clear the flag on other CPUs. Must also make all other CPUs flush pending writes (so that a 64-bit counter cannot be half-written at the beginning of the summing), but this will happen automatically with any heavyweight synchronization. Unsimple versions: to avoid bloating PCPU_INC(), write-protect all counters while summing, and count generations in the trap handler ... However, I prefer summing 32-bit counters (with heuristics to detect wraparound) to a 64-bit sum, like I think you already do for SNMP. Wraparound heuristics may still be useful with the generation count: suppose the generation count increases faster than you can sum; then looping to get a coherent sum doesn't work, and wraparound must be ruled out or fixed up in another way; the 32-bit wraparound heuristic works perfectly since we can guarantee to sum faster than a 32-bit counter can wrap twice. Bruce From owner-freebsd-arch@FreeBSD.ORG Thu Dec 17 08:10:29 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CAD7E106568B; Thu, 17 Dec 2009 08:10:29 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail09.syd.optusnet.com.au (mail09.syd.optusnet.com.au [211.29.132.190]) by mx1.freebsd.org (Postfix) with ESMTP id 6028B8FC12; Thu, 17 Dec 2009 08:10:28 +0000 (UTC) Received: from c220-239-235-116.carlnfd3.nsw.optusnet.com.au (c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116]) by mail09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id nBH8AP0P006827 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 17 Dec 2009 19:10:26 +1100 Date: Thu, 17 Dec 2009 19:10:25 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Bruce Evans In-Reply-To: <20091217021211.O35780@delplex.bde.org> Message-ID: <20091217181553.Q36492@delplex.bde.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912150812.35521.jhb@freebsd.org> <20091215183859.S53283@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091217021211.O35780@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Harti Brandt , John Baldwin , freebsd-arch@FreeBSD.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Dec 2009 08:10:29 -0000 On Thu, 17 Dec 2009, Bruce Evans wrote: > ... > Actually, you can do better with a generation count. The generation count > would at least tell you if you lost a race. The generation count should > only be maintained while summing other counts, since it must be global and > incremented by atomic ops (to avoid the races without even more costly > locking which would make the generation count irrelevant) so maintaining > it all the time would more than defeat the point of having per-CPU counters > (all CPUs would compete for it at the same address). ... Actually3, the generation count can be per-CPU and accessed without atomic ops (provided reads of it on other CPUs return a consistent possibly-stale value). > Simple version: > - bloat PCPU_INC(var) to do something like the following: > if (PCPU_GET(counter_summing_mode)) > atomic_add_int(&counter_gen, 1); > OLD_PCPU_INC(var); > - set PCPU_GET(counter_summing_mode) while summing. Needs heavyweight > synchronization (IPIs?) to set and clear the flag on other CPUs. Must > also make all other CPUs flush pending writes (so that a 64-bit counter > cannot be half-written at the beginning of the summing), but this will > happen automatically with any heavyweight synchronization. Better version: - bloat PCPU_INC(var) to do something like the following: OLD_PCPU_INC(counter_gen); OLD_PCPU_INC(var); - sum all PCPU_GET(counter_gen) before summing the subset of ordinary counters of interest. This gives a value <= the unracy current sum of the generation counters, by reading consistent possibly-stale values. Then sync all counters as above. Note that the order of the above increments would be backwards if we used write ordering instead of a full sync -- with only write ordering the sum of the generation counts would be too high here if we happened to read it on 1 of the CPUs in between the above increments. This order is chosen since I don't want to have 2 increments of counter_gen in the above and/or further complications and bloat, so there must be some order, and the above order works right later. Then sum selected ordinary counters. Then sync the generation counters (or all counters, or arrange for write ordering) as above. Then sum the generation counters. This gives a value >= the unracy current sum at the end of summing the selected counters. Bruce From owner-freebsd-arch@FreeBSD.ORG Thu Dec 17 08:28:12 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C1509106568B; Thu, 17 Dec 2009 08:28:12 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au [211.29.132.189]) by mx1.freebsd.org (Postfix) with ESMTP id 3F8638FC12; Thu, 17 Dec 2009 08:28:11 +0000 (UTC) Received: from c220-239-235-116.carlnfd3.nsw.optusnet.com.au (c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116]) by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id nBH8S8sx014527 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 17 Dec 2009 19:28:09 +1100 Date: Thu, 17 Dec 2009 19:28:08 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Bruce Evans In-Reply-To: <20091217181553.Q36492@delplex.bde.org> Message-ID: <20091217191535.U36525@delplex.bde.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912150812.35521.jhb@freebsd.org> <20091215183859.S53283@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091217021211.O35780@delplex.bde.org> <20091217181553.Q36492@delplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Harti Brandt , John Baldwin , freebsd-arch@FreeBSD.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Dec 2009 08:28:12 -0000 On Thu, 17 Dec 2009, Bruce Evans wrote: > On Thu, 17 Dec 2009, Bruce Evans wrote: > >> ... > Actually3, the generation count can be per-CPU and accessed without atomic > ops (provided reads of it on other CPUs return a consistent possibly-stale > value). > >> Simple version: > > Better version: > ... Duh, this is far too complicated and bloated. Counters can be their own generation counts -- you just read them again to see if they are quiescent. A heavyweight sync before each of the (sets of) reads is still necessary. Self-generation counters give a separate generation counter for each normal counter, so quiescence can be easily be checked for and/or enforced per-counter. Bruce From owner-freebsd-arch@FreeBSD.ORG Sat Dec 19 11:27:13 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C8E5B1065672; Sat, 19 Dec 2009 11:27:13 +0000 (UTC) (envelope-from uqs@spoerlein.net) Received: from acme.spoerlein.net (acme.spoerlein.net [IPv6:2a01:198:206::1]) by mx1.freebsd.org (Postfix) with ESMTP id 4F1468FC16; Sat, 19 Dec 2009 11:27:13 +0000 (UTC) Received: from acme.spoerlein.net (localhost.spoerlein.net [IPv6:::1]) by acme.spoerlein.net (8.14.3/8.14.3) with ESMTP id nBJBRCkd046767 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 19 Dec 2009 12:27:12 +0100 (CET) (envelope-from uqs@spoerlein.net) Received: (from uqs@localhost) by acme.spoerlein.net (8.14.3/8.14.3/Submit) id nBJBRCn9046766; Sat, 19 Dec 2009 12:27:12 +0100 (CET) (envelope-from uqs@spoerlein.net) Date: Sat, 19 Dec 2009 12:27:12 +0100 From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= To: John Baldwin Message-ID: <20091219112711.GR55913@acme.spoerlein.net> Mail-Followup-To: John Baldwin , Harti Brandt , freebsd-arch@freebsd.org References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912150812.35521.jhb@freebsd.org> <20091215183859.S53283@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200912151313.28326.jhb@freebsd.org> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: Harti Brandt , freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Dec 2009 11:27:13 -0000 On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote: > On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote: > > I see. I was also thinking along these lines, but was not sure whether it > > is worth the trouble. I suppose this does not help to implement 64-bit > > counters on 32-bit architectures, though, because you cannot read them > > reliably without locking to sum them up, right? > > Either that or you just accept that you have a small race since it is only stats. :) This might be stupid, but can we not easily *read* 64bit counters on 32bit machines like this: do { h1 = read_upper_32bits; l1 = read_lower_32bits; h2 = read_upper_32bits; l2 = read_lower_32bits; /* not needed */ } while (h1 != h2); sum64 = (h1<<32) + l1; or something like that? If h2 does not change between readings, no wrap-around has occured. If l1 was read in between the readings of h1 and h2, the code above is sound. Right? Regards, Uli From owner-freebsd-arch@FreeBSD.ORG Sat Dec 19 11:42:24 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8965A1065746; Sat, 19 Dec 2009 11:42:24 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe01.swip.net [212.247.154.1]) by mx1.freebsd.org (Postfix) with ESMTP id AC4FE8FC08; Sat, 19 Dec 2009 11:42:23 +0000 (UTC) X-Cloudmark-Score: 0.000000 [] X-Cloudmark-Analysis: v=1.0 c=1 a=MnI1ikcADjEx7bvsp0jZvQ==:17 a=Fa0p2dj8wnFP207fzp8A:9 a=NRoDMZWEEsvWI_krVPgA:7 a=f3fVchOcrPINIEe5_G_UAmhSYaQA:4 a=u4nCZWMP6T-Rh-9c:21 Received: from [188.126.201.140] (account mc467741@c2i.net HELO laptop.adsl.tele2.no) by mailfe01.swip.net (CommuniGate Pro SMTP 5.2.16) with ESMTPA id 293568362; Sat, 19 Dec 2009 12:42:21 +0100 From: Hans Petter Selasky To: freebsd-arch@freebsd.org Date: Sat, 19 Dec 2009 12:44:14 +0100 User-Agent: KMail/1.11.4 (FreeBSD/9.0-CURRENT; KDE/4.2.4; i386; ; ) References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> In-Reply-To: <20091219112711.GR55913@acme.spoerlein.net> X-Face: (%:6u[ldzJ`0qjD7sCkfdMmD*RxpO< =?iso-8859-1?q?Q0yAl=7E=3F=60=27F=3FjDVb=5DE6TQ7=27=23h-VlLs=7Dk/=0A=09?=(yxg(p!IL.`#ng"%`BMrham7%UK,}VH\wUOm=^>wEEQ+KWt[{J#x6ow~JO:,zwp.(t; @ =?iso-8859-1?q?Aq=0A=09=3A4=3A=26nFCgDb8=5B3oIeTb=5E=27?=",; u{5{}C9>"PuY\)!=#\u9SSM-nz8+SR~B\!qBv MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <200912191244.17803.hselasky@c2i.net> Cc: Ulrich =?iso-8859-1?q?Sp=F6rlein?= , Harti Brandt Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Dec 2009 11:42:24 -0000 On Saturday 19 December 2009 12:27:12 Ulrich Sp=F6rlein wrote: > On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote: > > On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote: > > > I see. I was also thinking along these lines, but was not sure whether > > > it is worth the trouble. I suppose this does not help to implement > > > 64-bit counters on 32-bit architectures, though, because you cannot > > > read them reliably without locking to sum them up, right? > > > > Either that or you just accept that you have a small race since it is > > only stats. :) > > This might be stupid, but can we not easily *read* 64bit counters > on 32bit machines like this: > > do { > h1 =3D read_upper_32bits; > l1 =3D read_lower_32bits; > h2 =3D read_upper_32bits; > l2 =3D read_lower_32bits; /* not needed */ > } while (h1 !=3D h2); Hi, Just a comment. You know you don't need a while loop to get a stable value?= =20 Should be implemented like this, in my opinion: h1 =3D read_upper_32bits; l1 =3D read_lower_32bits; h2 =3D read_upper_32bits; if (h1 !=3D h2) l1 =3D 0xffffffffUL; sum64 =3D (h1<<32) | l1; =2D-HPS From owner-freebsd-arch@FreeBSD.ORG Sat Dec 19 14:56:43 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8D1A21065679; Sat, 19 Dec 2009 14:56:43 +0000 (UTC) (envelope-from Hartmut.Brandt@dlr.de) Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32]) by mx1.freebsd.org (Postfix) with ESMTP id 1CC1F8FC14; Sat, 19 Dec 2009 14:56:42 +0000 (UTC) Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); Sat, 19 Dec 2009 15:56:40 +0100 Date: Sat, 19 Dec 2009 15:56:38 +0100 (CET) From: Harti Brandt X-X-Sender: brandt_h@beagle.kn.op.dlr.de To: Ulrich =?utf-8?B?U3DDtnJsZWlu?= In-Reply-To: <20091219112711.GR55913@acme.spoerlein.net> Message-ID: <20091219154206.E93919@beagle.kn.op.dlr.de> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912150812.35521.jhb@freebsd.org> <20091215183859.S53283@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> X-OpenPGP-Key: harti@freebsd.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-OriginalArrivalTime: 19 Dec 2009 14:56:40.0372 (UTC) FILETIME=[78948740:01CA80BB] Cc: freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Harti Brandt List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Dec 2009 14:56:43 -0000 On Sat, 19 Dec 2009, Ulrich Sprlein wrote: US>On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote: US>> On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote: US>> > I see. I was also thinking along these lines, but was not sure whether it US>> > is worth the trouble. I suppose this does not help to implement 64-bit US>> > counters on 32-bit architectures, though, because you cannot read them US>> > reliably without locking to sum them up, right? US>> US>> Either that or you just accept that you have a small race since it is only stats. :) US> US>This might be stupid, but can we not easily *read* 64bit counters US>on 32bit machines like this: US> US>do { US> h1 = read_upper_32bits; US> l1 = read_lower_32bits; US> h2 = read_upper_32bits; US> l2 = read_lower_32bits; /* not needed */ US>} while (h1 != h2); US> US>sum64 = (h1<<32) + l1; US> US>or something like that? If h2 does not change between readings, no US>wrap-around has occured. If l1 was read in between the readings of h1 US>and h2, the code above is sound. Right? I suppose this works only if it would be guaranteed that the CPU modifying the 64-bit value does this somehow faster than the CPU reading the data: CPU1 CPU2 ---- ---- write new h read h1 (new h) read l1 (old l) read h2 (new h) write new l It doesn't work too when the CPU first writes L and the H. harti From owner-freebsd-arch@FreeBSD.ORG Sat Dec 19 15:30:13 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 06C3B1065672; Sat, 19 Dec 2009 15:30:13 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au [211.29.132.189]) by mx1.freebsd.org (Postfix) with ESMTP id 796E58FC08; Sat, 19 Dec 2009 15:30:11 +0000 (UTC) Received: from besplex.bde.org (c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116]) by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id nBJFU3qZ026724 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 20 Dec 2009 02:30:04 +1100 Date: Sun, 20 Dec 2009 02:30:03 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Hans Petter Selasky In-Reply-To: <200912191244.17803.hselasky@c2i.net> Message-ID: <20091219232119.L1555@besplex.bde.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="0-509698132-1261236603=:1555" Cc: Ulrich =?iso-8859-1?q?Sp=F6rlein?= , Harti Brandt , freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Dec 2009 15:30:13 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-509698132-1261236603=:1555 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Sat, 19 Dec 2009, Hans Petter Selasky wrote: > On Saturday 19 December 2009 12:27:12 Ulrich Sp=F6rlein wrote: >> On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote: >>> On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote: >>>> I see. I was also thinking along these lines, but was not sure whether >>>> it is worth the trouble. I suppose this does not help to implement >>>> 64-bit counters on 32-bit architectures, though, because you cannot >>>> read them reliably without locking to sum them up, right? >>> >>> Either that or you just accept that you have a small race since it is >>> only stats. :) >> >> This might be stupid, but can we not easily *read* 64bit counters >> on 32bit machines like this: >> >> do { >> h1 =3D read_upper_32bits; >> l1 =3D read_lower_32bits; >> h2 =3D read_upper_32bits; >> l2 =3D read_lower_32bits; /* not needed */ >> } while (h1 !=3D h2); No. See my previous^N reply, but don't see it about this since it was wrong about this for all N :-). > Just a comment. You know you don't need a while loop to get a stable valu= e? > Should be implemented like this, in my opinion: >=20 > h1 =3D read_upper_32bits; > l1 =3D read_lower_32bits; > h2 =3D read_upper_32bits; > > if (h1 !=3D h2) > =09l1 =3D 0xffffffffUL; > > sum64 =3D (h1<<32) | l1; Also wrong :-). Apart from write ordering problems (1), the write of the second half (presumably the top half) might not have completed=20 when the above looks at it. Then both of the above will see: h1 =3D old value l1 =3D new value h2 =3D old value (since the new one has not been written yet). The race window for this can be arbitrarily long, since the second write can be delayed for arbitrarily long (e.g., by interrupt handling). Even if we ensure write ordering and no interrupts, the above has many problems. - we can't reasonably guarantee that the reads of l1 and h2 will execute sufficiently faster than the writes of l1 and h1/h2 so that the above will see h2 after l1. I think the writes will usually go slightly faster since they will go through a write buffer, provided the 2 halves are in a single cache line, but this is unclear. SMP with different CPU frequencies is not really supported, but works now modulo timecounte= r problems, and we probably want to support completely independent CPU frequencies, with some CPUs throttled. - I don't understand why the above compares the high values. Comparing the low values seems to work better. - I can't see how to fix up the second method to be useful. It is faster, but some delays seem to be necessary and they might as well be partly in a slower method. Another try: - enforce write ordering in writers and read ordering in the reader - make sure that the reader runs fast enough. This might require using critical_enter() in the reader. - then many cases work with no complications: (a) if l1 is not small and not large, then h1 must be associated with l1= , since then the low value can't roll over while we are looking at the pair, so the high value can't change while we are looking. So we can just use h1 and l1, without reading h2 or l2. (b) similarly, if l1 is small, then h2 is associated with l1. So we can just use h2 with l1, without reading l2. - otherwise, l1 is large: (c) if l1 =3D l2, then h1 must be associated with l1, since some writes of the high value associated with writing not-so-large low values have had plenty of time to complete (in fact, the one for (l1-2) or (l2-1) must have completed, and "large" only needs to be 1 or 2 to ensure that these values don't wrap. E.g., if l1 is 0xFFFFFFFF, then it is about to wrap, but it certainly hasn't wrapped recently so h1 hasn't incremented recently. So we can use h1 with l1, after reading l2 (still need to read h2, in case we don't get here). (d) if l1 !=3D l2, then ordering implies that the write of the high valu= e associated with l1 has completed. We might have missed reading this value, since we might have read h1 too early and might have read h2 too late, but in the usual case h1 =3D=3D h2 and then both h's are associated with l1, while if h1 !=3D h2 then we can loop again and surely find h1 =3D=3D h2 (and l1 small, so case (c)), or we can use the second method. We had to read all 4 values to determine what to do here, and can usually use 2 of them directly. (1) Write ordering is guaranteed on amd64 but I think not on all arches. You could pessimize PCPU_INC() with memory barriers on some arches to get write ordering. (2) You could pessimize PCPU_INC() with critical_enter() to mostly prevent this. You cannot prevent the second write from being delayed for arbitrarily long by a trap and its handling, except by breaking at least the debugger traps needed to debug this. In my previous^2 reply, I said heavyweight synchronization combined with extra atomic generation counters would work. The heavyweight synchronization would have to be heavier than I thought -- it would have to wait for all other CPUs to complete the pairs of writes for 64-bit counters, if any, and for this it would have to do more than IPI's -- it should change priorities and reschedule to ensure that the half-writes (if any) have a chance of completing soon... Far too complicated for this. Disabling interrupts for the non-atomic PCPU_INC()s is probably best. Duh, this or worse (locking) is required on the writer side anyway, else increments in won't be atomic. Locking would actually automatically give the rescheduling stuff for the heavyweight synchronizaton -- you would acquire the lock in the reader and of course in the writers, and get priority propagation to complete the one writer allowed to hold the lock iff any is holding it. Locking might not be too bad for a few 64-bit counters. So I've changed my mind yet again and prefer locking to critical_enter(). It's cleaner and works for traps. I just remembered that rwatson went the opposite way and changed some locking to critical_enter() in UMA. I prefer the old way, and at least in old versions of FreeBSD I got panics trying to debug near this (single-stepping malloc()?). In my previous^1 reply, I said lighter weight synchronizion combined with no extra or atomic counters (use counters as their own generation counter) would work. But the synchronization still needs to be heavy, or interrupts disabled, as above. Everything must be read before and after to test for getting a coherent set of values, so the loop in the first method has the minimal number of reads (for a single 64-bit counter). With sync, the order for each pair in it doesn't matter on either the reader or writer (there must be a sync or 2 instead). Bruce --0-509698132-1261236603=:1555-- From owner-freebsd-arch@FreeBSD.ORG Sat Dec 19 16:02:29 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DC466106566B for ; Sat, 19 Dec 2009 16:02:29 +0000 (UTC) (envelope-from Hartmut.Brandt@dlr.de) Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32]) by mx1.freebsd.org (Postfix) with ESMTP id 557688FC25 for ; Sat, 19 Dec 2009 16:02:28 +0000 (UTC) Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); Sat, 19 Dec 2009 17:02:27 +0100 Date: Sat, 19 Dec 2009 17:02:23 +0100 (CET) From: Harti Brandt X-X-Sender: brandt_h@beagle.kn.op.dlr.de To: Bruce Evans In-Reply-To: <20091219232119.L1555@besplex.bde.org> Message-ID: <20091219164818.L1741@beagle.kn.op.dlr.de> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> X-OpenPGP-Key: harti@freebsd.org MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="1964543108-214838482-1261238543=:1741" X-OriginalArrivalTime: 19 Dec 2009 16:02:27.0737 (UTC) FILETIME=[A964A090:01CA80C4] Cc: Ulrich =?iso-8859-1?q?Sp=F6rlein?= , freebsd-arch@freebsd.org, Hans Petter Selasky Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Harti Brandt List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Dec 2009 16:02:29 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --1964543108-214838482-1261238543=:1741 Content-Type: TEXT/PLAIN; charset=koi8-r Content-Transfer-Encoding: QUOTED-PRINTABLE On Sun, 20 Dec 2009, Bruce Evans wrote: BE>On Sat, 19 Dec 2009, Hans Petter Selasky wrote: BE> BE>> On Saturday 19 December 2009 12:27:12 Ulrich Sp=F6rlein wrote: BE>> > On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote: BE>> > > On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote: BE>> > > > I see. I was also thinking along these lines, but was not sure BE>> > > > whether BE>> > > > it is worth the trouble. I suppose this does not help to impleme= nt BE>> > > > 64-bit counters on 32-bit architectures, though, because you can= not BE>> > > > read them reliably without locking to sum them up, right? BE>> > >=20 BE>> > > Either that or you just accept that you have a small race since it= is BE>> > > only stats. :) BE>> >=20 BE>> > This might be stupid, but can we not easily *read* 64bit counters BE>> > on 32bit machines like this: BE>> >=20 BE>> > do { BE>> > h1 =3D read_upper_32bits; BE>> > l1 =3D read_lower_32bits; BE>> > h2 =3D read_upper_32bits; BE>> > l2 =3D read_lower_32bits; /* not needed */ BE>> > } while (h1 !=3D h2); BE> BE>No. See my previous^N reply, but don't see it about this since it was BE>wrong about this for all N :-). BE> BE>> Just a comment. You know you don't need a while loop to get a stable v= alue? BE>> Should be implemented like this, in my opinion: BE>>=20 BE>> h1 =3D read_upper_32bits; BE>> l1 =3D read_lower_32bits; BE>> h2 =3D read_upper_32bits; BE>>=20 BE>> if (h1 !=3D h2) BE>> =09l1 =3D 0xffffffffUL; BE>>=20 BE>> sum64 =3D (h1<<32) | l1; BE> BE>Also wrong :-). Apart from write ordering problems (1), the write of BE>the second half (presumably the top half) might not have completed when = the BE>above looks at it. Then both of the above will see: BE> h1 =3D old value BE> l1 =3D new value BE> h2 =3D old value (since the new one has not been written yet). BE>The race window for this can be arbitrarily long, since the second write BE>can be delayed for arbitrarily long (e.g., by interrupt handling). Even BE>if we ensure write ordering and no interrupts, the above has many proble= ms. BE>- we can't reasonably guarantee that the reads of l1 and h2 will execute BE> sufficiently faster than the writes of l1 and h1/h2 so that the above BE> will see h2 after l1. I think the writes will usually go slightly BE> faster since they will go through a write buffer, provided the 2 halve= s BE> are in a single cache line, but this is unclear. SMP with different BE> CPU frequencies is not really supported, but works now modulo timecoun= ter BE> problems, and we probably want to support completely independent CPU BE> frequencies, with some CPUs throttled. BE>- I don't understand why the above compares the high values. Comparing = the BE> low values seems to work better. BE>- I can't see how to fix up the second method to be useful. It is faste= r, BE> but some delays seem to be necessary and they might as well be partly BE> in a slower method. BE> BE>Another try: BE>- enforce write ordering in writers and read ordering in the reader BE>- make sure that the reader runs fast enough. This might require using BE> critical_enter() in the reader. BE>- then many cases work with no complications: BE> (a) if l1 is not small and not large, then h1 must be associated with = l1, BE> since then the low value can't roll over while we are looking BE> at the pair, so the high value can't change while we are looking. BE> So we can just use h1 and l1, without reading h2 or l2. BE> (b) similarly, if l1 is small, then h2 is associated with l1. So we BE> can just use h2 with l1, without reading l2. BE>- otherwise, l1 is large: BE> (c) if l1 =3D l2, then h1 must be associated with l1, since some write= s BE> of the high value associated with writing not-so-large low values BE> have had plenty of time to complete (in fact, the one for (l1-2) BE> or (l2-1) must have completed, and "large" only needs to be 1 BE> or 2 to ensure that these values don't wrap. E.g., if l1 is BE> 0xFFFFFFFF, then it is about to wrap, but it certainly hasn't BE> wrapped recently so h1 hasn't incremented recently. So we can BE> use h1 with l1, after reading l2 (still need to read h2, in case BE> we don't get here). BE> (d) if l1 !=3D l2, then ordering implies that the write of the high va= lue BE> associated with l1 has completed. We might have missed reading BE> this value, since we might have read h1 too early and might have BE> read h2 too late, but in the usual case h1 =3D=3D h2 and then both BE> h's are associated with l1, while if h1 !=3D h2 then we can loop BE> again and surely find h1 =3D=3D h2 (and l1 small, so case (c)), or BE> we can use the second method. We had to read all 4 values to BE> determine what to do here, and can usually use 2 of them directly. BE> BE>(1) Write ordering is guaranteed on amd64 but I think not on all arches. BE> You could pessimize PCPU_INC() with memory barriers on some arches t= o BE> get write ordering. BE> BE>(2) You could pessimize PCPU_INC() with critical_enter() to mostly preve= nt BE> this. You cannot prevent the second write from being delayed for BE> arbitrarily long by a trap and its handling, except by breaking BE> at least the debugger traps needed to debug this. BE> BE>In my previous^2 reply, I said heavyweight synchronization combined BE>with extra atomic generation counters would work. The heavyweight BE>synchronization would have to be heavier than I thought -- it would BE>have to wait for all other CPUs to complete the pairs of writes for BE>64-bit counters, if any, and for this it would have to do more than BE>IPI's -- it should change priorities and reschedule to ensure that the BE>half-writes (if any) have a chance of completing soon... Far too BE>complicated for this. Disabling interrupts for the non-atomic PCPU_INC(= )s BE>is probably best. Duh, this or worse (locking) is required on the BE>writer side anyway, else increments in won't be atomic. Locking would BE>actually automatically give the rescheduling stuff for the heavyweight BE>synchronizaton -- you would acquire the lock in the reader and of BE>course in the writers, and get priority propagation to complete the BE>one writer allowed to hold the lock iff any is holding it. Locking BE>might not be too bad for a few 64-bit counters. So I've changed my BE>mind yet again and prefer locking to critical_enter(). It's cleaner BE>and works for traps. I just remembered that rwatson went the opposite BE>way and changed some locking to critical_enter() in UMA. I prefer the BE>old way, and at least in old versions of FreeBSD I got panics trying BE>to debug near this (single-stepping malloc()?). BE> BE>In my previous^1 reply, I said lighter weight synchronizion combined BE>with no extra or atomic counters (use counters as their own generation BE>counter) would work. But the synchronization still needs to be heavy, BE>or interrupts disabled, as above. BE> BE>Everything must be read before and after to test for getting a coherent BE>set of values, so the loop in the first method has the minimal number BE>of reads (for a single 64-bit counter). With sync, the order for each BE>pair in it doesn't matter on either the reader or writer (there must be BE>a sync or 2 instead). To be honest, I'm lost now. Couldn't we just use the largest atomic type=20 for the given platform and atomic_inc/atomic_add/atomic_fetch and handle=20 the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel=20 thread? Are the 5-6 atomic operations really that costly given the many operations= =20 done on an IP packet? Are they more costly than a heavyweight sync for=20 each ++ or +=3D? Or we could use the PCPU stuff, use just ++ and +=3D for modifying the=20 statistics (32bit) and do the 32->64 bit stuff for all platforms with a=20 kernel thread per CPU (do we have this?). Between that thread and the=20 sysctl we could use a heavy sync. Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the=20 largest atomic type for the platform, handle the aggregation and (on IA32)= =20 the 32->64 bit stuff in a kernel thread. Using 32 bit stats may fail if you put in several 10GBit/s adapters into a= =20 machine and do routing at link speed, though. This might overflow the IP=20 input/output byte counter (which we don't have yet) too fast. harti --1964543108-214838482-1261238543=:1741-- From owner-freebsd-arch@FreeBSD.ORG Sat Dec 19 17:15:51 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 222B41065676; Sat, 19 Dec 2009 17:15:51 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id AC4D98FC1A; Sat, 19 Dec 2009 17:15:50 +0000 (UTC) Received: from besplex.bde.org (c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id nBJHFgbo021902 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 20 Dec 2009 04:15:43 +1100 Date: Sun, 20 Dec 2009 04:15:42 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Harti Brandt In-Reply-To: <20091219164818.L1741@beagle.kn.op.dlr.de> Message-ID: <20091220032452.W2429@besplex.bde.org> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Ulrich =?iso-8859-1?q?Sp=F6rlein?= , freebsd-arch@freebsd.org, Hans Petter Selasky Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Dec 2009 17:15:51 -0000 On Sat, 19 Dec 2009, Harti Brandt wrote: > On Sun, 20 Dec 2009, Bruce Evans wrote: > > [... complications] > > To be honest, I'm lost now. Couldn't we just use the largest atomic type > for the given platform and atomic_inc/atomic_add/atomic_fetch and handle > the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel > thread? That's probably best (except without the atomic operations) (like I said originally. I tried to spell out the complications to make it clear that they would be too much except for incomplete ones). > Are the 5-6 atomic operations really that costly given the many operations > done on an IP packet? Are they more costly than a heavyweight sync for > each ++ or +=? rwatson found that even non-atomic operations are quite costly, since at least on amd64 and i386, ones that write (or any access?) the same address (or cache line?) apparently involve much the same hardware activity (cache snoop?) as atomic ones implemented by locking the bus. I think this is mostly historical -- it should be necessary to lock the bus to get the slow version. Per-CPU counters give separate addresses and also don't require the bus lock. I don't like the complexity for per-CPU counters but don't use big SMP systems enough to know what the locks cost in real applications. > Or we could use the PCPU stuff, use just ++ and += for modifying the > statistics (32bit) and do the 32->64 bit stuff for all platforms with a > kernel thread per CPU (do we have this?). Between that thread and the > sysctl we could use a heavy sync. I don't like the squillions of threads in FreeBSD-post-4, but this seems to need its own one and there isn't one yet AFAIK. I think a thread is only needed for the 32-bit stuff (since aggregation has to use the current values and it shouldn't have to ask a thread to sum them). The thread should maintain only the high 32 or 33 bits of the 64-bit counters. Maybe there should be a thread per CPU (ugh) with per-CPU extra bits so that these bits can be accessed without locking. The synchronization is still interesting. > Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the > largest atomic type for the platform, handle the aggregation and (on IA32) > the 32->64 bit stuff in a kernel thread. I don't see why using atomic or locks for just the 64 bit counters is good. We will probably end up with too many 64-bit counters, especially if they don't cost much when not read. I just thought of another implementation to reduce reads: trap on overflow and handle all the complications in the trap handler, or just set a flag to tell the fixup thread to run and normally don't run the fixup thread. This seems to not quite work -- arranging for the trap would be costly (needs "into" instruction on i386?). Similarly for explicit tests for wraparound (PCPU_INC() could be a function call that does the test and handles wraparound in a fully locked fashion. We don't care that this code executes slowly since it rarely executes, but we care that the test pessimizes the usual case). There is also "lock cmpxchg8b" on i386. I think this can be used in a loop to implement atomic 64-bit ops (?). Simpler, but slower in PCPU_INC(). I prefer a function call version of PCPU_INC() to this. That should be faster in the usual case and only much larger if we have too many 64-bit counters. > Using 32 bit stats may fail if you put in several 10GBit/s adapters into a > machine and do routing at link speed, though. This might overflow the IP > input/output byte counter (which we don't have yet) too fast. Not with a mere 10GB/S. That's ~1GB/S so it takes 4 seconds to overflow a 32-bit byte counter. A bit counter would take a while to overflow too. Are there any faster incrementors? TSCs also take O(1) seconds to overflow, and timecounter logic depends on no timecounter overflowing much faster than that. Bruce From owner-freebsd-arch@FreeBSD.ORG Sat Dec 19 20:01:35 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4E55F1065676 for ; Sat, 19 Dec 2009 20:01:35 +0000 (UTC) (envelope-from Hartmut.Brandt@dlr.de) Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32]) by mx1.freebsd.org (Postfix) with ESMTP id D451D8FC16 for ; Sat, 19 Dec 2009 20:01:34 +0000 (UTC) Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); Sat, 19 Dec 2009 21:01:32 +0100 Date: Sat, 19 Dec 2009 21:01:35 +0100 (CET) From: Harti Brandt X-X-Sender: brandt_h@beagle.kn.op.dlr.de To: Bruce Evans In-Reply-To: <20091220032452.W2429@besplex.bde.org> Message-ID: <20091219204217.D1741@beagle.kn.op.dlr.de> References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <200912191244.17803.hselasky@c2i.net> <20091219232119.L1555@besplex.bde.org> <20091219164818.L1741@beagle.kn.op.dlr.de> <20091220032452.W2429@besplex.bde.org> X-OpenPGP-Key: harti@freebsd.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-OriginalArrivalTime: 19 Dec 2009 20:01:32.0832 (UTC) FILETIME=[0FBC6A00:01CA80E6] Cc: freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Harti Brandt List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Dec 2009 20:01:35 -0000 On Sun, 20 Dec 2009, Bruce Evans wrote: BE>On Sat, 19 Dec 2009, Harti Brandt wrote: BE> BE>> On Sun, 20 Dec 2009, Bruce Evans wrote: BE>> BE>> [... complications] BE>> BE>> To be honest, I'm lost now. Couldn't we just use the largest atomic type BE>> for the given platform and atomic_inc/atomic_add/atomic_fetch and handle BE>> the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel BE>> thread? BE> BE>That's probably best (except without the atomic operations) (like I said BE>originally. I tried to spell out the complications to make it clear that BE>they would be too much except for incomplete ones). BE> BE>> Are the 5-6 atomic operations really that costly given the many operations BE>> done on an IP packet? Are they more costly than a heavyweight sync for BE>> each ++ or +=? BE> BE>rwatson found that even non-atomic operations are quite costly, since BE>at least on amd64 and i386, ones that write (or any access?) the same BE>address (or cache line?) apparently involve much the same hardware BE>activity (cache snoop?) as atomic ones implemented by locking the bus. BE>I think this is mostly historical -- it should be necessary to lock the BE>bus to get the slow version. Per-CPU counters give separate addresses BE>and also don't require the bus lock. I don't like the complexity for BE>per-CPU counters but don't use big SMP systems enough to know what the BE>locks cost in real applications. BE> BE>> Or we could use the PCPU stuff, use just ++ and += for modifying the BE>> statistics (32bit) and do the 32->64 bit stuff for all platforms with a BE>> kernel thread per CPU (do we have this?). Between that thread and the BE>> sysctl we could use a heavy sync. BE> BE>I don't like the squillions of threads in FreeBSD-post-4, but this seems BE>to need its own one and there isn't one yet AFAIK. I think a thread is BE>only needed for the 32-bit stuff (since aggregation has to use the BE>current values and it shouldn't have to ask a thread to sum them). The BE>thread should maintain only the high 32 or 33 bits of the 64-bit counters. BE>Maybe there should be a thread per CPU (ugh) with per-CPU extra bits so BE>that these bits can be accessed without locking. The synchronization is BE>still interesting. BE> BE>> Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the BE>> largest atomic type for the platform, handle the aggregation and (on IA32) BE>> the 32->64 bit stuff in a kernel thread. BE> BE>I don't see why using atomic or locks for just the 64 bit counters is good. BE>We will probably end up with too many 64-bit counters, especially if they BE>don't cost much when not read. On a 32-bit arch when reading a 32-bit value on one CPU while the other CPU is modifying it, the read will probably be always correct given the variable is correctly aligned. On a 64-bit arch when reading a 64-bit value on one CPU while the other one is adding to, do I always get the correct value? I'm not sure about this, why I put atomic_*() there assuming that they will make this correct. The idea is (for 32-bit platforms): struct pcpu_stats { uint32_t in_bytes; uint32_t in_packets; }; struct pcpu_hc_stats { uint64_t hc_in_bytes; uint64_t hc_in_packets; }; /* driver; IP stack; ... */ ... pcpu_stats->in_bytes += bytes; pcpu_stats->in_packets++; ... /* per CPU kernel thread for 32-bit arch */ lock(pcpu_hc_stats); ... val = pcpu_stats->in_bytes; if ((uint32_t)pcpu_hc_stats->hc_in_bytes > val) pcpu_hc_stats->in_bytes += 0x100000000; pcpu_hc_stats->in_bytes = (pcpu_hc_stats->in_bytes & 0xffffffff00000000ULL) | val; ... unlock(pcpu_hc_stats); /* sysctl */ memset(&stats, 0, sizeof(stats)); foreach(cpu) { lock(pcpu_hc_stats(cpu)); ... stats.in_bytes += pcpu_hc_stats(cpu)->hc_in_bytes; ... unlock(pcpu_hc_stats(cpu)); } copyout(stats); On 64-bit archs we can go without the locks and the thread given that we can reliably read the 64-bit per CPU numbers (can we?). BE>I just thought of another implementation to reduce reads: trap on BE>overflow and handle all the complications in the trap handler, or BE>just set a flag to tell the fixup thread to run and normally don't BE>run the fixup thread. This seems to not quite work -- arranging BE>for the trap would be costly (needs "into" instruction on i386?). BE>Similarly for explicit tests for wraparound (PCPU_INC() could be a BE>function call that does the test and handles wraparound in a fully BE>locked fashion. We don't care that this code executes slowly since BE>it rarely executes, but we care that the test pessimizes the usual BE>case). BE> BE>There is also "lock cmpxchg8b" on i386. I think this can be used in a BE>loop to implement atomic 64-bit ops (?). Simpler, but slower in BE>PCPU_INC(). I prefer a function call version of PCPU_INC() to this. BE>That should be faster in the usual case and only much larger if we BE>have too many 64-bit counters. BE> BE>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a BE>> machine and do routing at link speed, though. This might overflow the IP BE>> input/output byte counter (which we don't have yet) too fast. BE> BE>Not with a mere 10GB/S. That's ~1GB/S so it takes 4 seconds to overflow BE>a 32-bit byte counter. A bit counter would take a while to overflow too. BE>Are there any faster incrementors? TSCs also take O(1) seconds to overflow, BE>and timecounter logic depends on no timecounter overflowing much faster BE>than that. If you have 4 10GBit/s adapters each operating full-duplex at link speed you wrap in under 0.5 seconds, maybe even faster if you have some kind of tunnels where each packet counts several times. But I suppose this will be not so easy with IA32 to implement :-) harti From owner-freebsd-arch@FreeBSD.ORG Sat Dec 19 21:09:11 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CD80C106566B; Sat, 19 Dec 2009 21:09:11 +0000 (UTC) (envelope-from uqs@spoerlein.net) Received: from acme.spoerlein.net (acme.spoerlein.net [217.20.127.186]) by mx1.freebsd.org (Postfix) with ESMTP id 3D3128FC1C; Sat, 19 Dec 2009 21:09:10 +0000 (UTC) Received: from acme.spoerlein.net (localhost.spoerlein.net [IPv6:::1]) by acme.spoerlein.net (8.14.3/8.14.3) with ESMTP id nBJKWJPT057539 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 19 Dec 2009 21:32:19 +0100 (CET) (envelope-from uqs@spoerlein.net) Received: (from uqs@localhost) by acme.spoerlein.net (8.14.3/8.14.3/Submit) id nBJKWJYV057538; Sat, 19 Dec 2009 21:32:19 +0100 (CET) (envelope-from uqs@spoerlein.net) Date: Sat, 19 Dec 2009 21:32:19 +0100 From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= To: Harti Brandt Message-ID: <20091219203219.GS55913@acme.spoerlein.net> Mail-Followup-To: Harti Brandt , freebsd-arch@freebsd.org, bde@freebsd.org References: <20091215103759.P97203@beagle.kn.op.dlr.de> <200912150812.35521.jhb@freebsd.org> <20091215183859.S53283@beagle.kn.op.dlr.de> <200912151313.28326.jhb@freebsd.org> <20091219112711.GR55913@acme.spoerlein.net> <20091219154206.E93919@beagle.kn.op.dlr.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20091219154206.E93919@beagle.kn.op.dlr.de> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: bde@freebsd.org, freebsd-arch@freebsd.org Subject: Re: network statistics in SMP X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Dec 2009 21:09:11 -0000 On Sat, 19.12.2009 at 15:56:38 +0100, Harti Brandt wrote: > On Sat, 19 Dec 2009, Ulrich Sprlein wrote: > > >On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote: > >> On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote: > >> > I see. I was also thinking along these lines, but was not sure whether it > >> > is worth the trouble. I suppose this does not help to implement 64-bit > >> > counters on 32-bit architectures, though, because you cannot read them > >> > reliably without locking to sum them up, right? > >> > >> Either that or you just accept that you have a small race since it is only stats. :) > > > >This might be stupid, but can we not easily *read* 64bit counters > >on 32bit machines like this: > > > >do { > > h1 = read_upper_32bits; > > l1 = read_lower_32bits; > > h2 = read_upper_32bits; > > l2 = read_lower_32bits; /* not needed */ > >} while (h1 != h2); > > > >sum64 = (h1<<32) + l1; > > > >or something like that? If h2 does not change between readings, no > >wrap-around has occured. If l1 was read in between the readings of h1 > >and h2, the code above is sound. Right? > > I suppose this works only if it would be guaranteed that the CPU modifying > the 64-bit value does this somehow faster than the CPU reading the data: > > CPU1 CPU2 > ---- ---- > write new h > read h1 (new h) > read l1 (old l) > read h2 (new h) > write new l > > It doesn't work too when the CPU first writes L and the H. To be honest, I didn't even think about the 64 bit writes being non-atomic, too. So, of course my suggestion was way too naive. Also thanks to Bruce for re-iterating the whole write/read ordering stuff yet again. :) Regards, Uli