From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 13 20:00:08 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E8E78106566C;
	Sun, 13 Dec 2009 20:00:07 +0000 (UTC) (envelope-from bz@FreeBSD.org)
Received: from mail.cksoft.de (mail.cksoft.de [IPv6:2001:4068:10::3])
	by mx1.freebsd.org (Postfix) with ESMTP id 6A0128FC13;
	Sun, 13 Dec 2009 20:00:07 +0000 (UTC)
Received: from localhost (amavis.fra.cksoft.de [192.168.74.71])
	by mail.cksoft.de (Postfix) with ESMTP id 59EE641C752;
	Sun, 13 Dec 2009 21:00:06 +0100 (CET)
X-Virus-Scanned: amavisd-new at cksoft.de
Received: from mail.cksoft.de ([192.168.74.103])
	by localhost (amavis.fra.cksoft.de [192.168.74.71]) (amavisd-new,
	port 10024)
	with ESMTP id H6Eh4T16LjOt; Sun, 13 Dec 2009 21:00:05 +0100 (CET)
Received: by mail.cksoft.de (Postfix, from userid 66)
	id C496D41C751; Sun, 13 Dec 2009 21:00:05 +0100 (CET)
Received: from maildrop.int.zabbadoz.net (maildrop.int.zabbadoz.net
	[10.111.66.10])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.int.zabbadoz.net (Postfix) with ESMTP id 604354448EC;
	Sun, 13 Dec 2009 19:55:58 +0000 (UTC)
Date: Sun, 13 Dec 2009 19:55:58 +0000 (UTC)
From: "Bjoern A. Zeeb" <bz@FreeBSD.org>
X-X-Sender: bz@maildrop.int.zabbadoz.net
To: John Baldwin <jhb@freebsd.org>
In-Reply-To: <20091026185459.U91695@maildrop.int.zabbadoz.net>
Message-ID: <20091213195501.H86040@maildrop.int.zabbadoz.net>
References: <20091025134226.Q91695@maildrop.int.zabbadoz.net>
	<200910260830.25168.jhb@freebsd.org>
	<20091026185459.U91695@maildrop.int.zabbadoz.net>
X-OpenPGP-Key: 0x14003F198FEFA3E77207EE8D2B58B8F83CCF1842
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-arch@freebsd.org
Subject: Re: src/Makefile, universe, LINT, VIMAGE, ..
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 13 Dec 2009 20:00:08 -0000

On Mon, 26 Oct 2009, Bjoern A. Zeeb wrote:

Hi,

> On Mon, 26 Oct 2009, John Baldwin wrote:
>
> Hi,
>
>>> @@ -345,3 +333,18 @@
>>>   	fi
>>>   .endif
>>>   .endif
>>> +
>>> +universe_kernels: universe_kernels_foo
>>> +TARGET?=	${BUILD_ARCH}
>>> +KERNCONFS!=	cd ${.CURDIR}/sys/${TARGET}/conf && \
>>> +		find [A-Z0-9]*[A-Z0-9] -type f -maxdepth 0 \
>>> +		! -name DEFAULTS ! -name NOTES
>>> +KERNCONFS:=	${KERNCONFS}
>>> +universe_kernels_foo:
>>> +.for kernel in ${KERNCONFS}
>>> +	@(cd ${.CURDIR} && env __MAKE_CONF=/dev/null \
>>> +	    ${MAKE} ${JFLAG} buildkernel TARGET=${TARGET} KERNCONF=${kernel} 
>>> \
>>> +	    > _.${TARGET}.${kernel} 2>&1 || \
>>> +	    (echo "${TARGET} ${kernel} kernel failed," \
>>> +	    "check _.${TARGET}.${kernel} for details"| ${MAKEFAIL}))
>>> +.endfor
>> 
>> Hmm, I'm not sure why you need a universe_kernels_foo target that
>> universe_kernels depends on?
>
> This is all about make and the variables after a target and within a
> target. Whatever else I tried: make complained.  If you know the
> rightbetter solution that works I'll be happy to simplify this and
> update the patch.
>
> It shouldn't be named _foo though;)
>
>
>> Also, I would probably prefer to have
>> universe_kernels come after universe_$target and before universe_epilogue.
>
> I think that should be possible to sneak it in after the the .endfor.

I fixed those; I needed to allow the target for the outer .if make()
though with that.


>>> Index: sys/conf/makeLINT.mk
>>> ===================================================================
>>> --- sys/conf/makeLINT.mk	(revision 198467)
>>> +++ sys/conf/makeLINT.mk	(working copy)
>>> @@ -5,7 +5,15 @@
>>>
>>>   clean:
>>>   	rm -f LINT
>>> +.if ${TARGET} == "amd64" || ${TARGET} == "i386"
>>> +	rm -f LINT=VIMAGE
>>> +.endif
>> 
>> s/=/-/
>
> Yeah, everyone notics that one; it should be fixed in the patch at the
> URL originally referenced.
>
>> BTW, I'm not sure why you would only enable VIMAGE for these two archs 
>> rather
>> than doing it for all archs that have a LINT?
>
> Because it'll usually simply not make any sense to build a VIMAGE
> kernel for embedded platforms like arm, ...  Also make universe time
> increases significantly with any platform; indeed amd64 is the worst
> now (again).  We can talk about the proper set and I had thought of
> sparc64 as well.  Obviously just building it everywhere simplifies
> things.


An updated patch to test would be here:
http://people.freebsd.org/~bz/20091213-01-make-LINT-VIMAGE.diff

/bz

-- 
Bjoern A. Zeeb         It will not break if you know what you are doing.

From owner-freebsd-arch@FreeBSD.ORG  Mon Dec 14 11:06:50 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id F1EFB1065672
	for <freebsd-arch@FreeBSD.org>; Mon, 14 Dec 2009 11:06:50 +0000 (UTC)
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id C84778FC15
	for <freebsd-arch@FreeBSD.org>; Mon, 14 Dec 2009 11:06:50 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.3/8.14.3) with ESMTP id nBEB6owH075868
	for <freebsd-arch@FreeBSD.org>; Mon, 14 Dec 2009 11:06:50 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: (from gnats@localhost)
	by freefall.freebsd.org (8.14.3/8.14.3/Submit) id nBEB6otM075866
	for freebsd-arch@FreeBSD.org; Mon, 14 Dec 2009 11:06:50 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Date: Mon, 14 Dec 2009 11:06:50 GMT
Message-Id: <200912141106.nBEB6otM075866@freefall.freebsd.org>
X-Authentication-Warning: freefall.freebsd.org: gnats set sender to
	owner-bugmaster@FreeBSD.org using -f
From: FreeBSD bugmaster <bugmaster@FreeBSD.org>
To: freebsd-arch@FreeBSD.org
Cc: 
Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 14 Dec 2009 11:06:51 -0000

Note: to view an individual PR, use:
  http://www.freebsd.org/cgi/query-pr.cgi?pr=(number).

The following is a listing of current problems submitted by FreeBSD users.
These represent problem reports covering all versions including
experimental development code and obsolete releases.


S Tracker      Resp.      Description
--------------------------------------------------------------------------------
o kern/120749  arch       [request] Suggest upping the default kern.ps_arg_cache

1 problem total.


From owner-freebsd-arch@FreeBSD.ORG  Mon Dec 14 16:47:02 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4FE80106568B;
	Mon, 14 Dec 2009 16:47:02 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 1D0FF8FC1E;
	Mon, 14 Dec 2009 16:47:02 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id C467D46B32;
	Mon, 14 Dec 2009 11:47:01 -0500 (EST)
Received: from jhbbsd.localnet (unknown [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPA id 25DE38A024;
	Mon, 14 Dec 2009 11:47:01 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: "Bjoern A. Zeeb" <bz@freebsd.org>
Date: Mon, 14 Dec 2009 11:19:36 -0500
User-Agent: KMail/1.12.1 (FreeBSD/7.2-CBSD-20091103; KDE/4.3.1; amd64; ; )
References: <20091025134226.Q91695@maildrop.int.zabbadoz.net>
	<20091026185459.U91695@maildrop.int.zabbadoz.net>
	<20091213195501.H86040@maildrop.int.zabbadoz.net>
In-Reply-To: <20091213195501.H86040@maildrop.int.zabbadoz.net>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <200912141119.36165.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Mon, 14 Dec 2009 11:47:01 -0500 (EST)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE
	autolearn=no version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: freebsd-arch@freebsd.org
Subject: Re: src/Makefile, universe, LINT, VIMAGE, ..
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 14 Dec 2009 16:47:02 -0000

On Sunday 13 December 2009 2:55:58 pm Bjoern A. Zeeb wrote:
> >> Also, I would probably prefer to have
> >> universe_kernels come after universe_$target and before 
universe_epilogue.
> >
> > I think that should be possible to sneak it in after the the .endfor.
> 
> I fixed those; I needed to allow the target for the outer .if make()
> though with that.

I think you can drop the 'KERNCONFS:= ${KERNCONFS}' line now.

> >>> Index: sys/conf/makeLINT.mk
> >>> ===================================================================
> >>> --- sys/conf/makeLINT.mk	(revision 198467)
> >>> +++ sys/conf/makeLINT.mk	(working copy)
> >>> @@ -5,7 +5,15 @@
> >>>
> >>>   clean:
> >>>   	rm -f LINT
> >>> +.if ${TARGET} == "amd64" || ${TARGET} == "i386"
> >>> +	rm -f LINT=VIMAGE
> >>> +.endif
> >> 
> >> s/=/-/
> >
> > Yeah, everyone notics that one; it should be fixed in the patch at the
> > URL originally referenced.

This is still here. :)

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Mon Dec 14 21:05:06 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 95E331065697;
	Mon, 14 Dec 2009 21:05:06 +0000 (UTC) (envelope-from bz@FreeBSD.org)
Received: from mail.cksoft.de (mail.cksoft.de [IPv6:2001:4068:10::3])
	by mx1.freebsd.org (Postfix) with ESMTP id 52B4D8FC16;
	Mon, 14 Dec 2009 21:05:06 +0000 (UTC)
Received: from localhost (amavis.fra.cksoft.de [192.168.74.71])
	by mail.cksoft.de (Postfix) with ESMTP id B351341C75B;
	Mon, 14 Dec 2009 22:05:05 +0100 (CET)
X-Virus-Scanned: amavisd-new at cksoft.de
Received: from mail.cksoft.de ([192.168.74.103])
	by localhost (amavis.fra.cksoft.de [192.168.74.71]) (amavisd-new,
	port 10024)
	with ESMTP id 2sf-fU5cwaob; Mon, 14 Dec 2009 22:05:05 +0100 (CET)
Received: by mail.cksoft.de (Postfix, from userid 66)
	id 3F0FD41C75A; Mon, 14 Dec 2009 22:05:05 +0100 (CET)
Received: from maildrop.int.zabbadoz.net (maildrop.int.zabbadoz.net
	[10.111.66.10])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.int.zabbadoz.net (Postfix) with ESMTP id D5FC74448EC;
	Mon, 14 Dec 2009 21:01:41 +0000 (UTC)
Date: Mon, 14 Dec 2009 21:01:41 +0000 (UTC)
From: "Bjoern A. Zeeb" <bz@FreeBSD.org>
X-X-Sender: bz@maildrop.int.zabbadoz.net
To: John Baldwin <jhb@freebsd.org>
In-Reply-To: <200912141119.36165.jhb@freebsd.org>
Message-ID: <20091214210054.A86040@maildrop.int.zabbadoz.net>
References: <20091025134226.Q91695@maildrop.int.zabbadoz.net>
	<20091026185459.U91695@maildrop.int.zabbadoz.net>
	<20091213195501.H86040@maildrop.int.zabbadoz.net>
	<200912141119.36165.jhb@freebsd.org>
X-OpenPGP-Key: 0x14003F198FEFA3E77207EE8D2B58B8F83CCF1842
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-arch@freebsd.org
Subject: Re: src/Makefile, universe, LINT, VIMAGE, ..
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 14 Dec 2009 21:05:06 -0000

On Mon, 14 Dec 2009, John Baldwin wrote:

> I think you can drop the 'KERNCONFS:= ${KERNCONFS}' line now.

So I did; thanks.

>>>>> Index: sys/conf/makeLINT.mk
>>>>> ===================================================================
>>>>> --- sys/conf/makeLINT.mk	(revision 198467)
>>>>> +++ sys/conf/makeLINT.mk	(working copy)
>>>>> @@ -5,7 +5,15 @@
>>>>>
>>>>>   clean:
>>>>>   	rm -f LINT
>>>>> +.if ${TARGET} == "amd64" || ${TARGET} == "i386"
>>>>> +	rm -f LINT=VIMAGE
>>>>> +.endif
>>>>
>>>> s/=/-/
>>>
>>> Yeah, everyone notics that one; it should be fixed in the patch at the
>>> URL originally referenced.
>
> This is still here. :)

*grump*  I had fixed it in the patch but not in my working tree.

New try:
http://people.freebsd.org/~bz/20091214-01-make-LINT-VIMAGE.diff

-- 
Bjoern A. Zeeb         It will not break if you know what you are doing.

From owner-freebsd-arch@FreeBSD.ORG  Tue Dec 15 09:50:14 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4EB671065670
	for <arch@freebsd.org>; Tue, 15 Dec 2009 09:50:14 +0000 (UTC)
	(envelope-from Hartmut.Brandt@dlr.de)
Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32])
	by mx1.freebsd.org (Postfix) with ESMTP id D00F68FC20
	for <arch@freebsd.org>; Tue, 15 Dec 2009 09:50:13 +0000 (UTC)
Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over
	TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); 
	Tue, 15 Dec 2009 10:38:07 +0100
Date: Tue, 15 Dec 2009 10:38:04 +0100 (CET)
From: Harti Brandt <hartmut.brandt@dlr.de>
X-X-Sender: brandt_h@beagle.kn.op.dlr.de
To: arch@freebsd.org
Message-ID: <20091215103759.P97203@beagle.kn.op.dlr.de>
X-OpenPGP-Key: harti@freebsd.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-OriginalArrivalTime: 15 Dec 2009 09:38:07.0697 (UTC)
	FILETIME=[4EE2C410:01CA7D6A]
Cc: 
Subject: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Harti Brandt <harti@freebsd.org>
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Dec 2009 09:50:14 -0000

Hi all,

I'm working on our network statistics (in the context of SNMP) and wonder, 
to what extend we want them to be correct. I've re-read part of the past 
discussions about 64-bit counters on 32-bit archs and got the impression, 
that there are users that would like to have almost correct statistics 
(for accounting, for example). If this is the case I wonder whether the 
way we do the statistics today is correct.

Basically all statistics are incremented or added to simply by a += b oder 
a++. As I understand, this worked fine in the old days, where you had 
spl*() calls at the right places. Nowadays when everything is SMP 
shouldn't we use at least atomic operations for this? Also I read that on 
architectures where cache coherency is not implemented in hardware even 
this does not help (I found a mail from jhb why for the mutex 
implementation this is not a problem, but I don't understand what to do 
for the += and ++ operations). I failed to find a way, though, to 
influence the caching policy (is there a function one can call to 
change the policy?).

Any opinions?
harti

From owner-freebsd-arch@FreeBSD.ORG  Tue Dec 15 16:43:03 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E3BDC1065697;
	Tue, 15 Dec 2009 16:43:03 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id B606C8FC1A;
	Tue, 15 Dec 2009 16:43:03 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id 5859B46B23;
	Tue, 15 Dec 2009 11:43:03 -0500 (EST)
Received: from jhbbsd.localnet (unknown [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPA id A40E58A01B;
	Tue, 15 Dec 2009 11:43:02 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-arch@freebsd.org,
 Harti Brandt <harti@freebsd.org>
Date: Tue, 15 Dec 2009 08:12:35 -0500
User-Agent: KMail/1.12.1 (FreeBSD/7.2-CBSD-20091103; KDE/4.3.1; amd64; ; )
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
In-Reply-To: <20091215103759.P97203@beagle.kn.op.dlr.de>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <200912150812.35521.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Tue, 15 Dec 2009 11:43:02 -0500 (EST)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,
	DATE_IN_PAST_03_06,RDNS_NONE autolearn=no version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: 
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Dec 2009 16:43:04 -0000

On Tuesday 15 December 2009 4:38:04 am Harti Brandt wrote:
> Hi all,
> 
> I'm working on our network statistics (in the context of SNMP) and wonder, 
> to what extend we want them to be correct. I've re-read part of the past 
> discussions about 64-bit counters on 32-bit archs and got the impression, 
> that there are users that would like to have almost correct statistics 
> (for accounting, for example). If this is the case I wonder whether the 
> way we do the statistics today is correct.
> 
> Basically all statistics are incremented or added to simply by a += b oder 
> a++. As I understand, this worked fine in the old days, where you had 
> spl*() calls at the right places. Nowadays when everything is SMP 
> shouldn't we use at least atomic operations for this? Also I read that on 
> architectures where cache coherency is not implemented in hardware even 
> this does not help (I found a mail from jhb why for the mutex 
> implementation this is not a problem, but I don't understand what to do 
> for the += and ++ operations). I failed to find a way, though, to 
> influence the caching policy (is there a function one can call to 
> change the policy?).

Atomic ops will always work for reliable statistics.  However, I believe 
Robert is working on using per-CPU statistics for TCP, UDP, etc. similar to 
what we do now for many of the 'cnt' stats (context switches, etc.).  For 
'cnt' each CPU has its own count of stats that are updated using non-atomic 
ops (since they are CPU local).  sysctl handlers then sum up the various per-
CPU counts to report global counts to userland.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Tue Dec 15 17:07:56 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C523C106568F
	for <freebsd-arch@freebsd.org>; Tue, 15 Dec 2009 17:07:56 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from outB.internet-mail-service.net (outb.internet-mail-service.net
	[216.240.47.225])
	by mx1.freebsd.org (Postfix) with ESMTP id A9F408FC12
	for <freebsd-arch@freebsd.org>; Tue, 15 Dec 2009 17:07:56 +0000 (UTC)
Received: from idiom.com (mx0.idiom.com [216.240.32.160])
	by out.internet-mail-service.net (Postfix) with ESMTP id 1C08D44441;
	Tue, 15 Dec 2009 09:07:57 -0800 (PST)
X-Client-Authorized: MaGic Cook1e
X-Client-Authorized: MaGic Cook1e
X-Client-Authorized: MaGic Cook1e
Received: from julian-mac.elischer.org
	(h-67-100-89-137.snfccasy.static.covad.net [67.100.89.137])
	by idiom.com (Postfix) with ESMTP id DB9182D6014;
	Tue, 15 Dec 2009 09:07:55 -0800 (PST)
Message-ID: <4B27C279.8030402@elischer.org>
Date: Tue, 15 Dec 2009 09:08:09 -0800
From: Julian Elischer <julian@elischer.org>
User-Agent: Thunderbird 2.0.0.23 (Macintosh/20090812)
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912150812.35521.jhb@freebsd.org>
In-Reply-To: <200912150812.35521.jhb@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Harti Brandt <harti@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Dec 2009 17:07:56 -0000

John Baldwin wrote:
> On Tuesday 15 December 2009 4:38:04 am Harti Brandt wrote:
>> Hi all,
>>
>> I'm working on our network statistics (in the context of SNMP) and wonder, 
>> to what extend we want them to be correct. I've re-read part of the past 
>> discussions about 64-bit counters on 32-bit archs and got the impression, 
>> that there are users that would like to have almost correct statistics 
>> (for accounting, for example). If this is the case I wonder whether the 
>> way we do the statistics today is correct.
>>
>> Basically all statistics are incremented or added to simply by a += b oder 
>> a++. As I understand, this worked fine in the old days, where you had 
>> spl*() calls at the right places. Nowadays when everything is SMP 
>> shouldn't we use at least atomic operations for this? Also I read that on 
>> architectures where cache coherency is not implemented in hardware even 
>> this does not help (I found a mail from jhb why for the mutex 
>> implementation this is not a problem, but I don't understand what to do 
>> for the += and ++ operations). I failed to find a way, though, to 
>> influence the caching policy (is there a function one can call to 
>> change the policy?).
> 
> Atomic ops will always work for reliable statistics.  However, I believe 
> Robert is working on using per-CPU statistics for TCP, UDP, etc. similar to 
> what we do now for many of the 'cnt' stats (context switches, etc.).  For 
> 'cnt' each CPU has its own count of stats that are updated using non-atomic 
> ops (since they are CPU local).  sysctl handlers then sum up the various per-
> CPU counts to report global counts to userland.

the trouble is that PCPU and VNET collide.  you then need to have
Per-CPU, per VNET counters. which would be yet a different pool of 
linker set symbols..

> 


From owner-freebsd-arch@FreeBSD.ORG  Tue Dec 15 17:45:17 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 678A81065670
	for <freebsd-arch@freebsd.org>; Tue, 15 Dec 2009 17:45:17 +0000 (UTC)
	(envelope-from Hartmut.Brandt@dlr.de)
Received: from smtp3.dlr.de (smtp3.dlr.de [129.247.252.33])
	by mx1.freebsd.org (Postfix) with ESMTP id F238F8FC16
	for <freebsd-arch@freebsd.org>; Tue, 15 Dec 2009 17:45:16 +0000 (UTC)
Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp3.dlr.de over
	TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); 
	Tue, 15 Dec 2009 18:45:15 +0100
Date: Tue, 15 Dec 2009 18:45:13 +0100 (CET)
From: Harti Brandt <hartmut.brandt@dlr.de>
X-X-Sender: brandt_h@beagle.kn.op.dlr.de
To: John Baldwin <jhb@freebsd.org>
In-Reply-To: <200912150812.35521.jhb@freebsd.org>
Message-ID: <20091215183859.S53283@beagle.kn.op.dlr.de>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912150812.35521.jhb@freebsd.org>
X-OpenPGP-Key: harti@freebsd.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-OriginalArrivalTime: 15 Dec 2009 17:45:15.0071 (UTC)
	FILETIME=[5BC21CF0:01CA7DAE]
Cc: freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Harti Brandt <harti@freebsd.org>
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Dec 2009 17:45:17 -0000

On Tue, 15 Dec 2009, John Baldwin wrote:

JB>On Tuesday 15 December 2009 4:38:04 am Harti Brandt wrote:
JB>> Hi all,
JB>> 
JB>> I'm working on our network statistics (in the context of SNMP) and wonder, 
JB>> to what extend we want them to be correct. I've re-read part of the past 
JB>> discussions about 64-bit counters on 32-bit archs and got the impression, 
JB>> that there are users that would like to have almost correct statistics 
JB>> (for accounting, for example). If this is the case I wonder whether the 
JB>> way we do the statistics today is correct.
JB>> 
JB>> Basically all statistics are incremented or added to simply by a += b oder 
JB>> a++. As I understand, this worked fine in the old days, where you had 
JB>> spl*() calls at the right places. Nowadays when everything is SMP 
JB>> shouldn't we use at least atomic operations for this? Also I read that on 
JB>> architectures where cache coherency is not implemented in hardware even 
JB>> this does not help (I found a mail from jhb why for the mutex 
JB>> implementation this is not a problem, but I don't understand what to do 
JB>> for the += and ++ operations). I failed to find a way, though, to 
JB>> influence the caching policy (is there a function one can call to 
JB>> change the policy?).
JB>
JB>Atomic ops will always work for reliable statistics.  However, I believe 
JB>Robert is working on using per-CPU statistics for TCP, UDP, etc. similar to 
JB>what we do now for many of the 'cnt' stats (context switches, etc.).  For 
JB>'cnt' each CPU has its own count of stats that are updated using non-atomic 
JB>ops (since they are CPU local).  sysctl handlers then sum up the various per-
JB>CPU counts to report global counts to userland.

I see. I was also thinking along these lines, but was not sure whether it 
is worth the trouble. I suppose this does not help to implement 64-bit 
counters on 32-bit architectures, though, because you cannot read them 
reliably without locking to sum them up, right?

harti

From owner-freebsd-arch@FreeBSD.ORG  Tue Dec 15 19:39:04 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 23B4C1065676;
	Tue, 15 Dec 2009 19:39:04 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id E6A7B8FC13;
	Tue, 15 Dec 2009 19:39:03 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id 881EA46B39;
	Tue, 15 Dec 2009 14:39:03 -0500 (EST)
Received: from jhbbsd.localnet (unknown [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPA id BF5FB8A01B;
	Tue, 15 Dec 2009 14:39:02 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: Harti Brandt <harti@freebsd.org>
Date: Tue, 15 Dec 2009 13:13:28 -0500
User-Agent: KMail/1.12.1 (FreeBSD/7.2-CBSD-20091103; KDE/4.3.1; amd64; ; )
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912150812.35521.jhb@freebsd.org>
	<20091215183859.S53283@beagle.kn.op.dlr.de>
In-Reply-To: <20091215183859.S53283@beagle.kn.op.dlr.de>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <200912151313.28326.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Tue, 15 Dec 2009 14:39:02 -0500 (EST)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE
	autolearn=no version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Dec 2009 19:39:04 -0000

On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote:
> On Tue, 15 Dec 2009, John Baldwin wrote:
> 
> JB>On Tuesday 15 December 2009 4:38:04 am Harti Brandt wrote:
> JB>> Hi all,
> JB>> 
> JB>> I'm working on our network statistics (in the context of SNMP) and wonder, 
> JB>> to what extend we want them to be correct. I've re-read part of the past 
> JB>> discussions about 64-bit counters on 32-bit archs and got the impression, 
> JB>> that there are users that would like to have almost correct statistics 
> JB>> (for accounting, for example). If this is the case I wonder whether the 
> JB>> way we do the statistics today is correct.
> JB>> 
> JB>> Basically all statistics are incremented or added to simply by a += b oder 
> JB>> a++. As I understand, this worked fine in the old days, where you had 
> JB>> spl*() calls at the right places. Nowadays when everything is SMP 
> JB>> shouldn't we use at least atomic operations for this? Also I read that on 
> JB>> architectures where cache coherency is not implemented in hardware even 
> JB>> this does not help (I found a mail from jhb why for the mutex 
> JB>> implementation this is not a problem, but I don't understand what to do 
> JB>> for the += and ++ operations). I failed to find a way, though, to 
> JB>> influence the caching policy (is there a function one can call to 
> JB>> change the policy?).
> JB>
> JB>Atomic ops will always work for reliable statistics.  However, I believe 
> JB>Robert is working on using per-CPU statistics for TCP, UDP, etc. similar to 
> JB>what we do now for many of the 'cnt' stats (context switches, etc.).  For 
> JB>'cnt' each CPU has its own count of stats that are updated using non-atomic 
> JB>ops (since they are CPU local).  sysctl handlers then sum up the various per-
> JB>CPU counts to report global counts to userland.
> 
> I see. I was also thinking along these lines, but was not sure whether it 
> is worth the trouble. I suppose this does not help to implement 64-bit 
> counters on 32-bit architectures, though, because you cannot read them 
> reliably without locking to sum them up, right?

Either that or you just accept that you have a small race since it is only stats. :)

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Wed Dec 16 18:19:49 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E5C261065693;
	Wed, 16 Dec 2009 18:19:49 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail01.syd.optusnet.com.au (mail01.syd.optusnet.com.au
	[211.29.132.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 7A39A8FC16;
	Wed, 16 Dec 2009 18:19:49 +0000 (UTC)
Received: from c220-239-235-116.carlnfd3.nsw.optusnet.com.au
	(c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116])
	by mail01.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	nBGIJjhj016826
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 17 Dec 2009 05:19:47 +1100
Date: Thu, 17 Dec 2009 05:19:45 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: John Baldwin <jhb@FreeBSD.org>
In-Reply-To: <200912151313.28326.jhb@freebsd.org>
Message-ID: <20091217021211.O35780@delplex.bde.org>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912150812.35521.jhb@freebsd.org>
	<20091215183859.S53283@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Harti Brandt <harti@FreeBSD.org>, freebsd-arch@FreeBSD.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 16 Dec 2009 18:19:50 -0000

On Tue, 15 Dec 2009, John Baldwin wrote:

> On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote:
>> On Tue, 15 Dec 2009, John Baldwin wrote:
>>
>> JB>On Tuesday 15 December 2009 4:38:04 am Harti Brandt wrote:
>> JB>> Hi all,
>> JB>>
>> JB>> I'm working on our network statistics (in the context of SNMP) and wonder,
>> JB>> to what extend we want them to be correct. I've re-read part of the past
>> JB>> discussions about 64-bit counters on 32-bit archs and got the impression,
>> JB>> that there are users that would like to have almost correct statistics
>> JB>> (for accounting, for example). If this is the case I wonder whether the
>> JB>> way we do the statistics today is correct.
>> JB>>
>> JB>> Basically all statistics are incremented or added to simply by a += b oder
>> JB>> a++. As I understand, this worked fine in the old days, where you had
>> JB>> spl*() calls at the right places. Nowadays when everything is SMP
>> JB>> shouldn't we use at least atomic operations for this? Also I read that on
>> JB>> architectures where cache coherency is not implemented in hardware even
>> JB>> this does not help (I found a mail from jhb why for the mutex
>> JB>> implementation this is not a problem, but I don't understand what to do
>> JB>> for the += and ++ operations). I failed to find a way, though, to
>> JB>> influence the caching policy (is there a function one can call to
>> JB>> change the policy?).
>> JB>
>> JB>Atomic ops will always work for reliable statistics.  However, I believe
>> JB>Robert is working on using per-CPU statistics for TCP, UDP, etc. similar to
>> JB>what we do now for many of the 'cnt' stats (context switches, etc.).  For
>> JB>'cnt' each CPU has its own count of stats that are updated using non-atomic
>> JB>ops (since they are CPU local).  sysctl handlers then sum up the various per-
>> JB>CPU counts to report global counts to userland.

I don't like the bloat from this, but don't see anything better.  Julian
said in another reply that there are even more complications for VIMAGE.

>> I see. I was also thinking along these lines, but was not sure whether it
>> is worth the trouble. I suppose this does not help to implement 64-bit
>> counters on 32-bit architectures, though, because you cannot read them
>> reliably without locking to sum them up, right?
>
> Either that or you just accept that you have a small race since it is only stats. :)

Actually, you can do better with a generation count.  The generation count
would at least tell you if you lost a race.  The generation count should
only be maintained while summing other counts, since it must be global and
incremented by atomic ops (to avoid the races without even more costly
locking which would make the generation count irrelevant) so maintaining
it all the time would more than defeat the point of having per-CPU counters
(all CPUs would compete for it at the same address).  Probably not worth
it for statistics.  Except, if userland had control over it, then userland
could decide the policy.

Actually2, this solves your original problem!, provided the races are
so rarely lost that looping to recover from them works: Once counters
are per-CPU, they can be 64-bits with no complications until they are
summed.  Detection of lost races is essential for summing them on
32-bit systems, unlike for 32-bit counters, since a lost race at the
point where the low 32 bits wraps around may give an error of 2**32
in the sum, while a lost race for a 32-bit counter only makes the sum
a bit too small (unless the 32-bit counter wrapped).

Simple version:
- bloat PCPU_INC(var) to do something like the following:
 	if (PCPU_GET(counter_summing_mode))
 		atomic_add_int(&counter_gen, 1);
 	OLD_PCPU_INC(var);
- set PCPU_GET(counter_summing_mode) while summing.  Needs heavyweight
   synchronization (IPIs?) to set and clear the flag on other CPUs.  Must
   also make all other CPUs flush pending writes (so that a 64-bit counter
   cannot be half-written at the beginning of the summing), but this will
   happen automatically with any heavyweight synchronization.

Unsimple versions: to avoid bloating PCPU_INC(), write-protect all
counters while summing, and count generations in the trap handler ...

However, I prefer summing 32-bit counters (with heuristics to detect
wraparound) to a 64-bit sum, like I think you already do for SNMP.

Wraparound heuristics may still be useful with the generation count:
suppose the generation count increases faster than you can sum; then
looping to get a coherent sum doesn't work, and wraparound must be
ruled out or fixed up in another way; the 32-bit wraparound heuristic
works perfectly since we can guarantee to sum faster than a 32-bit
counter can wrap twice.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Thu Dec 17 08:10:29 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CAD7E106568B;
	Thu, 17 Dec 2009 08:10:29 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail09.syd.optusnet.com.au (mail09.syd.optusnet.com.au
	[211.29.132.190])
	by mx1.freebsd.org (Postfix) with ESMTP id 6028B8FC12;
	Thu, 17 Dec 2009 08:10:28 +0000 (UTC)
Received: from c220-239-235-116.carlnfd3.nsw.optusnet.com.au
	(c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116])
	by mail09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	nBH8AP0P006827
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 17 Dec 2009 19:10:26 +1100
Date: Thu, 17 Dec 2009 19:10:25 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20091217021211.O35780@delplex.bde.org>
Message-ID: <20091217181553.Q36492@delplex.bde.org>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912150812.35521.jhb@freebsd.org>
	<20091215183859.S53283@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091217021211.O35780@delplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Harti Brandt <harti@FreeBSD.org>, John Baldwin <jhb@FreeBSD.org>,
	freebsd-arch@FreeBSD.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 17 Dec 2009 08:10:29 -0000

On Thu, 17 Dec 2009, Bruce Evans wrote:

> ...
> Actually, you can do better with a generation count.  The generation count
> would at least tell you if you lost a race.  The generation count should
> only be maintained while summing other counts, since it must be global and
> incremented by atomic ops (to avoid the races without even more costly
> locking which would make the generation count irrelevant) so maintaining
> it all the time would more than defeat the point of having per-CPU counters
> (all CPUs would compete for it at the same address).  ...

Actually3, the generation count can be per-CPU and accessed without atomic
ops (provided reads of it on other CPUs return a consistent possibly-stale
value).

> Simple version:
> - bloat PCPU_INC(var) to do something like the following:
> 	if (PCPU_GET(counter_summing_mode))
> 		atomic_add_int(&counter_gen, 1);
> 	OLD_PCPU_INC(var);
> - set PCPU_GET(counter_summing_mode) while summing.  Needs heavyweight
>  synchronization (IPIs?) to set and clear the flag on other CPUs.  Must
>  also make all other CPUs flush pending writes (so that a 64-bit counter
>  cannot be half-written at the beginning of the summing), but this will
>  happen automatically with any heavyweight synchronization.

Better version:
- bloat PCPU_INC(var) to do something like the following:
 	OLD_PCPU_INC(counter_gen);
 	OLD_PCPU_INC(var);
- sum all PCPU_GET(counter_gen) before summing the subset of ordinary
   counters of interest.  This gives a value <= the unracy current sum
   of the generation counters, by reading consistent possibly-stale
   values.

   Then sync all counters as above.  Note that the order of the above
   increments would be backwards if we used write ordering instead of
   a full sync -- with only write ordering the sum of the generation
   counts would be too high here if we happened to read it on 1 of the
   CPUs in between the above increments.  This order is chosen since
   I don't want to have 2 increments of counter_gen in the above and/or
   further complications and bloat, so there must be some order, and
   the above order works right later.

   Then sum selected ordinary counters.

   Then sync the generation counters (or all counters, or arrange for
   write ordering) as above.

   Then sum the generation counters.  This gives a value >= the unracy
   current sum at the end of summing the selected counters.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Thu Dec 17 08:28:12 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C1509106568B;
	Thu, 17 Dec 2009 08:28:12 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au
	[211.29.132.189])
	by mx1.freebsd.org (Postfix) with ESMTP id 3F8638FC12;
	Thu, 17 Dec 2009 08:28:11 +0000 (UTC)
Received: from c220-239-235-116.carlnfd3.nsw.optusnet.com.au
	(c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116])
	by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	nBH8S8sx014527
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 17 Dec 2009 19:28:09 +1100
Date: Thu, 17 Dec 2009 19:28:08 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20091217181553.Q36492@delplex.bde.org>
Message-ID: <20091217191535.U36525@delplex.bde.org>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912150812.35521.jhb@freebsd.org>
	<20091215183859.S53283@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091217021211.O35780@delplex.bde.org>
	<20091217181553.Q36492@delplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Harti Brandt <harti@FreeBSD.org>, John Baldwin <jhb@FreeBSD.org>,
	freebsd-arch@FreeBSD.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 17 Dec 2009 08:28:12 -0000

On Thu, 17 Dec 2009, Bruce Evans wrote:

> On Thu, 17 Dec 2009, Bruce Evans wrote:
>
>> ...
> Actually3, the generation count can be per-CPU and accessed without atomic
> ops (provided reads of it on other CPUs return a consistent possibly-stale
> value).
>
>> Simple version:
>
> Better version:
> ...

Duh, this is far too complicated and bloated.  Counters can be their own
generation counts -- you just read them again to see if they are quiescent.
A heavyweight sync before each of the (sets of) reads is still necessary.
Self-generation counters give a separate generation counter for each
normal counter, so quiescence can be easily be checked for and/or enforced 
per-counter.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Sat Dec 19 11:27:13 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C8E5B1065672;
	Sat, 19 Dec 2009 11:27:13 +0000 (UTC)
	(envelope-from uqs@spoerlein.net)
Received: from acme.spoerlein.net (acme.spoerlein.net [IPv6:2a01:198:206::1])
	by mx1.freebsd.org (Postfix) with ESMTP id 4F1468FC16;
	Sat, 19 Dec 2009 11:27:13 +0000 (UTC)
Received: from acme.spoerlein.net (localhost.spoerlein.net [IPv6:::1])
	by acme.spoerlein.net (8.14.3/8.14.3) with ESMTP id nBJBRCkd046767
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sat, 19 Dec 2009 12:27:12 +0100 (CET)
	(envelope-from uqs@spoerlein.net)
Received: (from uqs@localhost)
	by acme.spoerlein.net (8.14.3/8.14.3/Submit) id nBJBRCn9046766;
	Sat, 19 Dec 2009 12:27:12 +0100 (CET)
	(envelope-from uqs@spoerlein.net)
Date: Sat, 19 Dec 2009 12:27:12 +0100
From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= <uqs@spoerlein.net>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <20091219112711.GR55913@acme.spoerlein.net>
Mail-Followup-To: John Baldwin <jhb@freebsd.org>,
	Harti Brandt <harti@freebsd.org>, freebsd-arch@freebsd.org
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912150812.35521.jhb@freebsd.org>
	<20091215183859.S53283@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <200912151313.28326.jhb@freebsd.org>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: Harti Brandt <harti@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Dec 2009 11:27:13 -0000

On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote:
> On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote:
> > I see. I was also thinking along these lines, but was not sure whether it 
> > is worth the trouble. I suppose this does not help to implement 64-bit 
> > counters on 32-bit architectures, though, because you cannot read them 
> > reliably without locking to sum them up, right?
> 
> Either that or you just accept that you have a small race since it is only stats. :)

This might be stupid, but can we not easily *read* 64bit counters
on 32bit machines like this:

do {
    h1 = read_upper_32bits;
    l1 = read_lower_32bits;
    h2 = read_upper_32bits;
    l2 = read_lower_32bits; /* not needed */
} while (h1 != h2);

sum64 = (h1<<32) + l1;

or something like that? If h2 does not change between readings, no
wrap-around has occured. If l1 was read in between the readings of h1
and h2, the code above is sound. Right?

Regards,
Uli

From owner-freebsd-arch@FreeBSD.ORG  Sat Dec 19 11:42:24 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8965A1065746;
	Sat, 19 Dec 2009 11:42:24 +0000 (UTC)
	(envelope-from hselasky@c2i.net)
Received: from swip.net (mailfe01.swip.net [212.247.154.1])
	by mx1.freebsd.org (Postfix) with ESMTP id AC4FE8FC08;
	Sat, 19 Dec 2009 11:42:23 +0000 (UTC)
X-Cloudmark-Score: 0.000000 []
X-Cloudmark-Analysis: v=1.0 c=1 a=MnI1ikcADjEx7bvsp0jZvQ==:17
	a=Fa0p2dj8wnFP207fzp8A:9 a=NRoDMZWEEsvWI_krVPgA:7
	a=f3fVchOcrPINIEe5_G_UAmhSYaQA:4 a=u4nCZWMP6T-Rh-9c:21
Received: from [188.126.201.140] (account mc467741@c2i.net HELO
	laptop.adsl.tele2.no)
	by mailfe01.swip.net (CommuniGate Pro SMTP 5.2.16)
	with ESMTPA id 293568362; Sat, 19 Dec 2009 12:42:21 +0100
From: Hans Petter Selasky <hselasky@c2i.net>
To: freebsd-arch@freebsd.org
Date: Sat, 19 Dec 2009 12:44:14 +0100
User-Agent: KMail/1.11.4 (FreeBSD/9.0-CURRENT; KDE/4.2.4; i386; ; )
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
In-Reply-To: <20091219112711.GR55913@acme.spoerlein.net>
X-Face: (%<A9p';5>:6u[ldzJ`0qjD7sCkfdMmD*RxpO<
	=?iso-8859-1?q?Q0yAl=7E=3F=60=27F=3FjDVb=5DE6TQ7=27=23h-VlLs=7Dk/=0A=09?=(yxg(p!IL.`#ng"%`BMrham7%UK,}VH\wUOm=^>wEEQ+KWt[{J#x6ow~JO:,zwp.(t;
	@ =?iso-8859-1?q?Aq=0A=09=3A4=3A=26nFCgDb8=5B3oIeTb=5E=27?=",;
	u{5{}C9>"PuY\)!=#\u9SSM-nz8+SR~B\!qBv
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Message-Id: <200912191244.17803.hselasky@c2i.net>
Cc: Ulrich =?iso-8859-1?q?Sp=F6rlein?= <uqs@spoerlein.net>,
	Harti Brandt <harti@freebsd.org>
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Dec 2009 11:42:24 -0000

On Saturday 19 December 2009 12:27:12 Ulrich Sp=F6rlein wrote:
> On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote:
> > On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote:
> > > I see. I was also thinking along these lines, but was not sure whether
> > > it is worth the trouble. I suppose this does not help to implement
> > > 64-bit counters on 32-bit architectures, though, because you cannot
> > > read them reliably without locking to sum them up, right?
> >
> > Either that or you just accept that you have a small race since it is
> > only stats. :)
>
> This might be stupid, but can we not easily *read* 64bit counters
> on 32bit machines like this:
>
> do {
>     h1 =3D read_upper_32bits;
>     l1 =3D read_lower_32bits;
>     h2 =3D read_upper_32bits;
>     l2 =3D read_lower_32bits; /* not needed */
> } while (h1 !=3D h2);


Hi,

Just a comment. You know you don't need a while loop to get a stable value?=
=20
Should be implemented like this, in my opinion:

h1 =3D read_upper_32bits;
l1 =3D read_lower_32bits;
h2 =3D read_upper_32bits;

if (h1 !=3D h2)
	l1 =3D 0xffffffffUL;

sum64 =3D (h1<<32) | l1;

=2D-HPS

From owner-freebsd-arch@FreeBSD.ORG  Sat Dec 19 14:56:43 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8D1A21065679;
	Sat, 19 Dec 2009 14:56:43 +0000 (UTC)
	(envelope-from Hartmut.Brandt@dlr.de)
Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32])
	by mx1.freebsd.org (Postfix) with ESMTP id 1CC1F8FC14;
	Sat, 19 Dec 2009 14:56:42 +0000 (UTC)
Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over
	TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); 
	Sat, 19 Dec 2009 15:56:40 +0100
Date: Sat, 19 Dec 2009 15:56:38 +0100 (CET)
From: Harti Brandt <hartmut.brandt@dlr.de>
X-X-Sender: brandt_h@beagle.kn.op.dlr.de
To: Ulrich =?utf-8?B?U3DDtnJsZWlu?= <uqs@spoerlein.net>
In-Reply-To: <20091219112711.GR55913@acme.spoerlein.net>
Message-ID: <20091219154206.E93919@beagle.kn.op.dlr.de>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912150812.35521.jhb@freebsd.org>
	<20091215183859.S53283@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
X-OpenPGP-Key: harti@freebsd.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-OriginalArrivalTime: 19 Dec 2009 14:56:40.0372 (UTC)
	FILETIME=[78948740:01CA80BB]
Cc: freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Harti Brandt <harti@freebsd.org>
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Dec 2009 14:56:43 -0000

On Sat, 19 Dec 2009, Ulrich Sprlein wrote:

US>On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote:
US>> On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote:
US>> > I see. I was also thinking along these lines, but was not sure whether it 
US>> > is worth the trouble. I suppose this does not help to implement 64-bit 
US>> > counters on 32-bit architectures, though, because you cannot read them 
US>> > reliably without locking to sum them up, right?
US>> 
US>> Either that or you just accept that you have a small race since it is only stats. :)
US>
US>This might be stupid, but can we not easily *read* 64bit counters
US>on 32bit machines like this:
US>
US>do {
US>    h1 = read_upper_32bits;
US>    l1 = read_lower_32bits;
US>    h2 = read_upper_32bits;
US>    l2 = read_lower_32bits; /* not needed */
US>} while (h1 != h2);
US>
US>sum64 = (h1<<32) + l1;
US>
US>or something like that? If h2 does not change between readings, no
US>wrap-around has occured. If l1 was read in between the readings of h1
US>and h2, the code above is sound. Right?

I suppose this works only if it would be guaranteed that the CPU modifying 
the 64-bit value does this somehow faster than the CPU reading the data:

CPU1                    CPU2
----                    ----
write new h
			read h1 (new h)
			read l1 (old l)
			read h2 (new h)
write new l

It doesn't work too when the CPU first writes L and the H.

harti

From owner-freebsd-arch@FreeBSD.ORG  Sat Dec 19 15:30:13 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 06C3B1065672;
	Sat, 19 Dec 2009 15:30:13 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au
	[211.29.132.189])
	by mx1.freebsd.org (Postfix) with ESMTP id 796E58FC08;
	Sat, 19 Dec 2009 15:30:11 +0000 (UTC)
Received: from besplex.bde.org (c220-239-235-116.carlnfd3.nsw.optusnet.com.au
	[220.239.235.116])
	by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	nBJFU3qZ026724
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 20 Dec 2009 02:30:04 +1100
Date: Sun, 20 Dec 2009 02:30:03 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Hans Petter Selasky <hselasky@c2i.net>
In-Reply-To: <200912191244.17803.hselasky@c2i.net>
Message-ID: <20091219232119.L1555@besplex.bde.org>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
	<200912191244.17803.hselasky@c2i.net>
MIME-Version: 1.0
Content-Type: MULTIPART/MIXED; BOUNDARY="0-509698132-1261236603=:1555"
Cc: Ulrich =?iso-8859-1?q?Sp=F6rlein?= <uqs@spoerlein.net>,
	Harti Brandt <harti@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Dec 2009 15:30:13 -0000

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--0-509698132-1261236603=:1555
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Sat, 19 Dec 2009, Hans Petter Selasky wrote:

> On Saturday 19 December 2009 12:27:12 Ulrich Sp=F6rlein wrote:
>> On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote:
>>> On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote:
>>>> I see. I was also thinking along these lines, but was not sure whether
>>>> it is worth the trouble. I suppose this does not help to implement
>>>> 64-bit counters on 32-bit architectures, though, because you cannot
>>>> read them reliably without locking to sum them up, right?
>>>
>>> Either that or you just accept that you have a small race since it is
>>> only stats. :)
>>
>> This might be stupid, but can we not easily *read* 64bit counters
>> on 32bit machines like this:
>>
>> do {
>>     h1 =3D read_upper_32bits;
>>     l1 =3D read_lower_32bits;
>>     h2 =3D read_upper_32bits;
>>     l2 =3D read_lower_32bits; /* not needed */
>> } while (h1 !=3D h2);

No.  See my previous^N reply, but don't see it about this since it was
wrong about this for all N :-).

> Just a comment. You know you don't need a while loop to get a stable valu=
e?
> Should be implemented like this, in my opinion:
>=20
> h1 =3D read_upper_32bits;
> l1 =3D read_lower_32bits;
> h2 =3D read_upper_32bits;
>
> if (h1 !=3D h2)
> =09l1 =3D 0xffffffffUL;
>
> sum64 =3D (h1<<32) | l1;

Also wrong :-).  Apart from write ordering problems (1), the write of
the second half (presumably the top half) might not have completed=20
when the above looks at it.  Then both of the above will see:
     h1 =3D old value
     l1 =3D new value
     h2 =3D old value (since the new one has not been written yet).
The race window for this can be arbitrarily long, since the second write
can be delayed for arbitrarily long (e.g., by interrupt handling).  Even
if we ensure write ordering and no interrupts, the above has many problems.
- we can't reasonably guarantee that the reads of l1 and h2 will execute
   sufficiently faster than the writes of l1 and h1/h2 so that the above
   will see h2 after l1.  I think the writes will usually go slightly
   faster since they will go through a write buffer, provided the 2 halves
   are in a single cache line, but this is unclear.  SMP with different
   CPU frequencies is not really supported, but works now modulo timecounte=
r
   problems, and we probably want to support completely independent CPU
   frequencies, with some CPUs throttled.
- I don't understand why the above compares the high values.  Comparing the
   low values seems to work better.
- I can't see how to fix up the second method to be useful.  It is faster,
   but some delays seem to be necessary and they might as well be partly
   in a slower method.

Another try:
- enforce write ordering in writers and read ordering in the reader
- make sure that the reader runs fast enough.  This might require using
   critical_enter() in the reader.
- then many cases work with no complications:
   (a) if l1 is not small and not large, then h1 must be associated with l1=
,
       since then the low value can't roll over while we are looking
       at the pair, so the high value can't change while we are looking.
       So we can just use h1 and l1, without reading h2 or l2.
   (b) similarly, if l1 is small, then h2 is associated with l1.  So we
       can just use h2 with l1, without reading l2.
- otherwise, l1 is large:
   (c) if l1 =3D l2, then h1 must be associated with l1, since some writes
       of the high value associated with writing not-so-large low values
       have had plenty of time to complete (in fact, the one for (l1-2)
       or (l2-1) must have completed, and "large" only needs to be 1
       or 2 to ensure that these values don't wrap.  E.g., if l1 is
       0xFFFFFFFF, then it is about to wrap, but it certainly hasn't
       wrapped recently so h1 hasn't incremented recently.  So we can
       use h1 with l1, after reading l2 (still need to read h2, in case
       we don't get here).
   (d) if l1 !=3D l2, then ordering implies that the write of the high valu=
e
       associated with l1 has completed.  We might have missed reading
       this value, since we might have read h1 too early and might have
       read h2 too late, but in the usual case h1 =3D=3D h2 and then both
       h's are associated with l1, while if h1 !=3D h2 then we can loop
       again and surely find h1 =3D=3D h2 (and l1 small, so case (c)), or
       we can use the second method.  We had to read all 4 values to
       determine what to do here, and can usually use 2 of them directly.

(1) Write ordering is guaranteed on amd64 but I think not on all arches.
     You could pessimize PCPU_INC() with memory barriers on some arches to
     get write ordering.

(2) You could pessimize PCPU_INC() with critical_enter() to mostly prevent
     this.  You cannot prevent the second write from being delayed for
     arbitrarily long by a trap and its handling, except by breaking
     at least the debugger traps needed to debug this.

In my previous^2 reply, I said heavyweight synchronization combined
with extra atomic generation counters would work.  The heavyweight
synchronization would have to be heavier than I thought -- it would
have to wait for all other CPUs to complete the pairs of writes for
64-bit counters, if any, and for this it would have to do more than
IPI's -- it should change priorities and reschedule to ensure that the
half-writes (if any) have a chance of completing soon...  Far too
complicated for this.  Disabling interrupts for the non-atomic PCPU_INC()s
is probably best.  Duh, this or worse (locking) is required on the
writer side anyway, else increments in won't be atomic.  Locking would
actually automatically give the rescheduling stuff for the heavyweight
synchronizaton -- you would acquire the lock in the reader and of
course in the writers, and get priority propagation to complete the
one writer allowed to hold the lock iff any is holding it.  Locking
might not be too bad for a few 64-bit counters.  So I've changed my
mind yet again and prefer locking to critical_enter().  It's cleaner
and works for traps.  I just remembered that rwatson went the opposite
way and changed some locking to critical_enter() in UMA.  I prefer the
old way, and at least in old versions of FreeBSD I got panics trying
to debug near this (single-stepping malloc()?).

In my previous^1 reply, I said lighter weight synchronizion combined
with no extra or atomic counters (use counters as their own generation
counter) would work.  But the synchronization still needs to be heavy,
or interrupts disabled, as above.

Everything must be read before and after to test for getting a coherent
set of values, so the loop in the first method has the minimal number
of reads (for a single 64-bit counter).  With sync, the order for each
pair in it doesn't matter on either the reader or writer (there must be
a sync or 2 instead).

Bruce
--0-509698132-1261236603=:1555--

From owner-freebsd-arch@FreeBSD.ORG  Sat Dec 19 16:02:29 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DC466106566B
	for <freebsd-arch@freebsd.org>; Sat, 19 Dec 2009 16:02:29 +0000 (UTC)
	(envelope-from Hartmut.Brandt@dlr.de)
Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32])
	by mx1.freebsd.org (Postfix) with ESMTP id 557688FC25
	for <freebsd-arch@freebsd.org>; Sat, 19 Dec 2009 16:02:28 +0000 (UTC)
Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over
	TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); 
	Sat, 19 Dec 2009 17:02:27 +0100
Date: Sat, 19 Dec 2009 17:02:23 +0100 (CET)
From: Harti Brandt <hartmut.brandt@dlr.de>
X-X-Sender: brandt_h@beagle.kn.op.dlr.de
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20091219232119.L1555@besplex.bde.org>
Message-ID: <20091219164818.L1741@beagle.kn.op.dlr.de>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
	<200912191244.17803.hselasky@c2i.net>
	<20091219232119.L1555@besplex.bde.org>
X-OpenPGP-Key: harti@freebsd.org
MIME-Version: 1.0
Content-Type: MULTIPART/MIXED; BOUNDARY="1964543108-214838482-1261238543=:1741"
X-OriginalArrivalTime: 19 Dec 2009 16:02:27.0737 (UTC)
	FILETIME=[A964A090:01CA80C4]
Cc: Ulrich =?iso-8859-1?q?Sp=F6rlein?= <uqs@spoerlein.net>,
	freebsd-arch@freebsd.org, Hans Petter Selasky <hselasky@c2i.net>
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Harti Brandt <harti@freebsd.org>
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Dec 2009 16:02:29 -0000

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--1964543108-214838482-1261238543=:1741
Content-Type: TEXT/PLAIN; charset=koi8-r
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Sun, 20 Dec 2009, Bruce Evans wrote:

BE>On Sat, 19 Dec 2009, Hans Petter Selasky wrote:
BE>
BE>> On Saturday 19 December 2009 12:27:12 Ulrich Sp=F6rlein wrote:
BE>> > On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote:
BE>> > > On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote:
BE>> > > > I see. I was also thinking along these lines, but was not sure
BE>> > > > whether
BE>> > > > it is worth the trouble. I suppose this does not help to impleme=
nt
BE>> > > > 64-bit counters on 32-bit architectures, though, because you can=
not
BE>> > > > read them reliably without locking to sum them up, right?
BE>> > >=20
BE>> > > Either that or you just accept that you have a small race since it=
 is
BE>> > > only stats. :)
BE>> >=20
BE>> > This might be stupid, but can we not easily *read* 64bit counters
BE>> > on 32bit machines like this:
BE>> >=20
BE>> > do {
BE>> >     h1 =3D read_upper_32bits;
BE>> >     l1 =3D read_lower_32bits;
BE>> >     h2 =3D read_upper_32bits;
BE>> >     l2 =3D read_lower_32bits; /* not needed */
BE>> > } while (h1 !=3D h2);
BE>
BE>No.  See my previous^N reply, but don't see it about this since it was
BE>wrong about this for all N :-).
BE>
BE>> Just a comment. You know you don't need a while loop to get a stable v=
alue?
BE>> Should be implemented like this, in my opinion:
BE>>=20
BE>> h1 =3D read_upper_32bits;
BE>> l1 =3D read_lower_32bits;
BE>> h2 =3D read_upper_32bits;
BE>>=20
BE>> if (h1 !=3D h2)
BE>> =09l1 =3D 0xffffffffUL;
BE>>=20
BE>> sum64 =3D (h1<<32) | l1;
BE>
BE>Also wrong :-).  Apart from write ordering problems (1), the write of
BE>the second half (presumably the top half) might not have completed when =
the
BE>above looks at it.  Then both of the above will see:
BE>    h1 =3D old value
BE>    l1 =3D new value
BE>    h2 =3D old value (since the new one has not been written yet).
BE>The race window for this can be arbitrarily long, since the second write
BE>can be delayed for arbitrarily long (e.g., by interrupt handling).  Even
BE>if we ensure write ordering and no interrupts, the above has many proble=
ms.
BE>- we can't reasonably guarantee that the reads of l1 and h2 will execute
BE>  sufficiently faster than the writes of l1 and h1/h2 so that the above
BE>  will see h2 after l1.  I think the writes will usually go slightly
BE>  faster since they will go through a write buffer, provided the 2 halve=
s
BE>  are in a single cache line, but this is unclear.  SMP with different
BE>  CPU frequencies is not really supported, but works now modulo timecoun=
ter
BE>  problems, and we probably want to support completely independent CPU
BE>  frequencies, with some CPUs throttled.
BE>- I don't understand why the above compares the high values.  Comparing =
the
BE>  low values seems to work better.
BE>- I can't see how to fix up the second method to be useful.  It is faste=
r,
BE>  but some delays seem to be necessary and they might as well be partly
BE>  in a slower method.
BE>
BE>Another try:
BE>- enforce write ordering in writers and read ordering in the reader
BE>- make sure that the reader runs fast enough.  This might require using
BE>  critical_enter() in the reader.
BE>- then many cases work with no complications:
BE>  (a) if l1 is not small and not large, then h1 must be associated with =
l1,
BE>      since then the low value can't roll over while we are looking
BE>      at the pair, so the high value can't change while we are looking.
BE>      So we can just use h1 and l1, without reading h2 or l2.
BE>  (b) similarly, if l1 is small, then h2 is associated with l1.  So we
BE>      can just use h2 with l1, without reading l2.
BE>- otherwise, l1 is large:
BE>  (c) if l1 =3D l2, then h1 must be associated with l1, since some write=
s
BE>      of the high value associated with writing not-so-large low values
BE>      have had plenty of time to complete (in fact, the one for (l1-2)
BE>      or (l2-1) must have completed, and "large" only needs to be 1
BE>      or 2 to ensure that these values don't wrap.  E.g., if l1 is
BE>      0xFFFFFFFF, then it is about to wrap, but it certainly hasn't
BE>      wrapped recently so h1 hasn't incremented recently.  So we can
BE>      use h1 with l1, after reading l2 (still need to read h2, in case
BE>      we don't get here).
BE>  (d) if l1 !=3D l2, then ordering implies that the write of the high va=
lue
BE>      associated with l1 has completed.  We might have missed reading
BE>      this value, since we might have read h1 too early and might have
BE>      read h2 too late, but in the usual case h1 =3D=3D h2 and then both
BE>      h's are associated with l1, while if h1 !=3D h2 then we can loop
BE>      again and surely find h1 =3D=3D h2 (and l1 small, so case (c)), or
BE>      we can use the second method.  We had to read all 4 values to
BE>      determine what to do here, and can usually use 2 of them directly.
BE>
BE>(1) Write ordering is guaranteed on amd64 but I think not on all arches.
BE>    You could pessimize PCPU_INC() with memory barriers on some arches t=
o
BE>    get write ordering.
BE>
BE>(2) You could pessimize PCPU_INC() with critical_enter() to mostly preve=
nt
BE>    this.  You cannot prevent the second write from being delayed for
BE>    arbitrarily long by a trap and its handling, except by breaking
BE>    at least the debugger traps needed to debug this.
BE>
BE>In my previous^2 reply, I said heavyweight synchronization combined
BE>with extra atomic generation counters would work.  The heavyweight
BE>synchronization would have to be heavier than I thought -- it would
BE>have to wait for all other CPUs to complete the pairs of writes for
BE>64-bit counters, if any, and for this it would have to do more than
BE>IPI's -- it should change priorities and reschedule to ensure that the
BE>half-writes (if any) have a chance of completing soon...  Far too
BE>complicated for this.  Disabling interrupts for the non-atomic PCPU_INC(=
)s
BE>is probably best.  Duh, this or worse (locking) is required on the
BE>writer side anyway, else increments in won't be atomic.  Locking would
BE>actually automatically give the rescheduling stuff for the heavyweight
BE>synchronizaton -- you would acquire the lock in the reader and of
BE>course in the writers, and get priority propagation to complete the
BE>one writer allowed to hold the lock iff any is holding it.  Locking
BE>might not be too bad for a few 64-bit counters.  So I've changed my
BE>mind yet again and prefer locking to critical_enter().  It's cleaner
BE>and works for traps.  I just remembered that rwatson went the opposite
BE>way and changed some locking to critical_enter() in UMA.  I prefer the
BE>old way, and at least in old versions of FreeBSD I got panics trying
BE>to debug near this (single-stepping malloc()?).
BE>
BE>In my previous^1 reply, I said lighter weight synchronizion combined
BE>with no extra or atomic counters (use counters as their own generation
BE>counter) would work.  But the synchronization still needs to be heavy,
BE>or interrupts disabled, as above.
BE>
BE>Everything must be read before and after to test for getting a coherent
BE>set of values, so the loop in the first method has the minimal number
BE>of reads (for a single 64-bit counter).  With sync, the order for each
BE>pair in it doesn't matter on either the reader or writer (there must be
BE>a sync or 2 instead).

To be honest, I'm lost now. Couldn't we just use the largest atomic type=20
for the given platform and atomic_inc/atomic_add/atomic_fetch and handle=20
the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel=20
thread?

Are the 5-6 atomic operations really that costly given the many operations=
=20
done on an IP packet? Are they more costly than a heavyweight sync for=20
each ++ or +=3D?

Or we could use the PCPU stuff, use just ++ and +=3D for modifying the=20
statistics (32bit) and do the 32->64 bit stuff for all platforms with a=20
kernel thread per CPU (do we have this?). Between that thread and the=20
sysctl we could use a heavy sync.

Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the=20
largest atomic type for the platform, handle the aggregation and (on IA32)=
=20
the 32->64 bit stuff in a kernel thread.

Using 32 bit stats may fail if you put in several 10GBit/s adapters into a=
=20
machine and do routing at link speed, though. This might overflow the IP=20
input/output byte counter (which we don't have yet) too fast.

harti
--1964543108-214838482-1261238543=:1741--

From owner-freebsd-arch@FreeBSD.ORG  Sat Dec 19 17:15:51 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 222B41065676;
	Sat, 19 Dec 2009 17:15:51 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au
	[211.29.132.184])
	by mx1.freebsd.org (Postfix) with ESMTP id AC4D98FC1A;
	Sat, 19 Dec 2009 17:15:50 +0000 (UTC)
Received: from besplex.bde.org (c220-239-235-116.carlnfd3.nsw.optusnet.com.au
	[220.239.235.116])
	by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	nBJHFgbo021902
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 20 Dec 2009 04:15:43 +1100
Date: Sun, 20 Dec 2009 04:15:42 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Harti Brandt <harti@freebsd.org>
In-Reply-To: <20091219164818.L1741@beagle.kn.op.dlr.de>
Message-ID: <20091220032452.W2429@besplex.bde.org>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
	<200912191244.17803.hselasky@c2i.net>
	<20091219232119.L1555@besplex.bde.org>
	<20091219164818.L1741@beagle.kn.op.dlr.de>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Ulrich =?iso-8859-1?q?Sp=F6rlein?= <uqs@spoerlein.net>,
	freebsd-arch@freebsd.org, Hans Petter Selasky <hselasky@c2i.net>
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Dec 2009 17:15:51 -0000


On Sat, 19 Dec 2009, Harti Brandt wrote:

> On Sun, 20 Dec 2009, Bruce Evans wrote:
>
> [... complications]
>
> To be honest, I'm lost now. Couldn't we just use the largest atomic type
> for the given platform and atomic_inc/atomic_add/atomic_fetch and handle
> the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel
> thread?

That's probably best (except without the atomic operations) (like I said
originally.  I tried to spell out the complications to make it clear that
they would be too much except for incomplete ones).

> Are the 5-6 atomic operations really that costly given the many operations
> done on an IP packet? Are they more costly than a heavyweight sync for
> each ++ or +=?

rwatson found that even non-atomic operations are quite costly, since
at least on amd64 and i386, ones that write (or any access?) the same
address (or cache line?) apparently involve much the same hardware
activity (cache snoop?) as atomic ones implemented by locking the bus.
I think this is mostly historical -- it should be necessary to lock the
bus to get the slow version.  Per-CPU counters give separate addresses
and also don't require the bus lock.  I don't like the complexity for
per-CPU counters but don't use big SMP systems enough to know what the
locks cost in real applications.

> Or we could use the PCPU stuff, use just ++ and += for modifying the
> statistics (32bit) and do the 32->64 bit stuff for all platforms with a
> kernel thread per CPU (do we have this?). Between that thread and the
> sysctl we could use a heavy sync.

I don't like the squillions of threads in FreeBSD-post-4, but this seems
to need its own one and there isn't one yet AFAIK.  I think a thread is
only needed for the 32-bit stuff (since aggregation has to use the
current values and it shouldn't have to ask a thread to sum them).  The
thread should maintain only the high 32 or 33 bits of the 64-bit counters.
Maybe there should be a thread per CPU (ugh) with per-CPU extra bits so
that these bits can be accessed without locking.  The synchronization is
still interesting.

> Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the
> largest atomic type for the platform, handle the aggregation and (on IA32)
> the 32->64 bit stuff in a kernel thread.

I don't see why using atomic or locks for just the 64 bit counters is good.
We will probably end up with too many 64-bit counters, especially if they
don't cost much when not read.

I just thought of another implementation to reduce reads: trap on
overflow and handle all the complications in the trap handler, or
just set a flag to tell the fixup thread to run and normally don't
run the fixup thread.  This seems to not quite work -- arranging
for the trap would be costly (needs "into" instruction on i386?).
Similarly for explicit tests for wraparound (PCPU_INC() could be a
function call that does the test and handles wraparound in a fully
locked fashion.  We don't care that this code executes slowly since
it rarely executes, but we care that the test pessimizes the usual
case).

There is also "lock cmpxchg8b" on i386.  I think this can be used in a
loop to implement atomic 64-bit ops (?).  Simpler, but slower in
PCPU_INC().  I prefer a function call version of PCPU_INC() to this.
That should be faster in the usual case and only much larger if we
have too many 64-bit counters.

> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a
> machine and do routing at link speed, though. This might overflow the IP
> input/output byte counter (which we don't have yet) too fast.

Not with a mere 10GB/S.  That's ~1GB/S so it takes 4 seconds to overflow
a 32-bit byte counter.  A bit counter would take a while to overflow too.
Are there any faster incrementors?  TSCs also take O(1) seconds to overflow,
and timecounter logic depends on no timecounter overflowing much faster
than that.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Sat Dec 19 20:01:35 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4E55F1065676
	for <freebsd-arch@freebsd.org>; Sat, 19 Dec 2009 20:01:35 +0000 (UTC)
	(envelope-from Hartmut.Brandt@dlr.de)
Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32])
	by mx1.freebsd.org (Postfix) with ESMTP id D451D8FC16
	for <freebsd-arch@freebsd.org>; Sat, 19 Dec 2009 20:01:34 +0000 (UTC)
Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over
	TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); 
	Sat, 19 Dec 2009 21:01:32 +0100
Date: Sat, 19 Dec 2009 21:01:35 +0100 (CET)
From: Harti Brandt <hartmut.brandt@dlr.de>
X-X-Sender: brandt_h@beagle.kn.op.dlr.de
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20091220032452.W2429@besplex.bde.org>
Message-ID: <20091219204217.D1741@beagle.kn.op.dlr.de>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
	<200912191244.17803.hselasky@c2i.net>
	<20091219232119.L1555@besplex.bde.org>
	<20091219164818.L1741@beagle.kn.op.dlr.de>
	<20091220032452.W2429@besplex.bde.org>
X-OpenPGP-Key: harti@freebsd.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-OriginalArrivalTime: 19 Dec 2009 20:01:32.0832 (UTC)
	FILETIME=[0FBC6A00:01CA80E6]
Cc: freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Harti Brandt <harti@freebsd.org>
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Dec 2009 20:01:35 -0000

On Sun, 20 Dec 2009, Bruce Evans wrote:

BE>On Sat, 19 Dec 2009, Harti Brandt wrote:
BE>
BE>> On Sun, 20 Dec 2009, Bruce Evans wrote:
BE>> 
BE>> [... complications]
BE>> 
BE>> To be honest, I'm lost now. Couldn't we just use the largest atomic type
BE>> for the given platform and atomic_inc/atomic_add/atomic_fetch and handle
BE>> the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel
BE>> thread?
BE>
BE>That's probably best (except without the atomic operations) (like I said
BE>originally.  I tried to spell out the complications to make it clear that
BE>they would be too much except for incomplete ones).
BE>
BE>> Are the 5-6 atomic operations really that costly given the many operations
BE>> done on an IP packet? Are they more costly than a heavyweight sync for
BE>> each ++ or +=?
BE>
BE>rwatson found that even non-atomic operations are quite costly, since
BE>at least on amd64 and i386, ones that write (or any access?) the same
BE>address (or cache line?) apparently involve much the same hardware
BE>activity (cache snoop?) as atomic ones implemented by locking the bus.
BE>I think this is mostly historical -- it should be necessary to lock the
BE>bus to get the slow version.  Per-CPU counters give separate addresses
BE>and also don't require the bus lock.  I don't like the complexity for
BE>per-CPU counters but don't use big SMP systems enough to know what the
BE>locks cost in real applications.
BE>
BE>> Or we could use the PCPU stuff, use just ++ and += for modifying the
BE>> statistics (32bit) and do the 32->64 bit stuff for all platforms with a
BE>> kernel thread per CPU (do we have this?). Between that thread and the
BE>> sysctl we could use a heavy sync.
BE>
BE>I don't like the squillions of threads in FreeBSD-post-4, but this seems
BE>to need its own one and there isn't one yet AFAIK.  I think a thread is
BE>only needed for the 32-bit stuff (since aggregation has to use the
BE>current values and it shouldn't have to ask a thread to sum them).  The
BE>thread should maintain only the high 32 or 33 bits of the 64-bit counters.
BE>Maybe there should be a thread per CPU (ugh) with per-CPU extra bits so
BE>that these bits can be accessed without locking.  The synchronization is
BE>still interesting.
BE>
BE>> Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the
BE>> largest atomic type for the platform, handle the aggregation and (on IA32)
BE>> the 32->64 bit stuff in a kernel thread.
BE>
BE>I don't see why using atomic or locks for just the 64 bit counters is good.
BE>We will probably end up with too many 64-bit counters, especially if they
BE>don't cost much when not read.

On a 32-bit arch when reading a 32-bit value on one CPU while the other CPU is
modifying it, the read will probably be always correct given the variable is
correctly aligned. On a 64-bit arch when reading a 64-bit value on one CPU
while the other one is adding to, do I always get the correct value? I'm
not sure about this, why I put atomic_*() there assuming that they will make
this correct.

The idea is (for 32-bit platforms):

struct pcpu_stats {
	uint32_t in_bytes;
	uint32_t in_packets;
};

struct pcpu_hc_stats {
	uint64_t hc_in_bytes;
	uint64_t hc_in_packets;
};

/* driver; IP stack; ... */
...
pcpu_stats->in_bytes += bytes;
pcpu_stats->in_packets++;
...

/* per CPU kernel thread for 32-bit arch */
lock(pcpu_hc_stats);
...
val = pcpu_stats->in_bytes;
if ((uint32_t)pcpu_hc_stats->hc_in_bytes > val)
	pcpu_hc_stats->in_bytes += 0x100000000;
pcpu_hc_stats->in_bytes = (pcpu_hc_stats->in_bytes &
    0xffffffff00000000ULL) | val;
...
unlock(pcpu_hc_stats);

/* sysctl */

memset(&stats, 0, sizeof(stats));
foreach(cpu) {
	lock(pcpu_hc_stats(cpu));
	...
	stats.in_bytes += pcpu_hc_stats(cpu)->hc_in_bytes;
	...
	unlock(pcpu_hc_stats(cpu));
}
copyout(stats);

On 64-bit archs we can go without the locks and the thread given that we
can reliably read the 64-bit per CPU numbers (can we?).

BE>I just thought of another implementation to reduce reads: trap on
BE>overflow and handle all the complications in the trap handler, or
BE>just set a flag to tell the fixup thread to run and normally don't
BE>run the fixup thread.  This seems to not quite work -- arranging
BE>for the trap would be costly (needs "into" instruction on i386?).
BE>Similarly for explicit tests for wraparound (PCPU_INC() could be a
BE>function call that does the test and handles wraparound in a fully
BE>locked fashion.  We don't care that this code executes slowly since
BE>it rarely executes, but we care that the test pessimizes the usual
BE>case).
BE>
BE>There is also "lock cmpxchg8b" on i386.  I think this can be used in a
BE>loop to implement atomic 64-bit ops (?).  Simpler, but slower in
BE>PCPU_INC().  I prefer a function call version of PCPU_INC() to this.
BE>That should be faster in the usual case and only much larger if we
BE>have too many 64-bit counters.
BE>
BE>> Using 32 bit stats may fail if you put in several 10GBit/s adapters into a
BE>> machine and do routing at link speed, though. This might overflow the IP
BE>> input/output byte counter (which we don't have yet) too fast.
BE>
BE>Not with a mere 10GB/S.  That's ~1GB/S so it takes 4 seconds to overflow
BE>a 32-bit byte counter.  A bit counter would take a while to overflow too.
BE>Are there any faster incrementors?  TSCs also take O(1) seconds to overflow,
BE>and timecounter logic depends on no timecounter overflowing much faster
BE>than that.

If you have 4 10GBit/s adapters each operating full-duplex at link speed you
wrap in under 0.5 seconds, maybe even faster if you have some kind of tunnels
where each packet counts several times. But I suppose this will be not so easy
with IA32 to implement :-)

harti

From owner-freebsd-arch@FreeBSD.ORG  Sat Dec 19 21:09:11 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CD80C106566B;
	Sat, 19 Dec 2009 21:09:11 +0000 (UTC)
	(envelope-from uqs@spoerlein.net)
Received: from acme.spoerlein.net (acme.spoerlein.net [217.20.127.186])
	by mx1.freebsd.org (Postfix) with ESMTP id 3D3128FC1C;
	Sat, 19 Dec 2009 21:09:10 +0000 (UTC)
Received: from acme.spoerlein.net (localhost.spoerlein.net [IPv6:::1])
	by acme.spoerlein.net (8.14.3/8.14.3) with ESMTP id nBJKWJPT057539
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sat, 19 Dec 2009 21:32:19 +0100 (CET)
	(envelope-from uqs@spoerlein.net)
Received: (from uqs@localhost)
	by acme.spoerlein.net (8.14.3/8.14.3/Submit) id nBJKWJYV057538;
	Sat, 19 Dec 2009 21:32:19 +0100 (CET)
	(envelope-from uqs@spoerlein.net)
Date: Sat, 19 Dec 2009 21:32:19 +0100
From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= <uqs@spoerlein.net>
To: Harti Brandt <hartmut.brandt@dlr.de>
Message-ID: <20091219203219.GS55913@acme.spoerlein.net>
Mail-Followup-To: Harti Brandt <hartmut.brandt@dlr.de>,
	freebsd-arch@freebsd.org, bde@freebsd.org
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912150812.35521.jhb@freebsd.org>
	<20091215183859.S53283@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
	<20091219154206.E93919@beagle.kn.op.dlr.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20091219154206.E93919@beagle.kn.op.dlr.de>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: bde@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Dec 2009 21:09:11 -0000

On Sat, 19.12.2009 at 15:56:38 +0100, Harti Brandt wrote:
> On Sat, 19 Dec 2009, Ulrich Sprlein wrote:
> 
> >On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote:
> >> On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote:
> >> > I see. I was also thinking along these lines, but was not sure whether it 
> >> > is worth the trouble. I suppose this does not help to implement 64-bit 
> >> > counters on 32-bit architectures, though, because you cannot read them 
> >> > reliably without locking to sum them up, right?
> >> 
> >> Either that or you just accept that you have a small race since it is only stats. :)
> >
> >This might be stupid, but can we not easily *read* 64bit counters
> >on 32bit machines like this:
> >
> >do {
> >    h1 = read_upper_32bits;
> >    l1 = read_lower_32bits;
> >    h2 = read_upper_32bits;
> >    l2 = read_lower_32bits; /* not needed */
> >} while (h1 != h2);
> >
> >sum64 = (h1<<32) + l1;
> >
> >or something like that? If h2 does not change between readings, no
> >wrap-around has occured. If l1 was read in between the readings of h1
> >and h2, the code above is sound. Right?
> 
> I suppose this works only if it would be guaranteed that the CPU modifying 
> the 64-bit value does this somehow faster than the CPU reading the data:
> 
> CPU1                    CPU2
> ----                    ----
> write new h
> 			read h1 (new h)
> 			read l1 (old l)
> 			read h2 (new h)
> write new l
> 
> It doesn't work too when the CPU first writes L and the H.

To be honest, I didn't even think about the 64 bit writes being
non-atomic, too. So, of course my suggestion was way too naive.

Also thanks to Bruce for re-iterating the whole write/read ordering
stuff yet again. :)

Regards,
Uli