From owner-freebsd-arch@FreeBSD.ORG  Sun Dec 20 13:47:03 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 964F1106566B;
	Sun, 20 Dec 2009 13:47:03 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 6A21D8FC13;
	Sun, 20 Dec 2009 13:47:03 +0000 (UTC)
Received: from [192.168.2.102] (host86-146-40-97.range86-146.btcentralplus.com
	[86.146.40.97])
	by cyrus.watson.org (Postfix) with ESMTPSA id 87FC446B2D;
	Sun, 20 Dec 2009 08:47:01 -0500 (EST)
Mime-Version: 1.0 (Apple Message framework v1077)
Content-Type: text/plain; charset=us-ascii
From: "Robert N. M. Watson" <rwatson@FreeBSD.org>
In-Reply-To: <20091220134738.V46221@beagle.kn.op.dlr.de>
Date: Sun, 20 Dec 2009 13:46:54 +0000
Content-Transfer-Encoding: quoted-printable
Message-Id: <5230C2B2-57A5-4982-928A-43756BF8C1C4@FreeBSD.org>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
	<200912191244.17803.hselasky@c2i.net>
	<20091219232119.L1555@besplex.bde.org>
	<20091219164818.L1741@beagle.kn.op.dlr.de>
	<alpine.BSF.2.00.0912201202520.73550@fledge.watson.org>
	<20091220134738.V46221@beagle.kn.op.dlr.de>
To: Harti Brandt <harti@freebsd.org>
X-Mailer: Apple Mail (2.1077)
Cc: =?iso-8859-1?Q?Ulrich_Sp=F6rlein?= <uqs@spoerlein.net>,
	Hans Petter Selasky <hselasky@c2i.net>, freebsd-arch@freebsd.org
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 20 Dec 2009 13:47:03 -0000


On 20 Dec 2009, at 13:19, Harti Brandt wrote:

> RW>Frequent writes to the same cache line across multiple cores are =
remarkably
> RW>expensive, as they trigger the cache coherency protocol (mileage =
may vary).
> RW>For example, a single non-atomically incremented counter cut =
performance of
> RW>gettimeofday() to 1/6th performance on an 8-core system when =
parallel system
> RW>calls were made across all cores.  On many current systems, the =
cost of an
> RW>"atomic" operation is now fairly reasonable as long as the cache =
line is held
> RW>exclusively by the current CPU.  However, if we can avoid them that =
has
> RW>value, as we update quite a few global stats on the way through the =
network
> RW>stack.
>=20
> Hmm. I'm not sure that gettimeofday() is comparable to forwarding an =
IP=20
> packet. I would expect, that a single increment is a good percentage =
of=20
> the entire processing (in terms of numbers of operations) for=20
> gettimeofday(), while in IP forwarding this is somewhere in the noise=20=

> floor. In the simples case the packet is acted upon by the receiving=20=

> driver, the IP input function, the IP output function and the sending=20=

> driver. Not talking about IP filters, firewalls, tunnels, dummynet and=20=

> what else. The relative cost of the increment should be much less. =
But, I=20
> may be wrong of course.

If processing is occurring on multiple CPUs -- for example, you are =
receiving UDP from two ithreads -- then 4-8 cache lines being contended =
due to stats is a lot. Our goal should be (for 9.0) to avoid having any =
contended cache lines in the common case when processing independent =
streams on different CPUs.

> I would really like to sort that out before any kind of ABI freeze=20
> happens. Ideally all the statistics would be accessible per sysctl(), =
have=20
> a version number and have all or most of the required statistics with =
a=20
> simple way to add new fields without breaking anything. Also the field=20=

> sizes (64 vs. 32 bit) should be correct on the kernel - user =
interface.
>=20
> My current feeling after reading this thread is that the low-level =
kernel=20
> side stuff is probably out of what I could do with the time I have and=20=

> would sidetrack me too far from the work on bsnmp. What I would like =
to do=20
> is to fix the kernel/user interface and let the people that now how to =
do=20
> it handle the low-level side.
>=20
> I would really not like to have to deal with a changing user/kernel=20
> interface in current if we go in several steps with the kernel stuff.

I think we should treat the statistics gathering and statistics =
reporting interfaces as entirely separable problems. Statistics are =
updated far more frequently than they are queried, so making the query =
process a bit more expensive (reformatting from an efficient 'update' =
format to an application-friendly 'report' format) should be fine.

One question to think about is whether or not simply cross-CPU summaries =
are sufficient, or whether we actually also want to be able to directly =
monitor per-CPU statistics at the IP layer. The former would maintain =
the status quo making per-CPU behavior purely part of the 'update' step; =
the latter would change the 'report' format as well. I've been focused =
primarily on 'update', but at least for my work it would be quite =
helpful to have per-CPU stats in the 'report' format as well.

> I will try to come up with a patch for the kernel/user interface in =
the=20
> mean time. This will be for 9.x only, obviously.

Sounds good -- and the kernel stats capture can "grow into" the full =
report format as it matures.

> Doesn't this help for output only? For the input statistics there =
still=20
> will be per-ifnet statistics.

Most ifnet-layer stats should really be per-queue, both for input and =
output, which may help.

> An interesting question from the SNMP point of view is, what happens =
to=20
> the statistics if you move around interfaces between vimages. In any =
case=20
> it would be good if we could abstract from all the complications while=20=

> going kernel->userland.

At least for now, the interface is effectively recreated when it moves =
vimage, and only the current vimage is able to monitor it. That could be =
considered a bug but it might also be a simplifying assumption or even a =
feature. Likewise, it's worth remembering that the ifnet index space is =
per-vimage.

Robert=