From owner-freebsd-arch@FreeBSD.ORG  Sat Dec 19 16:02:29 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DC466106566B
	for <freebsd-arch@freebsd.org>; Sat, 19 Dec 2009 16:02:29 +0000 (UTC)
	(envelope-from Hartmut.Brandt@dlr.de)
Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32])
	by mx1.freebsd.org (Postfix) with ESMTP id 557688FC25
	for <freebsd-arch@freebsd.org>; Sat, 19 Dec 2009 16:02:28 +0000 (UTC)
Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over
	TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); 
	Sat, 19 Dec 2009 17:02:27 +0100
Date: Sat, 19 Dec 2009 17:02:23 +0100 (CET)
From: Harti Brandt <hartmut.brandt@dlr.de>
X-X-Sender: brandt_h@beagle.kn.op.dlr.de
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20091219232119.L1555@besplex.bde.org>
Message-ID: <20091219164818.L1741@beagle.kn.op.dlr.de>
References: <20091215103759.P97203@beagle.kn.op.dlr.de>
	<200912151313.28326.jhb@freebsd.org>
	<20091219112711.GR55913@acme.spoerlein.net>
	<200912191244.17803.hselasky@c2i.net>
	<20091219232119.L1555@besplex.bde.org>
X-OpenPGP-Key: harti@freebsd.org
MIME-Version: 1.0
Content-Type: MULTIPART/MIXED; BOUNDARY="1964543108-214838482-1261238543=:1741"
X-OriginalArrivalTime: 19 Dec 2009 16:02:27.0737 (UTC)
	FILETIME=[A964A090:01CA80C4]
Cc: Ulrich =?iso-8859-1?q?Sp=F6rlein?= <uqs@spoerlein.net>,
	freebsd-arch@freebsd.org, Hans Petter Selasky <hselasky@c2i.net>
Subject: Re: network statistics in SMP
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Harti Brandt <harti@freebsd.org>
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Dec 2009 16:02:29 -0000

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--1964543108-214838482-1261238543=:1741
Content-Type: TEXT/PLAIN; charset=koi8-r
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Sun, 20 Dec 2009, Bruce Evans wrote:

BE>On Sat, 19 Dec 2009, Hans Petter Selasky wrote:
BE>
BE>> On Saturday 19 December 2009 12:27:12 Ulrich Sp=F6rlein wrote:
BE>> > On Tue, 15.12.2009 at 13:13:28 -0500, John Baldwin wrote:
BE>> > > On Tuesday 15 December 2009 12:45:13 pm Harti Brandt wrote:
BE>> > > > I see. I was also thinking along these lines, but was not sure
BE>> > > > whether
BE>> > > > it is worth the trouble. I suppose this does not help to impleme=
nt
BE>> > > > 64-bit counters on 32-bit architectures, though, because you can=
not
BE>> > > > read them reliably without locking to sum them up, right?
BE>> > >=20
BE>> > > Either that or you just accept that you have a small race since it=
 is
BE>> > > only stats. :)
BE>> >=20
BE>> > This might be stupid, but can we not easily *read* 64bit counters
BE>> > on 32bit machines like this:
BE>> >=20
BE>> > do {
BE>> >     h1 =3D read_upper_32bits;
BE>> >     l1 =3D read_lower_32bits;
BE>> >     h2 =3D read_upper_32bits;
BE>> >     l2 =3D read_lower_32bits; /* not needed */
BE>> > } while (h1 !=3D h2);
BE>
BE>No.  See my previous^N reply, but don't see it about this since it was
BE>wrong about this for all N :-).
BE>
BE>> Just a comment. You know you don't need a while loop to get a stable v=
alue?
BE>> Should be implemented like this, in my opinion:
BE>>=20
BE>> h1 =3D read_upper_32bits;
BE>> l1 =3D read_lower_32bits;
BE>> h2 =3D read_upper_32bits;
BE>>=20
BE>> if (h1 !=3D h2)
BE>> =09l1 =3D 0xffffffffUL;
BE>>=20
BE>> sum64 =3D (h1<<32) | l1;
BE>
BE>Also wrong :-).  Apart from write ordering problems (1), the write of
BE>the second half (presumably the top half) might not have completed when =
the
BE>above looks at it.  Then both of the above will see:
BE>    h1 =3D old value
BE>    l1 =3D new value
BE>    h2 =3D old value (since the new one has not been written yet).
BE>The race window for this can be arbitrarily long, since the second write
BE>can be delayed for arbitrarily long (e.g., by interrupt handling).  Even
BE>if we ensure write ordering and no interrupts, the above has many proble=
ms.
BE>- we can't reasonably guarantee that the reads of l1 and h2 will execute
BE>  sufficiently faster than the writes of l1 and h1/h2 so that the above
BE>  will see h2 after l1.  I think the writes will usually go slightly
BE>  faster since they will go through a write buffer, provided the 2 halve=
s
BE>  are in a single cache line, but this is unclear.  SMP with different
BE>  CPU frequencies is not really supported, but works now modulo timecoun=
ter
BE>  problems, and we probably want to support completely independent CPU
BE>  frequencies, with some CPUs throttled.
BE>- I don't understand why the above compares the high values.  Comparing =
the
BE>  low values seems to work better.
BE>- I can't see how to fix up the second method to be useful.  It is faste=
r,
BE>  but some delays seem to be necessary and they might as well be partly
BE>  in a slower method.
BE>
BE>Another try:
BE>- enforce write ordering in writers and read ordering in the reader
BE>- make sure that the reader runs fast enough.  This might require using
BE>  critical_enter() in the reader.
BE>- then many cases work with no complications:
BE>  (a) if l1 is not small and not large, then h1 must be associated with =
l1,
BE>      since then the low value can't roll over while we are looking
BE>      at the pair, so the high value can't change while we are looking.
BE>      So we can just use h1 and l1, without reading h2 or l2.
BE>  (b) similarly, if l1 is small, then h2 is associated with l1.  So we
BE>      can just use h2 with l1, without reading l2.
BE>- otherwise, l1 is large:
BE>  (c) if l1 =3D l2, then h1 must be associated with l1, since some write=
s
BE>      of the high value associated with writing not-so-large low values
BE>      have had plenty of time to complete (in fact, the one for (l1-2)
BE>      or (l2-1) must have completed, and "large" only needs to be 1
BE>      or 2 to ensure that these values don't wrap.  E.g., if l1 is
BE>      0xFFFFFFFF, then it is about to wrap, but it certainly hasn't
BE>      wrapped recently so h1 hasn't incremented recently.  So we can
BE>      use h1 with l1, after reading l2 (still need to read h2, in case
BE>      we don't get here).
BE>  (d) if l1 !=3D l2, then ordering implies that the write of the high va=
lue
BE>      associated with l1 has completed.  We might have missed reading
BE>      this value, since we might have read h1 too early and might have
BE>      read h2 too late, but in the usual case h1 =3D=3D h2 and then both
BE>      h's are associated with l1, while if h1 !=3D h2 then we can loop
BE>      again and surely find h1 =3D=3D h2 (and l1 small, so case (c)), or
BE>      we can use the second method.  We had to read all 4 values to
BE>      determine what to do here, and can usually use 2 of them directly.
BE>
BE>(1) Write ordering is guaranteed on amd64 but I think not on all arches.
BE>    You could pessimize PCPU_INC() with memory barriers on some arches t=
o
BE>    get write ordering.
BE>
BE>(2) You could pessimize PCPU_INC() with critical_enter() to mostly preve=
nt
BE>    this.  You cannot prevent the second write from being delayed for
BE>    arbitrarily long by a trap and its handling, except by breaking
BE>    at least the debugger traps needed to debug this.
BE>
BE>In my previous^2 reply, I said heavyweight synchronization combined
BE>with extra atomic generation counters would work.  The heavyweight
BE>synchronization would have to be heavier than I thought -- it would
BE>have to wait for all other CPUs to complete the pairs of writes for
BE>64-bit counters, if any, and for this it would have to do more than
BE>IPI's -- it should change priorities and reschedule to ensure that the
BE>half-writes (if any) have a chance of completing soon...  Far too
BE>complicated for this.  Disabling interrupts for the non-atomic PCPU_INC(=
)s
BE>is probably best.  Duh, this or worse (locking) is required on the
BE>writer side anyway, else increments in won't be atomic.  Locking would
BE>actually automatically give the rescheduling stuff for the heavyweight
BE>synchronizaton -- you would acquire the lock in the reader and of
BE>course in the writers, and get priority propagation to complete the
BE>one writer allowed to hold the lock iff any is holding it.  Locking
BE>might not be too bad for a few 64-bit counters.  So I've changed my
BE>mind yet again and prefer locking to critical_enter().  It's cleaner
BE>and works for traps.  I just remembered that rwatson went the opposite
BE>way and changed some locking to critical_enter() in UMA.  I prefer the
BE>old way, and at least in old versions of FreeBSD I got panics trying
BE>to debug near this (single-stepping malloc()?).
BE>
BE>In my previous^1 reply, I said lighter weight synchronizion combined
BE>with no extra or atomic counters (use counters as their own generation
BE>counter) would work.  But the synchronization still needs to be heavy,
BE>or interrupts disabled, as above.
BE>
BE>Everything must be read before and after to test for getting a coherent
BE>set of values, so the loop in the first method has the minimal number
BE>of reads (for a single 64-bit counter).  With sync, the order for each
BE>pair in it doesn't matter on either the reader or writer (there must be
BE>a sync or 2 instead).

To be honest, I'm lost now. Couldn't we just use the largest atomic type=20
for the given platform and atomic_inc/atomic_add/atomic_fetch and handle=20
the 32->64 bit stuff (for IA32) as I do it in bsnmp, but as a kernel=20
thread?

Are the 5-6 atomic operations really that costly given the many operations=
=20
done on an IP packet? Are they more costly than a heavyweight sync for=20
each ++ or +=3D?

Or we could use the PCPU stuff, use just ++ and +=3D for modifying the=20
statistics (32bit) and do the 32->64 bit stuff for all platforms with a=20
kernel thread per CPU (do we have this?). Between that thread and the=20
sysctl we could use a heavy sync.

Or we could use PCPU and atomic_inc/atomic_add/atomic_fetch with the=20
largest atomic type for the platform, handle the aggregation and (on IA32)=
=20
the 32->64 bit stuff in a kernel thread.

Using 32 bit stats may fail if you put in several 10GBit/s adapters into a=
=20
machine and do routing at link speed, though. This might overflow the IP=20
input/output byte counter (which we don't have yet) too fast.

harti
--1964543108-214838482-1261238543=:1741--