From owner-freebsd-current@FreeBSD.ORG  Thu May  6 01:02:34 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 50F6C16A4CE
	for <freebsd-current@FreeBSD.org>;
	Thu,  6 May 2004 01:02:34 -0700 (PDT)
Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 43B1743D58
	for <freebsd-current@FreeBSD.org>;
	Thu,  6 May 2004 01:02:31 -0700 (PDT)	(envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.0.87])i4682F4u025815;	Thu, 6 May 2004 18:02:15 +1000
Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246])
	i4682BHW004546;	Thu, 6 May 2004 18:02:12 +1000
Date: Thu, 6 May 2004 18:02:10 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@gamplex.bde.org
To: Andrew Gallatin <gallatin@cs.duke.edu>
In-Reply-To: <16537.23378.375946.857908@grasshopper.cs.duke.edu>
Message-ID: <20040506164710.I19057@gamplex.bde.org>
References: <FE045D4D9F7AED4CBFF1B3B813C85337021AB377@mail.sandvine.com>
	<16537.23378.375946.857908@grasshopper.cs.duke.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: freebsd-current@FreeBSD.org
cc: Gerrit Nagelhout <gnagelhout@sandvine.com>
Subject: RE: 4.7 vs 5.2.1 SMP/UP bridging performance
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 06 May 2004 08:02:34 -0000

On Wed, 5 May 2004, Andrew Gallatin wrote:

> Bruce Evans writes:
>  > So there seems to be something wrong with your benchmark.  Locking the
>  > bus for the SMP case always costs about 20+ cycles, but this hasn't
>  > changed since RELENG_4 and mutexes can't be made much faster in the
>  > uncontested case since their overhead is dominated by the bus lock
>  > time.
>
> Actually, I think his tests are accurate and bus locked instructions
> take an eternity on P4.  See
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html
>
> For example, with your test above, I see 212 cycles for the UP case on
> a 2.53GHz P4.  Replacing the atomic_store_rel_int(&slock, 0) with a
> simple slock = 0; reduces that count to 18 cycles.

This seems to be right, unfortunately.  I wonder if this has anything to
do with freebsd.org having no P4 machines.

> If its really safe to remove the xchg* from non-SMP atomic_store_rel*,
> then I think you should do it.  Of course, that still leaves mutexes
> as very expensive on SMP (253 cycles on the 2.53GHz from above).

I forgot (again) that there are memory access ordering issues.  A lock
may be needed to get everything synced.  See the comment before the i386
versions in i386/include/atomic.h.  A single lock may be enough.  The
best example I could think of easily is:

%%%
int foo;			/* supposedly protected by sched_lock */
	...
	mtx_lock(&mtx);
	if (foo == 0)
		foo++;
	mtx_unlock(&mtx);
	KASSERT(foo == 1, ("oops"));
%%%

On at least amd64's, reads can be done out of order relative to all
other reads and relative to writes to different memory locations.
mtx_lock(&mtx) doesn't go near foo's memory location, so foo may be
read before the lock is acquired.  If this code is interrupted after
foo is read but also before the lock is required, then the interrupt
handler may run the same code and bump foo to 1.  Then on return, if
the out of order read is still valid, then the above will bump foo
again.  However, the lock in the mtx_unlock() in the interrupt handler
presumably makes the out of order read invalid (does it?), so on return
from the interrupt handler foo will be read again and found to be 1
(since even if the read is out of order related to the mtx locking,
it is ordered relative to the write to foo's memory location).

If this is correct, then someone's home made locking using
atomic_cmpset for unlock was more than a style bug :-).

To possibly reduce the locking overhead for at least the non-SMP case
on i386's and amd64's, there are the [lms]fence instructions and
serializing instructions.  On on AthlonXPs, movl + sfence takes about
half as many cycles as xchgl.  I think lfence is actually needed, but
AthlonXPs only have sfence.  All the serializing instructions seem to
be too heavyweight to help here.  On sledge's amd64, the saving using
lfence is smaller (atomic_cmpset_acq_int + movl: 6 cycles; + lfence:
15 cycles; atomic_cmpset_acq_int + atomic_store_rel_int: 21 cycles)
(all including loop overhead).

Bruce