Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 8 Nov 2004 13:17:52 -0500
From:      John Baldwin <jhb@FreeBSD.org>
To:        Stephan Uphoff <ups@tree.com>
Cc:        Robert Watson <rwatson@FreeBSD.org>
Subject:   Re: cvs commit: src/sys/i386/i386 pmap.c
Message-ID:  <200411081317.52153.jhb@FreeBSD.org>
In-Reply-To: <1099779852.8097.68.camel@palm.tree.com>
References:  <Pine.NEB.3.96L.1041105102349.90766E-100000@fledge.watson.org> <1099779852.8097.68.camel@palm.tree.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Saturday 06 November 2004 05:24 pm, Stephan Uphoff wrote:
> On Fri, 2004-11-05 at 07:01, Robert Watson wrote:
> > On Fri, 29 Oct 2004, Mike Silbersack wrote:
> > > I think we really need some sort of light-weight critical_enter that
> > > simply assures you that you won't get rescheduled to another CPU, but
> > > gives no guarantees beyond that.
> >
> > <snip>
> >
> > > Er, wait - I guess I'm forgetting something, there exists the potential
> > > for the interrupt that preempted whatever was calling arc4random to
> > > also call arc4random, thereby breaking things...
> >
> > I've been looking at related issues for the last couple of days and must
> > have missed this thread while at EuroBSDCon.  Alan Cox pointed me at it,
> > so here I am. :-)
> >
> > Right now, the cost of acquiring and dropping an uncontended a sleep
> > mutex on a UP kernel is very low -- about 21 cycles on my PIII and 40 on
> > my P4, including some efficiency problems in my measurement which
> > probably add a non-trivial overhead.  Compare this with the SMP versions
> > on the PIII (90 cycles) and P4 (260 cycles!).  Critical sections on the
> > SMP PIII are about the same cost as the SMP mutex, but on the P4 a
> > critical section is less than half the cost.  Getting to a model where
> > critical sections were as cheap as UP sleep mutexes, or where we could
> > use a similar combination of primitives (such as UP mutexes with pinning)
> > would be very useful. Otherwise, optimizing through use of critical
> > sections will improve SMP but potentially damage performance on UP. 
> > There's been a fair amount of discussion of such approaches, including
> > the implementation briefly present in the FreeBSD.  I know John Baldwin
> > and Justin Gibbs both have theories and plans in this area.
> >
> > If we do create a UP mutex primitive for use on SMP, I would suggest we
> > actually expand the contents of the UP mutex structure slightly to
> > include a cpu number that can be asserted, along with pinning, when an
> > operation is attempted and INVARIANTS is present.  One of the great
> > strengths of the mutex/lock model is a strong assertion capability, both
> > for the purposes of documentation and testing, so we should make sure
> > that carries into any new synchronization primitives.
> >
> > Small table of synchronization primitives below; in each case, the count
> > is in cycles and reflects the cost of acquiring and dropping the
> > primitive (lock+unlock, enter+exit).  The P4 is a 3ghz box, and the PIII
> > is an 800mhz box.  Note that the synchronization primitives requiring
> > atomic operations are substantially pessimized on the P4 vs the PIII.
> >
> > A discussion with John Baldwin and Scott Long yesterday revealed that the
> > UP spin mutex is currently pessimized from a critical section to a
> > critical section plus mutex internals due to a need for mtx_owned() on
> > spin locks.  I'm not convinced that explains the entire performance
> > irregularity I see for P4 spin mutexes on UP, however.  Note that 39 (P4
> > UP sleep mutex) + 120 (P4 UP critical section) is not 274 (P4 UP spin
> > mutex) by a fair amount.  Figuring out what's going on there would be a
> > good idea, although it could well be a property of my measurement
> > environment.  I'm currently using this to do measurements:
> >
> >     //depot/user/rwatson/percpu/sys/test/test_synch_timing.c
> >
> > In all of the below, the standard deviation is very small if you're
> > careful about not bumping into hard clock or other interrupts during
> > testing, especially when it comes to spin mutexes and critical sections.
> >
> > Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
> > robert@fledge.watson.org      Principal Research Scientist, McAfee
> > Research
> >
> >         sleep mutex     crit section    spin mutex
> >         UP      SMP     UP      SMP     UP      SMP
> > PIII    21      90      83      81      112     141
> > P4      39      260     120     119     274     342
>
> Nice catch!
> On a UP releasing a spin mutex involves a xchgl operation while
> releasing an uncontested sleep mutex uses cmpxchgl.
> Since the xchgl does an implicit LOCK (and cmpxchgl does NOT) this could
> explain why the spin mutex needs a lot more cycles.
> This should be easy to fix since the xchgl is not needed on a UP system.
> Right now I am sick and don't trust my own code so I won't write a patch
> for the next few days ... hopefully someone else can get to it first.

I've tried changing the store_rel() to just do a simple store since writes are 
ordered on x86, but benchmarks on SMP showed that it actually hurt.  However, 
it would probably be good to at least do that for UP.  The current patch to 
do it for all kernels is:

--- //depot/vendor/freebsd/src/sys/i386/include/atomic.h	2004/03/12 21:50:47
+++ //depot/projects/smpng/sys/i386/include/atomic.h	2004/08/02 15:16:35
@@ -69,7 +69,7 @@
 
 int atomic_cmpset_int(volatile u_int *dst, u_int exp, u_int src);
 
-#define	ATOMIC_STORE_LOAD(TYPE, LOP, SOP)			\
+#define	ATOMIC_STORE_LOAD(TYPE, LOP)			\
 u_##TYPE	atomic_load_acq_##TYPE(volatile u_##TYPE *p);	\
 void		atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)
 
@@ -175,12 +175,12 @@
 #if defined(I386_CPU)
 
 /*
- * We assume that a = b will do atomic loads and stores.
- *
- * XXX: This is _NOT_ safe on a P6 or higher because it does not guarantee
- * memory ordering.  These should only be used on a 386.
+ * We assume that a = b will do atomic loads and stores.  However, on a
+ * PentiumPro or higher, reads may pass writes, so for that case we have
+ * to use a serializing instruction (i.e. with LOCK) to do the load.  For
+ * the 386 case we can use a simple read since 386s don't support SMP.
  */
-#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP)		\
+#define ATOMIC_STORE_LOAD(TYPE, LOP)			\
 static __inline u_##TYPE				\
 atomic_load_acq_##TYPE(volatile u_##TYPE *p)		\
 {							\
@@ -190,14 +190,14 @@
 static __inline void					\
 atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)\
 {							\
+	__asm __volatile("" : : : "memory");		\
 	*p = v;						\
-	__asm __volatile("" : : : "memory");		\
 }							\
 struct __hack
 
 #else /* !defined(I386_CPU) */
 
-#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP)		\
+#define ATOMIC_STORE_LOAD(TYPE, LOP)			\
 static __inline u_##TYPE				\
 atomic_load_acq_##TYPE(volatile u_##TYPE *p)		\
 {							\
@@ -211,16 +211,11 @@
 	return (res);					\
 }							\
 							\
-/*							\
- * The XCHG instruction asserts LOCK automagically.	\
- */							\
 static __inline void					\
 atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)\
 {							\
-	__asm __volatile(SOP				\
-	: "+m" (*p),			/* 0 */		\
-	  "+r" (v)			/* 1 */		\
-	: : "memory");				 	\
+	__asm __volatile("" : : : "memory");		\
+	*p = v;						\
 }							\
 struct __hack
 
@@ -230,7 +225,7 @@
 
 extern int atomic_cmpset_int(volatile u_int *, u_int, u_int);
 
-#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP)				\
+#define ATOMIC_STORE_LOAD(TYPE, LOP)					\
 extern u_##TYPE atomic_load_acq_##TYPE(volatile u_##TYPE *p);		\
 extern void atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)
 
@@ -258,10 +253,10 @@
 ATOMIC_ASM(add,	     long,  "addl %1,%0",  "ir",  v);
 ATOMIC_ASM(subtract, long,  "subl %1,%0",  "ir",  v);
 
-ATOMIC_STORE_LOAD(char,	"cmpxchgb %b0,%1", "xchgb %b1,%0");
-ATOMIC_STORE_LOAD(short,"cmpxchgw %w0,%1", "xchgw %w1,%0");
-ATOMIC_STORE_LOAD(int,	"cmpxchgl %0,%1",  "xchgl %1,%0");
-ATOMIC_STORE_LOAD(long,	"cmpxchgl %0,%1",  "xchgl %1,%0");
+ATOMIC_STORE_LOAD(char,	"cmpxchgb %b0,%1");
+ATOMIC_STORE_LOAD(short,"cmpxchgw %w0,%1");
+ATOMIC_STORE_LOAD(int,	"cmpxchgl %0,%1");
+ATOMIC_STORE_LOAD(long,	"cmpxchgl %0,%1");
 
 #undef ATOMIC_ASM
 #undef ATOMIC_STORE_LOAD

-- 
John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200411081317.52153.jhb>