Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 08 Nov 2004 21:28:16 -0500
From:      Stephan Uphoff <ups@tree.com>
To:        John Baldwin <jhb@FreeBSD.org>
Cc:        Robert Watson <rwatson@FreeBSD.org>
Subject:   Re: cvs commit: src/sys/i386/i386 pmap.c
Message-ID:  <1099967296.24619.745.camel@palm.tree.com>
In-Reply-To: <200411081317.52153.jhb@FreeBSD.org>
References:  <Pine.NEB.3.96L.1041105102349.90766E-100000@fledge.watson.org> <1099779852.8097.68.camel@palm.tree.com> <200411081317.52153.jhb@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 2004-11-08 at 13:17, John Baldwin wrote: 
> On Saturday 06 November 2004 05:24 pm, Stephan Uphoff wrote:
> > On Fri, 2004-11-05 at 07:01, Robert Watson wrote:
> > > On Fri, 29 Oct 2004, Mike Silbersack wrote:
> > > > I think we really need some sort of light-weight critical_enter that
> > > > simply assures you that you won't get rescheduled to another CPU, but
> > > > gives no guarantees beyond that.
> > >
> > > <snip>
> > >
> > > > Er, wait - I guess I'm forgetting something, there exists the potential
> > > > for the interrupt that preempted whatever was calling arc4random to
> > > > also call arc4random, thereby breaking things...
> > >
> > > I've been looking at related issues for the last couple of days and must
> > > have missed this thread while at EuroBSDCon.  Alan Cox pointed me at it,
> > > so here I am. :-)
> > >
> > > Right now, the cost of acquiring and dropping an uncontended a sleep
> > > mutex on a UP kernel is very low -- about 21 cycles on my PIII and 40 on
> > > my P4, including some efficiency problems in my measurement which
> > > probably add a non-trivial overhead.  Compare this with the SMP versions
> > > on the PIII (90 cycles) and P4 (260 cycles!).  Critical sections on the
> > > SMP PIII are about the same cost as the SMP mutex, but on the P4 a
> > > critical section is less than half the cost.  Getting to a model where
> > > critical sections were as cheap as UP sleep mutexes, or where we could
> > > use a similar combination of primitives (such as UP mutexes with pinning)
> > > would be very useful. Otherwise, optimizing through use of critical
> > > sections will improve SMP but potentially damage performance on UP. 
> > > There's been a fair amount of discussion of such approaches, including
> > > the implementation briefly present in the FreeBSD.  I know John Baldwin
> > > and Justin Gibbs both have theories and plans in this area.
> > >
> > > If we do create a UP mutex primitive for use on SMP, I would suggest we
> > > actually expand the contents of the UP mutex structure slightly to
> > > include a cpu number that can be asserted, along with pinning, when an
> > > operation is attempted and INVARIANTS is present.  One of the great
> > > strengths of the mutex/lock model is a strong assertion capability, both
> > > for the purposes of documentation and testing, so we should make sure
> > > that carries into any new synchronization primitives.
> > >
> > > Small table of synchronization primitives below; in each case, the count
> > > is in cycles and reflects the cost of acquiring and dropping the
> > > primitive (lock+unlock, enter+exit).  The P4 is a 3ghz box, and the PIII
> > > is an 800mhz box.  Note that the synchronization primitives requiring
> > > atomic operations are substantially pessimized on the P4 vs the PIII.
> > >
> > > A discussion with John Baldwin and Scott Long yesterday revealed that the
> > > UP spin mutex is currently pessimized from a critical section to a
> > > critical section plus mutex internals due to a need for mtx_owned() on
> > > spin locks.  I'm not convinced that explains the entire performance
> > > irregularity I see for P4 spin mutexes on UP, however.  Note that 39 (P4
> > > UP sleep mutex) + 120 (P4 UP critical section) is not 274 (P4 UP spin
> > > mutex) by a fair amount.  Figuring out what's going on there would be a
> > > good idea, although it could well be a property of my measurement
> > > environment.  I'm currently using this to do measurements:
> > >
> > >     //depot/user/rwatson/percpu/sys/test/test_synch_timing.c
> > >
> > > In all of the below, the standard deviation is very small if you're
> > > careful about not bumping into hard clock or other interrupts during
> > > testing, especially when it comes to spin mutexes and critical sections.
> > >
> > > Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
> > > robert@fledge.watson.org      Principal Research Scientist, McAfee
> > > Research
> > >
> > >         sleep mutex     crit section    spin mutex
> > >         UP      SMP     UP      SMP     UP      SMP
> > > PIII    21      90      83      81      112     141
> > > P4      39      260     120     119     274     342
> >
> > Nice catch!
> > On a UP releasing a spin mutex involves a xchgl operation while
> > releasing an uncontested sleep mutex uses cmpxchgl.
> > Since the xchgl does an implicit LOCK (and cmpxchgl does NOT) this could
> > explain why the spin mutex needs a lot more cycles.
> > This should be easy to fix since the xchgl is not needed on a UP system.
> > Right now I am sick and don't trust my own code so I won't write a patch
> > for the next few days ... hopefully someone else can get to it first.
> 
> I've tried changing the store_rel() to just do a simple store since writes are 
> ordered on x86, but benchmarks on SMP showed that it actually hurt. 

Strange - any idea why?
Can you recommend a specific benchmark for this?
I would like to try some variations.


>  However, 
> it would probably be good to at least do that for UP.  The current patch to 
> do it for all kernels is:
> 
> --- //depot/vendor/freebsd/src/sys/i386/include/atomic.h	2004/03/12 21:50:47
> +++ //depot/projects/smpng/sys/i386/include/atomic.h	2004/08/02 15:16:35
> @@ -69,7 +69,7 @@
>  
>  int atomic_cmpset_int(volatile u_int *dst, u_int exp, u_int src);
>  
> -#define	ATOMIC_STORE_LOAD(TYPE, LOP, SOP)			\
> +#define	ATOMIC_STORE_LOAD(TYPE, LOP)			\
>  u_##TYPE	atomic_load_acq_##TYPE(volatile u_##TYPE *p);	\
>  void		atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)
>  
> @@ -175,12 +175,12 @@
>  #if defined(I386_CPU)
>  
>  /*
> - * We assume that a = b will do atomic loads and stores.
> - *
> - * XXX: This is _NOT_ safe on a P6 or higher because it does not guarantee
> - * memory ordering.  These should only be used on a 386.
> + * We assume that a = b will do atomic loads and stores.  However, on a
> + * PentiumPro or higher, reads may pass writes, so for that case we have
> + * to use a serializing instruction (i.e. with LOCK) to do the load.  For
> + * the 386 case we can use a simple read since 386s don't support SMP.
>   */
> -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP)		\
> +#define ATOMIC_STORE_LOAD(TYPE, LOP)			\
>  static __inline u_##TYPE				\
>  atomic_load_acq_##TYPE(volatile u_##TYPE *p)		\
>  {							\
> @@ -190,14 +190,14 @@
>  static __inline void					\
>  atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)\
>  {							\
> +	__asm __volatile("" : : : "memory");		\
>  	*p = v;						\
> -	__asm __volatile("" : : : "memory");		\
>  }							\
>  struct __hack
>  
>  #else /* !defined(I386_CPU) */
>  
> -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP)		\
> +#define ATOMIC_STORE_LOAD(TYPE, LOP)			\
>  static __inline u_##TYPE				\
>  atomic_load_acq_##TYPE(volatile u_##TYPE *p)		\
>  {							\
> @@ -211,16 +211,11 @@
>  	return (res);					\
>  }							\
>  							\
> -/*							\
> - * The XCHG instruction asserts LOCK automagically.	\
> - */							\
>  static __inline void					\
>  atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)\
>  {							\
> -	__asm __volatile(SOP				\
> -	: "+m" (*p),			/* 0 */		\
> -	  "+r" (v)			/* 1 */		\
> -	: : "memory");				 	\
> +	__asm __volatile("" : : : "memory");		\
> +	*p = v;						\
>  }							\
>  struct __hack
>  
> @@ -230,7 +225,7 @@
>  
>  extern int atomic_cmpset_int(volatile u_int *, u_int, u_int);
>  
> -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP)				\
> +#define ATOMIC_STORE_LOAD(TYPE, LOP)					\
>  extern u_##TYPE atomic_load_acq_##TYPE(volatile u_##TYPE *p);		\
>  extern void atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)
>  
> @@ -258,10 +253,10 @@
>  ATOMIC_ASM(add,	     long,  "addl %1,%0",  "ir",  v);
>  ATOMIC_ASM(subtract, long,  "subl %1,%0",  "ir",  v);
>  
> -ATOMIC_STORE_LOAD(char,	"cmpxchgb %b0,%1", "xchgb %b1,%0");
> -ATOMIC_STORE_LOAD(short,"cmpxchgw %w0,%1", "xchgw %w1,%0");
> -ATOMIC_STORE_LOAD(int,	"cmpxchgl %0,%1",  "xchgl %1,%0");
> -ATOMIC_STORE_LOAD(long,	"cmpxchgl %0,%1",  "xchgl %1,%0");
> +ATOMIC_STORE_LOAD(char,	"cmpxchgb %b0,%1");
> +ATOMIC_STORE_LOAD(short,"cmpxchgw %w0,%1");
> +ATOMIC_STORE_LOAD(int,	"cmpxchgl %0,%1");
> +ATOMIC_STORE_LOAD(long,	"cmpxchgl %0,%1");
>  
>  #undef ATOMIC_ASM
>  #undef ATOMIC_STORE_LOAD



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1099967296.24619.745.camel>