Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 5 May 2004 17:23:30 -0400 (EDT)
From:      Andrew Gallatin <gallatin@cs.duke.edu>
To:        Bruce Evans <bde@zeta.org.au>
Cc:        Gerrit Nagelhout <gnagelhout@sandvine.com>
Subject:   RE: 4.7 vs 5.2.1 SMP/UP bridging performance
Message-ID:  <16537.23378.375946.857908@grasshopper.cs.duke.edu>
In-Reply-To: <20040505222636.H15444@gamplex.bde.org>
References:  <FE045D4D9F7AED4CBFF1B3B813C85337021AB377@mail.sandvine.com> <20040505222636.H15444@gamplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

Bruce Evans writes:

 > 
 > Athlon XP2600 UP system:  !SMP case: 22 cycles   SMP case: 37 cycles
 > Celeron 366 SMP system:              35                    48
 > 
 > The extra cycles for the SMP case are just the extra cost of a one lock
 > instruction.  Note that SMP should cost twice as much extra, but the
 > non-SMP atomic_store_rel_int(&slock, 0) is pessimized by using xchgl
 > which always locks the bus.  After fixing this:
 > 
 > Athlon XP2600 UP system:  !SMP case:  6 cycles   SMP case: 37 cycles
 > Celeron 366 SMP system:              10                    48
 > 
 > Mutexes take longer than simple locks, but not much longer unless the
 > lock is contested.  In particular, they don't lock the bus any more
 > and the extra cycles for locking dominate (even in the !SMP case due
 > to the pessimization).
 > 
 > So there seems to be something wrong with your benchmark.  Locking the
 > bus for the SMP case always costs about 20+ cycles, but this hasn't
 > changed since RELENG_4 and mutexes can't be made much faster in the
 > uncontested case since their overhead is dominated by the bus lock
 > time.
 > 

Actually, I think his tests are accurate and bus locked instructions
take an eternity on P4.  See
http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html 

For example, with your test above, I see 212 cycles for the UP case on
a 2.53GHz P4.  Replacing the atomic_store_rel_int(&slock, 0) with a
simple slock = 0; reduces that count to 18 cycles.

If its really safe to remove the xchg* from non-SMP atomic_store_rel*,
then I think you should do it.  Of course, that still leaves mutexes
as very expensive on SMP (253 cycles on the 2.53GHz from above).

Drew



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?16537.23378.375946.857908>