Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 6 May 2004 20:18:58 +1000 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Gerrit Nagelhout <gnagelhout@sandvine.com>
Cc:        'Andrew Gallatin' <gallatin@cs.duke.edu>
Subject:   RE: 4.7 vs 5.2.1 SMP/UP bridging performance
Message-ID:  <20040506184749.R19447@gamplex.bde.org>
In-Reply-To: <FE045D4D9F7AED4CBFF1B3B813C85337021AB38C@mail.sandvine.com>
References:  <FE045D4D9F7AED4CBFF1B3B813C85337021AB38C@mail.sandvine.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 5 May 2004, Gerrit Nagelhout wrote:

> Andrew Gallatin wrote:
> > If its really safe to remove the xchg* from non-SMP atomic_store_rel*,
> > then I think you should do it.  Of course, that still leaves mutexes
> > as very expensive on SMP (253 cycles on the 2.53GHz from above).

See my other reply [1 memory barrier but not 2 seems to be needed for
each lock/unlock pair in the !SMP case, and the xchgl accidentally (?)
provides it; perhaps [lms]fence would give a faster memory barrier].
More ideas on this:
- compilers should probably now generate memory barrier instructions foe
  volatile variables (so volatile variables would be even slower :-).  I
  haven't seen gcc on i386's do this.
- jhb once tried changing mtx_lolock_spin(mtx)/mtx_unlock_spin(mtx) to
  crticial_enter()/critical_exit().  This didn't work because it broke
  mtx_assert().  It might also not work because it removes the memory
  barrier.  criticial_enter() only has the very weak memory barrier in
  disable_intr() on i386's.

> I wonder if there is anything that can be done to make the locking more
> efficient for the Xeon.  Are there any other locking types that could
> be used instead?

I can't think of anything for the SMP case.  See above for the !SMP case.

> This might also explain why we are seeing much worse system call
> performance under 4.7 in SMP versus UP.  Here is a table of results
> for some system call tests I ran.  (The numbers are calls/s)
>
> 2.8Ghz Xeon
>                        UP	          SMP
> write	             904427		 661312
> socket            1327692		1067743
> select             554131		 434390
> gettimeofday      1734963		 252479
>
> 1.3Ghz PIII
>                        UP	          SMP
> write	             746705            532223
> socket            1179819            977448
> select             727811            556537
> gettimeofday      1849862            186387

It's why the Xeon is relatively slower under -current and SMP.  -current
just does more locking and more of other things.

> The really interesting one is gettimeofday.  For both the Xeon & PIII,
> the UP is much better than SMP, but the UP for PIII is better than that
> of the Xeon.  I may try to get the results for 5.2.1 later.  I can
> forward the source code of this program to anyone else who wants to try
> it out.

gettimeofday() is slower for SMP because it uses a different timecounter.
This is a hardware problem -- there is no good timecounter available.
It looks like the TSC timecounter is being used for the UP cases and
either the i8254 or the ACPI-slow timecounter for the SMP cases.
Reading the TSC takes about 10-12 cycles on most most i386's (probably
mny more on P4 ;-).  Syscall overhead adds a lot to this, but
gettimeofday() still takes much less than a microsecond.  The fastest
I've seen recently is 260nS/578 cycles for clock_gettime() on an
AthlonXP.  OTOH, reading the i8254 takes about 4000 nS so gettimeofday()
takes 4190nS for clock_gettime() on the same AthlonXP system that takes
260nS with the TSC timecounter.  This system also has a slow ACPI timer
so clock_gettime() takes 1397nS with the ACPI-fast timecounter and
about 3 times as long with the ACPI-slow timecounter.  Recently-fixed
bugs made it often use the ACPI-slow timecounter although the ACPI-fast
timecounter always works.

Slow timecounters mainly affect workloads that do too many context
switches or timestamps on tinygrams.  Probably for yours but not mine.
I only notice them when I run microbenchmarks.  The simplest one that
shows them is "ping -fq localhost".  There are normally 7 timestamps
per packet (1 to put in the packet in userland, 2 for bookkepping in
userland, 2 for pessimization of netisrs in the kernel and 2 for
tripping on our own Giant foot in the kernel).  RELENG_4 only has the
userland ones.  With reasonably CPUs (1GHz+ or so) and slow timecounters,
making even one of these timestamps takes longer than everything else.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040506184749.R19447>