From owner-freebsd-current@FreeBSD.ORG  Thu May  6 03:19:12 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 47B8416A4CF
	for <freebsd-current@freebsd.org>;
	Thu,  6 May 2004 03:19:12 -0700 (PDT)
Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.85])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 289D643D39
	for <freebsd-current@freebsd.org>;
	Thu,  6 May 2004 03:19:09 -0700 (PDT)	(envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.0.87])i46AJ25v011205;	Thu, 6 May 2004 20:19:02 +1000
Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246])
	i46AIxHW005371;	Thu, 6 May 2004 20:19:00 +1000
Date: Thu, 6 May 2004 20:18:58 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@gamplex.bde.org
To: Gerrit Nagelhout <gnagelhout@sandvine.com>
In-Reply-To: <FE045D4D9F7AED4CBFF1B3B813C85337021AB38C@mail.sandvine.com>
Message-ID: <20040506184749.R19447@gamplex.bde.org>
References: <FE045D4D9F7AED4CBFF1B3B813C85337021AB38C@mail.sandvine.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: freebsd-current@freebsd.org
cc: 'Andrew Gallatin' <gallatin@cs.duke.edu>
Subject: RE: 4.7 vs 5.2.1 SMP/UP bridging performance
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 06 May 2004 10:19:12 -0000

On Wed, 5 May 2004, Gerrit Nagelhout wrote:

> Andrew Gallatin wrote:
> > If its really safe to remove the xchg* from non-SMP atomic_store_rel*,
> > then I think you should do it.  Of course, that still leaves mutexes
> > as very expensive on SMP (253 cycles on the 2.53GHz from above).

See my other reply [1 memory barrier but not 2 seems to be needed for
each lock/unlock pair in the !SMP case, and the xchgl accidentally (?)
provides it; perhaps [lms]fence would give a faster memory barrier].
More ideas on this:
- compilers should probably now generate memory barrier instructions foe
  volatile variables (so volatile variables would be even slower :-).  I
  haven't seen gcc on i386's do this.
- jhb once tried changing mtx_lolock_spin(mtx)/mtx_unlock_spin(mtx) to
  crticial_enter()/critical_exit().  This didn't work because it broke
  mtx_assert().  It might also not work because it removes the memory
  barrier.  criticial_enter() only has the very weak memory barrier in
  disable_intr() on i386's.

> I wonder if there is anything that can be done to make the locking more
> efficient for the Xeon.  Are there any other locking types that could
> be used instead?

I can't think of anything for the SMP case.  See above for the !SMP case.

> This might also explain why we are seeing much worse system call
> performance under 4.7 in SMP versus UP.  Here is a table of results
> for some system call tests I ran.  (The numbers are calls/s)
>
> 2.8Ghz Xeon
>                        UP	          SMP
> write	             904427		 661312
> socket            1327692		1067743
> select             554131		 434390
> gettimeofday      1734963		 252479
>
> 1.3Ghz PIII
>                        UP	          SMP
> write	             746705            532223
> socket            1179819            977448
> select             727811            556537
> gettimeofday      1849862            186387

It's why the Xeon is relatively slower under -current and SMP.  -current
just does more locking and more of other things.

> The really interesting one is gettimeofday.  For both the Xeon & PIII,
> the UP is much better than SMP, but the UP for PIII is better than that
> of the Xeon.  I may try to get the results for 5.2.1 later.  I can
> forward the source code of this program to anyone else who wants to try
> it out.

gettimeofday() is slower for SMP because it uses a different timecounter.
This is a hardware problem -- there is no good timecounter available.
It looks like the TSC timecounter is being used for the UP cases and
either the i8254 or the ACPI-slow timecounter for the SMP cases.
Reading the TSC takes about 10-12 cycles on most most i386's (probably
mny more on P4 ;-).  Syscall overhead adds a lot to this, but
gettimeofday() still takes much less than a microsecond.  The fastest
I've seen recently is 260nS/578 cycles for clock_gettime() on an
AthlonXP.  OTOH, reading the i8254 takes about 4000 nS so gettimeofday()
takes 4190nS for clock_gettime() on the same AthlonXP system that takes
260nS with the TSC timecounter.  This system also has a slow ACPI timer
so clock_gettime() takes 1397nS with the ACPI-fast timecounter and
about 3 times as long with the ACPI-slow timecounter.  Recently-fixed
bugs made it often use the ACPI-slow timecounter although the ACPI-fast
timecounter always works.

Slow timecounters mainly affect workloads that do too many context
switches or timestamps on tinygrams.  Probably for yours but not mine.
I only notice them when I run microbenchmarks.  The simplest one that
shows them is "ping -fq localhost".  There are normally 7 timestamps
per packet (1 to put in the packet in userland, 2 for bookkepping in
userland, 2 for pessimization of netisrs in the kernel and 2 for
tripping on our own Giant foot in the kernel).  RELENG_4 only has the
userland ones.  With reasonably CPUs (1GHz+ or so) and slow timecounters,
making even one of these timestamps takes longer than everything else.

Bruce