From owner-cvs-all@FreeBSD.ORG  Tue Nov  9 19:34:31 2004
Return-Path: <owner-cvs-all@FreeBSD.ORG>
Delivered-To: cvs-all@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 43BB216A4CE
	for <cvs-all@freebsd.org>; Tue,  9 Nov 2004 19:34:31 +0000 (GMT)
Received: from duchess.speedfactory.net (duchess.speedfactory.net
	[66.23.201.84])	by mx1.FreeBSD.org (Postfix) with SMTP id 8FEC043D5D
	for <cvs-all@freebsd.org>; Tue,  9 Nov 2004 19:34:30 +0000 (GMT)
	(envelope-from ups@tree.com)
Received: (qmail 28604 invoked by uid 89); 9 Nov 2004 19:34:25 -0000
Received: from duchess.speedfactory.net (66.23.201.84)
  by duchess.speedfactory.net with SMTP; 9 Nov 2004 19:34:25 -0000
Received: (qmail 28555 invoked by uid 89); 9 Nov 2004 19:34:24 -0000
Received: from unknown (HELO palm.tree.com) (66.23.216.49)
  by duchess.speedfactory.net with SMTP; 9 Nov 2004 19:34:24 -0000
Received: from [127.0.0.1] (localhost.tree.com [127.0.0.1])
	by palm.tree.com (8.12.10/8.12.10) with ESMTP id iA9JYN5R030523;
	Tue, 9 Nov 2004 14:34:23 -0500 (EST)
	(envelope-from ups@tree.com)
From: Stephan Uphoff <ups@tree.com>
To: Peter Wemm <peter@wemm.org>
In-Reply-To: <200411091057.54867.peter@wemm.org>
References: <Pine.NEB.3.96L.1041109103037.73102S-100000@fledge.watson.org>
	<1100024464.29384.30.camel@palm.tree.com>
	<200411091057.54867.peter@wemm.org>
Content-Type: text/plain
Message-Id: <1100028863.29384.111.camel@palm.tree.com>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.4.6 
Date: Tue, 09 Nov 2004 14:34:23 -0500
Content-Transfer-Encoding: 7bit
cc: src-committers@freebsd.org
cc: John Baldwin <jhb@freebsd.org>
cc: Alan Cox <alc@freebsd.org>
cc: cvs-src@freebsd.org
cc: Mike Silbersack <silby@silby.com>
cc: cvs-all@freebsd.org
cc: Robert Watson <rwatson@freebsd.org>
cc: Julian Elischer <julian@elischer.org>
Subject: Re: cvs commit: src/sys/i386/i386 pmap.c
X-BeenThere: cvs-all@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: CVS commit messages for the entire tree <cvs-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-all>,
	<mailto:cvs-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/cvs-all>
List-Post: <mailto:cvs-all@freebsd.org>
List-Help: <mailto:cvs-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-all>,
	<mailto:cvs-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 09 Nov 2004 19:34:31 -0000

On Tue, 2004-11-09 at 13:57, Peter Wemm wrote:
> On Tuesday 09 November 2004 10:21 am, Stephan Uphoff wrote:
> > On Tue, 2004-11-09 at 13:02, Julian Elischer wrote:
> > > Robert Watson wrote:
> > > >This change made a large difference, and eliminates the
> > > > unexplained costs. Here's a revised table as compared to the
> > > > above:
> > > >
> > > > sleep mutex crit section spin mutex new spin mutex
> > > > UP SMP UP SMP UP SMP UP SMP
> > > >PIII 21 81 83 81 112 141 95 141
> > > >P4 39 260 120 119 274 342 132 231
> > > >
> > > >So it basically cut 140 cycles off the P4 UP spin lock, 15 off the
> > > > PIII UP spin lock, and 110 cycles off the P4 SMP spin lock.  The
> > > > PIII SMP spin lock looks the same.  Keep in mind that all of
> > > > these measurements have a standard deviation of between 0 and 3
> > > > cycles, most in the 1 range.  Also keep in mind that these are
> > > > entirely uncontended measurements.
> > > >
> > > >Assuming that these changes are correct, and pass whatever tests
> > > > people have in mind, this would be a very strong merge candidate
> > > > for performance reasons.  The difference is visible in packet
> > > > send tests from user space as a percentage or two improvement on
> > > > UP on my P4, although it's a litte hard to tell due to the noise.
> > >
> > > Can you explain why a spin mutex is more expensive than a sleep
> > > mutex (I assume this is uncontested)?
> >
> > cli() and sti() used for the critical section are expensive.
> 
> ... on INTEL cpus!  Don't make the mistake of assuming that all x86 cpus 
> are as slow as Intel's P4 family on this stuff.   Other cpus don't have 
> the same massive microcode penalty.  My recollection is that athlon 
> (and athlon64 cpus in 32 bit mode) take about 8-12 clocks to do a cli 
> or sti, compared to 300+ for a P4 cpu.  And things like 50-90 clocks 
> for an invlpg vs 1200-1600 clocks for a P4.
> 
> Please don't accidently penalize those of us with cpus that were 
> designed for good all-round performance.  The P4 family was designed 
> for games and 3d graphics, not all-round performance.
> 
> (This isn't aimed at anybody in particular..  I just wanted to remind 
> people that the P4 code is a particularly pathological case (and the 
> writing is on the wall for that core).  Other cpus, including intel's 
> newer non-P4 cores, dont have the same pathological problems.)

Good points.
This seems to lead to the same choices as in my last email.
( non optimal code, lots of compile options or self modifying code)

Is there any reason not to implement self modifying code as for example
used in linux for memory barriers? ( Andi Kleen, [PATCH] Runtime memory
barrier patching - http://lkml.org/lkml/2003/4/21/168 )

Maybe this would even allow shipping SMP capable kernels by default
again. 

	Stephan