Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 09 Nov 2004 11:00:50 -0800
From:      Julian Elischer <julian@elischer.org>
To:        Peter Wemm <peter@wemm.org>
Cc:        Robert Watson <rwatson@freebsd.org>
Subject:   Re: cvs commit: src/sys/i386/i386 pmap.c
Message-ID:  <419113E2.6050609@elischer.org>
In-Reply-To: <200411091057.54867.peter@wemm.org>
References:  <Pine.NEB.3.96L.1041109103037.73102S-100000@fledge.watson.org> <4191062A.6090009@elischer.org> <1100024464.29384.30.camel@palm.tree.com> <200411091057.54867.peter@wemm.org>

next in thread | previous in thread | raw e-mail | index | archive | help


Peter Wemm wrote:

>On Tuesday 09 November 2004 10:21 am, Stephan Uphoff wrote:
>  
>
>>On Tue, 2004-11-09 at 13:02, Julian Elischer wrote:
>>    
>>
>>>Robert Watson wrote:
>>>      
>>>
>>>>This change made a large difference, and eliminates the
>>>>unexplained costs. Here's a revised table as compared to the
>>>>above:
>>>>
>>>>sleep mutex crit section spin mutex new spin mutex
>>>>UP SMP UP SMP UP SMP UP SMP
>>>>PIII 21 81 83 81 112 141 95 141
>>>>P4 39 260 120 119 274 342 132 231
>>>>
>>>>So it basically cut 140 cycles off the P4 UP spin lock, 15 off the
>>>>PIII UP spin lock, and 110 cycles off the P4 SMP spin lock.  The
>>>>PIII SMP spin lock looks the same.  Keep in mind that all of
>>>>these measurements have a standard deviation of between 0 and 3
>>>>cycles, most in the 1 range.  Also keep in mind that these are
>>>>entirely uncontended measurements.
>>>>
>>>>Assuming that these changes are correct, and pass whatever tests
>>>>people have in mind, this would be a very strong merge candidate
>>>>for performance reasons.  The difference is visible in packet
>>>>send tests from user space as a percentage or two improvement on
>>>>UP on my P4, although it's a litte hard to tell due to the noise.
>>>>        
>>>>
>>>Can you explain why a spin mutex is more expensive than a sleep
>>>mutex (I assume this is uncontested)?
>>>      
>>>
>>cli() and sti() used for the critical section are expensive.
>>    
>>
>
>... on INTEL cpus!  Don't make the mistake of assuming that all x86 cpus 
>are as slow as Intel's P4 family on this stuff.   Other cpus don't have 
>the same massive microcode penalty.  My recollection is that athlon 
>(and athlon64 cpus in 32 bit mode) take about 8-12 clocks to do a cli 
>or sti, compared to 300+ for a P4 cpu.  And things like 50-90 clocks 
>for an invlpg vs 1200-1600 clocks for a P4.
>
>Please don't accidently penalize those of us with cpus that were 
>designed for good all-round performance.  The P4 family was designed 
>for games and 3d graphics, not all-round performance.
>
>(This isn't aimed at anybody in particular..  I just wanted to remind 
>people that the P4 code is a particularly pathological case (and the 
>writing is on the wall for that core).  Other cpus, including intel's 
>newer non-P4 cores, dont have the same pathological problems.)
>

ok

maybe robert can send you his benchmarks to run on your jardwqare..

>
>  
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?419113E2.6050609>