Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 05 May 2004 19:54:53 -0600
From:      Scott Long <scottl@freebsd.org>
To:        Gerrit Nagelhout <gnagelhout@sandvine.com>
Cc:        'Andrew Gallatin' <gallatin@cs.duke.edu>
Subject:   Re: 4.7 vs 5.2.1 SMP/UP bridging performance
Message-ID:  <40999AED.9080606@freebsd.org>
In-Reply-To: <FE045D4D9F7AED4CBFF1B3B813C85337021AB38C@mail.sandvine.com>
References:  <FE045D4D9F7AED4CBFF1B3B813C85337021AB38C@mail.sandvine.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Gerrit Nagelhout wrote:
> Andrew Gallatin wrote:
> 
>>Bruce Evans writes:
>>
>> > 
>> > Athlon XP2600 UP system:  !SMP case: 22 cycles   SMP case: 
>>37 cycles
>> > Celeron 366 SMP system:              35                    48
>> > 
>> > The extra cycles for the SMP case are just the extra cost 
>>of a one lock
>> > instruction.  Note that SMP should cost twice as much 
>>extra, but the
>> > non-SMP atomic_store_rel_int(&slock, 0) is pessimized by 
>>using xchgl
>> > which always locks the bus.  After fixing this:
>> > 
>> > Athlon XP2600 UP system:  !SMP case:  6 cycles   SMP case: 
>>37 cycles
>> > Celeron 366 SMP system:              10                    48
>> > 
>> > Mutexes take longer than simple locks, but not much longer 
>>unless the
>> > lock is contested.  In particular, they don't lock the bus any more
>> > and the extra cycles for locking dominate (even in the 
>>!SMP case due
>> > to the pessimization).
>> > 
>> > So there seems to be something wrong with your benchmark.  
>>Locking the
>> > bus for the SMP case always costs about 20+ cycles, but this hasn't
>> > changed since RELENG_4 and mutexes can't be made much faster in the
>> > uncontested case since their overhead is dominated by the bus lock
>> > time.
>> > 
>>
>>Actually, I think his tests are accurate and bus locked instructions
>>take an eternity on P4.  See
>>http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html 
>>
>>For example, with your test above, I see 212 cycles for the UP case on
>>a 2.53GHz P4.  Replacing the atomic_store_rel_int(&slock, 0) with a
>>simple slock = 0; reduces that count to 18 cycles.
>>
>>If its really safe to remove the xchg* from non-SMP atomic_store_rel*,
>>then I think you should do it.  Of course, that still leaves mutexes
>>as very expensive on SMP (253 cycles on the 2.53GHz from above).
>>
>>Drew
>>
> 
> 
> I wonder if there is anything that can be done to make the locking more
> efficient for the Xeon.  Are there any other locking types that could
> be used instead?
> This might also explain why we are seeing much worse system call 
> performance under 4.7 in SMP versus UP.  Here is a table of results
> for some system call tests I ran.  (The numbers are calls/s)

Int 0x80 system calls are known to be extremely expensive on a P4.  I
think that Jeff Roberson measured them as taking 300 cycles on average.
Some work was done on implementing the alternate sysenter/sysexit
method, but I don't think it was ever finished.  I think that it was
shown to have a modest speed improvement, but there was still a lot of
overhead that made it slow on a P4.  There are other optimizations that
can be done like having a shared page that lets you avoid calls like
getpid and gettimeofday, but it opens some security risks that have to
be dealt with.

Scott



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?40999AED.9080606>