From owner-freebsd-current@FreeBSD.ORG  Wed May  5 09:32:44 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7093016A4CE
	for <freebsd-current@freebsd.org>;
	Wed,  5 May 2004 09:32:44 -0700 (PDT)
Received: from mail.sandvine.com (sandvine.com [199.243.201.138])
	by mx1.FreeBSD.org (Postfix) with ESMTP id BEFB343D4C
	for <freebsd-current@freebsd.org>;
	Wed,  5 May 2004 09:32:43 -0700 (PDT)
	(envelope-from gnagelhout@sandvine.com)
Received: by mail.sandvine.com with Internet Mail Service (5.5.2657.72)
	id <HVCVT4LV>; Wed, 5 May 2004 12:32:43 -0400
Message-ID: <FE045D4D9F7AED4CBFF1B3B813C85337021AB37D@mail.sandvine.com>
From: Gerrit Nagelhout <gnagelhout@sandvine.com>
To: freebsd-current@freebsd.org
Date: Wed, 5 May 2004 12:32:42 -0400 
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2657.72)
Content-Type: text/plain;
	charset="iso-8859-1"
Subject: RE: 4.7 vs 5.2.1 SMP/UP bridging performance
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 05 May 2004 16:32:44 -0000

Bruce Evans wrote:

>> On a single hyperthreaded xeon 2.8Ghz, it took ~30 cycles 
>> (per LOCK&UNLOCK,
>> and dividing by 100) under UP, and ~300 cycles for SMP.  Assuming 10
>> locks for every packet(which is conservative), at 500Kpps, 
>> this accounts
>> for:
>> 300 * 10 * 500000 = 1.5 billion cycles (out of 2.8 billion cycles)
>> 
> 300 cyles seems far too much.  I get the following times for slightly
> simpler locking in userland:
> 
> %%%
> #define _KERNEL
> #include ...
> 
> int slock;
> ...
> 	for (i = 0; i < 1000000; i++) {
> 		while (atomic_cmpset_acq_int(&slock, 0, 1) == 0)
> 			;
> 		atomic_store_rel_int(&slock, 0);
> 	}
> %%%
> 
> Athlon XP2600 UP system:  !SMP case: 22 cycles   SMP case: 37 cycles
> Celeron 366 SMP system:              35                    48
> 
> The extra cycles for the SMP case are just the extra cost of 
> a one lock
> instruction.  Note that SMP should cost twice as much extra, but the
> non-SMP atomic_store_rel_int(&slock, 0) is pessimized by using xchgl
> which always locks the bus.  After fixing this:
> 
> Athlon XP2600 UP system:  !SMP case:  6 cycles   SMP case: 37 cycles
> Celeron 366 SMP system:              10                    48
> 
> Mutexes take longer than simple locks, but not much longer unless the
> lock is contested.  In particular, they don't lock the bus any more
> and the extra cycles for locking dominate (even in the !SMP case due
> to the pessimization).
> 
> So there seems to be something wrong with your benchmark.  Locking the
> bus for the SMP case always costs about 20+ cycles, but this hasn't
> changed since RELENG_4 and mutexes can't be made much faster in the
> uncontested case since their overhead is dominated by the bus lock
> time.
> 
> -current is sloer than RELENG_4, especially for networking, because
> it does lots more locking and may contest locks more, and when it hits
> a lock and for some other operations it does slow context switches.
> Your profile didn't seem to show much of the latter 2, so the problem
> for bridging may be that there is just too much fine-grained locking.
> 
> The profile didn't seem quite right.  I was missing all the 
> call counts
> and times.  The times are not useful for short runs unless high
> resolution profiling is used, but the call counts are.  Profiling has
> been broken in -current since last November so some garbage needs to
> be ignored to interpret profiles.
> 
> Bruce
> 

I wonder if the lock instruction is simply much more expensive on the
Xeon architecture.  I ran a program very similar to yours with and without
the "lock" instruction:

static inline int _osiCondSet32Locked(volatile unsigned *ptr, unsigned old,
                                     unsigned replace)
{
    int ok;
    __asm __volatile("mov %2, %%eax;"
                     "movl $1, %0;"        /* ok=1 */
                     "lock;"
                     "cmpxchgl %3, %1;"  /* if(%eax==*ptr) *ptr=replace */
                     "jz 0f;"            /* jump if exchanged */
                     "movl $0, %0;"        /* ok=0 */
                     "0:"
                     : "=&mr"(ok), "+m"(*ptr)
                     : "mr"(old), "r"(replace)
                     : "eax", "memory" );
    return ok;
}

unsigned int value;
...
for (i = 0; i < iterations; i++)
{
    _osiCondSet32Locked(&value,  0,  1);
}

and got the following results:

PIII (550Mhz) w/o lock: 8 cycles
PIII (550Mhz) w/  lock: 26 cycles
Xeon (2.8Ghz) w/o lock: 12 cycles
Xeon (2.8Ghz) w   lock: 132 cycles

This means that on the Xeon, each lock instruction take 120 cycles!  
This is close to the 300 I mentioned before (assuming that both EM_LOCK
and EM_UNLOCK use the lock instruction).  I have tried reading through
the Intel optimization guide for any hints on making this better, but
I haven't been able to find anything useful (so far).
This would certainly explain why running 5.2.1 under SMP is performing
so poorly for me.
If anyone is interested in running this test, I can forward the source
code for this program.  

The profiling I did was missing the call counts because I didn't compile
mcount into the key modules (bridge, if_em, etc) because it slowed things
down too much, and relied on just the stats from the interrupt.  I think
I did run it long enough to get reasonable results out of it though.  I
have used this kind of profiling extensively on 4.7 in order to optimize
this application.

Thanks,

Gerrit