From owner-freebsd-current@FreeBSD.ORG Wed May 5 09:32:44 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7093016A4CE for ; Wed, 5 May 2004 09:32:44 -0700 (PDT) Received: from mail.sandvine.com (sandvine.com [199.243.201.138]) by mx1.FreeBSD.org (Postfix) with ESMTP id BEFB343D4C for ; Wed, 5 May 2004 09:32:43 -0700 (PDT) (envelope-from gnagelhout@sandvine.com) Received: by mail.sandvine.com with Internet Mail Service (5.5.2657.72) id ; Wed, 5 May 2004 12:32:43 -0400 Message-ID: From: Gerrit Nagelhout To: freebsd-current@freebsd.org Date: Wed, 5 May 2004 12:32:42 -0400 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2657.72) Content-Type: text/plain; charset="iso-8859-1" Subject: RE: 4.7 vs 5.2.1 SMP/UP bridging performance X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 May 2004 16:32:44 -0000 Bruce Evans wrote: >> On a single hyperthreaded xeon 2.8Ghz, it took ~30 cycles >> (per LOCK&UNLOCK, >> and dividing by 100) under UP, and ~300 cycles for SMP. Assuming 10 >> locks for every packet(which is conservative), at 500Kpps, >> this accounts >> for: >> 300 * 10 * 500000 = 1.5 billion cycles (out of 2.8 billion cycles) >> > 300 cyles seems far too much. I get the following times for slightly > simpler locking in userland: > > %%% > #define _KERNEL > #include ... > > int slock; > ... > for (i = 0; i < 1000000; i++) { > while (atomic_cmpset_acq_int(&slock, 0, 1) == 0) > ; > atomic_store_rel_int(&slock, 0); > } > %%% > > Athlon XP2600 UP system: !SMP case: 22 cycles SMP case: 37 cycles > Celeron 366 SMP system: 35 48 > > The extra cycles for the SMP case are just the extra cost of > a one lock > instruction. Note that SMP should cost twice as much extra, but the > non-SMP atomic_store_rel_int(&slock, 0) is pessimized by using xchgl > which always locks the bus. After fixing this: > > Athlon XP2600 UP system: !SMP case: 6 cycles SMP case: 37 cycles > Celeron 366 SMP system: 10 48 > > Mutexes take longer than simple locks, but not much longer unless the > lock is contested. In particular, they don't lock the bus any more > and the extra cycles for locking dominate (even in the !SMP case due > to the pessimization). > > So there seems to be something wrong with your benchmark. Locking the > bus for the SMP case always costs about 20+ cycles, but this hasn't > changed since RELENG_4 and mutexes can't be made much faster in the > uncontested case since their overhead is dominated by the bus lock > time. > > -current is sloer than RELENG_4, especially for networking, because > it does lots more locking and may contest locks more, and when it hits > a lock and for some other operations it does slow context switches. > Your profile didn't seem to show much of the latter 2, so the problem > for bridging may be that there is just too much fine-grained locking. > > The profile didn't seem quite right. I was missing all the > call counts > and times. The times are not useful for short runs unless high > resolution profiling is used, but the call counts are. Profiling has > been broken in -current since last November so some garbage needs to > be ignored to interpret profiles. > > Bruce > I wonder if the lock instruction is simply much more expensive on the Xeon architecture. I ran a program very similar to yours with and without the "lock" instruction: static inline int _osiCondSet32Locked(volatile unsigned *ptr, unsigned old, unsigned replace) { int ok; __asm __volatile("mov %2, %%eax;" "movl $1, %0;" /* ok=1 */ "lock;" "cmpxchgl %3, %1;" /* if(%eax==*ptr) *ptr=replace */ "jz 0f;" /* jump if exchanged */ "movl $0, %0;" /* ok=0 */ "0:" : "=&mr"(ok), "+m"(*ptr) : "mr"(old), "r"(replace) : "eax", "memory" ); return ok; } unsigned int value; ... for (i = 0; i < iterations; i++) { _osiCondSet32Locked(&value, 0, 1); } and got the following results: PIII (550Mhz) w/o lock: 8 cycles PIII (550Mhz) w/ lock: 26 cycles Xeon (2.8Ghz) w/o lock: 12 cycles Xeon (2.8Ghz) w lock: 132 cycles This means that on the Xeon, each lock instruction take 120 cycles! This is close to the 300 I mentioned before (assuming that both EM_LOCK and EM_UNLOCK use the lock instruction). I have tried reading through the Intel optimization guide for any hints on making this better, but I haven't been able to find anything useful (so far). This would certainly explain why running 5.2.1 under SMP is performing so poorly for me. If anyone is interested in running this test, I can forward the source code for this program. The profiling I did was missing the call counts because I didn't compile mcount into the key modules (bridge, if_em, etc) because it slowed things down too much, and relied on just the stats from the interrupt. I think I did run it long enough to get reasonable results out of it though. I have used this kind of profiling extensively on 4.7 in order to optimize this application. Thanks, Gerrit