From owner-freebsd-current@FreeBSD.ORG Wed May 5 18:16:39 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 90A9616A4CF for ; Wed, 5 May 2004 18:16:39 -0700 (PDT) Received: from mail.sandvine.com (sandvine.com [199.243.201.138]) by mx1.FreeBSD.org (Postfix) with ESMTP id B1EE643D48 for ; Wed, 5 May 2004 18:16:38 -0700 (PDT) (envelope-from gnagelhout@sandvine.com) Received: by mail.sandvine.com with Internet Mail Service (5.5.2657.72) id ; Wed, 5 May 2004 21:16:37 -0400 Message-ID: From: Gerrit Nagelhout To: 'Andrew Gallatin' , Bruce Evans , freebsd-current@freebsd.org Date: Wed, 5 May 2004 21:16:35 -0400 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2657.72) Content-Type: text/plain; charset="iso-8859-1" Subject: RE: 4.7 vs 5.2.1 SMP/UP bridging performance X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 06 May 2004 01:16:39 -0000 Andrew Gallatin wrote: > Bruce Evans writes: > > > > > Athlon XP2600 UP system: !SMP case: 22 cycles SMP case: > 37 cycles > > Celeron 366 SMP system: 35 48 > > > > The extra cycles for the SMP case are just the extra cost > of a one lock > > instruction. Note that SMP should cost twice as much > extra, but the > > non-SMP atomic_store_rel_int(&slock, 0) is pessimized by > using xchgl > > which always locks the bus. After fixing this: > > > > Athlon XP2600 UP system: !SMP case: 6 cycles SMP case: > 37 cycles > > Celeron 366 SMP system: 10 48 > > > > Mutexes take longer than simple locks, but not much longer > unless the > > lock is contested. In particular, they don't lock the bus any more > > and the extra cycles for locking dominate (even in the > !SMP case due > > to the pessimization). > > > > So there seems to be something wrong with your benchmark. > Locking the > > bus for the SMP case always costs about 20+ cycles, but this hasn't > > changed since RELENG_4 and mutexes can't be made much faster in the > > uncontested case since their overhead is dominated by the bus lock > > time. > > > > Actually, I think his tests are accurate and bus locked instructions > take an eternity on P4. See > http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html > > For example, with your test above, I see 212 cycles for the UP case on > a 2.53GHz P4. Replacing the atomic_store_rel_int(&slock, 0) with a > simple slock = 0; reduces that count to 18 cycles. > > If its really safe to remove the xchg* from non-SMP atomic_store_rel*, > then I think you should do it. Of course, that still leaves mutexes > as very expensive on SMP (253 cycles on the 2.53GHz from above). > > Drew > I wonder if there is anything that can be done to make the locking more efficient for the Xeon. Are there any other locking types that could be used instead? This might also explain why we are seeing much worse system call performance under 4.7 in SMP versus UP. Here is a table of results for some system call tests I ran. (The numbers are calls/s) 2.8Ghz Xeon UP SMP write 904427 661312 socket 1327692 1067743 select 554131 434390 gettimeofday 1734963 252479 1.3Ghz PIII UP SMP write 746705 532223 socket 1179819 977448 select 727811 556537 gettimeofday 1849862 186387 The really interesting one is gettimeofday. For both the Xeon & PIII, the UP is much better than SMP, but the UP for PIII is better than that of the Xeon. I may try to get the results for 5.2.1 later. I can forward the source code of this program to anyone else who wants to try it out. Thanks, Gerrit