From owner-freebsd-current@FreeBSD.ORG Thu May 6 03:19:12 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 47B8416A4CF for ; Thu, 6 May 2004 03:19:12 -0700 (PDT) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.85]) by mx1.FreeBSD.org (Postfix) with ESMTP id 289D643D39 for ; Thu, 6 May 2004 03:19:09 -0700 (PDT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87])i46AJ25v011205; Thu, 6 May 2004 20:19:02 +1000 Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246]) i46AIxHW005371; Thu, 6 May 2004 20:19:00 +1000 Date: Thu, 6 May 2004 20:18:58 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Gerrit Nagelhout In-Reply-To: Message-ID: <20040506184749.R19447@gamplex.bde.org> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-current@freebsd.org cc: 'Andrew Gallatin' Subject: RE: 4.7 vs 5.2.1 SMP/UP bridging performance X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 06 May 2004 10:19:12 -0000 On Wed, 5 May 2004, Gerrit Nagelhout wrote: > Andrew Gallatin wrote: > > If its really safe to remove the xchg* from non-SMP atomic_store_rel*, > > then I think you should do it. Of course, that still leaves mutexes > > as very expensive on SMP (253 cycles on the 2.53GHz from above). See my other reply [1 memory barrier but not 2 seems to be needed for each lock/unlock pair in the !SMP case, and the xchgl accidentally (?) provides it; perhaps [lms]fence would give a faster memory barrier]. More ideas on this: - compilers should probably now generate memory barrier instructions foe volatile variables (so volatile variables would be even slower :-). I haven't seen gcc on i386's do this. - jhb once tried changing mtx_lolock_spin(mtx)/mtx_unlock_spin(mtx) to crticial_enter()/critical_exit(). This didn't work because it broke mtx_assert(). It might also not work because it removes the memory barrier. criticial_enter() only has the very weak memory barrier in disable_intr() on i386's. > I wonder if there is anything that can be done to make the locking more > efficient for the Xeon. Are there any other locking types that could > be used instead? I can't think of anything for the SMP case. See above for the !SMP case. > This might also explain why we are seeing much worse system call > performance under 4.7 in SMP versus UP. Here is a table of results > for some system call tests I ran. (The numbers are calls/s) > > 2.8Ghz Xeon > UP SMP > write 904427 661312 > socket 1327692 1067743 > select 554131 434390 > gettimeofday 1734963 252479 > > 1.3Ghz PIII > UP SMP > write 746705 532223 > socket 1179819 977448 > select 727811 556537 > gettimeofday 1849862 186387 It's why the Xeon is relatively slower under -current and SMP. -current just does more locking and more of other things. > The really interesting one is gettimeofday. For both the Xeon & PIII, > the UP is much better than SMP, but the UP for PIII is better than that > of the Xeon. I may try to get the results for 5.2.1 later. I can > forward the source code of this program to anyone else who wants to try > it out. gettimeofday() is slower for SMP because it uses a different timecounter. This is a hardware problem -- there is no good timecounter available. It looks like the TSC timecounter is being used for the UP cases and either the i8254 or the ACPI-slow timecounter for the SMP cases. Reading the TSC takes about 10-12 cycles on most most i386's (probably mny more on P4 ;-). Syscall overhead adds a lot to this, but gettimeofday() still takes much less than a microsecond. The fastest I've seen recently is 260nS/578 cycles for clock_gettime() on an AthlonXP. OTOH, reading the i8254 takes about 4000 nS so gettimeofday() takes 4190nS for clock_gettime() on the same AthlonXP system that takes 260nS with the TSC timecounter. This system also has a slow ACPI timer so clock_gettime() takes 1397nS with the ACPI-fast timecounter and about 3 times as long with the ACPI-slow timecounter. Recently-fixed bugs made it often use the ACPI-slow timecounter although the ACPI-fast timecounter always works. Slow timecounters mainly affect workloads that do too many context switches or timestamps on tinygrams. Probably for yours but not mine. I only notice them when I run microbenchmarks. The simplest one that shows them is "ping -fq localhost". There are normally 7 timestamps per packet (1 to put in the packet in userland, 2 for bookkepping in userland, 2 for pessimization of netisrs in the kernel and 2 for tripping on our own Giant foot in the kernel). RELENG_4 only has the userland ones. With reasonably CPUs (1GHz+ or so) and slow timecounters, making even one of these timestamps takes longer than everything else. Bruce