From owner-freebsd-current@FreeBSD.ORG Thu May 6 01:02:34 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 50F6C16A4CE for ; Thu, 6 May 2004 01:02:34 -0700 (PDT) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id 43B1743D58 for ; Thu, 6 May 2004 01:02:31 -0700 (PDT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87])i4682F4u025815; Thu, 6 May 2004 18:02:15 +1000 Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246]) i4682BHW004546; Thu, 6 May 2004 18:02:12 +1000 Date: Thu, 6 May 2004 18:02:10 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Andrew Gallatin In-Reply-To: <16537.23378.375946.857908@grasshopper.cs.duke.edu> Message-ID: <20040506164710.I19057@gamplex.bde.org> References: <16537.23378.375946.857908@grasshopper.cs.duke.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-current@FreeBSD.org cc: Gerrit Nagelhout Subject: RE: 4.7 vs 5.2.1 SMP/UP bridging performance X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 06 May 2004 08:02:34 -0000 On Wed, 5 May 2004, Andrew Gallatin wrote: > Bruce Evans writes: > > So there seems to be something wrong with your benchmark. Locking the > > bus for the SMP case always costs about 20+ cycles, but this hasn't > > changed since RELENG_4 and mutexes can't be made much faster in the > > uncontested case since their overhead is dominated by the bus lock > > time. > > Actually, I think his tests are accurate and bus locked instructions > take an eternity on P4. See > http://www.uwsg.iu.edu/hypermail/linux/kernel/0109.3/0687.html > > For example, with your test above, I see 212 cycles for the UP case on > a 2.53GHz P4. Replacing the atomic_store_rel_int(&slock, 0) with a > simple slock = 0; reduces that count to 18 cycles. This seems to be right, unfortunately. I wonder if this has anything to do with freebsd.org having no P4 machines. > If its really safe to remove the xchg* from non-SMP atomic_store_rel*, > then I think you should do it. Of course, that still leaves mutexes > as very expensive on SMP (253 cycles on the 2.53GHz from above). I forgot (again) that there are memory access ordering issues. A lock may be needed to get everything synced. See the comment before the i386 versions in i386/include/atomic.h. A single lock may be enough. The best example I could think of easily is: %%% int foo; /* supposedly protected by sched_lock */ ... mtx_lock(&mtx); if (foo == 0) foo++; mtx_unlock(&mtx); KASSERT(foo == 1, ("oops")); %%% On at least amd64's, reads can be done out of order relative to all other reads and relative to writes to different memory locations. mtx_lock(&mtx) doesn't go near foo's memory location, so foo may be read before the lock is acquired. If this code is interrupted after foo is read but also before the lock is required, then the interrupt handler may run the same code and bump foo to 1. Then on return, if the out of order read is still valid, then the above will bump foo again. However, the lock in the mtx_unlock() in the interrupt handler presumably makes the out of order read invalid (does it?), so on return from the interrupt handler foo will be read again and found to be 1 (since even if the read is out of order related to the mtx locking, it is ordered relative to the write to foo's memory location). If this is correct, then someone's home made locking using atomic_cmpset for unlock was more than a style bug :-). To possibly reduce the locking overhead for at least the non-SMP case on i386's and amd64's, there are the [lms]fence instructions and serializing instructions. On on AthlonXPs, movl + sfence takes about half as many cycles as xchgl. I think lfence is actually needed, but AthlonXPs only have sfence. All the serializing instructions seem to be too heavyweight to help here. On sledge's amd64, the saving using lfence is smaller (atomic_cmpset_acq_int + movl: 6 cycles; + lfence: 15 cycles; atomic_cmpset_acq_int + atomic_store_rel_int: 21 cycles) (all including loop overhead). Bruce