From owner-freebsd-current@FreeBSD.ORG Mon Jan 26 21:48:55 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AFD5E16A673 for ; Mon, 26 Jan 2004 21:48:52 -0800 (PST) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id C2D4C44B46 for ; Mon, 26 Jan 2004 21:02:58 -0800 (PST) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87])i0R52mLE025016; Tue, 27 Jan 2004 16:02:48 +1100 Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246]) i0R52iEf025556; Tue, 27 Jan 2004 16:02:46 +1100 Date: Tue, 27 Jan 2004 16:02:45 +1100 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Maxim Sobolev In-Reply-To: <4014EFA0.4060801@portaone.com> Message-ID: <20040127153333.U4204@gamplex.bde.org> References: <20040124074052.GA12597@cirb503493.alcatel.com.au> <20040125143203.G29442@gamplex.bde.org> <4014EFA0.4060801@portaone.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: =?KOI8-U?Q?Dag-Erling_Sm=3Frgrav?= cc: freebsd-current@FreeBSD.ORG cc: Peter Jeremy Subject: Re: 80386 support in -current X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Jan 2004 05:48:55 -0000 On Mon, 26 Jan 2004, Maxim Sobolev wrote: > Out of curiosity I had run the bench on my good ol' P4 2GHz notebook, > and was very surprised that it much slower than even PIII-400 in most cases: > > 212.22 cycles/call > -DNO_MPLOCKED -DI386_CPU 117.32 cycles/call > -DI386_CPU 117.19 cycles/call > -DNO_MPLOCKED 31.39 cycles/call > > So, indeed, xchg is *lot* slower on p4 in non-SMP case than cmpxchgl, I Slowness of xchg vs (unlocked) cmpxchg is normal (xchg forces a lock which is expensive). I'm going to change the xchg to a mov in the UP case. More worrying and interesting is that everything is way slower than on an Athlon-XP. Locking seems to be much more expensive than cli/sti. > > Athlon XP1600 NO_MPLOCKED: 2.02 cycles/call > > Athlon XP1600: 18.07 cycles/call > > Athlon XP1600 I386_CPU NO_MPLOCKED: 19.06 cycles/call > > Athlon XP1600 I386_CPU: 19.06 cycles/call > > Celeron 400 NO_MPLOCKED: 5.03 cycles/call > > Celeron 400: 25.36 cycles/call > > Celeron 400 I386_CPU NO_MPLOCKED: 35.27 cycles/call > > Celeron 400 I386_CPU: 35.32 cycles/call Of course the cycle counts for locked instructions are much longer on the P4, because its CPU frequency is faster and its memory frequency is much faster. However, the Athlon is running at 1532 MHz (nominally 1400MHz 266MHz FSB with everything overclocked by 1532/1400), so its frequencies are not very different from the P4. But somehow the it is 15 times faster by cycle count for the unlocked cmpxchg. Locking the cmpxchg apparently takes 16 cycles on the Athlon and 181 cycles on the P4. cli/sti locking (plus a couple of extra instructions for the i386 case) apparently takes 17 cycles on the Athlon and 86 cyles on the P4. > had tried to rewrite atomic_readandclear_int() using cmpxchg - in > non-SMP case it became more than 10 times faster than current xchg > version (15 cycles vs. 200 cycles). However, when I've hacked all atomic_readandclear_int() should be changed too. It still uses essentially my original plain-i386ish code which may have been written without understanding that xchg has an implicit lock. But it is not used much. xchg is used mainly in _release_lock_quick() for spinlocks. Non-spin locks use _release_lock() which uses cmpxchg. > functions in atomic.h to use cmpxchg instead of xchg, and run make world > benchmark on kernels without this change and with it, I found that there > was hardly any improvement in performance, despite expected decrease of > mutex unlocking operation. Does anyone know how many mutex calls there are for makeworld? I've noticed that for almost anything that you can count for makeworld, although the count may look large it only accounts for epsilon% of the time. E.g., a makeworld that took 2540 seconds did about 600000 context switches. Context switches are very expensive -- they take between 1.1 and 87.6 usec according to lmbench2. If they take 87.6 then 600000 of them take 52.5 seconds which is significant, but I suspect an average one takes closer to 1.1 usec than 87.6 (87.6 is with 16 processes writing to 64K. Bruce