From owner-freebsd-current@FreeBSD.ORG Mon Jan 26 02:45:57 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id CD6C316A4CE for ; Mon, 26 Jan 2004 02:45:57 -0800 (PST) Received: from gwoffice.united.net.ua (office-l6-gw.united.net.ua [193.111.8.74]) by mx1.FreeBSD.org (Postfix) with ESMTP id 38A1643D55 for ; Mon, 26 Jan 2004 02:45:08 -0800 (PST) (envelope-from sobomax@portaone.com) Received: from portaone.com (localhost [127.0.0.1])i0QAisuc065175; Mon, 26 Jan 2004 12:44:58 +0200 (EET) (envelope-from sobomax@portaone.com) Message-ID: <4014EFA0.4060801@portaone.com> Date: Mon, 26 Jan 2004 12:44:48 +0200 From: Maxim Sobolev User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax) X-Accept-Language: en-us, en, ru, uk MIME-Version: 1.0 To: Bruce Evans References: <20040124074052.GA12597@cirb503493.alcatel.com.au> <20040125143203.G29442@gamplex.bde.org> In-Reply-To: <20040125143203.G29442@gamplex.bde.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Mailman-Approved-At: Mon, 26 Jan 2004 05:25:35 -0800 cc: =?KOI8-U?Q?Dag-Erling_Sm=3Frgrav?= cc: freebsd-current@FreeBSD.ORG cc: Peter Jeremy Subject: Re: 80386 support in -current X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Jan 2004 10:45:57 -0000 Out of curiosity I had run the bench on my good ol' P4 2GHz notebook, and was very surprised that it much slower than even PIII-400 in most cases: 212.22 cycles/call -DNO_MPLOCKED -DI386_CPU 117.32 cycles/call -DI386_CPU 117.19 cycles/call -DNO_MPLOCKED 31.39 cycles/call So, indeed, xchg is *lot* slower on p4 in non-SMP case than cmpxchgl, I had tried to rewrite atomic_readandclear_int() using cmpxchg - in non-SMP case it became more than 10 times faster than current xchg version (15 cycles vs. 200 cycles). However, when I've hacked all functions in atomic.h to use cmpxchg instead of xchg, and run make world benchmark on kernels without this change and with it, I found that there was hardly any improvement in performance, despite expected decrease of mutex unlocking operation. -Maxim =================================================================== RCS file: /sys/i386/include/atomic.h,v retrieving revision 1.1 diff -du -r1.1 /sys/i386/include/atomic.h --- /sys/i386/include/atomic.h 2004/01/25 22:25:56 1.1 +++ /sys/i386/include/atomic.h 2004/01/26 10:31:14 @@ -217,10 +217,10 @@ static __inline void \ atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)\ { \ - __asm __volatile(SOP \ + __asm __volatile(__XSTRING(MPLOCKED) SOP \ : "+m" (*p), /* 0 */ \ "+r" (v) /* 1 */ \ - : : "memory"); \ + : : "memory", "eax"); \ } \ struct __hack @@ -258,10 +258,10 @@ ATOMIC_ASM(add, long, "addl %1,%0", "ir", v); ATOMIC_ASM(subtract, long, "subl %1,%0", "ir", v); -ATOMIC_STORE_LOAD(char, "cmpxchgb %b0,%1", "xchgb %b1,%0"); -ATOMIC_STORE_LOAD(short,"cmpxchgw %w0,%1", "xchgw %w1,%0"); -ATOMIC_STORE_LOAD(int, "cmpxchgl %0,%1", "xchgl %1,%0"); -ATOMIC_STORE_LOAD(long, "cmpxchgl %0,%1", "xchgl %1,%0"); +ATOMIC_STORE_LOAD(char, "cmpxchgb %b0,%1", "movb %b1,%%al; 1: cmpxchgb %b1,%0; jnz 1b"); +ATOMIC_STORE_LOAD(short,"cmpxchgw %w0,%1", "movw %w1,%%ax; 1: cmpxchgw %w1,%0; jnz 1b"); +ATOMIC_STORE_LOAD(int, "cmpxchgl %0,%1", "movl %1,%%eax; 1: cmpxchgl %1,%0; jnz 1b"); +ATOMIC_STORE_LOAD(long, "cmpxchgl %0,%1", "movl %1,%%eax; 1: cmpxchgl %1,%0; jnz 1b"); #undef ATOMIC_ASM #undef ATOMIC_STORE_LOAD @@ -418,11 +418,15 @@ u_int result; __asm __volatile ( - " xorl %0,%0 ; " - " xchgl %1,%0 ; " + " " __XSTRING(MPLOCKED) " " + "1: " + " cmpxchgl %3,%2 ; " + " jnz 1b ; " "# atomic_readandclear_int" - : "=&r" (result) /* 0 (result) */ - : "m" (*addr)); /* 1 (addr) */ + : "=a" (result) /* 0 (result) */ + : "0" (*addr), /* 1 */ + "m" (*addr), /* 2 (addr) */ + "r" (0)); /* 3 */ return (result); } @@ -433,11 +437,15 @@ u_long result; __asm __volatile ( - " xorl %0,%0 ; " - " xchgl %1,%0 ; " + " " __XSTRING(MPLOCKED) " " + "1: " + " cmpxchgl %3,%2 ; " + " jnz 1b ; " "# atomic_readandclear_int" - : "=&r" (result) /* 0 (result) */ - : "m" (*addr)); /* 1 (addr) */ + : "=a" (result) /* 0 (result) */ + : "0" (*addr), /* 1 */ + "m" (*addr), /* 2 (addr) */ + "r" (0)); /* 3 */ return (result); } Bruce Evans wrote: > On Sat, 24 Jan 2004, Dag-Erling [iso-8859-1] Sm?rgrav wrote: > > >>Peter Jeremy writes: >> >>>Does anyone know why FreeBSD 5.x would not run on a 386SX/387SX >>>combination? I realise the performance would be very poor but I >>>don't see any reason why it wouldn't work at all. >> >>It should run fine (though quite slowly) on a 386 with a 387 FPU, but >>you need to roll your own release. The reason why we don't support >>the 386 out of the box is that a kernel that will run on a 386 will be >>very inefficient on newer CPUs (the synchronization code relies on a >>particular instruction which was introduced with the 486 and must be >>emulated on the 386) > > > This is the specious reason. The synchronization code relies on a > particular instruction that might be very inefficient to emulate on a > 386, but emulation is not done; the instruction is just replaced by > an instruction or sequence of instructions that is slower in some cases > and faster in others (mostly slower, but not especially so, except > probably on P4's). The actual reason was mostly that the 386 version > doesn't work in the SMP case or ([or]?) on "P6 [sic] or higher" and > making it work well would be too hard. > > SMP is now in GENERIC, so support for it is more important than when > I386_CPU was removed from GENERIC. > > The ifdef tangle for this stuff combined with lack of testing seems to > have broken the 386 support in practice. Libraries are now chummy with > the kernel implementation of atomic operations, but not chummy enough to > know when it actually works in userland. libthr uses the kernel > atomic_cmpset_*(), but this never works on plain i386's in userland > (the I386_CPU version doesn't work unless the application gains i/o > privilege since it uses cli/sti, and the !I386_CPU version doesn't > work because it uses cmpxchg). > > Some benchmarks for atomic_cmpset_int() run in userland: > > Athlon XP1600 NO_MPLOCKED: 2.02 cycles/call > Athlon XP1600: 18.07 cycles/call > Athlon XP1600 I386_CPU NO_MPLOCKED: 19.06 cycles/call > Athlon XP1600 I386_CPU: 19.06 cycles/call > Celeron 400 NO_MPLOCKED: 5.03 cycles/call > Celeron 400: 25.36 cycles/call > Celeron 400 I386_CPU NO_MPLOCKED: 35.27 cycles/call > Celeron 400 I386_CPU: 35.32 cycles/call > > %%% > #include > > /* > * This is userland benchmark, so lock prefixes are normally forced (for > * the !I386_CPU version only). Compile it with -DNO_MPLOCKED to cancel > * this. > */ > #ifdef NO_MPLOCKED > #define _KERNEL > #endif > #include > #undef _KERNEL > > #include > > #include > #include > #include > > #define NITER 100000000 > > int > main(void) > { > uint64_t tsc0, tsc1, tsc2; > volatile u_int dst; > int i; > > #ifdef I386_CPU > if (open("/dev/io", O_RDONLY) < 0) > err(1, "open"); > #endif > dst = 0; > tsc0 = rdtsc(); > for (i = 0; i < NITER; i++) { > #if 0 > atomic_store_rel_int(&dst, 0); > #else > dst = 0; > #endif > } > tsc1 = rdtsc(); > for (i = 0; i < NITER; i++) { > atomic_cmpset_int(&dst, 0, 1); > #if 0 > /* > * XXX mtx_unlock*() would use this, but it expands to > * xchgl in the !I386_CPU case so it gives a locked > * instruction even in the !SMP case. The locking > * more than doubles the runtime for this benchmark. > * Don't do it, since we're benchmarking > * atomic_cmpset_int(), not atomic_store_rel_int(). > */ > atomic_store_rel_int(&dst, 0); > #else > dst = 0; > #endif > } > tsc2 = rdtsc(); > printf("%.2f cycles/call\n", > ((tsc2 - tsc1) - (tsc1 - tsc0)) / (double)NITER); > return (0); > } > %%% > > Notes: > - the atomic_cmpset_int() tests the usual case of an uncontested lock. > - cli/sti takes about the same time as a lock prefix on the benchmarked > CPUs. The lock is always forced in userland, so the I386_CPU version > gives only a tiny pessimization for time in userland on these CPUs. > It mainly pessimizes for use (it doesn't actually work without i/o > privilege even in the !SMP case). > - the kernel sometimes uses xchg instead of "[lock] cmpxchg. The lock > prefix for xchg is implicit. So the !SMP case uses unnecessary lock > prefixes. This pessimizes mtx_unlock*() by about the same amount as > not supporting I386_CPU optimizes mtx_lock*() (on the benchmarked > CPUs). Also, the cli/sti in the I386_CPU version of atomic_cmpset*() > are just a waste of time for use in mtx_lock_spin(), since > mtx_lock_spin() has already done the cli. So the inefficiency of > I386_VERSION is just a misimplementation detail in many cases. > - I believe cli and/or sti takes 300 cycles on a P4, so the I386_CPU > version is correctly described as "very inefficient" for P4's. > > Bruce > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" > > > >