From owner-freebsd-current@FreeBSD.ORG Sat Jan 24 22:28:42 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4F86B16A4CE for ; Sat, 24 Jan 2004 22:28:42 -0800 (PST) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.85]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9F76E43D46 for ; Sat, 24 Jan 2004 22:28:39 -0800 (PST) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87])i0P6SX5O012336; Sun, 25 Jan 2004 17:28:33 +1100 Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246]) i0P6SUEf020922; Sun, 25 Jan 2004 17:28:32 +1100 Date: Sun, 25 Jan 2004 17:28:31 +1100 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?= In-Reply-To: Message-ID: <20040125143203.G29442@gamplex.bde.org> References: <20040124074052.GA12597@cirb503493.alcatel.com.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN Content-Transfer-Encoding: QUOTED-PRINTABLE cc: Peter Jeremy cc: freebsd-current@freebsd.org Subject: Re: 80386 support in -current X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 25 Jan 2004 06:28:42 -0000 On Sat, 24 Jan 2004, Dag-Erling [iso-8859-1] Sm=F8rgrav wrote: > Peter Jeremy writes: > > Does anyone know why FreeBSD 5.x would not run on a 386SX/387SX > > combination? I realise the performance would be very poor but I > > don't see any reason why it wouldn't work at all. > > It should run fine (though quite slowly) on a 386 with a 387 FPU, but > you need to roll your own release. The reason why we don't support > the 386 out of the box is that a kernel that will run on a 386 will be > very inefficient on newer CPUs (the synchronization code relies on a > particular instruction which was introduced with the 486 and must be > emulated on the 386) This is the specious reason. The synchronization code relies on a particular instruction that might be very inefficient to emulate on a 386, but emulation is not done; the instruction is just replaced by an instruction or sequence of instructions that is slower in some cases and faster in others (mostly slower, but not especially so, except probably on P4's). The actual reason was mostly that the 386 version doesn't work in the SMP case or ([or]?) on "P6 [sic] or higher" and making it work well would be too hard. SMP is now in GENERIC, so support for it is more important than when I386_CPU was removed from GENERIC. The ifdef tangle for this stuff combined with lack of testing seems to have broken the 386 support in practice. Libraries are now chummy with the kernel implementation of atomic operations, but not chummy enough to know when it actually works in userland. libthr uses the kernel atomic_cmpset_*(), but this never works on plain i386's in userland (the I386_CPU version doesn't work unless the application gains i/o privilege since it uses cli/sti, and the !I386_CPU version doesn't work because it uses cmpxchg). Some benchmarks for atomic_cmpset_int() run in userland: Athlon XP1600 NO_MPLOCKED: 2.02 cycles/call Athlon XP1600: 18.07 cycles/call Athlon XP1600 I386_CPU NO_MPLOCKED: 19.06 cycles/call Athlon XP1600 I386_CPU: 19.06 cycles/call Celeron 400 NO_MPLOCKED: 5.03 cycles/call Celeron 400: 25.36 cycles/call Celeron 400 I386_CPU NO_MPLOCKED: 35.27 cycles/call Celeron 400 I386_CPU: 35.32 cycles/call %%% #include /* * This is userland benchmark, so lock prefixes are normally forced (for * the !I386_CPU version only). Compile it with -DNO_MPLOCKED to cancel * this. */ #ifdef NO_MPLOCKED #define=09_KERNEL #endif #include #undef _KERNEL #include #include #include #include #define=09NITER=09100000000 int main(void) { =09uint64_t tsc0, tsc1, tsc2; =09volatile u_int dst; =09int i; #ifdef I386_CPU =09if (open("/dev/io", O_RDONLY) < 0) =09=09err(1, "open"); #endif =09dst =3D 0; =09tsc0 =3D rdtsc(); =09for (i =3D 0; i < NITER; i++) { #if 0 =09=09atomic_store_rel_int(&dst, 0); #else =09=09dst =3D 0; #endif =09} =09tsc1 =3D rdtsc(); =09for (i =3D 0; i < NITER; i++) { =09=09atomic_cmpset_int(&dst, 0, 1); #if 0 =09=09/* =09=09 * XXX mtx_unlock*() would use this, but it expands to =09=09 * xchgl in the !I386_CPU case so it gives a locked =09=09 * instruction even in the !SMP case. The locking =09=09 * more than doubles the runtime for this benchmark. =09=09 * Don't do it, since we're benchmarking =09=09 * atomic_cmpset_int(), not atomic_store_rel_int(). =09=09 */ =09=09atomic_store_rel_int(&dst, 0); #else =09=09dst =3D 0; #endif =09} =09tsc2 =3D rdtsc(); =09printf("%.2f cycles/call\n", =09 ((tsc2 - tsc1) - (tsc1 - tsc0)) / (double)NITER); =09return (0); } %%% Notes: - the atomic_cmpset_int() tests the usual case of an uncontested lock. - cli/sti takes about the same time as a lock prefix on the benchmarked CPUs. The lock is always forced in userland, so the I386_CPU version gives only a tiny pessimization for time in userland on these CPUs. It mainly pessimizes for use (it doesn't actually work without i/o privilege even in the !SMP case). - the kernel sometimes uses xchg instead of "[lock] cmpxchg. The lock prefix for xchg is implicit. So the !SMP case uses unnecessary lock prefixes. This pessimizes mtx_unlock*() by about the same amount as not supporting I386_CPU optimizes mtx_lock*() (on the benchmarked CPUs). Also, the cli/sti in the I386_CPU version of atomic_cmpset*() are just a waste of time for use in mtx_lock_spin(), since mtx_lock_spin() has already done the cli. So the inefficiency of I386_VERSION is just a misimplementation detail in many cases. - I believe cli and/or sti takes 300 cycles on a P4, so the I386_CPU version is correctly described as "very inefficient" for P4's. Bruce