From owner-freebsd-current@FreeBSD.ORG  Sat Jan 24 22:28:42 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 4F86B16A4CE
	for <freebsd-current@freebsd.org>;
	Sat, 24 Jan 2004 22:28:42 -0800 (PST)
Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.85])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 9F76E43D46
	for <freebsd-current@freebsd.org>;
	Sat, 24 Jan 2004 22:28:39 -0800 (PST)	(envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.0.87])i0P6SX5O012336;	Sun, 25 Jan 2004 17:28:33 +1100
Received: from gamplex.bde.org (katana.zip.com.au [61.8.7.246])
	i0P6SUEf020922;	Sun, 25 Jan 2004 17:28:32 +1100
Date: Sun, 25 Jan 2004 17:28:31 +1100 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@gamplex.bde.org
To: Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?= <des@des.no>
In-Reply-To: <xzpptd9qsf0.fsf@dwp.des.no>
Message-ID: <20040125143203.G29442@gamplex.bde.org>
References: <20040124074052.GA12597@cirb503493.alcatel.com.au>
 <xzpptd9qsf0.fsf@dwp.des.no>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN
Content-Transfer-Encoding: QUOTED-PRINTABLE
cc: Peter Jeremy <PeterJeremy@optushome.com.au>
cc: freebsd-current@freebsd.org
Subject: Re: 80386 support in -current
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 25 Jan 2004 06:28:42 -0000

On Sat, 24 Jan 2004, Dag-Erling [iso-8859-1] Sm=F8rgrav wrote:

> Peter Jeremy <PeterJeremy@optushome.com.au> writes:
> > Does anyone know why FreeBSD 5.x would not run on a 386SX/387SX
> > combination?  I realise the performance would be very poor but I
> > don't see any reason why it wouldn't work at all.
>
> It should run fine (though quite slowly) on a 386 with a 387 FPU, but
> you need to roll your own release.  The reason why we don't support
> the 386 out of the box is that a kernel that will run on a 386 will be
> very inefficient on newer CPUs (the synchronization code relies on a
> particular instruction which was introduced with the 486 and must be
> emulated on the 386)

This is the specious reason.  The synchronization code relies on a
particular instruction that might be very inefficient to emulate on a
386, but emulation is not done; the instruction is just replaced by
an instruction or sequence of instructions that is slower in some cases
and faster in others (mostly slower, but not especially so, except
probably on P4's).  The actual reason was mostly that the 386 version
doesn't work in the SMP case or ([or]?) on "P6 [sic] or higher" and
making it work well would be too hard.

SMP is now in GENERIC, so support for it is more important than when
I386_CPU was removed from GENERIC.

The ifdef tangle for this stuff combined with lack of testing seems to
have broken the 386 support in practice.  Libraries are now chummy with
the kernel implementation of atomic operations, but not chummy enough to
know when it actually works in userland.  libthr uses the kernel
atomic_cmpset_*(), but this never works on plain i386's in userland
(the I386_CPU version doesn't work unless the application gains i/o
privilege since it uses cli/sti, and the !I386_CPU version doesn't
work because it uses cmpxchg).

Some benchmarks for atomic_cmpset_int() run in userland:

Athlon XP1600          NO_MPLOCKED:             2.02 cycles/call
Athlon XP1600:                                 18.07 cycles/call
Athlon XP1600 I386_CPU NO_MPLOCKED:            19.06 cycles/call
Athlon XP1600 I386_CPU:                        19.06 cycles/call
Celeron 400            NO_MPLOCKED:             5.03 cycles/call
Celeron 400:                                   25.36 cycles/call
Celeron 400   I386_CPU NO_MPLOCKED:            35.27 cycles/call
Celeron 400   I386_CPU:                        35.32 cycles/call

%%%
#include <sys/types.h>

/*
 * This is userland benchmark, so lock prefixes are normally forced (for
 * the !I386_CPU version only).  Compile it with -DNO_MPLOCKED to cancel
 * this.
 */
#ifdef NO_MPLOCKED
#define=09_KERNEL
#endif
#include <machine/atomic.h>
#undef _KERNEL

#include <machine/cpufunc.h>

#include <err.h>
#include <fcntl.h>
#include <stdio.h>

#define=09NITER=09100000000

int
main(void)
{
=09uint64_t tsc0, tsc1, tsc2;
=09volatile u_int dst;
=09int i;

#ifdef I386_CPU
=09if (open("/dev/io", O_RDONLY) < 0)
=09=09err(1, "open");
#endif
=09dst =3D 0;
=09tsc0 =3D rdtsc();
=09for (i =3D 0; i < NITER; i++) {
#if 0
=09=09atomic_store_rel_int(&dst, 0);
#else
=09=09dst =3D 0;
#endif
=09}
=09tsc1 =3D rdtsc();
=09for (i =3D 0; i < NITER; i++) {
=09=09atomic_cmpset_int(&dst, 0, 1);
#if 0
=09=09/*
=09=09 * XXX mtx_unlock*() would use this, but it expands to
=09=09 * xchgl in the !I386_CPU case so it gives a locked
=09=09 * instruction even in the !SMP case.  The locking
=09=09 * more than doubles the runtime for this benchmark.
=09=09 * Don't do it, since we're benchmarking
=09=09 * atomic_cmpset_int(), not atomic_store_rel_int().
=09=09 */
=09=09atomic_store_rel_int(&dst, 0);
#else
=09=09dst =3D 0;
#endif
=09}
=09tsc2 =3D rdtsc();
=09printf("%.2f cycles/call\n",
=09    ((tsc2 - tsc1) - (tsc1 - tsc0)) / (double)NITER);
=09return (0);
}
%%%

Notes:
- the atomic_cmpset_int() tests the usual case of an uncontested lock.
- cli/sti takes about the same time as a lock prefix on the benchmarked
  CPUs.  The lock is always forced in userland, so the I386_CPU version
  gives only a tiny pessimization for time in userland on these CPUs.
  It mainly pessimizes for use (it doesn't actually work without i/o
  privilege even in the !SMP case).
- the kernel sometimes uses xchg instead of "[lock] cmpxchg.  The lock
  prefix for xchg is implicit.  So the !SMP case uses unnecessary lock
  prefixes.  This pessimizes mtx_unlock*() by about the same amount as
  not supporting I386_CPU optimizes mtx_lock*() (on the benchmarked
  CPUs).  Also, the cli/sti in the I386_CPU version of atomic_cmpset*()
  are just a waste of time for use in mtx_lock_spin(), since
  mtx_lock_spin() has already done the cli.  So the inefficiency of
  I386_VERSION is just a misimplementation detail in many cases.
- I believe cli and/or sti takes 300 cycles on a P4, so the I386_CPU
  version is correctly described as "very inefficient" for P4's.

Bruce