From owner-freebsd-current  Mon Jul 12 16: 7: 2 1999
Delivered-To: freebsd-current@freebsd.org
Received: from alcanet.com.au (border.alcanet.com.au [203.62.196.10])
	by hub.freebsd.org (Postfix) with ESMTP id E2E9F14E73
	for <freebsd-current@FreeBSD.ORG>; Mon, 12 Jul 1999 16:06:29 -0700 (PDT)
	(envelope-from jeremyp@gsmx07.alcatel.com.au)
Received: by border.alcanet.com.au id <40574>; Tue, 13 Jul 1999 08:46:57 +1000
Date: Tue, 13 Jul 1999 09:04:57 +1000
From: Peter Jeremy <jeremyp@gsmx07.alcatel.com.au>
Subject: Re: "objtrm" problem probably found
To: dfr@nlsystems.com, dillon@apollo.backplane.com,
	freebsd-current@FreeBSD.ORG, luoqi@watermarkgroup.com, mike@ducky.net
Message-Id: <99Jul13.084657est.40574@border.alcanet.com.au>
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Doug Rabson <dfr@nlsystems.com> wrote:
>We don't need the lock prefix for the current SMP implementation. A lock
>prefix would be needed in a multithreaded implementation but should not be
>added unless the kernel is an SMP kernel otherwise UP performance would
>suffer.

Modulo the issue of UP vs SMP modules, the code would seem to be simply:

#ifdef SMP
#define ATOMIC_ASM(type,op)	\
    __asm __volatile ("lock; " op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p))
#else
#define ATOMIC_ASM(type,op)	\
    __asm __volatile (op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p))
#endif

Or (maybe more clearly):

#ifdef SMP
#define	SMP_LOCK	"lock; "
#else
#define	SMP_LOCK
#endif

#define ATOMIC_ASM(type,op)	\
    __asm __volatile (SMP_LOCK op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p))


John-Mark Gurney <gurney_j@efn.org> wrote:
>actually, I'm not so sure, it guarantees that NO other bus operation
>will succeed while this is happening... what happens if a pci bus
>mastering card makes a modification to this value?

This is a valid point, but I don't think it's directly related to this
thread.  I think it's up the the various PCI device driver writers to
ensure that objects shared between a PCI device and driver are
correctly locked.  The mechanism to do this is likely to be device-
specific: Lock prefixes only protect objects no larger than 32 or 64
bits (depending on the processor), cards may require locked accesses
to much larger structures.

I believe the API to the PCI-locking routines should be distinct from
the API for SMP locks - even though the underlying implementation may
be common.


Oliver Fromme <olli@dorifer.heim3.tu-clausthal.de> wrote:
>In my list of i386 clock cycles, the lock prefix is listed with
>0 (zero) cycles.
My i486 book states 1 cycle, although that cycle can be overlapped with
several other combinations that add a cycle to the basic instruction
execution time.  I don't know about the Pentium and beyond timings.  In
any case, we have real-world timings, which are more useful.


Mike Haertel <mike@ducky.net> wrote:
>Although function calls are more expensive than inline code,
>they aren't necessarily a lot more so, and function calls to
>non-locked RMW operations are certainly much cheaper than
>inline locked RMW operations.
Based on my timings below, this is correct, though counter-intuitive.
Given the substantial cost of indirect function calls, I don't
this this would be acceptable, though.  I think compiling modules
separately for UP/SMP is a better choice.


In Message-id: <19990 121806.LAA70634@apollo.backplane.com>,
Matthew Dillon <dillon@apollo.backplane.com> provided some hard figures
for a dual PIII-450.  Expanding those figures for a range of machines
(all are UP except the PIII-450, which are Matt's SMP figures), and
adding the cost of using indirect function calls (patches to Matt's
code at the end):

        i386SX-25   P-133   PII-266  PIII-450  nproc  locks
mode  0   1950.23    39.65    26.31     9.21     EMPTY
mode  1   3340.59    71.74    24.45    16.48     1      no  tight
mode  2   3237.57    71.18    25.27    23.65     2      no  tight
mode  3   3367.65   282.31   153.29    93.02     1     yes  tight
mode  4   3263.64   285.58   152.94   160.82     2     yes  tight
mode  5   9439.15   446.16    60.40    37.64     1      no  spread
mode  6  10231.96   467.39    60.16    89.28     2      no  spread
mode  7  10660.05   725.80   153.18    88.32     1     yes  spread
mode  8   9990.18   755.87   155.18   161.08     2     yes  spread

mode  9   5544.82   131.31    49.96        ?     EMPTY
mode 10   7234.97   174.20    64.81        ?     1      no  tight
mode 11   7212.14   178.72    64.87        ?     2      no  tight
mode 12   7355.46   304.74   182.75        ?     1     yes  tight
mode 13   6956.54   327.11   180.21        ?     2     yes  tight
mode 14  13603.72   582.02   100.10        ?     1      no  spread
mode 15  13443.54   543.97   101.13        ?     2      no  spread
mode 16  13731.94   717.31   207.12        ?     1     yes  spread
mode 17  13379.62   800.31   207.70        ?     2     yes  spread

Modes 9 through 17 are equivalent to modes 0-8 except that the
operation is performed via a call thru a pointer-to-function.
(Mode 9 is a pointer to a nop).

Apart from the noticable improvement in overall speed from left to
right, this shows that the lock prefix is _very_ expensive on
Pentium and above, even in a UP configuration.  It makes no difference
on a 386.  (I can produce the 486 figures tonight after I get home).
It also suggests that moving to an indirect function call (to allow
run-time UP/SMP selection) will be quite painful.

>    As you can see, the lock prefix creates a stall condition on the locked
>    memory, but does *NOT* stall other memory.
This is at least CPU dependent (and may also depend on the motherboard
chipset).  The i486 states that `all memory is locked'.

>    Therefore I believe the impact will be unnoticeable.  On a duel 
>    450MHz P-III we are talking 37 ns vs 88 ns - an overhead of 50 ns
>    for the one processor case, and an overhead of 72 ns for the two
>    processor case.
Whilst that's true for a P-III, it's definitely not true for most
lesser machines (which are probably more common - and are likely to
remain so for a while).

Based on the impact above, I believe the lock prefixes should not
be inserted until they are necessary - even if it does mean we wind
up with /modules and /modules.smp.  I don't believe that moving
to indirect function pointers is a reasonable work-around.

Finally, my patches to Matt's last code:
--- lock2.c~	Tue Jul 13 07:43:10 1999
+++ lock2.c	Tue Jul 13 08:38:35 1999
@@ -32,6 +32,13 @@
         ATOMIC_ASM_NOLOCK(int, "addl %1,%0");
 }
 
+static void nop(void *p, u_int v) {}
+
+static void (*add_lockp)(void *p, u_int v) = atomic_add_int;
+static void (*add_nolockp)(void *p, u_int v) = atomic_add_int_nolock;
+static void (*nopp)(void *p, u_int v) = nop;
+
+
 volatile int GX[8];	/* note: not shared between processes */
 
 int
@@ -51,7 +58,7 @@
     ftruncate(fd, pgsize);
     ptr = mmap(NULL, pgsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
 
-    for (m = 0; m <= 8; ++m) {
+    for (m = 0; m <= 17; ++m) {
 	pid_t pid = -1;
 	int nproc = 1;
 	const char *lcks = "EMPTY";
@@ -119,6 +126,67 @@
 		    ;
 	    }
 	    break;
+	case 8+9:
+	    pid = fork();
+	    nproc = 2;
+	    /* fall through */
+	case 7+9:
+	    for (i = 0; i < LOOPS; ++i) {
+		(*add_lockp)(ptr, 1);
+		GX[0] = 1;
+		GX[1] = 1;
+		GX[2] = 1;
+		GX[3] = 1;
+		GX[4] = 1;
+		GX[5] = 1;
+		GX[6] = 1;
+		GX[7] = 1;
+	    }
+	    lcks = "yes";
+	    break;
+	case 6+9:
+	    pid = fork();
+	    nproc = 2;
+	    /* fall through */
+	case 5+9:
+	    for (i = 0; i < LOOPS; ++i) {
+		(*add_nolockp)(ptr, 1);
+		GX[0] = 1;
+		GX[1] = 1;
+		GX[2] = 1;
+		GX[3] = 1;
+		GX[4] = 1;
+		GX[5] = 1;
+		GX[6] = 1;
+		GX[7] = 1;
+	    }
+	    lcks = "no";
+	    break;
+	case 4+9:
+	    pid = fork();
+	    nproc = 2;
+	    /* fall through */
+	case 3+9:
+	    for (i = 0; i < LOOPS; ++i) {
+		(*add_lockp)(ptr, 1);
+	    }
+	    lcks = "yes";
+	    break;
+	case 2+9:
+	    pid = fork();
+	    nproc = 2;
+	    /* fall through */
+	case 1+9:
+	    for (i = 0; i < LOOPS; ++i) {
+		(*add_nolockp)(ptr, 1);
+	    }
+	    lcks = "no";
+	    break;
+	case 0+9:
+	    for (i = 0; i < LOOPS; ++i) {
+		(*nopp)(ptr, 1);
+	    }
+	    break;
 	default:
 	    printf("huh?\n");
 	    exit(1);
@@ -131,7 +199,7 @@
 
 	usec = tv2.tv_usec + 1000000 - tv1.tv_usec + (tv2.tv_sec - tv1.tv_sec - 1) * 1000000;
 
-	printf("mode %d\t%6.2f ns/loop nproc=%d lcks=%s\n", m, (double)usec * 1000.0 / (double)LOOPS / (double)nproc, nproc, lcks);
+	printf("mode %2d\t%6.2f ns/loop nproc=%d lcks=%s\n", m, (double)usec * 1000.0 / (double)LOOPS / (double)nproc, nproc, lcks);
     }
     return(0);
 }


Peter


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message