From owner-freebsd-current Mon Jul 12 16: 7: 2 1999 Delivered-To: freebsd-current@freebsd.org Received: from alcanet.com.au (border.alcanet.com.au [203.62.196.10]) by hub.freebsd.org (Postfix) with ESMTP id E2E9F14E73 for ; Mon, 12 Jul 1999 16:06:29 -0700 (PDT) (envelope-from jeremyp@gsmx07.alcatel.com.au) Received: by border.alcanet.com.au id <40574>; Tue, 13 Jul 1999 08:46:57 +1000 Date: Tue, 13 Jul 1999 09:04:57 +1000 From: Peter Jeremy Subject: Re: "objtrm" problem probably found To: dfr@nlsystems.com, dillon@apollo.backplane.com, freebsd-current@FreeBSD.ORG, luoqi@watermarkgroup.com, mike@ducky.net Message-Id: <99Jul13.084657est.40574@border.alcanet.com.au> Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Doug Rabson wrote: >We don't need the lock prefix for the current SMP implementation. A lock >prefix would be needed in a multithreaded implementation but should not be >added unless the kernel is an SMP kernel otherwise UP performance would >suffer. Modulo the issue of UP vs SMP modules, the code would seem to be simply: #ifdef SMP #define ATOMIC_ASM(type,op) \ __asm __volatile ("lock; " op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p)) #else #define ATOMIC_ASM(type,op) \ __asm __volatile (op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p)) #endif Or (maybe more clearly): #ifdef SMP #define SMP_LOCK "lock; " #else #define SMP_LOCK #endif #define ATOMIC_ASM(type,op) \ __asm __volatile (SMP_LOCK op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p)) John-Mark Gurney wrote: >actually, I'm not so sure, it guarantees that NO other bus operation >will succeed while this is happening... what happens if a pci bus >mastering card makes a modification to this value? This is a valid point, but I don't think it's directly related to this thread. I think it's up the the various PCI device driver writers to ensure that objects shared between a PCI device and driver are correctly locked. The mechanism to do this is likely to be device- specific: Lock prefixes only protect objects no larger than 32 or 64 bits (depending on the processor), cards may require locked accesses to much larger structures. I believe the API to the PCI-locking routines should be distinct from the API for SMP locks - even though the underlying implementation may be common. Oliver Fromme wrote: >In my list of i386 clock cycles, the lock prefix is listed with >0 (zero) cycles. My i486 book states 1 cycle, although that cycle can be overlapped with several other combinations that add a cycle to the basic instruction execution time. I don't know about the Pentium and beyond timings. In any case, we have real-world timings, which are more useful. Mike Haertel wrote: >Although function calls are more expensive than inline code, >they aren't necessarily a lot more so, and function calls to >non-locked RMW operations are certainly much cheaper than >inline locked RMW operations. Based on my timings below, this is correct, though counter-intuitive. Given the substantial cost of indirect function calls, I don't this this would be acceptable, though. I think compiling modules separately for UP/SMP is a better choice. In Message-id: <19990 121806.LAA70634@apollo.backplane.com>, Matthew Dillon provided some hard figures for a dual PIII-450. Expanding those figures for a range of machines (all are UP except the PIII-450, which are Matt's SMP figures), and adding the cost of using indirect function calls (patches to Matt's code at the end): i386SX-25 P-133 PII-266 PIII-450 nproc locks mode 0 1950.23 39.65 26.31 9.21 EMPTY mode 1 3340.59 71.74 24.45 16.48 1 no tight mode 2 3237.57 71.18 25.27 23.65 2 no tight mode 3 3367.65 282.31 153.29 93.02 1 yes tight mode 4 3263.64 285.58 152.94 160.82 2 yes tight mode 5 9439.15 446.16 60.40 37.64 1 no spread mode 6 10231.96 467.39 60.16 89.28 2 no spread mode 7 10660.05 725.80 153.18 88.32 1 yes spread mode 8 9990.18 755.87 155.18 161.08 2 yes spread mode 9 5544.82 131.31 49.96 ? EMPTY mode 10 7234.97 174.20 64.81 ? 1 no tight mode 11 7212.14 178.72 64.87 ? 2 no tight mode 12 7355.46 304.74 182.75 ? 1 yes tight mode 13 6956.54 327.11 180.21 ? 2 yes tight mode 14 13603.72 582.02 100.10 ? 1 no spread mode 15 13443.54 543.97 101.13 ? 2 no spread mode 16 13731.94 717.31 207.12 ? 1 yes spread mode 17 13379.62 800.31 207.70 ? 2 yes spread Modes 9 through 17 are equivalent to modes 0-8 except that the operation is performed via a call thru a pointer-to-function. (Mode 9 is a pointer to a nop). Apart from the noticable improvement in overall speed from left to right, this shows that the lock prefix is _very_ expensive on Pentium and above, even in a UP configuration. It makes no difference on a 386. (I can produce the 486 figures tonight after I get home). It also suggests that moving to an indirect function call (to allow run-time UP/SMP selection) will be quite painful. > As you can see, the lock prefix creates a stall condition on the locked > memory, but does *NOT* stall other memory. This is at least CPU dependent (and may also depend on the motherboard chipset). The i486 states that `all memory is locked'. > Therefore I believe the impact will be unnoticeable. On a duel > 450MHz P-III we are talking 37 ns vs 88 ns - an overhead of 50 ns > for the one processor case, and an overhead of 72 ns for the two > processor case. Whilst that's true for a P-III, it's definitely not true for most lesser machines (which are probably more common - and are likely to remain so for a while). Based on the impact above, I believe the lock prefixes should not be inserted until they are necessary - even if it does mean we wind up with /modules and /modules.smp. I don't believe that moving to indirect function pointers is a reasonable work-around. Finally, my patches to Matt's last code: --- lock2.c~ Tue Jul 13 07:43:10 1999 +++ lock2.c Tue Jul 13 08:38:35 1999 @@ -32,6 +32,13 @@ ATOMIC_ASM_NOLOCK(int, "addl %1,%0"); } +static void nop(void *p, u_int v) {} + +static void (*add_lockp)(void *p, u_int v) = atomic_add_int; +static void (*add_nolockp)(void *p, u_int v) = atomic_add_int_nolock; +static void (*nopp)(void *p, u_int v) = nop; + + volatile int GX[8]; /* note: not shared between processes */ int @@ -51,7 +58,7 @@ ftruncate(fd, pgsize); ptr = mmap(NULL, pgsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); - for (m = 0; m <= 8; ++m) { + for (m = 0; m <= 17; ++m) { pid_t pid = -1; int nproc = 1; const char *lcks = "EMPTY"; @@ -119,6 +126,67 @@ ; } break; + case 8+9: + pid = fork(); + nproc = 2; + /* fall through */ + case 7+9: + for (i = 0; i < LOOPS; ++i) { + (*add_lockp)(ptr, 1); + GX[0] = 1; + GX[1] = 1; + GX[2] = 1; + GX[3] = 1; + GX[4] = 1; + GX[5] = 1; + GX[6] = 1; + GX[7] = 1; + } + lcks = "yes"; + break; + case 6+9: + pid = fork(); + nproc = 2; + /* fall through */ + case 5+9: + for (i = 0; i < LOOPS; ++i) { + (*add_nolockp)(ptr, 1); + GX[0] = 1; + GX[1] = 1; + GX[2] = 1; + GX[3] = 1; + GX[4] = 1; + GX[5] = 1; + GX[6] = 1; + GX[7] = 1; + } + lcks = "no"; + break; + case 4+9: + pid = fork(); + nproc = 2; + /* fall through */ + case 3+9: + for (i = 0; i < LOOPS; ++i) { + (*add_lockp)(ptr, 1); + } + lcks = "yes"; + break; + case 2+9: + pid = fork(); + nproc = 2; + /* fall through */ + case 1+9: + for (i = 0; i < LOOPS; ++i) { + (*add_nolockp)(ptr, 1); + } + lcks = "no"; + break; + case 0+9: + for (i = 0; i < LOOPS; ++i) { + (*nopp)(ptr, 1); + } + break; default: printf("huh?\n"); exit(1); @@ -131,7 +199,7 @@ usec = tv2.tv_usec + 1000000 - tv1.tv_usec + (tv2.tv_sec - tv1.tv_sec - 1) * 1000000; - printf("mode %d\t%6.2f ns/loop nproc=%d lcks=%s\n", m, (double)usec * 1000.0 / (double)LOOPS / (double)nproc, nproc, lcks); + printf("mode %2d\t%6.2f ns/loop nproc=%d lcks=%s\n", m, (double)usec * 1000.0 / (double)LOOPS / (double)nproc, nproc, lcks); } return(0); } Peter To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message