From owner-freebsd-smp Tue Jan 2 13:43:22 2001 From owner-freebsd-smp@FreeBSD.ORG Tue Jan 2 13:43:18 2001 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from fast.cs.utah.edu (fast.cs.utah.edu [155.99.212.1]) by hub.freebsd.org (Postfix) with ESMTP id B6C1237B400; Tue, 2 Jan 2001 13:43:17 -0800 (PST) Received: (from vanmaren@localhost) by fast.cs.utah.edu (8.9.1/8.9.1) id OAA08265; Tue, 2 Jan 2001 14:43:10 -0700 (MST) Date: Tue, 2 Jan 2001 14:43:10 -0700 (MST) From: Kevin Van Maren Message-Id: <200101022143.OAA08265@fast.cs.utah.edu> To: jhb@FreeBSD.ORG, vanmaren@fast.cs.utah.edu Subject: Re: atomic increment? Cc: cp@bsdi.com, smp@FreeBSD.ORG Sender: owner-freebsd-smp@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > There is also a desire to try and keep the atomic API from being too huge I > think. Atomic operations are _expensive_. One thing you are forgetting about > on the x86 is that an atomic op on an SMP system requires a 'lock' prefix. The > cost of locking the bus drowns out the savings you may get by getting one or > two less instructions. atomic_increment/decrement doesn't inflate it *that* much. For most arch's it can point to atomic_add. I prefer ++ to +=1 for things that go up/down by one. Atomic_add of 1 may make more sense for some behaviors, but not for others. Believe me, I'm not forgetting the LOCK prefix. But LOCKing the memory bus is only the case on processors before the Pentium Pro. On the P6+, the LOCK signal is normally NOT asserted on the bus (for cacheable memory locations). However, it is still "expensive" because it is a memory fence, which can stall the speculative processor core. But it should be much less expensive than a cache miss in any case. [Yes, even on the P6, the extra cost will be larger that a single simple instruction in the cache. But just because we spend 30% of our time doing something doesn't mean we can't make the code faster by eliminating 20% elsewhere. But I'm NOT advocating the atomic_increment as a way to get around the gcc bug that causes it to move an immediate into a temp register: that performance boost should come from fixing gcc.] For many uses, atomic ops are better than a mutex, because: -) I have to acquire the memory I'm modifying an an exclusive state, even without the use of atomic operations. On a P6, the incremental cost for using an atomic operation is basically just incurred in preventing speculative reads to pass the op. -) If I use a mutex, I have to perform at least two LOCKed operations (one to acquire, one to release). Plus, not only do I have to obtain exclusive ownership of the variable, I also have to acquire exclusive ownership of the mutex cacheline. [If the mutex is in the same cacheline as the variable, and it isn't highly contested, the incremental cost is very low.] -) Performance comparisons are still assuming that the mutex isn't already owned, because if it is, the expense goes through the roof, either in spinning on the mutex, or in blocking. [Yes, other platforms may have different relative performance issues.] If you want to see slow, statically predict the wrong branch on IA64... I think it was something like an extra 50 cycles per loop iteration. > > One more thought on atomic operations: If we don't assume assignments > > are atomic, and always use atomic_load and atomic_store, then we a) can > > easily provide atomic 64-bit operations on x86 (quick hack would be > > to use a single mutex for all 64-bit operations), and b) we can port > > to platforms where atomic_add requires a mutex to protect the atomic_add > > or atomic_cmpset sequence. [Slow as molasses] On x86, the load/store > > macros are NOPs, but the use also (c) makes it clear that we are > > manipulating a variable we perform atomic operations on. > > Note that the only atomic_load and atomic_store primities are those that > include memory barriers (and I think they are broken on the x86 for that > matter; they need to use a lock'd cmpxchgl in the load case and a lock'd xchgl > in the store case I think.) The macros do (aligned) atomic assignments properly, but without the acquire and release semantics provided by LOCK: it is currently incorrect to release a lock by doing an "atomic_store(&lock, 0)". For that to be allowed, they would have to use x LOCKed xchg or something, as you say. From "man 9 atomic": The first form just performs the operation without any explicit barriers. The second form uses a read memory barrier, and the final variant uses a write memory barrier. The atomic_load() functions always have acquire semantics. Which is not the case for the current IA32 code (reads can pass unLOCKed reads). The atomic_store() functions always have release semantics. Which is not the case for the current IA32 code (reads can pass unLOCKed writes). I think separate memory barrier operations might also make sense. On IA64, we have the "mf" and "mf.a" operation; on IA32, we can emulate it by "lock; addl $0,0(%esp)" for SMP, and a NOP for uniprocessors (which is as free as you can make it, since (%esp) had better be exclusivly owned by the CPU). Memory barrier ops could be used under the "enhanced" load/store primitives, but they would be slower than direct encoded versions. I wonder how many problems we're going to have going from x86, where all LOCKed atomic operations have acquire and release semantics, to IA64, where they must be explicitly encoded. I guess it depends on how many places we use atomic primitives. Kevin To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-smp" in the body of the message