From owner-freebsd-smp  Tue Jan  2 13:43:22 2001
From owner-freebsd-smp@FreeBSD.ORG  Tue Jan  2 13:43:18 2001
Return-Path: <owner-freebsd-smp@FreeBSD.ORG>
Delivered-To: freebsd-smp@freebsd.org
Received: from fast.cs.utah.edu (fast.cs.utah.edu [155.99.212.1])
	by hub.freebsd.org (Postfix) with ESMTP
	id B6C1237B400; Tue,  2 Jan 2001 13:43:17 -0800 (PST)
Received: (from vanmaren@localhost)
	by fast.cs.utah.edu (8.9.1/8.9.1) id OAA08265;
	Tue, 2 Jan 2001 14:43:10 -0700 (MST)
Date: Tue, 2 Jan 2001 14:43:10 -0700 (MST)
From: Kevin Van Maren <vanmaren@fast.cs.utah.edu>
Message-Id: <200101022143.OAA08265@fast.cs.utah.edu>
To: jhb@FreeBSD.ORG, vanmaren@fast.cs.utah.edu
Subject: Re: atomic increment?
Cc: cp@bsdi.com, smp@FreeBSD.ORG
Sender: owner-freebsd-smp@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> There is also a desire to try and keep the atomic API from being too huge I
> think.  Atomic operations are _expensive_.  One thing you are forgetting about
> on the x86 is that an atomic op on an SMP system requires a 'lock' prefix.  The
> cost of locking the bus drowns out the savings you may get by getting one or
> two less instructions.

atomic_increment/decrement doesn't inflate it *that* much.  For most
arch's it can point to atomic_add.  I prefer ++ to +=1 for things that
go up/down by one.  Atomic_add of 1 may make more sense for some
behaviors, but not for others.

Believe me, I'm not forgetting the LOCK prefix.  But LOCKing the
memory bus is only the case on processors before the Pentium Pro.
On the P6+, the LOCK signal is normally NOT asserted on the bus
(for cacheable memory locations).  However, it is still "expensive"
because it is a memory fence, which can stall the speculative
processor core.  But it should be much less expensive than a
cache miss in any case.  [Yes, even on the P6, the extra cost will
be larger that a single simple instruction in the cache.  But just
because we spend 30% of our time doing something doesn't mean we
can't make the code faster by eliminating 20% elsewhere.  But I'm
NOT advocating the atomic_increment as a way to get around the gcc
bug that causes it to move an immediate into a temp register: that
performance boost should come from fixing gcc.]

For many uses, atomic ops are better than a mutex, because:
 -) I have to acquire the memory I'm modifying an an exclusive
state, even without the use of atomic operations.  On a P6, the
incremental cost for using an atomic operation is basically just
incurred in preventing speculative reads to pass the op.
 -) If I use a mutex, I have to perform at least two LOCKed operations
(one to acquire, one to release).  Plus, not only do I have to
obtain exclusive ownership of the variable, I also have to acquire
exclusive ownership of the mutex cacheline.  [If the mutex is in the
same cacheline as the variable, and it isn't highly contested,
the incremental cost is very low.]
 -) Performance comparisons are still assuming that the mutex isn't
already owned, because if it is, the expense goes through the roof,
either in spinning on the mutex, or in blocking.
[Yes, other platforms may have different relative performance issues.]

If you want to see slow, statically predict the wrong branch on IA64...
I think it was something like an extra 50 cycles per loop iteration.

> > One more thought on atomic operations: If we don't assume assignments
> > are atomic, and always use atomic_load and atomic_store, then we a) can
> > easily provide atomic 64-bit operations on x86 (quick hack would be
> > to use a single mutex for all 64-bit operations), and b) we can port
> > to platforms where atomic_add requires a mutex to protect the atomic_add
> > or atomic_cmpset sequence.  [Slow as molasses]  On x86, the load/store
> > macros are NOPs, but the use also (c) makes it clear that we are
> > manipulating a variable we perform atomic operations on.
> 
> Note that the only atomic_load and atomic_store primities are those that
> include memory barriers (and I think they are broken on the x86 for that
> matter; they need to use a lock'd cmpxchgl in the load case and a lock'd xchgl
> in the store case I think.)

The macros do (aligned) atomic assignments properly, but without the
acquire and release semantics provided by LOCK: it is currently
incorrect to release a lock by doing an "atomic_store(&lock, 0)".
For that to be allowed, they would have to use x LOCKed xchg or
something, as you say.

From "man 9 atomic":
   The first form just performs the operation without any explicit barriers.
   The second form uses a read memory barrier, and the final variant uses a
   write memory barrier.

   The atomic_load() functions always have acquire semantics.
Which is not the case for the current IA32 code (reads can pass
unLOCKed reads).

   The atomic_store() functions always have release semantics.
Which is not the case for the current IA32 code (reads can pass
unLOCKed writes).


I think separate memory barrier operations might also make sense.
On IA64, we have the "mf" and "mf.a" operation; on IA32, we
can emulate it by "lock; addl $0,0(%esp)" for SMP, and a NOP for
uniprocessors (which is as free as you can make it, since (%esp)
had better be exclusivly owned by the CPU).  Memory barrier ops
could be used under the "enhanced" load/store primitives, but
they would be slower than direct encoded versions.

I wonder how many problems we're going to have going from x86, where
all LOCKed atomic operations have acquire and release semantics, to
IA64, where they must be explicitly encoded.  I guess it depends on
how many places we use atomic primitives.

Kevin


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message