Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 28 Oct 2014 21:08:27 +0100
From:      Attilio Rao <attilio@freebsd.org>
To:        Andrew Turner <andrew@fubar.geek.nz>
Cc:        "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>, Adrian Chadd <adrian@freebsd.org>, Mateusz Guzik <mjguzik@gmail.com>, Konstantin Belousov <kib@freebsd.org>, Alan Cox <alc@rice.edu>
Subject:   Re: atomic ops
Message-ID:  <CAJ-FndCsvLV_B3Q0boyK78980chM79hFf_dRyEqRtxzMJkpD5g@mail.gmail.com>
In-Reply-To: <20141028175318.709d2ef6@bender.lan>
References:  <20141028025222.GA19223@dft-labs.eu> <CAJ-FndCWZt7YwFswt70QvbXA5c8Q_cYME2m3OwHTjCv8Nu3s=Q@mail.gmail.com> <20141028142510.10a9d3cb@bender.lan> <CAJ-FndD=9MgK608ra8%2BeMy=cAdq%2BA0xRp9u3xFrwtPEk8eH4CA@mail.gmail.com> <20141028175318.709d2ef6@bender.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Oct 28, 2014 at 6:53 PM, Andrew Turner <andrew@fubar.geek.nz> wrote:
> On Tue, 28 Oct 2014 15:33:06 +0100
> Attilio Rao <attilio@freebsd.org> wrote:
>> On Tue, Oct 28, 2014 at 3:25 PM, Andrew Turner <andrew@fubar.geek.nz>
>> wrote:
>> > On Tue, 28 Oct 2014 14:18:41 +0100
>> > Attilio Rao <attilio@freebsd.org> wrote:
>> >
>> >> On Tue, Oct 28, 2014 at 3:52 AM, Mateusz Guzik <mjguzik@gmail.com>
>> >> wrote:
>> >> > As was mentioned sometime ago, our situation related to atomic
>> >> > ops is not ideal.
>> >> >
>> >> > atomic_load_acq_* and atomic_store_rel_* (at least on amd64)
>> >> > provide full memory barriers, which is stronger than needed.
>> >> >
>> >> > Moreover, load is implemented as lock cmpchg on var address, so
>> >> > it is addditionally slower especially when cpus compete.
>> >>
>> >> I already explained this once privately: fully memory barriers is
>> >> not stronger than needed.
>> >> FreeBSD has a different semantic than Linux. We historically
>> >> enforce a full barrier on _acq() and _rel() rather then just a
>> >> read and write barrier, hence we need a different implementation
>> >> than Linux. There is code that relies on this property, like the
>> >> locking primitives (release a mutex, for instance).
>> >
>> > On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support)
>> > there are only full barriers. On both 32 and 64-bit ARMv8 ARM has
>> > added support for load-acquire and store-release atomic
>> > instructions. For the use in atomic instructions we can assume
>> > these only operate of the address passed to them.
>> >
>> > It is unlikely we will use them in the 32-bit port however I would
>> > like to know the expected semantics of these atomic functions to
>> > make sure we get them correct in the arm64 port. I have been
>> > advised by one of the ARM Linux kernel maintainers on the problems
>> > they have found using these instructions but have yet to determine
>> > what our atomic functions guarantee.
>>
>> For FreeBSD the "reference doc" is atomic(9).
>> It clearly states:
>
> There may also be a difference between what it states, how they are
> implemented, and what developers assume they do. I'm trying to make
> sure I get them correct.

atomic(9) is our reference so there might be no difference between
what it states and what all architectures implement.
I can say that x86 follows atomic(9) well. I'm not competent enough to
judge if all the !x86 arches follow it completely.
I can understand that developers may get confused. The FreeBSD scheme
is pretty unique. It comes from the fact that historically the membar
support was made to initially support x86. The super-widespread Linux
design, instead, tried to catch all architectures in its description.
It become very well known and I think it also "pushed" for companies
like Intel to invest in improving performance of things like explicit
read/write barriers, etc.

>> The second variant of each operation includes a read memory barrier.
>> This barrier ensures that the effects of this operation are completed
>> before the effects of any later data accesses.  As a result, the
>> opera- tion is said to have acquire semantics as it acquires a
>> pseudo-lock requiring further operations to wait until it has
>> completed.  To denote this, the suffix ``_acq'' is inserted into the
>> function name immediately prior to the ``_<type>'' suffix.  For
>> example, to subtract two integers ensuring that any later writes will
>> happen after the subtraction is per- formed, use
>> atomic_subtract_acq_int().
>
> It depends on the point we guarantee the acquire barrier to be. On ARMv8
> the function will be a load/modify/write sequence. If we use a
> load-acquire operation for atomic_subtract_acq_int, for example, for a
> pointer P and value to subtract X:
>
> loop:
>  load-acquire *P to N
>  perform N = N - X
>  store-exclusive N to *P
>  if the store failed goto loop
>
> where N and X are both registers.
>
> This will mean no access after this loop will happen before it, but
> they may happen within it, e.g. if there was a later access A the
> following may be possible:
>
> Load P
> Access A
> Store P

No, this will be broken in FreeBSD if "Access A" is later.
If "Access A" is prior the membar it doesn't really matter if it gets
interleaved with any of the operations in the atomic instruction.
Ideally, it could even surpass the Store P itself.
But if "Access A" is later (and you want to implement an _acq()
barrier) then it cannot absolutely gets in the middle of the atomic_*
operation.

> We know the store will happen as if it fails, e.g. another processor
> access *P, the store will have failed and will iterate over the loop.
>
> The other point is we can guarantee any store-release, and therefore
> any prior access, has happened before a later load-acquire even if it's
> on another processor.

No, we can never guarantee on the visibility of the operations by other CPUs.
We just make guarantee on how the operations are posted on the system
bus (or how they are locally visible).
Keeping in mind that FreeBSD model cames from x86, you can sense that
some things are sized on the x86 model, which doesn't have any rule or
ordering on global visibility of the operations.

> ...
>
>> The bottom-side of all this is that read memory barriers ensures that
>> the effect of the operations you are making (load in case of
>> atomic_load_acq_int(), for example) are completed before any later
>> data accesses. "Data accesses" qualifies for *all* the operations
>> including read, writes, etc. This is very different by what Linux
>> assumes for its rmb() barrier, for example which just orders loads. So
>> for FreeBSD there is no _acq -> rmb() analogy and there is no _rel ->
>> wmb() analogy.
>
> On ARMv8 using the above pseudo-code the operation later operations
> will not be moved before the load-acquire, but they may happen before
> it's store. Having discussed this with John Baldwin I don't think this
> is a problem due to the nature of the store operation being allowed to
> fail if another processor has written its memory.
>
>>
>> This must be kept well in mind when trying to optimize the atomic_*()
>> operations.
>
> At this point I'm more interested in getting them correct as they will
> be important when I start on SMP support.

Sure. The thread as started as an "optimization of x86" but it refers
to all atomic_* on every architecture FreeBSD supports.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-FndCsvLV_B3Q0boyK78980chM79hFf_dRyEqRtxzMJkpD5g>