Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 28 Oct 2014 14:18:41 +0100
From:      Attilio Rao <attilio@freebsd.org>
To:        Mateusz Guzik <mjguzik@gmail.com>
Cc:        Adrian Chadd <adrian@freebsd.org>, Alan Cox <alc@rice.edu>, Konstantin Belousov <kib@freebsd.org>, "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
Subject:   Re: atomic ops
Message-ID:  <CAJ-FndCWZt7YwFswt70QvbXA5c8Q_cYME2m3OwHTjCv8Nu3s=Q@mail.gmail.com>
In-Reply-To: <20141028025222.GA19223@dft-labs.eu>
References:  <20141028025222.GA19223@dft-labs.eu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Oct 28, 2014 at 3:52 AM, Mateusz Guzik <mjguzik@gmail.com> wrote:
> As was mentioned sometime ago, our situation related to atomic ops is
> not ideal.
>
> atomic_load_acq_* and atomic_store_rel_* (at least on amd64) provide
> full memory barriers, which is stronger than needed.
>
> Moreover, load is implemented as lock cmpchg on var address, so it is
> addditionally slower especially when cpus compete.

I already explained this once privately: fully memory barriers is not
stronger than needed.
FreeBSD has a different semantic than Linux. We historically enforce a
full barrier on _acq() and _rel() rather then just a read and write
barrier, hence we need a different implementation than Linux.
There is code that relies on this property, like the locking
primitives (release a mutex, for instance).

In short: optimizing the implementation for performance is fine and
due. Changing the semantic is not fine, unless you have reviewed and
fixed all the uses of _rel() and _acq().

> On amd64 it is sufficient to place a compiler barrier in such cases.
>
> Next, we lack some atomic ops in the first place.
>
> Let's define some useful terms:
> smp_wmb - no writes can be reordered past this point
> smp_rmb - no reads can be reordered past this point
>
> With this in mind, we lack ops which would guarantee only the following:
>
> 1. var = tmp; smp_wmb();
> 2. tmp = var; smp_rmb();
> 3. smp_rmb(); tmp = var;
>
> This matters since what we can use already to emulate this is way
> heavier than needed on aforementioned amd64 and most likely other archs.

I can see the value of such barriers in case you want to just
synchronize operation regards read or writes.
I also believe that on newest intel processors (for which we should
optimize) rmb() and wmb() got significantly faster than mb(). However
the most interesting case would be for arm and mips, I assume. That's
where you would see a bigger perf difference if you optimize the
membar paths.

Last time I looked into it, in FreeBSD kernel the Linux-ish
rmb()/wmb()/etc. were used primilarly in 3 places: Linux-derived code,
handling of 16-bits operand and implementation of "faster" bus
barriers.
Initially I had thought about just confining the smp_*() in a Linux
compat layer and fix the other 2 in this way: for 16-bits operands
just pad to 32-bits, as the C11 standard also does. For the bus
barriers, just grow more versions to actually include the rmb()/wmb()
scheme within.

At this point, I understand we may want to instead  support the
concept of write-only or read-only barrier. This means that if we want
to keep the concept tied to the current _acq()/_rel() scheme we will
end up with a KPI explosion.

I'm not the one making the call here, but for a faster and more
granluar approach, possibly we can end up using smp_rmb() and
smp_wmb() directly. As I said I'm not the one making the call.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-FndCWZt7YwFswt70QvbXA5c8Q_cYME2m3OwHTjCv8Nu3s=Q>