Date: Wed, 29 Oct 2014 20:04:59 +0100 From: Mateusz Guzik <mjguzik@gmail.com> To: Attilio Rao <attilio@freebsd.org> Cc: Adrian Chadd <adrian@freebsd.org>, Alan Cox <alc@rice.edu>, Konstantin Belousov <kib@freebsd.org>, "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org> Subject: Re: atomic ops Message-ID: <20141029190459.GA25368@dft-labs.eu> In-Reply-To: <CAJ-FndCWZt7YwFswt70QvbXA5c8Q_cYME2m3OwHTjCv8Nu3s=Q@mail.gmail.com> References: <20141028025222.GA19223@dft-labs.eu> <CAJ-FndCWZt7YwFswt70QvbXA5c8Q_cYME2m3OwHTjCv8Nu3s=Q@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Oct 28, 2014 at 02:18:41PM +0100, Attilio Rao wrote: > On Tue, Oct 28, 2014 at 3:52 AM, Mateusz Guzik <mjguzik@gmail.com> wrote: > > As was mentioned sometime ago, our situation related to atomic ops is > > not ideal. > > > > atomic_load_acq_* and atomic_store_rel_* (at least on amd64) provide > > full memory barriers, which is stronger than needed. > > > > Moreover, load is implemented as lock cmpchg on var address, so it is > > addditionally slower especially when cpus compete. > > I already explained this once privately: fully memory barriers is not > stronger than needed. > FreeBSD has a different semantic than Linux. We historically enforce a > full barrier on _acq() and _rel() rather then just a read and write > barrier, hence we need a different implementation than Linux. > There is code that relies on this property, like the locking > primitives (release a mutex, for instance). > I mean stronger than needed in some cases, popular one is fget_unlocked and we provide no "lightest sufficient" barrier (which would also be cheaper). Other case which benefits greatly is sys/sys/seq.h. As noted in some other thread, using load_acq as it is destroys performance. I don't dispute the need for full barriers, although it is unclear what current consumers of load_acq actually need a full barrier.. > In short: optimizing the implementation for performance is fine and > due. Changing the semantic is not fine, unless you have reviewed and > fixed all the uses of _rel() and _acq(). > > > On amd64 it is sufficient to place a compiler barrier in such cases. > > > > Next, we lack some atomic ops in the first place. > > > > Let's define some useful terms: > > smp_wmb - no writes can be reordered past this point > > smp_rmb - no reads can be reordered past this point > > > > With this in mind, we lack ops which would guarantee only the following: > > > > 1. var = tmp; smp_wmb(); > > 2. tmp = var; smp_rmb(); > > 3. smp_rmb(); tmp = var; > > > > This matters since what we can use already to emulate this is way > > heavier than needed on aforementioned amd64 and most likely other archs. > > I can see the value of such barriers in case you want to just > synchronize operation regards read or writes. > I also believe that on newest intel processors (for which we should > optimize) rmb() and wmb() got significantly faster than mb(). However > the most interesting case would be for arm and mips, I assume. That's > where you would see a bigger perf difference if you optimize the > membar paths. > > Last time I looked into it, in FreeBSD kernel the Linux-ish > rmb()/wmb()/etc. were used primilarly in 3 places: Linux-derived code, > handling of 16-bits operand and implementation of "faster" bus > barriers. > Initially I had thought about just confining the smp_*() in a Linux > compat layer and fix the other 2 in this way: for 16-bits operands > just pad to 32-bits, as the C11 standard also does. For the bus > barriers, just grow more versions to actually include the rmb()/wmb() > scheme within. > > At this point, I understand we may want to instead support the > concept of write-only or read-only barrier. This means that if we want > to keep the concept tied to the current _acq()/_rel() scheme we will > end up with a KPI explosion. > > I'm not the one making the call here, but for a faster and more > granluar approach, possibly we can end up using smp_rmb() and > smp_wmb() directly. As I said I'm not the one making the call. > Well, I don't know original motivation for expressing stuff with _load_acq and _store_rel. Anyway, maybe we could do something along (expressing intent, not actual code): mb_producer_start(p, v) { *p = v; smp_wmb(); } mb_producer(p, v) { smp_wmb(); *p = v; } mb_producer_end(p, v) { mb_producer(p, v); } type mb_consumer(p) { var = *p; smp_rmb(); return (var); } type mb_consumer_start(p) { return (mb_consumer(p)); } type mb_consumer_end(p) { smp_rmb(); return (*p); } -- Mateusz Guzik <mjguzik gmail.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141029190459.GA25368>