From owner-freebsd-arch@FreeBSD.ORG  Sun Oct  5 06:37:57 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id BC3A5A4F;
 Sun,  5 Oct 2014 06:37:57 +0000 (UTC)
Received: from mail-wg0-x22c.google.com (mail-wg0-x22c.google.com
 [IPv6:2a00:1450:400c:c00::22c])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 56F9EA64;
 Sun,  5 Oct 2014 06:37:56 +0000 (UTC)
Received: by mail-wg0-f44.google.com with SMTP id y10so4249245wgg.27
 for <multiple recipients>; Sat, 04 Oct 2014 23:37:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-type:content-disposition:content-transfer-encoding
 :in-reply-to:user-agent;
 bh=HDg2Nn1mBArcUQc6g+9eDevupGTF//z5KwR/cSmRGAw=;
 b=q5MVhQB6QF5xfFAqdGjBk+pyoZoUAUYdLXY7Am1LWb7knB3uu3b0xj0K1oeyQ75Wc7
 z7Q4HlyimVlcNzDvlEUUODO6qzEOJID7P5n9XpGSNx2RiSZeEaxs5KX0B3VTu8bCcfAL
 K/p2svARwDMiA9SxJAQBqnwXpvwfmW7Zb7fhxrCdrtNFE5zmR7a2sERf97xzD09q8gnm
 nkZ/Y5kx3+ck0T0NtcE+2ilw6weRvBT+t++84zqRmGjFotm3N4GkPn0zs/awMPEMQXpD
 80MZ9D/EBz44XuChqY33vVzTw9P5LsEPyhHVmzIcY7hF6lenRJLc0seL4sHqBu+TAw7j
 TU3Q==
X-Received: by 10.180.100.38 with SMTP id ev6mr10221457wib.83.1412491074523;
 Sat, 04 Oct 2014 23:37:54 -0700 (PDT)
Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net.
 [2001:470:1f08:1f7::2])
 by mx.google.com with ESMTPSA id k2sm12955985wjy.34.2014.10.04.23.37.53
 for <multiple recipients>
 (version=TLSv1.2 cipher=RC4-SHA bits=128/128);
 Sat, 04 Oct 2014 23:37:53 -0700 (PDT)
Date: Sun, 5 Oct 2014 08:37:51 +0200
From: Mateusz Guzik <mjguzik@gmail.com>
To: Attilio Rao <attilio@freebsd.org>
Subject: Re: [PATCH 1/2] Implement simple sequence counters with memory
 barriers.
Message-ID: <20141005063750.GA9262@dft-labs.eu>
References: <1408064112-573-1-git-send-email-mjguzik@gmail.com>
 <1408064112-573-2-git-send-email-mjguzik@gmail.com>
 <20140816093811.GX2737@kib.kiev.ua>
 <20140816185406.GD2737@kib.kiev.ua>
 <20140817012646.GA21025@dft-labs.eu>
 <CAJUyCcPA7ZDNbwyfx3fT7mq3SE7M-mL5he=eXZ8bY3z-xUCJ-g@mail.gmail.com>
 <20141004052851.GA27891@dft-labs.eu>
 <CAJ-FndAHawNWC+Yh2BRtmg4e-f3dUdRVonScwfaADABqWuF3Tg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAJ-FndAHawNWC+Yh2BRtmg4e-f3dUdRVonScwfaADABqWuF3Tg@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: Alan Cox <alc@freebsd.org>, Konstantin Belousov <kostikbel@gmail.com>,
 Johan Schuijt <johan@transip.nl>,
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 05 Oct 2014 06:37:57 -0000

On Sat, Oct 04, 2014 at 11:37:16AM +0200, Attilio Rao wrote:
> On Sat, Oct 4, 2014 at 7:28 AM, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > Reviving. Sorry everyone for such big delay, $life.
> >
> > On Tue, Aug 19, 2014 at 02:24:16PM -0500, Alan Cox wrote:
> >> On Sat, Aug 16, 2014 at 8:26 PM, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >> > Well, my memory-barrier-and-so-on-fu is rather weak.
> >> >
> >> > I had another look at the issue. At least on amd64, it looks like only
> >> > compiler barrier is required for both reads and writes.
> >> >
> >> > According to AMD64 Architecture Programmer’s Manual Volume 2: System
> >> > Programming, 7.2 Multiprocessor Memory Access Ordering states:
> >> >
> >> > "Loads do not pass previous loads (loads are not reordered). Stores do
> >> > not pass previous stores (stores are not reordered)"
> >> >
> >> > Since the code modifying stuff only performs a series of writes and we
> >> > expect exclusive writers, I find it applicable to this scenario.
> >> >
> >> > I checked linux sources and generated assembly, they indeed issue only
> >> > a compiler barrier on amd64 (and for intel processors as well).
> >> >
> >> > atomic_store_rel_int on amd64 seems fine in this regard, but the only
> >> > function for loads issues lock cmpxhchg which kills performance
> >> > (median 55693659 -> 12789232 ops in a microbenchmark) for no gain.
> >> >
> >> > Additionally release and acquire semantics seems to be a stronger than
> >> > needed guarantee.
> >> >
> >> >
> >>
> >> This statement left me puzzled and got me to look at our x86 atomic.h for
> >> the first time in years.  It appears that our implementation of
> >> atomic_load_acq_int() on x86 is, umm ..., unconventional.  That is, it is
> >> enforcing a constraint that simple acquire loads don't normally enforce.
> >> For example, the C11 stdatomic.h simple acquire load doesn't enforce this
> >> constraint.  Moreover, our own implementation of atomic_load_acq_int() on
> >> ia64, where the mapping from atomic_load_acq_int() to machine instructions
> >> is straightforward, doesn't enforce this constraint either.
> >>
> >
> > By 'this constraint' I presume you mean full memory barrier.
> >
> > It is unclear to me if one can just get rid of it currently. It
> > definitely would be beneficial.
> >
> > In the meantime, if for some reason full barrier is still needed, we can
> > speed up concurrent load_acq of the same var considerably. There is no
> > need to lock cmpxchg on the same address. We should be able to replace
> > it with +/-:
> > lock add $0,(%rsp);
> > movl ...;
> 
> When I looked into some AMD manual (I think the same one which reports
> using lock add $0, (%rsp)) I recall that the (reported) added
> instructions latencies of "lock add" + "movl" is superior than the
> single "cmpxchg".
> Moreover, I think that the simple movl is going to lock the cache-line
> anyway, so I doubt the "lock add" is going to provide any benefit. The
> only benefit I can think of is that we will be able to use an _acq()
> barriers on read-only memory with this trick (which is not possible
> today as timecounters code can testify).
> 
> If the latencies for "lock add" + "movl" is changed in the latest
> Intel processors I can't say for sure, it may be worth to look at it.
> 

I stated in my previous mail that it is faster, and I have trivial
benchmark to back it up.

In fget_unlocked there is an atomic_load_acq at the beginning (I have
patches which get rid of it, btw). After the code is changed to lock add
+ movl, we get a significant speed up in a microbenchmark of 15 threads
going read -> fget_unlocked.

x vanilla-readpipe            
+ lockadd-readpipe
    N           Min           Max        Median           Avg        Stddev
x  20      11073800      13429593      12266195      12190982     629380.16
+  20      53414354      54152272      53567250      53791945     322012.74
Difference at 95.0% confidence
	4.1601e+07 +/- 319962
	341.244% +/- 2.62458%
	(Student's t, pooled s = 499906)

This is on Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz.

Seems to make sense since we only read from shared area and lock add is
performed on addresses private to executing threads.

fwiw, lock cmpxchg on %rsp gives comparable speed up.

Of course one would need to actually measure this stuff to get a better
idea what's really going on within cpu.

-- 
Mateusz Guzik <mjguzik gmail.com>